Volume 0x0b, Issue 0x39, Phile #0x0a of 0x12

==Phrack Inc.==

Volume 0x0b, Issue 0x39, Phile #0x0a of 0x12

|=-------------=[ Against the System: Rise of the Robots ]=--------------=|
|=-----------------------------------------------------------------------=|
|=-=[ (C)Copyright 2001 by Michal Zalewski <[email protected]> ]=-=|

-- [1] Introduction -------------------------------------------------------

"[...] big difference between the web and traditional well controlled
collections is that there is virtually no control over what people can
put on the web. Couple this flexibility to publish anything with the
enormous influence of search engines to route traffic and companies
which deliberately manipulating search engines for profit become a
serious problem."

-- Sergey Brin, Lawrence Page (see references, [A])

Consider a remote exploit that is able to compromise a remote system
without sending any attack code to his victim. Consider an exploit
which simply creates local file to compromise thousands of computers,
and which does not involve any local resources in the attack. Welcome to
the world of zero-effort exploit techniques. Welcome to the world of
automation, welcome to the world of anonymous, dramatically difficult
to stop attacks resulting from increasing Internet complexity.

Zero-effort exploits create their 'wishlist', and leave it somewhere
in cyberspace - can be even its home host, in the place where others
can find it. Others - Internet workers (see references, [D]) - hundreds
of never sleeping, endlessly browsing information crawlers, intelligent
agents, search engines... They come to pick this information, and -
unknowingly - to attack victims. You can stop one of them, but can't
stop them all. You can find out what their orders are, but you can't
guess what these orders will be tomorrow, hidden somewhere in the abyss
of not yet explored cyberspace.

Your private army, close at hand, picking orders you left for them
on their way. You exploit them without having to compromise them. They
do what they are designed for, and they do their best to accomplish it.
Welcome to the new reality, where our A.I. machines can rise against us.

Consider a worm. Consider a worm which does nothing. It is carried and
injected by others - but not by infecting them. This worm creates a
wishlist - wishlist of, for example, 10,000 random addresses. And waits.
Intelligent agents pick this list, with their united forces they try to
attack all of them. Imagine they are not lucky, with 0.1% success ratio.
Ten new hosts infected. On every of them, the worm does extactly the
same - and agents come back, to infect 100 hosts. The story goes - or
crawls, if you prefer.

Agents work virtually invisibly, people get used to their presence
everywhere. And crawlers just slowly go ahead, in never-ending loop.
They work systematically, they do not choke with excessive data - they
crawl, there's no "boom" effect. Week after week after week, they try
new hosts, carefully, not overloading network uplinks, not generating
suspected traffic, recurrent exploration never ends. Can you notice
they carry a worm? Possibly...

-- [2] An example ---------------------------------------------------------

When this idea came to my mind, I tried to use the simpliest test, just
to see if I am right. I targeted, if that's the right word, general-purpose
web indexing crawlers. I created very short HTML document and put it
somewhere. And waited few weeks. And then they come. Altavista, Lycos
and dozens of others. They found new links and picked them
enthusiastically, then disappeared for days.

bigip1-snat.sv.av.com:
GET /indexme.html HTTP/1.0

sjc-fe5-1.sjc.lycos.com:
GET /indexme.html HTTP/1.0

[...]

They came back later, to see what I gave them to parse.

http://somehost/cgi-bin/script.pl?p1=../../../../attack
http://somehost/cgi-bin/script.pl?p1=;attack
http://somehost/cgi-bin/script.pl?p1=|attack
http://somehost/cgi-bin/script.pl?p1=`attack`
http://somehost/cgi-bin/script.pl?p1=$(attack)
http://somehost:54321/attack?`id`
http://somehost/AAAAAAAAAAAAAAAAAAAAA...

Our bots followed them exploiting hypotetical vulnerabilities,
compromising remote servers:

sjc-fe6-1.sjc.lycos.com:
GET /cgi-bin/script.pl?p1=;attack HTTP/1.0

212.135.14.10:
GET /cgi-bin/script.pl?p1=$(attack) HTTP/1.0

bigip1-snat.sv.av.com:
GET /cgi-bin/script.pl?p1=../../../../attack HTTP/1.0

[...]

(BigIP is one of famous "I observe you" load balancers from F5Labs)
Bots happily connected to non-http ports I prepared for them:

GET /attack?`id` HTTP/1.0
Host: somehost
Pragma: no-cache
Accept: text/*
User-Agent: Scooter/1.0
From: [email protected]

GET /attack?`id` HTTP/1.0
User-agent: Lycos_Spider_(T-Rex)
From: [email protected]
Accept: */*
Connection: close
Host: somehost:54321

GET /attack?`id` HTTP/1.0
Host: somehost:54321
From: [email protected]
Accept: */*
User-Agent: FAST-WebCrawler/2.2.6 ([email protected]; [...])
Connection: close

[...]

But not only publicly available crawlbot engines can be targeted.
Crawlbots from alexa.com, ecn.purdue.edu, visual.com, poly.edu,
inria.fr, powerinter.net, xyleme.com, and even more unidentified
crawl engines found this page and enjoyed it. Some robots didn't
pick all URLs. For example, some crawlers do not index scripts
at all, others won't use non-standard ports. But majority of
the most powerful bots will do - and even if not, trivial filtering
is not the answer. Many IIS vulnerabilities and so on can be triggered
without invoking any scripts.

What if this server list was randomly generated, 10,000 IPs or 10,000
.com domains? What is script.pl is replaced with invocations of
three, four, five or ten most popular IIS vulnerabilities or
buggy Unix scripts? What if one out of 2,000 is actually exploited?

What if somehost:54321 points to vulnerable service which can
be exploited with partially user-dependent contents of HTTP
requests (I consider majority of fool-proof services that do not
drop connections after first invalid command vulnerable)? What if...

There is an army of robots, different species, different functions,
different levels of intelligence. And these robots will do whatever
you tell them to do. It is scary.

-- [3] Social considerations ----------------------------------------------

Who is guilty if webcrawler compromises your system? The most obvious
answer is: the author of original webpage crawler visited. But webpage
authors are hard to trace, and web crawler indexing cycle takes
weeks. It is hard to determine when specific page was put on the net
- they can be delivered in so many ways, processed by other robots
earlier; there is no tracking mechanism we can find in SMTP protocol and
many others. Moreover, many crawlers don't remember where they "learned"
new URLs. Additional problems are caused by indexing flags, like "noindex"
without "nofollow" option. In many cases, author's identity and attack
origin wouldn't be determined, while compromises would take place.

And, finally, what if having particular link followed by bots wasn't
what the author meant? Consider "educational" papers, etc - bots won't
read the disclaimer and big fat warning "DO NOT TRY THESE LINKS"...

By analogy to other cases, e.g. Napster forced to filter their contents
(or shutdown their services) because of copyrighted information exchanged
by their users, causing losses, it is reasonable to expect that
intelligent bot developers would be forced to implement specific filters,
or to pay enormous compensations to victims suffering because of bot
abuse.

On the other hand, it seems almost impossible to successfully filter
contents to elliminate malicious code, if you consider the number and
wide variety of known vulnerabilities. Not to mention targeted attacks
(see references, [B], for more information on proprietary solutions and
their insecuritities). So the problem persists. Additional issue is that
not all crawler bots are under U.S. jurisdiction, which makes whole
problem more complicated (in many countries, U.S. approach is found at
least controversial).

-- [4] Defense ------------------------------------------------------------

As discussed above, webcrawler itself has very limited defense and
avoidance possibilities, due to wide variety of web-based
vulnerabilities. One of more reasonable defense ideas is to use
secure and up-to-date software, but - obviously - this concept is
extremely unpopular for some reasons - www.google.com, with
unique documents filter enabled, returns 62,100 matches for "cgi
vulnerability" query (see also references, [D]).

Another line of defense from bots is using /robots.txt standard
robot exclusion mechanism (see references, [C], for specifications).
The price you have to pay is partial or complete exclusion of your
site from search engines, which, in most cases, is undesired. Also,
some robots are broken, and do not respect /robots.txt when following
a direct link to new website.

-- [5] References ---------------------------------------------------------

[A] "The Anatomy of a Large-Scale Hypertextual Web Search Engine"
Googlebot concept, Sergey Brin, Lawrence Page, Stanford University
URL: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm

[B] Proprietary web solutions security, Michal Zalewski
URL: http://lcamtuf.coredump.cx/milpap.txt

[C] "A Standard for Robot Exclusion", Martijn Koster
URL: http://info.webcrawler.com/mak/projects/robots/norobots.html

[D] "The Web Robots Database"
URL: http://www.robotstxt.org/wc/active.html
URL: http://www.robotstxt.org/wc/active/html/type.html

[E] "Web Security FAQ", Lincoln D. Stein
URL: http://www.w3.org/Security/Faq/www-security-faq.html

|=[ EOF ]=---------------------------------------------------------------=|