==Phrack Inc.==

              Volume 0x0b, Issue 0x39, Phile #0x0a of 0x12

|=-------------=[ Against the System: Rise of the Robots ]=--------------=|
|=-----------------------------------------------------------------------=|
|=-=[ (C)Copyright 2001 by Michal Zalewski <[email protected]> ]=-=|


-- [1] Introduction -------------------------------------------------------

 "[...] big difference between the web and traditional well controlled
  collections is that there is virtually no control over what people can
  put on the web. Couple this flexibility to publish anything with the
  enormous influence of search engines to route traffic and companies
  which deliberately manipulating search engines for profit become a
  serious problem."

                     -- Sergey Brin, Lawrence Page (see references, [A])

 Consider a remote exploit that is able to compromise a remote system
 without sending any attack code to his victim. Consider an exploit
 which simply creates local file to compromise thousands of computers,
 and which does not involve any local resources in the attack. Welcome to
 the world of zero-effort exploit techniques. Welcome to the world of
 automation, welcome to the world of anonymous, dramatically difficult
 to stop attacks resulting from increasing Internet complexity.

 Zero-effort exploits create their 'wishlist', and leave it somewhere
 in cyberspace - can be even its home host, in the place where others
 can find it. Others - Internet workers (see references, [D]) - hundreds
 of never sleeping, endlessly browsing information crawlers, intelligent
 agents, search engines... They come to pick this information, and -
 unknowingly - to attack victims. You can stop one of them, but can't
 stop them all. You can find out what their orders are, but you can't
 guess what these orders will be tomorrow, hidden somewhere in the abyss
 of not yet explored cyberspace.

 Your private army, close at hand, picking orders you left for them
 on their way. You exploit them without having to compromise them. They
 do what they are designed for, and they do their best to accomplish it.
 Welcome to the new reality, where our A.I. machines can rise against us.

 Consider a worm. Consider a worm which does nothing. It is carried and
 injected by others - but not by infecting them. This worm creates a
 wishlist - wishlist of, for example, 10,000 random addresses. And waits.
 Intelligent agents pick this list, with their united forces they try to
 attack all of them. Imagine they are not lucky, with 0.1% success ratio.
 Ten new hosts infected. On every of them, the worm does extactly the
 same - and agents come back, to infect 100 hosts. The story goes - or
 crawls, if you prefer.

 Agents work virtually invisibly, people get used to their presence
 everywhere. And crawlers just slowly go ahead, in never-ending loop.
 They work systematically, they do not choke with excessive data - they
 crawl, there's no "boom" effect. Week after week after week, they try
 new hosts, carefully, not overloading network uplinks, not generating
 suspected traffic, recurrent exploration never ends. Can you notice
 they carry a worm? Possibly...

-- [2] An example ---------------------------------------------------------

 When this idea came to my mind, I tried to use the simpliest test, just
 to see if I am right. I targeted, if that's the right word, general-purpose
 web indexing crawlers. I created very short HTML document and put it
 somewhere. And waited few weeks. And then they come. Altavista, Lycos
 and dozens of others. They found new links and picked them
 enthusiastically, then disappeared for days.

 bigip1-snat.sv.av.com:
   GET /indexme.html HTTP/1.0

 sjc-fe5-1.sjc.lycos.com:
   GET /indexme.html HTTP/1.0

 [...]

 They came back later, to see what I gave them to parse.

   http://somehost/cgi-bin/script.pl?p1=../../../../attack
   http://somehost/cgi-bin/script.pl?p1=;attack
   http://somehost/cgi-bin/script.pl?p1=|attack
   http://somehost/cgi-bin/script.pl?p1=`attack`
   http://somehost/cgi-bin/script.pl?p1=$(attack)
   http://somehost:54321/attack?`id`
   http://somehost/AAAAAAAAAAAAAAAAAAAAA...


 Our bots followed them exploiting hypotetical vulnerabilities,
 compromising remote servers:

 sjc-fe6-1.sjc.lycos.com:
   GET /cgi-bin/script.pl?p1=;attack HTTP/1.0

 212.135.14.10:
   GET /cgi-bin/script.pl?p1=$(attack) HTTP/1.0

 bigip1-snat.sv.av.com:
   GET /cgi-bin/script.pl?p1=../../../../attack HTTP/1.0

 [...]

 (BigIP is one of famous "I observe you" load balancers from F5Labs)
 Bots happily connected to non-http ports I prepared for them:

 GET /attack?`id` HTTP/1.0
 Host: somehost
 Pragma: no-cache
 Accept: text/*
 User-Agent: Scooter/1.0
 From: [email protected]

 GET /attack?`id` HTTP/1.0
 User-agent: Lycos_Spider_(T-Rex)
 From: [email protected]
 Accept: */*
 Connection: close
 Host: somehost:54321

 GET /attack?`id` HTTP/1.0
 Host: somehost:54321
 From: [email protected]
 Accept: */*
 User-Agent: FAST-WebCrawler/2.2.6 ([email protected]; [...])
 Connection: close

 [...]

 But not only publicly available crawlbot engines can be targeted.
 Crawlbots from alexa.com, ecn.purdue.edu, visual.com, poly.edu,
 inria.fr, powerinter.net, xyleme.com, and even more unidentified
 crawl engines found this page and enjoyed it. Some robots didn't
 pick all URLs. For example, some crawlers do not index scripts
 at all, others won't use non-standard ports. But majority of
 the most powerful bots will do - and even if not, trivial filtering
 is not the answer. Many IIS vulnerabilities and so on can be triggered
 without invoking any scripts.

 What if this server list was randomly generated, 10,000 IPs or 10,000
 .com domains? What is script.pl is replaced with invocations of
 three, four, five or ten most popular IIS vulnerabilities or
 buggy Unix scripts? What if one out of 2,000 is actually exploited?

 What if somehost:54321 points to vulnerable service which can
 be exploited with partially user-dependent contents of HTTP
 requests (I consider majority of fool-proof services that do not
 drop connections after first invalid command vulnerable)? What if...

 There is an army of robots, different species, different functions,
 different levels of intelligence. And these robots will do whatever
 you tell them to do. It is scary.

-- [3] Social considerations ----------------------------------------------

 Who is guilty if webcrawler compromises your system? The most obvious
 answer is: the author of original webpage crawler visited. But webpage
 authors are hard to trace, and web crawler indexing cycle takes
 weeks. It is hard to determine when specific page was put on the net
 - they can be delivered in so many ways, processed by other robots
 earlier; there is no tracking mechanism we can find in SMTP protocol and
 many others. Moreover, many crawlers don't remember where they "learned"
 new  URLs. Additional problems are caused by indexing flags, like "noindex"
 without "nofollow" option. In many cases, author's identity and attack
 origin wouldn't be determined, while compromises would take place.

 And, finally, what if having particular link followed by bots wasn't
 what the author meant? Consider "educational" papers, etc - bots won't
 read the disclaimer and big fat warning "DO NOT TRY THESE LINKS"...

 By analogy to other cases, e.g. Napster forced to filter their contents
 (or shutdown their services) because of copyrighted information exchanged
 by their users, causing losses, it is reasonable to expect that
 intelligent bot developers would be forced to implement specific filters,
 or to pay enormous compensations to victims suffering because of bot
 abuse.

 On the other hand, it seems almost impossible to successfully filter
 contents to elliminate malicious code, if you consider the number and
 wide variety of known vulnerabilities. Not to mention targeted attacks
 (see references, [B], for more information on proprietary solutions and
 their insecuritities). So the problem persists. Additional issue is that
 not all crawler bots are under U.S. jurisdiction, which makes whole
 problem more complicated (in many countries, U.S. approach is found at
 least controversial).

-- [4] Defense ------------------------------------------------------------

 As discussed above, webcrawler itself has very limited defense and
 avoidance possibilities, due to wide variety of web-based
 vulnerabilities. One of more reasonable defense ideas is to use
 secure and up-to-date software, but - obviously - this concept is
 extremely unpopular for some reasons - www.google.com, with
 unique documents filter enabled, returns 62,100 matches for "cgi
 vulnerability" query (see also references, [D]).

 Another line of defense from bots is using /robots.txt standard
 robot exclusion mechanism (see references, [C], for specifications).
 The price you have to pay is partial or complete exclusion of your
 site from search engines, which, in most cases, is undesired. Also,
 some robots are broken, and do not respect /robots.txt when following
 a direct link to new website.

-- [5] References ---------------------------------------------------------

 [A] "The Anatomy of a Large-Scale Hypertextual Web Search Engine"
     Googlebot concept, Sergey Brin, Lawrence Page, Stanford University
     URL: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm

 [B] Proprietary web solutions security, Michal Zalewski
     URL: http://lcamtuf.coredump.cx/milpap.txt

 [C] "A Standard for Robot Exclusion", Martijn Koster
     URL: http://info.webcrawler.com/mak/projects/robots/norobots.html

 [D] "The Web Robots Database"
     URL: http://www.robotstxt.org/wc/active.html
     URL: http://www.robotstxt.org/wc/active/html/type.html

 [E] "Web Security FAQ", Lincoln D. Stein
     URL: http://www.w3.org/Security/Faq/www-security-faq.html

|=[ EOF ]=---------------------------------------------------------------=|