################################################################# # robots.txt at www.rogueriver.tzo.com ################################################################# # See www.honeypot.be for updates on other nuisances # added honeypot.be/robots.txt to my existing file # this robots.txt file can be freely distributed # Feel free to use it to build your own robots.txt file ################################################################# # As you can see, I comment my listings pretty well. ################################################################# # ATTENTION Bot-masters: apparent do-nothing bots and spambots # will be banned. Any that ignore robots.txt will be blocked at # the IP or domain level. ################################################################# # NOTICE: I am NOT running a Win2K IIS web server! I use the # latest version of Apache Server (currently at v2.2). # additionally no perl or php scripts are used here. # my blog is Thingamablog which doesnt require any MySQL, php # or other on-server scripting. google thingamablog for more info # crackers and other black hats: Please take your iis exploits, # bots, trojans, virii, and worms elsewhere. ################################################################# #There is no reason that I have been able to research for a crawler #to index the _vti_cnf Frontpage directory so ALL of them are now #protected with a password to block the crawlers. Now you get a 401 #error. ################################################################# # Thank You and Have a Nice Day! ################################################################# ########################################################################### #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~# ########################################################################### ##################################################################### # Featured Bots of the month - SapphireWebCrawler/1.0 (IMAGE SCRAPER) ##################################################################### ######################################## #SapphireWebCrawler/1.0 (IMAGE SCRAPER)# ######################################## # DISREGARDS ROBOTS.TXT # Private web crawler/IMAGE SCRAPER # After full examination of January/February 2009 access.log files # no entries for Sapphire crawler can be found indexing web pages. # First observed using IP 64.88.164.198 from Lawrenceville, GA not CMU # NetRange: 64.88.160.0 - 64.88.191.255 # 23 Oct 2009 08:37:05 observed using 4.71.254.198 - traceroute to Google-Atlanta-Level3 # contact info: mhoy@cs.cmu.edu - http://boston.lti.cs.cmu.edu/crawler/ # Since this bot disregards robot.txt it will be blocked by ip range User-agent: SapphireWebCrawler/1.0 Disallow: / #IBM sai-crawler #observed 27 Oct 2009 03:45:17 129.34.20.17 #lets see if it obeys robots.txt #yupper... now lets see if they override this User-agent: http://domino.research.ibm.com/comm/research_projects.nsf/pages/sai-crawler.callingcard.html Disallow: / ########################################################################### ########################################################################### ################################################################# # Here are the usual search engines ################################################################# ##################### #Feedjit Crawler 2.1 [69.46.36.7] ##################### #Live network activity viewer - www.feedjit.com #observed using [69.46.36.7] net range: 69.46.32.0 - 69.46.47.255 User-agent: Feedjit Crawler 2.1 Disallow: /cgi-bin/ ############### #Technoratibot ############### User-agent: TechnoratiSnoop/1.0 Disallow: /cgi-bin/ ######### # YAHOO! ######### #Yahoo! Slurp/3.0 #net range (67.195.0.0-67.195.255.255)(72.30.0.0-72.30.255.255)(74.6.0.0-74.6.255.255) aka Inktomi Corp. #I have passworded the _vti_cnf directories so you will get a 401 error! Yahoo bot disregards some robots.txt #instructions... manual override? #No mega-scans from Yahoo! #They ignore NoIndex instructions in html headers. #Apparently they consider themselves as the 1000 pound gorilla? #The Yahoo search engine Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp User-agent: Yahoo! Slurp/3.0 Disallow: /email*.* Disallow: /cgi-bin/ Disallow: /hivemind/ Disallow: /?vti_cnf/ Disallow: /*EntryPermalink* #################### #Yahoo-MMCrawler/3.x #################### #Haven't seen this one lately, maybe the Scooter/3.3 is its replacement. #Yahoo-MMCrawler/3.x (mms-mmcrawler-support@yahoo-inc.com) #So far, Yahoo has been a well behaved crawler. Looks frequently for robots.txt User-agent: Yahoo-MMCrawler/3.x Disallow: /email*.* Disallow: /cgi-bin/ Disallow: /hivemind/ Disallow: /?vti_cnf/ ################### # Scooter/3.3 ################### #Apparently someone at Yahoo (New York) is being cute using this scooter/3.3 spider to index #images without identifying themselves as Yahoo during the crawl. #The IP used in this case is 69.147.79.37 #whois info: NetRange: 69.147.64.0 - 69.147.127.255 / Host: scrub3.media.search.re3.yahoo.com #contact email: network-abuse@cc.yahoo-inc.com and rauschen@yahoo-inc.com #Abuse and Tech Phone: +1-408-349-3300 User-agent: Scooter/3.3 Disallow: /email*.* Disallow: /cgi-bin/ Disallow: /hivemind/ Disallow: /?vti_cnf/ ######### # GOOGLEBOT-MOBILE (66.249.65.136 and 66.249.65.200 so far!) ######### # Twice now I have had Googlebot-mobile do a full 45 minute bandwidth hogging high speed crawl and site download. # I try this first then if it doesnt work, a full ban on google goes into effect User-agent: Googlebot-Mobile/2.1 Disallow: / ######### # GOOGLE ######### #Google Bot (66.249.65.0 - 66.249.66.255) #Google is famous for their ultra-high speed gigabyte whole web site every day download sessions. #(66.249.65.136) did a FULL site download (almost 2 gigabytes) from 8:56 and 9:43 pm, I am restricting #google access to between midnight and 6 AM. Other addresses will be added as Google acts like a net boor. #Google checks for robots.txt but doesnt obey it! #Google checks (maybe) some time during each calendar day for robots.txt, but not every 24 hours. #Note to Google! I researched _vti_cnf and it is used by Frontpage and Micro$oft IIS Server. #This is an APACHE server...I have passworded all of them. #Search all you want... when you hit one it asks for a password. User-agent: Googlebot/2.1 Disallow: /email*.* Disallow: /_vti_cnf/ Disallow: /cgi-bin/ Disallow: /hivemind/ #################################################################################################### #Below are the Micro$oft public and stealth crawlers as I have spotted them in the apache access.log #note to Micro$oft: NO I AM NOT USING A MICROSOFT IIS SERVER. Apache is superior in most ways to IIS #I must say that WinXP Pro sp2 does work great with Apache. #################################################################################################### ##################################### # MICRO$OFT crawlers and stealthbots ##################################### ######## #adidxbot/1.1 (+http://search.msn.com/msnbot.htm) ######## #the latest msn bot, but what is this one scanning for? #observed on 6/11/2009 using ip address (65.55.214.145) User-agent: adidxbot/1.1 Crawl-delay: 20 Disallow: /%3C$EntryPermalink$%3E Disallow: /email*.* Disallow: /cgi-bin/ Disallow: /hivemind/ ######## #msnbot-media/1.0 (65.52.0.0 - 65.55.255.255) http://search.msn.com/msnbot.htm ######## #at least they know what _vti_cnf is and dont bother scanning it. User-agent: msnbot-media/1.0 Crawl-delay: 20 Disallow: /%3C$EntryPermalink$%3E Disallow: /email*.* Disallow: /cgi-bin/ Disallow: /hivemind/ ######## #msnbot/1.1 ######## #I suspect that this is frequently operator controlled because of what it crawls and requests #keeps hitting on %3C$EntryPermalink$%3E getting a 403 (forbidden) error, a blacklist is coming User-agent: msnbot/1.1 Crawl-delay: 20 Disallow: /*EntryPermalink* Disallow: /email*.* Disallow: /cgi-bin/ Disallow: /hivemind/ ######## #msrbot (http://research.microsoft.com/research/sv/msrbot/) [209.133.64.213] ######## #netname for the ip address shows as AboveNet from San Francisco. #I think that Micro$oft has many stealth identities on the net. #This bot has previously ignored robots.txt and was blacklisted User-agent: msrbot Disallow: / ######## #MSR-ISRCCrawler (131.107.65.41) ######## #Another Micro$oft stealth crawler #It pulled robots.txt but since it does not identify with an address to contact them about its activites, #I will be adding this to the blacklist. NetRange: 131.107.0.0 - 131.107.255.255 User-agent: MSR-ISRCCrawler Disallow: / ######## #END of the Micro$oft section here ######## ############# # ASK JEEVES ############# #Ask Jeeves/Teoma 65.214.44.75... latest info - scans from 206.80.1.253 #not identifying itself as spider from Jeeves but a lookup shows it is from there. #I have watched Ask Jeeves spider my site for a LONG time now. I have to say, #I do not know what they are looking for, but they are VERY civilized in doing it! #I also have to say that I can never find what I am looking for using ASK.com #No ultra-highspeed bandwidth hogging scans from Jeeves! User-agent: Ask Jeeves/Teoma Disallow: /cgi-bin/ ############ # GIGABOT ############ #Gigabot/2.0 (66.154.102.0 - 66.154.103.255) GIGABLAST, Inc. User-agent: Gigabot/2.0 Disallow: /cgi-bin/ ############################ #ia_archiver-web.archive.org ############################ #just popped in "ia_archiver-web.archive.org" (207.241.229.150) #ip range: (207.241.224.0 - 207.241.239.255) #archive.org comes up as TheWayBackMachine - seems to copy and obey robots.txt User-agent: ia_archiver Disallow: /cgi-bin/ Disallow: /_vti_cnf/ ################## #Baiduspider+(+http://www.baidu.com/search/spider.htm) #observed using 61.135.168.82, 119.63.193.55, 123.125.64.38, 220.181.32.22 ################## User-agent: Baiduspider+ Disallow: /cgi-bin/ Disallow: /_vti_cnf/ ################## #BlogsNowBot ################## User-agent: BlogsNowBot, V 3.0 (+http://www.blogsnow.com/) Disallow: /cgi-bin/ ############## # BLOGPULSE ############## #BlogPulse (64.158.138.0 - 64.158.138.255) IP range may be narrower, not confirmed #Leased from Level 3 Communications in Colorado (64.152.0.0 - 64.159.255.255) #I visited the web site and searched through hundreds of listing for my pages... #Nothing! I can not figure what they are cataloging. They are allowed for now. #maybe looking for friends of Israel? I have my Friends of Israel image posted. #Registered via GoDaddy.com Registrant: #Buzzmetrics, Ltd 4 SHENKAR STREET POB 12686 HERZLIYA, TEL AVIV 46733 Israel #spider looks for robots.txt User-agent: BlogPulseLive (support@blogpulse.com) Disallow: /cgi-bin/ ################### #BlogSearch/1.0 ################### User-agent: BlogSearch/1.0 +http://www.icerocket.com/ Disallow: /cgi-bin/ ############### #Charlotte/1.1 ############### #http://www.searchme.com/support/ #ip's observeded so far [208.111.154.15] [208.111.154.16] [208.111.154.35] #NetRange: [208.111.128.0 - 208.111.191.255] User-agent: Charlotte/1.1 Disallow: /cgi-bin/ Disallow: /_vti_cnf/ ############## #crawly_2.6.3 ############## #Here is a new one to me. crawly_2.6.3 crawly@commandcom.com IP 66.255.53.123 #commandcom.com defaults to www.authentium.com, a software and security company #I'll observe to see how it behaves User-agent: crawly_2.6.3 Disallow: /cgi-bin/ ############## # Jyxobot/1 ############## #CESNET, z.s.p.o. Prague, CZ #(195.113.214.192 - 195.113.214.255) observed using 195.113.214.204 #First visit 11/2/2006 @ 23:03 well mannered, no high speed index #visited again tonight [3-3-2009] same ip address and range #no bad reports on the web #weird items catalogued on this visit. will watch. #Looked for robots.txt first! User-agent: Jyxobot/1 Disallow: /cgi-bin/ ################ #Snapbot/1.0 ################ #Snapbot/1.0 (Snap Shots, +http://www.snap.com) #IP range for this one is (38.98.19.60 - 38.98.19.99) #Observed getting robots.txt unknown if it uses it User-agent: Snapbot/1.0 Disallow: /cgi-bin/ ########################## # Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/) ########################## #crawling from Stockholm Sweden (88.131.106.2) on 10/23/2009 User-agent: Speedy Spider Disallow: /cgi-bin/ ###################################################### #MJ12bot/v1.2.1; http://www.majestic12.co.uk/bot.php?+ ###################################################### User-agent: MJ12bot/v1.2.1 Disallow: /cgi-bin/ ######################################################## #Voilabot BETA 1.2 (support.voilabot@orange-ftgroup.com) ######################################################## #observed using these IPs - 81.52.143.15, 81.52.143.16, 193.252.149.16, 193.252.149.15 #ip range - 81.52.142.0 - 81.52.143.223 and 193.252.149.0 - 193.252.149.31 #Still in beta. email - support.voilabot@orange-ftgroup.com User-agent: Voilabot BETA 1.2 Disallow: /cgi-bin/ ################################################################################# ################################################################################# # Here is the riff-raff section of spiders, spambots, and general nuisances ################################################################################# ################################################################################# #80legs bot (008/0.83; http://www.80legs.com/spider.html) #a viral multiplatform botnet crawler User-Agent: 008/0.83 Disallow: / #aipbot (69.25.142.0 - 69.25.142.63) AIPBOT.COM Registrar- REGISTER.COM, INC. #new info they are VERY private (info blocked by privacy protect) #Whois Agent (jbvlcqjgv@whoisprivacyprotect.com) +1.9027492060 #PO Box 841 C/O aipbot.com Yarmouth, NS B5A 4K5 CA #the web site gives robots.txt info and nothing more #this is a suspected spambot User-agent: aipbot/1.0 Disallow: / #AISearchBot (Email: aisearchbot@gmail.com; If your web site doesn't want to be crawled, please send us a email.) #no identified web page and no benefit to web masters. #does not search for or use robots.txt. sure looks like a spambot to me. #I emailed them to ask what they were indexing and who would benefit. no response. blocked (12/31/2008) #addresses used: 67.202.11.9, 75.101.247.60, 174.129.151.88, 174.129.136.78, 174.129.145.155 #net ranges: [67.202.0.0 - 67.202.63.255] [75.101.128.0 - 75.101.255.255] [174.129.0.0 - 174.129.255.255] #addresses used are several within listed ranges of Amazon Web Services, Elastic Compute Cloud dynamic hosting. User-agent: AISearchBot Disallow: / #accellobot (first observed on my site 11/16/2006) #http://www.accelobot.com scanning from (72.20.99.47) #full IP range is (72.20.99.0 - 72.20.99.255) out of Santa Clara, Ca. #the actual bot is a generic sourceforge gnu-project bot (they forgot to change its ID) #lets see if it actually obeys this file User-agent: heritrix/1.8.0 Disallow: /cgi-bin/ Disallow: /_vti_cnf/ #accoonabot (69.25.71.0 - 69.25.71.255) #Accoona-AI-Agent/1.1.2 (aicrawler at accoonabot dot com) #IP comes back to Fast Search & Transfer of Wellesley, MA #accoonabot.com comes back reg to John Fernandez of Jersy city, nj #www.accoonabot.com comes up with the apache tomcat web page. Something does not #seem on the up and up. First time I have seen this one and it has done a full #site crawl. SPAMBOT? A scammer trading on the name of www.accoona.com? User-agent: Accoona-AI-Agent/1.1.2 (aicrawler at accoonabot dot com) Disallow: / #AlkalineBot is a commercial research bot - sounds like commercial spambot User-agent: AlkalineBOT Disallow: / #BuzzCore/0.9.2 visited from (206.111.151.151) #http://www.buzzrage.com/about/buzzcrawl.html - fake web site 404 error #A whois of buzzrage.com comes back with an anonymous registration thru GoDaddy.com #and a host of XO.com using assigned IP 206.111.151.142 User-agent: BuzzCore/0.9.2 Disallow: / #bdfetch - www.brandimensions.com #seen using 72.14.164.135 and 72.14.164.155 #After a google search to see what sort of company they are, IP range has been blocked User-agent: bdfetch Disallow: / #CazoodleBot/CazoodleBot-0.1 #http://www.cazoodle.com/cazoodlebot; cazoodlebot@cazoodle.com (72.36.115.75) #web page is www.apartments.cazoodle.com... no apartment listings here User-agent: CazoodleBot Disallow: /cgi-bin/ #cfetch/1.0 what is this? no info on a yahoo! or google search #(38.112.0.0 - 38.119.255.255) Performance Systems International Inc. #operates from ip 38.112.6.182 - now banned for non-compliance with rules of how #spiders are supposed to use and react to robots.txt entries User-agent: cfetch/1.0 Disallow: / #discobot/1.0; +http://discoveryengine.com/discobot.html #Is this what the disco/nutch below has become? Lets see if it obeys robots.txt User-agent: discobot/1.0 Disallow: /%3C$EntryPermalink$%3E Disallow: /email??.* Disallow: /cgi-bin/ Disallow: /hivemind/ Disallow: /_vti_cnf/ #disco/Nutch-0.9 (experimental crawler ... please email imagine@gmail.com if problems observed; nedrocks@gmail.com) #also observed as ozzie (question ozzie@cs.stanford.edu) #observed crawling from 208.96.54.88 (netrange for blacklisting - 208.96.0.0 - 208.96.63.255) #lets see if this crawler obeys robots.txt User-agent: disco/Nutch-0.9 Disallow: / #DotBot/1.0.1 (http://www.dotnetdotcom.org/#info, crawler@dotnetdotcom.org) #NetRange: 208.115.96.0 - 208.115.127.255 #after reading the info on their page this is either a spam bot or some other nefarious scheme #they have no stated plans for what they are indexing or what it will be used for #thus they are banned from here as I see no benefit to let them continue crawling here User-agent: DotBot/1.0.1 Disallow: / #e-Collector is a freeware e-mail harvesting tool #any IP sporting this user agent will simply be banned #observed today (15 April 2009) crawling from 83.12.228.78 (net range 83.12.228.76-83.12.228.79) #no contact info but the whois comes back as a Polish RIPE registry... a spammer/scraper? User-agent: LWP:: Disallow: / #e-sense 1.0 ea(www.vigiltech.com/esensedisclaim.html) (92.48.126.214) #looks like a search engine but no info page or contact info even on the search page #e-sense does not look for robots.txt before during or after scan #whois registrant info: vigiltech.com Funchal, Madeira 9004-521 PT #email: 1c0ae8950a14115001fe37d4c271e217@domaindiscreet.com #with an email address like that, you know they do NOT want to be contacted #netrange for blacklisting 92.48.126.208 - 92.48.126.223 from Belgrade, Serbia User-agent: e-sense 1.0 ea Disallow: / #Exabot-Images/3.0; +http://www.exabot.com/go/robot http://www.exalead.com/search #Image collector Image search engine #EXALEAD Paris France (193.47.80.0-193.47.80.255, 83.167.62.160-83.167.62.191) User-agent: Exabot-Images/3.0 Disallow: /_vti_cnf #and now User-agent: Exabot/3.0 Disallow: /_vti_cnf #Feedster crawler (64.95.116.1) User-agent: Feedster Crawler/3.0 Disallow: /cgi-bin/ Disallow: /_vti_cnf/ # GurujiBot/1.0 - http://www.guruji.com/en/WebmasterFAQ.html # using IP 72.20.109.62 (net range: 72.20.109.32 - 72.20.109.63) # observed snooping for spam bait lists and nothing else # Disregards robots.txt # I am blocking the IP net range as there is no benefit to me of # them looking for email addresses, acting like a spambot User-agent: GurujiBot/1.0 Disallow: / #ImageWalker/2.0 #Image thief? only scans image files #observed IP usage IP 72.14.164.93, 72.14.164.156 #netrange for blacklisting 72.14.160.0 - 72.14.175.255 User-agent: ImageWalker/2.0 Disallow: /*.jpg Disallow: /*.gif Disallow: /*.bmp Disallow: /*.png Disallow: / #IRLbot Texas A&M University netrange:128.194.0.0 - 128.194.255.255 - does not look for robots.txt #This is just formality. WHOIS does not give an abuse address. #Sample line from access.log below #"GET /congress/person.xpd?id=400111 HTTP/1.1" 404 217 "-" "IRLbot/2.0 (compatible; MSIE 6.0; #http://irl.cs.tamu.edu/crawler" (continued from line above) User-agent: IRLbot/2.0 Disallow: / #Kyluka crawl - http://www.kyluka.com/crawl.html is a bogus URL not found in ARIN database. #The URL finally works. I'll watch them to see what they do. #using 66.92.19.252 at this time. User-agent: Kyluka crawl Disallow: / #larbin_2.6.3 from University of Dresden (Germany)- unidentified student or faculty #using IP 141.76.44.181 of the IP net range 141.76.0.0 to 141.76.255.255 #Larbin pulled robots.txt but doesn't obey it #sighting on 12-23-2006 - using 66.80.248.144 from megapath networks, inc. #sighting on 06/09/2008 - 207.67.117.170 (net range: 207.67.117.0 - 207.67.117.255) from (mcafee)securecomputing.com #sighting on 04-12-2009 - 85.72.217.33 IP block - 85.72.0.0 - 85.72.255.255 #sighting on 04/16/2009 - 64.56.64.57 IP block - 64.56.64.0 - 64.56.79.255 [vrtservers, inc.] #note: SecureComputing.com is an INTERNATIONAL corporation and is now owned by McAfee #WHY is someone from McAfee using an anonymous spider and a fake email address? (4-26-2009 See net range above) #email address: larbin2.6.3@unspecified.mail (BOGUS) doing a IP Range blacklist on my servers #no one seems to know what this is so I am blocking it this way first. #Lots of hacker activity out of Dresden - the whole IP range from there is blacklisted now User-agent: larbin_2.6.3 Disallow: / #LiteFinder/1.0; +http://www.litefinder.net/about.html #does not obey robots.txt - also does high speed full site download - bandwidth hog #Web site says they are from Bangalore India #operates from the following Net Ranges: 74.53.243.224 - 74.53.243.239 and 216.40.192.0 - 216.40.255.255 #just observed them using ThePlanet Net Range: 74.53.249.32 - 74.53.249.47 - Sneaky BASTARDS #more IP-Network-Blocks will be added as they are observed. A simple domain ban doesnt seem to be effective. #lookup of IPs used doesnt come back as litefinder.com - they have their ID blocked. THIS IS HIGHLY SUSPICIOUS! User-agent: LiteFinder/1.0 Disallow: / #Lycos_Spider_(modspider) - 209.202.205.1 #whois lookup - waltham-nat.ma.lycos.com #ip range assigned - 209.202.192.0 - 209.202.255.255 User-agent: Lycos_Spider_(modspider) Disallow: /cgi-bin/ Disallow: /_vti_cnf/ #MLBot www.metadatalabs.com #net range 71.41.192.0 - 71.41.223.255 #sighting on 6-18-09 net range 64.17.0.0 - 64.17.15.255 #web address - www.metadatalabs.com #OrgName: Core NAP, L.P., Austin Texas #It gives no real info about the purpose of the crawler, observed it trying to pull directory listings #spambot or picture stealer? #update: They are looking for video and audio files according to the writer at their Metablog. User-agent: MLBot Disallow: /cgi-bin/ Disallow: /_vti_cnf/ #Mp3Bot/0.6; +http://mp3realm.org/mp3bot/ #ip: 66.147.236.94 (sighting 06/22/2009) User-agent: Mp3Bot/0.6 Disallow: / #Moreoverbot/5.00 (65.199.34.69) (+http://www.moreover.com) #I have not seen this bot check robots.txt User-agent: Moreoverbot/5.00 Disallow: /cgi-bin/ #MUNAX #http://www.munax.com/referer.htm - from Stockholm, Sweden (82.99.30.0 - 82.99.30.127) #A new bot, crawling with no bot identifier, to index pages for purpose of hotlinking #to images. Since they do not bother with robots.txt this one gets an automatic blacklist. #Goes immediately for images without the page. The whole IP range is not blocked. User-agent: MUNAX (made up since they didnt use one) Disallow: / #NPBot is the Name Protect Bot (http://www.nameprotect.com/) User-agent: NPBot Disallow: / #nicebot - eggdrop bot of some type #its unknown if they will scan for this robots.txt file but I will try #NOPE... he doesnt even look for this. I am blocking the IP 60.60.120.169 User-agent: nicebot Disallow: / #OmniExplorer bot flooded my site with repeated requests for a bbs list #Sneaky Bastards changed the user agent name and hit again #any ip associated to this will simply be banned User-agent: OmniExplorer_Bot/* Disallow: / #picmole/1.0 +http://www.picmole.com [67.205.96.37] #NetRange: 67.205.64.0 - 67.205.127.255 #NetRange: [PicMole Dedicated] 67.205.96.32 - 67.205.96.63 User-agent: picmole/1.0 Disallow: / #proximic; +http://www.proximic.com #current IP (85.25.151.39) Munich, Germany #seems to be looking for images, keep an eye on this one, lets see what it does User-agent: proximic Disallow: /_vti_cnf/ #psbot is the spider from picsearch.com, (217.212.224.128 - 217.212.224.255) #They cant have my pics! User-agent: psbot Disallow: / #From the honeypot.be robots.txt #Robofox is a site leeching tool - as soon as I identify an IP it will list here! User-agent: Robofox v2.0 Disallow: / #Rufusbot (Rufus Web Miner; http://64.124.122.252/feedback.html) #picture stealer, does not have a real web site. Has only an initial apache setup screen. #Does a rapid hit and run on all available pictures. #MORE INFO: Owned by Webaroo.com, a self described "Stealth Startup" no public info or product! #(64.124.122.224 - 64.124.122.255) are associated IPs #No they are not from Australia, they are from India. The web site name and graphic is deceptive! #UPDATE: They have now further disguised themselves. New contact info: Private Registration #Beanstalk Technology ATTN: WEBAROO.COM c/o NETWORK SOLUTIONS - Admin Contact, William Pagano User-agent: Rufusbot Disallow: / #also User-agent: Webaroo Disallow: / #schibstedsokbot FAST FreshCrawler 6; +http://www.schibstedsok.no/bot/ #fake bot info address returns a 404 not found, no other info found #last used IP 81.93.168.71 in Basefarm network in Norway (81.93.168.64 - 81.93.168.79) #on next scan, set up RIPE block if robots.txt is ignored User-agent: schibstedsokbot Disallow: / #Shelob (shelob@gmx.net) 85.25.124.167 #spambot of some kind operating out of Germsny - netrange: 85.25.120.0 - 85.25.127.255 ##it went for all my email honey traps that I have set for them and got 100,000 bogus email addresses User-agent: Shelob Disallow: / #Sosospider+(+http://help.soso.com/webspider.htm) #Chinese web spider observed on 17/Jun/2009 using 58.61.164.141 User-agent: Sosospider Disallow: / #Spambot 207.36.209.124 User-agent: Mizzu Labs 2.2 Disallow: / #Sphere Scout&v4.0 (beta) - scout (at) sphere (dot) com #I see this in my access log several times now, so I decided to see what my listing looks like #I couldn't find a listing for my blogs. What are they doing, wasting my bandwith? User-agent: Sphere Scout&v4.0 (beta) Disallow: / #studybot/1.0 - 66.65.114.70 #pretty anonymous, no bot info or contact address, google search no info except logs of scans #on other pages on the web, looks like a possible spambot address harvester User-agent: studybot/1.0 Disallow: / #TurnitinBot/2.1 - 65.98.224.5 iParadigms Oakland, Ca #IP range: 65.98.224.0 - 65.98.224.31 #crawls the web looking for info that students use for writing reports. #I consider this as a spybot and am blocking it! User-agent: TurnitinBot/2.1 Disallow: / #Under The Rainbow 2.2 (80.58.13.108) #honeypot.de identifies this as a spambot. Guess what! This sucker disregards robots.txt! #The above listed IP is now blocked. I may have to resort to blocking a whole range of IP addresses. User-agent: Under The Rainbow 2.2 Disallow: / #Wasabot/1.4 (+ http://www.wasalive.com ) - [88.191.42.56] #net range: 88.191.3.0 - 88.191.248.255 #owner-name: Trendy Buzz person: Benjamin Fabre city: Paris country: France User-agent: Wasabot/1.4 Disallow: /cgi-bin/ #WebReaper is a freeware offline browser that downloads a website for viewing unconnected User-agent: WebReaper [webreaper@otway.com] Disallow: / #Yandex (77.88.26.26, 77.88.22.109) #Russian Federation Yandex Enterprise Network #NetRange: 77.0.0.0 - 77.255.255.255 User-agent: Yandex/1.01.001 Disallow: /cgi-bin/ #Yoriwa/0.1 (24.4.88.223) and today 12/31/2006 from (66.160.142.7) #www.yoriwa.com says this is a new web crawler startup not yet available for public use User-agent: Yoriwa/0.1 Disallow: /cgi-bin/ Disallow: /_vti_cnf/ #zermelo; +http://www.powerset.com 67.202.28.203 (Net range - 67.202.0.0 - 67.202.63.255) #After googling this bot, I have decided to block it due to comments of others #Location: Kansas City Amazon Web Services, Elastic Compute Cloud, EC2 #Why would Amazon operate a bot like this/ User-agent: zermelo Disallow: / #ZIBB Crawler (email address / WWW address) is info sent by the bot using 208.68.136.5 out of a whole #net range of 208.68.136.0 - 208.68.143.255 from FAST Search & Transfer Inc, Needham, MA #zibber-v0.1(www.zibb.com/crawler/) is info I found via a Yahoo search... same or a bandit? #the web site says it is a business oriented bot, so what is it doing at my web page? #It did look for robots.txt but unknown if it obeys it. As you can see it is pretty anonymous. No contact #info is included with the bot scan. I am going to observe to see what it is looking for. #I tried using their web contact form and it gives an error message... these guys look phony. #I recomend blocking the IP address. User-agent: ZIBB Crawler (email address / WWW address) Disallow: / #also User-agent: zibber-v0.1(www.zibb.com/crawler/) Disallow: / ######################################################### #Generic Crawler - all other crawlers can use this template ######################################################### User-agent: * Disallow: /cgi-bin/ ################################################################################# # the end for now! Bye-bye and toodles #################################################################################