Filtering for Bots
By:
Siva Katir
- 4/24/2014
<p>It is estimated that bots account for 60% of all requests on the web now. We at POSGuys.com are no exception, bots account for a staggering number of page views every day. While on the front end we treat bots and users the same the story is very different on the back end. Because of the detailed analytics that we keep storing all requests from bots would cause a flood of meaningless data. Meaningless because they don't follow any pattern through the site and would cause useful user interaction data to be diluted. </p> <p>So how do we deal with this? </p> <h3>Request Filtering</h3> <p>Running an ASP.NET web application gives us a few opportune places to hook into and collect the data we're looking for. We track what views are loaded, how long it took to generate each, what URL it was for as well as basic information about the user. As the page is generated this data is collected. Once the generation is complete and the server begins to respond we begin another process to compile our data together and send it to storage for later processing. This is were we check for bots and throw out their results.</p> <h3>The Bot Array</h3> <p>To do this we keep an array of bot names:</p> <pre><code>"bingbot", "abacho", "abcdatos", "abcsearch", "acoon", "adsarobot", "aesop", "ah-ha", "alkalinebot", "almaden", "altavista", "antibot", "anzwerscrawl", "aol", "search", "appie", "arachnoidea", "araneo", "architext", "ariadne", "arianna", "ask", "jeeves", "aspseek", "asterias", "astraspider", "atomz", "augurfind", "backrub", "baiduspider", "bannana_bot", "bbot", "bdcindexer", "blindekuh", "boitho", "boito", "borg-bot", "bsdseek", "christcrawler", "computer_and_automation_research_institute_crawler", "coolbot", "cosmos", "crawler", "crawler@fast", "crawlerboy", "cruiser", "cusco", "cyveillance", "deepindex", "denmex", "dittospyder", "docomo", "dogpile", "dtsearch", "elfinbot", "entire", "esismartspider", "exalead", "excite", "ezresult", "fast", "fast-webcrawler", "fdse", "felix", "fido", "findwhat", "finnish", "firefly", "firstgov", "fluffy", "freecrawl", "frooglebot", "galaxy", "gaisbot", "geckobot", "gencrawler", "geobot", "gigabot", "girafa", "goclick", "goliat", "googlebot", "griffon", "gromit", "grub-client", "gulliver", "gulper", "henrythemiragorobot", "hometown", "hotbot", "htdig", "hubater", "ia_archiver", "ibm_planetwide", "iitrovatore-setaccio", "incywincy", "incrawler", "indy", "infonavirobot", "infoseek", "ingrid", "inspectorwww", "intelliseek", "internetseer", "ip3000.com-crawler", "iron33", "jcrawler", "jeeves", "jubii", "kanoodle", "kapito", "kit_fireball", "kit-fireball", "ko_yappo_robot", "kototoi", "lachesis", "larbin", "legs", "linkwalker", "lnspiderguy", "look.com", "lycos", "mantraagent", "markwatch", "maxbot", "mercator", "merzscope", "meshexplorer", "metacrawler", "mirago", "mnogosearch", "moget", "motor", "muscatferret", "nameprotect", "nationaldirectory", "naverrobot", "nazilla", "ncsa", "beta", "netnose", "netresearchserver", "ng/1.0", "northerlights", "npbot", "nttdirectory_robot", "nutchorg", "nzexplorer", "odp", "openbot", "openfind", "osis-project", "overture", "perlcrawler", "phpdig", "pjspide", "polybot", "pompos", "poppi", "portalb", "psbot", "quepasacreep", "rabot", "raven", "rhcs", "robi", "robocrawl", "robozilla", "roverbot", "scooter", "scrubby", "search.ch", "search.com.ua", "searchfeed", "searchspider", "searchuk", "seventwentyfour", "sidewinder", "sightquestbot", "skymob", "sleek", "slider_search", "slurp", "solbot", "speedfind", "speedy", "spida", "spider_monkey", "spiderku", "stackrambler", "steeler", "suchbot", "suchknecht.at-robot", "suntek", "szukacz", "surferf3", "surfnomore", "surveybot", "suzuran", "synobot", "tarantula", "teomaagent", "teradex", "t-h-u-n-d-e-r-s-t-o-n-e", "tigersuche", "topiclink", "toutatis", "tracerlock", "turnitinbot", "tutorgig", "uaportal", "uasearch.kiev.ua", "uksearcher", "ultraseek", "unitek", "vagabondo", "verygoodsearch", "vivisimo", "voilabot", "voyager", "vscooter", "w3index", "w3c_validator", "wapspider", "wdg_validator", "webcrawler", "webmasterresourcesdirectory", "webmoose", "websearchbench", "webspinne", "whatuseek", "whizbanglab", "winona", "wire", "wotbox", "wscbot", "www.webwombat.com.au", "xenu", "link", "sleuth", "xyro", "yahoobot", "yahoo!", "slurp", "yandex", "yellopet-spider", "zao/0", "zealbot", "zippy", "zyborg", "mediapartners-google", "www.majestic12.co.uk/bot.php", "ocspd" </code></pre> <p>These names match a portion of the user agent string that the bot sends with the request.</p> <p>Here's an example of how to use it to filter out bots in C#.</p> <pre><code>var bots = new[] { /* ...list from above */ } if(!(from b in bots where Request.UserAgent.ToLower().Contains(b) select b).Any()) { //Store Results } </code></pre> <p>And that's it!</p> <p>We sort through our logs to see if we can find new bots as they come along from time to time. If there's any that you've found missing from the list above feel free to leave a comment.</p> <h3>Speed</h3> <p>Because we process this out of scope of the request we are not that concerned about speed. If speed is a concern for you a far better solution would be to get the full user agent string and use them to create a hash table to match against.</p>
Share this post:
Recent Posts By Siva Katir
Using ASP.NET Custom Authorization Attribute to Hide Page Elements
Filtering for Bots
Let Us Fill Your Excel Spreadsheet With Barcodes
Easily Generate Barcodes using Microsoft Excel for Free
Introducing POSGuys Barcode API V.0.1
Please enable JavaScript to view the comments.