Do web bots dream of electric sheep?
Well, they would if you were using them to compile a search engine of sites relating to farmyard animals.
Fiat Stradas may have been hand-built by robots in the ’70s, but since then robotic hands have turned to building the internet instead.
In this article, we take a peek at the wiring behind various bionic buzzwords, from search engine spiders to artificial intelligence with world-domination on its hive mind…
AKA ‘bot’, ‘web bot’, ‘internet bot’
In development, a robot is any program written to do a job too repetitive or dull for a human to do. Like the doors with personalities in Douglas Adams’ galaxy, web bots will sigh with satisfaction every time they carry out a task you or I would find mind-numbingly tedious – and they will do so over and over again without getting bored.
Some bots are tasked with giving new users of a service, such as the chat software Slack, a friendly welcome and helpful advice. Others have less wholesome intentions, including robbing you of that eBay item you’ve been eyeing up all day with a last-second bid.
Evil robot armies are also involved in DDoS attacks (where they will attempt to overload a server with repeated traffic), harvesting email addresses for spam and trying to crack passwords.
Like Cylons in the Battlestar Galactica reboot, these mendacious bots operate by pretending to be human. Though few are as pretty as Tricia Helfer, their creators go to extreme lengths to make their bots appear as realistic as possible in order to fool website security systems.
Bots working together are called ‘Botnets’ (short for ‘robot network’). Malicious botnets infect computers like viruses and the software then ‘recruits’ other computers, giving criminals a wide network of computing power right under the noses of owners.
Controversially, legal software is also distributed this way – Windows 10 turns idle PCs into mini cloud servers, using your bandwidth to share updates across the internet to other Windows 10 users. It’s probably only a few shorts steps from this to Skynet, Terminators and the destruction of mankind…
In the world of SEO, the most import kind of bot resembles the mechanical spiders that ruin Tom Cruise’s bath time in Minority Report.
Web crawlers are unleashed on the internet by search engine providers, not to terrify arachnophobes and technophobes alike, but to explore the nooks and crannies of the web.
A crawler bot will scuttle indiscriminately over your page, analysing code and content before slipping through hyperlinks like cracks in the floorboards.
Remember, the purpose of bots is to do work that is too monotonous for people. Google and Bing could employ thousands of workers to sit reading the internet all day then report back on their findings in order to rank pages on quality and relevance, but crawlers can do this a lot faster and won’t get sidetracked filling in quizzes to find out which member of Little Mix they most resemble.
Crawlers have OCD programmed into them, meaning not only will they keep browsing a site until they have followed every single link, they will come back a few days later to check if anything has changed.
This is how search engines keep their indexes up-to-date. Crawler bots will count keywords, analyse coding quality and test the speed of your hosting, all in the pursuit of knowledge.
However, like Robin Williams’ Bicentennial Man, they also wish to understand the human experience. The vastness of the internet makes it necessary for crawling software to second-guess what people actually want, so algorithms create priority ratings and some (or in fact, most) pages get forgotten about or are never indexed in the first place.
AKA ‘robots exclusion standard’
As Star Trek fans will know, robots sometimes need to be interrupted. After all, we’ve already mentioned that bots in DDoS attacks mimic genuine traffic to break a server, so if you’ve got dozens of search engines obsessively crawling your site, couldn’t that have the same effect?
Indeed it could, which is precisely why the robot.txt file was introduced back in 1994 – to prevent robots accidentally hogging all the bandwidth, a much more pressing issue pre-broadband.
These days, the bandwidth impact of crawlers is negligible, so robots.txt is mainly used to guide search engines to the content you want listing – and away from stuff you’d rather wasn’t.
‘Guide’ is the key word here – you’re only asking the crawler not to index stuff, so you better hope you have a polite robot like Red Dwarf’s Kryten who will respect the request and not a Hudzen-10 barging his way in with his groinal attachment.
The concept of ‘politeness’ is actually factored into the coding of crawlers because search engines would rather treat your site with respect than have you block them using robots.txt because they are crawling too many pages too often.
Malicious bots, however, will simply ignore robots.txt, so your site will require additional security measures to deal with these threats. Robots.txt should be seen primarily as a way of talking to search engines about how you’d like your site listing.
‘Disallow’, ‘noindex’ and ‘nofollow’
Your robots.txt file can tell a crawler to ignore certain pages using a ‘disallow’ command. You can also put code into the header portion of individual pages asking them to abide by a ‘noindex’ or ‘nofollow’ request.
‘Noindex’ simply means you don’t want that page indexing (i.e. appearing in search results).
‘Nofollow’ means you don’t want the links appearing on that page to contribute to other sites’ rankings, you miser, you!
Perversely, Google bots will still follow a link marked ‘nofollow’, they just won’t use that link when calculating the site’s PageRank (the score Google gives websites based on the number of links they receive from other sites). You are linking to the site, but you aren’t voting for them in the Google popularity contest.
Reducing other people’s PageRank might seem mean-spirited, but if you are linking to a rival company in a blog comparing your products with theirs, for example, you don’t want to inadvertently send a load of crawlers their way any more than you want to pass on potential customers with a glowing recommendation.
[Chris writes more about implementing ‘nofollow’ here.]
AKA the ‘Imitation Game’
The renowned computer scientist and codebreaker Alan Turing recognised the need to distinguish between human and artificial intelligence back in 1950, long before most people had even heard of artificial intelligence as a concept, never mind such a thing becoming potential reality.
The idea is simple – a person has a text-based conversation with what they believe is another human, but is in fact a computer. If the human can’t tell the difference, the computer wins and eventually enslaves the human race.
We’re a way away from that at the moment, but really intelligent humans like Stephen Hawking and Elon Musk are worried about the threat AI holds. Personally I believe evil robots are already fooling humans on a daily basis in the comments section of YouTube… I mean, some of those trolls can’t be real people, can they?