A search engine web crawler is an internet bot that search engines utilize to update their content or update indices of web content of other sites. Web crawlers also go by the name spiders and are used by more than just search engines basically for web indexing.

They are capable of copying massive amounts of pages for a search engine to process and for indexing purposes and all these go a long way in making for effective web browsing. Web crawlers don’t usually need permission to visit a page or use its resources, but mechanisms exist to notify them to steer clear of a certain part of a site or the whole of it.

After the invention of search engines, it was difficult for search engines to present a user with the most relevant results. Advancement in web crawler technology has enabled spiders to crawl faster, gather more data in less time and index more efficiently. Today, web users can get very relevant results within seconds.

The web crawling process

Even before a search is conducted, spiders collect data from billions of sites and index it. Crawling usually commences with a list of “familiar” sites – those the crawlers have crawled in the past – called the seeds. As the crawlers “look over” those sites, they take note of all the existing hyperlinks in the sites and use them to crawl more pages.

New sites and changes in old ones, as well as dead links, are given special attention. Parameters and policies exist on which sites to crawl, the number of times to crawl them and how many pages to copy.

Crawling policies

A combination of policies guides how web crawlers behave. Some of the policies are:

  • Selection policy – The wide world web is extremely huge and even the largest crawlers can’t index the whole of it. As such, crawlers only get to download only a fraction of the web and hence it is desirable that the downloaded pages are the most relevant. The selection policy tells spiders which pages to download
  • Politeness policy – this prevents crawlers from overloading a site and crippling its performance. It is most applicable to sites with huge chunks of data.
  • Re-visit policy – when a site has been crawled and the spider moves on to another part of the web, changes such as new additions, deletions and updates may have occurred. A re-visit policy dictates when a crawler should check a page for changes.
  • Parallelization policy – this dictates how to synchronize distributed web crawlers.