2. Crawling

Crawling is how search engines discover new links and process pages it finds on the internet. Here we provide a good overview based only on reliable sources.

Google's Search Index contains hundred of billions of webpages and is over 100,000,000 gigibytes (100k Terabytes) in size. In order to feed that index, Google uses Googlebot, the generic name for Google's web crawling infrastructure, to search for new pages, and add to a list of known pages. Bing uses Bingbot as their standard crawler. Both Google and Bing use desktop and mobile User-Agents to crawl the web.

The web has grown into an enormous resource of available information. It is important to note that Google (and Bing) only know about a portion of the web, called the surface web. The surface web is the portion that is accessible publicly by crawlers via web pages. Other parts of the web are the deep web, and hidden web. The deep web is estimated to be 4,000 to 5,000 times larger than the surface web. This article by the University of Washington details the differences.

Google uses a distributed crawling infrastructure that distributes load over many machines. It is also an incremental crawler, in that it continues to refresh its collection of pages with up-to-date versions based on the perceived importance of the page to users. Crawling is distinct from rendering, although the process of crawling and rendering is often attributed to Googlebot.

Crawl Budget

In 2017, Google provided some guidance on how to think about crawl budget. Essentially the term "crawl budget" is a term coined by SEOs to indicate the amount of crawling resources Google would allocate to a given website. Crawl budget is a bit of a misnomer in that there are several areas that can potentially affect the rate with which Google crawls your website URLs.

First, we'd like to emphasize that crawl budget, as described below, is not something most publishers have to worry about. If new pages tend to be crawled the same day they're published, crawl budget is not something webmasters need to focus on. Likewise, if a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently. (source)

Essentially, the goal of Technical SEOs is to ensure that, for larger websites, Google spends their available resources crawling and rendering the important pages, from a search perspective, and little to no time crawling non-valuable content. Google indicates the following categories as non-value-add URLs (in order of significance):

Major areas that can affect the rate with which Google crawls your URLs are:

  • Blocked content in robots.txt

  • Server crawl health. Google tries to be a good citizen and is especially tuned to feedback from your server such as slowing response times and server errors.

  • Site-wide events like site moves or major content changes.

  • Popularity of your URLs to users.

  • Staleness of the URLs in Google's index. Google tries to keep their index fresh.

  • Reliance on many embedded assets to render pages, like JS, CSS, XHR calls.

  • The category of website (eg. News, or other frequently changing unique content).

Cloaking

Google defines cloaking as the practice of presenting different content or URLs to human users and search engines. Cloaking is considered a violation of Google’s Webmaster Guidelines because it provides our users with different results than they expected. (source)

In 2016, Google trained a classification model that was able to accurately detect cloaking 95.5% of the time with a false positive rate of 0.9%. JavaScript redirection was one of the strongest features, predictive of a positive classification. (source)

Research Articles on Crawling