Crawl Budget Management For Large Sites | Google Search Central  |  Documentation  |  Google for Developers (original) (raw)

Large site owner's guide to managing your crawl budget

This guide describes how to optimize Google's crawling of very large and frequently updated sites.

If your site does not have a large number of pages that change rapidly, or if your pages seem to be crawled the same day that they are published, you don't need to read this guide; merelykeeping your sitemap up to date andchecking your index coverage regularly is adequate.

If you have content that's been available for a while but has never been indexed, this is a different problem; use theURL Inspection tool instead to find out why your page isn't being indexed.

Who this guide is for

This is an advanced guide and is intended for:

General theory of crawling

The web is a nearly infinite space, exceeding Google's ability to explore and index every available URL. As a result, there are limits to how much time Googlebot can spend crawling any single site. The amount of time and resources that Google devotes to crawling a site is commonly called the site's crawl budget. Note that not everything crawled on your site will necessarily be indexed; each page must be evaluated,consolidated, and assessed to determine whether it will be indexed after it has been crawled.

Crawl budget is determined by two main elements: crawl capacity limit and crawl demand.

Crawl capacity limit

Googlebot wants to crawl your site without overwhelming your servers. To prevent this, Googlebot calculates a crawl capacity limit, which is the maximum number of simultaneous parallel connections that Googlebot can use to crawl a site, as well as the time delay between fetches. This is calculated to provide coverage of all your important content without overloading your servers.

The crawl capacity limit can go up and down based on a few factors:

Crawl demand

Google typically spends as much time as necessary crawling a site, given its size, update frequency, page quality, and relevance, compared to other sites.

The factors that play a significant role in determining crawl demand are:

Additionally, site-wide events like site moves may trigger an increase in crawl demand in order to reindex the content under the new URLs.

In sum

Taking crawl capacity and crawl demand together, Google defines a site's crawl budget as the set of URLs that Googlebot can and wants to crawl. Even if the crawl capacity limit isn't reached, if crawl demand is low, Googlebot will crawl your site less.

Best practices

Follow these best practices to maximize your crawling efficiency:

Monitor your site's crawling and indexing

Here are the key steps to monitoring your site's crawl profile:

  1. See if Googlebot is encountering availability issues on your site.
  2. See whether you have pages that aren't being crawled, but should be.
  3. See whether any parts of your site need to be crawled more quickly than they already are.
  4. Improve your site's crawl efficiency.
  5. Handle overcrawling of your site.

See if Googlebot is encountering availability issues on your site

Improving your site availability won't necessarily increase your crawl budget; Google determines the best crawl rate based on the crawl demand, as described previously. However, availability issues do prevent Google from crawling your site as much as it might want to.

Diagnosing:

Use the Crawl Stats report to see Googlebot's crawling history for your site. The report shows when Google encountered availability issues on your site. If availability errors or warnings are reported for your site, look for instances in the Host availability graphs where Googlebot requests exceeded the red limit line, click into the graph to see which URLs were failing, and try to correlate those with issues on your site.

Additionally, you can also use theURL Inspection Tool to test a few URLs on your site. If the tool returnsHostload exceeded warnings, that means that Googlebot can't crawl as many URLs from your site as it discovered.

Treating:

See if any parts of your site are not crawled, but should be

Google spends as much time as necessary on your site in order to index all the high-quality, user-valuable content that it can find. If you think that Googlebot is missing important content, either it doesn't know about the content, the content is blocked from Google, or your site availability is throttling Google's access (or Google is trying not to overload your site).

Diagnosing:

Search Console doesn't provide a crawl history for your site that can be filtered by URL or path, but you can inspect your site logs to see whether specific URLs have been crawled byGooglebot. Whether or not those crawled URLs have been indexed is another story.

Remember that for most sites, new pages will take several days minimum to be noticed; most sites shouldn't expect same-day crawling for URLs, with the exception of time-sensitive sites such as news sites.

Treating:

If you are adding pages to your site and they are not being crawled in a reasonable amount of time, either Google doesn't know about them, the content is blocked, your site has reached its maximum serving capacity, or you are out of crawl budget.

  1. Tell Google about your new pages: update your sitemaps to reflect new URLs.
  2. Examine your robots.txt rules to confirm that you're not accidentally blocking pages.
  3. Review your crawling priorities (a.k.a. use your crawl budget wisely). Manage your inventory and improve your site's crawling efficiency.
  4. Check that you're not running out of serving capacity. Googlebot will scale back its crawling if it detects that your servers are having trouble responding to crawl requests.

Note that pages might not be shown in search results, even if crawled, if there isn't sufficient value or user demand for the content.

See if updates are crawled quickly enough

If we're missing new or updated pages on your site, perhaps it's because we haven't seen them, or haven't noticed that they are updated. Here is how you can help us be aware of page updates.

Note that Google strives to check and index pages in a reasonably timely manner. For most sites, this is three days or more. Don't expect Google to index pages the same day that you publish them unless you are a news site or have other high-value, extremely time-sensitive content.

Diagnosing:

Examine your site logs to see when specific URLs were crawled by Googlebot.

To learn the indexing date, use the URL Inspection tool or do a Google search for URLs that you updated.

Treating:

Do:

Avoid:

Improve your site's crawl efficiency

Increase your page loading speed

Google's crawling is limited by bandwidth, time, and availability of Googlebot instances. If your server responds to requests quicker, we might be able to crawl more pages on your site. That said, Google only wants to crawl high quality content, so simply making low quality pages faster won't encourage Googlebot to crawl more of your site; conversely, if we think that we're missing high-quality content on your site, we'll probably increase your budget to crawl that content.

Here's how you can optimize your pages and resources for crawling:

Specify content changes with HTTP status codes

Google generally supports theIf-Modified-Since and If-None-Match HTTP request headers for crawling. Google's crawlers don't send the headers with all crawl attempts; it depends on the use case of the request (for example,AdsBot is more likely to set the If-Modified-Since and If-None-Match HTTP request headers). If our crawlers send the If-Modified-Since header, the header's value is the date and time the content was last crawled. Based on that value, the server may choose to return a304 (Not Modified) HTTP status code with no response body, in which case Google will reuse the content version it crawled the last time. If the content is newer than the date specified by the crawler in the If-Modified-Since header, the server can return a200 (OK) HTTP status code with the response body.

Independently of the request headers, you can send a 304 (Not Modified) HTTP status code and no response body for any Googlebot request if the content hasn't changed since Googlebot last visited the URL. This will save your server processing time and resources, which may indirectly improve crawl efficiency.

Hide URLs that you don't want in search results

Wasting server resources on unnecessary pages can reduce crawl activity from pages that are important to you, which may cause a significant delay in discovering great new or updated content on a site.

Exposing many URLs on your site that you don't want crawled by Search can negatively affect a site's crawling and indexing. Typically these URLs fall into the following categories:

Do:

Avoid:

Handle overcrawling of your site (emergencies)

Googlebot has algorithms to prevent it from overwhelming your site with crawl requests. However, if you find that Googlebot is overwhelming your site, there are a few things you can do.

Diagnosing:

Monitor your server for excessive Googlebot requests to your site.

Treating:

In an emergency, we recommend the following steps to slow down an overwhelming crawl from Googlebot:

  1. Return 503 or 429 HTTP response status codes temporarily for Googlebot requests when your server is overloaded. Googlebot will retry these URLs for about 2 days. Note that returning "no availability" codes for more than a few days will cause Google to permanently slow or stop crawling URLs on your site, so follow the additional next steps.
  2. When the crawl rate goes down, stop returning 503 or 429 HTTP response status codes for crawl requests; returning 503 or 429 for more than 2 days will cause Google to drop those URLs from the index.
  3. Monitor your crawling and your host capacity over time.
  4. If the problematic crawler is one of the AdsBot crawlers, the problem is likely that you have created Dynamic Search Ad targets for your site that Google is trying to crawl. This crawl will reoccur every 3 weeks. If you don't have the server capacity to handle these crawls, either limit your ad targets or get increased serving capacity.

Myths and facts about crawling

Test your knowledge on how Google crawls and indexes websites.

Compressing my sitemaps can increase my crawl budget.

False

It won't. Zipped sitemaps still have to be fetched from the server, so you're not really saving much crawling time or effort on Google's part by sending compressed sitemaps.

Google prefers fresher content, so I'd better keep tweaking my page.

False

Content is rated by quality, regardless of age. Create and update your content as necessary, but there's no additional value in making pages artificially appear to be fresh by making trivial changes and updating the page date.

Google prefers old content (it has more weight) over fresh content.

False

If your page is useful, it's useful, whether it's new or old.

Google prefers clean URLs and doesn't like query parameters.

False

We can crawl parameters.

The faster your pages load and render, the more Google is able to crawl.

True

True, in that our resources are limited by a combination of time and number of crawling bots. If you can serve us more pages in a limited time, we will be able to crawl more of them. However, we might devote more time crawling a site that has more important information, even if it is slower. It's probably more important for you to make your site faster for your users than to make it faster to increase your crawl coverage. It's much simpler to help Google crawl the right content than it is to crawl all your content every time. Note that crawling a site involves both retrieving and rendering the content. Time spent rendering the page counts as much as time spent requesting the page. So making your pages faster to render will also increase the crawl speed.

Small sites aren't crawled as often as big ones.

False

If a site has important content that changes often, we crawl it often, regardless of the size.

The closer your content is to the home page the more important it is to Google.

Partly true

Your site's home page is often the most important page on your site, and so pages linked directly to the home page may be seen as more important, and therefore crawled more often. However, this doesn't mean that these pages will be ranked more highly than other pages on your site.

URL versioning is a good way to encourage Google to recrawl my pages.

Partly true

Using a versioned URL for your page in order to entice Google to crawl it again sooner will probably work, but often this is not necessary, and will waste crawl resources if the page is not actually changed. If you do use versioned URLs to indicate new content, we recommend that you only change the URL when the page content has changed meaningfully.

Site speed and errors affects my crawl budget.

True

Making a site faster improves the users' experience while also increasing crawl rate. For Googlebot a speedy site is a sign of healthy servers, so it can get more content over the same number of connections. On the flip side, a significant number of5xx HTTP response status codes (server errors) or connection timeouts signal the opposite, and crawling slows down. We recommend paying attention to the Crawl Stats report in Search Console and keeping the number of server errors low.

Crawling is a ranking factor.

False

Improving your crawl rate will not necessarily lead to better positions in search results. Google uses many signals to rank the results, and while crawling is necessary for a page to be in search results, it's not a ranking signal.

Alternate URLs and embedded content count in the crawl budget.

True

Generally, any URL that Googlebot crawls will count towards a site's crawl budget. Alternate URLs, like AMP or hreflang, as well as embedded content, such as CSS and JavaScript, including XHR fetches, may have to be crawled and will consume a site's crawl budget.

I can control Googlebot with the "crawl-delay" rule.

False

The non-standard "crawl-delay" robots.txt rule is not processed by Googlebot.

The nofollow rule affects crawl budget.

Partly true

Any URL that is crawled affects crawl budget, so even if your page marks a URL as nofollow, it can still be crawled if another page on your site, or any page on the web, doesn't label the link as nofollow.

I can use noindex to control crawl budget.

Partly true

Any URL that is crawled affects crawl budget, and Google has to crawl the page in order to find the noindex rule.

However, noindex is there to help you keep things out of the index. If you want to ensure that those pages don't end up in Google's index, continue using noindex and don't worry about crawl budget. It's also important to note that if you remove URLs from Google's index with noindex or otherwise, Googlebot can focus on other URLs on your site, which means noindex can indirectly free up some crawl budget for your site in the long run.

Pages that serve 4xx HTTP status codes are wasting crawl budget.

False

Pages that serve 4xx HTTP status codes (except 429) don't waste crawl budget. Google attempted to crawl the page, but received a status code and no other content.