Crawl budget basics: Why Google isn’t indexing your pages—and what to do about it (original) (raw)

As a marketer, you’ve spent hours adding value to your website. Now imagine a visitor drops by regularly to check what’s new and decide what’s worth showing in Google Search.

That visitor? It’s called Googlebot, and it’s the crawler responsible for discovering and indexing your content. It scans your pages to decide what should be included in Google Search and how often to return for updates.

Robot

But Googlebot doesn’t have unlimited resources to always crawl in-depth. Each site gets a set crawl budget, or an allowance of time and bandwidth for Googlebot to spend exploring your site.

The more efficiently you use your crawl budget, the easier for Googlebot to find and prioritize your most valuable content which can help you rank.

Let’s start with the basics: What is crawl budget, and why does it matter?

What is crawl budget (and why does it matter)?

Crawl budget is the limit that Googlebot has for how many pages it’s willing to “crawl” on your website in a given timeframe.

Think of Googlebot as having a set amount of time and energy each day to explore your site. It flips through your site’s pages, deciding what to read and what to skip.

If your site has 10,000 URLs but Googlebot only has the energy to crawl 2,000 today, it has to prioritize. And you want it to prioritize the right things because without guidance, Googlebot might waste time on low-value pages.

Pie Chart

Instead of indexing your latest blog post or your new campaign landing page, it could get stuck crawling 300 nearly identical filter URLs.

Let’s say you run an online shop with 6,000 pages. Now imagine half of those pages are variations—color filters, size options, slight duplicates.

To a customer, those variations are useful. But to Googlebot, they’re mostly the same.

So while it’s busy crawling:

/product/red
/product/blue
/product/xl

It might skip pages like:

Your newly updated homepage
A new seasonal landing page
Your latest blog post that’s already getting traction on socials

Even if the content is ready, the most important pages might not be crawled—or indexed—soon enough. All because your crawl budget was spent elsewhere.

Crawlability vs. crawl budget: What’s the difference?

Crawlability and crawl budget sound similar, but they’re not the same thing.

Both matter because without access and priority, even your best pages can go unseen by Google, and never show up in search.

1. Crawlability = Access

Crawlability answers a simple question: Can Googlebot access this page?

If the answer is no, it won’t crawl the page, no matter how important it is.

Example: It still exists, but Googlebot sees that block as a “Do not enter” sign.

It skips the page entirely, freeing up crawl budget for other areas.

2. Crawl budget = Priority and choice

Crawl budget comes after crawlability.

It’s no longer “Can I crawl this page?”—it’s:

“Do I have the time and energy to crawl this page soon?”

Even if a page is crawlable, Googlebot might decide it’s not worth its limited attention right now.

Example: You’ve got a crawlable event page from 2017 that’s still live. It isn’t blocked, but it’s outdated and gets no traffic.

Googlebot might think:

“Hmm. Not urgent. I’ll come back to it… eventually.”

So even though the page is crawlable, it might go untouched for months.

In the case of crawlability vs crawl budget, which should you use? You need both crawlability and crawl budget to work together.

If a page isn’t crawlable, it won’t be discovered.

If it’s crawlable but low priority, it might be ignored until it’s too late.

This helps show how they’re related, but not interchangeable.

Your customers search everywhere. Make sure your brand shows up. The SEO toolkit you know, plus the AI visibility data you need. Start Free Trial Get started with Semrush One Logo

Why crawl budget matters—and when it actually applies to your site

If Googlebot hasn’t crawled your page, it can’t rank it.

It might not even know it exists—or worse, it could be showing an outdated version in search results.

Your crawl budget decides whether Google sees your page and when, which has everything to do with your chances of showing up (and showing up well) in search.

For example, if you launch a new product page that hasn’t been crawled, it won’t appear in search. Or if you’ve updated pricing across service pages but Googlebot hasn’t had a chance to recrawl, users might still see outdated prices in the SERP.

This is where crawl budget gets serious.

When crawl budget becomes a real concern

While crawl budget affects every site, it’s especially critical for:

Large websites: Sites with thousands or millions of URLs
News and media: New URLs post frequently and need fast indexing
Ecommerce sites: Tons of product filters, variations, and categories

If Googlebot can’t keep up, your most important or time-sensitive content might be the very thing that gets missed.

Running a smaller site?

Larger sites are more difficult to manage, including from a crawl perspective. If your site has fewer than 500–1,000 indexable URLs, crawl budget likely isn’t your main issue. Googlebot can typically handle small and mid-sized sites with ease, crawling all the parts of your site.

In these cases, focus on what’s blocking indexing, not crawling. Common culprits include:

Pages blocked by noindex or canonical tags
Weak internal linking
Thin, duplicate, or low-quality content

Pro tip: Use the Pages report in Google Search Console to see which URLs are excluded and why. You might spot indexability problems faster than expected.

How Google calculates your crawl budget

Google looks at two main factors when deciding what, and how much, to crawl:

Crawl Demand: How much Google wants to crawl from your site.
Crawl Capacity Limit: How much your server can handle without performance issues.

Let’s look at what shapes them.

What drives crawl demand

Crawl demand reflects how valuable or fresh Google thinks your content is. With limited resources, it prioritizes pages that seem worth its time.

Here’s what affects that demand:

Perceived inventory
This is how many pages Google thinks you actually have.
If your sitemap says 40,000 URLs, but internal links only expose 3,000, Google might assume the rest aren’t important or don’t exist at all.
That means big portions of your site could go uncrawled, especially if your new or seasonal content lives on those hidden pages.
Popularity
Pages with backlinks or strong engagement signals tend to get crawled more often.
If your blog post goes viral or picks up backlinks, Googlebot will likely visit it regularly.
But if an old press release is buried deep in your site architecture, it might be ignored for months.
Staleness
Google doesn’t want to waste time checking the same stale page over and over.
If a page hasn’t changed in years, it drops in crawl priority.
But if you frequently update product listings, refresh blog posts, or revise landing pages, Google will return more often to keep up.

What limits Google from crawling your site

Even if Google wants to crawl everything, it won’t if your site shows signs of instability. There are generally two key sources of crawl budget issues.

Crawl health
If your site is slow, timing out, or returning server errors, Googlebot will back off.
Even modest crawling can slow users down on shared or underpowered hosting, something Google actively tries to avoid.
Google’s crawl limits
Google also sets internal limits on how much it’s willing to crawl from a domain.
It’s a balancing act: if either demand or capacity is low, crawl budget drops.
Think of it like a formula:
Crawl Demand × Site Capacity = Your Crawl Budget
If either side of that equation decreases, your site’s crawl budget shrinks.

Crawl signals: How to influence what Googlebot prioritizes

Google doesn’t just crawl everything on your site equally. It prioritizes pages that seem valuable, updated, or in demand.

Several signals influence whether and how often Google crawls a page. Some say “skip this,” while others flag content as important.

Signals that influence crawl budget

So, what exactly tells Google whether to pay attention to a page or skip it?

These signals behind the scenes shape how your crawl budget gets spent.

Robots.txt

Robots Txt

This is a simple text file that sits in the root of your website. It tells Googlebot what not to crawl.

So if you block a page here, Google won’t waste any crawl budget trying to reach it. It’ll just move on.

Example: You might block your admin login page or thank-you pages after a form is submitted.

Noindex tags

This is a bit different. A noindex tag tells Google, “You can crawl this page, but don’t show it in search results.”

Google might still crawl it, but if it sees that noindex signal over time, it might decide not to crawl it much at all, since it’s not useful for search.

Example: A staging version of a landing page that’s not ready to go live.

Canonicals

Canonicals tell Google which version of similar pages to treat as primary, preventing crawl budget waste across duplicates. So if you’ve got loads of near-identical versions (like product filters or UTM-tagged URLs), a canonical says: “Hey, treat this version as the real deal.”

If you have five filtered product pages for “pink shoes under $20,” but they all show similar items, you can set a canonical tag to point back to the main “pink shoes” page.

That way, you’re not wasting crawl budget on all the lookalikes.

Sitemap entries

A sitemap is like a treasure map of your site. It tells Google: “These are all the key pages I want you to know about.”

If your sitemap is clean, well-structured, and updated regularly, it’s like giving Googlebot a guided tour.

Make sure your sitemap includes your blog posts, main product pages, and key categories—not broken pages or expired URLs.

Internal linking depth

This just means: how many clicks does it take to get to a page from your homepage? If it takes six to seven clicks to find a page, Google might think: “This page must not be that important since it’s not easily accessible for customers.”

Example: Pages linked directly from your homepage, footer, or main menu tend to get crawled more than ones buried deep inside subfolders.

Quick comparison:

A product page with glowing reviews, good backlinks, and a lot of internal links? Likely to be crawled often.
A filtered version of that same page for “pink speakers under $20,” with no links and duplicate content? Might barely get a glance.

What wastes crawl budget (and how to fix it)

Think of it like this: Googlebot is flipping through the pages of your website with limited energy. The more it wastes on low-value pages, the less it spends on your top content.

Before we get into the biggest crawl budget wasters, it’s worth running a quick site audit to see if any of these issues are already showing up on your site.

So let’s look at the biggest offenders, how to spot and stop them.

1. Duplicate pages

These are different URLs that show the exact same or very similar content.

Multiple Users

All those pages might look the same to a person, but to Googlebot? They’re separate pages. So it reads the same content over and over.

Exhausting, right?

Why it’s a problem: Google is spending energy crawling versions of the same thing instead of using that energy on new or updated content.

How to fix it:

Use canonical tags to point to the main version of the page.
Or, if the page isn’t important? Set it to noindex so Google doesn’t bother at all.

Think of canonicals as a gentle nudge saying, “Hey, this version’s the one that matters.”

2. Broken links and soft 404s

These are pages that no longer exist but still appear in your internal links or XML sitemaps.

Examples: A deleted product page that still lives in your sitemap or a blog link that returns a vague “Sorry, page not found” message (aka a soft 404).

Why it’s a problem: Google will keep trying to visit these pages like knocking on a door that’s not there. Over and over.

A total waste of time.

How to fix it:

Clean up your internal links and remove anything that leads nowhere.
Set up 301 redirects to send Google (and visitors) to a helpful alternative instead.
In your sitemap, only list live, useful pages.

Think of it like tidying up the hallways so Google doesn’t keep bumping into locked doors.

3. Orphan pages

These include pages that exist, but nothing links to them. They’re floating around your site with no clear way in, almost like a ghost floating around your website.

Example: An old blog post from 2019 that has no links from your homepage, no category page, and no tags. Just… lost.

Why it’s a problem: Google might stumble on it eventually, but it’s using crawl budget on a page that’s not helping your site in any way.

How to fix it:

Make sure every page is linked to from somewhere useful—whether it’s your main nav, a footer, or another related article.
Or, if the page is truly outdated or useless? Consider removing it or setting it to noindex.

No one likes to be left out in the cold. Help Google find your content with proper links.

These endless combinations of filters or sort orders—think size, color, price, category—generate thousands of slightly different URLs.

Examples:

/shoes?color=blue&size=7&sale=true
/shoes?size=7&sale=true&color=blue (yes, that counts as another page)

Why it’s a problem: Googlebot gets stuck in a loop. It keeps crawling tiny variations in URL parameters showing the same products, wasting budget on pages that offer nothing new.

How to fix it:

Block these URLs in your robots.txt so Google doesn’t even try to crawl them
Use parameter settings in Google Search Console to tell Google which filters to ignore
Canonical back to the main product category page, where possible

Think of this as closing the door on an endless maze. By doing so, you’re helping Google get to the good stuff faster.

How do you check crawl activity?

Once you understand crawl budget, the next step is monitoring it. Google Search Console (GSC) gives you direct insight into how Googlebot interacts with your site.

This tool gives you a behind-the-scenes look at how Google is crawling your site:

How often it visits
What types of pages it’s fetching
Whether your server is keeping up

We’ll walk through where to find this info and what each part means.

1. GSC crawl stats overview

To get started, head over to your GSC property and:

Click on “Settings” in the sidebar

Gsc Settings Scaled

Scroll down to the Crawling section
Click “Open Report”

Gsc Settings Overview Scaled

You’ll now be in the Crawl Stats report. This is where the good stuff lives.

From here, you’ll get a 90-day snapshot of Google’s crawl activity across your site, including any red flags or changes worth noting. Think of it as a little health check for your crawl budget.

How do you know if you’re hitting your crawl budget limit?

One common sign is a high number of pages in Google Search Console marked as:

“Discovered – currently not indexed”
“Crawled – currently not indexed”

These signals suggest Google knows the pages exist, but hasn’t prioritized them for crawling or indexing yet.

Pro tip: If these messages show up often and your site has thousands of URLs, it’s a strong sign your crawl budget needs attention.

2. Overtime charts (aka Google’s crawl timeline)

Gsc Settings Crawl Stats Scaled

Right at the top, you’ll see a visual chart of crawl activity over the last 90 days. This helps you spot any patterns or sudden drops or spikes in crawling. And underneath the chart, you’ll see three key stats:

Total Crawl Requests: If this drops, Google may be deprioritizing your site
Total Download Size: High values may signal bloated pages or media
Average Response Time: Rising numbers suggest server slowdowns

3. Host status

This part shows you how well your site is handling Google’s crawling, especially from a technical or server perspective.

If everything’s smooth, you’ll see something like: “Hosts are healthy.”

If not, you might get a warning like: “Hosts had problems in the past.”

Click into the box to find more details. You’ll see:

Robots.txt fetch issues: e.g. Google couldn’t load your robots.txt file
DNS issues: Problems resolving your domain name
Server connectivity issues: Your server didn’t respond fast enough (or at all)

Gsc Server Connectivity Scaled

Why it matters: If Google can’t reach your site reliably, it’ll crawl less often. You’ll want to address any of these issues quickly.

4. Crawl requests breakdown

This is the really meaty bit. Google breaks down what it’s crawling, how, and why. You’ll see four handy categories:

By Response Code
This shows how your pages responded—200 OK, 404 Not Found, 301 Redirect, etc.
Example: If you’re seeing loads of 404s here, you might have broken links wasting crawl budget.
By File Type
Googlebot doesn’t just crawl pages in HTML. It also grabs images, scripts, and CSS.
Example: If a chunk of your crawl budget is going toward JavaScript files, that might be worth optimizing or limiting.
By Purpose of the Request
Google labels each request by reason: Discovery (finding new pages) or Refresh (checking back on known pages).
Example: Seeing mostly “Refresh” might mean you’re not publishing much new content right now, or Google isn’t aware of it.
By Googlebot Type
Google uses different bots for jobs, like Googlebot, Smartphone, and Image.
Example: If you see a lot of requests from Googlebot Smartphone, Google is prioritizing the mobile version of your site (which is great). See the complete picture of your search visibility. Track, optimize, and win in Google and AI search from one platform. Start Free Trial Get started with

Clicking into any item shows you specific pages that match that type, like which URLs returned a 404 or which ones were crawled by a specific bot.

Google Search Console gives you the basics straight from the source.

For enterprise or ecommerce websites with tens of thousands of URLs, consider running a crawl budget audit using tools like Semrush Log File Analyzer, Botify, or OnCrawl.

These help uncover how Googlebot behaves over time, where it’s spending crawl budget, where it’s dropping off, and which sections of your site may be undercrawled. You can quickly pinpoint opportunities for crawl budget optimization.

Pro tip: Use log file data to compare crawl activity against revenue-driving URLs. If top-converting pages aren’t getting regular crawls, you’ve got an optimization opportunity.

Want to see what Google’s seeing?

You don’t need to master crawl budget today, but it does play a key role in how your content gets discovered and ranked. When search engines focus on the right pages, you’re more likely to show up where it counts.

Crawl budget helps Google prioritize your most valuable content. Make sure it’s working in your favor.

Start by checking what’s already visible. Use our SERP Checker to see which pages are ranking and which ones aren’t. This can help you spot missed opportunities and make your digital marketing efforts more effective.

Search Engine Land is owned by Semrush. We remain committed to providing high-quality coverage of marketing topics. Unless otherwise noted, this page’s content was written by either an employee or a paid contractor of Semrush Inc.

Crawl budget basics: Why Google isn’t indexing your pages—and what to do about it (original) (raw)

What is crawl budget (and why does it matter)?

Crawlability vs. crawl budget: What’s the difference?

Why crawl budget matters—and when it actually applies to your site

When crawl budget becomes a real concern

How Google calculates your crawl budget

What drives crawl demand

What limits Google from crawling your site

Crawl signals: How to influence what Googlebot prioritizes

Signals that influence crawl budget

Robots.txt

Noindex tags

Canonicals

Sitemap entries

Internal linking depth

What wastes crawl budget (and how to fix it)

1. Duplicate pages

2. Broken links and soft 404s

3. Orphan pages

4. Faceted navigation

How do you check crawl activity?

1. GSC crawl stats overview

2. Overtime charts (aka Google’s crawl timeline)

3. Host status

4. Crawl requests breakdown

Want to see what Google’s seeing?