Crawl budget math
Google publishes the framework: crawl budget = crawl capacity × crawl demand. Crawl capacity is what your server can handle; crawl demand is what Google wants to fetch. Both are dynamic and you can influence both.
For a 1,000-page site, Google's typical crawl rate is high enough that crawl budget never becomes the bottleneck. For a 100,000-page site, you'll see Googlebot fetch maybe 30,000 URLs per day at peak — meaning a third of your site gets refreshed daily. For a 10-million-page site, that fraction shrinks to single digits.
When crawl budget actually matters
- Sites with more than 10,000 URLs that update frequently (ecommerce, news, classifieds).
- Sites with high turnover (job boards, real estate, events).
- Sites with auto-generated content (faceted navigation, search result pages, infinite scroll).
- Sites that recently changed structure and have many redirect chains.
- Sites on slow shared hosting that throttle Googlebot.
Diagnose: GSC Crawl Stats report
Open Search Console → Settings → Crawl Stats. You'll see three charts: total requests, total download size, and average response time. Healthy patterns:
- Requests trending up or flat — Google is crawling more or holding steady.
- Average response time under 200ms — fast enough that Googlebot will keep increasing crawl rate.
- Few 5xx errors — server isn't getting overwhelmed.
- Few 4xx errors on indexable URLs — no wasted crawl budget on dead URLs.
The four crawl-waste patterns
1. Faceted navigation
/products?color=red&size=large&sort=price-asc generates thousands of URL combinations, most of which are near-duplicates. Each one consumes crawl budget and dilutes ranking signals.
Fix: canonical the facet URLs back to the unfaceted version. For high-value facets you want indexed (e.g., /products/red-shoes), create static URLs with proper internal linking. Block low-value facet combinations with robots.txt.
2. URL parameters
Tracking parameters (?utm_source, ?ref, ?sessionId) create unlimited URL variants of the same page. Handle them with canonicals and avoid emitting tracked URLs in your own internal links.
3. Internal duplicates
Print versions, dev/staging URLs accidentally exposed, paginated archives where each page is too thin to stand alone. Canonical to the primary version, robots-disallow the rest, or consolidate.
4. Long redirect chains
Every hop in a redirect chain is a separate fetch. /a → /b → /c → /d burns 4 crawl-budget units for one final destination. Flatten redirects to single hops.
Server performance optimization
Crawl capacity scales with server speed. Real improvements you can make this week:
- Add a CDN for static assets — cuts request load on your origin.
- Cache HTML responses for anonymous users with a 5-minute TTL.
- Audit slow database queries (anything over 100ms server-time).
- Move to HTTP/2 or HTTP/3 if you haven't already.
- Drop response time targets to sub-200ms TTFB site-wide.
Sitemap segmentation for large sites
One mega-sitemap with 50,000 URLs is hard for Google to prioritize. Split into segmented sitemaps with a sitemap index:
<sitemapindex>
<sitemap>
<loc>https://yoursite.com/sitemap-products.xml</loc>
<lastmod>2026-05-12</lastmod>
</sitemap>
<sitemap>
<loc>https://yoursite.com/sitemap-categories.xml</loc>
<lastmod>2026-05-12</lastmod>
</sitemap>
<sitemap>
<loc>https://yoursite.com/sitemap-blog.xml</loc>
<lastmod>2026-05-12</lastmod>
</sitemap>
</sitemapindex>Each segment can have its own lastmod, which Google uses to prioritize re-crawling.
Force-index priority URLs via Indexing API
The strategic move on large sites: stop trying to make Google crawl your entire site faster. Instead, pick your highest-value URLs and push them through the Indexing API. Examples:
- New product launches.
- Trending content (news articles, breaking topics).
- Updated cornerstone pages where you've added significant new content.
- Pages with new backlinks (force re-crawl so Google sees the inbound signal).
This is where Instant URL Indexer's bulk submit shines for enterprise sites: 500 URLs per request, integrate it into your CMS publish hook, and your priority pages bypass the crawl-budget bottleneck entirely.
Monitoring crawl budget over time
Set up weekly monitoring on:
- Crawl Stats average response time (alert if it crosses 500ms).
- 5xx error count (alert on any spike).
- Pages indexed vs total URLs (target ratio depends on site, but a 30% gap is normal; 70%+ is a problem).
- Indexing API submission success rate (failed submissions point to systemic issues).