Help Google Find Your Pages
Understanding how Google crawls your site is crucial for getting your content seen. Last month, Google Search Central published a series of blog posts covering various aspects of crawling, including how Google handles CDNs, sub-resources, HTTP caching, and faceted navigation. This post highlights some key updates that can help improve your website's crawlability.
Before diving into specificities, check Google's guide to managing your crawl budget. This document provides relevant information on how to ensure that your site is being crawled in a sustainable way (both for your servers and Google).
Let's take a quick look at a few of the updates:
- CDNs: Google doesn't want to overload your servers with bot traffic (it could happen, let's see how to identify when it does), so using a CDN can actually improve your crawl capacity since Google identifies your servers are backed up by one or more Content Delivery Networks (CDNs). This puts a strong emphasis on caching, which offloads your origin servers, reducing the need for them to serve every request. Cached resources are delivered faster and reduce energy consumption by minimizing requests to the origin server.
- Regarding traffic overload (which CDNs often mitigate, for example, during malicious attacks), Google's documentation advises using a 503 Service Unavailable HTTP response to avoid losing indexed pages. This also applies to "Are you sure you're a human" interstitials. Returning a 503 signals to Googlebot that your content is temporarily unavailable, prompting it to return later.
- Another update concerns the Web Rendering Service (WRS). Google explains that the WRS downloads resources referenced in the page's HTML. These crucial assets, such as JavaScript and CSS, are cached by the WRS for 30 days using its own caching mechanism, which is independent of HTTP caching directives. The usual Cache-Control headers and other HTTP caching mechanisms do not affect this 30-day WRS cache. For more technical details, refer to the source.
“The time to live of the WRS cache is unaffected by HTTP caching directives; instead WRS caches everything for up to 30 days, which helps preserve the site’s crawl budget for other crawl tasks.”
- HTTP caching has long been a key Google recommendation for improving website performance and crawl efficiency. While CDNs enhance caching, the core principle remains: serving cached resources reduces server load and boosts Googlebot's crawling effectiveness. A 0.017% share of cached requests is concerningly low, signaling a significant opportunity for improvement. Even a slight increase can have a positive impact. Common issues include:
- Missing or incorrect caching headers: This is the most common problem. Ensure your server sends the correct headers for different resource types.
- Overly short cache durations: Very short cache times negate caching benefits.
- Query strings in URLs: URLs with query strings are often not cached by default.
- Finally, Google reminds us about managing faceted navigation, a recurring issue for e-commerce sites. The numerous filters and attribute queries can significantly increase the crawl inventory, wasting Google's time on resources that shouldn't be crawled. A helpful resource for addressing this issue can be found here.
How can using a CDN improve my site's crawl capacity?
Google can better identify when your servers are backed by CDNs, which prevents your servers from being overloaded with bot traffic. This improves crawl capacity by emphasizing caching, which reduces the load on your origin servers.
What HTTP response code should I use during a traffic overload or attack?
Google recommends using a 503 Service Unavailable HTTP response. This tells Googlebot that your content is temporarily unavailable, and it will return later.
How long does the Google Web Rendering Service (WRS) cache resources?
The WRS caches JavaScript and CSS resources for up to 30 days, independent of HTTP caching directives, to preserve crawl budget.
What are some common issues with HTTP caching?
Common issues include missing or incorrect caching headers, overly short cache durations, and query strings in URLs that prevent caching.
How should I manage faceted navigation on my e-commerce site to improve crawlability?
You can manage faceted navigation by blocking unnecessary faceted pages using robots.txt, using canonical tags, and implementing AJAX-based filtering.
- → Universal Commerce Protocol (UCP): Why Machine-Readable Is the New User-Friendly
- → Building a Multi-Agent System to Find My Next Neighborhood Abroad
- → Notion Blog with AI Translation
- → A new approach to Information Architecture in the age of AI?
- → Observability: How I Monitor My Strava Activities
- → AI Won't Replace You