Help Google Find Your Pages
Understanding how Google crawls your site is crucial for getting your content seen. Last month, Google Search Central published a series of blog posts covering various aspects of crawling, including how Google handles CDNs, sub-resources, HTTP caching, and faceted navigation. This post highlights some key updates that can help improve your website's crawlability.
Before diving into specificities, check Google's guide to managing your crawl budget. This document provides relevant information on how to ensure that your site is being crawled in a sustainable way (both for your servers and Google).
Let's take a quick look at a few of the updates:
CDNs: Google doesn't want to overload your servers with bot traffic (it could happen, let's see how to identify when it does), so using a CDN can actually improve your crawl capacity since Google identifies your servers are backed up by one or more Content Delivery Networks (CDNs). This puts a strong emphasis on caching, which offloads your origin servers, reducing the need for them to serve every request. Cached resources are delivered faster and reduce energy consumption by minimizing requests to the origin server.
Regarding traffic overload (which CDNs often mitigate, for example, during malicious attacks), Google's documentation advises using a
503 Service Unavailable
HTTP response to avoid losing indexed pages. This also applies to "Are you sure you're a human" interstitials. Returning a503
signals to Googlebot that your content is temporarily unavailable, prompting it to return later.
Another update concerns the Web Rendering Service (WRS). Google explains that the WRS downloads resources referenced in the page's HTML. These crucial assets, such as JavaScript and CSS, are cached by the WRS for 30 days using its own caching mechanism, which is independent of HTTP caching directives. The usual
Cache-Control
headers and other HTTP caching mechanisms do not affect this 30-day WRS cache. For more technical details, refer to the source.
HTTP caching has long been a key Google recommendation for improving website performance and crawl efficiency. While CDNs enhance caching, the core principle remains: serving cached resources reduces server load and boosts Googlebot's crawling effectiveness. A 0.017% share of cached requests is concerningly low, signaling a significant opportunity for improvement. Even a slight increase can have a positive impact. Common issues include:
Missing or incorrect caching headers: This is the most common problem. Ensure your server sends the correct headers for different resource types.
Overly short cache durations: Very short cache times negate caching benefits.
Query strings in URLs: URLs with query strings are often not cached by default.
Finally, Google reminds us about managing faceted navigation, a recurring issue for e-commerce sites. The numerous filters and attribute queries can significantly increase the crawl inventory, wasting Google's time on resources that shouldn't be crawled. A helpful resource for addressing this issue can be found here.