Improving a CDN's cache hit ratio

Improving a CDN's cache hit ratio

A platform can take a long time to generate a webpage. But when the content is cached, TTFB and users will benefit, but the first user accessing new contents will pay a penalty. How to prevent this and improve the cache hit ratio?

Maybe you've read about properly setting up cloudflare caching for faster TTFB before. But let's dive in a bit deeper.

Cache hit, miss and cache-hit ratio

When a CDN contains a cached response, it will return the contents of the cache, reducing server response times. This is then called a "cache hit". When nothing is found, it's called a "cache miss" and response time + TTFB might take a bit more.

Testing server side caching

I often test the difference by appending a dummy query string to the requested URL and checking both response headers and timing information. Most CDN's will reveal caching information in the response headers. For example, Cloudflare might show you something like cf-cache-status: DYNAMIC|MISS|HIT

Calculating the cache hit ratio

By dividing the amount of cache hits by the total amount of requests (which is the sum of cache hits and misses), you get the relative amount of cache hits. This is called the cache hit ratio.

cache hit ratio as explained by Cloudflare.

Don't bother doing this yourselves though. When using a CDN, they will often come with a dashboard or chart displaying such information.

But while a server side caching strategy will already easily help reducing the TTFB, there will always be users out there that will run into a situation where a cache could not be returned, ending up paying a penalty. In most cases because a previous copy needed to be invalidated. However, you still want them to have an optimal experience as well to reduce bounce and increase the likelihood of yet another conversion amongst those visitors as well.

Luckily, there are ways to improve the cache hit ratio:

Stale while revalidate caching strategy

One of them is the stale-while-revalidate caching strategy. Often abbreviated as SWR. The goal is to always return a cached version while the server just checks for new contents in the background. This way, a user isn't bothered with any invalidated contents.

Content Delivery Networks

Fastly is describing the need for Stale-while-revalidate as following:

Certain pieces of content can take a long time to generate. Once the content is cached it will be served quickly, but the first user to try and access it will pay a penalty.

And while Cloudflare might be the first CDN to think about and their users asking for a SWR feature, Cloudflare didn't see a chance of supporting Stale While Revalidate caching strategy just yet because of internal technical debt.

SWR challenges

But despite the webworker's stale-while-revalidate caching strategy, it could lead to more issues and race conditions, as references to static resources -especially JS and CSS files- will often change as well. So as a result, this caching strategy might only help ourliers, such as TTFB experiences at the 95th percentile.

Reducing CDN locations

We all know that the short the distance is between a user and server dealing with the request, the higher the chance of an improved TTFB. This especially becomes important when dealing with an international audience.

And maybe you've actually picked a CDN based on its geographical coverage, as a more international audience could mean you want to choose a CDN with more Point of Presence (PoP) locations to increase the likelihood of a server being close to the visitor.

More CDN locations means lower cache hit ratio

However, more locations also means more servers that need to have a cache available to benefit from server side caching. But as your requests are now divided by maybe 5 servers, this means 5 times higher chance of a visitor not receiving a copy, as each server works independently, and will all only fetch a response and cache it when a request comes in. But the more you're invalidating a cache for whatever reasons, the more often a page has to be cached yet again.

It could then actually lead to less visitors benefitting from a cached version and thus a lower cache hit ratio. Reducing the amount of servers won't magically improve your server response time and TTFB, as it still is a combination of factors. My advice is to monitor TTFB across real users and experiment with the amount of servers.

Dynamic rendering

Frequently changing contents are responsible for frequently invalidating cached contents. So that's then your number one cache-hit ratio bottleneck. Increasing the cache hit ratio will increase TTFB amongst more users, instead of only the ones that were lucky enough to run into a cache hit.

Critical versus less-critical

How to fight this? Get rid of frequently changing contents. But you obviously want to show recent prices and stock information. But at the same time, such information might not always be the most important part of a webpage. In the case of a product detail page, the title and image is more important. You still want the product price or rating to show up very fast, but it just isn't as critical towards perceived performance.

Move rendering from server to client

As part of a dynamic rendering strategy, you could choose to render those parts using JavaScript. By using API requests, you could fetch detailed information of a product and render it client side. Be sure to preserve the amount of spaces needed to display dynamically rendered contents to prevent you from running into layout shifting issue.

One example is an overall product rating number. Or all individual reviews of a product. One could use a cronjob to fetch and save the information so that it doesn't need to be done each time the page itself is being requested. Or just choose to fully fetch product review information client side.

By splitting up rendering tasks into server side and client side rendering:

  • generating the boilerplate HTML doesn't depend on frequently changing data anymore;
  • a cached version of a page can then be re-used more often;
  • which will increase the cache-hit ratio and reduce server response time for a wider range of visitors.

Traffic from ads and campaigns

There are other ways to improve cache hit ratio. One that I discussed before is looking out for common query strings that won't actually change any contents, but might be considered a new request based on the para­meters. When already using Varnish, Cloudflare or other solutions for server side caching, be sure to ignore campaign and ads related query string para­meters.