Cloudfront empty responses
In our staging environment, we recently found an odd issue: it was returning blank pages - for some people, sometimes. Unable to reproduce the problem on their machine, developers started suspecting it’s an infrastructure issue.
All (animals) are equal, but some (animals) are more equal than others
I heard from a couple of colleagues that they were not able to reproduce the issue on their machine, but I tried it anyway. And voila!
This was really strange - the server was returning empty 200 OK response for any non-existent route. I was pretty sure it was returning proper 404s until recently, so something recent changed this behavior.
At least, I found a way to reproduce the problem, on my machine.
Feature flags and Middlewares
I remembered seeing a small change in our Ring middleware chain lately. It was related to our new REST API feature that we had just released. I jumped into the code and focused on the API middleware.
Eventually, I stumbled upon code that looked like this:
(defn wrap-api [handler] (-> handler wrap-logging wrap-auth wrap-common wrap-no-cache))
wrap-no-cache was the recent addition so I looked at it:
(defn- wrap-no-cache [handler] (fn [request] (-> (handler request) util/no-cache-response)))
Hmm… Could that cause a problem? I added some debug prints and it became obvious that this was the problematic piece.
A middleware is simply a higher-order function that wraps a request handler and adds some functionality.
A simple middleware adding a key to the request map may look like this:
(defn wrap-user [handler] (fn [request] (if-let [user-id (-> request :session :user-id)] (let [user (get-user-by-id user-id)] (handler (assoc request :user user))) (handler request))))
wrap-no-cache shown earlier is a middleware that calls
util/no-cache-response which inserts the
Cache-Control header to make sure that the client doesn’t cache the response.
The trouble is that
no-cache-response returns a response map even if
(handler-request) returns nil , thus turning a Not Found response to an OK response!
Feature flags or "why is it always me?"
We were using a feature flag (disabled by default) for the new REST API functionality. And the whole API middleware was only activated if the flag was on.
I had the flag enabled because I was testing REST API before, but my colleagues didn’t use it yet. Thus it was working for them just fine. We also had enabled the flag on staging a while ago to be able to test it continuously before releasing to production. So this feature flag was at least part of the equation.
But why it was sometimes working, for some people, on staging? That was still strange and demanded a closer look.
Mind your layers
Modern development might be tricky because there are typically many layers to be aware of, for instance:
your application code
3rd party services
the app (web) server
proxies and load balancers
These are nice and give us robustness and decouple concerns. Debugging, however, becomes more challenging.
We host all our infrastructure on AWS, and in front of it, there’s Cloudfront - a managed CDN service. It’s the usual suspect when it comes to caching problems.
My colleague noticed, that using
curl everything looked fine but in the browser, we were getting empty pages, intermittently.
We knew that the empty responses were due to a non-existent bundle files being requested and our app returning an OK response instead of Not Found. But why (some) clients keep requesting those non-existent files?
Here’s a summary of the first hypothesis trying to explain the behavior:
During a deployment of a new app version, the (two) web application nodes are updated one by one.
The client requests /index.html file from the server and it gets served by the node (B) having the latest version of the app
The client parses the HTML response, and requests the other resources, including JS and CSS bundles - but this time, they are served from the node (A) still having the older version of the app.
The bundle files that the new version of the app is using are not available on the node hosting the older version - it would normally yield 404 but due to the bug in our code, it returns 200 OK.
Such an empty response breaks the app and it manifests as a blank page in the browser.
Having said that, it was still unclear to me why would clients keep requesting the old bundles, even after the application deployment was finished. Even though
wrap-no-cache was broken, the responses it returned should be non-cacheable (
no-store value in the
Fortunately, a colleague of mine (thanks Simon!) had shed some light on it. He noticed we had special nginx configuration for JS & CSS files:
This finally explained it! While the app was returning
no-store cache-control header, the caching behavior was overwritten by nginx and the cache expiration time set to 1 year. So even if the user did a hard-refresh of their browser, Cloudfront still served the cached version of the non-existent/empty JS & CSS resources.
Fixing the problem was really simple - just use
some→ to make sure
nil is not turned into a map, unexpectedly:
(defn- wrap-no-cache [handler] (fn [request] (some-> (handler request) util/no-cache-response)))
Feature flags are nice but remember: they multiply the number of testing paths through the application and different configurations making it harder to spot and reproduce problems
There are often multiple contributing factors under the hood of a tricky problem, not a single "root cause".