When retries break production!

Introduction

Ever seen a seemingly harmless, non-critical service bring an entire production system to its knees?

It sounds like a bad joke, but it happened to me a few years back, and the irony was palpable. We often focus on the big, "important" services, but sometimes it's the unexpected ripple effects from smaller components that cause the real headaches.

In this writeup, I'll explore an example of a sneaky issue, that rose from something very "simple" to have great impact. More importantly, i'll share some hard-won lessons and betterments learned.

The System

Imagine a fairly standard e-commerce platform. You've got your user-facing API gateway, which routes requests to a constellation of backend microservices:

Product Catalog Service
User Accounts Service
Order Processing Service
Recommendations Service

These services might communicate with each other and rely on shared resources like SQL or NoSQL databases/clusters.

The Recommendations Service

When a user hits the homepage or a product page, the Recommendations service is called to display a curated list of items, which should increase the likelihood of the user buying again.

The same happens at several stages: after a user pays, after they issue a refund, when they look up a specific item, and so on.

This service, for instance, might pull data from the product catalog and user history to generate recommendations on the fly when needed.

The number of recommendations to return for the user can vary depending on the stage of their journey and the platform they're using.

If you're on a tiny mobile screen, the UI won't fit more than two or three products.

But when you're on the checkout page from a laptop, you'll typically see more items grouped into categories (10? 15?) some they've already bought, others in adjacent categories, etc.

However, The recommender shouldn't go too far back in time for products to recommend; ideally, we want to select from the last X items, depending on the number of recommendations desired, the current screen.

To simplify, we can assume that the Recommender Service executes the following read query on a PostgreSQL database to generate recommendations:

SELECT *
FROM shop.products
WHERE UserId=@UserId and LastBoughtOnEpoch > @FilterEpoch

Like any distributed system, this platform is designed to handle a certain capacity: a number of concurrent users, processing orders, and updating inventories.

Each service has its own performance characteristics and dependencies. It's a complex dance, and usually, it works seamlessly.

But what happens to the group when one dancer stumbles?

What happened?

In a typical shopping platform, we can clearly state that the 'Recommendations' service isn't essential for checkout.

Your highest priority for the system is to ensure customers can buy what they're looking for or already have in their basket. Anything beyond that is extra features and ideas to maximize conversion.

The reputational aspect is crucial here. You're far less likely to hear people complaining on Twitter about bad, few, or missing recommendations than you are to hear them screaming (and potentially tanking your stock price) if they can't pay for their items.

If anything, comically bad or irrelevant recommendations might even give you some ironic publicity 🙃

We've reached the trigger for the issue, on a random Tuesday, the Recommendations service starts having intermittent latency issues due to an edge case.

The Edge Case

Due to some unusual sizing dimensions or browser quirks, a frontend started sending invalid epoch integers to the Recommender Service.

The service didn't quite handle the type validation for the epoch correctly, transforming the query into something like this:

SELECT *
FROM shop.products
WHERE UserId='user123' and BoughtOnEpoch > 0

Notice the 0 there? That's often the default for an integer type when a proper BoughtOnEpoch filter value is missing.

This made each request for that user scan their entire purchase history.

But for a handful of prolific shoppers, this turned the Recommender service into a jam, taking seconds or even minutes to respond, all while hammering the database by forcing it to scan massive amounts of data, potentially across multiple partitions and bypassing carefully crafted indexes.

It was a rogue query, and even if it only affected 1% of requests, but this wasn't the true catalyst for the widespread outage. A delay in recommendations, though not ideal, wouldn't normally crash the entire e-commerce platform!

The next culprit, the one that escalated this from a minor snag to a full blown catastrophe, was the barrage of retries from downstream consumers.

Different services, like the Dashboard UI, had their own diverse retry strategies—which is natural, as these components are often owned by different teams. This divergence, however, proved to be a critical vulnerability.

In Normal Cases

If these retry attempts occurred during typical operations, it would likely be a non-event. A few extra hits on the database wouldn't cause any harm.

The database was designed to handle a certain load, and well-tuned expected queries are usually quick to resolve, retries included.

BUT

This underlying vulnerability was exposed only when the database/API were under pressure. The Recommendations Service returned a 500 error on those query timeouts. With a saturated DB, all new connections and even normal queries started to timeout, causing complete downtime for that service.

Consumers, ignorant to the Recommender's specific woes, aggressively retried on 500 error codes returned. This unexpectedly tripled the traffic to the already struggling service, ultimately taking down the entire database and service.

The Recommender Service never had a chance to recover, because this edge case was heavily amplified by retries.

Magic Libraries

Consumers were using bad retry strategies, which we get it. But the cherry on the top was with some Consumers, referencing an "SDK" or Library to handle the retry strategies and execution across the codebase.

Turns out, this library had a critical flaw, where retries are always forked into concurrent threads without any limit.

This led to resource exhaustion in other services, primarily due to thread starvation, as they waited indefinitely for the struggling Recommender service.

We got from a simple type validation missing, impacting only a handful of requests, to a full blown outage of multiple "isolated" services!

Should you retry 500 errors?

Should your client automatically retry 500 errors? Generally, retrying a 500 can be okay for truly transient server-side issues.

But what if it's a persistent problem, like a bug in the API's code or a deeper infrastructural failure? Blindly retrying won't help and can just exacerbate the issue, leading to those dreaded retry storms.

Proper HTTP Response Codes is key here, a server using 503 instead of 500 would mostly guarantee that this request is retriable. But 500 is too generic for the caller to assume anything, without proper context and a case by case study.

Another aspect is idempotency, if the request is non-idempotent? Like a 500 error after a 'submit order' or 'create user'. If the operation completed on the server before the 500 error was sent (db write succeeded, but the response connection timed out), a retry will cause the operation to be performed a second time.

You could end up with duplicate orders, double charges, or multiple user accounts for the same person!

So, like everything: it depends. Retrying 500s is often fine, but you must ask these critical questions early in your design phase and build your API or service accordingly.

Perhaps the caller can implement a way to check if an operation truly succeeded despite the error. Maybe you can enforce idempotency using unique keys or specific headers.

Or, if the business risk is low, you might decide you're okay with these rare occurrences, provided you have robust monitoring to catch them.

Lessons Learned

After every failure, we would ask ourselves how can we do better? Few easy to medium considerations below!

Rate Limiting

Implementing sensible rate limits (even if generous) is crucial for that recovery window and minimizing the blast radius of a single feature not working.

Assumptions are fine if you're building entirely systems, but thinking about and enforcing that limit allows you to codify what you support and expand it as you if needed.

Retry Strategies and Testing

"How" you retry a downstream call is very important. Exponential backoff, jitters, and sensible thresholds are things that need to be explored and considered carefully across the teams owning those services.

Usually, giving the team owning the API an option to create libraries and SDKs for consumers to use for communicating is ideal.

This ensures a consistent and effective way of retries to be managed by the team owning the underlying API, which can/should battle test it with all their product knowledge and edge cases.

Its an effort investment for sure, but keeping this logic closer to the server/team can be beneficial on the long run!

But regardless of who does what -- chaos/load testing your service against specific integrations & understanding the dependencies you're using, how you're using will go a long way to avoid this mess!

SLOs

Setting expectations cross team and with clients is also something to be done. What are their genuine availability and latency needs from your service? What kind of error rates are they designing their systems to gracefully handle?

You might want to push for decreasing specific types of requests, or ensuring no miscommunication for different objectives between the clients and the service.

If you have similar occurrences of this, and strict availability objectives, you might want to start thinking about some system design changes, which would limit the blast radius of those issues for when they happen again. Some high level ideas:

Having an independent clusters for the database/services for different requirements and needs.
Moving to an async/push based flow for pushing recommendations across, with more native queuing systems and retries baked in.