Why an idempotency key isn’t an idempotency assure


It began with a one-line message from a finance crew on a Tuesday afternoon: A handful of shoppers had been charged twice that day, and one was disputing a reproduction cost with their financial institution.

I went straight to the monitoring, anticipating to search out one thing damaged. As a substitute, all the pieces appeared wholesome: By the system’s personal data, each order had been paid precisely as soon as. It took the crew a month of digging by means of manufacturing incidents to shut the hole between a dashboard that mentioned, “all good,” and a buyer billed twice.

I’ve since seen this type of failure throughout a number of fee methods, some dealing with a whole lot of 1000’s of transactions a day. What follows is a composite and doesn’t describe any single system or group. The numbers, timings and figuring out particulars have been modified to maintain something proprietary out.

The retry that charged twice

A buyer clicked Pay; the order service referred to as the fee service, which referred to as the exterior supplier. The supplier charged the cardboard for $200 and recorded a hit on its aspect.

The one factor that went flawed was timing. The supplier was beneath load and took simply over 3 seconds to reply. The shopper gave up after 2 seconds, a default inherited from inner service calls and by no means tuned for funds. From the caller’s aspect, the decision had merely failed, so nothing was marked as paid.

The retry logic did what it was constructed to do and despatched the request once more. The supplier noticed what appeared like a recent cost and took the cash a second time. The database recorded one fee: the retry. The primary cost lived solely within the supplier’s data, invisible to us till the complaints got here in.

Violetta Pidvolotska

The duplicates had been refunded inside hours, earlier than the dispute may develop into a chargeback. Understanding what had really failed took for much longer.

Later, we widened the timeout nicely previous the supplier’s slowest wholesome responses, however so long as a retry can set off a second cost, an extended timeout solely makes the double cost rarer.

The actual mistake was older than the restrict itself. The system had been informed {that a} timeout means failure.

The third state

We have a tendency to consider a community name as having two outcomes: it labored or it didn’t. A timeout is the third. The request might by no means have arrived. It might have completed its work and misplaced the response on the way in which again. That’s what bit us. Or it might nonetheless be operating. From the caller’s aspect, you possibly can’t inform which.

Code not often has a separate path for “unknown.” It will get lumped in with failure and the failure path retries. When the request strikes cash, that’s the way you cost somebody twice.

A sluggish service exhibits up as rising response occasions, an unreliable one as errors. A double cost exhibits up as a hit, and no one seen till a buyer did.

Timeouts, which I’ve written about earlier than, flip silent hangs into seen failures. And visual failures get retried, which is how we arrived at idempotency.

“Precisely-once” will get used as if it had been a setting you would activate. You can not promise exactly-once supply throughout an unreliable community, as Tyler Deal with explains. What you possibly can promise is exactly-once results: The request might arrive twice, whereas the cost occurs as soon as.

My first intuition was to cease retrying funds robotically and it helped. However not each retry is ours to modify off: The shopper refreshes the web page, or a retry coverage someplace within the infrastructure resends by itself.

The assumptions beneath the important thing

The usual treatment is an idempotency key: The caller attaches a singular worth to at least one try at an operation and sends the identical worth on each retry. A brand new key will get processed and its end result saved; a well-recognized one will get the saved end result again, so the retry has no further impact. Brandur Leach’s walkthrough of Stripe-like idempotency keys in Postgres lays the sample out finish to finish.

The important thing was shipped and the duplicates stopped. However we relaxed too early. The important thing turned out to be the simple half.

A key like this rests on 4 assumptions. I’ve since turned them right into a guidelines I name the four-assumptions take a look at:

  • Declare. Claiming a secret’s only a matter of checking it’s free first.
  • Intent. The identical key all the time carries the identical intent.
  • Reminiscence. No matter a key remembers is secure to replay.
  • Boundary. Nothing behind the important thing lies past your management.

Over the next month, all 4 broke: The race in a load take a look at, the opposite three in manufacturing.

Two requests, identical millisecond

In a load take a look at, two requests with the identical key arrived in the identical millisecond. Every checked for the important thing; neither discovered it and each began processing.

In a load test, two requests with the same key arrived in the same millisecond. Each checked for the key, neither found it and both started processing.

Violetta Pidvolotska

“Test whether or not the important thing exists, then write it” is a race like another and it broke the declare assumption. We mounted it by flipping the order: now writing the important thing is the examine. Each request writes it as “began” and the database lets just one declare win. The safeguard:

-- Attempt to declare the important thing; the UNIQUE index lets just one caller win.
INSERT INTO operations (idempotency_key, state) VALUES (:key, 'began')
ON CONFLICT (idempotency_key) DO NOTHING;

The insert touches one row or none, and that depend tells you which ones path you’re on. One row means you received: Name the supplier, then mark the row ‘accomplished’ and save the response. None means you misplaced: Learn the row and return its saved response or inform the caller to retry later whether it is nonetheless ‘began’.

One element is straightforward to get flawed: Commit the declare earlier than the supplier name goes out. In any other case, a crash rolls it again and erases the one report {that a} cost could also be in flight.

The more durable case is a profitable request that crashes mid-charge: Its secret’s caught at “began,” and each retry is informed to attend for a solution that can by no means come. A caught declare is identical unknown another time: As soon as it has sat in “began” longer than any wholesome name may take, ask the supplier what really occurred earlier than anybody prices once more.

Similar key, completely different request

The second hole appeared per week into manufacturing and broke the intent assumption: A caller reused one key for 2 completely different requests, $200 and $500, and the system returned the primary request’s saved response with out noticing the quantity had modified.

The second gap appeared a week into production and broke the intent assumption: a caller reused one key for two different requests, $200 and $500, and the system returned the first request's stored response without noticing the amount had changed.

Violetta Pidvolotska

We mounted it by storing a fingerprint of the request’s contents subsequent to the important thing, on the identical insert, so a request that loses the declare race can nonetheless examine its fingerprint in opposition to the winner’s. If the fingerprints match, it’s a real retry. In the event that they don’t, the important thing was reused for a distinct operation and we reject it.

That repair promptly rejected a sound retry. We had been fingerprinting the complete request, together with a timestamp that modified between makes an attempt and fields that arrived in a distinct order, so the fingerprints didn’t match.

A fingerprint has to seize what a request means moderately than how its bytes are organized. Hash a hand-picked record of enterprise fields and also you danger a silent collision: The one area no one remembered so as to add lets two completely different requests match. Hash the entire request minus identified noise like timestamps and the failure is loud as a substitute: A missed risky area rejects a sound retry. We selected loud, the repair got here down to 2 strains:

intent      = drop_fields(request.json, risky={"client_ts", "trace_id"})  # strip identified noise solely
fingerprint = sha256(canonical_json(intent))   # canonical kind: keys sorted, numbers and spacing normalized

Even “canonical” hides choices. RFC 8785 pins them down, but it surely runs each quantity by means of an IEEE 754 double, which loses precision on massive values, so cash quantities are safer as strings or integer cents. Change the canonical kind and each saved fingerprint stops matching, so we model it and retailer the model subsequent to the fingerprint.

The error we cached

The third hole got here in by means of assist: A buyer hit an insufficient-funds decline, added cash, tried once more with the identical key and obtained the previous “inadequate funds” again. The supplier was by no means requested. The system had been caching each response, declines included, so the failure caught to the important thing.

The third gap came in through support: a customer hit an insufficient-funds decline, added money, tried again with the same key and got the old "insufficient funds" back. The provider was never asked. The system had been caching every response, declines included, so the failure stuck to the key.

Violetta Pidvolotska

That compelled the query behind the reminiscence assumption: What’s a key allowed to recollect? The rule we landed on: cache solely success.

A mushy decline or a validation error releases the declare as a substitute: The row flips again to claimable, fingerprint stored. The following try reclaims it with an replace that just one retry can win and the client who provides cash will get a reside try as a substitute of a replay. Laborious declines are the exception: A stolen-card response is closing and that declare stays closed.

On a timeout, we don’t know whether or not the cost landed, so we ask the supplier whether or not the cost already went by means of and act on the reply.

The place the assure runs out

The primary three gaps had been on endpoints beneath the crew’s management. The fourth surfaced throughout reconciliation: A cost on an older supplier’s assertion with no matching inner report. That supplier had no idempotency keys, and the assure had reached its boundary. We couldn’t make it secure to name twice.

We obtained as shut as we may: A pending report earlier than the decision, a standing examine earlier than retrying, reconciliation to catch and refund no matter slips by means of. A window stays the place the cost has landed and our report doesn’t realize it but. We stored shrinking that window, however we by no means managed to shut it.

The database that holds the keys forces a choice of its personal: When it’s down, you both cease taking funds or take them unprotected. That alternative is a enterprise name. For a low-stakes write, cleansing up a uncommon duplicate can price lower than turning clients away. A fee will not be low stakes, so we fail closed and cease taking funds till the shop is again: A misplaced sale we will get better, and we had simply spent a month studying what duplicates price.

Questions I ask in design opinions

For something that shops or modifications knowledge, I ask three questions:

  • What occurs if this runs twice? Ask it out loud for each write.
  • Can we show the reply? Run it twice in checks, in sequence and in parallel; the second run ought to change nothing.
  • The place does the reality reside when methods disagree? For funds, it’s the supplier as a result of their data present whether or not cash really moved. Settle whose reply wins earlier than an incident does.

The secret’s a good suggestion and, in something that strikes cash, a needed one. It’s simply not a assure. The assure is the design round it: A declare that can’t race, an intent the fingerprint confirms, a reminiscence that retains solely what’s secure to replay and a boundary you might have mapped upfront. That’s the four-assumptions take a look at. Each assumption will get examined ultimately: You do it at design time or manufacturing does it for you.

This text is revealed as a part of the Foundry Skilled Contributor Community.
Need to be part of?

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles