Fraud Detection at Sub-200ms: The Latency Budget Nobody Talks About

A fraud model that takes 400ms to score is not a fraud model. It is a customer-service problem.

I learned this on a payments platform handling billions in volume, where the fraud team had built something genuinely good. Gradient boosting, hundreds of features, lift numbers that made the data scientists proud. It caught fraud the old rules missed. It also sat outside the authorization path, scoring transactions a beat too late, and by the time it flagged anything the money had already moved. So we did the obvious thing. We put it inline, in front of the authorization decision. And it fell over, because nobody had counted the milliseconds.

Here is the constraint that everyone designing inline fraud detection eventually runs into and almost nobody plans for. The authorization call has a hard deadline. The card network expects an answer in a couple of seconds, the payment processor wants its own margin inside that, and the merchant on the other end is watching a spinner. By the time the request reaches your fraud check, you do not own two seconds. You own a slice. On a good day that slice was around 200ms, and fraud scoring had to fit inside it without starving everything else that also needed time in that window.

Where the 200ms actually goes

The mistake is to think the budget is “model inference time.” Inference is usually the cheapest part. The budget gets eaten by everything around the model.

Authorization request flowing through feature fetch, model inference, and decision, with a 200ms budget split across the stages

The model inference is the small box. The feature fetch is the one that blows the budget.

Walk it through. The authorization request lands. Before the model can do anything, you have to assemble the features it was trained on, and those features do not live in the request. The card’s recent velocity, how many times this device has been seen this hour, the merchant’s chargeback rate, the distance between this transaction and the last one: all of that is state, and state lives in a store you have to go ask. Every feature you fetch is a network hop or a cache lookup, and you are paying for the slowest one, not the average one. Then the model runs. Then you serialize a decision and hand it back. Then there is the network time getting in and out of your own service, which you do not control and cannot wish away.

When we instrumented it, the split was not close. Feature fetching was the overwhelming majority of the spend. The model itself ran in single-digit milliseconds. We had been optimizing the wrong thing for weeks, shaving the model when the model was never the problem.

The features that survive the budget

Once you accept that feature fetch is the budget, the whole design flips. The question stops being “what features make the model better” and becomes “what features can I get my hands on in time.” Those are different questions, and the gap between them is where the engineering lives.

We sorted every feature into three buckets by how expensive it was to retrieve at decision time.

Features already in the request were free. Amount, currency, merchant category, the card BIN, the rough shape of the transaction. No fetch, no hop. The model got all of these without spending a millisecond.

Features in a hot in-memory store were cheap. We precomputed the expensive aggregations offline and continuously, then kept the latest values in a low-latency cache keyed by card, device, and merchant. The model did not compute the card’s seven-day velocity at decision time. It read a number that a streaming job had already computed and parked there. A read, not a calculation.

Features that needed a real query were the ones we cut. Anything that meant joining across tables, hitting a warehouse, or calling another service synchronously. Some of those were genuinely predictive. We dropped them anyway, or we moved their computation offline so that what remained at decision time was a single fast read. (The graph feature the data scientists loved, the one that traced money between linked accounts, was exactly this kind. It was good. It also took 90ms to compute live, so it lost.)

Here is the part nobody tells you. A model with twenty features it can actually fetch in time beats a model with two hundred features it cannot. The second model is better in the notebook and useless in production, because a fraud score that arrives after the deadline does not arrive at all. The network timed out, the authorization defaulted to its fallback, and your beautiful model contributed nothing except latency. Offline accuracy is a vanity metric if the model cannot answer the phone.

Make the timing impossible to ignore

The way you keep this honest is to measure every feature fetch the same way you measure the model, and to enforce a hard ceiling rather than hope. We wrapped feature retrieval so that anything blowing its slice of the budget got dropped and the model scored without it, instead of letting one slow lookup take the whole transaction down with it.

# Every feature carries its own time budget. A feature that can't answer
# in time is not a feature, it's a liability. We score without it.
async def gather_features(txn, budget_ms=120):
    deadline = monotonic_ms() + budget_ms
    feats = {}

    # free features: already in the request, no fetch, no excuse
    feats.update(extract_inline(txn))

    # cached aggregates: precomputed offline, we only read here
    for name, key in HOT_FEATURES:
        remaining = deadline - monotonic_ms()
        if remaining <= 0:
            metrics.incr("feature.dropped.budget", tag=name)
            continue
        try:
            feats[name] = await cache.get(key(txn), timeout_ms=remaining)
        except TimeoutError:
            # a missing feature is fine. a stalled authorization is not.
            metrics.incr("feature.dropped.timeout", tag=name)

    return feats

The model has to be trained to expect this. If a feature can go missing under load, the model cannot treat it as always present, so we trained with the same features randomly dropped, which is also just good regularization. A model that falls apart the moment one cache read times out is not production-ready, however well it scored in the offline test.

def score(txn):
    feats = run(gather_features(txn))          # bounded, may be partial
    risk = model.predict_proba(feats)          # single-digit ms, the easy part
    # the threshold is a business call, not a model call: declining a real
    # customer costs more than letting one bad txn through, up to a point.
    if risk >= DECLINE_AT:
        return Decision.DECLINE
    if risk >= REVIEW_AT:
        return Decision.STEP_UP          # 3DS challenge, push the cost to the rare case
    return Decision.APPROVE

That step-up path mattered more than any model improvement. Most transactions are fine and you approve them without friction. A thin slice are clearly fraud and you decline. The interesting band in the middle does not need a better fraud model. It needs a second, slower check that you can afford precisely because you only run it on the few percent the fast model was unsure about. The expensive graph feature we cut from the inline path? It lives here, in the step-up flow, where it has time to run because we are no longer holding an authorization hostage.

So the architecture that won was not the most accurate model. It was a fast, slightly dumber model that fit the budget, backed by a slower, smarter check that only fired when the fast one hesitated. The fraud team hated giving up features at first. They came around when the decline rate on good customers dropped, because the inline model finally answered before the network gave up on it.

The lesson generalizes past fraud. Any model you put on the critical path of a real-time decision is competing for the same milliseconds as the thing it is trying to protect. You do not get to spend the whole budget on being right. You get a slice, and the model that wins is the one that does the most with the slice it is handed.

What is the budget on your hottest path, and have you actually measured where it goes? Because if you have not, I will bet it is the feature fetch.