A merchant sells phone cases out of a stall. She has run that stall for six years, moves a steady volume of small transactions through us every week, pays her suppliers on time, and has never missed a settlement. Walk her into a bank and ask for a working-capital line and the answer is no, because the credit bureau has never heard of her. No file. No score. No story the bank’s underwriting system knows how to read.
There are hundreds of thousands of merchants like her on the platform. We had years of their transaction history sitting in our own systems. The pitch was obvious: build a scoring model on the data we already hold, turn it into an API, and sell that score to banking partners who want the lending volume but cannot underwrite a thin-file borrower. The pitch was easy. The model was easy. Everything after the model was the hard year.
The model was the easy part
We built an ensemble. Gradient-boosted trees on the structured transaction features, a neural net on the behavioral and sequence signals, and a thin layer on top that blended the two into one probability of default. The boosting model carried most of the weight, because tabular financial data is what boosting was born for. The net earned its keep on the things trees handle badly: the rhythm of a merchant’s activity over time, the shape of their week, the slow drift of a business that is quietly dying versus one that is just seasonal.
def score(merchant):
x_struct, x_seq = features(merchant) # ~80 tabular, ~30 sequence
p_gbm = gbm.predict_proba(x_struct)
p_net = net.predict_proba(x_seq)
# blend weights came from the holdout, not from a hunch.
# the net only ever moved the needle on thin-history merchants.
return 0.65 * p_gbm + 0.35 * p_net
Accuracy came fast. Within a couple of months the ensemble was separating good from bad borrowers well enough that the lending team wanted to ship it. I told them no. Not because the numbers were soft, but because a number that good with this kind of data is exactly when you should get nervous.
What a regulator actually asks
We took it to a national financial regulator early, before we had a product, because in this market you do not get to ask forgiveness on lending. The first meeting reset my whole sense of the problem.
They did not ask how accurate the model was. They asked three other things. Can you tell a merchant why they were declined, in plain language. Can you prove the model is not penalizing people for things they are not allowed to be penalized for. And can you show that a person can audit a single decision, end to end, months after it was made.
Notice that none of those is about accuracy. A model can be ninety-something correct and still fail every one of them. (The most accurate model I could build used features I had to throw in the bin by the end of that same week.)
The features I refused to use
This is where it got concrete. We had a feature catalog, and we went through it line by line asking not “does this lift the score” but “what is this a proxy for.”
Device price was a strong predictor. It was also a proxy for wealth, which is a proxy for things a lender is not allowed to price on. Gone. Home neighborhood, inferred from delivery and registration addresses, lifted the model too. It is one of the cleanest proxies for ethnicity and class there is, and a model that learns the neighborhood has learned to redline without ever seeing a protected attribute. Gone. Time-of-day patterns that turned out to track religious observance. Gone.
None of those features mention race, religion, or gender. That is the trap. You never put the protected attribute in the model. You do not have to. The model reconstructs it from forty correlated breadcrumbs, and now you have a discrimination engine with plausible deniability built in. So we stopped asking whether a feature was a protected attribute and started asking whether a feature was a stand-in for one. The catalog got shorter and the model got slightly worse and I slept better.
We also ran the model against itself. Build it, then test whether its decisions correlate with the protected attributes we deliberately kept out, on a held-aside dataset where we did know those attributes. If a model we built without ethnicity still declines one group at a different rate for the same underlying risk, the model found a proxy we missed. That test stayed in the pipeline forever, as a gate, not a one-time blessing.
Reason codes, or you do not ship
The explainability requirement was not a nice-to-have. The regulator wanted a reason a human could read for every single decision, and so did I, because a merchant who is declined deserves better than a shrug.
This is where boosting earned its place over anything fancier. We ran SHAP over each scored decision and turned the top contributing features into a short, ranked list of reasons, mapped to plain phrasing a merchant could act on.
def reason_codes(merchant, k=3):
contribs = shap_values(gbm, features(merchant)[0])
top = sorted(contribs, key=lambda c: c.impact, reverse=True)[:k]
# we map raw feature names to merchant-readable reasons.
# "txn_volatility_90d" means nothing to a person who runs a stall.
return [REASON_MAP[c.name] for c in top if c.direction == "against"]
A reason like “your transaction volume has been declining over the last three months” is something a merchant can understand, dispute, or fix. “The model said so” is not. The reason codes also did something I did not expect: they became our best bug detector. When a reason made no business sense, it usually meant a feature was leaking or mislabeled. The explainability layer we built for the regulator kept catching our own mistakes.
The feedback loop nobody budgets for
Here is the part nobody tells you about scoring models that decide who gets credit. The model does not just predict the world. It shapes it, and then it trains on the world it shaped.
You decline a merchant. You never find out whether they would have repaid, because they never got the loan. So your training data only ever contains outcomes for the people the model already approved. Retrain on that, and the model gets more and more confident about a narrower and narrower slice of the population, while going blind to everyone it has been rejecting. It looks like the model is improving. Its accuracy on the approved population goes up. It is actually curling in on itself.
The straight line is the easy part. The checkpoint that gates the score, and the feedback arrow looping the model’s own decisions back into its training data, are the two places this kind of system goes wrong quietly.
We did two unglamorous things about it. We approved a small, randomized slice of borderline applicants we would otherwise have declined, and we accepted the losses on that slice as the price of keeping our eyes open. Those merchants gave us the only honest signal we had about the population on the wrong side of our own line. And we held the model’s decisions and the protected-attribute correlations under monitoring over time, not just at launch, because a model that is fair on the day it ships can drift into unfair as the world and the data move.
What I would tell someone starting today
If you are scoring thin-file borrowers, your hard problems are not the ones a Kaggle notebook prepares you for. The accuracy will come. The work is everything around it: knowing which features are proxies for things you are not allowed to price on, being able to hand a declined merchant a reason they can act on, and staying honest with yourself about the fact that your model is editing its own future training set every time it says no.
Build the ensemble in a month. Spend the year on the rest. A scoring API that a regulator trusts and a merchant can argue with is worth far more than one extra point of accuracy bought with a feature you should not be using.