Building an MCP Server Fabric for Financial Operations

The first version was one agent. It could read the ledger, query reconciliation, pull invoices from the document store, and post adjusting entries. It worked in the demo, which is the most dangerous sentence in this line of work.

What it could also do, on a bad day, was post a wrong adjusting entry to a real ledger because a retrieval came back stale and the model reasoned confidently over it. One process held every credential and every tool. The blast radius of a single bad turn was the whole finance stack. That is not a model you can put in front of a controller and an auditor and keep your job.

So I tore it apart. Not into a better agent. Into a fabric.

The fabric, not the brain

The shape that worked is small. Each financial system gets its own MCP server. The ledger server speaks ledger and nothing else. The reconciliation server speaks reconciliation. The document store sits behind its own server. Each one exposes a handful of tools, scoped to exactly what that system does, and holds only the credentials that system needs. An orchestrator sits above them and decides what to call. It holds no credentials of its own.

An orchestrator routing to four MCP servers, each fronting one financial system, with approval gates on every write tool

The orchestrator reasons. The servers do the work. Every arrow that mutates state passes through a gate first.

The thing the diagram is trying to make obvious: writes are physically routed differently from reads. A read tool returns data and the orchestrator moves on. A write tool does not execute on call. It stages a proposed change, emits an approval request, and parks. A human (or a policy engine standing in for one on the small stuff) releases it or kills it. The model never touches the ledger directly. It touches a queue of intentions, and a separate gate decides which intentions become facts.

Here is the part nobody tells you. The win from this shape is not really about the agent being smarter. The agent is the same. The win is that the failure modes got small enough to reason about. When the ledger server is the only thing holding ledger credentials and the only thing that can post an entry, a compromised or confused orchestrator cannot drain anything. It can only ask. And every ask is logged with the inputs that produced it.

Why one big agent was the wrong shape

Three reasons, and I had to live each of them.

Blast radius. One agent with every tool is one credential blast radius. A prompt injection in a fetched invoice, a hallucinated account number, a tool call that fires twice because of a retry, and the damage is unbounded across systems. Split the credentials across servers and the worst a single bad turn can do is bounded by one system’s scope. You stop designing for “what if the model is wrong” as a catastrophe and start treating it as a normal, contained event.

Auditability. An auditor does not want to read a chat transcript. They want to know, for this posted entry, what tool ran, with what arguments, who approved it, and at what time. When every write goes through a named tool on a named server behind a gate, that record falls out for free. The trace is the audit trail. With one agent improvising over a pile of functions, you are reconstructing intent from logs after the fact, which is exactly the position you do not want to be in when the regulator calls.

Testing. This is the unglamorous one and it mattered most. I can test the ledger server in isolation. Feed it a posting request, assert it validates the account, assert it refuses an unbalanced entry, assert the gate fires. No model in the loop, no orchestrator, no flakiness from a language model’s mood that morning. Each tool is a small unit with a contract. The monolith was untestable in any honest sense. You could only test the whole thing end to end, watch it pass, and pray the next prompt didn’t find a new path through it.

A tool definition with the guardrail built in

The guardrail is not a wrapper bolted on after. It lives in the tool. A write tool that cannot enforce its own invariants is not a tool I will ship. Here is the posting tool on the ledger server, roughly as it stands.

@server.tool()
async def propose_ledger_entry(
    debit_account: str,
    credit_account: str,
    amount_minor: int,          # always integer minor units; floats post wrong entries
    currency: str,
    memo: str,
    idempotency_key: str,       # the orchestrator MUST pass this; retries are not new entries
) -> ToolResult:
    # Guardrails run before anything is staged. A bad call fails loud, here,
    # not three steps later when it's a mystery in the reconciliation report.
    if amount_minor <= 0:
        return reject("amount must be positive; reversals use the reversal tool")
    if not ledger.account_exists(debit_account) or not ledger.account_exists(credit_account):
        return reject("unknown account; the model does not get to invent GL codes")
    if amount_minor > POLICY.auto_threshold_minor:
        require_human = True    # over the line, a person signs. no exceptions, no override flag.
    else:
        require_human = POLICY.always_require_human  # which, in finance, is true

    # We never post here. We stage and ask. The gate decides if this becomes real.
    proposal = ledger.stage(
        debit_account, credit_account, amount_minor, currency, memo,
        idempotency_key=idempotency_key,
    )
    return await approval.request(
        proposal,
        require_human=require_human,
        # the approver sees the exact arguments, not a model's summary of them
        evidence=proposal.as_diff(),
    )

Two things in there earn their keep. The amount is integer minor units, never a float, because a float is how you post a cent wrong and spend a Tuesday finding it. And the idempotency key is mandatory, because the single most common production failure of an agent is not a wrong decision, it is the right decision executed twice on a retry. The gate de-dupes on that key. Without it, “post this entry” plus a network blip becomes two entries and a reconciliation break you will chase for an afternoon.

The approver does not see “the agent wants to record a vendor payment.” They see the exact debit, credit, amount, and the source document the proposal was built from. Judgment needs the real arguments, not the model’s prose about them.

What it cost

This shape is more work up front than one clever agent. You are running and versioning several servers instead of one. The orchestrator has to know which server owns which capability, and you have to keep those contracts honest as the underlying systems change. There is real glue.

And the gates add latency and friction. A controller approving every machine proposal is not a controller, it is a bottleneck wearing a fancier title. So the threshold matters. Small, routine, fully-reconciled entries auto-release under policy. The human’s attention is spent only where the amount or the ambiguity earns it. Get that line wrong toward caution and people route around your system. Get it wrong toward speed and you have rebuilt the monolith with extra steps.

I will take the friction. In finance the question is never only “can the machine do this.” It is “can you prove, afterward, exactly what it did and that someone with authority let it.” A fabric of small servers with gated writes answers that by construction. One big agent answers it with a transcript and a shrug.

If you are wiring an agent to systems that move money, resist the elegant single brain. Build the boring fabric. The day something goes wrong, and it will, you want the blast radius to fit in one server and the answer to fit in one trace.