MCP a Year In: What Held Up, What Didn't

A year and a half ago I wrote that MCP was the USB-C of AI tools and that I was betting on the protocol over the ecosystem around it. I have now spent sixteen months building real systems on it, the kind that touch money and get audited, so it is time to grade my own homework instead of pretending the early take was prophecy.

The short version: the structural bet was right and most of the rough edges I waved off as growing pains were real, and some of them are still here. The protocol held. The maturity around it took longer than the optimists promised and arrived faster than the skeptics swore it would. Both camps get to feel a little vindicated, which is usually how these things go.

What held up

The interoperability bet was the whole thesis, and it landed. I argued that N callers times M tools collapses to N plus M once everyone speaks one handshake, and that is exactly what happened to my integration surface. I have servers I wrote in early 2025 that I have not touched since, still serving clients that did not exist when I wrote them. A colleague pointed an agent I have never seen at my reconciliation server last month and it just worked. No meeting, no integration ticket, no shared types package. That is the compounding I was hoping for, and it is real.

Tools as a clean boundary held up even better than the interoperability did. The line I drew in the first post (the model proposes, the server disposes) turned out to be the single most load-bearing decision in everything I have built since. The model never gets a connection string. It gets to ask. The server validates, scopes, audits, and decides. When the auditors came for the agentic finance work, that boundary is the thing that made the conversation short. Every action a model could take was a tool call, every tool call was logged, every log had who, what, and why. I did not design that for the audit. I designed it because I did not trust the model, and it paid off in a room I did not anticipate.

Testing servers in isolation held up too, and this one I undersold at the time. A server is just a process that answers a handshake. That means you can drive it with a test client and never involve a model at all. My server test suites run in CI in seconds, deterministically, no tokens spent, no flakiness from a model having a bad day. The agent layer is where the nondeterminism lives. Push as much logic as you can down into tools that you can test like normal software, and your reliability problem shrinks to the part that is genuinely hard.

What was rough

Auth and permissions were immature in early 2025 and the maturity curve was slower than I wanted. I wrote then that I had already rewritten one server’s auth twice and expected a third. I got the third. And a fourth. The remote, multi-tenant, real-identity case stayed unpaved far longer than the local stdio case, and for a while the answer to “how should a remote MCP server authenticate a caller and scope what it can see” was a stack of GitHub threads and a shrug. It is better now. It was painful for most of the year.

Versioning is the rough edge I did not see coming, and it is the one I would warn the loudest about today. A tool is an interface, and interfaces drift. I renamed a parameter on a tool that three different clients depended on, none of which I controlled, and discovered that there is no compiler for “you broke the agent’s understanding of your tool.” The client did not crash. The model just started calling the tool slightly wrong, and the failure showed up as degraded answers, not an error. Schema changes to a tool are breaking changes to every model that learned the old shape, and the protocol gives you very little to catch that. You end up versioning tools by hand and treating a tool rename like an API deprecation, because it is one.

Discovery was thinner than the optimism implied. The dream was a rich directory of production servers you could pull off a shelf. What I actually got was a lot of reference servers that were demos wearing a production costume, plus a smaller number of serious ones you had to find by word of mouth. I wrote most of my own anyway, which was fine, but the “ecosystem of strangers building the other half of your system” took longer to fill in with things I would put in front of a regulator.

And the gap between a demo server and a production one is the part nobody tells you, still, a year and a half in. A demo server is forty lines. A production server is the forty lines plus everything you do not see in the talk: input validation that assumes the model is adversarial, output that never leaks more than the caller is allowed to see, rate limits, idempotency on anything that writes, structured audit, timeouts, and a failure mode that fails loud instead of weird. The protocol makes the easy version look so close to the real version that teams ship the easy version and find out the difference in an incident.

The demo and the hardened version

This is the difference, made concrete. Same tool, two definitions. The first is the one that demos beautifully and gets you paged.

# The naive version. Reads great in a talk. Do not ship it.
@app.call_tool()
async def call_tool(name: str, args: dict) -> list[TextContent]:
    if name == "issue_refund":
        # trusts the model's arguments completely, no caller scope,
        # no idempotency, no audit. it WILL double-refund eventually.
        result = payments.refund(args["order_id"], args["amount"])
        return [TextContent(type="text", text=str(result))]

Here is the same tool the way it has to look once it touches money and an auditor exists. Nothing clever, just the boring parts the demo skipped.

# The hardened version. The boring parts are the whole job.
@app.call_tool()
async def call_tool(name: str, args: dict, ctx: CallContext) -> list[TextContent]:
    if name != "issue_refund":
        raise ValueError(f"no such tool: {name}")  # fail loud, not weird

    # 1. The caller has an identity. The model does not get to pick it.
    actor = ctx.principal  # set by the transport's auth, never by args
    require_scope(actor, "refunds:write")

    # 2. Validate against the world, not just the JSON schema. The schema
    #    says "amount is a number." It does not say "this order exists,
    #    is refundable, and the amount is <= what was actually paid."
    order = orders.get(args["order_id"])
    if order is None or not order.refundable:
        return deny("order not refundable", args)
    amount = Money(args["amount"])
    if amount <= 0 or amount > order.captured_total:
        return deny("amount out of bounds", args)

    # 3. Idempotency. The model retries. Agents loop. Without this key,
    #    a retried tool call is a second refund. Ask me how I know.
    idem_key = f"refund:{order.id}:{ctx.call_id}"

    # 4. Above a threshold, this is not the agent's decision to make.
    if amount > REVIEW_THRESHOLD:
        return enqueue_for_human(actor, order, amount, idem_key)

    with audit.record(actor=actor, action="issue_refund", target=order.id,
                      amount=amount, reason=args.get("reason")):
        result = payments.refund(order.id, amount, idempotency_key=idem_key)
    return [TextContent(type="text", text=render(result))]

The naive one is shorter and it works in every demo you will ever give. The hardened one is the only one I would put my name on, and almost none of the extra length is about MCP. It is identity, authorization, real validation, idempotency, a human gate, and an audit trail. The protocol got you a clean place to put all of that. It did not write any of it for you, and the demos keep implying it did.

What I would tell a team adopting it now

Start, but start with the boundary, not the framework. The reason to use MCP is the clean line between what the model can ask for and what your code decides to do. If you are not going to validate and scope and audit on the server side of that line, you are getting the syntax of the protocol and none of the value, and you would be safer with a plain function.

Version your tools from day one. Pick a deprecation discipline before you have a single external client, because the day you have three you cannot rename a parameter without breaking models you do not control, and nothing will tell you that you did.

Assume auth is your problem, not the protocol’s. For a local stdio server you are fine. For anything remote, multi-tenant, or carrying real identity, plan to own that layer. It has matured, but “matured” means the patterns exist, not that you get them for free.

And test your servers like the deterministic software they are. The model is where the uncertainty lives. Everything you push below the tool boundary is just code, and code you can test in CI is code you can trust at 2am.

I bet on the protocol over the ecosystem a year and a half ago, and I would make the same bet today, with one correction. I said standards win by being uninteresting. That was right. What I underweighted is how much interesting, unglamorous work the standard quietly leaves on your side of the wire. The handshake is solved. The hard part was never the handshake.