Skip to content
Ryan de Melo
Go back

The Lakehouse Won. Here's the Migration Nobody Warns You About.

The format conversion took a sprint. The migration took the better part of a year. If you only read the vendor decks, you would think those two numbers should be the same, and that gap is the entire post.

The open table format won, and it deserved to. Files in object storage, an Iceberg-style table layer that gives you snapshots and schema evolution and ACID over those files, a catalog that tracks where the current metadata lives, and any engine you like reading on top. Decoupled storage and compute, no single vendor holding your data hostage, the warehouse experience without the warehouse lock-in. I moved a multi-petabyte estate off a proprietary warehouse onto exactly this, across a few hundred tables feeding analysts, ML pipelines, and a real-time-ish reporting layer. I would do it again.

But I want to talk about the year, not the sprint. The part the marketing skips is that the table format is the easy half. Writing your old Parquet into Iceberg tables is close to mechanical. Operating those tables in production is where the work actually is, and almost none of it shows up in the getting-started guide.

The diagram is the trap

Here is how the architecture gets drawn in every pitch.

Lakehouse architecture: object storage holding data files under a table format, a catalog tracking metadata, query engines reading on top, and maintenance jobs running off to the side

The boxes everyone draws are the bottom and the top. The box on the right, the one running on a schedule, is the one that decides whether this thing is good or miserable to live with.

The clean version is real. Data files sit in object storage. The table format keeps manifests that say which files belong to which snapshot. A catalog points engines at the current metadata pointer. Spark, Trino, your warehouse engine of the week, they all read the same tables without copying anything. Lovely.

What the diagram leaves off, every single time, is the box on the side: the maintenance jobs. Compaction, snapshot expiry, orphan-file cleanup, manifest rewriting. In the pitch those are a footnote. In production they are a standing service with an on-call rotation, and if you do not staff them on day one you will meet them on day ninety when queries that used to take seconds start taking minutes for no reason anyone can see.

The small-file problem will find you

Streaming and frequent writes are how the lakehouse earns its keep over the old batch warehouse. They are also how you generate ten thousand tiny files a day in a single partition.

Every commit that writes a few rows writes new files. A Kafka-fed ingestion job landing micro-batches every minute is, from the table’s point of view, a small-file generator with a nice name. Object storage does not care. The query engine cares a great deal. Planning a scan means opening and reading metadata for every one of those files, and a thousand 2 MB files cost vastly more to plan and read than ten 200 MB files holding the same rows. The data is identical. The query is slower by an order of magnitude, and the bill follows the query.

Here is the part nobody tells you. The lakehouse does not fix this for you. It hands you the tools (compaction that rewrites small files into right-sized ones, and bin-packing on write) and then expects you to schedule them, tune them, and pay for the compute they burn. Compaction is itself a heavy write job. So now you are scheduling a job whose entire purpose is to clean up after your other jobs, and the two compete for the same cluster and occasionally the same table. We learned to run compaction on its own isolated compute, off-peak, partition by partition, with concurrency limits so it never collided with a writer mid-commit. None of that was in a guide. All of it was load-bearing.

Schema evolution does what it says, not what you meant

The format’s headline feature is that schema evolution is safe. Add a column, rename one, reorder them, and old data still reads correctly because the format tracks fields by a stable ID, not by position. This is genuinely good and it is genuinely better than the Hive-table world it replaced, where a careless column add could quietly shift every value one position to the left.

The gotcha is that “the format won’t corrupt your data” is not the same promise as “your downstream consumers won’t break.” The format will happily let an upstream team rename a column with a clean, ID-tracked, fully reversible operation, and that rename will sail straight into a dozen dashboards and three ML feature pipelines that were selecting it by name. The table is fine. Everything reading the table is on fire. (We found this one the way you always find these, on a Monday, from a director, not a test.)

Partition evolution is the sharper edge. You can change a table’s partitioning without rewriting history, which sounds like a free lunch and is not. You end up with old data partitioned one way and new data partitioned another, and a query spanning the boundary has to reason about both. It works. It is just slower and stranger than anyone expects.

Governance is the project the deck forgets entirely

This is the one that surprised even me, and I went in expecting surprises.

In the old warehouse, access control lived in one place. You granted on a schema, a table, a column, sometimes a row, and the warehouse enforced it on every query because every query went through the warehouse. One front door, one bouncer.

The lakehouse deliberately removes the front door. That is the whole point: many engines, direct access to the files. Which means a permission you defined for Trino does not automatically bind Spark, and a clever user with raw object-storage credentials can skip the table layer entirely and read the Parquet underneath your nicest row filter. The security boundary you assumed was a table just quietly became a bucket. If your governance model assumed the engine enforces access, the lakehouse just deleted that assumption and did not tell you.

You get this back, but you have to build it, and the building is where the months go. The catalog becomes your enforcement point, so the choice of catalog is suddenly a security decision and not just a metadata one. We had to lock the object-storage layer down so that humans and rogue jobs could not read raw files directly, route every real access through the catalog and a governed engine, and rebuild column masking and row filtering as policies the catalog enforced rather than properties of a single warehouse. Add lineage and audit on top, because in a regulated business “who read this and when” is a question you get asked under oath, not in a retro. The lakehouse does not ship that. It ships the slot where it goes.

The catalog is the decision you will live with longest

Everything above routes through one choice you make early and revisit constantly: which catalog. Hive Metastore because everything speaks it, and you inherit its age and its scaling limits. A newer transactional catalog with proper commit semantics, branching, and finer governance, and you take on a younger system at the center of everything. A vendor’s managed catalog, which is excellent and is also the lock-in you told the CFO you were escaping.

There is no clean answer. The catalog is the part of an “open” architecture where openness goes to negotiate with reality. The format is a spec anyone can read. The catalog is the live service that decides what “current” means and who is allowed to ask. Pick it like you will be running it for five years, because you will be.

So yes. The lakehouse won, and it should have. Decoupled storage and compute, no proprietary vault, real engine choice. Every word of that is worth the move. Just know what you are buying. The format is a weekend. The maintenance, the schema discipline, the governance rebuild, and the catalog you will argue about for a year are what you are signing up to operate. That is not a warning against doing it. It is a warning against budgeting for the sprint and getting handed the year.


Share this post:

Previous Post
Negotiating a Nine-Figure Cloud Deal: What Engineers Should Know
Next Post
Hybrid Search: BM25 and Embeddings Are Better Together