No person information a ticket that claims “our structure has an abstraction drawback.” They file tickets saying the information is flawed, or lacking, or late. So engineering spends two weeks chasing a data-quality difficulty that doesn’t exist, fixes nothing, and the identical ticket comes again the next quarter sporting a barely totally different hat.
That was us. Essentially the most helpful factor I discovered from the entire effort is that the bug was by no means within the knowledge. It was in what we have been asking the information to be.
We had an on-premises MongoDB occasion serving because the registered golden supply for enterprise reference knowledge. Codes, classifications, id lookups, the unglamorous shared knowledge that quietly underpins buyer onboarding, regulatory reporting, and a dozen different issues folks solely discover once they break. It was well-maintained, authoritative, the real single supply of reality. The staff that owned it was rightly happy with it. By each affordable measure, the system was wholesome.
And but each time an analytics staff or a downstream product group wanted one thing from it, the expertise was depressing. They reverse-engineered the operational schema. They wrote one-off queries towards nested JSON they solely half understood. They tracked down whoever nonetheless carried the institutional reminiscence of the gathering construction, waited, after which repeated your entire ritual three months later when the requirement shifted by an inch.
The analysis took longer than it ought to have
I watched this play out for months earlier than it clicked. The info was fantastic. We have been asking an operational retailer to moonlight as an analytical platform, and it was dangerous on the second job. Not via any flaw of its personal. It was merely by no means constructed for that.
Operational shops optimise for correctness and life cycle administration. Analytics groups want one thing else totally: steady shapes, fields which can be truly documented, a refresh cadence you may predict, and a method to choose whether or not a dataset is match for goal with out reverse-engineering another person’s schema. These will not be the identical necessities, and conflating them is exactly how you find yourself with a system that’s technically good and virtually ineffective. Wholesome uptime, depressing customers.
So we stopped asking folks to devour reference knowledge immediately from MongoDB. We began treating every dataset as a knowledge product: one thing with a named proprietor, a definition, high quality gates, ruled entry, and an actual path to publication. The technical pipeline, MongoDB via Kafka Join into Touchdown, Bronze and Silver layers as Iceberg tables on S3, Athena on high, publication via the Information Market, adopted from that call somewhat than driving it. Twenty-one reference knowledge merchandise ultimately shipped down that single path.

Determine 1: The complete pipeline. MongoDB because the authoritative golden supply, occasions flowing via Kafka into Touchdown, Bronze and Silver layers as Iceberg tables on S3, Athena offering the question floor, and the enterprise Information Market because the publication endpoint. Airflow orchestrates the whole lot; DPPS UI offers operational visibility.
What “knowledge product” truly compelled us to determine
“Information product” is a kind of phrases that may imply nearly something, which normally means it means nothing. So we made it imply one thing particular and non-negotiable: a dataset couldn’t be revealed till it had a named proprietor, a knowledge dictionary, enterprise and technical metadata, documented audit expectations, high quality gates, and a ruled route into the Market. Compliance with all lively requirements at deployment time was obligatory, enforced at publication, not requested in a evaluate assembly.
That framing instantly surfaced questions that ought to have been answered years earlier. What’s the precise boundary of this product? Which attributes matter to customers, and that are operational plumbing no one exterior the proudly owning staff cares about? What does “present” imply for this dataset, and the way would a shopper know if it had gone stale? How does anybody uncover it with out submitting a ticket and ready for a human to level them on the proper S3 path?
None of that was governance overhead bolted on for present. Answering these questions was the structure. The Kafka connectors and Iceberg tables have been nearly the straightforward half by comparability.
The three choices that formed the whole lot else
The primary resolution was to maintain MongoDB because the golden supply. No rip-and-replace. Authority stayed the place it belonged, with the staff that understood the information’s lifecycle and had maintained it appropriately for years. The enterprise requirement was specific: no business-logic transformation, a one-to-one mapping from supply to vacation spot, trustworthy preservation somewhat than enrichment. The temptation to crown a shiny new system because the supply of reality lurks in each modernisation challenge, and it’s nearly all the time flawed. MongoDB did its job nicely. We have been constructing a supply layer, not changing a basis, and complicated the 2 is how good migrations flip into eighteen-month disasters.
The second was to construct one supply mannequin as an alternative of tolerating 4. Earlier than this work, at the least 4 groups had independently extracted roughly the identical reference knowledge, every with its personal refresh logic, its personal studying of the sphere semantics, and its personal personal definition of “present.” The diplomatic phrase for that state of affairs is “decentralised.” The trustworthy phrase is chaos. Occasions flowing from MongoDB via Kafka Join into the pipeline, Airflow orchestrating a month-to-month batch on the fifth at 07:00 UTC with no dependency on working days or vacation calendars, schema validation firing earlier than something touched S3, changed all 4 personal empires with a single path anybody might motive about.
The price of these 4 pipelines was by no means the compute or the storage, which was trivial. It was the reconciliation tax. At any time when two copies disagreed, they usually did, somebody senior and busy needed to work out which one to consider. Multiply a half-day investigation by each quarter and each consuming staff and also you arrive at a genuinely costly behavior that by no means appeared on any finances line, as a result of it was hidden inside everybody’s extraordinary work. Collapsing 4 pipelines into one didn’t simply simplify the diagram. It deleted a whole recurring class of argument.
The third was to deal with publication as an actual pipeline stage somewhat than an afterthought. Information that reached Silver obtained revealed into the Information Market with metadata, a Kitemark high quality rating, documentation, and subscription behaviour already hooked up. Consumption occurred solely via the Market subscription mannequin, by no means by handing somebody an S3 path. Shoppers might discover a product, choose whether or not it match, and subscribe to it while not having to know which bucket to ask about or which Slack channel to beg in. Publication meant the product went stay. It didn’t imply a file quietly appeared in storage and somebody hoped the fitting folks would discover.
The boring stuff turned out to be the exhausting stuff
I stored ready for the exhausting issues to indicate up within the pipeline itself. Kafka connector configuration, Iceberg desk upkeep, Athena partition tuning, all of it wanted consideration, and all of it obtained sorted sooner or later. However the hole between “a pipeline that works” and “a platform folks belief” got here from the issues I used to wave off as housekeeping. Naming conventions. Audit column requirements. Documentation templates somebody would truly open. Possession that was actual somewhat than nominal.
Naming is an effective instance of how unglamorous and the way decisive this will get. A shopper looking out {the catalogue} has to discover a dataset utilizing enterprise-standard terminology, not the interior shorthand that made sense to the staff that constructed it. The metadata framework mapping to the enterprise commonplace is tedious work that reveals up on no demo. It’s also your entire distinction between a list folks can navigate and a listing of cryptic desk names solely the authors perceive.
Right here is the uncomfortable half I didn’t recognize entering into: shared enterprise knowledge tends to fail socially earlier than it fails technically. The Kafka connector will probably be fantastic. What corrodes is the shared understanding of what “authoritative” means in observe, whether or not a given dataset is the true one or a replica any person made eighteen months in the past and forgot to deprecate. No quantity of Iceberg optimisation touches that. You repair it on the layer the place customers determine whether or not to belief a dataset, which is the product layer, and nowhere else.
A concrete instance of how social this will get. Early on, two groups disagreed about which currency-code dataset was right. Each have been internally constant. Each had been “proper” sooner or later. The distinction got here all the way down to a refresh one staff had quietly stopped working a 12 months earlier, and neither staff might show which copy mirrored the stay supply, as a result of nothing in both dataset recorded the place it got here from or when. We didn’t repair that with a greater connector. We mounted it by making provenance a first-class column. Each Silver file now carries SOURCE_SYSTEM, JOB_RUN_ID, VALID_FROM and VALID_TO, so the query “is that this the true one, and is it present?” has a documented reply as an alternative of a hallway debate.
Storage just isn’t the product
I’ve watched groups land knowledge in S3, declare victory on self-service, after which spend six months baffled that no one is utilizing it. The reply is almost all the time the identical. “The info is in S3” just isn’t a product. It’s a location. Folks must know the information exists, work out what it means, choose whether or not it suits their goal, and discover out who to contact when one thing seems to be flawed. A path offers them none of that.
The Market addressed this greater than any particular person pipeline part did. It turned a scattered set of S3 paths right into a ruled catalogue of subscribable merchandise, every with documentation, a high quality rating, and clear possession. That’s the distinction between handing somebody a warehouse handle and handing them a store. And since subscription is the one sanctioned path to the information, {the catalogue} stays the one entrance door somewhat than one possibility amongst a number of personal again channels.
Separate reality, transport, and consumption
If I had 5 minutes with somebody beginning this work, I’d spend all of it on one thought. Separate reality, transport, and consumption, and deal with them as three totally different issues owned by three totally different elements of the system. MongoDB holds reality, and stays authoritative. The pipeline, Touchdown via Bronze to Silver, strikes that reality reliably and proves it arrived intact with checksum reconciliation and inter-layer record-count checks. The product layer, Silver tables, Athena, and the Market, makes reality consumable by individuals who have no idea and may by no means must understand how MongoDB organises its collections.


Determine 2: The identical knowledge, three separated planes. Reality stays within the operational golden supply; transport strikes it and proves it arrived intact; consumption exposes it as ruled, subscribable merchandise. Separating the three issues, every with its personal proprietor, is what removes the friction between producers and customers.
When these three are genuinely separate, an unlimited quantity of organisational friction merely evaporates. Producers cease getting dragged into ad-hoc reporting. Shoppers cease reverse-engineering operational intent. The ops staff can evolve the MongoDB schema with out shattering six downstream jobs. And a brand new staff that wants nation codes or forex classifications can discover them within the Market, learn the documentation, and be finished in a day as an alternative of 1 / 4.
The info was all the time fantastic. What we truly constructed was the boundary that permit everybody cease arguing about it.
