This put up was co-authored by Mike Araujo Principal Engineer at Medidata Options.
The life sciences trade is transitioning from fragmented, standalone instruments in direction of built-in, platform-based options. Medidata, a Dassault Systèmes firm, is constructing a next-generation information platform that addresses the advanced challenges of contemporary scientific analysis. On this put up, we present you ways Medidata created a unified, scalable, real-time information platform that serves 1000’s of scientific trials worldwide with AWS providers, Apache Iceberg, and a contemporary lakehouse structure.
Challenges with legacy structure
Because the Medidata scientific information repository expanded, the crew acknowledged the shortcomings of the legacy information answer to supply high quality information merchandise to their clients throughout their rising portfolio of information choices. A number of information tenants started to erode. The next diagram reveals Medidata’s legacy extract, remodel, and cargo (ETL) structure.
Constructed upon a collection of scheduled batch jobs, the legacy system proved ill-equipped to supply a unified view of the information throughout the whole ecosystem. Batch jobs ran at completely different intervals, typically requiring a ample diploma of scheduling buffer to verify upstream jobs accomplished throughout the anticipated window. As the information quantity expanded, the roles and their schedules continued to inflate, introducing a latency window between ingestion and processing for dependent customers. Totally different customers working from varied underlying information providers additional magnified the issue as pipelines needed to be repeatedly constructed throughout a wide range of information supply stacks.
The increasing portfolio of pipelines started to overwhelm present upkeep operations. With extra operations, the chance for failure expanded and restoration efforts additional difficult. Current observability techniques had been inundated with operational information, and figuring out the foundation trigger of information high quality points grew to become a multi-day endeavor. Will increase within the information quantity required scaling concerns throughout the whole information property.
Moreover, the proliferation of information pipelines and copies of the information in several applied sciences and storage techniques necessitated increasing entry controls with enhanced security measures to verify solely the right customers had entry to the subset of information to which they had been permitted. Ensuring entry management modifications had been accurately propagated throughout all techniques added an additional layer of complexity to customers and producers.
Resolution overview
With the arrival of Scientific Knowledge Studio (Medidata’s unified information administration and analytics answer for scientific trials) and Knowledge Join (Medidata’s information answer for buying, reworking, and exchanging digital well being file (EHR) information throughout healthcare organizations), Medidata launched a brand new world of information discovery, evaluation, and integration to the life sciences trade powered by open supply applied sciences and hosted on AWS. The next diagram illustrates the answer structure.

Fragmented batch ETL jobs had been changed by real-time Apache Flink streaming pipelines, an open supply, distributed engine for stateful processing, and powered by Amazon Elastic Kubernetes Service (Amazon EKS), a totally managed Kubernetes service. The Flink jobs write to Apache Kafka working in Amazon Managed Apache Kafka (Amazon MSK), a streaming information service that manages Kafka infrastructure and operations, earlier than touchdown in Iceberg tables backed by the AWS Glue Knowledge Catalog, a centralized metadata repository for information belongings. From this assortment of Iceberg tables, a central, single supply of information is now accessible from a wide range of customers with out extra downstream processing, assuaging the necessity for customized pipelines to fulfill the necessities of downstream customers. By way of these elementary architectural modifications, the crew at Medidata solved the problems offered by the legacy answer.
Knowledge availability and consistency
With the introduction of the Flink jobs and Iceberg tables, the crew was capable of ship a constant view of their information throughout the Medidata information expertise. Pipeline latency was lowered from days to minutes, serving to Medidata clients understand a 99% efficiency achieve from the information ingestion to the information analytics layers. Because of Iceberg’s interoperability, Medidata customers noticed the identical view of the information no matter the place they considered that information, minimizing the necessity for consumer-driven customized pipelines as a result of Iceberg might plug into present customers.
Upkeep and sturdiness
Iceberg’s interoperability supplied a single copy of the information to fulfill their use circumstances, so the Medidata crew might focus its commentary and upkeep efforts on a five-times smaller subset of operations than beforehand required. Observability was enhanced by tapping into the assorted metadata parts and metrics uncovered by Iceberg and the Knowledge Catalog. High quality administration reworked from cross-system traces and queries to a single evaluation of unified pipelines, with an added good thing about time limit information queries due to the Iceberg snapshot function. Knowledge quantity will increase are dealt with with out-of-box scaling supported by the whole infrastructure stack and AWS Glue Iceberg optimization options that embrace compaction, snapshot retention, and orphan file deletion, which offer a set-and-forget expertise for fixing quite a lot of widespread Iceberg frustrations, such because the small file drawback, orphan file retention, and question efficiency.
Safety
With Iceberg on the middle of its answer structure, the Medidata crew now not needed to spend the time constructing customized entry management layers with enhanced security measures at every information integration level. Iceberg on AWS centralizes the authorization layer utilizing acquainted techniques equivalent to AWS Id and Entry Administration (IAM), offering a single and sturdy management for information entry. The info additionally stays completely throughout the Medidata digital non-public cloud (VPC), additional lowering the chance for unintended disclosures.
Conclusion
On this put up, we demonstrated how legacy universe of consumer-driven customized ETL pipelines may be changed with a scalable, high-performant streaming lakehouses. By placing Iceberg on AWS on the middle of information operations, you possibly can have a single supply of information in your customers.
To study extra about Iceberg on AWS, consult with Optimizing Iceberg tables and Utilizing Apache Iceberg on AWS.
Concerning the authors
