This submit is written in collaboration with Philipp Karg from BMW Group.
Companies more and more require scalable, cost-efficient architectures to course of and remodel huge datasets. On the BMW Group, our Cloud Effectivity Analytics (CLEA) staff has developed a FinOps answer to optimize prices throughout over 10,000 cloud accounts. Whereas enabling organization-wide effectivity, the staff additionally utilized these ideas to the information structure, ensuring that CLEA itself operates frugally. After evaluating numerous instruments, we constructed a serverless knowledge transformation pipeline utilizing Amazon Athena and dbt.
This submit explores our journey, from the preliminary challenges to our present structure, and particulars the steps we took to attain a extremely environment friendly, serverless knowledge transformation setup.
Challenges: Ranging from a inflexible and expensive setup
In our early phases, we encountered a number of inefficiencies that made scaling tough. We have been managing advanced schemas with vast tables that required vital effort in maintainability. Initially, we used Terraform to create tables and views in Athena, permitting us to handle our knowledge infrastructure as code (IaC) and automate deployments by steady integration and supply (CI/CD) pipelines. Nevertheless, this technique slowed us down when altering knowledge fashions or coping with schema modifications, subsequently requiring excessive improvement efforts.
As our answer grew, we confronted challenges with question efficiency and prices. Every question scanned giant quantities of uncooked knowledge, leading to elevated processing time and better Athena prices. We used views to offer a clear abstraction layer, however this masked underlying complexity as a result of seemingly easy queries towards these views scanned giant volumes of uncooked knowledge, and our partitioning technique wasn’t optimized for these entry patterns. As our datasets grew, the shortage of modularity in our knowledge design elevated complexity, making scalability and upkeep more and more tough. We would have liked an answer for pre-aggregating, computing, and storing question outcomes of computationally intensive transformations. The absence of sturdy testing and lineage options made it difficult to establish the foundation causes of knowledge inconsistencies after they occurred.
As a part of our enterprise intelligence (BI) answer, we used Amazon QuickSight to construct our dashboards, offering visible insights into our cloud price knowledge. Nevertheless, our preliminary knowledge structure led to challenges. We have been constructing dashboards on prime of huge, vast datasets, with some hitting the QuickSight per-dataset SPICE restrict of 1 TB. Moreover, throughout SPICE ingest, our largest datasets required 4–5 hours of processing time as a result of performing full scans every time, typically scanning over a terabyte of knowledge. This structure wasn’t serving to us be extra agile and fast whereas scaling up. The lengthy processing instances and storage limitations hindered our means to offer well timed insights and broaden our analytics capabilities.
To handle these points, we enhanced the information structure with AWS Lambda, AWS Step Features, AWS Glue, and dbt. This device stack considerably enhanced our improvement agility, empowering us to rapidly modify and introduce new knowledge fashions. On the identical time, we improved our general knowledge processing effectivity with incremental hundreds and higher schema administration.
Answer overview
Our present structure consists of a serverless and modular pipeline coordinated by GitHub Actions workflows. We selected Athena as our main question engine for a number of strategic causes: it aligns completely with our staff’s SQL experience, excels at querying Parquet knowledge immediately in our knowledge lake, and alleviates the necessity for devoted compute sources. This makes Athena a really perfect match for CLEA’s structure, the place we course of round 300 GB day by day from an information lake of 15 TB, with our largest dataset containing 50 billion rows throughout as much as 400 columns. The potential of Athena to effectively question large-scale Parquet knowledge, mixed with its serverless nature, allows us to deal with writing environment friendly transformations fairly than managing infrastructure.
The next diagram illustrates the answer structure.
Utilizing this structure, we’ve streamlined our knowledge transformation course of utilizing dbt. In dbt, an information mannequin represents a single SQL transformation that creates both a desk or a view—primarily a constructing block of our knowledge transformation pipeline. Our implementation contains round 400 such fashions, 50 knowledge sources, and round 100 knowledge checks. This setup allows seamless updates—whether or not creating new fashions, updating schemas, or modifying views—triggered just by making a pull request in our supply code repository, with the remainder dealt with mechanically.
Our workflow automation contains the next options:
- Pull request – Once we create a pull request, it’s deployed to our testing surroundings first. After passing validation and being permitted or merged, it’s deployed to manufacturing utilizing GitHub workflows. This setup allows seamless mannequin creation, schema updates, or view modifications—triggered simply by making a pull request, with the remainder dealt with mechanically.
- Cron scheduler – For nightly runs or a number of day by day runs to cut back knowledge latency, we use scheduled GitHub workflows. This setup permits us to configure particular fashions with totally different replace methods primarily based on knowledge wants. We will set fashions to replace incrementally (processing solely new or modified knowledge), as views (querying with out materializing knowledge), or as full hundreds (utterly refreshing the information). This flexibility optimizes processing time and useful resource utilization. We will goal solely particular folders—like supply, ready, or semantic layers—and run the dbt take a look at afterward to validate mannequin high quality.
- On demand – When including new columns or altering enterprise logic, we have to replace historic knowledge to keep up consistency. For this, we use a backfill course of, which is a customized GitHub workflow created by our staff. The workflow permits us to pick particular fashions, embrace their upstream dependencies, and set parameters like begin and finish dates. This makes certain that modifications are utilized precisely throughout the whole historic dataset, sustaining knowledge consistency and integrity.
Our pipeline is organized into three main phases—Supply, Ready, and Semantic—every serving a selected goal in our knowledge transformation journey. The Supply stage maintains uncooked knowledge in its unique type. The Ready stage cleanses and standardizes this knowledge, dealing with duties like deduplication and knowledge sort conversions. The Semantic stage transforms this ready knowledge into business-ready fashions aligned with our analytical wants. An extra QuickSight step handles visualization necessities. To realize low price and excessive efficiency, we use dbt fashions and SQL code to handle all transformations and schema modifications. By implementing incremental processing methods, our fashions course of solely new or modified knowledge fairly than reprocessing the whole dataset with every run.
The Semantic stage (to not be confused with dbt’s semantic layer function) introduces enterprise logic, remodeling knowledge into aggregated datasets which can be immediately consumable by BMW’s Cloud Information Hub, inside CLEA dashboards, knowledge APIs, or In-Console Cloud Assistant (ICCA) chatbot. The QuickSight step additional optimizes knowledge by choosing solely essential columns through the use of a column-level lineage answer and setting a dynamic date filter with a sliding window to ingest solely related sizzling knowledge into SPICE, avoiding unused knowledge in dashboards or stories.
This method aligns with BMW Group’s broader knowledge technique, which incorporates streamlining knowledge entry utilizing AWS Lake Formation for fine-grained entry management.
General, as a high-level construction, we’ve absolutely automated schema modifications, knowledge updates, and testing by GitHub pull requests and dbt instructions. This method allows managed deployment with sturdy model management and alter administration. Steady testing and monitoring workflows uphold knowledge accuracy, reliability, and high quality throughout transformations, supporting environment friendly, collaborative mannequin iteration.
Key advantages of the dbt-Athena structure
To design and handle dbt fashions successfully, we use a multi-layered method mixed with price and efficiency optimizations. On this part, we focus on how our method has yielded vital advantages in 5 key areas.
SQL-based, developer-friendly surroundings
Our staff already had robust SQL abilities, so dbt’s SQL-centric method was a pure match. As an alternative of studying a brand new language or framework, builders might instantly begin writing transformations utilizing acquainted SQL syntax with dbt. This familiarity aligns properly with the SQL interface of Athena and, mixed with dbt’s added performance, has elevated our staff’s productiveness.
Behind the scenes, dbt mechanically handles synchronization between Amazon Easy Storage Service (Amazon S3), the AWS Glue Information Catalog, and our fashions. When we have to change a mannequin’s materialization sort—for instance, from a view to a desk—it’s so simple as updating a configuration parameter fairly than rewriting code. This flexibility has diminished our improvement time dramatically, allowed us to deal with constructing higher knowledge fashions fairly than managing infrastructure.
Agility in modeling and deployment
Documentation is essential for any knowledge platform’s success. We use dbt’s built-in documentation capabilities by publishing them to GitHub Pages, which creates an accessible, searchable repository of our knowledge fashions. This documentation contains desk schemas, relationships between fashions, and utilization examples, enabling staff members to grasp how fashions interconnect and how you can use them successfully.
We use dbt’s built-in testing capabilities to implement complete knowledge high quality checks. These embrace schema checks that confirm column uniqueness, referential integrity, and null constraints, in addition to customized SQL checks that validate enterprise logic and knowledge consistency. The testing framework runs mechanically on each pull request, validating knowledge transformations at every step of our pipeline. Moreover, dbt’s dependency graph supplies a visible illustration of how our fashions interconnect, serving to us perceive the upstream and downstream impacts of any modifications earlier than we implement them. When stakeholders want to switch fashions, they’ll submit modifications by pull requests, which, after they’re permitted and merged, mechanically set off the mandatory knowledge transformations by our CI/CD pipeline. This streamlined course of enabled us to create new knowledge merchandise inside days in comparison with weeks and diminished ongoing upkeep work by catching points early within the improvement cycle.
Athena workgroup separation
We use Athena workgroups to isolate totally different question patterns primarily based on their execution triggers and functions. Every workgroup has its personal configuration and metric reporting, permitting us to watch and optimize individually. The dbt workgroup handles our scheduled nightly transformations and on-demand updates triggered by pull requests by our Supply, Ready, and Semantic phases. The dbt-test workgroup executes automated knowledge high quality checks throughout pull request validation and nightly builds. The QuickSight workgroup manages SPICE knowledge ingestion queries, and the Advert-hoc workgroup helps interactive knowledge exploration by our staff.
Every workgroup might be configured with particular knowledge utilization quotas, enabling groups to implement granular governance insurance policies. This separation supplies a number of advantages: it allows clear price allocation, supplies remoted monitoring of question patterns throughout totally different use instances, and helps implement knowledge governance by customized workgroup settings. Amazon CloudWatch monitoring per workgroup helps us observe utilization patterns, establish question efficiency points, and regulate configurations primarily based on precise wants.
Utilizing QuickSight SPICE
QuickSight SPICE (Tremendous-fast, Parallel, In-memory Calculation Engine) supplies highly effective in-memory processing capabilities that we’ve optimized for our particular use instances. Relatively than loading whole tables into SPICE, we create specialised views on prime of our materialized semantic fashions. These views are rigorously crafted to incorporate solely the mandatory columns, related metadata joins, and applicable time filtering to have solely current knowledge accessible in dashboards.
We’ve carried out a hybrid refresh technique for these SPICE datasets: day by day incremental updates maintain the information contemporary, and weekly full refreshes keep knowledge consistency. This method strikes a stability between knowledge freshness and processing effectivity. The result’s responsive dashboards that keep excessive efficiency whereas conserving processing prices beneath management.
Scalability and cost-efficiency
The serverless structure of Athena eliminates guide infrastructure administration, mechanically scaling primarily based on question demand. As a result of prices are primarily based solely on the quantity of knowledge scanned by queries, optimizing queries to scan as little knowledge as attainable immediately reduces our prices. We use the distributed question execution capabilities of Athena by our dbt mannequin construction, enabling parallel processing throughout knowledge partitions. By implementing efficient partitioning methods and utilizing Parquet file format, we reduce the quantity of knowledge scanned whereas maximizing question efficiency.
Our structure affords flexibility in how we materialize knowledge by views, full tables, and incremental tables. With dbt’s incremental fashions and partitioning technique, we course of solely new or modified knowledge as an alternative of whole datasets. This method has confirmed extremely efficient—we’ve noticed vital reductions in knowledge processing quantity in addition to knowledge scanning, notably in our QuickSight workgroup.
The effectiveness of those optimizations carried out on the finish of 2023 is seen within the following diagram, displaying prices by Athena workgroups.

The workgroups are illustrated as follows:
- Inexperienced (QuickSight): Reveals diminished knowledge scanning post-optimization.
- Gentle blue (Advert-hoc): Varies primarily based on evaluation wants.
- Darkish blue (dbt): Maintains constant processing patterns
- Orange (dbt-test): Reveals common, environment friendly take a look at execution.
The elevated dbt workload prices immediately correlate with decreased QuickSight prices, reflecting our architectural shift from utilizing advanced views in QuickSight workgroups (which beforehand masked question complexity however led to repeated computations) to utilizing dbt for materializing these transformations. Though this elevated the dbt workload, the general cost-efficiency improved considerably as a result of materialized tables diminished redundant computations in QuickSight. This demonstrates how our optimization methods efficiently handle rising knowledge volumes whereas attaining internet price discount by environment friendly knowledge materialization patterns.
Conclusion
Our knowledge structure makes use of dbt and Athena to offer a scalable, cost-efficient, and versatile framework for constructing and managing knowledge transformation pipelines. Athena’s means to question knowledge immediately in Amazon S3 alleviates the necessity to transfer or copy knowledge right into a separate knowledge warehouse, and its serverless mannequin and dbt’s incremental processing reduce each operational overhead and processing prices. Given our staff’s robust SQL experience, expressing these transformations in SQL by dbt and Athena was a pure alternative, enabling speedy mannequin improvement and deployment. With dbt’s computerized documentation and lineage, troubleshooting and figuring out knowledge points is simplified, and the system’s modularity permits for fast changes to satisfy evolving enterprise wants.
Beginning with this structure is fast and easy: all that’s wanted is the dbt-core and dbt-athena libraries, and Athena itself requires no setup, as a result of it’s a totally serverless service with seamless integration with Amazon S3. This structure is good for groups trying to quickly prototype, take a look at, and deploy knowledge fashions, optimizing useful resource utilization, accelerating deployment, and offering high-quality, correct knowledge processing.
For these focused on a managed answer from dbt, see From knowledge lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud.
Concerning the Authors
Philipp Karg is a
Lead FinOps Engineer at BMW Group and has a robust background in knowledge engineering, AI, and FinOps. He focuses on driving cloud effectivity initiatives and fostering a cost-aware tradition inside the firm to leverage the cloud sustainably.
Selman Ay is a Information Architect specializing in end-to-end knowledge options, structure, and AI on AWS. Outdoors of labor, he enjoys taking part in tennis and interesting out of doors actions.
Cizer Pereira is a Senior DevOps Architect at AWS Skilled Providers. He works carefully with AWS prospects to speed up their journey to the cloud. He has a deep ardour for cloud-based and DevOps options, and in his free time, he additionally enjoys contributing to open supply tasks.
