Organizations typically battle with constructing scalable and maintainable information lakes—particularly when dealing with advanced information transformations, imposing information high quality, and monitoring compliance with established governance. Conventional approaches sometimes contain customized scripts and disparate instruments, which may improve operational overhead and complicate entry management. A scalable, built-in method is required to simplify these processes, enhance information reliability, and help enterprise-grade governance.
Apache Airflow has emerged as a strong resolution for orchestrating advanced information pipelines within the cloud. Amazon Managed Workflows for Apache Airflow (MWAA) extends this functionality by offering a totally managed service that eliminates infrastructure administration overhead. This service permits groups to deal with constructing and scaling their information workflows whereas AWS handles the underlying infrastructure, safety, and upkeep necessities.
dbt enhances information transformation workflows by bringing software program engineering greatest practices to analytics. It permits analytics engineers to remodel warehouse information utilizing acquainted SQL choose statements whereas offering important options like model management, testing, and documentation. As a part of the ELT (Extract, Load, Remodel) course of, dbt handles the transformation section, working immediately inside an information warehouse to allow environment friendly and dependable information processing. This method permits groups to keep up a single supply of reality for metrics and enterprise definitions whereas enabling information high quality by means of built-in testing capabilities.
On this put up, we present easy methods to construct a ruled information lake that makes use of trendy information instruments and AWS companies.
Resolution overview
We discover a complete resolution that features:
- A metadata-driven framework in MWAA that dynamically generates directed acyclic graphs (DAGs), considerably enhancing pipeline scalability and lowering upkeep overhead.
- dbt with Amazon Athena adapter to implement modular, SQL-based information transformations immediately on an information lake, enabling well-structured, and totally examined transformations.
- An automatic framework that proactively identifies and segregates problematic data, sustaining the integrity of knowledge property.
- AWS Lake Formation to implement fine-grained entry controls for Athena tables, making certain correct information governance and safety all through an information lake surroundings.
Collectively, these elements create a sturdy, maintainable, and safe information administration resolution appropriate for enterprise-scale deployments.
The next structure illustrates the elements of the answer.
The workflow incorporates the next steps:
- A number of information sources (PostgreSQL, MySQL, SFTP) push information to an Amazon S3 uncooked bucket
- S3 occasion triggers AWS Lambda Operate
- Lambda perform triggers the MWAA DAG to transform file codecs to parquet
- Information is saved in Amazon S3 formatted bucket below formatted_stg prefix
- Crawler crawls the info in formatted_stg prefix within the formatted bucket and creates catalog tables
- dbt utilizing Athena adapter processes the info and places the processed information after information high quality checks below formatted prefix in Formatted bucket
- dbt utilizing Athena adapter can carry out additional transformations on the formatted information and put the remodeled information in Revealed bucket
Conditions
To implement this resolution, the next conditions have to be met.
- An AWS account with create/function entry for the next AWS companies:
Deploy the answer
For this resolution, we offer an AWS CloudFormation (CFN) template that units up the companies included within the structure, to allow repeatable deployments.
Observe:
- US-EAST-1 Area is required for the deployment.
- Deploying this resolution will contain prices related to AWS companies.
To deploy the answer, full the next steps:
- Earlier than deploying the stack, open the AWS Lake Formation console. Add your console position as a Information Lake Administrator and select Verify to avoid wasting the adjustments.

- Obtain the CloudFormation template.
After the file is downloaded to the native machine, comply with the steps beneath to deploy the stack utilizing this template:- Open the AWS CloudFormation Console.
- Select Create stack and select With new sources (commonplace).
- Beneath Specify template, choose Add a template file.
- Choose Select file and add the CFN template that was downloaded earlier.
- Select Subsequent to proceed.

- Enter a stack identify (for instance, bdb4834-data-lake-blog-stack) and configure the parameters (bdb4834-MWAAClusterName could be left because the default worth and replace SNSEmailEndpoints along with your e-mail deal with), then select Subsequent.

- Choose “I acknowledge that AWS CloudFormation would possibly create IAM sources with customized names” and select Subsequent

- Evaluate all of the configuration particulars on the subsequent web page, then select Submit.

- Anticipate the stack creation to finish within the AWS CloudFormation console. The method sometimes takes roughly 35 to 40 minutes to provision all required sources.
The next desk exhibits sources obtainable within the AWS Account after CloudFormation template deployment is efficiently accomplished:
Useful resource Kind Description Instance Useful resource Identify S3 Buckets For storing uncooked, processed information and property bdb4834-mwaa-bucket- – ,bdb4834-raw-bucket- – ,bdb4834-formatted-bucket- – ,bdb4834-published-bucket- – IAM Position Position assumed by MWAA for permissions bdb4834-mwaa-role MWAA Surroundings Managed Airflow surroundings for orchestration bdb4834-MyMWAACluster VPC Community setup required by MWAA bdb4834-MyVPC Glue Catalog Databases Logical grouping of metadata for tables bdb4834_formatted_stg,bdb4834_formatted_exception, bdb4834_formatted, bdb4834_published Glue Crawlers Robotically catalog metadata from S3 bdb4834-formatted-stg-crawler Lambda Lambda to Set off MWAA DAG on file arrival and to setup Lake Formation Permissions bdb4834_mwaa_trigger_process_s3_files,bdb4834-lf-tags-automation Lake Formation Setup Centralized governance and permissions LF-Setup for the above Sources Airflow DAGs Airflow DAGs are saved within the S3 bucket named mwaa-bucket- – below the dags/ prefix. These DAGs are liable for triggering information pipelines primarily based on both file arrival occasions or scheduled intervals. The precise performance of every DAG is defined within the following sections. blog-test-data-processingcrawler-daily-runcreate-audit-tableprocess_raw_to_formatted_stage - When the stack is full carry out the beneath steps:
- Open the Amazon Managed Workflows for Apache Airflow (MWAA) console, select on Open Airflow UI
- Within the DAGs console, find the next DAGs and unpause them by unchecking the toggle change (radio button) subsequent to every DAG.

Add pattern information to uncooked S3 bucket and create catalog tables
On this part, we add pattern information to uncooked S3 bucket (bucket identify beginning with bdb4834-raw-bucket) and convert the file codecs to parquet and run AWS Glue crawler to create catalog tables which might be utilized by dbt within the ELT Course of. Glue Crawler mechanically scans the info in S3 and creates or updates tables within the Glue Information Catalog, making the info queryable and accessible for transformation.
- Obtain the pattern information.
- Zip folder incorporates two pattern information information, playing cards.json and prospects.json
Schema for playing cards.jsonDiscipline Information Kind Description cust_id String Distinctive buyer identifier cc_number String Bank card quantity cc_expiry_date String Bank card expiry date Schema for patrons.json
Discipline Information Kind Description cust_id String Distinctive buyer identifier fname String First identify lname String Final identify gender String Gender deal with String Full deal with dob String Date of beginning (YYYY/MM/DD) cellphone String Cellphone quantity e-mail String E-mail deal with - Open S3 console, select Common objective buckets within the navigation pane.
- Find the S3 bucket with a reputation beginning with bdb4834-raw-bucket. This bucket is created by the CloudFormation stack and will also be discovered below the stack’s Sources tab within the CloudFormation console.

- Select the bucket identify to open it, and comply with these steps to create the required prefix:
- Select Create folder.
- Enter the folder identify as mwaa/weblog/partition_dt=YYYY-MM-DD/, changing YYYY-MM-DD with the precise date for use for the partition.
- Select Create folder to substantiate.
- Add the pattern information information from the placement to the s3 uncooked bucket prefix.

- As quickly because the information are uploaded, the on_put object occasion on the uncooked bucket invokes
thebdb4834_mwaa_trigger_process_s3_fileslambda which triggers the process_raw_to_formatted_stg MWAA DAG.- Within the Airflow UI, select the process_raw_to_formatted_stg DAG to view execution standing. This DAG converts the file codecs to parquet and sometimes completes inside a number of seconds.

- (Non-obligatory) To examine the Lambda execution particulars:
- On the AWS Lambda Console, select Features within the navigation pane.
- Choose the perform named bdb4834_mwaa_trigger_process_s3_files.

- Within the Airflow UI, select the process_raw_to_formatted_stg DAG to view execution standing. This DAG converts the file codecs to parquet and sometimes completes inside a number of seconds.
- Validate the parquet information are created in formatted bucket (bucket identify beginning with
bdb4834-formatted) below the respective information object prefix.

- Earlier than continuing additional, re-upload the Lake Formation metadata file in MWAA bucket.
- Open the S3 console, select Common objective buckets within the navigation pane.
- Seek for the bucket beginning with bdb4834-mwaa-bucket

- Select the bucket identify and go to the lakeformation prefix. Obtain the file named
lf_tags_metadata.json. Now, re-upload the identical file to the identical location.
Observe: This re-upload is important as a result of the Lambda perform is configured to set off on file arrival. When the sources had been initially created by the CloudFormation stack, the information had been merely moved to S3 and didn’t set off the Lambda. Re-uploading the file ensures the Lambda perform is executed as supposed. - As quickly because the file is uploaded, the on_put object occasion on the MWAA bucket invokes the lf_tags_automation lambda, which creates the Lake Formation (LF) tags as outlined within the metadata file and grants entry to the desired AWS Identification and Entry Administration (IAM) roles for learn/write.
- Validate that the LF-Tags have been created by visiting the Lake Formation Console. Within the left navigation pane, select Permissions, after which choose LF-Tags and permissions.

- Now, run the crawler DAG to create/replace the catalog tables:
crawler-daily-run- Within the Airflow UI choose the
crawler-daily-runDAG and select Set off DAG to execute it. - This DAG is configured to set off Glue Crawler which crawls the
formatted_stgprefix below thebdb4834-formatteds3 bucket to create catalog tables as per the prefixes obtainable below theformatted_stgprefix.
- Monitor the execution of the crawler-daily-run DAG till it completes, which usually takes 2 to three minutes. The crawler run standing could be verified within the AWS Glue Console by following these steps:
- Open the AWS Glue Console.
- Within the left navigation pane, select Crawlers.
- Seek for the crawler named
bdb4834-formatted-stg-crawler. - Verify the Final run standing column to substantiate the crawler executed efficiently.
- Select the crawler identify to view extra run particulars and logs if wanted.

- As soon as the crawler has accomplished efficiently, within the left-hand panel, select Databases and choose the bdb4834_formatted_stg database to view the created tables, which ought to seem as displaying within the following picture. Optionally, choose the desk’s identify to view its schema, after which choose Desk information to open Athena for information evaluation. (An error might seem when querying information utilizing Athena as a result of Lake Formation permissions. Evaluate the Governance utilizing Lake Formation part on this put up to resolve the difficulty.)
- Within the Airflow UI choose the
Observe: If that is the primary time Athena is getting used, a question consequence location should be configured by specifying an S3 bucket. Observe the directions within the AWS Athena documentation to arrange the S3 staging bucket for storing question outcomes.

Run mannequin by means of DAG in MWAA
On this part, we cowl how dbt fashions run in MWAA utilizing Athena adapter to create Glue-catalogued tables and the way auditing is completed for every run.
- After creating the tables within the Glue database utilizing the AWS Glue Crawler within the earlier steps, we are able to now proceed to run the dbt fashions in MWAA. These fashions are saved in S3 within the type of SQL information, positioned on the S3 prefix: bdb4834-mwaa-bucket-
-us-east-1/dags/dbt/fashions/
The next are the dbt fashions and their performance:mwaa_blog_cards_exception.sqlThis mannequin reads information from themwaa_blog_cardsdesk within thebdb4834_formatted_stgdatabase and writes data with information high quality points to themwaa_blog_cards_exceptiondesk within thebdb4834_formatted_exceptiondatabase.mwaa_blog_customers_exception.sqlThis mannequin reads information from themwaa_blog_customersdesk within thebdb4834_formatted_stgdatabase and writes data with information high quality points to themwaa_blog_customers_exceptiondesk within thebdb4834_formatted_exceptiondatabase.mwaa_blog_cards.sqlThis mannequin reads information from themwaa_blog_cardsdesk within thebdb4834_formatted_stgdatabase and hundreds it into themwaa_blog_cardsdesk within thebdb4834_formatteddatabase. If the goal desk doesn’t exist, dbt mechanically creates it.mwaa_blog_customers.sqlThis mannequin reads information from themwaa_blog_customersdesk within thebdb4834_formatted_stgdatabase and hundreds it into themwaa_blog_customersdesk within thebdb4834_formatteddatabase. If the goal desk doesn’t exist, dbt mechanically creates it.
- The
mwaa_blog_cards.sqlmannequin processes bank card information and is dependent upon themwaa_blog_customers.sqlmannequin to finish efficiently earlier than it runs. This dependency is important as a result of sure information high quality checks—reminiscent of referential integrity validations between buyer and card data—should be carried out beforehand.- These relationships and checks are outlined within the
schema.ymlfile positioned in the identical S3 path:bdb4834-mwaa-bucket-. The schema.yml file offers metadata for dbt fashions, together with mannequin dependencies, column definitions, and information high quality assessments. It makes use of macros like-us-east-1/dags/dbt/fashions/ get_dq_macro.sqlanddq_referentialcheck.sql(discovered below the macros/ listing) to implement these validations.
Because of this, dbt mechanically generates a lineage graph primarily based on the declared dependencies. This visible graph helps orchestrate mannequin execution order—making certain fashions like
mwaa_blog_customers.sqlrun earlier than dependent fashions reminiscent ofmwaa_blog_cards.sql, and identifies which fashions can execute in parallel to optimize the pipeline. - These relationships and checks are outlined within the
- As a pre-step earlier than operating fashions, select the set off DAG button for create-audit-table to create audit desk for storing run particulars for every mannequin.

- Set off the
blog-test-data-processingDAG within the Airflow UI to start out the Mannequin run. - Select
blog-test-data-processingto see the execution standing. This DAG runs the fashions so as and creates Glue catalogued iceberg tables. The move diagram of a DAG from Airflow UI could be discovered by selecting Graph after selecting DAG.

- The exception fashions places the failed data below exception prefix in S3:
Information that failed are present in an added column, tests_failed, the place all the info high quality checks that failed for that exact row are added, separated by a pipe (‘|’). (For the
mwaa_blog_customers_exceptiontwo exception data are discovered within the desk.) - The handed data are put below formatted prefix in S3.
- For every run, a run audit is captured within the audit desk with execution particulars like model_nm, process_nm, execution_start_date, execution_end_date, execution_status, execution_failure_reason, rows_affected.
Discover the info in S3 below the prefixbdb4834-formatted-bucket-- /audit_control/ - Monitor the execution till the DAG completes, which may take as much as 2-3 minutes. The execution standing of the DAG could be seen within the left panel after opening the DAG.

- As soon as the DAG has accomplished efficiently, open the AWS Glue console and choose Databases. Choose the
bdb4834_formatteddatabase, which ought to create three tables, as proven within the following picture.
Optionally, select Desk information to entry Athena for information evaluation.

- Select
bdb4834_formatted_exceptiondatabase from below Databases in AWS Glue console, which ought to create two tables as proven within the following picture.
- Every mannequin is assigned LF tags by means of the config block of mannequin itself. Due to this fact, when the iceberg tables are created by means of dbt, LF tags are connected to the tables after the run completes.
Validate the LF tags connected to the tables by visiting the AWS Lake Formation console. Within the left navigation pane, select Tables and search for
mwaa_blog_customersormwaa_blog_cardsdesk belowbdb4834_formatteddatabase. Choose any desk among the many two and below Actions, select Edit LF tags and the tags are connected, as proven within the following display shot.


- Equally, for the
bdb4834_formatted_exceptiondatabase, choose any one of many exception tables below thebdb4834_formatted_exceptiondatabase and the LF tags are connected.

- Run SQL queries on the tables created by opening the Athena console and operating Analytical queries on the tables created above.Pattern SQL queries:


- The exception fashions places the failed data below exception prefix in S3:
Governance utilizing Lake Formation
On this part, we present how assigning Lake Formation permissions and creating LF tags is automated utilizing the metadata file.Beneath is a metadata file construction, which is required for reference when importing the metadata file for Lake Formation in Airflow S3 bucket, contained in the Lake Formation prefix.
Parts of the metadata file
role_arn: The IAM position that the Lambda perform assumes to carry out operations.access_type: Specifies whether or not the motion is to grant or revoke permissions (GRANT, REVOKE).lf_tags: Tags used for tag-based entry management (TBAC) in Lake Formation.named_data_catalog: An inventory of databases and tables on which Lake Formation permissions or tags are utilized to.table_permissions: Lake Formation-specific permissions (e.g., SELECT, DESCRIBE, ALTER, and so on.).
Lambda perform bdb4834-lf-tags-automation parses this JSON and grants the required LF tags to the position with given desk permissions.
- To replace the metadata file, obtain it from the MWAA bucket (lakeformation prefix)
- Add a JSON object with the metadata construction outlined above, mentioning the IAM position ARN and the tags and tables to which entry must be granted.
Instance:Let’s assume beneath is how the metadata file initially seems to be like:Beneath is the json object that must be added within the above metadata file:
So now, the ultimate metadata file ought to appear like:
- Upon importing this file on the identical location (
bdb4834-mwaa-bucket-<) in S3, the>-< >/lakeformation/ lf_tags_automationlambda is triggered to create LF tags in the event that they don’t exist after which it assigns these tags to the IAM position ARN and likewise grants permission to the IAM position ARN utilizingnamed_data_catalogas outlined.To confirm the permissions, go to the Lake Formation console and select Tables below Information Catalog and seek for the desk identify.

To examine LF-Tags, select the desk identify and below the LF tags part, all of the tags are discovered connected to this desk.

This metadata file used as a structured enter to an AWS Lambda perform automates the next to carry out automated, constant, and scalable information entry governance throughout the AWS Lake Formation environments:
- Granting AWS Lake Formation (LF) permissions on Glue Information Catalog sources (like databases and tables).
- Creating Lake Formation Tags and Making use of Lake Formation tags (LF-Tags) for tag-based entry management (TBAC).
Discover extra on dbt
Now that the deployment features a bdb4834-printed S3 bucket and a printed Catalog database, sturdy dbt fashions could be constructed for information transformation and curation.
Right here’s easy methods to implement an entire dbt workflow:
- Begin by creating fashions that comply with this sample:
- Learn from the formatted tables within the staging space
- Apply enterprise logic, joins, and aggregations
- Write clear, analysis-ready information to the printed schema
- Tagging for automation: Use constant dbt tags to allow automated DAG technology. These tags set off MWAA orchestration to mechanically embrace new fashions within the execution pipeline.
- Including new fashions: When working with new datasets, consult with current fashions for steerage. Apply acceptable LF tags for information entry management. The brand new LF tags can even now be used for permissions.
- Allow DAG execution: For brand spanking new datasets, replace the MWAA metadata file to incorporate a new JSON entry. This step is important to generate a DAG that executes the brand new dbt fashions.
This method ensures the dbt implementation scales systematically whereas sustaining automated orchestration and correct information governance.
Clear up
1. Open the S3 console and delete all objects from beneath buckets:
- bdb4834-raw-bucket-
– - bdb4834-formatted -bucket-
– - bdb4834-mwaa-bucket-
– - bdb4834-published-bucket-
–
To delete all objects, select the bucket identify, choose all objects and select Delete.
After that, kind ‘completely delete’ within the textual content field and select Delete Objects.
Do that for all three buckets talked about above.


2. Go to the AWS Cloudformation console, select you’re the stack identify and choose Delete. It might take roughly 40 minutes for the deletion to finish.
Suggestions
When utilizing dbt with MWAA, some typical challenges embrace employee useful resource exhaustion, dependency administration points, and in some uncommon circumstances, points like DAGs disappearing and re-appearing when there are a lot of dynamic DAGs being created from a single python script.
To mitigate these points, comply with these greatest practices:
1. Scale the MWAA surroundings appropriately by upgrading the surroundings class as required.
2. Use customized necessities.txt and correct dbt adapter configuration to make sure constant environments.
3. Set airflow configuration parameters to tune the efficiency of MWAA.
Conclusion
On this put up, we explored the end-to-end setup of a ruled information lake utilizing MWAA and dbt which improved information high quality, safety, and compliance, main to raised decision-making and elevated operational effectivity. We additionally lined easy methods to construct customized dbt frameworks for auditing and information high quality, automate Lake Formation entry management, and dynamically generate MWAA DAGs primarily based on dbt tags. These capabilities allow a scalable, safe, and automatic information lake structure, streamlining information governance and orchestration.
For additional exploring, consult with From information lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud
In regards to the authors
