Constructing scalable AWS Lake Formation ruled information lakes with dbt and Amazon Managed Workflows for Apache Airflow

January 12, 2026

53

Organizations typically battle with constructing scalable and maintainable information lakes—particularly when dealing with advanced information transformations, imposing information high quality, and monitoring compliance with established governance. Conventional approaches sometimes contain customized scripts and disparate instruments, which may improve operational overhead and complicate entry management. A scalable, built-in method is required to simplify these processes, enhance information reliability, and help enterprise-grade governance.

Apache Airflow has emerged as a strong resolution for orchestrating advanced information pipelines within the cloud. Amazon Managed Workflows for Apache Airflow (MWAA) extends this functionality by offering a totally managed service that eliminates infrastructure administration overhead. This service permits groups to deal with constructing and scaling their information workflows whereas AWS handles the underlying infrastructure, safety, and upkeep necessities.

dbt enhances information transformation workflows by bringing software program engineering greatest practices to analytics. It permits analytics engineers to remodel warehouse information utilizing acquainted SQL choose statements whereas offering important options like model management, testing, and documentation. As a part of the ELT (Extract, Load, Remodel) course of, dbt handles the transformation section, working immediately inside an information warehouse to allow environment friendly and dependable information processing. This method permits groups to keep up a single supply of reality for metrics and enterprise definitions whereas enabling information high quality by means of built-in testing capabilities.

On this put up, we present easy methods to construct a ruled information lake that makes use of trendy information instruments and AWS companies.

Resolution overview

We discover a complete resolution that features:

A metadata-driven framework in MWAA that dynamically generates directed acyclic graphs (DAGs), considerably enhancing pipeline scalability and lowering upkeep overhead.
dbt with Amazon Athena adapter to implement modular, SQL-based information transformations immediately on an information lake, enabling well-structured, and totally examined transformations.
An automatic framework that proactively identifies and segregates problematic data, sustaining the integrity of knowledge property.
AWS Lake Formation to implement fine-grained entry controls for Athena tables, making certain correct information governance and safety all through an information lake surroundings.

Collectively, these elements create a sturdy, maintainable, and safe information administration resolution appropriate for enterprise-scale deployments.

The next structure illustrates the elements of the answer.

The workflow incorporates the next steps:

A number of information sources (PostgreSQL, MySQL, SFTP) push information to an Amazon S3 uncooked bucket
S3 occasion triggers AWS Lambda Operate
Lambda perform triggers the MWAA DAG to transform file codecs to parquet
Information is saved in Amazon S3 formatted bucket below formatted_stg prefix
Crawler crawls the info in formatted_stg prefix within the formatted bucket and creates catalog tables
dbt utilizing Athena adapter processes the info and places the processed information after information high quality checks below formatted prefix in Formatted bucket
dbt utilizing Athena adapter can carry out additional transformations on the formatted information and put the remodeled information in Revealed bucket

Conditions

To implement this resolution, the next conditions have to be met.

An AWS account with create/function entry for the next AWS companies:

Deploy the answer

For this resolution, we offer an AWS CloudFormation (CFN) template that units up the companies included within the structure, to allow repeatable deployments.

Observe:

US-EAST-1 Area is required for the deployment.
Deploying this resolution will contain prices related to AWS companies.

To deploy the answer, full the next steps:

Earlier than deploying the stack, open the AWS Lake Formation console. Add your console position as a Information Lake Administrator and select Verify to avoid wasting the adjustments.
Obtain the CloudFormation template.

After the file is downloaded to the native machine, comply with the steps beneath to deploy the stack utilizing this template:
1. Open the AWS CloudFormation Console.
2. Select Create stack and select With new sources (commonplace).
3. Beneath Specify template, choose Add a template file.
4. Choose Select file and add the CFN template that was downloaded earlier.
5. Select Subsequent to proceed.
Enter a stack identify (for instance, bdb4834-data-lake-blog-stack) and configure the parameters (bdb4834-MWAAClusterName could be left because the default worth and replace SNSEmailEndpoints along with your e-mail deal with), then select Subsequent.
Choose “I acknowledge that AWS CloudFormation would possibly create IAM sources with customized names” and select Subsequent
Evaluate all of the configuration particulars on the subsequent web page, then select Submit.

Anticipate the stack creation to finish within the AWS CloudFormation console. The method sometimes takes roughly 35 to 40 minutes to provision all required sources.

The next desk exhibits sources obtainable within the AWS Account after CloudFormation template deployment is efficiently accomplished:

Useful resource Kind	Description	Instance Useful resource Identify
S3 Buckets	For storing uncooked, processed information and property	bdb4834-mwaa-bucket-–,bdb4834-raw-bucket-–,bdb4834-formatted-bucket-–,bdb4834-published-bucket-–
IAM Position	Position assumed by MWAA for permissions	bdb4834-mwaa-role
MWAA Surroundings	Managed Airflow surroundings for orchestration	bdb4834-MyMWAACluster
VPC	Community setup required by MWAA	bdb4834-MyVPC
Glue Catalog Databases	Logical grouping of metadata for tables	bdb4834_formatted_stg,bdb4834_formatted_exception, bdb4834_formatted, bdb4834_published
Glue Crawlers	Robotically catalog metadata from S3	bdb4834-formatted-stg-crawler
Lambda	Lambda to Set off MWAA DAG on file arrival and to setup Lake Formation Permissions	bdb4834_mwaa_trigger_process_s3_files,bdb4834-lf-tags-automation
Lake Formation Setup	Centralized governance and permissions	LF-Setup for the above Sources
Airflow DAGs	Airflow DAGs are saved within the S3 bucket named mwaa-bucket-– below the dags/ prefix. These DAGs are liable for triggering information pipelines primarily based on both file arrival occasions or scheduled intervals. The precise performance of every DAG is defined within the following sections.	blog-test-data-processingcrawler-daily-runcreate-audit-tableprocess_raw_to_formatted_stage

When the stack is full carry out the beneath steps:
1. Open the Amazon Managed Workflows for Apache Airflow (MWAA) console, select on Open Airflow UI
2. Within the DAGs console, find the next DAGs and unpause them by unchecking the toggle change (radio button) subsequent to every DAG.

Constructing scalable AWS Lake Formation ruled information lakes with dbt and Amazon Managed Workflows for Apache Airflow

Add pattern information to uncooked S3 bucket and create catalog tables

On this part, we add pattern information to uncooked S3 bucket (bucket identify beginning with bdb4834-raw-bucket) and convert the file codecs to parquet and run AWS Glue crawler to create catalog tables which might be utilized by dbt within the ELT Course of. Glue Crawler mechanically scans the info in S3 and creates or updates tables within the Glue Information Catalog, making the info queryable and accessible for transformation.

Obtain the pattern information.

Zip folder incorporates two pattern information information, playing cards.json and prospects.json

Schema for playing cards.json

Discipline	Information Kind	Description
cust_id	String	Distinctive buyer identifier
cc_number	String	Bank card quantity
cc_expiry_date	String	Bank card expiry date

Schema for patrons.json

Discipline	Information Kind	Description
cust_id	String	Distinctive buyer identifier
fname	String	First identify
lname	String	Final identify
gender	String	Gender
deal with	String	Full deal with
dob	String	Date of beginning (YYYY/MM/DD)
cellphone	String	Cellphone quantity
e-mail	String	E-mail deal with

Open S3 console, select Common objective buckets within the navigation pane.
Find the S3 bucket with a reputation beginning with bdb4834-raw-bucket. This bucket is created by the CloudFormation stack and will also be discovered below the stack’s Sources tab within the CloudFormation console.
Select the bucket identify to open it, and comply with these steps to create the required prefix:
1. Select Create folder.
2. Enter the folder identify as mwaa/weblog/partition_dt=YYYY-MM-DD/, changing YYYY-MM-DD with the precise date for use for the partition.
3. Select Create folder to substantiate.
Add the pattern information information from the placement to the s3 uncooked bucket prefix.
As quickly because the information are uploaded, the on_put object occasion on the uncooked bucket invokes thebdb4834_mwaa_trigger_process_s3_files lambda which triggers the process_raw_to_formatted_stg MWAA DAG.
1. Within the Airflow UI, select the process_raw_to_formatted_stg DAG to view execution standing. This DAG converts the file codecs to parquet and sometimes completes inside a number of seconds.
2. (Non-obligatory) To examine the Lambda execution particulars:
  1. On the AWS Lambda Console, select Features within the navigation pane.
  2. Choose the perform named bdb4834_mwaa_trigger_process_s3_files.
Validate the parquet information are created in formatted bucket (bucket identify beginning with bdb4834-formatted) below the respective information object prefix.
Earlier than continuing additional, re-upload the Lake Formation metadata file in MWAA bucket.
1. Open the S3 console, select Common objective buckets within the navigation pane.
2. Seek for the bucket beginning with bdb4834-mwaa-bucket
3. Select the bucket identify and go to the lakeformation prefix. Obtain the file named lf_tags_metadata.json. Now, re-upload the identical file to the identical location.
  
  Observe: This re-upload is important as a result of the Lambda perform is configured to set off on file arrival. When the sources had been initially created by the CloudFormation stack, the information had been merely moved to S3 and didn’t set off the Lambda. Re-uploading the file ensures the Lambda perform is executed as supposed.
4. As quickly because the file is uploaded, the on_put object occasion on the MWAA bucket invokes the lf_tags_automation lambda, which creates the Lake Formation (LF) tags as outlined within the metadata file and grants entry to the desired AWS Identification and Entry Administration (IAM) roles for learn/write.
5. Validate that the LF-Tags have been created by visiting the Lake Formation Console. Within the left navigation pane, select Permissions, after which choose LF-Tags and permissions.
Now, run the crawler DAG to create/replace the catalog tables: crawler-daily-run
1. Within the Airflow UI choose the crawler-daily-run DAG and select Set off DAG to execute it.
2. This DAG is configured to set off Glue Crawler which crawls the formatted_stg prefix below the bdb4834-formatted s3 bucket to create catalog tables as per the prefixes obtainable below the formatted_stg prefix.
```
bdb4834-formatted-bucket--/formatted_stg/
```
3. Monitor the execution of the crawler-daily-run DAG till it completes, which usually takes 2 to three minutes. The crawler run standing could be verified within the AWS Glue Console by following these steps:
  1. Open the AWS Glue Console.
  2. Within the left navigation pane, select Crawlers.
  3. Seek for the crawler named bdb4834-formatted-stg-crawler.
  4. Verify the Final run standing column to substantiate the crawler executed efficiently.
  5. Select the crawler identify to view extra run particulars and logs if wanted.
4. As soon as the crawler has accomplished efficiently, within the left-hand panel, select Databases and choose the bdb4834_formatted_stg database to view the created tables, which ought to seem as displaying within the following picture. Optionally, choose the desk’s identify to view its schema, after which choose Desk information to open Athena for information evaluation. (An error might seem when querying information utilizing Athena as a result of Lake Formation permissions. Evaluate the Governance utilizing Lake Formation part on this put up to resolve the difficulty.)

Observe: If that is the primary time Athena is getting used, a question consequence location should be configured by specifying an S3 bucket. Observe the directions within the AWS Athena documentation to arrange the S3 staging bucket for storing question outcomes.

Run mannequin by means of DAG in MWAA

On this part, we cowl how dbt fashions run in MWAA utilizing Athena adapter to create Glue-catalogued tables and the way auditing is completed for every run.

After creating the tables within the Glue database utilizing the AWS Glue Crawler within the earlier steps, we are able to now proceed to run the dbt fashions in MWAA. These fashions are saved in S3 within the type of SQL information, positioned on the S3 prefix: bdb4834-mwaa-bucket-
-us-east-1/dags/dbt/fashions/

The next are the dbt fashions and their performance:
- mwaa_blog_cards_exception.sql This mannequin reads information from the mwaa_blog_cards desk within the bdb4834_formatted_stg database and writes data with information high quality points to the mwaa_blog_cards_exception desk within the bdb4834_formatted_exception database.
- mwaa_blog_customers_exception.sql This mannequin reads information from the mwaa_blog_customers desk within the bdb4834_formatted_stg database and writes data with information high quality points to the mwaa_blog_customers_exception desk within the bdb4834_formatted_exception database.
- mwaa_blog_cards.sql This mannequin reads information from the mwaa_blog_cards desk within the bdb4834_formatted_stg database and hundreds it into the mwaa_blog_cards desk within the bdb4834_formatted database. If the goal desk doesn’t exist, dbt mechanically creates it.
- mwaa_blog_customers.sql This mannequin reads information from the mwaa_blog_customers desk within the bdb4834_formatted_stg database and hundreds it into the mwaa_blog_customers desk within the bdb4834_formatted database. If the goal desk doesn’t exist, dbt mechanically creates it.
The mwaa_blog_cards.sql mannequin processes bank card information and is dependent upon the mwaa_blog_customers.sql mannequin to finish efficiently earlier than it runs. This dependency is important as a result of sure information high quality checks—reminiscent of referential integrity validations between buyer and card data—should be carried out beforehand.
- These relationships and checks are outlined within the schema.yml file positioned in the identical S3 path: bdb4834-mwaa-bucket--us-east-1/dags/dbt/fashions/. The schema.yml file offers metadata for dbt fashions, together with mannequin dependencies, column definitions, and information high quality assessments. It makes use of macros like get_dq_macro.sql and dq_referentialcheck.sql (discovered below the macros/ listing) to implement these validations.
Because of this, dbt mechanically generates a lineage graph primarily based on the declared dependencies. This visible graph helps orchestrate mannequin execution order—making certain fashions like mwaa_blog_customers.sql run earlier than dependent fashions reminiscent of mwaa_blog_cards.sql, and identifies which fashions can execute in parallel to optimize the pipeline.
As a pre-step earlier than operating fashions, select the set off DAG button for create-audit-table to create audit desk for storing run particulars for every mannequin.
Set off the blog-test-data-processing DAG within the Airflow UI to start out the Mannequin run.
Select blog-test-data-processing to see the execution standing. This DAG runs the fashions so as and creates Glue catalogued iceberg tables. The move diagram of a DAG from Airflow UI could be discovered by selecting Graph after selecting DAG.
1. The exception fashions places the failed data below exception prefix in S3:
```
bdb4834-formatted-bucket--/formatted_exception/
```
  Information that failed are present in an added column, tests_failed, the place all the info high quality checks that failed for that exact row are added, separated by a pipe (‘|’). (For the mwaa_blog_customers_exception two exception data are discovered within the desk.)
2. The handed data are put below formatted prefix in S3.
```
bdb4834-formatted-bucket--/formatted/
```
3. For every run, a run audit is captured within the audit desk with execution particulars like model_nm, process_nm, execution_start_date, execution_end_date, execution_status, execution_failure_reason, rows_affected.
  
  Discover the info in S3 below the prefix bdb4834-formatted-bucket--/audit_control/
4. Monitor the execution till the DAG completes, which may take as much as 2-3 minutes. The execution standing of the DAG could be seen within the left panel after opening the DAG.
5. As soon as the DAG has accomplished efficiently, open the AWS Glue console and choose Databases. Choose the bdb4834_formatted database, which ought to create three tables, as proven within the following picture.
  
  Optionally, select Desk information to entry Athena for information evaluation.
6. Select bdb4834_formatted_exception database from below Databases in AWS Glue console, which ought to create two tables as proven within the following picture.
7. Every mannequin is assigned LF tags by means of the config block of mannequin itself. Due to this fact, when the iceberg tables are created by means of dbt, LF tags are connected to the tables after the run completes.
  Validate the LF tags connected to the tables by visiting the AWS Lake Formation console. Within the left navigation pane, select Tables and search for mwaa_blog_customers or mwaa_blog_cards desk below bdb4834_formatted database. Choose any desk among the many two and below Actions, select Edit LF tags and the tags are connected, as proven within the following display shot.
8. Equally, for the bdb4834_formatted_exception database, choose any one of many exception tables below the bdb4834_formatted_exception database and the LF tags are connected.
9. Run SQL queries on the tables created by opening the Athena console and operating Analytical queries on the tables created above.Pattern SQL queries:
```
SELECT * FROM bdb4834_formatted.mwaa_blog_cards;
Output: Whole 30 rows
```
```
SELECT * FROM bdb4834_formatted_exception.mwaa_blog_customers_exception;
Output: Whole 2 data
```

Governance utilizing Lake Formation

On this part, we present how assigning Lake Formation permissions and creating LF tags is automated utilizing the metadata file.Beneath is a metadata file construction, which is required for reference when importing the metadata file for Lake Formation in Airflow S3 bucket, contained in the Lake Formation prefix.

Metadata file structure-
{
    "role_arn": "<>",
    "access_type": "GRANT",
    "lf_tags": [
      {
        "TagKey": "<>",
        "TagValues": ["<>"]
      }
    ],
	  "named_data_catalog": [
      {
        "Database": "<>",
        "Table": ""<>"
      }
    ],
    "table_permissions": ["SELECT", "DESCRIBE"]
  }

Parts of the metadata file

role_arn: The IAM position that the Lambda perform assumes to carry out operations.
access_type: Specifies whether or not the motion is to grant or revoke permissions (GRANT, REVOKE).
lf_tags: Tags used for tag-based entry management (TBAC) in Lake Formation.
named_data_catalog: An inventory of databases and tables on which Lake Formation permissions or tags are utilized to.
table_permissions: Lake Formation-specific permissions (e.g., SELECT, DESCRIBE, ALTER, and so on.).

Lambda perform bdb4834-lf-tags-automation parses this JSON and grants the required LF tags to the position with given desk permissions.

To replace the metadata file, obtain it from the MWAA bucket (lakeformation prefix)
```
bdb4834-mwaa-bucket-<>-<>/lakeformation/lf_tags_metadata.json
```

Add a JSON object with the metadata construction outlined above, mentioning the IAM position ARN and the tags and tables to which entry must be granted.

Instance:Let’s assume beneath is how the metadata file initially seems to be like:


	[
	{
    "role_arn": "arn:aws:iam::XXX:role/aws-reserved/sso.amazonaws.com/XX ",
    "access_type": "GRANT",
    "lf_tags": [
      {
        "TagKey": " blog",
        "TagValues": ["bdb-4834"]
      }
    ],
    "named_data_catalog": [],
    "table_permissions": ["SELECT", "DESCRIBE"]
  }
]

Beneath is the json object that must be added within the above metadata file:


{
          "role_arn": "arn:aws:iam::XXX:position/aws-reserved/sso.amazonaws.com/XX ",
          "access_type": "GRANT",
          "lf_tags": [],
          "named_data_catalog": [
          {
            "Database": " bdb4834_formatted",
            "Table": "audit_control"
          },
          {
            "Database": " bdb4834_formatted_stg",
            "Table": "*"
          }
         ],
         "table_permissions": ["SELECT", "DESCRIBE"]}

So now, the ultimate metadata file ought to appear like:


[
  {
    "role_arn": "arn:aws:iam::XXX:role/aws-reserved/sso.amazonaws.com/XX ",
    "access_type": "GRANT",
    "lf_tags": [
      {
        "TagKey": "blog",
        "TagValues": ["bdb-4834"]
      }
    ],
    "named_data_catalog": [],
    "table_permissions": ["SELECT", "DESCRIBE"]
  },
  {
    "role_arn": "arn:aws:iam::XXX:position/aws-reserved/sso.amazonaws.com/XX ",
    "access_type": "GRANT",
    "lf_tags": [],
    "named_data_catalog": [
      {
        "Database": " bdb4834_formatted",
        "Table": "audit_control"
      },
      {
        "Database": " bdb4834_formatted_stg",
        "Table": "*"
      }
    ],
    "table_permissions": ["SELECT", "DESCRIBE"]
  }
]

Upon importing this file on the identical location (bdb4834-mwaa-bucket-<>-<>/lakeformation/) in S3, the lf_tags_automation lambda is triggered to create LF tags in the event that they don’t exist after which it assigns these tags to the IAM position ARN and likewise grants permission to the IAM position ARN utilizing named_data_catalog as outlined.
To confirm the permissions, go to the Lake Formation console and select Tables below Information Catalog and seek for the desk identify.

To examine LF-Tags, select the desk identify and below the LF tags part, all of the tags are discovered connected to this desk.

This metadata file used as a structured enter to an AWS Lambda perform automates the next to carry out automated, constant, and scalable information entry governance throughout the AWS Lake Formation environments:

Granting AWS Lake Formation (LF) permissions on Glue Information Catalog sources (like databases and tables).
Creating Lake Formation Tags and Making use of Lake Formation tags (LF-Tags) for tag-based entry management (TBAC).

Discover extra on dbt

Now that the deployment features a bdb4834-printed S3 bucket and a printed Catalog database, sturdy dbt fashions could be constructed for information transformation and curation.

Right here’s easy methods to implement an entire dbt workflow:

Begin by creating fashions that comply with this sample:
- Learn from the formatted tables within the staging space
- Apply enterprise logic, joins, and aggregations
- Write clear, analysis-ready information to the printed schema
Tagging for automation: Use constant dbt tags to allow automated DAG technology. These tags set off MWAA orchestration to mechanically embrace new fashions within the execution pipeline.
Including new fashions: When working with new datasets, consult with current fashions for steerage. Apply acceptable LF tags for information entry management. The brand new LF tags can even now be used for permissions.
Allow DAG execution: For brand spanking new datasets, replace the MWAA metadata file to incorporate a new JSON entry. This step is important to generate a DAG that executes the brand new dbt fashions.

This method ensures the dbt implementation scales systematically whereas sustaining automated orchestration and correct information governance.

Clear up

1. Open the S3 console and delete all objects from beneath buckets:

bdb4834-raw-bucket-–
bdb4834-formatted -bucket-–
bdb4834-mwaa-bucket-–
bdb4834-published-bucket-–

To delete all objects, select the bucket identify, choose all objects and select Delete.

After that, kind ‘completely delete’ within the textual content field and select Delete Objects.

Do that for all three buckets talked about above.

2. Go to the AWS Cloudformation console, select you’re the stack identify and choose Delete. It might take roughly 40 minutes for the deletion to finish.

Suggestions

When utilizing dbt with MWAA, some typical challenges embrace employee useful resource exhaustion, dependency administration points, and in some uncommon circumstances, points like DAGs disappearing and re-appearing when there are a lot of dynamic DAGs being created from a single python script.

To mitigate these points, comply with these greatest practices:

1. Scale the MWAA surroundings appropriately by upgrading the surroundings class as required.

2. Use customized necessities.txt and correct dbt adapter configuration to make sure constant environments.

3. Set airflow configuration parameters to tune the efficiency of MWAA.

Conclusion

On this put up, we explored the end-to-end setup of a ruled information lake utilizing MWAA and dbt which improved information high quality, safety, and compliance, main to raised decision-making and elevated operational effectivity. We additionally lined easy methods to construct customized dbt frameworks for auditing and information high quality, automate Lake Formation entry management, and dynamically generate MWAA DAGs primarily based on dbt tags. These capabilities allow a scalable, safe, and automatic information lake structure, streamlining information governance and orchestration.

For additional exploring, consult with From information lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

Constructing scalable AWS Lake Formation ruled information lakes with dbt and Amazon Managed Workflows for Apache Airflow

Resolution overview

Conditions

Deploy the answer

Add pattern information to uncooked S3 bucket and create catalog tables

Run mannequin by means of DAG in MWAA

Governance utilizing Lake Formation

Discover extra on dbt

Clear up

Suggestions

Conclusion

In regards to the authors

Related Articles

Grilling Isn’t Only for Meat

Givaudan boosts high-quality fragrances international growth with Eurofragance acquisition

Finest Natural Clothes Manufacturers for Youngsters

LEAVE A REPLY Cancel reply

Latest Articles

Grilling Isn’t Only for Meat

Givaudan boosts high-quality fragrances international growth with Eurofragance acquisition

Finest Natural Clothes Manufacturers for Youngsters

From Labs to Studying: How Fingers-On Observe Modified Our Assumptions About On-line Training

Question Amazon Redshift utilizing pure language with Kiro