For those who handle information in Amazon Aurora MySQL-Suitable Version and need to make it accessible for analytics, machine studying (ML), or cross-service querying in a contemporary lakehouse format, you’re not alone.
Organizations typically must run analytics, construct ML fashions, or be part of information throughout a number of sources. These are examples of workloads that may be resource-intensive and impractical to run instantly towards a transactional database. By extracting your Aurora MySQL information into Amazon S3 Tables in Apache Iceberg format, you’ll be able to offload analytical queries out of your manufacturing database with out impacting its efficiency, whereas storing information in a completely managed Iceberg desk retailer optimized for analytics. Constructed on the open Apache Iceberg normal, Amazon Easy Storage Service (Amazon S3) Desk information is queryable from engines like Amazon Athena, Amazon Redshift Spectrum, and Apache Spark with out extra information copies. You can even mix relational information with different datasets already in your information lake, enabling richer cross-domain insights.
Apache Iceberg and Amazon S3 Tables
Apache Iceberg is a extensively adopted open desk format that gives Atomicity, Consistency, Isolation, Sturdiness (ACID) transactions, schema evolution, and time journey capabilities. It permits a number of engines to work concurrently on the identical dataset, making it a preferred selection for constructing open lakehouse architectures.
Amazon S3 Tables is a purpose-built, absolutely managed Apache Iceberg desk retailer designed for analytics workloads. It delivers as much as 3x sooner question efficiency and as much as 10x extra transactions per second in comparison with self-managed Iceberg tables. It additionally robotically compacts information and removes unreferenced recordsdata to optimize storage and efficiency.
On this submit, you learn to arrange an automatic, end-to-end resolution that extracts tables from Amazon Aurora MySQL Serverless v2 and writes them to Amazon S3 Tables in Apache Iceberg format utilizing AWS Glue. Your entire infrastructure is deployed utilizing a single AWS CloudFormation stack.
Necessities
AWS presents zero-ETL integrations from Amazon Aurora to Amazon Redshift and Amazon SageMaker AI, enabling seamless information movement for analytics and machine studying workloads.
Nevertheless, there isn’t but a local zero-ETL integration between Amazon Aurora and Amazon S3 Tables. This implies that organizations trying to use Amazon S3 Tables for his or her Lakehouse structure at present face a number of necessities:
- Establishing ETL pipelines to extract information from Amazon Aurora and rework it into Apache Iceberg format
- Configuring networking and safety for AWS Glue jobs to entry Amazon Aurora databases in personal subnets
- Coordinating the provisioning of supply databases, ETL pipelines, and goal desk shops
- Managing the end-to-end workflow with out native automation
Answer overview
On this resolution, you automate the extraction of relational database tables from Amazon Aurora MySQL Serverless v2 to Amazon S3 Tables in Apache Iceberg format utilizing AWS Glue 5.0. That will help you get began and take a look at this resolution, a CloudFormation template is offered. This template provisions the required infrastructure, hundreds pattern information, and configures the Extract, Remodel, Load (ETL) pipeline. You possibly can adapt this template in your personal state of affairs.
Pattern information
This resolution makes use of the TICKIT pattern database, a well known dataset utilized in Amazon Redshift documentation. The TICKIT information fashions a fictional ticket gross sales system with seven interrelated tables: customers, venue, class, date, occasion, itemizing, and gross sales. The dataset is publicly accessible as talked about within the Amazon Redshift Getting Began Information.
Answer movement
The answer movement as proven within the earlier structure diagram:
- An AWS Lambda operate downloads the TICKIT pattern dataset (a fictional ticket gross sales system utilized in Amazon Redshift documentation) from a public Amazon S3 bucket to a staging S3 bucket.
- A second Lambda operate, utilizing PyMySQL (a Python MySQL consumer library), hundreds the staged information recordsdata into the Aurora MySQL Serverless v2 database utilizing
LOAD DATA LOCAL INFILE. - The AWS Glue job reads seven TICKIT tables from Aurora MySQL by way of a local MySQL connection and writes them to Amazon S3 Tables in Apache Iceberg format utilizing the S3 Tables REST catalog endpoint with SigV4 authentication.
- You possibly can question the migrated information in S3 Tables utilizing Amazon Athena.
The answer consists of the next key elements:
- Amazon Aurora MySQL Serverless v2 because the supply relational database containing the TICKIT pattern dataset (customers, venue, class, date, occasion, itemizing, and gross sales tables)
- AWS Secrets and techniques Supervisor to retailer the Aurora MySQL database credentials securely
- Amazon S3 staging bucket for the TICKIT pattern information recordsdata downloaded from the general public redshift-downloads S3 bucket
- AWS Lambda features utilizing PyMySQL to load information into Aurora MySQL
- AWS Glue 5.0 job (PySpark) to learn tables from Aurora MySQL and write them to S3 Tables in Apache Iceberg format
- Amazon S3 Tables because the goal storage for the migrated Iceberg tables
- Amazon VPC with personal subnets and VPC endpoints for Amazon S3, S3 Tables, AWS Glue, Secrets and techniques Supervisor, AWS Safety Token Service (AWS STS), CloudWatch Logs, and CloudFormation
Listed here are some benefits of this structure:
- Absolutely automated setup: A single CloudFormation stack provisions the required infrastructure, hundreds pattern information, and configures the ETL pipeline.
- Serverless and cost-efficient: Aurora MySQL Serverless v2 and AWS Glue each scale primarily based on demand, minimizing idle prices.
- Apache Iceberg desk format: Information is saved in Apache Iceberg format, enabling ACID transactions, schema evolution, and time journey queries.
- Community isolation and credential administration: The assets run inside personal subnets with Digital Non-public Cloud (VPC) endpoints, and database credentials are managed by way of AWS Secrets and techniques Supervisor.
- Extensible sample: The identical method will be tailored for different relational databases (PostgreSQL, SQL Server) and different goal codecs supported by AWS Glue.
Conditions
To observe alongside, you want an AWS account. For those who don’t but have an AWS account, it’s essential to create one. The CloudFormation stack deployment takes roughly 30-45 minutes to finish and requires familiarity with Amazon S3 Tables, AWS CloudFormation, Apache Iceberg, AWS Glue, Amazon Aurora. This resolution will incur AWS prices. The primary price drivers are AWS Glue ETL job runs (billed per DPU-hour, proportional to information quantity) and Amazon S3 Tables storage and request prices. Keep in mind to wash up assets when you find yourself achieved to keep away from pointless prices.
CloudFormation parameters
You possibly can configure the next parameters earlier than deploying the CloudFormation stack:
| Parameter | Description | Default | Required |
| S3TableBucketName | Title of the S3 Tables bucket to create (or use current) | Sure | |
| DatabaseName | Title of the preliminary Aurora MySQL database | tickit | No |
| MasterUsername | Grasp username for Aurora MySQL | admin | No |
| VpcCidr | CIDR block for the VPC | 10.1.0.0/16 | No |
| S3TableNamespace | Namespace for S3 Tables | tickit | No |
Implementation walkthrough
The next steps stroll you thru the implementation. These steps are to deploy and take a look at an end-to-end resolution from scratch. If you’re already operating a few of these elements, it’s possible you’ll skip to the related step. You can even consult with the aws-samples repository, sample-to-write-aurora-mysql-to-s3tables-using-glue for the whole resolution.
Step 1: Deploy the CloudFormation stack
Deploy the CloudFormation template scripts/aurora-mysql-to-s3tables-stack.yaml utilizing the AWS Console or the AWS Command Line Interface (AWS CLI). Present a reputation for the S3 Tables bucket; the stack will create it robotically (or use an current one if it already exists).
To deploy utilizing the AWS Console (really helpful), navigate to the AWS CloudFormation Console and use the CloudFormation template. Alternatively, to deploy utilizing the AWS CLI first add the template to an S3 bucket (the template exceeds the 51,200 byte restrict for inline –template-body), then create the stack.
The stack will robotically:
- Create the S3 Tables bucket (or use current if it already exists)
- Create a VPC with personal subnets and VPC endpoints
- Provision an Aurora MySQL Serverless v2 cluster
- Obtain TICKIT pattern information from the general public Amazon S3 bucket
- Load the pattern information into Aurora MySQL through a Lambda operate utilizing PyMySQL
- Create a Glue job configured emigrate information to S3 Tables in Iceberg format
Observe: The S3 Tables bucket is retained when the stack is deleted to protect your information.
Step 2: Confirm the Aurora MySQL information
Retrieve the AuroraClusterEndpoint, DatabaseName, and SecretArn values from the CloudFormation stack, make an observation of the AuroraClusterEndpoint, DatabaseName, and SecretArn. You possibly can navigate to the Amazon Aurora Console, select the Question Editor, and enter the values from the CloudFormation stack to attach. You can even select your most popular methodology of connecting to an Amazon Aurora DB cluster.
Use the AWS CLI to retrieve the stack outputs: –
Then run the next SQL instructions to confirm the information load:
Step 3: Run the Glue job
Navigate to the AWS Glue Console, select ETL jobs beneath Information Integration and ETL from the left panel. Choose the AWS Glue job mysql-tickit-to-iceberg-job and select Run job to start out execution. You can even begin the ETL job utilizing the AWS CLI:
The AWS Glue job performs the next operations for every of the seven TICKIT tables:
- Reads the desk from Aurora MySQL by way of the native MYSQL Glue connection
- Converts the information to a Spark DataFrame
- Creates the Iceberg desk within the S3 Tables namespace utilizing CREATE TABLE IF NOT EXISTS with the USING ICEBERG clause
- Inserts the information utilizing INSERT INTO (or INSERT OVERWRITE if the desk already exists)
- Verifies the document depend and shows pattern information
Step 4: Confirm the outcomes
After the AWS Glue job completes, confirm that the tables have been created in your S3 Desk bucket by navigating to the Amazon S3 Console. Select Desk buckets beneath Buckets and choose your S3 Desk bucket. You can even confirm utilizing the AWS CLI:
Choose a desk from the tickit namespace and select Preview to examine the information.

You can even question the migrated tables utilizing Amazon Athena to validate the information.
Clear up assets
Keep in mind to wash up assets if you not want them to keep away from pointless prices.
Navigate to the CloudFormation console, seek for your stack and select Delete. Alternatively, use the AWS CLI:
The S3 Tables bucket is retained by default. To delete it, use the Amazon S3 console or the AWS CLI to take away the desk bucket individually. The staging S3 bucket can be robotically emptied and deleted as a part of the stack deletion.
Abstract
On this submit, we confirmed you the way to extract information from Amazon Aurora MySQL Serverless v2 and write it to Amazon S3 Tables in Apache Iceberg format utilizing AWS Glue 5.0. Through the use of the native Iceberg help of AWS Glue and the S3 Tables REST catalog endpoint, you’ll be able to bridge the hole between relational databases and fashionable lakehouse storage codecs. By automating the whole pipeline by way of CloudFormation, you’ll be able to rapidly arrange and replicate this sample throughout a number of environments.
As AWS Glue and Amazon S3 Tables proceed to evolve, you’ll be able to reap the benefits of future enhancements whereas sustaining this automated migration sample.
You probably have questions or ideas, depart us a remark.
Concerning the authors
