Extract information from Amazon Aurora MySQL to Amazon S3 Tables in Apache Iceberg format


For those who handle information in Amazon Aurora MySQL-Suitable Version and need to make it accessible for analytics, machine studying (ML), or cross-service querying in a contemporary lakehouse format, you’re not alone.

Organizations typically must run analytics, construct ML fashions, or be part of information throughout a number of sources. These are examples of workloads that may be resource-intensive and impractical to run instantly towards a transactional database. By extracting your Aurora MySQL information into Amazon S3 Tables in Apache Iceberg format, you’ll be able to offload analytical queries out of your manufacturing database with out impacting its efficiency, whereas storing information in a completely managed Iceberg desk retailer optimized for analytics. Constructed on the open Apache Iceberg normal, Amazon Easy Storage Service (Amazon S3) Desk information is queryable from engines like Amazon Athena, Amazon Redshift Spectrum, and Apache Spark with out extra information copies. You can even mix relational information with different datasets already in your information lake, enabling richer cross-domain insights.

Apache Iceberg and Amazon S3 Tables

Apache Iceberg is a extensively adopted open desk format that gives Atomicity, Consistency, Isolation, Sturdiness (ACID) transactions, schema evolution, and time journey capabilities. It permits a number of engines to work concurrently on the identical dataset, making it a preferred selection for constructing open lakehouse architectures.

Amazon S3 Tables is a purpose-built, absolutely managed Apache Iceberg desk retailer designed for analytics workloads. It delivers as much as 3x sooner question efficiency and as much as 10x extra transactions per second in comparison with self-managed Iceberg tables. It additionally robotically compacts information and removes unreferenced recordsdata to optimize storage and efficiency.

On this submit, you learn to arrange an automatic, end-to-end resolution that extracts tables from Amazon Aurora MySQL Serverless v2 and writes them to Amazon S3 Tables in Apache Iceberg format utilizing AWS Glue. Your entire infrastructure is deployed utilizing a single AWS CloudFormation stack.

Necessities

AWS presents zero-ETL integrations from Amazon Aurora to Amazon Redshift and Amazon SageMaker AI, enabling seamless information movement for analytics and machine studying workloads.

Nevertheless, there isn’t but a local zero-ETL integration between Amazon Aurora and Amazon S3 Tables. This implies that organizations trying to use Amazon S3 Tables for his or her Lakehouse structure at present face a number of necessities:

  • Establishing ETL pipelines to extract information from Amazon Aurora and rework it into Apache Iceberg format
  • Configuring networking and safety for AWS Glue jobs to entry Amazon Aurora databases in personal subnets
  • Coordinating the provisioning of supply databases, ETL pipelines, and goal desk shops
  • Managing the end-to-end workflow with out native automation

Answer overview

On this resolution, you automate the extraction of relational database tables from Amazon Aurora MySQL Serverless v2 to Amazon S3 Tables in Apache Iceberg format utilizing AWS Glue 5.0. That will help you get began and take a look at this resolution, a CloudFormation template is offered. This template provisions the required infrastructure, hundreds pattern information, and configures the Extract, Remodel, Load (ETL) pipeline. You possibly can adapt this template in your personal state of affairs.

Pattern information

This resolution makes use of the TICKIT pattern database, a well known dataset utilized in Amazon Redshift documentation. The TICKIT information fashions a fictional ticket gross sales system with seven interrelated tables: customers, venue, class, date, occasion, itemizing, and gross sales. The dataset is publicly accessible as talked about within the Amazon Redshift Getting Began Information.

Answer movement

The answer movement as proven within the earlier structure diagram:

  1. An AWS Lambda operate downloads the TICKIT pattern dataset (a fictional ticket gross sales system utilized in Amazon Redshift documentation) from a public Amazon S3 bucket to a staging S3 bucket.
  2. A second Lambda operate, utilizing PyMySQL (a Python MySQL consumer library), hundreds the staged information recordsdata into the Aurora MySQL Serverless v2 database utilizing LOAD DATA LOCAL INFILE.
  3. The AWS Glue job reads seven TICKIT tables from Aurora MySQL by way of a local MySQL connection and writes them to Amazon S3 Tables in Apache Iceberg format utilizing the S3 Tables REST catalog endpoint with SigV4 authentication.
  4. You possibly can question the migrated information in S3 Tables utilizing Amazon Athena.

The answer consists of the next key elements:

  1. Amazon Aurora MySQL Serverless v2 because the supply relational database containing the TICKIT pattern dataset (customers, venue, class, date, occasion, itemizing, and gross sales tables)
  2. AWS Secrets and techniques Supervisor to retailer the Aurora MySQL database credentials securely
  3. Amazon S3 staging bucket for the TICKIT pattern information recordsdata downloaded from the general public redshift-downloads S3 bucket
  4. AWS Lambda features utilizing PyMySQL to load information into Aurora MySQL
  5. AWS Glue 5.0 job (PySpark) to learn tables from Aurora MySQL and write them to S3 Tables in Apache Iceberg format
  6. Amazon S3 Tables because the goal storage for the migrated Iceberg tables
  7. Amazon VPC with personal subnets and VPC endpoints for Amazon S3, S3 Tables, AWS Glue, Secrets and techniques Supervisor, AWS Safety Token Service (AWS STS), CloudWatch Logs, and CloudFormation

Listed here are some benefits of this structure:

  • Absolutely automated setup: A single CloudFormation stack provisions the required infrastructure, hundreds pattern information, and configures the ETL pipeline.
  • Serverless and cost-efficient: Aurora MySQL Serverless v2 and AWS Glue each scale primarily based on demand, minimizing idle prices.
  • Apache Iceberg desk format: Information is saved in Apache Iceberg format, enabling ACID transactions, schema evolution, and time journey queries.
  • Community isolation and credential administration: The assets run inside personal subnets with Digital Non-public Cloud (VPC) endpoints, and database credentials are managed by way of AWS Secrets and techniques Supervisor.
  • Extensible sample: The identical method will be tailored for different relational databases (PostgreSQL, SQL Server) and different goal codecs supported by AWS Glue.

Conditions

To observe alongside, you want an AWS account. For those who don’t but have an AWS account, it’s essential to create one. The CloudFormation stack deployment takes roughly 30-45 minutes to finish and requires familiarity with Amazon S3 Tables, AWS CloudFormation, Apache Iceberg, AWS Glue, Amazon Aurora. This resolution will incur AWS prices. The primary price drivers are AWS Glue ETL job runs (billed per DPU-hour, proportional to information quantity) and Amazon S3 Tables storage and request prices. Keep in mind to wash up assets when you find yourself achieved to keep away from pointless prices.

CloudFormation parameters

You possibly can configure the next parameters earlier than deploying the CloudFormation stack:

Parameter Description Default Required
S3TableBucketName Title of the S3 Tables bucket to create (or use current) Sure
DatabaseName Title of the preliminary Aurora MySQL database tickit No
MasterUsername Grasp username for Aurora MySQL admin No
VpcCidr CIDR block for the VPC 10.1.0.0/16 No
S3TableNamespace Namespace for S3 Tables tickit No

Implementation walkthrough

The next steps stroll you thru the implementation. These steps are to deploy and take a look at an end-to-end resolution from scratch. If you’re already operating a few of these elements, it’s possible you’ll skip to the related step. You can even consult with the aws-samples repository, sample-to-write-aurora-mysql-to-s3tables-using-glue for the whole resolution.

Step 1: Deploy the CloudFormation stack

Deploy the CloudFormation template scripts/aurora-mysql-to-s3tables-stack.yaml utilizing the AWS Console or the AWS Command Line Interface (AWS CLI). Present a reputation for the S3 Tables bucket; the stack will create it robotically (or use an current one if it already exists).

To deploy utilizing the AWS Console (really helpful), navigate to the AWS CloudFormation Console and use the CloudFormation template. Alternatively, to deploy utilizing the AWS CLI first add the template to an S3 bucket (the template exceeds the 51,200 byte restrict for inline –template-body), then create the stack.

# Add the template to S3
aws s3 cp scripts/aurora-mysql-to-s3tables-stack.yaml 
  s3:///aurora-mysql-to-s3tables-stack.yaml 
  --region 

# Create the stack utilizing the S3 template URL
aws cloudformation create-stack 
  --stack-name aurora-mysql-tickit-stack 
  --template-url https://.s3..amazonaws.com/aurora-mysql-to-s3tables-stack.yaml 
  --parameters 
    ParameterKey=S3TableBucketName,ParameterValue= 
  --capabilities CAPABILITY_NAMED_IAM 
  --region 

The stack will robotically:

  • Create the S3 Tables bucket (or use current if it already exists)
  • Create a VPC with personal subnets and VPC endpoints
  • Provision an Aurora MySQL Serverless v2 cluster
  • Obtain TICKIT pattern information from the general public Amazon S3 bucket
  • Load the pattern information into Aurora MySQL through a Lambda operate utilizing PyMySQL
  • Create a Glue job configured emigrate information to S3 Tables in Iceberg format

Observe: The S3 Tables bucket is retained when the stack is deleted to protect your information.

Step 2: Confirm the Aurora MySQL information

Retrieve the AuroraClusterEndpoint, DatabaseName, and SecretArn values from the CloudFormation stack, make an observation of the AuroraClusterEndpoint, DatabaseName, and SecretArn. You possibly can navigate to the Amazon Aurora Console, select the Question Editor, and enter the values from the CloudFormation stack to attach. You can even select your most popular methodology of connecting to an Amazon Aurora DB cluster.

Use the AWS CLI to retrieve the stack outputs: –

aws cloudformation describe-stacks --stack-name aurora-mysql-tickit-stack --region  --query "Stacks[0].Outputs"

Then run the next SQL instructions to confirm the information load:

-- Confirm if the tables are created
SELECT * FROM information_schema.tables WHERE table_schema="tickit";

-- Confirm if the information is loaded
SELECT 'customers' AS table_name, COUNT(*) AS record_count FROM tickit.customers
UNION ALL SELECT 'venue', COUNT(*) FROM tickit.venue
UNION ALL SELECT 'class', COUNT(*) FROM tickit.class
UNION ALL SELECT 'date', COUNT(*) FROM tickit.date
UNION ALL SELECT 'occasion', COUNT(*) FROM tickit.occasion
UNION ALL SELECT 'itemizing', COUNT(*) FROM tickit.itemizing
UNION ALL SELECT 'gross sales', COUNT(*) FROM tickit.gross sales;

Step 3: Run the Glue job

Navigate to the AWS Glue Console, select ETL jobs beneath Information Integration and ETL from the left panel. Choose the AWS Glue job mysql-tickit-to-iceberg-job and select Run job to start out execution. You can even begin the ETL job utilizing the AWS CLI:

aws glue start-job-run --job-name mysql-tickit-to-iceberg-job --region 

The AWS Glue job performs the next operations for every of the seven TICKIT tables:

  • Reads the desk from Aurora MySQL by way of the native MYSQL Glue connection
  • Converts the information to a Spark DataFrame
  • Creates the Iceberg desk within the S3 Tables namespace utilizing CREATE TABLE IF NOT EXISTS with the USING ICEBERG clause
  • Inserts the information utilizing INSERT INTO (or INSERT OVERWRITE if the desk already exists)
  • Verifies the document depend and shows pattern information

Step 4: Confirm the outcomes

After the AWS Glue job completes, confirm that the tables have been created in your S3 Desk bucket by navigating to the Amazon S3 Console. Select Desk buckets beneath Buckets and choose your S3 Desk bucket. You can even confirm utilizing the AWS CLI:

aws s3tables list-tables 
  --table-bucket-arn arn:aws:s3tables:::bucket/ 
  --namespace tickit 
  --region 

Choose a desk from the tickit namespace and select Preview to examine the information.

Preview S3 data

You can even question the migrated tables utilizing Amazon Athena to validate the information.

Clear up assets

Keep in mind to wash up assets if you not want them to keep away from pointless prices.

Navigate to the CloudFormation console, seek for your stack and select Delete. Alternatively, use the AWS CLI:

aws cloudformation delete-stack --stack-name aurora-mysql-tickit-stack --region 

The S3 Tables bucket is retained by default. To delete it, use the Amazon S3 console or the AWS CLI to take away the desk bucket individually. The staging S3 bucket can be robotically emptied and deleted as a part of the stack deletion.

aws s3tables delete-table-bucket --table-bucket-arn arn:aws:s3tables:::bucket/ --region 

Abstract

On this submit, we confirmed you the way to extract information from Amazon Aurora MySQL Serverless v2 and write it to Amazon S3 Tables in Apache Iceberg format utilizing AWS Glue 5.0. Through the use of the native Iceberg help of AWS Glue and the S3 Tables REST catalog endpoint, you’ll be able to bridge the hole between relational databases and fashionable lakehouse storage codecs. By automating the whole pipeline by way of CloudFormation, you’ll be able to rapidly arrange and replicate this sample throughout a number of environments.

As AWS Glue and Amazon S3 Tables proceed to evolve, you’ll be able to reap the benefits of future enhancements whereas sustaining this automated migration sample.

You probably have questions or ideas, depart us a remark.


Concerning the authors

Kunal Ghosh

Kunal Ghosh

Kunal is a Sr. Options Architect at AWS. He’s captivated with constructing environment friendly and efficient options on AWS, particularly involving generative AI, analytics, information science, and machine studying. Apart from household time, he likes studying, swimming, biking, and watching films.

Arghya Banerjee

Arghya Banerjee

Arghya is a Sr. Options Architect at AWS within the San Francisco Bay Space, targeted on serving to clients undertake and use the AWS Cloud. He’s targeted on large information, information lakes, streaming and batch analytics companies, and generative AI applied sciences.

Indranil Banerjee

Indranil Banerjee

Indranil is a Sr. Options Architect at AWS within the San Francisco Bay Space, targeted on serving to clients within the hi-tech and semi-conductor sectors clear up advanced enterprise issues utilizing the AWS Cloud. His particular pursuits are within the areas of legacy modernization and migration, constructing analytics platforms and serving to clients undertake innovative applied sciences reminiscent of generative AI.

Vipan Kumar

Vipan Kumar

Vipan is a Sr. Options Architect at AWS, the place he works with strategic clients. He has intensive expertise in machine studying and generative AI. With a background in utility growth, he’s captivated with designing and constructing enterprise functions for the cloud.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles