Knowledge environments in data-driven organizations are altering to fulfill the rising calls for for analytics, together with enterprise intelligence (BI) dashboarding, one-time querying, information science, machine studying (ML), and generative AI. These organizations have an enormous demand for lakehouse options that mix the most effective of information warehouses and information lakes to simplify information administration with easy accessibility to all information from their most popular engines.
Amazon SageMaker Lakehouse unifies all of your information throughout Amazon Easy Storage Service (Amazon S3) information lakes and Amazon Redshift information warehouses, serving to you construct highly effective analytics and synthetic intelligence and machine studying (AI/ML) functions on a single copy of knowledge. SageMaker Lakehouse provides you the pliability to entry and question your information in place with all Apache Iceberg suitable instruments and engines. It secures your information within the lakehouse by defining fine-grained permissions, that are persistently utilized throughout all analytics and ML instruments and engines. You possibly can carry information from operational databases and functions into your lakehouse in close to actual time by means of zero-ETL integrations. It accesses and queries information in-place with federated question capabilities throughout third-party information sources by means of Amazon Athena.
With SageMaker Lakehouse, you may entry tables saved in Amazon Redshift managed storage (RMS) by means of Iceberg APIs, utilizing the Iceberg REST catalog backed by AWS Glue Knowledge Catalog. This expands your information integration workload throughout information lakes and information warehouses, enabling seamless entry to various information sources.
Amazon SageMaker Unified Studio, Amazon EMR 7.5.0 and better, and AWS Glue 5.0 natively help SageMaker Lakehouse. This submit describes combine information on RMS tables by means of Apache Spark utilizing SageMaker Unified Studio, Amazon EMR 7.5.0 and better, and AWS Glue 5.0.
How you can entry RMS tables by means of Apache Spark on AWS Glue and Amazon EMR
With SageMaker Lakehouse, RMS tables are accessible by means of the Apache Iceberg REST catalog. Open supply engines corresponding to Apache Spark are suitable with Apache Iceberg, and so they can work together with RMS tables by configuring this Iceberg REST catalog. You possibly can be taught extra in Connecting to the Knowledge Catalog utilizing AWS Glue Iceberg REST extension endpoint.
Notice that the Iceberg REST extensions endpoint is used while you entry RMS tables. This endpoint is accessible by means of the Apache Iceberg AWS Glue Knowledge Catalog extensions, which comes preinstalled on AWS Glue 5.0 and Amazon EMR 7.5.0 or larger. The extension library permits entry to RMS tables utilizing the Amazon Redshift connector for Apache Spark.
To entry RMS backed catalog databases from Spark, every RMS database requires its personal Spark session catalog configuration. Listed below are the required Spark configurations:
| Spark config key | Worth |
spark.sql.catalog.{catalog_name} |
org.apache.iceberg.spark.SparkCatalog |
spark.sql.catalog.{catalog_name}.kind |
glue |
spark.sql.catalog.{catalog_name}.glue.id |
{account_id}:{rms_catalog_name}/{database_name} |
spark.sql.catalog.{catalog_name}.consumer.area |
{aws_region} |
spark.sql.extensions |
org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions |
Configuration parameters:
{catalog_name}: Your chosen identify for referencing the RMS catalog database in your software code{rms_catalog_name}: The RMS catalog identify as proven within the AWS Lake Formation catalogs part{database_name}: The RMS database identify{aws_region}: The AWS Area the place the RMS catalog is positioned
For a deeper understanding of how the Amazon Redshift hierarchy (databases, schemas, and tables) is mapped to the AWS Glue multilevel catalogs, you may discuss with the Bringing Amazon Redshift information into the AWS Glue Knowledge Catalog documentation.
Within the following part, we exhibit entry RMS tables by means of Apache Spark utilizing SageMaker Unified Studio JupyterLab notebooks with the AWS Glue 5.0 runtime and Amazon EMR Serverless.
Though we will carry present Amazon Redshift tables into the AWS Glue Knowledge catalog by making a Lakehouse Redshift catalog from an present Redshift namespace and supply entry to a SageMaker Unified Studio mission, within the following instance, you’ll create a managed Amazon Redshift Lakehouse catalog immediately from SageMaker Unified Studio and work with that.
Conditions
To comply with these directions, it’s essential to have the next stipulations:
Create a SageMaker Unified Studio mission
Full the next steps to create a SageMaker Unified Studio mission:
- Register to SageMaker Unified Studio.
- Select Choose a mission on the highest menu and select Create mission.
- For Challenge identify, enter
demo. - For Challenge profile, select All capabilities.
- Select Proceed.
- Go away the default values and select Proceed.
- Evaluation the configurations and select Create mission.
You want to await the mission to be created. Challenge creation can take about 5 minutes. When the mission standing adjustments to Energetic, choose the mission identify to entry the mission’s house web page.

- Make be aware of the Challenge function ARN since you’ll want it for subsequent steps.
You’ve efficiently created the mission and famous the mission function ARN. The following step is to configure a Lakehouse catalog in your RMS.
Configure a Lakehouse catalog in your RMS
Full the next steps to configure a Lakehouse catalog in your RMS:
- Within the navigation pane, select Knowledge.
- Select the
+(plus) signal. - Choose Create Lakehouse catalog to create a brand new catalog and select Subsequent.

- For Lakehouse catalog identify, enter
rms-catalog-demo. - Select Add catalog.

- Anticipate the catalog to be created.

- In SageMaker Unified Studio, select Knowledge within the left navigation pane, then choose the three vertical dots subsequent to Redshift (Lakehouse) and select Refresh to verify the Amazon Redshift compute is energetic.

Create a brand new desk within the RMS Lakehouse catalog:
- In SageMaker Unified Studio, on the highest menu, below Construct, select Question Editor.
- On the highest proper, select Choose information supply.
- For CONNECTIONS, select Redshift (Lakehouse).
- For DATABASES, select
dev@rms-catalog-demo. - For SCHEMAS, select public.
- Select Select.

- Within the question cell, enter and execute the next question to create a brand new schema:

- In a brand new cell, enter and execute the next question to create a brand new desk:

- In a brand new cell, enter and execute the next question to populate the desk with pattern information:

- In a brand new cell, enter and run the next question to confirm the desk contents:

(Non-obligatory) Create an Amazon EMR Serverless software
IMPORTANT: This part is barely required should you plan to check additionally utilizing Amazon EMR Serverless. If you happen to intend to make use of AWS Glue completely, you may skip this part completely.
- Navigate to the mission web page. Within the left navigation pane, choose Compute, then choose the Knowledge processing Select Add compute.

- Select Create new compute assets, then select Subsequent.

- Choose EMR Serverless.

- Specify
emr_serverless_applicationas Compute identify, choose Compatibility as Permission mode, and select Add compute.

- Monitor the deployment progress. Anticipate the Amazon EMR Serverless software to finish its deployment. This course of can take a minute.

Entry Amazon Redshift Managed Storage tables by means of Apache Spark
On this part, we exhibit question tables saved in RMS utilizing a SageMaker Unified Studio pocket book.
- Within the navigation pane, select Knowledge
- Underneath Lakehouse, choose the down arrow subsequent to
rms-catalog-demo - Underneath dev, choose the down arrow subsequent
salesdb, selectstore_sales, and select the three dots
SageMaker Lakehouse offers a number of evaluation choices: Question with Athena, Question with Redshift, and Open in Jupyter Lab pocket book.

- Select Open in Jupyter Lab pocket book
- On the Launcher tab, select Python 3 (ipykernel)
In SageMaker Unified Studio JupyterLab, you may specify completely different compute varieties for every pocket book cell. Though this instance demonstrates utilizing AWS Glue compute (mission.spark.compatibility), the identical code might be executed utilizing Amazon EMR Serverless by choosing the suitable compute within the cell settings. The next desk reveals the connection kind and compute values to specify when operating PySpark code or Spark SQL code with completely different engines:
| Compute choice | Pyspark code | Spark SQL | ||
| Connection kind | Compute | Connection kind | Compute | |
| AWS Glue | Pyspark | mission.spark.compatibility |
SQL | mission.spark.compatibility |
| Amazon EMR Serverless | Pyspark | emr-s.emr_serverless_application |
SQL | emr-s.emr_serverless_application |
- Within the pocket book cell’s prime left nook, set Connection Kind to PySpark and choose
spark.compatibility(AWS Glue 5.0) as Compute - Execute the next code to initialize the SparkSession and configure
rmscatalogbecause the session catalog for accessing thedevdatabase below therms-catalog-demoRMS catalog:

- Create a brand new cell and change the connection kind from PySpark to SQL to execute Spark SQL instructions immediately
- Enter the next SQL assertion to view all tables below
salesdb(RMS schema) insidermscatalog:

- In a brand new SQL cell, enter the next
DESCRIBE EXTENDEDassertion to view detailed details about thestore_salesdesk within thesalesdbschema:

Within the output, you’ll observe that the Supplier is about to iceberg. This means that the desk is acknowledged as an Iceberg desk, regardless of being saved in Amazon Redshift managed storage.
- In a brand new SQL cell, enter the next
SELECTassertion to view the content material of the desk

All through this instance, we demonstrated create a desk in Amazon Redshift Serverless and seamlessly question it as an Iceberg desk utilizing Apache Spark inside a SageMaker Unified Studio pocket book.
Clear up
To keep away from incurring future fees, clear up all created assets:
- Delete the created SageMaker Unified Studio mission. This step will robotically delete Amazon EMR compute (for instance, the Amazon EMR Serverless software) that was provisioned from the mission:
- Inside SageMaker Studio, navigate to the demo mission’s Challenge overview part.
- Select Actions, then choose Delete mission.

- Kind verify and select Delete mission.

- Delete the created Lakehouse catalog:
- Navigate to the AWS Lake Formation web page within the Catalogs part.
- Choose the
rms-catalog-democatalog, select Actions, then choose Delete.
- Within the affirmation window kind
rms-catalog-demoafter which select Drop.
Conclusion
On this submit, we demonstrated use Apache Spark to work together with Amazon Redshift Managed Storage tables by means of Amazon SageMaker Lakehouse utilizing the Iceberg REST catalog. This integration gives a unified view of your information throughout Amazon S3 information lakes and Amazon Redshift information warehouses, so you may construct highly effective analytics and AI/ML functions whereas sustaining a single copy of your information.
For added workloads and implementations, go to Simplify information entry in your enterprise utilizing Amazon SageMaker Lakehouse.
Concerning the Authors
Noritaka Sekiyama is a Principal Huge Knowledge Architect with Amazon Internet Companies (AWS) Analytics providers. He’s accountable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking on his street bike.
Stefano Sandonà is a Senior Huge Knowledge Specialist Resolution Architect at Amazon Internet Companies (AWS). Keen about information, distributed methods, and safety, he helps prospects worldwide architect high-performance, environment friendly, and safe information options.
Derek Liu is a Senior Options Architect primarily based out of Vancouver, BC. He enjoys serving to prospects remedy massive information challenges by means of Amazon Internet Companies (AWS) analytic providers.
Raj Ramasubbu is a Senior Analytics Specialist Options Architect centered on massive information and analytics and AI/ML with Amazon Internet Companies (AWS). He helps prospects architect and construct extremely scalable, performant, and safe cloud-based options on AWS. Raj offered technical experience and management in constructing information engineering, massive information analytics, enterprise intelligence, and information science options for over 18 years previous to becoming a member of AWS. He helped prospects in varied business verticals like healthcare, medical gadgets, life science, retail, asset administration, automobile insurance coverage, residential REIT, agriculture, title insurance coverage, provide chain, doc administration, and actual property.
Angel Conde Manjon is a Sr. EMEA Knowledge & AI PSA, primarily based in Madrid. He has beforehand labored on analysis associated to information analytics and AI in various European analysis initiatives. In his present function, Angel helps companions develop companies centered on information and AI.
Appendix: Pattern script for Lake Formation FGAC enabled Spark cluster
If you wish to entry RMS tables from Lake Formation FGAC enabled Spark cluster on AWS Glue or Amazon EMR, discuss with the next code instance:
