Implement fine-grained entry management on information lake tables utilizing AWS Glue 5.0 built-in with AWS Lake Formation

December 9, 2024

178

AWS Glue 5.0 helps fine-grained entry management (FGAC) primarily based in your insurance policies outlined in AWS Lake Formation. FGAC allows you to granularly management entry to your information lake assets on the desk, column, and row ranges. This degree of management is crucial for organizations that must adjust to information governance and safety rules, or those who cope with delicate information.

Lake Formation makes it easy to construct, safe, and handle information lakes. It lets you outline fine-grained entry controls by way of grant and revoke statements, just like these used with relational database administration methods (RDBMS), and routinely implement these insurance policies utilizing suitable engines like Amazon Athena, Apache Spark on Amazon EMR, and Amazon Redshift Spectrum. With AWS Glue 5.0, the identical Lake Formation guidelines that you just arrange to be used with different companies like Athena now apply to your AWS Glue Spark jobs and Interactive Classes by way of built-in Spark SQL and Spark DataFrames. This simplifies safety and governance of your information lakes.

This publish demonstrates easy methods to implement FGAC on AWS Glue 5.0 by way of Lake Formation permissions.

How FGAC works on AWS Glue 5.0

Utilizing AWS Glue 5.0 with Lake Formation enables you to implement a layer of permissions on every Spark job to use Lake Formation permissions management when AWS Glue runs jobs. AWS Glue makes use of Spark useful resource profiles to create two profiles to successfully run jobs. The person profile runs user-supplied code, and the system profile enforces Lake Formation insurance policies. For extra data, see the AWS Lake Formation Developer Information.

The next diagram demonstrates a high-level overview of how AWS Glue 5.0 will get entry to information protected by Lake Formation permissions.

The workflow consists of the next steps:

A person calls the StartJobRun API on a Lake Formation enabled AWS Glue job.
AWS Glue sends the job to a person driver and runs the job within the person profile. The person driver runs a lean model of Spark that has no capability to launch duties, request executors, or entry Amazon Easy Storage Service (Amazon S3) or the AWS Glue Information Catalog. It builds a job plan.
AWS Glue units up a second driver known as the system driver and runs it within the system profile (with a privileged id). AWS Glue units up an encrypted TLS channel between the 2 drivers for communication. The person driver makes use of the channel to ship the job plans to the system driver. The system driver doesn’t run user-submitted code. It runs full Spark and communicates with Amazon S3 and the Information Catalog for information entry. It requests executors and compiles the Job Plan right into a sequence of execution phases.
AWS Glue then runs the phases on executors with the person driver or system driver. The person code in any stage is run completely on person profile executors.
Levels that learn information from Information Catalog tables protected by Lake Formation or those who apply safety filters are delegated to system executors.

Allow FGAC on AWS Glue 5.0

To allow Lake Formation FGAC to your AWS Glue 5.0 jobs on the AWS Glue console, full the next steps:

On the AWS Glue console, select ETL jobs within the navigation pane.
Select your job.
Select the Job particulars
For Glue model, select Glue 5.0 – Helps spark 3.5, Scala 2, Python 3.
For Job parameters, add following parameter:
1. Key: --enable-lakeformation-fine-grained-access
2. Worth: true
Select Save.

To allow Lake Formation FGAC to your AWS Glue notebooks on the AWS Glue console, use %%configure magic:

%glue_version 5.0
%%configure
{
    "--enable-lakeformation-fine-grained-access": "true"
}

Instance use case

The next diagram represents the high-level structure of the use case we reveal on this publish. The target of the use case is to showcase how will you implement Lake Formation FGAC on each CSV and Iceberg tables and configure an AWS Glue PySpark job to learn from them.

The implementation consists of the next steps:

Create an S3 bucket and add the enter CSV dataset.
Create a normal Information Catalog desk and an Iceberg desk by studying information from the enter CSV desk, utilizing an Athena CTAS question.
Use Lake Formation to allow FGAC on each CSV and Iceberg tables utilizing row- and column-based filters.
Run two pattern AWS Glue jobs to showcase how one can run a pattern PySpark script in AWS Glue that respects the Lake Formation FGAC permissions, after which write the output to Amazon S3.

To reveal the implementation steps, we use pattern product stock information that has the next attributes:

op – The operation on the supply file. This reveals values I to symbolize insert operations, U to symbolize updates, and D to symbolize deletes.
product_id – The first key column within the supply database’s merchandise desk.
class – The product’s class, similar to Electronics or Cosmetics.
product_name – The identify of the product.
quantity_available – The amount accessible within the stock for a product.
last_update_time – The time when the product file was up to date on the supply database.

To implement this workflow, we create AWS assets similar to an S3 bucket, outline FGAC with Lake Formation, and construct AWS Glue jobs to question these tables.

Stipulations

Earlier than you get began, ensure you have the next conditions:

An AWS account with AWS Identification and Entry Administration (IAM) roles as wanted.
The required permissions to carry out the next actions:
- Learn or write to an S3 bucket.
- Create and run AWS Glue crawlers and jobs.
- Handle Information Catalog databases and tables.
- Handle Athena workgroups and run queries.
Lake Formation already arrange within the account and a Lake Formation administrator position or the same position to observe together with the directions on this publish. To be taught extra about organising permissions for a knowledge lake administrator position, see Create a knowledge lake administrator.

For this publish, we use the eu-west-1 AWS Area, however you may combine it in your most well-liked Area if the AWS companies included within the structure can be found in that Area.

Subsequent, let’s dive into the implementation steps.

Create an S3 bucket

To create an S3 bucket for the uncooked enter datasets and Iceberg desk, full the next steps:

On the Amazon S3 console, select Buckets within the navigation pane.
Select Create bucket.
Enter the bucket identify (for instance, glue5-lf-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}), and depart the remaining fields as default.
Select Create bucket.
On the bucket particulars web page, select Create folder.
Create two subfolders: raw-csv-input and iceberg-datalake.
Add the LOAD00000001.csv file into the raw-csv-input folder of the bucket.

Create tables

To create enter and output tables within the Information Catalog, full the next steps:

On the Athena console, navigate to the question editor.

Run the next queries in sequence (present your S3 bucket identify):

-- Create database for the demo
CREATE DATABASE glue5_lf_demo;

-- Create exterior desk in enter CSV information. Change the S3 path along with your bucket identify
CREATE EXTERNAL TABLE glue5_lf_demo.raw_csv_input(
 op string, 
 product_id bigint, 
 class string, 
 product_name string, 
 quantity_available bigint, 
 last_update_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3:///raw-csv-input/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',', 
  'typeOfData'='file');
 
-- Create output Iceberg desk with partitioning. Change the S3 bucket identify along with your bucket identify
CREATE TABLE glue5_lf_demo.iceberg_datalake WITH (
  table_type="ICEBERG",
  format="parquet",
  write_compression = 'SNAPPY',
  is_external = false,
  partitioning=ARRAY['category', 'bucket(product_id, 16)'],
  location='s3:///iceberg-datalake/'
) AS SELECT * FROM glue5_lf_demo.raw_csv_input;

Run the next question to validate the uncooked CSV enter information:
```
SELECT * FROM glue5_lf_demo.raw_csv_input;
```

The next screenshot reveals the question end result.

Run the next question to validate the Iceberg desk information:
```
SELECT * FROM glue5_lf_demo.iceberg_datalake;
```

The next screenshot reveals the question end result.

This step used DDL to create desk definitions. Alternatively, you should utilize a Information Catalog API, the AWS Glue console, the Lake Formation console, or an AWS Glue crawler.

Subsequent, let’s configure Lake Formation permissions on the raw_csv_input desk and iceberg_datalake desk.

Configure Lake Formation permissions

To validate the potential, let’s outline FGAC permissions for the 2 Information Catalog tables we created.

For the raw_csv_input desk, we allow permission for particular rows, for instance enable learn entry just for the Furnishings class. Equally, for the iceberg_datalake desk, we allow a knowledge filter for the Electronics product class and restrict learn entry to a couple columns solely.

To configure Lake Formation permissions for the 2 tables, full the next steps:

On the Lake Formation console, select Information lake areas underneath Administration within the navigation pane.
Select Register location.
For Amazon S3 path, enter the trail of your S3 bucket to register the situation.
For IAM position, select your Lake Formation information entry IAM position, which isn’t a service linked position.
For Permission mode, choose Lake Formation.
Select Register location.

Grant desk permissions on the usual desk

The following step is to grant desk permissions on the raw_csv_input desk to the AWS Glue job position.

On the Lake Formation console, select Information lake permissions underneath Permissions within the navigation pane.
Select Grant.
For Principals, select IAM customers and roles.
For IAM customers and roles, select your IAM position that’s going for use on an AWS Glue job.
For LF-Tags or catalog assets, select Named Information Catalog assets.
For Databases, select glue5_lf_demo.
For Tables, select raw_csv_input.
For Information filters, select Create new.
Within the Create information filter dialog, present the next data:
1. For Information filter identify, enter product_furniture.
2. For Column-level entry, choose Entry to all columns.
3. Choose Filter rows.
4. For Row filter expression, enter class='Furnishings'.
5. Select Create filter.

For Information filters, choose the filter product_furniture you created.
For Information filter permissions, select Choose and Describe.
Select Grant.

Grant permissions on the Iceberg desk

The following step is to grant desk permissions on the iceberg_datalake desk to the AWS Glue job position.

On the Lake Formation console, select Information lake permissions underneath Permissions within the navigation pane.
Select Grant.
For Principals, select IAM customers and roles.
For IAM customers and roles, select your IAM position that’s going for use on an AWS Glue job.
For LF-Tags or catalog assets, select Named Information Catalog assets.
For Databases, select glue5_lf_demo.
For Tables, select iceberg_datalake.
For Information filters, select Create new.
Within the Create information filter dialog, present the next data:
1. For Information filter identify, enter product_electronics.
2. For Column-level entry, choose Embody columns.
3. For Included columns, select class, last_update_time, op, product_name, and quantity_available.
4. Select Filter rows.
5. For Row filter expression, enter class='Electronics'.
6. Select Create filter.
For Information filters, choose the filter product_electronics you created.
For Information filter permissions, select Choose and Describe.
Select

Subsequent, let’s create the AWS Glue PySpark job to course of the enter information.

Question the usual desk by way of an AWS Glue 5.0 job

Full the next steps to create an AWS Glue job to load information from the raw_csv_input desk:

On the AWS Glue console, select ETL jobs within the navigation pane.
For Create job, select Script Editor.
For Engine, select Spark.
For Choices, select Begin recent.
Select Create script.

For Script, use the next code, offering your S3 output path. This instance script writes the output in Parquet format; you may change this in line with your use case.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Learn from uncooked CSV desk
df = spark.sql("SELECT * FROM glue5_lf_demo.raw_csv_input")
df.present()

# Write to your most well-liked location.
df.write.mode("overwrite").parquet("s3://")

On the Job particulars tab, for Identify, enter glue5-lf-demo.
For IAM Position, assign an IAM position that has the required permissions to run an AWS Glue job and skim and write to the S3 bucket.
For Glue model, select Glue 5.0 – Helps spark 3.5, Scala 2, Python 3.
For Job parameters, add following parameter:
1. Key: --enable-lakeformation-fine-grained-access
2. Worth: true

Select Save after which Run.
When the job is full, on the Run particulars tab on the backside of job runs, select Output logs.

You’re redirected to the Amazon CloudWatch console to validate the output.

The printed desk is proven within the following screenshot. Solely two data had been returned as a result of they’re Furnishings class merchandise.

Question the Iceberg desk by way of an AWS Glue 5.0 job

Subsequent, full the next steps to create an AWS Glue job to load information from the iceberg_datalake desk:

On the AWS Glue console, select ETL jobs within the navigation pane.
For Create job, select Script Editor.
For Engine, select Spark.
For Choices, select Begin recent.
Select Create script.
For Script, substitute the next parameters:
1. Change aws_region along with your Area.
2. Change aws_account_id along with your AWS account ID.
3. Change warehouse_path along with your S3 warehouse path for the Iceberg desk.
4. Change along with your S3 output path.

This instance script writes the output in Parquet format; you may change it in line with your use case.

from pyspark.context import SparkContext
from pyspark.sql import SparkSession

catalog_name = "spark_catalog"
aws_region = "eu-west-1"
aws_account_id = "123456789012"
warehouse_path = "s3:///warehouse"

# Create Spark Session with Iceberg Configurations
spark = SparkSession.builder 
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkSessionCatalog") 
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}") 
    .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") 
    .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") 
    .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") 
    .config(f"spark.sql.catalog.{catalog_name}.shopper.area", f"{aws_region}") 
    .config(f"spark.sql.catalog.{catalog_name}.glue.account-id", f"{aws_account_id}") 
    .getOrCreate()

# Learn from Iceberg desk
df = spark.sql(f"SELECT * FROM {catalog_name}.glue5_lf_demo.iceberg_datalake")
df.present()

# Write to your most well-liked location.
df.write.mode("overwrite").parquet("s3://")

On the Job particulars tab, for Identify, enter glue5-lf-demo-iceberg.
For IAM Position, assign an IAM position that has the required permissions to run an AWS Glue job and skim and write to the S3 bucket.
For Glue model, select Glue 5.0 – Helps spark 3.5, Scala 2, Python 3.
For Job parameters, add following parameters:
1. Key: --enable-lakeformation-fine-grained-access
2. Worth: true
3. Key: --datalake-formats
4. Worth: iceberg
Select Save after which Run.
When the job is full, on the Run particulars tab, select Output logs.

You’re redirected to the CloudWatch console to validate the output.

The printed desk is proven within the following screenshot. Solely two data had been returned as a result of they’re Electronics class merchandise, and the product_id column is excluded.

You at the moment are in a position to confirm that data of the desk raw_csv_input and the desk iceberg_datalake are efficiently retrieved with configured Lake Formation information cell filters.

Clear up

Full the next steps to scrub up your assets:

Delete the AWS Glue jobs glue5-lf-demo and glue5-lf-demo-iceberg.
Delete the Lake Formation permissions.
Delete the output information written to the S3 bucket.
Delete the bucket you created for the enter datasets, which could have a reputation just like glue5-lf-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}.

Conclusion

This publish defined how one can allow Lake Formation FGAC in AWS Glue jobs and notebooks that can implement entry management outlined utilizing Lake Formation grant instructions. Beforehand, you wanted to combine AWS Glue DynamicFrames to implement FGAC in AWS Glue jobs, however with this launch, you may implement FGAC by way of Spark DataFrame or Spark SQL. This functionality additionally works not solely with commonplace file codecs like CSV, JSON, and Parquet but additionally with Apache Iceberg.

This function can prevent effort and encourage portability whereas migrating Spark scripts to totally different serverless environments similar to AWS Glue and Amazon EMR.

Concerning the Authors

Sakti Mishra is a Principal Options Architect at AWS, the place he helps clients modernize their information structure and outline end-to end-data methods, together with information safety, accessibility, governance, and extra. He’s additionally the writer of Simplify Huge Information Analytics with Amazon EMR and AWS Licensed Information Engineer Examine Information. Exterior of labor, Sakti enjoys studying new applied sciences, watching motion pictures, and visiting locations with household. He may be reached through LinkedIn.

Noritaka Sekiyama is a Principal Huge Information Architect on the AWS Glue staff. He’s additionally the writer of the ebook Serverless ETL and Analytics with AWS Glue. He’s accountable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking together with his highway bike.

Matt Su is a Senior Product Supervisor on the AWS Glue staff. He enjoys serving to clients uncover insights and make higher selections utilizing their information with AWS Analytics companies. In his spare time, he enjoys snowboarding and gardening.

Layth Yassin is a Software program Growth Engineer on the AWS Glue staff. He’s captivated with tackling difficult issues at a big scale, and constructing merchandise that push the boundaries of the sphere. Exterior of labor, he enjoys enjoying/watching basketball, and spending time with family and friends.

Implement fine-grained entry management on information lake tables utilizing AWS Glue 5.0 built-in with AWS Lake Formation

How FGAC works on AWS Glue 5.0

Allow FGAC on AWS Glue 5.0

Instance use case

Stipulations

Create an S3 bucket

Create tables

Configure Lake Formation permissions

Grant desk permissions on the usual desk

Grant permissions on the Iceberg desk

Question the usual desk by way of an AWS Glue 5.0 job

Question the Iceberg desk by way of an AWS Glue 5.0 job

Clear up

Conclusion

Concerning the Authors

Related Articles

USMNT Makes World Cup Historical past: Crew USA Secures First Again-to-Again Wins Since 1930

5 Finest Shapewear Items 2026, Examined & Authorized By Specialists

Love Island USA Alannah’s Pal’s Response to Racism Allegations

LEAVE A REPLY Cancel reply

Latest Articles

USMNT Makes World Cup Historical past: Crew USA Secures First Again-to-Again Wins Since 1930

5 Finest Shapewear Items 2026, Examined & Authorized By Specialists

Love Island USA Alannah’s Pal’s Response to Racism Allegations

Is Mistral late or savvy?

Why Pepcid Is the Solely H2 Blocker Price Taking