Information engineers working Apache Spark jobs on Amazon EMR face a persistent problem: understanding how information strikes by way of Spark pipelines because it’s reworked, joined, and written to downstream tables . Monitoring these transformations manually requires inspecting job logs, reviewing code, and piecing collectively transformation logic throughout a number of sources. As pipelines scale, this course of turns into advanced. The visibility hole impacts key enterprise actions: troubleshooting information high quality points takes longer – influence evaluation for schema modifications requires extra effort – and compliance audits want in depth documentation of knowledge provenance.
Amazon SageMaker is the middle for all of your information and analytics the place you’ll find and entry all the info in your group and act on it utilizing instruments throughout varied use case. This unified platform addresses the info visibility problem by bringing collectively information governance, collaboration, and discovery right into a single interface. On the coronary heart of this platform is Amazon SageMaker Catalog, a centralized hub that permits organizations to catalog, govern, and uncover all their information property with full visibility into lineage. By capturing information lineage throughout your complete information ecosystem from uncooked sources by way of transformations to ultimate outputs, SageMaker Catalog lets you monitor information provenance throughout your complete platform, allow collaboration with clear visibility into information possession and high quality metrics, construct belief by way of complete information lineage that helps compliance and assured decision-making, and speed up discovery of reliable, governance-ready information property. You possibly can entry and visualize this lineage immediately in Amazon SageMaker Unified Studio, which serves because the unified interface to discover information relationships and collaborate throughout your analytics workflows.
Amazon EMR, ranging from model 7.11, now contains native OpenLineage help that automates lineage seize. OpenLineage is an open-source framework for information lineage that mechanically emits lineage metadata out of your information transformation jobs immediately into Amazon SageMaker Catalog, or different information governance options, with out requiring customizations.
This EMR native help of OpenLineage is a part of a rising set of integrations throughout AWS analytics companies together with AWS Glue, Amazon EMR Serverless, and Amazon Redshift. The entire listing of companies with native OpenLineage integration could be discovered within the information lineage help matrix.
On this publish, you’ll stroll by way of a sensible, step-by-step instance that exhibits seize and monitor information lineage from Spark jobs working on Amazon EMR immediately into Amazon SageMaker Catalog utilizing OpenLineage. You’ll see how lineage metadata flows mechanically and discover information relationships and dependencies throughout your workflows in Amazon SageMaker Unified Studio.
Resolution overview
Think about you’re half of a big enterprise that depends on HR analytics to optimize workforce planning, compensation methods, and expertise retention practices. Your information engineering group owns the supply of those analytical merchandise by processing uncooked HR datasets (together with worker data, attendance logs, and compensation particulars), with Spark jobs working in your Amazon EMR infrastructure.
With time, Spark jobs have grown in complexity. Your group now struggles to take care of visibility into how information strikes by way of pipelines, who modified it, and map dependencies between datasets and ultimate analytical merchandise.
The next answer demonstrates how one can tackle these challenges by mechanically capturing information lineage end-to-end from Spark jobs working in your EMR infrastructure and visualizing it in Amazon SageMaker Unified Studio so that you simply and the enterprise perceive information provenance of the ultimate analytical merchandise.
The structure features a Information Layer with CSV information containing worker, attendance, wage, and bonus information saved in Amazon S3 (Easy Storage Service), representing typical HR and payroll supply techniques.
The Processing Layer makes use of Amazon EMR cluster working Apache Spark jobs that rework uncooked information into analytical tables. The primary Spark job joins worker and attendance information whereas the second Spark job combines attendance with compensation information. Each jobs use Apache Iceberg desk format to offer ACID (Atomic, Constant, Remoted, and Sturdy) transactions and time journey capabilities.
The Metadata Layer makes use of AWS Glue Information Catalog to retailer Iceberg desk metadata, making tables discoverable and accessible throughout AWS analytics companies. A Lineage Layer makes use of the OpenLineage integration in EMR to mechanically monitor enter/output datasets (CSV information and Iceberg tables), transformation logic at column stage (joins, filters, aggregations), and job execution metadata.
Lastly, the Information Governance Layer makes use of Amazon SageMaker Catalog to seize and course of OpenLineage occasions posted by the EMR Spark jobs and mechanically construct a complete lineage graph that exhibits full information provenance from CSV supply information by way of Spark transformations to Iceberg analytical tables.
Earlier than you deploy this answer, be sure to have the next assets in place.
Conditions
For this walkthrough, it is best to have the next conditions:
- An AWS account.
- Your assumed function ought to have full entry to Amazon EMR serverless, Amazon S3, Amazon Identification and Entry Administration (IAM) and AWS Lambda. Observe that for manufacturing workloads, minimal permissions are advisable.
- A Amazon VPC (Digital Non-public Cloud) with a minimum of one subnet with web entry. You possibly can provision this VPC as you create the Amazon SageMaker area subsequent.
- An present Amazon SageMaker Unified Studio area and undertaking. To get began, use the short setup possibility as defined right here. To create a undertaking, observe the directions right here.
- An S3 bucket with the pattern information information and Spark scripts uploaded (see Put together Your Supply Information beneath)
- Default EMR service roles — if that is your first time utilizing EMR on this account, run `aws emr create-default-roles` from the AWS CLI or CloudShell to create them.
With these conditions in place, let’s study what the AWS CloudFormation template will deploy to your AWS surroundings.
Structure parts
The deployment creates a number of interconnected parts that work collectively to seize and visualize lineage:
- An S3 bucket to retailer all information and artifacts for the answer.
- An EMR cluster (v 7.12.0) with Apache Iceberg help enabled and OpenLineage integration pre-installed, able to run Spark jobs with lineage monitoring.
- A set of IAM insurance policies that grant the mandatory permissions to the EMR cluster to publish lineage occasions to your SageMaker Unified Studio area.
- A set of AWS Lake Formation permissions that grant the EMR cluster to create, alter, and drop Iceberg tables in your specified Glue database.
With an understanding of what is going to be deployed, you’re able to launch the CloudFormation stack.
Deploy the answer
Observe: Whereas this walkthrough makes use of the AWS EMR console and AWS CLI to confirm the cluster and run Spark jobs, you may as well carry out these steps immediately from Amazon SageMaker Unified Studio. SMUS offers a unified interface to create and handle EMR clusters, submit Spark jobs, and monitor execution — all inside the identical surroundings the place you’ll later discover the lineage captured in Amazon SageMaker Catalog.
Put together your supply information
Earlier than deploying the CloudFormation stack, clone or obtain the Git repository.
Add the CSV information downloaded from git to the enter/ prefix and the spark scripts in scripts/ prefix. You possibly can run the next command to add the information:
aws s3 cp workers.csv s3://YOUR-BUCKET/enter/
aws s3 cp attendance.csv s3://YOUR-BUCKET/enter/
aws s3 cp salary_adjustments.csv s3://YOUR-BUCKET/enter/
aws s3 cp bonus_payments.csv s3://YOUR-BUCKET/enter/
aws s3 cp emr-lineage-spark-job.py s3://YOUR-BUCKET/scripts/
aws s3 cp emr-lineage-compensation-job.py s3://YOUR-BUCKET/scripts/
To deploy the answer, full the next steps in CloudFormation console:
- Create new stack by specifying the CloudFormation yaml file beforehand obtain from git repository
PutHereThe YMLFileName - Enter a stack title (equivalent to,
emr-lineage-demo) and supply the next parameters:- SourceS3BucketName: S3 bucket containing your CSV information and Spark scripts
- SourceCSVPrefix: S3 prefix the place CSV information are positioned
- SourceScriptsPrefix: S3 prefix the place Spark scripts are positioned
- GlueDatabaseName: The title of the Glue database related to your Amazon SageMaker Unified Studio undertaking.
- DataZoneDomainId: Your SageMaker Unified Studio area ID.
- VpcId: The id of the VPC that was deployed as a part of the conditions.
- For EMRReleaseLabel, MasterInstanceType, CoreInstanceType and CoreInstanceCount, hold the default values.
- Acknowledge IAM useful resource creation, select Subsequent after which Submit. The CloudFormation stack takes roughly 10 to fifteen minutes to finish.
- Within the EMR console, watch for the cluster standing to point out as WAITING earlier than shifting to the following step.

Now that the EMR cluster is working with OpenLineage enabled, let’s study how the Spark jobs are configured to seize lineage metadata.
Discover information lineage configuration in EMR
When submitting Spark jobs to EMR, particular configurations allow OpenLineage to create and publish lineage occasions to SageMaker Unified Studio because the job runs:
spark.hadoop.hive.metastore.consumer.manufacturing facility.class– Configures Spark to make use of AWS Glue because the Hive metastore.spark.jars– Path to the pre-installed OpenLineage library (out there on EMR 7.11+).spark.extraListeners– Registers an OpenLineage listener to seize metadata of enter / output datasets and transformations.spark.openlineage.transport.sort– Makes use of the OpenLineage DataZone transport choice to ship lineage occasions immediately into SageMaker Catalog.spark.openlineage.transport.domainId– The ID of your SageMaker Unified Studio area, that serves because the goal for lineage occasions.spark.glue.accountId– Your AWS account ID for Glue information catalog operations.
Now that you simply perceive the configuration that permits automated lineage seize, you’re able to run the info pipeline.
When working this two-step pipeline, you’ll calculate the whole worker compensation by combining wage changes, bonuses, and attendance information. The ultimate analytical asset will serve payroll processing and budgeting.
Run worker attendance evaluation job
The primary job reads worker particulars (in workers.csv dataset) and attendance data (in attendance.csv dataset), joins the datasets on EmployeeID and creates a unified dataset (employee_attendance Iceberg desk) in your Glue database.
Observe the steps beneath to run this primary job:
- Within the CloudFormation console, navigate to the stack’s Outputs tab
- Copy the worth of the
Job1SubmitCommandoutput key. Observe that that is the command you’ll use to submit the primary job in EMR with the best configuration.

- Run the command in your terminal or AWS CloudShell.
- Monitor the job within the Amazon EMR console beneath Steps.

Run worker compensation evaluation job
Now, you’ll calculate the whole worker compensation (Iceberg desk) by combining wage changes (salary_adjustments.csv dataset), bonuses (bonus_payments.csv dataset), and attendance (calculated within the final step):
- Repeat the steps 1 to 4 to run Job 2.
- After completion, open the AWS Glue console.
- Navigate to Information Catalog, then Tables and select your SageMaker undertaking’s database.
- Affirm that
employee_attendanceandemployee_compensationtables are listed.
With each Spark jobs full, now you can visualize the entire information lineage graph in Amazon SageMaker Unified Studio.
Visualizing lineage in SageMaker Unified Studio
SageMaker Unified Studio offers a graph-based information lineage visualization that helps information engineers, analysts, and information scientists clearly perceive which supply datasets (information or tables) feed into every dataset, what transformations and logic are utilized at each step, which downstream analytics property devour the info, and the way modifications to upstream information or transformations could influence the remainder of the info pipeline.
Now that the info pipeline run efficiently, let’s evaluation the captured lineage for the HR information in SageMaker Unified Studio:
- Navigate to the SageMaker Unified Studio console, sign up to your area.
- Open your undertaking and go to Information Sources
- Discover your AWS Glue Information Catalog supply

- Click on RUN. Two new property can be created.

- Navigate to Belongings and Click on on employee_compensation. Beneath the LINEAGE tab you’ll discover the lineage graph view that SageMaker builds primarily based on the OpenLineage metadata captured from the EMR Spark jobs as they run.

-
- You’ll first see three lineage nodes from left to proper: one representing the EMR Spark job that created the ultimate Iceberg desk, a second one representing the precise Iceberg desk within the Glue catalog, and a 3rd one representing the info asset within the SageMaker Catalog stock that maps to the Glue desk.
- Click on on any lineage node to view its underlying metadata within the particulars pane, together with dataset names, S3 areas, schema, information sorts, job execution particulars and extra.
- Broaden the lineage to the left by clicking on the double arrow subsequent to the primary lineage node. Hold increasing till you hit the originating datasets.

-
- Increasing the graph to the left reveals the entire information pipeline again to authentic CSV supply information. You possibly can see how compensation information is determined by upstream attendance analytics.
- Observe how every lineage node represents a component within the information pipeline you run, together with each Spark jobs and even the intermediate employee_attendance Iceberg desk that connects them.
- You possibly can increase column-level lineage by clicking on the column part of a lineage node of a dataset or information asset. This lets you perceive how information modifications at a column stage because it goes downstream your information pipeline.

Cleanup
To keep away from ongoing prices, clear up the assets:
- First, empty the vacation spot bucket by working the next command in your terminal or with AWS CloudShell.
aws s3 rm s3://${DEST_BUCKET}/ --recursive
- Delete the CloudFormation stack.
- On the AWS CloudFormation console, select Stacks within the navigation pane.
- Select the stack you created, then select Delete after which Delete stack when prompted.
Conclusion
On this publish, you discover seize information lineage from Spark jobs in Amazon EMR (v7.11+) immediately into Amazon SageMaker Unified Studio. You realized arrange an Amazon EMR cluster with native OpenLineage help to mechanically monitor lineage metadata from Spark jobs processing your information. You additionally configured the combination between EMR and Amazon SageMaker Catalog to make sure lineage info flows seamlessly into your governance platform. Lastly, you explored the ensuing lineage graph in SageMaker Unified Studio and noticed the way it offers complete visibility into information transformations, from supply CSV information by way of Spark processing jobs to ultimate analytical tables utilizing Apache Iceberg format.
We encourage you to now check these capabilities with your personal information pipelines working on EMR. By implementing automated lineage monitoring, many shoppers have strengthened their governance frameworks whereas gaining priceless insights into information dependencies, influence evaluation, and compliance necessities. This method allows information groups to construct belief of their analytics outputs whereas sustaining the agility wanted to derive enterprise worth from their information property.
In regards to the authors
