Seize information lineage of Amazon EMR spark jobs into Amazon SageMaker Unified Studio


Information engineers working Apache Spark jobs on Amazon EMR face a persistent problem: understanding how information strikes by way of Spark pipelines because it’s reworked, joined, and written to downstream tables . Monitoring these transformations manually requires inspecting job logs, reviewing code, and piecing collectively transformation logic throughout a number of sources. As pipelines scale, this course of turns into advanced. The visibility hole impacts key enterprise actions: troubleshooting information high quality points takes longer – influence evaluation for schema modifications requires extra effort – and compliance audits want in depth documentation of knowledge provenance.

Amazon SageMaker is the middle for all of your information and analytics the place you’ll find and entry all the info in your group and act on it utilizing instruments throughout varied use case. This unified platform addresses the info visibility problem by bringing collectively information governance, collaboration, and discovery right into a single interface. On the coronary heart of this platform is Amazon SageMaker Catalog, a centralized hub that permits organizations to catalog, govern, and uncover all their information property with full visibility into lineage. By capturing information lineage throughout your complete information ecosystem from uncooked sources by way of transformations to ultimate outputs, SageMaker Catalog lets you monitor information provenance throughout your complete platform, allow collaboration with clear visibility into information possession and high quality metrics, construct belief by way of complete information lineage that helps compliance and assured decision-making, and speed up discovery of reliable, governance-ready information property. You possibly can entry and visualize this lineage immediately in Amazon SageMaker Unified Studio, which serves because the unified interface to discover information relationships and collaborate throughout your analytics workflows.

Amazon EMR, ranging from model 7.11, now contains native OpenLineage help that automates lineage seize. OpenLineage is an open-source framework for information lineage that mechanically emits lineage metadata out of your information transformation jobs immediately into Amazon SageMaker Catalog, or different information governance options, with out requiring customizations.

This EMR native help of OpenLineage is a part of a rising set of integrations throughout AWS analytics companies together with AWS Glue, Amazon EMR Serverless, and Amazon Redshift. The entire listing of companies with native OpenLineage integration could be discovered within the information lineage help matrix.

On this publish, you’ll stroll by way of a sensible, step-by-step instance that exhibits seize and monitor information lineage from Spark jobs working on Amazon EMR immediately into Amazon SageMaker Catalog utilizing OpenLineage. You’ll see how lineage metadata flows mechanically and discover information relationships and dependencies throughout your workflows in Amazon SageMaker Unified Studio.

Resolution overview

Think about you’re half of a big enterprise that depends on HR analytics to optimize workforce planning, compensation methods, and expertise retention practices. Your information engineering group owns the supply of those analytical merchandise by processing uncooked HR datasets (together with worker data, attendance logs, and compensation particulars), with Spark jobs working in your Amazon EMR infrastructure.

With time, Spark jobs have grown in complexity. Your group now struggles to take care of visibility into how information strikes by way of pipelines, who modified it, and map dependencies between datasets and ultimate analytical merchandise.

The next answer demonstrates how one can tackle these challenges by mechanically capturing information lineage end-to-end from Spark jobs working in your EMR infrastructure and visualizing it in Amazon SageMaker Unified Studio so that you simply and the enterprise perceive information provenance of the ultimate analytical merchandise.

The structure features a Information Layer with CSV information containing worker, attendance, wage, and bonus information saved in Amazon S3 (Easy Storage Service), representing typical HR and payroll supply techniques.

The Processing Layer makes use of Amazon EMR cluster working Apache Spark jobs that rework uncooked information into analytical tables. The primary Spark job joins worker and attendance information whereas the second Spark job combines attendance with compensation information. Each jobs use Apache Iceberg desk format to offer ACID (Atomic, Constant, Remoted, and Sturdy) transactions and time journey capabilities.

The Metadata Layer makes use of AWS Glue Information Catalog to retailer Iceberg desk metadata, making tables discoverable and accessible throughout AWS analytics companies. A Lineage Layer makes use of the OpenLineage integration in EMR to mechanically monitor enter/output datasets (CSV information and Iceberg tables), transformation logic at column stage (joins, filters, aggregations), and job execution metadata.

Lastly, the Information Governance Layer makes use of Amazon SageMaker Catalog to seize and course of OpenLineage occasions posted by the EMR Spark jobs and mechanically construct a complete lineage graph that exhibits full information provenance from CSV supply information by way of Spark transformations to Iceberg analytical tables.

Earlier than you deploy this answer, be sure to have the next assets in place.

Conditions

For this walkthrough, it is best to have the next conditions:

  • An AWS account.
  • Your assumed function ought to have full entry to Amazon EMR serverless, Amazon S3, Amazon Identification and Entry Administration (IAM) and AWS Lambda. Observe that for manufacturing workloads, minimal permissions are advisable.
  • A Amazon VPC (Digital Non-public Cloud) with a minimum of one subnet with web entry. You possibly can provision this VPC as you create the Amazon SageMaker area subsequent.
  • An present Amazon SageMaker Unified Studio area and undertaking. To get began, use the short setup possibility as defined right here. To create a undertaking, observe the directions right here.
  • An S3 bucket with the pattern information information and Spark scripts uploaded (see Put together Your Supply Information beneath)
  • Default EMR service roles — if that is your first time utilizing EMR on this account, run `aws emr create-default-roles` from the AWS CLI or CloudShell to create them.

With these conditions in place, let’s study what the AWS CloudFormation template will deploy to your AWS surroundings.

Structure parts

The deployment creates a number of interconnected parts that work collectively to seize and visualize lineage:

  • An S3 bucket to retailer all information and artifacts for the answer.
  • An EMR cluster (v 7.12.0) with Apache Iceberg help enabled and OpenLineage integration pre-installed, able to run Spark jobs with lineage monitoring.
  • A set of IAM insurance policies that grant the mandatory permissions to the EMR cluster to publish lineage occasions to your SageMaker Unified Studio area.
  • A set of AWS Lake Formation permissions that grant the EMR cluster to create, alter, and drop Iceberg tables in your specified Glue database.

With an understanding of what is going to be deployed, you’re able to launch the CloudFormation stack.

Deploy the answer

Observe: Whereas this walkthrough makes use of the AWS EMR console and AWS CLI to confirm the cluster and run Spark jobs, you may as well carry out these steps immediately from Amazon SageMaker Unified Studio. SMUS offers a unified interface to create and handle EMR clusters, submit Spark jobs, and monitor execution — all inside the identical surroundings the place you’ll later discover the lineage captured in Amazon SageMaker Catalog.

Put together your supply information

Earlier than deploying the CloudFormation stack, clone or obtain the Git repository.

git clone https://github.com/aws-samples/sample-capture-data-lineage-of-amazon-emr-ec2

Add the CSV information downloaded from git to the enter/ prefix and the spark scripts in scripts/ prefix. You possibly can run the next command to add the information:

aws s3 cp workers.csv s3://YOUR-BUCKET/enter/
aws s3 cp attendance.csv s3://YOUR-BUCKET/enter/
aws s3 cp salary_adjustments.csv s3://YOUR-BUCKET/enter/
aws s3 cp bonus_payments.csv s3://YOUR-BUCKET/enter/
aws s3 cp emr-lineage-spark-job.py s3://YOUR-BUCKET/scripts/
aws s3 cp emr-lineage-compensation-job.py s3://YOUR-BUCKET/scripts/

To deploy the answer, full the next steps in CloudFormation console:

  1. Create new stack by specifying the CloudFormation yaml file beforehand obtain from git repository PutHereThe YMLFileName
  2. Enter a stack title (equivalent to, emr-lineage-demo) and supply the next parameters:
    • SourceS3BucketName: S3 bucket containing your CSV information and Spark scripts
    • SourceCSVPrefix: S3 prefix the place CSV information are positioned
    • SourceScriptsPrefix: S3 prefix the place Spark scripts are positioned
    • GlueDatabaseName: The title of the Glue database related to your Amazon SageMaker Unified Studio undertaking.
    • DataZoneDomainId: Your SageMaker Unified Studio area ID.
    • VpcId: The id of the VPC that was deployed as a part of the conditions.
    • For EMRReleaseLabel, MasterInstanceType, CoreInstanceType and CoreInstanceCount, hold the default values.
  3. Acknowledge IAM useful resource creation, select Subsequent after which Submit. The CloudFormation stack takes roughly 10 to fifteen minutes to finish.
  4. Within the EMR console, watch for the cluster standing to point out as WAITING earlier than shifting to the following step.

Screenshot of the Amazon EMR on EC2 Clusters management console showing a list of 14 clusters, with the cluster "EMR-Lineage-Demo-emr-ec2-lineage-demo-stack" (ID: j-3APWOTUDNYO2T) highlighted in a "Waiting – Ready to run steps" status with a green badge.

Now that the EMR cluster is working with OpenLineage enabled, let’s study how the Spark jobs are configured to seize lineage metadata.

Discover information lineage configuration in EMR

When submitting Spark jobs to EMR, particular configurations allow OpenLineage to create and publish lineage occasions to SageMaker Unified Studio because the job runs:

  • spark.hadoop.hive.metastore.consumer.manufacturing facility.class – Configures Spark to make use of AWS Glue because the Hive metastore.
  • spark.jars – Path to the pre-installed OpenLineage library (out there on EMR 7.11+).
  • spark.extraListeners – Registers an OpenLineage listener to seize metadata of enter / output datasets and transformations.
  • spark.openlineage.transport.sort – Makes use of the OpenLineage DataZone transport choice to ship lineage occasions immediately into SageMaker Catalog.
  • spark.openlineage.transport.domainId – The ID of your SageMaker Unified Studio area, that serves because the goal for lineage occasions.
  • spark.glue.accountId – Your AWS account ID for Glue information catalog operations.

Now that you simply perceive the configuration that permits automated lineage seize, you’re able to run the info pipeline.

When working this two-step pipeline, you’ll calculate the whole worker compensation by combining wage changes, bonuses, and attendance information. The ultimate analytical asset will serve payroll processing and budgeting.

Run worker attendance evaluation job

The primary job reads worker particulars (in workers.csv dataset) and attendance data (in attendance.csv dataset), joins the datasets on EmployeeID and creates a unified dataset (employee_attendance Iceberg desk) in your Glue database.

Observe the steps beneath to run this primary job:

  1. Within the CloudFormation console, navigate to the stack’s Outputs tab
  2. Copy the worth of the Job1SubmitCommand output key. Observe that that is the command you’ll use to submit the primary job in EMR with the best configuration.

AWS CloudFormation console screenshot showing the Outputs tab for the "emr-ec2-lineage-demo-stack" stack, displaying 9 outputs including the Job1SubmitCommand — an AWS EMR add-steps command with Apache Spark configuration for the EMR Lineage Demo Job targeting cluster j-3APWOTUDNYO2T.

  1. Run the command in your terminal or AWS CloudShell.
  2. Monitor the job within the Amazon EMR console beneath Steps.

Screenshot of the Amazon EMR console Steps tab for the cluster "EMR-Lineage-Demo-emr-ec2-lineage-demo-stack," showing one completed step named "EMR-Lineage-Demo-Job" with Step ID s-0270631D8DHBCJZKBAZ and a green "Completed" status checkmark.

Run worker compensation evaluation job

Now, you’ll calculate the whole worker compensation (Iceberg desk) by combining wage changes (salary_adjustments.csv dataset), bonuses (bonus_payments.csv dataset), and attendance (calculated within the final step):

  1. Repeat the steps 1 to 4 to run Job 2.
  2. After completion, open the AWS Glue console.
  3. Navigate to Information Catalog, then Tables and select your SageMaker undertaking’s database.
  4. Affirm that employee_attendance and employee_compensation tables are listed.

With each Spark jobs full, now you can visualize the entire information lineage graph in Amazon SageMaker Unified Studio.

Visualizing lineage in SageMaker Unified Studio

SageMaker Unified Studio offers a graph-based information lineage visualization that helps information engineers, analysts, and information scientists clearly perceive which supply datasets (information or tables) feed into every dataset, what transformations and logic are utilized at each step, which downstream analytics property devour the info, and the way modifications to upstream information or transformations could influence the remainder of the info pipeline.

Now that the info pipeline run efficiently, let’s evaluation the captured lineage for the HR information in SageMaker Unified Studio:

  1. Navigate to the SageMaker Unified Studio console, sign up to your area.
  2. Open your undertaking and go to Information Sources
  3. Discover your AWS Glue Information Catalog supply

Screenshot of the Amazon SageMaker project catalog Data Sources page listing three configured data sources: a Redshift Serverless source, an AWS Glue Lakehouse source named "AwsDataCatalog-emr_ec2_lineage_blogpost_glue_db-default-datasource" (highlighted), and a Tooling SageMaker model package group source — all scheduled MTWTFSS and in Ready or Running status.

  1. Click on RUN. Two new property can be created.

Screenshot of the AWS Glue Data Catalog interface showing run activities for the data source "AwsDataCatalog-emr_ec2_lineage_blogpost_glue_db-default-datasource," with two completed on-demand runs and a highlighted asset table showing employee_attendance and employee_compensation successfully created in the emr_ec2_lineage_blogpost_glue_db database.

  1. Navigate to Belongings and Click on on employee_compensation. Beneath the LINEAGE tab you’ll discover the lineage graph view that SageMaker builds primarily based on the OpenLineage metadata captured from the EMR Spark jobs as they run.

AWS Glue data lineage visualization showing the flow of the employee_compensation dataset from an Apache Spark job (default.emr_lineage_compensa, COMPLETE, Dec 22 2025 11:42:47 AM) through an AWS Glue Iceberg table (20 columns) to an AWS Glue Inventory destination table, with a right sidebar displaying lineage metadata including the dataset ARN, OpenLineage producer URL, Iceberg snapshot ID, and projected field names EmployeeID, Name, and Department.

    • You’ll first see three lineage nodes from left to proper: one representing the EMR Spark job that created the ultimate Iceberg desk, a second one representing the precise Iceberg desk within the Glue catalog, and a 3rd one representing the info asset within the SageMaker Catalog stock that maps to the Glue desk.
    • Click on on any lineage node to view its underlying metadata within the particulars pane, together with dataset names, S3 areas, schema, information sorts, job execution particulars and extra.
  1. Broaden the lineage to the left by clicking on the double arrow subsequent to the primary lineage node. Hold increasing till you hit the originating datasets.

Data pipeline lineage diagram showing the complete ETL flow from Amazon S3 source files (input/attendance.csv with 6 columns, input/employees.csv with 5 columns) through two Apache Spark jobs to intermediate tables (input/salary_adjustments.csv, iceberg/employee.csv, AWS Glue employee_attendance with 14 columns) and final destination tables (AWS Glue iceberg/employee_compensation with 29 columns, AWS Glue Inventory employee_compensation_hive with 30 columns), all timestamped Dec 22, 2025.

    • Increasing the graph to the left reveals the entire information pipeline again to authentic CSV supply information. You possibly can see how compensation information is determined by upstream attendance analytics.
    • Observe how every lineage node represents a component within the information pipeline you run, together with each Spark jobs and even the intermediate employee_attendance Iceberg desk that connects them.
  1. You possibly can increase column-level lineage by clicking on the column part of a lineage node of a dataset or information asset. This lets you perceive how information modifications at a column stage because it goes downstream your information pipeline.

Data lineage diagram showing the employee compensation ETL pipeline with four Amazon S3 source tables (employee.csv with 5 columns, input/attendance.csv with 6 columns, input/salary_adjustments.csv with 4 columns, output/employee_attendance.csv with 14 columns) processed by two Apache Spark jobs to produce a final s3://employee_compensation table with 20 columns, all dated Dec 22, 2025.

Cleanup

To keep away from ongoing prices, clear up the assets:

  1. First, empty the vacation spot bucket by working the next command in your terminal or with AWS CloudShell.

aws s3 rm s3://${DEST_BUCKET}/ --recursive

  1. Delete the CloudFormation stack.
    • On the AWS CloudFormation console, select Stacks within the navigation pane.
    • Select the stack you created, then select Delete after which Delete stack when prompted.

Conclusion

On this publish, you discover seize information lineage from Spark jobs in Amazon EMR (v7.11+) immediately into Amazon SageMaker Unified Studio. You realized arrange an Amazon EMR cluster with native OpenLineage help to mechanically monitor lineage metadata from Spark jobs processing your information. You additionally configured the combination between EMR and Amazon SageMaker Catalog to make sure lineage info flows seamlessly into your governance platform. Lastly, you explored the ensuing lineage graph in SageMaker Unified Studio and noticed the way it offers complete visibility into information transformations, from supply CSV information by way of Spark processing jobs to ultimate analytical tables utilizing Apache Iceberg format.

We encourage you to now check these capabilities with your personal information pipelines working on EMR. By implementing automated lineage monitoring, many shoppers have strengthened their governance frameworks whereas gaining priceless insights into information dependencies, influence evaluation, and compliance necessities. This method allows information groups to construct belief of their analytics outputs whereas sustaining the agility wanted to derive enterprise worth from their information property.


In regards to the authors

Yanick Houngbedji is a Options Architect for Unbiased Software program Distributors (ISV) at Amazon Internet Companies (AWS), primarily based in Montréal, Canada. He focuses on serving to prospects architect and implement extremely scalable, performant, and safe cloud options on AWS. Earlier than becoming a member of AWS, he spent over 8 years offering technical management in information engineering, huge information analytics, enterprise intelligence, and information science options.

Jose Romero is a Senior Options Architect for Startups at Amazon Internet Companies (AWS) primarily based in Austin, TX, US. He’s obsessed with serving to prospects architect trendy platforms at scale for information, AI, and ML. As a former senior architect in AWS Skilled Companies, he enjoys constructing and sharing options for frequent advanced issues in order that prospects can speed up their cloud journey and undertake greatest practices. Join with him on LinkedIn.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles