Automate information lineage in Amazon SageMaker utilizing AWS Glue Crawlers supported information sources


The following era of Amazon SageMaker is the middle for all of your information, analytics, and AI. Bringing collectively extensively adopted Amazon Net Companies (AWS) machine studying (ML) and analytics capabilities, it delivers an built-in expertise for analytics and AI with unified entry to all of your information. From Amazon SageMaker Unified Studio, a single information and AI growth surroundings, you’ll be able to entry your information and use a set of highly effective instruments for information processing, SQL analytics, mannequin growth, coaching and inference, and generative AI growth.

With information lineage, now a part of Amazon SageMaker Catalog, you’ll be able to centralize lineage metadata of your information belongings in a single place. You may monitor the movement of information over time, figuring out a transparent understanding of the place it originated, the way it has modified, and its utilization throughout the enterprise. By offering this degree of transparency, information lineage helps information customers achieve belief that the information is right and compliant for his or her use circumstances. With information lineage captured on the desk, column, and job degree, information producers can conduct impression evaluation of modifications of their information pipelines and reply to information points when wanted, for instance, when a column within the ensuing dataset is lacking the standard required by the enterprise.

Information lineage is a strong device that may rework how organizations perceive and handle their information flows. On this put up, we discover its real-world impression by the lens of an ecommerce firm striving to spice up their backside line.

As an instance this sensible utility, we stroll you thru how you should utilize the prebuilt integration between SageMaker Catalog and AWS Glue crawlers to mechanically seize lineage for information belongings saved in Amazon Easy Storage Service (Amazon S3) and Amazon DynamoDB. Utilizing this workflow, you’ll be able to seize lineage mechanically from further information sources utilizing AWS Glue crawlers. Consult with the Information lineage help matrix within the SageMaker Unified Studio Consumer Information for supported sources. We additionally use SageMaker Unified Studio to navigate these information belongings and find out about their origin, transformations, and dependencies, due to the lineage metadata captured utilizing the AWS Glue crawlers.

Key options of the SageMaker Catalog lineage graph

In SageMaker Unified Studio, you’ll be able to discover and uncover information belongings of your group suited in your use case. As you dive into these information belongings, you’ll be able to study extra about its enterprise context, schema, high quality, and lineage. Once you determine to work with a subset of those belongings, you’ll be able to subscribe to them in a self-service vogue and begin working with them. For extra element, go to Information discovery, subscription, and consumption within the SageMaker Unified Studio Consumer Information.

SageMaker Studio offers a visible lineage graph that exhibits how a knowledge asset has developed from its supply by transformations to its ultimate state. This helps information scientists, engineers, and analysts reply key questions akin to:

  • The place did this information come from?
  • What transformations has it gone by?
  • Which downstream belongings will likely be impacted by a change?

With this degree of visibility, groups can carry out quicker impression evaluation, discover the foundation trigger of information high quality points, and guarantee fashions are constructed on trusted information. It additionally helps higher collaboration so customers can confidently use and share information throughout the group. The next screenshot exhibits how SageMaker Unified Studio visualizes information lineage, making it easy to hint information movement and perceive dependencies.

  • Column-level lineage – You may increase column-level lineage when obtainable in dataset nodes. This mechanically exhibits relationships with upstream or downstream dataset nodes if supply column data is on the market.
  • Column search – If the dataset has greater than 10 columns, the node presents pagination to navigate to columns not initially offered. To shortly view a selected column, you’ll be able to search on the dataset node that lists solely the searched column.
  • Particulars pane – Every lineage node captures and shows the next particulars:
    • Each dataset node has three tabs: LINEAGE, SCHEMA, and HISTORY. The HISTORY tab lists the completely different variations of lineage occasion captured for that node.
    • The job node has a particulars pane to show job particulars with the tabs Job information and Historical past. The main points pane additionally captures queries or expressions run as a part of the job.
  • View dataset nodes solely – If you wish to filter out the job nodes, you’ll be able to select the open view management icon within the graph viewer and toggle the show dataset nodes solely, which is able to take away all of the job nodes from the graph and allow you to navigate solely the dataset nodes.
  • Model tabs – All lineage nodes in Amazon DataZone information lineage could have versioning, captured as historical past, primarily based on lineage occasions captured. You may view lineage at a specific timestamp that opens a brand new tab on the lineage web page to assist evaluate or distinction between the completely different timestamps.

You may strive a few of these options as you discover the information belongings of this put up. To study extra on information lineage in SageMaker, we encourage you to dive deep into the Information lineage in Amazon SageMaker Unified Studio.

Resolution overview

Think about a state of affairs the place an ecommerce firm goals to optimize conversion charges and improve buyer expertise by gaining deeper insights into the shopper journey. They should join the dots between person interactions and precise purchases, however with information scattered throughout a number of sources, the place do they start? That is the place information lineage turns into invaluable. To carry out their evaluation, they want information from two major sources:

  • Clickstream information saved in Amazon S3 (in JSON or Parquet format)
  • Transactional order information saved as gadgets in Amazon DynamoDB

To make these datasets discoverable throughout the enterprise, it’s worthwhile to:

  1. Create a mission in SageMaker Unified Studio that will likely be used to supply and handle the datasets
  2. Allow information lineage seize within the SageMaker Unified Studio mission
  3. Arrange the sources for this use case, which incorporates an AWS Glue information supply (arrange in SageMaker Unified Studio) and AWS Glue crawler (arrange in AWS Glue)
  4. Run the AWS Glue crawler to catalog the datasets in AWS Glue Information Catalog
  5. Supply the metadata of the information belongings into the SageMaker Catalog by working the information supply
  6. Use SageMaker Unified Studio to navigate by the lineage of the information belongings and visualize their origin
  7. Perceive how schema evolution is captured within the information asset’s lineage

Stipulations

To finish the steps on this put up, you want an SageMaker Unified Studio area already deployed in your AWS account. To get began shortly in a testing surroundings, we advise creating your SageMaker area utilizing the fast setup possibility as defined in Create an Amazon SageMaker Unified Studio area – fast setup.

Resolution steps

To seize information lineage for AWS Glue tables managed with AWS Glue crawlers utilizing SageMaker Unified Studio, full the steps within the following sections.

Arrange a SageMaker mission with SQL functionality

In SageMaker Unified Studio, a mission profile defines an uber template for tasks in your Amazon SageMaker unified area. By establishing a mission with the proper tooling (mission profile), you’ll provision sources you should utilize to work with information, which could embody cataloging it in SageMaker, remodeling it into new information belongings, analyzing it to drive enterprise worth, and even use it for ML or AI functions.

To display information lineage successfully, we use SageMaker SQL analytics mission profile for a streamlined setup. Though this profile provides complete information analytics capabilities, we focus particularly on two key elements:

  • AWS Glue database – A lakehouse for storing and managing technical metadata
  • Information supply job – Robotically collects and tracks metadata into SageMaker Catalog

We’ve chosen this profile to bypass complicated guide configurations so we are able to deal with the core ideas of information lineage.

To create a brand new mission in your SageMaker area utilizing the SQL analytics mission profile, observe the steps detailed in SQL analytics mission profile. Preserve all default configurations when creating the mission.

After creating your mission in SageMaker Studio, you’ll unlock highly effective information lineage capabilities that make monitoring and understanding your information flows intuitive. By way of the information sourcing characteristic, you’ll be able to simply monitor how information strikes from supply to the AWS Glue database. This visibility turns into significantly precious when debugging information points—you’ll be able to shortly hint information again to its supply, perceive how modifications impression downstream processes, and establish affected analyses or reviews. Subsequent, populate the AWS Glue database with pattern information to look at these options in motion and display how they’ll streamline your information operations.

For additional steerage on find out how to entry the main points of the brand new SageMaker mission, discuss with Get mission particulars. After you entry the information supply particulars, within the Database identify area, pay attention to the AWS Glue database identify related to the SageMaker mission.

Allow information lineage seize within the SageMaker mission’s information supply

To allow lineage seize, observe these steps:

  1. Broaden the Actions menu, then select Edit information supply.
  2. Go to the connections and choose Import information lineage to configure lineage seize from the supply, as proven within the following screenshot.
  3. Make different modifications to the information supply fields as desired, then select Save.

Enabling lineage will ensure that the information supply job will seize lineage within the subsequent run.

Deploy sources for the use case

Comply with these steps:

  1. To deploy the sources required for this put up, obtain the AWS CloudFormation template amazon-datazone-examples within the AWS Samples GitHub repository. Deploy it in your AWS account.

For additional steerage on find out how to deploy a CloudFormation stack, discuss with Create a stack from the CloudFormation console. It’s essential to present a Stack identify and the identify of the AWS GlueDatabaseName related to the mission of your SageMaker area, as proven within the following screenshot.

  1. Select Subsequent.

The template will deploy the next sources:

  • A S3 bucket with a pattern file of clickstream information. The bucket identify and site of the file will observe the trail sample s3://ecomm-analytics--/clickstream///
    /information.json

    . The file will include a pattern document with the next construction:

{
    "session_id": "abc123",
    "user_id": "u789",
    "event_type": "product_view",
    "product_id": "prod456",
    "timestamp": "2025-06-04T09:23:12Z"
}

  • A DynamoDB desk with a pattern merchandise of order information (transactions). The desk will likely be named OrderTransactionTable. The pattern merchandise could have the next construction:
{
    "order_id": "ord789",
    "user_id": "u789",
    "product_id": "prod456",
    "order_total": 79.99,
    "order_timestamp": "2025-06-04T09:27:10Z"
}

  • An AWS Glue crawler configured to crawl the S3 bucket and DynamoDB desk deployed as a part of the stack and retailer the metadata within the AWS Glue database related to the SageMaker mission. You may entry the crawler’s particulars within the AWS console, as proven within the following screenshot.

Run the AWS Glue crawler

The AWS Glue crawler deployed within the earlier step will can help you seize metadata from the 2 information sources, Amazon S3 and DynamoDB, and retailer it in AWS Glue Information Catalog, particularly within the database related to the SageMaker mission. After the metadata is saved, it is going to be accessible to SageMaker.

Earlier than working the crawler, it’s worthwhile to present AWS Lake Formation permissions to the IAM function that the AWS Glue crawler will use to work together along with your information supply and goal AWS Glue database. The next command will grant the permissions wanted for the crawler to retailer metadata into the AWS Glue database of the SageMaker mission.

To invoke this command, we advocate utilizing AWS CloudShell on the AWS console as defined in AWS CloudShell Ideas. Replace the , and  placeholders with the proper values in your AWS Area, AWS account ID, and identify of the AWS Glue database related to the SageMaker mission.

aws lakeformation grant-permissions 
  --region  
  --principal DataLakePrincipalIdentifier=arn:aws:iam:::function/glue-crawler-role  
  --permissions CREATE_TABLE 
  --resource '{ "Database": { "Identify": "" } }'
  

Subsequent, run the AWS Glue Crawler on the AWS console. After the crawler efficiently finishes, two new tables, clickstream and ordertransactiontable, will likely be created within the AWS Glue database related to the SageMaker mission. Consult with Viewing crawler outcomes and particulars to study extra about AWS Glue crawler outcomes.

Supply metadata from the AWS Glue database into SageMaker

To supply metadata from information belongings within the AWS Glue database, together with their lineage, into SageMaker, use the information supply that was deployed as a part of the SageMaker mission creation.

  1. To run the information supply, go to the information supply particulars web page.
  2. Select Run. (Information sources could be scheduled to run as effectively, nonetheless, for this demonstration we set off a guide run).

After the information supply run is full, metadata from each information belongings within the AWS Glue database will likely be imported into the SageMaker area because the mission’s stock belongings. You could find the main points of the information supply run from inside SageMaker Unified Studio, which embody:

  • The information belongings from the AWS Glue database that had been ingested into SageMaker.
  • The standing of the information lineage import for every information asset, which incorporates an occasion ID for traceability. This lineage occasion ID can be utilized to debug inconsistencies within the ensuing lineage graph. You need to use the GetLineageEvent API to retrieve the uncooked payload of the lineage occasion.

Visualizing the information lineage graph of the information belongings in SageMaker Unified Studio

With SageMaker Unified Studio, you’ve a single place to handle and uncover information belongings. When accessing a knowledge asset revealed within the SageMaker central catalog or in your mission’s personal stock, you’ll be able to dive into the asset’s metadata, which incorporates its schema, enterprise description, customized metadata varieties, high quality, lineage, and extra. To visualise the lineage graph of every information asset of this put up, observe these steps:

  1. In SageMaker Studio, navigate to the Belongings part of the SageMaker mission particulars web page and select INVENTORY
  2. Choose the asset that you simply need to discover. You can even entry the asset immediately from the information supply run by choosing the asset identify.
  3. To view the lineage graph of the information asset as much as its origin, proven within the following screenshots, select the LINEAGE tab.
    • For clickstream desk (Sourced from S3)

    • For order transactions desk (Sourced from DynamoDB)

With lineage, now you can affirm that the information originated from sources akin to Amazon S3 and Amazon DynamoDB and perceive the way it has been reworked alongside the way in which. Due to this end-to-end visibility, you’ll be able to belief the information, make knowledgeable selections, and supply compliance with confidence. The lineage graph captures important metadata that varieties the inspiration of lineage monitoring.

  • This consists of desk schemas, column definitions and their information varieties.
  • Column-level lineage turns into significantly highly effective on this context. Think about your clickstream’s AWS Glue desk powers an Amazon QuickSight dashboard analyzing buyer buy patterns and see discrepancies in your income reviews. With column lineage, you’ll be able to immediately hint the supply of these columns.
  • This granular visibility not solely accelerates debugging but in addition proves invaluable throughout schema modifications, as we present within the following part by altering the supply schema.
  • The crawler particulars akin to crawlerRunId (current within the supply identifier of the lineage node) and crawler begin and finish occasions can be utilized to debug which crawler runs up to date the desk.

Understanding your information asset’s schema evolution by lineage in SageMaker Unified Studio

Think about the order transactions supply in DynamoDB was up to date with new data. As a result of this supply powers an Amazon QuickSight report for the shopper utilizing the AWS Glue database desk, it’s vital for customers to know what modifications within the information pipeline up to date the report.

  1. Edit the DynamoDB desk merchandise with further columns to find out how lineage graph can be utilized to view historic updates:
{
    "order_id": "ord789",
    "user_id": "u789",
    "product_id": "prod456",
    "order_total": 79.99,
    "order_timestamp": "2025-06-04T09:27:10Z",
	"customerSegment": "new-customer",
    "conversionSource": "primeDayEmailCampaign"
}

  1. Enter the OrderTransactionsCrawler Glue crawler once more on the AWS console. After completion, you’ll discover that it up to date the ordertransactiontable AWS Glue desk, as proven within the following screenshot.

  1. Run once more the information supply related to the mission in SageMaker Unified Studio to import the most recent metadata into the SageMaker Catalog. After completion, you’ll discover the information supply up to date the ordertransactiontable information asset within the SageMaker Catalog, as proven within the following screenshot.

This part explores how lineage could be helpful to trace the updates.

Navigate to the ordertransactiontable information asset in SageMaker Catalog by choosing it from the information supply run and select the LINEAGE tab, as proven within the following screenshot.

Discover how the brand new columns can be found within the lineage graph. A brand new crawler run ID is current because the supply identifier of the crawler lineage node. The historical past tab exhibits a number of crawler runs. You may navigate to examine the state of the system in the course of the first run.

Cleanup

After you’re accomplished, we advocate to cleansing up the sources created for this put up to keep away from unintended prices:

  1. Delete the stock belongings that had been cataloged within the SageMaker mission’s stock, as defined in Delete an Amazon SageMaker Unified Studio asset.
  2. Delete the SageMaker mission that was created as a part of this put up, as defined in Delete a mission.
  3. Delete the CloudFormation stack that was deployed as a part of this put up, as defined in Delete a stack from the CloudFormation console.
  4. The S3 bucket created as a part of the CloudFormation stack will stay after its deletion as a result of it accommodates a knowledge file in it. Empty and delete the bucket, as defined in Deleting a basic goal bucket.

Conclusion

On this put up, you had been in a position to discover the information lineage capabilities of Amazon SageMaker, particularly when working with AWS Glue crawlers. You discovered how one can arrange an AWS Glue crawler to deduce metadata from information belongings in a number of sources akin to Amazon S3 and DynamoDB and retailer it the AWS Glue Information Catalog. You additionally imported this metadata, together with information lineage, into Amazon SageMaker by the information supply functionality of a SageMaker mission. Lastly, you explored the ensuing lineage graph of information belongings in SageMaker Unified Studio and noticed a number of the functionalities obtainable to know the origin path of them, perceive how columns are reworked, and what impression seems like when performing modifications to any step of the pipeline.We encourage you to now take a look at the capabilities you explored on this put up with your individual information. By following the sample offered on this put up, many purchasers have been in a position to obtain governance of their information lake and lakehouse platforms on prime of Amazon SageMaker with information lineage and extra.


Concerning the authors

Mohit Dawar is a Senior Software program Engineer at Amazon Net Companies (AWS) engaged on Amazon DataZone. Over the previous 3 years, he has led efforts across the core metadata catalog, generative AI–powered metadata curation, and lineage visualization. He enjoys engaged on large-scale distributed programs, experimenting with AI to enhance person expertise, and constructing instruments that make information governance really feel easy. Join with him on LinkedIn: Mohit Dawar.

Jose Romero is a Senior Options Architect for Startups at Amazon Net Companies (AWS) primarily based in Austin, TX, US. He’s enthusiastic about serving to prospects architect trendy platforms at scale for information, AI, and ML. As a former senior architect in AWS Skilled Companies, he enjoys constructing and sharing options for widespread complicated issues in order that prospects can speed up their cloud journey and undertake greatest practices. Join with him on LinkedIn: Jose Romero.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles