Seize knowledge lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker


The subsequent era of Amazon SageMaker is the middle in your knowledge, analytics, and AI. SageMaker brings collectively AWS synthetic intelligence and machine studying (AI/ML) and analytics capabilities and delivers an built-in expertise for analytics and AI with unified entry to knowledge. From Amazon SageMaker Unified Studio, a single interface, you possibly can entry your knowledge and use a collection of highly effective instruments for knowledge processing, SQL analytics, mannequin improvement, coaching and inference, in addition to generative AI improvement. This unified expertise is assisted by Amazon Q and Amazon SageMaker Catalog (powered by Amazon DataZone), which delivers an embedded generative AI and governance expertise at each step.

With knowledge lineage, now a part of SageMaker Catalog, area directors and knowledge producers can centralize lineage metadata of their knowledge property in a single place. You possibly can observe the stream of information over time, providing you with a transparent understanding of the place it originated, the way it has modified, and its final use throughout the enterprise. By offering this stage of transparency across the origin of information, knowledge lineage helps knowledge shoppers achieve belief that the info is appropriate for his or her use case. As a result of knowledge lineage is captured on the desk, column, and job stage, knowledge producers may conduct impression evaluation and reply to knowledge points when wanted.

Seize of information lineage in SageMaker begins after connections and knowledge sources are configured and lineage occasions are generated when knowledge is remodeled in AWS Glue or Amazon Redshift. This functionality can be totally appropriate with OpenLineage, so you possibly can additional increase knowledge lineage seize to different knowledge processing instruments. This publish walks you thru easy methods to use the OpenLineage-compatible API of SageMaker or Amazon DataZone to push knowledge lineage occasions programmatically from instruments supporting the OpenLineage customary like dbt, Apache Airflow, and Apache Spark.

Answer overview

Many third-party and open supply instruments which are used as we speak to orchestrate and run knowledge pipelines, like dbt, Airflow, and Spark, have lively help of the OpenLineage customary to offer interoperability throughout environments. With this functionality, you solely want to incorporate and configure the correct library to your setting, to have the ability to stream lineage occasions from jobs working on this device on to their corresponding output logs or to a goal HTTP endpoint that you simply specify.

With the goal HTTP endpoint choice, you possibly can introduce a sample to publish lineage occasions from these instruments into SageMaker or Amazon DataZone to additional provide help to centralize governance of your knowledge property and processes in a single place. This sample takes the type of a proxy, and its simplified structure is illustrated within the following determine.

The best way that the proxy for OpenLineage works is straightforward:

  • Amazon API Gateway exposes an HTTP endpoint and path. Jobs working with the OpenLineage package deal on high of the supported knowledge processing instruments will be arrange with the HTTP transport choice pointing to this endpoint and path. If connectivity permits, lineage occasions shall be streamed into this endpoint because the job runs.
  • An Amazon Easy Queue Service (Amazon SQS) queue buffers the occasions as they arrive. By storing them in a queue, you’ve gotten the choice to implement methods for retries and errors when wanted. For instances the place occasion order is required, we suggest using first-in-first-out (FIFO) queues; nevertheless, SageMaker and Amazon DataZone are capable of map incoming OpenLineage occasions, even when they’re out of order.
  • An AWS Lambda perform retrieves occasions from the queue in batches. For each occasion in a batch, the perform can carry out transformations when wanted and publish the ensuing occasion to the goal SageMaker or Amazon DataZone area.
  • Though it’s not proven within the structure, AWS Id and Entry Administration (IAM) and Amazon CloudWatch are key capabilities that permit safe interplay between sources with minimal permissions and logging for troubleshooting and observability.

The AWS pattern OpenLineage HTTP Proxy for Amazon SageMaker Governance and Amazon DataZone supplies a working implementation of this simplified structure that you would be able to take a look at and customise as wanted. To deploy in a take a look at setting, observe the steps as described within the repository. We use an AWS CloudFormation template to deploy answer sources.

After you’ve gotten deployed the OpenLineage HTTP Proxy answer, you need to use it to publish lineage occasions from knowledge processing instruments like dbt, Airflow, and Spark right into a SageMaker or Amazon DataZone area, as proven within the following examples.

Arrange the OpenLineage package deal for Spark in AWS Glue 4.0

AWS Glue added built-in help for OpenLineage with AWS Glue 5.0 (to study extra, see Introducing AWS Glue 5.0 for Apache Spark). For jobs which are nonetheless working on AWS Glue 4.0, you continue to can stream OpenLineage occasions into SageMaker or Amazon DataZone through the use of the OpenLineage HTTP Proxy answer. This serves for example that may be utilized to different platforms working Spark like Amazon EMR, third-party options, or self-managed clusters.

To correctly add OpenLineage capabilities to an AWS Glue 4.0 job and configure it to stream lineage occasions into the OpenLineage HTTP Proxy answer, full the next steps:

  1. Obtain the official OpenLineage package deal for Spark. For our instance, we used the JAR package deal model 2.12 launch 1.9.1.
  2. Retailer the JAR file in an Amazon Easy Storage Service (Amazon S3) bucket that may be accessed by your AWS Glue job.
  3. On the AWS Glue console, open your job.
  4. Beneath Libraries, for Dependent JARs path, enter the trail of the JAR package deal saved in your S3 bucket.

  1. Within the Job parameters part, add the next parameters:
    1. Allow the OpenLineage package deal:
      1. Key: --user-jars-first
      2. Worth: true
    2. Configure how the OpenLineage package deal shall be used to stream lineage occasions. Exchange and with the corresponding values of the OpenLineage HTTP Proxy answer. These values will be discovered as outputs of the deployed CloudFormation stack. Exchange along with your AWS account ID.
      1. Key: --conf
      2. Worth:
        spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener 
        --conf spark.openlineage.transport.kind=http 
        --conf spark.openlineage.transport.url=
        --conf spark.openlineage.transport.endpoint=/
        --conf spark.openlineage.aspects.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;] 
        --conf spark.glue.accountId=

With this setup, the AWS Glue 4.0 job will use the HTTP transport choice of the OpenLineage package deal to stream lineage occasions into the OpenLineage proxy, which is able to publish occasions to the SageMaker or Amazon DataZone area.

  1. Run the AWS Glue 4.0 job.

The job’s ensuing datasets must be sourced into SageMaker or Amazon DataZone in order that OpenLineage occasions are mapped to them. As you discover the sourced dataset in SageMaker Unified Studio, you possibly can observe its origin path as described by the OpenLineage occasions streamed by way of the OpenLineage proxy.

When working with Amazon DataZone, you’re going to get the identical outcome.

The origin path on this instance is in depth and maps the ensuing dataset right down to its origin, on this case, a few tables hosted in a relational database and remodeled by way of a knowledge pipeline with two AWS Glue 4.0 (Spark) jobs.

Arrange the OpenLineage package deal for dbt

dbt has quickly develop into a well-liked framework to construct knowledge pipelines on high of information processing and knowledge warehouse instruments like Amazon Redshift, Amazon EMR, and AWS Glue, in addition to different conventional and third-party options. This framework helps OpenLineage as a strategy to standardize era of lineage occasions and combine with the rising knowledge governance ecosystem.dbt deployments would possibly range per setting, which is why we don’t dive into the specifics on this publish. Nonetheless, to easily configure your dbt mission to leverage the OpenLineage HTTP Proxy answer, full the next steps:

  1. Set up the OpenLineage package deal for dbt. You possibly can study extra within the OpenLineage documentation.
  2. Within the root folder of your dbt mission, create an openlineage.yml file the place you possibly can specify the transport configuration. Exchange and with the values of the OpenLineage HTTP Proxy answer. These values will be discovered as outputs of the deployed CloudFormation stack.
transport:
  kind: http
  url: 
  endpoint: 
  timeout: 5

  1. Run your dbt pipeline. As defined within the OpenLineage documentation, as a substitute of working the usual dbt run command, you run the dbt-ol run command. The latter command is only a wrapper on high of the usual dbt run command in order that lineage occasions are captured and streamed as configured.

The job’s ensuing datasets must be sourced into SageMaker or Amazon DataZone in order that OpenLineage occasions are mapped to them. As you discover the sourced dataset in SageMaker Unified Studio, you possibly can observe its lineage path as described by the OpenLineage occasions streamed by way of the OpenLineage proxy.

When working with Amazon DataZone, you’re going to get the identical outcome.

On this instance, the dbt mission is working on high of Amazon Redshift, which is a standard use case amongst clients. Amazon Redshift is built-in for automated lineage seize with SageMaker and Amazon DataZone, however such capabilities weren’t used as a part of this instance for example how one can nonetheless combine OpenLineage occasions from dbt utilizing the sample carried out within the OpenLineage HTTP Proxy answer.The dbt pipeline is made by two phases working sequentially, that are illustrated within the origin path because the nodes with the dbt kind.

Arrange the OpenLineage package deal for Airflow

Airflow is a well-positioned device to orchestrate knowledge pipelines at any scale. AWS supplies Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as a managed different for patrons that wish to cut back administration and speed up the event of their knowledge technique with Airflow in a cheap method. Airflow additionally helps OpenLineage, so you possibly can centralize lineage with instruments like SageMaker and Amazon DataZone.

The next steps are particular for Amazon MWAA, however they are often extrapolated to different types of deployment of Airflow:

  1. Set up the OpenLineage package deal for Airflow. You possibly can study extra within the OpenLineage documentation. For variations 2.7 and later, it’s really useful to make use of the native Airflow OpenLineage package deal (apache-airflow-providers-openlineage), which is the case for this instance.
  2. To put in the package deal, add it to the necessities.txt file that you’re storing in Amazon S3 and that you’re pointing to when provisioning your Amazon MWAA setting. To study extra, consult with Managing Python dependencies in necessities.txt.
  3. As you put in the OpenLineage package deal or afterwards, you possibly can configure it to ship lineage occasions to the OpenLineage proxy:
    1. When filling the shape to create a brand new Amazon MWAA setting or edit an present one, within the Airflow configuration choices part, add the next. Exchange and with the values of the OpenLineage HTTP Proxy answer. These values will be discovered as outputs of the deployed CloudFormation stack:
      1. Configuration choice: openlineage.transport
      2. Customized worth: {"kind": "http", "url": "", "endpoint": ""}

  1. Run your pipeline.

The Airflow duties will mechanically use the transport configuration to stream lineage occasions into the OpenLineage proxy as they run. The duty’s ensuing datasets must be sourced into SageMaker or Amazon DataZone in order that OpenLineage occasions are mapped to them.As you discover the sourced dataset in SageMaker Unified Studio, you possibly can observe its origin path as described by the OpenLineage occasions streamed by way of the OpenLineage proxy.

When working with Amazon DataZone, you’re going to get the identical outcome.

On this instance, the Amazon MWAA Directed Acyclic Graph (DAG) is working on high of Amazon Redshift, much like the dbt instance earlier than. Nonetheless, it’s nonetheless not utilizing the native integration for automated knowledge seize between Amazon Redshift and SageMaker or Amazon DataZone. This manner, we will illustrate how one can nonetheless combine OpenLineage occasions from Airflow utilizing the sample carried out within the OpenLineage HTTP Proxy answer.The Airflow DAG is made by a single job that outputs the ensuing dataset through the use of datasets that have been created as a part of the dbt pipeline within the earlier instance. That is illustrated within the origin path, the place it consists of nodes with the dbt kind and a node with AIRFLOW kind. With this ultimate instance, be aware how SageMaker and Amazon DataZone map all datasets and jobs to replicate the truth of your knowledge pipelines.

Extra issues when implementing the OpenLineage proxy sample

The OpenLineage proxy sample carried out within the pattern OpenLineage HTTP Proxy answer and introduced on this publish has proven to be a sensible method to combine a rising set of information processing instruments right into a centralized knowledge governance technique on high of SageMaker. We encourage you to dive into it and use it in your take a look at environments to learn the way it may be greatest used in your particular setup.If desirous about taking this sample to manufacturing, we propose you first evaluation it totally and customise it to your specific wants. The next are some objects value reviewing as you consider this sample implementation:

  • The answer used within the examples of this publish makes use of a public API endpoint with no authentication or authorization mechanism. For a manufacturing workload, we suggest limiting entry to the endpoint to a minimal so solely licensed sources are capable of stream messages into it. To study extra, consult with Management and handle entry to HTTP APIs in API Gateway.
  • The logic carried out within the Lambda perform is meant to be personalized relying in your use case. You would possibly must implement transformation logic, relying on how OpenLineage occasions are created by the device you might be utilizing. As a reference, for the case of the Amazon MWAA instance of this publish, some minor transformations have been required on the identify and namespace fields of the inputs and outputs components of the occasion for full compatibility with the format anticipated for Amazon Redshift datasets as described within the dataset naming conventions of OpenLineage. You may additionally want to vary how the perform logs execution particulars or embrace retry/error logic and extra.
  • The SQS queue used within the OpenLineage HTTP Proxy answer is customary, which means that occasions aren’t delivered so as. If this can be a requirement, you would use FIFO queues as a substitute.

For instances the place you wish to publish OpenLineage occasions immediately into SageMaker or Amazon DataZone, with out utilizing the proxy sample defined on this publish, a customized transport is now accessible as an extension of the OpenLineage mission model 1.33.0. Leverage this characteristic in instances the place you don’t want extra controls in your OpenLineage occasion stream, for instance, when you don’t want any customized transformation logic.

Abstract

On this publish, we confirmed easy methods to use the OpenLineage-compatible APIs of SageMaker to seize knowledge lineage from any device supporting this customary, by following an architectural sample launched because the OpenLineage proxy. We introduced some examples of how one can arrange instruments like dbt, Airflow, and Spark to stream lineage occasions to the OpenLineage proxy, which subsequently posts them to a SageMaker or Amazon DataZone area. Lastly, we launched a working implementation of this sample that you would be able to take a look at and mentioned some issues when implementing this identical sample to manufacturing.

The SageMaker compatibility with OpenLineage can assist simplify governance of your knowledge property and enhance belief in your knowledge. This functionality is without doubt one of the many options that are actually accessible to construct a complete governance technique powered by knowledge lineage, knowledge high quality, enterprise metadata, knowledge discovery, entry automation, and extra. By bundling knowledge governance capabilities with the rising set of instruments accessible for knowledge and AI improvement, you possibly can derive worth out of your knowledge sooner and get nearer to consolidating a data-driven tradition. Check out this answer and get began with SageMaker to hitch the rising set of shoppers which are modernizing their knowledge platform.


In regards to the authors

Jose Romero is a Senior Options Architect for Startups at AWS, primarily based in Austin, Texas. He’s enthusiastic about serving to clients architect fashionable platforms at scale for knowledge, AI, and ML. As a former senior architect in AWS Skilled Companies, he enjoys constructing and sharing options for frequent complicated issues in order that clients can speed up their cloud journey and undertake greatest practices. Join with him on LinkedIn.

Priya Tiruthani is a Senior Technical Product Supervisor with Amazon SageMaker Catalog (Amazon DataZone) at AWS. She focuses on constructing merchandise and their capabilities in knowledge analytics and governance. She is enthusiastic about constructing progressive merchandise to deal with and simplify clients’ challenges of their end-to-end knowledge journey. Exterior of labor, she enjoys being outdoor to hike and seize nature’s magnificence. Join along with her on LinkedIn.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles