We’re excited to announce the Basic Availability (GA) of PySpark’s Knowledge Supply API for Apache Spark™ 4.0 on Databricks Runtime (DBR) 15.4 LTS and above. This highly effective characteristic permits builders to construct customized knowledge connectors with Spark utilizing pure Python. It simplifies integration with exterior and non-Spark-native knowledge sources, opening new potentialities for knowledge pipelines and machine studying workflows.
Why This Issues
At the moment, knowledge is ingested from myriad sources, some structured, some unstructured, and a few multi-modal knowledge like photographs and movies. Spark natively helps Knowledge Supply v1 (DSv1) and Knowledge Supply v2 (DSv2) customary codecs like Delta, Iceberg, Parquet, JSON, CSV, and JDBC. Nonetheless, it doesn’t present built-in assist for a lot of different sources, akin to Google Sheets, REST APIs, HuggingFace datasets, tweets from X, or proprietary inner programs. Whereas DSv1/DSv2 can technically be prolonged to implement these sources, the method is overly advanced and infrequently pointless for light-weight use circumstances.
What when you want these different customized knowledge sources to learn from or write to on your use case, or what in case your ETL pipeline for a machine studying use case must eat this knowledge to coach a mannequin? That’s precisely the hole the Python Knowledge Supply API fills.
This weblog will discover how one can write customized knowledge sources in PySpark. Utilizing this API, you possibly can simply convey numerous datasets not constructed into Spark into your knowledge processing pipelines on your particular use case. We will even discover some examples of customized knowledge sources. However first, let’s perceive why and what.
What’s the Python Knowledge Supply API?
Impressed by individuals’s love for writing in Python and the convenience of putting in packages with pip, the Python Knowledge Supply API makes it simple to construct customized readers and writers for Spark utilizing Python. This API unlocks entry to any knowledge supply, eliminating the necessity for advanced DSv1 and DSv2 growth or Spark internals data that was beforehand required for customized connectors.
What are the Key Options and Advantages?
There are a number of key advantages to utilizing Python knowledge sources.
1. Pure Python Implementation
Think about the pythonic really feel first. The Python Knowledge Supply API removes the boundaries of a extra difficult JVM-based connector growth. Knowledge engineers constructing advanced ETL pipelines utilizing myriad knowledge sources in Python can now leverage their current Python expertise to create customized connectors with out diving into Spark’s inner complexities.
2. Help for Each Batch and Streaming Operations
The API helps each batch and streaming reads, permitting you to construct connectors that deal with a number of knowledge entry patterns:
- Batch Studying: Pull knowledge from APIs, databases, or different sources in a single operation
- Stream Studying: Repeatedly ingest knowledge from event-based or real-time sources
- Batch and Stream writing: Write to the information sink, together with Declarative Pipeline sinks.
3. Accessibility from SQL
Like all supported knowledge supply in Spark SQL, you possibly can equally and simply entry your customized Python Knowledge Supply from Spark SQL. As soon as a knowledge supply has been loaded as a DataFrame, it can save you it as a temp view or a persistent Unity Catalog managed desk. This lets you incorporate your customized knowledge sources into your SQL analytics downstream.
4. Simplified Integration with Exterior Companies
You may simply hook up with exterior programs by passing in your customized choices, like API keys, endpoints, or different configs, utilizing the DataFrame API. This offers you full management over how your connector behaves. See the real-world examples under for particulars.
5. Group-Pushed Connector Ecosystem
Since its preview launch, the group has already begun constructing helpful connectors with the Python Knowledge Supply API :
- Instance Connectors: Reference implementations for REST APIs, CSV variants, and extra (GitHub repository)
- HuggingFace Connector: Direct entry to datasets from HuggingFace (GitHub repository)
6. Velocity with out the complexity
The Python Knowledge Supply API is designed not only for ease of use, but additionally for pace. It’s constructed on Apache Arrow, an in-memory knowledge format optimized for quick knowledge processing. This implies your knowledge can transfer between your customized connector and Spark with minimal overhead, making ingestion and writing a lot sooner.
Python Knowledge Supply Actual-World Use Instances
Let’s discover real-world use circumstances for feeding knowledge into your pipelines utilizing Python Knowledge Sources.
Customized API Integrations
First, let’s study how one can profit from integrating REST APIs.
Many knowledge engineering groups are constructing customized connectors to drag knowledge from REST APIs and use it in downstream transformations. As an alternative of writing customized code to fetch the information, reserve it to disk or reminiscence, after which load it into Spark, you should use the Python Knowledge Supply API to skip these steps.
With a customized knowledge supply, you possibly can learn knowledge instantly from an API right into a Spark DataFrame, and no intermediate storage is required. For instance, right here’s fetch the total output from a REST API name and cargo it straight into Spark.
The REST API knowledge supply and its full implementation are right here as a reference instance.
Unity Catalog Integration
The second integration is with knowledge catalogs. Extra builders within the enterprise are turning towards knowledge catalogs, akin to Unity Catalog, to retailer their AI and Knowledge property in a central repository for central knowledge governance and safety. As this second pattern continues, your knowledge pipeline jobs ought to be capable to learn and write to those knowledge property in a safe and managed method.
You may learn knowledge instantly from these customized knowledge sources and write to Unity Catalog tables, bringing governance, safety, and discoverability to knowledge from any supply:
This integration ensures that knowledge out of your specialised sources could be correctly ruled and secured by means of Unity Catalog.
Machine Studying Pipeline Integration
The third integration is with machine studying exterior datasets. Knowledge scientists use the Python Knowledge Supply API to instantly hook up with specialised Machine Studying (ML) datasets and mannequin repositories. HuggingFace has a ton of datasets explicitly curated for coaching and testing classical ML fashions.
To fetch this dataset as a Spark DataFrame, you should use the HuggingFace connector. This connector leverages the facility of the Python Knowledge Supply API to simply fetch ML property that may be built-in into your knowledge pipeline.
As soon as fetched, the suitable Spark DataFrame can be utilized with the related machine studying algorithm to coach, check, and consider the mannequin. Easy!
For extra examples, take a look at the HuggingFace DataSource Connector.
Stream Processing with Customized Sources
Furthermore, as a fourth integration level, streaming knowledge sources are as a lot a part of your each day ETL pipelines as static sources in storage. It’s also possible to construct streaming functions with customized sources that repeatedly ingest knowledge.
Here’s a code snippet for a customized Spark knowledge supply for streaming real-time plane monitoring knowledge from the OpenSky Community API. The OpenSky Community, a community-driven receiver community, gathers air site visitors surveillance knowledge and presents it as open knowledge to researchers and fanatics. To view a full implementation of this streaming customized knowledge supply, examine the GitHub supply right here.
Declarative Pipeline Integration
Lastly, as a knowledge engineer, you possibly can simply combine Python Knowledge Sources with Declarative Pipeline Integration.
Studying From Customized Knowledge Supply
In Declarative Pipeline, ingesting knowledge from the customized knowledge supply works the identical manner as in common Databricks Jobs.
Writing to exterior companies by way of Customized Knowledge Supply
On this instance weblog, Alex Ott demonstrates how the brand new Declarative Pipelines Sinks API means that you can use a sink object that factors to an exterior Delta desk or different streaming targets like Kafka as a built-in knowledge supply.
Nonetheless, you can even write to a customized sink carried out as a Python Knowledge Supply. In our customized Python knowledge supply within the code under, we create a sink utilizing the sink API and use that because the “sink” object. As soon as outlined and developed, you possibly can append flows. You may peruse the total implementation of this code right here.
Constructing your Customized Connectors
To begin utilizing the Python Knowledge Supply API to construct your customized connector, comply with these 4 steps:
- Guarantee you may have Spark 4.0 or Databricks Runtime 15.4 LTS or later: The Python Knowledge Supply API is offered in DBR 15.4 LTS and above, or use Databricks Serverless Generic Compute.
- Use the implementation template: Reference the bottom courses in pyspark.sql.datasource module.
- Register your connector: Make your connector discoverable in your Databricks workspace.
Use your connector: Your connector can be utilized identical to any built-in knowledge supply.
Buyer Success Tales utilizing Python Knowledge Supply API
At Shell, knowledge engineers usually wanted to mix knowledge from built-in Spark sources like Apache Kafka with exterior sources accessed by way of REST APIs or SDKs. This led to one-off, or bespoke, customized code that was laborious to keep up and inconsistent throughout groups. Shell’s chief digital expertise officer famous that it wasted time and added complexity.
“We write plenty of cool REST APIs, together with for streaming use circumstances, and would love to only use them as a knowledge supply in Databricks as a substitute of writing all of the plumbing code ourselves.” — Bryce Bartmann, Chief Digital Know-how Advisor, Shell.
To make the most of their cool REST APIs, the answer for the Shell knowledge engineers was to make use of the brand new Python customized knowledge supply API to implement their REST APIs as a knowledge supply. It allowed builders to deal with APIs and different non-standard sources as first-class Spark knowledge sources. With object-oriented abstractions, it was simpler to plug in customized logic cleanly—no extra messy glue code.
Conclusion
In abstract, PySpark’s Python Knowledge Supply API permits Python builders to convey customized knowledge into Apache Spark™ utilizing acquainted and beautiful Python, combining simplicity and efficiency with out requiring deep data of Spark internals. Whether or not connecting to REST APIs, accessing machine studying datasets like HuggingFace, or streaming knowledge from social platforms, this API simplifies customized connector growth with a clear, Pythonic interface. It bypasses the complexity of JVM-based connectors and provides knowledge groups the flexibleness to construct, register, and use their sources instantly in PySpark and SQL.
By supporting batch and streaming workloads and integrating with Unity Catalog, the API ensures your exterior knowledge stays ruled and accessible. From real-time ML pipelines to analytics or knowledge ingestion, the Python Knowledge Supply API turns Spark into an extensible knowledge platform.
Future Roadmap
Because the Python Knowledge Supply API continues to evolve, we’re enthusiastic about a number of upcoming enhancements:
- Column Pruning and Filter Pushdown: Extra subtle capabilities to optimize knowledge switch by transferring filtering and column choice nearer to the supply
- Help for Customized Statistics: Enable connectors to offer source-specific statistics to enhance question planning and optimization
- Higher Observability and Debuggability: Enhanced logging instruments to simplify connector growth and troubleshooting
- Expanded Instance Library: Extra reference implementations for acquainted knowledge sources and utilization patterns
- Efficiency Optimizations: Continued enhancements to scale back serialization overhead and improve throughput
Strive It At the moment
The Python Knowledge Supply API is mostly accessible at present throughout the Databricks Intelligence Platform, together with Databricks Runtime 15.4 LTS and above and Serverless Environments.
Begin constructing customized connectors to combine any knowledge supply along with your knowledge Lakehouse! And to learn the way others have carried out Python DataSource by listening to those talks introduced on the Knowledge + AI Summit, 2025:
Wish to be taught extra? Contact us for a personalised Python Knowledge Supply API demo or discover our documentation to get began.
