How UiPath Constructed a Scalable Actual-Time ETL pipeline on Databricks

August 17, 2025

216

Delivering on the promise of real-time agentic automation requires a quick, dependable, and scalable knowledge basis. At UiPath, we would have liked a contemporary streaming structure to underpin merchandise like Maestro and Insights, enabling close to real-time visibility into agentic automation metrics as they unfold. That journey led us to unify batch and streaming on Azure Databricks utilizing Apache Spark™ Structured Streaming, enabling cost-efficient, low-latency analytics that assist agentic decision-making throughout the enterprise.

This weblog particulars the technical method, trade-offs, and influence of those enhancements.

With Databricks-based streaming, we have achieved sub-minute event-to-warehouse latency whereas delivering simplified structure and future-proof scalability, setting the brand new customary for event-driven knowledge processing throughout UiPath.

Why Streaming Issues for UiPath Maestro and UiPath Insights

At UiPath, merchandise like Maestro and Insights rely closely on well timed, dependable knowledge. Maestro acts because the orchestration layer for our agentic automation platform; coordinating AI brokers, robots, and people based mostly on real-time occasions. Whether or not it’s reacting to a system set off, executing a long-running workflow, or together with a human-in-the-loop step, Maestro depends upon quick, correct sign processing to make the suitable choices.

UiPath Insights, which powers monitoring and analytics throughout these automations, provides one other layer of demand: capturing key metrics and behavioral alerts in close to actual time to floor developments, calculate ROI, and assist situation detection.

Delivering these sorts of outcomes – reactive orchestration and real-time observability – requires an information pipeline structure that’s not solely low-latency, but additionally scalable, dependable, and maintainable. That want is what led us to rethink our streaming structure on Azure Databricks.

Constructing the Streaming Knowledge Basis

Delivering on the promise of highly effective analytics and real-time monitoring requires a basis of scalable, dependable knowledge pipelines. Over the previous few years, we have now developed and expanded a number of pipelines to assist new product options and reply to evolving enterprise necessities. Now, we have now the chance to evaluate how we are able to optimize these pipelines to not solely save prices, but additionally have higher scalability, and at-least as soon as supply assure to assist knowledge from new providers like Maestro.

Earlier structure

Whereas our earlier setup (proven above) labored effectively for our clients, it additionally revealed areas for enchancment:

The batching pipeline launched as much as half-hour of latency and relied on a posh infrastructure
The actual-time pipeline delivered sooner knowledge however got here with greater value.
For Robotlogs, our largest dataset, we maintained separate ingestion and storage paths for each historic and real-time processing, leading to duplication and inefficiency.
To assist the brand new ETL pipeline for UiPath Maestro, a brand new UiPath product, we would wish to attain at-least as soon as supply assure.

To handle these challenges, we undertook a serious architectural overhaul. We merged the batching and real-time ingestion processes for Robotlogs right into a single pipeline, and re-architected the real-time ingestion pipeline to be extra cost-efficient and scalable.

Why Spark Structured Streaming on Databricks?

As we got down to simplify and modernize our pipeline structure, we would have liked a framework that might deal with each high-throughput batch workloads and low-latency real-time knowledge—with out introducing operational overhead. Spark Structured Streaming (SSS) on Azure Databricks was a pure match.

Constructed on prime of Spark SQL and Spark Core, Structured Streaming treats real-time knowledge as an unbounded desk—permitting us to reuse acquainted Spark batch constructs whereas gaining the advantages of a fault-tolerant, scalable streaming engine. This unified programming mannequin lowered complexity and accelerated growth.

We had already leveraged Spark Structured Streaming to develop our Actual-time Alert function, which makes use of stateful stream processing in Databricks. Now, we’re increasing its capabilities to construct our subsequent technology of Actual-time ingestion pipelines, enabling us to attain low-latency, scalability, value effectivity, and at-least-once supply ensures.

The Subsequent Technology of Actual-time Ingestion

Our new structure, proven under, dramatically simplifies the info ingestion course of by consolidating beforehand separate elements right into a unified, scalable pipeline utilizing Spark Structured Streaming on Databricks:

On the core of this new design is a set of streaming jobs that learn instantly from occasion sources. These jobs carry out parsing, filtering, flattening, and—most critically—be a part of every occasion with reference knowledge to counterpoint it earlier than writing to our knowledge warehouse.

We orchestrate these jobs utilizing Databricks Lakeflow Jobs, which helps handle retries and job restoration in case of transient failures. This streamlined setup improves each developer productiveness and system reliability.

The advantages of this new structure embrace:

Value effectivity: Saves COGS by decreasing infrastructure complexity and compute utilization
Low latency: Ingestion latency averages round one minute, with the flexibleness to cut back this additional
Future-proof scalability: Throughput is proportional to the variety of cores, and we are able to scale out infinitely
No knowledge misplaced: Spark does the heavy-lifting of failure restoration, supporting at-least as soon as supply.
- With downstream sink deduplication in future growth, it is going to be in a position to obtain precisely as soon as supply
Quick growth cycle due to the Spark DataFrame API
Easy and unified structure

Low-Latency

Our streaming job at the moment runs in micro-batch mode with a one-minute set off interval. Because of this from the second an occasion is revealed to our Occasion Bus, it usually lands in our knowledge warehouse round 27 seconds on median, with 95% of data arriving inside 51 seconds, and 99% inside 72 seconds.

Structured Streaming gives configurable set off settings, which might even deliver down the latency to some seconds. For now, we’ve chosen the one-minute set off as the suitable stability between value and efficiency, with the flexibleness to decrease it sooner or later if necessities change.

Scalability

Spark divides the massive knowledge work by partitions, which totally make the most of the Employee/Executor CPU cores. Every Structured Streaming job is cut up into phases, that are additional divided into duties, every of which runs on a single core. This degree of parallelization permits us to totally make the most of our Spark cluster and scale effectively with rising knowledge volumes.

Because of optimizations like in-memory processing, Catalyst question planning, whole-stage code technology, and vectorized execution, we course of round 40,000 occasions per second in scalability validation. If visitors will increase, we are able to scale out just by growing partition counts on the supply Occasion Bus and including extra employee nodes—guaranteeing future-proof scalability with minimal engineering effort.

Supply Assure

Spark Structured Streaming gives exactly-once supply by default, due to its checkpointing system. After every micro-batch, Spark persists the progress (or “epoch”) of every supply partition as write-ahead logs and the job’s software state in state retailer. Within the occasion of a failure, the job resumes from the final checkpoint—guaranteeing no knowledge is misplaced or skipped.

That is talked about within the authentic Spark Structured Streaming analysis paper, which states that attaining exactly-once supply requires:

The enter supply to be replayable
The output sink to assist idempotent writes

However there’s additionally an implicit third requirement that usually goes unstated: the system should be capable to detect and deal with failures gracefully.

That is the place Spark works effectively—its sturdy failure restoration mechanisms can detect job failures, executor crashes, and driver points, and routinely take corrective actions similar to retries or restarts.

Observe that we’re at the moment working with at-least as soon as supply, as our output sink isn’t idempotent but. If we have now additional necessities of exactly-once supply sooner or later, so long as we put additional engineering efforts into idempotency, we must always be capable to obtain it.

Uncooked Knowledge is Higher

Now we have additionally made another enhancements. Now we have now included and continued a standard rawMessage subject throughout all tables. This column shops the unique occasion payload as a uncooked string. To borrow the sushi precept (though we imply a barely completely different factor right here): uncooked knowledge is healthier.

Uncooked knowledge considerably simplifies troubleshooting. When one thing goes flawed—like a lacking subject or sudden worth—we are able to immediately consult with the unique message and hint the difficulty, with out chasing down logs or upstream methods. With out this uncooked payload, diagnosing knowledge points turns into a lot more durable and slower.

The draw back is a small enhance in storage. However due to low cost cloud storage and the columnar format of our warehouse, this has minimal value and no influence on question efficiency.

Easy and Highly effective API

The brand new implementation is taking us much less growth time. That is largely due to the DataFrame API in Spark, which gives a high-level, declarative abstraction over distributed knowledge processing. Prior to now, utilizing RDDs meant manually reasoning about execution plans, understanding DAGs, and optimizing the order of operations like joins and filters. DataFrames enable us to give attention to the logic of what we need to compute, fairly than how you can compute it. This considerably simplifies the event course of.

This has additionally improved operations. We not have to manually rerun failed jobs or hint errors throughout a number of pipeline elements. With a simplified structure and fewer transferring elements, each growth and debugging are considerably simpler.

Driving Actual-Time Analytics Throughout UiPath

The success of this new structure has not gone unnoticed. It has rapidly change into the brand new customary for real-time occasion ingestion throughout UiPath. Past its preliminary implementation for UiPath Maestro and Insights, the sample has been broadly adopted by a number of new groups and tasks for his or her real-time analytics wants, together with these engaged on cutting-edge initiatives. This widespread adoption is a testomony to the structure’s scalability, effectivity, and extensibility, making it straightforward for brand spanking new groups to onboard and enabling a brand new technology of merchandise with highly effective real-time analytics capabilities.

In case you’re seeking to scale your real-time analytics workloads with out the operational burden, the structure outlined right here gives a confirmed path, powered by Databricks and Spark Structured Streaming and able to assist the following technology of AI and agentic methods.

About UiPath
UiPath (NYSE: PATH) is a worldwide chief in agentic automation, empowering enterprises to harness the complete potential of AI brokers to autonomously execute and optimize advanced enterprise processes. The UiPath Platform™ uniquely combines managed company, developer flexibility, and seamless integration to assist organizations scale agentic automation safely and confidently. Dedicated to safety, governance, and interoperability, UiPath helps enterprises as they transition right into a future the place automation delivers on the complete potential of AI to rework industries.

How UiPath Constructed a Scalable Actual-Time ETL pipeline on Databricks

Why Streaming Issues for UiPath Maestro and UiPath Insights

Constructing the Streaming Knowledge Basis

Why Spark Structured Streaming on Databricks?

The Subsequent Technology of Actual-time Ingestion

Low-Latency

Scalability

Supply Assure

Uncooked Knowledge is Higher

Easy and Highly effective API

Driving Actual-Time Analytics Throughout UiPath

Related Articles

Question Amazon Redshift utilizing pure language with Kiro

Embedding pipelines are the brand new ETL

Bodybuilding Legend Invoice Grant Dies at 79: Golden Period Icon Remembered

LEAVE A REPLY Cancel reply

Latest Articles

Question Amazon Redshift utilizing pure language with Kiro

Embedding pipelines are the brand new ETL

Bodybuilding Legend Invoice Grant Dies at 79: Golden Period Icon Remembered

Purple Wine French dressing Recipe – Love and Lemons

CDC warns of an infection and security dangers linked to beauty surgical procedure tourism