This put up was initially revealed in March 2024 and up to date in February 2026.
Stream knowledge processing permits you to act on knowledge in actual time. Actual-time knowledge analytics will help you have got on-time and optimized responses whereas enhancing total buyer expertise.
Apache Flink is a distributed computation framework that enables for stateful real-time knowledge processing. It offers a single set of APIs for constructing batch and streaming jobs, making it straightforward for builders to work with bounded and unbounded knowledge. Apache Flink offers totally different ranges of abstraction to cowl a wide range of occasion processing use instances.
Amazon Managed Service for Apache Flink is an AWS service that gives a serverless infrastructure for operating Apache Flink purposes. This makes it straightforward for builders to construct extremely accessible, fault tolerant, and scalable Apache Flink purposes with no need to develop into an skilled in constructing, configuring, and sustaining Apache Flink clusters on AWS.
Information streaming workloads typically require knowledge within the stream to be enriched by way of exterior sources (resembling databases or different knowledge streams). For instance, assume you might be receiving coordinates knowledge from a GPS machine and wish to know how these coordinates map with bodily geographic areas; it’s worthwhile to enrich it with geolocation knowledge. You need to use a number of approaches to counterpoint your real-time knowledge in Amazon Managed Service for Apache Flink in your use case and Apache Flink abstraction degree. Every methodology has totally different results on the throughput, community site visitors, and CPU (or reminiscence) utilization. On this put up, we cowl these approaches and talk about their advantages and disadvantages.
Information enrichment patterns
Information enrichment is a course of that appends extra context and enhances the collected knowledge. The extra knowledge typically is collected from a wide range of sources. The format and the frequency of the info updates might vary from as soon as in a month to many instances in a second. The next desk exhibits a couple of examples of various sources, codecs, and replace frequency.
| Information | Format | Replace Frequency |
| IP deal with ranges by nation | CSV | As soon as a month |
| Firm group chart | JSON | Twice a yr |
| Machine names by ID | CSV | As soon as a day |
| Worker data | Desk (Relational database) | A couple of instances a day |
| Buyer data | Desk (Non-relational database) | A couple of instances an hour |
| Buyer orders | Desk (Relational database) | Many instances a second |
Primarily based on the use case, your knowledge enrichment utility might have totally different necessities by way of latency, throughput, or different components. The rest of the put up dives deeper into totally different patterns of knowledge enrichment in Amazon Managed Service for Apache Flink, that are listed within the following desk with their key traits. You may select the most effective sample primarily based on the trade-off of those traits.
| Enrichment Sample | Latency | Throughput | Accuracy if Reference Information Adjustments | Reminiscence Utilization | Complexity |
| Pre-load reference knowledge in Apache Flink Activity Supervisor reminiscence | Low | Excessive | Low | Excessive | Low |
| Partitioned pre-loading of reference knowledge in Apache Flink state | Low | Excessive | Low | Low | Low |
| Periodic Partitioned pre-loading of reference knowledge in Apache Flink state | Low | Excessive | Medium | Low | Medium |
| Per-record asynchronous lookup with unordered map | Medium | Medium | Excessive | Low | Low |
| Per-record asynchronous lookup from an exterior cache system | Low or Medium (Relying on Cache storage and implementation) | Medium | Excessive | Low | Medium |
| Enriching streams utilizing the Desk API | Low | Excessive | Excessive | Low – Medium (relying on the chosen be part of operator) | Low |
Enrich streaming knowledge by pre-loading the reference knowledge
When the reference knowledge is small in dimension and static in nature (for instance, nation knowledge together with nation code and nation title), it’s beneficial to counterpoint your streaming knowledge by pre-loading the reference knowledge, which you are able to do in a number of methods.
To see the code implementation for pre-loading reference knowledge in varied methods, seek advice from the GitHub repo. Comply with the directions within the GitHub repository to run the code and perceive the info mannequin.
Pre-loading of reference knowledge in Apache Flink Activity Supervisor reminiscence
The best and likewise quickest enrichment methodology is to load the enrichment knowledge into every of the Apache Flink job managers’ on-heap reminiscence. To implement this methodology, you create a brand new class by extending the RichFlatMapFunction summary class. You outline a worldwide static variable in your class definition. The variable might be of any sort, the one limitation is that it ought to prolong java.io.Serializable; for instance, java.util.HashMap. Throughout the open() methodology, you outline a logic that hundreds the static knowledge into your outlined variable. The open() methodology is at all times known as first, in the course of the initialization of every job in Apache Flink’s job managers, which makes positive the entire reference knowledge is loaded earlier than the processing begins. You implement your processing logic by overriding the processElement() methodology. You implement your processing logic and entry the reference knowledge by its key from the outlined world variable.
The next structure diagram exhibits the complete reference knowledge load in every job slot of the duty supervisor:
This methodology has the next advantages:
- Straightforward to implement
- Low latency
- Can help excessive throughput
Nonetheless, it has the next disadvantages:
- If the reference knowledge is massive in dimension, the Apache Flink job supervisor might run out of reminiscence.
- Reference knowledge can develop into stale over a time period.
- A number of copies of the identical reference knowledge are loaded in every job slot of the duty supervisor.
- Reference knowledge must be small to slot in the reminiscence allotted to a single job slot. In Amazon Managed Service for Apache Flink, every KPU has 4 GB of reminiscence, out of which 3 GB can be utilized for heap reminiscence. If the
ParallelismPerKPUparameter is about to 1, one job slot runs in every job supervisor, and the duty slot can use the entire 3 GB of heap reminiscence. IfParallelismPerKPUis about to a price larger than 1, the three GB of heap reminiscence is distributed throughout a number of job slots within the job supervisor. When you’re deploying Apache Flink in Amazon EMR or in a self-managed mode, you possibly can tunetaskmanager.reminiscence.job.heap.dimensionto extend the heap reminiscence of a job supervisor.
Partitioned pre-loading of reference knowledge in Apache Flink State
On this strategy, the reference knowledge is loaded and saved within the Apache Flink state retailer at the beginning of the Apache Flink utility. To optimize the reminiscence utilization, first the primary knowledge stream is split by a specified area by way of the keyBy() operator throughout all job slots. Moreover, solely the portion of the reference knowledge that corresponds to every job slot is loaded within the state retailer.That is achieved in Apache Flink by creating the category PartitionPreLoadEnrichmentData, extending the RichFlatMapFunction summary class. Throughout the open methodology, you override the ValueStateDescriptor methodology to create a state deal with. Within the referenced instance, the descriptor is called locationRefData, the state key sort is String, and the worth sort is Location. On this code, we use ValueState in comparison with MapState as a result of we solely maintain the placement reference knowledge for a specific key. For instance, once we question Amazon S3 to get the placement reference knowledge, we question for the precise position and get a specific location as a price.
In Apache Flink, ValueState is used to carry a selected worth for a key, whereas MapState is used to carry a mix of key-value pairs. This method is beneficial when you have got a big static dataset that’s troublesome to slot in reminiscence as a complete for every partition.
The next structure diagram exhibits the load of reference knowledge for the precise key for every partition of the stream.

For instance, our reference knowledge within the pattern GitHub code has roles that are mapped to every constructing. As a result of the stream is partitioned by roles, solely the precise constructing data per position is required to be loaded for every partition because the reference knowledge.This methodology has the next advantages:
- Low latency.
- Can help excessive throughput.
- Reference knowledge for particular partition is loaded within the keyed state.
- In Amazon Managed Service for Apache Flink, the default state retailer configured is RocksDB. RocksDB can make the most of a good portion of 1 GB of managed reminiscence and 50 GB of disk area offered by every KPU. This offers sufficient room for the reference knowledge to develop.
Nonetheless, it has the next disadvantages:
- Reference knowledge can develop into stale over a time period
Periodic partitioned pre-loading of reference knowledge in Apache Flink State
This strategy is a fine-tune of the earlier approach, the place every partitioned reference knowledge is reloaded on a periodic foundation to refresh the reference knowledge. That is helpful in case your reference knowledge adjustments often.
The next structure diagram exhibits the periodic load of reference knowledge for the precise key for every partition of the stream:

On this strategy, the category PeriodicPerPartitionLoadEnrichmentData is created, extending the KeyedProcessFunction class. Just like the earlier sample, within the context of the GitHub instance, ValueState is beneficial right here as a result of every partition solely hundreds a single worth for the important thing. In the identical method as talked about earlier, within the open methodology, you outline the ValueStateDescriptor to deal with the worth state and outline a runtime context to entry the state.
Throughout the processElement methodology, load the worth state and fix the reference knowledge (within the referenced GitHub instance, we hooked up buildingNo to the shopper knowledge). Additionally register a timer service to be invoked when the processing time passes the given time. Within the pattern code, the timer service is scheduled to be invoked periodically (for instance, each 60 seconds). Within the onTimer methodology, replace the state by making a name to reload the reference knowledge for the precise position.
This methodology has the next advantages:
- Low latency.
- Can help excessive throughput.
- Reference knowledge for particular partitions is loaded within the keyed state.
- Reference knowledge is refreshed periodically.
- In Amazon Managed Service for Apache Flink, the default state retailer configured is RocksDB. Additionally, 50 GB of disk area offered by every KPU. This offers sufficient room for the reference knowledge to develop.
Nonetheless, it has the next disadvantages:
- If the reference knowledge adjustments ceaselessly, the appliance nonetheless has stale knowledge relying on how ceaselessly the state is reloaded
- The appliance can face load spikes throughout reload of reference knowledge
Enrich streaming knowledge utilizing per-record lookup
Though pre-loading of reference knowledge offers low latency and excessive throughput, it is probably not appropriate for sure kinds of workloads, resembling the next:
- Reference knowledge updates with excessive frequency
- Apache Flink must make an exterior name to compute the enterprise logic
- Accuracy of the output is vital and the appliance shouldn’t use stale knowledge
Usually, for these kind of use instances, builders trade-off excessive throughput and low latency for knowledge accuracy. On this part, you study a couple of of widespread implementations for per-record knowledge enrichment and their advantages and downsides.
Per-record asynchronous lookup with unordered map
In a synchronous per-record lookup implementation, the Apache Flink utility has to attend till it receives the response after sending each request. This causes the processor to remain idle for a major interval of processing time. As a substitute, the appliance can ship a request for different components within the stream whereas it waits for the response for the primary factor. This fashion, the wait time is amortized throughout a number of requests and subsequently it will increase the method throughput. Apache Flink offers asynchronous I/O for exterior knowledge entry. Whereas utilizing this sample, you must determine between unorderedWait (the place it emits the outcome to the subsequent operator as quickly because the response is acquired, disregarding the order of the factor on the stream) and orderedWait (the place it waits till all inflight I/O operations full, then sends the outcomes to the subsequent operator in the identical order as authentic components have been positioned on the stream). Often, when downstream shoppers disregard the order of the weather within the stream, unorderedWait offers higher throughput and fewer idle time. Go to Enrich your knowledge stream asynchronously utilizing Amazon Managed Service for Apache Flink to study extra about this sample.
The next structure diagram exhibits how an Apache Flink utility on Amazon Managed Service for Apache Flink does asynchronous calls to an exterior database engine (for instance Amazon DynamoDB) for each occasion in the primary stream:

This methodology has the next advantages:
- Nonetheless moderately easy and simple to implement
- Reads essentially the most up-to-date reference knowledge
Nonetheless, it has the next disadvantages:
- It generates a heavy learn load for the exterior system (for instance, a database engine or an exterior API) that hosts the reference knowledge
- General, it may not be appropriate for methods that require excessive throughput with low latency
Per-record asynchronous lookup from an exterior cache system
A solution to improve the earlier sample is to make use of a cache system to reinforce the learn time for each lookup I/O name. You need to use Amazon ElastiCache for caching, which accelerates utility and database efficiency, or as a major knowledge retailer to be used instances that don’t require sturdiness like session shops, gaming leaderboards, streaming, and analytics. ElastiCache is appropriate with Redis and Memcached.
For this sample to work, you could implement a caching sample for populating knowledge within the cache storage. You may select between a proactive or reactive strategy relying your utility goals and latency necessities. For extra data, seek advice from Caching patterns.
The next structure diagram exhibits how an Apache Flink utility calls to learn the reference knowledge from an exterior cache storage (for instance, Amazon ElastiCache for Redis). Information adjustments have to be replicated from the primary database (for instance, Amazon Aurora) to the cache storage by implementing one of many caching patterns.

Implementation for this knowledge enrichment sample is much like the per-record asynchronous lookup sample; the one distinction is that the Apache Flink utility makes a connection to the cache storage, as a substitute of connecting to the first database.
This methodology has the next advantages:
- Higher throughput as a result of caching can speed up utility and database efficiency
- Protects the first knowledge supply from the learn site visitors created by the stream processing utility
- Can present decrease learn latency for each lookup name
- General, may not be appropriate for medium to excessive throughput methods that wish to enhance knowledge freshness
Nonetheless, it has the next disadvantages:
- Extra complexity of implementing a cache sample for populating and syncing the info between the first database and the cache storage
- There’s a probability for the Apache Flink stream processing utility to learn stale reference knowledge relying on what caching sample is carried out
- Relying on the chosen cache sample (proactive or reactive), the response time for every enrichment I/O might differ, subsequently the general processing time of the stream might be unpredictable
Alternatively, you possibly can keep away from these complexities through the use of the Apache Flink JDBC connector for Flink SQL APIs. We talk about enrichment stream knowledge by way of Flink SQL APIs in additional element later on this put up.
Enrich stream knowledge by way of one other stream
On this sample, the info in the primary stream is enriched with the reference knowledge in one other knowledge stream. This sample is nice to be used instances wherein the reference knowledge is up to date ceaselessly and it’s potential to carry out change knowledge seize (CDC) and publish the occasions to an information streaming service resembling Apache Kafka or Amazon Kinesis Information Streams. This sample is beneficial within the following use instances, for instance:
- Buyer buy orders are revealed to a Kinesis knowledge stream, after which be part of with buyer billing data in a DynamoDB stream
- Information occasions captured from IoT gadgets ought to enrich with reference knowledge in a desk in Amazon Relational Database Service (Amazon RDS)
- Community log occasions ought to enrich with the machine title on the supply (and the vacation spot) IP addresses
The next structure diagram exhibits how an Apache Flink utility on Amazon Managed Service for Apache Flink joins knowledge in the primary stream with the CDC knowledge in a DynamoDB stream.

To counterpoint streaming knowledge from one other stream, we use a standard stream to stream be part of patterns, which we clarify within the following sections.
Enrich streams utilizing the Desk API
Apache Flink Desk APIs present increased abstraction for working with knowledge occasions. With Desk APIs, you possibly can outline your knowledge stream as a desk and fix the info schema to it.
On this sample, you outline tables for every knowledge stream after which be part of these tables to attain the info enrichment targets. Apache Flink Desk APIs help various kinds of be part of situations, like inside be part of and outer be part of. Nonetheless, you wish to keep away from these when you’re coping with unbounded streams as a result of these are useful resource intensive. To restrict the useful resource utilization and run joins successfully, you need to use both interval or temporal joins. An interval be part of requires one equi-join predicate and a be part of situation that bounds the time on each side. To higher perceive find out how to implement an interval be part of, seek advice from Get began with Amazon Managed Service for Apache Flink (Desk API).
In comparison with interval joins, temporal desk joins don’t work with a time interval inside which totally different variations of a document are saved. Data from the primary stream are at all times joined with the corresponding model of the reference knowledge on the time specified by the watermark. Due to this fact, fewer variations of the reference knowledge stay within the state. Be aware that the reference knowledge might or might not have a time factor related to it. If it doesn’t, chances are you’ll want so as to add a processing time factor for the be part of with the time-based stream.
Within the following instance code snippet, the update_time column is added to the currency_rates reference desk from the change knowledge seize metadata resembling Debezium. Moreover, it’s used to outline a watermark technique for the desk.
CREATE TABLE currency_rates (
forex STRING,
conversion_rate DECIMAL(32, 2),
update_time TIMESTAMP(3) METADATA FROM `values.supply.timestamp` VIRTUAL,
WATERMARK FOR update_time AS update_time,
PRIMARY KEY(forex) NOT ENFORCED
) WITH (
'connector' = 'kafka',
'worth.format' = 'debezium-json',
/* ... */
);
This methodology has the next advantages:
- Straightforward to implement
- Low latency
- Can help excessive throughput when reference knowledge is an information stream
SQL APIs present increased abstractions over how the info is processed. For extra advanced logic round how the be part of operator ought to course of, we advocate you at all times begin with SQL APIs first and use DataStream APIs if you actually need to.
Conclusion
On this put up, we demonstrated totally different knowledge enrichment patterns in Amazon Managed Service for Apache Flink. You need to use these patterns and discover the one which addresses your wants and shortly develop a stream processing utility.
For additional details about this service, go to the official product web page.
In regards to the Authors
