This put up was written by Eunice Aguilar and Francisco Rodera from REA Group.
Enterprises that must share and entry giant quantities of knowledge throughout a number of domains and companies must construct a cloud infrastructure that scales as want modifications. REA Group, a digital enterprise that focuses on actual property property, solved this drawback utilizing Amazon Managed Streaming for Apache Kafka (Amazon MSK) and an information streaming platform referred to as Hydro.
REA Group’s crew of greater than 3,000 individuals is guided by our goal: to vary the best way the world experiences property. We assist individuals with all elements of their property expertise—not simply shopping for, promoting, and renting—by way of the richest content material, information and insights, valuation estimates, and residential financing options. We ship unparalleled worth to our prospects, Australia’s actual property brokers, by offering entry to the biggest and most engaged viewers of property seekers.
To attain this, the totally different technical merchandise inside the firm usually want to maneuver information throughout domains and companies effectively and reliably.
Inside the Knowledge Platform crew, we’ve got constructed an information streaming platform referred to as Hydro to offer this functionality throughout the entire group. Hydro is powered by Amazon MSK and different instruments with which groups can transfer, rework, and publish information at low latency utilizing event-driven architectures. This sort of construction is foundational at REA for constructing microservices and well timed information processing for real-time and batch use instances like time-sensitive outbound messaging, personalization, and machine studying (ML).
On this put up, we share our method to MSK cluster capability planning.
The issue
Hydro manages a large-scale Amazon MSK infrastructure by offering configuration abstractions, permitting customers to deal with delivering worth to REA with out the cognitive overhead of infrastructure administration. As the usage of Hydro grows inside REA, it’s essential to carry out capability planning to fulfill consumer calls for whereas sustaining optimum efficiency and cost-efficiency.
Hydro makes use of provisioned MSK clusters in improvement and manufacturing environments. In every atmosphere, Hydro manages a single MSK cluster that hosts a number of tenants with differing workload necessities. Correct capability planning makes positive the clusters can deal with excessive site visitors and supply all customers with the specified stage of service.
Actual-time streaming is a comparatively new expertise at REA. Many customers aren’t but aware of Apache Kafka, and precisely assessing their workload necessities could be difficult. Because the custodians of the Hydro platform, it’s our accountability to discover a option to carry out capability planning to proactively assess the affect of the consumer workloads on our clusters.
Targets
Capability planning entails figuring out the suitable dimension and configuration of the cluster primarily based on present and projected workloads, in addition to contemplating components reminiscent of information replication, community bandwidth, and storage capability.
With out correct capability planning, Hydro clusters can change into overwhelmed by excessive site visitors and fail to offer customers with the specified stage of service. Subsequently, it’s crucial to us to take a position time and assets into capability planning to ensure Hydro clusters can ship the efficiency and availability that fashionable functions require.
The capability planning method we comply with for Hydro covers three predominant areas:
- The fashions used for the calculation of present and estimated future capability wants, together with the attributes used as variables in them
- The fashions used to evaluate the approximate anticipated capability required for a brand new Hydro workload becoming a member of the platform
- The tooling out there to operators and custodians to evaluate the historic and present capability consumption of the platform and, primarily based on them, the out there headroom
The next diagram exhibits the interplay of capability utilization and the precalculated most utilization.
Though we don’t have this functionality but, the aim is to take this method one step additional sooner or later and predict the approximate useful resource depletion time, as proven within the following diagram.

To verify our digital operations are resilient and environment friendly, we should preserve a complete observability of our present capability utilization. This detailed oversight permits us not solely to grasp the efficiency limits of our current infrastructure, but additionally to establish potential bottlenecks earlier than they affect our companies and customers.
By proactively setting and monitoring well-understood thresholds, we are able to obtain well timed alerts and take mandatory scaling actions. This method makes positive our infrastructure can meet demand spikes with out compromising on efficiency, finally supporting a seamless consumer expertise and sustaining the integrity of our system.
Answer overview
The MSK clusters in Hydro are configured with a PER_TOPIC_PER_BROKER stage of monitoring, which offers metrics on the dealer and subject ranges. These metrics assist us decide the attributes of the cluster utilization successfully.
Nevertheless, it wouldn’t be sensible to show an extreme variety of metrics on our monitoring dashboards as a result of that would result in much less readability and slower insights on the cluster. It’s extra invaluable to decide on probably the most related metrics for capability planning quite than displaying quite a few metrics.
Cluster utilization attributes
Primarily based on the Amazon MSK finest practices pointers, we’ve got recognized a number of key attributes to evaluate the well being of the MSK cluster. These attributes embrace the next:
- In/out throughput
- CPU utilization
- Disk area utilization
- Reminiscence utilization
- Producer and client latency
- Producer and client throttling
For extra data on right-sizing your clusters, see Greatest practices for right-sizing your Apache Kafka clusters to optimize efficiency and value, Greatest practices for Customary brokers, Monitor CPU utilization, Monitor disk area, and Monitor Apache Kafka reminiscence.
The next desk accommodates the detailed record of all of the attributes we use for MSK cluster capability planning in Hydro.
| Attribute Identify | Attribute Sort | Items | Feedback |
|---|---|---|---|
| Bytes in | Throughput | Bytes per second | Depends on the mixture Amazon EC2 community, Amazon EBS community, and Amazon EBS storage throughput |
| Bytes out | Throughput | Bytes per second | Depends on the mixture Amazon EC2 community, Amazon EBS community, and Amazon EBS storage throughput |
| Shopper latency | Latency | Milliseconds | Excessive or unacceptable latency values often point out consumer expertise degradation earlier than reaching precise useful resource (for instance, CPU and reminiscence) depletion |
| CPU utilization | Capability limits | % CPU consumer + CPU system | Ought to keep underneath 60% |
| Disk area utilization | Persistent storage | Bytes | Ought to keep underneath 85% |
| Reminiscence utilization | Capability limits | % Reminiscence in use | Ought to keep underneath 60% |
| Producer latency | Latency | Milliseconds | Excessive or unacceptable sustained latency values often point out consumer expertise degradation earlier than reaching precise capability limits or precise useful resource (for instance, CPU or reminiscence) depletion |
| Throttling | Capability limits | Milliseconds, bytes, or messages | Excessive or unacceptable sustained throttling values point out capability limits are being reached earlier than precise useful resource (for instance, CPU or reminiscence) depletion |
By monitoring these attributes, we are able to rapidly consider the efficiency of the clusters as we add extra workloads to the platform. We then match these attributes to the related MSK metrics out there.
Cluster capability limits
Through the preliminary capability planning, our MSK clusters weren’t receiving sufficient site visitors to offer us with a transparent concept of their capability limits. To handle this, we used the AWS efficiency testing framework for Apache Kafka to guage the theoretical efficiency limits. We performed efficiency and capability exams on the check MSK clusters that had the identical cluster configurations as our improvement and manufacturing clusters. We obtained a extra complete understanding of the cluster’s efficiency by conducting these numerous check situations. The next determine exhibits an instance of a check cluster’s efficiency metrics.

To carry out the exams inside a selected time-frame and funds, we centered on the check situations that would effectively measure the cluster’s capability. As an illustration, we performed exams that concerned sending high-throughput site visitors to the cluster and creating subjects with many partitions.
After each check, we collected the metrics of the check cluster and extracted the utmost values of the important thing cluster utilization attributes. We then consolidated the outcomes and decided probably the most applicable limits of every attribute. The next screenshot exhibits an instance of the exported check cluster’s efficiency metrics.
![]() |
Capability monitoring dashboards
As a part of our platform administration course of, we conduct month-to-month operational opinions to take care of optimum efficiency. This entails analyzing an automatic operational report that covers all of the methods on the platform. Through the assessment, we consider the service stage aims (SLOs) primarily based on choose service stage indicators (SLIs) and assess the monitoring alerts triggered from the earlier month. By doing so, we are able to establish any points and take corrective actions.
To help us in conducting the operational opinions and to offer us with an summary of the cluster’s utilization, we developed a capability monitoring dashboard, as proven within the following screenshot, for every atmosphere. We constructed the dashboard as infrastructure as code (IaC) utilizing the AWS Cloud Growth Package (AWS CDK). The dashboard is generated and managed mechanically as a element of the platform infrastructure, together with the MSK cluster.

By defining the utmost capability limits of the MSK cluster in a configuration file, the boundaries are mechanically loaded into the capability dashboard as annotations within the Amazon CloudWatch graph widgets. The capability limits annotations are clearly seen and supply us with a view of the cluster’s capability headroom primarily based on utilization.
We decided the capability limits for throughput, latency, and throttling by way of the efficiency testing. Capability limits of the opposite metrics, reminiscent of CPU, disk area, and reminiscence, are primarily based on the Amazon MSK finest practices pointers.
Through the operational opinions, we proactively assess the capability monitoring dashboards to find out if extra capability must be added to the cluster. This method permits us to establish and deal with potential efficiency points earlier than they’ve a major affect on consumer workloads. It’s a preventative measure quite than a reactive response to a efficiency degradation.
Preemptive CloudWatch alarms
We have now applied preemptive CloudWatch alarms along with the capability monitoring dashboards. These alarms are configured to alert us earlier than a selected capability metric reaches its threshold, notifying us when the sustained worth reaches 80% of the capability restrict. This technique of monitoring allows us to take instant motion as an alternative of ready for our month-to-month assessment cadence.
Worth added by our capability planning method
As operators of the Hydro platform, our method to capability planning has offered a constant option to assess how far we’re from the theoretical capability limits of all our clusters, no matter their configuration. Our capability monitoring dashboards are a key observability instrument that we assessment frequently; they’re additionally helpful whereas troubleshooting efficiency points. They assist us rapidly inform if capability constraints could possibly be a possible root reason behind any ongoing points. Which means we are able to use our present capability planning method and tooling each proactively or reactively, relying on the state of affairs and want.
One other good thing about this method is that we calculate the theoretical most utilization values {that a} given cluster with a selected configuration can face up to from a separate cluster with out impacting any precise customers of the platform. We spin up short-lived MSK clusters by way of our AWS CDK primarily based automation and carry out capability exams on them. We do that very often to evaluate the affect, if any, that modifications made to the cluster’s configurations have on the identified capability limits. In accordance with our present suggestions loop, if these newly calculated limits change from the beforehand identified ones, they’re used to mechanically replace our capability dashboards and alarms in CloudWatch.
Future evolution
Hydro is a platform that’s always bettering with the introduction of latest options. Certainly one of these options contains the power to conveniently create Kafka shopper functions. To fulfill the growing demand, it’s important to remain forward of capability planning. Though the method mentioned right here has served us effectively thus far, it’s on no account the ultimate stage , and there are capabilities that we have to prolong and areas we have to enhance on.
Multi-cluster structure
To help crucial workloads, we’re contemplating utilizing a multi-cluster structure utilizing Amazon MSK, which might additionally have an effect on our capability planning. Sooner or later, we plan to profile workloads primarily based on metadata, cross-check them with capability metrics, and place them within the applicable MSK cluster. Along with the prevailing provisioned MSK clusters, we are going to consider how the Amazon MSK Serverless cluster sort can complement our platform structure.
Utilization tendencies
We have now added CloudWatch anomaly detection graphs to our capability monitoring dashboards to trace any uncommon tendencies. Nevertheless, as a result of the CloudWatch anomaly detection algorithm solely evaluates as much as 2 weeks of metric information, we are going to reassess its usefulness as we onboard extra workloads. Except for figuring out utilization tendencies, we are going to discover choices to implement an algorithm with predictive capabilities to detect when MSK cluster assets degrade and deplete.
Conclusion
Preliminary capability planning lays a stable basis for future enhancements and offers a protected onboarding course of for workloads. To attain optimum efficiency of our platform, we should make it possible for our capability planning technique evolves in step with the platform’s development. Because of this, we preserve an in depth collaboration with AWS to repeatedly develop extra options that meet our enterprise wants and are in sync with the Amazon MSK roadmap. This makes positive we keep forward of the curve and may ship the absolute best expertise to our customers.
We suggest all Amazon MSK customers not miss out on maximizing their cluster’s potential and to start out planning their capability. Implementing the methods listed on this put up is a superb first step and can result in smoother operations and vital financial savings in the long term.
In regards to the Authors
Eunice Aguilar is a Workers Knowledge Engineer at REA. She has labored in software program engineering in numerous industries all through the years and lately for property information. She’s additionally an advocate for girls taken with transitioning into tech, together with the well-versed who she takes inspiration from.
Francisco Rodera is a Workers Programs Engineer at REA. He has in depth expertise constructing and working large-scale distributed methods. His pursuits are automation, observability, and making use of SRE practices to business-critical companies and platforms.
Khizer Naeem is a Technical Account Supervisor at AWS. He makes a speciality of Environment friendly Compute and has a deep ardour for Linux and open-source applied sciences, which he leverages to assist enterprise prospects modernize and optimize their cloud workloads.


































Bésame has lots of pretty reds to select from and in the event you can’t determine or aren’t fairly positive what would be just right for you — don’t fear! Now we have an all goal purple that additionally occurs to be our hottest shade: 








