Organizations operating Apache Kafka as their streaming platform want complete monitoring to take care of dependable operations. With out correct visibility into dealer well being, useful resource utilization, and information move metrics, groups danger service disruptions, information loss, and degraded efficiency that may affect important enterprise operations. Efficient monitoring and alerting are important to detect anomalies early, from excessive system load to connectivity points, enabling groups to take preventive motion earlier than issues have an effect on manufacturing workloads.
Amazon Managed Streaming for Apache Kafka (Amazon MSK) addresses these monitoring challenges by publishing detailed metrics to Amazon CloudWatch. The service emits metrics at 1-minute intervals for provisioned (Customary) clusters, with versatile monitoring ranges (DEFAULT, PER_BROKER, PER_TOPIC_PER_BROKER, or PER_TOPIC_PER_PARTITION) to regulate granularity and price. On the DEFAULT stage (free), cluster-level metrics can be found; increased ranges (paid) expose broker-level, per-topic and per-partition metrics.
On this submit, I present you find out how to implement efficient monitoring on your MSK clusters utilizing Amazon CloudWatch. You’ll learn to observe important metrics like dealer well being, useful resource utilization, and client lag, and arrange automated alerts to stop operational points. By following these practices, you’ll be able to work to enhance streaming operations reliability, optimize useful resource utilization, and help excessive availability on your mission-critical purposes.
Key metrics to watch
This text teams vital Amazon MSK metrics into logical classes. For every, we spotlight key metrics and what they point out:
- Dealer Well being and Cluster Availability:
- ActiveControllerCount is a cluster-level metric the place every dealer experiences whether or not it’s the lively controller (1) or not (0). In a wholesome cluster, precisely one dealer serves because the lively controller at any time. When viewing this metric with the common statistic, the worth equals 1 divided by the variety of brokers. For instance, a 3-broker cluster exhibits 0.33 (1/3). Set CloudWatch alarm thresholds accordingly—for six brokers, alert if common falls beneath 0.166(1/6). When utilizing the sum statistic, the worth ought to all the time be 1, indicating one lively controller no matter cluster dimension. If the sum differs from 1, a controller election is in progress—usually throughout upkeep actions, configuration adjustments, or rolling restarts.
Be aware: For a KRaft-based clusters, the ActiveControllerCount is barely uncovered on devoted controller endpoints so the pattern rely is 3 and solely controller will report worth of 1. Thus, the common is all the time 0.33 irrespective of what number of brokers there are within the cluster. To observe the dealer well being for Kraft-based clusters, verify LeaderCount metric. If a dealer isn’t emitting any metric, then it’s an excellent indication that dealer may be unhealthy. - OfflinePartitionsCount (cluster): Variety of partitions with no lively chief. Non-zero values imply information is briefly unavailable or unwritable. Set off alerts if it rises above 0.
- UnderReplicatedPartitions (per dealer): Variety of partitions the place not all replicas are caught up. This could keep at 0 beneath regular situations. Spikes point out visitors exceeds capability or replication lag; sustained values typically imply a configuration/ACL difficulty. Seek advice from Troubleshoot your Amazon MSK cluster
- UnderMinIsrPartitionCount (per dealer): Partitions beneath the minimal in-sync duplicate (ISR) rely. A non-zero worth means potential information loss danger if brokers fail. Monitor to make sure replication is wholesome. Seek advice from Customized configurations
- GlobalPartitionCount (cluster): Complete variety of partitions throughout all subjects (leaders solely). Helpful for capability planning and sanity checks.
- PartitionCount (per dealer): Variety of partitions (together with replicas) hosted by a dealer. Sudden adjustments could point out re-balances. (Extra partitions per dealer can degrade efficiency).
- ActiveControllerCount is a cluster-level metric the place every dealer experiences whether or not it’s the lively controller (1) or not (0). In a wholesome cluster, precisely one dealer serves because the lively controller at any time. When viewing this metric with the common statistic, the worth equals 1 divided by the variety of brokers. For instance, a 3-broker cluster exhibits 0.33 (1/3). Set CloudWatch alarm thresholds accordingly—for six brokers, alert if common falls beneath 0.166(1/6). When utilizing the sum statistic, the worth ought to all the time be 1, indicating one lively controller no matter cluster dimension. If the sum differs from 1, a controller election is in progress—usually throughout upkeep actions, configuration adjustments, or rolling restarts.
- Useful resource Utilization:
- CPU: Complete dealer CPU utilization is outlined as CpuUser + CpuSystem. Finest observe is to maintain common CPU utilization beneath 60% . Set alarms on the sum of person+system to detect overload.
- CPUCreditBalance / CPUCreditUsage (per dealer): For burstable occasion sorts(T3), tracks earned/spent CPU credit. A declining credit score steadiness or excessive credit score utilization warns that the occasion could also be CPU-starved.
- Reminiscence: MemoryUsed, MemoryFree (per dealer) present RAM utilization. Critically, HeapMemoryAfterGC (per dealer) experiences JVM heap utilization (%) after rubbish assortment. AWS recommends alerting if HeapMemoryAfterGC exceeds 60%, to keep away from out-of-memory points.
- Disk: Kafka brokers use connected EBS storage for subject information. Monitor KafkaDataLogsDiskUsed (per dealer) – proportion of disk utilized by message logs. Finest observe: alarm when information log utilization exceeds 85%. Additionally observe RootDiskUsed: the proportion of the foundation disk utilized by the dealer.
- EBS I/O: Quantity metrics (per dealer) comparable to VolumeQueueLength, VolumeReadOps, VolumeWriteOps, VolumeReadBytes, VolumeWriteBytes point out I/O latency and throughput. Rising queue lengths or latency (comparable to VolumeTotalReadTime) counsel disk rivalry.
- Community: Primary community stats per dealer embody NetworkRxPackets, NetworkTxPackets, and errors/drop counts (NetworkRxErrors, NetworkTxErrors, NetworkRxDropped, NetworkTxDropped). Surprising errors or drops can point out community points.
- Subject and Partition Exercise:
- Throughput: BytesInPerSec and BytesOutPerSec measure inbound/outbound information charges per dealer or per subject. Sustained drops can sign misplaced producers/shoppers; spikes could require scaling.
- Replication Site visitors: ReplicationBytesInPerSec/ReplicationBytesOutPerSec (per subject) present inter-broker replication quantity.
- Client Lag: Client lag metrics quantify the distinction between the newest information written to your subjects and the information learn by your purposes. Amazon MSK supplies the next consumer-lag metrics, which you may get by Amazon CloudWatch or by open monitoring with Prometheus: EstimatedMaxTimeLag, EstimatedTimeLag, MaxOffsetLag, OffsetLag, and SumOffsetLag. For details about these metrics, see Amazon MSK metrics for monitoring Customary brokers with CloudWatch.
- Shopper Connections :
- ConnectionCount (per dealer): Complete lively connections (shoppers + inter-broker). Sudden drops or sustained excessive counts (hitting limits) benefit consideration.
- ClientConnectionCount (per dealer, with auth filter): Energetic authenticated shopper connections.
- ConnectionCreationRate / ConnectionCloseRate (per dealer): New or closed connections per second. Spikes in connection churn could point out shopper points.
- Authentication: IAMNumberOfConnectionRequests and IAMTooManyConnections (per dealer) present IAM auth request charges and throttle breaches (restrict of 100 simultaneous connections).
- Community Bandwidth Metrics:
- TrafficShaping > 0 (any throttling) metric serves as your major warning sign. When this worth exceeds zero, your MSK cluster is experiencing community throttling on the EC2 layer, with packets being dropped or queued resulting from exceeded allocations. This throttling manifests as lowered throughput, elevated latency, and potential community errors that affect each producer and client efficiency. TrafficShaping points stem from two doable bandwidth limitations: BwInAllowanceExceeded & BwOutAllowanceExceeded :
- BwInAllowanceExceeded tracks when inbound mixture bandwidth surpasses dealer maximums.
- BwOutAllowanceExceeded screens when outbound mixture bandwidth exceeds limits.
Each BwInAllowanceExceeded and BwOutAllowanceExceeded metrics straight contribute to total community throttling occasions.
- Different Operational Metrics:
- Thread Swimming pools: RequestHandlerAvgIdlePercent, NetworkProcessorAvgIdlePercent (per dealer) present how busy Kafka’s inside thread swimming pools are. Constantly low idle (%) can point out bottlenecks.
- ZooKeeper: For ZooKeeper-based MSK clusters, ZooKeeperRequestLatencyMsMean and ZooKeeperSessionState replicate ZK efficiency (for older Kafka variations that use Zookeeper). For ZooKeeperSessionState, something aside from 1 for 5-10 minutes ought to be alarming as there might be probabilities dealer has a problem or zookeeper isn’t in a position to connect with brokers resulting from some intermittent community difficulty.
- Tiered Storage: For clusters with tiered storage enabled, Amazon MSK supplies metrics like RemoteFetchBytesPerSec, RemoteCopyBytesPerSec, RemoteLogSizeBytes, and associated error/queue metrics. These observe offloading to distant storage.
- Clever rebalancing metrics: For MSK Provisioned clusters utilizing Categorical brokers, Amazon MSK supplies two key metrics to watch rebalancing operations: RebalanceInProgress and UnderProvisioned metrics. See Monitor Clever rebalancing metrics
By grouping metrics into these classes, you’ll be able to construct dashboards and alerts that comprehensively cowl Amazon MSK well being and efficiency. Amazon CloudWatch additionally supplies automated dashboards for Amazon MSK.
Let’s take a fast look on find out how to entry CloudWatch automated dashboard. Within the AWS Console, go to the CloudWatch service. When within the CloudWatch console, choose Dashboards. Open the Computerized dashboard tab and seek for MSK within the Filter Bar.
These dashboards supply per-configured visualizations of key metrics, enabling fast insights into the well being and efficiency of your MSK clusters.
Really useful CloudWatch alarms
Setting alarms on key metrics helps catch points early. Detecting points early is essential in streaming purposes the place each second counts. A single failing dealer can set off a sequence response – halting information ingestion, backing up upstream programs, and breaking downstream purposes. This will shortly escalate from delayed order processing to misplaced income. Proactive monitoring helps catch and repair issues earlier than they affect your small business operations. Based mostly on AWS greatest practices and expertise, take into account alarms comparable to:
| Metric (Dimension) | Alarm Situation | Rationale |
| ActiveControllerCount (cluster) | ≠ 1 (rely) | Just one lively controller ought to exist. Deviation implies cluster instability. |
| CPU Utilization (Sum(CPUUser+CPUSystem), per dealer) | > 60% (common) for five+ minutes | Helps preserve headroom for dealer load and upkeep. Excessive CPU could sluggish processing as outlined within the MSK greatest practices documentation |
| HeapMemoryAfterGC (dealer) | > 60% (proportion) | Signifies Kafka heap is filling up. Helps forestall OOM by alerting early. |
| KafkaDataLogsDiskUsed (dealer) | ≥ 85% (%) | Warns that disk is sort of full. Helps forestall information loss by offering time for scaling or cleanup. |
| OfflinePartitionsCount (cluster) | > 0 (rely) | Any offline partition means unavailable information. Quick investigation wanted. |
| UnderReplicatedPartitions (dealer) | > 0 (rely) | No replicas lagging beneath wholesome situations. Spikes or sustained lag can point out overload or ACL misconfiguration. |
| UnderMinIsrPartitionCount (dealer) | > 0 (rely) | There should be subjects with partitions which have both much less in-sync replicas than the min.insync.replicas setting or with RF=MinISR. To seek out these subjects whose partitions are beneath replicated, use command:
|
| ConnectionCount (dealer) | Sudden drop (e.g. < 90% of baseline) or spike above excessive threshold | Detect shopper connectivity points or connection floods. Surprising drops could imply a dealer is unreachable. Seek advice from Amazon MSK Customary dealer quota |
| CPUCreditBalance (for T3 dealer) | < some low threshold (e.g. 10 credit) | For burstable situations, alerts when credit are practically exhausted, which degrades efficiency. |
| VolumeQueueLength (dealer) | > 0 (sustained) or rising | Signifies I/O operations are queuing, doable disk bottleneck. |
| NetworkRxErrors/TxErrors (dealer) | > 0 (rely) | Any community errors may cause packet loss or disconnections. |
| IAMTooManyConnections (dealer) | > 0 (rely) | Exceeding IAM connection restrict (100) blocks new connections. |
| Client Lag (MaxOffsetLag or SumOffsetLag) (per consumer-group/subject) | > threshold (relies on SLAs, e.g. rising past anticipated) | Alerts on sluggish shoppers so you’ll be able to scale shoppers or examine backlogs. |
| TrafficShaping | > 0 (any throttling) | This is a sign that brokers are exceeding their allotted community bandwidth. |
These are illustrative thresholds; modify them on your workload and SLAs. The remaining metrics listed within the CloudWatch metrics for Customary and Categorical brokers documentation are inclined to downstream affect from anomalies within the major metrics above. It is strongly recommended to allow CloudWatch alarms on a single take a look at cluster first to validate thresholds earlier than extending protection throughout your MSK fleet.
Conclusion
On this submit, we lined the vital CloudWatch metrics and alarms for monitoring Amazon MSK clusters successfully. By implementing these advisable alarms, you’ll be able to proactively detect and reply to potential points earlier than they affect your Kafka workloads. To be taught extra about Amazon MSK monitoring, check with the Amazon MSK Monitoring Finest Practices documentation or discover our Amazon MSK Workshops hands-on expertise.
Concerning the authors
