This put up is co-written with Ido Ziv from Kaltura.
As organizations develop, managing observability throughout a number of groups and functions turns into more and more advanced. Logs, metrics, and traces generate huge quantities of information, making it difficult to take care of efficiency, reliability, and cost-efficiency.
At Kaltura, an AI-infused video-first firm serving thousands and thousands of customers throughout a whole lot of functions, observability is mission-critical. Understanding system conduct at scale isn’t nearly troubleshooting—it’s about offering seamless experiences for patrons and workers alike. However attaining efficient observability at this scale comes with challenges: managing spans; correlating logs, traces, and occasions throughout distributed methods; and sustaining visibility with out overwhelming groups with noise. Balancing granularity, value, and actionable insights requires fixed tuning and considerate structure.
On this put up, we share how Kaltura remodeled its observability technique and technological stack by migrating from a software program as a service (SaaS) logging resolution to Amazon OpenSearch Service—attaining greater log retention, a 60% discount in value, and a centralized platform that empowers a number of groups with real-time insights.
Observability challenges at scale
Kaltura ingests over 8TB of logs and traces each day, processing greater than 20 billion occasions throughout 6 manufacturing AWS Areas and over 200 functions—with log spikes reaching as much as 6 GB per second. This immense information quantity, mixed with a extremely distributed structure, created vital challenges in observability. Traditionally, Kaltura relied on a SaaS-based observability resolution that met preliminary necessities however grew to become more and more troublesome to scale. Because the platform advanced, groups generated disparate log codecs, utilized retention insurance policies that now not mirrored information worth, and operated greater than 10 organically grown observability sources. The dearth of standardization and visibility required in depth handbook effort to correlate information, preserve pipelines, and troubleshoot points – resulting in rising operational complexity and glued prices that didn’t scale effectively with utilization.
Kaltura’s DevOps group acknowledged the necessity to reassess their observability resolution and started exploring a wide range of choices, from self-managed platforms to totally managed SaaS choices. After a complete analysis, they made the strategic determination emigrate to OpenSearch Service, utilizing its superior options equivalent to Amazon OpenSearch Ingestion, the Observability plugin, UltraWarm storage, and Index State Administration.
Answer overview
Kaltura created a brand new AWS account that might be a devoted observability account, the place OpenSearch Service was deployed. Logs and traces had been collected from totally different accounts and producers equivalent to microservices on Amazon Elastic Kubernetes Service (Amazon EKS) and providers operating on Amazon Elastic Compute Cloud (Amazon EC2).
By utilizing AWS providers equivalent to AWS Id and Entry Administration (IAM), AWS Key Administration Service (AWS KMS), and Amazon CloudWatch, Kaltura was in a position to meet the requirements to create a production-grade system whereas maintaining safety and reliability in thoughts. The next determine reveals a high-level design of the atmosphere setup.
Ingestion
As seen within the following diagram, logs are shipped utilizing log shippers, also referred to as collectors. In Kaltura’s case, they used Fluent Bit. A log shipper is a device designed to gather, course of, and transport log information from numerous sources to a centralized location, equivalent to log analytics platforms, administration methods, or an aggregator system. Fluent Bit was utilized in all sources and likewise offered gentle processing skills. Fluent Bit was deployed as a daemonset in Kubernetes. The applying improvement groups didn’t change their code, as a result of the Fluent Bit pods had been studying the stdout of the appliance pods.

The next code is an instance of FluentBit configurations for Amazon EKS:
Spans and traces had been collected straight from the appliance layer utilizing a seamless integration strategy. To facilitate this, Kaltura deployed an OpenTelemetry Collector (OTEL) utilizing the OpenTelemetry Operator for Kubernetes. Moreover, the group developed a customized OTEL code library, which was integrated into the appliance code to effectively seize and log traces and spans, offering complete observability throughout their system.
Knowledge from Fluent Bit and OpenTelemetry Collector was despatched to OpenSearch Ingestion, a completely managed, serverless information collector that delivers real-time log, metric, and hint information to OpenSearch Service domains and Amazon OpenSearch Serverless collections. Every producer despatched information to a selected pipeline, one for logs and one for traces, the place information was remodeled, aggregated, enriched, and normalized earlier than being despatched to OpenSearch Service. The hint pipeline used the otel_trace and service_map processors, whereas utilizing the OpenSearch Ingestion OpenTelemetry hint analytics blueprint.
The next code is an instance of the OpenSearch Ingestion pipeline for logs:
The previous instance reveals the usage of processors equivalent to grok, date, add_entries, rename_keys, and drop_events:
- add_entries:
- Provides a brand new subject
log_typebased mostly on filename - Default: “default”
- If the filename comprises particular substrings (equivalent to
api.logorstats.log), it assigns a extra particular kind
- Provides a brand new subject
- grok:
- Applies Grok parsing to logs of kind “api”
- Extracts fields like
timestamp,logIp,host,priorityName,precedence,reminiscence,actual, andmessageutilizing a customized sample
- date:
- Parses timestamp strings into a regular datetime format
- Shops it in a subject known as
@timestampbased mostly on ISO8601 format - Handles a number of timestamp patterns
- rename_keys:
- timestamp or date are renamed into
@timestamp - Doesn’t overwrite if
@timestampalready exists
- timestamp or date are renamed into
- drop_events:
- Drops logs the place filename comprises
simplesamlphp.log - This can be a filtering rule to disregard noisy or irrelevant logs
- Drops logs the place filename comprises
The next is an instance of the enter of a log line:
After processing, we get the next code:
Kaltura adopted some OpenSearch Ingestion finest practices, equivalent to:
- Together with a dead-letter queue (DLQ) in pipeline configuration. This will considerably assist troubleshoot pipeline points.
- Beginning and stopping pipelines to optimize cost-efficiency, when doable.
- In the course of the proof of idea stage:
- Putting in Knowledge Prepper domestically for sooner improvement iterations.
- Disabling persistent buffering to expedite blue-green deployments.
Reaching operational excellence with environment friendly log and hint administration
Logs and traces play a significant position in figuring out operational points, however they arrive with distinctive challenges. First, they characterize time sequence information, which inherently evolves over time. Second, their worth sometimes diminishes as time passes, making environment friendly administration essential. Third, they’re append-only in nature. With OpenSearch, Kaltura confronted distinct trade-offs between value, information retention, and latency. The purpose was to verify beneficial information remained accessible to engineering groups with minimal latency, however the resolution additionally wanted to be cost-effective. Balancing these elements required considerate planning and optimization.
Knowledge was ingested to OpenSearch information streams, which simplifies the method of ingesting append-only time sequence information. A number of Index State Administration (ISM) insurance policies had been utilized to totally different information streams, which had been depending on log retention necessities. ISM insurance policies dealt with shifting indexes from scorching storage to UltraWarm, and finally deleting the indexes. This allowed a customizable and cost-effective resolution, with low latency for querying new information and affordable latency for querying historic information.
The next instance ISM coverage makes positive indexes are managed effectively, rolled over, and moved to totally different storage tiers based mostly on their age and dimension, and finally deleted after 60 days. If an motion fails, it’s retried with an exponential backoff technique. In case of failures, notifications are despatched to related groups to maintain them knowledgeable.
To create an information stream in OpenSearch, a definition of index template is required, which configures how the info stream and its backing indexes will behave. Within the following instance, the index template specifies key index settings such because the variety of shards, replication, and refresh interval—controlling how information is distributed, replicated, and refreshed throughout the cluster. It additionally defines the mappings, which describe the construction of the info—what fields exist, their sorts, and the way they need to be listed. These mappings be certain the info stream is aware of find out how to interpret and retailer incoming log information effectively. Lastly, the template allows the @timestamp subject because the time-based subject required for an information stream.
Implementing role-based entry management and consumer entry
The brand new observability platform is accessed by many forms of customers; inside customers log in to OpenSearch Dashboards utilizing SAML-based federation with Okta. The next diagram illustrates the consumer move.

Every consumer accesses the dashboards to view observability gadgets related to their position. Tremendous-grained entry management (FGAC) is enforced in OpenSearch utilizing built-in IAM position and SAML group mappings to implement role-based entry management (RBAC).When customers log in to the OpenSearch area, they’re robotically routed to the suitable tenant based mostly on their assigned position. This setup makes positive builders can create dashboards tailor-made to debugging inside improvement environments, and help groups can construct dashboards centered on figuring out and troubleshooting manufacturing points. The SAML integration alleviates the necessity to handle inside OpenSearch customers fully.
For every position in Kaltura, a corresponding OpenSearch position was created with solely the required permissions. For example, help engineers are granted entry to the monitoring plugin to create alerts based mostly on logs, whereas QA engineers, who don’t require this performance, aren’t granted that entry.
The next screenshot reveals the position of the DevOps engineers outlined with cluster permissions.

These customers are routed to their very own devoted DevOps tenant, to which they solely have write entry. This makes it doable for various customers from totally different roles in Kaltura to create the dashboard gadgets that concentrate on their priorities and wishes. OpenSearch helps backend position mapping; Kaltura mapped the Okta group to the position so when a consumer logs in from Okta, they robotically get assigned based mostly on their position.

This additionally works with IAM roles to facilitate automations within the cluster utilizing exterior providers, equivalent to OpenSearch Ingestion pipelines, as could be seen within the following screenshot.

Utilizing observability options and repair mapping for enhanced hint and log correlation
After a consumer is logged in, they will use the Observability plugins, view surrounding occasions in logs, correlate logs and traces, and use the Hint Analytics plugin. Customers can examine traces and spans, and group traces with latency data utilizing built-in dashboards. Customers also can drill right down to a selected hint or span and correlate it again to log occasions. The service_map processor utilized in OpenSearch Ingestion sends OpenTelemetry information to create a distributed service map for visualization in OpenSearch Dashboards.
Utilizing the mixed alerts of traces and spans, OpenSearch discovers the appliance connectivity and maps them to a service map.

After OpenSearch ingests the traces and spans from Otel, they’re aggregated to teams based on paths and traits. Durations are additionally calculated and introduced to the consumer over time.

With a hint ID, it’s doable to filter out all of the related spans by the service and see how lengthy every took, figuring out points with exterior providers equivalent to MongoDB and Redis.

From the spans, customers can uncover the related logs.

Submit-migration enhancements
After the migration, a powerful developer group emerged inside Kaltura that embraced the brand new observability resolution. As adoption grew, so did requests for brand new options and enhancements geared toward enhancing the general developer expertise.
One key enchancment was extending log retention. Kaltura achieved this by re-ingesting historic logs from Amazon Easy Storage Service (Amazon S3) utilizing a devoted OpenSearch Ingestion pipeline with Amazon S3 learn permissions. With this enhancement, groups can entry and analyze logs from as much as a yr in the past utilizing the identical acquainted dashboards and filters.
Along with monitoring EKS clusters and EC2 cases, Kaltura expanded its observability stack by integrating extra AWS providers. Amazon API Gateway and AWS Lambda had been launched to help log ingestion from exterior distributors, permitting for seamless correlation with current information and broader visibility throughout methods.
Lastly, to empower groups and promote autonomy, information stream templates and ISM insurance policies are managed straight by builders inside their very own repositories. By utilizing infrastructure as code instruments like Terraform, builders can outline index mappings, alerts, and dashboards as code—versioned in Git and deployed persistently throughout environments.
Conclusion
Kaltura efficiently carried out a wise log retention technique, extending actual time retention from 5 days for all log sorts to 30 days for vital logs, whereas sustaining cost-efficiency via the usage of UltraWarm nodes. This strategy led to a 60% discount in prices in comparison with their earlier resolution. Moreover, Kaltura consolidated their observability platform, streamlining operations by merging 10 separate methods right into a unified, all-in-one resolution. This consolidation not solely improved operational effectivity but in addition sparked elevated engagement from developer groups, driving function requests, fostering inside design collaborations, and attracting early adopters for brand new enhancements. If Kaltura’s journey has impressed you and also you’re desirous about implementing an analogous resolution in your group, think about these steps:
- Begin by understanding the necessities and setting expectations with the engineering groups in your group
- Begin with a fast proof of idea to get hands-on expertise
- Consult with the next assets that will help you get began:
Concerning the authors
Ido Ziv is a DevOps group chief in Kaltura with over 6 years of expertise. His hobbies embrace crusing and Kubernetes (however not on the identical time).
Roi Gamliel is a Senior Options Architect serving to startups construct on AWS. He’s passionate in regards to the OpenSearch Undertaking, serving to clients fine-tune their workloads and maximize outcomes.
Yonatan Dolan is a Principal Analytics Specialist at Amazon Internet Companies. He’s situated in Israel and helps clients harness AWS analytical providers to make use of information, acquire insights, and derive worth.
