This can be a visitor publish co-written with Shashidhar Soppin, Manochandra Menni and Anchal Kansal from Zeta.
Zeta is a core banking expertise supplier that allows banks to quickly launch extensible banking property and legal responsibility merchandise. Zeta’s main merchandise are Olympus and Tachyon. Olympus is a platform as a service (PaaS) that simplifies constructing and working cloud-native, safe and distributed multi-tenant software program as a service (SaaS) merchandise. It blends infrastructure as code and GitOps methodologies for environment friendly and constant deployment of SaaS merchandise. Its structure prioritizes robust tenant isolation, real-time occasion processing, and complete observability, supporting strong API integrations and seamless deployment. Zeta’s Tachyon is a full-stack, cloud-native, API-first digital-banking SaaS service delivered by way of Olympus. The banking companies of Tachyon embrace cost engines (for UPI, credit score, debit, and pay as you go playing cards), financial savings & checking account administration, and so forth. Tachyon is a contemporary debit processing product with private finance administration and card controls. It’s designed to extend utilization, upsell credit score, cut back fraud, and enhance buyer satisfaction. The Tachyon product presents complete provisioning, funds, and account administration APIs and SDKs, enabling seamless integration of economic merchandise into third-party apps with out compromising privateness and safety. Zeta operates Tachyon as a multi-tenant SaaS product, serving prospects who’re configured as particular person tenants inside the system. Zeta’s expertise stack is monitored by their Buyer Service Navigator product (CSN), which is a part of Olympus.
As a worldwide SaaS supplier, Zeta wanted an answer able to monitoring tenants, measuring SLAs, assembly native regulatory necessities, and scaling effectively with each new tenant onboarding and seasonal utilization spikes. Zeta sought a cheap, scalable system that would supply a unified “single pane of glass” to observe the applying companies, cloud infrastructure, open-source parts, and third-party merchandise.
Zeta confronted a formidable problem in orchestrating a cohesive monitoring system throughout a quickly increasing multi-tenant setting, various domains, and quite a few instruments. As extra tenants joined their system, the complexity grew exponentially, making Zeta’s monitoring resolution more and more tough to take care of. The first problem stemmed from fragmented monitoring instruments that made it tough to shortly establish root causes throughout interconnected techniques, resulting in extended troubleshooting occasions and potential service degradation. When customers reported points, equivalent to bank card cost issues, Website Reliability Engineering (SRE) crew needed to navigate via a a number of disparate monitoring instruments and siloed information, and the shortage of built-in observability resulted in time-consuming guide correlation efforts. This multi-tenant, multi-solution panorama considerably difficult the power to take care of constant monitoring requirements and repair ranges. The problem was additional difficult by the complicated regulatory panorama, the place international enlargement required adherence to various native rules, necessitating a versatile structure able to accommodating various information retention insurance policies and entry controls throughout completely different jurisdictions. Every new tenant addition multiplied the complexity of balancing the monitoring wants of inner SRE groups and prospects, requiring subtle information segregation and entry administration. Moreover, Zeta required complete anomaly detection capabilities throughout techniques, parts, infrastructure, and operations, requiring an answer that would scale dynamically whereas establishing dynamic baselines and figuring out delicate patterns which may point out rising points. Because the tenant base continued to develop, the necessity for a unified, scalable monitoring resolution that would streamline these processes, improve operational visibility, and keep system integrity turned vital.
Zeta’s objective was to streamline their processes and improve operational visibility throughout the whole expertise panorama. By addressing these challenges, Zeta aimed to create a unified observability resolution that may considerably enhance incident response occasions, improve regulatory compliance posture, and finally ship a extra dependable and performant service to their international buyer base.
On this publish we clarify how Zeta constructed a extra unified monitoring resolution utilizing Amazon OpenSearch Service that improved efficiency, diminished guide processes, and elevated end-user satisfaction. Zeta has achieved over an 80% discount in imply time to decision (MTTR), with incident response occasions lowering from 30+ minutes to beneath 5 minutes.
Answer overview
Zeta designed and constructed an observability system, CSN, to ship complete visibility throughout the service setting. CSN is a part of the Olympus suite of merchandise. CSN serves as the first interface for the SRE crew, providing real-time service well being dashboards, infrastructure monitoring, SLA efficiency analytics, and an admin panel for person administration. The system is provided with single sign-on (SSO) integration and enforces role-based entry management (RBAC) to allow safe, granular entry. With CSN, SREs can effectively monitor system well being, obtain actionable alerts and warnings, and handle operational workflows throughout vital companies.
CSN is powered by OpenSearch Service to supply an built-in resolution for DevOps and Website Reliability Engineers to assist establish vital occasions and points. Zeta selected OpenSearch Service as a result of it presents a totally managed, open-source search analytics engine that scales effortlessly to deal with the growing variety of tenants, related information progress, and analytics wants. It’s seamless integration with AWS companies, strong safety features, and help for real-time information ingestion and querying make it preferrred for powering the CSN dashboards and analytics workloads. The next diagram illustrates the CSN deployment structure.
The OpenSearch Service area makes use of the Multi-AZ with Standby deployment mannequin, following AWS greatest practices for top availability and fault tolerance. Nodes—together with devoted cluster supervisor nodes, information nodes, and UltraWarm nodes—are distributed evenly throughout three Availability Zones in the identical AWS Area. Availability Zones 1 and a pair of deal with energetic indexing and search visitors, and Availability Zone 3 comprises standby nodes that stay passive throughout regular operations. If an Availability Zone failure happens, OpenSearch Service routinely promotes standby nodes to energetic standing, sustaining cluster operations with minimal disruption and no want for information redistribution.
The OpenSearch cluster consists of three devoted cluster supervisor nodes and a multiple-of-three information node depend to take care of quorum and balanced shard allocation. Every index makes use of not less than two replicas, offering redundant copies of information throughout the Availability Zones. This Multi-AZ with Standby configuration delivers excessive resilience and fast failover, supporting steady service availability and strong catastrophe restoration for the observability workloads.
Information assortment and ingestion
The observability technique facilities on an information assortment and ingestion pipeline designed to deal with the complexity and scale. The structure, as proven within the following diagram, addresses three vital information varieties: AWS useful resource logs, utility logs, and distributed traces, with every information sort utilizing tailor-made assortment and processing strategies optimized for the workloads.
AWS useful resource logs assortment
The infrastructure spans a number of AWS companies together with Amazon Elastic Kubernetes Service(Amazon EKS), Amazon Relational Database Service(Amazon RDS), Amazon Redshift, Utility Load Balancer, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Elastic Compute Cloud (Amazon EC2) and extra. Zeta makes use of Amazon CloudWatch Logs as the first assortment level for AWS service logs, which supplies native integration with these companies.
AWS companies ship their logs on to CloudWatch Logs, that are then pulled by Fluentd working on the Amazon EKS cluster for centralized processing. This method natively captures operational information from the AWS sources, together with:
- Database operational logs and audit trails from Amazon RDS situations
- Information warehouse question execution logs from Amazon Redshift
- Utility Load Balancer entry logs capturing visitors patterns and efficiency metrics
- Kafka cluster operational logs from Amazon MSK
- AWS API invocation audit trails from AWS CloudTrail
- Container runtime and working system logs from Amazon EC2
- Throughout the log assortment, personally identifiable data (PII) is filtered out. The answer adheres strictly to PCI-DSS pointers all through this course of.
Zeta used Amazon MSK as a scalable and dependable spine for gathering and streaming logs from varied sources throughout the AWS sources. Logs are ingested into Amazon MSK, offering a sturdy and fault-tolerant buffer that decouples log producers from shoppers. This structure allows real-time log streaming and helps superior processing pipelines earlier than the logs are routed to the OpenSearch Service. By integrating Amazon MSK into the logging workflow, scalability, resilience, and adaptability is improved, so that top log volumes are effectively managed with out impacting downstream techniques. This method, mixed with native AWS integrations, minimizes operational complexity and maintains complete, centralized log visibility throughout the cloud setting.
Fluentd processes these logs and routes them on to OpenSearch Service, sustaining the advantages of AWS integration whereas offering centralized accessibility. This centralized logging method with built-in buffering capabilities reduces the direct load on OpenSearch Service by batching and optimizing log supply, serving to to forestall potential ingestion bottlenecks throughout high-volume intervals. The method alleviates the necessity for customized log delivery brokers on AWS sources, lowering operational overhead whereas sustaining complete protection of the cloud infrastructure.
Utility logs processing
For application-level observability, a pipeline utilizing Fluentd is deployed as Kubernetes DaemonSet. Utility microservices working on Amazon EKS generate logs that Fluentd DaemonSets accumulate, parses, and enrich with metadata equivalent to pod names, namespaces, and repair identifiers. The processed logs then move via Amazon MSK for dependable, high-throughput message streaming earlier than closing processing by Fluentd and indexing in OpenSearch Service.
This Kafka-based method supplies a number of benefits:
- Decoupling – This helps producers and shoppers to function independently, in order that Zeta can scale ingestion and processing individually based mostly on demand.
- Backpressure dealing with – Utilizing Kafka’s buffering capabilities, this manages visitors spikes throughout peak banking hours, absorbing sudden will increase in log quantity whereas sustaining system stability throughout seasonal utilization surges.
- Sturdiness of logs – The system maintains logs durably in order that no log information is misplaced throughout system upkeep or sudden failures via message persistence.
The logs then move via a second Fluentd layer for closing processing and routing to OpenSearch Service, the place they’re listed throughout service-specific indexes (app-index, falco-index, kong-index).
Distributed hint assortment
To deal with the problem of correlating points throughout Zeta’s microservices structure, system makes use of distributed tracing utilizing Jaeger, an open-source, end-to-end distributed tracing system. Jaeger allows monitoring and troubleshooting transactions in complicated distributed techniques by monitoring requests as they move via a number of companies. The applying companies and Kong API Gateway are instrumented with Jaeger shopper libraries that generate hint information together with spans, which characterize particular person operations inside a hint. Every span comprises metadata equivalent to operation names, begin and end timestamps, tags, and logs that present context concerning the operation being carried out. The Jaeger Collector aggregates these spans from a number of companies, performing validation, indexing, and transformation earlier than forwarding the info.
The traces move via Amazon MSK for a similar reliability advantages because the logging pipeline – offering sturdiness, decoupling, and backpressure dealing with throughout high-volume intervals. Jaeger Ingester then consumes traces from Amazon MSK and processes them for storage within the jaeger-index inside OpenSearch Service.
This information assortment and ingestion technique supplies full end-to-end visibility and builds an observability system that allows SRE groups to observe, troubleshoot, and optimize the companies throughout the whole expertise stack.
Storage tiering
To handle the log, metric, and hint information at scale—about 3TB generated each day—the answer applied OpenSearch Service storage tiering to steadiness efficiency, retention, and price. Zeta requires close to real-time search and retrieval for not less than per week, whereas retaining logs and traces for as much as 10 years. Maintaining this information in energetic clusters would influence search efficiency and considerably improve prices, so the answer makes use of the OpenSearch Service scorching, UltraWarm, and chilly storage tiers to optimize the info lifecycle. The next diagram illustrates storage tiering in OpenSearch Service.
Sizzling storage is used for the newest and regularly accessed information, supporting real-time indexing and low-latency queries. This tier depends on high-performance storage hooked up to straightforward information nodes, making it preferrred for powering reside dashboards and analytics the place velocity is vital. The answer makes use of AWS Graviton 2 powered m6g.4xlarge.search occasion varieties to run the OpenSearch Service area which supplies upto 40% decrease value in comparison with x86 based mostly situations. Every scorching information node has an hooked up gp3 EBS quantity to retailer indexes. Zeta maintains information in scorching storage for 1 week.
UltraWarm storage serves as a cheap layer for older, read-only information that’s queried much less regularly however nonetheless wants to stay searchable. UltraWarm nodes use Amazon Easy Storage Service (Amazon S3) because the backing retailer with an built-in caching mechanism, to retain massive volumes of information at a fraction of the price of scorching storage whereas nonetheless supporting interactive queries for historic evaluation. Zeta makes use of ultrawarm1.massive.search occasion varieties within the UltraWarm storage tier and maintains information in UltraWarm storage for 15 days.
Chilly storage is designed for long-term archival of occasionally accessed or compliance-driven information. Information in chilly storage is indifferent from energetic compute sources and resides in Amazon S3, incurring minimal value. When historic information must be queried, the indexes are hooked up to the UltraWarm nodes utilizing OpenSearch API calls. This helps extracting historic information for audits, periodic analysis or forensic investigations with out sustaining energetic compute for the whole retention interval, thereby lowering storage value.
OpenSearch Service automates index transitions between scorching, UltraWarm, and chilly storage tiers utilizing Index State Administration (ISM) insurance policies. ISM insurance policies specify the situations and actions for every state, equivalent to transitioning based mostly on index age, dimension, or doc depend. When an index qualifies for a transition, ISM jobs—working each 5 to eight minutes—consider the coverage and transfer the index to the following tier. When indexes attain the UltraWarm threshold, they’re migrated to UltraWarm nodes backed by Amazon S3, which reduces storage prices whereas holding information accessible for queries. After the UltraWarm retention interval, ISM archives the indexes to chilly storage, detaching them from compute sources however permitting reattachment for future queries or compliance wants. This automated lifecycle administration reduces operational overhead, optimizes storage prices, and maintains efficiency for each latest and historic information.
For observability information, new indexes are created within the scorching tier, the place they continue to be for 7 days to help quick ingestion and low-latency queries. After this era, ISM transitions these indexes to UltraWarm storage, the place they’re retained for an extra 15 days as read-only information, balancing value with searchability.
Safety
Safety is essentially the most vital a part of the structure. Zeta’s observability system implements a number of layers of safety for information confidentiality, integrity, and compliance with banking rules, and is constructed utilizing a zero-trust method following the AWS shared duty mannequin for OpenSearch Service:
- Infrastructure safety: The OpenSearch Service area is deployed inside a digital non-public cloud (VPC) with non-public subnets, isolating it from direct web entry. Safety teams implement restrictive ingress guidelines, permitting entry solely from licensed sources. The OpenSearch Service area makes use of encryption at relaxation via AWS Key Administration Service (KMS). Information in transit is secured utilizing TLS 1.3 encryption, in order that log information, traces, and search queries stay protected throughout transmission. Service-to-service communication makes use of AWS Identification and Entry Administration (IAM) roles and encrypted connections, assuaging the necessity for hardcoded credentials.
- Entry management and authentication: The answer makes use of Amazon OpenSearch Service fine-grained entry management(FGAC) built-in with IAM, the place IAM serves because the authentication supplier and FGAC handles authorization by mapping IAM roles to OpenSearch backend roles. This method helps Zeta to regulate entry permissions on the index and doc stage based mostly on tenant necessities and person tasks. The information ingestion pipeline implements end-to-end safety with Fluentd authenticating to Amazon MSK utilizing IAM roles over encrypted connections. Amazon MSK clusters use encryption in transit and at relaxation, defending log information all through the streaming pipeline. Kubernetes RBAC insurance policies prohibit pod-to-pod communication and restrict service account permissions.
- Information privateness and tenant isolation: Every tenants’ information is maintained in logical separation in OpenSearch Service utilizing tenant id. CSN implements tenant-aware authentication and authorization with FGAC, limiting customers to their licensed tenants’ dashboards and information. Each API endpoint validates tenant context, in order that customers can solely entry information inside their licensed scope. Importantly, no buyer information is captured within the logs – solely system metrics are used to construct the monitoring system, adhering to banking safety requirements and greatest practices. Consumer actions are audited and logged for compliance functions, with audit trails maintained based on regulatory necessities.
This safety framework allows the observability system meet the safety necessities of core banking operations whereas sustaining operational effectivity and regulatory compliance throughout international industries.
Buyer Service Navigator
CSN delivers SREs a strong diagnostics interface engineered for high-efficiency monitoring, deep evaluation, and fast troubleshooting of system efficiency throughout distributed environments. The system ingests and processes telemetry information at sub-minute intervals, offering near-real-time metrics, traces, and logs from vital infrastructure parts. Actionable, interactive visualizations—equivalent to heatmaps, anomaly graphs, and dependency maps— helps SREs to shortly detect SLO breaches and drill right down to granular root causes, usually inside a couple of minutes of an incident.
The next screenshot exhibits an instance service well being dashboard in CSN for an Olympus tenant.
The next screenshot exhibits an instance of the API efficiency insights dashboard in CSN.
Enterprise and technical advantages
The OpenSearch Service-based CSN System supplies the next enterprise and technical advantages:
- Guide effort is diminished via automated Index State Administration (ISM) and lifecycle insurance policies, in order that Zeta’s groups to deal with innovation
- Automated lifecycle insurance policies facilitate seamless retention and archiving of compliance information, lowering the danger of non-compliance
- The system helps log retention for over 10 years to satisfy regulatory necessities for Zeta’s banking and monetary companies prospects
- A number of layers of safety—together with encryption at relaxation and in transit, FGAC, and tenant isolation to guard buyer information and help Zeta’s zero-trust structure
- By consolidating logs, traces, and metrics from disparate techniques into OpenSearch, SRE groups can correlate occasions extra successfully, thereby lowering troubleshooting efforts and reaching an 80% enchancment in MTTR
- Zeta achieved 99.999999999% information sturdiness for archived logs saved in Amazon S3, offering long-term information integrity
- Zstandard compression is being applied to optimize long-term storage prices
Conclusion
CSN’s superior correlation engine routinely associates associated occasions throughout microservices, databases, community layers, and infrastructure, considerably streamlining root trigger evaluation. Built-in alerting and automatic runbooks additional cut back response occasions. Since implementing CSN, Zeta has achieved over an 80% discount in MTTR, with incident response occasions lowering from 30+ minutes to beneath 5 minutes. The service helps seamless multi-tenant monitoring, processes 3TB of machine-generated information each day, and is architected for petabyte-scale progress. Moreover, CSN helps Zeta meet regulatory necessities for retaining historic logs over a number of years whereas holding storage prices beneath management. This has considerably improved operational resilience, elevated service availability, and empowered groups to proactively resolve points earlier than they have an effect on finish customers.
Able to take your group’s observability capabilities to the following stage? Dive into the technical particulars of OpenSearch Service within the Amazon OpenSearch Developer Information. Go to our new migration hub web page for extra prescriptive steerage on transferring your workloads to OpenSearch Service.
Concerning the authors
Deepesh Dhapola is a Senior Options Architect at AWS India, the place he architects high-performance, resilient cloud options for monetary companies and fintech organizations. He focuses on utilizing superior AI applied sciences—together with generative AI, clever brokers, and the Mannequin Context Protocol (MCP)—to design safe, scalable, and context-aware functions. With deep experience in machine studying and a eager deal with rising traits, Deepesh drives digital transformation by integrating cutting-edge AI capabilities to reinforce operational effectivity and foster innovation for AWS prospects. Past his technical pursuits, he enjoys high quality time along with his household and explores artistic culinary strategies.
Shashidhar (Shashi) Soppin is an completed Enterprise Architect and cloud transformation chief with over 24+ years of expertise spanning regulated industries and high-growth expertise environments. At present steering strategic initiatives as Lead Architect at Zeta’s CTO workplace, Shashidhar has helped in constructing and led world-class engineering groups, driving innovation in cloud, safety, and fintech domains. He has architected safe, scalable platforms—scaling person bases by 10x, enabling complicated integrations for main Financial institution’s migration to Zeta’s platforms, and pioneering Zero Belief frameworks that achieved excellent regulatory compliance. A results-driven government and former DMTS at Wipro, Shashidhar holds 25+ granted patents and has delivered multi-million greenback enterprise offers throughout domains together with AI/ML. Famend as a broadcast writer (“Necessities of Deep Studying”), frequent business speaker, and hands-on innovator, he combines technical experience with enterprise acumen, propelling organizations towards strong, future-ready cloud ecosystems and operational excellence. Previous to Wipro he labored in IBM-ISL as effectively.
Anchal Kansal is a Lead Website Reliability Engineer at Zeta, the place she has spent the previous 4 years constructing and scaling dependable, high-performance techniques. With deep experience in OpenSearch, observability platforms, and large-scale infrastructure, she focuses on guaranteeing uptime, efficiency, and operational effectivity. Anchal is obsessed with fixing complicated reliability challenges and sharing sensible insights with the engineering neighborhood.
Manochandra (Mano) is the Website Reliability Engineering (SRE) professional at Zeta, specializing in information management-oriented techniques. With a deep understanding of large-scale distributed architectures, he has intensive expertise designing, deploying, and sustaining resilient, production-grade OpenSearch techniques. Mano is understood for his proactive method in optimizing infrastructure reliability and efficiency, in addition to his means to troubleshoot complicated operational challenges. His experience spans implementing automation, monitoring, and incident administration greatest practices, making him a go-to useful resource for guaranteeing service availability and scalability at Zeta.
Hitesh Subnani is a FSI Options Architect at AWS India, the place he works with prospects to design and construct architectures that ship enterprise worth. He focuses on complete observability and analytics techniques, enabling organizations to achieve deep insights from operational information. With experience in search and analytics applied sciences, Hitesh focuses on scalable monitoring techniques, real-time dashboards, and compliance-driven architectures for AWS prospects within the monetary sector.
Tarun Chakraborty is a Sr. Technical Account Supervisor (TAM) at AWS India, the place he companions with main banks and fintech organizations to speed up their cloud transformation journeys. With over 15 years of expertise in expertise and monetary companies, he serves as a trusted advisor serving to prospects leverage AWS’s complete suite of companies to drive innovation and obtain their enterprise targets.




