How Razorpay achieved 11% efficiency enchancment and 21% value discount with Amazon EMR


This can be a visitor publish by Narendra Kumar, Head of Platform – Knowledge at Razorpay, in partnership with AWS.

On this publish, we discover how Razorpay, India’s main FinTech firm, reworked their information platform by migrating from a third-party resolution to Amazon EMR, unlocking improved efficiency and vital value financial savings. We’ll stroll via the architectural selections that guided this migration, the implementation technique, and the measurable advantages Razorpay achieved.

Based in 2014, Razorpay has turn into a powerhouse in complete cost options, enabling companies to simply accept, course of, and disburse funds on-line. With choices like RazorpayX for enterprise banking and Razorpay Capital for lending options, the corporate has skilled explosive progress, now serving tens of millions of companies. This fast growth introduced vital information challenges. When Razorpay’s information platform started straining below the load of greater than 1PB every day processing calls for, the engineering group confronted a important resolution: proceed scaling their current third-party resolution or modernize with a platform providing larger flexibility and management. They selected Amazon EMR to construct a complete information structure spanning batch warehousing, real-time stream processing, and interactive analytics – all operating on Apache Spark with open-source Delta Lake for ACID transactions. This wasn’t merely an ETL migration; it was an entire platform transformation that gave Razorpay’s 800 every day customers entry to greater than 60 concurrent streaming pipelines, greater than 3,000 orchestrated workflows, and the power to question 6PB of information every day. The outcomes validated their architectural selections: 11% higher general efficiency, 21% value discount, and the operational flexibility to optimize Spark useful resource allocation, leverage EC2 Spot situations, and implement superior options like liquid clustering – all with out vendor lock-in.

Attaining information insights cost-effectively with AWS

The information structure has a knowledge ingestion layer, information processing layer, and information consumption layer. Razorpay ingests greater than 20 TB of latest information each day, processes greater than 1 PB of every day information utilizing greater than 60 information stream processing pipelines. This information is then consumed by querying greater than 6 PB of every day information via greater than 3,000 scheduled workflows.

Knowledge flows from a wide range of sources reminiscent of on-line transaction processing (OLTP) databases – conventional transactional or entity shops, occasions reminiscent of clickstream and software occasions, and third-party occasions like reverse extract, remodel, and cargo (ETL). A lot of the information consumption use circumstances energy service provider reporting and inside analytics of the group. The structure powers a wide range of information science use circumstances and monetary infrastructure round a reconciliation service.

Resolution overview

As proven within the following diagram, in its early phases, Razorpay operated on a small scale, utilizing Sqoop to dump transactional information every day into a knowledge lake and managing a Presto layer for querying this information. As they grew, the demand for close to real-time information elevated, prompting the setup of a change information seize (CDC) collector utilizing Maxwell to stream information manipulation language (DML) occasions to Kafka. To additional improve information processing, Razorpay constructed a processing layer that consumed information from Kafka to UPSERT data into the lake utilizing Apache Hudi.

Moreover, the corporate onboarded information from third-party sources reminiscent of Freshdesk and Google Sheets and automatic occasion ingestion from frontend functions utilizing Lumberjack, thereby streamlining their information administration processes.

As Razorpay scaled its operations, the demand for a number of real-time use circumstances grew to become mission-critical, prompting the event of a strong information warehouse ingestion framework to effectively ingest information into TiDB. To reinforce service reliability and assist dashboard querying, a low-latency, high-throughput service referred to as Harvester was created, which saved pre-aggregated information for efficient monitoring. Over time, reporting use circumstances emerged, resulting in using a warehouse service to determine a denormalized report information layer whereas additionally exploring a real-time layer for dynamic insights. Moreover, to facilitate a easy transition to microservices, Razorpay constructed a unified storage layer able to supporting information from each its current monolithic structure and the brand new microservices, making certain seamless integration and improved information accessibility throughout the group.

Razorpay applied a complete information service migration to Amazon EMR utilizing a phased method. The answer structure as proven within the following diagram includes a number of layers dealing with information ingestion, processing, and consumption.

Technical implementation

A contemporary and scalable analytics platform focuses on real-time information ingestion, petabyte-scale processing, and cost-optimized storage – all orchestrated with strong workflow administration:

Knowledge ingestion layer

To deal with large-scale and various information sources, they applied a mixture of CDC and file ingestion patterns:

  • CDC utilizing Amazon Aurora MySQL-Appropriate Version – Used Debezium and Maxwell for low-latency replication and streaming of database modifications
  • Excessive-volume streaming pipelines – Configured streaming pipelines able to processing greater than 20 TB of every day inbound information
  • Third-party information integration: Carried out safe file push mechanisms to ingest companion and software program as a service (SaaS) information into the service

Knowledge processing layer

Razorpay designed the processing stack on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) with Spark as the first compute engine

  • Batch warehousing – Each day ETL and aggregation jobs processing greater than 1 PB of information
  • Stream processing – Actual-time analytics pipelines throughout greater than 60 concurrent processing streams
  • Delta merge operations – Excessive-performance incremental updates throughout greater than 25 Delta Lake tables

Knowledge storage and group

Their information storage follows the medallion structure sample layered on an Amazon Easy Storage Service (Amazon S3):

  • Uncooked zone – Immutable ingestion zone for unique supply information
  • Processed and aggregated zone – Optimized datasets prepared for analytics and reporting
  • Open supply software program (OSS) Delta Lake format – Carried out open supply Delta Lake for ACID transactions, schema enforcement, and quicker question efficiency

Workflow orchestration

Complicated information workflows are automated and monitored utilizing a hybrid orchestration method:

  • Apache Airflow integration – Scheduling and coordinating greater than 3,000 workflows per day
  • dbt on Amazon EMR – SQL-based transformations for enterprise logic and metric definitions
  • Specialised compliance jobs – Devoted workflows assembly the 15-minute SLA for delicate regulatory reporting

Efficiency optimizations

To make sure value effectivity and excessive throughput, the next optimizations had been utilized:

  • Spark tuning – Customized configurations for executor reminiscence, shuffle partitions, and serialization to maximise {hardware} utilization
  • Liquid clustering – Carried out in delta lake tables to enhance question efficiency over giant datasets
  • Optimized delta merges – Lowered merge latency for incremental updates.
  • Auto scaling – Dynamic scaling insurance policies based mostly on workload patterns to steadiness efficiency and value

To allow a safe migration, they applied Amazon EMR safety greatest practices following AWS steering on encryption, authentication, and authorization as documented within the Amazon EMR safety greatest practices.

This structure delivers low-latency ingestion, petabyte-scale processing, and strong workflow orchestration in order that analytics groups can derive quicker insights whereas sustaining compliance and optimizing for value.

The mix of Debezium and Maxwell for CDC, Spark on Amazon EMR, OSS Delta Lake on Amazon S3, and Airflow with dbt has confirmed to be a scalable and resilient method for contemporary information analytics workloads

Enterprise Affect: What Amazon EMR Enabled

  • 11% efficiency enchancment enabling quicker insights for 800 every day energetic customers
  • 13-15% quicker execution for big warehouse jobs, accelerating time-to-insight for important enterprise selections
  • 21% value discount reinvested into product innovation for service provider clients
  • Seamless scaling from 20 TB to 1 PB+ every day processing with out efficiency degradation
  • Enterprise reliability supporting 350,000 operational studies and compliance necessities

Key learnings and greatest practices

All through their migration to Amazon EMR, Razorpay discovered precious classes that helped optimize their information platform. We’re sharing these insights to assist different clients speed up their very own modernization journeys whereas avoiding frequent pitfalls.

Infrastructure Stability and Efficiency

  • Optimizing Spark Useful resource Allocation – Razorpay initially assumed that Spark’s dynamic allocation would routinely optimize useful resource utilization. Nonetheless, they found it launched overhead that degraded efficiency for sure workload patterns. To deal with this problem, they took two approaches relying on workload traits – setting express maxExecutors values for predictable workloads, and enabling maximizeResourceAllocation to create “fats executors” that totally utilized out there cluster assets. These focused configurations improved job execution instances by 13-15% for large-scale information processing workloads.
  • Guaranteeing Stability with But One other Useful resource Negotiator (YARN) node labels – When utilizing EC2 Spot situations for value optimization, Razorpay encountered a important subject wherein Spot occasion interruptions sometimes terminated nodes operating important driver containers, inflicting complete job failures. Their resolution was elegant and efficient. They configured YARN node labels to make sure driver containers all the time spawn on On-Demand Cases, whereas activity nodes use cost-effective Spot capability. This structure delivered each value effectivity and reliability, making their jobs resilient to Spot interruptions whereas sustaining 21% value financial savings.
  • Managing Spot Cases Successfully – Razorpay’s preliminary method of switching completely to On-Demand Cases throughout Spot availability constraints eradicated the associated fee advantages they had been looking for. They applied a number of greatest practices to deal with this reminiscent of utilizing occasion fleets with allocation methods (price-capacity optimized and capability optimized) to maximise Spot availability, spreading major situations throughout a number of Availability Zones for fault tolerance, and accepting that heterogeneous executors create various executor sizes whereas planning capability accordingly. They maintained excessive Spot utilization charges whereas making certain workload continuity, reaching optimum worth efficiency.

Price Optimization

  • Attaining Sustainable Price Effectivity – As information volumes grew to greater than 20 TB every day, Razorpay wanted to scale infrastructure whereas controlling prices. They applied a complete value optimization technique that included a number of elements. First, they right-sized major nodes by avoiding over-provisioning and choosing occasion sorts matching precise workload necessities. They consolidated workloads by combining a number of jobs on fewer giant clusters to maximise useful resource utilization. For SLA-sensitive jobs, they migrated to Amazon EKS and Amazon EMR Serverless for automated scaling and pay-per-use pricing. They adopted Graviton situations, migrating suitable workloads to AWS Graviton processors for superior price-performance. Lastly, they diversified occasion fleets by using a number of occasion sorts to cut back Spot interruption impression.

These optimizations delivered 21% value financial savings whereas supporting 800 every day energetic customers and processing 1 PB of information every day. This enabled Razorpay to take a position financial savings again into product innovation for his or her service provider clients, demonstrating how technical optimization instantly interprets to enterprise worth.

Conclusion

Razorpay’s migration to Amazon EMR demonstrates how the fitting information processing platform can remodel enterprise outcomes at scale. By reaching 11% higher efficiency, 13-15% quicker execution instances, and 21% value financial savings, EMR enabled Razorpay to construct an enterprise-grade information platform that helps 800 every day customers, greater than 3,000 dashboards, and 10 million month-to-month queries.

To be taught extra about constructing comparable information analytics options on AWS, try the next assets.

Documentation:

AWS options:

Get began:


Concerning the authors

Narendra Kumar

Narendra Kumar

Narendra is a senior information platform and engineering chief with deep expertise in constructing and working large-scale information platforms for high-growth FinTech and SaaS organizations. He has labored throughout the complete information lifecycle, together with real-time information ingestion, fashionable lakehouse architectures, analytics platforms, and ML-ready information methods, with a powerful concentrate on reliability, scalability, and value effectivity.

Ravi Kompella

Ravi Kompella

Ravi is a principal analytics specialist with expertise in driving adoption of contemporary information architectures, enterprise information lakehouses, and real-time information methods throughout a number of business verticals in India together with startups and SaaS suppliers.

Shreshtha Dutta

Shreshtha Dutta

Shreshtha is a enterprise and IT transformation chief with deep expertise in large-scale cloud migrations, information platforms, and AI-driven innovation. She has led advanced Amazon EMR applications, serving to enterprises modernize analytics, optimize prices, and notice measurable enterprise worth via pragmatic, execution-focused methods.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles