Amazon EMR Serverless eliminates native storage provisioning, decreasing knowledge processing prices by as much as 20%


At AWS re:Invent 2025, Amazon Internet Companies (AWS) introduced serverless storage for Amazon EMR Serverless, a brand new functionality that eliminates the necessity configure native disks for Apache Spark workloads. This reduces knowledge processing prices by as much as 20% whereas eliminating job failures from disk capability constraints.

With serverless storage, Amazon EMR Serverless mechanically handles intermediate knowledge operations, corresponding to shuffle, in your behalf. You pay just for compute and reminiscence—no storage expenses. By decoupling storage from compute, Spark can launch idle staff instantly, decreasing prices all through the job lifecycle. The next picture exhibits the serverless storage for EMR Serverless announcement from the AWS re:Invent 2025 keynote:

The problem: Sizing native disk storage

Working Apache Spark workloads requires sizing native disk storage for shuffle operations—the place Spark redistributes knowledge throughout executors throughout joins, aggregations, and kinds. This requires analyzing job histories to estimate disk necessities, main to 2 frequent issues: overprovisioning wastes cash on unused capability, and underneath provisioning causes job failures when disk house runs out. Most clients overprovision native storage to make sure jobs full efficiently in manufacturing.

Information skew compounds this additional. When one executor handles a disproportionately giant partition, that executor takes considerably longer to finish whereas different staff sit idle. If you happen to didn’t provision sufficient disk for that skewed executor, the job fails completely—making knowledge skew one of many prime causes of Spark job failures. Nonetheless, the issue extends past capability planning. As a result of shuffle knowledge {couples} tightly to native disks, Spark executors pin to employee nodes even when compute necessities drop between job phases. This prevents Spark from releasing staff and cutting down, inflating compute prices all through the job lifecycle. When a employee node fails, Spark should recompute the shuffle knowledge saved on that node, inflicting delays and inefficient useful resource utilization.

The way it works

Serverless storage for Amazon EMR Serverless addresses these challenges by offloading shuffle operations from particular person compute staff onto a separate, elastic storage layer. As a substitute of storing important knowledge on native disks hooked up to Spark executors, serverless storage mechanically provisions and scales high-performance distant storage as your job runs.

The structure supplies a number of key advantages. First, compute and storage scale independently—Spark can purchase and launch staff as wanted throughout job phases with out worrying about preserving regionally saved knowledge. Second, shuffle knowledge is evenly distributed throughout the serverless storage layer, eliminating knowledge skew bottlenecks that happen when some executors deal with disproportionately giant shuffle partitions. Third, if a employee node fails, your job continues processing with out delays or reruns as a result of knowledge is reliably saved exterior particular person compute staff.

Serverless storage is offered at no extra cost, and it eliminates the price related to native storage. As a substitute of paying for fastened disk capability sized for max potential I/O load—capability that usually sits idle throughout lighter workloads—you should use serverless storage with out incurring storage prices. You’ll be able to focus your funds on compute assets that straight course of your knowledge, not on managing and overprovisioning disk storage.

Technical innovation brings three breakthroughs

Serverless storage introduces three elementary improvements that resolve Spark’s shuffle bottlenecks: multi-tier aggregation structure, purpose-built networking, and true storage-compute decoupling. Apache Spark’s shuffle mechanism has a core constraint: every mapper independently writes output as small information, and every reducer should fetch knowledge from doubtlessly hundreds of staff. In a large-scale job with 10,000 mappers and 1,000 reducers, this creates 10 million particular person knowledge exchanges. Serverless storage aggregates early and intelligently—mappers stream knowledge to an aggregation layer that consolidates shuffle knowledge in reminiscence earlier than committing to storage. Whereas particular person shuffle write and fetch operations may present barely larger latency as a result of community round-trips in comparison with native disk I/O, the general job efficiency improves by reworking tens of millions of tiny I/O operations right into a smaller variety of giant, sequential operations.

Conventional Spark shuffle creates a mesh community the place every employee maintains connections to doubtlessly a whole bunch of different staff, spending vital CPU on connection administration relatively than knowledge processing. We constructed a customized networking stack the place every mapper opens a single persistent distant process name (RPC) connection to our aggregator layer, eliminating the mesh complexity. Though particular person shuffle operations may present barely larger latency as a result of community spherical journeys in comparison with native disk I/O, total job efficiency improves by means of higher useful resource utilization and elastic scaling. Staff now not run a shuffle service—they focus completely on processing your knowledge.

Conventional Amazon EMR Serverless jobs retailer shuffle knowledge on native disks, coupling knowledge lifecycle to employee lifecycle—idle staff can’t terminate with out shedding shuffle knowledge. Serverless storage decouples these completely by storing shuffle knowledge in AWS managed storage with opaque handles tracked by the motive force. Staff can terminate instantly after finishing duties with out knowledge loss, enabling elastic scaling. In funnel-shaped queries the place early phases require large parallelism that narrows as knowledge aggregates, we’re seeing as much as 80% compute value discount in benchmarks by releasing idle staff immediately. The next diagram illustrates instantaneous employee launch in funnel-shaped queries.

Our aggregator layer integrates straight with AWS Identification and Entry Administration (IAM), AWS Lake Formation, and fine-grained entry management programs, offering job-level knowledge isolation with entry controls that match supply knowledge permissions.

Getting began

Serverless storage is accessible in a number of AWS Areas. For the present checklist of supported Areas, discuss with the Amazon EMR Person Information.

New functions

Serverless storage might be enabled for brand spanking new functions beginning with Amazon EMR launch 7.12. Observe these steps:

  1. Create an Amazon EMR Serverless software with Amazon EMR 7.12 or later:
aws emr-serverless create-application 
  --type "SPARK" 
  --name my-application 
  --release-label emr-7.12.0 
  --runtime-configuration '[{
      "classification": "spark-defaults",
        "properties": {
          "spark.aws.serverlessStorage.enabled": "true"
        }
    }]' 
  --region us-east-1

  1. Submit your Spark job:
aws emr-serverless start-job-run 
  --application-id  
  --execution-role-arn  
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3:///",
      "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.reminiscence=20g --conf spark.driver.cores=4 --conf spark.driver.reminiscence=8g --conf spark.executor.situations=10"
    }
  }'

Current functions

You’ll be able to allow serverless storage for current functions on Amazon EMR 7.12 or later by updating your software settings.

To allow serverless storage utilizing AWS Command Line Interface (AWS CLI), enter the next command:

aws emr-serverless update-application 
  --application-id  
  --runtime-configuration '[{
      "classification": "spark-defaults",
        "properties": {
          "spark.aws.serverlessStorage.enabled": "true"
        }
    }]'

To allow serverless storage utilizing Amazon EMR Studio UI, navigate to your software in Amazon EMR Studio, go to Configuration, and add the Spark property spark.aws.serverlessStorage.enabled=true within the spark-defaults classification.

Job-level configuration

You may also allow serverless storage for particular jobs, even when it’s not enabled on the software degree:

aws emr-serverless start-job-run 
  --application-id  
  --execution-role-arn  
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3:///",
      "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.reminiscence=20g --conf spark.aws.serverlessStorage.enabled=true"
    }
  }'

(Non-obligatory) Disabling serverless storage

If you happen to want to proceed utilizing native disks, you may disable serverless storage by omitting the spark.aws.serverlessStorage.enabled configuration or setting it to false at both the appliance or job degree:

spark.aws.serverlessStorage.enabled=falseTo make use of conventional native disk provisioning, configure the suitable disk kind and dimension to your software staff.

Monitoring and value monitoring

You’ll be able to monitor elastic shuffle utilization by means of normal Spark UI metrics and observe prices on the software degree in AWS Price Explorer and AWS Price and Utilization Stories. The service mechanically handles efficiency optimization and scaling, so that you don’t must tune configuration parameters.

When to make use of serverless storage

Serverless storage delivers probably the most worth for workloads with substantial shuffle operations—sometimes jobs that shuffle greater than 10 GB of information (and fewer than 200 G per job, the limitation as of this writing). These embrace:

  • Massive-scale knowledge processing with heavy aggregations and joins
  • Type-heavy analytics workloads
  • Iterative algorithms that repeatedly entry the identical datasets

Jobs with unpredictable shuffle sizes profit significantly properly as a result of serverless storage mechanically scales capability up and down based mostly on real-time demand. For workloads with minimal shuffle exercise or very quick length (underneath 2–3 minutes), the advantages is likely to be restricted. In these circumstances, the overhead of distant storage entry may outweigh some great benefits of elastic scaling.

Safety and knowledge lifecycle

Your knowledge is saved in serverless storage solely whereas your job is working and is mechanically deleted when your job is accomplished. As a result of Amazon EMR Serverless batch jobs can run for as much as 24 hours, your knowledge might be saved for now not than this most length. Serverless storage encrypts your knowledge each in transit between your Amazon EMR Serverless software and the serverless storage layer and at relaxation whereas quickly saved, utilizing AWS managed encryption keys. The service makes use of an IAM based mostly safety mannequin with job-level knowledge isolation, which implies that one job can’t entry the shuffle knowledge of one other job. Serverless storage maintains the identical safety requirements as Amazon EMR Serverless, with enterprise-grade safety controls all through the processing lifecycle.

Conclusion

Serverless storage represents a elementary shift in how we method knowledge processing infrastructure, eliminating guide configuration, aligning prices to precise utilization, and enhancing reliability for I/O intensive workloads. By offloading shuffle operations to a managed service, knowledge engineers can concentrate on constructing analytics relatively than managing storage infrastructure.

To be taught extra about serverless storage and get began, go to the Amazon EMR Serverless documentation.


In regards to the authors

Karthik Prabhakar

Karthik Prabhakar

Karthik is a Information Processing Engines Architect for Amazon EMR at AWS. He focuses on distributed programs structure and question optimization, working with clients to unravel complicated efficiency challenges in large-scale knowledge processing workloads. His focus spans engine internals, value optimization methods, and architectural patterns that allow clients to run petabyte-scale analytics effectively.

Ravi Kumar

Ravi Kumar

Ravi is a Senior Product Supervisor Technical at Amazon Internet Companies, specializing in exabyte-scale knowledge infrastructure and analytics platforms. He helps clients unlock insights from structured and unstructured knowledge utilizing open-source applied sciences and cloud computing. Exterior of labor, Ravi enjoys exploring rising tendencies in knowledge science and machine studying.

Matt Tolton

Matt Tolton

Matt is a Senior Principal Engineer at Amazon Internet Companies.

author name

Neil Mukerje

Neil is a Principal Product Supervisor at Amazon Internet Companies.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles