The Amazon EMR runtime for Apache Spark affords a high-performance runtime surroundings whereas sustaining 100% API compatibility with open supply Apache Spark and Apache Iceberg desk format. Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, Amazon EMR on AWS Outposts and AWS Glue all use the optimized runtimes.
On this put up, we exhibit the efficiency advantages of utilizing the Amazon EMR 7.5 runtime for Spark and Iceberg in comparison with open supply Spark 3.5.3 with Iceberg 1.6.1 tables on the TPC-DS 3TB benchmark v2.13.
Iceberg is a well-liked open supply high-performance format for big analytic tables. Our benchmarks exhibit that Amazon EMR can run TPC-DS 3 TB workloads 3.6 instances sooner, decreasing the runtime from 1.54 hours to 0.42 hours. Moreover, the associated fee effectivity improves by 2.9 instances, with the full value lowering from $16.00 to $5.39 when utilizing Amazon Elastic Compute Cloud (Amazon EC2) On-Demand r5d.4xlarge situations, offering observable positive aspects for knowledge processing duties.
This can be a additional 32% enhance from the optimizations shipped in Amazon EMR 7.1 lined in a earlier put up, Amazon EMR 7.1 runtime for Apache Spark and Iceberg can run Spark workloads 2.7 instances sooner than Apache Spark 3.5.1 and Iceberg 1.5.2. Since then we now have continued including extra help for DataSource V2 for eight extra present question optimizations within the EMR runtime for Spark.
Along with these DataSource V2 particular enhancements, we now have made extra optimizations to Spark operators since Amazon EMR 7.1 that additionally contribute to the extra speedup.
Benchmark outcomes for Amazon EMR 7.5 in contrast to4 open supply Spark 3.5.3 and Iceberg 1.6.1
To evaluate the Spark engine’s efficiency with the Iceberg desk format, we carried out benchmark exams utilizing the 3 TB TPC-DS dataset, model 2.13 (our outcomes derived from the TPC-DS dataset should not straight corresponding to the official TPC-DS outcomes as a result of setup variations). Benchmark exams for the EMR runtime for Spark and Iceberg had been carried out on Amazon EMR 7.5 EC2 clusters vs open supply Spark 3.5.3 and Iceberg 1.6.1 on EC2 clusters.
The setup directions and technical particulars can be found in our GitHub repository. To attenuate the affect of exterior catalogs like AWS Glue and Hive, we used the Hadoop catalog for the Iceberg tables. This makes use of the underlying file system, particularly Amazon S3, because the catalog. We are able to outline this setup by configuring the property spark.sql.catalog.. The actual fact tables used the default partitioning by the date column, which have various partitions various from 200–2,100. No precalculated statistics had been used for these tables.
We ran a complete of 104 SparkSQL queries in three sequential rounds, and the typical runtime of every question throughout these rounds was taken for comparability. The typical runtime for the three rounds on Amazon EMR 7.5 with Iceberg enabled was 0.42 hours, demonstrating a 3.6-fold velocity enhance in comparison with open supply Spark 3.5.3 and Iceberg 1.6.1. The next determine presents the full runtimes in seconds.
The next desk summarizes the metrics.
| Metric | Amazon EMR 7.5 on EC2 | Amazon EMR 7.1 on EC2 | Open Supply Spark 3.5.3 and Iceberg 1.6.1 |
| Common runtime in seconds | 1535.62 | 2033.17 | 5546.16 |
| Geometric imply over queries in seconds | 8.30046 | 10.13153 | 20.40555 |
| Value* | $5.39 | $7.18 | $16.00 |
*Detailed value estimates are mentioned later on this put up.
The next chart demonstrates the per-query efficiency enchancment of Amazon EMR 7.5 relative to open supply Spark 3.5.3 and Iceberg 1.6.1. The extent of the speedup varies from one question to a different, with the quickest as much as 9.4 instances sooner for q93, with Amazon EMR outperforming open supply Spark with Iceberg tables. The horizontal axis arranges the TPC-DS 3TB benchmark queries in descending order primarily based on the efficiency enchancment seen with Amazon EMR, and the vertical axis depicts the magnitude of this speedup as a ratio.

Value comparability
Our benchmark supplies the full runtime and geometric imply knowledge to evaluate the efficiency of Spark and Iceberg in a posh, real-world resolution help situation. For added insights, we additionally study the associated fee side. We calculate value estimates utilizing formulation that account for EC2 On-Demand situations, Amazon Elastic Block Retailer (Amazon EBS), and Amazon EMR bills.
- Amazon EC2 value (contains SSD value) = variety of situations * r5d.4xlarge hourly fee * job runtime in hours
- r5d.4xlarge hourly fee = $1.152 per hour in us-east-1
- Root Amazon EBS value = variety of situations * Amazon EBS per GB-hourly fee * root EBS quantity dimension * job runtime in hours
- Amazon EMR value = variety of situations * r5d.4xlarge Amazon EMR value * job runtime in hours
- 4xlarge Amazon EMR value = $0.27 per hour
- Whole value = Amazon EC2 value + root Amazon EBS value + Amazon EMR value
The calculations reveal that the Amazon EMR 7.5 benchmark yields a 2.9-fold value effectivity enchancment over open supply Spark 3.5.3 and Iceberg 1.6.1 in operating the benchmark job.
| Metric | Amazon EMR 7.5 | Amazon EMR 7.1 | Open Supply Spark 3.5.1 and Iceberg 1.5.2 |
| Runtime in hours | 0.426 | 0.564 | 1.540 |
|
Variety of EC2 situations (Consists of main node) |
9 | 9 | 9 |
| Amazon EBS Measurement | 20gb | 20gb | 20gb |
|
Amazon EC2 (Whole runtime value) |
$4.35 | $5.81 | $15.97 |
| Amazon EBS value | $0.01 | $0.01 | $0.04 |
| Amazon EMR value | $1.02 | $1.36 | $0 |
| Whole value | $5.38 | $7.18 | $16.01 |
| Value financial savings | Amazon EMR 7.5 is 2.9 instances higher | Amazon EMR 7.1 is 2.2 instances higher | Baseline |
Along with the time-based metrics mentioned thus far, knowledge from Spark occasion logs present that Amazon EMR scanned roughly 3.4 instances much less knowledge from Amazon S3 and 4.1 instances fewer information than the open supply model within the TPC-DS 3 TB benchmark. This discount in Amazon S3 knowledge scanning contributes on to value financial savings for Amazon EMR workloads.
Run open supply Spark benchmarks on Iceberg tables
We used separate EC2 clusters, every outfitted with 9 r5d.4xlarge situations, for testing each open supply Spark 3.5.3 and Amazon EMR 7.5 for Iceberg workload. The first node was outfitted with 16 vCPU and 128 GB of reminiscence, and the eight employee nodes collectively had 128 vCPU and 1024 GB of reminiscence. We carried out exams utilizing the Amazon EMR default settings to showcase the everyday person expertise and minimally adjusted the settings of Spark and Iceberg to keep up a balanced comparability.
The next desk summarizes the Amazon EC2 configurations for the first node and eight employee nodes of kind r5d.4xlarge.
| EC2 Occasion | vCPU | Reminiscence (GiB) | Occasion Storage (GB) | EBS Root Quantity (GB) |
| r5d.4xlarge | 16 | 128 | 2 x 300 NVMe SSD | 20 GB |
Conditions
The next stipulations are required to run the benchmarking:
- Utilizing the directions within the emr-spark-benchmark GitHub repo, arrange the TPC-DS supply knowledge in your S3 bucket and in your native laptop.
- Construct the benchmark software following the steps offered in Steps to construct spark-benchmark-assembly software and replica the benchmark software to your S3 bucket. Alternatively, copy spark-benchmark-assembly-3.5.3.jar to your S3 bucket.
- Create Iceberg tables from the TPC-DS supply knowledge. Comply with the directions on GitHub to create Iceberg tables utilizing the Hadoop catalog. For instance, the next code makes use of an EMR 7.5 cluster with Iceberg enabled to create the tables:
Notice the Hadoop catalog warehouse location and database title from the previous step. We use the identical iceberg tables to run benchmarks with Amazon EMR 7.5 and open supply Spark.
This benchmark software is constructed from the department tpcds-v2.13_iceberg. Should you’re constructing a brand new benchmark software, swap to the proper department after downloading the supply code from the GitHub repo.
Create and configure a YARN cluster on Amazon EC2
To match Iceberg efficiency between Amazon EMR on Amazon EC2 and open supply Spark on Amazon EC2, comply with the directions within the emr-spark-benchmark GitHub repo to create an open supply Spark cluster on Amazon EC2 utilizing Flintrock with eight employee nodes.
Primarily based on the cluster choice for this take a look at, the next configurations are used:
Make certain to interchange the placeholder , within the yarn-site.xml file, with the first node’s IP tackle of your Flintrock cluster.
Run the TPC-DS benchmark with Spark 3.5.3 and Iceberg 1.6.1
Full the next steps to run the TPC-DS benchmark:
- Log in to the open supply cluster main node utilizing
flintrock login $CLUSTER_NAME. - Submit your Spark job:
- Select the proper Iceberg catalog warehouse location and database that has the created Iceberg tables.
- The outcomes are created in
s3://./benchmark_run - You possibly can observe progress in
/media/ephemeral0/spark_run.log.
Summarize the outcomes
After the Spark job finishes, retrieve the take a look at outcome file from the output S3 bucket at s3://. This may be performed both by means of the Amazon S3 console by navigating to the desired bucket location or by utilizing the AWS Command Line Interface (AWS CLI). The Spark benchmark software organizes the info by making a timestamp folder and putting a abstract file inside a folder labeled abstract.csv. The output CSV information comprise 4 columns with out headers:
- Question title
- Median time
- Minimal time
- Most time
With the info from three separate take a look at runs with one iteration every time, we will calculate the typical and geometric imply of the benchmark runtimes.
Run the TPC-DS benchmark with the EMR runtime for Spark
A lot of the directions are much like Steps to run Spark Benchmarking with a number of Iceberg-specific particulars.
Conditions
Full the next prerequisite steps:
- Run
aws configureto configure the AWS CLI shell to level to the benchmarking AWS account. Check with Configure the AWS CLI for directions. - Add the benchmark software JAR file to Amazon S3.
Deploy the EMR cluster and run the benchmark job
Full the next steps to run the benchmark job:
- Use the AWS CLI command as proven in Deploy EMR on EC2 Cluster and run benchmark job to spin up an EMR on EC2 cluster. Make certain to allow Iceberg. See Create an Iceberg cluster for extra particulars. Select the proper Amazon EMR model, root quantity dimension, and identical useful resource configuration because the open supply Flintrock setup. Check with create-cluster for an in depth description of the AWS CLI choices.
- Retailer the cluster ID from the response. We’d like this for the subsequent step.
- Submit the benchmark job in Amazon EMR utilizing
add-stepsfrom the AWS CLI:- Substitute
with the cluster ID from Step 2. - The benchmark software is at
s3://./spark-benchmark-assembly-3.5.3.jar - Select the proper Iceberg catalog warehouse location and database that has the created Iceberg tables. This ought to be the identical because the one used for the open supply TPC-DS benchmark run.
- The outcomes might be in
s3://./benchmark_run
- Substitute
Summarize the outcomes
After the step is full, you’ll be able to see the summarized benchmark outcome at s3:// in the identical manner because the earlier run and compute the typical and geometric imply of the question runtimes.
Clear up
To forestall any future fees, delete the sources you created by following the directions offered within the Cleanup part of the GitHub repository.
Abstract
Amazon EMR is persistently enhancing the EMR runtime for Spark when used with Iceberg tables, attaining a efficiency that’s 3.6 instances sooner than open supply Spark 3.5.3 and Iceberg 1.6.1 with EMR 7.5 on TPC-DS 3 TB, v2.13. This can be a additional enhance of 32% from EMR 7.1. We encourage you to maintain updated with the most recent Amazon EMR releases to completely profit from ongoing efficiency enhancements.
To remain knowledgeable, subscribe to the AWS Large Knowledge Weblog’s RSS feed, the place you will discover updates on the EMR runtime for Spark and Iceberg, in addition to recommendations on configuration greatest practices and tuning suggestions.
In regards to the Authors
Atul Felix Payapilly is a software program growth engineer for Amazon EMR at Amazon Internet Providers.
Udit Mehrotra is an Engineering Supervisor for EMR at Amazon Internet Providers.
