Organizations run hundreds of thousands of Apache Spark functions every month on AWS, transferring, processing, and making ready knowledge for analytics and machine studying. As these functions age, holding them safe and environment friendly turns into more and more difficult. Knowledge practitioners must improve to the most recent Spark releases to learn from efficiency enhancements, new options, bug fixes, and safety enhancements. Nonetheless, these upgrades are sometimes complicated, pricey, and time-consuming.
As we speak, we’re excited to announce the preview of generative AI upgrades for Spark, a brand new functionality that allows knowledge practitioners to rapidly improve and modernize their Spark functions operating on AWS. Beginning with Spark jobs in AWS Glue, this characteristic lets you improve from an older AWS Glue model to AWS Glue model 4.0. This new functionality reduces the time knowledge engineers spend on modernizing their Spark functions, permitting them to deal with constructing new knowledge pipelines and getting worthwhile analytics quicker.
Understanding the Spark improve problem
The normal technique of upgrading Spark functions requires important handbook effort and experience. Knowledge practitioners should fastidiously evaluate incremental Spark launch notes to know the intricacies and nuances of breaking adjustments, a few of which can be undocumented. They then want to switch their Spark scripts and configurations, updating options, connectors, and library dependencies as wanted.
Testing these upgrades includes operating the appliance and addressing points as they come up. Every check run could reveal new issues, leading to a number of iterations of adjustments. After the upgraded utility runs efficiently, practitioners should validate the brand new output in opposition to the anticipated ends in manufacturing. This course of usually turns into year-long initiatives that value hundreds of thousands of {dollars} and devour tens of 1000’s of engineering hours.
How generative AI upgrades for Spark works
The Spark upgrades characteristic makes use of AI to automate each the identification and validation of required adjustments to your AWS Glue Spark functions. Let’s discover how these capabilities work collectively to simplify your improve course of.
AI-driven improve plan era
While you provoke an improve, the service analyzes your utility utilizing AI to determine obligatory adjustments throughout each PySpark code and Spark configurations. Throughout preview, Spark Upgrades helps upgrading from Glue 2.0 (Spark 2.4.3, Python 3.7) to Glue 4.0 (Spark 3.3.0, Python 3.10), mechanically dealing with adjustments that may usually require intensive handbook evaluate of public Spark, Python and Glue model migration guides, adopted by growth, testing, and verification. Spark Upgrades addresses 4 key areas of adjustments:
- Spark SQL API strategies and capabilities
- Spark DataFrame API strategies and operations
- Python language updates (together with module deprecations and syntax adjustments)
- Spark SQL and Core configuration settings
The complexity of those upgrades turns into evident when you think about migrating from Spark 2.4.3 to Spark 3.3.0 includes over 100 version-specific adjustments. A number of elements contribute to the challenges of performing handbook upgrades:
- Extremely expressive language with a mixture of crucial and declarative programming types, permits customers to simply develop Spark functions. Nonetheless, this will increase the complexity of figuring out impacted code throughout upgrades.
- Lazy execution of transformations in a distributed Spark utility improves efficiency however makes runtime verification of utility upgrades difficult for customers.
- Spark configurations adjustments in default values or the introduction of recent configurations throughout variations can impression utility habits in several methods, making it troublesome for customers to determine points throughout upgrades.
For instance, in Spark 3.2, Spark SQL TRANSFORM operator can’t assist alias in inputs. In Spark 3.1 and earlier, you could possibly write a script rework like SELECT TRANSFORM(a AS c1, b AS c2) USING 'cat' FROM TBL.
In Spark 3.1, loading and saving timestamps earlier than 1900-01-01 00:00:00Z as INT96 in Parquet recordsdata causes errors. In Spark 3.0, this wouldn’t fail however might end in timestamp shifts attributable to calendar rebasing. To revive the previous habits in Spark 3.1, you would want to configure the Spark SQL configurations for spark.sql.legacy.parquet.int96RebaseModeInRead and spark.sql.legacy.parquet.int96RebaseModeInWrite to LEGACY.
Automated validation in your setting
After figuring out the mandatory adjustments, Spark Upgrades validates the upgraded utility by operating it as an AWS Glue job in your AWS account. The service iterates by means of a number of validation runs, as much as 10, reviewing any errors encountered in every iteration and refining the improve plan till it achieves a profitable run. You’ll be able to run a Spark Improve Evaluation in your growth account utilizing mock datasets provided by means of Glue job parameters used for validation runs.
After Spark Upgrades has efficiently validated the adjustments, it presents an improve plan so that you can evaluate. You’ll be able to then settle for and apply the adjustments to your job within the growth account, earlier than replicating them to your job within the manufacturing account. The Spark Improve plan consists of the next:
- An improve abstract with an evidence of code updates made through the course of
- The ultimate script that you should utilize rather than your present script
- Logs from validation runs exhibiting how points have been recognized and resolved
You’ll be able to evaluate all features of the improve, together with intermediate validation makes an attempt and any error resolutions, earlier than deciding to use the adjustments to your manufacturing job. This strategy ensures you might have full visibility into and management over the improve course of whereas benefiting from AI-driven automation.
Get began with generative AI Spark upgrades
Let’s stroll by means of the method of upgrading an AWS Glue 2.0 job to AWS Glue 4.0. Full the next steps:
- On the AWS Glue console, select ETL jobs within the navigation pane.
- Choose your AWS Glue 2.0 job, and select Run improve evaluation with AI.
- For End result path, enter
s3://aws-glue-assets-(present your individual account ID and AWS Area).- /scripts/upgraded/ - Select Run.

- On the Improve evaluation tab, look ahead to the evaluation to be accomplished.

Whereas an evaluation is operating, you’ll be able to view the intermediate job evaluation makes an attempt (as much as 10) for validation underneath the Runs tab. Moreover, the Upgraded abstract in S3 paperwork the upgrades made by the Spark Improve service to date, refining the improve plan with every try. Every try will show a unique failure motive, which the service tries to deal with within the subsequent try by means of code or configuration updates.
After a profitable evaluation, the upgraded script and a abstract of adjustments can be uploaded to Amazon Easy Storage Service (Amazon S3). - Assessment the adjustments to ensure they meet your necessities, then select Apply upgraded script.

Your job has now been efficiently upgraded to AWS Glue model 4.0. You’ll be able to test the Script tab to confirm the up to date script and the Job particulars tab to evaluate the modified configuration.
Understanding the improve course of by means of an instance

We now present a manufacturing Glue 2.0 job that we wish to improve to Glue 4.0 utilizing the Spark Improve characteristic. This Glue 2.0 job reads a dataset, up to date every day in an S3 bucket underneath totally different partitions, containing new ebook critiques from an internet market and runs SparkSQL to assemble insights into the consumer votes for the ebook critiques.
Unique code (Glue 2.0) – earlier than improve
New code (Glue 4.0) – after improve
Improve abstract
As seen within the up to date Glue 4.0 (Spark 3.3.0) script diff in comparison with the Glue 2.0 (Spark 2.4.3) script and the ensuing improve abstract, a complete of six totally different code and configuration updates have been utilized throughout the six makes an attempt of the Spark Improve Evaluation.
- Try #1 included a Spark SQL configuration (
spark.sql.adaptive.enabled) to revive the appliance habits as a brand new characteristic for Spark SQL adaptive question execution is launched beginning Spark 3.2. Customers can examine this configuration change and may additional allow or disable it as per their choice. - Try #2 resolved a Python language change between Python 3.7 and three.10 with the introduction of a brand new summary base class (
abc) underneath the Python collections module for importingSequence. - Try #3 resolved an error encountered attributable to a change in habits of DataFrame API beginning Spark 3.1 the place
pathpossibility can’t exist with differentDataFrameReaderoperations. - Try #4 resolved an error attributable to a change within the Spark SQL operate API signature for
DATE_ADDwhich now solely accepts integers because the second argument ranging from Spark 3.0. - Try #5 resolved an error encountered because of the change in habits Spark SQL operate API for
rely(tblName.*)beginning Spark 3.2. The habits was restored with the introduction of a brand new Spark SQL configurationspark.sql.legacy.allowStarWithSingleTableIdentifierInCount - Try #6 efficiently accomplished the evaluation and ran the brand new script on Glue 4.0 with none new errors. The ultimate try resolved an error encountered because of the prohibited use of detrimental scale for
forged(DecimalType(3, -6)in Spark DataFrame API beginning Spark 3.0. The difficulty was addressed by enabling the brand new Spark SQL configurationspark.sql.legacy.allowNegativeScaleOfDecimal.
Vital concerns for preview
As you start utilizing automated Spark upgrades through the preview interval, there are a number of necessary features to think about for optimum utilization of the service:
- Service scope and limitations – The preview launch focuses on PySpark code upgrades from AWS Glue variations 2.0 to model 4.0. On the time of writing, the service handles PySpark code that doesn’t depend on extra library dependencies. You’ll be able to run automated upgrades for as much as 10 jobs concurrently in an AWS account, permitting you to effectively modernize a number of jobs whereas sustaining system stability.
- Optimizing prices through the improve course of – As a result of the service makes use of generative AI to validate the improve plan by means of a number of iterations, with every iteration operating as an AWS Glue job in your account, it’s important to optimize the validation job run configurations for cost-efficiency. To attain this, we advocate specifying a run configuration when beginning an improve evaluation as follows:
- Utilizing non-production developer accounts and choosing pattern mock datasets that characterize your manufacturing knowledge however are smaller in measurement for validation with Spark Upgrades.
- Utilizing right-sized compute assets, resembling G.1X employees, and choosing an applicable variety of employees for processing your pattern knowledge.
- Enabling Glue auto scaling when relevant to mechanically regulate assets based mostly on workload.
For instance, in case your manufacturing job processes terabytes of information with 20 G.2X employees, you may configure the improve job to course of just a few gigabytes of consultant knowledge with 2 G.2X employees and auto scaling enabled for validation.
- Preview greatest practices – In the course of the preview interval, we strongly advocate beginning your improve journey with non-production jobs. This strategy lets you familiarize your self with the improve workflow, and perceive how the service handles several types of Spark code patterns.
Your expertise and suggestions are essential in serving to us improve and enhance this characteristic. We encourage you to share your insights, ideas, and any challenges you encounter by means of AWS Assist or your account group. This suggestions will assist us enhance the service and add capabilities that matter most to you throughout preview.
Conclusion
This submit demonstrates how automated Spark upgrades can help with migrating your Spark functions in AWS Glue. It simplifies the migration course of by utilizing generative AI to mechanically determine the mandatory script adjustments throughout totally different Spark variations.
To be taught extra about this characteristic in AWS Glue, see Generative AI upgrades for Apache Spark in AWS Glue.
A particular due to everybody who contributed to the launch of generative AI upgrades for Apache Spark in AWS Glue: Shuai Zhang, Mukul Prasad, Liyuan Lin, Rishabh Nair, Raghavendhar Thiruvoipadi Vidyasagar, Tina Shao, Chris Kha, Neha Poonia, Xiaoxi Liu, Japson Jeyasekaran, Suthan Phillips, Raja Jaya Chandra Mannem, Yu-Ting Su, Neil Jonkers, Boyko Radulov, Sujatha Rudra, Mohammad Sabeel, Mingmei Yang, Matt Su, Daniel Greenberg, Charlie Sim, McCall Petier, Adam Rohrscheib, Andrew King, Ranu Shah, Aleksei Ivanov, Bernie Wang, Karthik Seshadri, Sriram Ramarathnam, Asterios Katsifodimos, Brody Bowman, Sunny Konoplev, Bijay Bisht, Saroj Yadav, Carlos Orozco, Nitin Bahadur, Kinshuk Pahare, Santosh Chandrachood, and William Vambenepe.
Concerning the Authors
Noritaka Sekiyama is a Principal Massive Knowledge Architect on the AWS Glue group. He’s answerable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking together with his new highway bike.
Keerthi Chadalavada is a Senior Software program Improvement Engineer at AWS Glue, specializing in combining generative AI and knowledge integration applied sciences to design and construct complete options for patrons’ knowledge and analytics wants.
Shubham Mehta is a Senior Product Supervisor at AWS Analytics. He leads generative AI characteristic growth throughout providers resembling AWS Glue, Amazon EMR, and Amazon MWAA, utilizing AI/ML to simplify and improve the expertise of information practitioners constructing knowledge functions on AWS.
Pradeep Patel is a Software program Improvement Supervisor on the AWS Glue group. He’s obsessed with serving to prospects resolve their issues by utilizing the facility of the AWS Cloud to ship extremely scalable and strong options. In his spare time, he likes to hike and play with net functions.
Chuhan Liu is a Software program Engineer at AWS Glue. He’s obsessed with constructing scalable distributed techniques for giant knowledge processing, analytics, and administration. He’s additionally eager on utilizing generative AI applied sciences to supply brand-new expertise to prospects. In his spare time, he likes sports activities and enjoys taking part in tennis.
Vaibhav Naik is a software program engineer at AWS Glue, obsessed with constructing strong, scalable options to deal with complicated buyer issues. With a eager curiosity in generative AI, he likes to discover progressive methods to develop enterprise-level options that harness the facility of cutting-edge AI applied sciences.
Mohit Saxena is a Senior Software program Improvement Supervisor on the AWS Glue and Amazon EMR group. His group focuses on constructing distributed techniques to allow prospects with simple-to-use interfaces and AI-driven capabilities to effectively rework petabytes of information throughout knowledge lakes on Amazon S3, and databases and knowledge warehouses on the cloud.
