Within the ever-evolving panorama of cloud computing and information administration, AWS has constantly been on the forefront of innovation. One of many groundbreaking developments lately is zero-ETL integration, a set of absolutely managed integrations by AWS that minimizes the necessity to construct extract, rework, and cargo (ETL) information pipelines. This put up will discover temporary historical past of zero-ETL, its significance for purchasers, and introduce an thrilling new function: historical past mode for Amazon Aurora PostgreSQL-Appropriate Version, Amazon Aurora MySQL-Appropriate Version, Amazon Relational Database Service (Amazon RDS) for MySQL, and Amazon DynamoDB zero-ETL integration with Amazon Redshift.
A short historical past of zero-ETL integrations
The idea of zero-ETL integrations emerged as a response to the rising complexities and inefficiencies in conventional ETL processes. Conventional ETL processes are time-consuming and complicated to develop, preserve, and scale. Though not all use circumstances will be changed with zero-ETL, it simplifies the replication and lets you apply transformation post-replication. This eliminates the necessity for added ETL expertise between the supply database and Amazon Redshift. We at AWS acknowledged the necessity for a extra streamlined method to information integration, significantly between operational databases and the cloud information warehouses. The journey of zero-ETL started in late 2022 after we launched the function for Aurora MySQL with Amazon Redshift. This function marked a pivotal second in streamlining advanced information workflows, enabling close to real-time information replication and evaluation whereas eliminating the necessity for ETL processes.
Constructing on the success of our first zero-ETL integration, we’ve made steady strides on this area by working backward from our clients’ wants and launching options like information filtering, auto and incremental refresh of materialized views, refresh interval, and extra. Moreover, we elevated the breadth of sources to incorporate Aurora PostgreSQL, DynamoDB, and Amazon RDS for MySQL to Amazon Redshift integrations, solidifying our dedication to creating it seamless so that you can run analytics in your information. The introduction of zero-ETL was not only a technological development; it represented a paradigm shift in how organizations might method their information methods. By eradicating the necessity for intermediate information processing steps, we opened up new prospects for close to real-time analytics and decision-making.
Introducing historical past mode: A brand new frontier in information evaluation
Zero-ETL has already simplified the information integration, and we’re excited to additional improve the capabilities by asserting a brand new function that takes it a step additional: historical past mode with Amazon Redshift. Utilizing historical past mode with zero-ETL integrations, you possibly can streamline your historic information evaluation by sustaining full change information seize (CDC) from the supply in Amazon Redshift. Historical past mode lets you unlock the complete potential of your information by seamlessly capturing and retaining historic variations of data throughout your zero-ETL information sources. You’ll be able to carry out superior historic evaluation, construct look again reviews, carry out development evaluation, and create slowly altering dimensions (SCD) Sort 2 tables on Amazon Redshift. This lets you consolidate your core analytical property and derive insights throughout a number of functions, gaining value financial savings and operational efficiencies. Historical past mode permits organizations to adjust to regulatory necessities for sustaining historic data, facilitating complete information governance and knowledgeable decision-making.
Zero-ETL integrations present a present view of data in close to actual time, that means solely the newest adjustments from supply databases are retained on Amazon Redshift. With historical past mode, Amazon Redshift introduces a revolutionary method to historic information evaluation. Now you can configure your zero-ETL integrations to trace each model of your data in supply tables instantly in Amazon Redshift, together with the supply timestamp with each document model indicating when every document was inserted, modified, or deleted. As a result of information adjustments are tracked and retained by Amazon Redshift, this can assist you meet your compliance necessities with out having to take care of duplicate copies in information sources. As well as, you don’t have to take care of and handle partitioned tables to maintain older information intact as separate partitions to model data, and preserve historic information in supply databases.
In a knowledge warehouse, the commonest dimensional modeling strategies is a star schema, the place there’s a reality desk on the middle surrounded by numerous related dimension tables. A dimension is a construction that categorizes information and measures to be able to allow customers to reply enterprise questions. For instance an instance, in a typical gross sales area, buyer, time, or product are dimensions and gross sales transactions is a reality. An SCD is a knowledge warehousing idea that comprises comparatively static information that may change slowly over a time period. There are three main varieties of SCDs maintained in information warehousing: Sort 1 (no historical past), Sort 2 (full historical past), and Sort 3 (restricted historical past). CDC is a attribute of a database that gives a capability to determine the information that modified between two database masses, in order that an motion will be carried out on the modified information.
On this put up, we display the right way to allow historical past mode for tables in a zero-ETL integration and seize the complete historic information adjustments as SCD2.
Answer overview
On this use case, we discover how a fictional nationwide retail chain, AnyCompany, makes use of AWS companies to achieve helpful insights into their buyer base. With a number of places throughout the nation, AnyCompany goals to reinforce their understanding of buyer habits and enhance their advertising methods via two key initiatives:
- Buyer migration evaluation – AnyCompany seeks to trace and analyze buyer relocation patterns, specializing in how geographical strikes affect buying habits. By monitoring these adjustments, the corporate can adapt its stock, companies, and native advertising efforts to raised serve clients of their new places.
- Advertising marketing campaign effectiveness – The retailer needs to judge the affect of focused advertising campaigns based mostly on buyer demographics on the time of marketing campaign execution. This evaluation can assist AnyCompany refine its advertising methods, optimize useful resource allocation, and enhance general marketing campaign efficiency.
By carefully monitoring adjustments in buyer profiles for each geographic motion and advertising responsiveness, AnyCompany is positioning itself to make extra knowledgeable, data-driven selections.
On this demonstration, we start by loading a pattern dataset into the supply desk, buyer, in Aurora PostgreSQL-Appropriate. To take care of historic data, we allow historical past mode on the buyer desk, which mechanically tracks adjustments in Amazon Redshift.
When historical past mode is turned on, the next columns are mechanically added to the goal desk, buyer, in Amazon Redshift to maintain monitor of adjustments within the supply.
| Column title | Knowledge kind | Description |
_record_is_active |
Boolean | Signifies if a document within the goal is presently lively within the supply. True signifies the document is lively. |
_record_create_time |
Timestamp | Beginning time (UTC) when the supply document is lively. |
_record_delete_time |
Timestamp | Ending time (UTC) when the supply document is up to date or deleted. |
Subsequent, we create a dimension desk, customer_dim, in Amazon Redshift with an extra surrogate key column to point out an instance of making an SCD desk. To optimize question efficiency for various queries, a few of which is perhaps analyzing lively or inactive data solely whereas different queries is perhaps analyzing information as of a sure date, we outlined the type key consisting of _record_is_active, _record_create_time, and _record_delete_time attributes within the customer_dim desk.
The next determine offers the schema of the supply desk in Aurora PostgreSQL-Appropriate, and the goal desk and goal buyer dimension desk in Amazon Redshift.
To streamline the information inhabitants course of, we developed a saved process named SP_Customer_Type2_SCD(). This process is designed to populate incremental information into the customer_dim desk from the replicated buyer desk. It handles varied information adjustments, together with updates, inserts, and deletes within the supply desk and implementing an SCD2 method.
Conditions
Earlier than you get began, full the next steps:
- Configure your Aurora DB cluster and your Redshift information warehouse with the required parameters and permissions. For directions, check with Getting began with Aurora zero-ETL integrations with Amazon Redshift.
- Create an Aurora zero-ETL integration with Amazon Redshift.
- From an Amazon Elastic Compute Cloud (Amazon EC2) terminal or utilizing AWS CloudShell, SSH into the Aurora PostgreSQL cluster and run the next instructions to put in psql:
- Load the pattern supply information:
- Obtain the TPC-DS pattern dataset for the
buyerdesk onto the machine operating psql. - From the EC2 terminal, run the next command to connect with the Aurora PostgreSQL DB utilizing the default tremendous person
postgres: - Run the next SQL command to create the database
zetl: - Change the connection to the newly created database:
- Create the
buyerdesk (the next instance creates it within the public schema): - Run the next command to load buyer information from the downloaded dataset after altering the highlighted location of the dataset to your listing path:
- Run the next question to validate the profitable creation of the desk and loading of pattern information:
- Obtain the TPC-DS pattern dataset for the
The SQL output ought to be as follows:
Create a goal database in Amazon Redshift
To duplicate information out of your supply into Amazon Redshift, you will need to create a goal database out of your integration in Amazon Redshift. For this put up, we have now already created a supply database referred to as zetl in Aurora PostgreSQL-Appropriate as a part of the conditions. Full the next steps to create the goal database:
- On the Amazon Redshift console, select Question editor v2 within the navigation pane.
- Run the next instructions to create a database referred to as
postgresin Amazon Redshift utilizing the zero-ETLintegration_idwith historical past mode turned on.
Historical past mode turned on on the time of goal database creation on Amazon Redshift will allow historical past mode for current and new tables created sooner or later.
- Run the next question to validate the profitable replication of the preliminary information from the supply into Amazon Redshift:
The desk buyer ought to present table_state as Synced with is_history_mode as true.
Allow historical past mode for current zero-ETL integrations
Historical past mode will be enabled to your current zero-ETL integrations utilizing both the Amazon Redshift console or SQL instructions. Primarily based in your use case, you possibly can activate historical past mode on the database, schema, or desk stage. To make use of the Amazon Redshift console, full the next steps:
- On the Amazon Redshift console, select Zero-ETL integrations within the navigation pane.
- Select your required integration.
- Select Handle historical past mode.

On this web page, you possibly can both allow or disable historical past mode for all tables or a subset of tables.
- Choose Handle historical past mode for particular person tables and choose Activate for the historical past mode for the
buyer - Select Save adjustments.

- To substantiate adjustments, select Desk statistics and ensure Historical past mode is On for the
buyer.
- Optionally, you possibly can run the next SQL command in Amazon Redshift to allow historical past mode for the
buyerdesk:
- Optionally, you possibly can allow historical past mode for all present and tables created sooner or later within the database:
- Optionally, you possibly can allow historical past mode for all present and tables created sooner or later in a number of schemas. The next question permits historical past mode for all present and tables created sooner or later for the
publicschema:
- Run the next question to validate if the
buyerdesk has been efficiently modified to historical past mode with theis_history_modecolumn astruein order that it could possibly start monitoring each model (together with updates and deletes) of all data modified within the supply:
Initially, the desk will probably be in ResyncInitiated state earlier than altering to Synced.
- Run the next question within the
zetldatabase of Aurora PostgreSQL-Appropriate to switch a supply document and observe the habits of historical past mode within the Amazon Redshift goal:
- Now run the next question within the
postgresdatabase of Amazon Redshift to see all variations of the identical document:
Zero-ETL integrations with historical past mode has inactivated the previous document with the _record_is_active column worth to false and created a brand new document with _record_is_active as true. You can even see the way it maintains the _record_create_time and _record_delete_time column values for each data. The inactive document has a delete timestamp that matches the lively document’s create timestamp.
Load incremental information in an SCD2 desk
Full the next steps to create an SCD2 desk and implement an incremental information load course of in a daily database of Amazon Redshift, on this case dev:
- Create an empty buyer SDC2 desk referred to as
customer_dimwith SCD fields. The desk additionally has DISTSTYLEAUTOand SORTKEY columns_record_is_active,_record_create_time, and_record_delete_time. If you outline a form key on a desk, Amazon Redshift can skip studying complete blocks of information for that column. It could accomplish that as a result of it tracks the minimal and most column values saved on every block and might skip blocks that don’t apply to the predicate vary.
Subsequent, you create a saved process referred to as SP_Customer_Type2_SCD() to populate incremental information within the customer_dim SCD2 desk created within the previous step. The saved process comprises the next elements:
-
- First, it fetches the max
_record_create_timeand max_record_delete_timefor everycustomer_id. - Then, it compares the output of the previous step with the continued zero-ETL integration replicated desk for data created after the max creation time within the dimension desk or the document within the replicated desk with
_record_delete_timeafter the max_record_delete_timewithin the dimension desk for everycustomer_id. - The output of the previous step captures the modified information between the replicated
buyerdesk and goalcustomer_dimdimension desk. The interim information is staged to acustomer_stgdesk, which is able to be merged with the goal desk. - Through the merge course of, data that should be deleted are marked with
_record_delete_timeand_record_is_activeis ready tofalse, whereas newly created data are inserted into the goal deskcustomer_dimwith_record_is_activeastrue.
- First, it fetches the max
- Create the saved process with the next code:
- Run and schedule the saved process to load the preliminary and ongoing incremental information into the
customer_dimSCD2 desk:
- Validate the information within the
customer_dimdesk for a similar buyer with a modified deal with:

You’ve gotten efficiently applied an incremental load technique for the shopper SCD2 desk. Going ahead, all adjustments to buyer will probably be tracked and maintained on this buyer dimension desk by operating the saved process. This lets you analyze buyer information at a desired time limit for various use circumstances, for instance, performing buyer migration evaluation and seeing how geographical strikes affect buying habits, or advertising marketing campaign effectiveness to research the affect of focused advertising campaigns on buyer demographics on the time of marketing campaign execution.
Trade use circumstances for historical past mode
The next are different trade use circumstances enabled by historical past mode between operational information shops and Amazon Redshift:
- Monetary auditing or regulatory compliance – Observe adjustments in monetary data over time to assist compliance and audit necessities. Historical past mode permits auditors to reconstruct the state of monetary information at any time limit, which is essential for investigations and regulatory reporting.
- Buyer journey evaluation – Perceive how buyer information evolves to achieve insights into habits patterns and preferences. Entrepreneurs can analyze how buyer profiles change over time, informing personalization methods and lifelong worth calculations.
- Provide chain optimization – Analyze historic stock and order information to determine tendencies and optimize inventory ranges. Provide chain managers can evaluation how demand patterns have shifted over time, enhancing forecasting accuracy.
- HR analytics – Observe worker information adjustments over time for higher workforce planning and efficiency evaluation. HR professionals can analyze profession development, wage adjustments, and talent growth tendencies throughout the group.
- Machine studying mannequin auditing – Knowledge scientists can use historic information to coach fashions, examine predictions vs. actuals to enhance accuracy, and assist clarify mannequin habits and determine potential biases over time.
- Hospitality and airline trade use circumstances – For instance:
- Customer support – Entry historic reservation information to swiftly deal with buyer queries, enhancing service high quality and buyer satisfaction.
- Crew scheduling – Observe crew schedule adjustments to assist adjust to union contracts, sustaining constructive labor relations and optimizing workforce administration.
- Knowledge science functions – Use historic information to coach fashions on a number of situations from totally different time durations. Examine predictions towards actuals to enhance mannequin accuracy for key operations corresponding to airport gate administration, flight prioritization, and crew scheduling optimization.
Finest practices
In case your requirement is to separate lively and inactive data, you need to use _record_is_active as the primary type key. For different patterns the place you wish to analyze information as of a particular date up to now, no matter whether or not information is lively or inactive, _record_create_time and _record_delete_time will be added as type keys.
Historical past mode retains document variations, which is able to improve the desk measurement in Amazon Redshift and will affect question efficiency. Due to this fact, periodically carry out DML deletes for outdated document variations (delete information past a sure timeframe if not wanted for evaluation). When executing these deletions, preserve information integrity by deleting throughout all associated tables. Vacuuming additionally turns into needed whenever you carry out DML deletes on data whose versioning is now not required. To enhance auto vacuum delete effectivity, Amazon Redshift auto vacuum delete is extra environment friendly when working on bulk deletes. You’ll be able to monitor vacuum development utilizing the SYS_VACUUM_HISTORY desk.
Clear up
Full the next steps to scrub up your assets:
Conclusion
Zero-ETL integrations have already made vital strides in simplifying information integration and enabling close to real-time analytics. With the addition of historical past mode, AWS continues to innovate, offering you with much more highly effective instruments to derive worth out of your information.
As companies more and more depend on data-driven decision-making, zero-ETL with historical past mode will probably be essential in sustaining a aggressive edge within the digital financial system. These developments not solely streamline information processes but in addition open up new avenues for evaluation and perception technology.
To be taught extra about zero-ETL integration with historical past mode, check with Zero-ETL integrations and Limitations. Get began with zero-ETL on AWS by making a free account as we speak!
In regards to the Authors
Raks Khare is a Senior Analytics Specialist Options Architect at AWS based mostly out of Pennsylvania. He helps clients throughout various industries and areas architect information analytics options at scale on the AWS platform. Exterior of labor, he likes exploring new journey and meals locations and spending high quality time together with his household.
Jyoti Aggarwal is a Product Administration Lead for AWS zero-ETL. She leads the product and enterprise technique, together with driving initiatives round efficiency, buyer expertise, and safety. She brings alongside an experience in cloud compute, information pipelines, analytics, synthetic intelligence (AI), and information companies together with databases, information warehouses and information lakes.
Gopal Paliwal is a Principal Engineer for Amazon Redshift, main the software program growth of ZeroETL initiatives for Amazon Redshift.
Harman Nagra is a Principal Options Architect at AWS, based mostly in San Francisco. He works with world monetary companies organizations to design, develop, and optimize their workloads on AWS.
Sumanth Punyamurthula is a Senior Knowledge and Analytics Architect at Amazon Internet Providers with greater than 20 years of expertise in main massive analytical initiatives, together with analytics, information warehouse, information lakes, information governance, safety, and cloud infrastructure throughout journey, hospitality, monetary, and healthcare industries.
