This visitor submit was co-authored with Kostas Diamantis from Skroutz.
At Skroutz, we’re obsessed with our product, and it’s all the time our prime precedence. We’re continually working to enhance and evolve it, supported by a big and proficient group of software program engineers. Our product’s steady innovation and evolution result in frequent updates, typically necessitating adjustments and additions to the schemas of our operational databases.
After we determined to construct our personal information platform to satisfy our information wants, equivalent to supporting reporting, enterprise intelligence (BI), and decision-making, the principle problem—and likewise a strict requirement—was to verify it wouldn’t block or delay our product growth.
We selected Amazon Redshift to advertise information democratization, empowering groups throughout the group with seamless entry to information, enabling sooner insights and extra knowledgeable decision-making. This alternative helps a tradition of transparency and collaboration, as information turns into available for evaluation and innovation throughout all departments.
Nonetheless, maintaining with schema adjustments from our operational databases, whereas updating the info warehouse with out continually coordinating with growth groups, delaying releases, or risking information loss, turned a brand new problem for us.
On this submit, we share how we dealt with real-time schema evolution in Amazon Redshift with Debezium.
Resolution overview
Most of our information resides in our operational databases, equivalent to MariaDB and MongoDB. Our strategy entails utilizing the change information seize (CDC) method, which routinely handles the schema evolution of the info shops being captured. For this, we used Debezium together with a Kafka cluster. This answer permits schema adjustments to be propagated with out disrupting the Kafka shoppers.
Nonetheless, dealing with schema evolution in Amazon Redshift turned a bottleneck, prompting us to develop a method to handle this problem. It’s necessary to notice that, in our case, adjustments in our operational databases primarily contain including new columns somewhat than breaking adjustments like altering information sorts. Subsequently, we’ve applied a semi-manual course of to resolve this challenge, together with a compulsory alerting mechanism to inform us of any schema adjustments. This two-step course of consists of dealing with schema evolution in actual time and dealing with information updates in an asynchronous handbook step. The next architectural diagram illustrates a hybrid deployment mannequin, integrating each on-premises and cloud-based elements.
The information movement begins with information from MariaDB and MongoDB, captured utilizing Debezium for CDC in close to real-time mode. The captured information is streamed to a Kafka cluster, the place Kafka shoppers (constructed on the Ruby Karafka framework) learn and write them to the staging space, both in Amazon Redshift or Amazon Easy Storage Service (Amazon S3). From the staging space, DataLoaders promote the info to manufacturing tables in Amazon Redshift. At this stage, we apply the slowly altering dimension (SCD) idea to those tables, utilizing Sort 7 for many of them.
In information warehousing, an SCD is a dimension that shops information, and although it’s typically secure, it would change over time. Varied methodologies tackle the complexities of SCD administration. SCD Sort 7 locations each the surrogate key and the pure key into the actual fact desk. This enables the consumer to pick out the suitable dimension information based mostly on:
- The first efficient date on the actual fact report
- The newest or present info
- Different dates related to the actual fact report
Afterwards, analytical jobs are run to create reporting tables, enabling BI and reporting processes. The next diagram supplies an instance of the info modeling course of from a staging desk to a manufacturing desk.

The structure depicted within the diagram reveals solely our CDC pipeline, which fetches information from our operational databases and doesn’t embody different pipelines, equivalent to these for fetching information by means of APIs, scheduled batch processes, and plenty of extra. Additionally observe that our conference is that dw_* columns are used to catch SCD metadata info and different metadata basically. Within the following sections, we talk about the important thing elements of the answer in additional element.
Actual-time workflow
For the schema evolution half, we concentrate on the column dw_md_missing_data, which captures schema evolution adjustments in close to actual time that happen within the supply databases. When a brand new change is produced to the Kafka cluster, the Kafka shopper is chargeable for writing this alteration to the staging desk in Amazon Redshift. For instance, a message produced by Debezium to the Kafka cluster could have the next construction when a brand new store entity is created:
The Kafka shopper is chargeable for making ready and executing the SQL INSERT assertion:
After that, let’s say a brand new column is added to the supply desk known as new_column, with the worth new_value.
The brand new message produced to the Kafka cluster could have the next format:
Now the SQL INSERT assertion executed by the Kafka shopper shall be as follows:
The patron performs an INSERT as it could for the recognized schema, and something new is added to the dw_md_missing_data column as key-value JSON. After the info is promoted from the staging desk to the manufacturing desk, it can have the next construction.

At this level, the info movement continues operating with none information loss or the necessity for communication with groups chargeable for sustaining the schema within the operational databases. Nonetheless, this information may not be simply accessible for the info shoppers, analysts, or different personas. It’s price noting that dw_md_missing_data is outlined as a column of the SUPER information kind, which was launched in Amazon Redshift to retailer semistructured information or paperwork as values.
Monitoring mechanism
To trace new columns added to a desk, we’ve a scheduled course of that runs weekly. This course of checks for tables in Amazon Redshift with values within the dw_md_missing_data column and generates a listing of tables requiring handbook motion to make this information obtainable by means of a structured schema. A notification is then despatched to the group.
Guide remediation steps
Within the aforementioned instance, the handbook steps to make this column obtainable can be:
- Add the brand new columns to each staging and manufacturing tables:
- Replace the Kafka shopper’s recognized schema. On this step, we simply want so as to add the brand new column title to a easy array checklist. For instance:
- Replace the DataLoader’s SQL logic for the brand new column. A DataLoader is chargeable for selling the info from the staging space to the manufacturing desk.
- Switch the info that has been loaded within the meantime from the
dw_md_missing_dataSUPER column to the newly added column after which clear up. On this step, we simply have to run an information migration like the next:
To carry out the previous operations, we be sure that nobody else performs adjustments to the manufacturing.retailers desk as a result of we wish no new information to be added to the dw_md_missing_data column.
Conclusion
The answer mentioned on this submit enabled Skroutz to handle schema evolution in operational databases whereas seamlessly updating the info warehouse. This alleviated the necessity for fixed growth group coordination and eliminated dangers of information loss throughout releases, finally fostering innovation somewhat than stifling it.
Because the migration of Skroutz to the AWS Cloud approaches, discussions are underway on how the present structure will be tailored to align extra carefully with AWS-centered ideas. To that finish, one of many adjustments being thought-about is Amazon Redshift streaming ingestion from Amazon Managed Streaming for Apache Kafka (Amazon MSK) or open supply Kafka, which is able to make it potential for Skroutz to course of giant volumes of streaming information from a number of sources with low latency and excessive throughput to derive insights in seconds.
When you face related challenges, talk about with an AWS consultant and work backward out of your use case to offer probably the most appropriate answer.
In regards to the authors
Konstantina Mavrodimitraki is a Senior Options Architect at Amazon Net Providers, the place she assists clients in designing scalable, strong, and safe methods in international markets. With deep experience in information technique, information warehousing, and large information methods, she helps organizations remodel their information landscapes. A passionate technologist and folks individual, Konstantina loves exploring rising applied sciences and helps the native tech communities. Moreover, she enjoys studying books and taking part in together with her canine.
Kostas Diamantis is the Head of the Information Warehouse at Skroutz firm. With a background in software program engineering, he transitioned into information engineering, utilizing his technical experience to construct scalable information options. Keen about data-driven decision-making, he focuses on optimizing information pipelines, enhancing analytics capabilities, and driving enterprise insights.
