Speed up information lake operations with Apache Iceberg V3 deletion vectors and row lineage


Organizations constructing petabyte-scale information lakes face growing challenges as their information grows. Batch updates and compliance deletes create a proliferation of positional delete information, slowing downstream information pipelines and driving up storage prices. Monitoring information adjustments for audit trails and incremental processing requires customized, engine-specific implementations that add complexity and upkeep burden. As information volumes scale, these challenges compound, leaving information groups juggling customized options and growing operational prices simply to take care of information freshness and compliance.

Apache Iceberg V3 addresses these challenges with two new capabilities: deletion vectors and row lineage. AWS now delivers these capabilities throughout Apache Spark on Amazon EMR 7.12, AWS Glue, Amazon SageMaker notebooks, Amazon S3 Tables, and the AWS Glue Knowledge Catalog, providing you with an entire, built-in V3 expertise with out customized implementation. This implies quicker writes, decrease storage prices, complete audit trails, and environment friendly incremental processing, all working seamlessly throughout your complete information lake structure.

On this put up, we stroll you thru the brand new capabilities in Iceberg V3, clarify how deletion vectors and row lineage tackle these challenges, discover real-world use circumstances throughout industries, and supply sensible steerage on implementing Iceberg V3 options throughout AWS analytics, catalog, and storage providers.

What’s new in Iceberg V3

Iceberg V3 introduces new capabilities and information varieties. Two key capabilities that tackle the challenges mentioned earlier are deletion vectors and row lineage.

Deletion vectors substitute positional delete information with an environment friendly binary format saved as Puffin information. As an alternative of making separate delete information for every delete operation, the deletion vector consolidates these delete references to a single delete vector per information file, fairly than a delete reference file per deleted row. Throughout question execution, engines effectively filter out deleted rows utilizing these compact vectors, sustaining question efficiency whereas eradicating the necessity to merge a number of delete information.

This avoids write amplification from random batch updates and GDPR compliance deletes, considerably lowering the overhead of sustaining contemporary information. Excessive-frequency replace workloads can see speedy enhancements in write efficiency and diminished storage prices from fewer small delete information. Moreover, having fewer small delete information reduces desk upkeep prices for compaction operations.

Row lineage permits exact change monitoring on the row stage with full auditability. Row lineage provides metadata fields to every information file that monitor when rows had been created and final modified. The _row_id discipline uniquely identifies every row, and the _last_updated_sequence_number discipline tracks the snapshot when the row was final modified. These fields allow environment friendly change monitoring queries with out scanning complete tables, and so they’re routinely maintained by the Iceberg specification with out requiring customized code.

Earlier than row lineage, change monitoring in Iceberg supplied solely the web adjustments between snapshots, making it tough to trace particular person file modifications. Row lineage metadata fields can now be queried to return all incremental adjustments, providing you with full constancy for auditing information modifications and regulatory compliance. For information transformations, your downstream methods can course of adjustments incrementally, dashing up information pipelines and lowering compute prices for change information seize (CDC) workflows. Row lineage is engine agnostic, interoperable, and constructed into the Iceberg V3 specification, assuaging the necessity for customized, engine-specific change monitoring implementations.

Actual-world use circumstances

The brand new Iceberg V3 capabilities tackle vital challenges throughout a number of industries:

  • Advertising and promoting providers organizations – Now you can effectively deal with GDPR right-to-be-forgotten requests and regulatory compliance deletes with out the write amplification that beforehand degraded pipeline efficiency. Row lineage gives full audit trails for information modifications, assembly strict regulatory necessities for information governance.
  • Ecommerce platforms processing thousands and thousands of product updates and stock adjustments each day – You may keep information freshness whereas lowering storage prices. Deletion vectors allow quicker upsert operations, serving to groups meet aggressive SLA necessities throughout peak procuring durations.
  • Healthcare and life sciences firms – You may monitor affected person information modifications with precision for compliance functions whereas effectively processing large-scale genomic datasets. Row lineage gives the detailed change historical past required for medical trial audits and regulatory submissions.
  • Media and leisure suppliers managing massive catalogs of consumer viewing information – You may effectively course of incremental adjustments for advice engines. Row lineage permits downstream analytics methods to course of solely modified data, lowering compute prices in incremental processing situations.

Get began with Iceberg V3

To benefit from deletion vectors for optimized writes and row lineage for built-in change monitoring in Iceberg V3, set the desk property format-version = 3 throughout desk creation. Alternatively, setting this property on an present Iceberg V2 desk atomically upgrades the desk with out information rewrites. Earlier than creating or upgrading V3 tables, be certain the Iceberg engines in your answer are V3-compatible. Check with Apache Iceberg V3 on AWS for extra particulars.

Create a brand new V3 desk with Apache Spark on Amazon EMR 7.12

The next code creates a brand new desk named customer_data. Setting the desk property format-version = 3 creates a V3 desk. If the format-version desk property is just not explicitly set, a V2 desk is created. V2 is at present the Iceberg default desk model. Setting write.delete.mode, write.replace.mode, and write.merge.mode to merge-on-read configures Spark to jot down deletion vectors for delete, replace, or merge statements carried out on the desk.

CREATE TABLE customer_data (
customer_id bigint,
identify string,
electronic mail string,
last_purchase timestamp,
total_spent decimal(10,2)
)
USING iceberg
TBLPROPERTIES (
'format-version' = '3',
'write.delete.mode' = 'merge-on-read',
'write.replace.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read'
)

Run the next code to insert data into the customer_data desk:

INSERT INTO customer_data VALUES
 (1, 'Alejandro Rosalez', 'alejandro_rosalez@instance.org', TIMESTAMP '2025-11-24 18:55:27', 42.97)
,(2, 'Akua Mansa', 'akua_mansa@instance.org', TIMESTAMP '2025-11-24 17:55:27', 25.02)
,(3, 'Ana Carolina Silva','anacarolina_silva@instance.org', TIMESTAMP '2025-11-24 16:55:27', 43.67)
,(4, 'Arnav Desai','arnav_desai@instance.org', TIMESTAMP '2025-11-24 15:55:27', 98.32)
,(5, 'Carlos Salazar','carlos_salazar@instance.org', TIMESTAMP '2025-11-24 12:55:27', 76.45)

Delete a file the place customer_id = 5 to generate a delete file:

DELETE 
  FROM customer_data 
  WHERE customer_id = 5

Updating a file with the next replace assertion additionally generates a delete file:

UPDATE customer_data
  SET identify="Mansa Akua" 
  WHERE customer_id = 2

The final a part of this instance queries the manifest’s metadata desk to confirm delete information had been produced:

SELECT added_snapshot_id
      ,sum(added_delete_files_count) as added_delete_files_count 
FROM customer_data.manifests 
GROUP BY added_snapshot_id 
ORDER BY added_snapshot_id

This question will lead to three data returned, as proven within the following screenshot. The added_delete_files_count for the primary snapshot that inserts data must be 0. The subsequent two snapshots for the corresponding delete and replace statements ought to have 1 every for added_delete_files_count worth.

Question row lineage for change monitoring

Row lineage is routinely enabled on V3 tables. The next instance consists of row lineage metadata fields and an instance of easy methods to question desk adjustments after a row lineage sequence quantity:

SELECT
customer_id,
identify,
electronic mail,
_row_id,
_last_updated_sequence_number
FROM customer_data
WHERE _last_updated_sequence_number > 0
ORDER BY _last_updated_sequence_number, _row_id

Working this question after the earlier insert, replace, and delete statements returns 4 data, as proven within the following screenshot. The deleted file is eliminated. The _last_updated_sequence_number is 3 for the replace to customer_id = 2.

Improve an present V2 desk

You may improve your present V2 tables to V3 with the next command:

ALTER TABLE existing_customer_data
SET TBLPROPERTIES ('format-version' = '3')

If you improve a desk from V2 to V3, a number of vital operations happen atomically:

  • A brand new metadata snapshot is created atomically, leading to no information loss.
  • Present Parquet information information are reused with out modification.
  • Row-lineage fields (_row_id and _last_updated_sequence_number) are added to the desk metadata.
  • The subsequent compaction operation will take away previous V2 positional delete information. If new deletion vector information are generated earlier than compaction runs, they may merge present V2 positional delete information.
  • New modifications will routinely use V3’s deletion vector information.
  • The improve doesn’t carry out a historic backfill of row-lineage change monitoring data.

The improve course of is synchronous and completes in seconds for many tables. If the improve fails, an error message is returned instantly, and the desk stays in its V2 state.

Getting essentially the most from Iceberg V3

On this part, we share the important thing issues we’ve discovered from prospects already utilizing these options.

Know your workload sample

Deletion vectors work greatest if you’re doing plenty of writes, equivalent to high-frequency updates, batch deletes, or CDC workloads making random non-append-only updates. In case you’re writing greater than you’re studying, deletion vectors will ship speedy efficiency positive factors. To unlock these advantages, set your desk to merge-on-read mode for delete, replace, and merge operations.

Let AWS deal with compaction

Allow computerized compaction by way of the Knowledge Catalog or use S3 Tables (on by default). You’ll get hands-free optimization with out constructing customized upkeep jobs. Deletion vectors produce fewer delete information than positional deletes in Iceberg V2. Given the same sample and quantity of modified data, V3 compaction must be faster and price lower than V2.

Perceive the significance of row lineage when utilizing the V2 changelog

With the Spark changelog process in Iceberg V2, if a row will get inserted after which deleted between snapshots, it disappears out of your change feed—you by no means see it. Iceberg V3 row lineage captures each operations as a result of _last_updated_sequence_number updates on every modification. This full constancy is vital for audit trails and regulatory compliance the place it’s worthwhile to show what occurred to each file. Efficiency-wise, the V2 changelog requires scanning and merging delete information to compute adjustments—that’s compute you’re paying for on each learn. V3 row lineage shops metadata fields straight on every row, so filtering by _last_updated_sequence_number is a straightforward metadata scan.

Take a look at earlier than you improve

Iceberg V3 upgrades are atomic and quick, however take a look at in dev first. Make certain all of your question engines assist Iceberg V3 earlier than upgrading shared tables—mixing V2 and V3 engines causes complications. After upgrading, hold just a few V2 snapshots round quickly for time-travel queries whilst you validate efficiency.

Conclusion

Iceberg V3 assist throughout AWS analytics, catalog, and storage providers marks a major development in information lake capabilities. By combining deletion vectors’ write optimization with row lineage’s complete change monitoring, you possibly can construct extra environment friendly, auditable, and cost-effective information lakes at scale. The seamless interoperability throughout AWS providers makes positive your information lake structure stays versatile and future-proof.

To study extra about AWS assist for Iceberg V3, check with Utilizing Apache Iceberg on AWS.

To study extra about constructing trendy information lakes with Iceberg on AWS, check with Analytics on AWS.


In regards to the authors

Ron Ortloff

Ron Ortloff

Ron is a Principal Product Supervisor at AWS.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles