Introducing Variant: A New Open Commonplace for Semi-Structured Information in Apache Parquet™, Delta Lake, and Apache Iceberg™


Semi-structured information is in all places in AI, software logs, and telemetry. This information is beneficial, however altering schemas makes it difficult to retailer and question. For years, the usual apply was to retailer this information as strings. Strings had been versatile however had poor question efficiency, because the engine wanted to parse and search by your entire string.

The Variant information kind, now ratified in Apache Parquet™, takes a special method. It shops the info in a compact binary format that’s each versatile and performant for querying. This method isn’t tied to at least one engine or format – Variant is the open normal for semi-structured information throughout the lakehouse, with assist in Apache Spark™, Delta Lake, and Apache Iceberg™.

On this weblog put up, we’ll cowl:

  • Investing in Variant open requirements
  • How Variant and shredding work
  • Quick efficiency on semi-structured information

Databricks is main Variant efforts in open supply

Final 12 months, we collaborated with the open supply neighborhood to introduce Variant to Apache Spark™ and Delta Lake. This new information kind gives each flexibility and efficiency in comparison with storing semi-structured information as strings (which have poor efficiency) or structs (which aren’t versatile).

Variant’s launch rapidly drew curiosity from different main open supply tasks, together with Apache Iceberg™ and Apache Arrow™. To unify the ecosystem, we proposed bringing Variant to all engines and codecs by incorporating the kind instantly into Parquet and shifting the Spark implementation to the Parquet-java open supply challenge, contributing over 9,600 strains of code. This enables all open desk codecs to simply leverage the Variant information kind.

Now that Variant has been authorised throughout the Parquet neighborhood, your entire lakehouse ecosystem has a regular, open information kind for semi-structured information. Variant is already supported in open desk codecs: Delta has included Variant assist for the previous 12 months, and final Could, Iceberg authorised v3, which incorporates Variant assist. Consequently, customers leveraging Delta or Iceberg can now profit from Variant’s flexibility and efficiency.

The Parquet Variant artifacts embrace:

The Delta and Iceberg Protocols to assist Variant are:

We lengthen our gratitude to all of the people and organizations concerned for his or her contributions throughout many open supply communities, together with Apache Parquet™, Apache Spark™, Apache Iceberg™, Delta Lake, and Apache Arrow™.

How Variant and shredding work

Variant makes use of a binary encoding format to offer a versatile interface for information storage. Variant additionally has a shredding scheme, a way for storing Variant extra effectively to enhance efficiency.

Binary encoding format

The Variant datatype leverages an environment friendly binary encoding scheme to characterize semi-structured information. As an alternative of storing the info as a plain-text worth (like JSON), the Variant information encodes the values and the construction in a binary format that prioritizes environment friendly navigation.

Navigating a JSON string requires studying and processing your entire JSON object to seek out the related area. With the Variant binary encoding, the construction of the info is encoded utilizing offsets to different areas throughout the Variant worth. With these offsets, navigating by the Variant construction doesn’t require studying or processing your entire worth. This offset-based navigation drastically improves the efficiency of processing semi-structured information.

This instance demonstrates that navigating to the trail order.merchandise.title requires inspecting just a few parts of the Variant worth utilizing the offsets. This reduces the quantity of knowledge to course of/parse, and results in sooner efficiency.

Shredding

Shredding mechanically extracts frequent fields from the Variant values. These fields are saved as separate, typed chunks in the identical column. With out shredding, your entire Variant worth is saved as a single “binary blob” within the file.

There are a number of efficiency benefits for shredding Variants:

  • Pruned I/O: When fields are saved individually, solely the fields required by the question must be fetched. Meaning if the question solely requires a small fraction of the Variant fields, solely a small fraction of the I/O is required.
  • Information skipping: When shredded fields are saved as separate Parquet chunks, engines can use all of the Parquet optimizations for environment friendly row group and column web page skipping.
  • Compression: Since shredded fields are columnar, the info may be compressed extra effectively, thus lowering the storage dimension.

This instance demonstrates that with shredding, the scan solely must learn the columns required by the question. The scan makes use of Parquet column statistics, so irrelevant row-groups may be utterly skipped. Studying shredded information improves efficiency by avoiding pointless work.

Quick efficiency on semi-structured information

The Variant binary format and shredding approach allow important efficiency enhancements in comparison with storing semi-structured information as JSON strings. We performed efficiency benchmarks utilizing semi-structured information based mostly on TPC-DS to check Variant and string representations.

In comparison with storing JSON as a string, Variant has 8x sooner learn efficiency. With shredding, Variant writes are 20% to 50% slower, however reads are 30x sooner – demonstrating its efficiency and effectivity.

Attempt Variant At this time

With native Parquet, Delta, and Iceberg assist, the Variant datatype is the open and standardized information kind for semi-structured information. By eliminating the necessity for complicated ETL and brittle parsing, Variant empowers customers to research information quick, simply, and reliably.

Making a desk with a Variant column is simple:

To load Variant information, Databricks helps Variant ingestion features from JSON, XML, and CSV:

Variant shredding is supported in DBR 17.2+ (DBSQL 2025.30+) with assist in Delta and Iceberg tables. This improves question efficiency with out code modifications:

Keep tuned for our follow-up put up on Variant, the place we’ll stroll by sensible examples and share buyer tales.

The give attention to efficiency, simplicity, and worth is the inspiration of Databricks SQL, the place one of the best information warehouse is a lakehouse. To study extra about Databricks SQL, go to our web site, learn the documentation, or try the product tour. Databricks SQL is the high-performance, lower-cost, and serverless information warehouse — attempt it without spending a dime as we speak.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles