Delta Lake

Integrations / ecosystem

Observations / thoughts / questions

Atomic log record insertion depends on atomic “put if absent” or rename operations
- To use Delta Lake on S3, transaction coordinator service is required because of lack of them on S3.
- P.S. Since Aug 20, 2024, S3 supports put if absent. S3 conditional writes
Checkpoint is made at the end of write transactions.
- Checkpoint happens every 10 transactions by default.
Reference: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, VLDB 2020
Row Tracking and Row IDs
- every row has two Row IDs, fresh Row ID and stable Row ID
- Default generated Row IDs: calculated by baseRowId field of add and remove actions and row position
- Materialized Row IDs: stored in a column in the data files
- fresh Row ID = Default generated Row ID
- stable Row ID = Materialiezed Row ID if not null, otherwise Default generated Row ID
Data skipping
- By default, statistics of the first 32 columns are collected

Iceberg

Update June 4th, 2024: Databricks acquired Tablular. Delta Lake and Iceberg will probably be merged gradually in the near future.

Spec

Integrations / ecosystem

Trino
Hive
DuckDB
- read only as of 2024/08/07
ClickHouse

Observations / thoughts / questions

A manifest list is created for each table snapshot.
Puffine file format is a file format for indexes and statistics of a table

Atomic data commit

from File System Operations

Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files.

from Metastore Tables

The atomic swap needed to commit new versions of table metadata

Delete format

Position delete file points rows by file location and position
Equality delete file
How to confirm that data files pointed by a position delete file still exist?

Maintenance

Maintenance

Partitioning

Hudi

Spec

Integrations / ecosystem

Observations / thoughts / questions

It seems Hudi is optimized for near-realtime scan and ingest use cases rather than batch processing.
- Development seems the most active for Spark, Flink.
- Trino and Hive support only read as of Aug. 2024.
Supports both copy-on-write and merge-on-read table types
- merge-on-read table is optimized for update and delete heavy workload
  - File group of merge-on-read table comprises of columnar base files and row based delta log files
- Table and query types
File locations are stored on files index on metadata table
- Eliminate expensive list files operation of DFS/cloud object storage
Hudi supports Record level index
- Implemented by HFile format which has B+-tree like structures
- Index is built for a primary key, i.e. keys must be unique across all partitions within a table

Concurrency Control

Hudi implements a file level, log based concurrency control protocol on the Hudi timeline, which in-turn relies on bare minimum atomic puts to cloud storage. (from: Lakehouse Concurrency Control: Are we too optimistic?)

Hudi guarantees that the actions performed on the timeline are atomic & timeline consistent based on the instant time. Atomicity is achieved by relying on the atomic puts to the underlying storage to move the write operations through various states in the timeline. (from: Timeline)

Kudo

Schema design

Integrations / ecosystem

Tightly integrated with Impala. Has integration with NiFi and Spark.

Trino

Observations / thoughts / questions

Has replication and high availability mechanism by itself.
- Others rely on reliability of underlying storage, e.g. HDFS, cloud object storage
- Makes consensus by Raft algorithm
- It plays some roles of distributed file system rather than simple table format.
Direction is somehow similar to Hudi. Aims for rear realtime scan and ingest use cases.

Links

Top