Delta Lake
Integrations / ecosystem
Observations / thoughts / questions
- Atomic log record insertion depends on atomic “put if absent” or rename operations
- To use Delta Lake on S3, transaction coordinator service is required because of lack of them on S3.
- P.S. Since Aug 20, 2024, S3 supports put if absent. S3 conditional writes
- Checkpoint is made at the end of write transactions.
- Checkpoint happens every 10 transactions by default.
- Reference: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, VLDB 2020
- Row Tracking and Row IDs
- every row has two Row IDs, fresh Row ID and stable Row ID
- Default generated Row IDs: calculated by
baseRowId
field ofadd
andremove
actions and row position - Materialized Row IDs: stored in a column in the data files
- fresh Row ID = Default generated Row ID
- stable Row ID = Materialiezed Row ID if not null, otherwise Default generated Row ID
- Data skipping
- By default, statistics of the first 32 columns are collected
Links
Iceberg
Update June 4th, 2024: Databricks acquired Tablular. Delta Lake and Iceberg will probably be merged gradually in the near future.
Integrations / ecosystem
- Trino
- Hive
- DuckDB
- read only as of 2024/08/07
- ClickHouse
Observations / thoughts / questions
- A manifest list is created for each table snapshot.
- Puffine file format is a file format for indexes and statistics of a table
Atomic data commit
Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files.
from Metastore Tables
The atomic swap needed to commit new versions of table metadata
Delete format
- Position delete file points rows by file location and position
- Equality delete file
- How to confirm that data files pointed by a position delete file still exist?
Maintenance
Hudi
Integrations / ecosystem
Observations / thoughts / questions
- It seems Hudi is optimized for near-realtime scan and ingest use cases rather than batch processing.
- Development seems the most active for Spark, Flink.
- Trino and Hive support only read as of Aug. 2024.
- Supports both copy-on-write and merge-on-read table types
- merge-on-read table is optimized for update and delete heavy workload
- File group of merge-on-read table comprises of columnar base files and row based delta log files
- Table and query types
- merge-on-read table is optimized for update and delete heavy workload
- File locations are stored on files index on metadata table
- Eliminate expensive list files operation of DFS/cloud object storage
- Hudi supports Record level index
- Implemented by HFile format which has B+-tree like structures
- Index is built for a primary key, i.e. keys must be unique across all partitions within a table
Concurrency Control
Hudi implements a file level, log based concurrency control protocol on the Hudi timeline, which in-turn relies on bare minimum atomic puts to cloud storage. (from: Lakehouse Concurrency Control: Are we too optimistic?)
Hudi guarantees that the actions performed on the timeline are atomic & timeline consistent based on the instant time. Atomicity is achieved by relying on the atomic puts to the underlying storage to move the write operations through various states in the timeline. (from: Timeline)
Kudo
Integrations / ecosystem
Tightly integrated with Impala. Has integration with NiFi and Spark.
Observations / thoughts / questions
- Has replication and high availability mechanism by itself.
- Others rely on reliability of underlying storage, e.g. HDFS, cloud object storage
- Makes consensus by Raft algorithm
- It plays some roles of distributed file system rather than simple table format.
- Direction is somehow similar to Hudi. Aims for rear realtime scan and ingest use cases.