Parquet
Libraries
- Java
- High leve interface like parquet-hadoop, hadoop-avro are tightly coupled with Hadoop.
- If you do not use hadoop and want to avoid dependency hell, you may need to implement your own parquet writer using low level interface like parquet-{common,column,encoding}.
- e.g. Iceberg’s parquet writer, Trino’s parquet writer
- High leve interface like parquet-hadoop, hadoop-avro are tightly coupled with Hadoop.
Links
- Capacitor (BigQuery’s columnar storage format)
- Has the same ancestor as Parquet (Dremel)
- Motivation of Parquet
ORC
- ORC Specification v1
- v2 specification exists, but it seems there is no progress since 2018.
- protobuf definition
Links
- An Empirical Evaluation of Columnar Storage Formats (2023)
- Exploiting Cloud Object Storage for High-Performance Analytics (2023)
- Amazon’s Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2 (2024)
- Column Storage For the AI Era (2025)
- Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet (2025)