Playing with DuckDB
Installation
Build
Prerequisite: install cmake
CMAKE_BUILD_PARALLEL_LEVEL=6 make
TPC-H extension
install 'tpch';
load 'tpch';
call dbgen(sf=1);
pragma show_tables;
| name |
|----------|
| customer |
| lineitem |
| nation |
| orders |
| part |
| partsupp |
| region |
| supplier |
pragma database_size;
-- in memory
| database_name | database_size | block_size | total_blocks | used_blocks | free_blocks | wal_size | memory_usage | memory_limit |
|---------------|---------------|------------|--------------|-------------|-------------|----------|--------------|--------------|
| memory | 0 bytes | 0 | 0 | 0 | 0 | 0 bytes | 1.4GB | 13.3GB |
-- file
| database_name | database_size | block_size | total_blocks | used_blocks | free_blocks | wal_size | memory_usage | memory_limit |
|---------------|---------------|------------|--------------|-------------|-------------|----------|--------------|--------------|
| tpch_sf1 | 260.3MB | 262144 | 993 | 993 | 0 | 0 bytes | 259.2MB | 13.3GB |
describe region;
| column_name | column_type | null | key | default | extra |
|-------------|-------------|------|-----|---------|-------|
| r_regionkey | INTEGER | NO | | | |
| r_name | VARCHAR | NO | | | |
| r_comment | VARCHAR | NO | | | |
summarize lineitem;
| column_name | column_type | min | max | approx_unique | avg | std | q25 | q50 | q75 | count | null_percentage |
|-----------------|---------------|-------------|---------------------|---------------|---------------------|---------------------|---------|---------|---------|---------|-----------------|
| l_orderkey | INTEGER | 1 | 6000000 | 1508227 | 3000279.604204982 | 1732187.8734803302 | 1526218 | 3009234 | 4504205 | 6001215 | 0.0% |
| l_partkey | INTEGER | 1 | 200000 | 202598 | 100017.98932999402 | 57735.69082650548 | 50076 | 99980 | 150179 | 6001215 | 0.0% |
| l_suppkey | INTEGER | 1 | 10000 | 10061 | 5000.602606138924 | 2886.961998730616 | 2500 | 5000 | 7499 | 6001215 | 0.0% |
| l_linenumber | INTEGER | 1 | 7 | 7 | 3.0005757167506912 | 1.7324314036519408 | 2 | 3 | 4 | 6001215 | 0.0% |
| l_quantity | DECIMAL(15,2) | 1.00 | 50.00 | 50 | 25.507967136654827 | 14.426262537016848 | 13 | 25 | 38 | 6001215 | 0.0% |
| l_extendedprice | DECIMAL(15,2) | 901.00 | 104949.50 | 923139 | 38255.138484656854 | 23300.43871096203 | 18746 | 36718 | 55151 | 6001215 | 0.0% |
| l_discount | DECIMAL(15,2) | 0.00 | 0.10 | 11 | 0.04999943011540163 | 0.03161985510812599 | 0 | 0 | 0 | 6001215 | 0.0% |
| l_tax | DECIMAL(15,2) | 0.00 | 0.08 | 9 | 0.04001350893110812 | 0.02581655179884276 | 0 | 0 | 0 | 6001215 | 0.0% |
| l_returnflag | VARCHAR | A | R | 3 | | | | | | 6001215 | 0.0% |
| l_linestatus | VARCHAR | F | O | 2 | | | | | | 6001215 | 0.0% |
| l_shipdate | DATE | 1992-01-02 | 1998-12-01 | 2516 | | | | | | 6001215 | 0.0% |
| l_commitdate | DATE | 1992-01-31 | 1998-10-31 | 2460 | | | | | | 6001215 | 0.0% |
| l_receiptdate | DATE | 1992-01-04 | 1998-12-31 | 2549 | | | | | | 6001215 | 0.0% |
| l_shipinstruct | VARCHAR | COLLECT COD | TAKE BACK RETURN | 4 | | | | | | 6001215 | 0.0% |
| l_shipmode | VARCHAR | AIR | TRUCK | 7 | | | | | | 6001215 | 0.0% |
| l_comment | VARCHAR | Tiresias | zzle? furiously iro | 3558599 | | | | | | 6001215 | 0.0% |
Postgres Scanner extension
install postgres;
load postgres;
attach 'postgresql://postgres@localhost/pgbench' as pgbench (type postgres);
Parquet extension
Parquet extension is built-in.
Local Parquet files
Export
copy lineitem to 'lineitem.parquet' (format 'parquet');
Query
select l_linenumber, count(*) from read_parquet('lineitem.parquet') group by 1;
Parquet files on S3
- Load AWS credentials from environment variables
export AWS_DEFAULT_REGION=$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r '.region') export AWS_ACCESS_KEY_ID=$(curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/${role_name} | jq -r '.AccessKeyId') export AWS_SECRET_ACCESS_KEY=$(curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/${role_name} | jq -r '.SecretAccessKey') export AWS_SESSION_TOKEN=$(curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/${role_name} | jq -r '.Token')
aws extension was introduced by DuckDB 0.9.0
.
Credentials can be loaded as follows:
CALL load_aws_credentials();
Export
copy lineitem to 's3://${bucket}/lineitem.parquet';
Query
select l_linenumber, count(*) from read_parquet('s3://${bucket}/lineitem.parquet') group by 1;
Links
Tips
CLI
- default mode of CLI is “duckbox”
D .mode current output mode: duckbox
- increase number of rows displayed by duckbox (default = 40)
.maxrows 100
- SQL timer
.timer on
- Read SQL from file
.read query.sql
Configuration
- Set number of threads
SET threads TO 1;
- Enable profiling
PRAGMA enable_profiling='json';
- Set timezone
set TimeZone = 'UTC';
e.g
D set TimeZone = 'UTC';
D SELECT * FROM duckdb_settings() WHERE name = 'TimeZone';
| name | value | description | input_type |
|----------|-------|-----------------------|------------|
| TimeZone | UTC | The current time zone | VARCHAR |
Tuning to avoid out of memory
- OOM in in-memory mode
Query aborts if it consumers memory more than the limit and temporary directory isn’t specified. e.g.
Error: Out of Memory Error: failed to allocate data of size 2.0MB (13.1GB/13.1GB used)
Database is launched in in-memory mode and no temporary directory is specified.
Unused blocks cannot be offloaded to disk.
Launch the database with a persistent storage back-end
Or set PRAGMA temp_directory='/path/to/tmp.tmp'
set temporary directory to avoid OOM
D set temp_directory='duckdb_tmp';
-
Use jemalloc to avoid memory fragmentation
About jemalloc extension -
Set
memory_limit
conservatively
Default value is 80% of RAM, but there is edge cases which may consumer memory more than the limit now.
From PR #6733:
What happens is that query 21 has a pipeline with more than one join hashtable, which both need to spill to disk. The join hashtables try not to take up all available memory, but are not aware of each others memory usage, and therefore take up more memory than there is available, causing the OOM exception.
memory_limit
needs to be configured so that entire memory consumption doesn’t exceed available memory.