Use Spark with AWS Glue Iceberg REST API and S3 Tables
A quick experiment to use Spark for Iceberg tables stored on S3 table buckets and managed by Glue Data Catalog via Iceberg REST API.
S3 Tables provides built-in support for Apache Iceberg format. It also provides integration with Glue and Lake Formation. When the integration is enabled, tables stored on table buckets are registered to Glue Data Catalog and available through Iceberg REST API.
This post explains how to use Iceberg tables of S3 Tables by Apache Spark via Glue Iceberg REST API.
Environment
- Run Spark locally on a container by using apache/spark image
Run interactive pyspark session on a local container
export AWS_REGION=${your_region}
# retrieve and export credentials
eval $(aws configure export-credentials --format env)
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
# packages required to use iceberg and S3
ICEBERG_VERSION="1.9.2"
SPARK_SCALA_VERSION="3.5_2.12"
AWS_SDK_VERSION="2.32.10"
HADOOP_AWS_VERSION="3.3.6"
SPARK_PACKAGES_CONFIG="org.apache.iceberg:iceberg-spark-runtime-${SPARK_SCALA_VERSION}:${ICEBERG_VERSION},software.amazon.awssdk:s3:${AWS_SDK_VERSION},software.amazon.awssdk:sts:${AWS_SDK_VERSION},org.apache.hadoop:hadoop-aws:${HADOOP_AWS_VERSION}"
GLUE_CATALOG_ID="${AWS_ACCOUNT_ID}"
# If you used S3 Table and Glue/Lake Formation integration, a catalog is created per table bucket
# GLUE_CATALOG_ID="${AWS_ACCOUNT_ID}:s3tablescatalog/${BUCKET_NAME}""
podman run --rm -it \
--name spark-iceberg-job \
-v ./:/opt/spark/work-dir \
-e AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}" \
-e AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}" \
-e AWS_SESSION_TOKEN="${AWS_SESSION_TOKEN}" \
-e AWS_REGION="${AWS_REGION}" \
spark:3.5.6-java17-python3 \
/opt/spark/bin/pyspark \
--conf "spark.jars.packages=${SPARK_PACKAGES_CONFIG}" \
--conf "spark.jars.ivy=/opt/spark/work-dir/.ivy" \
--conf "spark.sql.catalog.glue_rest_catalog=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.glue_rest_catalog.type=rest" \
--conf "spark.sql.catalog.glue_rest_catalog.warehouse=${GLUE_CATALOG_ID}" \
--conf "spark.sql.catalog.glue_rest_catalog.uri=https://glue.${AWS_REGION}.amazonaws.com/iceberg" \
--conf "spark.sql.catalog.glue_rest_catalog.rest.auth.type=sigv4" \
--conf "spark.sql.catalog.glue_rest_catalog.rest.signing-name=glue" \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.defaultCatalog=glue_rest_catalog"
SQL examples
- list databases
spark.sql("SHOW databases").show()
- list tables
spark.sql("SHOW tables in test_db").show()
- create database
spark.sql("CREATE DATABASE test_db")
- create table
create_table_sql = f""" CREATE TABLE IF NOT EXISTS test_db.test_tbl (id LONG) USING iceberg LOCATION 's3://{BUCKET_NAME}/{DATABASE_NAME}/{TABLE_NAME}' TBLPROPERTIES ('write.format.default'='parquet') """ spark.sql(create_table_sql)