Use Spark with AWS Glue Iceberg REST API and S3 Tables

A quick experiment to use Spark for Iceberg tables stored on S3 table buckets and managed by Glue Data Catalog via Iceberg REST API.

S3 Tables provides built-in support for Apache Iceberg format. It also provides integration with Glue and Lake Formation. When the integration is enabled, tables stored on table buckets are registered to Glue Data Catalog and available through Iceberg REST API.

This post explains how to use Iceberg tables of S3 Tables by Apache Spark via Glue Iceberg REST API.

Environment

Run interactive pyspark session on a local container

export AWS_REGION=${your_region}
# retrieve and export credentials
eval $(aws configure export-credentials --format env)
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# packages required to use iceberg and S3
ICEBERG_VERSION="1.9.2"
SPARK_SCALA_VERSION="3.5_2.12"
AWS_SDK_VERSION="2.32.10"
HADOOP_AWS_VERSION="3.3.6"
SPARK_PACKAGES_CONFIG="org.apache.iceberg:iceberg-spark-runtime-${SPARK_SCALA_VERSION}:${ICEBERG_VERSION},software.amazon.awssdk:s3:${AWS_SDK_VERSION},software.amazon.awssdk:sts:${AWS_SDK_VERSION},org.apache.hadoop:hadoop-aws:${HADOOP_AWS_VERSION}"

GLUE_CATALOG_ID="${AWS_ACCOUNT_ID}"
# If you used S3 Table and Glue/Lake Formation integration, a catalog is created per table bucket
# GLUE_CATALOG_ID="${AWS_ACCOUNT_ID}:s3tablescatalog/${BUCKET_NAME}""

podman run --rm -it \
  --name spark-iceberg-job \
  -v ./:/opt/spark/work-dir \
  -e AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}" \
  -e AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}" \
  -e AWS_SESSION_TOKEN="${AWS_SESSION_TOKEN}" \
  -e AWS_REGION="${AWS_REGION}" \
  spark:3.5.6-java17-python3 \
  /opt/spark/bin/pyspark \
  --conf "spark.jars.packages=${SPARK_PACKAGES_CONFIG}" \
  --conf "spark.jars.ivy=/opt/spark/work-dir/.ivy" \
  --conf "spark.sql.catalog.glue_rest_catalog=org.apache.iceberg.spark.SparkCatalog" \
  --conf "spark.sql.catalog.glue_rest_catalog.type=rest" \
  --conf "spark.sql.catalog.glue_rest_catalog.warehouse=${GLUE_CATALOG_ID}" \
  --conf "spark.sql.catalog.glue_rest_catalog.uri=https://glue.${AWS_REGION}.amazonaws.com/iceberg" \
  --conf "spark.sql.catalog.glue_rest_catalog.rest.auth.type=sigv4" \
  --conf "spark.sql.catalog.glue_rest_catalog.rest.signing-name=glue" \
  --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
  --conf "spark.sql.defaultCatalog=glue_rest_catalog"

SQL examples

Links