Study Apache Iceberg ecosystems in AWS

Written on: July 11, 2025
Last modified on: August 14, 2025

Study note about Apache Iceberg ecosystems in AWS.

S3 Tables

S3 Tables supports IAM-based and resource-based access control and automatic maintenance operations for Iceberg tables stored in buckets. S3 Tables is available in S3 table buckets. It was released on 2024/12/03.

Table maintenance

Unreferenced file removal and Compaction and snapshot are enabled by default. They are configurable per table.

Integration with Glue and Lake Formation

S3 table buckets can be integrated with Glue and Lake Formation. When the integration is enabled, a Glue catalog is created per table bucket and Iceberg tables are managed on that catalog.

The integration is enabled by the following steps.

registering buckets to Lake Formation as data location
creating a federated catalog on Glue

Resource mapping between AWS Glue (left is S3 Table resource, right is Glue resource)
- Table bucket = Catalog
- Namespace = Database
- Table = Table
Client configuration to use Glue Iceberg endpoint
- Warehouse location : <accountid>:s3tablescatalog/<table-bucket-name>

Quotas

Table buckets per region in an AWS account = 10
Namespaces in a table bucket = 10,000
Tables in a table bucket = 10,000

Limitations

Presigned URLs to access objects associated with a table are not supported.
Tags are not supported for table buckets and tables. Therefore, support for attribute-based access control and tag-based allocation is unavailable.

AWS Glue

AWS launched Glue Iceberg REST endpoint along with the launch of S3 Tables. (The release date is the same as S3 Tables.)

Creating Iceberg tables
- By default, Iceberg v2 tables are created
- Data Catalog doesn’t support creating partitions and adding Iceberg table properties.
Optimizing Iceberg tables
- The same table optimizers as S3 Tables are available
- Number of distinct values (NDVs) of columns is also supported

Data Catalog

An AWS account has a default Data Catalog per region
- Catalog ID = account ID
- Only the default catalog is displayed on Glue UI
  - Non-default catalogs are available on API
    - GetCatalogs
  - Also available on Lake Formation UI
Iceberg REST APIs have a free-form prefix. It can be used to logically segments catalogs.
- Prefix and catalog path parameters
- For S3 Tables, catalog ID is <accountid>:s3tablescatalog/<table-bucket-name>
  - S3 Table integration must be enabled on Lake Formation
- For Iceberg tables in regular S3 buckets, prefix / catalog ID is unavailable.
  - All tables are stored in the default Data Catalog (Catalog ID = AWS account ID) (reference: Populating catalog)

Access control

A resource-based policy can be attached to a catalog
- ARNs of data catalog resources
  - Federated catalogs are also catalog resources
Cross-account permissions of Glue and Lake Formation can be used at the same time

Quotas

Max databases per region in an AWS account = 10,000
Max tables per region in an AWS account = 1,000,000
Max tables per database = 200,000

AWS Lake Formation (LF)

Lake Formation provides RDBMS permissions model to grant or revoke access to Data Catalog resources.

Permissions model

Lake Formation manages two types of permissions.

Metadata access (Data Catalog permissions)
- Permissions on Data Catalog resources
Underlying data access
- Permissions to read and write data to S3 locations pointed by Data Catalog resources

Lake Formation uses a combination of Lake Formation permissions and IAM permissions. A principal must pass both Lake Formation and IAM permissions checks.

Metadata permissions

By default, all databases and tables have IAMAllowedPrincipal group
- If this permissions exists on a database or table, all principals will be granted access to the database or table
- IAMAllowedPrincipal must be removed for granular access control
- IAMAllowedPrincipal is set to new databases and tables by default. The default setting can be modified.
LF-Tag based access control (LF-TBAC) is the best way to scale permissions across huge number of resources
- LF-Tag can be assigned for databases and tables, not for catalogs
Metadata access control
Lake Formation personas and IAM permissions reference
Lake Formation permissions reference

Underlying data access permissions

The following permissions are required to enable principals to read and write underlying data

Register the Amazon S3 locations that contain the data with Lake Formation.
Principals who create Data Catalog tables that point to underlying data locations must have data location permissions.
Principals who read and write underlying data must have Lake Formation data access permissions on the Data Catalog tables that point to the underlying data locations.
Principals who read and write underlying data must have the lakeformation:GetDataAccess IAM permission when the underlying data location is registered with Lake Formation.

The Lake Formation permissions model doesn’t prevent access to Amazon S3 locations through the Amazon S3 API or console if you have access to them through IAM or Amazon S3 policies. You can attach IAM policies to principals to block this access.

(from Underlying data access control)

Query and join tables across multiple accounts is available with cross account data sharing
AWS Resource Access Manger (RAM) is used to share LF resources
If the grantee (provider) account is in the same organization as the grantor (consumer) account, shared access is available immediately
- Otherwise, RAM sends an invitation to the grantee account to accept or reject the resource grant
Setup required in each consumer account
- at least one user in the consumer account must be a data lake administrator to view shared resources
- The data lake administrator can grant Lake Formation permissions on the shared resources to other principals in the account
The consumer account principals cannot assign new LF-Tags for shared resources
- For fine grained database or table level access control in the consumer account, only named resource based method is available
Permissions required to access underlying data of shared table

Example steps for cross account data sharing with LF-TBAC

Permissions enforcement

Permissions management workflow
- If the user is authorized, Lake Formation provides temporary access to data
- Creation of tables at specific S3 location can be blocked by data location permissions

Storage access management

Column level, row level and cell level filtering are enforced by the integrated service
- Integrated services are trusted to properly enforce Lake Formation permissions (distributed-enforcement)

Credential vending

Lake Formation can vend scoped-down temporary credentials in the form of AWS STS tokens to registered Amazon S3 locations based on the effective permissions
Credential vending APIs
- GetTemporaryGlueTableCredentials
- GetTemporaryGluePartitionCredentials
- APIs are disabled by default.
  - Third party query engines must be registered to use them or full access must be enabled
    - Registered IAM session tag must be set when third party query engines call assume role for the role that is used to call credential vending APIs.
Credential vending only works with queries that run through the AWS Glue ETL library
Lake Formation credential vending API operations enable a distributed-enforcement with explicit deny on failure (fail-close) model
- Roles and responsibilities
- Snowflake supports use of vended credentials

Quotas

Number of registered paths per region in an AWS account = 10,000

Links

← → Top