Study Apache Iceberg ecosystems in AWS
Study note about Apache Iceberg ecosystems in AWS.
S3 Tables
S3 Tables supports IAM-based and resource-based access control and automatic maintenance operations for Iceberg tables stored in buckets. S3 Tables is available in S3 table buckets. It was released on 2024/12/03.
Table maintenance
Unreferenced file removal and Compaction and snapshot are enabled by default. They are configurable per table.
Integration with Glue and Lake Formation
S3 table buckets can be integrated with Glue and Lake Formation. When the integration is enabled, a Glue catalog is created per table bucket and Iceberg tables are managed on that catalog.
The integration is enabled by the following steps.
- registering buckets to Lake Formation as data location
- creating a federated catalog on Glue
- Resource mapping between AWS Glue (left is S3 Table resource, right is Glue resource)
- Table bucket = Catalog
- Namespace = Database
- Table = Table
- Client configuration to use Glue Iceberg endpoint
- Warehouse location :
<accountid>:s3tablescatalog/<table-bucket-name>
- Warehouse location :
Quotas
- Table buckets per region in an AWS account = 10
- Namespaces in a table bucket = 10,000
- Tables in a table bucket = 10,000
Limitations
- Presigned URLs to access objects associated with a table are not supported.
- Tags are not supported for table buckets and tables. Therefore, support for attribute-based access control and tag-based allocation is unavailable.
Links
AWS Glue
AWS launched Glue Iceberg REST endpoint along with the launch of S3 Tables. (The release date is the same as S3 Tables.)
- Creating Iceberg tables
- By default, Iceberg v2 tables are created
-
Data Catalog doesn’t support creating partitions and adding Iceberg table properties.
- Optimizing Iceberg tables
- The same table optimizers as S3 Tables are available
- Number of distinct values (NDVs) of columns is also supported
Data Catalog
- An AWS account has a default Data Catalog per region
- Catalog ID = account ID
- Only the default catalog is displayed on Glue UI
- Non-default catalogs are available on API
- Also available on Lake Formation UI
- Iceberg REST APIs have a free-form prefix. It can be used to logically segments catalogs.
- Prefix and catalog path parameters
- For S3 Tables, catalog ID is
<accountid>:s3tablescatalog/<table-bucket-name>
- S3 Table integration must be enabled on Lake Formation
- For Iceberg tables in regular S3 buckets, prefix / catalog ID is unavailable.
- All tables are stored in the default Data Catalog (Catalog ID = AWS account ID) (reference: Populating catalog)
Access control
- A resource-based policy can be attached to a catalog
- ARNs of data catalog resources
- Federated catalogs are also catalog resources
- ARNs of data catalog resources
- Cross-account permissions of Glue and Lake Formation can be used at the same time
Quotas
- Max databases per region in an AWS account = 10,000
- Max tables per region in an AWS account = 1,000,000
- Max tables per database = 200,000
Links
AWS Lake Formation (LF)
Lake Formation provides RDBMS permissions model to grant or revoke access to Data Catalog resources.
Permissions model
Lake Formation manages two types of permissions.
- Metadata access (Data Catalog permissions)
- Permissions on Data Catalog resources
- Underlying data access
- Permissions to read and write data to S3 locations pointed by Data Catalog resources
Lake Formation uses a combination of Lake Formation permissions and IAM permissions. A principal must pass both Lake Formation and IAM permissions checks.
Metadata permissions
- By default, all databases and tables have
IAMAllowedPrincipal
group- If this permissions exists on a database or table, all principals will be granted access to the database or table
IAMAllowedPrincipal
must be removed for granular access controlIAMAllowedPrincipal
is set to new databases and tables by default. The default setting can be modified.
- LF-Tag based access control (LF-TBAC) is the best way to scale permissions across huge number of resources
- LF-Tag can be assigned for databases and tables, not for catalogs
- Metadata access control
- Lake Formation personas and IAM permissions reference
- Lake Formation permissions reference
Underlying data access permissions
The following permissions are required to enable principals to read and write underlying data
- Register the Amazon S3 locations that contain the data with Lake Formation.
- Principals who create Data Catalog tables that point to underlying data locations must have data location permissions.
- Principals who read and write underlying data must have Lake Formation data access permissions on the Data Catalog tables that point to the underlying data locations.
- Principals who read and write underlying data must have the lakeformation:GetDataAccess IAM permission when the underlying data location is registered with Lake Formation.
The Lake Formation permissions model doesn’t prevent access to Amazon S3 locations through the Amazon S3 API or console if you have access to them through IAM or Amazon S3 policies. You can attach IAM policies to principals to block this access.
(from Underlying data access control)
Cross account data sharing
- Query and join tables across multiple accounts is available with cross account data sharing
- AWS Resource Access Manger (RAM) is used to share LF resources
- If the grantee (provider) account is in the same organization as the grantor (consumer) account, shared access is available immediately
- Otherwise, RAM sends an invitation to the grantee account to accept or reject the resource grant
- Setup required in each consumer account
- at least one user in the consumer account must be a data lake administrator to view shared resources
- The data lake administrator can grant Lake Formation permissions on the shared resources to other principals in the account
- The consumer account principals cannot assign new LF-Tags for shared resources
- For fine grained database or table level access control in the consumer account, only named resource based method is available
- Permissions required to access underlying data of shared table
Example steps for cross account data sharing with LF-TBAC
Permissions enforcement
- Permissions management workflow
- If the user is authorized, Lake Formation provides temporary access to data
- Creation of tables at specific S3 location can be blocked by data location permissions
Storage access management
- Column level, row level and cell level filtering are enforced by the integrated service
- Integrated services are trusted to properly enforce Lake Formation permissions (distributed-enforcement)
Credential vending
- Lake Formation can vend scoped-down temporary credentials in the form of AWS STS tokens to registered Amazon S3 locations based on the effective permissions
- Credential vending APIs
GetTemporaryGlueTableCredentials
GetTemporaryGluePartitionCredentials
- APIs are disabled by default.
- Third party query engines must be registered to use them or full access must be enabled
- Registered IAM session tag must be set when third party query engines call assume role for the role that is used to call credential vending APIs.
- Third party query engines must be registered to use them or full access must be enabled
- Credential vending only works with queries that run through the AWS Glue ETL library
- Lake Formation credential vending API operations enable a distributed-enforcement with explicit deny on failure (fail-close) model
Quotas
- Number of registered paths per region in an AWS account = 10,000