[core] Support chain table #6380
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
see: https://cwiki.apache.org/confluence/display/PAIMON/PIP-37%3A+Introduce+Chain+Table
1. Overview
Chain Table is a new feature in Paimon designed to solve the problem of periodically storing full data in data warehouses. It optimizes storage and computation performance by dividing data into delta branches and snapshot branches.
1.1 Motivation
In data warehouse systems, there is a typical scenario: periodically storing full data (e.g., daily or hourly). However, between consecutive time intervals, most of the data is redundant, with only a small amount of newly changed data. Traditional processing methods have the following issues:
Chain Table optimizes through the following approaches:
2. Design Solution
2.1 Configuration Options
2.2 Solution
Add two new branches on top of the warehouse: delta and snapshot, which describe newly changed data and full data generated by chain compaction respectively.
2.2.1 Table Structure
2.2.2 Write Strategy
Write data to the corresponding branch based on branch configuration, using partition 20250722 as an example:
2.2.3 Read Strategy
Full Batch Read
Adopt corresponding strategies based on whether the partition exists in the snapshot branch:
Incremental Batch Read
Read incremental partitions directly from t$branch_delta. For example, when querying partition 20250722, read directly from t$branch_delta
Stream Read
Read data directly from t$branch_delta
2.2.4 Chain Compaction
Merge the incremental data of the current cycle with the full data of the previous cycle to generate the full data for the day. For example, the full data for date=20250729 is generated by merging all incremental partitions from 20250723 to 20250729 in t$branch_delta and the full data of 20250722 in t$branch_snapshot.
3. Implementation Plan
3.1 Core Class Design
3.1.1 ChainFileStoreTable
Inherits from FallbackReadFileStoreTable, implementing chain table splitting and reading functionality.
3.1.2 ChainTableBatchScan
Implements batch scan logic for chain tables.
3.1.3 ChainTableRead
Implements read logic for chain tables.
3.2 Configuration Implementation
Add chain table related configurations in CoreOptions class:
3.3 Table Factory Modification
Modify FileStoreTableFactory to support chain table creation:
4. Usage Examples
4.1 Create Table
4.2 Create Branches
4.3 Write Data
4.4 Read Data
Tests
API and Format
Documentation