Daft checkpoint design #5868

everySympathy · 2025-12-22T18:06:55Z

everySympathy
Dec 22, 2025

Context

A long-running job may fail and terminate due to various reasons (such as resource limitations, unstable environment, code bugs, etc.). Once a failure occurs during the intermediate process, restarting often means the entire workflow runs from the beginning, which leads to the already processed data being re-executed. This redundant computation is a huge waste of resources and time.
Therefore, we propose a design of checkpoint: to implement "incremental processing". For example, if the previous run terminates after processing and writing part of the data, subsequent runs will skip already processed data and only complete the missing part.

Design

The checkpoint in Daft enables incremental processing. Its core principle is using primary key (or composite primary key) to filter out rows that have already been processed, ensuring that only new data is processed and appended to target path.
This is achieved by injecting a filter predicate into the logical plan, immediately after the source node. When a write operation is initiated with a checkpoint_config, Daft first reads the primary keys from the existing data at the destination. This set of primary keys is then loaded into memory, distributed across a pool of checkpoint actors. During execution, the injected filter (actually is a UDF Actor) consults these actors to efficiently discard rows with primary keys that already exist. The DataFrame.write_* APIs should have been extended to accept checkpoint_config as a parameter, which controls this behavior mentioned above.

Planning

Milestone 1: Checkpointing for major and basic scenarios

Status: ✅ Completed
Tasks:

1. Implement checkpoint filter.
2. Implement checkpoint actor based on ray actor.
3. Extend write_* APIs(Csv, Parquet, Json) to accept checkpoint_config parameters, and construct the filter predicate with checkpoint filter and checkpoint actor.
4. Support validate and parse checkpoint_config parameter: must be a dictionary containing:
- key_column: The name of the column(s) to use as the primary key/composite primary keys.
- num_buckets(optional): The number of checkpoint actors to create for sharding the primary keys set.
- num_cpus(optional): The number of CPUs to allocate for each checkpoint actor.
- batch_size(optional): The batch size of checkpoint filter operation.
5. Enable write_* APIs(Csv, Parquet, Json) to insert the filter predicate immediately after the source node in the logical plan of current dataframe.
6. Support using primary keys to efficiently distinguish which row has been processed.
7. Support checkpointing on ray runner, focus on distributed scenarios.
8. Support checkpointing for single-source plan.
9. Support users to set batch size of filter operation for checkpointing to improve performance.

Limits in Milestone 1:

Formats: Do not support lake formats like iceberg and lance, because their two-phase commit (2PC) mechanism makes it very hard to integrate.
Primary key: The dataset must have primary key because checkpointing relies on this to skip processed rows.
Do not support checkpointing on native runner.
Do not support checkpointing for dataframes with a multi-source plan generated by join, concat, etc.

Milestone 2: Checkpoint Enhancement.

Status: ⌛️ In Progress
Tasks:

1. Support composite primary key to distinguish which row has been processed.
2. Implement checkpoint actor based on thread pool to support checkpointing on native runner.
3. Support checkpointing for dataframes with a multi-source plan generated by join, concat, etc.

Limits in Milestone 2:

Formats: Do not support lake formats like iceberg and lance.

Milestone 3: Checkpoint available for all formats, like Flink CDC

Today: an actor-based filter delivers incremental processes without external state.
Long term: we could consider a stateful checkpointing mode inspired by Flink CDC/checkpoints

Tasks:

1. State Backend: A persistent storage layer (e.g., RocksDB, HDFS, S3) where checkpoint metadata (e.g., task state, source offsets) is stored — decoupled from compute nodes for durability.
2. Checkpoint Barrier: Source offsets and watermarks for barrier alignment, coordinated by a central checkpoint manager.
3. A global coordinator: managing barriers and snapshots.

Limits in Milestone 3:

State persistence depends on external storage solutions like RocksDB, HDFS and S3.

Benchmark

We conducted a test: reading data from parquet files, and dedupping all the data. We tested two methods:

Anti-join: we generate two dataframes by reading from the same locations, and apply an anti-join to get the final dataframe, which has 0 rows, because all the data in two dataframes are the same.
Checkpoint implemented in Milestone1: we generate a dataframe by reading from a location, and writes to the same location. A filter is constructed and all the data rows will be skipped. Finally, zero rows are appended to the location.

It is observed that this Milestone 1 checkpoint exhibits greater stability compared to anti-join-based deduplication, and can support larger-scale datasets without triggering OOM. This advantage stems from the fact that actor-based checkpointing eliminates the need for costly data shuffling operations.

Number of total rows	Number of files	method	cost time(seconds)
18,900,000	100	checkpoint	50.13
	100	anti-join(broad-cast)	65.04
	1890	checkpoint	52.32
	1890	anti-join(broad-cast)	75.39
189,000,000	100	checkpoint	443.89
	100	anti-join(broad-cast)	OOM
	1890	checkpoint	377.30
	1890	anti-join(broad-cast)	OOM
	30000	checkpoint	593.79
	30000	anti-join(broad-cast)	OOM

Note about the above dedup benchmark:

All files are parquet files stored in TOS(Object Store).
Daft version: v0.7.1
Running on Ray k8s Cluster with 1 header and 4 workers. Each pod is 8c 32G. Head pod only for scheduling.

everySympathy · 2025-12-22T18:25:31Z

everySympathy
Dec 22, 2025
Author

Looking forward to discussing our checkpoint implementation with the community! We’ve built v1 of the checkpoint and would love to contribute!

0 replies

caican00 · 2025-12-24T03:10:53Z

caican00
Dec 24, 2025

Hi @everySympathy

Great work on the checkpoint proposal! This is a solid step forward for Daft's robustness. Excited to see this feature take shape!

I have a few questions about this:

Data Uniqueness & Correctness: Since not all datasets have a primary key, and even composite keys might not guarantee uniqueness (e.g., duplicate rows), how should we best handle deduplication to ensure checkpoint correctness?
Performance & Memory Considerations: If the composite key involves many columns, the initial data load for the checkpoint actor could be larger. We might need to consider the impact on startup performance and memory footprint.
Usability & User Experience: This approach requires users to have a clear understanding of their data's unique characteristics. How can we make this intuitive or provide tools to help users identify suitable key columns?

Checkpoint implemented in Milestone1: we generate a dataframe by reading from a location, and writes to the same location. A filter is constructed and all the data rows will be skipped. Finally, zero rows are appended to the location.

This approach seems rather customized and this doesn't seem to be the standard usage of ck.

4 replies

everySympathy Dec 24, 2025
Author

Hi @caican00 , Thank you very much for your reply and thoughts! Happy to discuss together!

Let me reply your questions：

Correctness issues and Usability issues: Yes, the actor-based filter checkpoint solution need the primary key (or composite primary key) as a unique identifier to each rows, so actually users need to set the primary key for checkpoint. If there is no unique id, or the composite primary key is large, one solution is to generate uuid for each rows. uuid function #3706. Another solution could be use a built-in hash function to generate unique hash based on the rows content.
Performance issues:
- Memory: There are several checkpoint actors whose numbers could also be set by user. Each ck actor is distributed across different nodes with best effort, sharing a portion of all the keys. Therefore, the memory pressure on a single machine is significantly reduced.
- CPU: The filtering is not CPU-intensive. Each ck actor stores processed primary keys as a set, and every time a batch of keys from the source comes in to query their existence. The hash computation, addressing and matching are not CPU-heavy operations, in our test, the average latency of filtering 100K string url keys in one batch is about 100ms(even lower), with a very low CPU utilization.
- Composite key: Yes, extremely large composite keys are not recommended and other solutions like uuid could be considered.
Approach issues: The standard usage of ck is like below:

 df = daft.read_parquet(source_path, ...)
 df = df.with_column(...)
 df = df.write_parquet(target_path, ..., checkpoint_config = {"key_column": "unique_id"})

In the benchmark case we set target_path as same as source_path to test the performance of full-set filtering, to compare with the performance with anti-join, regardless of the transform operations behind.

Thanks again for your insights!

caican00 Dec 24, 2025

Thanks for your explaination @everySympathy

Please allow me to continue submitting some comments. Thank you!

The use of UUIDs to ensure uniqueness represents a robust solution. However, one consideration is that it necessitates the inclusion of a UUID column in upstream data generation processes, which would require establishing a standardized convention across the entire data pipeline，since the upstream data could originate from computation engines such as Spark, among others.
For hash function, since hash collisions are theoretically inevitable, could this result in erroneous data filtration?
Furthermore, distributing primary keys across multiple checkpoint actors presents a reasonable approach for enhancing scalability. maybe a critical design challenge to address is the data routing mechanism—specifically, how to effectively route each data record to the checkpoint actor responsible for its corresponding primary key shard.

everySympathy Dec 24, 2025
Author

Thanks for your explaination @everySympathy

Please allow me to continue submitting some comments. Thank you!

The use of UUIDs to ensure uniqueness represents a robust solution. However, one consideration is that it necessitates the inclusion of a UUID column in upstream data generation processes, which would require establishing a standardized convention across the entire data pipeline，since the upstream data could originate from computation engines such as Spark, among others.

For hash function, since hash collisions are theoretically inevitable, could this result in erroneous data filtration?

Furthermore, distributing primary keys across multiple checkpoint actors presents a reasonable approach for enhancing scalability. maybe a critical design challenge to address is the data routing mechanism—specifically, how to effectively route each data record to the checkpoint actor responsible for its corresponding primary key shard.

Hi @caican00! Thank you for giving more insights!

Please let me give quickly answers and thoughts:

For hash: Blake2b-128 could be used to compute the hash fingerprint, which has a collision resistance of 2^64. Or for more safety, Blake2b-256 and Blake2b-512 could be considered. As you mentioned, collisions are theoretically inevitable, but could be extremely low and provide an option when there is no unique id as well as uuid.
For distributed ck actors: Routing could be optional because filtering is not CPU-heavy operation for ck actors. For ck actors, filtering 100K keys at a time has a extremely slight difference between 1K keys. Nevertheless, we implemented routing based on primary key hash bucketing in similar scenarios, and we could test the performance gain compared to no routing on a very large dataset.

caican00 Dec 24, 2025

Thank you @everySympathy, i have no further comments for the time being.

Jay-ju · 2025-12-24T08:18:24Z

Jay-ju
Dec 24, 2025

Enable write_* APIs(Csv, Parquet, Json) to insert the filter predicate immediately after the source node in the logical plan of current dataframe.

Is there a mistake here? Should it be read_xxx?

4 replies

everySympathy Dec 24, 2025
Author

Enable write_* APIs(Csv, Parquet, Json) to insert the filter predicate immediately after the source node in the logical plan of current dataframe.

Is there a mistake here? Should it be read_xxx?

No mistake here, there is write_xxx. We extend write_xxx apis to accept checkpoint config. If we set it in read_xxx apis, we couldn't get the target writing path, and must need users to specifically set target path, ak/sk and other infos into the checkpoint config.

The usage is:

df = daft.read_parquet(source_path, ...)
df = df.with_column(...)
df = df.write_parquet(target_path, ..., checkpoint_config = {"key_column": "unique_id"})

Jay-ju Dec 24, 2025

Um, sorry. So will a separate file be generated here? Why can't we directly use the generated file independently?

everySympathy Dec 24, 2025
Author

Um, sorry. So will a separate file be generated here? Why can't we directly use the generated file independently?

No separate files generated. We read the processed file that should be stored in target_path, extract the unique identifiers, and store them into ck actors. Then we use ck actors to filter the all data in source_path, skip the processed data that should already exist in target_path

everySympathy Dec 24, 2025
Author

The checkpoint_config here is to let the workflow launch the checkpoint filter.

Jay-ju · 2025-12-24T08:21:46Z

Jay-ju
Dec 24, 2025

Extend write_* APIs(Csv, Parquet, Json) to accept checkpoint_config parameters, and construct the filter predicate with checkpoint filter and checkpoint actor.

Why is it necessary to add some checkpoint_config in write_xxx?

1 reply

everySympathy Dec 24, 2025
Author

Extend write_* APIs(Csv, Parquet, Json) to accept checkpoint_config parameters, and construct the filter predicate with checkpoint filter and checkpoint actor.

Why is it necessary to add some checkpoint_config in write_xxx?

In write_xxx we've got the target path and io_config infomations, we could easy give these to checkpoint filter to deduplicate already processed rows that stores in target path.

caican00 · 2025-12-25T07:04:26Z

caican00
Dec 25, 2025

If needed later, I would be glad to participate in co-construction.

1 reply

everySympathy Dec 25, 2025
Author

If needed later, I would be glad to participate in co-construction.

Sure, thank you! I’m planning to submit a PR for the checkpoint later, so we’ll have something concrete to iterate on together.

Daft checkpoint design #5868

Uh oh!

Uh oh!

Context

Design

Planning

Milestone 1: Checkpointing for major and basic scenarios

Milestone 2: Checkpoint Enhancement.

Milestone 3: Checkpoint available for all formats, like Flink CDC

Benchmark

Replies: 5 comments · 10 replies

Uh oh!

Uh oh!

everySympathy Dec 22, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

everySympathy Dec 24, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

everySympathy Dec 24, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

everySympathy Dec 24, 2025 Author

Uh oh!

Uh oh!

everySympathy Dec 24, 2025 Author

Uh oh!

everySympathy Dec 24, 2025 Author

Uh oh!

Uh oh!

everySympathy Dec 24, 2025 Author

Uh oh!

Uh oh!

everySympathy Dec 25, 2025 Author

Replies: 5 comments 10 replies

everySympathy
Dec 22, 2025
Author

everySympathy Dec 24, 2025
Author

everySympathy Dec 24, 2025
Author

everySympathy Dec 24, 2025
Author

everySympathy Dec 24, 2025
Author

everySympathy Dec 24, 2025
Author

everySympathy Dec 24, 2025
Author

everySympathy Dec 25, 2025
Author