Skip to content

Conversation

@bingogome
Copy link

What this does

This PR fix an error of meta file mapping.

The old aggregate.py will produce a merged dataset that has (as an example, the number of video files does not match the number of data files, and the meta has an incorrect mapping):

data/chunk-000/file-000.parquet
videos/observation.images.left/chunk-000/file-000.mp4
videos/observation.images.left/chunk-000/file-001.mp4

This will bug out when using data conversion tools and reports:

"/.../convert_dataset_v30_to_v21.py", line 175, in convert_data
raise FileNotFoundError(f"Expected source parquet file not found: {source_path}")
FileNotFoundError: Expected source parquet file not found: .../data/chunk-000/file-001.parquet

The above happens when you merge dataset A and B, where A has

data/chunk-000/file-000.parquet
data/chunk-000/file-001.parquet
videos/observation.images.left/chunk-000/file-000.mp4
videos/observation.images.left/chunk-000/file-001.mp4

and B has

data/chunk-000/file-000.parquet
videos/observation.images.left/chunk-000/file-000.mp4
Copilot AI review requested due to automatic review settings October 20, 2025 17:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a metadata mapping issue in dataset aggregation that caused misalignment between video files and their corresponding data files. The fix introduces a data_mapping dictionary to track the actual destination chunk/file indices when data files are merged, ensuring metadata correctly references the consolidated data files.

Key Changes:

  • Added data mapping tracking to ensure metadata points to correct data file locations after aggregation
  • Modified append_or_create_parquet_file to return destination chunk/file indices
  • Updated update_meta_data to use the mapping instead of simple offset arithmetic

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant