Enable Automatic Parquet File Detection in StreamingLeRobotDataset (Removes Hard-Coded data//.parquet Assumption) issue #2312 #2324

Coffeempty · 2025-10-27T18:06:11Z

fixes #2312
hey, i tried solving the issue. It was quite easy actually.

StreamingLeRobotDataset currently assumes parquet files follow the pattern data/*/*.parquet.
However, some valid Hub datasets (e.g., yaak-ai/L2D-v3) do not use this directory structure, causing
load_dataset to fail even though the dataset is fully compatible.

This PR removes that structural assumption and makes dataset loading more flexible.

here is what i did:

Adds an optional data_files=None parameter to StreamingLeRobotDataset.__init__().
If data_files is not provided, load_dataset() will automatically detect parquet files in the repo.
Allows users to pass custom glob patterns when desired.
Fully preserves backward compatibility.

Looking forward to review + feedback!

### Summary This PR resolves a streaming issue where StreamingLeRobotDataset assumed all datasets stored parquet files under "data/*/*.parquet". This caused load failures for valid HF datasets with different layouts. ### Changes - Added new parameter `data_files=None` to StreamingLeRobotDataset.__init__() - When `data_files=None`, the underlying `load_dataset()` automatically detects parquet files in the repo. - Users can still specify custom directory patterns if needed. - Maintains full backward compatibility. ### Result Now this works as expected: from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset dataset = StreamingLeRobotDataset("yaak-ai/L2D-v3") ### Motivation This allows streaming from any Hub dataset that contains parquet data, without requiring a specific folder hierarchy.

…FoundError-huggingface#2312

sattwik-sahu · 2025-10-29T10:53:48Z

Hi, I tried running the streaming dataset feature with this PR, but got error from #2312 again. I changed to another dataset and successfully ran the StreamingLeRobotDataset(...) command. However, when I tried to read data from the dataset using

repo_id = "lerobot/metaworld_mt50"
dataset = StreamingLeRobotDataset(repo_id)  # streams directly from the Hub

print(dataset[0])

I got the following error

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[7], [line 1](vscode-notebook-cell:?execution_count=7&line=1)
----> [1](vscode-notebook-cell:?execution_count=7&line=1) dataset[0]

File ~/sattwik/projects/robot-learning/jepavizor/.venv/lib/python3.12/site-packages/torch/utils/data/dataset.py:59, in Dataset.__getitem__(self, index)
     58 def __getitem__(self, index) -> _T_co:
---> [59](https://file+.vscode-resource.vscode-cdn.net/home/moonlab/sattwik/projects/robot-learning/jepavizor/tests/notebooks/~/sattwik/projects/robot-learning/jepavizor/.venv/lib/python3.12/site-packages/torch/utils/data/dataset.py:59)     raise NotImplementedError("Subclasses of Dataset should implement __getitem__.")

NotImplementedError: Subclasses of Dataset should implement __getitem__.

It seems not all datasets on the hub follow some standard naming/directory conventions or API conventions, leading to these inconsistencies?

Coffeempty changed the title ~~Update streaming_dataset.py~~ Oct 27, 2025

Coffeempty mentioned this pull request Oct 27, 2025

Streaming Dataset not Working - DataFilesNotFoundError #2312

Open

2 tasks

Merge branch 'main' into Streaming-Dataset-not-Working---DataFilesNot…

d6e1d98

…FoundError-huggingface#2312

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Automatic Parquet File Detection in StreamingLeRobotDataset (Removes Hard-Coded data//.parquet Assumption) issue #2312 #2324

Enable Automatic Parquet File Detection in StreamingLeRobotDataset (Removes Hard-Coded data//.parquet Assumption) issue #2312 #2324

Coffeempty commented Oct 27, 2025 •

edited

Loading

sattwik-sahu commented Oct 29, 2025

Labels

2 participants

Enable Automatic Parquet File Detection in StreamingLeRobotDataset (Removes Hard-Coded data/*/*.parquet Assumption) issue #2312 #2324

Are you sure you want to change the base?

Enable Automatic Parquet File Detection in StreamingLeRobotDataset (Removes Hard-Coded data/*/*.parquet Assumption) issue #2312 #2324

Conversation

Coffeempty commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

here is what i did:

sattwik-sahu commented Oct 29, 2025

Labels

2 participants

Enable Automatic Parquet File Detection in StreamingLeRobotDataset (Removes Hard-Coded data//.parquet Assumption) issue #2312 #2324

Enable Automatic Parquet File Detection in StreamingLeRobotDataset (Removes Hard-Coded data//.parquet Assumption) issue #2312 #2324

Coffeempty commented Oct 27, 2025 •

edited

Loading