Skip to content

Conversation

@Coffeempty
Copy link

@Coffeempty Coffeempty commented Oct 27, 2025

fixes #2312
hey, i tried solving the issue. It was quite easy actually.

StreamingLeRobotDataset currently assumes parquet files follow the pattern data/*/*.parquet.
However, some valid Hub datasets (e.g., yaak-ai/L2D-v3) do not use this directory structure, causing
load_dataset to fail even though the dataset is fully compatible.

This PR removes that structural assumption and makes dataset loading more flexible.


here is what i did:

  • Adds an optional data_files=None parameter to StreamingLeRobotDataset.__init__().
  • If data_files is not provided, load_dataset() will automatically detect parquet files in the repo.
  • Allows users to pass custom glob patterns when desired.
  • Fully preserves backward compatibility.

Looking forward to review + feedback!

### Summary
This PR resolves a streaming issue where StreamingLeRobotDataset
assumed all datasets stored parquet files under "data/*/*.parquet".
This caused load failures for valid HF datasets with different layouts.

### Changes
- Added new parameter `data_files=None` to StreamingLeRobotDataset.__init__()
- When `data_files=None`, the underlying `load_dataset()` automatically
  detects parquet files in the repo.
- Users can still specify custom directory patterns if needed.
- Maintains full backward compatibility.

### Result
Now this works as expected:
from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset
dataset = StreamingLeRobotDataset("yaak-ai/L2D-v3")

### Motivation
This allows streaming from any Hub dataset that contains parquet data,
without requiring a specific folder hierarchy.
@Coffeempty Coffeempty changed the title Update streaming_dataset.py Oct 27, 2025
@sattwik-sahu
Copy link

Hi, I tried running the streaming dataset feature with this PR, but got error from #2312 again. I changed to another dataset and successfully ran the StreamingLeRobotDataset(...) command. However, when I tried to read data from the dataset using

repo_id = "lerobot/metaworld_mt50"
dataset = StreamingLeRobotDataset(repo_id)  # streams directly from the Hub

print(dataset[0])

I got the following error

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[7], [line 1](vscode-notebook-cell:?execution_count=7&line=1)
----> [1](vscode-notebook-cell:?execution_count=7&line=1) dataset[0]

File ~/sattwik/projects/robot-learning/jepavizor/.venv/lib/python3.12/site-packages/torch/utils/data/dataset.py:59, in Dataset.__getitem__(self, index)
     58 def __getitem__(self, index) -> _T_co:
---> [59](https://file+.vscode-resource.vscode-cdn.net/home/moonlab/sattwik/projects/robot-learning/jepavizor/tests/notebooks/~/sattwik/projects/robot-learning/jepavizor/.venv/lib/python3.12/site-packages/torch/utils/data/dataset.py:59)     raise NotImplementedError("Subclasses of Dataset should implement __getitem__.")

NotImplementedError: Subclasses of Dataset should implement __getitem__.

It seems not all datasets on the hub follow some standard naming/directory conventions or API conventions, leading to these inconsistencies?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants