[FEAT] Integrate synthetic dataset generation for local documents into preprocessing pipeline #79

haroon0x · 2025-10-18T08:57:20Z

Description

Integrates synthetic-data-kit to the preprocessing module.
When the user uploads the local documents it goes through the separate parsers implemented in the services folder.
Then when the process endpoint is hit the synthetic data generation begins.

Fixes #72

Changes Made

Uses synthetic data kit by vendoring the source code and integrates to the DatasetLoader class in _load_uploaded_dataset method
A new dataset_synthesizer file with the synthetic_data_pipeline function takes the file and returns the curated synthetic qa pairs generated using gemini as the judge based on the config.yaml file in the synthetic data kit dir
For the datasets/upload separate standalone parsers are used for now to output the samples in the metadata to be returned

minor improvements

Improve dataset metadata for all mime types in datasets/upload endpoint
Update dataset local file storage listing function to be cross platform

Type of Change

Breaking change (fix or feature that would cause existing functionality to not work as expected)

How Has This Been Tested?

I have tested 2 api endpoints mainly the datasets/upload and datasets/process.

Next Improvements

Optimize the synthetic data kit and remove unnecessary files in the module.
enhance the dataset_synthesizer to be more robust
remove the standalone parsers in the services dir and use the parsers from the synthetic data kit for displaying the samples in the datasets/upload metadata.
Replace the static config.yaml with a Pydantic BaseSettings class for robust, type-safe, and environment-aware configuration management.

- Update parsers to take input of in memory file stream instead of file path - Update dataset_handler and dataset_loader to integrate the above change - Improve dataset metadata for content type in datasets/upload endpoint - update local file storage dataset listing to be cross platform

haroon0x · 2025-10-18T10:56:59Z

@supreme-gg-gg let me know the feedback after you test it out .

supreme-gg-gg · 2025-10-18T22:15:52Z

preprocessing/services/synthetic_data_kit/pyproject.toml

I will test out the functionality soon, but why did you copy the entire source code for synthetic kit? is there a reason why we cannot install from pypi or from their GitHub?

That is why i asked about it in the issue (#72 (comment)). whether to use the source code or use it as a cli, since using it as a cli tool include using subprocess and decreases performance.

I see is it not possible to install synthetic kit and then import it like a module? Or perhaps we can do an editable install from their source directly? Since if we just copy the source code it is very difficult to maintain

it is not possible to import it as a module. we can do an editable install from their source. i will work on that.

Sounds good!

supreme-gg-gg · 2025-10-18T22:20:37Z

preprocessing/services/parsers/__init__.py

These parsers seems to be identical from the ones in synthetic_data_kit can you import from there directly when using the parsers in dataset_handler

The parsers in the services folder uses in memory file object whereas the one in the synthetic data kit uses the file path.

The current services/parsers are designed to return a Hugging Face Dataset object, which the rest of the application expects. The synthetic_data_kit/parsers return raw text or a list of dictionaries.

i can work on it to use just one parsers for all.

if you can I believe wrapping their parser is better than copying the code directly, but this is not a big deal whichever is simpler is fine

supreme-gg-gg · 2025-10-18T22:29:17Z

preprocessing/services/dataset_handler.py

+            num_examples = 0
+
+            if file_type in self.parsers:
+                parser = self.parsers[file_type]


It might not be necessary to parse the uploaded files for samples because these files could be meant to used to synthesise samples, for example they might not contain actual QA pairs directly. Since we're directly doing synthetic generation next, I think it's good enough to later return the generated samples instead.

Perhaps we can break down this feature to synthesis + just upload:

Upload will handle already-formatted samples usually in JSON and JSONL no need any complex multimedia parsing, direct convert to dataset, pretty much identical to how current HF remote dataset works just from local upload

Synthesis will be your new endpoint, this handles any non-structured formats and returns the generated samples directly

Let me know what you think is better for UX

Are you suggesting to make the synthetic data generation in the upload endpoint itself and then show the samples ?.
i will make a synthesis endpoint which takes the non-structured format and return the dataset in hf dataset format.

so the synthesis should happen in the upload endpoint itself ?

haroon0x · 2025-10-19T16:11:09Z

@supreme-gg-gg
Are you suggesting to make the synthetic data generation in the upload endpoint itself and then show the samples ?.
i will make a synthesis endpoint which takes the non-structured format and return the dataset in hf dataset format.

so the synthesis should happen in the upload endpoint itself ?

supreme-gg-gg · 2025-10-19T17:51:17Z

@supreme-gg-gg Are you suggesting to make the synthetic data generation in the upload endpoint itself and then show the samples ?. i will make a synthesis endpoint which takes the non-structured format and return the dataset in hf dataset format.

so the synthesis should happen in the upload endpoint itself ?

No what I meant is that upload endpoint is separate from synthesis, synthesis takes unstructured documents -> synthetic kit -> HF dataset format, upload endpoint takes structured documents and parse them directly to HF dataset, restricting the file upload type to like parquet, json, jsonl, whatever common directly convertible types

haroon0x · 2025-10-20T05:13:39Z

okay so upload endpoint for uploading only strucutured formats and synthesis endpoint for unstructured format to hf dataset.

haroon0x · 2025-10-20T07:35:40Z

the issue is that i have added synthetic kit as a sub module but the thing is that when i try to use gemini in the config there are some issues with formatting and other compatability issues. so i would need to make changes in the source code. That would result in creating a fork of the synthetic-data-kit to use it as a sub-module and then maintain that. so should i proceed with this approach or you have any other suggestion ?

supreme-gg-gg · 2025-10-20T15:41:57Z

okay so upload endpoint for uploading only strucutured formats and synthesis endpoint for unstructured format to hf dataset.

sounds good to me

supreme-gg-gg · 2025-10-20T15:47:46Z

the issue is that i have added synthetic kit as a sub module but the thing is that when i try to use gemini in the config there are some issues with formatting and other compatability issues. so i would need to make changes in the source code. That would result in creating a fork of the synthetic-data-kit to use it as a sub-module and then maintain that. so should i proceed with this approach or you have any other suggestion ?

Does it have to do with this issue: meta-llama/synthetic-data-kit#44
Yea I think forking this is fine, I'll create a fork under our organisation though would be better if you can fix the library there, sorry about the complexity but it makes maintaining the fork easier for us in the long run

Here is the fork, feel free to submit any PR there and I'll merge them: https://github.com/gemma-facet/synthetic-data-kit

haroon0x · 2025-10-20T15:53:29Z

yeah , it is this issue - meta-llama/synthetic-data-kit#44
i'll make a pr on the fork. Doing it this way makes sense since it’s better for long-term maintenance

haroon0x · 2025-10-24T15:30:01Z

created the pr on synthetic data kit.

supreme-gg-gg · 2025-10-25T21:58:29Z

created the pr on synthetic data kit.

great thanks, i've merged it, i'll test it together with this PR once it's done. If they merged your PR upstream you can use it from the meta llama source as well, but idk how fast that will happen so we can just use our fork

haroon0x added 5 commits September 30, 2025 22:15

initial commit

ee84014

run pre-commit

9359a5f

add synthetic data kit and integrate dataset_synthetsizer

f90a0d5

fix the integration issues of synthetic data kit to dataset processing

a16bc87

haroon0x changed the title ~~[FEAT] Add synthetic dataset generation for local documents to preprocessing pipeline~~ Oct 18, 2025

haroon0x added 2 commits October 18, 2025 15:17

Update uv.lock

902f8a0

Merge branch 'main' into feat/local-datasets

2947728

supreme-gg-gg requested changes Oct 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEAT] Integrate synthetic dataset generation for local documents into preprocessing pipeline #79

[FEAT] Integrate synthetic dataset generation for local documents into preprocessing pipeline #79

Uh oh!

haroon0x commented Oct 18, 2025 •

edited

Loading

haroon0x commented Oct 18, 2025 •

edited

Loading

supreme-gg-gg Oct 18, 2025

haroon0x Oct 19, 2025

supreme-gg-gg Oct 19, 2025 •

edited

Loading

haroon0x Oct 19, 2025

supreme-gg-gg Oct 19, 2025

supreme-gg-gg Oct 18, 2025

haroon0x Oct 19, 2025

supreme-gg-gg Oct 19, 2025

supreme-gg-gg Oct 18, 2025

haroon0x Oct 19, 2025 •

edited

Loading

haroon0x commented Oct 19, 2025

supreme-gg-gg commented Oct 19, 2025

haroon0x commented Oct 20, 2025

haroon0x commented Oct 20, 2025

supreme-gg-gg commented Oct 20, 2025

supreme-gg-gg commented Oct 20, 2025 •

edited

Loading

haroon0x commented Oct 20, 2025 •

edited

Loading

haroon0x commented Oct 24, 2025

supreme-gg-gg commented Oct 25, 2025

Labels

2 participants

[FEAT] Integrate synthetic dataset generation for local documents into preprocessing pipeline #79

Are you sure you want to change the base?

[FEAT] Integrate synthetic dataset generation for local documents into preprocessing pipeline #79

Uh oh!

Conversation

haroon0x commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes Made

minor improvements

Type of Change

How Has This Been Tested?

Next Improvements

haroon0x commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

supreme-gg-gg Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haroon0x Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

haroon0x commented Oct 19, 2025

supreme-gg-gg commented Oct 19, 2025

haroon0x commented Oct 20, 2025

haroon0x commented Oct 20, 2025

supreme-gg-gg commented Oct 20, 2025

supreme-gg-gg commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

haroon0x commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

haroon0x commented Oct 24, 2025

supreme-gg-gg commented Oct 25, 2025

Labels

2 participants

haroon0x commented Oct 18, 2025 •

edited

Loading

haroon0x commented Oct 18, 2025 •

edited

Loading

supreme-gg-gg Oct 19, 2025 •

edited

Loading

haroon0x Oct 19, 2025 •

edited

Loading

supreme-gg-gg commented Oct 20, 2025 •

edited

Loading

haroon0x commented Oct 20, 2025 •

edited

Loading