Skip to content

Conversation

@haroon0x
Copy link

@haroon0x haroon0x commented Oct 18, 2025

Description

Integrates synthetic-data-kit to the preprocessing module.
When the user uploads the local documents it goes through the separate parsers implemented in the services folder.
Then when the process endpoint is hit the synthetic data generation begins.

Fixes #72

Changes Made

  • Uses synthetic data kit by vendoring the source code and integrates to the DatasetLoader class in _load_uploaded_dataset method
  • A new dataset_synthesizer file with the synthetic_data_pipeline function takes the file and returns the curated synthetic qa pairs generated using gemini as the judge based on the config.yaml file in the synthetic data kit dir
  • For the datasets/upload separate standalone parsers are used for now to output the samples in the metadata to be returned

minor improvements

  • Improve dataset metadata for all mime types in datasets/upload endpoint
  • Update dataset local file storage listing function to be cross platform

Type of Change

  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

How Has This Been Tested?

I have tested 2 api endpoints mainly the datasets/upload and datasets/process.

Next Improvements

  • Optimize the synthetic data kit and remove unnecessary files in the module.
  • enhance the dataset_synthesizer to be more robust
  • remove the standalone parsers in the services dir and use the parsers from the synthetic data kit for displaying the samples in the datasets/upload metadata.
  • Replace the static config.yaml with a Pydantic BaseSettings class for robust, type-safe, and environment-aware configuration management.
- Update parsers to take input of in memory file stream instead of file path
- Update dataset_handler and dataset_loader to integrate the above change
- Improve dataset metadata for content type in datasets/upload endpoint
-   update local file storage  dataset listing to be cross platform
@haroon0x haroon0x changed the title [FEAT] Add synthetic dataset generation for local documents to preprocessing pipeline Oct 18, 2025
@haroon0x
Copy link
Author

haroon0x commented Oct 18, 2025

@supreme-gg-gg let me know the feedback after you test it out .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will test out the functionality soon, but why did you copy the entire source code for synthetic kit? is there a reason why we cannot install from pypi or from their GitHub?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why i asked about it in the issue (#72 (comment)). whether to use the source code or use it as a cli, since using it as a cli tool include using subprocess and decreases performance.

Copy link
Contributor

@supreme-gg-gg supreme-gg-gg Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see is it not possible to install synthetic kit and then import it like a module? Or perhaps we can do an editable install from their source directly? Since if we just copy the source code it is very difficult to maintain

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not possible to import it as a module. we can do an editable install from their source. i will work on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parsers seems to be identical from the ones in synthetic_data_kit can you import from there directly when using the parsers in dataset_handler

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parsers in the services folder uses in memory file object whereas the one in the synthetic data kit uses the file path.

  • The current services/parsers are designed to return a Hugging Face Dataset object, which the rest of the application expects. The synthetic_data_kit/parsers return raw text or a list of dictionaries.

i can work on it to use just one parsers for all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can I believe wrapping their parser is better than copying the code directly, but this is not a big deal whichever is simpler is fine

num_examples = 0

if file_type in self.parsers:
parser = self.parsers[file_type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be necessary to parse the uploaded files for samples because these files could be meant to used to synthesise samples, for example they might not contain actual QA pairs directly. Since we're directly doing synthetic generation next, I think it's good enough to later return the generated samples instead.

Perhaps we can break down this feature to synthesis + just upload:

  1. Upload will handle already-formatted samples usually in JSON and JSONL no need any complex multimedia parsing, direct convert to dataset, pretty much identical to how current HF remote dataset works just from local upload
  2. Synthesis will be your new endpoint, this handles any non-structured formats and returns the generated samples directly

Let me know what you think is better for UX

Copy link
Author

@haroon0x haroon0x Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to make the synthetic data generation in the upload endpoint itself and then show the samples ?.
i will make a synthesis endpoint which takes the non-structured format and return the dataset in hf dataset format.

so the synthesis should happen in the upload endpoint itself ?

@haroon0x
Copy link
Author

@supreme-gg-gg
Are you suggesting to make the synthetic data generation in the upload endpoint itself and then show the samples ?.
i will make a synthesis endpoint which takes the non-structured format and return the dataset in hf dataset format.

so the synthesis should happen in the upload endpoint itself ?

@supreme-gg-gg
Copy link
Contributor

@supreme-gg-gg Are you suggesting to make the synthetic data generation in the upload endpoint itself and then show the samples ?. i will make a synthesis endpoint which takes the non-structured format and return the dataset in hf dataset format.

so the synthesis should happen in the upload endpoint itself ?

No what I meant is that upload endpoint is separate from synthesis, synthesis takes unstructured documents -> synthetic kit -> HF dataset format, upload endpoint takes structured documents and parse them directly to HF dataset, restricting the file upload type to like parquet, json, jsonl, whatever common directly convertible types

@haroon0x
Copy link
Author

okay so upload endpoint for uploading only strucutured formats and synthesis endpoint for unstructured format to hf dataset.

@haroon0x
Copy link
Author

the issue is that i have added synthetic kit as a sub module but the thing is that when i try to use gemini in the config there are some issues with formatting and other compatability issues. so i would need to make changes in the source code. That would result in creating a fork of the synthetic-data-kit to use it as a sub-module and then maintain that. so should i proceed with this approach or you have any other suggestion ?

@supreme-gg-gg
Copy link
Contributor

okay so upload endpoint for uploading only strucutured formats and synthesis endpoint for unstructured format to hf dataset.

sounds good to me

@supreme-gg-gg
Copy link
Contributor

supreme-gg-gg commented Oct 20, 2025

the issue is that i have added synthetic kit as a sub module but the thing is that when i try to use gemini in the config there are some issues with formatting and other compatability issues. so i would need to make changes in the source code. That would result in creating a fork of the synthetic-data-kit to use it as a sub-module and then maintain that. so should i proceed with this approach or you have any other suggestion ?

Does it have to do with this issue: meta-llama/synthetic-data-kit#44
Yea I think forking this is fine, I'll create a fork under our organisation though would be better if you can fix the library there, sorry about the complexity but it makes maintaining the fork easier for us in the long run

Here is the fork, feel free to submit any PR there and I'll merge them: https://github.com/gemma-facet/synthetic-data-kit

@haroon0x
Copy link
Author

haroon0x commented Oct 20, 2025

yeah , it is this issue - meta-llama/synthetic-data-kit#44
i'll make a pr on the fork. Doing it this way makes sense since it’s better for long-term maintenance

@haroon0x
Copy link
Author

created the pr on synthetic data kit.

@supreme-gg-gg
Copy link
Contributor

created the pr on synthetic data kit.

great thanks, i've merged it, i'll test it together with this PR once it's done. If they merged your PR upstream you can use it from the meta llama source as well, but idk how fast that will happen so we can just use our fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants