-
Notifications
You must be signed in to change notification settings - Fork 3
[FEAT] Integrate synthetic dataset generation for local documents into preprocessing pipeline #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Update parsers to take input of in memory file stream instead of file path - Update dataset_handler and dataset_loader to integrate the above change - Improve dataset metadata for content type in datasets/upload endpoint - update local file storage dataset listing to be cross platform
|
@supreme-gg-gg let me know the feedback after you test it out . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will test out the functionality soon, but why did you copy the entire source code for synthetic kit? is there a reason why we cannot install from pypi or from their GitHub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is why i asked about it in the issue (#72 (comment)). whether to use the source code or use it as a cli, since using it as a cli tool include using subprocess and decreases performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see is it not possible to install synthetic kit and then import it like a module? Or perhaps we can do an editable install from their source directly? Since if we just copy the source code it is very difficult to maintain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not possible to import it as a module. we can do an editable install from their source. i will work on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These parsers seems to be identical from the ones in synthetic_data_kit can you import from there directly when using the parsers in dataset_handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parsers in the services folder uses in memory file object whereas the one in the synthetic data kit uses the file path.
- The current services/parsers are designed to return a Hugging Face
Datasetobject, which the rest of the application expects. The synthetic_data_kit/parsers return raw text or a list of dictionaries.
i can work on it to use just one parsers for all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can I believe wrapping their parser is better than copying the code directly, but this is not a big deal whichever is simpler is fine
| num_examples = 0 | ||
|
|
||
| if file_type in self.parsers: | ||
| parser = self.parsers[file_type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be necessary to parse the uploaded files for samples because these files could be meant to used to synthesise samples, for example they might not contain actual QA pairs directly. Since we're directly doing synthetic generation next, I think it's good enough to later return the generated samples instead.
Perhaps we can break down this feature to synthesis + just upload:
- Upload will handle already-formatted samples usually in JSON and JSONL no need any complex multimedia parsing, direct convert to dataset, pretty much identical to how current HF remote dataset works just from local upload
- Synthesis will be your new endpoint, this handles any non-structured formats and returns the generated samples directly
Let me know what you think is better for UX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting to make the synthetic data generation in the upload endpoint itself and then show the samples ?.
i will make a synthesis endpoint which takes the non-structured format and return the dataset in hf dataset format.
so the synthesis should happen in the upload endpoint itself ?
|
@supreme-gg-gg so the synthesis should happen in the upload endpoint itself ? |
No what I meant is that upload endpoint is separate from synthesis, synthesis takes unstructured documents -> synthetic kit -> HF dataset format, upload endpoint takes structured documents and parse them directly to HF dataset, restricting the file upload type to like parquet, json, jsonl, whatever common directly convertible types |
|
okay so upload endpoint for uploading only strucutured formats and synthesis endpoint for unstructured format to hf dataset. |
|
the issue is that i have added synthetic kit as a sub module but the thing is that when i try to use gemini in the config there are some issues with formatting and other compatability issues. so i would need to make changes in the source code. That would result in creating a fork of the synthetic-data-kit to use it as a sub-module and then maintain that. so should i proceed with this approach or you have any other suggestion ? |
sounds good to me |
Does it have to do with this issue: meta-llama/synthetic-data-kit#44 Here is the fork, feel free to submit any PR there and I'll merge them: https://github.com/gemma-facet/synthetic-data-kit |
|
yeah , it is this issue - meta-llama/synthetic-data-kit#44 |
|
created the pr on synthetic data kit. |
great thanks, i've merged it, i'll test it together with this PR once it's done. If they merged your PR upstream you can use it from the meta llama source as well, but idk how fast that will happen so we can just use our fork |
Description
Integrates synthetic-data-kit to the preprocessing module.
When the user uploads the local documents it goes through the separate parsers implemented in the services folder.
Then when the process endpoint is hit the synthetic data generation begins.
Fixes #72
Changes Made
DatasetLoaderclass in_load_uploaded_datasetmethoddataset_synthesizerfile with the synthetic_data_pipeline function takes the file and returns the curated synthetic qa pairs generated using gemini as the judge based on theconfig.yamlfile in the synthetic data kit dirdatasets/uploadseparate standalone parsers are used for now to output the samples in the metadata to be returnedminor improvements
datasets/uploadendpointType of Change
How Has This Been Tested?
I have tested 2 api endpoints mainly the
datasets/uploadanddatasets/process.Next Improvements
datasets/uploadmetadata.