Skip to content

Conversation

@GTimothee
Copy link
Contributor

@GTimothee GTimothee commented Apr 18, 2025

Description

  • Adds save and load methods to KnowledgeBase
  • Adds save and load methods to KnowledgeBase using hugginface hub
  • Adds card template for the newly created dataset on the hub
  • Adds a _config property to KnowledgeBase
  • Adds a list of file names to be saved to the KnowledgeBase class

Related Issue

#2145

Type of Change

  • 📚 Examples / docs / tutorials / dependencies update
  • 🔧 Bug fix (non-breaking change which fixes an issue)
  • 🥂 Improvement (non-breaking change which improves an existing feature)
  • 🚀 New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to change)
  • 🔐 Security fix

It is an improvement as it avoids embeddings to be recomputed.

It is a new feature because we can now save/load to/from disk and hugginface hub.

Checklist

  • I've read the CODE_OF_CONDUCT.md document.
  • I've read the CONTRIBUTING.md guide.
  • I've written tests for all new methods and classes that I created.
  • I've written the docstring in Google format for all the methods and classes that I used.
  • I've updated the pdm.lock running pdm update-lock (only applicable when pyproject.toml has been
    modified)
@GTimothee
Copy link
Contributor Author

GTimothee commented Apr 18, 2025

I still need to write the tests, but I am sharing the PR already to get feedbacks on the implementation.

Little note: You will see that in the load methods you can select another llm or embedding client than the one that was used to generate the KnowledgeBase. I added this purely to add flexibility and improve developer experience, but it was not necessary nor asked in the issue.

@GTimothee
Copy link
Contributor Author

GTimothee commented Apr 18, 2025

TODO

  • fix test push to hub
  • fix test load from hub
  • add documentation to docs/
  • make pre commit checks pass
Copy link
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @GTimothee, this already looks nice. I left some comments to improve some things..

Hugging Face token for authentication. If None, will use local token.
private : bool
Whether to make the repo private or public.
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also pass kwargs for the push to hub/upload file etc?

Comment on lines 401 to 409
kb = cls(
data=data,
columns=config.get("columns"),
llm_client=llm_client,
embedding_model=embedding_model,
chunk_size=config.get("chunk_size", 2048),
seed=config.get("seed"),
min_topic_size=config.get("min_topic_size"),
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this already create the embeddings during initialisation?

GTimothee and others added 5 commits May 29, 2025 18:31
Co-authored-by: David Berenstein <david.m.berenstein@gmail.com>
Co-authored-by: David Berenstein <david.m.berenstein@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants