Skip to content

Conversation

@sabaul
Copy link

@sabaul sabaul commented Oct 23, 2025

Add file metadata information while embedding files using Knowledge Base Source Classes

  • Add per-chunk metadata in all KnowledgeSources (filepath, chunk_index, source_type)
  • Backwards compatible storage with _coerce_to_records

Reason why this was required

  • If we decide to persist the embeddings by creating a child of KnowledgeStorage.py (using the one time embedding way in the knowledge base best practices), it was very difficult to get any file level information.
  • This change will add some file level metadata to the embeddings, which can be later used to do chromadb level query and embedding deletion as well (if required).
  • This will enable file deletion possibility with the corresponding embedding deletion as well.

Related to #3812

@cursor
Copy link

cursor bot commented Oct 23, 2025

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Metadata Corruption in Document Conversion

The zip call in add() uses strict=False, which can silently misalign filepath metadata. If document conversion yields fewer documents than input files, chunks may be incorrectly associated with file paths, corrupting the intended file-level metadata.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Empty String Handling Inconsistency

The _coerce_to_records function skips dictionary-based documents with empty string content due to the if not content: continue check. This creates an inconsistency with string-based inputs, which accept empty content, and can lead to silent data loss, breaking backward compatibility.

Fix in Cursor Fix in Web

…e by one in docling, fixed empty string issue in _coerce_to_records
@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Incomplete Flag Causes Dynamic Import Error

The DOCLING_AVAILABLE flag is incomplete, as it doesn't account for DocumentConverter. This allows an ImportError to occur during model_post_init when dynamically importing DocumentConverter, even if DOCLING_AVAILABLE is True.

Additional Locations (1)

Fix in Cursor Fix in Web

@sabaul sabaul changed the title Metadata addition to embeddings created using Knowledge Base Source Classes Oct 23, 2025
cursor[bot]

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant