Skip to content

Cosine distances for document retrieval #120

@schumannc

Description

@schumannc

Current Behavior:
The function get_most_relevant_contents_from_message is designed to fetch documents from a database, ordering them based on the L2 score (Euclidean distance) of similarity with the customer's request. This retrieval is capped at the top N documents. The L2 score, an unbounded metric, varies significantly depending on the embeddings of the documents it analyzes. Currently, utilizing an L2 score filter of less than 1 can potentially always yield a document—even if the document lacks semantic relevance to the query. This behavior inadvertently bypasses the intended fallback mechanisms that should activate in the absence of meaningful answers within the database.

Issue:
The core of the problem lies in the unbounded nature of the L2 distance metric, which leads to inconsistent filtering thresholds. This inconsistency can result in the retrieval of documents that do not semantically align with the customer's request, thereby impeding the system's ability to default to fallback options when genuinely relevant answers are not present in the database.

Suggested Improvement:
Transition the similarity metric from L2 (Euclidean distance) to cosine distance, which operates within a bounded range of 0 to 1. This change would normalize the comparison scale, ensuring that document similarity scores are consistently interpretable. Additionally, introducing a configurable parameter for the similarity threshold would provide greater flexibility, allowing for fine-tuning of the retrieval process based on specific needs or contexts.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions