Cosine distances for document retrieval

Current Behavior:
The function get_most_relevant_contents_from_message is designed to fetch documents from a database, ordering them based on the L2 score (Euclidean distance) of similarity with the customer's request. This retrieval is capped at the top N documents. The L2 score, an unbounded metric, varies significantly depending on the embeddings of the documents it analyzes. Currently, utilizing an L2 score filter of less than 1 can potentially always yield a document—even if the document lacks semantic relevance to the query. This behavior inadvertently bypasses the intended fallback mechanisms that should activate in the absence of meaningful answers within the database.

Issue:
The core of the problem lies in the unbounded nature of the L2 distance metric, which leads to inconsistent filtering thresholds. This inconsistency can result in the retrieval of documents that do not semantically align with the customer's request, thereby impeding the system's ability to default to fallback options when genuinely relevant answers are not present in the database.

Suggested Improvement:
Transition the similarity metric from L2 (Euclidean distance) to cosine distance, which operates within a bounded range of 0 to 1. This change would normalize the comparison scale, ensuring that document similarity scores are consistently interpretable. Additionally, introducing a configurable parameter for the similarity threshold would provide greater flexibility, allowing for fine-tuning of the retrieval process based on specific needs or contexts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cosine distances for document retrieval #120

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cosine distances for document retrieval #120

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions