comet

package module

v0.1.1 Latest Latest Go to latest Published: Oct 14, 2025 License: MIT Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/wizenheimer/comet

Links

Open Source Insights

README ¶

Comet

Cover

A high-performance hybrid vector store written in Go. Comet brings together multiple indexing strategies and search modalities into a unified, hackable package. Hybrid retrieval with reciprocal rank fusion, autocut, pre-filtering, semantic search, full-text search, and multi-KNN searches, and multi-query operations — all in pure Go.

Understand search internals from the inside out. Built for hackers, not hyperscalers. Tiny enough to fit in your head. Decent enough to blow it.

Choose from:

Flat (exact), HNSW (graph), IVF (clustering), PQ (quantization), or IVFPQ (hybrid) storage backends
Full-Text Search: BM25 ranking algorithm with tokenization and normalization
Metadata Filtering: Fast filtering using Roaring Bitmaps and Bit-Sliced Indexes
Ranking Programmability: Reciprocal Rank Fusion, Fixed size result sets, Threshold based result sets, Dynamic result sets etc.
Hybrid Search: Unified interface combining vector, text, and metadata search

Overview

Everything you need to understand how vector databases actually work—and build one yourself.

What's inside:

5 Vector Storage Types: Flat, HNSW, IVF, PQ, IVFPQ
3 Distance Metrics: L2, L2 Squared, Cosine
Full-Text Search: BM25 ranking with Unicode tokenization
Metadata Filtering: Roaring bitmaps + Bit-Sliced Indexes
Hybrid Search: Combine vector + text + metadata with Reciprocal Rank Fusion
Advanced Search: Multi-KNN queries, multi-query operations, autocut result truncation
Production Features: Thread-safe, serialization, soft deletes, configurable parameters

Everything you need to understand how vector databases actually work—and build one yourself.

Features

Vector Storage

Flat: Brute-force exact search (100% recall baseline)
HNSW: Hierarchical navigable small world graphs (95-99% recall, O(log n) search)
IVF: Inverted file index with k-means clustering (85-95% recall, 10-20x speedup)
PQ: Product quantization for compression (85-95% recall, 10-500x memory reduction)
IVFPQ: IVF + PQ combined (85-95% recall, 100x speedup + 500x compression)

Search Modalities

Vector Search: L2, L2 Squared, and Cosine distance metrics
Full-Text Search: BM25 ranking with Unicode-aware tokenization
Metadata Filtering: Boolean queries on structured attributes
Hybrid Search: Combine all three with configurable fusion strategies

Fusion Strategies

Weighted Sum: Linear combination with configurable weights
Reciprocal Rank Fusion (RRF): Scale-independent rank-based fusion
Max/Min Score: Simple score aggregation

Data Structures (The Good Stuff)

HNSW Graphs: Multi-layer skip lists for approximate nearest neighbor search
Roaring Bitmaps: Compressed bitmaps for metadata filtering (array, bitmap, run-length encoding)
Bit-Sliced Index (BSI): Efficient numeric range queries without full scans
Product Quantization Codebooks: Learned k-means centroids for vector compression
Inverted Indexes: Token-to-document mappings for full-text search

Other Capabilities

Quantization: Full precision, half precision, int8 precision
Soft Deletes: Fast deletion with lazy cleanup
Serialization: Persist and reload indexes
Thread-Safe: Concurrent read/write operations
Autocut: Automatic result truncation based on score gaps

Installation

go get github.com/wizenheimer/comet

Quick Start

package main

import (
    "fmt"
    "log"

    "github.com/wizenheimer/comet"
)

func main() {
    // Create a vector store (384-dimensional embeddings with cosine distance)
    index, err := comet.NewFlatIndex(384, comet.Cosine)
    if err != nil {
        log.Fatal(err)
    }

    // Add vectors
    vec1 := make([]float32, 384)
    // ... populate vec1 with your embedding ...
    node := comet.NewVectorNode(vec1)
    index.Add(*node)

    // Search for similar vectors
    query := make([]float32, 384)
    // ... populate query vector ...
    results, err := index.NewSearch().
        WithQuery(query).
        WithK(10).
        Execute()

    if err != nil {
        log.Fatal(err)
    }

    // Process results
    for i, result := range results {
        fmt.Printf("%d. ID=%d, Score=%.4f\n", i+1, result.GetId(), result.GetScore())
    }
}

Output:

1. ID=123, Score=0.0234
2. ID=456, Score=0.0567
3. ID=789, Score=0.0823
...

Architecture

System Architecture

Comet is organized into three main search engines that can work independently or together:

Application Layer

┌─────────────────────────────────────────────────────────────┐
│                    Your Application                          │
│            (Using Comet as a Go Library)                     │
└──────────────────────┬──────────────────────────────────────┘
                       │
         ┌─────────────┼─────────────┐
         │             │             │
         ▼             ▼             ▼
    Vector         Text         Metadata

Search Engine Layer

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Vector    │    │    Text     │    │  Metadata   │
│   Search    │    │   Search    │    │  Filtering  │
│   Engine    │    │   Engine    │    │   Engine    │
└──────┬──────┘    └──────┬──────┘    └──────┬──────┘
       │                  │                  │
       │ Semantic         │ Keywords         │ Filters
       │ Similarity       │ + Relevance      │ + Boolean Logic
       ▼                  ▼                  ▼

Index Storage Layer

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ HNSW / IVF  │    │ BM25 Index  │    │  Roaring    │
│ / PQ / Flat │    │ (Inverted)  │    │  Bitmaps    │
└─────────────┘    └─────────────┘    └─────────────┘
   Graph/Trees      Token→DocIDs       Compressed Sets

Hybrid Coordinator

                 All Three Engines
                       │
                       ▼
              ┌─────────────────┐
              │  Hybrid Search   │
              │  Coordinator     │
              │  (Score Fusion)  │
              └─────────────────┘
                       │
                       ▼
              Combined Results

Component Details

Component A: Vector Storage Engine

Manages vector storage and similarity search across multiple index types.

Common Interface:

┌────────────────────────────────────┐
│  VectorIndex Interface             │
│                                    │
│  ├─ Train(vectors)                 │
│  ├─ Add(vector)                    │
│  ├─ Remove(vector)                 │
│  └─ NewSearch()                    │
└────────────────────────────────────┘

Available Implementations:

FlatIndex          → Brute force, 100% recall
HNSWIndex          → Graph-based, O(log n)
IVFIndex           → Clustering, 10-20x faster
PQIndex            → Quantization, 10-500x compression
IVFPQIndex         → Hybrid, best of IVF + PQ

Responsibilities:

Vector preprocessing (normalization for cosine distance)
Distance calculations (Euclidean, L2², Cosine)
K-nearest neighbor search
Serialization/deserialization
Soft delete management with flush mechanism

Performance Characteristics:

Flat: O(n×d) search, 100% recall
HNSW: O(M×ef×log n) search, 95-99% recall
IVF: O(nProbes×n/k×d) search, 85-95% recall

Component B: Text Search Engine

Full-text search using BM25 ranking algorithm.

Inverted Index:

┌────────────────────────────────────┐
│  term → RoaringBitmap(docIDs)      │
│                                    │
│  "machine"  →  {1, 5, 12, 45}      │
│  "learning" →  {1, 3, 12, 20}      │
│  "neural"   →  {3, 20, 45}         │
└────────────────────────────────────┘

Term Frequencies:

┌────────────────────────────────────┐
│  term → {docID: count}             │
│                                    │
│  "machine" → {1: 3, 5: 1, 12: 2}   │
└────────────────────────────────────┘

Document Stats:

┌────────────────────────────────────┐
│  docID → (length, token_count)     │
│                                    │
│  1  →  (250 chars, 45 tokens)      │
│  5  →  (180 chars, 32 tokens)      │
└────────────────────────────────────┘

Responsibilities:

Text tokenization (UAX#29 word segmentation)
Unicode normalization (NFKC)
Inverted index maintenance
BM25 score calculation
Top-K retrieval with heap

Performance Characteristics:

Add: O(m) where m = tokens
Search: O(q×d_avg) where q = query terms, d_avg = avg docs per term
Memory: Compressed inverted index, no original text stored

Component C: Metadata Filter Engine

Fast filtering using compressed bitmaps.

Categorical Fields (Roaring Bitmaps):

┌────────────────────────────────────┐
│  field:value → Bitmap(docIDs)      │
│                                    │
│  "category:electronics" → {1,5,12} │
│  "category:books"       → {2,8,15} │
│  "in_stock:true"        → {1,2,5}  │
└────────────────────────────────────┘

Numeric Fields (Bit-Sliced Index):

┌────────────────────────────────────┐
│  field → BSI (range queries)       │
│                                    │
│  "price"  → [0-1000, 1000-5000]    │
│  "rating" → [0-5 scale]            │
└────────────────────────────────────┘

Document Universe:

┌────────────────────────────────────┐
│  allDocs → Bitmap(all IDs)         │
│                                    │
│  Used for NOT operations           │
└────────────────────────────────────┘

Responsibilities:

Bitmap index maintenance (Roaring compression)
BSI for numeric range queries
Boolean query evaluation (AND, OR, NOT)
Existence checks
Set operations (IN, NOT IN)

Performance Characteristics:

Add: O(f) where f = fields
Query: O(f×log n) with bitmap operations
Memory: Highly compressed, 1-10% of uncompressed

Data Flow

How a hybrid search request flows through the system:

Step 1: Query Input

User Query:
├─ Vector:    [0.1, 0.5, ...]
├─ Text:      "machine learning"
└─ Filters:   {category="ai", price<100}

Step 2: Validation

┌────────────┐
│ Validation │ ──✗──> Error Handler
└─────┬──────┘         (Invalid dimension, etc.)
      │
      ✓ Valid

Step 3: Metadata Pre-filtering

┌─────────────────────────┐
│ Metadata Filter Engine  │
│ Apply: category="ai"    │
│        price<100        │
└─────┬───────────────────┘
      │
      ▼
Candidates: {1, 5, 7, 12, 15}

Step 4: Parallel Search

Candidates → Split into both engines

┌────────────┐         ┌────────────┐
│  Vector    │         │   Text     │
│  Search    │         │  Search    │
│ (on cands) │         │ (on cands) │
└─────┬──────┘         └─────┬──────┘
      │                      │
      ▼                      ▼
Vec Results:           Text Results:
{1: 0.2,               {7: 8.5,
 5: 0.3,                1: 7.2,
 7: 0.4}                12: 6.8}

Step 5: Score Fusion

      Both Result Sets
              │
              ▼
      ┌───────────────┐
      │ Score Fusion  │
      │ (RRF/Weighted)│
      └───────┬───────┘
              │
              ▼
      Fused Rankings

Step 6: Final Ranking

      ┌───────────────┐
      │  Rank & Sort  │
      └───────┬───────┘
              │
              ▼
      ┌───────────────┐
      │  Top-K Filter │
      │  (k=10)       │
      └───────┬───────┘
              │
              ▼
      Final Results:
      [{1: 0.8},
       {7: 0.7},
       {5: 0.6}]

Memory Layout

Understanding how different index types use memory:

HNSW Index Memory

File Header (24 bytes):

┌─────────────────────────────┐
│ Magic:      "HNSW" (4 B)    │
│ Version:    1 (4 B)          │
│ Dimensions: 384 (4 B)        │
│ M:          16 (4 B)         │
│ Max Level:  3 (4 B)          │
│ Entry Point: 42 (4 B)        │
└─────────────────────────────┘

Per-Node Storage:

Node ID:       4 bytes
Level:         4 bytes
Vector Data:   1536 bytes (384-dim × 4)
Graph Edges:   320 bytes (M connections × layers)
              ─────────────
Total:         ~1864 bytes per node

Scaling Analysis:

┌───────────────────┬──────────┬───────────┐
│ Component         │ Per Node │ 1M Nodes  │
├───────────────────┼──────────┼───────────┤
│ Vectors (raw)     │ 1536 B   │ 1.46 GB   │
│ Graph structure   │ 320 B    │ 305 MB    │
│ Metadata          │ 8 B      │ 7.6 MB    │
├───────────────────┼──��───────┼───────────┤
│ Total             │ 1864 B   │ 1.78 GB   │
└───────────────────┴──────────┴───────────┘

Product Quantization Memory

Compression Overview:

Original Vector (384-dim):
384 × 4 bytes = 1536 bytes
         ↓
    Quantization
         ↓
PQ Codes (8 subspaces):
8 × 1 byte = 8 bytes
         ↓
192x smaller!

Codebook Storage:

8 codebooks × 256 centroids × 48 dims × 4 bytes
= 393 KB (shared across all vectors)

1M Vectors Comparison:

┌───────────────────┬────────────┬──────────┐
│ Format            │ Size       │ Ratio    │
├───────────────────┼────────────┼──────────┤
│ Original (float32)│ 1.46 GB    │ 1x       │
│ PQ-8              │ 7.6 MB     │ 192x     │
│ PQ-16             │ 15.3 MB    │ 96x      │
│ PQ-32             │ 30.5 MB    │ 48x      │
│ + Codebooks       │ +393 KB    │ -        │
└───────────────────┴────────────┴──────────┘

Core Concepts

Concept 1: Vector Storage and Distance Metrics

A vector store maintains high-dimensional embeddings and enables efficient similarity search. The choice of storage type determines the tradeoff between search speed, memory usage, and accuracy.

Example: How vectors are stored and searched

Given these input vectors:

Vector 1: [0.1, 0.5, 0.3, 0.8]
Vector 2: [0.2, 0.4, 0.7, 0.1]
Query:    [0.15, 0.45, 0.5, 0.5]

The index computes distances and returns nearest neighbors:

┌─────────────┬────────────────────────────────┬──────────┐
│ Vector ID   │ Distance to Query              │ Rank     │
├─────────────┼────────────────────────────────┼──────────┤
│ 1           │ 0.234 (closest)                │ 1st      │
│ 2           │ 0.567 (further)                │ 2nd      │
└─────────────┴────────────────────────────────┴──────────┘

Visual Representation: Storage Types Comparison

Flat Index (Brute Force)

Query → Compare with ALL vectors → Sort → Return K

✓ 100% Accuracy
✗ O(n) time complexity
✗ Slow for large datasets

HNSW Index (Graph-Based)

Layer 2: ●─────────────●    (highways - long jumps)
          │             │
Layer 1: ●───●───●───●    (roads - medium jumps)
          │   │ \ │ \ │
Layer 0: ●─●─●─●─●─●─●    (streets - all nodes)

Search: Start high → Navigate greedily → Descend

✓ O(log n) time complexity
✓ 95-99% recall
✗ 2-3x memory overhead

IVF Index (Clustering)

Cluster 1: ●●●●     Cluster 2: ●●●●
Cluster 3: ●●●●     Cluster 4: ●●●●

Query → Find nearest clusters → Search within

✓ Fast on large datasets
✗ Requires training
~ 85-95% recall

Product Quantization (Compression)

Original:  [0.123, 0.456, 0.789, ...]
                    ↓ Compress
Quantized: [17, 42, 89, ...]

4 bytes each → 1 byte each

✓ 4-32x memory reduction
✗ Slight accuracy loss

Benefits of Different Storage Types:

Flat Index: Perfect recall, simple, no training. Use for small datasets (<100K vectors)
HNSW: Excellent speed/accuracy tradeoff, no training. Use for most production workloads
IVF: Fast filtering with clusters, scalable. Use for very large datasets (>10M vectors)
PQ: Massive memory savings. Use when storage cost is a concern
IVFPQ: Best of IVF + PQ. Use for extremely large, memory-constrained environments

Concept 2: BM25 Full-Text Search Algorithm

BM25 (Best Matching 25) ranks documents by relevance to a text query using term frequency and inverse document frequency.

Setup

Test Corpus:

Doc 1: "the quick brown fox jumps over the lazy dog"  (9 words)
Doc 2: "the lazy dog sleeps"                           (4 words)
Doc 3: "quick brown rabbits jump"                      (4 words)

Query: "quick brown"

Parameters:

Average Document Length: 7 words
K1 = 1.2  (term frequency saturation)
B  = 0.75 (length normalization)

Step 1: Calculate IDF (Inverse Document Frequency)

For term "quick":

Appears in: 2 out of 3 documents

IDF = log((N - df + 0.5) / (df + 0.5) + 1)
    = log((3 - 2 + 0.5) / (2 + 0.5) + 1)
        = log(1.5 / 2.5 + 1)
        = log(1.6)
        = 0.470

For term "brown":

Appears in: 2 out of 3 documents
IDF = 0.470  (same calculation)

Step 2: Calculate TF Component for Each Document

Doc 1 (9 words - longer than average):

Term "quick" (tf=1):
  TF_score = (tf × (K1 + 1)) / (tf + K1 × (1 - B + B × (docLen/avgLen)))
           = (1 × 2.2) / (1 + 1.2 × (1 - 0.75 + 0.75 × (9/7)))
             = 2.2 / 2.457
             = 0.895

Term "brown" (tf=1):
  TF_score = 0.895  (same calculation)

Doc 2 (4 words):

Terms "quick" and "brown": tf=0 (not present)
    TF_score = 0

Doc 3 (4 words - shorter than average):

Term "quick" (tf=1):
  TF_score = (1 × 2.2) / (1 + 1.2 × (1 - 0.75 + 0.75 × (4/7)))
             = 2.2 / 1.815
             = 1.212

Term "brown" (tf=1):
  TF_score = 1.212  (same calculation)

Step 3: Calculate Final BM25 Scores

Combine IDF × TF for each document:

Doc 1: (0.895 × 0.470) + (0.895 × 0.470) = 0.841
Doc 2: 0  (no matching terms)
Doc 3: (1.212 × 0.470) + (1.212 × 0.470) = 1.139

Final Ranking

┌─────────┬──────────┬────────┐
│ Rank    │ Doc ID   │ Score  │
├─────────┼──────────┼────────┤
│ 1st     │ Doc 3    │ 1.139  │
│ 2nd     │ Doc 1    │ 0.841  │
│ 3rd     │ Doc 2    │ 0.000  │
└─────────┴──────────┴────────┘

Why Doc 3 Ranks Higher

Doc 1 vs Doc 3:
├─ Same term frequencies (1 occurrence each)
├─ Doc 3 is shorter (4 words vs 9 words)
├─ BM25 length normalization penalizes longer docs
└─ Result: Doc 3 gets higher TF scores (1.212 vs 0.895)

Key Insight: Shorter documents with same term frequency
             are considered more relevant ✓

Concept 3: Hybrid Search with Score Fusion

Hybrid search combines vector similarity, text relevance, and metadata filtering. Different fusion strategies handle score normalization differently.

Query Setup

INPUT:
├─ Query:   "machine learning tutorial"
├─ Vector:  [0.12, 0.45, 0.89, ...]  (embedding)
└─ Filter:  category="education" AND price<50

Step 1: Apply Metadata Filter

┌─────────────────────────────────┐
│  Metadata Filtering             │
│  category="education"           │
│  price<50                       │
└─────────────────────────────────┘
                │
                ▼
Candidate Docs: {1, 3, 5, 7, 9, 12, 15, 18, 20}

Step 2: Vector Search Results

Semantic Similarity (on candidates):

┌────────┬──────────┬──────┐
│ Doc ID │ Distance │ Rank │
├────────┼──────────┼──────┤
│   1    │  0.12    │  1   │  ← Closest
│   5    │  0.23    │  2   │
│   7    │  0.34    │  3   │
│  12    │  0.45    │  4   │
└────────┴──────────┴──────┘

Step 3: Text Search Results

BM25 Ranking (on candidates):

┌────────┬────────────┬──────┐
│ Doc ID │ BM25 Score │ Rank │
├────────┼────────────┼──────┤
│   7    │   8.5      │  1   │  ← Most relevant
│   1    │   7.2      │  2   │
│  12    │   6.8      │  3   │
│   5    │   4.1      │  4   │
└────────┴────────────┴──────┘

Step 4: Reciprocal Rank Fusion (RRF)

Formula: RRF_score = sum(1 / (K + rank_i)) where K=60

Doc 1 Calculation:

  Vector rank: 1 → 1/(60+1) = 0.0164
Text rank:   2 → 1/(60+2) = 0.0161
                          ────────
Combined RRF score:         0.0325

Doc 5 Calculation:

  Vector rank: 2 → 1/(60+2) = 0.0161
Text rank:   4 → 1/(60+4) = 0.0156
                          ────────
Combined RRF score:         0.0317

Doc 7 Calculation:

  Vector rank: 3 → 1/(60+3) = 0.0159
Text rank:   1 → 1/(60+1) = 0.0164
                          ────────
Combined RRF score:         0.0323

Doc 12 Calculation:

  Vector rank: 4 → 1/(60+4) = 0.0156
Text rank:   3 → 1/(60+3) = 0.0159
                          ────────
Combined RRF score:         0.0315

Final Ranking

┌──────┬────────┬───────────┬──────────────────┐
│ Rank │ Doc ID │ RRF Score │ Why              │
├──────┼────────┼───────────┼──────────────────┤
│ 1st  │   1    │  0.0325   │ Best vector      │
│ 2nd  │   7    │  0.0323   │ Best text        │
│ 3rd  │   5    │  0.0317   │ Balanced         │
│ 4th  │  12    │  0.0315   │ Lower in both    │
└──────┴────────┴───────────┴──────────────────┘

Why RRF Over Weighted Sum?

Problem with Weighted Sum:
├─ Vector distances:    0-2 range
├─ BM25 scores:         0-100+ range
└─ Different scales need manual tuning ✗

RRF Advantages:
├─ ✓ Scale Independent  (uses ranks, not raw scores)
├─ ✓ Robust             (stable across score distributions)
├─ ✓ No Tuning          (no manual weight calibration)
└─ ✓ Industry Standard  (Elasticsearch, Vespa, etc.)

Storage Type Deep Dive

Choose your poison: Flat (exact), HNSW (graph), IVF (clustering), PQ (quantization), or IVFPQ (hybrid). Each trades speed, memory, and accuracy differently.

Full Deep Dive: See INDEX.md for complete algorithms, benchmarks, and implementation details.

Flat Index (Brute Force)

The Simplest Approach: Compare query against EVERY vector. 100% recall, zero approximation.

How It Works

Query Vector
     ↓
┌────────────────────┐
│  Compare to ALL n  │
│     vectors        │  ← Every single one checked
└────────────────────┘   No shortcuts!
        ↓
   Sort by distance
        ↓
   Return top K

Implementation

1. Store vectors → Raw float32 arrays
2. Search step   → Compute distance to all vectors
3. Selection     → Keep top-k closest
4. Index struct  → None (just a list)

Complexity Analysis

┌─────────────���──────────────────────────┐
│ Operation   │ Complexity               │
├─────────────┼──────────────────────────┤
│ Build       │ O(1) - just store        │
│ Search      │ O(n × d) - check all     │
│ Memory      │ O(n × d) - raw vectors   │
│ Recall      │ 100% - exhaustive        │
└─────────────┴──────────────────────────┘

When to Use

✓ Small datasets (<10K vectors)
✓ 100% recall required (fingerprinting, security)
✓ Benchmarking baseline
✗ Large datasets (too slow)

HNSW Index (Hierarchical Graph)

The Graph Navigator: Build a multi-layer graph where each node connects to nearby vectors. Search by greedy navigation from layer to layer.

Graph Structure

Layer 2: [A]─────────[D]           ← Highways (long jumps)
          │           │              Few nodes

Layer 1: [A]──[B]────[D]──[E]      ← Roads (medium jumps)
          │   │ \    │ \  │          More nodes

Layer 0: [A]─[B]─[C]─[D]─[E]─[F]   ← Streets (short links)
                                      ALL vectors here

Search Process

1. Start at top layer    → Long-range navigation
2. Greedy best move      → Move to closest neighbor
3. Drop to next layer    → Refine search
4. Repeat steps 2-3      → Until Layer 0
5. Return k-nearest      → Final results

How Insertion Works

New Vector:
├─ 1. Add to Layer 0 (always)
├─ 2. Randomly assign to higher layers (exponential decay)
├─ 3. Connect to M nearest neighbors per layer
└─ 4. Update neighbors' connections (bidirectional)

Complexity Analysis

┌─────────────┬────────────────────────────────┐
│ Operation   │ Complexity                     │
├─────────────┼────────────────────────────────┤
│ Search      │ O(M × efSearch × log n)        │
│             │ Typically checks <1% of data   │
│ Insert      │ O(M × efConstruction × log n)  │
│ Memory      │ O(n × d + n × M × log n)       │
│ Recall      │ 95-99% with proper tuning      │
└─────────────┴────────────────────────────────┘

Key Parameters

M (Connections per layer):
├─ 4-8:   Low memory, lower recall
├─ 16:    Balanced (default)
└─ 32-48: High recall, more memory

efConstruction (Build quality):
├─ 100:   Fast build, lower quality
├─ 200:   Good balance (default)
└─ 400+:  Better graph, slower build

efSearch (Search quality):
├─ 50:    Fast search, ~85% recall
├─ 200:   Balanced (default), ~96% recall
└─ 400+:  High recall, slower search

When to Use

✓ Large datasets (10K-10M vectors)
✓ Need 95-99% recall
✓ Sub-millisecond latency required
✓ Can afford 2-3x memory overhead
✗ Memory constrained environments

IVF Index (Inverted File with Clustering)

The Clustering Approach: Partition vectors into clusters using k-means. Search only the nearest clusters.

Build Phase (Training)

All Vectors (n)
      ↓
   k-means clustering
      ↓
┌─���──────────────────────────────────┐
│ [C1]  [C2]  [C3]  [C4]  ... [Cn]  │  ← Centroids
└────────────────────────────────────┘
   ↓      ↓     ↓     ↓         ↓
  {v1}   {v8}  {v3}  {v12}    {v7}     ← Vectors assigned
  {v5}   {v9}  {v6}  {v19}    {v11}      to nearest
  {v20}  ...   ...   ...      ...        centroid

Search Phase

Query Vector
      ↓
Find nProbe nearest centroids
      ↓
┌────────────────────────────┐
│ [C2] [C3]  (only these!)   │  ← Search 2 of 100 clusters
└────────────────────────────┘
   ↓     ↓
  {...} {...}  Search within clusters
      ↓
  Top-k results

Key Insight

Instead of:  Check all 1M vectors
Do this:     1. Find 8 nearest clusters  (O(nClusters))
             2. Search 80K vectors total  (10% of data)
             └─> 10-20x faster!

Complexity Analysis

┌─────────────┬───���────────────────────────────┐
│ Operation   │ Complexity                     │
├─────────────┼────────────────────────────────┤
│ Build       │ O(iterations × n × k × d)      │
│             │ k-means clustering             │
│ Search      │ O(k × d + (n/k) × nProbe × d)  │
│             │ Typically checks 5-20% of data │
│ Memory      │ O(n × d + k × d)               │
│ Recall      │ 80-95% depending on nProbe     │
└─────────────┴────────────────────────────────┘

Key Parameters

nClusters (number of partitions):
├─ Rule of thumb: sqrt(n) to n/10
├─ 10K vectors   → 100 clusters
├─ 100K vectors  → 316 clusters
├─ 1M vectors    → 1,000 clusters
└─ More clusters → Faster search, slower build

nProbe (clusters to search):
├─ 1:   Fastest, ~60-70% recall
├─ 8:   Good balance, ~85% recall
├─ 16:  Better recall, ~92% recall
└─ 32+: High recall, ~96% recall

When to Use

✓ Large datasets (>100K vectors)
✓ Can tolerate 85-95% recall
✓ Want 10-20x speedup over Flat
✓ Have training data available
✗ Need 100% recall (use Flat)
✗ Dataset too small (<10K)

PQ Index (Product Quantization)

The Compression Master: Split vectors into subvectors, quantize each subvector to 256 codes. Reduce memory by 16-32×.

Compression Process

Original Vector (384 dimensions):
  [0.23, 0.91, ..., 0.15, 0.44, ..., 0.73, 0.22, ...]
   \_____48D_____/  \_____48D_____/  \_____48D_____/
   Subspace 1       Subspace 2       Subspace 3
        ↓                ↓                ↓
   K-means on       K-means on       K-means on
   subspace 1       subspace 2       subspace 3
        ↓                ↓                ↓
 Codebook with    Codebook with    Codebook with
 256 centroids    256 centroids    256 centroids
        ↓                ↓                ↓
   Find nearest    Find nearest     Find nearest
   centroid ID     centroid ID      centroid ID
        ↓                ↓                ↓
      [12]             [203]            [45]
       1 byte          1 byte           1 byte

Result

Before: [0.23, 0.91, ...] → 384 × 4 bytes = 1536 bytes
After:  [12, 203, 45, ...]  → 8 × 1 byte  = 8 bytes

192x compression!

Search Process

1. Query arrives
        ↓
2. Split query into subspaces (like training)
        ↓
3. Precompute distance tables for each subspace
   (query_subspace to all 256 codebook centroids)
        ↓
4. For each vector:
   - Look up codes: [12, 203, 45, ...]
   - Table lookup: distances[12] + distances[203] + ...
   - No float operations! Just array lookups
        ↓
5. Return top-k

Complexity Analysis

┌─────────────┬────────────────────────────────┐
│ Operation   │ Complexity                     │
├─────────────┼────────────────────────────────┤
│ Training    │ O(M × iterations × K × n/M)    │
│             │ K-means on each subspace       │
│ Encoding    │ O(M × K × d/M) per vector      │
│ Search      │ O(M × K + n × M)               │
│             │ Super fast, cache-friendly     │
│ Memory      │ O(n × M) - massive savings!    │
│ Recall      │ 70-85% typical                 │
└─────────────┴────────────────────────────────┘

Memory Comparison

1M vectors, 384 dimensions:

Float32:  1M × 384 × 4 bytes = 1.46 GB
PQ-8:     1M × 8 × 1 byte    = 7.6 MB  (192x smaller!)
PQ-16:    1M × 16 × 1 byte   = 15.3 MB (96x smaller!)
PQ-32:    1M × 32 × 1 byte   = 30.5 MB (48x smaller!)

+ Codebooks: ~393 KB (shared across all vectors)

Key Parameters

M (nSubspaces):
├─ 8:   Maximum compression, lower accuracy
├─ 16:  Good balance
├─ 32:  Better accuracy, less compression
└─ 64:  High accuracy, moderate compression

bitsPerCode:
└─ 8:   Standard (256 centroids per subspace)
        Perfect for uint8 storage

When to Use

✓ Massive datasets (millions of vectors)
✓ Memory is the bottleneck
✓ Can tolerate 70-85% recall
✓ Want 30-200x memory reduction
✗ Need high recall (>95%)
✗ Have plenty of memory

IVFPQ Index (Hybrid: Clustering + Quantization)

Best of Both Worlds: IVF clusters to reduce search space, PQ compression to reduce memory. The ultimate scalability index.

Build Process

Step 1: IVF Clustering

All Vectors
      ↓
  K-means clustering
      ↓
[C1]  [C2]  [C3]  [C4]  ... [Cn]

Step 2: PQ Compression per Cluster

Cluster 1:        Cluster 2:        Cluster 3:
  {vectors}         {vectors}         {vectors}
      ↓                 ↓                 ↓
  Apply PQ          Apply PQ          Apply PQ
      ↓                 ↓                 ↓
[12,203,45]      [91,34,178]       [56,211,19]
[88,9,101]       [23,156,88]       [199,44,73]
[...]            [...]             [...]
           uint8 codes      uint8 codes       uint8 codes

Search Process

Step 1: IVF Stage
  Query → Find nProbe nearest centroids
          ↓
     [C2] [C3] [C7]  (e.g., 3 of 100 clusters)
          ↓
     Search only 3% of the data!

Step 2: PQ Stage
  For each selected cluster:
    ├─ Precompute distance tables
    ├─ Fast table lookups on PQ codes
    └─ No float operations
          ↓
     Top-k results

The Magic Combination

IVF contributes:
└─ Speed:  10-100x faster (search only nProbe clusters)

PQ contributes:
└─ Memory: 30-200x smaller (compressed codes)

Combined:
└─ Fast + Tiny = Billion-scale capability!

Complexity Analysis

┌─────────────┬────────────────────────────────┐
│ Operation   │ Complexity                     │
├─────────────┼────────────────────────────────┤
│ Training    │ O(IVF_kmeans + PQ_kmeans)      │
│ Search      │ O(k×d + (n/k)×nProbe×M)        │
│             │ Searches ~1-10% of data        │
│ Memory      │ O(n×M + k×d)                   │
│             │ Massive compression            │
│ Recall      │ 70-90% depending on params     │
└─────────────┴────────────────────────────────┘

Memory Savings Example

1M vectors, 384 dimensions:

Float32 + IVF:   1.46 GB
IVFPQ (M=8):     7.6 MB + 400 KB (centroids)
                 = ~8 MB total

180x compression!

Key Parameters

nClusters (IVF):
├─ 100:   For 100K vectors
├─ 1K:    For 1M vectors
└─ 10K:   For 100M vectors

nProbe (IVF search):
├─ 1:     Fastest, lower recall
├─ 8:     Good balance
└─ 16:    Better recall

M (PQ subspaces):
├─ 8:     Maximum compression
├─ 16:    Good balance
└─ 32:    Better accuracy

When to Use

✓ Billion-scale datasets
✓ Need <10ms latency
✓ Severe memory constraints
✓ Can tolerate 70-90% recall
✓ Want 100x speed + 100x compression
✗ Need >95% recall (use HNSW)
✗ Small datasets (use Flat or HNSW)

Real-World Example

Use Case: 100M vectors, 768-dim

Flat Index:
├─ Memory: ~288 GB
├─ Search: ~10 seconds
└─ Not feasible! ✗

IVFPQ:
├─ Memory: ~800 MB
├─ Search: ~5 ms
└─ Practical! ✓

Decision Matrix

Index	Recall	Search Speed	Memory	Build Time	Best For
Flat	100%	Slow O(n)	High	Instant	<10k vectors, benchmarks
HNSW	90-99%	Fast O(log n)	Highest	Slow	10k-10M vectors, low latency
IVF	80-95%	Medium	High	Medium	>100k vectors, moderate recall
PQ	70-85%	Fast	Lowest	Slow	>1M vectors, memory-constrained
IVFPQ	70-90%	Fastest	Very Low	Slow	>10M vectors, billion-scale

Rules of thumb:

Small data (<10k): Flat - brute force is fast enough
Medium data (10k-1M): HNSW - best recall/speed tradeoff
Large data (1M-100M): IVF or PQ - choose speed (IVF) vs memory (PQ)
Massive data (>100M): IVFPQ - only option that scales to billions

Latency targets:

Flat: 1-100ms (depends on n)
HNSW: 0.1-2ms (sub-millisecond possible)
IVF: 0.5-5ms
PQ: 1-10ms (fast scan but approximate)
IVFPQ: 0.5-3ms (fastest for massive datasets)

API Reference

For detailed API documentation, see API.md.

Examples

For practical examples and code samples, see EXAMPLE.md.

Configuration

Basic Configuration

// Flat Index - No configuration needed
flatIdx, _ := comet.NewFlatIndex(384, comet.Cosine)

// HNSW Index - Configure build and search parameters
hnswIdx, _ := comet.NewHNSWIndex(
    384,              // dimensions
    comet.Cosine,     // distance metric
    16,               // M: connections per layer
    200,              // efConstruction: build quality
    200,              // efSearch: search quality
)

// IVF Index - Configure clustering
ivfIdx, _ := comet.NewIVFIndex(
    384,              // dimensions
    comet.Cosine,     // distance metric
    100,              // nClusters: number of partitions
)

// Training required for IVF
trainingVectors := []comet.VectorNode{ /* ... */ }
ivfIdx.Train(trainingVectors)

Configuration Options

HNSW Parameters

M (Connections Per Layer)

Type: int
Default: 16
Range: 4 to 64
Description: Number of bidirectional connections per node at each layer (except layer 0 which uses 2×M)

Effects of different values:

┌─────────┬──────────────────────────────────────────┐
│ Value   │ Effect                                   │
├─────────┼──────────────────────────────────────────┤
│ 4-8     │ Low memory, faster build, lower recall   │
│ 12-16   │ Balanced (recommended for most cases)    │
│ 24-48   │ High recall, slower build, more memory   │
│ 64+     │ Diminishing returns, excessive memory    │
└─────────┴──────────────────────────────────────────┘

efConstruction (Build-Time Candidate List Size)

Type: int
Default: 200
Range: 100 to 1000
Description: Size of candidate list during index construction. Higher = better graph quality

Effects:

┌─────────┬──────────────────────────────────────────┐
│ Value   │ Effect                                   │
├─────────┼──────────────────────────────────────────┤
│ 100     │ Fast build, lower quality graph          │
│ 200     │ Good balance (default)                   │
│ 400-500 │ Better quality, 2-3x slower build        │
│ 800+    │ Marginal gains, very slow build          │
└─────────┴──────────────────────────────────────────┘

efSearch (Search-Time Candidate List Size)

Type: int
Default: 200
Range: k to 1000 (must be >= k)
Description: Size of candidate list during search. Can be adjusted dynamically.

Effects:

┌─────────┬──────────────────────────────────────────┐
│ Value   │ Effect                                   │
├─────────┼──────────────────────────────────────────┤
│ 50      │ Very fast, ~85% recall                   │
│ 100     │ Fast, ~92% recall                        │
│ 200     │ Balanced, ~96% recall                    │
│ 400     │ Slower, ~98% recall                      │
│ 800     │ Much slower, ~99% recall                 │
└─────────┴──────────────────────────────────────────┘

Example:

// Create HNSW with high quality settings
index, _ := comet.NewHNSWIndex(
    384,
    comet.Cosine,
    32,    // More connections = better recall
    400,   // Higher construction quality
    200,   // Search quality (can adjust later)
)

// Dynamically adjust search quality
index.SetEfSearch(400)  // Trade speed for recall

IVF Parameters

nClusters (Number of Clusters)

Type: int
Default: sqrt(n) where n = dataset size
Range: 16 to n/100
Description: Number of k-means clusters (Voronoi cells) to partition the space

┌─────────────┬────────────┬─────────────────────────┐
│ Dataset Size│ nClusters  │ Typical Range           │
├─────────────┼────────────┼─────────────────────────┤
│ 10K         │ 100        │ 50-200                  │
│ 100K        │ 316        │ 200-500                 │
│ 1M          │ 1000       │ 500-2000                │
│ 10M         │ 3162       │ 2000-5000               │
└─────────────┴────────────┴─────────────────────────┘

nProbes (Search-Time Clusters to Probe)

Type: int
Default: 1
Range: 1 to nClusters
Description: Number of nearest clusters to search during query

┌─────────┬──────────────────────────────────────────┐
│ nProbes │ Effect                                   │
├─────────┼──────────────────────────────────────────┤
│ 1       │ Fastest, ~60-70% recall                  │
│ 8       │ Good balance, ~85% recall                │
│ 16      │ Better recall, ~92% recall               │
│ 32      │ High recall, ~96% recall                 │
│ 64      │ Very high recall, slower                 │
└─────────┴──────────────────────────────────────────┘

Example:

// Create IVF index
index, _ := comet.NewIVFIndex(384, comet.Cosine, 256)

// Train with representative sample
trainData := sampleVectors(10000)  // 10K samples for training
index.Train(trainData)

// Search with multiple probes
results, _ := index.NewSearch().
    WithQuery(query).
    WithK(10).
    WithNProbes(16).  // Search 16 nearest clusters
    Execute()

Advanced Configuration

Fusion Configuration

// Weighted Sum Fusion (manual weights)
config := &comet.FusionConfig{
    VectorWeight: 0.7,  // 70% weight to semantic similarity
    TextWeight:   0.3,  // 30% weight to keyword relevance
}
fusion, _ := comet.NewFusion(comet.WeightedSumFusion, config)

hybridSearch := hybrid.NewSearch().
    WithVector(queryVec).
    WithText("machine learning").
    WithFusion(fusion).
    Execute()

// Reciprocal Rank Fusion (rank-based, no weights needed)
hybridSearch := hybrid.NewSearch().
    WithVector(queryVec).
    WithText("machine learning").
    WithFusionKind(comet.ReciprocalRankFusion).
    Execute()

Distance Metrics

// Euclidean (L2): Use when magnitude matters
euclideanIdx, _ := comet.NewFlatIndex(384, comet.Euclidean)

// L2 Squared: Faster than Euclidean (no sqrt), preserves ranking
l2sqIdx, _ := comet.NewFlatIndex(384, comet.L2Squared)

// Cosine: Use for normalized vectors (text embeddings, etc.)
cosineIdx, _ := comet.NewFlatIndex(384, comet.Cosine)

API Details

FlatIndex

Constructor:

func NewFlatIndex(dim int, distanceKind DistanceKind) (*FlatIndex, error)

Parameters:

Parameter	Type	Required	Description
`dim`	`int`	Yes	Vector dimension (must be > 0)
`distanceKind`	`DistanceKind`	Yes	Distance metric: `Euclidean`, `L2Squared`, or `Cosine`

Returns:

*FlatIndex: Initialized flat index
error: Error if parameters are invalid

HNSWIndex

Constructor:

func NewHNSWIndex(dim int, distanceKind DistanceKind, m, efConstruction, efSearch int) (*HNSWIndex, error)

Parameters:

Parameter	Type	Required	Default	Description
`dim`	`int`	Yes	-	Vector dimension (must be > 0)
`distanceKind`	`DistanceKind`	Yes	-	Distance metric: `Euclidean`, `L2Squared`, or `Cosine`
`m`	`int`	No	16	Max connections per node (higher = better accuracy, more memory)
`efConstruction`	`int`	No	200	Build-time search depth (higher = better graph quality, slower build)
`efSearch`	`int`	No	`efConstruction`	Query-time search depth (higher = better accuracy, slower search)

Returns:

*HNSWIndex: Initialized HNSW index
error: Error if parameters are invalid

Parameter Guidelines:

┌──────────────┬─────┬────────────────┬──────────┬─────────────┐
│ Use Case     │  M  │ efConstruction │ efSearch │ Description │
├──────────────┼─────┼────────────────┼──────────┼─────────────┤
│ Fast         │  8  │      100       │    50    │ Speed first │
│ Balanced     │ 16  │      200       │   100    │ Default     │
│ High Recall  │ 32  │      400       │   200    │ Accuracy    │
│ Memory Eff.  │  4  │       50       │    25    │ Low memory  │
└──────────────┴─────┴────────────────┴──────────┴─────────────┘

IVFIndex

Constructor:

func NewIVFIndex(dim int, nlist int, distanceKind DistanceKind) (*IVFIndex, error)

Parameters:

Parameter	Type	Required	Description
`dim`	`int`	Yes	Vector dimension (must be > 0)
`nlist`	`int`	Yes	Number of clusters/partitions (must be > 0)
`distanceKind`	`DistanceKind`	Yes	Distance metric: `Euclidean`, `L2Squared`, or `Cosine`

Returns:

*IVFIndex: Initialized IVF index
error: Error if parameters are invalid

Parameter Guidelines:

nlist Selection:
├─ Small dataset (10K-100K):    nlist = sqrt(n) = 100-300
├─ Medium dataset (100K-1M):    nlist = sqrt(n) = 300-1000
├─ Large dataset (1M-10M):      nlist = sqrt(n) = 1000-3000
└─ Very large (10M+):           nlist = sqrt(n) = 3000-10000

Rule of thumb: nlist ≈ sqrt(number_of_vectors)

Search Parameters:

// nProbe: number of clusters to search (1 to nlist)
results, _ := index.NewSearch().
    WithQuery(queryVec).
    WithK(10).
    WithNProbe(8).  // Search top 8 nearest clusters
    Execute()

Speed vs Accuracy:
├─ nProbe = 1:     Fastest, lowest recall
├─ nProbe = 8:     Balanced (typical)
├─ nProbe = nlist: Exhaustive (same as flat)

PQIndex

Constructor:

func NewPQIndex(dim int, distanceKind DistanceKind, M int, Nbits int) (*PQIndex, error)

Parameters:

Parameter	Type	Required	Constraint	Description
`dim`	`int`	Yes	> 0	Vector dimension
`distanceKind`	`DistanceKind`	Yes	-	Distance metric: `Euclidean`, `L2Squared`, or `Cosine`
`M`	`int`	Yes	dim % M == 0	Number of subspaces (dim must be divisible by M)
`Nbits`	`int`	Yes	1-16	Bits per subspace (typical: 8)

Returns:

*PQIndex: Initialized PQ index
error: Error if parameters are invalid or constraints violated

Parameter Guidelines:

M (Number of Subspaces):
├─ M = 8:    192x compression (typical, good balance)
├─ M = 16:    96x compression (better accuracy)
├─ M = 32:    48x compression (highest accuracy)
└─ Constraint: dim must be divisible by M

Nbits (Codebook size = 2^Nbits):
├─ Nbits = 4:   16 centroids per subspace (very fast)
├─ Nbits = 8:  256 centroids per subspace (typical)
├─ Nbits = 12: 4096 centroids (high accuracy)
└─ Constraint: 1 ≤ Nbits ≤ 16

Common Configurations:
┌──────┬───────┬───────────────┬─────────────────────┐
│  M   │ Nbits │ Compression   │ Use Case            │
├──────┼───────��───────────────┼────────────────��────┤
│   8  │   8   │ 192x (1 byte) │ Maximum compression │
│  16  │   8   │  96x (2 byte) │ Balanced            │
│  32  │   8   │  48x (4 byte) │ Better accuracy     │
│   8  │  12   │ 128x (1.5 B)  │ Higher quality      │
└──────┴───────┴───────────────┴─────────────────────┘

IVFPQIndex

Constructor:

func NewIVFPQIndex(dim int, distanceKind DistanceKind, nlist int, m int, nbits int) (*IVFPQIndex, error)

Parameters:

Parameter	Type	Required	Constraint	Description
`dim`	`int`	Yes	> 0	Vector dimension
`distanceKind`	`DistanceKind`	Yes	-	Distance metric: `Euclidean`, `L2Squared`, or `Cosine`
`nlist`	`int`	Yes	> 0	Number of IVF clusters
`m`	`int`	Yes	dim % m == 0	Number of PQ subspaces
`nbits`	`int`	Yes	1-16	Bits per PQ subspace

Returns:

*IVFPQIndex: Initialized IVFPQ index
error: Error if parameters are invalid

Parameter Guidelines:

Combined IVF + PQ Configuration:

IVF Parameters:
├─ nlist ≈ sqrt(n) for number of vectors
└─ See IVFIndex guidelines above

PQ Parameters:
├─ m: Typically 8, 16, or 32
└─ nbits: Typically 8

Common Configurations for 100M vectors (384-dim):
┌────────┬─────┬────────┬──────────┬─────────────────┐
│ nlist  │  m  │ nbits  │ Memory   │ Use Case        │
├────────┼─────┼────────┼──────────┼─────────────────┤
│  4096  │  8  │   8    │  ~800 MB │ Extreme speed   │
│  8192  │  8  │   8    │  ~900 MB │ Balanced        │
│ 16384  │ 16  │   8    │ ~1.6 GB  │ Better accuracy │
│  8192  │  8  │  12    │ ~1.2 GB  │ High quality    │
└────────┴─────��────────┴──────────┴─────────────────┘

Memory Savings Example (100M vectors, 384-dim):
├─ Original float32:  100M × 384 × 4 = 146 GB
├─ IVF only:          Still ~146 GB (no compression)
├─ PQ only:           100M × 8 × 1 = 760 MB (faster train)
└─ IVFPQ (m=8):       100M × 8 × 1 + centroids ≈ 800 MB (best of both!)

Search Parameters:

results, _ := index.NewSearch().
    WithQuery(queryVec).
    WithK(10).
    WithNProbe(8).       // IVF: clusters to search
    WithNRefine(100).    // Optional: refine top-100 with original vectors
    Execute()

BM25SearchIndex

Constructor:

func NewBM25SearchIndex() *BM25SearchIndex

Parameters: None (uses default BM25 parameters: k1=1.5, b=0.75)

Returns:

*BM25SearchIndex: Initialized BM25 full-text search index

BM25 Parameters:

k1 = 1.5: Term frequency saturation (typical range: 1.2-2.0)
b = 0.75: Document length normalization (typical range: 0.5-0.9)

RoaringMetadataIndex

Constructor:

func NewRoaringMetadataIndex() *RoaringMetadataIndex

Parameters: None

Returns:

*RoaringMetadataIndex: Initialized metadata filtering index using Roaring Bitmaps

HybridSearchIndex

Constructor:

func NewHybridSearchIndex(
    vectorIndex VectorIndex,
    textIndex TextIndex,
    metadataIndex MetadataIndex,
) HybridSearchIndex

Parameters:

Parameter	Type	Required	Description
`vectorIndex`	`VectorIndex`	Yes	Any vector index (Flat, HNSW, IVF, PQ, IVFPQ)
`textIndex`	`TextIndex`	Yes	BM25 text search index
`metadataIndex`	`MetadataIndex`	Yes	Roaring metadata filter index

Returns:

HybridSearchIndex: Initialized hybrid search combining all three modalities

Search APIs

Vector Search (VectorSearch Interface)

Creating a Search:

search := index.NewSearch()  // Available on all VectorIndex implementations

Methods:

Method	Parameters	Description
`WithQuery`	`queries ...[]float32`	Set query vector(s) for similarity search
`WithNodes`	`nodes ...uint32`	Find vectors similar to indexed nodes by ID
`WithK`	`k int`	Number of results to return (default: 10)
`WithDocumentIDs`	`ids ...uint32`	Pre-filter: only search within these document IDs
`WithScoreAggregation`	`agg ScoreAggregation`	How to combine multi-query results: `SumAggregation`, `MaxAggregation`, `MeanAggregation`
`WithThreshold`	`threshold float32`	Only return results with distance ≤ threshold
`WithAutocut`	`enabled bool`	Automatically determine optimal cutoff point
`Execute`	-	Run the search, returns `[]VectorResult, error`

HNSW-Specific Methods:

Method	Parameters	Description
`WithEfSearch`	`ef int`	Override default efSearch for this query

IVF-Specific Methods:

Method	Parameters	Description
`WithNProbe`	`nprobe int`	Number of clusters to search (1 to nlist)

IVFPQ-Specific Methods:

Method	Parameters	Description
`WithNProbe`	`nprobe int`	Number of clusters to search
`WithNRefine`	`nrefine int`	Refine top-N results with original vectors

Examples:

// Basic single-query search
results, err := index.NewSearch().
    WithQuery(queryVector).
    WithK(10).
    Execute()

// Multi-query with aggregation
results, _ := index.NewSearch().
    WithQuery(query1, query2, query3).
    WithK(20).
    WithScoreAggregation(comet.MeanAggregation).
    Execute()

// Node-based search (find similar to existing vectors)
results, _ := index.NewSearch().
    WithNodes(1, 5, 10).  // IDs of indexed vectors
    WithK(10).
    Execute()

// Pre-filtered search
eligibleIDs := []uint32{100, 200, 300, 400, 500}
results, _ := index.NewSearch().
    WithQuery(queryVector).
    WithDocumentIDs(eligibleIDs...).
    WithK(5).
    Execute()

// Distance threshold filtering
results, _ := index.NewSearch().
    WithQuery(queryVector).
    WithK(100).
    WithThreshold(0.5).  // Only distances ≤ 0.5
    Execute()

// Autocut (automatic result truncation)
results, _ := index.NewSearch().
    WithQuery(queryVector).
    WithK(100).
    WithAutocut(true).  // Returns fewer if quality drops
    Execute()

// HNSW with custom efSearch
hnswIndex, _ := comet.NewHNSWIndex(384, comet.Cosine, 16, 200, 100)
results, _ := hnswIndex.NewSearch().
    WithQuery(queryVector).
    WithK(10).
    WithEfSearch(200).  // Higher accuracy for this query
    Execute()

// IVF with nProbe
ivfIndex, _ := comet.NewIVFIndex(384, 1000, comet.Cosine)
results, _ := ivfIndex.NewSearch().
    WithQuery(queryVector).
    WithK(10).
    WithNProbe(16).  // Search 16 nearest clusters
    Execute()

// IVFPQ with refinement
ivfpqIndex, _ := comet.NewIVFPQIndex(384, comet.Cosine, 8192, 8, 8)
results, _ := ivfpqIndex.NewSearch().
    WithQuery(queryVector).
    WithK(10).
    WithNProbe(32).
    WithNRefine(100).  // Refine top-100 with original vectors
    Execute()

Result Format:

type VectorResult struct {
    Node  VectorNode  // The matched vector node
    Score float32     // Distance score (lower = more similar)
}

// Access results
for _, result := range results {
    id := result.GetId()
    score := result.GetScore()
    vector := result.Node.Vector()
    fmt.Printf("ID: %d, Distance: %.4f\n", id, score)
}

Score Aggregation Strategies:

// When using multiple queries, results are aggregated:

comet.SumAggregation   // Sum of all distances (penalizes far results)
comet.MaxAggregation   // Maximum distance (pessimistic)
comet.MeanAggregation  // Average distance (balanced, recommended)

Text Search (TextSearch Interface)

Creating a Search:

search := textIndex.NewSearch()  // BM25SearchIndex

Methods:

Method	Parameters	Description
`WithQuery`	`query string`	Text query for keyword search
`WithK`	`k int`	Number of results to return (default: 10)
`WithDocumentIDs`	`ids ...uint32`	Pre-filter: only search within these document IDs
`WithAutocut`	`enabled bool`	Automatically determine optimal cutoff point
`Execute`	-	Run the search, returns `[]TextResult, error`

Examples:

// Basic text search
results, err := textIndex.NewSearch().
    WithQuery("machine learning algorithms").
    WithK(10).
    Execute()

// Pre-filtered text search
eligibleIDs := []uint32{1, 5, 10, 15, 20}
results, _ := textIndex.NewSearch().
    WithQuery("neural networks").
    WithDocumentIDs(eligibleIDs...).
    WithK(5).
    Execute()

// With autocut
results, _ := textIndex.NewSearch().
    WithQuery("deep learning").
    WithK(50).
    WithAutocut(true).  // Returns fewer if relevance drops
    Execute()

Result Format:

type TextResult struct {
    Node  TextNode  // The matched text node
    Score float32   // BM25 relevance score (higher = more relevant)
}

// Access results
for _, result := range results {
    id := result.GetId()
    score := result.GetScore()
    text := result.Node.Text()
    fmt.Printf("ID: %d, BM25: %.4f, Text: %s\n", id, score, text)
}

BM25 Scoring:

BM25 score components:
├─ IDF: Term rarity (rare terms score higher)
├─ TF:  Term frequency (with saturation via k1)
└─ DL:  Document length normalization (via b)

Higher score = more relevant
Typical range: 0-10+ (no upper bound)

Metadata Search (MetadataSearch Interface)

Creating a Search:

search := metadataIndex.NewSearch()  // RoaringMetadataIndex

Methods:

Method	Parameters	Description
`WithFilters`	`filters ...Filter`	Metadata filter conditions (AND logic)
`Execute`	-	Run the search, returns `[]MetadataResult, error`

Filter Functions:

Function	Parameters	Description
`Eq`	`field string, value interface{}`	Equality: field == value
`Lt`	`field string, value interface{}`	Less than: field < value
`Lte`	`field string, value interface{}`	Less than or equal: field ≤ value
`Gt`	`field string, value interface{}`	Greater than: field > value
`Gte`	`field string, value interface{}`	Greater than or equal: field ≥ value
`Between`	`field string, min, max interface{}`	Range: min ≤ field ≤ max

Examples:

// Single filter
results, err := metadataIndex.NewSearch().
    WithFilters(comet.Eq("category", "electronics")).
    Execute()

// Multiple filters (AND logic)
results, _ := metadataIndex.NewSearch().
    WithFilters(
        comet.Eq("category", "books"),
        comet.Gte("rating", 4.0),
        comet.Lte("price", 50.0),
        comet.Eq("in_stock", true),
    ).
    Execute()

// Range filter
results, _ := metadataIndex.NewSearch().
    WithFilters(
        comet.Between("year", 2020, 2024),
        comet.Eq("status", "published"),
    ).
    Execute()

// Numeric comparisons
results, _ := metadataIndex.NewSearch().
    WithFilters(
        comet.Gt("views", 1000),
        comet.Lt("price", 100),
    ).
    Execute()

// String equality
results, _ := metadataIndex.NewSearch().
    WithFilters(
        comet.Eq("author", "John Doe"),
        comet.Eq("language", "en"),
    ).
    Execute()

Result Format:

type MetadataResult struct {
    Node MetadataNode  // The matched metadata node
}

// Access results
for _, result := range results {
    id := result.GetId()
    score := result.GetScore()  // Always 0 (binary match)
    metadata := result.Node.Metadata()

    fmt.Printf("ID: %d\n", id)
    fmt.Printf("Category: %v\n", metadata["category"])
    fmt.Printf("Price: %v\n", metadata["price"])
}

Supported Data Types:

Supported field types:
├─ string:   Equality only (Eq)
├─ int/int64: All comparisons (Eq, Lt, Lte, Gt, Gte, Between)
├─ float32/float64: All comparisons
├─ bool:     Equality only (Eq)
���─ nil:      Equality only (Eq)

Hybrid Search (HybridSearch Interface)

Creating a Search:

search := hybridIndex.NewSearch()  // HybridSearchIndex

Methods:

Method	Parameters	Description
`WithVectorQuery`	`queries ...[]float32`	Semantic search queries (vector embeddings)
`WithTextQuery`	`query string`	Keyword search query (BM25)
`WithFilters`	`filters ...Filter`	Metadata filters (pre-filtering before search)
`WithK`	`k int`	Number of results to return per modality before fusion
`WithFusion`	`fusion FusionKind`	Score fusion strategy: `ReciprocalRankFusion` or `WeightedSumFusion`
`WithDocumentIDs`	`ids ...uint32`	Pre-filter: only search within these document IDs
`WithAutocut`	`enabled bool`	Automatically determine optimal cutoff point
`Execute`	-	Run hybrid search, returns `[]HybridResult, error`

Fusion Strategies:

Strategy	Description	When to Use
`ReciprocalRankFusion`	Rank-based fusion: `score = Σ(1/(k+rank))`	Default, works without tuning, handles different score scales
`WeightedSumFusion`	Linear combination of normalized scores	When you want to weight modalities differently

Examples:

// All three modalities
results, err := hybridIndex.NewSearch().
    WithVectorQuery(queryEmbedding).
    WithTextQuery("machine learning").
    WithFilters(
        comet.Eq("category", "ai"),
        comet.Gte("year", 2020),
    ).
    WithK(20).
    WithFusion(comet.ReciprocalRankFusion).
    Execute()

// Vector + Text only (no metadata filters)
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(queryEmbedding).
    WithTextQuery("neural networks").
    WithK(10).
    Execute()

// Vector + Metadata only (no text)
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(queryEmbedding).
    WithFilters(
        comet.Eq("category", "research"),
        comet.Lte("price", 0),  // Free papers
    ).
    WithK(15).
    Execute()

// Multi-query vector search with text
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(embedding1, embedding2, embedding3).
    WithTextQuery("deep learning transformers").
    WithK(20).
    WithFusion(comet.ReciprocalRankFusion).
    Execute()

// Pre-filtered hybrid search
eligibleIDs := []uint32{1, 5, 10, 15, 20}
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(queryEmbedding).
    WithTextQuery("optimization").
    WithDocumentIDs(eligibleIDs...).
    WithK(5).
    Execute()

// With autocut
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(queryEmbedding).
    WithTextQuery("information retrieval").
    WithK(50).
    WithAutocut(true).
    Execute()

Result Format:

type HybridResult struct {
    ID          uint32   // Document ID
    FusedScore  float32  // Combined score from all modalities
    VectorScore float32  // Individual vector similarity score
    TextScore   float32  // Individual BM25 score
}

// Access results
for _, result := range results {
    fmt.Printf("ID: %d\n", result.ID)
    fmt.Printf("Fused Score: %.4f\n", result.FusedScore)
    fmt.Printf("Vector: %.4f, Text: %.4f\n",
        result.VectorScore, result.TextScore)
}

How Fusion Works:

Reciprocal Rank Fusion (RRF):
┌──────────────────────────────────────┐
│ Step 1: Get results from each modality
│   Vector: [id=1, id=5, id=7, ...]
│   Text:   [id=7, id=1, id=12, ...]
│
│ Step 2: Compute RRF score per document
│   RRF(doc) = Σ 1/(k + rank_i)
│   where k=60 (default), rank_i is rank in modality i
│
│ Example for doc_id=1:
│   Vector rank: 1 → 1/(60+1) = 0.0164
│   Text rank:   2 → 1/(60+2) = 0.0161
│   RRF score:       0.0164 + 0.0161 = 0.0325
│
│ Step 3: Sort by RRF score (higher = better)
└──────────────────────────────────────┘

Benefits:
├─ No score normalization needed
├─ Robust to different score scales
├─ Emphasizes top-ranked results
└─ Works out-of-the-box (no tuning)

Search Flow:

Hybrid Search Execution:
┌───────────────────────────────────────────┐
│ 1. Metadata Filtering (if filters present)│
│    → Get eligible document IDs            │
└────────────┬──────────────────────────────┘
             │
      ┌──────┴──────┐
      │             │
┌─────▼─────┐ ┌─────▼─────┐
│  Vector   │ │   Text    │
│  Search   │ │  Search   │
│ (top-K)   │ │ (top-K)   │
└─────┬─────┘ └─────┬─────┘
      │             │
      └──────┬──────┘
             │
       ┌─────▼─────┐
       │   Fusion  │
       │ (RRF/WS)  │
       └─────┬─────┘
             │
       ┌─────▼─────┐
       │  Results  │
       │ (sorted)  │
       └───────────┘

Use Case Examples:

// E-commerce: "Show me red dresses under $100 similar to this image"
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(imageEmbedding).
    WithTextQuery("red dress").
    WithFilters(
        comet.Eq("category", "dresses"),
        comet.Lte("price", 100),
        comet.Eq("in_stock", true),
    ).
    WithK(20).
    Execute()

// Academic search: "Papers about 'attention mechanisms' similar to this abstract, published after 2020"
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(abstractEmbedding).
    WithTextQuery("attention mechanisms transformers").
    WithFilters(
        comet.Gte("year", 2020),
        comet.Eq("peer_reviewed", true),
    ).
    WithK(50).
    Execute()

// Job search: "Backend engineer roles in San Francisco paying over $150k"
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(resumeEmbedding).
    WithTextQuery("backend engineer golang kubernetes").
    WithFilters(
        comet.Eq("location", "San Francisco"),
        comet.Gte("salary", 150000),
        comet.Eq("remote", false),
    ).
    WithK(30).
    Execute()

Use Cases

Use Case 1: Semantic Document Search

Problem: A documentation platform needs to find relevant documents based on user queries, going beyond simple keyword matching to understand intent.

Solution: Use Comet's vector search with text embeddings from a sentence transformer model.

Implementation:

package main

import (
    "github.com/wizenheimer/comet"
    "log"
)

type Document struct {
    ID       uint32
    Title    string
    Content  string
    Embedding []float32  // From sentence-transformers
}

func main() {
    // Create HNSW vector store for 384-dim embeddings
    m, efC, efS := comet.DefaultHNSWConfig()
    index, _ := comet.NewHNSWIndex(384, comet.Cosine, m, efC, efS)

    // Add documents (embeddings generated by your ML model)
    docs := []Document{
        {Title: "Getting Started", Embedding: getEmbedding("getting started guide")},
        {Title: "API Reference", Embedding: getEmbedding("api documentation")},
        {Title: "Troubleshooting", Embedding: getEmbedding("common problems and solutions")},
    }

    for _, doc := range docs {
        node := comet.NewVectorNode(doc.Embedding)
        index.Add(*node)
    }

    // Search with user query
    queryEmbedding := getEmbedding("how do I start using this?")
    results, _ := index.NewSearch().
        WithQuery(queryEmbedding).
        WithK(5).
        Execute()

    // Return relevant documents
    for _, result := range results {
        log.Printf("Found: %s (similarity: %.4f)",
            docs[result.GetId()].Title, 1-result.GetScore())
    }
}

func getEmbedding(text string) []float32 {
    // Call your embedding model API (OpenAI, Cohere, local model, etc.)
    return nil  // Placeholder
}

Use Case 2: E-commerce Product Recommendations

Problem: Recommend similar products based on browsing history, with filtering by price, category, and availability.

Solution: Hybrid search combining product embeddings with metadata filters.

Implementation:

type Product struct {
    ID          uint32
    Name        string
    ImageVector []float32  // From image embedding model
    Price       float64
    Category    string
    InStock     bool
}

func RecommendSimilarProducts(
    productID uint32,
    maxPrice float64,
    category string,
) ([]Product, error) {
    // Setup hybrid store
    vecIdx, _ := comet.NewHNSWIndex(512, comet.Cosine, 16, 200, 200)
    metaIdx := comet.NewRoaringMetadataIndex()
    hybrid := comet.NewHybridSearchIndex(vecIdx, nil, metaIdx)

    // Get the product's embedding
    product := getProduct(productID)

    // Search with filters
    results, err := hybrid.NewSearch().
        WithVector(product.ImageVector).
        WithMetadata(
            comet.Eq("category", category),
            comet.Lte("price", maxPrice),
            comet.Eq("in_stock", true),
        ).
        WithK(10).
        Execute()

    if err != nil {
        return nil, err
    }

    // Convert to Product structs
    products := make([]Product, len(results))
    for i, result := range results {
        products[i] = getProduct(result.ID)
    }

    return products, nil
}

Use Case 3: Question-Answering System with Hybrid Retrieval

Problem: Build a QA system that combines semantic understanding (vector search) with keyword precision (BM25) for better answer retrieval.

Solution: Use RRF fusion to combine both modalities.

Implementation:

func AnswerQuestion(question string) ([]string, error) {
    // Create hybrid store
    vecIdx, _ := comet.NewFlatIndex(768, comet.Cosine)  // BERT embeddings
    txtIdx := comet.NewBM25SearchIndex()
    hybrid := comet.NewHybridSearchIndex(vecIdx, txtIdx, nil)

    // Index knowledge base
    for _, doc := range knowledgeBase {
        embedding := getBERTEmbedding(doc.Text)
        hybrid.Add(embedding, doc.Text, nil)
    }

    // Search with both modalities
    questionEmbedding := getBERTEmbedding(question)
    results, err := hybrid.NewSearch().
        WithVector(questionEmbedding).        // Semantic similarity
        WithText(question).                    // Keyword matching
        WithFusionKind(comet.ReciprocalRankFusion).  // Combine rankings
        WithK(5).
        Execute()

    if err != nil {
        return nil, err
    }

    // Extract answer passages
    answers := make([]string, len(results))
    for i, result := range results {
        answers[i] = knowledgeBase[result.ID].Text
    }

    return answers, nil
}

Contributing

Contributions are welcome! Please follow these guidelines:

Development Setup

# Clone repository
git clone https://github.com/wizenheimer/comet.git
cd comet

# Install dependencies
make deps

# Run tests
make test

# Run linter
make lint

# Check everything
make check

Code Style

Follow standard Go conventions (gofmt, go vet)
Write descriptive comments for exported functions
Include examples in documentation comments
Add tests for new features (maintain >95% coverage)
Keep functions focused and composable
Use meaningful variable names

Commit Messages

Follow conventional commits format:

Good commit messages:

feat: Add IVF index with k-means clustering
fix: Handle zero vectors in cosine distance
perf: Optimize HNSW search with heap pooling
docs: Update API reference for hybrid search
test: Add benchmarks for BM25 search

Bad commit messages:

Update code
Fix bug
WIP
asdf

Pull Request Process

Fork the repository and create a feature branch
Write your code following the style guidelines
Add tests for all new functionality
Run the full test suite and ensure all tests pass
Update documentation (README, godoc comments)
Commit with descriptive messages using conventional commits
Push to your fork and open a Pull Request
Respond to review feedback promptly

Code Review Checklist

Before submitting a PR, ensure:

All tests pass (make test)
Linter shows no issues (make lint)
Coverage remains >95% (make test-coverage)
Documentation is updated
Benchmarks show no regression (if applicable)
No breaking changes (unless discussed)

License

MIT License - see LICENSE file for details

References

HNSW Algorithm: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
Product Quantization: Product quantization for nearest neighbor search
BM25 Ranking: The Probabilistic Relevance Framework: BM25 and Beyond
Roaring Bitmaps: Better bitmap performance with Roaring bitmaps
Reciprocal Rank Fusion: Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods

Documentation ¶

Overview ¶

Package comet implements a BM25-based full-text search index.

WHAT IS BM25? BM25 (Best Matching 25) is a probabilistic ranking function used to estimate the relevance of documents to a given search query. It is one of the most widely used ranking functions in information retrieval.

HOW BM25 WORKS: For a given query Q with terms {t1, t2, ..., tn} and document D: 1. Tokenizes and normalizes both query and documents using UAX#29 word segmentation 2. For each query term, calculates:

IDF (Inverse Document Frequency): log((N - df + 0.5) / (df + 0.5) + 1) where N is total docs and df is docs containing the term
TF component: (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (docLen / avgDocLen))) where tf is term frequency in the document

3. Final score is the sum of (IDF × TF) for all query terms

KEY PARAMETERS:

K1 (default 1.2): Controls term frequency saturation. Higher values mean term frequency has more impact on the score. Typical range: 1.2-2.0
B (default 0.75): Controls document length normalization. 0 means no normalization (all docs treated equally), 1 means full normalization. Typical range: 0.75

TIME COMPLEXITY:

Add: O(m) where m is the number of tokens in the document
Search: O(q × d) where q is query tokens and d is average docs per term
Remove: O(m) where m is the number of tokens in the document

MEMORY REQUIREMENTS: - Stores inverted index (term -> docIDs) using roaring bitmaps for compression - Stores term frequencies (term -> docID -> count) - Stores document lengths and tokens (not full text) - Much more memory efficient than storing full document text

GUARANTEES & TRADE-OFFS: ✓ Pros:

Excellent relevance ranking for text search
Handles term frequency and document length well
Fast search using inverted index
Memory efficient (doesn't store original text)
Thread-safe for concurrent use

✗ Cons:

Requires tokenization and normalization
Cannot retrieve original document text
Updates require document replacement (remove + add)

WHEN TO USE: Use BM25 index when: 1. You need full-text search with relevance ranking 2. You want fast keyword-based search 3. Memory efficiency is important (vs storing full text) 4. You have your own document store and just need search

Package comet provides a high-performance hybrid vector search library for Go.

Comet combines multiple indexing strategies and search modalities into a unified, efficient package. It supports semantic search (vector embeddings), full-text search (BM25), metadata filtering, and hybrid search with score fusion.

Overview ¶

Comet is built for developers who want to understand how vector databases work from the inside out. It provides production-ready implementations of modern vector search algorithms with comprehensive documentation and examples.

Quick Start ¶

Create a vector index and perform similarity search:

package main

import (
    "fmt"
    "log"

    "github.com/wizenheimer/comet"
)

func main() {
    // Create a flat index for 384-dimensional vectors using cosine distance
    index, err := comet.NewFlatIndex(384, comet.Cosine)
    if err != nil {
        log.Fatal(err)
    }

    // Add vectors to the index
    vec1 := make([]float32, 384)
    // ... populate vec1 with your embedding ...
    node := comet.NewVectorNode(vec1)
    if err := index.Add(*node); err != nil {
        log.Fatal(err)
    }

    // Search for similar vectors
    query := make([]float32, 384)
    // ... populate query vector ...
    results, err := index.NewSearch().
        WithQuery(query).
        WithK(10).
        Execute()

    if err != nil {
        log.Fatal(err)
    }

    // Process results
    for i, result := range results {
        fmt.Printf("%d. ID=%d, Score=%.4f\n", i+1, result.GetId(), result.GetScore())
    }
}

Vector Storage Types ¶

Comet provides five vector index implementations, each with different tradeoffs:

FlatIndex: Brute-force exhaustive search with 100% recall. Best for small datasets (<10K vectors) or when perfect accuracy is required.

index, _ := comet.NewFlatIndex(384, comet.Cosine)

HNSWIndex: Hierarchical graph-based search with 95-99% recall and O(log n) performance. Best for most production workloads (10K-10M vectors).

m, efConstruction, efSearch := comet.DefaultHNSWConfig()
index, _ := comet.NewHNSWIndex(384, comet.Cosine, m, efConstruction, efSearch)

IVFIndex: Inverted file index using k-means clustering with 85-95% recall. Best for large datasets (>100K vectors) with moderate accuracy requirements.

index, _ := comet.NewIVFIndex(384, comet.Cosine, 100) // 100 clusters
index.Train(trainingVectors) // Training required

PQIndex: Product quantization for massive memory compression (10-500x smaller) with 70-85% recall. Best for memory-constrained environments.

m, nbits := comet.CalculatePQParams(384)
index, _ := comet.NewPQIndex(384, comet.Cosine, m, nbits)
index.Train(trainingVectors) // Training required

IVFPQIndex: Combines IVF and PQ for maximum scalability with 70-90% recall. Best for billion-scale datasets.

m, nbits := comet.CalculatePQParams(384)
index, _ := comet.NewIVFPQIndex(384, comet.Cosine, 100, m, nbits)
index.Train(trainingVectors) // Training required

Distance Metrics ¶

Three distance metrics are supported:

Euclidean (L2): Measures absolute spatial distance. Use when magnitude matters.

index, _ := comet.NewFlatIndex(384, comet.Euclidean)

L2Squared: Squared Euclidean distance (faster, preserves ordering). Use for better performance when only relative distances matter.

index, _ := comet.NewFlatIndex(384, comet.L2Squared)

Cosine: Measures angular similarity, independent of magnitude. Use for normalized vectors like text embeddings.

index, _ := comet.NewFlatIndex(384, comet.Cosine)

Full-Text Search ¶

BM25-based full-text search with Unicode tokenization:

index := comet.NewBM25SearchIndex()
index.Add(1, "the quick brown fox jumps over the lazy dog")
index.Add(2, "pack my box with five dozen liquor jugs")

results, _ := index.NewSearch().
    WithQuery("quick fox").
    WithK(10).
    Execute()

Metadata Filtering ¶

Fast filtering using Roaring Bitmaps and Bit-Sliced Indexes:

index := comet.NewRoaringMetadataIndex()

// Add documents with metadata
node := comet.NewMetadataNode(map[string]interface{}{
    "category": "electronics",
    "price": 999,
    "in_stock": true,
})
index.Add(*node)

// Filter by metadata
results, _ := index.NewSearch().
    WithFilter(
        comet.Eq("category", "electronics"),
        comet.Lte("price", 1000),
        comet.Eq("in_stock", true),
    ).
    Execute()

Hybrid Search ¶

Combine vector, text, and metadata search with score fusion:

// Create indexes
vecIdx, _ := comet.NewHNSWIndex(384, comet.Cosine, 16, 200, 200)
txtIdx := comet.NewBM25SearchIndex()
metaIdx := comet.NewRoaringMetadataIndex()

// Create hybrid index
hybrid := comet.NewHybridSearchIndex(vecIdx, txtIdx, metaIdx)

// Add documents
id, _ := hybrid.Add(
    embedding,  // 384-dim vector
    "machine learning tutorial",  // text
    map[string]interface{}{  // metadata
        "category": "education",
        "price": 29.99,
    },
)

// Hybrid search with all modalities
results, _ := hybrid.NewSearch().
    WithVector(queryEmbedding).
    WithText("machine learning").
    WithMetadata(
        comet.Eq("category", "education"),
        comet.Lt("price", 50),
    ).
    WithFusionKind(comet.ReciprocalRankFusion).
    WithK(10).
    Execute()

Score Fusion Strategies ¶

When combining results from multiple search modalities, different fusion strategies are available:

WeightedSumFusion: Linear combination with configurable weights

config := &comet.FusionConfig{
    VectorWeight: 0.7,
    TextWeight: 0.3,
}
fusion, _ := comet.NewFusion(comet.WeightedSumFusion, config)

ReciprocalRankFusion: Rank-based fusion (scale-independent, recommended)

fusion, _ := comet.NewFusion(comet.ReciprocalRankFusion, nil)

MaxFusion/MinFusion: Simple maximum or minimum across modalities

fusion, _ := comet.NewFusion(comet.MaxFusion, nil)

Serialization ¶

All indexes support persistence:

// Save index
file, _ := os.Create("index.bin")
defer file.Close()
index.WriteTo(file)

// Load index
file, _ := os.Open("index.bin")
defer file.Close()
index, _ := comet.NewFlatIndex(384, comet.Cosine)
index.ReadFrom(file)

Performance Tuning ¶

HNSW parameters for tuning search quality:

// Higher M = better recall, more memory
// Lower M = faster search, less memory
m := 16  // connections per layer (default: 16)

// Higher efConstruction = better graph quality, slower build
efConstruction := 200  // build quality (default: 200)

// Higher efSearch = better recall, slower search
efSearch := 200  // search quality (default: 200)

index, _ := comet.NewHNSWIndex(384, comet.Cosine, m, efConstruction, efSearch)

IVF parameters for tuning speed/accuracy:

nClusters := 100  // more clusters = faster search, lower recall
index, _ := comet.NewIVFIndex(384, comet.Cosine, nClusters)

// At search time
nProbes := 8  // more probes = better recall, slower search
results, _ := index.NewSearch().
    WithQuery(query).
    WithK(10).
    WithNProbes(nProbes).
    Execute()

Thread Safety ¶

All indexes are safe for concurrent use. Multiple goroutines can search simultaneously while one goroutine adds or removes vectors.

Use Cases ¶

Document Search: Use vector embeddings for semantic search in documentation, knowledge bases, or content management systems.

Product Recommendations: Combine product image embeddings with metadata filters for personalized recommendations.

Question Answering: Use hybrid search (vector + BM25) for retrieval-augmented generation (RAG) systems.

Duplicate Detection: Use high-recall vector search to find near-duplicate documents or images.

Multi-modal Search: Combine text, image embeddings, and structured metadata for comprehensive search experiences.

Best Practices ¶

Choose the right index type:

<10K vectors: Use FlatIndex
10K-10M vectors: Use HNSWIndex
>10M vectors: Use IVFIndex or IVFPQIndex
Memory constrained: Use PQIndex or IVFPQIndex

Use appropriate distance metrics:

Text embeddings: Use Cosine
Image features: Use L2 or L2Squared
Custom embeddings: Depends on how they were trained

Batch operations:

Add vectors in batches for better performance
Call Flush() periodically after Remove() operations

Training indexes:

Use representative sample (10K-100K vectors) for IVF/PQ training
Training sample should cover the data distribution

Metadata filtering:

Apply filters before vector search to reduce candidates
Use equality filters for categorical data
Use range filters for numeric data

Documentation and Examples ¶

For detailed API documentation, see the godoc comments on each type and function.

For more examples and use cases, visit: https://github.com/wizenheimer/comet

License ¶

MIT License - Copyright (c) 2025 wizenheimer ¶

Package comet implements a k-Nearest Neighbors (kNN) flat index for similarity search.

WHAT IS A FLAT INDEX? A flat index is the most naive and simple approach to similarity search. The term "flat" indicates that vectors are stored without any compression or transformation - they are stored "as-is" in their original form. This is also known as brute-force or exhaustive search.

HOW kNN WORKS: For a given query vector Q, the algorithm: 1. Calculates the distance from Q to EVERY vector in the dataset 2. Sorts all distances 3. Returns the k vectors with the smallest distances

TIME COMPLEXITY:

Training: O(1) - No training phase required! This is one of the few ML algorithms that doesn't need training.
Search: O(m*n) where:
n = number of vectors in the dataset
m = dimensionality of each vector For each of the n vectors, we need O(m) time to calculate the distance.

MEMORY REQUIREMENTS: - 4 bytes per float32 component - Total per vector: 4 * d bytes (where d is the dimensionality) - No compression, so memory scales linearly with dataset size

GUARANTEES & TRADE-OFFS: ✓ Pros:

100% accuracy - always finds the true nearest neighbors
No training phase required
Simple to implement and understand

✗ Cons:

Slow for large datasets (exhaustive search)
No memory compression
Not scalable (O(mn) time complexity)

WHEN TO USE: Use flat index only when: 1. Dataset size or embedding dimensionality is relatively small 2. You MUST have 100% accuracy (e.g., fingerprint matching, security applications) 3. Speed is not a critical concern

Package comet implements HNSW (Hierarchical Navigable Small World).

WHAT IS HNSW? HNSW is a state-of-the-art graph-based algorithm for approximate nearest neighbor search. It builds a multi-layered graph where search is O(log n) - incredibly fast!

HIERARCHICAL SKIP-LIST-LIKE STRUCTURE ¶

Layer 2: Few nodes, long-range connections (highways) Layer 1: More nodes, medium-range connections (state roads) Layer 0: All nodes, short-range connections (local streets)

Search starts at top layer and descends, getting more refined at each level!

PERFORMANCE:

Search: O(log n) with ~95-99% recall
Memory: 2-3x raw vectors (stores full precision + graph)
Build: O(log n) per insertion
Use case: When speed and accuracy are critical

TIME COMPLEXITY:

Insert: O(M × efConstruction × log n)
Search: O(M × efSearch × log n)
Typical: 10-50 distance calculations for 1M vectors!

Package comet implements a hybrid search index that combines vector, text, and metadata search.

WHAT IS HYBRIDSEARCHINDEX? HybridSearchIndex is a facade that provides a unified interface over three specialized indexes: 1. VectorIndex: For semantic similarity search using vector embeddings 2. TextIndex: For keyword-based BM25 full-text search 3. MetadataIndex: For filtering by structured metadata attributes

HOW IT WORKS: The index maintains three separate indexes internally and coordinates search across them. When searching, it follows this flow: 1. Apply metadata filters first (if any) to get candidate document IDs 2. Pass candidate IDs to vector and/or text search for relevance ranking 3. Combine results from multiple search modes using score aggregation

SEARCH MODES: - Vector-only: Semantic similarity search using embeddings - Text-only: Keyword-based BM25 search - Metadata-only: Pure filtering without ranking - Hybrid: Combine any or all of the above with score aggregation

WHEN TO USE: Use HybridSearchIndex when: 1. You need to combine multiple search modalities 2. You want to filter by metadata before expensive vector search 3. You need both semantic and keyword-based search 4. You want a simple unified API instead of managing multiple indexes

Package comet implements a k-Nearest Neighbors (kNN) IVF index for similarity search.

WHAT IS AN IVF INDEX? IVF (Inverted File Index) is a partitioning-based approximate nearest neighbor search algorithm. It divides the vector space into Voronoi cells using k-means clustering, then searches only the nearest cells instead of scanning all vectors.

HOW IVF WORKS: Training Phase: 1. Run k-means on training vectors to learn nlist cluster centroids 2. These centroids define Voronoi partitions of the vector space

Indexing Phase: 1. For each vector, find its nearest centroid 2. Add the vector to that centroid's inverted list

Search Phase: 1. Find the nprobe nearest centroids to the query vector 2. Search only the vectors in those nprobe inverted lists 3. Return the top-k nearest neighbors from candidates

TIME COMPLEXITY:

Training: O(iterations × nlist × n × dim) - k-means clustering
Add: O(nlist × dim) - find nearest centroid
Search: O(nprobe × (nlist/nprobe) × dim + nprobe × (n/nlist) × dim) ≈ O(nprobe × dim + (nprobe/nlist) × n × dim)

MEMORY REQUIREMENTS: - Vectors: 4 × n × dim bytes (stored as-is) - Centroids: 4 × nlist × dim bytes - Lists: negligible overhead (just pointers) - Total: ~4 × (n + nlist) × dim bytes

ACCURACY VS SPEED TRADEOFF: - nprobe = 1: Fastest, lowest recall (~30-50%) - nprobe = sqrt(nlist): Good balance (~70-90% recall) - nprobe = nlist: Same as flat search (100% recall)

CHOOSING NLIST: Rule of thumb: nlist = sqrt(n) or nlist = 4*sqrt(n) - For 1M vectors: nlist = 1,000 to 4,000 - For 100K vectors: nlist = 316 to 1,264 - For 10K vectors: nlist = 100 to 400

WHEN TO USE IVF: Use IVF when: 1. Dataset is large (>10K vectors) 2. You can tolerate ~90-95% recall (not 100%) 3. You want 10-100x speedup over flat search 4. Memory usage is not a primary concern

DON'T use IVF when: 1. Dataset is small (<10K vectors) - use flat index 2. You need 100% recall - use flat index 3. Memory is very limited - use PQ or IVFPQ

Package comet implements IVFPQ (Inverted File with Product Quantization).

WHAT IS IVFPQ? IVFPQ combines IVF (scope reduction) with PQ (compression) to create one of the most powerful similarity search algorithms. It's the workhorse of large-scale vector search systems.

RESIDUAL VECTORS IVFPQ encodes RESIDUALS (vector - centroid) instead of original vectors. This dramatically improves compression quality because:

Residuals are centered near 0 (low variance)
Single set of codebooks works for all clusters
Better approximation than encoding original vectors

PERFORMANCE:

Speed: 10-100x faster than flat (from IVF)
Memory: 100-500x compression (from PQ on residuals)
Accuracy: 85-95% recall with proper tuning

TIME COMPLEXITY:

Training: O(IVF_kmeans + PQ_kmeans)
Add: O(nlist × dim + M × K × dsub)
Search: O(nlist + nprobe × (M × K × dsub + candidates × M))

Package comet implements a metadata filtering index for vector search.

WHAT IS A METADATA INDEX? A metadata index enables fast filtering of documents based on structured metadata attributes before performing expensive vector similarity searches. This dramatically improves search performance by reducing the candidate set.

HOW IT WORKS: The index uses two specialized data structures: 1. Roaring Bitmaps: For categorical fields (strings, booleans)

Memory-efficient compressed bitmaps
Fast set operations (AND, OR, NOT)

2. Bit-Sliced Index (BSI): For numeric fields (integers, floats)

Enables range queries without scanning all documents
Fast comparison operations (>, <, =, BETWEEN)

QUERY TYPES: - Equality: field = value - Inequality: field != value - Comparisons: field > value, field >= value, field < value, field <= value - Range: field BETWEEN min AND max - Set membership: field IN (val1, val2, val3) - Set exclusion: field NOT IN (val1, val2) - Existence: field EXISTS, field NOT EXISTS

TIME COMPLEXITY:

Add: O(m) where m is the number of metadata fields
Query: O(f × log(n)) where f is filter count, n is documents
Remove: O(m) where m is the number of metadata fields

MEMORY REQUIREMENTS: - Roaring bitmaps: Highly compressed, typically 1-10% of uncompressed size - BSI: ~64 bits per numeric value (compressed with roaring) - Much more efficient than traditional B-tree indexes for high-cardinality data

GUARANTEES & TRADE-OFFS: ✓ Pros:

Extremely fast filtering (microseconds for millions of documents)
Memory efficient with compression
Supports complex boolean queries
Thread-safe for concurrent use

✗ Cons:

Only supports exact matches and ranges (no fuzzy matching)
Floats converted to integers (precision loss)
Updates require full document replacement

WHEN TO USE: Use metadata index when: 1. Pre-filtering documents before vector search 2. Need to filter by structured attributes (price, date, category, etc.) 3. Working with large datasets (100K+ documents) 4. Need sub-millisecond filter performance

Package comet implements Product Quantization (PQ) for similarity search.

WHAT IS PRODUCT QUANTIZATION? PQ is a lossy compression technique that dramatically reduces memory usage for vector storage while enabling approximate similarity search. It achieves compression ratios of 10-500x by dividing vectors into subspaces and quantizing each independently.

THE CORE IDEA - DIVIDE AND COMPRESS: Instead of storing full high-dimensional vectors: 1. Divide each vector into M equal-sized subvectors (subspaces) 2. Learn a codebook of K centroids for each subspace via k-means 3. Encode each subvector with the ID of its nearest centroid 4. Store only these compact codes instead of original vectors

COMPRESSION EXAMPLE: Original: 768 dims × 4 bytes = 3,072 bytes PQ (M=8, K=256): 8 subspaces × 1 byte = 8 bytes Compression: 384x smaller!

TIME COMPLEXITY:

Training: O(M × iterations × K × n × dsub) where dsub = dim/M
Add: O(M × K × dsub) per vector
Search: O(M × K × dsub + n × M) - table build + lookups

WHEN TO USE PQ: - Dataset too large for RAM - Can tolerate 85-95% recall - L2 or inner product metric - Want massive compression

Index ¶

Constants ¶

View Source

const (
	// K1 controls term frequency saturation (typical range: 1.2-2.0)
	K1 = 1.2
	// B controls document length normalization (0 = no normalization, 1 = full normalization)
	B = 0.75
)

BM25 parameters for ranking

View Source

const (
	// UnassignedCluster indicates a vector hasn't been assigned to any cluster yet
	UnassignedCluster = -1
)

Variables ¶

View Source

var (
	// DefaultMaxIter is the default maximum number of iterations for k-means clustering.
	DefaultMaxIter = 20
)

View Source

var ErrUnknownDistanceKind = errors.New("unknown distance kind")

ErrUnknownDistanceKind is returned when an unknown distance kind is provided to NewDistance.

View Source

var ErrZeroVector = errors.New("zero vector not allowed for this metric")

ErrZeroVector is returned when a zero vector is provided for a metric that doesn't support it.

Functions ¶

func Autocut ¶

func Autocut(yValues []float32, cutOff int) int

Autocut determines optimal cutoff point in a score distribution.

It analyzes the normalized difference between actual scores and ideal linear distribution to find local maxima (extrema). Returns the index before the Nth extremum where N is the cutOff parameter.

Parameters:

yValues: slice of score values (typically distances or similarity scores)
cutOff: number of extrema to encounter before cutting

Returns the index at which to cut the results.

func AutocutResults ¶

func AutocutResults[T Result](results []T, cutoff int) []T

AutocutResults applies autocut algorithm to determine optimal result cutoff. This generic function works with any result type that implements the Result interface.

It extracts scores from the result slice and uses the Autocut algorithm to find the optimal cutoff point based on the score distribution.

Type parameter T must be a type that implements the Result interface.

Parameters:

results: slice of results implementing Result interface
cutoff: number of extrema to find before cutting (-1 disables autocut)

Returns the sliced results up to the autocut point. If cutoff is -1, returns results unchanged (no-op).

Usage:

vectorResults = AutocutResults(vectorResults, 1)  // Apply autocut with 1 extremum
textResults = AutocutResults(textResults, -1)     // No-op, returns all results

func CalculatePQParams ¶

func CalculatePQParams(dim int) (M int, Nbits int)

CalculatePQParams returns recommended PQ parameters for a given dimension. A neat utility function to get the recommended PQ parameters for a given dimension. Returns:

M: Number of subquantizers (subspaces) that divides dim evenly
Nbits: Bits per PQ code (default: 8, giving K=256 centroids per subspace)

func DefaultHNSWConfig ¶

func DefaultHNSWConfig() (m, efConstruction, efSearch int)

DefaultHNSWConfig returns recommended default configuration parameters.

Returns:

m: connections per layer (16) - layer 0 uses 2*M connections
efConstruction: candidate list size during construction (200)
efSearch: candidate list size during search (200)

func FindNearestCentroidIndex ¶

func FindNearestCentroidIndex(v []float32, centroids [][]float32, distance Distance) int

FindNearestCentroidIndex finds the index of the nearest centroid to a given vector.

This is a utility function that performs a linear search through all centroids to find the one with minimum distance to the query vector.

Parameters:

v: Query vector to find nearest centroid for
centroids: Slice of centroid vectors to search through
distance: Distance metric to use for comparison

Returns:

int: Index of the nearest centroid (0 to len(centroids)-1)

Time Complexity: O(k × dim) where k = len(centroids)

func KMeans ¶

func KMeans(vectors [][]float32, k int, distance Distance, maxIter int) (centroids [][]float32, vectorToClusterMapping []int)

KMeans performs k-means clustering to learn cluster centroids.

K-MEANS CLUSTERING ALGORITHM ¶

K-means partitions data into k clusters by iteratively refining cluster assignments and centroid positions. Each cluster is represented by its centroid (center point).

Algorithm Steps:

INITIALIZATION: Select k initial centroid positions (uniform spacing)
ASSIGNMENT: Assign each vector to its nearest centroid
UPDATE: Recompute centroids as the mean of assigned vectors
REPEAT: Steps 2-3 until convergence or max iterations

CONVERGENCE: The algorithm converges when assignments stop changing, meaning all vectors are assigned to their nearest centroid and centroids are at the center of their clusters. Typical convergence: 5-20 iterations for most datasets.

VORONOI PARTITIONS: K-means naturally creates Voronoi partitions where each cluster forms a Voronoi cell. Vectors in each cluster are guaranteed to be closer to their centroid than to any other centroid.

TIME COMPLEXITY: O(iterations × k × n × dim) where:

iterations: number of iterations until convergence (typically 10-20)
k: number of clusters
n: number of training vectors
dim: dimensionality

For 100K training vectors, 768 dims, k=316, 10 iterations:

About 243 billion floating point operations
Takes seconds to minutes depending on hardware

Parameters:

vectors: Training vectors to cluster (each is []float32)
k: Number of clusters to create
distance: Distance metric to use for computing distances
maxIter: Maximum number of k-means iterations (typical: 20-100)

Returns:

[][]float32: k learned centroids that define the clusters
[]int: cluster assignments for each input vector (vector i -> cluster assignments[i])

func KMeansSubspace ¶

func KMeansSubspace(vectors [][]float32, k int, maxIter int) (centroids [][]float32, vectorToClusterMapping []int)

KMeansSubspace performs k-means clustering on subspace vectors for codebook learning.

K-MEANS FOR SUBSPACE CLUSTERING ¶

This function learns k centroids that best represent the distribution of vectors in a lower-dimensional subspace. Commonly used for Product Quantization (PQ) codebook learning and other subspace quantization methods.

SUBSPACE CLUSTERING ADVANTAGES:

Faster: fewer dimensions to process
More effective: less curse of dimensionality
Parallelizable: independent across subspaces

ALGORITHM:

INITIALIZATION: Choose k initial centroids using uniform sampling
ASSIGNMENT: Assign each subvector to nearest centroid (using squared L2 distance)
UPDATE: Recompute centroids as mean of assigned subvectors
REPEAT: Steps 2-3 until convergence or max iterations

CONVERGENCE: Converges when no assignments change. Typically takes 3-10 iterations for most subspaces due to lower dimensionality providing more stability.

DISTANCE METRIC: Uses squared L2 distance (Euclidean) - standard for:

Product Quantization
Vector quantization
Codebook learning

TIME COMPLEXITY: O(iterations × k × n × dsub) where:

iterations: number of k-means iterations (typically 5-20)
k: number of centroids (often 256 = 2^8)
n: number of training subvectors
dsub: subspace dimension (e.g., 768/8 = 96)

For 100K training vectors, dsub=96, k=256, 10 iterations:

About 2.5 billion floating point operations per subspace
Takes milliseconds to seconds per subspace

Parameters:

vectors: Subspace vectors to cluster (all from same subspace)
k: Number of centroids to learn (typically 256 = 2^8 bits)
maxIter: Maximum k-means iterations (typical: 20-50)

Returns:

[][]float32: k learned centroids for this subspace (the codebook)
[]int: cluster assignments for each input subvector

func LimitResults ¶

func LimitResults[T Result](results []T, k int) []T

LimitResults applies k-limiting to any result slice that implements the Result interface. This generic function works with both VectorResult and TextResult.

Type parameter T must be a type that implements the Result interface.

Usage:

vectorResults := LimitResults(vectorResults, requestedK)
textResults := LimitResults(textResults, requestedK)

func Norm ¶

func Norm(v []float32) float32

Norm computes the L2 norm (Euclidean length/magnitude) of a vector. This represents the "length" of the vector from the origin in n-dimensional space.

The norm is useful for:

Measuring vector magnitude
Normalizing vectors to unit length
Computing distances and similarities

Formula: sqrt(sum(v[i]^2))

Example:

v := []float32{3, 4}
length := Norm(v)  // Returns 5.0 (3²+4² = 25, sqrt(25) = 5)

Time complexity: O(n) where n is the vector dimension

func Normalize ¶

func Normalize(v []float32) []float32

Normalize returns a new vector with the same direction as v but with unit length (magnitude = 1). The original vector is not modified.

Normalization is essential for:

Cosine similarity calculations (removes magnitude, keeps direction)
Comparing vectors by direction only
Machine learning feature scaling
Unit vector representations

Special case:

If the input is a zero vector (all elements are 0), returns the zero vector unchanged to avoid division by zero and NaN values

Formula: result = v / ||v|| where ||v|| is the L2 norm

Example:

v := []float32{3, 4}
unit := Normalize(v)     // Returns [0.6, 0.8] (magnitude = 1)
zero := []float32{0, 0}
safe := Normalize(zero)  // Returns [0, 0] (safely handles zero vector)

Time complexity: O(n) where n is the vector dimension

func NormalizeInPlace ¶

func NormalizeInPlace(v []float32)

NormalizeInPlace normalizes the vector to unit length in-place, modifying the original vector. This is more memory-efficient than Normalize() as it doesn't allocate a new vector.

Use this when:

You don't need to keep the original vector
Memory efficiency is important
Processing large batches of vectors

Special case:

If the input is a zero vector (all elements are 0), the vector remains unchanged to avoid division by zero and NaN values

Formula: v[i] = v[i] / ||v|| for all i

Example:

v := []float32{3, 4}
NormalizeInPlace(v)      // v is now [0.6, 0.8] (magnitude = 1)

zero := []float32{0, 0}
NormalizeInPlace(zero)   // zero remains [0, 0] (safely handles zero vector)

Time complexity: O(n) where n is the vector dimension

func ReturnDocumentFilter ¶

func ReturnDocumentFilter(filter *DocumentFilter)

ReturnDocumentFilter returns a document filter to the pool for reuse. This should be called after the filter is no longer needed to reduce allocations. Do not use the filter after calling this method.

func Scale ¶

func Scale(v []float32, scalar float32) []float32

Scale returns a new vector with all elements multiplied by the given scalar. The original vector is not modified.

Use cases:

Scaling vectors to a desired magnitude
Implementing weighted vectors
Normalizing vectors (when combined with Norm)

Formula: result[i] = v[i] * scalar

Example:

v := []float32{1, 2, 3}
doubled := Scale(v, 2.0)        // Returns [2, 4, 6]
inverted := Scale(v, -1.0)      // Returns [-1, -2, -3]
normalized := Scale(v, 1.0/Norm(v))  // Returns unit vector

Time complexity: O(n) where n is the vector dimension

Types ¶

type BM25SearchIndex ¶

type BM25SearchIndex struct {
	// contains filtered or unexported fields
}

BM25SearchIndex is a full-text search index that uses BM25 scoring for relevance ranking. It maintains an inverted index using roaring bitmaps for efficient storage and retrieval. All methods are safe for concurrent use by multiple goroutines.

The index stores only document IDs and tokens for efficient memory usage. Applications should maintain their own document store and use the returned document IDs to retrieve full text.

func NewBM25SearchIndex ¶

func NewBM25SearchIndex() *BM25SearchIndex

NewBM25SearchIndex creates and returns a new empty BM25SearchIndex. The index stores only document IDs and tokens for efficient memory usage. Applications should maintain their own document store and use the returned document IDs to retrieve full text.

Returns:

*BM25SearchIndex: A new empty index ready to accept documents

Example:

idx := NewBM25SearchIndex()
idx.Add(1, "the quick brown fox")
results := idx.NewSearch().WithQuery("fox").WithK(10).Execute()

func (*BM25SearchIndex) Add ¶

func (ix *BM25SearchIndex) Add(id uint32, text string) error

Add indexes a document with the given docID and text. If a document with the same ID already exists, it will be replaced. This method is safe for concurrent use.

Note: The index does NOT store the original text, only tokens for efficient memory usage.

Parameters:

id: Document ID (must be unique)
text: Document text to index

Returns:

error: Always returns nil (exists to satisfy TextIndex interface)

Time Complexity: O(m) where m is the number of tokens in the text

Thread-safety: Acquires exclusive lock

Example:

err := idx.Add(1, "the quick brown fox jumps over the lazy dog")

func (*BM25SearchIndex) Flush ¶

func (ix *BM25SearchIndex) Flush() error

Flush performs hard delete of soft-deleted documents.

WHEN TO CALL:

After multiple Remove() calls (batch cleanup)
When deleted documents are significant (e.g., > 10% of index)
During off-peak hours

WHAT IT DOES: 1. Physically removes all soft-deleted documents from all data structures 2. Updates inverted index (postings), term frequencies, document lengths 3. Reclaims memory occupied by deleted documents 4. Clears the deleted documents bitmap

COST: O(d × m) where d = number of deleted docs, m = avg tokens per doc

Thread-safety: Acquires exclusive write lock

Returns:

error: Always returns nil

func (*BM25SearchIndex) NewSearch ¶

func (ix *BM25SearchIndex) NewSearch() TextSearch

NewSearch creates a new search builder for this index.

Returns:

TextSearch: A new search builder ready to be configured

Example:

results, err := idx.NewSearch().
	WithQuery("quick brown").
	WithK(5).
	Execute()

func (*BM25SearchIndex) ReadFrom ¶

func (ix *BM25SearchIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes a BM25SearchIndex from an io.Reader.

This method reconstructs a BM25SearchIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

int64: Number of bytes read
error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("bm25_index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("bm25_index.bin")
idx2 := NewBM25SearchIndex()
idx2.ReadFrom(file)
file.Close()

func (*BM25SearchIndex) Remove ¶

func (ix *BM25SearchIndex) Remove(id uint32) error

Remove performs soft delete using roaring bitmap.

CONCURRENCY OPTIMIZATION: - Uses read lock first (cheaper) to check if document exists - Only acquires write lock for the actual bitmap modification - Minimizes write lock contention

SOFT DELETE MECHANISM: Instead of immediately removing from all data structures (expensive O(m)), we mark as deleted in roaring bitmap. Deleted documents are:

Skipped during search
Still in internal data structures
Not counted as active documents

Call Flush() periodically for actual cleanup and memory reclamation.

Parameters:

id: Document ID to remove

Returns:

error: Always returns nil (exists to satisfy TextIndex interface)

Time Complexity: O(log n) for bitmap operation (vs O(m) for hard delete)

Thread-safety: Uses read lock for validation, write lock for modification

func (*BM25SearchIndex) WriteTo ¶

func (ix *BM25SearchIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the BM25SearchIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization to ensure all soft-deleted documents are permanently removed from the serialized data.

The serialization format is: 1. Magic number (4 bytes) - "BM25" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. Statistics:

numDocs (4 bytes)
totalTokens (4 bytes)
avgDocLen (8 bytes as float64)

4. Document lengths (map[uint32]int):

Count (4 bytes)
For each entry: docID (4 bytes) + length (4 bytes)

5. Document tokens (map[uint32][]string):

Count (4 bytes)
For each entry:
docID (4 bytes)
Token count (4 bytes)
For each token: token length (4 bytes) + token bytes

6. Postings (map[string]*roaring.Bitmap):

Count (4 bytes)
For each entry:
Term length (4 bytes) + term bytes
Bitmap size (4 bytes) + bitmap bytes

7. Term frequencies (map[string]map[uint32]int):

Count (4 bytes)
For each term:
Term length (4 bytes) + term bytes
Doc count (4 bytes)
For each doc: docID (4 bytes) + frequency (4 bytes)

8. Deleted docs bitmap size (4 bytes) + roaring bitmap bytes

Thread-safety: Acquires read lock during serialization

Returns:

int64: Number of bytes written
error: Returns error if write fails or flush fails

type CompressedVector ¶

type CompressedVector struct {
	// Node is the original VectorNode
	Node VectorNode

	// Code is the PQ-encoded residual (M bytes)
	Code []uint8
}

CompressedVector represents a PQ-encoded residual in an inverted list.

type Distance ¶

type Distance interface {
	// Calculate computes the distance between two vectors a and b.
	// The vectors must have the same dimensionality.
	// Returns a float32 representing the distance (lower values = more similar).
	Calculate(a, b []float32) float32

	// CalculateBatch computes distances from multiple query vectors to a single target vector.
	// This is more efficient than calling Calculate multiple times as it can optimize
	// computations (e.g., precomputing norms for cosine distance).
	//
	// Parameters:
	//   - queries: slice of query vectors (each vector is []float32)
	//   - target: single target vector to compare against
	//
	// Returns:
	//   - slice of distances where result[i] is the distance from queries[i] to target
	//
	// All vectors (queries and target) must have the same dimensionality.
	CalculateBatch(queries [][]float32, target []float32) []float32

	// PreprocessInPlace preprocesses the target vector in-place for the distance metric.
	// For cosine distance, this normalizes the vector to unit length.
	// For euclidean distance, this is a no-op.
	// Returns an error if the vector is invalid for this metric (e.g., zero vector for cosine).
	PreprocessInPlace(target []float32) error

	// Preprocess preprocesses the target vector for the distance metric, returning a new vector.
	// For cosine distance, this returns a normalized copy.
	// For euclidean distance, this returns the original vector unchanged.
	// Returns an error if the vector is invalid for this metric (e.g., zero vector for cosine).
	Preprocess(target []float32) ([]float32, error)
}

Distance is the interface for computing distances between vectors. Implementations provide different distance metrics for vector similarity search.

func NewDistance ¶

func NewDistance(t DistanceKind) (Distance, error)

NewDistance returns a singleton Distance implementation for the specified metric type. The returned instances are stateless and safe for concurrent use across goroutines. Returns ErrUnknownDistanceKind if the distance kind is not recognized.

Example:

dist, err := NewDistance(Euclidean)
if err != nil {
    log.Fatal(err)
}
distance := dist.Calculate([]float32{1, 2, 3}, []float32{4, 5, 6})

type DistanceKind ¶

type DistanceKind string

DistanceKind represents the type of distance metric to use for vector comparisons. Different distance metrics are suitable for different use cases: - Euclidean (L2): Measures absolute spatial distance between points - L2Squared: Squared Euclidean distance (faster, preserves ordering) - Cosine: Measures angular similarity, independent of magnitude

const (
	// Euclidean (L2) distance measures the straight-line distance between two points.
	// Use this when the magnitude of vectors matters.
	// Formula: sqrt(sum((a[i] - b[i])^2))
	Euclidean DistanceKind = "l2"

	// L2Squared (squared Euclidean) distance measures the squared distance between two points.
	// This is faster than L2 as it avoids the sqrt operation.
	// Use this when you only need to compare distances (ordering is preserved).
	// Formula: sum((a[i] - b[i])^2)
	L2Squared DistanceKind = "l2_squared"

	// Cosine distance measures the angular difference between vectors (1 - cosine similarity).
	// Use this when you care about direction but not magnitude (e.g., text embeddings).
	// Formula: 1 - (dot(a,b) / (||a|| * ||b||))
	// Range: [0, 2] where 0 = identical direction, 1 = orthogonal, 2 = opposite
	Cosine DistanceKind = "cosine"
)

type DocumentFilter ¶

type DocumentFilter struct {
	// contains filtered or unexported fields
}

DocumentFilter provides efficient document ID filtering for vector search. It uses roaring bitmaps for fast membership testing during search operations.

func NewDocumentFilter ¶

func NewDocumentFilter(documentIDs []uint32) *DocumentFilter

NewDocumentFilter creates a new document filter from a list of document IDs. If the document IDs list is empty, returns nil (no filtering). The filter should be returned to the pool using ReturnDocumentFilter when done.

func (*DocumentFilter) Count ¶

func (f *DocumentFilter) Count() uint64

Count returns the number of eligible documents. Returns 0 if filter is nil (meaning all documents are eligible).

func (*DocumentFilter) IsEligible ¶

func (f *DocumentFilter) IsEligible(docID uint32) bool

IsEligible checks if a document ID is eligible for search. If filter is nil, all documents are eligible. Otherwise, checks if the document ID exists in the filter bitmap.

func (*DocumentFilter) IsEmpty ¶

func (f *DocumentFilter) IsEmpty() bool

IsEmpty returns true if no documents are eligible. Returns false if filter is nil (all documents eligible).

func (*DocumentFilter) ShouldSkip ¶

func (f *DocumentFilter) ShouldSkip(docID uint32) bool

ShouldSkip returns true if the document should be skipped (not eligible). This is a convenience method for use in loops with continue statements.

type Filter ¶

type Filter struct {
	Field    string
	Operator Operator
	Value    interface{}
	Value2   interface{} // Used for range queries
}

Filter represents a query filter

func AnyOf ¶

func AnyOf(field string, values ...interface{}) Filter

AnyOf creates an OR group for multiple values on the same field (alias for In)

func Between ¶

func Between(field string, min, max interface{}) Filter

Between is an alias for Range

func Eq ¶

func Eq(field string, value interface{}) Filter

Eq creates an equality filter

func Exists ¶

func Exists(field string) Filter

Exists checks if a field exists (has any value)

func Gt ¶

func Gt(field string, value interface{}) Filter

Gt creates a greater-than filter

func Gte ¶

func Gte(field string, value interface{}) Filter

Gte creates a greater-than-or-equal filter

func In ¶

func In(field string, values ...interface{}) Filter

In creates an in-set filter

func IsNotNull ¶

func IsNotNull(field string) Filter

IsNotNull checks if a field is not null/exists

func IsNull ¶

func IsNull(field string) Filter

IsNull checks if a field is null/doesn't exist

func Lt ¶

func Lt(field string, value interface{}) Filter

Lt creates a less-than filter

func Lte ¶

func Lte(field string, value interface{}) Filter

Lte creates a less-than-or-equal filter

func Ne ¶

func Ne(field string, value interface{}) Filter

Ne creates a not-equal filter

func NoneOf ¶

func NoneOf(field string, values ...interface{}) Filter

NoneOf creates a NOT IN filter (alias for NotIn)

func Not ¶

func Not(filter Filter) Filter

Not creates a negation filter by inverting the operator

func NotExists ¶

func NotExists(field string) Filter

NotExists checks if a field doesn't exist

func NotIn ¶

func NotIn(field string, values ...interface{}) Filter

NotIn creates a not-in-set filter

func Range ¶

func Range(field string, min, max interface{}) Filter

Range creates a range filter [min, max]

type FilterGroup struct {
	Filters []Filter
	Logic   LogicOperator
}

FilterGroup represents a group of filters with a logical operator

type FlatIndex ¶

type FlatIndex struct {
	// contains filtered or unexported fields
}

FlatIndex represents a flat (brute-force) kNN index.

This is the simplest form of similarity search index where vectors are stored with metric-appropriate preprocessing applied. For cosine distance, vectors are normalized to unit length. For euclidean distance, vectors are stored as-is.

Thread-safety: This index is safe for concurrent use through a read-write mutex. Multiple readers can search simultaneously, but Add operations are exclusive.

func NewFlatIndex ¶

func NewFlatIndex(dim int, distanceKind DistanceKind) (*FlatIndex, error)

NewFlatIndex creates a new flat (kNN) index with the specified dimensionality and distance metric.

This is the only "training" step for a flat index, which simply initializes the data structure. Unlike other indexes (IVF, HNSW, PQ), there's no actual training phase or preprocessing required.

Parameters:

dim: The dimensionality of vectors that will be stored. All vectors must have this exact dimension. For example, if you're using 384-dimensional embeddings from a sentence transformer model, dim should be 384.
distanceKind: The distance metric to use for similarity comparison.

Returns:

*FlatIndex: A new flat index ready to accept vectors via Add()
error: Returns error if dim <= 0 or distance kind is invalid

Time Complexity: O(1) Memory: O(1) initially, grows to O(n*d) where n is number of vectors added

Example:

idx, err := NewFlatIndex(384, Cosine)  // For sentence embeddings
if err != nil { log.Fatal(err) }

func (*FlatIndex) Add ¶

func (idx *FlatIndex) Add(vector VectorNode) error

Add adds a vector to the flat index.

Vectors are preprocessed according to the distance metric:

For cosine distance: vectors are normalized to unit length in-place
For euclidean distance: vectors are stored as-is (preprocessing is a no-op)

This preprocessing happens once during insertion, making all subsequent distance calculations more efficient by eliminating redundant norm computations and divisions.

Parameters:

vector: Vector to add. Must have dimensionality matching idx.dim

Returns:

error: Returns error if vector has wrong dimension or preprocessing fails (e.g., zero vector for cosine)

Time Complexity: O(m) where m = dimensionality

For L2 metric: Just appends vector (preprocessing is no-op)
For cosine metric: Normalizes vector then appends

Thread-safety: Acquires exclusive lock, blocking all searches during addition

func (*FlatIndex) Dimensions ¶

func (idx *FlatIndex) Dimensions() int

Dimensions returns the dimensionality of vectors stored in this index.

Returns:

int: The dimensionality (number of components in each vector)

func (*FlatIndex) DistanceKind ¶

func (idx *FlatIndex) DistanceKind() DistanceKind

DistanceKind returns the distance metric used by this index.

Returns:

DistanceKind: The distance metric (L2, Cosine, or Dot)

func (*FlatIndex) Flush ¶

func (idx *FlatIndex) Flush() error

Flush performs hard delete of soft-deleted nodes.

WHEN TO CALL:

After multiple Remove() calls (batch cleanup)
When deleted nodes are significant (e.g., > 10% of index)
During off-peak hours

WHAT IT DOES: 1. Removes all soft-deleted vectors from the vectors slice 2. Reclaims memory occupied by deleted vectors 3. Clears the deleted nodes bitmap

COST: O(n) where n = number of vectors in the index

Thread-safety: Acquires exclusive write lock

func (*FlatIndex) Kind ¶

func (idx *FlatIndex) Kind() VectorIndexKind

Kind returns the type of this index.

Returns:

VectorIndexKind: Always returns FlatIndex

func (*FlatIndex) NewSearch ¶

func (idx *FlatIndex) NewSearch() VectorSearch

NewSearch creates a new search builder for this index.

Returns:

VectorSearch: A new search builder ready to be configured

func (*FlatIndex) ReadFrom ¶

func (idx *FlatIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes a FlatIndex from an io.Reader.

This method reconstructs a FlatIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

int64: Number of bytes read
error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("index.bin")
idx2, _ := NewFlatIndex(384, Cosine)
idx2.ReadFrom(file)
file.Close()

func (*FlatIndex) Remove ¶

func (idx *FlatIndex) Remove(vector VectorNode) error

Remove performs soft delete using roaring bitmap.

CONCURRENCY OPTIMIZATION: - Uses read lock first (cheaper) to check if node exists - Only acquires write lock for the actual bitmap modification - Minimizes write lock contention

SOFT DELETE MECHANISM: Instead of immediately removing from the vectors slice (expensive O(n)), we mark as deleted in roaring bitmap. Deleted nodes are:

Skipped during search
Still in vectors slice
Not counted as active nodes

Call Flush() periodically for actual cleanup and memory reclamation.

Parameters:

vector: Vector to remove (only the ID field is used for matching)

Returns:

error: Returns error if vector is not found or already deleted

Time Complexity: O(n) for existence check + O(log n) for bitmap operation

Thread-safety: Uses read lock for validation, write lock for modification

func (*FlatIndex) Train ¶

func (idx *FlatIndex) Train(vectors []VectorNode) error

Train is a no-op for flat index since vectors are stored in memory. This method exists to satisfy the VectorIndex interface.

Returns:

error: Always returns nil

func (*FlatIndex) Trained ¶

func (idx *FlatIndex) Trained() bool

Trained returns true if the index has been trained This is a no-op for flat index since it doesn't need training

func (*FlatIndex) WriteTo ¶

func (idx *FlatIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the FlatIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization to ensure all soft-deleted vectors are permanently removed from the serialized data.

The serialization format is: 1. Magic number (4 bytes) - "FLAT" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. Dimensionality (4 bytes) 4. Distance kind length (4 bytes) + distance kind string 5. Number of vectors (4 bytes) 6. For each vector:

Vector ID (4 bytes)
Vector dimension (4 bytes)
Vector data (dim * 4 bytes as float32)

7. Deleted nodes bitmap size (4 bytes) + roaring bitmap bytes

Thread-safety: Acquires read lock during serialization

Returns:

int64: Number of bytes written
error: Returns error if write fails or flush fails

type FullPrecisionQuantizer ¶

type FullPrecisionQuantizer struct{}

FullPrecisionQuantizer stores vectors in full 32-bit floating point.

Memory: 4 bytes per dimension Accuracy: Full IEEE 754 single precision Training: Not required

This is essentially a no-op quantizer that provides the interface for consistency while maintaining full precision.

func (*FullPrecisionQuantizer) Dequantize ¶

func (q *FullPrecisionQuantizer) Dequantize(stored any) ([]float32, error)

func (*FullPrecisionQuantizer) IsTrained ¶

func (q *FullPrecisionQuantizer) IsTrained() bool

func (*FullPrecisionQuantizer) Quantize ¶

func (q *FullPrecisionQuantizer) Quantize(vector []float32) (any, error)

func (*FullPrecisionQuantizer) Train ¶

func (q *FullPrecisionQuantizer) Train(vectors [][]float32)

func (*FullPrecisionQuantizer) Type ¶

func (q *FullPrecisionQuantizer) Type() QuantizerType

type Fusion ¶

type Fusion interface {
	// Kind returns the kind of fusion strategy
	Kind() FusionKind

	// Combine takes separate result sets from vector and text search and
	// combines them into a single ranked list
	//
	// Parameters:
	//   - vectorResults: Results from vector search (map of docID -> score)
	//   - textResults: Results from text search (map of docID -> score)
	//
	// Returns:
	//   - Combined results as map of docID -> score
	Combine(vectorResults map[uint32]float64, textResults map[uint32]float64) map[uint32]float64
}

Fusion defines how to combine scores from different search modalities (vector search, text search) into a single ranking.

Different fusion strategies handle score normalization and combination differently: - WeightedSum: Direct weighted combination of scores - ReciprocalRank: Rank-based fusion (more robust to score scale differences) - Max/Min: Simple maximum or minimum across modalities

func DefaultFusion ¶

func DefaultFusion() Fusion

DefaultFusion returns the default fusion strategy (WeightedSum)

func NewFusion ¶

func NewFusion(kind FusionKind, config *FusionConfig) (Fusion, error)

NewFusion creates a new fusion strategy with the given configuration

type FusionConfig ¶

type FusionConfig struct {
	// VectorWeight is the weight for vector search scores (used by WeightedSumFusion)
	VectorWeight float64

	// TextWeight is the weight for text search scores (used by WeightedSumFusion)
	TextWeight float64

	// K is the constant used in Reciprocal Rank Fusion (default: 60)
	// Lower k gives more weight to top-ranked items
	K float64
}

FusionConfig holds configuration for fusion strategies

func DefaultFusionConfig ¶

func DefaultFusionConfig() *FusionConfig

DefaultFusionConfig returns the default fusion configuration

type FusionKind ¶

type FusionKind string

FusionKind defines the type of score fusion strategy for combining results from multiple search modalities (vector, text, metadata).

const (
	// WeightedSumFusion combines scores using weighted sum
	// finalScore = vectorScore * vectorWeight + textScore * textWeight
	WeightedSumFusion FusionKind = "weighted_sum"

	// ReciprocalRankFusion (RRF) combines results based on their ranks
	// More robust to differences in score scales between modalities
	ReciprocalRankFusion FusionKind = "reciprocal_rank"

	// MaxFusion takes the maximum score across modalities
	MaxFusion FusionKind = "max"

	// MinFusion takes the minimum score across modalities
	MinFusion FusionKind = "min"
)

type HNSWIndex ¶

type HNSWIndex struct {

	// M is connections per layer (except layer 0)
	M int
	// contains filtered or unexported fields
}

HNSWIndex implements HNSW (Hierarchical Navigable Small World).

Memory layout:

Vectors: n × dim × 4 bytes
Graph: n × M × avgLayers × 4 bytes (uint32 IDs)
Total: typically 2-3x raw vector size

Thread-safety: All public methods use read-write mutex.

func NewHNSWIndex ¶

func NewHNSWIndex(dim int, distanceKind DistanceKind, m, efConstruction, efSearch int) (*HNSWIndex, error)

NewHNSWIndex creates a new HNSW index.

Creates an empty multi-layered graph. Vectors can be added immediately (no training required unlike IVF/IVFPQ).

Parameters:

dim: Vector dimensionality
distanceKind: Distance metric
m: Connections per layer (except layer 0 which uses 2*M). Typical: 12-48, pass 0 for default (16)
efConstruction: Candidate list size during construction. Higher = better graph, slower build. Typical: 100-500, pass 0 for default (200)
efSearch: Candidate list size during search. Higher = better recall, slower search. Pass 0 for default (200)

Returns:

*HNSWIndex: New empty HNSW index
error: Returns error if parameters invalid

func (*HNSWIndex) Add ¶

func (idx *HNSWIndex) Add(vector VectorNode) error

Add adds a vector to the index by inserting into the graph.

CONCURRENCY OPTIMIZATION: All expensive operations (validation, level assignment, node preparation) are performed OUTSIDE the lock to minimize critical section time.

Algorithm:

Assign random level (outside lock)
Prepare node structure (outside lock)
Critical section: graph modification
Insert and connect

func (*HNSWIndex) Dimensions ¶

func (idx *HNSWIndex) Dimensions() int

Dimensions returns vector dimensionality.

func (*HNSWIndex) DistanceKind ¶

func (idx *HNSWIndex) DistanceKind() DistanceKind

DistanceKind returns the distance metric.

func (*HNSWIndex) Flush ¶

func (idx *HNSWIndex) Flush() error

Flush performs hard delete of soft-deleted nodes.

WHEN TO CALL:

After multiple Remove() calls (batch cleanup)
When deleted nodes are significant (e.g., > 10% of index)
During off-peak hours

WHAT IT DOES: 1. Removes all edges pointing to deleted nodes 2. Deletes nodes from memory 3. Updates entry point if needed 4. Rebuilds graph structure

COST: O(n × M × L) - expensive, so batch deletions

func (*HNSWIndex) Kind ¶

func (idx *HNSWIndex) Kind() VectorIndexKind

Kind returns the index type.

func (*HNSWIndex) NewSearch ¶

func (idx *HNSWIndex) NewSearch() VectorSearch

NewSearch creates a new search builder.

func (*HNSWIndex) ReadFrom ¶

func (idx *HNSWIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes an HNSWIndex from an io.Reader.

This method reconstructs an HNSWIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

int64: Number of bytes read
error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("hnsw_index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("hnsw_index.bin")
m, efC, efS := DefaultHNSWConfig()
idx2, _ := NewHNSWIndex(384, Cosine, m, efC, efS)
idx2.ReadFrom(file)
file.Close()

func (*HNSWIndex) Remove ¶

func (idx *HNSWIndex) Remove(vector VectorNode) error

Remove performs soft delete using roaring bitmap.

CONCURRENCY OPTIMIZATION: - Uses read lock first (cheaper) to check if node exists - Only acquires write lock for the actual bitmap modification - Minimizes write lock contention

SOFT DELETE MECHANISM: Instead of immediately removing (expensive O(n × M × L)), we mark as deleted in roaring bitmap. Deleted nodes are:

Skipped during search
Still in graph structure
Not counted as active nodes

Call Flush() periodically for actual cleanup.

func (*HNSWIndex) SetEfSearch ¶

func (idx *HNSWIndex) SetEfSearch(ef int)

SetEfSearch adjusts search-time candidate list size.

func (*HNSWIndex) Train ¶

func (idx *HNSWIndex) Train(vectors []VectorNode) error

Train is a no-op for HNSW (no training required).

func (*HNSWIndex) Trained ¶

func (idx *HNSWIndex) Trained() bool

Trained always returns true (HNSW doesn't require training).

func (*HNSWIndex) WriteTo ¶

func (idx *HNSWIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the HNSWIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization to ensure all soft-deleted nodes are permanently removed from the serialized data.

The serialization format is: 1. Magic number (4 bytes) - "HNSW" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. Basic parameters:

Dimensionality (4 bytes)
Distance kind length (4 bytes) + distance kind string
M (4 bytes)
efConstruction (4 bytes)
efSearch (4 bytes)
levelMult (8 bytes as float64)

4. Graph structure:

maxLevel (4 bytes)
entryPoint (4 bytes)

5. Number of nodes (4 bytes) 6. For each node:

Node ID (4 bytes)
Level (4 bytes)
Vector dimension (4 bytes)
Vector data (dim * 4 bytes as float32)
Number of edge layers (4 bytes)
For each layer:
Number of edges (4 bytes)
Edge IDs (n * 4 bytes)

7. Deleted nodes bitmap size (4 bytes) + roaring bitmap bytes

Thread-safety: Acquires read lock during serialization

Returns:

int64: Number of bytes written
error: Returns error if write fails or flush fails

type HalfPrecisionQuantizer ¶

type HalfPrecisionQuantizer struct{}

HalfPrecisionQuantizer compresses vectors to 16-bit floating point.

Memory: 2 bytes per dimension (50% savings vs float32) Accuracy: IEEE 754 half precision (1 sign, 5 exp, 10 mantissa bits) Training: Not required

Trade-off: Significant memory savings with minimal accuracy loss for most use cases. Values are stored as uint16 bit representations.

func (*HalfPrecisionQuantizer) Dequantize ¶

func (q *HalfPrecisionQuantizer) Dequantize(stored any) ([]float32, error)

func (*HalfPrecisionQuantizer) IsTrained ¶

func (q *HalfPrecisionQuantizer) IsTrained() bool

func (*HalfPrecisionQuantizer) Quantize ¶

func (q *HalfPrecisionQuantizer) Quantize(vector []float32) (any, error)

func (*HalfPrecisionQuantizer) Train ¶

func (q *HalfPrecisionQuantizer) Train(vectors [][]float32)

func (*HalfPrecisionQuantizer) Type ¶

func (q *HalfPrecisionQuantizer) Type() QuantizerType

type HybridSearch ¶

type HybridSearch interface {
	// WithVector sets the query vector for vector search
	WithVector(query []float32) HybridSearch

	// WithText sets the query text for text search
	WithText(queries ...string) HybridSearch

	// WithMetadata sets metadata filters
	WithMetadata(filters ...Filter) HybridSearch

	// WithMetadataGroups sets complex metadata filter groups
	WithMetadataGroups(groups ...*FilterGroup) HybridSearch

	// WithK sets the number of results to return
	WithK(k int) HybridSearch

	// WithNProbes sets the number of probes for IVF-based vector indexes
	WithNProbes(nProbes int) HybridSearch

	// WithEfSearch sets the efSearch parameter for HNSW search
	WithEfSearch(efSearch int) HybridSearch

	// WithThreshold sets a score threshold for results
	WithThreshold(threshold float32) HybridSearch

	// WithScoreAggregation sets the strategy for aggregating scores
	WithScoreAggregation(kind ScoreAggregationKind) HybridSearch

	// WithCutoff sets the autocut parameter for result cutoff
	WithCutoff(cutoff int) HybridSearch

	// WithFusion sets the fusion strategy for combining vector and text scores
	WithFusion(fusion Fusion) HybridSearch

	// WithFusionKind sets the fusion strategy by kind with default config
	WithFusionKind(kind FusionKind) HybridSearch

	// Execute performs the search and returns results
	Execute() ([]HybridSearchResult, error)
}

HybridSearch encapsulates the search context for the hybrid index

type HybridSearchIndex ¶

type HybridSearchIndex interface {
	// Add adds a document with its vector, text, and metadata to the index
	// The ID is auto-generated and returned
	Add(vector []float32, text string, metadata map[string]interface{}) (uint32, error)

	// AddWithID adds a document with a specific ID
	AddWithID(id uint32, vector []float32, text string, metadata map[string]interface{}) error

	// Remove removes a document from all indexes
	Remove(id uint32) error

	// NewSearch creates a new search builder
	NewSearch() HybridSearch

	// Train trains the vector index (required for some index types like IVF, PQ)
	Train(vectors [][]float32) error

	// Flush flushes all indexes
	Flush() error

	// VectorIndex returns the underlying vector index
	VectorIndex() VectorIndex

	// TextIndex returns the underlying text index
	TextIndex() TextIndex

	// MetadataIndex returns the underlying metadata index
	MetadataIndex() MetadataIndex

	// Serialization support
	// WriteTo writes each component to a separate writer for maximum flexibility
	// Usage: WriteTo(hybridWriter, vectorWriter, textWriter, metadataWriter)
	WriteTo(hybridWriter, vectorWriter, textWriter, metadataWriter io.Writer) error

	// ReadFrom reads from a combined reader (use io.MultiReader to combine separate readers)
	io.ReaderFrom
}

HybridSearchIndex provides a unified interface for multi-modal search

func NewHybridSearchIndex ¶

func NewHybridSearchIndex(vectorIndex VectorIndex, textIndex TextIndex, metadataIndex MetadataIndex) HybridSearchIndex

NewHybridSearchIndex creates a new hybrid search index

Parameters:

vectorIndex: The vector index to use (can be nil if not using vector search)
textIndex: The text index to use (can be nil if not using text search)
metadataIndex: The metadata index to use (can be nil if not using metadata search)

Returns:

HybridSearchIndex: A new hybrid search index

Example:

vecIdx, _ := NewFlatIndex(384, Cosine)
txtIdx := NewBM25SearchIndex()
metaIdx := NewRoaringMetadataIndex()
idx := NewHybridSearchIndex(vecIdx, txtIdx, metaIdx)

type HybridSearchResult ¶

type HybridSearchResult struct {
	ID uint32 // Document ID
	// Score is float64 (not float32) because:
	// 1. Fusion combines scores from multiple sources (vector + text) with arithmetic operations
	// 2. RRF uses fractional calculations (1.0 / (k + rank)) requiring higher precision
	// 3. Better numerical stability when sorting results with very similar scores
	// 4. Prevents cumulative rounding errors from weighted sums and aggregations
	Score float64 // Combined relevance score
}

HybridSearchResult represents a search result from the hybrid index

func (HybridSearchResult) GetId ¶

func (r HybridSearchResult) GetId() uint32

func (HybridSearchResult) GetScore ¶

func (r HybridSearchResult) GetScore() float32

type IVFIndex ¶

type IVFIndex struct {
	// contains filtered or unexported fields
}

IVFIndex represents an Inverted File index for approximate nearest neighbor search.

The index partitions the vector space into nlist Voronoi cells using k-means clustering. Each cell has an inverted list containing vectors assigned to that cell's centroid. During search, only the nearest nprobe cells are examined, providing a speed-accuracy tradeoff.

Thread-safety: This index is safe for concurrent use through a read-write mutex. Multiple readers can search simultaneously, but training and adding are exclusive.

func NewIVFIndex ¶

func NewIVFIndex(dim int, nlist int, distanceKind DistanceKind) (*IVFIndex, error)

NewIVFIndex creates a new IVF index with the specified parameters.

The index must be trained with representative vectors before it can be used. Training learns the Voronoi partitions (cluster centroids) via k-means clustering.

Parameters:

dim: The dimensionality of vectors. All vectors must have this exact dimension.
nlist: Number of Voronoi cells (clusters). Typical: sqrt(n) or 4*sqrt(n).
distanceKind: The distance metric for similarity comparison.

Returns:

*IVFIndex: A new IVF index ready to be trained
error: Returns error if parameters are invalid

Choosing nlist:

Too small: Poor speed improvement (searching too many vectors per cell)
Too large: Poor recall (query's neighbors scattered across many cells)
Rule of thumb: nlist = sqrt(expected_dataset_size)

Example:

idx, err := NewIVFIndex(384, 316, Cosine)  // For ~100K vectors
if err != nil { log.Fatal(err) }

// Must train before use
err = idx.Train(trainingVectors)

func (*IVFIndex) Add ¶

func (idx *IVFIndex) Add(vector VectorNode) error

Add adds a vector to the IVF index.

The vector is assigned to the nearest centroid and added to that centroid's inverted list. This is a simple O(nlist × dim) operation.

Parameters:

vector: Vector to add. Must have dimensionality matching idx.dim

Returns:

error: Returns error if index not trained, wrong dimension, or preprocessing fails

Time Complexity: O(nlist × dim) - compute distance to all centroids

Thread-safety: Acquires exclusive lock, blocking all searches during addition

func (*IVFIndex) Dimensions ¶

func (idx *IVFIndex) Dimensions() int

Dimensions returns the dimensionality of vectors stored in this index.

func (*IVFIndex) DistanceKind ¶

func (idx *IVFIndex) DistanceKind() DistanceKind

DistanceKind returns the distance metric used by this index.

func (*IVFIndex) Flush ¶

func (idx *IVFIndex) Flush() error

Flush performs hard delete of soft-deleted nodes.

WHEN TO CALL:

After multiple Remove() calls (batch cleanup)
When deleted nodes are significant (e.g., > 10% of index)
During off-peak hours

WHAT IT DOES: 1. Removes all soft-deleted vectors from all inverted lists 2. Reclaims memory occupied by deleted vectors 3. Clears the deleted nodes bitmap

COST: O(n) where n = total number of vectors across all lists

Thread-safety: Acquires exclusive write lock

func (*IVFIndex) Kind ¶

func (idx *IVFIndex) Kind() VectorIndexKind

Kind returns the type of this index.

func (*IVFIndex) NewSearch ¶

func (idx *IVFIndex) NewSearch() VectorSearch

NewSearch creates a new search builder for this index.

Returns:

VectorSearch: A new search builder ready to be configured

func (*IVFIndex) ReadFrom ¶

func (idx *IVFIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes an IVFIndex from an io.Reader.

This method reconstructs an IVFIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

int64: Number of bytes read
error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("index.bin")
idx2, _ := NewIVFIndex(384, 100, Cosine)
idx2.ReadFrom(file)
file.Close()

func (*IVFIndex) Remove ¶

func (idx *IVFIndex) Remove(vector VectorNode) error

Remove performs soft delete using roaring bitmap.

CONCURRENCY OPTIMIZATION: - Uses read lock first (cheaper) to check if node exists - Only acquires write lock for the actual bitmap modification - Minimizes write lock contention

SOFT DELETE MECHANISM: Instead of immediately removing from inverted lists (expensive O(n)), we mark as deleted in roaring bitmap. Deleted nodes are:

Skipped during search
Still in inverted lists
Not counted as active nodes

Call Flush() periodically for actual cleanup and memory reclamation.

Parameters:

vector: Vector to remove (only the ID field is used for matching)

Returns:

error: Returns error if vector is not found or already deleted

Time Complexity: O(n) for existence check + O(log n) for bitmap operation

Thread-safety: Uses read lock for validation, write lock for modification

func (*IVFIndex) Train ¶

func (idx *IVFIndex) Train(vectors []VectorNode) error

Train learns the Voronoi partitions (cluster centroids) using k-means clustering.

Training is required before the index can accept vectors or perform searches. The training vectors should be representative of the full dataset distribution.

Algorithm: 1. Run k-means clustering with k=nlist on the training vectors 2. Store the learned centroids 3. Mark the index as trained

Parameters:

vectors: Training vectors used to learn cluster centroids. Should be at least nlist vectors, ideally 10-100x more.

Returns:

error: Returns error if insufficient training vectors or training fails

Training Tips:

Use at least max(nlist, 1000) training vectors
More training vectors = better centroids = better recall
Training vectors should represent your full dataset's distribution
Can use a random sample of your full dataset

Time Complexity: O(iterations × nlist × n × dim) where n = len(vectors) Typical training time: seconds to minutes depending on dataset size

Example:

// Sample 100K vectors from your dataset for training
trainingVectors := sampleVectors(allVectors, 100000)
err := idx.Train(trainingVectors)

func (*IVFIndex) Trained ¶

func (idx *IVFIndex) Trained() bool

Trained returns true if the index has been trained

func (*IVFIndex) WriteTo ¶

func (idx *IVFIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the IVFIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization to ensure all soft-deleted vectors are permanently removed from the serialized data.

The serialization format is: 1. Magic number (4 bytes) - "IVFX" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. Basic parameters:

Dimensionality (4 bytes)
Distance kind length (4 bytes) + distance kind string
nlist (4 bytes) - number of IVF clusters
trained (1 byte) - whether index is trained

4. Centroids (only if trained):

For each of nlist centroids:
Centroid size (4 bytes)
Centroid data (dim * 4 bytes as float32)

5. Number of inverted lists (4 bytes) 6. For each inverted list:

List size (4 bytes)
For each vector:
Vector ID (4 bytes)
Vector data (dim * 4 bytes as float32)

7. Deleted nodes bitmap size (4 bytes) + roaring bitmap bytes

Thread-safety: Acquires read lock during serialization

Returns:

int64: Number of bytes written
error: Returns error if write fails or flush fails

type IVFPQIndex ¶

type IVFPQIndex struct {

	// M is number of PQ subspaces
	M int

	// Nbits is bits per PQ code
	Nbits int

	// Ksub is centroids per subspace (K = 2^Nbits)
	Ksub int
	// contains filtered or unexported fields
}

IVFPQIndex represents an IVFPQ index.

Memory layout:

IVF centroids: nlist × dim × 4 bytes
PQ codebooks: M × K × (dim/M) × 4 bytes
Compressed vectors: n × M bytes

func NewIVFPQIndex ¶

func NewIVFPQIndex(dim int, distanceKind DistanceKind, nlist int, m int, nbits int) (*IVFPQIndex, error)

NewIVFPQIndex creates a new IVFPQ index.

Parameters:

dim: Vector dimensionality (must be divisible by M)
distanceKind: Distance metric
nlist: Number of IVF clusters (rule of thumb: sqrt(n))
m: Number of PQ subspaces (must divide dimension evenly)
nbits: Bits per PQ code (most common: 8, which gives K=256)

Returns:

*IVFPQIndex: New untrained IVFPQ index
error: Returns error if parameters invalid

func (*IVFPQIndex) Add ¶

func (idx *IVFPQIndex) Add(vector VectorNode) error

Add encodes vectors as residuals and adds to inverted lists.

Algorithm:

Find nearest IVF centroid
Compute residual = vector - centroid
Encode residual with PQ
Store code in inverted list

Parameters:

vector: Vector to add

Returns:

error: Returns error if not trained or dimension mismatch

func (*IVFPQIndex) Dimensions ¶

func (idx *IVFPQIndex) Dimensions() int

Dimensions returns vector dimensionality.

func (*IVFPQIndex) DistanceKind ¶

func (idx *IVFPQIndex) DistanceKind() DistanceKind

DistanceKind returns the distance metric.

func (*IVFPQIndex) Flush ¶

func (idx *IVFPQIndex) Flush() error

Flush performs hard delete of soft-deleted nodes.

WHEN TO CALL:

After multiple Remove() calls (batch cleanup)
When deleted nodes are significant (e.g., > 10% of index)
During off-peak hours

WHAT IT DOES: 1. Removes all soft-deleted vectors from all inverted lists 2. Reclaims memory occupied by deleted vectors and their PQ codes 3. Clears the deleted nodes bitmap

COST: O(n) where n = total number of vectors across all lists

Thread-safety: Acquires exclusive write lock

func (*IVFPQIndex) Kind ¶

func (idx *IVFPQIndex) Kind() VectorIndexKind

Kind returns the index type.

func (*IVFPQIndex) NewSearch ¶

func (idx *IVFPQIndex) NewSearch() VectorSearch

NewSearch creates a new search builder.

func (*IVFPQIndex) ReadFrom ¶

func (idx *IVFPQIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes an IVFPQIndex from an io.Reader.

This method reconstructs an IVFPQIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

int64: Number of bytes read
error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("index.bin")
idx2, _ := NewIVFPQIndex(384, Cosine, 100, 8, 8)
idx2.ReadFrom(file)
file.Close()

func (*IVFPQIndex) Remove ¶

func (idx *IVFPQIndex) Remove(vector VectorNode) error

Remove performs soft delete using roaring bitmap.

CONCURRENCY OPTIMIZATION: - Uses read lock first (cheaper) to check if node exists - Only acquires write lock for the actual bitmap modification - Minimizes write lock contention

SOFT DELETE MECHANISM: Instead of immediately removing from inverted lists (expensive O(n)), we mark as deleted in roaring bitmap. Deleted nodes are:

Skipped during search
Still in inverted lists
Not counted as active nodes

Call Flush() periodically for actual cleanup and memory reclamation.

Parameters:

vector: Vector to remove (only the ID field is used for matching)

Returns:

error: Returns error if vector is not found or already deleted

Time Complexity: O(n) for existence check + O(log n) for bitmap operation

Thread-safety: Uses read lock for validation, write lock for modification

func (*IVFPQIndex) Train ¶

func (idx *IVFPQIndex) Train(vectors []VectorNode) error

Train learns IVF clusters and PQ codebooks on residuals.

THE TRAINING ALGORITHM:

Run k-means to create nlist IVF clusters
Assign vectors to nearest centroids
Compute residuals = vector - centroid
Train PQ codebooks on ALL residuals together

The key innovation: Training PQ on residuals instead of original vectors. Residuals have much less variance, enabling better compression.

Parameters:

vectors: Training vectors (need at least nlist*10)

Returns:

error: Returns error if insufficient training data

func (*IVFPQIndex) Trained ¶

func (idx *IVFPQIndex) Trained() bool

Trained returns true if the index has been trained

func (*IVFPQIndex) WriteTo ¶

func (idx *IVFPQIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the IVFPQIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization to ensure all soft-deleted vectors are permanently removed from the serialized data.

The serialization format is: 1. Magic number (4 bytes) - "IVPQ" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. Basic parameters:

Dimensionality (4 bytes)
Distance kind length (4 bytes) + distance kind string
nlist (4 bytes) - number of IVF clusters
M (4 bytes) - number of PQ subspaces
Nbits (4 bytes) - bits per PQ code
Ksub (4 bytes) - centroids per subspace
dsub (4 bytes) - subspace dimension
trained (1 byte) - whether index is trained

4. Centroids (only if trained):

For each of nlist centroids:
Centroid size (4 bytes)
Centroid data (dim * 4 bytes as float32)

5. Codebooks (only if trained):

For each of M subquantizers:
Codebook size (4 bytes)
Codebook data (Ksub * dsub * 4 bytes as float32)

6. Number of inverted lists (4 bytes) 7. For each inverted list:

List size (4 bytes)
For each compressed vector:
Vector ID (4 bytes)
PQ code (M bytes)

8. Deleted nodes bitmap size (4 bytes) + roaring bitmap bytes

Thread-safety: Acquires read lock during serialization

Returns:

int64: Number of bytes written
error: Returns error if write fails or flush fails

type Int8Quantizer ¶

type Int8Quantizer struct {
	// contains filtered or unexported fields
}

Int8Quantizer uses symmetric scalar quantization to compress vectors.

Memory: 1 byte per dimension (75% savings vs float32) Accuracy: Maps [-AbsMax, AbsMax] to [-127, 127] Training: REQUIRED - must call Train() before use

How it works: 1. Train: Finds the maximum absolute value across sample vectors 2. Quantize: Maps [-AbsMax, AbsMax] to [-127, 127] using linear scaling 3. Dequantize: Reverses the scaling to approximate original values

Trade-off: Maximum memory savings with some accuracy loss. Best for large-scale deployments where memory is critical.

func (*Int8Quantizer) Dequantize ¶

func (q *Int8Quantizer) Dequantize(stored any) ([]float32, error)

func (*Int8Quantizer) GetAbsMax ¶

func (q *Int8Quantizer) GetAbsMax() float32

GetAbsMax returns the trained maximum absolute value (for serialization)

func (*Int8Quantizer) IsTrained ¶

func (q *Int8Quantizer) IsTrained() bool

func (*Int8Quantizer) Quantize ¶

func (q *Int8Quantizer) Quantize(vector []float32) (any, error)

func (*Int8Quantizer) SetAbsMax ¶

func (q *Int8Quantizer) SetAbsMax(absMax float32)

SetAbsMax sets the maximum absolute value (for deserialization)

func (*Int8Quantizer) Train ¶

func (q *Int8Quantizer) Train(vectors [][]float32)

func (*Int8Quantizer) Type ¶

func (q *Int8Quantizer) Type() QuantizerType

type LogicOperator ¶

type LogicOperator string

LogicOperator defines how filters are combined

const (
	AND LogicOperator = "AND"
	OR  LogicOperator = "OR"
)

type MetadataFilterQueryBuilder ¶

type MetadataFilterQueryBuilder struct {
	// contains filtered or unexported fields
}

MetadataFilterQueryBuilder provides a type-safe fluent API for building complex metadata filter queries

func NewMetadataFilterQuery ¶

func NewMetadataFilterQuery() *MetadataFilterQueryBuilder

NewMetadataFilterQuery creates a new query builder

Example:

results, err := NewMetadataFilterQuery().
	Where(Eq("category", "electronics"), Gte("price", 500)).
	Or(Eq("category", "phones"), Gte("rating", 4.5)).
	Execute(idx)

func (*MetadataFilterQueryBuilder) And ¶

func (qb *MetadataFilterQueryBuilder) And(filters ...Filter) *MetadataFilterQueryBuilder

And adds filters with AND logic to the last group

func (*MetadataFilterQueryBuilder) Build ¶

func (qb *MetadataFilterQueryBuilder) Build() []*FilterGroup

Build returns the constructed filter groups

func (*MetadataFilterQueryBuilder) Execute ¶

func (qb *MetadataFilterQueryBuilder) Execute(idx MetadataIndex) ([]MetadataResult, error)

Execute runs the query against a metadata index

Example:

results, err := NewMetadataFilterQuery().
	Where(Eq("brand", "Apple")).
	Or(Eq("category", "phone"), Eq("brand", "Samsung")).
	Execute(idx)

func (*MetadataFilterQueryBuilder) Or ¶

func (qb *MetadataFilterQueryBuilder) Or(filters ...Filter) *MetadataFilterQueryBuilder

Or starts a new filter group (filters within are combined with AND, but this group is OR'd with previous groups)

func (*MetadataFilterQueryBuilder) Where ¶

func (qb *MetadataFilterQueryBuilder) Where(filters ...Filter) *MetadataFilterQueryBuilder

Where starts a new filter group with AND logic

type MetadataIndex ¶

type MetadataIndex interface {
	// Add adds a document with its metadata to the index
	Add(node MetadataNode) error

	// Remove removes a document from the index
	Remove(node MetadataNode) error

	// NewSearch creates a new search builder
	NewSearch() MetadataSearch

	// Flush the metadata index
	Flush() error

	// Serialization support
	io.WriterTo
	io.ReaderFrom
}

MetadataIndex is the interface for filtering documents based on metadata

type MetadataNode ¶

type MetadataNode struct {
	// contains filtered or unexported fields
}

MetadataNode represents a document with structured metadata attributes.

MetadataNode is used for filtering and structured queries in metadata indexes. It associates a unique ID with a map of metadata fields that can be:

Strings: for categorical data (e.g., "category", "status")
Numbers (int, float64): for numeric fields (e.g., "price", "rating")
Booleans: for binary flags (e.g., "in_stock", "featured")

The metadata is stored as map[string]interface{} for flexibility, but the metadata index will interpret types and use appropriate data structures:

Categorical values → Roaring Bitmaps for fast set operations
Numeric values → Bit-Sliced Index (BSI) for range queries

Example:

metadata := map[string]interface{}{
    "category": "electronics",
    "price": 999.99,
    "rating": 4.5,
    "in_stock": true,
    "tags": []string{"premium", "wireless"},  // Lists also supported
}
node := comet.NewMetadataNode(metadata)
index.Add(*node)

func NewMetadataNode ¶

func NewMetadataNode(metadata map[string]interface{}) *MetadataNode

NewMetadataNode creates a new MetadataNode with an auto-incremented ID.

The ID is generated atomically and is guaranteed to be unique across all VectorNode and MetadataNode instances. This makes it safe for concurrent use.

Parameters:

metadata: Map of field names to values. Supported types are:
string: categorical data
int, int32, int64: integer values
float32, float64: floating point values (converted to int64 for BSI)
bool: boolean flags
[]string: list of string values (for IN queries)

Returns:

*MetadataNode: A new node with auto-generated ID

Thread-safety: This function is safe for concurrent use.

Example:

node := comet.NewMetadataNode(map[string]interface{}{
    "category": "books",
    "price": 29.99,
    "in_stock": true,
})
index.Add(*node)

func NewMetadataNodeWithID ¶

func NewMetadataNodeWithID(id uint32, metadata map[string]interface{}) *MetadataNode

NewMetadataNodeWithID creates a new MetadataNode with an explicit ID.

Use this when you need to control the ID assignment, typically when:

Mapping metadata to existing document IDs from your database
Ensuring consistency with vector node IDs in hybrid indexes
Deserializing indexes from disk

WARNING: The caller is responsible for ensuring ID uniqueness.

Parameters:

id: The unique identifier for this node
metadata: Map of field names to values

Returns:

*MetadataNode: A new node with the specified ID

Example:

dbID := uint32(12345)
node := comet.NewMetadataNodeWithID(dbID, map[string]interface{}{
    "category": "electronics",
    "price": 999,
})
index.Add(*node)

func (*MetadataNode) ID ¶

func (n *MetadataNode) ID() uint32

ID returns the unique identifier of this metadata node.

Returns:

uint32: The node's ID

func (*MetadataNode) Metadata ¶

func (n *MetadataNode) Metadata() map[string]interface{}

Metadata returns the metadata map for this node.

Returns:

map[string]interface{}: The metadata fields (not a copy - do not modify)

type MetadataResult ¶

type MetadataResult struct {
	// Node is the matched metadata node containing ID and metadata fields
	Node MetadataNode
}

MetadataResult represents a search result from metadata filtering.

Unlike vector and text results which have meaningful scores, metadata results are binary (match or no match). All returned results match the filter criteria, so they don't have associated relevance scores.

The result contains the full metadata node, allowing access to all metadata fields of matched documents.

Example:

results, _ := metaIndex.NewSearch().
    WithFilters(
        comet.Eq("category", "electronics"),
        comet.Lte("price", 1000),
    ).
    Execute()

for _, result := range results {
    fmt.Printf("ID: %d\n", result.GetId())
    fmt.Printf("Metadata: %v\n", result.Node.Metadata())
}

func (MetadataResult) GetId ¶

func (r MetadataResult) GetId() uint32

GetId returns the ID of the matched document.

func (MetadataResult) GetScore ¶

func (r MetadataResult) GetScore() float32

GetScore returns 0 for metadata results (no meaningful score). This method exists to implement the Result interface.

type MetadataSearch ¶

type MetadataSearch interface {
	// WithFilters sets the filters to apply with AND logic.
	// All filters must match for a document to be included.
	//
	// Available filter functions:
	//   - Eq(field, value): field == value
	//   - Ne(field, value): field != value
	//   - Lt(field, value): field < value
	//   - Lte(field, value): field <= value
	//   - Gt(field, value): field > value
	//   - Gte(field, value): field >= value
	//   - Between(field, min, max): min <= field <= max
	//   - In(field, ...values): field in values
	//   - NotIn(field, ...values): field not in values
	//   - Exists(field): field exists
	//   - NotExists(field): field doesn't exist
	//
	// Parameters:
	//   - filters: One or more filter conditions (combined with AND)
	//
	// Returns:
	//   - MetadataSearch: The search instance for method chaining
	WithFilters(filters ...Filter) MetadataSearch

	// WithFilterGroups sets complex filter groups with OR logic.
	// Documents matching ANY group are included (OR between groups).
	// Filters within each group are combined with AND.
	//
	// This enables complex boolean expressions like:
	//   (A AND B) OR (C AND D) OR (E AND F)
	//
	// Parameters:
	//   - groups: One or more filter groups (combined with OR)
	//
	// Returns:
	//   - MetadataSearch: The search instance for method chaining
	WithFilterGroups(groups ...*FilterGroup) MetadataSearch

	// Execute performs the filtering and returns matching document IDs.
	// The results contain only document IDs (no scores).
	//
	// Returns:
	//   - []MetadataResult: Matching documents
	//   - error: Error if filtering fails
	Execute() ([]MetadataResult, error)
}

MetadataSearch provides a fluent interface for filtering documents by metadata.

MetadataSearch uses roaring bitmaps and bit-sliced indexes for extremely fast filtering on structured attributes. It supports:

Equality and inequality queries
Numeric range queries
Set membership (IN, NOT IN)
Existence checks
Complex boolean logic (AND, OR, NOT)

Filters within a single WithFilters() call are combined with AND logic. Filter groups enable OR logic between different filter combinations.

Example - Simple filters (AND logic):

results, _ := metaIndex.NewSearch().
    WithFilters(
        comet.Eq("category", "electronics"),
        comet.Lte("price", 1000),
        comet.Eq("in_stock", true),
    ).
    Execute()

Example - Complex query with OR:

// (category=electronics AND price<500) OR (category=books AND rating>=4)
group1 := comet.NewFilterGroup().
    WithFilters(
        comet.Eq("category", "electronics"),
        comet.Lt("price", 500),
    )
group2 := comet.NewFilterGroup().
    WithFilters(
        comet.Eq("category", "books"),
        comet.Gte("rating", 4),
    )
results, _ := metaIndex.NewSearch().
    WithFilterGroups(group1, group2).
    Execute()

Example - Set membership:

results, _ := metaIndex.NewSearch().
    WithFilters(
        comet.In("category", "electronics", "computers", "phones"),
        comet.NotIn("brand", "brandX", "brandY"),
    ).
    Execute()

type Operator ¶

type Operator string

Operator represents a filter operation type

const (
	// Equality operators
	OpEqual    Operator = "eq" // Equal to
	OpNotEqual Operator = "ne" // Not equal to

	// Comparison operators (numeric)
	OpGreaterThan        Operator = "gt"  // Greater than
	OpGreaterThanOrEqual Operator = "gte" // Greater than or equal to
	OpLessThan           Operator = "lt"  // Less than
	OpLessThanOrEqual    Operator = "lte" // Less than or equal to

	// Set operators
	OpIn    Operator = "in"     // In a set of values
	OpNotIn Operator = "not_in" // Not in a set of values

	// Range operators
	OpRange Operator = "range" // Within a range [Value, Value2]

	// Existence operators
	OpExists    Operator = "exists"     // Field exists (has any value)
	OpNotExists Operator = "not_exists" // Field doesn't exist
)

Operator constants

type PQIndex ¶

type PQIndex struct {

	// M is the number of subquantizers
	M int

	// Nbits is bits per PQ code
	Nbits int

	// Ksub is centroids per subquantizer (K = 2^Nbits)
	Ksub int
	// contains filtered or unexported fields
}

PQIndex represents a Product Quantization index.

Memory layout:

Codes: n × M bytes (compressed vectors)
Codebooks: M × K × (dim/M) × 4 bytes
Typical: 10-500x smaller than original

func NewPQIndex ¶

func NewPQIndex(dim int, distanceKind DistanceKind, M int, Nbits int) (*PQIndex, error)

NewPQIndex creates a new Product Quantization index.

Parameters:

dim: Vector dimensionality (must be divisible by M)
distanceKind: Distance metric
M: Number of subquantizers (subspaces). Must divide dim evenly.
Nbits: Bits per PQ code, determines K=2^Nbits centroids per subspace

Returns:

*PQIndex: New untrained PQ index
error: Returns error if parameters invalid

Tip: Use CalculatePQParams(dim) to get recommended M and Nbits values.

func (*PQIndex) Add ¶

func (idx *PQIndex) Add(vector VectorNode) error

Add compresses and adds vectors to the index.

Encoding process: 1. Divide vector into M subvectors 2. For each subvector, find nearest centroid in its codebook 3. Store centroid IDs as M-byte code

Original vectors are discarded after encoding!

Parameters:

vector: Vector to compress and add

Returns:

error: Returns error if not trained or dimension mismatch

func (*PQIndex) Dimensions ¶

func (idx *PQIndex) Dimensions() int

Dimensions returns the dimensionality of original vectors.

func (*PQIndex) DistanceKind ¶

func (idx *PQIndex) DistanceKind() DistanceKind

DistanceKind returns the distance metric.

func (*PQIndex) Flush ¶

func (idx *PQIndex) Flush() error

Flush performs hard delete of soft-deleted nodes.

WHEN TO CALL:

After multiple Remove() calls (batch cleanup)
When deleted nodes are significant (e.g., > 10% of index)
During off-peak hours

WHAT IT DOES: 1. Removes all soft-deleted codes and vectorNodes from their parallel slices 2. Reclaims memory occupied by deleted vectors 3. Clears the deleted nodes bitmap

COST: O(n) where n = number of vectors in the index

Thread-safety: Acquires exclusive write lock

func (*PQIndex) Kind ¶

func (idx *PQIndex) Kind() VectorIndexKind

Kind returns the index type.

func (*PQIndex) NewSearch ¶

func (idx *PQIndex) NewSearch() VectorSearch

NewSearch creates a new search builder.

func (*PQIndex) ReadFrom ¶

func (idx *PQIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes a PQIndex from an io.Reader.

This method reconstructs a PQIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

int64: Number of bytes read
error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("index.bin")
M, Nbits := CalculatePQParams(384)
idx2, _ := NewPQIndex(384, Cosine, M, Nbits)
idx2.ReadFrom(file)
file.Close()

func (*PQIndex) Remove ¶

func (idx *PQIndex) Remove(vector VectorNode) error

Remove performs soft delete using roaring bitmap.

CONCURRENCY OPTIMIZATION: - Uses read lock first (cheaper) to check if node exists - Only acquires write lock for the actual bitmap modification - Minimizes write lock contention

SOFT DELETE MECHANISM: Instead of immediately removing from the codes and vectorNodes slices (expensive O(n)), we mark as deleted in roaring bitmap. Deleted nodes are:

Skipped during search
Still in storage slices
Not counted as active nodes

Call Flush() periodically for actual cleanup and memory reclamation.

Parameters:

vector: Vector to remove (only the ID field is used for matching)

Returns:

error: Returns error if vector is not found or already deleted

Time Complexity: O(n) for existence check + O(log n) for bitmap operation

Thread-safety: Uses read lock for validation, write lock for modification

func (*PQIndex) Train ¶

func (idx *PQIndex) Train(vectors []VectorNode) error

Train learns codebooks for each subspace using k-means.

Algorithm:

For each of M subspaces: a. Extract that subspace from all training vectors b. Run k-means to find K centroids c. Store centroids as codebook for that subspace

Parameters:

vectors: Training vectors (need at least Ksub vectors)

Returns:

error: Returns error if insufficient training data

func (*PQIndex) Trained ¶

func (idx *PQIndex) Trained() bool

Trained returns true if the index has been trained

func (*PQIndex) WriteTo ¶

func (idx *PQIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the PQIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization to ensure all soft-deleted vectors are permanently removed from the serialized data.

The serialization format is: 1. Magic number (4 bytes) - "PQIX" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. Basic parameters:

Dimensionality (4 bytes)
Distance kind length (4 bytes) + distance kind string
M (4 bytes) - number of subquantizers
Nbits (4 bytes) - bits per PQ code
Ksub (4 bytes) - centroids per subquantizer
dsub (4 bytes) - dimension of each subspace
trained (1 byte) - whether codebooks have been learned

4. Codebooks (only if trained):

For each of M subquantizers:
Codebook size (4 bytes)
Codebook data (Ksub * dsub * 4 bytes as float32)

5. Number of vectors (4 bytes) 6. For each vector:

Vector ID (4 bytes)
PQ code (M bytes)

7. Deleted nodes bitmap size (4 bytes) + roaring bitmap bytes

Thread-safety: Acquires read lock during serialization

Returns:

int64: Number of bytes written
error: Returns error if write fails or flush fails

type Quantizer ¶

type Quantizer interface {
	// Train prepares the quantizer using sample vectors.
	// Required for Int8Quantizer, no-op for others.
	Train(vectors [][]float32)

	// IsTrained returns true if the quantizer is ready to use.
	// Always true for FullPrecision and HalfPrecision.
	// For Int8, true only after Train() has been called.
	IsTrained() bool

	// Quantize converts a float32 vector to the quantizer's storage format.
	// Returns:
	//   - []float32 for FullPrecisionQuantizer
	//   - []uint16 for HalfPrecisionQuantizer (float16 bits)
	//   - []int8 for Int8Quantizer
	Quantize(vector []float32) (any, error)

	// Dequantize converts a stored vector back to float32.
	// The input type must match the quantizer's storage format.
	Dequantize(stored any) ([]float32, error)

	// Type returns the quantizer type
	Type() QuantizerType
}

Quantizer defines the interface for vector quantization operations. Implementations handle conversion between float32 vectors and their compressed representations, supporting different precision levels.

func NewQuantizer ¶

func NewQuantizer(qType QuantizerType) (Quantizer, error)

NewQuantizer creates a quantizer of the specified type.

type QuantizerType ¶

type QuantizerType string

QuantizerType represents the type of quantization

const (
	FullPrecision QuantizerType = "float32"
	HalfPrecision QuantizerType = "float16"
	Int8Precision QuantizerType = "int8"
)

type Result ¶

type Result interface {
	// GetId returns the unique identifier for this result
	GetId() uint32

	// GetScore returns the relevance score for this result.
	// Interpretation depends on search type:
	//   - Vector search: distance (lower = better)
	//   - Text search: BM25 score (higher = better)
	//   - Metadata search: typically 0 or 1 (binary match)
	GetScore() float32
}

Result is a common interface for all search result types.

This interface enables generic operations like limiting, autocut, and score aggregation to work uniformly across different search modalities (vector, text, metadata).

All result types (VectorResult, TextResult, MetadataResult) implement this interface.

type RoaringMetadataIndex ¶

type RoaringMetadataIndex struct {
	// contains filtered or unexported fields
}

RoaringMetadataIndex provides fast filtering using Roaring Bitmaps and BSI.

This index maintains two types of indexes: 1. Categorical index: Maps "field:value" to document IDs using roaring bitmaps 2. Numeric index: Maps field names to BSI structures for range queries

Thread-safety: This index is safe for concurrent use through a read-write mutex.

func NewRoaringMetadataIndex ¶

func NewRoaringMetadataIndex() *RoaringMetadataIndex

NewRoaringMetadataIndex creates a new roaring bitmap-based metadata index.

Returns:

*RoaringMetadataIndex: A new empty index ready to accept documents

Example:

idx := NewRoaringMetadataIndex()
node := NewMetadataNode(1, map[string]interface{}{
	"category": "electronics",
	"price": 999,
})
idx.Add(node)

func (*RoaringMetadataIndex) Add ¶

func (idx *RoaringMetadataIndex) Add(node MetadataNode) error

Add adds a document with its metadata to the index.

The metadata is automatically classified into categorical or numeric fields:

Numeric: int, int64, float64 (stored in BSI)
Categorical: string, bool (stored in roaring bitmaps)

Parameters:

node: MetadataNode containing document ID and metadata fields

Returns:

error: Returns error if metadata contains unsupported types

Time Complexity: O(m) where m is the number of metadata fields

Thread-safety: Acquires exclusive lock

func (*RoaringMetadataIndex) Flush ¶

func (idx *RoaringMetadataIndex) Flush() error

Flush is a no-op for roaring metadata index since data is stored in memory. This method exists to satisfy the MetadataIndex interface.

Returns:

error: Always returns nil

func (*RoaringMetadataIndex) NewSearch ¶

func (idx *RoaringMetadataIndex) NewSearch() MetadataSearch

NewSearch creates a new search builder for this index.

Returns:

MetadataSearch: A new search builder ready to be configured

Example:

results, err := idx.NewSearch().
	WithFilters(
		Eq("category", "electronics"),
		Gte("price", 500),
	).
	Execute()

func (*RoaringMetadataIndex) ReadFrom ¶

func (idx *RoaringMetadataIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes a RoaringMetadataIndex from an io.Reader.

This method reconstructs a RoaringMetadataIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

int64: Number of bytes read
error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("index.bin")
idx2 := NewRoaringMetadataIndex()
idx2.ReadFrom(file)
file.Close()

func (*RoaringMetadataIndex) Remove ¶

func (idx *RoaringMetadataIndex) Remove(node MetadataNode) error

Remove removes a document from all indexes.

Parameters:

node: MetadataNode to remove (only the ID field is used for matching)

Returns:

error: Always returns nil (exists to satisfy MetadataIndex interface)

Time Complexity: O(m) where m is the number of metadata fields

Thread-safety: Acquires exclusive lock

func (*RoaringMetadataIndex) WriteTo ¶

func (idx *RoaringMetadataIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the RoaringMetadataIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization (though Flush is a no-op for metadata index since it's memory-based).

The serialization format is: 1. Magic number (4 bytes) - "MTIX" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. All docs bitmap size (4 bytes) + bitmap bytes 4. Number of categorical entries (4 bytes) 5. For each categorical entry:

Key length (4 bytes) + key string
Bitmap size (4 bytes) + bitmap bytes

6. Number of numeric entries (4 bytes) 7. For each numeric entry:

Field name length (4 bytes) + field name string
BSI size (4 bytes) + BSI bytes

Thread-safety: Acquires read lock during serialization

Returns:

int64: Number of bytes written
error: Returns error if write fails or flush fails

type ScoreAggregationKind ¶

type ScoreAggregationKind string

ScoreAggregationKind defines the type of score aggregation strategy.

const (
	// SumAggregation sums all scores for the same node
	SumAggregation ScoreAggregationKind = "sum"

	// MaxAggregation takes the maximum score for the same node
	MaxAggregation ScoreAggregationKind = "max"

	// MeanAggregation averages all scores for the same node
	MeanAggregation ScoreAggregationKind = "mean"
)

type SearchResult ¶

type SearchResult struct {
	DocID uint32  // Document ID
	Score float64 // BM25 relevance score
}

SearchResult represents a single search result with its score.

type TextAggregation ¶

type TextAggregation interface {
	// Kind returns the kind of aggregation strategy
	Kind() ScoreAggregationKind

	// Aggregate takes a slice of TextResults (potentially with duplicate document IDs),
	// deduplicates by document ID, aggregates scores for each unique document,
	// and returns the deduplicated results sorted by aggregated score (descending).
	Aggregate(results []TextResult) []TextResult
}

TextAggregation defines how to aggregate scores when the same text document appears in results from multiple queries.

When performing multi-query text searches, the same document may be returned multiple times with different scores. The aggregation strategy determines how to combine these scores and deduplicate results by document ID.

Note: Text results use relevance scores where HIGHER score = BETTER match (e.g., BM25).

func DefaultTextAggregation ¶

func DefaultTextAggregation() TextAggregation

DefaultTextAggregation returns the default text aggregation strategy (Sum).

func NewTextAggregation ¶

func NewTextAggregation(kind ScoreAggregationKind) (TextAggregation, error)

NewTextAggregation returns the singleton text aggregation instance for the given kind. Returns error if the kind is not recognized.

type TextIndex ¶

type TextIndex interface {
	// Add a new text to the index
	Add(id uint32, text string) error

	// Remove a text from the index
	Remove(id uint32) error

	// NewSearch creates a new search builder
	NewSearch() TextSearch

	// Flush the text index
	Flush() error

	// Serialization support
	io.WriterTo
	io.ReaderFrom
}

type TextResult ¶

type TextResult struct {
	// Id is the unique identifier of the matched document
	Id uint32

	// Score is the BM25 relevance score (higher = more relevant)
	Score float32
}

TextResult represents a search result from full-text (BM25) search.

Each result contains the matched document ID and a relevance score. The score represents BM25 relevance:

Higher scores indicate better relevance (more important terms, higher frequency)
Scores are not normalized and can exceed 1.0
Scores depend on:
Term frequency in document
Inverse document frequency
Document length normalization

Note: Unlike vector results where lower is better, text results have higher-is-better scores (standard for information retrieval).

Example:

results, _ := index.NewSearch().
    WithQuery("machine learning").
    WithK(10).
    Execute()

for _, result := range results {
    fmt.Printf("ID: %d, BM25 Score: %.4f\n",
        result.Id, result.Score)
}

func (TextResult) GetId ¶

func (r TextResult) GetId() uint32

GetId returns the ID of the matched document.

func (TextResult) GetScore ¶

func (r TextResult) GetScore() float32

GetScore returns the BM25 relevance score (higher = more relevant).

type TextSearch ¶

type TextSearch interface {
	// WithQuery sets the query text(s) for full-text search.
	// Supports single or multiple queries for batch search.
	// Text is tokenized and normalized using UAX#29 word segmentation.
	//
	// Parameters:
	//   - queries: One or more query strings
	//
	// Returns:
	//   - TextSearch: The search instance for method chaining
	WithQuery(queries ...string) TextSearch

	// WithNode sets the node ID(s) to search from (not commonly used for text).
	// This is provided for interface consistency but is rarely needed.
	//
	// Parameters:
	//   - nodeIDs: One or more node IDs
	//
	// Returns:
	//   - TextSearch: The search instance for method chaining
	WithNode(nodeIDs ...uint32) TextSearch

	// WithK sets the number of top results to return.
	//
	// Parameters:
	//   - k: Number of results (default: 10)
	//
	// Returns:
	//   - TextSearch: The search instance for method chaining
	WithK(k int) TextSearch

	// WithScoreAggregation sets how to combine scores from multiple queries.
	// Only relevant when using multiple queries.
	//
	// Available strategies:
	//   - SumAggregation: Sum all scores (default, emphasizes total relevance)
	//   - MaxAggregation: Take maximum score (best match across queries)
	//   - MeanAggregation: Average all scores (balanced)
	//
	// Parameters:
	//   - kind: The aggregation strategy
	//
	// Returns:
	//   - TextSearch: The search instance for method chaining
	WithScoreAggregation(kind ScoreAggregationKind) TextSearch

	// WithCutoff enables automatic result cutoff based on score distribution.
	//
	// Parameters:
	//   - cutoff: Number of extrema to find before cutting (-1 disables)
	//
	// Returns:
	//   - TextSearch: The search instance for method chaining
	WithCutoff(cutoff int) TextSearch

	// WithDocumentIDs pre-filters the search to specific document IDs.
	// Only documents with IDs in this set will be scored.
	//
	// Parameters:
	//   - docIDs: Eligible document IDs (empty means all documents)
	//
	// Returns:
	//   - TextSearch: The search instance for method chaining
	WithDocumentIDs(docIDs ...uint32) TextSearch

	// Execute performs the configured search and returns results.
	// Results are sorted by BM25 score (descending - higher is better).
	//
	// Returns:
	//   - []TextResult: Sorted search results
	//   - error: Error if search fails
	Execute() ([]TextResult, error)
}

TextSearch provides a fluent interface for configuring and executing text searches.

TextSearch uses the builder pattern for configuration, similar to VectorSearch. It supports BM25-based full-text search with tokenization and normalization.

Features:

Single or multi-query search with score aggregation
Pre-filtering by document IDs
Automatic result cutoff (autocut)
Top-K retrieval with heap-based ranking

Example - Basic search:

results, _ := txtIndex.NewSearch().
    WithQuery("machine learning tutorial").
    WithK(10).
    Execute()

Example - Multi-query search:

results, _ := txtIndex.NewSearch().
    WithQuery("deep learning", "neural networks", "transformers").
    WithK(20).
    WithScoreAggregation(comet.MaxAggregation).
    Execute()

Example - Pre-filtered search:

eligibleDocs := []uint32{1, 5, 10, 15}
results, _ := txtIndex.NewSearch().
    WithQuery("tutorial").
    WithDocumentIDs(eligibleDocs...).
    WithK(5).
    Execute()

type VectorAggregation ¶

type VectorAggregation interface {
	// Kind returns the kind of aggregation strategy
	Kind() ScoreAggregationKind

	// Aggregate takes a slice of VectorResults (potentially with duplicate node IDs),
	// deduplicates by node ID, aggregates scores for each unique node,
	// and returns the deduplicated results sorted by aggregated score (ascending).
	Aggregate(results []VectorResult) []VectorResult
}

VectorAggregation defines how to aggregate scores when the same vector node appears in results from multiple queries or node searches.

When performing multi-query or multi-node searches, the same vector may be returned multiple times with different scores. The aggregation strategy determines how to combine these scores and deduplicate results by node ID.

Note: Vector results use distance metrics where LOWER score = BETTER match.

func DefaultVectorAggregation ¶

func DefaultVectorAggregation() VectorAggregation

DefaultVectorAggregation returns the default vector aggregation strategy (Sum).

func NewVectorAggregation ¶

func NewVectorAggregation(kind ScoreAggregationKind) (VectorAggregation, error)

NewVectorAggregation returns the singleton vector aggregation instance for the given kind. Returns error if the kind is not recognized.

type VectorIndex ¶

type VectorIndex interface {
	// Train the index
	Train(vectors []VectorNode) error

	// Add a new vector to the index
	Add(vector VectorNode) error

	// Remove a vector from the index
	Remove(vector VectorNode) error

	// Flush the vector index
	Flush() error

	// NewSearch creates a new search builder
	NewSearch() VectorSearch

	// Dimensions returns the dimensionality of vectors stored in this index
	Dimensions() int

	// DistanceKind returns the distance metric used for similarity measurement
	DistanceKind() DistanceKind

	// Kind returns the type of vector index
	Kind() VectorIndexKind

	// Trained returns true if the index has been trained
	Trained() bool

	// Serialization support
	io.WriterTo
	io.ReaderFrom
}

VectorIndex is the interface for the vector index

type VectorIndexKind ¶

type VectorIndexKind string

VectorIndexKind represents the type of indexing strategy used for vector search. Different index types offer different tradeoffs between speed, accuracy, and memory usage.

var (
	// FlatIndex performs exhaustive search by comparing the query against all vectors.
	// Provides perfect recall but has O(n) time complexity.
	FlatIndexKind VectorIndexKind = "flat"

	// IVFIndex (Inverted File Index) partitions the vector space into clusters.
	// Searches only the nearest clusters, trading accuracy for speed.
	IVFIndexKind VectorIndexKind = "ivf"

	// PQIndex (Product Quantization) compresses vectors using learned codebooks.
	// Significantly reduces memory usage at the cost of some accuracy.
	PQIndexKind VectorIndexKind = "pq"

	// IVFPQIndex combines IVF clustering with PQ compression.
	// Provides fast search with low memory footprint.
	IVFPQIndexKind VectorIndexKind = "ivfpq"

	// HNSWIndex (Hierarchical Navigable Small World) builds a multi-layer graph.
	// Offers excellent query performance with high recall.
	HNSWIndexKind VectorIndexKind = "hnsw"
)

type VectorNode ¶

type VectorNode struct {
	// contains filtered or unexported fields
}

VectorNode represents a vector embedding with a unique identifier.

VectorNode is the fundamental data structure for storing vectors in all index types. Each node contains:

id: A unique identifier for retrieval and reference
vector: The actual float32 embedding values

The node is immutable after creation - neither the ID nor vector can be modified. This immutability ensures thread-safety and prevents accidental corruption of indexed data.

Example:

// Auto-generated ID
vec := []float32{0.1, 0.5, 0.3, 0.8}
node := comet.NewVectorNode(vec)
fmt.Println(node.ID())      // e.g., 1
fmt.Println(node.Vector())  // [0.1, 0.5, 0.3, 0.8]

// Explicit ID (useful for mapping to external document IDs)
node2 := comet.NewVectorNodeWithID(42, vec)
fmt.Println(node2.ID())  // 42

func NewVectorNode ¶

func NewVectorNode(vector []float32) *VectorNode

NewVectorNode creates a new VectorNode with an auto-incremented ID.

The ID is generated atomically and is guaranteed to be unique across all VectorNode and MetadataNode instances created by this package. This makes it safe to create nodes concurrently from multiple goroutines.

Parameters:

vector: The embedding vector to store. This should match the dimensionality expected by your index (e.g., 384 for sentence transformers, 768 for BERT).

Returns:

*VectorNode: A new node with auto-generated ID

Thread-safety: This function is safe for concurrent use.

Example:

embedding := []float32{0.1, 0.2, 0.3, ...}  // 384 dimensions
node := comet.NewVectorNode(embedding)
index.Add(*node)

func NewVectorNodeWithID ¶

func NewVectorNodeWithID(id uint32, vector []float32) *VectorNode

NewVectorNodeWithID creates a new VectorNode with an explicit ID.

Use this when you need to control the ID assignment, typically when:

Mapping vectors to existing document IDs from your database
Deserializing indexes from disk
Maintaining consistent IDs across restarts

WARNING: The caller is responsible for ensuring ID uniqueness. Using duplicate IDs may cause undefined behavior in the index.

Parameters:

id: The unique identifier for this node
vector: The embedding vector to store

Returns:

*VectorNode: A new node with the specified ID

Example:

// Map to database primary key
dbID := uint32(12345)
embedding := []float32{0.1, 0.2, 0.3, ...}
node := comet.NewVectorNodeWithID(dbID, embedding)
index.Add(*node)

func (*VectorNode) ID ¶

func (n *VectorNode) ID() uint32

ID returns the unique identifier of this vector node.

Returns:

uint32: The node's ID

func (*VectorNode) Vector ¶

func (n *VectorNode) Vector() []float32

Vector returns the embedding vector stored in this node.

Returns:

[]float32: The vector data (not a copy - do not modify)

type VectorResult ¶

type VectorResult struct {
	// Node is the matched vector node containing ID and vector data
	Node VectorNode

	// Score is the distance from the query vector (lower = more similar)
	Score float32
}

VectorResult represents a search result from vector similarity search.

Each result contains the matched vector node and a similarity score. The score represents distance from the query vector:

Lower scores indicate higher similarity (closer in vector space)
Score interpretation depends on the distance metric:
L2/Euclidean: absolute distance
L2Squared: squared distance (faster, preserves ordering)
Cosine: angular distance (range: [0, 2])

Example:

results, _ := index.NewSearch().
    WithQuery(queryVec).
    WithK(10).
    Execute()

for _, result := range results {
    fmt.Printf("ID: %d, Distance: %.4f\n",
        result.GetId(), result.Score)
    fmt.Printf("Vector: %v\n", result.Node.Vector())
}

func (VectorResult) GetId ¶

func (r VectorResult) GetId() uint32

GetId returns the ID of the matched vector node.

func (VectorResult) GetScore ¶

func (r VectorResult) GetScore() float32

GetScore returns the distance score (lower = more similar).

type VectorSearch ¶

type VectorSearch interface {
	// WithQuery sets the query vector(s) for similarity search.
	// Supports single or multiple query vectors for batch search.
	// Results from multiple queries are aggregated using the configured strategy.
	//
	// Parameters:
	//   - queries: One or more query vectors (each []float32 must match index dimension)
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithQuery(queries ...[]float32) VectorSearch

	// WithNode sets the node ID(s) to search from (node-based similarity).
	// Finds vectors similar to the specified indexed nodes.
	// Supports single or multiple nodes for batch search.
	//
	// Parameters:
	//   - nodeIDs: One or more node IDs to use as query vectors
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithNode(nodeIDs ...uint32) VectorSearch

	// WithK sets the number of nearest neighbors to return.
	//
	// Parameters:
	//   - k: Number of results (default: 10)
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithK(k int) VectorSearch

	// WithNProbes sets the number of clusters to probe (IVF/IVFPQ only).
	// Higher values increase recall but reduce speed.
	// This is a no-op for other index types.
	//
	// Typical values:
	//   - 1: Fastest, ~60-70% recall
	//   - 8: Good balance, ~85% recall
	//   - 16: Better recall, ~92% recall
	//   - 32: High recall, ~96% recall
	//
	// Parameters:
	//   - nProbes: Number of clusters to search
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithNProbes(nProbes int) VectorSearch

	// WithEfSearch sets the efSearch parameter (HNSW only).
	// Controls the size of the dynamic candidate list during search.
	// Higher values increase recall but reduce speed.
	// This is a no-op for other index types.
	//
	// Typical values:
	//   - 50: Very fast, ~85% recall
	//   - 100: Fast, ~92% recall
	//   - 200: Balanced, ~96% recall (default)
	//   - 400: Slower, ~98% recall
	//
	// Parameters:
	//   - efSearch: Size of candidate list
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithEfSearch(efSearch int) VectorSearch

	// WithThreshold sets a distance threshold for filtering results.
	// Only vectors with distance <= threshold are returned.
	// Useful for finding all "sufficiently similar" vectors.
	//
	// Parameters:
	//   - threshold: Maximum distance (results with distance > threshold are excluded)
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithThreshold(threshold float32) VectorSearch

	// WithScoreAggregation sets how to combine scores from multiple queries/nodes.
	// Only relevant when using multiple queries or nodes.
	//
	// Available strategies:
	//   - SumAggregation: Sum all scores (emphasizes frequency)
	//   - MaxAggregation: Take maximum distance (conservative)
	//   - MeanAggregation: Average all scores (balanced, default)
	//
	// Parameters:
	//   - kind: The aggregation strategy
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithScoreAggregation(kind ScoreAggregationKind) VectorSearch

	// WithCutoff enables automatic result cutoff based on score distribution.
	// Analyzes the score curve to find natural breakpoints.
	//
	// Parameters:
	//   - cutoff: Number of extrema to find before cutting (-1 disables autocut)
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithCutoff(cutoff int) VectorSearch

	// WithDocumentIDs pre-filters the search to specific document IDs.
	// Only vectors with IDs in this set will be considered.
	// Useful for combining with metadata filters in hybrid search.
	//
	// Parameters:
	//   - docIDs: Eligible document IDs (empty means all documents)
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithDocumentIDs(docIDs ...uint32) VectorSearch

	// Execute performs the configured search and returns results.
	// Results are sorted by distance (ascending - lower is better).
	//
	// Returns:
	//   - []VectorResult: Sorted search results
	//   - error: Error if search fails
	Execute() ([]VectorResult, error)
}

VectorSearch provides a fluent interface for configuring and executing vector searches.

VectorSearch uses the builder pattern to configure search parameters before execution. All With* methods return the search instance for method chaining.

Search modes:

Query-based: Search for vectors similar to provided query vectors
Node-based: Find vectors similar to indexed nodes (by ID)
Hybrid: Combine both query and node searches

Advanced features:

Pre-filtering: Restrict search to specific document IDs
Score aggregation: Combine results from multiple queries/nodes
Autocut: Automatically determine optimal result cutoff
Threshold filtering: Exclude results beyond distance threshold

Example - Basic search:

results, _ := index.NewSearch().
    WithQuery(queryVec).
    WithK(10).
    Execute()

Example - Multi-query with aggregation:

results, _ := index.NewSearch().
    WithQuery(query1, query2, query3).
    WithK(20).
    WithScoreAggregation(comet.MeanAggregation).
    Execute()

Example - Pre-filtered search:

eligibleIDs := []uint32{1, 5, 10, 15, 20}
results, _ := index.NewSearch().
    WithQuery(queryVec).
    WithDocumentIDs(eligibleIDs...).
    WithK(5).
    Execute()

README ¶

Comet

Table of Contents

Overview

Features

Vector Storage

Search Modalities

Fusion Strategies

Data Structures (The Good Stuff)

Other Capabilities

Installation

Quick Start

Architecture

System Architecture

Application Layer

Search Engine Layer

Index Storage Layer

Hybrid Coordinator

Component Details

Component A: Vector Storage Engine

Component B: Text Search Engine

Component C: Metadata Filter Engine

Data Flow

Step 1: Query Input

Step 2: Validation

Step 3: Metadata Pre-filtering

Step 4: Parallel Search

Step 5: Score Fusion

Step 6: Final Ranking

Memory Layout

HNSW Index Memory

Product Quantization Memory

Core Concepts

Concept 1: Vector Storage and Distance Metrics

Flat Index (Brute Force)

HNSW Index (Graph-Based)

IVF Index (Clustering)

Product Quantization (Compression)

Concept 2: BM25 Full-Text Search Algorithm

Setup

Step 1: Calculate IDF (Inverse Document Frequency)

Step 2: Calculate TF Component for Each Document

Step 3: Calculate Final BM25 Scores

Final Ranking

Why Doc 3 Ranks Higher

Concept 3: Hybrid Search with Score Fusion

Query Setup

Step 1: Apply Metadata Filter

Step 2: Vector Search Results

Step 3: Text Search Results

Step 4: Reciprocal Rank Fusion (RRF)

Final Ranking

Why RRF Over Weighted Sum?

Storage Type Deep Dive

Flat Index (Brute Force)

How It Works

Implementation

Complexity Analysis

When to Use

HNSW Index (Hierarchical Graph)

Graph Structure

Search Process

How Insertion Works

Complexity Analysis

Key Parameters

When to Use

IVF Index (Inverted File with Clustering)

Build Phase (Training)

Search Phase

Key Insight

Complexity Analysis

Key Parameters

When to Use

PQ Index (Product Quantization)

Compression Process

Result

Search Process

Complexity Analysis

Memory Comparison

Key Parameters