comet

package module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 14, 2025 License: MIT Imports: 16 Imported by: 0

README

Comet

Cover

A high-performance hybrid vector store written in Go. Comet brings together multiple indexing strategies and search modalities into a unified, hackable package. Hybrid retrieval with reciprocal rank fusion, autocut, pre-filtering, semantic search, full-text search, and multi-KNN searches, and multi-query operations — all in pure Go.

Understand search internals from the inside out. Built for hackers, not hyperscalers. Tiny enough to fit in your head. Decent enough to blow it.

Choose from:

  • Flat (exact), HNSW (graph), IVF (clustering), PQ (quantization), or IVFPQ (hybrid) storage backends
  • Full-Text Search: BM25 ranking algorithm with tokenization and normalization
  • Metadata Filtering: Fast filtering using Roaring Bitmaps and Bit-Sliced Indexes
  • Ranking Programmability: Reciprocal Rank Fusion, Fixed size result sets, Threshold based result sets, Dynamic result sets etc.
  • Hybrid Search: Unified interface combining vector, text, and metadata search

Table of Contents

Overview

Everything you need to understand how vector databases actually work—and build one yourself.

What's inside:

  • 5 Vector Storage Types: Flat, HNSW, IVF, PQ, IVFPQ
  • 3 Distance Metrics: L2, L2 Squared, Cosine
  • Full-Text Search: BM25 ranking with Unicode tokenization
  • Metadata Filtering: Roaring bitmaps + Bit-Sliced Indexes
  • Hybrid Search: Combine vector + text + metadata with Reciprocal Rank Fusion
  • Advanced Search: Multi-KNN queries, multi-query operations, autocut result truncation
  • Production Features: Thread-safe, serialization, soft deletes, configurable parameters

Everything you need to understand how vector databases actually work—and build one yourself.

Features

Vector Storage
  • Flat: Brute-force exact search (100% recall baseline)
  • HNSW: Hierarchical navigable small world graphs (95-99% recall, O(log n) search)
  • IVF: Inverted file index with k-means clustering (85-95% recall, 10-20x speedup)
  • PQ: Product quantization for compression (85-95% recall, 10-500x memory reduction)
  • IVFPQ: IVF + PQ combined (85-95% recall, 100x speedup + 500x compression)
Search Modalities
  • Vector Search: L2, L2 Squared, and Cosine distance metrics
  • Full-Text Search: BM25 ranking with Unicode-aware tokenization
  • Metadata Filtering: Boolean queries on structured attributes
  • Hybrid Search: Combine all three with configurable fusion strategies
Fusion Strategies
  • Weighted Sum: Linear combination with configurable weights
  • Reciprocal Rank Fusion (RRF): Scale-independent rank-based fusion
  • Max/Min Score: Simple score aggregation
Data Structures (The Good Stuff)
  • HNSW Graphs: Multi-layer skip lists for approximate nearest neighbor search
  • Roaring Bitmaps: Compressed bitmaps for metadata filtering (array, bitmap, run-length encoding)
  • Bit-Sliced Index (BSI): Efficient numeric range queries without full scans
  • Product Quantization Codebooks: Learned k-means centroids for vector compression
  • Inverted Indexes: Token-to-document mappings for full-text search
Other Capabilities
  • Quantization: Full precision, half precision, int8 precision
  • Soft Deletes: Fast deletion with lazy cleanup
  • Serialization: Persist and reload indexes
  • Thread-Safe: Concurrent read/write operations
  • Autocut: Automatic result truncation based on score gaps

Installation

go get github.com/wizenheimer/comet

Quick Start

package main

import (
    "fmt"
    "log"

    "github.com/wizenheimer/comet"
)

func main() {
    // Create a vector store (384-dimensional embeddings with cosine distance)
    index, err := comet.NewFlatIndex(384, comet.Cosine)
    if err != nil {
        log.Fatal(err)
    }

    // Add vectors
    vec1 := make([]float32, 384)
    // ... populate vec1 with your embedding ...
    node := comet.NewVectorNode(vec1)
    index.Add(*node)

    // Search for similar vectors
    query := make([]float32, 384)
    // ... populate query vector ...
    results, err := index.NewSearch().
        WithQuery(query).
        WithK(10).
        Execute()

    if err != nil {
        log.Fatal(err)
    }

    // Process results
    for i, result := range results {
        fmt.Printf("%d. ID=%d, Score=%.4f\n", i+1, result.GetId(), result.GetScore())
    }
}

Output:

1. ID=123, Score=0.0234
2. ID=456, Score=0.0567
3. ID=789, Score=0.0823
...

Architecture

System Architecture

Comet is organized into three main search engines that can work independently or together:

Application Layer
┌─────────────────────────────────────────────────────────────┐
│                    Your Application                          │
│            (Using Comet as a Go Library)                     │
└──────────────────────┬──────────────────────────────────────┘
                       │
         ┌─────────────┼─────────────┐
         │             │             │
         ▼             ▼             ▼
    Vector         Text         Metadata
Search Engine Layer
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Vector    │    │    Text     │    │  Metadata   │
│   Search    │    │   Search    │    │  Filtering  │
│   Engine    │    │   Engine    │    │   Engine    │
└──────┬──────┘    └──────┬──────┘    └──────┬──────┘
       │                  │                  │
       │ Semantic         │ Keywords         │ Filters
       │ Similarity       │ + Relevance      │ + Boolean Logic
       ▼                  ▼                  ▼
Index Storage Layer
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ HNSW / IVF  │    │ BM25 Index  │    │  Roaring    │
│ / PQ / Flat │    │ (Inverted)  │    │  Bitmaps    │
└─────────────┘    └─────────────┘    └─────────────┘
   Graph/Trees      Token→DocIDs       Compressed Sets
Hybrid Coordinator
                 All Three Engines
                       │
                       ▼
              ┌─────────────────┐
              │  Hybrid Search   │
              │  Coordinator     │
              │  (Score Fusion)  │
              └─────────────────┘
                       │
                       ▼
              Combined Results
Component Details
Component A: Vector Storage Engine

Manages vector storage and similarity search across multiple index types.

Common Interface:

┌────────────────────────────────────┐
│  VectorIndex Interface             │
│                                    │
│  ├─ Train(vectors)                 │
│  ├─ Add(vector)                    │
│  ├─ Remove(vector)                 │
│  └─ NewSearch()                    │
└────────────────────────────────────┘

Available Implementations:

FlatIndex          → Brute force, 100% recall
HNSWIndex          → Graph-based, O(log n)
IVFIndex           → Clustering, 10-20x faster
PQIndex            → Quantization, 10-500x compression
IVFPQIndex         → Hybrid, best of IVF + PQ

Responsibilities:

  • Vector preprocessing (normalization for cosine distance)
  • Distance calculations (Euclidean, L2², Cosine)
  • K-nearest neighbor search
  • Serialization/deserialization
  • Soft delete management with flush mechanism

Performance Characteristics:

  • Flat: O(n×d) search, 100% recall
  • HNSW: O(M×ef×log n) search, 95-99% recall
  • IVF: O(nProbes×n/k×d) search, 85-95% recall
Component B: Text Search Engine

Full-text search using BM25 ranking algorithm.

Inverted Index:

┌────────────────────────────────────┐
│  term → RoaringBitmap(docIDs)      │
│                                    │
│  "machine"  →  {1, 5, 12, 45}      │
│  "learning" →  {1, 3, 12, 20}      │
│  "neural"   →  {3, 20, 45}         │
└────────────────────────────────────┘

Term Frequencies:

┌────────────────────────────────────┐
│  term → {docID: count}             │
│                                    │
│  "machine" → {1: 3, 5: 1, 12: 2}   │
└────────────────────────────────────┘

Document Stats:

┌────────────────────────────────────┐
│  docID → (length, token_count)     │
│                                    │
│  1  →  (250 chars, 45 tokens)      │
│  5  →  (180 chars, 32 tokens)      │
└────────────────────────────────────┘

Responsibilities:

  • Text tokenization (UAX#29 word segmentation)
  • Unicode normalization (NFKC)
  • Inverted index maintenance
  • BM25 score calculation
  • Top-K retrieval with heap

Performance Characteristics:

  • Add: O(m) where m = tokens
  • Search: O(q×d_avg) where q = query terms, d_avg = avg docs per term
  • Memory: Compressed inverted index, no original text stored
Component C: Metadata Filter Engine

Fast filtering using compressed bitmaps.

Categorical Fields (Roaring Bitmaps):

┌────────────────────────────────────┐
│  field:value → Bitmap(docIDs)      │
│                                    │
│  "category:electronics" → {1,5,12} │
│  "category:books"       → {2,8,15} │
│  "in_stock:true"        → {1,2,5}  │
└────────────────────────────────────┘

Numeric Fields (Bit-Sliced Index):

┌────────────────────────────────────┐
│  field → BSI (range queries)       │
│                                    │
│  "price"  → [0-1000, 1000-5000]    │
│  "rating" → [0-5 scale]            │
└────────────────────────────────────┘

Document Universe:

┌────────────────────────────────────┐
│  allDocs → Bitmap(all IDs)         │
│                                    │
│  Used for NOT operations           │
└────────────────────────────────────┘

Responsibilities:

  • Bitmap index maintenance (Roaring compression)
  • BSI for numeric range queries
  • Boolean query evaluation (AND, OR, NOT)
  • Existence checks
  • Set operations (IN, NOT IN)

Performance Characteristics:

  • Add: O(f) where f = fields
  • Query: O(f×log n) with bitmap operations
  • Memory: Highly compressed, 1-10% of uncompressed
Data Flow

How a hybrid search request flows through the system:

Step 1: Query Input
User Query:
├─ Vector:    [0.1, 0.5, ...]
├─ Text:      "machine learning"
└─ Filters:   {category="ai", price<100}
Step 2: Validation
┌────────────┐
│ Validation │ ──✗──> Error Handler
└─────┬──────┘         (Invalid dimension, etc.)
      │
      ✓ Valid
Step 3: Metadata Pre-filtering
┌─────────────────────────┐
│ Metadata Filter Engine  │
│ Apply: category="ai"    │
│        price<100        │
└─────┬───────────────────┘
      │
      ▼
Candidates: {1, 5, 7, 12, 15}
Candidates → Split into both engines

┌────────────┐         ┌────────────┐
│  Vector    │         │   Text     │
│  Search    │         │  Search    │
│ (on cands) │         │ (on cands) │
└─────┬──────┘         └─────┬──────┘
      │                      │
      ▼                      ▼
Vec Results:           Text Results:
{1: 0.2,               {7: 8.5,
 5: 0.3,                1: 7.2,
 7: 0.4}                12: 6.8}
Step 5: Score Fusion
      Both Result Sets
              │
              ▼
      ┌───────────────┐
      │ Score Fusion  │
      │ (RRF/Weighted)│
      └───────┬───────┘
              │
              ▼
      Fused Rankings
Step 6: Final Ranking
      ┌───────────────┐
      │  Rank & Sort  │
      └───────┬───────┘
              │
              ▼
      ┌───────────────┐
      │  Top-K Filter │
      │  (k=10)       │
      └───────┬───────┘
              │
              ▼
      Final Results:
      [{1: 0.8},
       {7: 0.7},
       {5: 0.6}]
Memory Layout

Understanding how different index types use memory:

HNSW Index Memory

File Header (24 bytes):

┌─────────────────────────────┐
│ Magic:      "HNSW" (4 B)    │
│ Version:    1 (4 B)          │
│ Dimensions: 384 (4 B)        │
│ M:          16 (4 B)         │
│ Max Level:  3 (4 B)          │
│ Entry Point: 42 (4 B)        │
└─────────────────────────────┘

Per-Node Storage:

Node ID:       4 bytes
Level:         4 bytes
Vector Data:   1536 bytes (384-dim × 4)
Graph Edges:   320 bytes (M connections × layers)
              ─────────────
Total:         ~1864 bytes per node

Scaling Analysis:

┌───────────────────┬──────────┬───────────┐
│ Component         │ Per Node │ 1M Nodes  │
├───────────────────┼──────────┼───────────┤
│ Vectors (raw)     │ 1536 B   │ 1.46 GB   │
│ Graph structure   │ 320 B    │ 305 MB    │
│ Metadata          │ 8 B      │ 7.6 MB    │
├───────────────────┼──��───────┼───────────┤
│ Total             │ 1864 B   │ 1.78 GB   │
└───────────────────┴──────────┴───────────┘
Product Quantization Memory

Compression Overview:

Original Vector (384-dim):
384 × 4 bytes = 1536 bytes
         ↓
    Quantization
         ↓
PQ Codes (8 subspaces):
8 × 1 byte = 8 bytes
         ↓
192x smaller!

Codebook Storage:

8 codebooks × 256 centroids × 48 dims × 4 bytes
= 393 KB (shared across all vectors)

1M Vectors Comparison:

┌───────────────────┬────────────┬──────────┐
│ Format            │ Size       │ Ratio    │
├───────────────────┼────────────┼──────────┤
│ Original (float32)│ 1.46 GB    │ 1x       │
│ PQ-8              │ 7.6 MB     │ 192x     │
│ PQ-16             │ 15.3 MB    │ 96x      │
│ PQ-32             │ 30.5 MB    │ 48x      │
│ + Codebooks       │ +393 KB    │ -        │
└───────────────────┴────────────┴──────────┘

Core Concepts

Concept 1: Vector Storage and Distance Metrics

A vector store maintains high-dimensional embeddings and enables efficient similarity search. The choice of storage type determines the tradeoff between search speed, memory usage, and accuracy.

Example: How vectors are stored and searched

Given these input vectors:

Vector 1: [0.1, 0.5, 0.3, 0.8]
Vector 2: [0.2, 0.4, 0.7, 0.1]
Query:    [0.15, 0.45, 0.5, 0.5]

The index computes distances and returns nearest neighbors:

┌─────────────┬────────────────────────────────┬──────────┐
│ Vector ID   │ Distance to Query              │ Rank     │
├─────────────┼────────────────────────────────┼──────────┤
│ 1           │ 0.234 (closest)                │ 1st      │
│ 2           │ 0.567 (further)                │ 2nd      │
└─────────────┴────────────────────────────────┴──────────┘

Visual Representation: Storage Types Comparison

Flat Index (Brute Force)
Query → Compare with ALL vectors → Sort → Return K

✓ 100% Accuracy
✗ O(n) time complexity
✗ Slow for large datasets
HNSW Index (Graph-Based)
Layer 2: ●─────────────●    (highways - long jumps)
          │             │
Layer 1: ●───●───●───●    (roads - medium jumps)
          │   │ \ │ \ │
Layer 0: ●─●─●─●─●─●─●    (streets - all nodes)

Search: Start high → Navigate greedily → Descend

✓ O(log n) time complexity
✓ 95-99% recall
✗ 2-3x memory overhead
IVF Index (Clustering)
Cluster 1: ●●●●     Cluster 2: ●●●●
Cluster 3: ●●●●     Cluster 4: ●●●●

Query → Find nearest clusters → Search within

✓ Fast on large datasets
✗ Requires training
~ 85-95% recall
Product Quantization (Compression)
Original:  [0.123, 0.456, 0.789, ...]
                    ↓ Compress
Quantized: [17, 42, 89, ...]

4 bytes each → 1 byte each

✓ 4-32x memory reduction
✗ Slight accuracy loss

Benefits of Different Storage Types:

  • Flat Index: Perfect recall, simple, no training. Use for small datasets (<100K vectors)
  • HNSW: Excellent speed/accuracy tradeoff, no training. Use for most production workloads
  • IVF: Fast filtering with clusters, scalable. Use for very large datasets (>10M vectors)
  • PQ: Massive memory savings. Use when storage cost is a concern
  • IVFPQ: Best of IVF + PQ. Use for extremely large, memory-constrained environments
Concept 2: BM25 Full-Text Search Algorithm

BM25 (Best Matching 25) ranks documents by relevance to a text query using term frequency and inverse document frequency.

Setup

Test Corpus:

Doc 1: "the quick brown fox jumps over the lazy dog"  (9 words)
Doc 2: "the lazy dog sleeps"                           (4 words)
Doc 3: "quick brown rabbits jump"                      (4 words)

Query: "quick brown"

Parameters:

Average Document Length: 7 words
K1 = 1.2  (term frequency saturation)
B  = 0.75 (length normalization)
Step 1: Calculate IDF (Inverse Document Frequency)

For term "quick":

Appears in: 2 out of 3 documents

IDF = log((N - df + 0.5) / (df + 0.5) + 1)
    = log((3 - 2 + 0.5) / (2 + 0.5) + 1)
        = log(1.5 / 2.5 + 1)
        = log(1.6)
        = 0.470

For term "brown":

Appears in: 2 out of 3 documents
IDF = 0.470  (same calculation)
Step 2: Calculate TF Component for Each Document

Doc 1 (9 words - longer than average):

Term "quick" (tf=1):
  TF_score = (tf × (K1 + 1)) / (tf + K1 × (1 - B + B × (docLen/avgLen)))
           = (1 × 2.2) / (1 + 1.2 × (1 - 0.75 + 0.75 × (9/7)))
             = 2.2 / 2.457
             = 0.895

Term "brown" (tf=1):
  TF_score = 0.895  (same calculation)

Doc 2 (4 words):

Terms "quick" and "brown": tf=0 (not present)
    TF_score = 0

Doc 3 (4 words - shorter than average):

Term "quick" (tf=1):
  TF_score = (1 × 2.2) / (1 + 1.2 × (1 - 0.75 + 0.75 × (4/7)))
             = 2.2 / 1.815
             = 1.212

Term "brown" (tf=1):
  TF_score = 1.212  (same calculation)
Step 3: Calculate Final BM25 Scores

Combine IDF × TF for each document:

Doc 1: (0.895 × 0.470) + (0.895 × 0.470) = 0.841
Doc 2: 0  (no matching terms)
Doc 3: (1.212 × 0.470) + (1.212 × 0.470) = 1.139
Final Ranking
┌─────────┬──────────┬────────┐
│ Rank    │ Doc ID   │ Score  │
├─────────┼──────────┼────────┤
│ 1st     │ Doc 3    │ 1.139  │
│ 2nd     │ Doc 1    │ 0.841  │
│ 3rd     │ Doc 2    │ 0.000  │
└─────────┴──────────┴────────┘
Why Doc 3 Ranks Higher
Doc 1 vs Doc 3:
├─ Same term frequencies (1 occurrence each)
├─ Doc 3 is shorter (4 words vs 9 words)
├─ BM25 length normalization penalizes longer docs
└─ Result: Doc 3 gets higher TF scores (1.212 vs 0.895)

Key Insight: Shorter documents with same term frequency
             are considered more relevant ✓
Concept 3: Hybrid Search with Score Fusion

Hybrid search combines vector similarity, text relevance, and metadata filtering. Different fusion strategies handle score normalization differently.

Query Setup
INPUT:
├─ Query:   "machine learning tutorial"
├─ Vector:  [0.12, 0.45, 0.89, ...]  (embedding)
└─ Filter:  category="education" AND price<50
Step 1: Apply Metadata Filter
┌─────────────────────────────────┐
│  Metadata Filtering             │
│  category="education"           │
│  price<50                       │
└─────────────────────────────────┘
                │
                ▼
Candidate Docs: {1, 3, 5, 7, 9, 12, 15, 18, 20}
Step 2: Vector Search Results
Semantic Similarity (on candidates):

┌────────┬──────────┬──────┐
│ Doc ID │ Distance │ Rank │
├────────┼──────────┼──────┤
│   1    │  0.12    │  1   │  ← Closest
│   5    │  0.23    │  2   │
│   7    │  0.34    │  3   │
│  12    │  0.45    │  4   │
└────────┴──────────┴──────┘
Step 3: Text Search Results
BM25 Ranking (on candidates):

┌────────┬────────────┬──────┐
│ Doc ID │ BM25 Score │ Rank │
├────────┼────────────┼──────┤
│   7    │   8.5      │  1   │  ← Most relevant
│   1    │   7.2      │  2   │
│  12    │   6.8      │  3   │
│   5    │   4.1      │  4   │
└────────┴────────────┴──────┘
Step 4: Reciprocal Rank Fusion (RRF)

Formula: RRF_score = sum(1 / (K + rank_i)) where K=60

Doc 1 Calculation:

  Vector rank: 1 → 1/(60+1) = 0.0164
Text rank:   2 → 1/(60+2) = 0.0161
                          ────────
Combined RRF score:         0.0325

Doc 5 Calculation:

  Vector rank: 2 → 1/(60+2) = 0.0161
Text rank:   4 → 1/(60+4) = 0.0156
                          ────────
Combined RRF score:         0.0317

Doc 7 Calculation:

  Vector rank: 3 → 1/(60+3) = 0.0159
Text rank:   1 → 1/(60+1) = 0.0164
                          ────────
Combined RRF score:         0.0323

Doc 12 Calculation:

  Vector rank: 4 → 1/(60+4) = 0.0156
Text rank:   3 → 1/(60+3) = 0.0159
                          ────────
Combined RRF score:         0.0315
Final Ranking
┌──────┬────────┬───────────┬──────────────────┐
│ Rank │ Doc ID │ RRF Score │ Why              │
├──────┼────────┼───────────┼──────────────────┤
│ 1st  │   1    │  0.0325   │ Best vector      │
│ 2nd  │   7    │  0.0323   │ Best text        │
│ 3rd  │   5    │  0.0317   │ Balanced         │
│ 4th  │  12    │  0.0315   │ Lower in both    │
└──────┴────────┴───────────┴──────────────────┘
Why RRF Over Weighted Sum?
Problem with Weighted Sum:
├─ Vector distances:    0-2 range
├─ BM25 scores:         0-100+ range
└─ Different scales need manual tuning ✗

RRF Advantages:
├─ ✓ Scale Independent  (uses ranks, not raw scores)
├─ ✓ Robust             (stable across score distributions)
├─ ✓ No Tuning          (no manual weight calibration)
└─ ✓ Industry Standard  (Elasticsearch, Vespa, etc.)

Storage Type Deep Dive

Choose your poison: Flat (exact), HNSW (graph), IVF (clustering), PQ (quantization), or IVFPQ (hybrid). Each trades speed, memory, and accuracy differently.

Full Deep Dive: See INDEX.md for complete algorithms, benchmarks, and implementation details.

Flat Index (Brute Force)

The Simplest Approach: Compare query against EVERY vector. 100% recall, zero approximation.

How It Works
Query Vector
     ↓
┌────────────────────┐
│  Compare to ALL n  │
│     vectors        │  ← Every single one checked
└────────────────────┘   No shortcuts!
        ↓
   Sort by distance
        ↓
   Return top K
Implementation
1. Store vectors → Raw float32 arrays
2. Search step   → Compute distance to all vectors
3. Selection     → Keep top-k closest
4. Index struct  → None (just a list)
Complexity Analysis
┌─────────────���──────────────────────────┐
│ Operation   │ Complexity               │
├─────────────┼──────────────────────────┤
│ Build       │ O(1) - just store        │
│ Search      │ O(n × d) - check all     │
│ Memory      │ O(n × d) - raw vectors   │
│ Recall      │ 100% - exhaustive        │
└─────────────┴──────────────────────────┘
When to Use
✓ Small datasets (<10K vectors)
✓ 100% recall required (fingerprinting, security)
✓ Benchmarking baseline
✗ Large datasets (too slow)

HNSW Index (Hierarchical Graph)

The Graph Navigator: Build a multi-layer graph where each node connects to nearby vectors. Search by greedy navigation from layer to layer.

Graph Structure
Layer 2: [A]─────────[D]           ← Highways (long jumps)
          │           │              Few nodes

Layer 1: [A]──[B]────[D]──[E]      ← Roads (medium jumps)
          │   │ \    │ \  │          More nodes

Layer 0: [A]─[B]─[C]─[D]─[E]─[F]   ← Streets (short links)
                                      ALL vectors here
Search Process
1. Start at top layer    → Long-range navigation
2. Greedy best move      → Move to closest neighbor
3. Drop to next layer    → Refine search
4. Repeat steps 2-3      → Until Layer 0
5. Return k-nearest      → Final results
How Insertion Works
New Vector:
├─ 1. Add to Layer 0 (always)
├─ 2. Randomly assign to higher layers (exponential decay)
├─ 3. Connect to M nearest neighbors per layer
└─ 4. Update neighbors' connections (bidirectional)
Complexity Analysis
┌─────────────┬────────────────────────────────┐
│ Operation   │ Complexity                     │
├─────────────┼────────────────────────────────┤
│ Search      │ O(M × efSearch × log n)        │
│             │ Typically checks <1% of data   │
│ Insert      │ O(M × efConstruction × log n)  │
│ Memory      │ O(n × d + n × M × log n)       │
│ Recall      │ 95-99% with proper tuning      │
└─────────────┴────────────────────────────────┘
Key Parameters
M (Connections per layer):
├─ 4-8:   Low memory, lower recall
├─ 16:    Balanced (default)
└─ 32-48: High recall, more memory

efConstruction (Build quality):
├─ 100:   Fast build, lower quality
├─ 200:   Good balance (default)
└─ 400+:  Better graph, slower build

efSearch (Search quality):
├─ 50:    Fast search, ~85% recall
├─ 200:   Balanced (default), ~96% recall
└─ 400+:  High recall, slower search
When to Use
✓ Large datasets (10K-10M vectors)
✓ Need 95-99% recall
✓ Sub-millisecond latency required
✓ Can afford 2-3x memory overhead
✗ Memory constrained environments

IVF Index (Inverted File with Clustering)

The Clustering Approach: Partition vectors into clusters using k-means. Search only the nearest clusters.

Build Phase (Training)
All Vectors (n)
      ↓
   k-means clustering
      ↓
┌─���──────────────────────────────────┐
│ [C1]  [C2]  [C3]  [C4]  ... [Cn]  │  ← Centroids
└────────────────────────────────────┘
   ↓      ↓     ↓     ↓         ↓
  {v1}   {v8}  {v3}  {v12}    {v7}     ← Vectors assigned
  {v5}   {v9}  {v6}  {v19}    {v11}      to nearest
  {v20}  ...   ...   ...      ...        centroid
Search Phase
Query Vector
      ↓
Find nProbe nearest centroids
      ↓
┌────────────────────────────┐
│ [C2] [C3]  (only these!)   │  ← Search 2 of 100 clusters
└────────────────────────────┘
   ↓     ↓
  {...} {...}  Search within clusters
      ↓
  Top-k results
Key Insight
Instead of:  Check all 1M vectors
Do this:     1. Find 8 nearest clusters  (O(nClusters))
             2. Search 80K vectors total  (10% of data)
             └─> 10-20x faster!
Complexity Analysis
┌─────────────┬───���────────────────────────────┐
│ Operation   │ Complexity                     │
├─────────────┼────────────────────────────────┤
│ Build       │ O(iterations × n × k × d)      │
│             │ k-means clustering             │
│ Search      │ O(k × d + (n/k) × nProbe × d)  │
│             │ Typically checks 5-20% of data │
│ Memory      │ O(n × d + k × d)               │
│ Recall      │ 80-95% depending on nProbe     │
└─────────────┴────────────────────────────────┘
Key Parameters
nClusters (number of partitions):
├─ Rule of thumb: sqrt(n) to n/10
├─ 10K vectors   → 100 clusters
├─ 100K vectors  → 316 clusters
├─ 1M vectors    → 1,000 clusters
└─ More clusters → Faster search, slower build

nProbe (clusters to search):
├─ 1:   Fastest, ~60-70% recall
├─ 8:   Good balance, ~85% recall
├─ 16:  Better recall, ~92% recall
└─ 32+: High recall, ~96% recall
When to Use
✓ Large datasets (>100K vectors)
✓ Can tolerate 85-95% recall
✓ Want 10-20x speedup over Flat
✓ Have training data available
✗ Need 100% recall (use Flat)
✗ Dataset too small (<10K)

PQ Index (Product Quantization)

The Compression Master: Split vectors into subvectors, quantize each subvector to 256 codes. Reduce memory by 16-32×.

Compression Process
Original Vector (384 dimensions):
  [0.23, 0.91, ..., 0.15, 0.44, ..., 0.73, 0.22, ...]
   \_____48D_____/  \_____48D_____/  \_____48D_____/
   Subspace 1       Subspace 2       Subspace 3
        ↓                ↓                ↓
   K-means on       K-means on       K-means on
   subspace 1       subspace 2       subspace 3
        ↓                ↓                ↓
 Codebook with    Codebook with    Codebook with
 256 centroids    256 centroids    256 centroids
        ↓                ↓                ↓
   Find nearest    Find nearest     Find nearest
   centroid ID     centroid ID      centroid ID
        ↓                ↓                ↓
      [12]             [203]            [45]
       1 byte          1 byte           1 byte
Result
Before: [0.23, 0.91, ...] → 384 × 4 bytes = 1536 bytes
After:  [12, 203, 45, ...]  → 8 × 1 byte  = 8 bytes

192x compression!
Search Process
1. Query arrives
        ↓
2. Split query into subspaces (like training)
        ↓
3. Precompute distance tables for each subspace
   (query_subspace to all 256 codebook centroids)
        ↓
4. For each vector:
   - Look up codes: [12, 203, 45, ...]
   - Table lookup: distances[12] + distances[203] + ...
   - No float operations! Just array lookups
        ↓
5. Return top-k
Complexity Analysis
┌─────────────┬────────────────────────────────┐
│ Operation   │ Complexity                     │
├─────────────┼────────────────────────────────┤
│ Training    │ O(M × iterations × K × n/M)    │
│             │ K-means on each subspace       │
│ Encoding    │ O(M × K × d/M) per vector      │
│ Search      │ O(M × K + n × M)               │
│             │ Super fast, cache-friendly     │
│ Memory      │ O(n × M) - massive savings!    │
│ Recall      │ 70-85% typical                 │
└─────────────┴────────────────────────────────┘
Memory Comparison
1M vectors, 384 dimensions:

Float32:  1M × 384 × 4 bytes = 1.46 GB
PQ-8:     1M × 8 × 1 byte    = 7.6 MB  (192x smaller!)
PQ-16:    1M × 16 × 1 byte   = 15.3 MB (96x smaller!)
PQ-32:    1M × 32 × 1 byte   = 30.5 MB (48x smaller!)

+ Codebooks: ~393 KB (shared across all vectors)
Key Parameters
M (nSubspaces):
├─ 8:   Maximum compression, lower accuracy
├─ 16:  Good balance
├─ 32:  Better accuracy, less compression
└─ 64:  High accuracy, moderate compression

bitsPerCode:
└─ 8:   Standard (256 centroids per subspace)
        Perfect for uint8 storage
When to Use
✓ Massive datasets (millions of vectors)
✓ Memory is the bottleneck
✓ Can tolerate 70-85% recall
✓ Want 30-200x memory reduction
✗ Need high recall (>95%)
✗ Have plenty of memory

IVFPQ Index (Hybrid: Clustering + Quantization)

Best of Both Worlds: IVF clusters to reduce search space, PQ compression to reduce memory. The ultimate scalability index.

Build Process

Step 1: IVF Clustering

All Vectors
      ↓
  K-means clustering
      ↓
[C1]  [C2]  [C3]  [C4]  ... [Cn]

Step 2: PQ Compression per Cluster

Cluster 1:        Cluster 2:        Cluster 3:
  {vectors}         {vectors}         {vectors}
      ↓                 ↓                 ↓
  Apply PQ          Apply PQ          Apply PQ
      ↓                 ↓                 ↓
[12,203,45]      [91,34,178]       [56,211,19]
[88,9,101]       [23,156,88]       [199,44,73]
[...]            [...]             [...]
           uint8 codes      uint8 codes       uint8 codes
Search Process
Step 1: IVF Stage
  Query → Find nProbe nearest centroids
          ↓
     [C2] [C3] [C7]  (e.g., 3 of 100 clusters)
          ↓
     Search only 3% of the data!

Step 2: PQ Stage
  For each selected cluster:
    ├─ Precompute distance tables
    ├─ Fast table lookups on PQ codes
    └─ No float operations
          ↓
     Top-k results
The Magic Combination
IVF contributes:
└─ Speed:  10-100x faster (search only nProbe clusters)

PQ contributes:
└─ Memory: 30-200x smaller (compressed codes)

Combined:
└─ Fast + Tiny = Billion-scale capability!
Complexity Analysis
┌─────────────┬────────────────────────────────┐
│ Operation   │ Complexity                     │
├─────────────┼────────────────────────────────┤
│ Training    │ O(IVF_kmeans + PQ_kmeans)      │
│ Search      │ O(k×d + (n/k)×nProbe×M)        │
│             │ Searches ~1-10% of data        │
│ Memory      │ O(n×M + k×d)                   │
│             │ Massive compression            │
│ Recall      │ 70-90% depending on params     │
└─────────────┴────────────────────────────────┘
Memory Savings Example
1M vectors, 384 dimensions:

Float32 + IVF:   1.46 GB
IVFPQ (M=8):     7.6 MB + 400 KB (centroids)
                 = ~8 MB total

180x compression!
Key Parameters
nClusters (IVF):
├─ 100:   For 100K vectors
├─ 1K:    For 1M vectors
└─ 10K:   For 100M vectors

nProbe (IVF search):
├─ 1:     Fastest, lower recall
├─ 8:     Good balance
└─ 16:    Better recall

M (PQ subspaces):
├─ 8:     Maximum compression
├─ 16:    Good balance
└─ 32:    Better accuracy
When to Use
✓ Billion-scale datasets
✓ Need <10ms latency
✓ Severe memory constraints
✓ Can tolerate 70-90% recall
✓ Want 100x speed + 100x compression
✗ Need >95% recall (use HNSW)
✗ Small datasets (use Flat or HNSW)
Real-World Example
Use Case: 100M vectors, 768-dim

Flat Index:
├─ Memory: ~288 GB
├─ Search: ~10 seconds
└─ Not feasible! ✗

IVFPQ:
├─ Memory: ~800 MB
├─ Search: ~5 ms
└─ Practical! ✓

Decision Matrix
Index Recall Search Speed Memory Build Time Best For
Flat 100% Slow O(n) High Instant <10k vectors, benchmarks
HNSW 90-99% Fast O(log n) Highest Slow 10k-10M vectors, low latency
IVF 80-95% Medium High Medium >100k vectors, moderate recall
PQ 70-85% Fast Lowest Slow >1M vectors, memory-constrained
IVFPQ 70-90% Fastest Very Low Slow >10M vectors, billion-scale

Rules of thumb:

  • Small data (<10k): Flat - brute force is fast enough
  • Medium data (10k-1M): HNSW - best recall/speed tradeoff
  • Large data (1M-100M): IVF or PQ - choose speed (IVF) vs memory (PQ)
  • Massive data (>100M): IVFPQ - only option that scales to billions

Latency targets:

  • Flat: 1-100ms (depends on n)
  • HNSW: 0.1-2ms (sub-millisecond possible)
  • IVF: 0.5-5ms
  • PQ: 1-10ms (fast scan but approximate)
  • IVFPQ: 0.5-3ms (fastest for massive datasets)

API Reference

For detailed API documentation, see API.md.

Examples

For practical examples and code samples, see EXAMPLE.md.

Configuration

Basic Configuration
// Flat Index - No configuration needed
flatIdx, _ := comet.NewFlatIndex(384, comet.Cosine)

// HNSW Index - Configure build and search parameters
hnswIdx, _ := comet.NewHNSWIndex(
    384,              // dimensions
    comet.Cosine,     // distance metric
    16,               // M: connections per layer
    200,              // efConstruction: build quality
    200,              // efSearch: search quality
)

// IVF Index - Configure clustering
ivfIdx, _ := comet.NewIVFIndex(
    384,              // dimensions
    comet.Cosine,     // distance metric
    100,              // nClusters: number of partitions
)

// Training required for IVF
trainingVectors := []comet.VectorNode{ /* ... */ }
ivfIdx.Train(trainingVectors)
Configuration Options
HNSW Parameters

M (Connections Per Layer)

  • Type: int
  • Default: 16
  • Range: 4 to 64
  • Description: Number of bidirectional connections per node at each layer (except layer 0 which uses 2×M)

Effects of different values:

┌─────────┬──────────────────────────────────────────┐
│ Value   │ Effect                                   │
├─────────┼──────────────────────────────────────────┤
│ 4-8     │ Low memory, faster build, lower recall   │
│ 12-16   │ Balanced (recommended for most cases)    │
│ 24-48   │ High recall, slower build, more memory   │
│ 64+     │ Diminishing returns, excessive memory    │
└─────────┴──────────────────────────────────────────┘

efConstruction (Build-Time Candidate List Size)

  • Type: int
  • Default: 200
  • Range: 100 to 1000
  • Description: Size of candidate list during index construction. Higher = better graph quality

Effects:

┌─────────┬──────────────────────────────────────────┐
│ Value   │ Effect                                   │
├─────────┼──────────────────────────────────────────┤
│ 100     │ Fast build, lower quality graph          │
│ 200     │ Good balance (default)                   │
│ 400-500 │ Better quality, 2-3x slower build        │
│ 800+    │ Marginal gains, very slow build          │
└─────────┴──────────────────────────────────────────┘

efSearch (Search-Time Candidate List Size)

  • Type: int
  • Default: 200
  • Range: k to 1000 (must be >= k)
  • Description: Size of candidate list during search. Can be adjusted dynamically.

Effects:

┌─────────┬──────────────────────────────────────────┐
│ Value   │ Effect                                   │
├─────────┼──────────────────────────────────────────┤
│ 50      │ Very fast, ~85% recall                   │
│ 100     │ Fast, ~92% recall                        │
│ 200     │ Balanced, ~96% recall                    │
│ 400     │ Slower, ~98% recall                      │
│ 800     │ Much slower, ~99% recall                 │
└─────────┴──────────────────────────────────────────┘

Example:

// Create HNSW with high quality settings
index, _ := comet.NewHNSWIndex(
    384,
    comet.Cosine,
    32,    // More connections = better recall
    400,   // Higher construction quality
    200,   // Search quality (can adjust later)
)

// Dynamically adjust search quality
index.SetEfSearch(400)  // Trade speed for recall
IVF Parameters

nClusters (Number of Clusters)

  • Type: int
  • Default: sqrt(n) where n = dataset size
  • Range: 16 to n/100
  • Description: Number of k-means clusters (Voronoi cells) to partition the space
┌─────────────┬────────────┬─────────────────────────┐
│ Dataset Size│ nClusters  │ Typical Range           │
├─────────────┼────────────┼─────────────────────────┤
│ 10K         │ 100        │ 50-200                  │
│ 100K        │ 316        │ 200-500                 │
│ 1M          │ 1000       │ 500-2000                │
│ 10M         │ 3162       │ 2000-5000               │
└─────────────┴────────────┴─────────────────────────┘

nProbes (Search-Time Clusters to Probe)

  • Type: int
  • Default: 1
  • Range: 1 to nClusters
  • Description: Number of nearest clusters to search during query
┌─────────┬──────────────────────────────────────────┐
│ nProbes │ Effect                                   │
├─────────┼──────────────────────────────────────────┤
│ 1       │ Fastest, ~60-70% recall                  │
│ 8       │ Good balance, ~85% recall                │
│ 16      │ Better recall, ~92% recall               │
│ 32      │ High recall, ~96% recall                 │
│ 64      │ Very high recall, slower                 │
└─────────┴──────────────────────────────────────────┘

Example:

// Create IVF index
index, _ := comet.NewIVFIndex(384, comet.Cosine, 256)

// Train with representative sample
trainData := sampleVectors(10000)  // 10K samples for training
index.Train(trainData)

// Search with multiple probes
results, _ := index.NewSearch().
    WithQuery(query).
    WithK(10).
    WithNProbes(16).  // Search 16 nearest clusters
    Execute()
Advanced Configuration
Fusion Configuration
// Weighted Sum Fusion (manual weights)
config := &comet.FusionConfig{
    VectorWeight: 0.7,  // 70% weight to semantic similarity
    TextWeight:   0.3,  // 30% weight to keyword relevance
}
fusion, _ := comet.NewFusion(comet.WeightedSumFusion, config)

hybridSearch := hybrid.NewSearch().
    WithVector(queryVec).
    WithText("machine learning").
    WithFusion(fusion).
    Execute()

// Reciprocal Rank Fusion (rank-based, no weights needed)
hybridSearch := hybrid.NewSearch().
    WithVector(queryVec).
    WithText("machine learning").
    WithFusionKind(comet.ReciprocalRankFusion).
    Execute()
Distance Metrics
// Euclidean (L2): Use when magnitude matters
euclideanIdx, _ := comet.NewFlatIndex(384, comet.Euclidean)

// L2 Squared: Faster than Euclidean (no sqrt), preserves ranking
l2sqIdx, _ := comet.NewFlatIndex(384, comet.L2Squared)

// Cosine: Use for normalized vectors (text embeddings, etc.)
cosineIdx, _ := comet.NewFlatIndex(384, comet.Cosine)

API Details

FlatIndex

Constructor:

func NewFlatIndex(dim int, distanceKind DistanceKind) (*FlatIndex, error)

Parameters:

Parameter Type Required Description
dim int Yes Vector dimension (must be > 0)
distanceKind DistanceKind Yes Distance metric: Euclidean, L2Squared, or Cosine

Returns:

  • *FlatIndex: Initialized flat index
  • error: Error if parameters are invalid
HNSWIndex

Constructor:

func NewHNSWIndex(dim int, distanceKind DistanceKind, m, efConstruction, efSearch int) (*HNSWIndex, error)

Parameters:

Parameter Type Required Default Description
dim int Yes - Vector dimension (must be > 0)
distanceKind DistanceKind Yes - Distance metric: Euclidean, L2Squared, or Cosine
m int No 16 Max connections per node (higher = better accuracy, more memory)
efConstruction int No 200 Build-time search depth (higher = better graph quality, slower build)
efSearch int No efConstruction Query-time search depth (higher = better accuracy, slower search)

Returns:

  • *HNSWIndex: Initialized HNSW index
  • error: Error if parameters are invalid

Parameter Guidelines:

┌──────────────┬─────┬────────────────┬──────────┬─────────────┐
│ Use Case     │  M  │ efConstruction │ efSearch │ Description │
├──────────────┼─────┼────────────────┼──────────┼─────────────┤
│ Fast         │  8  │      100       │    50    │ Speed first │
│ Balanced     │ 16  │      200       │   100    │ Default     │
│ High Recall  │ 32  │      400       │   200    │ Accuracy    │
│ Memory Eff.  │  4  │       50       │    25    │ Low memory  │
└──────────────┴─────┴────────────────┴──────────┴─────────────┘
IVFIndex

Constructor:

func NewIVFIndex(dim int, nlist int, distanceKind DistanceKind) (*IVFIndex, error)

Parameters:

Parameter Type Required Description
dim int Yes Vector dimension (must be > 0)
nlist int Yes Number of clusters/partitions (must be > 0)
distanceKind DistanceKind Yes Distance metric: Euclidean, L2Squared, or Cosine

Returns:

  • *IVFIndex: Initialized IVF index
  • error: Error if parameters are invalid

Parameter Guidelines:

nlist Selection:
├─ Small dataset (10K-100K):    nlist = sqrt(n) = 100-300
├─ Medium dataset (100K-1M):    nlist = sqrt(n) = 300-1000
├─ Large dataset (1M-10M):      nlist = sqrt(n) = 1000-3000
└─ Very large (10M+):           nlist = sqrt(n) = 3000-10000

Rule of thumb: nlist ≈ sqrt(number_of_vectors)

Search Parameters:

// nProbe: number of clusters to search (1 to nlist)
results, _ := index.NewSearch().
    WithQuery(queryVec).
    WithK(10).
    WithNProbe(8).  // Search top 8 nearest clusters
    Execute()

Speed vs Accuracy:
├─ nProbe = 1:     Fastest, lowest recall
├─ nProbe = 8:     Balanced (typical)
├─ nProbe = nlist: Exhaustive (same as flat)
PQIndex

Constructor:

func NewPQIndex(dim int, distanceKind DistanceKind, M int, Nbits int) (*PQIndex, error)

Parameters:

Parameter Type Required Constraint Description
dim int Yes > 0 Vector dimension
distanceKind DistanceKind Yes - Distance metric: Euclidean, L2Squared, or Cosine
M int Yes dim % M == 0 Number of subspaces (dim must be divisible by M)
Nbits int Yes 1-16 Bits per subspace (typical: 8)

Returns:

  • *PQIndex: Initialized PQ index
  • error: Error if parameters are invalid or constraints violated

Parameter Guidelines:

M (Number of Subspaces):
├─ M = 8:    192x compression (typical, good balance)
├─ M = 16:    96x compression (better accuracy)
├─ M = 32:    48x compression (highest accuracy)
└─ Constraint: dim must be divisible by M

Nbits (Codebook size = 2^Nbits):
├─ Nbits = 4:   16 centroids per subspace (very fast)
├─ Nbits = 8:  256 centroids per subspace (typical)
├─ Nbits = 12: 4096 centroids (high accuracy)
└─ Constraint: 1 ≤ Nbits ≤ 16

Common Configurations:
┌──────┬───────┬───────────────┬─────────────────────┐
│  M   │ Nbits │ Compression   │ Use Case            │
├──────┼───────��───────────────┼────────────────��────┤
│   8  │   8   │ 192x (1 byte) │ Maximum compression │
│  16  │   8   │  96x (2 byte) │ Balanced            │
│  32  │   8   │  48x (4 byte) │ Better accuracy     │
│   8  │  12   │ 128x (1.5 B)  │ Higher quality      │
└──────┴───────┴───────────────┴─────────────────────┘
IVFPQIndex

Constructor:

func NewIVFPQIndex(dim int, distanceKind DistanceKind, nlist int, m int, nbits int) (*IVFPQIndex, error)

Parameters:

Parameter Type Required Constraint Description
dim int Yes > 0 Vector dimension
distanceKind DistanceKind Yes - Distance metric: Euclidean, L2Squared, or Cosine
nlist int Yes > 0 Number of IVF clusters
m int Yes dim % m == 0 Number of PQ subspaces
nbits int Yes 1-16 Bits per PQ subspace

Returns:

  • *IVFPQIndex: Initialized IVFPQ index
  • error: Error if parameters are invalid

Parameter Guidelines:

Combined IVF + PQ Configuration:

IVF Parameters:
├─ nlist ≈ sqrt(n) for number of vectors
└─ See IVFIndex guidelines above

PQ Parameters:
├─ m: Typically 8, 16, or 32
└─ nbits: Typically 8

Common Configurations for 100M vectors (384-dim):
┌────────┬─────┬────────┬──────────┬─────────────────┐
│ nlist  │  m  │ nbits  │ Memory   │ Use Case        │
├────────┼─────┼────────┼──────────┼─────────────────┤
│  4096  │  8  │   8    │  ~800 MB │ Extreme speed   │
│  8192  │  8  │   8    │  ~900 MB │ Balanced        │
│ 16384  │ 16  │   8    │ ~1.6 GB  │ Better accuracy │
│  8192  │  8  │  12    │ ~1.2 GB  │ High quality    │
└────────┴─────��────────┴──────────┴─────────────────┘

Memory Savings Example (100M vectors, 384-dim):
├─ Original float32:  100M × 384 × 4 = 146 GB
├─ IVF only:          Still ~146 GB (no compression)
├─ PQ only:           100M × 8 × 1 = 760 MB (faster train)
└─ IVFPQ (m=8):       100M × 8 × 1 + centroids ≈ 800 MB (best of both!)

Search Parameters:

results, _ := index.NewSearch().
    WithQuery(queryVec).
    WithK(10).
    WithNProbe(8).       // IVF: clusters to search
    WithNRefine(100).    // Optional: refine top-100 with original vectors
    Execute()
BM25SearchIndex

Constructor:

func NewBM25SearchIndex() *BM25SearchIndex

Parameters: None (uses default BM25 parameters: k1=1.5, b=0.75)

Returns:

  • *BM25SearchIndex: Initialized BM25 full-text search index

BM25 Parameters:

  • k1 = 1.5: Term frequency saturation (typical range: 1.2-2.0)
  • b = 0.75: Document length normalization (typical range: 0.5-0.9)
RoaringMetadataIndex

Constructor:

func NewRoaringMetadataIndex() *RoaringMetadataIndex

Parameters: None

Returns:

  • *RoaringMetadataIndex: Initialized metadata filtering index using Roaring Bitmaps
HybridSearchIndex

Constructor:

func NewHybridSearchIndex(
    vectorIndex VectorIndex,
    textIndex TextIndex,
    metadataIndex MetadataIndex,
) HybridSearchIndex

Parameters:

Parameter Type Required Description
vectorIndex VectorIndex Yes Any vector index (Flat, HNSW, IVF, PQ, IVFPQ)
textIndex TextIndex Yes BM25 text search index
metadataIndex MetadataIndex Yes Roaring metadata filter index

Returns:

  • HybridSearchIndex: Initialized hybrid search combining all three modalities

Search APIs

Vector Search (VectorSearch Interface)

Creating a Search:

search := index.NewSearch()  // Available on all VectorIndex implementations

Methods:

Method Parameters Description
WithQuery queries ...[]float32 Set query vector(s) for similarity search
WithNodes nodes ...uint32 Find vectors similar to indexed nodes by ID
WithK k int Number of results to return (default: 10)
WithDocumentIDs ids ...uint32 Pre-filter: only search within these document IDs
WithScoreAggregation agg ScoreAggregation How to combine multi-query results: SumAggregation, MaxAggregation, MeanAggregation
WithThreshold threshold float32 Only return results with distance ≤ threshold
WithAutocut enabled bool Automatically determine optimal cutoff point
Execute - Run the search, returns []VectorResult, error

HNSW-Specific Methods:

Method Parameters Description
WithEfSearch ef int Override default efSearch for this query

IVF-Specific Methods:

Method Parameters Description
WithNProbe nprobe int Number of clusters to search (1 to nlist)

IVFPQ-Specific Methods:

Method Parameters Description
WithNProbe nprobe int Number of clusters to search
WithNRefine nrefine int Refine top-N results with original vectors

Examples:

// Basic single-query search
results, err := index.NewSearch().
    WithQuery(queryVector).
    WithK(10).
    Execute()

// Multi-query with aggregation
results, _ := index.NewSearch().
    WithQuery(query1, query2, query3).
    WithK(20).
    WithScoreAggregation(comet.MeanAggregation).
    Execute()

// Node-based search (find similar to existing vectors)
results, _ := index.NewSearch().
    WithNodes(1, 5, 10).  // IDs of indexed vectors
    WithK(10).
    Execute()

// Pre-filtered search
eligibleIDs := []uint32{100, 200, 300, 400, 500}
results, _ := index.NewSearch().
    WithQuery(queryVector).
    WithDocumentIDs(eligibleIDs...).
    WithK(5).
    Execute()

// Distance threshold filtering
results, _ := index.NewSearch().
    WithQuery(queryVector).
    WithK(100).
    WithThreshold(0.5).  // Only distances ≤ 0.5
    Execute()

// Autocut (automatic result truncation)
results, _ := index.NewSearch().
    WithQuery(queryVector).
    WithK(100).
    WithAutocut(true).  // Returns fewer if quality drops
    Execute()

// HNSW with custom efSearch
hnswIndex, _ := comet.NewHNSWIndex(384, comet.Cosine, 16, 200, 100)
results, _ := hnswIndex.NewSearch().
    WithQuery(queryVector).
    WithK(10).
    WithEfSearch(200).  // Higher accuracy for this query
    Execute()

// IVF with nProbe
ivfIndex, _ := comet.NewIVFIndex(384, 1000, comet.Cosine)
results, _ := ivfIndex.NewSearch().
    WithQuery(queryVector).
    WithK(10).
    WithNProbe(16).  // Search 16 nearest clusters
    Execute()

// IVFPQ with refinement
ivfpqIndex, _ := comet.NewIVFPQIndex(384, comet.Cosine, 8192, 8, 8)
results, _ := ivfpqIndex.NewSearch().
    WithQuery(queryVector).
    WithK(10).
    WithNProbe(32).
    WithNRefine(100).  // Refine top-100 with original vectors
    Execute()

Result Format:

type VectorResult struct {
    Node  VectorNode  // The matched vector node
    Score float32     // Distance score (lower = more similar)
}

// Access results
for _, result := range results {
    id := result.GetId()
    score := result.GetScore()
    vector := result.Node.Vector()
    fmt.Printf("ID: %d, Distance: %.4f\n", id, score)
}

Score Aggregation Strategies:

// When using multiple queries, results are aggregated:

comet.SumAggregation   // Sum of all distances (penalizes far results)
comet.MaxAggregation   // Maximum distance (pessimistic)
comet.MeanAggregation  // Average distance (balanced, recommended)
Text Search (TextSearch Interface)

Creating a Search:

search := textIndex.NewSearch()  // BM25SearchIndex

Methods:

Method Parameters Description
WithQuery query string Text query for keyword search
WithK k int Number of results to return (default: 10)
WithDocumentIDs ids ...uint32 Pre-filter: only search within these document IDs
WithAutocut enabled bool Automatically determine optimal cutoff point
Execute - Run the search, returns []TextResult, error

Examples:

// Basic text search
results, err := textIndex.NewSearch().
    WithQuery("machine learning algorithms").
    WithK(10).
    Execute()

// Pre-filtered text search
eligibleIDs := []uint32{1, 5, 10, 15, 20}
results, _ := textIndex.NewSearch().
    WithQuery("neural networks").
    WithDocumentIDs(eligibleIDs...).
    WithK(5).
    Execute()

// With autocut
results, _ := textIndex.NewSearch().
    WithQuery("deep learning").
    WithK(50).
    WithAutocut(true).  // Returns fewer if relevance drops
    Execute()

Result Format:

type TextResult struct {
    Node  TextNode  // The matched text node
    Score float32   // BM25 relevance score (higher = more relevant)
}

// Access results
for _, result := range results {
    id := result.GetId()
    score := result.GetScore()
    text := result.Node.Text()
    fmt.Printf("ID: %d, BM25: %.4f, Text: %s\n", id, score, text)
}

BM25 Scoring:

BM25 score components:
├─ IDF: Term rarity (rare terms score higher)
├─ TF:  Term frequency (with saturation via k1)
└─ DL:  Document length normalization (via b)

Higher score = more relevant
Typical range: 0-10+ (no upper bound)
Metadata Search (MetadataSearch Interface)

Creating a Search:

search := metadataIndex.NewSearch()  // RoaringMetadataIndex

Methods:

Method Parameters Description
WithFilters filters ...Filter Metadata filter conditions (AND logic)
Execute - Run the search, returns []MetadataResult, error

Filter Functions:

Function Parameters Description
Eq field string, value interface{} Equality: field == value
Lt field string, value interface{} Less than: field < value
Lte field string, value interface{} Less than or equal: field ≤ value
Gt field string, value interface{} Greater than: field > value
Gte field string, value interface{} Greater than or equal: field ≥ value
Between field string, min, max interface{} Range: min ≤ field ≤ max

Examples:

// Single filter
results, err := metadataIndex.NewSearch().
    WithFilters(comet.Eq("category", "electronics")).
    Execute()

// Multiple filters (AND logic)
results, _ := metadataIndex.NewSearch().
    WithFilters(
        comet.Eq("category", "books"),
        comet.Gte("rating", 4.0),
        comet.Lte("price", 50.0),
        comet.Eq("in_stock", true),
    ).
    Execute()

// Range filter
results, _ := metadataIndex.NewSearch().
    WithFilters(
        comet.Between("year", 2020, 2024),
        comet.Eq("status", "published"),
    ).
    Execute()

// Numeric comparisons
results, _ := metadataIndex.NewSearch().
    WithFilters(
        comet.Gt("views", 1000),
        comet.Lt("price", 100),
    ).
    Execute()

// String equality
results, _ := metadataIndex.NewSearch().
    WithFilters(
        comet.Eq("author", "John Doe"),
        comet.Eq("language", "en"),
    ).
    Execute()

Result Format:

type MetadataResult struct {
    Node MetadataNode  // The matched metadata node
}

// Access results
for _, result := range results {
    id := result.GetId()
    score := result.GetScore()  // Always 0 (binary match)
    metadata := result.Node.Metadata()

    fmt.Printf("ID: %d\n", id)
    fmt.Printf("Category: %v\n", metadata["category"])
    fmt.Printf("Price: %v\n", metadata["price"])
}

Supported Data Types:

Supported field types:
├─ string:   Equality only (Eq)
├─ int/int64: All comparisons (Eq, Lt, Lte, Gt, Gte, Between)
├─ float32/float64: All comparisons
├─ bool:     Equality only (Eq)
���─ nil:      Equality only (Eq)
Hybrid Search (HybridSearch Interface)

Creating a Search:

search := hybridIndex.NewSearch()  // HybridSearchIndex

Methods:

Method Parameters Description
WithVectorQuery queries ...[]float32 Semantic search queries (vector embeddings)
WithTextQuery query string Keyword search query (BM25)
WithFilters filters ...Filter Metadata filters (pre-filtering before search)
WithK k int Number of results to return per modality before fusion
WithFusion fusion FusionKind Score fusion strategy: ReciprocalRankFusion or WeightedSumFusion
WithDocumentIDs ids ...uint32 Pre-filter: only search within these document IDs
WithAutocut enabled bool Automatically determine optimal cutoff point
Execute - Run hybrid search, returns []HybridResult, error

Fusion Strategies:

Strategy Description When to Use
ReciprocalRankFusion Rank-based fusion: score = Σ(1/(k+rank)) Default, works without tuning, handles different score scales
WeightedSumFusion Linear combination of normalized scores When you want to weight modalities differently

Examples:

// All three modalities
results, err := hybridIndex.NewSearch().
    WithVectorQuery(queryEmbedding).
    WithTextQuery("machine learning").
    WithFilters(
        comet.Eq("category", "ai"),
        comet.Gte("year", 2020),
    ).
    WithK(20).
    WithFusion(comet.ReciprocalRankFusion).
    Execute()

// Vector + Text only (no metadata filters)
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(queryEmbedding).
    WithTextQuery("neural networks").
    WithK(10).
    Execute()

// Vector + Metadata only (no text)
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(queryEmbedding).
    WithFilters(
        comet.Eq("category", "research"),
        comet.Lte("price", 0),  // Free papers
    ).
    WithK(15).
    Execute()

// Multi-query vector search with text
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(embedding1, embedding2, embedding3).
    WithTextQuery("deep learning transformers").
    WithK(20).
    WithFusion(comet.ReciprocalRankFusion).
    Execute()

// Pre-filtered hybrid search
eligibleIDs := []uint32{1, 5, 10, 15, 20}
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(queryEmbedding).
    WithTextQuery("optimization").
    WithDocumentIDs(eligibleIDs...).
    WithK(5).
    Execute()

// With autocut
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(queryEmbedding).
    WithTextQuery("information retrieval").
    WithK(50).
    WithAutocut(true).
    Execute()

Result Format:

type HybridResult struct {
    ID          uint32   // Document ID
    FusedScore  float32  // Combined score from all modalities
    VectorScore float32  // Individual vector similarity score
    TextScore   float32  // Individual BM25 score
}

// Access results
for _, result := range results {
    fmt.Printf("ID: %d\n", result.ID)
    fmt.Printf("Fused Score: %.4f\n", result.FusedScore)
    fmt.Printf("Vector: %.4f, Text: %.4f\n",
        result.VectorScore, result.TextScore)
}

How Fusion Works:

Reciprocal Rank Fusion (RRF):
┌──────────────────────────────────────┐
│ Step 1: Get results from each modality
│   Vector: [id=1, id=5, id=7, ...]
│   Text:   [id=7, id=1, id=12, ...]
│
│ Step 2: Compute RRF score per document
│   RRF(doc) = Σ 1/(k + rank_i)
│   where k=60 (default), rank_i is rank in modality i
│
│ Example for doc_id=1:
│   Vector rank: 1 → 1/(60+1) = 0.0164
│   Text rank:   2 → 1/(60+2) = 0.0161
│   RRF score:       0.0164 + 0.0161 = 0.0325
│
│ Step 3: Sort by RRF score (higher = better)
└──────────────────────────────────────┘

Benefits:
├─ No score normalization needed
├─ Robust to different score scales
├─ Emphasizes top-ranked results
└─ Works out-of-the-box (no tuning)

Search Flow:

Hybrid Search Execution:
┌───────────────────────────────────────────┐
│ 1. Metadata Filtering (if filters present)│
│    → Get eligible document IDs            │
└────────────┬──────────────────────────────┘
             │
      ┌──────┴──────┐
      │             │
┌─────▼─────┐ ┌─────▼─────┐
│  Vector   │ │   Text    │
│  Search   │ │  Search   │
│ (top-K)   │ │ (top-K)   │
└─────┬─────┘ └─────┬─────┘
      │             │
      └──────┬──────┘
             │
       ┌─────▼─────┐
       │   Fusion  │
       │ (RRF/WS)  │
       └─────┬─────┘
             │
       ┌─────▼─────┐
       │  Results  │
       │ (sorted)  │
       └───────────┘

Use Case Examples:

// E-commerce: "Show me red dresses under $100 similar to this image"
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(imageEmbedding).
    WithTextQuery("red dress").
    WithFilters(
        comet.Eq("category", "dresses"),
        comet.Lte("price", 100),
        comet.Eq("in_stock", true),
    ).
    WithK(20).
    Execute()

// Academic search: "Papers about 'attention mechanisms' similar to this abstract, published after 2020"
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(abstractEmbedding).
    WithTextQuery("attention mechanisms transformers").
    WithFilters(
        comet.Gte("year", 2020),
        comet.Eq("peer_reviewed", true),
    ).
    WithK(50).
    Execute()

// Job search: "Backend engineer roles in San Francisco paying over $150k"
results, _ := hybridIndex.NewSearch().
    WithVectorQuery(resumeEmbedding).
    WithTextQuery("backend engineer golang kubernetes").
    WithFilters(
        comet.Eq("location", "San Francisco"),
        comet.Gte("salary", 150000),
        comet.Eq("remote", false),
    ).
    WithK(30).
    Execute()

Use Cases

Problem: A documentation platform needs to find relevant documents based on user queries, going beyond simple keyword matching to understand intent.

Solution: Use Comet's vector search with text embeddings from a sentence transformer model.

Implementation:

package main

import (
    "github.com/wizenheimer/comet"
    "log"
)

type Document struct {
    ID       uint32
    Title    string
    Content  string
    Embedding []float32  // From sentence-transformers
}

func main() {
    // Create HNSW vector store for 384-dim embeddings
    m, efC, efS := comet.DefaultHNSWConfig()
    index, _ := comet.NewHNSWIndex(384, comet.Cosine, m, efC, efS)

    // Add documents (embeddings generated by your ML model)
    docs := []Document{
        {Title: "Getting Started", Embedding: getEmbedding("getting started guide")},
        {Title: "API Reference", Embedding: getEmbedding("api documentation")},
        {Title: "Troubleshooting", Embedding: getEmbedding("common problems and solutions")},
    }

    for _, doc := range docs {
        node := comet.NewVectorNode(doc.Embedding)
        index.Add(*node)
    }

    // Search with user query
    queryEmbedding := getEmbedding("how do I start using this?")
    results, _ := index.NewSearch().
        WithQuery(queryEmbedding).
        WithK(5).
        Execute()

    // Return relevant documents
    for _, result := range results {
        log.Printf("Found: %s (similarity: %.4f)",
            docs[result.GetId()].Title, 1-result.GetScore())
    }
}

func getEmbedding(text string) []float32 {
    // Call your embedding model API (OpenAI, Cohere, local model, etc.)
    return nil  // Placeholder
}
Use Case 2: E-commerce Product Recommendations

Problem: Recommend similar products based on browsing history, with filtering by price, category, and availability.

Solution: Hybrid search combining product embeddings with metadata filters.

Implementation:

type Product struct {
    ID          uint32
    Name        string
    ImageVector []float32  // From image embedding model
    Price       float64
    Category    string
    InStock     bool
}

func RecommendSimilarProducts(
    productID uint32,
    maxPrice float64,
    category string,
) ([]Product, error) {
    // Setup hybrid store
    vecIdx, _ := comet.NewHNSWIndex(512, comet.Cosine, 16, 200, 200)
    metaIdx := comet.NewRoaringMetadataIndex()
    hybrid := comet.NewHybridSearchIndex(vecIdx, nil, metaIdx)

    // Get the product's embedding
    product := getProduct(productID)

    // Search with filters
    results, err := hybrid.NewSearch().
        WithVector(product.ImageVector).
        WithMetadata(
            comet.Eq("category", category),
            comet.Lte("price", maxPrice),
            comet.Eq("in_stock", true),
        ).
        WithK(10).
        Execute()

    if err != nil {
        return nil, err
    }

    // Convert to Product structs
    products := make([]Product, len(results))
    for i, result := range results {
        products[i] = getProduct(result.ID)
    }

    return products, nil
}
Use Case 3: Question-Answering System with Hybrid Retrieval

Problem: Build a QA system that combines semantic understanding (vector search) with keyword precision (BM25) for better answer retrieval.

Solution: Use RRF fusion to combine both modalities.

Implementation:

func AnswerQuestion(question string) ([]string, error) {
    // Create hybrid store
    vecIdx, _ := comet.NewFlatIndex(768, comet.Cosine)  // BERT embeddings
    txtIdx := comet.NewBM25SearchIndex()
    hybrid := comet.NewHybridSearchIndex(vecIdx, txtIdx, nil)

    // Index knowledge base
    for _, doc := range knowledgeBase {
        embedding := getBERTEmbedding(doc.Text)
        hybrid.Add(embedding, doc.Text, nil)
    }

    // Search with both modalities
    questionEmbedding := getBERTEmbedding(question)
    results, err := hybrid.NewSearch().
        WithVector(questionEmbedding).        // Semantic similarity
        WithText(question).                    // Keyword matching
        WithFusionKind(comet.ReciprocalRankFusion).  // Combine rankings
        WithK(5).
        Execute()

    if err != nil {
        return nil, err
    }

    // Extract answer passages
    answers := make([]string, len(results))
    for i, result := range results {
        answers[i] = knowledgeBase[result.ID].Text
    }

    return answers, nil
}

Contributing

Contributions are welcome! Please follow these guidelines:

Development Setup
# Clone repository
git clone https://github.com/wizenheimer/comet.git
cd comet

# Install dependencies
make deps

# Run tests
make test

# Run linter
make lint

# Check everything
make check
Code Style
  • Follow standard Go conventions (gofmt, go vet)
  • Write descriptive comments for exported functions
  • Include examples in documentation comments
  • Add tests for new features (maintain >95% coverage)
  • Keep functions focused and composable
  • Use meaningful variable names
Commit Messages

Follow conventional commits format:

Good commit messages:

  • feat: Add IVF index with k-means clustering
  • fix: Handle zero vectors in cosine distance
  • perf: Optimize HNSW search with heap pooling
  • docs: Update API reference for hybrid search
  • test: Add benchmarks for BM25 search

Bad commit messages:

  • Update code
  • Fix bug
  • WIP
  • asdf
Pull Request Process
  1. Fork the repository and create a feature branch
  2. Write your code following the style guidelines
  3. Add tests for all new functionality
  4. Run the full test suite and ensure all tests pass
  5. Update documentation (README, godoc comments)
  6. Commit with descriptive messages using conventional commits
  7. Push to your fork and open a Pull Request
  8. Respond to review feedback promptly
Code Review Checklist

Before submitting a PR, ensure:

  • All tests pass (make test)
  • Linter shows no issues (make lint)
  • Coverage remains >95% (make test-coverage)
  • Documentation is updated
  • Benchmarks show no regression (if applicable)
  • No breaking changes (unless discussed)

License

MIT License - see LICENSE file for details

Copyright (c) 2025 wizenheimer

References

Documentation

Overview

Package comet implements a BM25-based full-text search index.

WHAT IS BM25? BM25 (Best Matching 25) is a probabilistic ranking function used to estimate the relevance of documents to a given search query. It is one of the most widely used ranking functions in information retrieval.

HOW BM25 WORKS: For a given query Q with terms {t1, t2, ..., tn} and document D: 1. Tokenizes and normalizes both query and documents using UAX#29 word segmentation 2. For each query term, calculates:

  • IDF (Inverse Document Frequency): log((N - df + 0.5) / (df + 0.5) + 1) where N is total docs and df is docs containing the term
  • TF component: (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (docLen / avgDocLen))) where tf is term frequency in the document

3. Final score is the sum of (IDF × TF) for all query terms

KEY PARAMETERS:

  • K1 (default 1.2): Controls term frequency saturation. Higher values mean term frequency has more impact on the score. Typical range: 1.2-2.0
  • B (default 0.75): Controls document length normalization. 0 means no normalization (all docs treated equally), 1 means full normalization. Typical range: 0.75

TIME COMPLEXITY:

  • Add: O(m) where m is the number of tokens in the document
  • Search: O(q × d) where q is query tokens and d is average docs per term
  • Remove: O(m) where m is the number of tokens in the document

MEMORY REQUIREMENTS: - Stores inverted index (term -> docIDs) using roaring bitmaps for compression - Stores term frequencies (term -> docID -> count) - Stores document lengths and tokens (not full text) - Much more memory efficient than storing full document text

GUARANTEES & TRADE-OFFS: ✓ Pros:

  • Excellent relevance ranking for text search
  • Handles term frequency and document length well
  • Fast search using inverted index
  • Memory efficient (doesn't store original text)
  • Thread-safe for concurrent use

✗ Cons:

  • Requires tokenization and normalization
  • Cannot retrieve original document text
  • Updates require document replacement (remove + add)

WHEN TO USE: Use BM25 index when: 1. You need full-text search with relevance ranking 2. You want fast keyword-based search 3. Memory efficiency is important (vs storing full text) 4. You have your own document store and just need search

Package comet provides a high-performance hybrid vector search library for Go.

Comet combines multiple indexing strategies and search modalities into a unified, efficient package. It supports semantic search (vector embeddings), full-text search (BM25), metadata filtering, and hybrid search with score fusion.

Overview

Comet is built for developers who want to understand how vector databases work from the inside out. It provides production-ready implementations of modern vector search algorithms with comprehensive documentation and examples.

Quick Start

Create a vector index and perform similarity search:

package main

import (
    "fmt"
    "log"

    "github.com/wizenheimer/comet"
)

func main() {
    // Create a flat index for 384-dimensional vectors using cosine distance
    index, err := comet.NewFlatIndex(384, comet.Cosine)
    if err != nil {
        log.Fatal(err)
    }

    // Add vectors to the index
    vec1 := make([]float32, 384)
    // ... populate vec1 with your embedding ...
    node := comet.NewVectorNode(vec1)
    if err := index.Add(*node); err != nil {
        log.Fatal(err)
    }

    // Search for similar vectors
    query := make([]float32, 384)
    // ... populate query vector ...
    results, err := index.NewSearch().
        WithQuery(query).
        WithK(10).
        Execute()

    if err != nil {
        log.Fatal(err)
    }

    // Process results
    for i, result := range results {
        fmt.Printf("%d. ID=%d, Score=%.4f\n", i+1, result.GetId(), result.GetScore())
    }
}

Vector Storage Types

Comet provides five vector index implementations, each with different tradeoffs:

FlatIndex: Brute-force exhaustive search with 100% recall. Best for small datasets (<10K vectors) or when perfect accuracy is required.

index, _ := comet.NewFlatIndex(384, comet.Cosine)

HNSWIndex: Hierarchical graph-based search with 95-99% recall and O(log n) performance. Best for most production workloads (10K-10M vectors).

m, efConstruction, efSearch := comet.DefaultHNSWConfig()
index, _ := comet.NewHNSWIndex(384, comet.Cosine, m, efConstruction, efSearch)

IVFIndex: Inverted file index using k-means clustering with 85-95% recall. Best for large datasets (>100K vectors) with moderate accuracy requirements.

index, _ := comet.NewIVFIndex(384, comet.Cosine, 100) // 100 clusters
index.Train(trainingVectors) // Training required

PQIndex: Product quantization for massive memory compression (10-500x smaller) with 70-85% recall. Best for memory-constrained environments.

m, nbits := comet.CalculatePQParams(384)
index, _ := comet.NewPQIndex(384, comet.Cosine, m, nbits)
index.Train(trainingVectors) // Training required

IVFPQIndex: Combines IVF and PQ for maximum scalability with 70-90% recall. Best for billion-scale datasets.

m, nbits := comet.CalculatePQParams(384)
index, _ := comet.NewIVFPQIndex(384, comet.Cosine, 100, m, nbits)
index.Train(trainingVectors) // Training required

Distance Metrics

Three distance metrics are supported:

Euclidean (L2): Measures absolute spatial distance. Use when magnitude matters.

index, _ := comet.NewFlatIndex(384, comet.Euclidean)

L2Squared: Squared Euclidean distance (faster, preserves ordering). Use for better performance when only relative distances matter.

index, _ := comet.NewFlatIndex(384, comet.L2Squared)

Cosine: Measures angular similarity, independent of magnitude. Use for normalized vectors like text embeddings.

index, _ := comet.NewFlatIndex(384, comet.Cosine)

BM25-based full-text search with Unicode tokenization:

index := comet.NewBM25SearchIndex()
index.Add(1, "the quick brown fox jumps over the lazy dog")
index.Add(2, "pack my box with five dozen liquor jugs")

results, _ := index.NewSearch().
    WithQuery("quick fox").
    WithK(10).
    Execute()

Metadata Filtering

Fast filtering using Roaring Bitmaps and Bit-Sliced Indexes:

index := comet.NewRoaringMetadataIndex()

// Add documents with metadata
node := comet.NewMetadataNode(map[string]interface{}{
    "category": "electronics",
    "price": 999,
    "in_stock": true,
})
index.Add(*node)

// Filter by metadata
results, _ := index.NewSearch().
    WithFilter(
        comet.Eq("category", "electronics"),
        comet.Lte("price", 1000),
        comet.Eq("in_stock", true),
    ).
    Execute()

Combine vector, text, and metadata search with score fusion:

// Create indexes
vecIdx, _ := comet.NewHNSWIndex(384, comet.Cosine, 16, 200, 200)
txtIdx := comet.NewBM25SearchIndex()
metaIdx := comet.NewRoaringMetadataIndex()

// Create hybrid index
hybrid := comet.NewHybridSearchIndex(vecIdx, txtIdx, metaIdx)

// Add documents
id, _ := hybrid.Add(
    embedding,  // 384-dim vector
    "machine learning tutorial",  // text
    map[string]interface{}{  // metadata
        "category": "education",
        "price": 29.99,
    },
)

// Hybrid search with all modalities
results, _ := hybrid.NewSearch().
    WithVector(queryEmbedding).
    WithText("machine learning").
    WithMetadata(
        comet.Eq("category", "education"),
        comet.Lt("price", 50),
    ).
    WithFusionKind(comet.ReciprocalRankFusion).
    WithK(10).
    Execute()

Score Fusion Strategies

When combining results from multiple search modalities, different fusion strategies are available:

WeightedSumFusion: Linear combination with configurable weights

config := &comet.FusionConfig{
    VectorWeight: 0.7,
    TextWeight: 0.3,
}
fusion, _ := comet.NewFusion(comet.WeightedSumFusion, config)

ReciprocalRankFusion: Rank-based fusion (scale-independent, recommended)

fusion, _ := comet.NewFusion(comet.ReciprocalRankFusion, nil)

MaxFusion/MinFusion: Simple maximum or minimum across modalities

fusion, _ := comet.NewFusion(comet.MaxFusion, nil)

Serialization

All indexes support persistence:

// Save index
file, _ := os.Create("index.bin")
defer file.Close()
index.WriteTo(file)

// Load index
file, _ := os.Open("index.bin")
defer file.Close()
index, _ := comet.NewFlatIndex(384, comet.Cosine)
index.ReadFrom(file)

Performance Tuning

HNSW parameters for tuning search quality:

// Higher M = better recall, more memory
// Lower M = faster search, less memory
m := 16  // connections per layer (default: 16)

// Higher efConstruction = better graph quality, slower build
efConstruction := 200  // build quality (default: 200)

// Higher efSearch = better recall, slower search
efSearch := 200  // search quality (default: 200)

index, _ := comet.NewHNSWIndex(384, comet.Cosine, m, efConstruction, efSearch)

IVF parameters for tuning speed/accuracy:

nClusters := 100  // more clusters = faster search, lower recall
index, _ := comet.NewIVFIndex(384, comet.Cosine, nClusters)

// At search time
nProbes := 8  // more probes = better recall, slower search
results, _ := index.NewSearch().
    WithQuery(query).
    WithK(10).
    WithNProbes(nProbes).
    Execute()

Thread Safety

All indexes are safe for concurrent use. Multiple goroutines can search simultaneously while one goroutine adds or removes vectors.

Use Cases

Document Search: Use vector embeddings for semantic search in documentation, knowledge bases, or content management systems.

Product Recommendations: Combine product image embeddings with metadata filters for personalized recommendations.

Question Answering: Use hybrid search (vector + BM25) for retrieval-augmented generation (RAG) systems.

Duplicate Detection: Use high-recall vector search to find near-duplicate documents or images.

Multi-modal Search: Combine text, image embeddings, and structured metadata for comprehensive search experiences.

Best Practices

Choose the right index type:

  • <10K vectors: Use FlatIndex
  • 10K-10M vectors: Use HNSWIndex
  • >10M vectors: Use IVFIndex or IVFPQIndex
  • Memory constrained: Use PQIndex or IVFPQIndex

Use appropriate distance metrics:

  • Text embeddings: Use Cosine
  • Image features: Use L2 or L2Squared
  • Custom embeddings: Depends on how they were trained

Batch operations:

  • Add vectors in batches for better performance
  • Call Flush() periodically after Remove() operations

Training indexes:

  • Use representative sample (10K-100K vectors) for IVF/PQ training
  • Training sample should cover the data distribution

Metadata filtering:

  • Apply filters before vector search to reduce candidates
  • Use equality filters for categorical data
  • Use range filters for numeric data

Documentation and Examples

For detailed API documentation, see the godoc comments on each type and function.

For more examples and use cases, visit: https://github.com/wizenheimer/comet

License

Package comet implements a k-Nearest Neighbors (kNN) flat index for similarity search.

WHAT IS A FLAT INDEX? A flat index is the most naive and simple approach to similarity search. The term "flat" indicates that vectors are stored without any compression or transformation - they are stored "as-is" in their original form. This is also known as brute-force or exhaustive search.

HOW kNN WORKS: For a given query vector Q, the algorithm: 1. Calculates the distance from Q to EVERY vector in the dataset 2. Sorts all distances 3. Returns the k vectors with the smallest distances

TIME COMPLEXITY:

  • Training: O(1) - No training phase required! This is one of the few ML algorithms that doesn't need training.
  • Search: O(m*n) where:
  • n = number of vectors in the dataset
  • m = dimensionality of each vector For each of the n vectors, we need O(m) time to calculate the distance.

MEMORY REQUIREMENTS: - 4 bytes per float32 component - Total per vector: 4 * d bytes (where d is the dimensionality) - No compression, so memory scales linearly with dataset size

GUARANTEES & TRADE-OFFS: ✓ Pros:

  • 100% accuracy - always finds the true nearest neighbors
  • No training phase required
  • Simple to implement and understand

✗ Cons:

  • Slow for large datasets (exhaustive search)
  • No memory compression
  • Not scalable (O(mn) time complexity)

WHEN TO USE: Use flat index only when: 1. Dataset size or embedding dimensionality is relatively small 2. You MUST have 100% accuracy (e.g., fingerprint matching, security applications) 3. Speed is not a critical concern

Package comet implements HNSW (Hierarchical Navigable Small World).

WHAT IS HNSW? HNSW is a state-of-the-art graph-based algorithm for approximate nearest neighbor search. It builds a multi-layered graph where search is O(log n) - incredibly fast!

HIERARCHICAL SKIP-LIST-LIKE STRUCTURE

Layer 2: Few nodes, long-range connections (highways) Layer 1: More nodes, medium-range connections (state roads) Layer 0: All nodes, short-range connections (local streets)

Search starts at top layer and descends, getting more refined at each level!

PERFORMANCE:

  • Search: O(log n) with ~95-99% recall
  • Memory: 2-3x raw vectors (stores full precision + graph)
  • Build: O(log n) per insertion
  • Use case: When speed and accuracy are critical

TIME COMPLEXITY:

  • Insert: O(M × efConstruction × log n)
  • Search: O(M × efSearch × log n)
  • Typical: 10-50 distance calculations for 1M vectors!

Package comet implements a hybrid search index that combines vector, text, and metadata search.

WHAT IS HYBRIDSEARCHINDEX? HybridSearchIndex is a facade that provides a unified interface over three specialized indexes: 1. VectorIndex: For semantic similarity search using vector embeddings 2. TextIndex: For keyword-based BM25 full-text search 3. MetadataIndex: For filtering by structured metadata attributes

HOW IT WORKS: The index maintains three separate indexes internally and coordinates search across them. When searching, it follows this flow: 1. Apply metadata filters first (if any) to get candidate document IDs 2. Pass candidate IDs to vector and/or text search for relevance ranking 3. Combine results from multiple search modes using score aggregation

SEARCH MODES: - Vector-only: Semantic similarity search using embeddings - Text-only: Keyword-based BM25 search - Metadata-only: Pure filtering without ranking - Hybrid: Combine any or all of the above with score aggregation

WHEN TO USE: Use HybridSearchIndex when: 1. You need to combine multiple search modalities 2. You want to filter by metadata before expensive vector search 3. You need both semantic and keyword-based search 4. You want a simple unified API instead of managing multiple indexes

Package comet implements a k-Nearest Neighbors (kNN) IVF index for similarity search.

WHAT IS AN IVF INDEX? IVF (Inverted File Index) is a partitioning-based approximate nearest neighbor search algorithm. It divides the vector space into Voronoi cells using k-means clustering, then searches only the nearest cells instead of scanning all vectors.

HOW IVF WORKS: Training Phase: 1. Run k-means on training vectors to learn nlist cluster centroids 2. These centroids define Voronoi partitions of the vector space

Indexing Phase: 1. For each vector, find its nearest centroid 2. Add the vector to that centroid's inverted list

Search Phase: 1. Find the nprobe nearest centroids to the query vector 2. Search only the vectors in those nprobe inverted lists 3. Return the top-k nearest neighbors from candidates

TIME COMPLEXITY:

  • Training: O(iterations × nlist × n × dim) - k-means clustering
  • Add: O(nlist × dim) - find nearest centroid
  • Search: O(nprobe × (nlist/nprobe) × dim + nprobe × (n/nlist) × dim) ≈ O(nprobe × dim + (nprobe/nlist) × n × dim)

MEMORY REQUIREMENTS: - Vectors: 4 × n × dim bytes (stored as-is) - Centroids: 4 × nlist × dim bytes - Lists: negligible overhead (just pointers) - Total: ~4 × (n + nlist) × dim bytes

ACCURACY VS SPEED TRADEOFF: - nprobe = 1: Fastest, lowest recall (~30-50%) - nprobe = sqrt(nlist): Good balance (~70-90% recall) - nprobe = nlist: Same as flat search (100% recall)

CHOOSING NLIST: Rule of thumb: nlist = sqrt(n) or nlist = 4*sqrt(n) - For 1M vectors: nlist = 1,000 to 4,000 - For 100K vectors: nlist = 316 to 1,264 - For 10K vectors: nlist = 100 to 400

WHEN TO USE IVF: Use IVF when: 1. Dataset is large (>10K vectors) 2. You can tolerate ~90-95% recall (not 100%) 3. You want 10-100x speedup over flat search 4. Memory usage is not a primary concern

DON'T use IVF when: 1. Dataset is small (<10K vectors) - use flat index 2. You need 100% recall - use flat index 3. Memory is very limited - use PQ or IVFPQ

Package comet implements IVFPQ (Inverted File with Product Quantization).

WHAT IS IVFPQ? IVFPQ combines IVF (scope reduction) with PQ (compression) to create one of the most powerful similarity search algorithms. It's the workhorse of large-scale vector search systems.

RESIDUAL VECTORS IVFPQ encodes RESIDUALS (vector - centroid) instead of original vectors. This dramatically improves compression quality because:

  • Residuals are centered near 0 (low variance)
  • Single set of codebooks works for all clusters
  • Better approximation than encoding original vectors

PERFORMANCE:

  • Speed: 10-100x faster than flat (from IVF)
  • Memory: 100-500x compression (from PQ on residuals)
  • Accuracy: 85-95% recall with proper tuning

TIME COMPLEXITY:

  • Training: O(IVF_kmeans + PQ_kmeans)
  • Add: O(nlist × dim + M × K × dsub)
  • Search: O(nlist + nprobe × (M × K × dsub + candidates × M))

Package comet implements a metadata filtering index for vector search.

WHAT IS A METADATA INDEX? A metadata index enables fast filtering of documents based on structured metadata attributes before performing expensive vector similarity searches. This dramatically improves search performance by reducing the candidate set.

HOW IT WORKS: The index uses two specialized data structures: 1. Roaring Bitmaps: For categorical fields (strings, booleans)

  • Memory-efficient compressed bitmaps
  • Fast set operations (AND, OR, NOT)

2. Bit-Sliced Index (BSI): For numeric fields (integers, floats)

  • Enables range queries without scanning all documents
  • Fast comparison operations (>, <, =, BETWEEN)

QUERY TYPES: - Equality: field = value - Inequality: field != value - Comparisons: field > value, field >= value, field < value, field <= value - Range: field BETWEEN min AND max - Set membership: field IN (val1, val2, val3) - Set exclusion: field NOT IN (val1, val2) - Existence: field EXISTS, field NOT EXISTS

TIME COMPLEXITY:

  • Add: O(m) where m is the number of metadata fields
  • Query: O(f × log(n)) where f is filter count, n is documents
  • Remove: O(m) where m is the number of metadata fields

MEMORY REQUIREMENTS: - Roaring bitmaps: Highly compressed, typically 1-10% of uncompressed size - BSI: ~64 bits per numeric value (compressed with roaring) - Much more efficient than traditional B-tree indexes for high-cardinality data

GUARANTEES & TRADE-OFFS: ✓ Pros:

  • Extremely fast filtering (microseconds for millions of documents)
  • Memory efficient with compression
  • Supports complex boolean queries
  • Thread-safe for concurrent use

✗ Cons:

  • Only supports exact matches and ranges (no fuzzy matching)
  • Floats converted to integers (precision loss)
  • Updates require full document replacement

WHEN TO USE: Use metadata index when: 1. Pre-filtering documents before vector search 2. Need to filter by structured attributes (price, date, category, etc.) 3. Working with large datasets (100K+ documents) 4. Need sub-millisecond filter performance

Package comet implements Product Quantization (PQ) for similarity search.

WHAT IS PRODUCT QUANTIZATION? PQ is a lossy compression technique that dramatically reduces memory usage for vector storage while enabling approximate similarity search. It achieves compression ratios of 10-500x by dividing vectors into subspaces and quantizing each independently.

THE CORE IDEA - DIVIDE AND COMPRESS: Instead of storing full high-dimensional vectors: 1. Divide each vector into M equal-sized subvectors (subspaces) 2. Learn a codebook of K centroids for each subspace via k-means 3. Encode each subvector with the ID of its nearest centroid 4. Store only these compact codes instead of original vectors

COMPRESSION EXAMPLE: Original: 768 dims × 4 bytes = 3,072 bytes PQ (M=8, K=256): 8 subspaces × 1 byte = 8 bytes Compression: 384x smaller!

TIME COMPLEXITY:

  • Training: O(M × iterations × K × n × dsub) where dsub = dim/M
  • Add: O(M × K × dsub) per vector
  • Search: O(M × K × dsub + n × M) - table build + lookups

WHEN TO USE PQ: - Dataset too large for RAM - Can tolerate 85-95% recall - L2 or inner product metric - Want massive compression

Index

Constants

View Source
const (
	// K1 controls term frequency saturation (typical range: 1.2-2.0)
	K1 = 1.2
	// B controls document length normalization (0 = no normalization, 1 = full normalization)
	B = 0.75
)

BM25 parameters for ranking

View Source
const (
	// UnassignedCluster indicates a vector hasn't been assigned to any cluster yet
	UnassignedCluster = -1
)

Variables

View Source
var (
	// DefaultMaxIter is the default maximum number of iterations for k-means clustering.
	DefaultMaxIter = 20
)
View Source
var ErrUnknownDistanceKind = errors.New("unknown distance kind")

ErrUnknownDistanceKind is returned when an unknown distance kind is provided to NewDistance.

View Source
var ErrZeroVector = errors.New("zero vector not allowed for this metric")

ErrZeroVector is returned when a zero vector is provided for a metric that doesn't support it.

Functions

func Autocut

func Autocut(yValues []float32, cutOff int) int

Autocut determines optimal cutoff point in a score distribution.

It analyzes the normalized difference between actual scores and ideal linear distribution to find local maxima (extrema). Returns the index before the Nth extremum where N is the cutOff parameter.

Parameters:

  • yValues: slice of score values (typically distances or similarity scores)
  • cutOff: number of extrema to encounter before cutting

Returns the index at which to cut the results.

func AutocutResults

func AutocutResults[T Result](results []T, cutoff int) []T

AutocutResults applies autocut algorithm to determine optimal result cutoff. This generic function works with any result type that implements the Result interface.

It extracts scores from the result slice and uses the Autocut algorithm to find the optimal cutoff point based on the score distribution.

Type parameter T must be a type that implements the Result interface.

Parameters:

  • results: slice of results implementing Result interface
  • cutoff: number of extrema to find before cutting (-1 disables autocut)

Returns the sliced results up to the autocut point. If cutoff is -1, returns results unchanged (no-op).

Usage:

vectorResults = AutocutResults(vectorResults, 1)  // Apply autocut with 1 extremum
textResults = AutocutResults(textResults, -1)     // No-op, returns all results

func CalculatePQParams

func CalculatePQParams(dim int) (M int, Nbits int)

CalculatePQParams returns recommended PQ parameters for a given dimension. A neat utility function to get the recommended PQ parameters for a given dimension. Returns:

  • M: Number of subquantizers (subspaces) that divides dim evenly
  • Nbits: Bits per PQ code (default: 8, giving K=256 centroids per subspace)

func DefaultHNSWConfig

func DefaultHNSWConfig() (m, efConstruction, efSearch int)

DefaultHNSWConfig returns recommended default configuration parameters.

Returns:

  • m: connections per layer (16) - layer 0 uses 2*M connections
  • efConstruction: candidate list size during construction (200)
  • efSearch: candidate list size during search (200)

func FindNearestCentroidIndex

func FindNearestCentroidIndex(v []float32, centroids [][]float32, distance Distance) int

FindNearestCentroidIndex finds the index of the nearest centroid to a given vector.

This is a utility function that performs a linear search through all centroids to find the one with minimum distance to the query vector.

Parameters:

  • v: Query vector to find nearest centroid for
  • centroids: Slice of centroid vectors to search through
  • distance: Distance metric to use for comparison

Returns:

  • int: Index of the nearest centroid (0 to len(centroids)-1)

Time Complexity: O(k × dim) where k = len(centroids)

func KMeans

func KMeans(vectors [][]float32, k int, distance Distance, maxIter int) (centroids [][]float32, vectorToClusterMapping []int)

KMeans performs k-means clustering to learn cluster centroids.

K-MEANS CLUSTERING ALGORITHM

K-means partitions data into k clusters by iteratively refining cluster assignments and centroid positions. Each cluster is represented by its centroid (center point).

Algorithm Steps:

  1. INITIALIZATION: Select k initial centroid positions (uniform spacing)
  2. ASSIGNMENT: Assign each vector to its nearest centroid
  3. UPDATE: Recompute centroids as the mean of assigned vectors
  4. REPEAT: Steps 2-3 until convergence or max iterations

CONVERGENCE: The algorithm converges when assignments stop changing, meaning all vectors are assigned to their nearest centroid and centroids are at the center of their clusters. Typical convergence: 5-20 iterations for most datasets.

VORONOI PARTITIONS: K-means naturally creates Voronoi partitions where each cluster forms a Voronoi cell. Vectors in each cluster are guaranteed to be closer to their centroid than to any other centroid.

TIME COMPLEXITY: O(iterations × k × n × dim) where:

  • iterations: number of iterations until convergence (typically 10-20)
  • k: number of clusters
  • n: number of training vectors
  • dim: dimensionality

For 100K training vectors, 768 dims, k=316, 10 iterations:

  • About 243 billion floating point operations
  • Takes seconds to minutes depending on hardware

Parameters:

  • vectors: Training vectors to cluster (each is []float32)
  • k: Number of clusters to create
  • distance: Distance metric to use for computing distances
  • maxIter: Maximum number of k-means iterations (typical: 20-100)

Returns:

  • [][]float32: k learned centroids that define the clusters
  • []int: cluster assignments for each input vector (vector i -> cluster assignments[i])

func KMeansSubspace

func KMeansSubspace(vectors [][]float32, k int, maxIter int) (centroids [][]float32, vectorToClusterMapping []int)

KMeansSubspace performs k-means clustering on subspace vectors for codebook learning.

K-MEANS FOR SUBSPACE CLUSTERING

This function learns k centroids that best represent the distribution of vectors in a lower-dimensional subspace. Commonly used for Product Quantization (PQ) codebook learning and other subspace quantization methods.

SUBSPACE CLUSTERING ADVANTAGES:

  • Faster: fewer dimensions to process
  • More effective: less curse of dimensionality
  • Parallelizable: independent across subspaces

ALGORITHM:

  1. INITIALIZATION: Choose k initial centroids using uniform sampling
  2. ASSIGNMENT: Assign each subvector to nearest centroid (using squared L2 distance)
  3. UPDATE: Recompute centroids as mean of assigned subvectors
  4. REPEAT: Steps 2-3 until convergence or max iterations

CONVERGENCE: Converges when no assignments change. Typically takes 3-10 iterations for most subspaces due to lower dimensionality providing more stability.

DISTANCE METRIC: Uses squared L2 distance (Euclidean) - standard for:

  • Product Quantization
  • Vector quantization
  • Codebook learning

TIME COMPLEXITY: O(iterations × k × n × dsub) where:

  • iterations: number of k-means iterations (typically 5-20)
  • k: number of centroids (often 256 = 2^8)
  • n: number of training subvectors
  • dsub: subspace dimension (e.g., 768/8 = 96)

For 100K training vectors, dsub=96, k=256, 10 iterations:

  • About 2.5 billion floating point operations per subspace
  • Takes milliseconds to seconds per subspace

Parameters:

  • vectors: Subspace vectors to cluster (all from same subspace)
  • k: Number of centroids to learn (typically 256 = 2^8 bits)
  • maxIter: Maximum k-means iterations (typical: 20-50)

Returns:

  • [][]float32: k learned centroids for this subspace (the codebook)
  • []int: cluster assignments for each input subvector

func LimitResults

func LimitResults[T Result](results []T, k int) []T

LimitResults applies k-limiting to any result slice that implements the Result interface. This generic function works with both VectorResult and TextResult.

Type parameter T must be a type that implements the Result interface.

Usage:

vectorResults := LimitResults(vectorResults, requestedK)
textResults := LimitResults(textResults, requestedK)

func Norm

func Norm(v []float32) float32

Norm computes the L2 norm (Euclidean length/magnitude) of a vector. This represents the "length" of the vector from the origin in n-dimensional space.

The norm is useful for:

  • Measuring vector magnitude
  • Normalizing vectors to unit length
  • Computing distances and similarities

Formula: sqrt(sum(v[i]^2))

Example:

v := []float32{3, 4}
length := Norm(v)  // Returns 5.0 (3²+4² = 25, sqrt(25) = 5)

Time complexity: O(n) where n is the vector dimension

func Normalize

func Normalize(v []float32) []float32

Normalize returns a new vector with the same direction as v but with unit length (magnitude = 1). The original vector is not modified.

Normalization is essential for:

  • Cosine similarity calculations (removes magnitude, keeps direction)
  • Comparing vectors by direction only
  • Machine learning feature scaling
  • Unit vector representations

Special case:

  • If the input is a zero vector (all elements are 0), returns the zero vector unchanged to avoid division by zero and NaN values

Formula: result = v / ||v|| where ||v|| is the L2 norm

Example:

v := []float32{3, 4}
unit := Normalize(v)     // Returns [0.6, 0.8] (magnitude = 1)
zero := []float32{0, 0}
safe := Normalize(zero)  // Returns [0, 0] (safely handles zero vector)

Time complexity: O(n) where n is the vector dimension

func NormalizeInPlace

func NormalizeInPlace(v []float32)

NormalizeInPlace normalizes the vector to unit length in-place, modifying the original vector. This is more memory-efficient than Normalize() as it doesn't allocate a new vector.

Use this when:

  • You don't need to keep the original vector
  • Memory efficiency is important
  • Processing large batches of vectors

Special case:

  • If the input is a zero vector (all elements are 0), the vector remains unchanged to avoid division by zero and NaN values

Formula: v[i] = v[i] / ||v|| for all i

Example:

v := []float32{3, 4}
NormalizeInPlace(v)      // v is now [0.6, 0.8] (magnitude = 1)

zero := []float32{0, 0}
NormalizeInPlace(zero)   // zero remains [0, 0] (safely handles zero vector)

Time complexity: O(n) where n is the vector dimension

func ReturnDocumentFilter

func ReturnDocumentFilter(filter *DocumentFilter)

ReturnDocumentFilter returns a document filter to the pool for reuse. This should be called after the filter is no longer needed to reduce allocations. Do not use the filter after calling this method.

func Scale

func Scale(v []float32, scalar float32) []float32

Scale returns a new vector with all elements multiplied by the given scalar. The original vector is not modified.

Use cases:

  • Scaling vectors to a desired magnitude
  • Implementing weighted vectors
  • Normalizing vectors (when combined with Norm)

Formula: result[i] = v[i] * scalar

Example:

v := []float32{1, 2, 3}
doubled := Scale(v, 2.0)        // Returns [2, 4, 6]
inverted := Scale(v, -1.0)      // Returns [-1, -2, -3]
normalized := Scale(v, 1.0/Norm(v))  // Returns unit vector

Time complexity: O(n) where n is the vector dimension

Types

type BM25SearchIndex

type BM25SearchIndex struct {
	// contains filtered or unexported fields
}

BM25SearchIndex is a full-text search index that uses BM25 scoring for relevance ranking. It maintains an inverted index using roaring bitmaps for efficient storage and retrieval. All methods are safe for concurrent use by multiple goroutines.

The index stores only document IDs and tokens for efficient memory usage. Applications should maintain their own document store and use the returned document IDs to retrieve full text.

func NewBM25SearchIndex

func NewBM25SearchIndex() *BM25SearchIndex

NewBM25SearchIndex creates and returns a new empty BM25SearchIndex. The index stores only document IDs and tokens for efficient memory usage. Applications should maintain their own document store and use the returned document IDs to retrieve full text.

Returns:

  • *BM25SearchIndex: A new empty index ready to accept documents

Example:

idx := NewBM25SearchIndex()
idx.Add(1, "the quick brown fox")
results := idx.NewSearch().WithQuery("fox").WithK(10).Execute()

func (*BM25SearchIndex) Add

func (ix *BM25SearchIndex) Add(id uint32, text string) error

Add indexes a document with the given docID and text. If a document with the same ID already exists, it will be replaced. This method is safe for concurrent use.

Note: The index does NOT store the original text, only tokens for efficient memory usage.

Parameters:

  • id: Document ID (must be unique)
  • text: Document text to index

Returns:

  • error: Always returns nil (exists to satisfy TextIndex interface)

Time Complexity: O(m) where m is the number of tokens in the text

Thread-safety: Acquires exclusive lock

Example:

err := idx.Add(1, "the quick brown fox jumps over the lazy dog")

func (*BM25SearchIndex) Flush

func (ix *BM25SearchIndex) Flush() error

Flush performs hard delete of soft-deleted documents.

WHEN TO CALL:

  • After multiple Remove() calls (batch cleanup)
  • When deleted documents are significant (e.g., > 10% of index)
  • During off-peak hours

WHAT IT DOES: 1. Physically removes all soft-deleted documents from all data structures 2. Updates inverted index (postings), term frequencies, document lengths 3. Reclaims memory occupied by deleted documents 4. Clears the deleted documents bitmap

COST: O(d × m) where d = number of deleted docs, m = avg tokens per doc

Thread-safety: Acquires exclusive write lock

Returns:

  • error: Always returns nil

func (*BM25SearchIndex) NewSearch

func (ix *BM25SearchIndex) NewSearch() TextSearch

NewSearch creates a new search builder for this index.

Returns:

  • TextSearch: A new search builder ready to be configured

Example:

results, err := idx.NewSearch().
	WithQuery("quick brown").
	WithK(5).
	Execute()

func (*BM25SearchIndex) ReadFrom

func (ix *BM25SearchIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes a BM25SearchIndex from an io.Reader.

This method reconstructs a BM25SearchIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

  • int64: Number of bytes read
  • error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("bm25_index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("bm25_index.bin")
idx2 := NewBM25SearchIndex()
idx2.ReadFrom(file)
file.Close()

func (*BM25SearchIndex) Remove

func (ix *BM25SearchIndex) Remove(id uint32) error

Remove performs soft delete using roaring bitmap.

CONCURRENCY OPTIMIZATION: - Uses read lock first (cheaper) to check if document exists - Only acquires write lock for the actual bitmap modification - Minimizes write lock contention

SOFT DELETE MECHANISM: Instead of immediately removing from all data structures (expensive O(m)), we mark as deleted in roaring bitmap. Deleted documents are:

  • Skipped during search
  • Still in internal data structures
  • Not counted as active documents

Call Flush() periodically for actual cleanup and memory reclamation.

Parameters:

  • id: Document ID to remove

Returns:

  • error: Always returns nil (exists to satisfy TextIndex interface)

Time Complexity: O(log n) for bitmap operation (vs O(m) for hard delete)

Thread-safety: Uses read lock for validation, write lock for modification

func (*BM25SearchIndex) WriteTo

func (ix *BM25SearchIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the BM25SearchIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization to ensure all soft-deleted documents are permanently removed from the serialized data.

The serialization format is: 1. Magic number (4 bytes) - "BM25" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. Statistics:

  • numDocs (4 bytes)
  • totalTokens (4 bytes)
  • avgDocLen (8 bytes as float64)

4. Document lengths (map[uint32]int):

  • Count (4 bytes)
  • For each entry: docID (4 bytes) + length (4 bytes)

5. Document tokens (map[uint32][]string):

  • Count (4 bytes)
  • For each entry:
  • docID (4 bytes)
  • Token count (4 bytes)
  • For each token: token length (4 bytes) + token bytes

6. Postings (map[string]*roaring.Bitmap):

  • Count (4 bytes)
  • For each entry:
  • Term length (4 bytes) + term bytes
  • Bitmap size (4 bytes) + bitmap bytes

7. Term frequencies (map[string]map[uint32]int):

  • Count (4 bytes)
  • For each term:
  • Term length (4 bytes) + term bytes
  • Doc count (4 bytes)
  • For each doc: docID (4 bytes) + frequency (4 bytes)

8. Deleted docs bitmap size (4 bytes) + roaring bitmap bytes

Thread-safety: Acquires read lock during serialization

Returns:

  • int64: Number of bytes written
  • error: Returns error if write fails or flush fails

type CompressedVector

type CompressedVector struct {
	// Node is the original VectorNode
	Node VectorNode

	// Code is the PQ-encoded residual (M bytes)
	Code []uint8
}

CompressedVector represents a PQ-encoded residual in an inverted list.

type Distance

type Distance interface {
	// Calculate computes the distance between two vectors a and b.
	// The vectors must have the same dimensionality.
	// Returns a float32 representing the distance (lower values = more similar).
	Calculate(a, b []float32) float32

	// CalculateBatch computes distances from multiple query vectors to a single target vector.
	// This is more efficient than calling Calculate multiple times as it can optimize
	// computations (e.g., precomputing norms for cosine distance).
	//
	// Parameters:
	//   - queries: slice of query vectors (each vector is []float32)
	//   - target: single target vector to compare against
	//
	// Returns:
	//   - slice of distances where result[i] is the distance from queries[i] to target
	//
	// All vectors (queries and target) must have the same dimensionality.
	CalculateBatch(queries [][]float32, target []float32) []float32

	// PreprocessInPlace preprocesses the target vector in-place for the distance metric.
	// For cosine distance, this normalizes the vector to unit length.
	// For euclidean distance, this is a no-op.
	// Returns an error if the vector is invalid for this metric (e.g., zero vector for cosine).
	PreprocessInPlace(target []float32) error

	// Preprocess preprocesses the target vector for the distance metric, returning a new vector.
	// For cosine distance, this returns a normalized copy.
	// For euclidean distance, this returns the original vector unchanged.
	// Returns an error if the vector is invalid for this metric (e.g., zero vector for cosine).
	Preprocess(target []float32) ([]float32, error)
}

Distance is the interface for computing distances between vectors. Implementations provide different distance metrics for vector similarity search.

func NewDistance

func NewDistance(t DistanceKind) (Distance, error)

NewDistance returns a singleton Distance implementation for the specified metric type. The returned instances are stateless and safe for concurrent use across goroutines. Returns ErrUnknownDistanceKind if the distance kind is not recognized.

Example:

dist, err := NewDistance(Euclidean)
if err != nil {
    log.Fatal(err)
}
distance := dist.Calculate([]float32{1, 2, 3}, []float32{4, 5, 6})

type DistanceKind

type DistanceKind string

DistanceKind represents the type of distance metric to use for vector comparisons. Different distance metrics are suitable for different use cases: - Euclidean (L2): Measures absolute spatial distance between points - L2Squared: Squared Euclidean distance (faster, preserves ordering) - Cosine: Measures angular similarity, independent of magnitude

const (
	// Euclidean (L2) distance measures the straight-line distance between two points.
	// Use this when the magnitude of vectors matters.
	// Formula: sqrt(sum((a[i] - b[i])^2))
	Euclidean DistanceKind = "l2"

	// L2Squared (squared Euclidean) distance measures the squared distance between two points.
	// This is faster than L2 as it avoids the sqrt operation.
	// Use this when you only need to compare distances (ordering is preserved).
	// Formula: sum((a[i] - b[i])^2)
	L2Squared DistanceKind = "l2_squared"

	// Cosine distance measures the angular difference between vectors (1 - cosine similarity).
	// Use this when you care about direction but not magnitude (e.g., text embeddings).
	// Formula: 1 - (dot(a,b) / (||a|| * ||b||))
	// Range: [0, 2] where 0 = identical direction, 1 = orthogonal, 2 = opposite
	Cosine DistanceKind = "cosine"
)

type DocumentFilter

type DocumentFilter struct {
	// contains filtered or unexported fields
}

DocumentFilter provides efficient document ID filtering for vector search. It uses roaring bitmaps for fast membership testing during search operations.

func NewDocumentFilter

func NewDocumentFilter(documentIDs []uint32) *DocumentFilter

NewDocumentFilter creates a new document filter from a list of document IDs. If the document IDs list is empty, returns nil (no filtering). The filter should be returned to the pool using ReturnDocumentFilter when done.

func (*DocumentFilter) Count

func (f *DocumentFilter) Count() uint64

Count returns the number of eligible documents. Returns 0 if filter is nil (meaning all documents are eligible).

func (*DocumentFilter) IsEligible

func (f *DocumentFilter) IsEligible(docID uint32) bool

IsEligible checks if a document ID is eligible for search. If filter is nil, all documents are eligible. Otherwise, checks if the document ID exists in the filter bitmap.

func (*DocumentFilter) IsEmpty

func (f *DocumentFilter) IsEmpty() bool

IsEmpty returns true if no documents are eligible. Returns false if filter is nil (all documents eligible).

func (*DocumentFilter) ShouldSkip

func (f *DocumentFilter) ShouldSkip(docID uint32) bool

ShouldSkip returns true if the document should be skipped (not eligible). This is a convenience method for use in loops with continue statements.

type Filter

type Filter struct {
	Field    string
	Operator Operator
	Value    interface{}
	Value2   interface{} // Used for range queries
}

Filter represents a query filter

func AnyOf

func AnyOf(field string, values ...interface{}) Filter

AnyOf creates an OR group for multiple values on the same field (alias for In)

func Between

func Between(field string, min, max interface{}) Filter

Between is an alias for Range

func Eq

func Eq(field string, value interface{}) Filter

Eq creates an equality filter

func Exists

func Exists(field string) Filter

Exists checks if a field exists (has any value)

func Gt

func Gt(field string, value interface{}) Filter

Gt creates a greater-than filter

func Gte

func Gte(field string, value interface{}) Filter

Gte creates a greater-than-or-equal filter

func In

func In(field string, values ...interface{}) Filter

In creates an in-set filter

func IsNotNull

func IsNotNull(field string) Filter

IsNotNull checks if a field is not null/exists

func IsNull

func IsNull(field string) Filter

IsNull checks if a field is null/doesn't exist

func Lt

func Lt(field string, value interface{}) Filter

Lt creates a less-than filter

func Lte

func Lte(field string, value interface{}) Filter

Lte creates a less-than-or-equal filter

func Ne

func Ne(field string, value interface{}) Filter

Ne creates a not-equal filter

func NoneOf

func NoneOf(field string, values ...interface{}) Filter

NoneOf creates a NOT IN filter (alias for NotIn)

func Not

func Not(filter Filter) Filter

Not creates a negation filter by inverting the operator

func NotExists

func NotExists(field string) Filter

NotExists checks if a field doesn't exist

func NotIn

func NotIn(field string, values ...interface{}) Filter

NotIn creates a not-in-set filter

func Range

func Range(field string, min, max interface{}) Filter

Range creates a range filter [min, max]

type FilterGroup

type FilterGroup struct {
	Filters []Filter
	Logic   LogicOperator
}

FilterGroup represents a group of filters with a logical operator

type FlatIndex

type FlatIndex struct {
	// contains filtered or unexported fields
}

FlatIndex represents a flat (brute-force) kNN index.

This is the simplest form of similarity search index where vectors are stored with metric-appropriate preprocessing applied. For cosine distance, vectors are normalized to unit length. For euclidean distance, vectors are stored as-is.

Thread-safety: This index is safe for concurrent use through a read-write mutex. Multiple readers can search simultaneously, but Add operations are exclusive.

func NewFlatIndex

func NewFlatIndex(dim int, distanceKind DistanceKind) (*FlatIndex, error)

NewFlatIndex creates a new flat (kNN) index with the specified dimensionality and distance metric.

This is the only "training" step for a flat index, which simply initializes the data structure. Unlike other indexes (IVF, HNSW, PQ), there's no actual training phase or preprocessing required.

Parameters:

  • dim: The dimensionality of vectors that will be stored. All vectors must have this exact dimension. For example, if you're using 384-dimensional embeddings from a sentence transformer model, dim should be 384.
  • distanceKind: The distance metric to use for similarity comparison.

Returns:

  • *FlatIndex: A new flat index ready to accept vectors via Add()
  • error: Returns error if dim <= 0 or distance kind is invalid

Time Complexity: O(1) Memory: O(1) initially, grows to O(n*d) where n is number of vectors added

Example:

idx, err := NewFlatIndex(384, Cosine)  // For sentence embeddings
if err != nil { log.Fatal(err) }

func (*FlatIndex) Add

func (idx *FlatIndex) Add(vector VectorNode) error

Add adds a vector to the flat index.

Vectors are preprocessed according to the distance metric:

  • For cosine distance: vectors are normalized to unit length in-place
  • For euclidean distance: vectors are stored as-is (preprocessing is a no-op)

This preprocessing happens once during insertion, making all subsequent distance calculations more efficient by eliminating redundant norm computations and divisions.

Parameters:

  • vector: Vector to add. Must have dimensionality matching idx.dim

Returns:

  • error: Returns error if vector has wrong dimension or preprocessing fails (e.g., zero vector for cosine)

Time Complexity: O(m) where m = dimensionality

  • For L2 metric: Just appends vector (preprocessing is no-op)
  • For cosine metric: Normalizes vector then appends

Thread-safety: Acquires exclusive lock, blocking all searches during addition

func (*FlatIndex) Dimensions

func (idx *FlatIndex) Dimensions() int

Dimensions returns the dimensionality of vectors stored in this index.

Returns:

  • int: The dimensionality (number of components in each vector)

func (*FlatIndex) DistanceKind

func (idx *FlatIndex) DistanceKind() DistanceKind

DistanceKind returns the distance metric used by this index.

Returns:

  • DistanceKind: The distance metric (L2, Cosine, or Dot)

func (*FlatIndex) Flush

func (idx *FlatIndex) Flush() error

Flush performs hard delete of soft-deleted nodes.

WHEN TO CALL:

  • After multiple Remove() calls (batch cleanup)
  • When deleted nodes are significant (e.g., > 10% of index)
  • During off-peak hours

WHAT IT DOES: 1. Removes all soft-deleted vectors from the vectors slice 2. Reclaims memory occupied by deleted vectors 3. Clears the deleted nodes bitmap

COST: O(n) where n = number of vectors in the index

Thread-safety: Acquires exclusive write lock

func (*FlatIndex) Kind

func (idx *FlatIndex) Kind() VectorIndexKind

Kind returns the type of this index.

Returns:

  • VectorIndexKind: Always returns FlatIndex

func (*FlatIndex) NewSearch

func (idx *FlatIndex) NewSearch() VectorSearch

NewSearch creates a new search builder for this index.

Returns:

  • VectorSearch: A new search builder ready to be configured

func (*FlatIndex) ReadFrom

func (idx *FlatIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes a FlatIndex from an io.Reader.

This method reconstructs a FlatIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

  • int64: Number of bytes read
  • error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("index.bin")
idx2, _ := NewFlatIndex(384, Cosine)
idx2.ReadFrom(file)
file.Close()

func (*FlatIndex) Remove

func (idx *FlatIndex) Remove(vector VectorNode) error

Remove performs soft delete using roaring bitmap.

CONCURRENCY OPTIMIZATION: - Uses read lock first (cheaper) to check if node exists - Only acquires write lock for the actual bitmap modification - Minimizes write lock contention

SOFT DELETE MECHANISM: Instead of immediately removing from the vectors slice (expensive O(n)), we mark as deleted in roaring bitmap. Deleted nodes are:

  • Skipped during search
  • Still in vectors slice
  • Not counted as active nodes

Call Flush() periodically for actual cleanup and memory reclamation.

Parameters:

  • vector: Vector to remove (only the ID field is used for matching)

Returns:

  • error: Returns error if vector is not found or already deleted

Time Complexity: O(n) for existence check + O(log n) for bitmap operation

Thread-safety: Uses read lock for validation, write lock for modification

func (*FlatIndex) Train

func (idx *FlatIndex) Train(vectors []VectorNode) error

Train is a no-op for flat index since vectors are stored in memory. This method exists to satisfy the VectorIndex interface.

Returns:

  • error: Always returns nil

func (*FlatIndex) Trained

func (idx *FlatIndex) Trained() bool

Trained returns true if the index has been trained This is a no-op for flat index since it doesn't need training

func (*FlatIndex) WriteTo

func (idx *FlatIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the FlatIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization to ensure all soft-deleted vectors are permanently removed from the serialized data.

The serialization format is: 1. Magic number (4 bytes) - "FLAT" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. Dimensionality (4 bytes) 4. Distance kind length (4 bytes) + distance kind string 5. Number of vectors (4 bytes) 6. For each vector:

  • Vector ID (4 bytes)
  • Vector dimension (4 bytes)
  • Vector data (dim * 4 bytes as float32)

7. Deleted nodes bitmap size (4 bytes) + roaring bitmap bytes

Thread-safety: Acquires read lock during serialization

Returns:

  • int64: Number of bytes written
  • error: Returns error if write fails or flush fails

type FullPrecisionQuantizer

type FullPrecisionQuantizer struct{}

FullPrecisionQuantizer stores vectors in full 32-bit floating point.

Memory: 4 bytes per dimension Accuracy: Full IEEE 754 single precision Training: Not required

This is essentially a no-op quantizer that provides the interface for consistency while maintaining full precision.

func (*FullPrecisionQuantizer) Dequantize

func (q *FullPrecisionQuantizer) Dequantize(stored any) ([]float32, error)

func (*FullPrecisionQuantizer) IsTrained

func (q *FullPrecisionQuantizer) IsTrained() bool

func (*FullPrecisionQuantizer) Quantize

func (q *FullPrecisionQuantizer) Quantize(vector []float32) (any, error)

func (*FullPrecisionQuantizer) Train

func (q *FullPrecisionQuantizer) Train(vectors [][]float32)

func (*FullPrecisionQuantizer) Type

type Fusion

type Fusion interface {
	// Kind returns the kind of fusion strategy
	Kind() FusionKind

	// Combine takes separate result sets from vector and text search and
	// combines them into a single ranked list
	//
	// Parameters:
	//   - vectorResults: Results from vector search (map of docID -> score)
	//   - textResults: Results from text search (map of docID -> score)
	//
	// Returns:
	//   - Combined results as map of docID -> score
	Combine(vectorResults map[uint32]float64, textResults map[uint32]float64) map[uint32]float64
}

Fusion defines how to combine scores from different search modalities (vector search, text search) into a single ranking.

Different fusion strategies handle score normalization and combination differently: - WeightedSum: Direct weighted combination of scores - ReciprocalRank: Rank-based fusion (more robust to score scale differences) - Max/Min: Simple maximum or minimum across modalities

func DefaultFusion

func DefaultFusion() Fusion

DefaultFusion returns the default fusion strategy (WeightedSum)

func NewFusion

func NewFusion(kind FusionKind, config *FusionConfig) (Fusion, error)

NewFusion creates a new fusion strategy with the given configuration

type FusionConfig

type FusionConfig struct {
	// VectorWeight is the weight for vector search scores (used by WeightedSumFusion)
	VectorWeight float64

	// TextWeight is the weight for text search scores (used by WeightedSumFusion)
	TextWeight float64

	// K is the constant used in Reciprocal Rank Fusion (default: 60)
	// Lower k gives more weight to top-ranked items
	K float64
}

FusionConfig holds configuration for fusion strategies

func DefaultFusionConfig

func DefaultFusionConfig() *FusionConfig

DefaultFusionConfig returns the default fusion configuration

type FusionKind

type FusionKind string

FusionKind defines the type of score fusion strategy for combining results from multiple search modalities (vector, text, metadata).

const (
	// WeightedSumFusion combines scores using weighted sum
	// finalScore = vectorScore * vectorWeight + textScore * textWeight
	WeightedSumFusion FusionKind = "weighted_sum"

	// ReciprocalRankFusion (RRF) combines results based on their ranks
	// More robust to differences in score scales between modalities
	ReciprocalRankFusion FusionKind = "reciprocal_rank"

	// MaxFusion takes the maximum score across modalities
	MaxFusion FusionKind = "max"

	// MinFusion takes the minimum score across modalities
	MinFusion FusionKind = "min"
)

type HNSWIndex

type HNSWIndex struct {

	// M is connections per layer (except layer 0)
	M int
	// contains filtered or unexported fields
}

HNSWIndex implements HNSW (Hierarchical Navigable Small World).

Memory layout:

  • Vectors: n × dim × 4 bytes
  • Graph: n × M × avgLayers × 4 bytes (uint32 IDs)
  • Total: typically 2-3x raw vector size

Thread-safety: All public methods use read-write mutex.

func NewHNSWIndex

func NewHNSWIndex(dim int, distanceKind DistanceKind, m, efConstruction, efSearch int) (*HNSWIndex, error)

NewHNSWIndex creates a new HNSW index.

Creates an empty multi-layered graph. Vectors can be added immediately (no training required unlike IVF/IVFPQ).

Parameters:

  • dim: Vector dimensionality
  • distanceKind: Distance metric
  • m: Connections per layer (except layer 0 which uses 2*M). Typical: 12-48, pass 0 for default (16)
  • efConstruction: Candidate list size during construction. Higher = better graph, slower build. Typical: 100-500, pass 0 for default (200)
  • efSearch: Candidate list size during search. Higher = better recall, slower search. Pass 0 for default (200)

Returns:

  • *HNSWIndex: New empty HNSW index
  • error: Returns error if parameters invalid

func (*HNSWIndex) Add

func (idx *HNSWIndex) Add(vector VectorNode) error

Add adds a vector to the index by inserting into the graph.

CONCURRENCY OPTIMIZATION: All expensive operations (validation, level assignment, node preparation) are performed OUTSIDE the lock to minimize critical section time.

Algorithm:

  1. Assign random level (outside lock)
  2. Prepare node structure (outside lock)
  3. Critical section: graph modification
  4. Insert and connect

func (*HNSWIndex) Dimensions

func (idx *HNSWIndex) Dimensions() int

Dimensions returns vector dimensionality.

func (*HNSWIndex) DistanceKind

func (idx *HNSWIndex) DistanceKind() DistanceKind

DistanceKind returns the distance metric.

func (*HNSWIndex) Flush

func (idx *HNSWIndex) Flush() error

Flush performs hard delete of soft-deleted nodes.

WHEN TO CALL:

  • After multiple Remove() calls (batch cleanup)
  • When deleted nodes are significant (e.g., > 10% of index)
  • During off-peak hours

WHAT IT DOES: 1. Removes all edges pointing to deleted nodes 2. Deletes nodes from memory 3. Updates entry point if needed 4. Rebuilds graph structure

COST: O(n × M × L) - expensive, so batch deletions

func (*HNSWIndex) Kind

func (idx *HNSWIndex) Kind() VectorIndexKind

Kind returns the index type.

func (*HNSWIndex) NewSearch

func (idx *HNSWIndex) NewSearch() VectorSearch

NewSearch creates a new search builder.

func (*HNSWIndex) ReadFrom

func (idx *HNSWIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes an HNSWIndex from an io.Reader.

This method reconstructs an HNSWIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

  • int64: Number of bytes read
  • error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("hnsw_index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("hnsw_index.bin")
m, efC, efS := DefaultHNSWConfig()
idx2, _ := NewHNSWIndex(384, Cosine, m, efC, efS)
idx2.ReadFrom(file)
file.Close()

func (*HNSWIndex) Remove

func (idx *HNSWIndex) Remove(vector VectorNode) error

Remove performs soft delete using roaring bitmap.

CONCURRENCY OPTIMIZATION: - Uses read lock first (cheaper) to check if node exists - Only acquires write lock for the actual bitmap modification - Minimizes write lock contention

SOFT DELETE MECHANISM: Instead of immediately removing (expensive O(n × M × L)), we mark as deleted in roaring bitmap. Deleted nodes are:

  • Skipped during search
  • Still in graph structure
  • Not counted as active nodes

Call Flush() periodically for actual cleanup.

func (*HNSWIndex) SetEfSearch

func (idx *HNSWIndex) SetEfSearch(ef int)

SetEfSearch adjusts search-time candidate list size.

func (*HNSWIndex) Train

func (idx *HNSWIndex) Train(vectors []VectorNode) error

Train is a no-op for HNSW (no training required).

func (*HNSWIndex) Trained

func (idx *HNSWIndex) Trained() bool

Trained always returns true (HNSW doesn't require training).

func (*HNSWIndex) WriteTo

func (idx *HNSWIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the HNSWIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization to ensure all soft-deleted nodes are permanently removed from the serialized data.

The serialization format is: 1. Magic number (4 bytes) - "HNSW" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. Basic parameters:

  • Dimensionality (4 bytes)
  • Distance kind length (4 bytes) + distance kind string
  • M (4 bytes)
  • efConstruction (4 bytes)
  • efSearch (4 bytes)
  • levelMult (8 bytes as float64)

4. Graph structure:

  • maxLevel (4 bytes)
  • entryPoint (4 bytes)

5. Number of nodes (4 bytes) 6. For each node:

  • Node ID (4 bytes)
  • Level (4 bytes)
  • Vector dimension (4 bytes)
  • Vector data (dim * 4 bytes as float32)
  • Number of edge layers (4 bytes)
  • For each layer:
  • Number of edges (4 bytes)
  • Edge IDs (n * 4 bytes)

7. Deleted nodes bitmap size (4 bytes) + roaring bitmap bytes

Thread-safety: Acquires read lock during serialization

Returns:

  • int64: Number of bytes written
  • error: Returns error if write fails or flush fails

type HalfPrecisionQuantizer

type HalfPrecisionQuantizer struct{}

HalfPrecisionQuantizer compresses vectors to 16-bit floating point.

Memory: 2 bytes per dimension (50% savings vs float32) Accuracy: IEEE 754 half precision (1 sign, 5 exp, 10 mantissa bits) Training: Not required

Trade-off: Significant memory savings with minimal accuracy loss for most use cases. Values are stored as uint16 bit representations.

func (*HalfPrecisionQuantizer) Dequantize

func (q *HalfPrecisionQuantizer) Dequantize(stored any) ([]float32, error)

func (*HalfPrecisionQuantizer) IsTrained

func (q *HalfPrecisionQuantizer) IsTrained() bool

func (*HalfPrecisionQuantizer) Quantize

func (q *HalfPrecisionQuantizer) Quantize(vector []float32) (any, error)

func (*HalfPrecisionQuantizer) Train

func (q *HalfPrecisionQuantizer) Train(vectors [][]float32)

func (*HalfPrecisionQuantizer) Type

type HybridSearch

type HybridSearch interface {
	// WithVector sets the query vector for vector search
	WithVector(query []float32) HybridSearch

	// WithText sets the query text for text search
	WithText(queries ...string) HybridSearch

	// WithMetadata sets metadata filters
	WithMetadata(filters ...Filter) HybridSearch

	// WithMetadataGroups sets complex metadata filter groups
	WithMetadataGroups(groups ...*FilterGroup) HybridSearch

	// WithK sets the number of results to return
	WithK(k int) HybridSearch

	// WithNProbes sets the number of probes for IVF-based vector indexes
	WithNProbes(nProbes int) HybridSearch

	// WithEfSearch sets the efSearch parameter for HNSW search
	WithEfSearch(efSearch int) HybridSearch

	// WithThreshold sets a score threshold for results
	WithThreshold(threshold float32) HybridSearch

	// WithScoreAggregation sets the strategy for aggregating scores
	WithScoreAggregation(kind ScoreAggregationKind) HybridSearch

	// WithCutoff sets the autocut parameter for result cutoff
	WithCutoff(cutoff int) HybridSearch

	// WithFusion sets the fusion strategy for combining vector and text scores
	WithFusion(fusion Fusion) HybridSearch

	// WithFusionKind sets the fusion strategy by kind with default config
	WithFusionKind(kind FusionKind) HybridSearch

	// Execute performs the search and returns results
	Execute() ([]HybridSearchResult, error)
}

HybridSearch encapsulates the search context for the hybrid index

type HybridSearchIndex

type HybridSearchIndex interface {
	// Add adds a document with its vector, text, and metadata to the index
	// The ID is auto-generated and returned
	Add(vector []float32, text string, metadata map[string]interface{}) (uint32, error)

	// AddWithID adds a document with a specific ID
	AddWithID(id uint32, vector []float32, text string, metadata map[string]interface{}) error

	// Remove removes a document from all indexes
	Remove(id uint32) error

	// NewSearch creates a new search builder
	NewSearch() HybridSearch

	// Train trains the vector index (required for some index types like IVF, PQ)
	Train(vectors [][]float32) error

	// Flush flushes all indexes
	Flush() error

	// VectorIndex returns the underlying vector index
	VectorIndex() VectorIndex

	// TextIndex returns the underlying text index
	TextIndex() TextIndex

	// MetadataIndex returns the underlying metadata index
	MetadataIndex() MetadataIndex

	// Serialization support
	// WriteTo writes each component to a separate writer for maximum flexibility
	// Usage: WriteTo(hybridWriter, vectorWriter, textWriter, metadataWriter)
	WriteTo(hybridWriter, vectorWriter, textWriter, metadataWriter io.Writer) error

	// ReadFrom reads from a combined reader (use io.MultiReader to combine separate readers)
	io.ReaderFrom
}

HybridSearchIndex provides a unified interface for multi-modal search

func NewHybridSearchIndex

func NewHybridSearchIndex(vectorIndex VectorIndex, textIndex TextIndex, metadataIndex MetadataIndex) HybridSearchIndex

NewHybridSearchIndex creates a new hybrid search index

Parameters:

  • vectorIndex: The vector index to use (can be nil if not using vector search)
  • textIndex: The text index to use (can be nil if not using text search)
  • metadataIndex: The metadata index to use (can be nil if not using metadata search)

Returns:

  • HybridSearchIndex: A new hybrid search index

Example:

vecIdx, _ := NewFlatIndex(384, Cosine)
txtIdx := NewBM25SearchIndex()
metaIdx := NewRoaringMetadataIndex()
idx := NewHybridSearchIndex(vecIdx, txtIdx, metaIdx)

type HybridSearchResult

type HybridSearchResult struct {
	ID uint32 // Document ID
	// Score is float64 (not float32) because:
	// 1. Fusion combines scores from multiple sources (vector + text) with arithmetic operations
	// 2. RRF uses fractional calculations (1.0 / (k + rank)) requiring higher precision
	// 3. Better numerical stability when sorting results with very similar scores
	// 4. Prevents cumulative rounding errors from weighted sums and aggregations
	Score float64 // Combined relevance score
}

HybridSearchResult represents a search result from the hybrid index

func (HybridSearchResult) GetId

func (r HybridSearchResult) GetId() uint32

func (HybridSearchResult) GetScore

func (r HybridSearchResult) GetScore() float32

type IVFIndex

type IVFIndex struct {
	// contains filtered or unexported fields
}

IVFIndex represents an Inverted File index for approximate nearest neighbor search.

The index partitions the vector space into nlist Voronoi cells using k-means clustering. Each cell has an inverted list containing vectors assigned to that cell's centroid. During search, only the nearest nprobe cells are examined, providing a speed-accuracy tradeoff.

Thread-safety: This index is safe for concurrent use through a read-write mutex. Multiple readers can search simultaneously, but training and adding are exclusive.

func NewIVFIndex

func NewIVFIndex(dim int, nlist int, distanceKind DistanceKind) (*IVFIndex, error)

NewIVFIndex creates a new IVF index with the specified parameters.

The index must be trained with representative vectors before it can be used. Training learns the Voronoi partitions (cluster centroids) via k-means clustering.

Parameters:

  • dim: The dimensionality of vectors. All vectors must have this exact dimension.
  • nlist: Number of Voronoi cells (clusters). Typical: sqrt(n) or 4*sqrt(n).
  • distanceKind: The distance metric for similarity comparison.

Returns:

  • *IVFIndex: A new IVF index ready to be trained
  • error: Returns error if parameters are invalid

Choosing nlist:

  • Too small: Poor speed improvement (searching too many vectors per cell)
  • Too large: Poor recall (query's neighbors scattered across many cells)
  • Rule of thumb: nlist = sqrt(expected_dataset_size)

Example:

idx, err := NewIVFIndex(384, 316, Cosine)  // For ~100K vectors
if err != nil { log.Fatal(err) }

// Must train before use
err = idx.Train(trainingVectors)

func (*IVFIndex) Add

func (idx *IVFIndex) Add(vector VectorNode) error

Add adds a vector to the IVF index.

The vector is assigned to the nearest centroid and added to that centroid's inverted list. This is a simple O(nlist × dim) operation.

Parameters:

  • vector: Vector to add. Must have dimensionality matching idx.dim

Returns:

  • error: Returns error if index not trained, wrong dimension, or preprocessing fails

Time Complexity: O(nlist × dim) - compute distance to all centroids

Thread-safety: Acquires exclusive lock, blocking all searches during addition

func (*IVFIndex) Dimensions

func (idx *IVFIndex) Dimensions() int

Dimensions returns the dimensionality of vectors stored in this index.

func (*IVFIndex) DistanceKind

func (idx *IVFIndex) DistanceKind() DistanceKind

DistanceKind returns the distance metric used by this index.

func (*IVFIndex) Flush

func (idx *IVFIndex) Flush() error

Flush performs hard delete of soft-deleted nodes.

WHEN TO CALL:

  • After multiple Remove() calls (batch cleanup)
  • When deleted nodes are significant (e.g., > 10% of index)
  • During off-peak hours

WHAT IT DOES: 1. Removes all soft-deleted vectors from all inverted lists 2. Reclaims memory occupied by deleted vectors 3. Clears the deleted nodes bitmap

COST: O(n) where n = total number of vectors across all lists

Thread-safety: Acquires exclusive write lock

func (*IVFIndex) Kind

func (idx *IVFIndex) Kind() VectorIndexKind

Kind returns the type of this index.

func (*IVFIndex) NewSearch

func (idx *IVFIndex) NewSearch() VectorSearch

NewSearch creates a new search builder for this index.

Returns:

  • VectorSearch: A new search builder ready to be configured

func (*IVFIndex) ReadFrom

func (idx *IVFIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes an IVFIndex from an io.Reader.

This method reconstructs an IVFIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

  • int64: Number of bytes read
  • error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("index.bin")
idx2, _ := NewIVFIndex(384, 100, Cosine)
idx2.ReadFrom(file)
file.Close()

func (*IVFIndex) Remove

func (idx *IVFIndex) Remove(vector VectorNode) error

Remove performs soft delete using roaring bitmap.

CONCURRENCY OPTIMIZATION: - Uses read lock first (cheaper) to check if node exists - Only acquires write lock for the actual bitmap modification - Minimizes write lock contention

SOFT DELETE MECHANISM: Instead of immediately removing from inverted lists (expensive O(n)), we mark as deleted in roaring bitmap. Deleted nodes are:

  • Skipped during search
  • Still in inverted lists
  • Not counted as active nodes

Call Flush() periodically for actual cleanup and memory reclamation.

Parameters:

  • vector: Vector to remove (only the ID field is used for matching)

Returns:

  • error: Returns error if vector is not found or already deleted

Time Complexity: O(n) for existence check + O(log n) for bitmap operation

Thread-safety: Uses read lock for validation, write lock for modification

func (*IVFIndex) Train

func (idx *IVFIndex) Train(vectors []VectorNode) error

Train learns the Voronoi partitions (cluster centroids) using k-means clustering.

Training is required before the index can accept vectors or perform searches. The training vectors should be representative of the full dataset distribution.

Algorithm: 1. Run k-means clustering with k=nlist on the training vectors 2. Store the learned centroids 3. Mark the index as trained

Parameters:

  • vectors: Training vectors used to learn cluster centroids. Should be at least nlist vectors, ideally 10-100x more.

Returns:

  • error: Returns error if insufficient training vectors or training fails

Training Tips:

  • Use at least max(nlist, 1000) training vectors
  • More training vectors = better centroids = better recall
  • Training vectors should represent your full dataset's distribution
  • Can use a random sample of your full dataset

Time Complexity: O(iterations × nlist × n × dim) where n = len(vectors) Typical training time: seconds to minutes depending on dataset size

Example:

// Sample 100K vectors from your dataset for training
trainingVectors := sampleVectors(allVectors, 100000)
err := idx.Train(trainingVectors)

func (*IVFIndex) Trained

func (idx *IVFIndex) Trained() bool

Trained returns true if the index has been trained

func (*IVFIndex) WriteTo

func (idx *IVFIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the IVFIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization to ensure all soft-deleted vectors are permanently removed from the serialized data.

The serialization format is: 1. Magic number (4 bytes) - "IVFX" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. Basic parameters:

  • Dimensionality (4 bytes)
  • Distance kind length (4 bytes) + distance kind string
  • nlist (4 bytes) - number of IVF clusters
  • trained (1 byte) - whether index is trained

4. Centroids (only if trained):

  • For each of nlist centroids:
  • Centroid size (4 bytes)
  • Centroid data (dim * 4 bytes as float32)

5. Number of inverted lists (4 bytes) 6. For each inverted list:

  • List size (4 bytes)
  • For each vector:
  • Vector ID (4 bytes)
  • Vector data (dim * 4 bytes as float32)

7. Deleted nodes bitmap size (4 bytes) + roaring bitmap bytes

Thread-safety: Acquires read lock during serialization

Returns:

  • int64: Number of bytes written
  • error: Returns error if write fails or flush fails

type IVFPQIndex

type IVFPQIndex struct {

	// M is number of PQ subspaces
	M int

	// Nbits is bits per PQ code
	Nbits int

	// Ksub is centroids per subspace (K = 2^Nbits)
	Ksub int
	// contains filtered or unexported fields
}

IVFPQIndex represents an IVFPQ index.

Memory layout:

  • IVF centroids: nlist × dim × 4 bytes
  • PQ codebooks: M × K × (dim/M) × 4 bytes
  • Compressed vectors: n × M bytes

func NewIVFPQIndex

func NewIVFPQIndex(dim int, distanceKind DistanceKind, nlist int, m int, nbits int) (*IVFPQIndex, error)

NewIVFPQIndex creates a new IVFPQ index.

Parameters:

  • dim: Vector dimensionality (must be divisible by M)
  • distanceKind: Distance metric
  • nlist: Number of IVF clusters (rule of thumb: sqrt(n))
  • m: Number of PQ subspaces (must divide dimension evenly)
  • nbits: Bits per PQ code (most common: 8, which gives K=256)

Returns:

  • *IVFPQIndex: New untrained IVFPQ index
  • error: Returns error if parameters invalid

func (*IVFPQIndex) Add

func (idx *IVFPQIndex) Add(vector VectorNode) error

Add encodes vectors as residuals and adds to inverted lists.

Algorithm:

  1. Find nearest IVF centroid
  2. Compute residual = vector - centroid
  3. Encode residual with PQ
  4. Store code in inverted list

Parameters:

  • vector: Vector to add

Returns:

  • error: Returns error if not trained or dimension mismatch

func (*IVFPQIndex) Dimensions

func (idx *IVFPQIndex) Dimensions() int

Dimensions returns vector dimensionality.

func (*IVFPQIndex) DistanceKind

func (idx *IVFPQIndex) DistanceKind() DistanceKind

DistanceKind returns the distance metric.

func (*IVFPQIndex) Flush

func (idx *IVFPQIndex) Flush() error

Flush performs hard delete of soft-deleted nodes.

WHEN TO CALL:

  • After multiple Remove() calls (batch cleanup)
  • When deleted nodes are significant (e.g., > 10% of index)
  • During off-peak hours

WHAT IT DOES: 1. Removes all soft-deleted vectors from all inverted lists 2. Reclaims memory occupied by deleted vectors and their PQ codes 3. Clears the deleted nodes bitmap

COST: O(n) where n = total number of vectors across all lists

Thread-safety: Acquires exclusive write lock

func (*IVFPQIndex) Kind

func (idx *IVFPQIndex) Kind() VectorIndexKind

Kind returns the index type.

func (*IVFPQIndex) NewSearch

func (idx *IVFPQIndex) NewSearch() VectorSearch

NewSearch creates a new search builder.

func (*IVFPQIndex) ReadFrom

func (idx *IVFPQIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes an IVFPQIndex from an io.Reader.

This method reconstructs an IVFPQIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

  • int64: Number of bytes read
  • error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("index.bin")
idx2, _ := NewIVFPQIndex(384, Cosine, 100, 8, 8)
idx2.ReadFrom(file)
file.Close()

func (*IVFPQIndex) Remove

func (idx *IVFPQIndex) Remove(vector VectorNode) error

Remove performs soft delete using roaring bitmap.

CONCURRENCY OPTIMIZATION: - Uses read lock first (cheaper) to check if node exists - Only acquires write lock for the actual bitmap modification - Minimizes write lock contention

SOFT DELETE MECHANISM: Instead of immediately removing from inverted lists (expensive O(n)), we mark as deleted in roaring bitmap. Deleted nodes are:

  • Skipped during search
  • Still in inverted lists
  • Not counted as active nodes

Call Flush() periodically for actual cleanup and memory reclamation.

Parameters:

  • vector: Vector to remove (only the ID field is used for matching)

Returns:

  • error: Returns error if vector is not found or already deleted

Time Complexity: O(n) for existence check + O(log n) for bitmap operation

Thread-safety: Uses read lock for validation, write lock for modification

func (*IVFPQIndex) Train

func (idx *IVFPQIndex) Train(vectors []VectorNode) error

Train learns IVF clusters and PQ codebooks on residuals.

THE TRAINING ALGORITHM:

  1. Run k-means to create nlist IVF clusters
  2. Assign vectors to nearest centroids
  3. Compute residuals = vector - centroid
  4. Train PQ codebooks on ALL residuals together

The key innovation: Training PQ on residuals instead of original vectors. Residuals have much less variance, enabling better compression.

Parameters:

  • vectors: Training vectors (need at least nlist*10)

Returns:

  • error: Returns error if insufficient training data

func (*IVFPQIndex) Trained

func (idx *IVFPQIndex) Trained() bool

Trained returns true if the index has been trained

func (*IVFPQIndex) WriteTo

func (idx *IVFPQIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the IVFPQIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization to ensure all soft-deleted vectors are permanently removed from the serialized data.

The serialization format is: 1. Magic number (4 bytes) - "IVPQ" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. Basic parameters:

  • Dimensionality (4 bytes)
  • Distance kind length (4 bytes) + distance kind string
  • nlist (4 bytes) - number of IVF clusters
  • M (4 bytes) - number of PQ subspaces
  • Nbits (4 bytes) - bits per PQ code
  • Ksub (4 bytes) - centroids per subspace
  • dsub (4 bytes) - subspace dimension
  • trained (1 byte) - whether index is trained

4. Centroids (only if trained):

  • For each of nlist centroids:
  • Centroid size (4 bytes)
  • Centroid data (dim * 4 bytes as float32)

5. Codebooks (only if trained):

  • For each of M subquantizers:
  • Codebook size (4 bytes)
  • Codebook data (Ksub * dsub * 4 bytes as float32)

6. Number of inverted lists (4 bytes) 7. For each inverted list:

  • List size (4 bytes)
  • For each compressed vector:
  • Vector ID (4 bytes)
  • PQ code (M bytes)

8. Deleted nodes bitmap size (4 bytes) + roaring bitmap bytes

Thread-safety: Acquires read lock during serialization

Returns:

  • int64: Number of bytes written
  • error: Returns error if write fails or flush fails

type Int8Quantizer

type Int8Quantizer struct {
	// contains filtered or unexported fields
}

Int8Quantizer uses symmetric scalar quantization to compress vectors.

Memory: 1 byte per dimension (75% savings vs float32) Accuracy: Maps [-AbsMax, AbsMax] to [-127, 127] Training: REQUIRED - must call Train() before use

How it works: 1. Train: Finds the maximum absolute value across sample vectors 2. Quantize: Maps [-AbsMax, AbsMax] to [-127, 127] using linear scaling 3. Dequantize: Reverses the scaling to approximate original values

Trade-off: Maximum memory savings with some accuracy loss. Best for large-scale deployments where memory is critical.

func (*Int8Quantizer) Dequantize

func (q *Int8Quantizer) Dequantize(stored any) ([]float32, error)

func (*Int8Quantizer) GetAbsMax

func (q *Int8Quantizer) GetAbsMax() float32

GetAbsMax returns the trained maximum absolute value (for serialization)

func (*Int8Quantizer) IsTrained

func (q *Int8Quantizer) IsTrained() bool

func (*Int8Quantizer) Quantize

func (q *Int8Quantizer) Quantize(vector []float32) (any, error)

func (*Int8Quantizer) SetAbsMax

func (q *Int8Quantizer) SetAbsMax(absMax float32)

SetAbsMax sets the maximum absolute value (for deserialization)

func (*Int8Quantizer) Train

func (q *Int8Quantizer) Train(vectors [][]float32)

func (*Int8Quantizer) Type

func (q *Int8Quantizer) Type() QuantizerType

type LogicOperator

type LogicOperator string

LogicOperator defines how filters are combined

const (
	AND LogicOperator = "AND"
	OR  LogicOperator = "OR"
)

type MetadataFilterQueryBuilder

type MetadataFilterQueryBuilder struct {
	// contains filtered or unexported fields
}

MetadataFilterQueryBuilder provides a type-safe fluent API for building complex metadata filter queries

func NewMetadataFilterQuery

func NewMetadataFilterQuery() *MetadataFilterQueryBuilder

NewMetadataFilterQuery creates a new query builder

Example:

results, err := NewMetadataFilterQuery().
	Where(Eq("category", "electronics"), Gte("price", 500)).
	Or(Eq("category", "phones"), Gte("rating", 4.5)).
	Execute(idx)

func (*MetadataFilterQueryBuilder) And

And adds filters with AND logic to the last group

func (*MetadataFilterQueryBuilder) Build

func (qb *MetadataFilterQueryBuilder) Build() []*FilterGroup

Build returns the constructed filter groups

func (*MetadataFilterQueryBuilder) Execute

Execute runs the query against a metadata index

Example:

results, err := NewMetadataFilterQuery().
	Where(Eq("brand", "Apple")).
	Or(Eq("category", "phone"), Eq("brand", "Samsung")).
	Execute(idx)

func (*MetadataFilterQueryBuilder) Or

Or starts a new filter group (filters within are combined with AND, but this group is OR'd with previous groups)

func (*MetadataFilterQueryBuilder) Where

Where starts a new filter group with AND logic

type MetadataIndex

type MetadataIndex interface {
	// Add adds a document with its metadata to the index
	Add(node MetadataNode) error

	// Remove removes a document from the index
	Remove(node MetadataNode) error

	// NewSearch creates a new search builder
	NewSearch() MetadataSearch

	// Flush the metadata index
	Flush() error

	// Serialization support
	io.WriterTo
	io.ReaderFrom
}

MetadataIndex is the interface for filtering documents based on metadata

type MetadataNode

type MetadataNode struct {
	// contains filtered or unexported fields
}

MetadataNode represents a document with structured metadata attributes.

MetadataNode is used for filtering and structured queries in metadata indexes. It associates a unique ID with a map of metadata fields that can be:

  • Strings: for categorical data (e.g., "category", "status")
  • Numbers (int, float64): for numeric fields (e.g., "price", "rating")
  • Booleans: for binary flags (e.g., "in_stock", "featured")

The metadata is stored as map[string]interface{} for flexibility, but the metadata index will interpret types and use appropriate data structures:

  • Categorical values → Roaring Bitmaps for fast set operations
  • Numeric values → Bit-Sliced Index (BSI) for range queries

Example:

metadata := map[string]interface{}{
    "category": "electronics",
    "price": 999.99,
    "rating": 4.5,
    "in_stock": true,
    "tags": []string{"premium", "wireless"},  // Lists also supported
}
node := comet.NewMetadataNode(metadata)
index.Add(*node)

func NewMetadataNode

func NewMetadataNode(metadata map[string]interface{}) *MetadataNode

NewMetadataNode creates a new MetadataNode with an auto-incremented ID.

The ID is generated atomically and is guaranteed to be unique across all VectorNode and MetadataNode instances. This makes it safe for concurrent use.

Parameters:

  • metadata: Map of field names to values. Supported types are:
  • string: categorical data
  • int, int32, int64: integer values
  • float32, float64: floating point values (converted to int64 for BSI)
  • bool: boolean flags
  • []string: list of string values (for IN queries)

Returns:

  • *MetadataNode: A new node with auto-generated ID

Thread-safety: This function is safe for concurrent use.

Example:

node := comet.NewMetadataNode(map[string]interface{}{
    "category": "books",
    "price": 29.99,
    "in_stock": true,
})
index.Add(*node)

func NewMetadataNodeWithID

func NewMetadataNodeWithID(id uint32, metadata map[string]interface{}) *MetadataNode

NewMetadataNodeWithID creates a new MetadataNode with an explicit ID.

Use this when you need to control the ID assignment, typically when:

  • Mapping metadata to existing document IDs from your database
  • Ensuring consistency with vector node IDs in hybrid indexes
  • Deserializing indexes from disk

WARNING: The caller is responsible for ensuring ID uniqueness.

Parameters:

  • id: The unique identifier for this node
  • metadata: Map of field names to values

Returns:

  • *MetadataNode: A new node with the specified ID

Example:

dbID := uint32(12345)
node := comet.NewMetadataNodeWithID(dbID, map[string]interface{}{
    "category": "electronics",
    "price": 999,
})
index.Add(*node)

func (*MetadataNode) ID

func (n *MetadataNode) ID() uint32

ID returns the unique identifier of this metadata node.

Returns:

  • uint32: The node's ID

func (*MetadataNode) Metadata

func (n *MetadataNode) Metadata() map[string]interface{}

Metadata returns the metadata map for this node.

Returns:

  • map[string]interface{}: The metadata fields (not a copy - do not modify)

type MetadataResult

type MetadataResult struct {
	// Node is the matched metadata node containing ID and metadata fields
	Node MetadataNode
}

MetadataResult represents a search result from metadata filtering.

Unlike vector and text results which have meaningful scores, metadata results are binary (match or no match). All returned results match the filter criteria, so they don't have associated relevance scores.

The result contains the full metadata node, allowing access to all metadata fields of matched documents.

Example:

results, _ := metaIndex.NewSearch().
    WithFilters(
        comet.Eq("category", "electronics"),
        comet.Lte("price", 1000),
    ).
    Execute()

for _, result := range results {
    fmt.Printf("ID: %d\n", result.GetId())
    fmt.Printf("Metadata: %v\n", result.Node.Metadata())
}

func (MetadataResult) GetId

func (r MetadataResult) GetId() uint32

GetId returns the ID of the matched document.

func (MetadataResult) GetScore

func (r MetadataResult) GetScore() float32

GetScore returns 0 for metadata results (no meaningful score). This method exists to implement the Result interface.

type MetadataSearch

type MetadataSearch interface {
	// WithFilters sets the filters to apply with AND logic.
	// All filters must match for a document to be included.
	//
	// Available filter functions:
	//   - Eq(field, value): field == value
	//   - Ne(field, value): field != value
	//   - Lt(field, value): field < value
	//   - Lte(field, value): field <= value
	//   - Gt(field, value): field > value
	//   - Gte(field, value): field >= value
	//   - Between(field, min, max): min <= field <= max
	//   - In(field, ...values): field in values
	//   - NotIn(field, ...values): field not in values
	//   - Exists(field): field exists
	//   - NotExists(field): field doesn't exist
	//
	// Parameters:
	//   - filters: One or more filter conditions (combined with AND)
	//
	// Returns:
	//   - MetadataSearch: The search instance for method chaining
	WithFilters(filters ...Filter) MetadataSearch

	// WithFilterGroups sets complex filter groups with OR logic.
	// Documents matching ANY group are included (OR between groups).
	// Filters within each group are combined with AND.
	//
	// This enables complex boolean expressions like:
	//   (A AND B) OR (C AND D) OR (E AND F)
	//
	// Parameters:
	//   - groups: One or more filter groups (combined with OR)
	//
	// Returns:
	//   - MetadataSearch: The search instance for method chaining
	WithFilterGroups(groups ...*FilterGroup) MetadataSearch

	// Execute performs the filtering and returns matching document IDs.
	// The results contain only document IDs (no scores).
	//
	// Returns:
	//   - []MetadataResult: Matching documents
	//   - error: Error if filtering fails
	Execute() ([]MetadataResult, error)
}

MetadataSearch provides a fluent interface for filtering documents by metadata.

MetadataSearch uses roaring bitmaps and bit-sliced indexes for extremely fast filtering on structured attributes. It supports:

  • Equality and inequality queries
  • Numeric range queries
  • Set membership (IN, NOT IN)
  • Existence checks
  • Complex boolean logic (AND, OR, NOT)

Filters within a single WithFilters() call are combined with AND logic. Filter groups enable OR logic between different filter combinations.

Example - Simple filters (AND logic):

results, _ := metaIndex.NewSearch().
    WithFilters(
        comet.Eq("category", "electronics"),
        comet.Lte("price", 1000),
        comet.Eq("in_stock", true),
    ).
    Execute()

Example - Complex query with OR:

// (category=electronics AND price<500) OR (category=books AND rating>=4)
group1 := comet.NewFilterGroup().
    WithFilters(
        comet.Eq("category", "electronics"),
        comet.Lt("price", 500),
    )
group2 := comet.NewFilterGroup().
    WithFilters(
        comet.Eq("category", "books"),
        comet.Gte("rating", 4),
    )
results, _ := metaIndex.NewSearch().
    WithFilterGroups(group1, group2).
    Execute()

Example - Set membership:

results, _ := metaIndex.NewSearch().
    WithFilters(
        comet.In("category", "electronics", "computers", "phones"),
        comet.NotIn("brand", "brandX", "brandY"),
    ).
    Execute()

type Operator

type Operator string

Operator represents a filter operation type

const (
	// Equality operators
	OpEqual    Operator = "eq" // Equal to
	OpNotEqual Operator = "ne" // Not equal to

	// Comparison operators (numeric)
	OpGreaterThan        Operator = "gt"  // Greater than
	OpGreaterThanOrEqual Operator = "gte" // Greater than or equal to
	OpLessThan           Operator = "lt"  // Less than
	OpLessThanOrEqual    Operator = "lte" // Less than or equal to

	// Set operators
	OpIn    Operator = "in"     // In a set of values
	OpNotIn Operator = "not_in" // Not in a set of values

	// Range operators
	OpRange Operator = "range" // Within a range [Value, Value2]

	// Existence operators
	OpExists    Operator = "exists"     // Field exists (has any value)
	OpNotExists Operator = "not_exists" // Field doesn't exist
)

Operator constants

type PQIndex

type PQIndex struct {

	// M is the number of subquantizers
	M int

	// Nbits is bits per PQ code
	Nbits int

	// Ksub is centroids per subquantizer (K = 2^Nbits)
	Ksub int
	// contains filtered or unexported fields
}

PQIndex represents a Product Quantization index.

Memory layout:

  • Codes: n × M bytes (compressed vectors)
  • Codebooks: M × K × (dim/M) × 4 bytes
  • Typical: 10-500x smaller than original

func NewPQIndex

func NewPQIndex(dim int, distanceKind DistanceKind, M int, Nbits int) (*PQIndex, error)

NewPQIndex creates a new Product Quantization index.

Parameters:

  • dim: Vector dimensionality (must be divisible by M)
  • distanceKind: Distance metric
  • M: Number of subquantizers (subspaces). Must divide dim evenly.
  • Nbits: Bits per PQ code, determines K=2^Nbits centroids per subspace

Returns:

  • *PQIndex: New untrained PQ index
  • error: Returns error if parameters invalid

Tip: Use CalculatePQParams(dim) to get recommended M and Nbits values.

func (*PQIndex) Add

func (idx *PQIndex) Add(vector VectorNode) error

Add compresses and adds vectors to the index.

Encoding process: 1. Divide vector into M subvectors 2. For each subvector, find nearest centroid in its codebook 3. Store centroid IDs as M-byte code

Original vectors are discarded after encoding!

Parameters:

  • vector: Vector to compress and add

Returns:

  • error: Returns error if not trained or dimension mismatch

func (*PQIndex) Dimensions

func (idx *PQIndex) Dimensions() int

Dimensions returns the dimensionality of original vectors.

func (*PQIndex) DistanceKind

func (idx *PQIndex) DistanceKind() DistanceKind

DistanceKind returns the distance metric.

func (*PQIndex) Flush

func (idx *PQIndex) Flush() error

Flush performs hard delete of soft-deleted nodes.

WHEN TO CALL:

  • After multiple Remove() calls (batch cleanup)
  • When deleted nodes are significant (e.g., > 10% of index)
  • During off-peak hours

WHAT IT DOES: 1. Removes all soft-deleted codes and vectorNodes from their parallel slices 2. Reclaims memory occupied by deleted vectors 3. Clears the deleted nodes bitmap

COST: O(n) where n = number of vectors in the index

Thread-safety: Acquires exclusive write lock

func (*PQIndex) Kind

func (idx *PQIndex) Kind() VectorIndexKind

Kind returns the index type.

func (*PQIndex) NewSearch

func (idx *PQIndex) NewSearch() VectorSearch

NewSearch creates a new search builder.

func (*PQIndex) ReadFrom

func (idx *PQIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes a PQIndex from an io.Reader.

This method reconstructs a PQIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

  • int64: Number of bytes read
  • error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("index.bin")
M, Nbits := CalculatePQParams(384)
idx2, _ := NewPQIndex(384, Cosine, M, Nbits)
idx2.ReadFrom(file)
file.Close()

func (*PQIndex) Remove

func (idx *PQIndex) Remove(vector VectorNode) error

Remove performs soft delete using roaring bitmap.

CONCURRENCY OPTIMIZATION: - Uses read lock first (cheaper) to check if node exists - Only acquires write lock for the actual bitmap modification - Minimizes write lock contention

SOFT DELETE MECHANISM: Instead of immediately removing from the codes and vectorNodes slices (expensive O(n)), we mark as deleted in roaring bitmap. Deleted nodes are:

  • Skipped during search
  • Still in storage slices
  • Not counted as active nodes

Call Flush() periodically for actual cleanup and memory reclamation.

Parameters:

  • vector: Vector to remove (only the ID field is used for matching)

Returns:

  • error: Returns error if vector is not found or already deleted

Time Complexity: O(n) for existence check + O(log n) for bitmap operation

Thread-safety: Uses read lock for validation, write lock for modification

func (*PQIndex) Train

func (idx *PQIndex) Train(vectors []VectorNode) error

Train learns codebooks for each subspace using k-means.

Algorithm:

  1. For each of M subspaces: a. Extract that subspace from all training vectors b. Run k-means to find K centroids c. Store centroids as codebook for that subspace

Parameters:

  • vectors: Training vectors (need at least Ksub vectors)

Returns:

  • error: Returns error if insufficient training data

func (*PQIndex) Trained

func (idx *PQIndex) Trained() bool

Trained returns true if the index has been trained

func (*PQIndex) WriteTo

func (idx *PQIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the PQIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization to ensure all soft-deleted vectors are permanently removed from the serialized data.

The serialization format is: 1. Magic number (4 bytes) - "PQIX" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. Basic parameters:

  • Dimensionality (4 bytes)
  • Distance kind length (4 bytes) + distance kind string
  • M (4 bytes) - number of subquantizers
  • Nbits (4 bytes) - bits per PQ code
  • Ksub (4 bytes) - centroids per subquantizer
  • dsub (4 bytes) - dimension of each subspace
  • trained (1 byte) - whether codebooks have been learned

4. Codebooks (only if trained):

  • For each of M subquantizers:
  • Codebook size (4 bytes)
  • Codebook data (Ksub * dsub * 4 bytes as float32)

5. Number of vectors (4 bytes) 6. For each vector:

  • Vector ID (4 bytes)
  • PQ code (M bytes)

7. Deleted nodes bitmap size (4 bytes) + roaring bitmap bytes

Thread-safety: Acquires read lock during serialization

Returns:

  • int64: Number of bytes written
  • error: Returns error if write fails or flush fails

type Quantizer

type Quantizer interface {
	// Train prepares the quantizer using sample vectors.
	// Required for Int8Quantizer, no-op for others.
	Train(vectors [][]float32)

	// IsTrained returns true if the quantizer is ready to use.
	// Always true for FullPrecision and HalfPrecision.
	// For Int8, true only after Train() has been called.
	IsTrained() bool

	// Quantize converts a float32 vector to the quantizer's storage format.
	// Returns:
	//   - []float32 for FullPrecisionQuantizer
	//   - []uint16 for HalfPrecisionQuantizer (float16 bits)
	//   - []int8 for Int8Quantizer
	Quantize(vector []float32) (any, error)

	// Dequantize converts a stored vector back to float32.
	// The input type must match the quantizer's storage format.
	Dequantize(stored any) ([]float32, error)

	// Type returns the quantizer type
	Type() QuantizerType
}

Quantizer defines the interface for vector quantization operations. Implementations handle conversion between float32 vectors and their compressed representations, supporting different precision levels.

func NewQuantizer

func NewQuantizer(qType QuantizerType) (Quantizer, error)

NewQuantizer creates a quantizer of the specified type.

type QuantizerType

type QuantizerType string

QuantizerType represents the type of quantization

const (
	FullPrecision QuantizerType = "float32"
	HalfPrecision QuantizerType = "float16"
	Int8Precision QuantizerType = "int8"
)

type Result

type Result interface {
	// GetId returns the unique identifier for this result
	GetId() uint32

	// GetScore returns the relevance score for this result.
	// Interpretation depends on search type:
	//   - Vector search: distance (lower = better)
	//   - Text search: BM25 score (higher = better)
	//   - Metadata search: typically 0 or 1 (binary match)
	GetScore() float32
}

Result is a common interface for all search result types.

This interface enables generic operations like limiting, autocut, and score aggregation to work uniformly across different search modalities (vector, text, metadata).

All result types (VectorResult, TextResult, MetadataResult) implement this interface.

type RoaringMetadataIndex

type RoaringMetadataIndex struct {
	// contains filtered or unexported fields
}

RoaringMetadataIndex provides fast filtering using Roaring Bitmaps and BSI.

This index maintains two types of indexes: 1. Categorical index: Maps "field:value" to document IDs using roaring bitmaps 2. Numeric index: Maps field names to BSI structures for range queries

Thread-safety: This index is safe for concurrent use through a read-write mutex.

func NewRoaringMetadataIndex

func NewRoaringMetadataIndex() *RoaringMetadataIndex

NewRoaringMetadataIndex creates a new roaring bitmap-based metadata index.

Returns:

  • *RoaringMetadataIndex: A new empty index ready to accept documents

Example:

idx := NewRoaringMetadataIndex()
node := NewMetadataNode(1, map[string]interface{}{
	"category": "electronics",
	"price": 999,
})
idx.Add(node)

func (*RoaringMetadataIndex) Add

func (idx *RoaringMetadataIndex) Add(node MetadataNode) error

Add adds a document with its metadata to the index.

The metadata is automatically classified into categorical or numeric fields:

  • Numeric: int, int64, float64 (stored in BSI)
  • Categorical: string, bool (stored in roaring bitmaps)

Parameters:

  • node: MetadataNode containing document ID and metadata fields

Returns:

  • error: Returns error if metadata contains unsupported types

Time Complexity: O(m) where m is the number of metadata fields

Thread-safety: Acquires exclusive lock

func (*RoaringMetadataIndex) Flush

func (idx *RoaringMetadataIndex) Flush() error

Flush is a no-op for roaring metadata index since data is stored in memory. This method exists to satisfy the MetadataIndex interface.

Returns:

  • error: Always returns nil

func (*RoaringMetadataIndex) NewSearch

func (idx *RoaringMetadataIndex) NewSearch() MetadataSearch

NewSearch creates a new search builder for this index.

Returns:

  • MetadataSearch: A new search builder ready to be configured

Example:

results, err := idx.NewSearch().
	WithFilters(
		Eq("category", "electronics"),
		Gte("price", 500),
	).
	Execute()

func (*RoaringMetadataIndex) ReadFrom

func (idx *RoaringMetadataIndex) ReadFrom(r io.Reader) (int64, error)

ReadFrom deserializes a RoaringMetadataIndex from an io.Reader.

This method reconstructs a RoaringMetadataIndex from the serialized format created by WriteTo. The deserialized index is fully functional and ready to use for searches.

Thread-safety: Acquires write lock during deserialization

Returns:

  • int64: Number of bytes read
  • error: Returns error if read fails, format is invalid, or data is corrupted

Example:

// Save index
file, _ := os.Create("index.bin")
idx.WriteTo(file)
file.Close()

// Load index
file, _ := os.Open("index.bin")
idx2 := NewRoaringMetadataIndex()
idx2.ReadFrom(file)
file.Close()

func (*RoaringMetadataIndex) Remove

func (idx *RoaringMetadataIndex) Remove(node MetadataNode) error

Remove removes a document from all indexes.

Parameters:

  • node: MetadataNode to remove (only the ID field is used for matching)

Returns:

  • error: Always returns nil (exists to satisfy MetadataIndex interface)

Time Complexity: O(m) where m is the number of metadata fields

Thread-safety: Acquires exclusive lock

func (*RoaringMetadataIndex) WriteTo

func (idx *RoaringMetadataIndex) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the RoaringMetadataIndex to an io.Writer.

IMPORTANT: This method calls Flush() before serialization (though Flush is a no-op for metadata index since it's memory-based).

The serialization format is: 1. Magic number (4 bytes) - "MTIX" identifier for validation 2. Version (4 bytes) - Format version for backward compatibility 3. All docs bitmap size (4 bytes) + bitmap bytes 4. Number of categorical entries (4 bytes) 5. For each categorical entry:

  • Key length (4 bytes) + key string
  • Bitmap size (4 bytes) + bitmap bytes

6. Number of numeric entries (4 bytes) 7. For each numeric entry:

  • Field name length (4 bytes) + field name string
  • BSI size (4 bytes) + BSI bytes

Thread-safety: Acquires read lock during serialization

Returns:

  • int64: Number of bytes written
  • error: Returns error if write fails or flush fails

type ScoreAggregationKind

type ScoreAggregationKind string

ScoreAggregationKind defines the type of score aggregation strategy.

const (
	// SumAggregation sums all scores for the same node
	SumAggregation ScoreAggregationKind = "sum"

	// MaxAggregation takes the maximum score for the same node
	MaxAggregation ScoreAggregationKind = "max"

	// MeanAggregation averages all scores for the same node
	MeanAggregation ScoreAggregationKind = "mean"
)

type SearchResult

type SearchResult struct {
	DocID uint32  // Document ID
	Score float64 // BM25 relevance score
}

SearchResult represents a single search result with its score.

type TextAggregation

type TextAggregation interface {
	// Kind returns the kind of aggregation strategy
	Kind() ScoreAggregationKind

	// Aggregate takes a slice of TextResults (potentially with duplicate document IDs),
	// deduplicates by document ID, aggregates scores for each unique document,
	// and returns the deduplicated results sorted by aggregated score (descending).
	Aggregate(results []TextResult) []TextResult
}

TextAggregation defines how to aggregate scores when the same text document appears in results from multiple queries.

When performing multi-query text searches, the same document may be returned multiple times with different scores. The aggregation strategy determines how to combine these scores and deduplicate results by document ID.

Note: Text results use relevance scores where HIGHER score = BETTER match (e.g., BM25).

func DefaultTextAggregation

func DefaultTextAggregation() TextAggregation

DefaultTextAggregation returns the default text aggregation strategy (Sum).

func NewTextAggregation

func NewTextAggregation(kind ScoreAggregationKind) (TextAggregation, error)

NewTextAggregation returns the singleton text aggregation instance for the given kind. Returns error if the kind is not recognized.

type TextIndex

type TextIndex interface {
	// Add a new text to the index
	Add(id uint32, text string) error

	// Remove a text from the index
	Remove(id uint32) error

	// NewSearch creates a new search builder
	NewSearch() TextSearch

	// Flush the text index
	Flush() error

	// Serialization support
	io.WriterTo
	io.ReaderFrom
}

type TextResult

type TextResult struct {
	// Id is the unique identifier of the matched document
	Id uint32

	// Score is the BM25 relevance score (higher = more relevant)
	Score float32
}

TextResult represents a search result from full-text (BM25) search.

Each result contains the matched document ID and a relevance score. The score represents BM25 relevance:

  • Higher scores indicate better relevance (more important terms, higher frequency)
  • Scores are not normalized and can exceed 1.0
  • Scores depend on:
  • Term frequency in document
  • Inverse document frequency
  • Document length normalization

Note: Unlike vector results where lower is better, text results have higher-is-better scores (standard for information retrieval).

Example:

results, _ := index.NewSearch().
    WithQuery("machine learning").
    WithK(10).
    Execute()

for _, result := range results {
    fmt.Printf("ID: %d, BM25 Score: %.4f\n",
        result.Id, result.Score)
}

func (TextResult) GetId

func (r TextResult) GetId() uint32

GetId returns the ID of the matched document.

func (TextResult) GetScore

func (r TextResult) GetScore() float32

GetScore returns the BM25 relevance score (higher = more relevant).

type TextSearch

type TextSearch interface {
	// WithQuery sets the query text(s) for full-text search.
	// Supports single or multiple queries for batch search.
	// Text is tokenized and normalized using UAX#29 word segmentation.
	//
	// Parameters:
	//   - queries: One or more query strings
	//
	// Returns:
	//   - TextSearch: The search instance for method chaining
	WithQuery(queries ...string) TextSearch

	// WithNode sets the node ID(s) to search from (not commonly used for text).
	// This is provided for interface consistency but is rarely needed.
	//
	// Parameters:
	//   - nodeIDs: One or more node IDs
	//
	// Returns:
	//   - TextSearch: The search instance for method chaining
	WithNode(nodeIDs ...uint32) TextSearch

	// WithK sets the number of top results to return.
	//
	// Parameters:
	//   - k: Number of results (default: 10)
	//
	// Returns:
	//   - TextSearch: The search instance for method chaining
	WithK(k int) TextSearch

	// WithScoreAggregation sets how to combine scores from multiple queries.
	// Only relevant when using multiple queries.
	//
	// Available strategies:
	//   - SumAggregation: Sum all scores (default, emphasizes total relevance)
	//   - MaxAggregation: Take maximum score (best match across queries)
	//   - MeanAggregation: Average all scores (balanced)
	//
	// Parameters:
	//   - kind: The aggregation strategy
	//
	// Returns:
	//   - TextSearch: The search instance for method chaining
	WithScoreAggregation(kind ScoreAggregationKind) TextSearch

	// WithCutoff enables automatic result cutoff based on score distribution.
	//
	// Parameters:
	//   - cutoff: Number of extrema to find before cutting (-1 disables)
	//
	// Returns:
	//   - TextSearch: The search instance for method chaining
	WithCutoff(cutoff int) TextSearch

	// WithDocumentIDs pre-filters the search to specific document IDs.
	// Only documents with IDs in this set will be scored.
	//
	// Parameters:
	//   - docIDs: Eligible document IDs (empty means all documents)
	//
	// Returns:
	//   - TextSearch: The search instance for method chaining
	WithDocumentIDs(docIDs ...uint32) TextSearch

	// Execute performs the configured search and returns results.
	// Results are sorted by BM25 score (descending - higher is better).
	//
	// Returns:
	//   - []TextResult: Sorted search results
	//   - error: Error if search fails
	Execute() ([]TextResult, error)
}

TextSearch provides a fluent interface for configuring and executing text searches.

TextSearch uses the builder pattern for configuration, similar to VectorSearch. It supports BM25-based full-text search with tokenization and normalization.

Features:

  • Single or multi-query search with score aggregation
  • Pre-filtering by document IDs
  • Automatic result cutoff (autocut)
  • Top-K retrieval with heap-based ranking

Example - Basic search:

results, _ := txtIndex.NewSearch().
    WithQuery("machine learning tutorial").
    WithK(10).
    Execute()

Example - Multi-query search:

results, _ := txtIndex.NewSearch().
    WithQuery("deep learning", "neural networks", "transformers").
    WithK(20).
    WithScoreAggregation(comet.MaxAggregation).
    Execute()

Example - Pre-filtered search:

eligibleDocs := []uint32{1, 5, 10, 15}
results, _ := txtIndex.NewSearch().
    WithQuery("tutorial").
    WithDocumentIDs(eligibleDocs...).
    WithK(5).
    Execute()

type VectorAggregation

type VectorAggregation interface {
	// Kind returns the kind of aggregation strategy
	Kind() ScoreAggregationKind

	// Aggregate takes a slice of VectorResults (potentially with duplicate node IDs),
	// deduplicates by node ID, aggregates scores for each unique node,
	// and returns the deduplicated results sorted by aggregated score (ascending).
	Aggregate(results []VectorResult) []VectorResult
}

VectorAggregation defines how to aggregate scores when the same vector node appears in results from multiple queries or node searches.

When performing multi-query or multi-node searches, the same vector may be returned multiple times with different scores. The aggregation strategy determines how to combine these scores and deduplicate results by node ID.

Note: Vector results use distance metrics where LOWER score = BETTER match.

func DefaultVectorAggregation

func DefaultVectorAggregation() VectorAggregation

DefaultVectorAggregation returns the default vector aggregation strategy (Sum).

func NewVectorAggregation

func NewVectorAggregation(kind ScoreAggregationKind) (VectorAggregation, error)

NewVectorAggregation returns the singleton vector aggregation instance for the given kind. Returns error if the kind is not recognized.

type VectorIndex

type VectorIndex interface {
	// Train the index
	Train(vectors []VectorNode) error

	// Add a new vector to the index
	Add(vector VectorNode) error

	// Remove a vector from the index
	Remove(vector VectorNode) error

	// Flush the vector index
	Flush() error

	// NewSearch creates a new search builder
	NewSearch() VectorSearch

	// Dimensions returns the dimensionality of vectors stored in this index
	Dimensions() int

	// DistanceKind returns the distance metric used for similarity measurement
	DistanceKind() DistanceKind

	// Kind returns the type of vector index
	Kind() VectorIndexKind

	// Trained returns true if the index has been trained
	Trained() bool

	// Serialization support
	io.WriterTo
	io.ReaderFrom
}

VectorIndex is the interface for the vector index

type VectorIndexKind

type VectorIndexKind string

VectorIndexKind represents the type of indexing strategy used for vector search. Different index types offer different tradeoffs between speed, accuracy, and memory usage.

var (
	// FlatIndex performs exhaustive search by comparing the query against all vectors.
	// Provides perfect recall but has O(n) time complexity.
	FlatIndexKind VectorIndexKind = "flat"

	// IVFIndex (Inverted File Index) partitions the vector space into clusters.
	// Searches only the nearest clusters, trading accuracy for speed.
	IVFIndexKind VectorIndexKind = "ivf"

	// PQIndex (Product Quantization) compresses vectors using learned codebooks.
	// Significantly reduces memory usage at the cost of some accuracy.
	PQIndexKind VectorIndexKind = "pq"

	// IVFPQIndex combines IVF clustering with PQ compression.
	// Provides fast search with low memory footprint.
	IVFPQIndexKind VectorIndexKind = "ivfpq"

	// HNSWIndex (Hierarchical Navigable Small World) builds a multi-layer graph.
	// Offers excellent query performance with high recall.
	HNSWIndexKind VectorIndexKind = "hnsw"
)

type VectorNode

type VectorNode struct {
	// contains filtered or unexported fields
}

VectorNode represents a vector embedding with a unique identifier.

VectorNode is the fundamental data structure for storing vectors in all index types. Each node contains:

  • id: A unique identifier for retrieval and reference
  • vector: The actual float32 embedding values

The node is immutable after creation - neither the ID nor vector can be modified. This immutability ensures thread-safety and prevents accidental corruption of indexed data.

Example:

// Auto-generated ID
vec := []float32{0.1, 0.5, 0.3, 0.8}
node := comet.NewVectorNode(vec)
fmt.Println(node.ID())      // e.g., 1
fmt.Println(node.Vector())  // [0.1, 0.5, 0.3, 0.8]

// Explicit ID (useful for mapping to external document IDs)
node2 := comet.NewVectorNodeWithID(42, vec)
fmt.Println(node2.ID())  // 42

func NewVectorNode

func NewVectorNode(vector []float32) *VectorNode

NewVectorNode creates a new VectorNode with an auto-incremented ID.

The ID is generated atomically and is guaranteed to be unique across all VectorNode and MetadataNode instances created by this package. This makes it safe to create nodes concurrently from multiple goroutines.

Parameters:

  • vector: The embedding vector to store. This should match the dimensionality expected by your index (e.g., 384 for sentence transformers, 768 for BERT).

Returns:

  • *VectorNode: A new node with auto-generated ID

Thread-safety: This function is safe for concurrent use.

Example:

embedding := []float32{0.1, 0.2, 0.3, ...}  // 384 dimensions
node := comet.NewVectorNode(embedding)
index.Add(*node)

func NewVectorNodeWithID

func NewVectorNodeWithID(id uint32, vector []float32) *VectorNode

NewVectorNodeWithID creates a new VectorNode with an explicit ID.

Use this when you need to control the ID assignment, typically when:

  • Mapping vectors to existing document IDs from your database
  • Deserializing indexes from disk
  • Maintaining consistent IDs across restarts

WARNING: The caller is responsible for ensuring ID uniqueness. Using duplicate IDs may cause undefined behavior in the index.

Parameters:

  • id: The unique identifier for this node
  • vector: The embedding vector to store

Returns:

  • *VectorNode: A new node with the specified ID

Example:

// Map to database primary key
dbID := uint32(12345)
embedding := []float32{0.1, 0.2, 0.3, ...}
node := comet.NewVectorNodeWithID(dbID, embedding)
index.Add(*node)

func (*VectorNode) ID

func (n *VectorNode) ID() uint32

ID returns the unique identifier of this vector node.

Returns:

  • uint32: The node's ID

func (*VectorNode) Vector

func (n *VectorNode) Vector() []float32

Vector returns the embedding vector stored in this node.

Returns:

  • []float32: The vector data (not a copy - do not modify)

type VectorResult

type VectorResult struct {
	// Node is the matched vector node containing ID and vector data
	Node VectorNode

	// Score is the distance from the query vector (lower = more similar)
	Score float32
}

VectorResult represents a search result from vector similarity search.

Each result contains the matched vector node and a similarity score. The score represents distance from the query vector:

  • Lower scores indicate higher similarity (closer in vector space)
  • Score interpretation depends on the distance metric:
  • L2/Euclidean: absolute distance
  • L2Squared: squared distance (faster, preserves ordering)
  • Cosine: angular distance (range: [0, 2])

Example:

results, _ := index.NewSearch().
    WithQuery(queryVec).
    WithK(10).
    Execute()

for _, result := range results {
    fmt.Printf("ID: %d, Distance: %.4f\n",
        result.GetId(), result.Score)
    fmt.Printf("Vector: %v\n", result.Node.Vector())
}

func (VectorResult) GetId

func (r VectorResult) GetId() uint32

GetId returns the ID of the matched vector node.

func (VectorResult) GetScore

func (r VectorResult) GetScore() float32

GetScore returns the distance score (lower = more similar).

type VectorSearch

type VectorSearch interface {
	// WithQuery sets the query vector(s) for similarity search.
	// Supports single or multiple query vectors for batch search.
	// Results from multiple queries are aggregated using the configured strategy.
	//
	// Parameters:
	//   - queries: One or more query vectors (each []float32 must match index dimension)
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithQuery(queries ...[]float32) VectorSearch

	// WithNode sets the node ID(s) to search from (node-based similarity).
	// Finds vectors similar to the specified indexed nodes.
	// Supports single or multiple nodes for batch search.
	//
	// Parameters:
	//   - nodeIDs: One or more node IDs to use as query vectors
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithNode(nodeIDs ...uint32) VectorSearch

	// WithK sets the number of nearest neighbors to return.
	//
	// Parameters:
	//   - k: Number of results (default: 10)
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithK(k int) VectorSearch

	// WithNProbes sets the number of clusters to probe (IVF/IVFPQ only).
	// Higher values increase recall but reduce speed.
	// This is a no-op for other index types.
	//
	// Typical values:
	//   - 1: Fastest, ~60-70% recall
	//   - 8: Good balance, ~85% recall
	//   - 16: Better recall, ~92% recall
	//   - 32: High recall, ~96% recall
	//
	// Parameters:
	//   - nProbes: Number of clusters to search
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithNProbes(nProbes int) VectorSearch

	// WithEfSearch sets the efSearch parameter (HNSW only).
	// Controls the size of the dynamic candidate list during search.
	// Higher values increase recall but reduce speed.
	// This is a no-op for other index types.
	//
	// Typical values:
	//   - 50: Very fast, ~85% recall
	//   - 100: Fast, ~92% recall
	//   - 200: Balanced, ~96% recall (default)
	//   - 400: Slower, ~98% recall
	//
	// Parameters:
	//   - efSearch: Size of candidate list
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithEfSearch(efSearch int) VectorSearch

	// WithThreshold sets a distance threshold for filtering results.
	// Only vectors with distance <= threshold are returned.
	// Useful for finding all "sufficiently similar" vectors.
	//
	// Parameters:
	//   - threshold: Maximum distance (results with distance > threshold are excluded)
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithThreshold(threshold float32) VectorSearch

	// WithScoreAggregation sets how to combine scores from multiple queries/nodes.
	// Only relevant when using multiple queries or nodes.
	//
	// Available strategies:
	//   - SumAggregation: Sum all scores (emphasizes frequency)
	//   - MaxAggregation: Take maximum distance (conservative)
	//   - MeanAggregation: Average all scores (balanced, default)
	//
	// Parameters:
	//   - kind: The aggregation strategy
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithScoreAggregation(kind ScoreAggregationKind) VectorSearch

	// WithCutoff enables automatic result cutoff based on score distribution.
	// Analyzes the score curve to find natural breakpoints.
	//
	// Parameters:
	//   - cutoff: Number of extrema to find before cutting (-1 disables autocut)
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithCutoff(cutoff int) VectorSearch

	// WithDocumentIDs pre-filters the search to specific document IDs.
	// Only vectors with IDs in this set will be considered.
	// Useful for combining with metadata filters in hybrid search.
	//
	// Parameters:
	//   - docIDs: Eligible document IDs (empty means all documents)
	//
	// Returns:
	//   - VectorSearch: The search instance for method chaining
	WithDocumentIDs(docIDs ...uint32) VectorSearch

	// Execute performs the configured search and returns results.
	// Results are sorted by distance (ascending - lower is better).
	//
	// Returns:
	//   - []VectorResult: Sorted search results
	//   - error: Error if search fails
	Execute() ([]VectorResult, error)
}

VectorSearch provides a fluent interface for configuring and executing vector searches.

VectorSearch uses the builder pattern to configure search parameters before execution. All With* methods return the search instance for method chaining.

Search modes:

  • Query-based: Search for vectors similar to provided query vectors
  • Node-based: Find vectors similar to indexed nodes (by ID)
  • Hybrid: Combine both query and node searches

Advanced features:

  • Pre-filtering: Restrict search to specific document IDs
  • Score aggregation: Combine results from multiple queries/nodes
  • Autocut: Automatically determine optimal result cutoff
  • Threshold filtering: Exclude results beyond distance threshold

Example - Basic search:

results, _ := index.NewSearch().
    WithQuery(queryVec).
    WithK(10).
    Execute()

Example - Multi-query with aggregation:

results, _ := index.NewSearch().
    WithQuery(query1, query2, query3).
    WithK(20).
    WithScoreAggregation(comet.MeanAggregation).
    Execute()

Example - Pre-filtered search:

eligibleIDs := []uint32{1, 5, 10, 15, 20}
results, _ := index.NewSearch().
    WithQuery(queryVec).
    WithDocumentIDs(eligibleIDs...).
    WithK(5).
    Execute()