Why Superlinked?

Table of Contents

Your users expect better search

The landscape of search and information retrieval is rapidly evolving. With the rise of AI and large language models, user expectations for search capabilities have skyrocketed. Your users now expect that your search can handle complex, nuanced queries that go beyond simple keyword matching. Just hear what Algolia CTO has to say -

"We saw 2x more keywords search 6 months after the ChatGPT launch." Algolia CTO, 2023

They have 17,000 customers accounting for 120B searches/month. This trend isn't isolated. Across industries, we're seeing a shift towards more sophisticated search queries that blend multiple concepts, contexts, and data types.

Vector Search with text-only embeddings (& also multi-modal) fails on complex queries, because complex queries are never just about text. They involve other data too!

Consider these examples:

E-commerce: A query like "comfortable running shoes for marathon training under $150" involves text, numerical data (price), and categorical information (product type, use case).
Content platforms: "Popular science fiction movies from the 80s with strong female leads" combines text analysis, temporal data, and popularity metrics.
Job search: "Entry-level data science positions in tech startups with good work-life balance" requires understanding of text, categorical data (industry, job level), and even subjective metrics.

Enter Superlinked

Imagine you are building a system that can deal with a query like “recent news about crop yield”. After collecting your data, you define your schema, ingest data and build index like this:

Schema definition


class News(sl.Schema):
    id: sl.IdField
    created_at: sl.Timestamp
    like_count: sl.Integer
    moderation_score: sl.Float
    content: sl.String

class User(sl.Schema):
    id: sl.IdField
    interest: sl.String

class Event(sl.EventSchema):
    id: sl.IdField
    news: sl.SchemaReference[News]
    user: sl.SchemaReference[User]
    event_type: sl.String

Encoder definition


recency_space = sl.RecencySpace(timestamp=news.created_at)
popularity_space = sl.NumberSpace(number=news.like_count, mode=sl.Mode.MAXIMUM)
trust_space = sl.NumberSpace(number=news.moderation_score, mode=sl.Mode.MAXIMUM)
semantic_space = sl.TextSimilarity(
    text=[news.content, user.interest], model="sentence-transformers/all-mpnet-base-v2"
)

Define Indexes

index = sl.Index(
    spaces=[recency_space, popularity_space, trust_space, semantic_space],
    effects=[sl.Effect(semantic_space, event.user, 0.8 * event.news)],
)

You define your queries and parameterize them like this:

Query definition


query = (
    sl.Query(
        index,
        weights={
            recency_space: sl.Param("recency_weight"),
            popularity_space: sl.Param("popularity_weight"),
            trust_space: sl.Param("trust_weight"),
            semantic_space: sl.Param("semantic_weight"),
        },
    )
    .find(news)
    .similar(semantic_space.text, sl.Param("content_query"))
    .with_vector(user, sl.Param("user_id"))
    .select_all()
)

Debug in notebook, run as server


sl.RestExecutor(
    sources=[sl.RestSource(news), sl.RestSource(user)],
    index=[index],
    query=[query],
    vector_database = sl.InMemoryVectorDatabase()
    # vector_database = sl.MongoDBVectorDatabase(...),
    # vector_database = sl.RedisVectorDatabase(...),
    # vector_database = sl.QdrantVectorDatabase(...),
)

# SparkExecutor()   <-- Coming soon in Superlinked Cloud


curl -X POST \
    'http://localhost:8000/api/v1/search/query' \
    --header 'Accept: */*' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "content_query": "crop yields",
        "semantic_weight": 0.5,
        "recency_weight": 0.9,
        "popularity_weight": 0.5,
        "trust_weight": 0.2,
    }'

Handle natural language queries

#In a notebook like this:

query = (
    sl.Query(...)
    .with_natural_query(Param("recent news about crop yield"))
)

# As an API call like this:

curl -X POST \
    'http://localhost:8000/api/v1/search/query' \
    --header 'Accept: */*' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "natural_language_query": "recent news about crop yield"
    }'

But can't I put all my data in json, stringify it and embed using LLM?

Stringify and embed approach produces unpredictable results. For example (code below):

Embed 0..100 with OpenAI API
Calculate and plot the cosine similarity
Observe the difference between expected and actual results


from openai import OpenAI
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

response = OpenAI().embeddings.create(
    input=[str(i) for i in range(0, 101)],
    model="text-embedding-3-small",
)
embeddings = np.array([r.embedding for r in response.data])
scores = cosine_similarity(embeddings, embeddings)

Okay, But can't I ...

1. Use different already existing storages per attribute, fire multiple searches and then reconcile results?

It's better to store all your attribute vectors in the same vector store and perform a single search, weighting your attributes at query time.

2. Use Metadata filters or Candidate re-ranking

When you convert a fuzzy preference like “recent”, “risky” or “popular” into a filter, you model a sigmoid with a binary step function = not enough resolution.

Semantic ranking & ColBERT only use text, learn2rank models need ML Engineers. Broad queries eg “popular pants” can’t be handled by re-ranking at all, due to poor candidate recall.