LogoLogo
👋 Get in touch⭐️ GitHub
  • Welcome
  • Getting Started
    • Why Superlinked?
    • Setup Superlinked
    • Basic Building Blocks
  • Run in Production
    • Overview
    • Setup Superlinked Server
      • Configuring your app
      • Interacting with app via API
    • Supported Vector Databases
      • Redis
      • Mongo DB
      • Qdrant
  • Concepts
    • Overview
    • Combining Multiple Embeddings for Better Retrieval Outcomes
    • Dynamic Parameters/Query Time weights
  • Reference
    • Overview
    • Changelog
    • Components
      • Dag
        • Period Time
      • Parser
        • Dataframe Parser
        • Data Parser
        • Json Parser
      • Schema
        • Event Schema Object
        • Event Schema
        • Schema Object
        • Schema
        • Id Schema Object
      • Executor
        • Executor
        • Exception
        • Interactive
          • Interactive Executor
        • Query
          • Query Executor
        • In Memory
          • In Memory Executor
        • Rest
          • Rest Configuration
          • Rest Descriptor
          • Rest Executor
          • Rest Handler
      • Source
        • Types
        • Rest Source
        • Data Loader Source
        • In Memory Source
        • Source
        • Interactive Source
      • App
        • App
        • Interactive
          • Interactive App
        • Online
          • Online App
        • In Memory
          • In Memory App
        • Rest
          • Rest App
      • Registry
        • Superlinked Registry
        • Exception
      • Index
        • Index
        • Effect
        • Util
          • Aggregation Effect Group
          • Aggregation Node Util
          • Effect With Referenced Schema Object
          • Event Aggregation Node Util
          • Event Aggregation Effect Group
      • Query
        • Query Descriptor
        • Typed Param
        • Query Clause
        • Query Filter Validator
        • Query Param Information
        • Query Mixin
        • Nlq Pydantic Model Builder
        • Param
        • Query Filters
        • Result
        • Query Filter Information
        • Query Param Value Setter
        • Query Vector Factory
        • Query Weighting
        • Clause Params
        • Nlq Param Evaluator
        • Natural Language Query Param Handler
        • Query
        • Param Evaluator
        • Space Weight Param Info
        • Predicate
          • Query Predicate
          • Binary Predicate
          • Binary Op
        • Nlq
          • Nlq Compatible Clause Handler
          • Nlq Clause Collector
          • Exception
          • Nlq Handler
          • Param Filler
            • Query Param Prompt Builder
            • Query Param Model Validator Info
            • Query Param Model Builder
            • Query Param Model Validator
            • Nlq Annotation
            • Templates
          • Suggestion
            • Query Suggestion Model
            • Query Suggestions Prompt Builder
        • Query Result Converter
          • Serializable Query Result Converter
          • Default Query Result Converter
          • Query Result Converter
        • Query Clause
          • Limit Clause
          • Radius Clause
          • Query Clause
          • Single Value Param Query Clause
          • Weight By Space Clause
          • Similar Filter Clause
          • Overriden Now Clause
          • Looks Like Filter Clause
          • Hard Filter Clause
          • Select Clause
          • Looks Like Filter Clause Weights By Space
          • Space Weight Map
          • Base Looks Like Filter Clause
          • Nlq Clause
          • Nlq System Prompt Clause
      • Space
        • Text Similarity Space
        • Image Space
        • Input Aggregation Mode
        • Categorical Similarity Space
        • Exception
        • Space
        • Number Space
        • Has Space Field Set
        • Recency Space
        • Image Space Field Set
        • Space Field Set
        • Custom Space
      • Storage
        • Mongo Db Vector Database
        • Redis Vector Database
        • Vector Database
        • Qdrant Vector Database
        • In Memory Vector Database
  • Recipes
    • Overview
    • Multi-Modal Semantic Search
      • Hotel Search
    • Recommendation System
      • E-Commerce RecSys
  • Tutorials
    • Overview
    • Semantic Search - News
    • Semantic Search - Movies
    • Semantic Search - Product Images & Descriptions
    • RecSys - Ecommerce
    • RAG - HR
    • Analytics - User Acquisition
    • Analytics - Keyword Expansion
  • Help & FAQ
    • Logging
    • Support
    • Discussion
  • Policies
    • Terms of Use
    • Privacy Policy
Powered by GitBook
On this page
  • Your users expect better search
  • Enter Superlinked
  • But can't I put all my data in json, stringify it and embed using LLM?
  • Okay, But can't I ...
  • Okay, seems like Superlinked proposes a nice approach, but
  • How does it fit in the big picture?

Was this helpful?

Edit on GitHub
  1. Getting Started

Why Superlinked?

PreviousWelcomeNextSetup Superlinked

Last updated 4 months ago

Was this helpful?

Table of Contents

Your users expect better search

The landscape of search and information retrieval is rapidly evolving. With the rise of AI and large language models, user expectations for search capabilities have skyrocketed. Your users now expect that your search can handle complex, nuanced queries that go beyond simple keyword matching. Just hear what Algolia CTO has to say -

"We saw 2x more keywords search 6 months after the ChatGPT launch." Algolia CTO, 2023

They have 17,000 customers accounting for 120B searches/month. This trend isn't isolated. Across industries, we're seeing a shift towards more sophisticated search queries that blend multiple concepts, contexts, and data types.

Vector Search with text-only embeddings (& also multi-modal) fails on complex queries, because complex queries are never just about text. They involve other data too!

Consider these examples:

  1. E-commerce: A query like "comfortable running shoes for marathon training under $150" involves text, numerical data (price), and categorical information (product type, use case).

  2. Content platforms: "Popular science fiction movies from the 80s with strong female leads" combines text analysis, temporal data, and popularity metrics.

  3. Job search: "Entry-level data science positions in tech startups with good work-life balance" requires understanding of text, categorical data (industry, job level), and even subjective metrics.

Enter Superlinked

Imagine you are building a system that can deal with a query like “recent news about crop yield”. After collecting your data, you define your schema, ingest data and build index like this:

Schema definition


class News(sl.Schema):
    id: sl.IdField
    created_at: sl.Timestamp
    like_count: sl.Integer
    moderation_score: sl.Float
    content: sl.String

class User(sl.Schema):
    id: sl.IdField
    interest: sl.String

class Event(sl.EventSchema):
    id: sl.IdField
    news: sl.SchemaReference[News]
    user: sl.SchemaReference[User]
    event_type: sl.String

Encoder definition


recency_space = sl.RecencySpace(timestamp=news.created_at)
popularity_space = sl.NumberSpace(number=news.like_count, mode=sl.Mode.MAXIMUM)
trust_space = sl.NumberSpace(number=news.moderation_score, mode=sl.Mode.MAXIMUM)
semantic_space = sl.TextSimilarity(
    text=[news.content, user.interest], model="sentence-transformers/all-mpnet-base-v2"
)

Define Indexes

index = sl.Index(
    spaces=[recency_space, popularity_space, trust_space, semantic_space],
    effects=[sl.Effect(semantic_space, event.user, 0.8 * event.news)],
)

You define your queries and parameterize them like this:

Query definition


query = (
    sl.Query(
        index,
        weights={
            recency_space: sl.Param("recency_weight"),
            popularity_space: sl.Param("popularity_weight"),
            trust_space: sl.Param("trust_weight"),
            semantic_space: sl.Param("semantic_weight"),
        },
    )
    .find(news)
    .similar(semantic_space.text, sl.Param("content_query"))
    .with_vector(user, sl.Param("user_id"))
    .select_all()
)

Debug in notebook, run as server


sl.RestExecutor(
    sources=[sl.RestSource(news), sl.RestSource(user)],
    index=[index],
    query=[query],
    vector_database = sl.InMemoryVectorDatabase()
    # vector_database = sl.MongoDBVectorDatabase(...),
    # vector_database = sl.RedisVectorDatabase(...),
    # vector_database = sl.QdrantVectorDatabase(...),
)

# SparkExecutor()   <-- Coming soon in Superlinked Cloud

curl -X POST \
    'http://localhost:8000/api/v1/search/query' \
    --header 'Accept: */*' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "content_query": "crop yields",
        "semantic_weight": 0.5,
        "recency_weight": 0.9,
        "popularity_weight": 0.5,
        "trust_weight": 0.2,
    }'

Handle natural language queries

#In a notebook like this:

query = (
    sl.Query(...)
    .with_natural_query(Param("recent news about crop yield"))
)
# As an API call like this:

curl -X POST \
    'http://localhost:8000/api/v1/search/query' \
    --header 'Accept: */*' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "natural_language_query": "recent news about crop yield"
    }'

But can't I put all my data in json, stringify it and embed using LLM?

Stringify and embed approach produces unpredictable results. For example (code below):

  • Embed 0..100 with OpenAI API

  • Calculate and plot the cosine similarity

  • Observe the difference between expected and actual results


from openai import OpenAI
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

response = OpenAI().embeddings.create(
    input=[str(i) for i in range(0, 101)],
    model="text-embedding-3-small",
)
embeddings = np.array([r.embedding for r in response.data])
scores = cosine_similarity(embeddings, embeddings)

Okay, But can't I ...

1. Use different already existing storages per attribute, fire multiple searches and then reconcile results?

It's better to store all your attribute vectors in the same vector store and perform a single search, weighting your attributes at query time.

2. Use Metadata filters or Candidate re-ranking

When you convert a fuzzy preference like “recent”, “risky” or “popular” into a filter, you model a sigmoid with a binary step function = not enough resolution.

Semantic ranking & ColBERT only use text, learn2rank models need ML Engineers. Broad queries eg “popular pants” can’t be handled by re-ranking at all, due to poor candidate recall.

Okay, seems like Superlinked proposes a nice approach, but

How does it fit in the big picture?

This is where Superlinked comes in, offering a powerful, flexible framework designed to handle the complexities of modern search and information retrieval challenges. Superlinked is a vector embedding solution for AI teams working with complicated data within their , , and stack.

Let's quickly go through an example. Keep in mind that there are a ton of new concepts thrown at you, but this is just to illustrate how Superlinked 'looks'. We'll go over each concept in detail in the .

Discover the powerful capabilities Superlinked offers .

Our naive approach (above) - storing and searching attribute vectors separately, then combining results - is limited in ability, subtlety, and efficiency when we need to retrieve objects with multiple simultaneous attributes. Moreover, multiple kNN searches take .

Read more here:

How can I build with it at scale? The is a deployable component available as a , designed to enhance the operation of Superlinked by providing a RESTful API for communicating with your application. This package streamlines the integration of Superlinked's sophisticated search functionalities into existing applications by offering REST endpoints and Vector Database connectivity. It enables developers to focus on leveraging Superlinked's capabilities without the burden of infrastructure management, from initial prototype to full-scale production.

RAG
Search
Recommendations
Analytics
following sections
here
more time than a single search with concatenated vectors
Multi-attribute search with vector embeddings
Superlinked Server
Python package on PyPI
Why Superlinked?
Your users expect better search
Enter Superlinked
But can't I put all my data in json, stringify it and embed using LLM?
Okay, But can't I ...
1. Use different already existing storages per attribute, fire multiple searches and then reconcile results?
2. Use Metadata filters or Candidate re-ranking
Okay, seems like Superlinked proposes a nice approach, but
How does it fit in the big picture?
Example of queries needing other data than text
OpenAI embeddings result in noisy, non-monotonic cosine similarity scores. For example, CosSim(25, 50) equals to 0.69 when CosSim(32, 50) equals 0.42 meaning 25 is more similar to 50 than 32 which doesn't make sense. Superlinked number embeddings avoid such inconsistencies by design.
Superlinked framework diagram
Page cover image