Stop using text-to-SQL for search. Here's why.

Multi-attribute vectors + natural language are changing how we find products online

Jan 16, 2025

This article is part of a two-article series on building a modern search app for Amazon e-commerce products, which leverages tabular semantic search and natural language queries that will transform your shopping experience.

The article makes sense on its own, but for more context, here is the first one:

Forget text-to-SQL: Use this natural query instead

Paul Iusztin

December 19, 2024

Read full story

Vector search (semantic search) is the new standard for information retrieval.

Instead of using fixed keywords for text or filters for numeric or categorical variables, it powers search algorithms with more nuances.

For example, when surfing the Internet for books, instead of searching for the fixed word “engineering” and using filters to look only for books with a rating larger than 4 and a price tag smaller than $70, it combines all these features into an embedding space, unlocking interactions between them and more complex relationships.

As seen in Figure 1, imagine a rectangle split equally between two colors: blue and orange. That’s how the classic approach looks like when using static filters. When working purely with embeddings, we have a gradient instead of a binary split, opening the search space to more combinations and relationships within our data.

Figure 1: Standard filters vs. semantic search

Intuitively, we aim to leverage only semantic search instead of using SQL.

However, as the data is complex, using semantic search only on one feature at a time is insufficient. For example, querying only on the description of an Amazon product or its image will not capture the complexity of the product.

That’s why we have to leverage multi-attribute vector indexes, where we embed all the aspects of the product that we have to query, such as its description, reviews, category and price.

Another popular method of integrating tabular data with LLMs is text-to-SQL algorithms. However, These methods do not leverage the “goodies” that come with semantic search but integrate with the current SQL-based infrastructure.

How do the text-to-SQL and multi-attribute vector index methods overlap?

First, they solve the same problem: search on top of tabular data.

Secondly, both leverage LLMs to decode natural language queries into programmable queries to modernize search as follows:

text-to-SQL: decodes language to SQL queries
multi-attribute vector indexes: decodes language to values to embed and vector weights to manipulate the query

Since both methods can solve similar problems, it’s valuable to understand the pros and cons of each.

In this article, we will implement both methods using a tabular dataset of Amazon products.

Ok. That sounds great but complex for a single article…

How can we quickly implement this?

That’s where frameworks, such as Superlinked, kick in.

Together with battle-tested databases that have vector index support, such as MongoDB, you can quickly:

Build a multi-attribute retrieval system.
Deploy it.
Scale it.

Also, we will leverage LlamaIndex to implement text-to-SQL.

With that in mind, in this article, you will learn:

How multi-attribute vector indexes work.
How text-to-SQL works.
Implementing multi-attribute vector apps powered by natural language queries using Superlinked and MongoDB.
Implementing text-to-SQL using LlamaIndex.
Multi-attribute vector indexes vs. text-to-SQL on Amazon products.

In a previous article, we detailed how to build a tabular semantic search for Amazon products using Superlinked and MongoDB, from raw data to a RESTful API.

In this article, we will explore the topic of multi-attribute vector indexes in more detail.

🔗 Read the article below for more context:

Forget text-to-SQL: Use this natural query instead

Paul Iusztin

December 19, 2024

Search has been around for a long time, but it still often gives us bad or irrelevant results on many popular websites. Things got even trickier with the rise of GenAI chatbots like ChatGPT. Thanks to Google, people who have mostly learned to use keywords now want to talk to search apps more naturally and expect them to get what they mean without using …

Read full story

Understanding how multi-attribute vector indexes work
Connecting multi-attribute vector indexes with natural language queries
Understanding how text-to-SQL works
Implementing multi-attribute natural language semantic search queries
Implementing text-to-SQL queries
Multi-attribute vector indexes + NLQ vs. text-to-SQL on Amazon products

1. Understanding how multi-attribute vector indexes work

First, we must understand how multi-attribute vector indexes work under the hood.

There are two main ways to build them. In both methods, we have to embed each object attribute separately.

The trick is to determine how the embeddings are stored and searched.

The naive approach

Store each attribute vector in separate vector stores (one per attribute), do an individual search for each attribute, combine the results and post-process the aggregated list as required.

Figure 2: Naive vs. Superlinked approach for building and querying multi-attribute vector indexes

The Superlinked approach

Concatenate all the attribute vectors into the same vector. Thus, a single vector index is created, which allows us to search just once for any query combination of attributes.

By weighting each attribute at query time, we can prioritize different attributes directly in the vector index search without further post-processing. This results in lower network bandwidth, infrastructure, and computing requirements, directly translating to lower latencies and out-of-the-box costs.

Another by-product of simplifying (or eliminating) the post-processing steps is less code to write. Thus, the app is easier to develop, debug and maintain.

For more on the naive vs. Superlinked approach, here is a fantastic Dungeons & Dragons use case benchmarking the difference.

Why can’t we convert everything into text and embed it into a single vector index?

Your first thought might be, why can’t we create a prompt template for all our tabular data columns, convert them to text, and embed them using a single text embedding model?

Well… You can, but that doesn’t mean it will perform as expected.

For example, when embedding numbers, their relationship will be inconsistent.

Let’s assume we stringify and embed a list of numbers from 1 to 100 using OpenAI’s embedding model as follows:

client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

response = client.embeddings.create(
    input=[str(i) for i in range(101)], model="text-embedding-ada-002"
)
embeddings = np.array([item.embedding for item in response.data])

Then, we compute the distance between 50 (as our reference number) and the whole [1 - 100] list:

reference_idx = 50
cos_sim_scores = cosine_similarity(
    embeddings, embeddings[reference_idx : reference_idx + 1]
).flatten()

In Figure 3, we can visualize that, when using OpenAI’s embeddings, the expected relationship between the numbers is wholly lost when using cosine similarity.

For example, CosSim(40, 50) equals 0.93 when CosSim(46, 50) equals 0.89, meaning 40 is more similar to 50 than 46, which doesn't make sense.

But, if we embed the numbers using embeddings specialized per type, as Superlinked proposes, the relationship is conserved.

Figure 3: Cosine similarity between 50 and the [1;100] range comparing OpenAI’s and Superlinked’s embeddings. Inspired from Superlinked’s docs.

To conclude, the strategy “stringify and embed” produces unpredictable results.

Quick insight: The plot is done in Python, with altair. We love them for creating plots relative to matplotlib and seaborn. Here is our code.

Metadata filters vs. multi-attribute vector indexes

Ok, so we can’t stringify and embed all our structured data.

But why can’t we embed our unstructured data, such as product descriptions and images, and leverage the rest of the fields, such as product price and review, as metadata filters?

Well… Again… You can… However, as highlighted at the beginning of the article, you lose all the benefits of “semantic search.”

For instance, rather than restricting our search to products with a "review count >= 5" threshold, we explore those that typically have higher ratings. This means our search extends beyond just the items with 5 review ratings, seeking out the highest-rated products in the database that are otherwise relevant to our query, even if those have less than 5 reviews. In Superlinked terminology, we refer to this as a maximizer.

Read more on the benefits of multi-attribute vector indexes with Superlinked.

2. Connecting multi-attribute vector indexes with natural language queries

The first article detailed how natural language queries (NLQs) work with multi-attribute vector indexes implemented in Superlinked.

Figure 5: Querying a multi-attribute vector index leveraging natural queries (NLQs)

But as a quick recap in the context of multi-attribute vector indexes, here is how it works:

You enter your natural language query (NLQ):

query = "books on psychology with a price lower than 100 and a rating bigger than 4"

An LLM maps the natural language query to Superlinked parameters:
Context to be embedded:
1. query_description = “psychology” (text)
2. query_price = 100 (int)
3. query_review_rating = 4 (int)
Query embedding weights:
1. price_minimizer_weights = 0.7
2. review_rating_maximizer_weight = 0.5
Filters:
1. filter_by_type = “book”
The context elements are embedded, and the multi-attribute vector query is created.
The query uses a vector index of choice (e.g., MongoDB Atlas, Qdrant) and the embedding weights and filters.
The results from the vector database are returned.
Optionally, the results are interpreted by an LLM and returned in a humanized answer.

As in the example above, we can always combine multi-attribute vector search with metadata filters.

Now, let’s see how it compares to text-to-SQL.

3. Understanding how text-to-SQL works

Another popular method of integrating tabular data with LLMs is using text-to-SQL algorithms.

This article does not aim to explain text-to-SQL in-depth, but we will quickly review it to compare the two methods.

Text-to-SQL works as follows:

You enter your natural language query (NLQ):

query = "books with a price lower than 100 and a rating bigger than 4"

An LLM maps the natural language query to SQL:

SELECT * FROM books WHERE price < 100 AND rating > 4;

The SQL query is executed using a SQL database of choice (e.g., SQLite, PostgreSQL)
The results from the database are returned
Optionally, the results are interpreted by an LLM and returned in a humanized answer.

Pros and cons of text-to-SQL

On the pro’s side, there are obvious factors such as:

Leveraging existing infrastructure, tooling and code with natural language queries (NLQs).
Leveraging battle-tested solutions that have been improved in the past 50 years.
Making database interaction accessible to non-technical users who don’t know SQL.
Accelerating writing queries for custom schemas.

On the con’s side, things get trickier. Here is what can go wrong when mapping NLQs to SQL:

Generating wrong SQL commands. A SQL command can be wrong syntactically (the code won’t run) or semantically (the logic is wrong).
LLMs often struggle with poorly documented schemas, thus making the translation from NLQs to SQL ambiguous.
Difficulties arise with complex queries such as nested joins or advanced SQL commands (e.g., window functions).
Adding a non-deterministic layer of complexity makes the system harder to monitor and debug.
Using LLMs adds extra costs and dependencies to your system.

In reality, text-to-SQL works best when:

You have simple queries.
The schemas are well-defined and documented.
Augment development, sketching SQL queries faster.
You write advanced pre-processing and post-processing techniques that validate the user’s query and generate SQL commands.

The beauty is that Superlinked’s multi-attribute vector indexes + NLQs check all the boxes from above.

Let’s see that with some concrete examples.

First, we will implement both methods. Then, we will compare them with some concrete examples.

4. Implementing multi-attribute natural language semantic search queries

In the first article, we showed how to implement an end-to-end Superlinked app, from data pre-processing to deploying it as a RESTful API.

But let’s have a quick refresher on how to define a multi-attribute query in Superlinked, as this is the main focus of this article:

First, we have to define our product schema, embedding spaces and vector index:

from superlinked import framework as sl


class ProductSchema(sl.Schema):
    id: sl.IdField
    type: sl.String
    title: sl.String
    description: sl.String
    review_rating: sl.Float
    price: sl.Float

product = ProductSchema()


# Vector embeddings / attribute
title_space = sl.TextSimilaritySpace(
    text=product.title, model="Alibaba-NLP/gte-large-en-v1.5"
)
description_space = sl.TextSimilaritySpace(
    text=product.description, model="Alibaba-NLP/gte-large-en-v1.5"
)
review_rating_maximizer_space = sl.NumberSpace(
    number=product.review_rating, min_value=-1.0, max_value=5.0, mode=sl.Mode.MAXIMUM
)
price_minimizer_space = sl.NumberSpace(
    number=product.price, min_value=0.0, max_value=1000, mode=sl.Mode.MINIMUM
)


# Vector index
product_index = sl.Index(
    spaces=[
        category_space,
        description_space,
        review_rating_maximizer_space,
        price_minimizer_space,
    ],
    fields=[product.type, product.category, product.review_rating, product.price],
)

We will find similar products based on their title and description. You can do something similar with images, but we won’t do it in this article.

As for the price and review rating, things get more interesting.

We defined a minimizer for the price, which means that instead of searching for similar prices, we are directly searching for prices lower than the given query. As for the rating maximizer, the opposite is happening, searching for products with a larger review.

Now, let’s define the query for the `product_index`:

title_similar_param = sl.Param(
    "query_title",
    description=(
        "The text in the user's query that is used to search in the products' title."
        "Extract info that does not apply to other spaces or params."
    ),
)
text_similar_param = sl.Param(
    "query_description",
    description=(
        "The text in the user's query that is used to search in the products' description."
        " Extract info that does not apply to other spaces or params."
    ),
)

query = (
    sl.Query(
        index.product_index,
        weights={
            index.description_space: sl.Param("description_weight"),
            index.review_rating_maximizer_space: sl.Param(
                "review_rating_maximizer_weight"
            ),
            index.price_minimizer_space: sl.Param("price_minimizer_weights"),
        },
    )
    .find(index.product)
    .limit(sl.Param("limit"))
    .with_natural_query(sl.Param("natural_query"), openai_config)
    .filter(
        index.product.type
        == sl.Param(
            "filter_by_type",
            description="Used to only present items that have a specific type",
            options=constants.TYPES,
        )
    )
   .similar(
        index.title_space,
        title_similar_param,
        sl.Param("title_similar_clause_weight"),
    )
   .similar(
        index.description_space,
        text_similar_param,
        sl.Param("description_similar_clause_weight"),
    )
)

Here’s what is important to notice from the `query` object:

The `Param` objects define the variables of our query, such as embedding weights or the user’s query.
Within Query(weights={..}), we define the weights for our multi-attribute vector index
Using the `with_natural_query()` method enables NLQ. We will input just the `natural_query` variable and all the other `Param` variables will be filled out automatically.
Using the `similar` method, we define the fields in which we want to search for semantic similarity. The review and pricing spaces are maximizers/minimizers that automatically look for a combination of items with the highest review and lowest price.
On top of the `query`, we can add more tricks, such as finding items similar to our description and the currently selected item, such as:

similar_items_query = query.with_vector(index.product, sl.Param("product_id"))

Amazing, right?

To check the end-to-end implementation of the Superlinked app, check our first article from the series.

Why MongoDB Atlas as our vector DB

MongoDB is a battle-tested product released 15 years ago. Thus, it’s a stable and robust database that has passed the test of time, being the #1 used modern database.

By nature, they are a NoSQL database that powers 52k+ applications from small to enterprise.

In late 2023, they released a vector search feature on their NoSQL database. As you have both a NoSQL and vector database integrated under the same technology, everything becomes simpler:

You can already leverage your existing MongoDB database.
Less infrastructure to maintain.
Data is automatically synchronized between the two.
Developers work with the same MongoDB query API and technology.
Enables powerful hybrid search capabilities, combining vector queries with their NoSQL features: documents or graphs.
Its distributed architecture scales vector search independently from the core database.

Everything translates to a smaller learning curve, less development time, infrastructure and overall maintenance.

Vector search is available on MongoDB Atlas, a fully managed platform that allows you to focus on your problem instead of infrastructure issues.

Along with that, MongoDB provides all the goodies that come with modern vector databases, such as:

Integration with LangChain, LlamaIndex, HayStack, AWS, etc.
Supporting embeddings up to 4096
Metadata filtering
HNSW indexing
Quantization

Using MongoDB as your vector DB is a safe bet for building AI applications.

Read more on MongoDB Atlas Vector Search and its optimization features.

Now, let’s quickly look at how we can implement text-to-SQL queries.

5. Implementing text-to-SQL queries

Let’s explore how to implement text-to-SQL with LlamaIndex and an SQLite database:

First, we need to set up our database structure. We'll create an SQLite database in memory and define a product table to store all our Amazon product information. The table includes columns for the product's ASIN (Amazon's unique identifier), type, title, description, review metrics, and price:

from sqlalchemy import (
    ..., # Other SQLAlchemy imports
    MetaData,
    Table,
    create_engine,
)

engine = create_engine("sqlite:///:memory:")
metadata_obj = MetaData()

product_table = Table(
    product_table_name,
    metadata_obj,
    Column("asin", String(255), primary_key=True),
    Column("type", String(255)),
    Column("title", String(255)),
    Column("description", String(1000)),
    Column("review_rating", Float),
    Column("review_count", Integer),
    Column("price", Float),
)
metadata_obj.create_all(engine)

Using LLamaIndex, we initialize the OpenAI LLM with a low temperature for consistent results and create an SQL database wrapper around our engine:

from llama_index.core import SQLDatabase
from llama_index.llms.openai import OpenAI

llm = OpenAI(temperature=0.1, model=settings.OPENAI_MODEL_ID)
sql_database = SQLDatabase(engine, include_tables=[product_table_name])

Now, we populate the SQLite database with Amazon products:

for _, row in df.iterrows():
    stmt = insert(product_table).values(**row.to_dict())
    with engine.begin() as connection:
        cursor = connection.execute(stmt)

Next comes the interesting part - we create a natural language SQL retriever that will translate English queries into SQL:

from llama_index.core.retrievers import NLSQLRetriever

nl_sql_retriever = NLSQLRetriever(
    sql_database, tables=[product_table_name], return_raw=False
)

Finally, we can test our system by asking questions in plain English. Here's an example query that finds the top 5 highest-rated books:

results = nl_sql_retriever.retrieve(
   "books with a price lower than 100 and a rating bigger than 4"
)

And there you have it! We've built a system that lets anyone query Amazon product data without knowing SQL.

Why SQLite as our SQL database?

SQLite is a powerful file-based SQL database that can also run in memory. It's light, fast, and quick to set up, requiring no separate server process or configuration. This makes it extremely useful for:

Proof of Concept projects
Early-stage applications
Applications with moderate user or storage requirements (up to ~100GB)

In our example, we use SQLite in-memory mode for simplicity and demonstration.

The Notebook with all the code is available on our GitHub.

Finally, let’s compare the two methods on multiple natural queries.

6. Multi-attribute vector indexes + NLQ vs. text-to-SQL on Amazon products

Using the code from sections 4 (on multi-attribute indexes + NLQ with Superlinked) and 5 (on text-to-SQL), we will look over multiple examples to see how they perform in the wild.

Example 1: Simple

The first example will be simple, without any ambiguities:

query = "books with a price lower than 100 and a rating bigger than 4"

The result from Superlinked:

The results from text-to-SQL:

Both results are acceptable. However, since the query is not very specific and we don’t have a GT, it’s hard to tell which one is better.

Some key observations are:

The Superlinked results are not ordered, meaning the first result is not necessarily more relevant than the 3rd.
The text-to-SQL method tries to hit the maximum rating and the minimum price more brutally.

Let’s add more complexity.

Example 2: Specific categories

We will search for products in a particular category:

query = "psychology and mindfulness with a rating bigger than 4"

The result from Superlinked:

The results from text-to-SQL:

The screenshot from `text-to-SQL` is not a mistake.

As we add more details about the Amazon product we are looking for, we move away from explicitly mapping natural language to SQL commands. Here, methods like text-to-SQL start to break.

In our case, it returns nothing.

Meanwhile, the Superlinked approach is more flexible as it leverages semantic search for all its query fields.

Remember that Superlinked uses NLQ just to map between the user query and its semantic query fields, which is an operation 100 times easier than mapping to an SQL command.

You must say that must be a picked example, right?

Let’s look at another one.

query = "Return the top items (along with their price) with the highest reviews rating on science"

The result from Superlinked:

The results from text-to-SQL:

Unfortunately, text-to-SQL has difficulty picking up on any particularities of the item.

You can further engineer it and write complex SQL commands that make look-ups in the title and description of the product, but that defeats the whole point of using natural queries, right?

Example 3: Titles or keywords

Another common use case is when looking for specific titles or keywords. Let’s search for products on Lord of the Rings:

query = "Lord of the Rings"

The result from Superlinked:

The results from text-to-SQL:

This time, text-to-SQL picked up only on an exact match, which makes sense as it uses only fuzzy classic search methods.

Meanwhile, the Superlinked method based on multi-attribute embeddings understood that the Lord of the Rings series is similar to The Hobbit and some products with lava, rocks, stones, etc. (Mordor, you know?)

What’s the final take on this?

We don’t want to demonize text-to-SQL, as we highlighted in section 3, it has some advantages:

Data ingestion is faster as it doesn’t require data embedding and vector indexing.
Higher precision, lower recall: Returns only strict matches, which is better for keyword matching.
The ability to dynamically select what fields to return due to the nature of SQL.

However, it also has many disadvantages:

Rigid
Fails on more ambiguous queries
Leverages only classical methods when searching keywords
Harder to map complex queries to SQL

On the other spectrum, the disadvantages of multi-attribute methods with Superlinked are:

The technique is new: it hasn't passed the test of time.
The indexing time is much longer due to its vector index.

But, in our opinion, the advantages are worth it:

Flexibility
Picks up on ambiguous queries
It brings a native approach to semantic search on top of your tabular data
Easier to map complex queries to a few Superlinked parameters

All examples (+ many more) are available on our GitHub:
Superlinked examples
text-to-SQL examples

Conclusion

This was an exciting article, where we understood how multi-attribute vector indexes work and how we can implement them, together with natural language queries, using Superlinked and MongoDB’s vector database.

Next, we quickly understood how text-to-SQL works.

Ultimately, we used the Amazon Tabular Semantic Search app, built in the first article of the series, to compare the two methods and better understand the power of multi-attribute vector indexes on top of tabular datasets.

References

Literature

[1] Decodingml. (n.d.). GitHub - decodingml/information-retrieval-tutorials: 📚 Tutorial on a tabular semantic search system for Amazon e-commerce products that enables natural language queries. GitHub. https://github.com/decodingml/information-retrieval-tutorials

Iusztin, P. (2024, December 19). Natural Queries for E-commerce with Tabular Semantic Search. Decoding ML. https://decodingml.substack.com/p/forget-text-to-sql-use-this-natural

Multi-attribute search with vector embeddings | VectorHub by Superlinked. (n.d.). https://superlinked.com/vectorhub/articles/multi-attribute-semantic-search

Team, M. D. (n.d.). Improve Vector search performance. MongoDB Atlas. https://www.mongodb.com/docs/atlas/atlas-vector-search/tune-vector-search

Text-to-SQL Guide (Query Engine + Retriever) - LlamaIndex. (n.d.). https://docs.llamaindex.ai/en/stable/examples/index_structs/struct_indices/SQLIndexDemo/

Images

If not otherwise stated, all images are created by the author.

Stop using text-to-SQL for search. Here's why.

Multi-attribute vectors + natural language are changing how we find products online

Forget text-to-SQL: Use this natural query instead

Forget text-to-SQL: Use this natural query instead

Table of Contents:

1. Understanding how multi-attribute vector indexes work

Why can’t we convert everything into text and embed it into a single vector index?

Metadata filters vs. multi-attribute vector indexes

2. Connecting multi-attribute vector indexes with natural language queries

3. Understanding how text-to-SQL works

Pros and cons of text-to-SQL

4. Implementing multi-attribute natural language semantic search queries

Why MongoDB Atlas as our vector DB

5. Implementing text-to-SQL queries

Why SQLite as our SQL database?

6. Multi-attribute vector indexes + NLQ vs. text-to-SQL on Amazon products

Example 1: Simple

Example 2: Specific categories

Example 3: Titles or keywords

What’s the final take on this?

Conclusion

References

Literature

Images

Sponsors

Discussion about this post