Lately, I've been digging into vector embeddings, since they're such an important part of the RAG (Retrieval Augmented Generation) pattern that we use in our most popular AI samples. I think that when many developers hear "vector embeddings" these days, they immediately think of dedicated vector databases such as Pinecone, Qdrant, or Chroma.
As it turns out, you can use developers in many existing databases as well, such as the very popular and open-source PostGreSQL database. You just need to install the open-source pgvector extension, and boom, you can store vector-type columns, use four different distance operators to compare vectors, and use two difference indexes to efficiently perform searches on large tables.
For this year's PosetteConf, I put together a talk called "pgvector for Python developers" to explain what vectors are, why they matter, how to use them with pgvector, and how to use pgvector from Python for similarity and searching.
If you're a Django developer, then you may also be interested in this talk on "Semantic search with Django and pgvector" from Paolo Melchiorre, which you can watch on YouTube or below:
Today I went on a quest to figure out the best way to use SLMs (small language models) like Phi-3 in a GitHub Codespace,
so that I can provide a browser-only way for anyone to start working with language models. My main target audience is teachers and students, who may not have GPUs or have the budget to pay for large language models, but it's always good to make new technologies accessible to everyone possible.
❌ Don't use transformers
For my first approach, I tried to use the HuggingFace transformers package, using code similar to their text generation tutorial, but modified for my non-GPU environment:
Unfortunately, that took a very long time. Enough time for me to go for a walk, chase the garbage truck around the neighborhood, make breakfast, and drop my kid at school: 133 minutes total!
So, yes, you can use transformers, technically. But without either a very powerful CPU or better, a GPU, the performance is just too slow. Let's move on...
✅ Do use Ollama!
For my next approach, I tried setting up a dev container with Ollama built-in, by adding the ollama feature to my devcontainer.json:
That repository includes the Ollama feature, OpenAI SDK, a notebook with demonstrations of few-shot and RAG, and a script for an interactive chat.
I hope it can be a helpful resource for teachers and students who want a quick and easy way to get started with small language models.
I also hope to add the Ollama feature to other repositories where it can be helpful, like the Phi-3 cookbook.
RAG (Retrieval Augmented Generation) is one of the most promising uses for large language models. Instead of asking an LLM a question and hoping the answer lies somewhere in its weights, we instead first query a knowledge base for anything relevant to the question, and then feed both those results and the original question to the LLM.
We have many RAG solutions out there for asking questions on unstructured documents, like PDFs and Word Documents. Our most popular Azure solution for this scenario includes a data ingestion process to extract the text from the documents, chunk them up into appropriate sizes, and store them in an Azure AI Search index. When your RAG is on unstructured documents, you'll always need a data ingestion step to store them in an LLM-compatible format.
But what if you just want users to ask questions about structured data, like a table in a database? Imagine customers that want to ask questions about the products in a store's inventory, and each product is a row in the table. We can use the RAG approach there, too, and in some ways, it's a simpler process.
To get you started with this flavor of RAG, we've created a new RAG-on-PostgreSQL solution that includes a FastAPI backend, React frontend, and infrastructure-as-code for deploying it all to Azure Container Apps with Azure PostgreSQL Flexible Server. Here it is with the sample seed data:
We use the user's question to query a single PostgreSQL table and send the matching rows to the LLM. We display the answer plus information about any of the referenced products from the answer. Now let's break down how that solution works.
Data preparation
When we eventually query the database table with the user's query, we ideally want to perform a hybrid search: both a full text search and a vector search of any columns that might match the user's intent. In order to perform a vector search, we also need a column that stores a vector embedding of the target columns.
This is what the sample table looks like, described using SQLAlchemy 2.0 model classes. The final embedding column is a Vector type, from the pgvector extension for PostgreSQl:
The embedding column has 1536 dimensions to match OpenAI's text-embedding-ada-002 model, but you could configure it to match the dimensions of different embedding models instead. The most important thing is to know exactly which model you used for generating embeddings, so then we can later search with that same model.
To compute the value of the embedding column, we concatenate the text columns from the table row, send them to the OpenAI embedding model, and store the result:
We only need to run that once, if our data is static. However, if any of the included columns change, we should re-run that for the changed rows. Another approach is to use the Azure AI extension for Azure PostgreSQL Flexible Server. I didn't use it in my solution since I also wanted it to run with a local PostgreSQL server, but it should work great if you're always using the Azure-hosted PostgreSQL Flexible Server.
Hybrid search in PostgreSQL
Now our database table has both text columns and a vector column, so we should be able to perform a hybrid search: using the pgvector distance operator on the embedding column, using the built-in full-text search functions on the text columns, and merging them using the Reciprocal-Rank Fusion algorithm.
vector_query = f"""
SELECT id, RANK () OVER (ORDER BY embedding <=> :embedding) AS rank
FROM items
ORDER BY embedding <=> :embedding
LIMIT 20
"""
fulltext_query = f"""
SELECT id, RANK () OVER (ORDER BY ts_rank_cd(to_tsvector('english', description), query) DESC)
FROM items, plainto_tsquery('english', :query) query
WHERE to_tsvector('english', description) @@ query
ORDER BY ts_rank_cd(to_tsvector('english', description), query) DESC
LIMIT 20
"""
hybrid_query = f"""
WITH vector_search AS (
{vector_query}
),
fulltext_search AS (
{fulltext_query}
)
SELECT
COALESCE(vector_search.id, fulltext_search.id) AS id,
COALESCE(1.0 / (:k + vector_search.rank), 0.0) +
COALESCE(1.0 / (:k + fulltext_search.rank), 0.0) AS score
FROM vector_search
FULL OUTER JOIN fulltext_search ON vector_search.id = fulltext_search.id
ORDER BY score DESC
LIMIT 20
"""
results = session.execute(sql,
{"embedding": to_db(query_vector), "query": query_text, "k": 60},
).fetchall()
That hybrid search is missing the final step that we always recommend for Azure AI Search: semantic ranker, a re-ranking model that sorts the results according to the original user queries. It should be possible to add a re-ranking model, as shown in another pgvector-python example, but such an addition requires loadtesting and possibly an architectural change, since re-ranking models are CPU-intensive. Ideally, the re-ranking model would be deployed on dedicated infrastructure optimized for model running, not on the same server as our app backend.
We get fairly good results from that hybrid search query, however! It easily finds rows that both match the exact keywords in a query and semantically similar phrases, as demonstrated by these user questions:
Function calling for SQL filtering
The next step is to handle user queries like, "climbing gear cheaper than $100." Our hybrid search query can definitely find "climbing gear", but it's not designed to find products whose price is lower than some amount. The hybrid search isn't querying the price column at all, and isn't appropriate for a numeric comparison query anyway. Ideally, we would do both a hybrid search and add a filter clause, like WHERE price < 100.
Fortunately, we can use an LLM to suggest filter clauses based on user queries, and the OpenAI GPT models are very good at it. We add a query-rewriting phase to our RAG flow which uses OpenAI function calling to come up with the optimal search query and column filters.
In order to use OpenAI function calling, we need to describe the function and its parameters. Here's what that looks like for a search query and single column's filter clause:
{
"type": "function",
"function": {
"name": "search_database",
"description": "Search PostgreSQL database for relevant products based on user query",
"parameters": {
"type": "object",
"properties": {
"search_query": {
"type": "string",
"description": "Query string to use for full text search, e.g. 'red shoes'"
},
"price_filter": {
"type": "object",
"description": "Filter search results based on price of the product",
"properties": {
"comparison_operator": {
"type": "string",
"description": "Operator to compare the column value, either '>', '<', '>=', '<=', '='"
},
"value": {
"type": "number",
"description": "Value to compare against, e.g. 30"
}
}
}
}
}
}
}
We can easily add additional parameters for other column filters, or we could even have a generic column filter parameter and have OpenAI suggest the column based on the table schema. For my solution, I am intentionally constraining the LLM to only suggest a subset of possible filters, to minimize risk of SQL injection or poor SQL performance. There are many libraries out there that do full text-to-SQL, and that's another approach you could try out, if you're comfortable with the security of those approaches.
When we get back the results from the function call, we use it to build a filter clause, and append that to our original hybrid search query. We want to do the filtering before the vector and full text search, to narrow down the search space to only what could possibly match. Here's what the new vector search looks like, with the additional filter clause:
vector_query = f"""
SELECT id, RANK () OVER (ORDER BY embedding <=> :embedding) AS rank
FROM items
{filter_clause}
ORDER BY embedding <=> :embedding
LIMIT 20
"""
With the query rewriting and filter building in place, our RAG app can now answer questions that depend on filters:
RAG on unstructured vs structured data
Trying to decide what RAG approach to use, or which of our solutions to use for a prototype? If your target data is largely unstructured documents, then you should try out our Azure AI Search RAG starter solution which will take care of the complex data ingestion phase for you. However, if your target data is an existing database table, and you want to RAG over a single table (or a small number of tables), the try out the PostgreSQL RAG starter solution and modify it to work with your table schema. If your target data is a database with a multitude of tables with different schemas, then you probably want to research full text-to-SQL solutions. Also check out the llamaindex and langchain libraries, as they often have functionality and samples for common RAG scenarios.
I'm concerned by the number of times I've heard, "oh, we can do RAG with retriever X, here's the vector search query." Yes, your retriever for a RAG flow should definitely support vector search, since that will let you find documents with similar semantics to a user's query, but vector search is not enough. Your retriever should support a full hybrid search, meaning that it can perform both a vector search and full text search, then merge and re-rank the results. That will allow your RAG flow to find both semantically similar concepts, but also find exact matches like proper names, IDs, and numbers.
Hybrid search steps
Azure AI Search offers a full hybrid search with all those components:
It performs a vector search using a distance metric (typically cosine or dot product).
It performs a full-text search using the BM25 scoring algorithm.
It re-ranks the results using semantic ranker, a machine learning model used by Bing, that compares each result to the original usery query and assigns a score from 0-4.
The search team even researched all the options against a standard dataset, and wrote a blog post comparing the retrieval results for full text search only, vector search only, hybrid search only, and hybrid plus ranker. Unsurprisingly, they found that the best results came from using the full stack, and that's why it's the default configuration we use in the AI Search RAG starter app.
When is hybrid search needed?
To demonstrate the importance of going beyond vector search, I'll show some queries based off the sample documents in the AI Search RAG starter app. Those documents are from a fictional company and discuss internal policies like healthcare and benefits.
Let's start by searching "what plan costs $45.00?" with a pure vector search using an AI Search index:
search_query = "what plan costs $45.00"
search_vector = get_embedding(search_query)
r = search_client.search(None, top=3, vector_queries=[
VectorizedQuery(search_vector, k_nearest_neighbors=50, fields="embedding")])
The results for that query contain numbers and costs, like the string "The copayment for primary care visits is typically around $20, while specialist visits have a copayment of around $50.", but none of the results contain an exact cost of $45.00, what the user was looking for.
Now let's try that query with a pure full-text search:
r = search_client.search(search_query, top=3)
The top result for that query contain a table of costs for the health insurance plans, with a row containing $45.00.
Of course, we don't want to be limited to full text queries, since many user queries would be better answered by vector search, so let's try this query with hybrid:
r = search_client.search(search_query, top=15, vector_queries=[
VectorizedQuery(search_vector, k_nearest_neighbors=10, fields="embedding")])
Once again, the top result is the table with the costs and exact string of $45.00. When the user asks that question in the context of the full RAG app, they get the answer they were hoping for:
You might think, well, how many users are searching for exact strings? Consider how often you search your email for a particular person's name, or how often you search the web for a particular programming function name. Users will make queries that are better answered by full-text search, and that's why we need hybrid search solutions.
Here's one more reason why vector search alone isn't enough: assuming you're using generic embedding models like the OpenAI models, those models are generally not a perfect fit for your domain. Their understanding of certain terms aren't going to be the same as a model that was trained entirely on your domain's data. Using hybrid search helps to compensate for the differences in the embedding domain.
When is re-ranking needed?
Now that you're hopefully convinced about hybrid search, let's talk about the final step: re-ranking results according to the original user query.
Now we'll search the same documents for "learning about underwater activities" with a hybrid search:
search_query = "learning about underwater activities"
search_vector = get_embedding(search_query)
r = search_client.search(search_query, top=5, vector_queries=[
VectorizedQuery(search_vector, k_nearest_neighbors=10, fields="embedding")])
The third result for that query contains the most relevant result, a benefits document that mentions surfing lessons and scuba diving lessons. The phrase "underwater" doesn't appear in any documents, notably, so those results are coming from the vector search component.
What happens if we add in the semantic ranker?
search_query = "learning about underwater activities"
search_vector = get_embedding(search_query)
r = search_client.search(search_query, top=5, vector_queries=[
VectorizedQuery(search_vector, k_nearest_neighbors=50, fields="embedding")],
query_type="semantic", semantic_configuration_name="default")
Now the very top result for the query is the document chunk about surfing and scuba diving lessons, since the semantic ranker realized that was the most pertinent result for the user query. When the user asks a question like that in the RAG flow, they get a correct answer with the expected citation:
Our search yielded the right result in both cases, so why should we bother with the ranker? For RAG applications, which send search results to an LLM like GPT-3.5, we typically limit the number of results to a fairly low number, like 3 or 5 results. That's due to research that shows that LLMs tend to get "lost in the middle" when too much context is thrown at them. We want those top N results to be the most relevant results, and to not contain any irrelevant results. By using the re-ranker, our top results are more likely to contain the closest matching content for the query.
Plus, there's a big additional benefit: each of the results now has a re-ranker score from 0-4, which makes it easy for us to filter out results with re-ranker scores below some threshold (like < 1.5). Remember that any search algorithm that includes vector search will always find results, even if those results aren't very close to the original query at all, since vector search just looks for the closest vectors in the entire vector space. So when your search involves vector search, you ideally want a re-ranking step and a scoring approach that will make it easier for you to discard results that just aren't relevant enough on an absolute scale.
Implementing hybrid search
As you can see from my examples, Azure AI Search can do everything we need for a RAG retrieval solution (and even more than we've covered here, like filters and custom scoring algorithms. However, you might be reading this because you're interested in using a different retriever for your RAG solution, such as a database. You should be able to implement hybrid search on top of most databases, provided they have some capability for text search and vector search.
As an example, consider the PostgreSQL database. It already has built-in full text search, and there's a popular extension called pgvector for bringing in vector indexes and distance operators. The next step is to combine them together in a hybrid search, which is demonstrated in this example from the pgvector-python repository:.
WITH semantic_search AS (
SELECT id, RANK () OVER (ORDER BY embedding <=> %(embedding)s) AS rank
FROM documents
ORDER BY embedding <=> %(embedding)s
LIMIT 20
),
keyword_search AS (
SELECT id, RANK () OVER (ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC)
FROM documents, plainto_tsquery('english', %(query)s) query
WHERE to_tsvector('english', content) @@ query
ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC
LIMIT 20
)
SELECT
COALESCE(semantic_search.id, keyword_search.id) AS id,
COALESCE(1.0 / (%(k)s + semantic_search.rank), 0.0) +
COALESCE(1.0 / (%(k)s + keyword_search.rank), 0.0) AS score
FROM semantic_search
FULL OUTER JOIN keyword_search ON semantic_search.id = keyword_search.id
ORDER BY score DESC
LIMIT 5
That SQL performs a hybrid search by running a vector search and text search and combining them together with RRF. Another example from that repo shows how we could bring in a cross-encoding model for a final re-ranking step:
encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = encoder.predict([(query, item[1]) for item in results])
results = [v for _, v in sorted(zip(scores, results), reverse=True)]
That code would run the cross-encoding model in the same process as the rest of the PostgreSQL query, so it could work well in a local or test environment, but it wouldn't necessarily scale well in a production environment. Ideally, a call to a cross-encoder would be made in a separate service that had access to a GPU and dedicated resources.
I have implemented the first three steps of hybrid search in a RAG-on-PostgreSQL starter app. Since I don't yet have a good way to productionize a call to a cross-encoding model, I have not brought in the final re-ranking step.
After seeing what it takes to replicate full hybrid search options on other database, I am even more appreciative of the work done by the Azure AI Search team. If you've decided that, nevermind, you'll go with Azure AI Search, check out the AI Search RAG starter app. You might also check out open source packages, such as llamaindex which has at least partial hybrid search support for a number of databases. If you've used or implemented hybrid search on a different database, please share your experience in the comments.
When in doubt, evaluate
When choosing our retriever and retriever options for RAG applications, we need to evaluate answer quality. I stepped through a few example queries above, but for a user-facing app, we really need to do bulk evaluations of a large quantity of questions (~200) to see the effect of an option on answer quality. To make it easier to run bulk evaluations, I've created the ai-rag-chat-evaluator repository, that can run both GPT-based metrics and code-based metrics against RAG chat apps.
Here are the results from evaluations against a synthetically generated data set for a RAG app based on all my personal blog posts:
search mode
groundedness
relevance
answer_length
citation_match
vector only
2.79
1.81
366.73
0.02
text only
4.87
4.74
662.34
0.89
hybrid
3.26
2.15
365.66
0.11
hybrid with ranker
4.89
4.78
670.89
0.92
Despite being the author of this blog post, I was shocked to see how poorly vector search did on its own, with an average groundedness of 2.79 (out of 5) and only 2% of the answers with citations matching the ground truth citations. Full-text search on its own did fairly well, with an average groundedness of 4.87 and a citation match rate of 89%. Hybrid search without the semantic ranker improved upon vector search, with an average groundedness of 3.26 and citation match of 11%, but it did much better with the semantic ranker, with an average groundedness of 4.89 and a citation match rate of 92%. As we would expect, that's the highest numbers across all the options.
But why do we see vector search and ranker-less hybrid search scoring so remarkably low? Besides what I've talked about above, I think it's also due to:
The full-text search option in Azure AI Search is really good. It uses BM25 and is fairly battle-tested, having been around for many years before vector search became so popular. The BM25 algorithm is based off TF-IDF and produces something like sparse vectors itself, so it's more advanced than a simple substring search. AI Search also uses standard NLP tricks like stemming and spell check. Many databases have full text search capabilities, but they won't all be as full-featured as the Azure AI Search full-text search.
My ground truth data set is biased towards compatibility with full-text-search. I generated the sample questions and answers by feeding my blog posts to GPT-4 and asking it to come up with good Q&A based off the text, so I think it's very likely that GPT-4 chose to use similar wording as my posts. An actual question-asker might use very different wording - heck, they might even ask in a different language like Spanish or Chinese! That's where vector search could really shine, and where full-text search wouldn't do so well. It's a good reminder of why need to continue updating evaluation data sets based off what our RAG chat users ask in the real world.
So in conclusion, if we are going to go down the path of using vector search, it is absolutely imperative that we employ a full hybrid search with all four steps and that we evaluate our results to ensure we're using the best retrieval options for the job.
When I build chat applications using the OpenAI chat completions API, I often want to send a user's previous messages to the model so that the model has more context for a user's question. However, OpenAI models have limited context windows, ranging between 4K and 128K depending on the model. If we send more tokens that the model allows, the API will respond with an error.
We need a way to make sure to only send as many tokens as a model can handle. You might consider several approaches:
Send the last N messages (where N is some small number like 3). Don't do this! That is very likely to end up in an error. A particular message might be very long, or might be written in a language with a higher token:word ratio, or might contain symbols that require surprisingly high token counts. Similarly, don't rely on character count as a reliable indicator of token count; it will fail with any message that isn't just common English words.
Use a separate OpenAI call to summarize the conversation, and send the summary. This approach can work, especially if you specify the maximum tokens for a Chat Completion call and verify the number of tokens used in the response. It does have the drawback of requiring an additional OpenAI call, so that can significantly affect user perceived latency.
Send the last N messages that fit inside the remaining token count. This approach requires the use of the tiktoken library for calculating token usage for each possible message that you might send. That does take time, but is faster than an additional LLM call. This is what we use in azure-search-openai-demo and rag-postgres-openai-python, and what I'll explain in this post.
Overall algorithm for conversation history truncation
Here is the approach we take to squeezing in as much conversation history as possible, assuming a function that takes an input of model, system_prompt, few_shots, past_messages, and new_user_message. The function defaults to the maximum token window for the given model, but can also be customized with a different max_tokens.
Start with the system prompt and few shot examples. We always want to send those.
Add the new user message, the one that ultimately requires an answer. Compute the token count of the current set of messages.
Starting from the most recent of the past messages, compute the token count of the message. If adding that token count wouldn't go over the max token count, then add the message. Otherwise, stop.
And that's it! You can see it implemented in code in my build_messages function.
Basically, we can use the tiktoken library to figure out the encoding for the given model, and ask for the token count of a particular user message's content, like "Please write a poem". But we also need to account for the other tokens that are a part of a request, like "role": "user" and images in GPT-4-vision requests, and the guides above provide tips for counting the additional tokens. You can see code in my count_tokens_for_messages function, which accounts for both text messages and image messages.
The calculation gets trickier when there's function calling involved, since that also uses up token costs, and the exact way it uses up the token costs depends on the system message contents, presumably since OpenAI is actually stuffing the function schema into the system message behind the scenes. That calculation is done in my count_tokens_for_system_and_tools function, which was based on great reverse engineering work by other developers in the OpenAI community.
Using message history truncation in a chat app
Now that I've encapsulated the token counting and message truncation functionality in the openai-messages-token-helper package, I can use that inside my OpenAI chat apps.
For example, azure-search-openai-demo is a RAG chat application that answers questions based off content from an Azure AI Search index. In the function that handles a new question from a user, here's how we build the messages parameter for the chat completion call:
We first decide how many tokens we'll allow for the response, then use build_messages to truncate the message history as needed, then pass the possibly truncated messages into the chat completion call.
I would very much like for this type of functionality to be built into either the OpenAI API itself, the OpenAI SDK, or the tiktoken package, as I don't know how sustainable it is for the community to be maintaining token counting packages - and I've found similar calculation logic scattered across JavaScript, Go, Java, Dart, and Python. Our token counting logic may become out-of-date when new models come out or new API parameters, and then we have to go through a reverse engineering process again to come up with calculations. Ultimately, I'm hopeful for one of these possibilities:
All LLM providers, including OpenAI API, provide token-counting estimators as part of their APIs or SDKs.
LLM APIs add parameters which allow developers to specify our preferred truncation or summarization schemes, such as "last_n_conversations": 10 or "summarize_all": true.
LLMs will eventually have such high context windows that we won't feel such a need to possibly truncate our messages based on token counts. Perhaps we'd send the last 10 messages, always, and we'd be confident enough that those would always fit in the high context windows.
Until then, I will maintain the openai-messages-token-helper package and use that whenever I feel the need to truncate conversation history.