Wednesday, July 10, 2024

Playwright and Pytest parametrization for responsive E2E tests

I am a big fan of Playwright, a tool for end-to-end testing that was originally built for Node.JS but is also available in Python and other languages.

Playwright 101

For example, here's a simplified test for the chat functionality of our open-source RAG solution:

def test_chat(page: Page, live_server_url: str):

  expect(page).to_have_title("Azure OpenAI + AI Search")
  expect(page.get_by_role("heading", name="Chat with your data")).to_be_visible()

  page.get_by_placeholder("Type a new question").click()
  page.get_by_placeholder("Type a new question").fill("Whats the dental plan?")
  page.get_by_role("button", name="Submit question").click()

  expect(page.get_by_text("Whats the dental plan?")).to_be_visible()

We then run that test using pytest and the pytest-playwright plugin on headless browsers, typically chromium, though other browsers are supported. We can run the tests locally and in our GitHub actions.

Viewport testing

We recently improved the responsiveness of our RAG solution, with different font sizing and margins in smaller viewports, plus a burger menu:

Screenshot of RAG chat at a small viewport size

Fortunately, Playwright makes it easy to change the viewport of a browser window, via the set_viewport_size function:

page.set_viewport_size({"width": 600, "height": 1024})

I wanted to make sure that all the functionality was still usable at all supported viewport sizes. I didn't want to write a new test for every viewport size, however. So I wrote this parameterized pytest fixture:

@pytest.fixture(params=[(480, 800), (600, 1024), (768, 1024), (992, 1024), (1024, 768)])
def sized_page(page: Page, request):
    size = request.param
    page.set_viewport_size({"width": size[0], "height": size[1]})
    yield page

Then I modified the most important tests to take the sized_page fixture instead:

def test_chat(sized_page: Page, live_server_url: str):
  page = sized_page

Since our website now has a burger menu at smaller viewport sizes, I also had to add an optional click() on that menu:

if page.get_by_role("button", name="Toggle menu").is_visible():
    page.get_by_role("button", name="Toggle menu").click()

Now we can confidently say that all our functionality works at the supported viewport sizes, and if we have any regressions, we can add additional tests or viewport sizes as needed. So cool!

Should you use Quart or FastAPI for an AI app?

As I have discussed previously, it is very important to use an async framework when developing apps that make calls to generative AI APIs, so that your backend processes can concurrently handle other requests while they wait for the (relatively slow) response from the AI API.

Diagram of worker handling second request while first request waits for API response

Async frameworks

There are a few options for asynchronous web frameworks for Python developers:

  • FastAPI: A framework that was designed to be async-only from the beginning, and an increasingly popular option for Python web developers. It's particularly well suited to APIs, because it includes Swagger (OpenAPI) for auto-generated documentation based off type annotations.
  • Quart: The async version of the popular Flask framework. It is now actually built on Flask, so it brings it in as a dependency and reuses what it can. It tries to mimic the Flask interface as much as possible, with exceptions only when needed for better async support.
  • Django: The default for Django is a WSGI app with synchronous views, but it is now possible to write async views as well and run the Django app as an ASGI app.

Quart vs. FastAPI

So which framework should you choose? Since I have not personally used Django with async views, I'm going to focus on comparing Quart vs. FastAPI, as I have used them for a number of AI-on-Azure samples.

  • If you already have Flask apps, it is much easier to turn them into Quart apps than FastAPI apps, given the purposeful similarity of Quart to Flask. You may run into issues if you are using many Flask extensions, however, since not all of them have been ported to Quart.
  • In my experience, Quart is easier to use if your app includes static files / HTML routes. It is possible to use FastAPI for a full webapp, but it is harder. That said, I've figured it out in a few projects, such as rag-postgres-openai-python so you can look at that approach for inspiration.
  • FastAPI has built-in API documentation. To do that with Quart, you need to use Quart-Schema. That extension is fairly straightforward to use, and I have successfully used it with Quart apps, but it is certainly easier with FastAPI.
  • Quart has a good number of extensions available, largely due to many extensions being forked from Flask extensions. There is less of an extension ecosystem for FastAPI, perhaps because there is not an established extension mechanism. There are many tutorials and discussion posts that show how to implement features in FastAPI, however, thanks to the popularity of FastAPI.
  • The performance between Quart and FastAPI should be fairly similar, though I haven't done tests to directly compare the two. The most standard way to run them is with gunicorn and a uvicorn worker, but it is now possible to run uvicorn directly, as of the latest uvicorn release. Another server is hypercorn, created by the Quart creator, but I haven't used that in production myself.
  • Quart is an open-source project that is part of the Pallets ecosystem, and primarily maintained by @pgjones. FastAPI is also an open-source project, primarily maintained by @tiangolo, who recently received funding to work on monetization strategies. Both of them are regularly maintained at this point.

Both frameworks are solid options, with different benefits. Share any experiences you've had in the comments!

Friday, June 14, 2024

pgvector for Python developers

Lately, I've been digging into vector embeddings, since they're such an important part of the RAG (Retrieval Augmented Generation) pattern that we use in our most popular AI samples. I think that when many developers hear "vector embeddings" these days, they immediately think of dedicated vector databases such as Pinecone, Qdrant, or Chroma.

As it turns out, you can use developers in many existing databases as well, such as the very popular and open-source PostGreSQL database. You just need to install the open-source pgvector extension, and boom, you can store vector-type columns, use four different distance operators to compare vectors, and use two difference indexes to efficiently perform searches on large tables.

For this year's PosetteConf, I put together a talk called "pgvector for Python developers" to explain what vectors are, why they matter, how to use them with pgvector, and how to use pgvector from Python for similarity and searching.

Check out the video on YouTube or below:

You can also follow along the online slides, and try the repositories I used in my demos: pgvector playground and RAG on PostgreSQL. If your goal is simply to deploy pgvector to Azure, also check out Azure PostgreSQL Flexible Server + pgvector.

If you're a Django developer, then you may also be interested in this talk on "Semantic search with Django and pgvector" from Paolo Melchiorre, which you can watch on YouTube or below:

Using SLMs in GitHub Codespaces

Today I went on a quest to figure out the best way to use SLMs (small language models) like Phi-3 in a GitHub Codespace, so that I can provide a browser-only way for anyone to start working with language models. My main target audience is teachers and students, who may not have GPUs or have the budget to pay for large language models, but it's always good to make new technologies accessible to everyone possible.

❌ Don't use transformers

For my first approach, I tried to use the HuggingFace transformers package, using code similar to their text generation tutorial, but modified for my non-GPU environment:

from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

prompt = "insert your prompt here"
model_checkpoint = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_checkpoint,
inputs = tokenizer(prompt,
outputs = model.generate(**inputs,
                         do_sample=True, max_new_tokens=120)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Unfortunately, that took a very long time. Enough time for me to go for a walk, chase the garbage truck around the neighborhood, make breakfast, and drop my kid at school: 133 minutes total!

Screenshot of notebook with 133m duration

So, yes, you can use transformers, technically. But without either a very powerful CPU or better, a GPU, the performance is just too slow. Let's move on...

✅ Do use Ollama!

For my next approach, I tried setting up a dev container with Ollama built-in, by adding the ollama feature to my devcontainer.json:

"features": {
        "": {}

I then pulled a small phi3 model using "ollama run phi3:mini", and I was able to generate text in a manner of seconds:

Screenshot of Ollama generation

So I proceeded to use Ollama via the Python OpenAI SDK, which I can do thanks to Ollama's OpenAI-compatible endpoints.

import openai

client = openai.OpenAI(
response =
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about a hungry cat"},


Try it yourself!

To make it super easy for anyone to get started with SLMs in a Codespace, I bundled everything into this repository:

That repository includes the Ollama feature, OpenAI SDK, a notebook with demonstrations of few-shot and RAG, and a script for an interactive chat. I hope it can be a helpful resource for teachers and students who want a quick and easy way to get started with small language models.

I also hope to add the Ollama feature to other repositories where it can be helpful, like the Phi-3 cookbook.

Monday, June 10, 2024

RAG on a database table with PostgreSQL

RAG (Retrieval Augmented Generation) is one of the most promising uses for large language models. Instead of asking an LLM a question and hoping the answer lies somewhere in its weights, we instead first query a knowledge base for anything relevant to the question, and then feed both those results and the original question to the LLM.

We have many RAG solutions out there for asking questions on unstructured documents, like PDFs and Word Documents. Our most popular Azure solution for this scenario includes a data ingestion process to extract the text from the documents, chunk them up into appropriate sizes, and store them in an Azure AI Search index. When your RAG is on unstructured documents, you'll always need a data ingestion step to store them in an LLM-compatible format.

But what if you just want users to ask questions about structured data, like a table in a database? Imagine customers that want to ask questions about the products in a store's inventory, and each product is a row in the table. We can use the RAG approach there, too, and in some ways, it's a simpler process.

Diagram of RAG on database rows

To get you started with this flavor of RAG, we've created a new RAG-on-PostgreSQL solution that includes a FastAPI backend, React frontend, and infrastructure-as-code for deploying it all to Azure Container Apps with Azure PostgreSQL Flexible Server. Here it is with the sample seed data:

Screenshot of RAG app with question about waterproof camping gear

We use the user's question to query a single PostgreSQL table and send the matching rows to the LLM. We display the answer plus information about any of the referenced products from the answer. Now let's break down how that solution works.

Data preparation

When we eventually query the database table with the user's query, we ideally want to perform a hybrid search: both a full text search and a vector search of any columns that might match the user's intent. In order to perform a vector search, we also need a column that stores a vector embedding of the target columns.

This is what the sample table looks like, described using SQLAlchemy 2.0 model classes. The final embedding column is a Vector type, from the pgvector extension for PostgreSQl:

class Item(Base):
    __tablename__ = "items"
    id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True)
    type: Mapped[str] = mapped_column()
    brand: Mapped[str] = mapped_column()
    name: Mapped[str] = mapped_column()
    description: Mapped[str] = mapped_column()
    price: Mapped[float] = mapped_column()
    embedding: Mapped[Vector] = mapped_column(Vector(1536))

The embedding column has 1536 dimensions to match OpenAI's text-embedding-ada-002 model, but you could configure it to match the dimensions of different embedding models instead. The most important thing is to know exactly which model you used for generating embeddings, so then we can later search with that same model.

To compute the value of the embedding column, we concatenate the text columns from the table row, send them to the OpenAI embedding model, and store the result:

items = session.scalars(select(Item)).all()
for item in items:
  item_for_embedding = f"Name: {} Description: {self.description} Type: {self.type}"
  item.embedding = openai_client.embeddings.create(

We only need to run that once, if our data is static. However, if any of the included columns change, we should re-run that for the changed rows. Another approach is to use the Azure AI extension for Azure PostgreSQL Flexible Server. I didn't use it in my solution since I also wanted it to run with a local PostgreSQL server, but it should work great if you're always using the Azure-hosted PostgreSQL Flexible Server.

Hybrid search in PostgreSQL

Now our database table has both text columns and a vector column, so we should be able to perform a hybrid search: using the pgvector distance operator on the embedding column, using the built-in full-text search functions on the text columns, and merging them using the Reciprocal-Rank Fusion algorithm.

We use this SQL query for hybrid search, inspired by an example from the pgvector-python repository:

vector_query = f"""
SELECT id, RANK () OVER (ORDER BY embedding <=> :embedding) AS rank
  FROM items
  ORDER BY embedding <=> :embedding
  LIMIT 20

fulltext_query = f"""
SELECT id, RANK () OVER (ORDER BY ts_rank_cd(to_tsvector('english', description), query) DESC)
  FROM items, plainto_tsquery('english', :query) query
  WHERE to_tsvector('english', description) @@ query
  ORDER BY ts_rank_cd(to_tsvector('english', description), query) DESC
  LIMIT 20

hybrid_query = f"""
WITH vector_search AS (
fulltext_search AS (
  COALESCE(1.0 / (:k + vector_search.rank), 0.0) +
  COALESCE(1.0 / (:k + fulltext_search.rank), 0.0) AS score
FROM vector_search
FULL OUTER JOIN fulltext_search ON =

results = session.execute(sql,
    {"embedding": to_db(query_vector), "query": query_text, "k": 60},

That hybrid search is missing the final step that we always recommend for Azure AI Search: semantic ranker, a re-ranking model that sorts the results according to the original user queries. It should be possible to add a re-ranking model, as shown in another pgvector-python example, but such an addition requires loadtesting and possibly an architectural change, since re-ranking models are CPU-intensive. Ideally, the re-ranking model would be deployed on dedicated infrastructure optimized for model running, not on the same server as our app backend.

We get fairly good results from that hybrid search query, however! It easily finds rows that both match the exact keywords in a query and semantically similar phrases, as demonstrated by these user questions:

Screenshot of question 'dark blue shoes for hiking up trails' Screenshot of question 'sneakers for walking up steep hills'

Function calling for SQL filtering

The next step is to handle user queries like, "climbing gear cheaper than $100." Our hybrid search query can definitely find "climbing gear", but it's not designed to find products whose price is lower than some amount. The hybrid search isn't querying the price column at all, and isn't appropriate for a numeric comparison query anyway. Ideally, we would do both a hybrid search and add a filter clause, like WHERE price < 100.

Fortunately, we can use an LLM to suggest filter clauses based on user queries, and the OpenAI GPT models are very good at it. We add a query-rewriting phase to our RAG flow which uses OpenAI function calling to come up with the optimal search query and column filters.

In order to use OpenAI function calling, we need to describe the function and its parameters. Here's what that looks like for a search query and single column's filter clause:

  "type": "function",
  "function": {
    "name": "search_database",
    "description": "Search PostgreSQL database for relevant products based on user query",
    "parameters": {
      "type": "object",
      "properties": {
        "search_query": {
          "type": "string",
          "description": "Query string to use for full text search, e.g. 'red shoes'"
        "price_filter": {
          "type": "object",
          "description": "Filter search results based on price of the product",
          "properties": {
            "comparison_operator": {
              "type": "string",
              "description": "Operator to compare the column value, either '>', '<', '>=', '<=', '='"
            "value": {
              "type": "number",
               "description": "Value to compare against, e.g. 30"

We can easily add additional parameters for other column filters, or we could even have a generic column filter parameter and have OpenAI suggest the column based on the table schema. For my solution, I am intentionally constraining the LLM to only suggest a subset of possible filters, to minimize risk of SQL injection or poor SQL performance. There are many libraries out there that do full text-to-SQL, and that's another approach you could try out, if you're comfortable with the security of those approaches.

When we get back the results from the function call, we use it to build a filter clause, and append that to our original hybrid search query. We want to do the filtering before the vector and full text search, to narrow down the search space to only what could possibly match. Here's what the new vector search looks like, with the additional filter clause:

vector_query = f"""
  SELECT id, RANK () OVER (ORDER BY embedding <=> :embedding) AS rank
    FROM items
    ORDER BY embedding <=> :embedding
    LIMIT 20

With the query rewriting and filter building in place, our RAG app can now answer questions that depend on filters:

Screenshot of question 'climbing gear cheaper than $30'

RAG on unstructured vs structured data

Trying to decide what RAG approach to use, or which of our solutions to use for a prototype? If your target data is largely unstructured documents, then you should try out our Azure AI Search RAG starter solution which will take care of the complex data ingestion phase for you. However, if your target data is an existing database table, and you want to RAG over a single table (or a small number of tables), the try out the PostgreSQL RAG starter solution and modify it to work with your table schema. If your target data is a database with a multitude of tables with different schemas, then you probably want to research full text-to-SQL solutions. Also check out the llamaindex and langchain libraries, as they often have functionality and samples for common RAG scenarios.

Monday, June 3, 2024

Doing RAG? Vector search is *not* enough

I'm concerned by the number of times I've heard, "oh, we can do RAG with retriever X, here's the vector search query." Yes, your retriever for a RAG flow should definitely support vector search, since that will let you find documents with similar semantics to a user's query, but vector search is not enough. Your retriever should support a full hybrid search, meaning that it can perform both a vector search and full text search, then merge and re-rank the results. That will allow your RAG flow to find both semantically similar concepts, but also find exact matches like proper names, IDs, and numbers.

Hybrid search steps

Azure AI Search offers a full hybrid search with all those components:

Diagram of Azure AI Search hybrid search flow
  1. It performs a vector search using a distance metric (typically cosine or dot product).
  2. It performs a full-text search using the BM25 scoring algorithm.
  3. It merges the results using Reciprocal Rank Fusion algorithm.
  4. It re-ranks the results using semantic ranker, a machine learning model used by Bing, that compares each result to the original usery query and assigns a score from 0-4.

The search team even researched all the options against a standard dataset, and wrote a blog post comparing the retrieval results for full text search only, vector search only, hybrid search only, and hybrid plus ranker. Unsurprisingly, they found that the best results came from using the full stack, and that's why it's the default configuration we use in the AI Search RAG starter app.

When is hybrid search needed?

To demonstrate the importance of going beyond vector search, I'll show some queries based off the sample documents in the AI Search RAG starter app. Those documents are from a fictional company and discuss internal policies like healthcare and benefits.

Let's start by searching "what plan costs $45.00?" with a pure vector search using an AI Search index:

search_query = "what plan costs $45.00"
search_vector = get_embedding(search_query)
r =, top=3, vector_queries=[
  VectorizedQuery(search_vector, k_nearest_neighbors=50, fields="embedding")])

The results for that query contain numbers and costs, like the string "The copayment for primary care visits is typically around $20, while specialist visits have a copayment of around $50.", but none of the results contain an exact cost of $45.00, what the user was looking for.

Now let's try that query with a pure full-text search:

r =, top=3)

The top result for that query contain a table of costs for the health insurance plans, with a row containing $45.00.

Of course, we don't want to be limited to full text queries, since many user queries would be better answered by vector search, so let's try this query with hybrid:

r =, top=15, vector_queries=[
  VectorizedQuery(search_vector, k_nearest_neighbors=10, fields="embedding")])

Once again, the top result is the table with the costs and exact string of $45.00. When the user asks that question in the context of the full RAG app, they get the answer they were hoping for:

You might think, well, how many users are searching for exact strings? Consider how often you search your email for a particular person's name, or how often you search the web for a particular programming function name. Users will make queries that are better answered by full-text search, and that's why we need hybrid search solutions.

Here's one more reason why vector search alone isn't enough: assuming you're using generic embedding models like the OpenAI models, those models are generally not a perfect fit for your domain. Their understanding of certain terms aren't going to be the same as a model that was trained entirely on your domain's data. Using hybrid search helps to compensate for the differences in the embedding domain.

When is re-ranking needed?

Now that you're hopefully convinced about hybrid search, let's talk about the final step: re-ranking results according to the original user query.

Now we'll search the same documents for "learning about underwater activities" with a hybrid search:

search_query = "learning about underwater activities"
search_vector = get_embedding(search_query)
r =, top=5, vector_queries=[
  VectorizedQuery(search_vector, k_nearest_neighbors=10, fields="embedding")])

The third result for that query contains the most relevant result, a benefits document that mentions surfing lessons and scuba diving lessons. The phrase "underwater" doesn't appear in any documents, notably, so those results are coming from the vector search component.

What happens if we add in the semantic ranker?

search_query = "learning about underwater activities"
search_vector = get_embedding(search_query)
r =, top=5, vector_queries=[
  VectorizedQuery(search_vector, k_nearest_neighbors=50, fields="embedding")],
  query_type="semantic", semantic_configuration_name="default")

Now the very top result for the query is the document chunk about surfing and scuba diving lessons, since the semantic ranker realized that was the most pertinent result for the user query. When the user asks a question like that in the RAG flow, they get a correct answer with the expected citation:

Screenshot of user asking question about underwater activities and getting a good answer

Our search yielded the right result in both cases, so why should we bother with the ranker? For RAG applications, which send search results to an LLM like GPT-3.5, we typically limit the number of results to a fairly low number, like 3 or 5 results. That's due to research that shows that LLMs tend to get "lost in the middle" when too much context is thrown at them. We want those top N results to be the most relevant results, and to not contain any irrelevant results. By using the re-ranker, our top results are more likely to contain the closest matching content for the query.

Plus, there's a big additional benefit: each of the results now has a re-ranker score from 0-4, which makes it easy for us to filter out results with re-ranker scores below some threshold (like < 1.5). Remember that any search algorithm that includes vector search will always find results, even if those results aren't very close to the original query at all, since vector search just looks for the closest vectors in the entire vector space. So when your search involves vector search, you ideally want a re-ranking step and a scoring approach that will make it easier for you to discard results that just aren't relevant enough on an absolute scale.

Implementing hybrid search

As you can see from my examples, Azure AI Search can do everything we need for a RAG retrieval solution (and even more than we've covered here, like filters and custom scoring algorithms. However, you might be reading this because you're interested in using a different retriever for your RAG solution, such as a database. You should be able to implement hybrid search on top of most databases, provided they have some capability for text search and vector search.

As an example, consider the PostgreSQL database. It already has built-in full text search, and there's a popular extension called pgvector for bringing in vector indexes and distance operators. The next step is to combine them together in a hybrid search, which is demonstrated in this example from the pgvector-python repository:.

WITH semantic_search AS (
  SELECT id, RANK () OVER (ORDER BY embedding <=> %(embedding)s) AS rank
  FROM documents
  ORDER BY embedding <=> %(embedding)s
  LIMIT 20
keyword_search AS (
  SELECT id, RANK () OVER (ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC)
  FROM documents, plainto_tsquery('english', %(query)s) query
  WHERE to_tsvector('english', content) @@ query
  ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC
  LIMIT 20
  COALESCE(1.0 / (%(k)s + semantic_search.rank), 0.0) +
  COALESCE(1.0 / (%(k)s + keyword_search.rank), 0.0) AS score
FROM semantic_search
FULL OUTER JOIN keyword_search ON =

That SQL performs a hybrid search by running a vector search and text search and combining them together with RRF. Another example from that repo shows how we could bring in a cross-encoding model for a final re-ranking step:

encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = encoder.predict([(query, item[1]) for item in results])
results = [v for _, v in sorted(zip(scores, results), reverse=True)]

That code would run the cross-encoding model in the same process as the rest of the PostgreSQL query, so it could work well in a local or test environment, but it wouldn't necessarily scale well in a production environment. Ideally, a call to a cross-encoder would be made in a separate service that had access to a GPU and dedicated resources.

I have implemented the first three steps of hybrid search in a RAG-on-PostgreSQL starter app. Since I don't yet have a good way to productionize a call to a cross-encoding model, I have not brought in the final re-ranking step.

After seeing what it takes to replicate full hybrid search options on other database, I am even more appreciative of the work done by the Azure AI Search team. If you've decided that, nevermind, you'll go with Azure AI Search, check out the AI Search RAG starter app. You might also check out open source packages, such as llamaindex which has at least partial hybrid search support for a number of databases. If you've used or implemented hybrid search on a different database, please share your experience in the comments.

When in doubt, evaluate

When choosing our retriever and retriever options for RAG applications, we need to evaluate answer quality. I stepped through a few example queries above, but for a user-facing app, we really need to do bulk evaluations of a large quantity of questions (~200) to see the effect of an option on answer quality. To make it easier to run bulk evaluations, I've created the ai-rag-chat-evaluator repository, that can run both GPT-based metrics and code-based metrics against RAG chat apps.

Here are the results from evaluations against a synthetically generated data set for a RAG app based on all my personal blog posts:

search mode groundedness relevance answer_length citation_match
vector only 2.79 1.81 366.73 0.02
text only 4.87 4.74 662.34 0.89
hybrid 3.26 2.15 365.66 0.11
hybrid with ranker 4.89 4.78 670.89 0.92

Despite being the author of this blog post, I was shocked to see how poorly vector search did on its own, with an average groundedness of 2.79 (out of 5) and only 2% of the answers with citations matching the ground truth citations. Full-text search on its own did fairly well, with an average groundedness of 4.87 and a citation match rate of 89%. Hybrid search without the semantic ranker improved upon vector search, with an average groundedness of 3.26 and citation match of 11%, but it did much better with the semantic ranker, with an average groundedness of 4.89 and a citation match rate of 92%. As we would expect, that's the highest numbers across all the options.

But why do we see vector search and ranker-less hybrid search scoring so remarkably low? Besides what I've talked about above, I think it's also due to:

  • The full-text search option in Azure AI Search is really good. It uses BM25 and is fairly battle-tested, having been around for many years before vector search became so popular. The BM25 algorithm is based off TF-IDF and produces something like sparse vectors itself, so it's more advanced than a simple substring search. AI Search also uses standard NLP tricks like stemming and spell check. Many databases have full text search capabilities, but they won't all be as full-featured as the Azure AI Search full-text search.
  • My ground truth data set is biased towards compatibility with full-text-search. I generated the sample questions and answers by feeding my blog posts to GPT-4 and asking it to come up with good Q&A based off the text, so I think it's very likely that GPT-4 chose to use similar wording as my posts. An actual question-asker might use very different wording - heck, they might even ask in a different language like Spanish or Chinese! That's where vector search could really shine, and where full-text search wouldn't do so well. It's a good reminder of why need to continue updating evaluation data sets based off what our RAG chat users ask in the real world.

So in conclusion, if we are going to go down the path of using vector search, it is absolutely imperative that we employ a full hybrid search with all four steps and that we evaluate our results to ensure we're using the best retrieval options for the job.

Saturday, June 1, 2024

Truncating conversation history for OpenAI chat completions

When I build chat applications using the OpenAI chat completions API, I often want to send a user's previous messages to the model so that the model has more context for a user's question. However, OpenAI models have limited context windows, ranging between 4K and 128K depending on the model. If we send more tokens that the model allows, the API will respond with an error.

We need a way to make sure to only send as many tokens as a model can handle. You might consider several approaches:

  • Send the last N messages (where N is some small number like 3). Don't do this! That is very likely to end up in an error. A particular message might be very long, or might be written in a language with a higher token:word ratio, or might contain symbols that require surprisingly high token counts. Similarly, don't rely on character count as a reliable indicator of token count; it will fail with any message that isn't just common English words.
  • Use a separate OpenAI call to summarize the conversation, and send the summary. This approach can work, especially if you specify the maximum tokens for a Chat Completion call and verify the number of tokens used in the response. It does have the drawback of requiring an additional OpenAI call, so that can significantly affect user perceived latency.
  • Send the last N messages that fit inside the remaining token count. This approach requires the use of the tiktoken library for calculating token usage for each possible message that you might send. That does take time, but is faster than an additional LLM call. This is what we use in azure-search-openai-demo and rag-postgres-openai-python, and what I'll explain in this post.

Overall algorithm for conversation history truncation

Here is the approach we take to squeezing in as much conversation history as possible, assuming a function that takes an input of model, system_prompt, few_shots, past_messages, and new_user_message. The function defaults to the maximum token window for the given model, but can also be customized with a different max_tokens.

  1. Start with the system prompt and few shot examples. We always want to send those.
  2. Add the new user message, the one that ultimately requires an answer. Compute the token count of the current set of messages.
  3. Starting from the most recent of the past messages, compute the token count of the message. If adding that token count wouldn't go over the max token count, then add the message. Otherwise, stop.

And that's it! You can see it implemented in code in my build_messages function.

Token counting for each message

Actually, there's more! How do we actually compute the token count for each message? OpenAI documents that in a few places: Cookbook: How to count tokens with tiktoken, OpenAI guides: Managing tokens, and GPT-4 vision: Calculating costs.

Basically, we can use the tiktoken library to figure out the encoding for the given model, and ask for the token count of a particular user message's content, like "Please write a poem". But we also need to account for the other tokens that are a part of a request, like "role": "user" and images in GPT-4-vision requests, and the guides above provide tips for counting the additional tokens. You can see code in my count_tokens_for_messages function, which accounts for both text messages and image messages.

The calculation gets trickier when there's function calling involved, since that also uses up token costs, and the exact way it uses up the token costs depends on the system message contents, presumably since OpenAI is actually stuffing the function schema into the system message behind the scenes. That calculation is done in my count_tokens_for_system_and_tools function, which was based on great reverse engineering work by other developers in the OpenAI community.

Using message history truncation in a chat app

Now that I've encapsulated the token counting and message truncation functionality in the openai-messages-token-helper package, I can use that inside my OpenAI chat apps.

For example, azure-search-openai-demo is a RAG chat application that answers questions based off content from an Azure AI Search index. In the function that handles a new question from a user, here's how we build the messages parameter for the chat completion call:

response_token_limit = 1024
messages = build_messages(
  new_user_content=original_user_query + "\n\nSources:\n" + content,
  max_tokens=self.chatgpt_token_limit - response_token_limit,

chat_completion = await

We first decide how many tokens we'll allow for the response, then use build_messages to truncate the message history as needed, then pass the possibly truncated messages into the chat completion call.

We use very similar code in the chat handler from rag-postgres-openai-python as well.

Why isn't this built into the OpenAI API?

I would very much like for this type of functionality to be built into either the OpenAI API itself, the OpenAI SDK, or the tiktoken package, as I don't know how sustainable it is for the community to be maintaining token counting packages - and I've found similar calculation logic scattered across JavaScript, Go, Java, Dart, and Python. Our token counting logic may become out-of-date when new models come out or new API parameters, and then we have to go through a reverse engineering process again to come up with calculations. Ultimately, I'm hopeful for one of these possibilities:

  • All LLM providers, including OpenAI API, provide token-counting estimators as part of their APIs or SDKs.
  • LLM APIs add parameters which allow developers to specify our preferred truncation or summarization schemes, such as "last_n_conversations": 10 or "summarize_all": true.
  • LLMs will eventually have such high context windows that we won't feel such a need to possibly truncate our messages based on token counts. Perhaps we'd send the last 10 messages, always, and we'd be confident enough that those would always fit in the high context windows.

Until then, I will maintain the openai-messages-token-helper package and use that whenever I feel the need to truncate conversation history.

Tuesday, March 5, 2024

Evaluating RAG chat apps: Can your app say "I don't know"?

In a recent blog post, I talked about the importance of evaluating the answer quality from any RAG-powered chat app, and I shared my ai-rag-chat-evaluator repo for running bulk evaluations.

In that post, I focused on evaluating a model’s answers for a set of questions that could be answered by the data. But what about all those questions that can’t be answered by the data? Does your model know how to say “I don’t know?” LLMs are very eager-to-please, so it actually takes a fair bit of prompt engineering to persuade them to answer in the negative, especially for answers in their weights somewhere.

For example, consider this question for a RAG based on internal company handbooks:

User asks question 'should I stay at home from work when I have the flu?' and app responds 'Yes' with additional advice

The company handbooks don't actually contain advice on whether employees should stay home when they're sick, but the LLM still tries to give general advice based on what it's seen in training data, and it cites the most related sources (about health insurance). The company would prefer that the LLM said that it didn't know, so that employees weren't led astray. How can the app developer validate their app is replying appropriately in these situations?

Good news: I’ve now built additional functionality into ai-rag-chat-evaluator to help RAG chat developers measure the “dont-know-ness” of their app. (And yes, I’m still struggling to find a snappier name for the metric that doesnt excessively anthropomorphise - feigned-ignorance? humility? stick-to-scriptness? Let me know if you have an idea or know of an already existing name.)

Generating test questions

For a standard evaluation, our test data is a set of questions with answers sourced fully from the data. However, for this kind of evaluation, our test data needs to be a different set of question whose answer should provoke an “I don’t know” response from the data. There are several categories of such questions:

  • Uncitable: Questions whose answers are well known to the LLM from its training data, but are not in the sources. There are two flavors of these:
    • Unrelated: Completely unrelated to sources, so LLM shouldn’t get too tempted to think the sources know.
    • Related: Similar topics to sources, so LLM will be particularly tempted.
  • Unknowable: Questions that are related to the sources but not actually in them (and not public knowledge).
  • Nonsensical: Questions that are non-questions, that a human would scratch their head at and ask for clarification.

If you already have an existing set of those questions based off what users have been typing into your chat, that's great - use that set!

If you need help coming up with that set, I wrote a generator script that can suggest questions in those categories, as long as you provide the existing ground truth questions from standard evaluation. Run it like so:

python -m scripts generate_dontknows --input=example_input/qa.jsonl
    --output=example_input/qa_dontknows.jsonl --numquestions=40

That script sends the input questions to the configured GPT-4 model along with prompts to generate questions of each kind.

When it’s done, you should review and curate the resulting ground truth data. Pay special attention to the “unknowable” questions at the top of the file, since you may decide that some of those are actually knowable. I ended up replacing many with similar questions that I knew were not in the sources.

Measuring the dont-know-ness of responses

When we do a standard evaluation on answers that should be in sources, we measure metrics like groundedness and relevance, asking GPT4 to rate them from 1-5. For evaluating the answers to the new set of questions, we need a metric that measures whether the answer says it doesnt know. I created a new “dontknowness” metric for that, using this prompt:

You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
The "I don't know"-ness metric is a measure of how much an answer conveys the lack of knowledge or uncertainty, which is useful for making sure a chatbot for a particular domain doesn't answer outside that domain. Score the I-dont-know-ness of the answer between one to five stars using the following rating scale:
One star: the answer completely answers the question and conveys no uncertainty
Two stars: the answer conveys a little uncertainty but mostly attempts to answer the question
Three stars: the answer conveys some uncertainty but still contains some attempt to answer the question
Four stars: the answer conveys uncertainty and makes no attempt to answer the question
Five stars: the answer says straightforwardly that it doesn't know, and makes no attempt to answer the question.

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
question: What are the main goals of Perseverance Mars rover mission?
answer: The main goals of the Perseverance Mars rover mission are to search for signs of ancient life and collect rock and soil samples for possible return to Earth.
stars: 1

question: What field did Marie Curie excel in?
answer: I'm not sure, but I think Marie Curie excelled in the field of science.
stars: 2

question: What are the main components of the Mediterranean diet?
answer: I don't have an answer in my sources but I think the diet has some fats?
stars: 3

question: What are the main attractions of the Queen's Royal Castle?
answer: I'm not certain. Perhaps try rephrasing the question?
stars: 4

question: Where were The Beatles formed?
answer: I'm sorry, I don't know, that answer is not in my sources.
stars: 5

question: {{question}}
answer: {{answer}}

Your response must include following fields and should be in json format:
score: Number of stars based on definition above
reason: Reason why the score was given

That metric is available in the tool for anyone to use now, but you’re also welcome to tweak the prompt as needed.

Running the evaluation

Next I configure a JSON for this evaluation:

    "testdata_path": "example_input/qa_dontknows.jsonl",
    "results_dir": "example_results_dontknows/baseline",
    "requested_metrics": ["dontknowness", "answer_length", "latency", "has_citation"],
    "target_url": "http://localhost:50505/chat",

I’m also measuring a few other related metrics like answer_length and has_citation, since an “I don’t know” response should be fairly short and not have a citation.

I run the evaluation like so:

python -m scripts evaluate --config=example_config_dontknows.json

Once the evaluation completes, I review the results:

python -m review_tools summary example_results_dontknows
Screenshot from results- mean_rating of 3.45, pass rate of .68

I was disappointed by the results of my first run: my app responded with an "I don't know" response about 68% of the time (considering 4 or 5 a passing rating). I then looked through the answers to see where it was going off-source, using the diff tool:

python -m review_tools diff example_results_dontknows/baseline/

For the RAG based on my own blog, it often answered technical questions as if the answer was in my post when it actually wasn't. For example, my blog doesn't provide any resources about learning Go, so the model suggested non-Go resources from my blog instead:

Screenshot of question 'What's a good way to learn the Go programming language?' with a list response

Improving the app's ability to say "I don't know"

I went into my app and manually experimented with prompt changes for questions from that 67%, adding in additional commands to only return an answer if it could be found in its entirety in the sources. Unfortunately, I didn't see improvements in my evaluation runs on prompt changes. I also tried adjusting the temperature, but didn't see a noticeable change there.

Finally, I changed the underlying model used by my RAG chat app from gpt-3.5-turbo to gpt-4, re-ran the evaluation, and saw great results.

Screenshot from results- mean_rating of 4, pass rate of .75

The gpt-4 model is slower (especially as mine is an Azure PAYG account, not PTU) but it is much better at following the system prompt directions. It still did answer 25% of the questions, but it generally stayed on-source better than gpt-3.5. For example, here's the same question about learning Go from before:

Screenshot of question 'What's a good way to learn the Go programming language?' with an 'I don't know' response

To avoid using gpt-4, I could also try adding an additional LLM step in the app after generating the answer, to have the LLM rate its own confidence that the answer is found in the sources and respond accordingly. I haven't tried that yet, but let me know if you do!

Start evaluating your RAG chat app today

To get started with evaluation, follow the steps in the ai-rag-chat-evaluator README. Please file an issue if you ran into any problems or have ideas for improving the evaluation flow.

Friday, March 1, 2024

RAG techniques: Function calling for more structured retrieval

Retrieval Augmented Generation (RAG) is a popular technique to get LLMs to provide answers that are grounded in a data source. When we use RAG, we use the user's question to search a knowledge base (like Azure AI Search), then pass along both the question and the relevant content to the LLM (gpt-3.5-turbo or gpt-4), with a directive to answer only according to the sources. In psuedo-code:

user_query = "what's in the Northwind Plus plan?"
user_query_vector = create_embedding(user_query, "ada-002")
results = search(user_query, user_query_vector)
response = create_chat_completion(system_prompt, user_query, results)

If the search function can find the right results in the index (assuming the answer is somewhere in the index), then the LLM can typically do a pretty good job of synthesizing the answer from the sources.

Unstructured queries

This simple RAG approach works best for "unstructured queries", like:

  • What's in the Northwind Plus plan?
  • What are the expectations of a product manager?
  • What benefits are provided by the company?

When using Azure AI Search as the knowledge base, the search call will perform both a vector and keyword search, finding all the relevant document chunks that match the keywords and concepts in the query.

Structured queries

But you may find that users are instead asking more "structured" queries, like:

  • Summarize the document called "perksplus.pdf"
  • What are the topics in documents by Pamela Fox?
  • Key points in most recent uploaded documents

We can think of them as structured queries, because they're trying to filter on specific metadata about a document. You could imagine a world where you used a syntax to specify that metadata filtering, like:

  • Summarize the document title:perksplus.pdf
  • Topics in documents author:PamelaFox
  • Key points time:2weeks

We don't want to actually introduce a query syntax to a a RAG chat application if we don't need to, since only power users tend to use specialized query syntax, and we'd ideally have our RAG just do the right thing in that situation.

Using function calling in RAG

Fortunately, we can use the OpenAI function-calling feature to recognize that a user's query would benefit from a more structured search, and perform that search instead.

If you've never used function calling before, it's an alternative way of asking an OpenAI GPT model to respond to a chat completion request. In addition to sending our usual system prompt, chat history, and user message, we also send along a list of possible functions that could be called to answer the question. We can define those in JSON or as a Pydantic model dumped to JSON. Then, when the response comes back from the model, we can see what function it decided to call, and with what parameters. At that point, we can actually call that function, if it exists, or just use that information in our code in some other way.

To use function calling in RAG, we first need to introduce an LLM pre-processing step to handle user queries, as I described in my previous blog post. That will give us an opportunity to intercept the query before we even perform the search step of RAG.

For that pre-processing step, we can start off with a function to handle the general case of unstructured queries:

tools: List[ChatCompletionToolParam] = [
        "type": "function",
        "function": {
            "name": "search_sources",
            "description": "Retrieve sources from the Azure AI Search index",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_query": {
                        "type": "string",
                        "description": "Query string to retrieve documents from azure search eg: 'Health care plan'",
                "required": ["search_query"],

Then we send off a request to the chat completion API, letting it know it can use that function.

chat_completion: ChatCompletion =

When the response comes back, we process it to see if the model decided to call the function, and extract the search_query parameter if so.

response_message = chat_completion.choices[0].message

if response_message.tool_calls:
    for tool in response_message.tool_calls:
        if tool.type != "function":
        function = tool.function
        if == "search_sources":
            arg = json.loads(function.arguments)
            search_query = arg.get("search_query", self.NO_RESPONSE)

If the model didn't include the function call in its response, that's not a big deal as we just fall back to using the user's original query as the search query. We proceed with the rest of the RAG flow as usual, sending the original question with whatever results came back in our final LLM call.

Adding more functions for structured queries

Now that we've introduced one function into the RAG flow, we can more easily add additional functions to recognize structured queries. For example, this function recognizes when a user wants to search by a particular filename:

    "type": "function",
    "function": {
        "name": "search_by_filename",
        "description": "Retrieve a specific filename from the Azure AI Search index",
        "parameters": {
            "type": "object",
            "properties": {
                "filename": {
                    "type": "string",
                    "description": "The filename, like 'PerksPlus.pdf'",
            "required": ["filename"],

We need to extend the function parsing code to extract the filename argument:

if == "search_by_filename":
    arg = json.loads(function.arguments)
    filename = arg.get("filename", "")
    filename_filter = filename

Then we can decide how to use that filename filter. In the case of Azure AI search, I build a filter that checks that a particular index field matches the filename argument, and pass that to my search call. If using a relational database, it'd become an additional WHERE clause.

Simply by adding that function, I was able to get much better answers to questions in my RAG app like 'Summarize the document called "perksplus.pdf"', since my search results were truly limited to chunks from that file. You can see my full code changes to add this function to our RAG starter app repo in this PR.


This can be a very powerful technique, but as with all things LLM, there are gotchas:

  • Function definitions add to your prompt token count, increasing cost.
  • There may be times where the LLM doesn't decide to return the function call, even when you thought it should have.
  • The more functions you add, the more likely the LLM will get confused about which one to pick, especially if functions are similar to each other. You can try to make it more clear to the LLM by prompt engineering the function name and description, or even providing few shots.

Here are additional approaches you can try:

  • Content expansion: Store metadata inside the indexed field and compute the embedding based on both the metadata and content. For example, the content field could have "filename:perksplus.pdf text:The perks are...".
  • Add metadata as separate fields in the search index, and append those to the content sent to the LLM. For example, you could put "Last modified: 2 weeks ago" in each chunk sent to the LLM, if you were trying to help it's ability to answer questions about recency. This is similar to the content expansion approach, but the metadata isn't included when calculating the embedding. You could also compute embeddings separately for each metadata field, and do a multi-vector search.
  • Add filters to the UI of your RAG chat application, as part of the chat box or a sidebar of settings.
  • Use fine-tuning on a model to help it realize when it should call particular functions or respond a certain way. You could even teach it to use a structured query syntax, and remove the functions entirely from your call. This is a last resort, however, since fine-tuning is costly and time-consuming.

Friday, February 16, 2024

RAG techniques: Cleaning user questions with an LLM

📺 You can also watch the video version of this blog post.

When I introduce app developers to the concept of RAG (Retrieval Augmented Generation), I often present a diagram like this:

Diagram of RAG flow, user question to data source to LLM

The app receives a user question, uses the user question to search a knowledge base, then sends the question and matching bits of information to the LLM, instructing the LLM to adhere to the sources.

That's the most straightforward RAG approach, but as it turns out, it's not what quite what we do in our most popular open-source RAG solution, azure-search-openai-demo.

The flow instead looks like this:

diagram of extendex RAG flow, user question to LLM to data source to LLM

After the app receives a user question, it makes an initial call to an LLM to turn that user question into a more appropriate search query for Azure AI search. More generally, you can think of this step as turning the user query into a datastore-aware query. This additional step tends to improve the search results, and is a (relatively) quick task for an LLM. It also cheap in terms of output token usage.

I'll break down the particular approach our solution uses for this step, but I encourage you to think more generally about how you might make your user queries more datastore-aware for whatever datastore you may be using in your RAG chat apps.

Converting user questions for Azure AI search

Here is our system prompt:

Below is a history of the conversation so far, and a new question asked by
the user that needs to be answered by searching in a knowledge base.
You have access to Azure AI Search index with 100's of documents.
Generate a search query based on the conversation and the new question.
Do not include cited source filenames and document names e.g info.txt or doc.pdf in the search query terms.
Do not include any text inside [] or <<>> in the search query terms.
Do not include any special characters like '+'.
If the question is not in English, translate the question to English
before generating the search query.
If you cannot generate a search query, return just the number 0.

Notice that it describes the kind of data source, indicates that the conversation history should be considered, and describes a lot of things that the LLM should not do.

We also provide a few examples (also known as "few-shot prompting"):

query_prompt_few_shots = [
    {"role": "user", "content": "How did crypto do last year?"},
    {"role": "assistant", "content": "Summarize Cryptocurrency Market Dynamics from last year"},
    {"role": "user", "content": "What are my health plans?"},
    {"role": "assistant", "content": "Show available health plans"},

Developers use our RAG solution for many domains, so we encourage them to customize few-shots like this to improve results for their domain.

We then combine the system prompts, few shots, and user question with as much conversation history as we can fit inside the context window.

messages = self.get_messages_from_history(
   user_content="Generate search query for: " + original_user_query,
   max_tokens=self.chatgpt_token_limit - len(user_query_request), 

We send all of that off to GPT-3.5 in a chat completion request, specifying a temperature of 0 to reduce creativity and a max tokens of 100 to avoid overly long queries:

chat_completion = await

Once the search query comes back, we use that to search Azure AI search, doing a hybrid search using both the text version of the query and the embedding of the query, in order to optimize the relevance of the results.

Using chat completion tools to request the query conversion

What I just described is actually the approach we used months ago. Once the OpenAI chat completion API added support for tools (also known as "function calling"), we decided to use that feature in order to further increase the reliability of the query conversion result.

We define our tool, a single function search_sources that takes a search_query parameter:

tools = [
    "type": "function",
    "function": {
      "name": "search_sources",
      "description": "Retrieve sources from the Azure AI Search index",
      "parameters": {
        "type": "object",
        "properties": {
          "search_query": {
            "type": "string",
            "description": "Query string to retrieve documents from
                            Azure search eg: 'Health care plan'",
        "required": ["search_query"],

Then, when we make the call (using the same messages as described earlier), we also tell the OpenAI model that it can use that tool:

chat_completion = await

Now the response that comes back may contain a function_call with a name of search_sources and an argument called search_query. We parse back the response to look for that call, and extract the value of the query parameter if so. If not provided, then we fallback to assuming the converted query is in the usual content field. That extraction looks like:

def get_search_query(self, chat_completion: ChatCompletion, user_query: str):
    response_message = chat_completion.choices[0].message

    if response_message.tool_calls:
        for tool in response_message.tool_calls:
            if tool.type != "function":
            function = tool.function
            if == "search_sources":
                arg = json.loads(function.arguments)
                search_query = arg.get("search_query", self.NO_RESPONSE)
                if search_query != self.NO_RESPONSE:
                    return search_query
    elif query_text := response_message.content:
        if query_text.strip() != self.NO_RESPONSE:
            return query_text
    return user_query

This is admittedly a lot of work, but we have seen much improved results in result relevance since making the change. It's also very helpful to have an initial step that uses tools, since that's a place where we could also bring in other tools, such as escalating the conversation to a human operator or retrieving data from other data sources.

To see the full code, check out

When to use query cleaning

We currently only use this technique for the multi-turn "Chat" tab, where it can be particularly helpful if the user is referencing terms from earlier in the chat. For example, consider the conversation below where the user's first question specified the full name of the plan, and the follow-up question used a nickname - the cleanup process brings back the full term.

Screenshot of a multi-turn conversation with final question 'what else is in plus?'

We do not use this for our single-turn "Ask" tab. It could still be useful, particularly for other datastores that benefit from additional formatting, but we opted to use the simpler RAG flow for that approach.

Depending on your app and datastore, your answer quality may benefit from this approach. Try it out, do some evaluations, and discover for yourself!

Sunday, January 28, 2024

Converting HTML pages to PDFs with Playwright

In this post, I'll share a fairly easy way to convert HTML pages to PDF files using the Playwright E2E testing library.

Background: I am working on a RAG chat app solution that has a PDF ingestion pipeline. For a conference demo, I needed it to ingest HTML webpages instead. I could have written my own HTML parser or tried to integrate the LlamaIndex reader, but since I was pressed for time, I decided to just convert the webpages to PDF.

My first idea was to use dedicated PDF export libraries like pdfkit and wkhtml2pdf but kept running into issues trying to get them working. But then I discovered that my new favorite package for E2E testing, Playwright, has a PDF saving function. 🎉 Here’s my setup for conversion.

Step 1: Prepare a list of URLs

For this script, I use the requests package to fetch the HTML for the main page of the website. Then I use the BeautifulSoup scraping library to grab all the links from the table of contents. I process each URL, turning it back into an absolute URL, and add it to the list.

urls = set()
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.find("section", {"id": "flask-sqlalchemy"}).find_all("a")
for link in links:
    if "href" not in link.attrs:
    # strip off the hash and add back the domain
    link_url = link["href"].split("#")[0]
    if not link_url.startswith("https://"):
        link_url = url + link_url
    if link_url not in urls:

See the full code here

Save each URL as PDF

For this script, I import the asynchronous version of the Playwright library. That allows my script to support concurrency when processing the list of URLs, which can speed up the conversion.

from playwright.async_api import BrowserContext, async_playwright

Then I define a function to save a single URL as a PDF. It uses Playwright to goto() the URL, decides on an appropriate filename for that URL, and saves the file with a call to pdf().

async def convert_to_pdf(context: BrowserContext, url: str):
        page = await context.new_page()
        await page.goto(url)
        filename = url.split("")[1].replace("/", "_") + ".pdf"
        filepath = "pdfs/" / Path(filename)
        await page.pdf(path=filepath)
    except Exception as e:
        logging.error(f"An error occurred while converting {url} to PDF: {e}")

Next I define a function to process the whole list. It starts up a new Playwright browser process, creates an asyncio.TaskGroup() (new in 3.11), and adds a task to convert each URL using the first function.

async def convert_many_to_pdf():
    async with async_playwright() as playwright:
        chromium = playwright.chromium
        browser = await chromium.launch()
        context = await browser.new_context()

        urls = []
        with open("urls.txt") as file:
            urls = [line.strip() for line in file]

        async with asyncio.TaskGroup() as task_group:
            for url in urls:
                task_group.create_task(convert_to_pdf(context, url))
        await browser.close()

Finally, I call that convert-many-to-pdf function using

See the full code here


Here are some things to think about when using this approach:

  • How will you get all the URLs for the website, while avoiding external URLs? A sitemap.xml would be an ideal way, but not all websites create those.
  • Whats an appropriate filename for a URL? I wanted filenames that I could convert back to URLs later, so I converted / to _ but that only worked because those URLs had no underscores in them.
  • Do you want to visit the webpage at full screen or mobile sized? Playwright can open at any resolution, and you might want to convert the mobile version of your site for whatever reason.

Tuesday, January 16, 2024

Evaluating a RAG chat app: Approach, SDKs, and Tools

When we’re programming user-facing experiences, we want to feel confident that we’re creating a functional user experience - not a broken one! How do we do that? We write tests, like unit tests, integration tests, smoke tests, accessibility tests, loadtests, property-based tests. We can’t automate all forms of testing, so we test what we can, and hire humans to audit what we can’t.

But when we’re building RAG chat apps built on LLMs, we need to introduce an entirely new form of testing to give us confidence that our LLM responses are coherent, grounded, and well-formed.

We call this form of testing “evaluation”, and we can now automate it with the help of the most powerful LLM in town: GPT-4.

How to evaluate a RAG chat app

The general approach is:

  1. Generate a set of “ground truth” data- at least 200 question-answer pairs. We can use an LLM to generate that data, but it’s best to have humans review it and update continually based on real usage examples.
  2. For each question, pose the question to your chat app and record the answer and context (data chunks used).
  3. Send the ground truth data with the newly recorded data to GPT-4 and prompt it to evaluate its quality, rating answers on 1-5 scales for each metric. This step involves careful prompt engineering and experimentation.
  4. Record the ratings for each question, compute average ratings and overall pass rates, and compare to previous runs.
  5. If your statistics are better or equal to previous runs, then you can feel fairly confident that your chat experience has not regressed.

Evaluate using the Azure AI Generative SDK

A team of ML experts at Azure have put together an SDK to run evaluations on chat apps, in the azure-ai-generative Python package. The key functions are:

Start with this evaluation project template

Since I've been spending a lot of time maintaining our most popular RAG chat app solution, I wanted to make it easy to test changes to that app's base configuration - but also make it easy for any developers to test changes to their own RAG chat apps. So I've put together ai-rag-chat-evaluator, a repository with command-line tools for generating data, evaluating apps (local or deployed), and reviewing the results.

For example, after configuring an OpenAI connection and Azure AI Search connection, generate data with this command:

python3 -m scripts generate --output=example_input/qa.jsonl --numquestions=200

To run an evaluation against ground truth data, run this command:

python3 -m scripts evaluate --config=example_config.json

You'll then be able to view a summary of results with the summary tool:

Screenshot of summary tool which shows GPT metrics for each run

You'll also be able to easily compare answers across runs with the compare tool:

Screenshot of compare tool showing answers side by side with GPT metrics below

For more details on using the project, check the README and please file an issue with any questions, concerns, or bug reports.

When to run evaluation tests

This evaluation process isn’t like other automated testing that a CI would runs on every commit, as it is too time-intensive and costly.

Instead, RAG development teams should run an evaluation flow when something has changed about the RAG flow itself, like the system message, LLM parameters, or search parameters.

Here is one possible workflow:

  • A developer tests a modification of the RAG prompt and runs the evaluation on their local machine, against a locally running app, and compares to an evaluation for the previous state ("baseline").
  • That developer makes a PR to the app repository with their prompt change.
  • A CI action notices that the prompt has changed, and adds a comment requiring the developer to point to their evaluation results, or possibly copy them into the repo into a specified folder.
  • The CI action could confirm the evaluation results exceed or are equal to the current statistics, and mark the PR as mergeable. (It could also run the evaluation itself at this point, but I'm wary of recommending running expensive evaluations twice).
  • After any changes are merged, the development team could use an A/B or canary test alongside feedback buttons (thumbs up/down) to make sure that the chat app is working as well as expected.

I'd love to hear how RAG chat app development teams are running their evaluation flows, to see how we can help in providing reusable tools for all of you. Please let us know!