pamela fox's blog

Tuesday, March 5, 2024

Evaluating RAG chat apps: Can your app say "I don't know"?

In a recent blog post, I talked about the importance of evaluating the answer quality from any RAG-powered chat app, and I shared my ai-rag-chat-evaluator repo for running bulk evaluations.

In that post, I focused on evaluating a model’s answers for a set of questions that could be answered by the data. But what about all those questions that can’t be answered by the data? Does your model know how to say “I don’t know?” LLMs are very eager-to-please, so it actually takes a fair bit of prompt engineering to persuade them to answer in the negative, especially for answers in their weights somewhere.

For example, consider this question for a RAG based on internal company handbooks:

User asks question 'should I stay at home from work when I have the flu?' and app responds 'Yes' with additional advice

The company handbooks don't actually contain advice on whether employees should stay home when they're sick, but the LLM still tries to give general advice based on what it's seen in training data, and it cites the most related sources (about health insurance). The company would prefer that the LLM said that it didn't know, so that employees weren't led astray. How can the app developer validate their app is replying appropriately in these situations?

Good news: I’ve now built additional functionality into ai-rag-chat-evaluator to help RAG chat developers measure the “dont-know-ness” of their app. (And yes, I’m still struggling to find a snappier name for the metric that doesnt excessively anthropomorphise - feigned-ignorance? humility? stick-to-scriptness? Let me know if you have an idea or know of an already existing name.)

Generating test questions

For a standard evaluation, our test data is a set of questions with answers sourced fully from the data. However, for this kind of evaluation, our test data needs to be a different set of question whose answer should provoke an “I don’t know” response from the data. There are several categories of such questions:

Uncitable: Questions whose answers are well known to the LLM from its training data, but are not in the sources. There are two flavors of these:
- Unrelated: Completely unrelated to sources, so LLM shouldn’t get too tempted to think the sources know.
- Related: Similar topics to sources, so LLM will be particularly tempted.
Unknowable: Questions that are related to the sources but not actually in them (and not public knowledge).
Nonsensical: Questions that are non-questions, that a human would scratch their head at and ask for clarification.

If you already have an existing set of those questions based off what users have been typing into your chat, that's great - use that set!

If you need help coming up with that set, I wrote a generator script that can suggest questions in those categories, as long as you provide the existing ground truth questions from standard evaluation. Run it like so:

python -m scripts generate_dontknows --input=example_input/qa.jsonl
    --output=example_input/qa_dontknows.jsonl --numquestions=40

That script sends the input questions to the configured GPT-4 model along with prompts to generate questions of each kind.

When it’s done, you should review and curate the resulting ground truth data. Pay special attention to the “unknowable” questions at the top of the file, since you may decide that some of those are actually knowable. I ended up replacing many with similar questions that I knew were not in the sources.

Measuring the dont-know-ness of responses

When we do a standard evaluation on answers that should be in sources, we measure metrics like groundedness and relevance, asking GPT4 to rate them from 1-5. For evaluating the answers to the new set of questions, we need a metric that measures whether the answer says it doesnt know. I created a new “dontknowness” metric for that, using this prompt:

System:
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
User:
The "I don't know"-ness metric is a measure of how much an answer conveys the lack of knowledge or uncertainty, which is useful for making sure a chatbot for a particular domain doesn't answer outside that domain. Score the I-dont-know-ness of the answer between one to five stars using the following rating scale:
One star: the answer completely answers the question and conveys no uncertainty
Two stars: the answer conveys a little uncertainty but mostly attempts to answer the question
Three stars: the answer conveys some uncertainty but still contains some attempt to answer the question
Four stars: the answer conveys uncertainty and makes no attempt to answer the question
Five stars: the answer says straightforwardly that it doesn't know, and makes no attempt to answer the question.

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
question: What are the main goals of Perseverance Mars rover mission?
answer: The main goals of the Perseverance Mars rover mission are to search for signs of ancient life and collect rock and soil samples for possible return to Earth.
stars: 1

question: What field did Marie Curie excel in?
answer: I'm not sure, but I think Marie Curie excelled in the field of science.
stars: 2

question: What are the main components of the Mediterranean diet?
answer: I don't have an answer in my sources but I think the diet has some fats?
stars: 3

question: What are the main attractions of the Queen's Royal Castle?
answer: I'm not certain. Perhaps try rephrasing the question?
stars: 4

question: Where were The Beatles formed?
answer: I'm sorry, I don't know, that answer is not in my sources.
stars: 5

question: {{question}}
answer: {{answer}}
stars:

Your response must include following fields and should be in json format:
score: Number of stars based on definition above
reason: Reason why the score was given

That metric is available in the tool for anyone to use now, but you’re also welcome to tweak the prompt as needed.

Running the evaluation

Next I configure a JSON for this evaluation:

{
    "testdata_path": "example_input/qa_dontknows.jsonl",
    "results_dir": "example_results_dontknows/baseline",
    "requested_metrics": ["dontknowness", "answer_length", "latency", "has_citation"],
    "target_url": "http://localhost:50505/chat",
}

I’m also measuring a few other related metrics like answer_length and has_citation, since an “I don’t know” response should be fairly short and not have a citation.

I run the evaluation like so:

python -m scripts evaluate --config=example_config_dontknows.json

Once the evaluation completes, I review the results:

python -m review_tools summary example_results_dontknows

Screenshot from results- mean_rating of 3.45, pass rate of .68

I was disappointed by the results of my first run: my app responded with an "I don't know" response about 68% of the time (considering 4 or 5 a passing rating). I then looked through the answers to see where it was going off-source, using the diff tool:

python -m review_tools diff example_results_dontknows/baseline/

For the RAG based on my own blog, it often answered technical questions as if the answer was in my post when it actually wasn't. For example, my blog doesn't provide any resources about learning Go, so the model suggested non-Go resources from my blog instead:

Screenshot of question 'What's a good way to learn the Go programming language?' with a list response

Improving the app's ability to say "I don't know"

I went into my app and manually experimented with prompt changes for questions from that 67%, adding in additional commands to only return an answer if it could be found in its entirety in the sources. Unfortunately, I didn't see improvements in my evaluation runs on prompt changes. I also tried adjusting the temperature, but didn't see a noticeable change there.

Finally, I changed the underlying model used by my RAG chat app from gpt-3.5-turbo to gpt-4, re-ran the evaluation, and saw great results.

Screenshot from results- mean_rating of 4, pass rate of .75

The gpt-4 model is slower (especially as mine is an Azure PAYG account, not PTU) but it is much better at following the system prompt directions. It still did answer 25% of the questions, but it generally stayed on-source better than gpt-3.5. For example, here's the same question about learning Go from before:

Screenshot of question 'What's a good way to learn the Go programming language?' with an 'I don't know' response

To avoid using gpt-4, I could also try adding an additional LLM step in the app after generating the answer, to have the LLM rate its own confidence that the answer is found in the sources and respond accordingly. I haven't tried that yet, but let me know if you do!

Start evaluating your RAG chat app today

To get started with evaluation, follow the steps in the ai-rag-chat-evaluator README. Please file an issue if you ran into any problems or have ideas for improving the evaluation flow.

Friday, March 1, 2024

RAG techniques: Function calling for more structured retrieval

Retrieval Augmented Generation (RAG) is a popular technique to get LLMs to provide answers that are grounded in a data source. When we use RAG, we use the user's question to search a knowledge base (like Azure AI Search), then pass along both the question and the relevant content to the LLM (gpt-3.5-turbo or gpt-4), with a directive to answer only according to the sources. In psuedo-code:

user_query = "what's in the Northwind Plus plan?"
user_query_vector = create_embedding(user_query, "ada-002")
results = search(user_query, user_query_vector)
response = create_chat_completion(system_prompt, user_query, results)

If the search function can find the right results in the index (assuming the answer is somewhere in the index), then the LLM can typically do a pretty good job of synthesizing the answer from the sources.

Unstructured queries

This simple RAG approach works best for "unstructured queries", like:

What's in the Northwind Plus plan?
What are the expectations of a product manager?
What benefits are provided by the company?

When using Azure AI Search as the knowledge base, the search call will perform both a vector and keyword search, finding all the relevant document chunks that match the keywords and concepts in the query.

Structured queries

But you may find that users are instead asking more "structured" queries, like:

Summarize the document called "perksplus.pdf"
What are the topics in documents by Pamela Fox?
Key points in most recent uploaded documents

We can think of them as structured queries, because they're trying to filter on specific metadata about a document. You could imagine a world where you used a syntax to specify that metadata filtering, like:

Summarize the document title:perksplus.pdf
Topics in documents author:PamelaFox
Key points time:2weeks

We don't want to actually introduce a query syntax to a a RAG chat application if we don't need to, since only power users tend to use specialized query syntax, and we'd ideally have our RAG just do the right thing in that situation.

Using function calling in RAG

Fortunately, we can use the OpenAI function-calling feature to recognize that a user's query would benefit from a more structured search, and perform that search instead.

If you've never used function calling before, it's an alternative way of asking an OpenAI GPT model to respond to a chat completion request. In addition to sending our usual system prompt, chat history, and user message, we also send along a list of possible functions that could be called to answer the question. We can define those in JSON or as a Pydantic model dumped to JSON. Then, when the response comes back from the model, we can see what function it decided to call, and with what parameters. At that point, we can actually call that function, if it exists, or just use that information in our code in some other way.

To use function calling in RAG, we first need to introduce an LLM pre-processing step to handle user queries, as I described in my previous blog post. That will give us an opportunity to intercept the query before we even perform the search step of RAG.

For that pre-processing step, we can start off with a function to handle the general case of unstructured queries:

tools: List[ChatCompletionToolParam] = [
    {
        "type": "function",
        "function": {
            "name": "search_sources",
            "description": "Retrieve sources from the Azure AI Search index",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_query": {
                        "type": "string",
                        "description": "Query string to retrieve documents from azure search eg: 'Health care plan'",
                    }
                },
                "required": ["search_query"],
            },
        },
    }
]

Then we send off a request to the chat completion API, letting it know it can use that function.

chat_completion: ChatCompletion = self.openai_client.chat.completions.create(
    messages=messages,
    model=model,
    temperature=0.0,
    max_tokens=100,
    n=1,
    tools=tools,
    tool_choice="auto",
)

When the response comes back, we process it to see if the model decided to call the function, and extract the search_query parameter if so.

response_message = chat_completion.choices[0].message

if response_message.tool_calls:
    for tool in response_message.tool_calls:
        if tool.type != "function":
            continue
        function = tool.function
        if function.name == "search_sources":
            arg = json.loads(function.arguments)
            search_query = arg.get("search_query", self.NO_RESPONSE)

If the model didn't include the function call in its response, that's not a big deal as we just fall back to using the user's original query as the search query. We proceed with the rest of the RAG flow as usual, sending the original question with whatever results came back in our final LLM call.

Adding more functions for structured queries

Now that we've introduced one function into the RAG flow, we can more easily add additional functions to recognize structured queries. For example, this function recognizes when a user wants to search by a particular filename:

{
    "type": "function",
    "function": {
        "name": "search_by_filename",
        "description": "Retrieve a specific filename from the Azure AI Search index",
        "parameters": {
            "type": "object",
            "properties": {
                "filename": {
                    "type": "string",
                    "description": "The filename, like 'PerksPlus.pdf'",
                }
            },
            "required": ["filename"],
        },
    },
},

We need to extend the function parsing code to extract the filename argument:

if function.name == "search_by_filename":
    arg = json.loads(function.arguments)
    filename = arg.get("filename", "")
    filename_filter = filename

Then we can decide how to use that filename filter. In the case of Azure AI search, I build a filter that checks that a particular index field matches the filename argument, and pass that to my search call. If using a relational database, it'd become an additional WHERE clause.

Simply by adding that function, I was able to get much better answers to questions in my RAG app like 'Summarize the document called "perksplus.pdf"', since my search results were truly limited to chunks from that file. You can see my full code changes to add this function to our RAG starter app repo in this PR.

Considerations

This can be a very powerful technique, but as with all things LLM, there are gotchas:

Function definitions add to your prompt token count, increasing cost.
There may be times where the LLM doesn't decide to return the function call, even when you thought it should have.
The more functions you add, the more likely the LLM will get confused about which one to pick, especially if functions are similar to each other. You can try to make it more clear to the LLM by prompt engineering the function name and description, or even providing few shots.

Here are additional approaches you can try:

Content expansion: Store metadata inside the indexed field and compute the embedding based on both the metadata and content. For example, the content field could have "filename:perksplus.pdf text:The perks are...".
Add metadata as separate fields in the search index, and append those to the content sent to the LLM. For example, you could put "Last modified: 2 weeks ago" in each chunk sent to the LLM, if you were trying to help it's ability to answer questions about recency. This is similar to the content expansion approach, but the metadata isn't included when calculating the embedding. You could also compute embeddings separately for each metadata field, and do a multi-vector search.
Add filters to the UI of your RAG chat application, as part of the chat box or a sidebar of settings.
Use fine-tuning on a model to help it realize when it should call particular functions or respond a certain way. You could even teach it to use a structured query syntax, and remove the functions entirely from your call. This is a last resort, however, since fine-tuning is costly and time-consuming.

Friday, February 16, 2024

RAG techniques: Cleaning user questions with an LLM

📺 You can also watch the video version of this blog post.

When I introduce app developers to the concept of RAG (Retrieval Augmented Generation), I often present a diagram like this:

Diagram of RAG flow, user question to data source to LLM

The app receives a user question, uses the user question to search a knowledge base, then sends the question and matching bits of information to the LLM, instructing the LLM to adhere to the sources.

That's the most straightforward RAG approach, but as it turns out, it's not what quite what we do in our most popular open-source RAG solution, azure-search-openai-demo.

The flow instead looks like this:

diagram of extendex RAG flow, user question to LLM to data source to LLM

After the app receives a user question, it makes an initial call to an LLM to turn that user question into a more appropriate search query for Azure AI search. More generally, you can think of this step as turning the user query into a datastore-aware query. This additional step tends to improve the search results, and is a (relatively) quick task for an LLM. It also cheap in terms of output token usage.

I'll break down the particular approach our solution uses for this step, but I encourage you to think more generally about how you might make your user queries more datastore-aware for whatever datastore you may be using in your RAG chat apps.

Converting user questions for Azure AI search

Here is our system prompt:

Below is a history of the conversation so far, and a new question asked by
the user that needs to be answered by searching in a knowledge base.
You have access to Azure AI Search index with 100's of documents.
Generate a search query based on the conversation and the new question.
Do not include cited source filenames and document names e.g info.txt or doc.pdf in the search query terms.
Do not include any text inside [] or <<>> in the search query terms.
Do not include any special characters like '+'.
If the question is not in English, translate the question to English
before generating the search query.
If you cannot generate a search query, return just the number 0.

Notice that it describes the kind of data source, indicates that the conversation history should be considered, and describes a lot of things that the LLM should not do.

We also provide a few examples (also known as "few-shot prompting"):

query_prompt_few_shots = [
    {"role": "user", "content": "How did crypto do last year?"},
    {"role": "assistant", "content": "Summarize Cryptocurrency Market Dynamics from last year"},
    {"role": "user", "content": "What are my health plans?"},
    {"role": "assistant", "content": "Show available health plans"},
]

Developers use our RAG solution for many domains, so we encourage them to customize few-shots like this to improve results for their domain.

We then combine the system prompts, few shots, and user question with as much conversation history as we can fit inside the context window.

messages = self.get_messages_from_history(
   system_prompt=self.query_prompt_template,
   few_shots=self.query_prompt_few_shots,
   history=history,
   user_content="Generate search query for: " + original_user_query,
   model_id=self.chatgpt_model,
   max_tokens=self.chatgpt_token_limit - len(user_query_request), 
)

We send all of that off to GPT-3.5 in a chat completion request, specifying a temperature of 0 to reduce creativity and a max tokens of 100 to avoid overly long queries:

chat_completion = await self.openai_client.chat.completions.create(
    messages=messages,
    model=self.chatgpt_model,
    temperature=0.0,
    max_tokens=100,
    n=1
)

Once the search query comes back, we use that to search Azure AI search, doing a hybrid search using both the text version of the query and the embedding of the query, in order to optimize the relevance of the results.

Using chat completion tools to request the query conversion

What I just described is actually the approach we used months ago. Once the OpenAI chat completion API added support for tools (also known as "function calling"), we decided to use that feature in order to further increase the reliability of the query conversion result.

We define our tool, a single function search_sources that takes a search_query parameter:

tools = [
  {
    "type": "function",
    "function": {
      "name": "search_sources",
      "description": "Retrieve sources from the Azure AI Search index",
      "parameters": {
        "type": "object",
        "properties": {
          "search_query": {
            "type": "string",
            "description": "Query string to retrieve documents from
                            Azure search eg: 'Health care plan'",
          }
        },
        "required": ["search_query"],
      },
    },
  }
]

Then, when we make the call (using the same messages as described earlier), we also tell the OpenAI model that it can use that tool:

chat_completion = await self.openai_client.chat.completions.create(
    messages=messages,
    model=self.chatgpt_model,
    temperature=0.0,
    max_tokens=100,
    n=1,
    tools=tools,
    tool_choice="auto",
)

Now the response that comes back may contain a function_call with a name of search_sources and an argument called search_query. We parse back the response to look for that call, and extract the value of the query parameter if so. If not provided, then we fallback to assuming the converted query is in the usual content field. That extraction looks like:

def get_search_query(self, chat_completion: ChatCompletion, user_query: str):
    response_message = chat_completion.choices[0].message

    if response_message.tool_calls:
        for tool in response_message.tool_calls:
            if tool.type != "function":
                continue
            function = tool.function
            if function.name == "search_sources":
                arg = json.loads(function.arguments)
                search_query = arg.get("search_query", self.NO_RESPONSE)
                if search_query != self.NO_RESPONSE:
                    return search_query
    elif query_text := response_message.content:
        if query_text.strip() != self.NO_RESPONSE:
            return query_text
    return user_query

This is admittedly a lot of work, but we have seen much improved results in result relevance since making the change. It's also very helpful to have an initial step that uses tools, since that's a place where we could also bring in other tools, such as escalating the conversation to a human operator or retrieving data from other data sources.

To see the full code, check out chatreadretrieveread.py.

When to use query cleaning

We currently only use this technique for the multi-turn "Chat" tab, where it can be particularly helpful if the user is referencing terms from earlier in the chat. For example, consider the conversation below where the user's first question specified the full name of the plan, and the follow-up question used a nickname - the cleanup process brings back the full term.

Screenshot of a multi-turn conversation with final question 'what else is in plus?'

We do not use this for our single-turn "Ask" tab. It could still be useful, particularly for other datastores that benefit from additional formatting, but we opted to use the simpler RAG flow for that approach.

Depending on your app and datastore, your answer quality may benefit from this approach. Try it out, do some evaluations, and discover for yourself!

Sunday, January 28, 2024

Converting HTML pages to PDFs with Playwright

In this post, I'll share a fairly easy way to convert HTML pages to PDF files using the Playwright E2E testing library.

Background: I am working on a RAG chat app solution that has a PDF ingestion pipeline. For a conference demo, I needed it to ingest HTML webpages instead. I could have written my own HTML parser or tried to integrate the LlamaIndex reader, but since I was pressed for time, I decided to just convert the webpages to PDF.

My first idea was to use dedicated PDF export libraries like pdfkit and wkhtml2pdf but kept running into issues trying to get them working. But then I discovered that my new favorite package for E2E testing, Playwright, has a PDF saving function. 🎉 Here’s my setup for conversion.

Step 1: Prepare a list of URLs

For this script, I use the requests package to fetch the HTML for the main page of the website. Then I use the BeautifulSoup scraping library to grab all the links from the table of contents. I process each URL, turning it back into an absolute URL, and add it to the list.

urls = set()
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.find("section", {"id": "flask-sqlalchemy"}).find_all("a")
for link in links:
    if "href" not in link.attrs:
        continue
    # strip off the hash and add back the domain
    link_url = link["href"].split("#")[0]
    if not link_url.startswith("https://"):
        link_url = url + link_url
    if link_url not in urls:
        urls.add(link_url)

See the full code here

Save each URL as PDF

For this script, I import the asynchronous version of the Playwright library. That allows my script to support concurrency when processing the list of URLs, which can speed up the conversion.

from playwright.async_api import BrowserContext, async_playwright

Then I define a function to save a single URL as a PDF. It uses Playwright to goto() the URL, decides on an appropriate filename for that URL, and saves the file with a call to pdf().

async def convert_to_pdf(context: BrowserContext, url: str):
    try:
        page = await context.new_page()
        await page.goto(url)
        filename = url.split("https://flask-sqlalchemy.palletsprojects.com/en/3.1.x/")[1].replace("/", "_") + ".pdf"
        filepath = "pdfs/" / Path(filename)
        await page.pdf(path=filepath)
    except Exception as e:
        logging.error(f"An error occurred while converting {url} to PDF: {e}")

Next I define a function to process the whole list. It starts up a new Playwright browser process, creates an asyncio.TaskGroup() (new in 3.11), and adds a task to convert each URL using the first function.

async def convert_many_to_pdf():
    async with async_playwright() as playwright:
        chromium = playwright.chromium
        browser = await chromium.launch()
        context = await browser.new_context()

        urls = []
        with open("urls.txt") as file:
            urls = [line.strip() for line in file]

        async with asyncio.TaskGroup() as task_group:
            for url in urls:
                task_group.create_task(convert_to_pdf(context, url))
        await browser.close()

Finally, I call that convert-many-to-pdf function using asyncio.run():

asyncio.run(convert_many_to_pdf())

See the full code here

Considerations

Here are some things to think about when using this approach:

How will you get all the URLs for the website, while avoiding external URLs? A sitemap.xml would be an ideal way, but not all websites create those.
Whats an appropriate filename for a URL? I wanted filenames that I could convert back to URLs later, so I converted / to _ but that only worked because those URLs had no underscores in them.
Do you want to visit the webpage at full screen or mobile sized? Playwright can open at any resolution, and you might want to convert the mobile version of your site for whatever reason.

Tuesday, January 16, 2024

Evaluating a RAG chat app: Approach, SDKs, and Tools

When we’re programming user-facing experiences, we want to feel confident that we’re creating a functional user experience - not a broken one! How do we do that? We write tests, like unit tests, integration tests, smoke tests, accessibility tests, loadtests, property-based tests. We can’t automate all forms of testing, so we test what we can, and hire humans to audit what we can’t.

But when we’re building RAG chat apps built on LLMs, we need to introduce an entirely new form of testing to give us confidence that our LLM responses are coherent, grounded, and well-formed.

We call this form of testing “evaluation”, and we can now automate it with the help of the most powerful LLM in town: GPT-4.

How to evaluate a RAG chat app

The general approach is:

Generate a set of “ground truth” data- at least 200 question-answer pairs. We can use an LLM to generate that data, but it’s best to have humans review it and update continually based on real usage examples.
For each question, pose the question to your chat app and record the answer and context (data chunks used).
Send the ground truth data with the newly recorded data to GPT-4 and prompt it to evaluate its quality, rating answers on 1-5 scales for each metric. This step involves careful prompt engineering and experimentation.
Record the ratings for each question, compute average ratings and overall pass rates, and compare to previous runs.
If your statistics are better or equal to previous runs, then you can feel fairly confident that your chat experience has not regressed.

Evaluate using the Azure AI Generative SDK

A team of ML experts at Azure have put together an SDK to run evaluations on chat apps, in the azure-ai-generative Python package. The key functions are:

QADataGenerator.generate(text, qa_type, num_questions): Pass a document, and it will use a configured GPT-4 model to generate multiple Q/A pairs based on it.
evaluate(target, data, data_mapping, metrics_list, ...): Point this function at a chat app function and ground truth data, configure what metrics you’re interested in, and it will aak GPT-4 to rate the answers.

Start with this evaluation project template

Since I've been spending a lot of time maintaining our most popular RAG chat app solution, I wanted to make it easy to test changes to that app's base configuration - but also make it easy for any developers to test changes to their own RAG chat apps. So I've put together ai-rag-chat-evaluator, a repository with command-line tools for generating data, evaluating apps (local or deployed), and reviewing the results.

For example, after configuring an OpenAI connection and Azure AI Search connection, generate data with this command:

python3 -m scripts generate --output=example_input/qa.jsonl --numquestions=200

To run an evaluation against ground truth data, run this command:

python3 -m scripts evaluate --config=example_config.json

You'll then be able to view a summary of results with the summary tool:

Screenshot of summary tool which shows GPT metrics for each run

You'll also be able to easily compare answers across runs with the compare tool:

Screenshot of compare tool showing answers side by side with GPT metrics below

For more details on using the project, check the README and please file an issue with any questions, concerns, or bug reports.

When to run evaluation tests

This evaluation process isn’t like other automated testing that a CI would runs on every commit, as it is too time-intensive and costly.

Instead, RAG development teams should run an evaluation flow when something has changed about the RAG flow itself, like the system message, LLM parameters, or search parameters.

Here is one possible workflow:

A developer tests a modification of the RAG prompt and runs the evaluation on their local machine, against a locally running app, and compares to an evaluation for the previous state ("baseline").
That developer makes a PR to the app repository with their prompt change.
A CI action notices that the prompt has changed, and adds a comment requiring the developer to point to their evaluation results, or possibly copy them into the repo into a specified folder.
The CI action could confirm the evaluation results exceed or are equal to the current statistics, and mark the PR as mergeable. (It could also run the evaluation itself at this point, but I'm wary of recommending running expensive evaluations twice).
After any changes are merged, the development team could use an A/B or canary test alongside feedback buttons (thumbs up/down) to make sure that the chat app is working as well as expected.

I'd love to hear how RAG chat app development teams are running their evaluation flows, to see how we can help in providing reusable tools for all of you. Please let us know!

Wednesday, January 10, 2024

Developer relations & motherhood: Will they blend?

My very first job out of college was in developer relations at Google, and it was absolutely perfect for me; a way to combine my love for programming with my interest in teaching. I got to code, write blog posts, organize events, work tightly with eng teams, and do so much traveling, giving talks all over the world. I only left when Google started killing products left and right, including the one I was working on (Wave), and well, my heart was a little broken. (I'm now jaded enough to not loan my whole heart out to corporations)

12 years pass...

I'm back in developer relations, this time for Microsoft/Azure on the Python Advocacy team, and I once again am loving it. It's similar to my old role at Google, but involves more open source work (yay!) and more forms of virtual advocacy (both due to Pandemic and increasingly global audience).

There's a big difference for me this time though: I'm a mom of two kids, a 4 year and a 1 year old (born the week after I started the job). My littlest one is still very attached to me, both emotionally and physically, as she's still nursing and co-sleeping at night, so I essentially have no free time outside of 9-5. (For example, I am writing this a few inches away from here in our floor bed, and have already had to stop/start a few times).

Generally, developer advocacy has been fairly compatible with motherhood, and I'm hugely thankful to Microsoft for their parental leave program (5 months) and support for remote work, and to my manager for understanding my needs as a mother.

However, I've found it stressful to participate fully in all the kinds of events that used to fill my days in DevRel. I'll break down difficulties I've had in fitting events in with my new mom-of-two life, from least to most friction:

Live streams: Many advocates (and content creators, generally) will easily hop on a stream to show what they're working on, and it can be a really fun, casual way to connect with the community. I avoided casual streams for the first year of my baby's life, while I was still pumping, as I had to pump too often for it to be practical to be on camera. Now that I'm done pumping, I've had a great time jumping on streams on my colleague's channel. Thanks for the invites, Jay!
Virtual events: I'm the one that gets really excited when I hear a conference will be online, since then I can participate from the comfort of my own home. But after speaking at a number of virtual events, I've learnt to ask for more information about the exact timing before getting too excited. Specifically:
- Is the event in my timezone? I'm in PT, and lots of events cater to audiences in Europe/Asia (rightly so), and their timing may not overlap my workday.
- Is the event during the week? Lots of conferences are on the weekend, which means paying for childcare and potentially missing out on events with my kids.
- Is the speaker rehearsal check-in at a convenient time? This is what keeps burning me: I'll happily get a slot speaking at 10AM PT, and then realize there's a speaker mic check at 7AM. I am usually awake at that time, but with a child draped over me who will wake up screaming if I jostle her, waking the rest of the house. Now, if I discover early mic checks, I either pay for my nanny to come early or I explain to them that I can connect but can't test my A/V yet.
Local events: I've attended a few Microsoft-sponsored events in SF that were pretty fun. I had to leave before the after parties, and even before the final keynote, in order to get home at a reasonable time for evening nursing, but I still got a lot of good interactions in from 10AM-4PM. There are some local meetups as well, but they tend to be on weeknights/weekends, so I generally avoid them due to the need for childcare. The hassle and added stress on the household often doesn't seem worth it.
Non-local events: I've managed to attend zero such events in my 1.5 years at Microsoft! My colleagues have attended events like PyCon and PyCascades, but I haven't felt like I could take an airplane ride with a nursing baby at home. Now that she's nearing two years old, I'm hoping to wean her soon, and a non-local event might become the forcing function for that. I'll be running a session in March at SIGCSE 2024 in Portland, Oregon, which is just a 2-hour plane ride from here, but I'd love to attend for a few days. I'll need to pay our nanny for the night, since she and I are the only two people who can get my little one to sleep, but hey, at least Microsoft pays me fairly well.

You may very well read through all my difficulties and think, "well, why doesn't she just wean the baby? or at least sleep train her?" Reader, I've tried. I'm trying. We're trying. It'll happen eventually.

Once both our kids are preschool aged, it should be much easier for me to participate more fully in events. I never see myself doing anywhere as much travel as I did back in my 20-something Google days, however. It wouldn't be fair to my never-traveling partner to constantly leave him with full parenting duties, and as the child of an always-traveling parent, it's not something I want to do to my kids either. Fortunately, the developer relations field is already much more focused on virtual forms of advocacy, so that is where I hope to hone my skills.

I hope this posts helps anyone else considering the combination of developer relations and motherhood (or more generally, parenting).

Wednesday, January 3, 2024

Using FastAPI for an OpenAI chat backend

When building web APIs that make calls to OpenAI servers, we really want a backend that supports concurrency, so that it can handle a new user request while waiting for the OpenAI server response. Since my apps have Python backends, I typically use either Quart, the asynchronous version of Flask, or FastAPI, the most popular asynchronous Python web framework.

In this post, I'm going to walk through a FastAPI backend that makes chat completion calls to OpenAI. Full code is available on GitHub: github.com/pamelafox/chatgpt-backend-fastapi/

Initializing the OpenAI client

In the new (>= 1.0) version of the openai Python package, the first step is to construct a client, using either OpenAI(), AsyncOpenAI(), AsyncOpenAI, or AzureAsyncOpenAI(). Since we're using FastAPI, we should use the Async* variants, either AsyncOpenAI() for openai.com accounts or AzureAsyncOpenAI() for Azure OpenAI accounts.

But when do we actually initialize that client? We could do it in every single request, but that would be doing unnecessary work. Ideally, we would do it once, when the app started up on a particular machine, and keep the client in memory for future requests. The way to do that in FastAPI is with lifespan events.

When constructing the FastAPI object, we must point the lifespan parameter at a function.

app = fastapi.FastAPI(docs_url="/", lifespan=lifespan)

That lifespan function must be wrapped with the @contextlib.asynccontextmanager decorator. The body of the function setups the OpenAI client, stores it as a global, issues a yield to signal it's done setting up, and then closes the client as part of shutdown.

from .globals import clients


@contextlib.asynccontextmanager
async def lifespan(app: fastapi.FastAPI):
    
    if os.getenv("OPENAI_KEY"):
         # openai.com OpenAI
        clients["openai"] = openai.AsyncOpenAI(
            api_key=os.getenv("OPENAI_KEY")
        )
    else:
        # Azure OpenAI: auth is more involved, see full code.
        clients["openai"] = openai.AsyncAzureOpenAI(
            api_version="2023-07-01-preview",
            azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
            **client_args,
        )

    yield

    await clients["openai"].close()

See full __init__.py.

Unfortunately, FastAPI doesn't have a standard way of defining globals (like Flask/Quart with the g object), so I am storing the client in a dictionary from a shared module. There are some more sophisticated approaches to shared globals in this discussion.

Making chat completion API calls

Now that the client is setup, the next step is to create a route that processes a message from the user, sends it to OpenAI, and returns the OpenAI response as an HTTP response.

We start off by defining pydantic models that describe what a request looks like for our chat app. In our case, each HTTP request will contain JSON with two keys, a list of "messages" and a "stream" boolean:

class ChatRequest(pydantic.BaseModel):
    messages: list[Message]
    stream: bool = True

Each message contains a "role" and "content" key, where role defaults to "user". I chose to be consistent with the OpenAI API here, but you could of course define your own input format and do pre-processing as needed.

class Message(pydantic.BaseModel):
    content: str
    role: str = "user"

Then we can define a route that handles chat requests over POST and send backs a non-streaming response:

@router.post("/chat")
async def chat_handler(chat_request: ChatRequest):
	messages = [{"role": "system", "content": system_prompt}] + chat_request.messages
    response = await clients["openai"].chat.completions.create(
    	messages=messages,
    	stream=False,
    )
    return response.model_dump()

The auto-generated documentation shows the JSON response as expected:

Screenshot of FastAPI documentation with JSON response from OpenAI chat completion call

Sending back streamed responses

It gets more interesting when we add support for streamed responses, as we need to return a StreamingResponse object pointing at an asynchronous generator function.

We'll add this code inside the "/chat" route:

if chat_request.stream:
    async def response_stream():
        chat_coroutine = clients["openai"].chat.completions.create(
            messages=messages,
            stream=True,
        )
        async for event in await chat_coroutine:
            yield json.dumps(event.model_dump(), ensure_ascii=False) + "\n"

    return fastapi.responses.StreamingResponse(response_stream())

The response_stream() function is an asynchronous generator, since it is defined with async and has a yield inside it. It uses async for to loop through the asynchronous iterable results of the Chat Completion call. For each event it receives, it yields a JSON string with a newline after it. This sort of response is known as "json lines" or "ndjson" and is my preferred approach for streaming JSON over HTTP versus other protocols like server-sent events.

The auto-generated documentation doesn't natively understand streamed JSON lines, but it happily displays it anyways:

Screenshot of FastAPI server response with streamed JSON lines

All together now

You can see the full router code in chat.py. You may also be interested in the tests folder to see how I fully tested the app using pytest, extensive mocks of the OpenAI API, including Azure OpenAI variations, and snapshot testing.