February 2024

📺 You can also watch the video version of this blog post.

When I introduce app developers to the concept of RAG (Retrieval Augmented Generation), I often present a diagram like this:

Diagram of RAG flow, user question to data source to LLM

The app receives a user question, uses the user question to search a knowledge base, then sends the question and matching bits of information to the LLM, instructing the LLM to adhere to the sources.

That's the most straightforward RAG approach, but as it turns out, it's not what quite what we do in our most popular open-source RAG solution, azure-search-openai-demo.

The flow instead looks like this:

diagram of extendex RAG flow, user question to LLM to data source to LLM

After the app receives a user question, it makes an initial call to an LLM to turn that user question into a more appropriate search query for Azure AI search. More generally, you can think of this step as turning the user query into a datastore-aware query. This additional step tends to improve the search results, and is a (relatively) quick task for an LLM. It also cheap in terms of output token usage.

I'll break down the particular approach our solution uses for this step, but I encourage you to think more generally about how you might make your user queries more datastore-aware for whatever datastore you may be using in your RAG chat apps.

Converting user questions for Azure AI search

Here is our system prompt:

Below is a history of the conversation so far, and a new question asked by
the user that needs to be answered by searching in a knowledge base.
You have access to Azure AI Search index with 100's of documents.
Generate a search query based on the conversation and the new question.
Do not include cited source filenames and document names e.g info.txt or doc.pdf in the search query terms.
Do not include any text inside [] or <<>> in the search query terms.
Do not include any special characters like '+'.
If the question is not in English, translate the question to English
before generating the search query.
If you cannot generate a search query, return just the number 0.

Notice that it describes the kind of data source, indicates that the conversation history should be considered, and describes a lot of things that the LLM should not do.

We also provide a few examples (also known as "few-shot prompting"):

query_prompt_few_shots = [
    {"role": "user", "content": "How did crypto do last year?"},
    {"role": "assistant", "content": "Summarize Cryptocurrency Market Dynamics from last year"},
    {"role": "user", "content": "What are my health plans?"},
    {"role": "assistant", "content": "Show available health plans"},
]

Developers use our RAG solution for many domains, so we encourage them to customize few-shots like this to improve results for their domain.

We then combine the system prompts, few shots, and user question with as much conversation history as we can fit inside the context window.

messages = self.get_messages_from_history(
   system_prompt=self.query_prompt_template,
   few_shots=self.query_prompt_few_shots,
   history=history,
   user_content="Generate search query for: " + original_user_query,
   model_id=self.chatgpt_model,
   max_tokens=self.chatgpt_token_limit - len(user_query_request), 
)

We send all of that off to GPT-3.5 in a chat completion request, specifying a temperature of 0 to reduce creativity and a max tokens of 100 to avoid overly long queries:

chat_completion = await self.openai_client.chat.completions.create(
    messages=messages,
    model=self.chatgpt_model,
    temperature=0.0,
    max_tokens=100,
    n=1
)

Once the search query comes back, we use that to search Azure AI search, doing a hybrid search using both the text version of the query and the embedding of the query, in order to optimize the relevance of the results.

Using chat completion tools to request the query conversion

What I just described is actually the approach we used months ago. Once the OpenAI chat completion API added support for tools (also known as "function calling"), we decided to use that feature in order to further increase the reliability of the query conversion result.

We define our tool, a single function search_sources that takes a search_query parameter:

tools = [
  {
    "type": "function",
    "function": {
      "name": "search_sources",
      "description": "Retrieve sources from the Azure AI Search index",
      "parameters": {
        "type": "object",
        "properties": {
          "search_query": {
            "type": "string",
            "description": "Query string to retrieve documents from
                            Azure search eg: 'Health care plan'",
          }
        },
        "required": ["search_query"],
      },
    },
  }
]

Then, when we make the call (using the same messages as described earlier), we also tell the OpenAI model that it can use that tool:

chat_completion = await self.openai_client.chat.completions.create(
    messages=messages,
    model=self.chatgpt_model,
    temperature=0.0,
    max_tokens=100,
    n=1,
    tools=tools,
    tool_choice="auto",
)

Now the response that comes back may contain a function_call with a name of search_sources and an argument called search_query. We parse back the response to look for that call, and extract the value of the query parameter if so. If not provided, then we fallback to assuming the converted query is in the usual content field. That extraction looks like:

def get_search_query(self, chat_completion: ChatCompletion, user_query: str):
    response_message = chat_completion.choices[0].message

    if response_message.tool_calls:
        for tool in response_message.tool_calls:
            if tool.type != "function":
                continue
            function = tool.function
            if function.name == "search_sources":
                arg = json.loads(function.arguments)
                search_query = arg.get("search_query", self.NO_RESPONSE)
                if search_query != self.NO_RESPONSE:
                    return search_query
    elif query_text := response_message.content:
        if query_text.strip() != self.NO_RESPONSE:
            return query_text
    return user_query

This is admittedly a lot of work, but we have seen much improved results in result relevance since making the change. It's also very helpful to have an initial step that uses tools, since that's a place where we could also bring in other tools, such as escalating the conversation to a human operator or retrieving data from other data sources.

To see the full code, check out chatreadretrieveread.py.

When to use query cleaning

We currently only use this technique for the multi-turn "Chat" tab, where it can be particularly helpful if the user is referencing terms from earlier in the chat. For example, consider the conversation below where the user's first question specified the full name of the plan, and the follow-up question used a nickname - the cleanup process brings back the full term.

Screenshot of a multi-turn conversation with final question 'what else is in plus?'

We do not use this for our single-turn "Ask" tab. It could still be useful, particularly for other datastores that benefit from additional formatting, but we opted to use the simpler RAG flow for that approach.

Depending on your app and datastore, your answer quality may benefit from this approach. Try it out, do some evaluations, and discover for yourself!

pamela fox's blog

Friday, February 16, 2024

RAG techniques: Cleaning user questions with an LLM

Converting user questions for Azure AI search

Using chat completion tools to request the query conversion

When to use query cleaning