Saturday, June 1, 2024

Truncating conversation history for OpenAI chat completions

When I build chat applications using the OpenAI chat completions API, I often want to send a user's previous messages to the model so that the model has more context for a user's question. However, OpenAI models have limited context windows, ranging between 4K and 128K depending on the model. If we send more tokens that the model allows, the API will respond with an error.

We need a way to make sure to only send as many tokens as a model can handle. You might consider several approaches:

  • Send the last N messages (where N is some small number like 3). Don't do this! That is very likely to end up in an error. A particular message might be very long, or might be written in a language with a higher token:word ratio, or might contain symbols that require surprisingly high token counts. Similarly, don't rely on character count as a reliable indicator of token count; it will fail with any message that isn't just common English words.
  • Use a separate OpenAI call to summarize the conversation, and send the summary. This approach can work, especially if you specify the maximum tokens for a Chat Completion call and verify the number of tokens used in the response. It does have the drawback of requiring an additional OpenAI call, so that can significantly affect user perceived latency.
  • Send the last N messages that fit inside the remaining token count. This approach requires the use of the tiktoken library for calculating token usage for each possible message that you might send. That does take time, but is faster than an additional LLM call. This is what we use in azure-search-openai-demo and rag-postgres-openai-python, and what I'll explain in this post.

Overall algorithm for conversation history truncation

Here is the approach we take to squeezing in as much conversation history as possible, assuming a function that takes an input of model, system_prompt, few_shots, past_messages, and new_user_message. The function defaults to the maximum token window for the given model, but can also be customized with a different max_tokens.

  1. Start with the system prompt and few shot examples. We always want to send those.
  2. Add the new user message, the one that ultimately requires an answer. Compute the token count of the current set of messages.
  3. Starting from the most recent of the past messages, compute the token count of the message. If adding that token count wouldn't go over the max token count, then add the message. Otherwise, stop.

And that's it! You can see it implemented in code in my build_messages function.

Token counting for each message

Actually, there's more! How do we actually compute the token count for each message? OpenAI documents that in a few places: Cookbook: How to count tokens with tiktoken, OpenAI guides: Managing tokens, and GPT-4 vision: Calculating costs.

Basically, we can use the tiktoken library to figure out the encoding for the given model, and ask for the token count of a particular user message's content, like "Please write a poem". But we also need to account for the other tokens that are a part of a request, like "role": "user" and images in GPT-4-vision requests, and the guides above provide tips for counting the additional tokens. You can see code in my count_tokens_for_messages function, which accounts for both text messages and image messages.

The calculation gets trickier when there's function calling involved, since that also uses up token costs, and the exact way it uses up the token costs depends on the system message contents, presumably since OpenAI is actually stuffing the function schema into the system message behind the scenes. That calculation is done in my count_tokens_for_system_and_tools function, which was based on great reverse engineering work by other developers in the OpenAI community.

Using message history truncation in a chat app

Now that I've encapsulated the token counting and message truncation functionality in the openai-messages-token-helper package, I can use that inside my OpenAI chat apps.

For example, azure-search-openai-demo is a RAG chat application that answers questions based off content from an Azure AI Search index. In the function that handles a new question from a user, here's how we build the messages parameter for the chat completion call:

response_token_limit = 1024
messages = build_messages(
  model=self.chatgpt_model,
  system_prompt=system_message,
  past_messages=messages[:-1],
  new_user_content=original_user_query + "\n\nSources:\n" + content,
  max_tokens=self.chatgpt_token_limit - response_token_limit,
)

chat_completion = await self.openai_client.chat.completions.create(
   model=self.chatgpt_deployment,
   messages=updated_messages,
   temperature=0.3,
   max_tokens=response_token_limit,
   n=1)

We first decide how many tokens we'll allow for the response, then use build_messages to truncate the message history as needed, then pass the possibly truncated messages into the chat completion call.

We use very similar code in the chat handler from rag-postgres-openai-python as well.

Why isn't this built into the OpenAI API?

I would very much like for this type of functionality to be built into either the OpenAI API itself, the OpenAI SDK, or the tiktoken package, as I don't know how sustainable it is for the community to be maintaining token counting packages - and I've found similar calculation logic scattered across JavaScript, Go, Java, Dart, and Python. Our token counting logic may become out-of-date when new models come out or new API parameters, and then we have to go through a reverse engineering process again to come up with calculations. Ultimately, I'm hopeful for one of these possibilities:

  • All LLM providers, including OpenAI API, provide token-counting estimators as part of their APIs or SDKs.
  • LLM APIs add parameters which allow developers to specify our preferred truncation or summarization schemes, such as "last_n_conversations": 10 or "summarize_all": true.
  • LLMs will eventually have such high context windows that we won't feel such a need to possibly truncate our messages based on token counts. Perhaps we'd send the last 10 messages, always, and we'd be confident enough that those would always fit in the high context windows.

Until then, I will maintain the openai-messages-token-helper package and use that whenever I feel the need to truncate conversation history.

No comments: