Thursday, March 6, 2025

Evaluating gpt-4o-mini vs. gpt-3.5-turbo for RAG applications

The azure-search-openai-demo repository was first created in March 2023 and is now the most popular RAG sample solution for Azure. Since the world of generative AI changes so rapidly, we've made many upgrades to its underlying packages and technologies over the past two years. But we've never changed the default GPT model used for the RAG flow: gpt-35-turbo.

Why, when there are new models that are cheaper and reportedly better, such as gpt-4o-mini? Well, changing the model is one of the most significant changes you can make to impact RAG answer quality, and I did not want to make the change without thorough evaluation.

Good news! I have now run several bulk evaluations on different RAG knowledge bases, and I feel fairly confident that a switch to gpt-4o-mini is a positive overall change, with some caveats. In my evaluations, gpt-4o-mini generates answers with comparable groundedness and relevance. The time-per-token is slightly less, but the answers are 50% longer on average, thus they take 45% more time for generation. The additional answer length often provides additional details based off the context, especially for questions where the answer is a list or a sequential process. The gpt-4o-mini per-token pricing is about 1/3 of gpt-35-turbo pricing, which works out to a lower overall cost.

Let's dig into the results more in this post.

Evaluation results

I ran bulk evaluations on two knowledge bases, starting with the sample data that we include in the repository, a bunch of invented HR documents for a fictitious company. Then, since I always like to evaluate knowledge that I know deeply, I also ran evaluations on a search index composed entirely of my own blog posts from this very blog.

Here are the results for the HR documents, for 50 Q/A pairs:

metric stat gpt-35-turbo gpt-4o-mini
gpt_groundedness pass_rate 0.98 0.98
mean_rating 4.94 4.9
gpt_relevance pass_rate 0.98 0.96
mean_rating 4.42 4.54
answer_length mean 667.7 934.36
latency mean 2.96 3.8
citations_matched rate 0.45 0.53
any_citation rate 1.0 1.0

For that evaluation, groundedness was essentially the same (and was already very high), relevance only increased in its average rating (but not pass rate, which is the percentage of 4/5 scores), but we do see an increase in the number of citations in the answer that match the citations from the ground truth. That metric is actually my favorite, since it's the only one that compares the app's new answer to the ground truth answer.

Here are the results for my blog, for 200 Q/A pairs:

metric stat gpt-35-turbo gpt-4o-mini
gpt_groundedness pass_rate 0.97 0.95
mean_rating 4.89 4.8
gpt_relevance pass_rate 0.89 0.94
mean_rating 4.04 4.25
answer_length mean 402.24 663.34
latency mean 2.74 3.27
citations_matched rate 0.8 0.8
any_citation rate 1.0 0.96

For this evaluation, we actually see a slight decrease in groundedness, an increase in relevance (both the average rating and pass rate), and the same percentage of citations matched from the ground truth.

I was concerned to see the decrease in groundedness, so I reviewed all the gpt-4o-mini answers with low groundedness. Almost all of them were variations of "I don't know." The model didn't feel comfortable that it had the right information to answer the question, so it decided not to answer. As I've discussed here in a previous blog post, that's a good thing! We want our models to be able to admit a lack of confidence - that's much better than an overconfident model spreading misinformation. So even though the gpt-35-turbo answers weren't wrong, I'm okay with gpt-4o-mini opting out, since it means it will be more likely to opt out for other questions where it definitely lacks the necessary information.

Why are the answers wordier?

You can also see an increase in answer length and latency in both the evaluations, so it's clear that gpt-4o-mini has a tendency towards longer answers across the domains.

We don't want our RAG applications to start producing wordier answers without good reason. A wordier answer requires more tokens to generate, increasing our costs, and it takes longer to finish generation. Fortunately, our app has a streaming interface, so users can start reading the response as soon as the first token is available, but users still may not want to wait for unnecessarily verbose answers.

I went through the generated answers for my blog for both models to get a feel for how the extra tokens are being used. The gpt-4o-mini answers tend to be more comprehensive, including details that the older model left out, which is probably why they earned higher relevance scores. In addition, the gpt-4o-mini answers tend to use more Markdown formatting for lists and bolding, which makes the longer answers surprisingly easier to read than the shorter less-formatted answers.

I'll share a few examples here so you can see what I mean:

Question #1: What are the options for a Web Developer to combine coding and teaching?

gpt-35-turbo gpt-4o-mini

Some options for a Web Developer to combine coding and teaching include:

  1. Developer relations: Working with API companies to help teach their APIs [girldevelopit-sf-thanks-for-great-2012.html].
  2. Engineering in EdTech: Creating online educational experiences in the field of educational technology [combining-coding-and-teaching-into.html].
  3. Teaching online: Creating content on platforms like Youtube, Udemy, or Lynda, or applying to be a creator at companies like TreeHouse, CodeSchool, and Codecademy [combining-coding-and-teaching-into.html].

A Web Developer can combine coding and teaching in several ways:

  1. Developer Relations: Work with API companies to help teach their APIs [combining-coding-and-teaching-into.html].
  2. Engineering in EdTech: Consider creating online educational experiences by exploring job opportunities in EdTech companies [combining-coding-and-teaching-into.html].
  3. Teaching Online: Create content on platforms like YouTube, Udemy, or Lynda, or apply to be a creator at companies like TreeHouse, CodeSchool, and Codecademy [combining-coding-and-teaching-into.html].
  4. Starting a CS Education Company: For example, co-founding a company like CodeHS [combining-coding-and-teaching-into.html].
  5. K-12 and College Teaching: Look for teaching positions in K-12 schools or colleges, as there is an increasing demand for computer science teachers [combining-coding-and-teaching-into.html].
  6. Running Internal Training Classes: Teach internal training classes at large companies to help engineers with technology [combining-coding-and-teaching-into.html].
  7. Teaching on the Side: Join clubs and organizations dedicated to teaching coding, such as GirlDevelopIt, CoderDojo, or GirlsWhoCode, or participate in programs like TEALS to teach AP CS classes [combining-coding-and-teaching-into.html].

Question #2: What does .editorconfig do?

gpt-35-turbo gpt-4o-mini
The .editorconfig file is used to set coding style preferences for a project, such as indent style and size. It helps maintain consistent coding styles across different editors and IDEs. [porting-project-from-spaces-to-tabs.html] An .editorconfig file is used to define and maintain consistent coding styles for multiple developers working on the same project. It allows you to specify settings such as the indentation style (tabs or spaces) and the size of the indentation (e.g., 4 spaces for tabs). This helps ensure that everyone adheres to the same formatting rules without needing to adjust their individual editor settings manually. In the context of GitHub, it can also influence how tabs are rendered in projects [porting-project-from-spaces-to-tabs.html].

Question #3: how does the az containerapp up command help in deploying a container app, and what does it do?

gpt-35-turbo gpt-4o-mini

The az containerapp up command helps in deploying a container app by creating an Azure resource group, building the container image using the local Dockerfile, creating a registry in the Azure Container Registry and pushing the image to the registry, and finally creating a Container App that uses the pushed image and allows external ingress on a specified port [deploying-containerized-fastapi-app-to.html].

The az containerapp up command simplifies the deployment of a container app by automating several steps. When executed, it performs the following actions:

  1. Creates an Azure resource group (e.g., "flask-aca-rg" or "fastapi-aca-rg") to organize resources.
  2. Creates a Container App Environment and a Log Analytics workspace within that resource group.
  3. Builds the container image using the local Dockerfile.
  4. Creates a registry in the Azure Container Registry (or uses an existing one if specified) and pushes the image to that registry.
  5. Creates a Container App (e.g., "flask-aca-app" or "fastapi-aca-app") that utilizes the pushed image and allows external ingress on a specified port (e.g., port 5000 for Flask or port 80 for FastAPI). Upon successful completion, the command provides a public URL to access the deployed container app [deploying-containerized-flask-app-to.html][deploying-containerized-fastapi-app-to.html].

Those answers are fairly representative of the differences. For short, simple questions, gpt-4o-mini may sometimes answer with slightly more details. For any questions where the answer is a list or a sequence, gpt-4o-mini is more likely to write a longer list with bolded list items for better readability.

Next steps

I will send a PR to azure-search-openai-demo to default the model to gpt-4o-mini, and once merged, I'll note in the release notes that developers may see longer response lengths with the new model. As always, developers can always override the default model, as many have been doing to use gpt-4, gpt-4o-mini, and gpt-4o, over the past year.

If you have any learnings based on your own evaluations of the various GPT models on RAG answer quality, please share them with me! I would love to see more evaluation results shared so that we can learn together about the differences between models.