The azure-search-openai-demo repository was first created in March 2023 and is now the most popular RAG sample solution for Azure. Since the world of generative AI changes so rapidly, we've made many upgrades to its underlying packages and technologies over the past two years. But we've never changed the default GPT model used for the RAG flow: gpt-35-turbo.
Why, when there are new models that are cheaper and reportedly better, such as gpt-4o-mini? Well, changing the model is one of the most significant changes you can make to impact RAG answer quality, and I did not want to make the change without thorough evaluation.
Good news! I have now run several bulk evaluations on different RAG knowledge bases, and I feel fairly confident that a switch to gpt-4o-mini is a positive overall change, with some caveats. In my evaluations, gpt-4o-mini generates answers with comparable groundedness and relevance. The time-per-token is slightly less, but the answers are 50% longer on average, thus they take 45% more time for generation. The additional answer length often provides additional details based off the context, especially for questions where the answer is a list or a sequential process. The gpt-4o-mini per-token pricing is about 1/3 of gpt-35-turbo pricing, which works out to a lower overall cost.
Let's dig into the results more in this post.
Evaluation results
I ran bulk evaluations on two knowledge bases, starting with the sample data that we include in the repository, a bunch of invented HR documents for a fictitious company. Then, since I always like to evaluate knowledge that I know deeply, I also ran evaluations on a search index composed entirely of my own blog posts from this very blog.
Here are the results for the HR documents, for 50 Q/A pairs:
metric | stat | gpt-35-turbo | gpt-4o-mini |
---|---|---|---|
gpt_groundedness | pass_rate | 0.98 | 0.98 |
mean_rating | 4.94 | 4.9 | |
gpt_relevance | pass_rate | 0.98 | 0.96 |
mean_rating | 4.42 | 4.54 | |
answer_length | mean | 667.7 | 934.36 |
latency | mean | 2.96 | 3.8 |
citations_matched | rate | 0.45 | 0.53 |
any_citation | rate | 1.0 | 1.0 |
For that evaluation, groundedness was essentially the same (and was already very high), relevance only increased in its average rating (but not pass rate, which is the percentage of 4/5 scores), but we do see an increase in the number of citations in the answer that match the citations from the ground truth. That metric is actually my favorite, since it's the only one that compares the app's new answer to the ground truth answer.
Here are the results for my blog, for 200 Q/A pairs:
metric | stat | gpt-35-turbo | gpt-4o-mini |
---|---|---|---|
gpt_groundedness | pass_rate | 0.97 | 0.95 |
mean_rating | 4.89 | 4.8 | |
gpt_relevance | pass_rate | 0.89 | 0.94 |
mean_rating | 4.04 | 4.25 | |
answer_length | mean | 402.24 | 663.34 |
latency | mean | 2.74 | 3.27 |
citations_matched | rate | 0.8 | 0.8 |
any_citation | rate | 1.0 | 0.96 |
For this evaluation, we actually see a slight decrease in groundedness, an increase in relevance (both the average rating and pass rate), and the same percentage of citations matched from the ground truth.
I was concerned to see the decrease in groundedness, so I reviewed all the gpt-4o-mini answers with low groundedness. Almost all of them were variations of "I don't know." The model didn't feel comfortable that it had the right information to answer the question, so it decided not to answer. As I've discussed here in a previous blog post, that's a good thing! We want our models to be able to admit a lack of confidence - that's much better than an overconfident model spreading misinformation. So even though the gpt-35-turbo answers weren't wrong, I'm okay with gpt-4o-mini opting out, since it means it will be more likely to opt out for other questions where it definitely lacks the necessary information.
Why are the answers wordier?
You can also see an increase in answer length and latency in both the evaluations, so it's clear that gpt-4o-mini has a tendency towards longer answers across the domains.
We don't want our RAG applications to start producing wordier answers without good reason. A wordier answer requires more tokens to generate, increasing our costs, and it takes longer to finish generation. Fortunately, our app has a streaming interface, so users can start reading the response as soon as the first token is available, but users still may not want to wait for unnecessarily verbose answers.
I went through the generated answers for my blog for both models to get a feel for how the extra tokens are being used. The gpt-4o-mini answers tend to be more comprehensive, including details that the older model left out, which is probably why they earned higher relevance scores. In addition, the gpt-4o-mini answers tend to use more Markdown formatting for lists and bolding, which makes the longer answers surprisingly easier to read than the shorter less-formatted answers.
I'll share a few examples here so you can see what I mean:
Question #1: What are the options for a Web Developer to combine coding and teaching?
gpt-35-turbo | gpt-4o-mini |
---|---|
Some options for a Web Developer to combine coding and teaching include:
|
A Web Developer can combine coding and teaching in several ways:
|
Question #2: What does .editorconfig do?
gpt-35-turbo | gpt-4o-mini |
---|---|
The .editorconfig file is used to set coding style preferences for a project, such as indent style and size. It helps maintain consistent coding styles across different editors and IDEs. [porting-project-from-spaces-to-tabs.html] | An .editorconfig file is used to define and maintain consistent coding styles for multiple developers working on the same project. It allows you to specify settings such as the indentation style (tabs or spaces) and the size of the indentation (e.g., 4 spaces for tabs). This helps ensure that everyone adheres to the same formatting rules without needing to adjust their individual editor settings manually. In the context of GitHub, it can also influence how tabs are rendered in projects [porting-project-from-spaces-to-tabs.html]. |
Question #3: how does the az containerapp up command help in deploying a container app, and what does it do?
gpt-35-turbo | gpt-4o-mini |
---|---|
The |
The
|
Those answers are fairly representative of the differences. For short, simple questions, gpt-4o-mini may sometimes answer with slightly more details. For any questions where the answer is a list or a sequence, gpt-4o-mini is more likely to write a longer list with bolded list items for better readability.
Next steps
I will send a PR to azure-search-openai-demo to default the model to gpt-4o-mini, and once merged, I'll note in the release notes that developers may see longer response lengths with the new model. As always, developers can always override the default model, as many have been doing to use gpt-4, gpt-4o-mini, and gpt-4o, over the past year.
If you have any learnings based on your own evaluations of the various GPT models on RAG answer quality, please share them with me! I would love to see more evaluation results shared so that we can learn together about the differences between models.