Monday, August 11, 2025

GPT-5: Will it RAG?

OpenAI released the GPT-5 model family today, with an emphasis on accurate tool calling and reduced hallucinations. For those of us working on RAG (Retrieval-Augmented Generation), it's particularly exciting to see a model specifically trained to reduce hallucination. There are five variants in the family:

  • gpt-5
  • gpt-5-mini
  • gpt-5-nano
  • gpt-5-chat: Not a reasoning model, optimized for chat applications
  • gpt-5-pro: Only available in ChatGPT, not via the API

As soon as GPT-5 models were available in Azure AI Foundry, I deployed them and evaluated them inside our popular open source RAG template. I was immediately impressed - not by the model's ability to answer a question, but by it's ability to admit it could not answer a question!

You see, we have one test question for our sample data (HR documents for a fictional company's) that sounds like it should be an easy question: "What does a Product Manager do?" But, if you actually look at the company documents, there's no job description for "Product Manager", only related jobs like "Senior Manager of Product Management". Every other model, including the reasoning models, has still pretended that it could answer that question. For example, here's a response from o4-mini:

Screenshot of model responding with description of PM role

However, the gpt-5 model realizes that it doesn't have the information necessary, and responds that it cannot answer the question:

Screenshot of model responding with I dont know

As I always say: I would much rather have an LLM admit that it doesn't have enough information instead of making up an answer.

Bulk evaluation

But that's just a single question! What we really need to know is whether the GPT-5 models will generally do a better job across the board, on a wide range of questions. So I ran bulk evaluations using the azure-ai-evaluations SDK, checking my favorite metrics: groundedness (LLM-judged), relevance (LLM-judged), and citation_match (regex based off ground truth citations). I didn't bother evaluating gpt-5-nano, as I did some quick manual tests and wasn't impressed enough - plus, we've never used a nano sized model for our RAG scenarios. Here are the results for 50 Q/A pairs:

metric stat gpt-4.1-mini gpt-4o-mini gpt-5 gpt-5-chat gpt-5-mini o3-mini
groundedness pass % 94% 86% 100% 🏆 96% 94% 96%
mean score 4.76 4.50 5.00 🏆 4.86 4.82 4.80
relevance pass % 94% 🏆 84% 90% 90% 74% 90%
mean score 4.42 🏆 4.22 4.20 4.06 4.02 4.00
answer_length mean 829 919 844 549 987 499
latency mean 2.9 4.5 9.6 2.9 12.8 19.4
citations_matched % 52% 49% 47% 52% 54% 🏆 51%

For the LLM-judged metrics of groundedness and relevance, the LLM awards a score of 1-5, and both 4 and 5 are considered passing scores. That's why you see both a "pass %" (percentage with 4 or 5 score) and an average score in the table above.

For the groundedness metric, which measures whether an answer is grounded in the retrieved search results, the gpt-5 model does the best (100%), while the other gpt-5 models do quite well as well, on par with our current default model of gpt-4.1-mini. For the relevance metric, which measures whether an answer fully answers a question, the gpt-5 models don't score as highly as gpt-4.1-mini. I looked into the discrepancies there, and I think that's actually due to gpt-5 being less willing to give an answer when it's not fully confident in it - it would rather give a partial answer instead. That's a good thing for RAG apps, so I am comfortable with that metric being less than 100%.

The latency metric is generally higher for the gpt-5 reasoning models, as would be expected, but I think that I had some network latency that affected the gpt-5-mini models, so I encourage you to run your own evaluation on latency. Latency will also depend on deployment region, region capacity, etc, assuming you're not using a "provisioned thoroughput" deployment. Also note that the latency here records the total time taken, from first token to last token, whereas the most important metric for a user-facing streaming chat is the time to *first* token.

For the gpt-5 reasoning models in this evaluation, I set the reasoning_effort to "minimal", which means the model chooses whether to use reasoning tokens. I have never seen it actually use any reasoning tokens when I set the effort to minimal, so maybe that means that a higher reasoning effort is really only needed for longer or more complex tasks, and RAG answering is a non-reasoning task. A higher reasoning effort would definitely affect the latency and likely also affect the answer quality. I did not test that out, since the "minimal" effort setting already results in high quality answers.

Answer differences

Now that we've seen the overall metrics, let's dig into some actual answers, and see some of the key ways that GPT-5 model answers differ.

Quicker to say "I don't know" 🤷🏻

Let's start with the first thing I noticed: gpt-5 was happy to admit "I don't know". In the 50 questions evaluated, the gpt-5 model started off their answer with "I don't know" for 3 of the answers, while gpt-5-mini started that way for 10 of the answers. In table form:

gpt-4.1-mini gpt-5 gpt-5-mini gpt-5-chat
1 (1.6%) 3 (6%) 10 (20%) 1 (1.6%)

Here's an example of a gpt-4.1-mini answer versus a gpt-5 answer:

Can you explain the eligibility criteria for the Northwind Standard plan offered to Contoso employees?

gpt-4.1-minigpt-5
The eligibility criteria for the Northwind Standard plan for Contoso employees are not explicitly detailed in the provided sources. However, it is mentioned that prior-authorization may be required for certain services or treatments, which involves approval from Northwind Health based on medical necessity and coverage under the plan. Employees must have their physician submit a request for prior-authorization, and in some cases, meet specific criteria such as medical tests, treatments, or lifestyle changes to be approved. Preventive care, emergency services, mental health, and substance abuse coverage do not require prior-authorization. Out-of-network services do require prior-authorization and may incur additional costs [Northwind_Standard_Benefits_Details.pdf#page=63][Northwind_Standard_Benefits_Details.pdf#page=102]. If you need information on basic eligibility (such as employment status or tenure), that is not provided in these sources.I don’t know. The provided sources describe compliance (ERISA, COBRA, HIPAA, ACA) and note that Northwind Standard is a group plan covering “all eligible employees of Contoso and their spouses or dependents,” but they do not define the specific eligibility criteria (e.g., hours worked, waiting period, employment class). [Northwind_Standard_Benefits_Details.pdf#page=88][Northwind_Standard_Benefits_Details.pdf#page=85][Northwind_Standard_Benefits_Details.pdf#page=86]

Both of the answers admit some degree of not-knowing, but gpt-5 comes right out with it at the start of the answer. Notably, our system prompt includes this instruction:

Answer ONLY with the facts listed in the list of sources below. If there isn't enough information below, say you don't know.

So what we may be seeing is a combination of two factors: 1) the gpt-5 reasoning models are generally better at following instructions, 2) the gpt-5 reasoning models went through a specific alignment process to reduce hallucination. Notably, however, the gpt-5-chat model rarely responded with "I don't know", so it is not as affected by whatever factors are increasing the likelihood of the reasoning models to answer that way.

You may be wondering why I have questions in the evaluation set that aren't fully answerable - it's hard to generate a good set of ground truth for evaluation, and my ground truth data is not the optimal set of questions. Ideally, I would do a separate evaluation for all the fully knowable questions versus all the unknowable questions - and I even have a specific custom metric for "don't-know-ness" that I run on those - but a few partially knowable questions snuck into this ground truth data set.

Reasoning models ❤️ lists

I saw a definite tendency of the GPT-5 reasoning models to output the answer as a list, even though there's nothing in the RAG system prompt that tells it to use lists. Out of the 50 questions evaluated, here are the number with answers formatted as a top-level bulleted list:

gpt-4.1-mini gpt-5 gpt-5-mini gpt-5-chat
0 36 (72%) 26 (52%) 0

Here's an example of a gpt-4.1-mini answer (paragraph) versus a gpt-5 answer (list):

What are the coverage limitations of the Northwind Standard plan regarding clinical trials?

baselinegpt5-emb3l
The Northwind Standard plan covers certain clinical trial services such as diagnostic testing, treatment of the condition being studied, medications, lab services, and imaging services. However, it does not cover travel expenses associated with attending clinical trials. Additionally, any experimental treatments or services not part of the clinical trial are not covered. Coverage for other clinical trial services not explicitly listed may be considered on a case-by-case basis. Members should contact Northwind Health customer service for more details [Northwind_Standard_Benefits_Details.pdf#page=23][Northwind_Standard_Benefits_Details.pdf#page=24].
  • Travel expenses for attending clinical trials are not covered [Northwind_Standard_Benefits_Details.pdf#page=23][Northwind_Standard_Benefits_Details.pdf#page=24].
  • Experimental treatments or services that are not part of the clinical trial are not covered [Northwind_Standard_Benefits_Details.pdf#page=24].
  • Services or treatments deemed experimental or investigational are not covered by the plan [Northwind_Standard_Benefits_Details.pdf#page=53].

Now, is it a bad thing that the gpt-5 reasoning models use lists? Not necessarily! But if that's not the style you're looking for, then you either want to consider the gpt-5-chat model or add specific messaging in the system prompt to veer the model away from top level lists.

Longer answers

As we saw in overall metrics above, there was an impact of the answer length (measured in the number of characters, not tokens). Let's isolate those stats:

gpt-4.1-mini gpt-5 gpt-5-mini gpt-5-chat
829 844 990 549

The gpt-5 reasoning models are generating answers of similar length to the current baseline of gpt-4.1-mini, though the gpt-5-mini model seems to be a bit more verbose. The API now has a new parameter to control verbosity for those models, which defaults to "medium". I did not try an evaluation with that set to "low" or "high", which would be an interesting evaluation to run.

The gpt-5-chat model outputs relatively short answers, which are actually closer in length to the answer length that I used to see from gpt-3.5-turbo.

What answer length is best? A longer answer will take longer to finish rendering to the user (even when streaming), and will cost the developer more tokens. However, sometimes answers are longer due to better formatting that is easier to skim, so longer does not always mean less readable. For the user-facing RAG chat scenario, I generally think that shorter answers are better. If I was putting these gpt-5 reasoning models in production, I'd probably try out the "low" verbosity value, or put instructions in the system prompt, so that users get their answers more quickly. They can always ask follow-up questions as needed.

Fancy punctuation

This is a weird difference that I discovered while researching the other differences: the GPT-5 models are more likely to use “smart” quotes instead of standard ASCII quotes. Specifically:

  • Left single: ‘ (U+2018)
  • Right single / apostrophe: ’ (U+2019)
  • Left double: “ (U+201C)
  • Right double: ” (U+201D)

For example, the gpt-5 model actually responded with "I don’t know", not with "I don't know". It's a subtle difference, but if you are doing any sort of post-processing or analysis, it's good to know. I've also seen some reports of the models using the smart quotes incorrectly in coding contexts, so that's another potential issue to look out for. I assumed that the models were trained on data that tended to use smart quotes more often, perhaps synthetic data or book text. I know that as a normal human, I rarely use them, given the extra effort required to type them.

Evaluate for yourself!

I share my evaluations on our sample data as a way to share general learnings on model differences, but I encourage every developer to evaluate these models for your specific domain, alongside domain experts that can reason about the correctness of the answers. How can you evaluate?

  • If you are already using our open source RAG project for Azure, deploy the GPT-5 models and follow the steps in the evaluation guide.
  • If you have your own solution, you can use an open-source SDK for evaluating, like azure-ai-evaluation (the one that I use), DeepEval, promptfoo, etc. If you are using an observability platform like Langfuse, Arize, or Langsmith, they have evaluation strategies baked in. Or if you're using an agents framework like Pydantic AI, those also often have built-in eval mechanisms.

If you can share what you learn from evaluations, please do! We are all learning about the strange new world of LLMs together.

No comments: