Do stricter MCP tool schemas increase agent reliability?

MCP servers contain tools, and each tool is described by its name, description, input parameters, and return type. When an agent is calling a tool, it formulates its call based on only that metadata; it does not know anything about the internals of a tool. For my PyAI talk last week, I investigated this hypothesis:

If we use stricter types for MCP tool schemas, then agents calling those tools will be more successful.

This was a hypothesis based on my personal experience over the last year of developing with agents and MCP servers, where I'd started with MCP servers with very minimal schemas, witnessed agents failing to call them correctly, and then iterated on the schemas to improve tool-calling success. I thought for sure that my hypothesis would be validated with flying colors. Let's see what I discovered instead...

Table of contents:

A basic MCP tool and schema
Annotating parameters with descriptions
Constraining parameters with types
Setting up evaluations
Evaluation results: category
Evaluation results: date
Cross-model evaluations
Impact of reasoning effort
Comparing agent frameworks
Takeaways

A basic MCP tool and schema

For this experiment, I built an MCP server that can add expenses to a database. My add_expense tool needs four pieces of information:

date: The date that the expense was incurred
amount: The amount of the expense
category: The category of the expense
description: A free-form description of the expense

This is what a first attempt at the tool might look like, using the Python FastMCP framework, and a Python type annotation for each parameter:

@mcp.tool
async def add_expense(
    expense_date: str,
    amount: float,
    category: str,
    description: str,
):
    """Add a new expense."""
    ...

See full code in expenses_mcp.py.

When FastMCP generates the schema based on that function signature, it produces this JSON schema:

{
  "name": "add_expense",
  "description": "Add a new expense.",
  "inputSchema": {
    "properties": {
      "expense_date": {"type": "string"},
      "amount": {"type": "number"},
      "category": {"type": "string"},
      "description": {"type": "string"}
    },
    "required": ["expense_date", "amount",
      "category", "description"],
    "type": "object"
  }
}

That schema is what agents see - nothing else! The name is the function name, the description is the function docstring, and the inputSchema describes each parameter based on its type annotation, and marks all of them as required, since none of them are marked as optional.

We've done only the bare minimum for that tool schema, assigning types for each parameter. But most of those types are bare strings, so the LLM can decide what to pass into each string. As we know, LLMs can be very creative, and can vary wildly in their choices. For example, this is a word cloud of the category values across 83 tool calls:

Now let's explore different ways to enhance the generated schemas, and evaluate whether those better schemas improve agent success.

Annotating parameters with descriptions

The first step that I always recommend to developers is to annotate each parameter with a description. Any LLM that is using the tool will see the description, and will alter its behavior based on the guidance inside. (We are basically doing prompt engineering inside our function signatures!) To add a description with FastMCP, wrap the type annotation in typing.Annotated and pass in a pydantic.Field with a description. This tool definition adds a description to just the category field:

from pydantic import Field
from typing import Annotated

@mcp.tool
async def add_expense_cat_b(
    expense_date: date,
    amount: float,
    category: Annotated[
        str,
        Field(
            description="Must be one of: Food & drink, Transit and Fuel, Media & streaming, Apparel and Beauty, "
            "Electronics & tech, Home and office, ..."
        ),
    ],
    description: str,
):

With that change, the generated JSON schema now includes the description:

"category": {
    "type": "string",
    "description": "Must be one of:
      Food & drink,
      Transit and Fuel,
      Media & streaming,
      Apparel and Beauty,
      Electronics & tech,
      Home and office, ..."
}

The description can be quite long - and in fact, my actual description became a lot longer to guide the LLM when faced with ambiguous cases:


Choose the closest category for the expense.
Do not ask follow-up questions just to disambiguate the category;
pick the best fit using the description and common sense.
If truly unclear, use Misc.

Heuristics: Food & drink = meals, groceries, coffee, restaurants, snacks;
Transit and Fuel = rideshare, taxi, gas, parking, public transit, tolls;
Media & streaming = movies, concerts, subscriptions, streaming, games, tickets;
Apparel and Beauty = clothing, shoes, cosmetics, haircuts, personal care;
Electronics & tech = devices, gadgets, accessories, apps, software;
Home and office = furniture, supplies, housewares, decor, cleaning;
Health & Fitness = gym, medical, wellness, supplements, pharmacy;
Arts and hobbies = crafts, sports equipment, creative supplies, lessons;
Fees & services = banking, professional services, insurance, subscriptions;
Misc = anything that does not fit well into other categories.

However, the longer the description, the higher the token cost, so you don't get a long description for free!

Constraining parameters with types

See how we're asking the LLM to constrain itself to a single option in a pre-determined list of options? In this case, we can enforce that in the schema, using enum types. With FastMCP, we can specify that in two different ways. The first option is to type the parameter as a Literal:

from typing import Literal

@mcp.tool
async def add_expense_cat_b(
    expense_date: date,
    amount: float,
    category: Literal[
      "Food & drink", "Transit and Fuel", "Media & streaming", ...
    ],
    description: str,
):

The second option is to type the parameter as an Enum:

from enum import Enum

class Category(Enum):
    FOOD_AND_DRINK = "Food & drink"
    TRANSIT_AND_FUEL = "Transit and Fuel"
    ...

@mcp.tool
async def add_expense_cat_b(
    expense_date: date,
    amount: float,
    category: Category,
    description: str,
):

Both of those options result in the same generated JSON schema for the category parameter:

"category": {
    "type": "string",
    "enum": [
        "Food & drink",
        "Transit and Fuel",
        "Media & streaming", ...
    ]
}

Fun fact: For the Enum case, FastMCP used to generate a different JSON schema that used "references", but multiple models errored when they saw that schema. FastMCP decided to simplify both cases to always output the flat enum array to reduce model errors.

We can combine these approaches, wrapping an Enum with a description, like so:

category: Annotated[
    Category,
    Field(
        description=(
            "Choose the closest category. "
            "If truly unclear, use Misc.\n\n"
            "Heuristics: "
            "Food & drink=meals, coffee; "
            "Transit and Fuel=rideshare, "
            "gas, parking; ..."
        )
    ),
],

See full code in expenses_mcp.py.

Then the generated schema includes both the possible values and the description with guidance on selecting them:

"category": {
    "type": "string",
    "enum": [
        "Food & drink",
        "Transit and Fuel",
        "Media & streaming", ...
    ],
    "description": "Choose the closest
      category. If truly unclear, use
      Misc. Heuristics: Food & drink=
      meals, coffee; Transit and Fuel=
      rideshare, gas, parking; ..."
}

Any constraint should beat a bare string for something as free-form as category — but which of these schemas has the greatest impact on getting the agent to pass in the right one? To find out, I set up a series of evaluations.

Setting up evaluations

In my expenses MCP server, I defined multiple tools, each with a different version of the schema:

def add_expense_cat_b(category: Annotated[str, Field(description="...")], ...): ...

def add_expense_cat_c(category: Literal["Food & drink", ...], ...): ...

def add_expense_cat_d(category: Category, ...): ...

def add_expense_cat_e(category: Annotated[Category, Field(description="...")], ...): ...

Next, I created an agent using Pydantic AI and pointed it to my local expenses MCP server. Here's simplified code:

server = MCPServerStreamableHTTP(url="http://localhost:8000/mcp")

model = OpenAIResponsesModel(
    "gpt-4.1-mini",
    provider=OpenAIProvider(openai_client=azure_openai_client))

agent = Agent(
    model,
    system_prompt=(
        "You help users log expenses. "
        f"Today's date is {datetime.now().strftime('%B %-d, %Y')}."
    ),
    output_type=str,
    toolsets=[server],
)

result = await agent.run("I bought a sandwich for $12.50.")

See full code in pydanticai_agent.py.

Now, I needed a way to vary which tool schema the agent saw. Fortunately, Pydantic AI makes it easy to filter tools on MCP servers, using code like this:

toolset = server.filtered(
    lambda ctx, tool: tool.name == f"add_expense_cat_b")
agent = Agent(model, toolsets=[toolset], ...)
result = await agent.run(case.prompt)

See full code in runner.py.

I came up with 17 different example user inputs for the agent:

Yesterday I bought a sandwich for $12.50.
I bought a sandwich the Monday before this one for $12.50.
Two Mondays ago I spent $8.75 on coffee.
I bought a sandwich on the first Monday of this month for $12.50.
On the last day of last month I spent $25.99 on a movie ticket.
I paid $60 for gas on the last business day of last month.
The day before yesterday I spent $4.50 on coffee.
Three days ago I took an Uber to the airport for $38.
Last Friday I spent $18 on a movie ticket.
The day after tomorrow I will buy a bus pass for $20.
Yesterday I paid $65 for Instacart grocery delivery.
On the last day of last month I bought headphones for $79.99.
Yesterday I bought a car for 35000 USD.
I paid $0.99 for an app yesterday.
Yesterday I spent $200 on a spa treatment.
I spent €50 on dinner yesterday.
Ayer compré una laptop por 1200 dólares.

For each input, I stored the desired values for each of the tool parameters, like so:

EXPENSE_CASES = [
    ExpenseCase(
        name="clear_food_yesterday",
        prompt="Yesterday I bought a sandwich for $12.50.",
        expected_category="Food & drink",
        expected_date=get_yesterday(),
        expected_amount=12.50,
    ),
    ...  # 17 cases ➡️
]

See full code in dataset.py.

I ran the agent on each input for each tool schema variant, a total of 17x4 cases, 68 runs total:

for variant in ["cat_b", "cat_c", "cat_d", "cat_e"]:
    toolset = server.filtered(
        tool_filter=lambda t: t.name == f"add_expense_{variant}")
    agent = Agent(model, toolsets=[toolset], ...)
    for case in EXPENSE_CASES:
        result = await agent.run(case.prompt)
        evals = run_all_evaluations(
            result.tool_calls, case)

See full code in runner.py.

Each time the agent ran, I inspected the tool calls to verify whether it had issued a tool call at all, and whether the tool call arguments matched my desired arguments. I recorded the results in both a JSON file and more human-readable Markdown file.

Evaluation results: category

For the four category variants, these are the results across the 17 cases:

	`Annotated[str]`	`Literal`	`Enum`	`Annotated[Enum]`
Was tool called?	15/17	16/17	16/17	17/17
When called, did category match expected?	14/15	13/16	13/16	15/17
Schema size (avg tokens)	374	412	424	836

There's no clear winner amongst the first three schemas. For the first schema, where we just provided a description, the agent was more likely to decide not to call the tool at all, and instead respond with a clarifying question, like "could you please provide a category?". That may be desirable for some scenarios, to encourage agents to ask users in the face of ambiguity, but if we believe that we've provided enough information in the schema for the agent to make a clear choice, then our schema has failed. For the middle two schemas, where we provided just the enum options with no description, the agent was more likely to call the tool, but it selected the wrong category more often. That makes sense, since the schema lacked the description with the additional guidance.

The final schema is the clear winner, as the agent called the tool all the time, and matched the desired category the most often. There is a drawback of course, and that's why I included the schema size in the table: the combination of description and enum list increased the size of the schema to be double any of the other variants. That extra cost is likely worth it, but we always need to consider any improvements that increase quality at the expense of tokens.

You might be thinking, "hey, clearly stricter schemas are always better!" Alas, the story gets murkier.

Evaluation results: date

Remember that our add_expense tool also has the expense_date parameter, specified as a string in our basic schema. I wanted to make sure that those dates always came in a format that I could easily store in my database as YYYY-MM-DD, so I came up with three stricter schemas.

I started off by adding a description specifying the format:

expense_date: Annotated[
    str, "Date in YYYY-MM-DD format"
]

As a reminder, that generates this JSON schema:

"expense_date": {
    "description": "Date in YYYY-MM-DD format",
    "type": "string"
}

Then I discovered that FastMCP supports date as a type for tool parameters, so I added that variant:

expense_date: date

That results in this JSON schema:

"expense_date": {
    "format": "date",
    "type": "string"
}

I thought it'd be fun to also try using a regular expression, since FastMCP supports that too:

expense_date: Annotated[
    str,
    Field(pattern=r"^\d{4}-\d{2}-\d{2}$"),
],

See full code in expenses_mcp.py.

That schema adds a "pattern" field to the generated JSON schema:

"expense_date": {
    "pattern": "^\\d{4}-\\d{2}-\\d{2}$",
    "type": "string"
}

Here are the evaluation results running the Pydantic AI agent with gpt-4.1-mini across the 17 cases and 4 schema variants, including the bare string:

	`str`	`Annotated[str]`	`date`	`Field(pattern)`
Was tool called?	17/17	17/17	17/17	17/17
Date match (of called)	12/17	12/17	12/17	12/17
Schema size (avg tokens)	326	406	414	423

Do you see what I see? Every single variant had the same success rates! The agent called the tool 100% of the time, and it matched the expected date the same fraction of the time. I expected to see lower success for that first schema, but even without any description at all, the agent always used YYYY-MM-DD format to specify the date. It appears that since I named the field with "_date" and YYYY-MM-DD is the standard ISO format for dates, that's what the model suggests. I suspect that if I had tried the evaluation with a SLM or the oldest tool-calling model possible, I may have seen worse results. With our frontier models, however, they do not need any additional prompting to produce a date in standard ISO formats.

Of course, you likely still want to use one of these schemas to guide the agents, to be on the safe side, and they fortunately do not increase the token size significantly. Personally, I like the date option, since that plays nicely with the rest of the Python server code.

You might be wondering about all the cases where the agent failed to suggest the right date. All of those failures were due to date math. For example, when the user says "Two Mondays ago I spent $8.75 on coffee.", the agent calculated the date as one Monday ago instead of two Mondays ago. If users were truly entering their data like this, then it might be a good idea to equip the server with some date calculation tools, or give the agent some guidance on when it should ask users to clarify the date.

Cross-model evaluations

After seeing the results for an agent powered by gpt-4.1-mini, I was super curious to see what would happen with both an older model and a newer model, so I deployed a gpt-4o and a gpt-5.3-codex and ran them through the same evaluations.

For the category schema variants, the results are very interesting:

Did agent call the tool?

Schema	gpt-4o	4.1-mini	5.3-codex (med)
`Annotated[str]`	17/17	15/17	17/17
`Literal`	17/17	16/17	17/17
`Enum`	17/17	16/17	17/17
`Annotated[Enum]`	17/17	17/17	17/17

When called, did category match expected?

Schema	gpt-4o	4.1-mini	5.3-codex (med)
`Annotated[str]`	17/17	14/15	15/17
`Literal`	15/17	13/16	13/17
`Enum`	14/17	13/16	13/17
`Annotated[Enum]`	17/17	15/17	15/17

As you can see, the gpt-4o model appears to be the winner: it always calls the tool, and it matches the category correctly 100% of the time, as long as it is provided a description. The gpt-5.3-codex model also always calls the tool, but it often chooses a different category than our desired category. So, at least for this particular scenario, the gpt-4o model aligns closer to our human decision-making process than the gpt-5.3-codex model.

But what if the newer model is just smarter than we are? Consider this example input and category choices:

"Yesterday I spent $200 on a spa treatment." with Annotated[Enum]

gpt-4o
🤖 Health & Fitness ✅

gpt-4.1-mini
🤖 Apparel and Beauty ❌

gpt-5.3-codex
🤖 Apparel and Beauty ❌

We marked "spa treatment" as "Health & Fitness" in our data, but newer models both preferred "Apparel and Beauty". Both of them seem like reasonable options, so the model disagreement is pointing out the ambiguity in the categories of our ground truth data. If we really wanted "spa treatment" to be "Health & Fitness", then we may need to give that example in our category description. Or, we might decide to change our ground truth data entirely to agree with the newer model's category selection. That's one thing that I love about running evaluations: they force you to think more deeply about your expectations of LLMs in the face of diverse user inputs.

For the date schema variants, we see a very different story:

Did agent call the tool?

Schema	gpt-4o	4.1-mini	5.3-codex (med)
`str`	17/17	17/17	17/17
`Annotated[str]`	17/17	17/17	17/17
`date`	17/17	17/17	17/17
`Field(pattern)`	17/17	17/17	17/17

When called, did date match expected?

Schema	gpt-4o	4.1-mini	5.3-codex (med)
`str`	15/17	12/17	17/17
`Annotated[str]`	15/17	12/17	17/17
`date`	15/17	12/17	17/17
`Field(pattern)`	15/17	12/17	17/17

The gpt-5.3-codex model is the clear winner here, as it calls the tool and selects the right date 100% of the time. Remember how gpt-4.1-mini couldn't do the date match? Apparently this newer model can! It correctly calculated "two Mondays ago" and when told "the last day of the month", it even realized that there is no February 29th in 2026. The gpt-4.1-mini model has the worst results here, and that may be due to it being a "-mini" model.

But now we're scratching our heads, because gpt-5.3-codex was not the winner across the board, despite being the much newer model. Let's dig deeper.

Impact of reasoning effort

The gpt-5.3-codex model is a reasoning model - and reasoning models can have very different outputs based on their reasoning effort level. I ran that evaluation on "medium", the default level, but for that model, the effort can be either "low", "medium", "high", or "xhigh". Here are the results:

	low	medium	high	xhigh
Did category match ground truth?	100%	88.2%	88.2%	88.2%
Did date match ground truth?	100%	100%	100%	100%
Schema size (average tokens)	862	890	939	1114
Latency (average ms)	7,129	7,474	8,828	11,554

For the category schema variants, we see that the category match percentage drops from 100% to 88% as soon as we go from "low" to "medium" and stays there. Meanwhile, higher reasoning effort steadily increases cost and latency: by the time we reach "xhigh", the schema size has grown by 250 tokens and the latency has nearly doubled.

Since this is a reasoning model, we can actually dig into the reasoning traces, to give us more insight into how we might steer the model more towards our desired category.

Here's how it reasons on "low" mode:

I’m considering categories like Health & Fitness or Apparel/Beauty since it’s for a spa treatment. There’s ambiguity because spa treatments could fit into personal care or wellness, but I might lean towards Apparel and Beauty. However, Health & Fitness could work too, especially under wellness. I’ll go with one of those!

As you can see, it's not even sure which category to select at this level of reasoning - we only see its final selection in the tool call itself. Compare that to "medium" mode:

The category for this expense is a spa treatment. I'm considering whether it should go under Apparel and Beauty or Health & Fitness. Using some heuristics, personal care fits in Apparel and Beauty, while wellness aligns with Health & Fitness. Since a spa treatment feels more like a beauty or personal care choice, I think I'll choose Apparel and Beauty.

This time, it explicitly decided on the category in its reasoning, and called the tool with the selection. Once again, you might agree with the model's choice here, and change the ground truth itself.

When we are developing MCP servers, we're not necessarily in control of the models powering the agents that call those MCP servers. Ideally, we're designing tool schemas that are clear and constrained enough so that all the most popular models at all reasoning effort levels will call our tools the way we expect.

Comparing agent frameworks

We live in a world with hundreds of agent frameworks and coding agent tools. All of them share a common approach: calling tools in a loop until the user's goal is reached. Behind the scenes, agent implementation varies. Some agents attach their own system prompts to your prompt; some agents add in memory and caching; some agents have special built-in reflection and retry loops. So when it comes to calling MCP servers, how much variance might we expect to see?

For my final evaluation, I wrote an agent using the GitHub Copilot SDK, and gave it the same system prompt and MCP server connection as the Pydantic AI agent. The simplified code:

client = CopilotClient()

session = await client.create_session(SessionConfig(
    model="gpt-5.3-codex",
    mcp_servers={
        "expenses": MCPRemoteServerConfig(
            type="http",
            url="http://localhost:8000/mcp",
            tools=["add_expense_cat_e"],
        )
    },
    system_message={
        "mode": "replace",
        "content": "You help users log expenses. "
            f"Today's date is {datetime.now().strftime('%B %-d, %Y')}.",
    },
))

await session.send_and_wait({"prompt": "I bought a sandwich for $12.50."})

See full code in copilotsdk_agent.py.

For the evaluation, I used the gpt-5.3-codex model on medium effort across the 4 category schemas, 4 date schemas, and 17 variants. The results:

Was tool called at all?

Schema	Pydantic AI	Copilot SDK
`Annotated[str]`	17/17	17/17
`Literal`	17/17	17/17
`Enum`	17/17	17/17
`Annotated[Enum]`	17/17	17/17

Did category match expected?

Schema	Pydantic AI	Copilot SDK
`Annotated[str]`	15/17	15/17
`Literal`	13/17	13/17
`Enum`	13/17	13/17
`Annotated[Enum]`	15/17	15/17

Did date match expected?

Schema	Pydantic AI	Copilot SDK
`str`	17/17	17/17
`Annotated[str]`	17/17	17/17
`date`	17/17	17/17
`Field(pattern)`	17/17	17/17

The success rates are exactly the same across both agents! Now, I will confess that in my first attempt at evaluation, the Copilot SDK agent had an off-by-one error for each date it selected, and I suspect there's a UTC date somewhere in the default system prompt. When I re-ran the evaluation at a date where UTC and my timezone (PT) were the same, the dates were all correct. You learn all sorts of things when running evaluations.

Takeaways

I went into this investigation certain that I would see significant improvement from agents when I used stricter types and constraints for the parameter types. I realize now that the models have improved so much and been so robustly trained for tool calling, that they often do not need the specificity of the stricter types. They mostly need clarity whenever there is ambiguity, and that can come in the form of a string description.

However, there are still other benefits to using stricter schemas, like increased type safety and validation in our MCP server codebase. Personally, I would rather use date for the date input and Enum for the category input, as those lead to cleaner code inside the tool code.

LLMs, and the agents powered by them, are both non-deterministic and not that predictable. The only way to really see how an agent will respond to your MCP server tool schemas is to set up evaluations for the scenarios that you care about. If you're new to evaluations, check out the fantastic resources from ML engineer Hamel Husain.

I learned a lot during this investigation, and hope my approach is useful to you as well. All of the code — the MCP server, schema variants, agents, and evaluation framework — is available in my GitHub repository, so feel free to explore, adapt, and run your own experiments. Please share any of your own experience with MCP tool schemas and evaluations with me. Thank you!

pamela fox's blog

Monday, March 16, 2026