I'm writing this post on the flight back from PyCon US 2026 in Long Beach, California.
It was my second time attending PyCon, and it was a fantastic conference -
a cornocopia of Python knowledge, but more importantly, a coming-together of developers across the Python ecosystem.
I'll recap my PyCon US 2026 experience in this post, both what I contributed and what thousands of others contributed.
First, a big old disclaimer: part of my job as a developer advocate at Microsoft is to attend conferences like PyCon,
so I was able to expense my travel and spend my work days on my PyCon contributions. But that's also why I picked my job,
as it gives me the excuse to do things that I'd want to do anyway, like attending the largest gathering of Python devs in the world.
My tutorial
Since I've been spending so much time on Model Context Protocol (MCP) in the past year,
I submitted an idea to the PyCon CFP to run a tutorial walking developers through the process
of building their first MCP server. I was thrilled that the tutorial was accepted, but nervous since I'd never delivered a tutorial at a PyCon before.
Fortunately, I was able to test it out with the SF Python meetup group a few days before,
and their feedback helped me streamline the tutorial experience immensely.
I delivered the tutorial at PyCon to a packed room: 84 people, all seats filled, bright and early at 9AM on Wednesday morning, the first slot of the week-long conference.
We started off the tutorial with an icebreaker, which included attendees inventing their own meaning for "MCP".
Of course, my not-so-secret goal was for them to get to know their neighbor, to encourage pair debugging during the exercises.
I alternated between slides and exercises in the 3.5 hours tutorial,
trying to give attendees enough background knowledge while also giving them the time to get hands-on.
We started off with attendees using MCP servers, via both coding agents (Copilot/Claude Code)
and agent frameworks (Pydantic AI, Langchain, Agent-framework). Then attendees moved on to building MCP servers,
using FastMCP and the open-source KeyCloak identity server.
Overall, the tutorial went really well with minimal technical issues - and hey, the WiFi even worked, which is my #1 need in a conference venue. Thank you to my colleagues Gwen and Sarah for TAing, and for all the attendees for being so eager to learn! I'll definitely submit a tutorial proposal for next year's PyCon.
Education Summit
Kelly Paredes, from the Teaching Python podcast, organizes a day-long Education Summit each year before the main PyCon begins.
The mini conference brings together educators, researchers, students, and EdTech software developers,
to talk about the intersection of Python and education.
This year, I gave two sessions, starting with a talk called
"Your slides but faster: Building an AI-powered presentation workflow".
I walked through my process of using Reveal.JS alongside GitHub Copilot to make presentations,
and shared the prompts and skills I use to collaborate with the coding agent.
My colleague Gwen gave a talk called
"Big Lessons from Small Models: Teaching Python AI with SLMs",
based on our attempts to add SLM support to every code sample we use in livestream series.
Gwen showed Ollama setup code, gave recommendations for which SLMs to use, and highlighted a teaching angle of SLMs:
students have to get creative to work around the constraints of SLMs, and it forces them to understand SLMs more deeply.
EduSummit was filled with many other great talks. My favorite was from the always entertaining Reuven Lerner on "Vibe teaching: Python training in the age of AI"
, where he shared his realization that he couldn't ignore agentic coding in his Python training courses anymore, as his customers are insisting on its inclusion. He showed ways he changed his existing courses and shared prototypes for new courses. To make sure he really understands the benefits and pitfalls of agentic coding, he is vibe-coding an app that helps students practice what they've learnt and receive LLM-based feedback.
Booth
As a sponsor of PyCon, Microsoft had a booth from Thursday evening through Saturday of the event. I was at the booth for most of the hours it was open, both because we were understaffed and hey, I just like boothing! It's a great opportunity to chat with developers, and many of the folks who stopped by were attendees of my earlier sessions. We talked about things like MCP, agents, models, GitHub Copilot, and agents skills - anything that was on their mind or projecting from my laptop. It was also a chance to connect with Microsoft colleagues that I rarely see IRL, since we mostly work remotely.
Talks
I managed to see a good number of talks this year, especially from my colleagues and folks I know from the community. They were all fantastic. A few highlights
"AI-Assisted Contributions and Maintainer Load" by Paolo Melchiorre, a prominent Django maintainer. He discussed the negative impacts of AI on OSS maintainers, highlighted different approaches that projects are taking, and encouraged every project to discuss what their own approach should be.
PyCon lets attendees propose "open spaces" based on topics we suggest, so that we can come together on topics that aren't on the schedule, typically for a group discussion. Occasionally open spaces are used for not-so-technical topics, like ice cream selfies and juggling. At his great suggestion, Evan Kohilas and I organized an open space for improv! We gathered together in the lobby and played newbie-friendly improv games with whoever wanted to come. It was super fun - so fun that we did it again that night in the hotel lobby.
Hallway Track
We often say that the best part of the conference is the "hallway track": the spontaneous interactions that happen between the sessions in the halls.
I talked with developers I've met at previous conferences, developers I'd only ever met online, and developers that I've never talked with before.
As a whole, the Python community is a very welcoming bunch, and everyone seemed eager to make new connections. And take new selfies!
PyLadies Auction
Every year, the Pyladies organization organizes a charity auction to help raise funds for Pyladies chapters and their annual conference. The PyCon community donates all sorts of fun items for the auction, like life-size cut-outs of Guido, Python-themed art, autographed books, 3-d printed snakes. Last year, I won an amazing pair of homemade earrings, and I kept up the earrings tradition this year with a pair of snake earrings. The auction is quite fun, and a good excuse to donate to PyLadies while getting new jewelry out of the deal. By the end of the night, we'd collectively raised $60,000 for PyLadies!
Sprints
The last two days of the PyCon conference are dedicated to OSS sprints. This is when maintainers of Python packages sit at a table, welcome new contributors, and guide them towards their first contributions to the project. This was my first year staying for sprints, and to keep my trip shorter, I only stayed for a half day. I sat down at the Pallets table, since I've contributed to Flask-SQLAlchemy and Flask-Admin in the past, and tried to both help the newer folks and make a few fixes myself (to the click and website repos). My changes were only documentation improvements, but that's often a good place to start, as it still introduces you to the fork-PR-merge flow used by each project.
Overall impression
PyCon US 2026 was an experience. It offers so many ways to contribute and participate, and I feel like I only talked about half of it here - I left out the hotel board games, the happy hours in funny venues, the bonding over yummy noms, the random encounters on boardwalks. Thank you so much to the Python community for being so welcoming and to the PyCon organizers for a job well done!
The Model Context Protocol (MCP) gives AI agents a standard way to call external tools, but things get more complicated when those tools need to know who the user is. In this post, I’ll show how to build an MCP server with the Python FastMCP package that authenticates users with Microsoft Entra ID when they connect from a pre-authorized client such as VS Code.
If you need to build a server that works with any MCP clients, read my previous blog post. With Microsoft Entra as the authorization server, supporting arbitrary clients currently requires adding an OAuth proxy in front, which increases security risk. This post focuses on the simpler pre-authorized-client path instead.
MCP auth
Let’s start by digging into the MCP auth spec, since that explains both the shape of the flow and the constraints we run into with Entra.
The MCP specification includes an authorization protocol based on OAuth 2.1, so an MCP client can send a request that includes a Bearer token from an authorization server, and the MCP server can validate that token.
In OAuth 2.1 terms, the MCP client is acting as the OAuth client, the MCP server is the resource server, the signed-in user is the resource owner, and the authorization server issues an access token. In this case, Entra will be our authorization server. We can't necessarily use any OAuth-compatible authorization servers, as MCP auth requires more than just the core OAuth 2.1 functionality.
In OAuth, the authorization server needs a relationship with the client. MCP auth describes three options:
Pre-registration: the auth server has a pre-existing relationship and has the client ID in its database already
CIMD (Client Identity Metadata Document): the MCP client sends the URL of its CIMD, a JSON document that describes its attributes, and the auth server bases its interactions on that information.
DCR (Dynamic Client Registration): when the auth server sees a new client, it explicitly registers it and stores the client information in its own data. DCR is now considered a "legacy" path, as the hope is for CIMD to be the supported path in the future.
For each MCP scenario - each combination of MCP server, MCP client, and authorization server - we need to determine which of those options are viable and optimal. Here's one way of thinking through it:
VS Code supports all of MCP auth, so its MCP client includes both CIMD and DCR support. However, the Microsoft Entra authorization server does not support CIMD or DCR. That leaves us with only one official option: pre-registration. If we desperately need support for arbitrary clients, it is possible to put a CIMD/DCR proxy in front of Entra, as discussed in my previous blog post, but the Entra team discourages that approach due to increased security risks.
When using pre-registration, the auth flow is relatively simple (but still complex, because hey, this is OAuth!):
User asks to use auth-restricted MCP server
MCP client makes a request to MCP server without a bearer token
MCP server responds with an HTTP 401 and a pointer to its PRM (Protected Resource Metadata) document
MCP client reads PRM to discover the authorization server and options
MCP client redirects to authorization server, including its client ID
User signs into authorization server
Authorization server returns authorization code
MCP client exchanges authorization code for access token
Authorization server returns access token
MCP client re-tries original request, but now with bearer token included
MCP server validates bearer token and returns successfully
Here's what that looks like:
Now let's dig into the code for implementing MCP auth with the pre-registered VS Code client.
Registering the MCP server with Entra
Before the server can use Entra to authorize users, we need to register the server with Entra via an app registration. We can do registration using the Azure Portal, Azure CLI, Microsoft Graph SDK, or even Bicep. In this case, I use the Python MS Graph SDK as it allows me to specify everything programmatically.
First, I create the Entra app registration, specifying the sign-in audience (single-tenant) and configuring the MCP server as a protected resource:
scope_id = str(uuid.uuid4())
Application(
display_name="Entra App for MCP server",
sign_in_audience="AzureADMyOrg",
api=ApiApplication(
requested_access_token_version=2,
oauth2_permission_scopes=[
PermissionScope(
admin_consent_description="Allows access to the MCP server as the signed-in user.",
admin_consent_display_name="Access MCP Server",
id=scope_id,
is_enabled=True,
type="User",
user_consent_description="Allow access to the MCP server on your behalf.",
user_consent_display_name="Access MCP Server",
value="user_impersonation")
],
pre_authorized_applications=[
PreAuthorizedApplication(
app_id=VSCODE_CLIENT_ID,
delegated_permission_ids=[scope_id],
)]))
The api parameter is doing the heavy lifting, ensuring that other applications (like VS Code) can request permission to access the server on behalf of a user. Here's what each parameter does:
requested_access_token_version=2: Entra ID has two token formats (v1.0 and v2.0). We need v2.0 because that's what FastMCP's token validator expects.
oauth2_permission_scopes: This defines a permission called user_impersonation that MCP clients can request when connecting to your server. It's the server saying: "I accept tokens that let an MCP client act on behalf of a signed-in user." Without at least one scope defined, no MCP client can obtain a token for your server — Entra wouldn't know what permission to grant. The name user_impersonation is a convention (we could call it anything), but it clearly signals that the MCP client is accessing your server as the user, not as itself.
pre_authorized_applications: This list tells Entra which client applications are pre-approved to request tokens for this server’s API without showing an extra consent prompt to the user. In this case, I list VS Code’s application ID and tie it to the user_impersonation scope, so VS Code can request a token for the MCP server as the signed-in user.
Thanks to that configuration, when VS Code requests a token, it will request a token with the scope "api://{app_id}/user_impersonation", and the FastMCP server will validate that incoming tokens contain that scope.
Next, I create a Service Principal for that Entra app registration, which represents the Entra app in my tenant
I also need a way for the server to prove that it can use that Entra app registration. There are three options:
Client secret: Easiest to set up, but since it's a secret, it must be stored securely, protected carefully, and rotated regularly.
Certificate: Stronger than a client secret and generally better suited for production, but it still requires certificate storage, renewal, and lifecycle management.
Managed identity as Federated Identity Credential (MI-as-FIC): No stored secret, no certificate to manage, and usually the best choice when your app is hosted on Azure. No support for local development however.
I wanted the best of both worlds: easy local development on my machine, but the most secure production story for deployment on Azure Container Apps. So I actually created two Entra app registrations, one for local with client secret, and one for production with managed identity.
Here's how I set up the password for the local Entra app:
It's a bit trickier to set up the MI-as-FIC, since we first need to provision the managed identity and associate that with our Azure Container Apps resource. I set all of that up in Bicep, and then after provisioning completes, I run this code to configure a FIC using the managed identity:
Since I now have two Entra app registrations, I make sure that the environment variables in my local .env point to the secret-secured local Entra app registration, and the environment variables on my Azure Container App point to the FIC-secured prod Entra app registration.
Granting admin consent
This next step is only necessary if the MCP server uses the on-behalf-of (OBO) flow to exchange the incoming access token for a token to a downstream API, such as Microsoft Graph. In this case, my demo server uses OBO so it can query Microsoft Graph to check the signed-in user's group membership.
The earlier code added VS Code as a pre-authorized application, but that only allows VS Code to obtain a token for the MCP server itself; it does not grant the MCP server permission to call Microsoft Graph on the user's behalf. Because the MCP sign-in flow in VS Code does not include a separate consent step for those downstream Graph scopes, I grant admin consent up front so the OBO exchange can succeed.
This code grants the admin consent to the associated service principal for the Graph API resource and scopes:
Notice that we do not need to pass in a client secret at this point, even when using the local Entra app registration. FastMCP validates the tokens using Entra's public keys - no Entra app credentials needed.
To make it easy for our MCP tools to access an identifier for the currently logged in user, we define a middleware that inspects the claims of the current token using FastMCP's get_access_token() and sets the "oid" (Entra object identifier) in the state:
class UserAuthMiddleware(Middleware):
def _get_user_id(self):
token = get_access_token()
if not (token and hasattr(token, "claims")):
return None
return token.claims.get("oid")
async def on_call_tool(self, context: MiddlewareContext, call_next):
user_id = self._get_user_id()
if context.fastmcp_context is not None:
await context.fastmcp_context.set_state("user_id", user_id)
return await call_next(context)
async def on_read_resource(self, context: MiddlewareContext, call_next):
user_id = self._get_user_id()
if context.fastmcp_context is not None:
await context.fastmcp_context.set_state("user_id", user_id)
return await call_next(context)
When we initialize the FastMCP server, we set the auth provider and include that middleware:
Now, every request made to the MCP server will require authentication. The server will return a 401 if a valid token isn't provided, and that 401 will prompt the VS Code MCP client to kick off the MCP authorization flow.
Inside each tool, we can grab the user id from the state, and use that to customize the response for the user, like to store or query items in a database.
@mcp.tool
async def add_user_expense(
date: Annotated[date, "Date of the expense in YYYY-MM-DD format"],
amount: Annotated[float, "Positive numeric amount of the expense"],
description: Annotated[str, "Human-readable description of the expense"],
ctx: Context,
):
"""Add a new expense to Cosmos DB."""
user_id = await ctx.get_state("user_id")
if not user_id:
return "Error: Authentication required (no user_id present)"
expense_item = {
"id": str(uuid.uuid4()),
"user_id": user_id,
"date": date.isoformat(),
"amount": amount,
"description": description
}
await cosmos_container.create_item(body=expense_item)
Using OBO flow in FastMCP server
Remember when we granted admin consent for the Entra app registration earlier? That means we can use an OBO flow inside the MCP server, to make calls to the Graph API on behalf of the signed-in user.
To make it easier to exchange and validate tokens, we use the Python MSAL SDK and configure a ConfidentialClientApplication.
When using the local secret-secured Entra app registration, this is all we need to set it up:
Once we successfully acquire the token, we can use that token with the Graph API, for any operations permitted by the scopes in the admin consent granted earlier. For this example, we call the Graph API to check whether the logged in user is a member of a particular Entra group:
FastMCP 3.0 now provides a way to restrict tool visibility based on authorization checks, so I wrapped the above code in a function and set it as the auth constraint for the admin tool:
FastMCP will run that function both when an MCP client requests the list of tools, to determine which tools can be seen by the current user, and again when a user tries to use that tool, for an added just-in-time security check.
This is just one way to use an OBO flow however. You can use it directly inside tools, like to query for more details from the Graph API, upload documents to OneDrive/SharePoint/Notes, send emails, etc.
auth_init.py: Creates the Entra app registrations for production and local development, defines the delegated user_impersonation scope, pre-authorizes VS Code, creates the service principal, and grants admin consent for the Microsoft Graph scopes used in the OBO flow.
auth_postprovision.py: Adds the federated identity credential (FIC) after deployment so the container app's managed identity can act as the production Entra app without storing a client secret.
main.py: Implements the MCP server using FastMCP's RemoteAuthProvider and AzureJWTVerifier for direct Entra authentication, plus OBO-based Microsoft Graph calls for admin group membership checks.
As always, please let me know if you have further questions or ideas for other Entra integrations.
Acknowledgements: Thank you to Matt Gotteiner for his guidance in implementing the OBO flow and review of the blog post.
MCP servers contain tools, and each tool is described by its name, description, input parameters, and return type. When an agent is calling a tool, it formulates its call based on only that metadata; it does not know anything about the internals of a tool. For my PyAI talk last week, I investigated this hypothesis:
If we use stricter types for MCP tool schemas, then agents calling those tools will be more successful.
This was a hypothesis based on my personal experience over the last year of developing with agents and MCP servers, where I'd started with MCP servers with very minimal schemas, witnessed agents failing to call them correctly, and then iterated on the schemas to improve tool-calling success. I thought for sure that my hypothesis would be validated with flying colors. Let's see what I discovered instead...
That schema is what agents see - nothing else! The name is the function name, the description is the function docstring, and the inputSchema describes each parameter based on its type annotation, and marks all of them as required, since none of them are marked as optional.
We've done only the bare minimum for that tool schema, assigning types for each parameter. But most of those types are bare strings, so the LLM can decide what to pass into each string. As we know, LLMs can be very creative, and can vary wildly in their choices. For example, this is a word cloud of the category values across 83 tool calls:
Now let's explore different ways to enhance the generated schemas, and evaluate whether those better schemas improve agent success.
Annotating parameters with descriptions
The first step that I always recommend to developers is to annotate each parameter with a description. Any LLM that is using the tool will see the description, and will alter its behavior based on the guidance inside. (We are basically doing prompt engineering inside our function signatures!) To add a description with FastMCP, wrap the type annotation in typing.Annotated and pass in a pydantic.Field with a description. This tool definition adds a description to just the category field:
from pydantic import Field
from typing import Annotated
@mcp.tool
async def add_expense_cat_b(
expense_date: date,
amount: float,
category: Annotated[
str,
Field(
description="Must be one of: Food & drink, Transit and Fuel, Media & streaming, Apparel and Beauty, "
"Electronics & tech, Home and office, ..."
),
],
description: str,
):
With that change, the generated JSON schema now includes the description:
"category": {
"type": "string",
"description": "Must be one of:
Food & drink,
Transit and Fuel,
Media & streaming,
Apparel and Beauty,
Electronics & tech,
Home and office, ..."
}
The description can be quite long - and in fact, my actual description became a lot longer to guide the LLM when faced with ambiguous cases:
Choose the closest category for the expense.
Do not ask follow-up questions just to disambiguate the category;
pick the best fit using the description and common sense.
If truly unclear, use Misc.
Heuristics: Food & drink = meals, groceries, coffee, restaurants, snacks;
Transit and Fuel = rideshare, taxi, gas, parking, public transit, tolls;
Media & streaming = movies, concerts, subscriptions, streaming, games, tickets;
Apparel and Beauty = clothing, shoes, cosmetics, haircuts, personal care;
Electronics & tech = devices, gadgets, accessories, apps, software;
Home and office = furniture, supplies, housewares, decor, cleaning;
Health & Fitness = gym, medical, wellness, supplements, pharmacy;
Arts and hobbies = crafts, sports equipment, creative supplies, lessons;
Fees & services = banking, professional services, insurance, subscriptions;
Misc = anything that does not fit well into other categories.
However, the longer the description, the higher the token cost, so you don't get a long description for free!
Constraining parameters with types
See how we're asking the LLM to constrain itself to a single option in a pre-determined list of options? In this case, we can enforce that in the schema, using enum types. With FastMCP, we can specify that in two different ways. The first option is to type the parameter as a Literal:
Fun fact: For the Enum case, FastMCP used to generate a different JSON schema that used "references", but multiple models errored when they saw that schema. FastMCP decided to simplify both cases to always output the flat enum array to reduce model errors.
We can combine these approaches, wrapping an Enum with a description, like so:
Then the generated schema includes both the possible values and the description with guidance on selecting them:
"category": {
"type": "string",
"enum": [
"Food & drink",
"Transit and Fuel",
"Media & streaming", ...
],
"description": "Choose the closest
category. If truly unclear, use
Misc. Heuristics: Food & drink=
meals, coffee; Transit and Fuel=
rideshare, gas, parking; ..."
}
Any constraint should beat a bare string for something as free-form as category — but which of these schemas has the greatest impact on getting the agent to pass in the right one? To find out, I set up a series of evaluations.
Setting up evaluations
In my expenses MCP server, I defined multiple tools, each with a different version of the schema:
Next, I created an agent using Pydantic AI and pointed it to my local expenses MCP server. Here's simplified code:
server = MCPServerStreamableHTTP(url="http://localhost:8000/mcp")
model = OpenAIResponsesModel(
"gpt-4.1-mini",
provider=OpenAIProvider(openai_client=azure_openai_client))
agent = Agent(
model,
system_prompt=(
"You help users log expenses. "
f"Today's date is {datetime.now().strftime('%B %-d, %Y')}."
),
output_type=str,
toolsets=[server],
)
result = await agent.run("I bought a sandwich for $12.50.")
Now, I needed a way to vary which tool schema the agent saw. Fortunately, Pydantic AI makes it easy to filter tools on MCP servers, using code like this:
Each time the agent ran, I inspected the tool calls to verify whether it had issued a tool call at all, and whether the tool call arguments matched my desired arguments. I recorded the results in both a JSON file and more human-readable Markdown file.
Evaluation results: category
For the four category variants, these are the results across the 17 cases:
Annotated[str]
Literal
Enum
Annotated[Enum]
Was tool called?
15/17
16/17
16/17
17/17
When called, did category match expected?
14/15
13/16
13/16
15/17
Schema size (avg tokens)
374
412
424
836
There's no clear winner amongst the first three schemas. For the first schema, where we just provided a description, the agent was more likely to decide not to call the tool at all, and instead respond with a clarifying question, like "could you please provide a category?". That may be desirable for some scenarios, to encourage agents to ask users in the face of ambiguity, but if we believe that we've provided enough information in the schema for the agent to make a clear choice, then our schema has failed. For the middle two schemas, where we provided just the enum options with no description, the agent was more likely to call the tool, but it selected the wrong category more often. That makes sense, since the schema lacked the description with the additional guidance.
The final schema is the clear winner, as the agent called the tool all the time, and matched the desired category the most often. There is a drawback of course, and that's why I included the schema size in the table: the combination of description and enum list increased the size of the schema to be double any of the other variants. That extra cost is likely worth it, but we always need to consider any improvements that increase quality at the expense of tokens.
You might be thinking, "hey, clearly stricter schemas are always better!" Alas, the story gets murkier.
Evaluation results: date
Remember that our add_expense tool also has the expense_date parameter, specified as a string in our basic schema. I wanted to make sure that those dates always came in a format that I could easily store in my database as YYYY-MM-DD, so I came up with three stricter schemas.
I started off by adding a description specifying the format:
expense_date: Annotated[
str, "Date in YYYY-MM-DD format"
]
As a reminder, that generates this JSON schema:
"expense_date": {
"description": "Date in YYYY-MM-DD format",
"type": "string"
}
Then I discovered that FastMCP supports date as a type for tool parameters, so I added that variant:
Here are the evaluation results running the Pydantic AI agent with gpt-4.1-mini across the 17 cases and 4 schema variants, including the bare string:
str
Annotated[str]
date
Field(pattern)
Was tool called?
17/17
17/17
17/17
17/17
Date match (of called)
12/17
12/17
12/17
12/17
Schema size (avg tokens)
326
406
414
423
Do you see what I see? Every single variant had the same success rates! The agent called the tool 100% of the time, and it matched the expected date the same fraction of the time. I expected to see lower success for that first schema, but even without any description at all, the agent always used YYYY-MM-DD format to specify the date. It appears that since I named the field with "_date" and YYYY-MM-DD is the standard ISO format for dates, that's what the model suggests. I suspect that if I had tried the evaluation with a SLM or the oldest tool-calling model possible, I may have seen worse results. With our frontier models, however, they do not need any additional prompting to produce a date in standard ISO formats.
Of course, you likely still want to use one of these schemas to guide the agents, to be on the safe side, and they fortunately do not increase the token size significantly. Personally, I like the date option, since that plays nicely with the rest of the Python server code.
You might be wondering about all the cases where the agent failed to suggest the right date. All of those failures were due to date math. For example, when the user says "Two Mondays ago I spent $8.75 on coffee.", the agent calculated the date as one Monday ago instead of two Mondays ago. If users were truly entering their data like this, then it might be a good idea to equip the server with some date calculation tools, or give the agent some guidance on when it should ask users to clarify the date.
Cross-model evaluations
After seeing the results for an agent powered by gpt-4.1-mini, I was super curious to see what would happen with both an older model and a newer model, so I deployed a gpt-4o and a gpt-5.3-codex and ran them through the same evaluations.
For the category schema variants, the results are very interesting:
Did agent call the tool?
Schema
gpt-4o
4.1-mini
5.3-codex (med)
Annotated[str]
17/17
15/17
17/17
Literal
17/17
16/17
17/17
Enum
17/17
16/17
17/17
Annotated[Enum]
17/17
17/17
17/17
When called, did category match expected?
Schema
gpt-4o
4.1-mini
5.3-codex (med)
Annotated[str]
17/17
14/15
15/17
Literal
15/17
13/16
13/17
Enum
14/17
13/16
13/17
Annotated[Enum]
17/17
15/17
15/17
As you can see, the gpt-4o model appears to be the winner: it always calls the tool, and it matches the category correctly 100% of the time, as long as it is provided a description. The gpt-5.3-codex model also always calls the tool, but it often chooses a different category than our desired category. So, at least for this particular scenario, the gpt-4o model aligns closer to our human decision-making process than the gpt-5.3-codex model.
But what if the newer model is just smarter than we are? Consider this example input and category choices:
"Yesterday I spent $200 on a spa treatment." with Annotated[Enum]
gpt-4o 🤖 Health & Fitness ✅
gpt-4.1-mini 🤖 Apparel and Beauty ❌
gpt-5.3-codex 🤖 Apparel and Beauty ❌
We marked "spa treatment" as "Health & Fitness" in our data, but newer models both preferred "Apparel and Beauty". Both of them seem like reasonable options, so the model disagreement is pointing out the ambiguity in the categories of our ground truth data. If we really wanted "spa treatment" to be "Health & Fitness", then we may need to give that example in our category description. Or, we might decide to change our ground truth data entirely to agree with the newer model's category selection. That's one thing that I love about running evaluations: they force you to think more deeply about your expectations of LLMs in the face of diverse user inputs.
For the date schema variants, we see a very different story:
Did agent call the tool?
Schema
gpt-4o
4.1-mini
5.3-codex (med)
str
17/17
17/17
17/17
Annotated[str]
17/17
17/17
17/17
date
17/17
17/17
17/17
Field(pattern)
17/17
17/17
17/17
When called, did date match expected?
Schema
gpt-4o
4.1-mini
5.3-codex (med)
str
15/17
12/17
17/17
Annotated[str]
15/17
12/17
17/17
date
15/17
12/17
17/17
Field(pattern)
15/17
12/17
17/17
The gpt-5.3-codex model is the clear winner here, as it calls the tool and selects the right date 100% of the time. Remember how gpt-4.1-mini couldn't do the date match? Apparently this newer model can! It correctly calculated "two Mondays ago" and when told "the last day of the month", it even realized that there is no February 29th in 2026. The gpt-4.1-mini model has the worst results here, and that may be due to it being a "-mini" model.
But now we're scratching our heads, because gpt-5.3-codex was not the winner across the board, despite being the much newer model. Let's dig deeper.
Impact of reasoning effort
The gpt-5.3-codex model is a reasoning model - and reasoning models can have very different outputs based on their reasoning effort level. I ran that evaluation on "medium", the default level, but for that model, the effort can be either "low", "medium", "high", or "xhigh". Here are the results:
low
medium
high
xhigh
Did category match ground truth?
100%
88.2%
88.2%
88.2%
Did date match ground truth?
100%
100%
100%
100%
Schema size (average tokens)
862
890
939
1114
Latency (average ms)
7,129
7,474
8,828
11,554
For the category schema variants, we see that the category match percentage drops from 100% to 88% as soon as we go from "low" to "medium" and stays there. Meanwhile, higher reasoning effort steadily increases cost and latency: by the time we reach "xhigh", the schema size has grown by 250 tokens and the latency has nearly doubled.
Since this is a reasoning model, we can actually dig into the reasoning traces, to give us more insight into how we might steer the model more towards our desired category.
Here's how it reasons on "low" mode:
I’m considering categories like Health & Fitness or Apparel/Beauty since it’s for a spa treatment. There’s ambiguity because spa treatments could fit into personal care or wellness, but I might lean towards Apparel and Beauty. However, Health & Fitness could work too, especially under wellness. I’ll go with one of those!
As you can see, it's not even sure which category to select at this level of reasoning - we only see its final selection in the tool call itself. Compare that to "medium" mode:
The category for this expense is a spa treatment. I'm considering whether it should go under Apparel and Beauty or Health & Fitness. Using some heuristics, personal care fits in Apparel and Beauty, while wellness aligns with Health & Fitness. Since a spa treatment feels more like a beauty or personal care choice, I think I'll choose Apparel and Beauty.
This time, it explicitly decided on the category in its reasoning, and called the tool with the selection. Once again, you might agree with the model's choice here, and change the ground truth itself.
When we are developing MCP servers, we're not necessarily in control of the models powering the agents that call those MCP servers. Ideally, we're designing tool schemas that are clear and constrained enough so that all the most popular models at all reasoning effort levels will call our tools the way we expect.
Comparing agent frameworks
We live in a world with hundreds of agent frameworks and coding agent tools. All of them share a common approach: calling tools in a loop until the user's goal is reached. Behind the scenes, agent implementation varies. Some agents attach their own system prompts to your prompt; some agents add in memory and caching; some agents have special built-in reflection and retry loops. So when it comes to calling MCP servers, how much variance might we expect to see?
For my final evaluation, I wrote an agent using the GitHub Copilot SDK, and gave it the same system prompt and MCP server connection as the Pydantic AI agent. The simplified code:
client = CopilotClient()
session = await client.create_session(SessionConfig(
model="gpt-5.3-codex",
mcp_servers={
"expenses": MCPRemoteServerConfig(
type="http",
url="http://localhost:8000/mcp",
tools=["add_expense_cat_e"],
)
},
system_message={
"mode": "replace",
"content": "You help users log expenses. "
f"Today's date is {datetime.now().strftime('%B %-d, %Y')}.",
},
))
await session.send_and_wait({"prompt": "I bought a sandwich for $12.50."})
For the evaluation, I used the gpt-5.3-codex model on medium effort across the 4 category schemas, 4 date schemas, and 17 variants. The results:
Was tool called at all?
Schema
Pydantic AI
Copilot SDK
Annotated[str]
17/17
17/17
Literal
17/17
17/17
Enum
17/17
17/17
Annotated[Enum]
17/17
17/17
Did category match expected?
Schema
Pydantic AI
Copilot SDK
Annotated[str]
15/17
15/17
Literal
13/17
13/17
Enum
13/17
13/17
Annotated[Enum]
15/17
15/17
Did date match expected?
Schema
Pydantic AI
Copilot SDK
str
17/17
17/17
Annotated[str]
17/17
17/17
date
17/17
17/17
Field(pattern)
17/17
17/17
The success rates are exactly the same across both agents! Now, I will confess that in my first attempt at evaluation, the Copilot SDK agent had an off-by-one error for each date it selected, and I suspect there's a UTC date somewhere in the default system prompt. When I re-ran the evaluation at a date where UTC and my timezone (PT) were the same, the dates were all correct. You learn all sorts of things when running evaluations.
Takeaways
I went into this investigation certain that I would see significant improvement from agents when I used stricter types and constraints for the parameter types. I realize now that the models have improved so much and been so robustly trained for tool calling, that they often do not need the specificity of the stricter types. They mostly need clarity whenever there is ambiguity, and that can come in the form of a string description.
However, there are still other benefits to using stricter schemas, like increased type safety and validation in our MCP server codebase. Personally, I would rather use date for the date input and Enum for the category input, as those lead to cleaner code inside the tool code.
LLMs, and the agents powered by them, are both non-deterministic and not that predictable. The only way to really see how an agent will respond to your MCP server tool schemas is to set up evaluations for the scenarios that you care about. If you're new to evaluations, check out the fantastic resources from ML engineer Hamel Husain.
I learned a lot during this investigation, and hope my approach is useful to you as well. All of the code — the MCP server, schema variants, agents, and evaluation framework — is available in my GitHub repository, so feel free to explore, adapt, and run your own experiments. Please share any of your own experience with MCP tool schemas and evaluations with me. Thank you!
When I was a kid, one of my first Java applets was a UI for choosing outfits by mixing and matching different articles of clothing. Now, with the advent of agents and MCP, I realized that I could make a modern, more dynamic version: an MCP server that can find relevant clothing based off a user query, and render matching clothing as a slideshow. Let's walk through the experience and code powering it.
Searching for relevant clothing
After connecting VS Code to my closet MCP server, I ask a query like:
i am presenting at PyAI about MCP, do I have MCP themed clothing? show me the best option.
GitHub Copilot decides that it can use the closet MCP server to answer that question, and it calls the image_search tool with these arguments:
The tool call returns a mix of binary files - thumbnails for each matching article of clothing, and structured data- a JSON containing filename, display name, and description for each article.
{
"results": [
{
"filename": "IMG_3234.jpg",
"display_name": "IMG_3234.jpg",
"description": "The image shows a black sleeveless dress hanging on a white hanger against a plain wall. The dress has a printed text on the front that reads: \"YOU DOWN WITH MCP? Yeah, you know me!\" The first line is in large white uppercase letters, and the second line is in smaller pink cursive letters. The dress has a fitted top and a flared skirt."
},...
Here's what that looks like in the GitHub Copilot chat interface. Notice that Copilot attaches the images, so I can actually click on them to see each result directly in VS Code, as if they were a file in the workspace.
Now let's look at the code powering that tool call. I built the server using FastMCP, so I declare my tools by wrapping functions in mcp.tool() decorator and annotating the arguments with types and helpful descriptions. Inside the function, I use Azure AI Search with hybrid retrieval on both the text query and the query's vector, against a target index that has multimodal embeddings for the images plus LLM-generated descriptions for the images. The tool returns a result that contains both the binary files and the structured content.
@mcp.tool()
async def image_search(
query: Annotated[
str, "Text description of images to find (e.g., 'red dress')"
],
max_results: Annotated[int, "Max number of images to return (1-20)"] = 5,
) -> ToolResult:
"""
Search for images matching a natural language query.
Returns the image data and descriptions.
"""
results = await search_client.search(
search_text=query,
top=max_results,
vector_queries=[VectorizableTextQuery(
k_nearest_neighbors=max_results, fields="embedding", text=query)],
select="metadata_storage_path,verbalized_image")
blob_service_client = get_blob_service_client()
files: list[File] = []
image_results: list[dict[str, str]] = []
result_index = 0
async for result in results:
result_index += 1
url = result["metadata_storage_path"]
description = result.get("verbalized_image")
container_name, blob_name = get_blob_reference_from_url(url)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
stream = await blob_client.download_blob()
image_bytes = await stream.readall()
image_format = get_image_format(url)
display_name = os.path.basename(blob_name)
file_basename = Path(display_name).stem
thumbnail_bytes = resize_image_bytes(image_bytes, image_format)
files.append(File(data=thumbnail_bytes, format=image_format, name=file_basename))
image_results.append({
"filename": blob_name,
"display_name": display_name,
"description": description})
return ToolResult(
content=files,
structured_content={
"query": query,
"results": image_results})
Displaying selected clothing
Once the agent finds possible matching clothing, it then reasons over the results and selects the best of those results. If the agent is using a multimodal LLM, like most modern frontier models, it's able to reason above both the image content and the image descriptions. It can then render its top choices directly in the UI, using an MCP app that renders a JavaScript-powered slideshow of images.
Here's what that looks like in GitHub Copilot chat:
Let's check out the code that powers that MCP app. An app is actually a kind of tool, so we once again wrap a Python function in @mcp.tool. However, this time, we specify that it's an app with an AppConfig with an associated resource for the image viewer HTML. Inside that function, we fetch the images from Azure Blob Storage based off their filename, return both the binary data for the images and structured content that includes the filename and mime-type of each image.
@mcp.tool(
app=AppConfig(resource_uri=IMAGE_VIEW_URI)
)
async def display_image_files(
filenames: Annotated[list[str], "List of image filenames to retrieve"]
) -> ToolResult:
"""Fetch images by filename and render them in a carousel display."""
blob_service_client = get_blob_service_client()
image_blocks: list[types.ImageContent] = []
image_results: list[dict[str, str]] = []
for filename in filenames:
blob_client = blob_service_client.get_blob_client(container=IMAGE_CONTAINER_NAME, blob=filename)
stream = await blob_client.download_blob()
image_bytes = await stream.readall()
mime_type = get_image_mime_type(filename)
image_blocks.append(
types.ImageContent(
type="image",
data=base64.b64encode(image_bytes).decode("utf-8"),
mimeType=mime_type))
image_results.append({
"filename": filename,
"mimeType": mime_type})
return ToolResult(
content=image_blocks,
structured_content={
"images": image_results,
})
Next we need to define the resource that serves up the image viewer HTML page. We wrap a Python function in @mcp.resource, assign it a "ui://" URL that is unique for our MCP server, and declare what servers are allowed in its Content-Security Policy (CSP):
@mcp.resource(
IMAGE_VIEW_URI,
app=AppConfig(csp=ResourceCSP(resource_domains=["https://unpkg.com"])),
)
def image_view() -> str:
"""Render images returned by display_image_files as an MCP App."""
return load_image_viewer_html()
Finally, we need the actual HTML that will render inside the iframed app. This tiny webpage brings in ext-apps, a JavaScript package which manages bidirectional communication with the MCP client. In our JavaScript, we declare an App instance, define the ontoolresult callback, and connect the app. That callback receives the images from the tool result and renders them inside the HTML. Note that apps also can communicate back, but that wasn't necessary for this UI.
If I want more ideas of how to put together my outfit, I can keep asking questions that will prompt the agent to call the MCP server. For example, my first follow-up question was:
great, i love the pink, matches pydantic-ai colors. can you find some pink accessories to go with it?
Then, after it suggested some nice accessories, I finished with:
sounds good. i also need a jacket to keep me warm. show me my final outfit.
To show me my final outfit, it called the display_image_files tool with only the selected articles of clothing - jacket, dress, and earrings. I can navigate through them with the arrows:
How'd the outfit work out? Pretty great!
Try it yourself!
The full MCP server code is available in the Azure-Samples/image-search-aisearch, along with a minimal frontend for image searching and data ingestion via Azure AI Search indexer with Azure OpenAI LLMs (for describing the images) and Azure AI Vision (for multi-modal embeddings of the images). The code can be used for any images, not just pictures of your clothing.
Here are ways you could improve it:
Use an image-generation model: visualize the head-to-toe outfit on a mannequin (instead of showing each item separately in the carousel)
Optimize token consumption: currently, since it returns each image thumbnail when searching, and images require a lot of tokens to represent them, conversations can easily exceed the context window. You could experiment with smaller images, higher compression, or other approaches.
Add user login: my MCP server is a public endpoint, but most people don't want their closet (or private images) to be public knowledge. You can add on key-based auth or OAuth using the FastMCP auth providers, as I described in the MCP auth livestream.
Have fun, and let me know if you build your own version!