Wednesday, March 11, 2026

Learnings from the PyAI conference

I recently spoke at the PyAI conference, put on by the good folks at Prefect and Pydantic, and I learnt so much from the talks I attended. Here are my top takeaways from the sessions that I watched:


AI Evals Pitfalls

Hamel Husain

  • View slides
  • Hamel cautioned against blindly using automated evaluation frameworks and built-in evaluators (like helpfulness and coherence).
  • Instead, we should adopt a data science approach to evaluation: explore the data, discover what's actually breaking, identify the most important metric, and iterate as new data comes in.
  • We shouldn't just trust an LLM-as-a-judge to be given accurate scores. Instead, we should validate it like we would validate a ML classifier- with labeled data, train/dev/test splits, and precision/recall metrics. LLM-judges should always give pass/fail results, instead of 1-5 scores, so that there's no ambiguity in their judgment.
  • When generating synthetic data, first come up with dimensions (such as persona), generate combinations based off dimensions, and convert those into realistic queries.
  • Hamel created evals-skills, a collection of skills for coding agents that can be run against evaluation pipelines to find issues like poorly designed LLM-judges.

Build Reasonable Software

Jeremiah Lowin (FastMCP/Prefect)

  • Write your Python programs in a way that coding agents can reason about them, so that they can more easily maintain and build them. For example, FastMCP v2 SDK was not well designed (bad abstractions) so a new CodeMod feature required 4,000 lines of code. In the new FastMCP v3 SDK (same functional API, different abstractions backing it), the same feature only required 500 lines of code.
  • To make Python FastMCP servers more Pythonic, Jeremiah is developing a new package for MCP apps which includes the most common UIs (forms/tables/charts), called PreFab: https://github.com/PrefectHQ/prefab

Panel: Open Source in the Age of AI

Guido van Rossum (CPython), Samuel Colvin (Pydantic), Sebastián Ramírez (FastAPI), Jeremiah Lowin (FastMCP)

  • OSS maintainers are overwhelmed by AI Slop PRs. As one maintainer said, "Don't expect someone else to be the first one to read your code". Each maintainer is coming up with different systems/bots/heuristics to detect and triage PRs (like FastMCP auto-rejects PRs that are too long!). Some maintainers are going to turn off PRs entirely, as now permitted by GitHub.
  • Samuel's opinion: GitHub should add a "human identity" vs "user identity", as well as a user reputation system where reputation is based off how many useful contributions you've made (or a "sloppiness" metric).

Do developer tools matter to agents?

Zanie Blue (Astral)

  • Astral is considering ways to make their tools more agent-friendly. For example, their error messages for ty are currently fairly long and include ASCII arrows pointing to the code in question, and they suspect the agents may not need all of that in their context.
  • Astral is also re-prioritizing based off the move towards 100% agentic coding, with less emphasis on tools that would be used solely by a developer who is manually typing. For example, they were once considering adding a "review" feature to review each ruff suggestion one-by-one, but that seems unlikely to be used by developers these days.
  • Astral may now be able to take advantage of agent's ability to reason over whether proposed ruff fixes are safe. Currently, ruff only auto-fixes code when it knows that the code change can't introduce any unwanted changes (like comment deletions), and it marks other fixes as "unsafe". Now ruff could add more unsafe fixes, knowing that an LLM could decide whether it was actually a safe change.

Context Engineering for MCP Servers

Till Döhmen (MotherDuck)

  • Till walked through the multi-step process of developing MCP servers to allow developers to interact with their MotherDuck databases. The server started with a single "query" tool, which later split into multiple tools, including "list_databases" and "list_tables". They had to offer dedicated schema-exploration tools since DuckDB uses a different syntax than PostgreSQL, and the agents kept suggesting PostgreSQL syntax that didn't work.
  • They also added a tool to search the documentation (powered by the same search used by their website) and a tool that teaches the agent how to create "dive"s, a visualization of the database state.
  • One of their big struggles is the lack of MCP spec support across clients: the MCP spec is so rich and full of features, but only a handful of clients support those features. It's hard for them to take advantage of the new features, knowing their users may be using a client that does not support them.

Controlling the wild: from tool calling to computer use

Samuel Colvin (Pydantic)

  • Samuel built Monty to be a minimal implementation of Python for agents to use. It intentionally does not support all of the Python standard lib (like sockets/file open), but does include a way to call back to functions on the host. When using monty, you do not need to setup a separate sandbox.
  • Monty is not designed to run full applications - it's designed to run Python code generated by agents.
  • The models vary in how successfully they call monty in a REPL loop- Opus 4.5 works the best, Opus 4.6 works worse, presumably due to the RLHF process teaching 4.6 to execute code in a particular way.
  • github.com/pydantic/monty

What's new in FastAPI for AI

Sebastián Ramírez (FastAPI)

  • There's now a VS Code extension for FastAPI, built by my brilliant former colleague, Savannah Ostrowski. It makes it easy to navigate to different routes in your app, and it adds a CodeLens for navigating from pytest tests back to the route that they're testing.
  • FastAPI has built-in support for streaming JSON lines! Just yield an AsyncIterable. I plan to port my FastAPI streaming chat apps to this approach, pronto.
  • In pyproject.toml, you can now specify the FastAPI entrypoint, so that the fastapi command knows exactly where your FastAPI app is.

Context Engineering 2.0: MCP, Agentic RAG & Memory

Simba Khadder (Redis)

  • Redis is adding many features to specifically help developers who are creating apps with generative AI. For example, they've added a semantic caching of queries, based off a fine-tuned BERT model, so that developers don't have to pay every time someone says "good morning" to a chatbot. Anyone can use semantic caching in open-source Redis by bringing your own LLMs, but the fine-tuned model is available only for Redis Cloud.