Sunday, June 1, 2025

Teaching Python with Codespaces

Whenever I am teaching Python workshops, tutorials, or classes, I love to use GitHub Codespaces. Any repository on GitHub can be opened inside a GitHub Codespace, which gives the student a full Python environment and a browser-based VS Code. Students spend less time setting up their environment and more time actually coding - the fun part! In this post, I'll walk through my tips for using Codespaces for teaching Python, particularly for classes about web apps, data science, or generative AI.

Getting started

You can start a GitHub Codespace from any repository. Navigate to the front page of the repository, then select "Code" > "Codespaces" > "Create codespace on main":

By default, the Codespace will build an environment based off a universal Docker image, which includes Python, NodeJS, Java, and other popular languages.

But what if you want more control over the environment?

Dev Containers

A dev container is an open specification for describing how a project should be opened in a development environment, and is supported by several IDEs, including GitHub Codespaces and VS Code (via Dev Containers extension).

To define a dev container for your repository, add a devcontainer.json that describes the desired Docker image, VS Code extensions, and project settings. Let's look at a few examples, from simple to complex.

A simple dev container configuration

The simplest devcontainer.json specifies a Docker image, like from Docker Hub or the Microsoft Artifact Registry. Microsoft provides several Python-specific images optimized for dev containers.

For example, my python-3.13-playground repository sets up Python 3.13 using one of those images, and also configures a few settings and default extensions:

{
  "name": "Python 3.13 playground",
  "image": "mcr.microsoft.com/devcontainers/python:3.13-bullseye",
  "customizations": {
    "vscode": {
      "settings": { 
        "python.defaultInterpreterPath": "/usr/local/bin/python",
        "python.linting.enabled": true
      },
      "extensions": [
        "ms-python.python",
        "ms-python.vscode-pylance",
        "ms-python.vscode-python-envs"
      ]
    }
  }
}

The settings inside the "vscode" field will be used whenever the playground is opened in either GitHub Codespaces or local VS Code.

A dev container with Dockerfile

We can also customize a dev container with a custom Dockerfile, if we want to run additional system commands on the image.

For example, the python-ai-agent-frameworks-demos repository uses a Dockerfile to install required Python packages:

FROM mcr.microsoft.com/devcontainers/python:3.12-bookworm

COPY requirements.txt /tmp/pip-tmp/

RUN pip3 --disable-pip-version-check install -r /tmp/pip-tmp/requirements.txt \
    && rm -rf /tmp/pip-tmp

The devcontainer.json references the Dockerfile in the "build" section:

{
  "name": "python-ai-agent-frameworks-demos",
  "build": {
    "dockerfile": "Dockerfile",
    "context": ".."
  },
  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "ms-azuretools.vscode-bicep"
      ],
      "python.defaultInterpreterPath": "/usr/local/bin/python"
    }
  },
  "remoteUser": "vscode"
}

You can also install OS-level packages in the Dockerfile, using Linux commands like apt-get, as you can see in this fabric-mcp-server Dockerfile.

A devcontainer with docker-compose.yaml

When our dev container is defined with a Dockerfile or image name, the Codespace creates an environment based off a single Docker container, and that is the container that we write our code inside.

It's also possible to setup multiple containers within the Codespace environment, with a primary container for our code development, plus additional services running on other containers. This is a great way to bring in containerized services like PostgreSQL, Redis, MongoDB, etc - anything that can be put in a container and exposed over the container network.

To configure a multi-container environment, add a docker-compose.yaml to the .devcontainer folder. For example, this docker-compose.yaml from my postgresql-playground repository configures a Python container plus a PostgreSQL container:

version: "3"

services:
  app:
    build:
      context: ..
      dockerfile: .devcontainer/Dockerfile
      args:
        IMAGE: python:3.12
    volumes:
      - ..:/workspace:cached
    command: sleep infinity
    network_mode: service:db

  db:
    image: postgres:latest
    restart: unless-stopped
    volumes:
      - postgres-data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: postgres
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: LocalPasswordOnly

volumes:
  postgres-data:

The devcontainer.json references that docker-compose.yaml file, and declares that the "service" container is the primary container for the environment:

{
  "name": "postgresql-playground",
  "dockerComposeFile": "docker-compose.yaml",
  "service": "app",
  "workspaceFolder": "/workspace",
...

Teaching Web Apps

Now let's look at topics you might be teaching in Python classes. One popular topic is web applications built with Python backends, using frameworks like Flask, Django, or FastAPI. A simple webapp can use the Python dev container from earlier, but if the webapp has a database, then you'll want to use the docker-compose setup with multiple containers.

Flask + DB

For example, my flask-db-quiz example configures a Flask backend with PostgreSQL database. The docker-compose.yaml is the same as the previous PostgreSQL example, and the devcontainer.json includes a few additional customizations:

{
  "name": "flask-db-quiz",
  "dockerComposeFile": "docker-compose.yaml",
  "service": "app",
  "workspaceFolder": "/workspace",
  "forwardPorts": [5000, 50505, 5432],
  "portsAttributes": {
    "50505": {"label": "Flask port", "onAutoForward": "notify"},
    "5432": {"label": "PostgreSQL port", "onAutoForward": "silent"}
  },
  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "mtxr.sqltools",
        "mtxr.sqltools-driver-pg"
      ]
      "settings": {
        "sqltools.connections": [
          {
          "name": "Container database",
          "driver": "PostgreSQL",
          "previewLimit": 50,
          "server": "localhost",
          "port": 5432,
          "database": "app",
          "username": "app_user",
          "password": "app_password"
          }
        ],
      }
    }
  },
  "postCreateCommand": "python3 -m pip install -r requirements-dev.txt && pre-commit install",
  "remoteUser": "vscode"
}

The "portsAttributes" field in devcontainer.json tells Codespaces that we're exposing services at those parts, which makes them easy to find in the Ports tab in VS Code.

Screenshot of Ports tab in GitHub Codespaces

Once the app is running, I can click on the URL in the Ports tab and open it in a new window. I can even right-click to change the port visibility, so I can share the URL with classmates or teacher. The URL will only work as long as the Codespace and app are running, but this can be really helpful for quick sharing in class.

Another customization in that devcontainer.json is the addition of the SQLTools extension, for easy browsing of database data. The "sqltools.connection" field sets up everything needed to connect to the local database.

Screenshot of SQLTools extension for browsing a database table

Django + DB

We can use a very similar configuration for Django apps, as demonstrated in my django-quiz-app repository.

By default, Django's built-in security rules are stricter than Flask's, so you may see security errors when using a Django app from the forwarded port's URL, especially when submitting forms. That's because Codespace "local" URLs aren't truly local URLs, and they bake the port into the URL instead of using it as a true port. For example, for a Django app on port 8000, the forwarded URL could be:

https://supreme-space-orbit-64xpgrxxxcwx4-8000.app.github.dev/

To get everything working nicely in Codespaces, we need Django to treat the forwarded URL as a trusted origin. I made that adjustment in settings.py:

ALLOWED_HOSTS = []
CSRF_TRUSTED_ORIGINS = ["http://localhost:8000",]
if env.get_value("CODESPACE_NAME", default=None):
  CSRF_TRUSTED_ORIGINS.append(
   f"https://{env('CODESPACE_NAME')}-8000.{env('GITHUB_CODESPACES_PORT_FORWARDING_DOMAIN')}"
  )

I've run into this with other frameworks as well, so if you ever get a cross-site origin error when running web apps in Codespaces, a similar approach may help you resolve the error.

Teaching Generative AI

For the past two years, a lot of my teaching has been around generative AI models, like large language models and embedding models. Fortunately, there are two ways that we can use Codespaces with those models for free.

GitHub Models

My current favorite approach is to use GitHub Models, which are freely available models for anyone with a GitHub Account. The catch is that they're rate limited, so you can only send a certain number of requests and tokens per day to each model, but you can get a lot of learning done on that limited budget.

To use the models, we can point our favorite Python AI package at the GitHub Models endpoint, and pass in a GitHub Personal Access Token (PAT) as the API key. Fortunately, every Codespace exposes a GITHUB_TOKEN environment variable automatically, so we can just access that directly from the env.

For example, this code uses the OpenAI package to connect to GitHub Models:

import openai

client = openai.OpenAI(
  api_key=os.environ["GITHUB_TOKEN"],
  base_url="https://models.inference.ai.azure.com")

Alternatively, when you are trying out a GitHub Model from the marketplace, select "Use this Model" to get suggested Python code and open a Codespace with code examples.

Screenshot of GitHub Models playground with Use this Model button

For more examples with other frameworks, most from the Python + AI series, check out:


Ollama

My other favorite way to use free generative AI models is Ollama. Ollama is a tool that you can download onto any OS that makes it possible to interact with local language models, especially SLMs (small language models).

On my fairly underpowered Mac M1 laptop, I can run models with up to 8 billion parameters (corresponding to ~5 GB download size). The most powerful LLMs like OpenAI's GPT 4 series typically have a few hundred billion parameters, quite a bit more, but you can get surprisingly good results from smaller models. The Ollama tooling runs a model as efficiently as possible based on the hardware, so it will use a GPU if your machine has one, but otherwise will use various tricks to make the most of the CPU.

Screenshot of Ollama running in terminal

I put together an ollama-python playground repo that makes a Codespace with Ollama already downloaded. All of the configuration is done inside devcontainer.json:

{
  "name": "ollama-python-playground",
  "image": "mcr.microsoft.com/devcontainers/python:3.12-bullseye",
  "features": {
    "ghcr.io/prulloac/devcontainer-features/ollama:1": {}
  },
  "customizations": {
    "vscode": {
      "settings": {
        "python.defaultInterpreterPath": "/usr/local/bin/python"
      },
      "extensions": [
        "ms-python.python"
      ]
    }
  },
  "hostRequirements": {
    "memory": "16gb"
  },
  "remoteUser": "vscode"
}

I could have installed Ollama using a Dockerfile, but instead, inside the "features" section, I added a dev container feature that takes care of installing Ollama for me. Once the Codespace opens, I can immediately run "ollama pull phi3:mini" and start interacting with the model, and also use Python programs to interact with the locally exposed Ollama API endpoints.

You may run into issues running larger SLMs, however, due to the Codespace defaulting to a 4-core machine with only 16 GB of RAM. In that case, you can change the "hostRequirements" to "32gb" or even "64gb" and restart the Codespace. Unfortunately, that will use up your monthly free Codespace hours at double or quadruple the rate.

Generally, making requests to a local Ollama model will be slower than making to GitHub Models, because they're being processed by relatively underpowered machines that do not have GPUs. That's why I start with GitHub models these days, but support using Ollama as a backup, to have as many options possible.

Teaching Data Science

We can also use Codespaces when teaching data science, when class assignments are more likely to use Jupyter notebooks and scientific computing packages.

If you typically set up your data science environment using anacadonda instead of pip, you can use conda inside the Dockerfile, as demonstrated in my colleague's conda-devcontainer-demo:

FROM mcr.microsoft.com/devcontainers/miniconda:0-3

RUN conda install -n base -c conda-forge mamba
COPY environment.yml* .devcontainer/noop.txt /tmp/conda-tmp/
RUN if [ -f "/tmp/conda-tmp/environment.yml" ]; then umask 0002 \
    && /opt/conda/bin/mamba env create -f /tmp/conda-tmp/environment.yml; fi \
    && rm -rf /tmp/conda-tmp

The corresponding devcontainer.json points the Python interpreter path to that conda environment:

{
  "name": "conda-devcontainer-demo",
  "build": { 
    "context": "..",
    "dockerfile": "Dockerfile"
  },
  "postCreateCommand": "conda init",
  "customizations": {
    "vscode": {
      "settings": {
        "python.defaultInterpreterPath": "/opt/conda/envs/demo"
      },
      "extensions": [
        "ms-python.python",
        "ms-toolsai.jupyter",
      ]
    }
  }
}

That configuration includes a "postCreateCommand", which tells Codespace to run "conda init" once everything is loaded in the environment, inside the actual VS Code terminal. There are times when it makes sense to use the lifecycle commands like postCreateCommand instead of running a command in the Dockerfile, depending on what the command does.

The extensions above includes both the Python extension and the Jupyter extension, so that students can get started interacting with Jupyter notebooks immediately. Another helpful extension could be Data Wrangler which adds richer data browsing to Jupyter notebooks and can generate pandas code for you.

If you are working entirely in Jupyter notebooks, then you may want the full JupyterLab experience. In that case, it's actually possible to open a Codespace in JupyterLab instead of the browser-based VS Code.

Disabling GitHub Copilot

As a professional software developer, I'm a big fan of GitHub Copilot to aid my programming productivity. However, in classroom settings, especially in introductory programming courses, you may want to discourage the use of coding assistants like Copilot. Fortunately, you can configure a setting inside the devcontainer.json to disable it, either for all files or specifically for Python:

"github.copilot.enable": {
   "*": true,
   "python": false
}

You could also add that to a .vscode/settings.json so that it would take effect even if the student opened the repository in local VS Code, without using the dev container.

Some classrooms then install their own custom-made extensions that offer more of a TA-like coding assistant, which will help the student debug their code and think through the assignment, but not actually provide the code. Check out the research from CS50 at Harvard and CS61A at UC Berkeley.

Optimizing startup time

When you're first starting up a Codespace for a repository, you might be sitting there waiting for 5-10 minutes, as it builds the Docker image and loads in all the extensions. That's why I often ask students to start loading the Codespace at the very beginning of a lesson, so that it's ready by the time I'm done introducing the topics.

Alternatively, you can use pre-builds to speed up startup time, if you've got the budget for it. Follow the steps to configure a pre-build for the repository, and then Codespace will build the image whenever the repo changes and store it for you. Subsequent startup times will only be a couple minutes. Pre-builds use up free Codespace storage quota more quickly, so you may only want to enable them right before a lesson and disable after. Or, ask if your school can provide more Codespace storage budget.

For additional tips on managing Codespace quotas and getting the most out of the free quotas, read this post by my colleague Alfredo Deza.

Any downsides?

Codespaces is a great way to set up a fully featured environment complete with extensions and services you need in your class. However, there are some drawbacks to using Codespaces in a classroom setting:

  • Saving work: Students need to know how to use git to be able to fork, commit, and push changes. Often students don't know how to use git, or can get easily confused (like all of us!). If your students don't know git, then you might opt to have them download their changed code instead and save or submit it using other mechanisms. Some teachers also build VS Code extensions for submitting work.
  • Losing work: By default, Codespaces only stick around for 30 days, so only changes are lost after then. If a student forgets to save their work, they will lose it entirely. Once again, you may need to give students other approaches for saving their work more frequently.

Additional resources

If you're a teacher in a classroom, you can also take advantage of these programs:

Wednesday, May 28, 2025

A visual introduction to vector embeddings

For Pycon 2025, I created a poster exploring vector embedding models, which you can download at full-size. In this post, I'll translate that poster into words.

Vector embeddings

A vector embedding is a mapping from an input (like a word, list of words, or image) into a list of floating point numbers. That list of numbers represents that input in the multidimensional embedding space of the model. We refer to the length of the list as its dimensions, so a list with 1024 numbers would have 1024 dimensions.

The word dog is sent to an embedding model and a list of floating point numbers is returned


Embedding models

Each embedding model has its own dimension length, allowed input types, similarity space, and other characteristics.

word2vec

For a long time, word2vec was the most well-known embedding model. It could only accept single words, but it was easily trainable on any machine, it is very good at representing the semantic meaning of words. A typical word2vec model outputs vectors of 300 dimensions, though you can customize that during training. This chart shows the 300 dimensions for the word "queen" from a word2vec model that was trained on a Google News dataset:
Chart showing 300 dimensions on x axis with values from -0.6 to 0.4

text-embedding-ada-002

When OpenAI came out with its chat models, it also offered embedding models, like text-embedding-ada-002 which was released in 2022. That model was significant for being powerful, fast, and significantly cheaper than previous models, and is still used by many developers. The text-embedding-ada-002 model accepts up to 8192 "tokens", where a "token" is the unit of measurement for the model (typically corresponding to a word or syllable), and outputs 1536 dimensions. Here are the 1536 dimensions for the word "queen":
Chart showing 1536 dimensions on x axis with y values from -0.7 to 0.2
Notice the strange spike downward at dimension 196? I found that spike in every single vector embedding generated from the model - short ones, long ones, English ones, Spanish ones, etc. For whatever reason, this model always produces a vector with that spike. Very peculiar!

text-embedding-3-small

In 2024, OpenAI announced two new embedding models, text-embedding-3-small and text-embedding-3-large, which are once again faster and cheaper than the previous model. For this post, we'll use the text-embedding-3-small model as an example. Like the previous model, it accepts 8192 tokens, and outputs 1536 dimensions by default. As we'll see later, it optionally allows you to output less dimensions. Here are the 1536 dimensions for the word "queen":
Chart showing 1536 dimensions on x axis with y values from -0.1 to 0.1
This time, there is no downward spike, and all of the values look well distributed across the positive and negative.


Similarity spaces

Why go through all this effort to turn inputs into embeddings? Once we have embeddings of different inputs from the same embedding model, then we can compare the vectors using a distance metric, and determine the relative similarity of inputs. Each model has its own "similarity space", so the similarity rankings will vary across models (sometimes only slightly, sometimes significantly). When you're choosing a model, you want to make sure that its similarity rankings are well aligned with human rankings.

For example, let's compare the embedding for "dog" to the embeddings for 1000 common English words, across each of the models, using the cosine similarity metric.

word2vec similarity

For the word2vec model, here are the closest words to "dog", and the similarity distribution across all 1000 words:

Similar words (cat, animal, horse) plus a chart with a histogram of similarity values

As you can see, the cosine similarity values range from 0 to 0.76, with most values clustered between 0 and 0.2.

text-embedding-ada-002 similarity

For the text-embedding-ada-002 model, the closest words and similarity distribution is quite different:

Similar words (animal, god, cat) plus a chart with a histogram of similarity values

Curiously, the model thinks that "god" is very similar to "dog". My theory is that OpenAI trained this model in a way that made it pay attention to spelling similarities, since that's the main way that "dog" and "god" are similar. Another curiousity is that the similarity values are in a very tight range, between 0.75 and 0.88. Many developers find that unintuitive, as we might see a value of 0.75 initially and think it indicates a very similar value, when it actually is the opposite for this model. That's why it's so important to look at relative similarity values, not absolute. Or, if you're going to look at absolute values, you must calibrate your expectations first based on the standard similarity range of each model.

text-embedding-3-small similarity

The text-embedding-3-small model looks more similar to word2vec in terms of its closest words and similarity distribution:

Similar words (animal, horse, cat) plus a chart with a histogram of similarity values

The most similar words are all similarity in semantics only, no spelling, and the similarity values peak at 0.68, with most values between 0.2 and 0.4. My theory is that OpenAI saw the weirdness in the text-embedding-ada-002 model and cleaned it up in the text-embedding-3 models.


Vector similarity metrics

There are multiple metrics we could possibly use to decide how "similar" two vectors are. We need to get a bit more math-y to understand the metrics, but I've found it helpful to know enough about metrics so that I can pick the right metric for each scenario.

Cosine similarity

The cosine similarity metric is the most well known way to measure the similarity of two vectors, by taking the cosine of the angle between the two vectors.

Graph showing two vectors with the angle shaded between them

For cosine similarity, the highest value of 1.0 signifies the two vectors are the most similar possible (overlapping completely, no angle between them). The lowest theoretical value is -1.0, but as we saw earlier, modern embedding models models tend to have a more narrow angular distribution, so all the cosine similarity values end up higher than 0.

Here's the formal definition for cosine similarity:
Cosine similarity = dot product of x and y, divided by product of magnitude of x and y
That formula divides the dot product of the vectors by the product of their magnitudes.

Dot product

The dot product is a metric that can be used on its own to measure similarity. The dot product sums up the products of corresponding vector elements:
Dot product = the sum of components of vectors

Here's what's interesting: cosine similarity and dot product produce the same exact values for unit vectors. How come? A unit vector has a magnitude of 1, so the product of the magnitude of two unit vectors is also 1, which means that the cosine similarity formula simplifies to the dot product formula in that special case.

Many of the popular embedding models do indeed output unit vectors, like text-embedding-ada-002 and text-embedding-3-small. For models like those, we can sometimes get performance speedups from a vector database by using the simpler dot product metric instead of cosine similarity, since the database can skip the extra step to calculate the denominator.

Vector distance metrics

Most vector databases also support distance metrics, where a smaller value indicates higher similarity. Two of them are related to the similarity metrics we just discussed: cosine distance is the complement of cosine similarity, and negative inner product is the negation of the dot product.

The Euclidean distance between two vectors is the straight-line distance between the vectors in multi-dimensional space - the path a bird would take to get straight from vector A to vector B.

A graph showing Euclidean distance (a straight line) between two vectors

The formula for calculating Euclidean distance:

Euclidean distance = the square root of squares of component differences

The Manhattan distance is the "taxi-cab distance" between the vectors - the path a car would need to take along each dimension of the space. This distance will be longer than Euclidean distance, since it can't take any shortcuts.

A graph showing Manhattan distance (a segmented line) between two vectors

The formula for Manhattan distance:

Manhattan distance = The sum of the magnitude of component differences

When would you use Euclidean or Manhattan? We don't typically use these metrics with text embedding models, like all the ones we've been exploring in those post. However, if you are working with a vector where each dimension has a very specific meaning and has been constructed with per-dimension meaning intentionally, then these distance metrics may be the best ones for the job.

Vector search

Once we can compute the similarity between two vectors, we can also compute the similarity between an arbitrary input vector and the existing vectors in a database. That's known as vector search, and it's the primary use case for vector embeddings these days. When we use vector search, we can find anything that is similar semantically, not just similar lexicographically. We can also use vector search across languages, since embedding models are frequently trained on more than just English data, and we can use vector search with images as well, if we use a multimodal embedding model that was trained on both text and images.

An input vector is turned into an embedding, and that embedding is used to search other vectors

When we have a small number of vectors, we can do an exhaustive search, measuring the similarity between the input vector and every single stored vector, and returning the full ranked list.

However, once we start growing our vector database size, we typically need to use an Approximate Nearest Neighbors (ANN) algorithm to search the embedding space heuristically. A popular algorithm is HNSW, but other algorithms can also be used, depending on what your vector database supports and your application requirements.

Algorithm Python package Example database support
HNSW hnswlib PostgreSQL pgvector extension
Azure AI Search
Chromadb
Weaviate
DiskANN diskannpy Cosmos DB
IVFFlat faiss PostgreSQL pgvector extension
Faiss faiss None, in-memory index only*


Vector compression

When our database grows to include millions or even billions of vectors, we start to feel the effects of vector size. It takes a lot of space to store thousands of floating point numbers, and it takes up computation time to calculate their similarity. There are two techniques that we can use to reduce vector size: quantization and dimension reduction.

Scalar quantization

A floating point number requires significant storage space, either 32 bits or 64 bits. The process of scalar quantization turns each floating point number into an 8-bit signed integer. First, the minimum and maximum values are determined, based off either the current known values, or a hardcoded min/max for the given embedding model. Then, each floating point number is re-mapped to a number between -127 to 128.

Diagram showing range from min value to max value being mapped to -127 to 128

The resulting list of integers requires ~13% of the original storage, but can still be used for similarity and search, with similar outputs. For example, compare the most similar movie titles to "Moana" between the original floating point vectors and the scalar quantized vectors:

Table showing most similar movie titles to Moana, before and after scalar quantization - only two movies change position

Binary quantization

A more extreme form of compression is binary quantization: turning each floating point number into a single bit, 0 or 1. For this process, the centroid between the minimum and maximum is determined, and any lower value becomes 0 while any higher value becomes 1.

Diagram showing range from min value to max value being mapped to 0 or 1

In theory, the resulting list of bits requires only 13% of the storage needed for scalar quantization, but that's only the case if the vector database supports bit-packing - if it has the ability to store multiple bits into a single byte of memory. Incredibly, the list of bits still retains a lot of the original semantic information. Here's a comparison once again for "Moana", this time between the scalar and binary quantized vectors:

Table showing most similar movie titles to Moana, before and after binary quantization - only two movies change position


Dimension reduction

Another way to compress vectors is to reduce their dimensions - to shorten the length of the list. This is only possible in models that were trained to support Matryoska Representation Learning (MRL). Fortunately, many newer models like text-embedding-3 were trained with MRL and thus support dimension reduction. In the case of text-embedding-3-small, the default/maximum dimension count is 1536, but the model can be reduced all the way down to 256.

Diagram showing vector dimension reudction

You can reduce the dimensions for a vector either via the API call, or you can do it yourself, by slicing the vector and normalizing the result. Here's a comparison of the values between a full 1536 dimension vector and its reduced 256 version, for text-embedding-3-small:

Graphs for vectors with 1536 dimension, then with 256 dimensions


Compression with rescoring

For optimal compression, you can combine both quantization and dimension reduction:

Diagram showing vector dimension reduction followed by quantization

However, you will definitely see a quality degradation for vector search results. There's a way you can both save on storage and get high quality results, however:

  1. For the vector index, use the compressed vectors
  2. Store the original vectors as well, but don't index them
  3. When performing a vector search, oversample: request 10x the N that you actually need
  4. For each result that comes back, swap their compressed vector with original vector
  5. Rescore every result using the original vectors
  6. Only use the top N of the rescored results

That sounds like a fair bit of work to implement yourself, but databases like Azure AI Search offer rescoring as a built-in feature, so you may find that your vector database makes it easy for you.

Additional resources

If you want to keep digging into vector embeddings:

  1. Explore the Jupyter notebooks that generated all the visualizations above
  2. Check out the links at the bottom of each of those notebooks for further learning
  3. Watch my talk about vector embeddings from the Python + AI series

Monday, April 28, 2025

Using DefaultAzureCredential across multiple tenants

If you are using the DefaultAzureCredential class from the Azure Identity SDK while your user account is associated with multiple tenants, you may find yourself frequently running into API authentication errors (such as HTTP 401/Unauthorized). This post is for you!

These are your two options for successful authentication from a non-default tenant:

  1. Setup your environment precisely to force DefaultAzureCredential to use the desired tenant
  2. Use a specific credential class and explicitly pass in the desired tenant ID

Option 1: Get DefaultAzureCredential working

The DefaultAzureCredential class is a credential chain, which means that it tries a sequence of credential classes until it finds one that can authenticate successfully. The current sequence is:

  • EnvironmentCredential
  • WorkloadIdentityCredential
  • ManagedIdentityCredential
  • SharedTokenCacheCredential
  • AzureCliCredential
  • AzurePowerShellCredential
  • AzureDeveloperCliCredential
  • InteractiveBrowserCredential

For example, on my personal machine, only two of those credentials can retrieve tokens:

  1. AzureCliCredential: from logging in with Azure CLI (az login)
  2. AzureDeveloperCliCredential: from logging in with Azure Developer CLI (azd auth login)

Many developers are logged in with those two credentials, so it's crucial to understand how this chained credential works. The AzureCliCredential is earlier in the chain, so if you are logged in with that, you must have the desired tenant set as the "active tenant". According to Azure CLI documentation, there are two ways to set the active tenant:

  1. az account set --subscription SUBSCRIPTION-ID where the subscription is from the desired tenant
  2. az login --tenant TENANT-ID, with no subsequent az login commands after

Whatever option you choose, you can confirm that your desired tenant is currently the default by running az account show and verifying the tenantId in the account details shown.

If you are only logged in with the azd CLI and not the Azure CLI, you have a problem: the azd cli does not currently have a way to set the active tenant. If that credential is called with no additional information, azd assumes your home tenant, which may not be desired. The azd credential does check for a system variable called AZURE_TENANT_ID, however, so you can try setting that in your environment before running code that uses DefaultAzureCredential. That should work as long as the DefaultAzureCredential code is truly running in the same environment where AZURE_TENANT_ID has been set.

Option 2: Use specific credentials

Several credential types allow you to explicitly pass in a tenant ID, including both the AzureCliCredential and AzureDeveloperCliCredential. If you know that you’re always going to be logging in with a specific CLI, you can change your code to that credential:

For example, in the Python SDK:

azure_cred = AzureDeveloperCliCredential(
    tenant_id=os.environ["AZURE_TENANT_ID"])

For more flexibility, you can use conditionals to only pass in a tenant ID if one is set in the environment:

if AZURE_TENANT_ID := os.environ("AZURE_TENANT_ID"): 
  azure_cred = AzureDeveloperCliCredential(tenant_id=AZURE_TENANT_ID) 
else: 
  azure_cred = AzureDeveloperCliCredential() 

As a best practice, I always like to log out exactly what credential I'm calling and whether I'm passing in a tenant ID, to help me spot any misconfiguration from my logs.

⚠️ Be careful when replacing DefaultAzureCredential if your code will be deployed to a production host! That means you were previously relying on it using the ManagedIdentityCredential in the chain, and that you now need to call that credential class specifically. You will also need to pass in the managed identity ID, if using user-assigned identity instead of system-assigned identity.

For example, using managed identity in the Python SDK with user-assigned identity:

azure_cred = ManagedIdentityCredential(
    client_id=os.environ["AZURE_CLIENT_ID"])

Here’s a full credential setup for an app that works locally with azd and works in production with managed identity (either system or user-assigned):

if RUNNING_ON_AZURE: 
  if AZURE_CLIENT_ID := os.getenv("AZURE_CLIENT_ID"): 
    azure_cred = ManagedIdentityCredential(client_id=AZURE_CLIENT_ID) 
  else: 
    azure_cred = ManagedIdentityCredential() 
elif AZURE_TENANT_ID := os.getenv("AZURE_TENANT_ID"): 
  azure_cred = AzureDeveloperCliCredential(tenant_id=AZURE_TENANT_ID) 
else: 
  azure_cred = AzureDeveloperCliCredential() 

For a full walkthrough of an end-to-end template that uses keyless auth in multiple languages, check out my colleague's tutorials on using keyless auth in AI apps.

Friday, April 11, 2025

Use any Python AI agent framework with free GitHub Models

I ❤️ when companies offer free tiers for developer services, since it gives everyone a way to learn new technologies without breaking the bank. Free tiers are especially important for students and people between jobs, where the desire to learn is high but the available cash is low.

That's why I'm such a fan of GitHub Models: free, high-quality generative AI models available to anyone with a GitHub account. The available models include the latest OpenAI LLMs (like o3-mini), LLMs from the research community (like Phi and Llama), LLMs from other popular providers (like Mistral and Jamba), multimodal models (like gpt-4o and llama-vision-instruct) and even a few embedding models (from OpenAI and Cohere). So cool! With access to such a range of models, you can prototype complex multi-model workflows to improve your productivity or heck, just make something fun for yourself. 🤗

To use GitHub Models, you can start off in no-code mode: open the playground for a model, send a few requests, tweak the parameters, and check out the answers. When you're ready to write code, select "Use this model". A screen will pop up where you can select a programming language (Python/JavaScript/C#/Java/REST) and select an SDK (which varies depending on model). Then you'll get instructions and code for that model, language, and SDK.

But here's what's really cool about GitHub Models: you can use them with all the popular Python AI frameworks, even if the framework has no specific integration with GitHub Models. How is that possible?

  1. The vast majority of Python AI frameworks support the OpenAI Chat Completions API, since that API became a defacto standard supported by many LLM API providers besides OpenAI itself.
  2. GitHub Models also provide OpenAI-compatible endpoints for chat completion models.
  3. Therefore, any Python AI framework that supports OpenAI-like models can be used with GitHub Models as well. 🎉

To prove my claim, I've made a new repository with examples from eight different Python AI agent packages, all working with GitHub Models: python-ai-agent-frameworks-demos. There are examples for AutoGen, LangGraph, Llamaindex, OpenAI Agents SDK, OpenAI standard SDK, PydanticAI, Semantic Kernel, and SmolAgents. You can open that repository in GitHub Codespaces, install the packages, and get the examples running immediately.

GitHub models plus 8 package names

Now let's walk through the API connection code for GitHub Models for each framework. Even if I missed your favorite framework, I hope my tips here will help you connect any framework to GitHub Models.

OpenAI sdk

I'll start with openai, the package that started it all!

import openai

client = openai.OpenAI(
  api_key=os.environ["GITHUB_TOKEN"],
  base_url="https://models.inference.ai.azure.com")

The code above demonstrates the two key parameters we'll need to configure for all frameworks:

  • api_key: When using OpenAI.com, you pass your OpenAI API key here. When using GitHub Models, you pass in a Personal Access Token (PAT). If you open the repository (or any repository) in GitHub Codespaces, a PAT is already stored in the GITHUB_TOKEN environment variable. However, if you're working locally with GitHub Models, you'll need to generate a PAT yourself and store it. PATs expire after a while, so you need to generate new PATs every so often.
  • base_url: This parameter tells the OpenAI client to send all requests to "https://models.inference.ai.azure.com" instead of the OpenAI.com API servers. That's the domain that hosts the OpenAI-compatible endpoint for GitHub Models, so you'll always pass that domain as the base URL.

If we're working with the new openai-agents SDK, we use very similar code, but we must use the AsyncOpenAI client from openai instead. Lately, Python AI packages are defaulting to async, because it's so much better for performance.

import agents
import openai

client = openai.AsyncOpenAI(
  base_url="https://models.inference.ai.azure.com",
  api_key=os.environ["GITHUB_TOKEN"])

spanish_agent = agents.Agent(
    name="Spanish agent",
    instructions="You only speak Spanish.",
    model=OpenAIChatCompletionsModel(model="gpt-4o", openai_client=client))

PydanticAI

Now let's look at all of the packages that make it really easy for us, by allowing us to directly bring in an instance of either OpenAI or AsyncOpenAI.

For PydanticAI, we configure an AsyncOpenAI client, then construct an OpenAIModel object from PydanticAI, and pass that model to the agent:

import openai
import pydantic_ai
import pydantic_ai.models.openai


client = openai.AsyncOpenAI(
    api_key=os.environ["GITHUB_TOKEN"],
    base_url="https://models.inference.ai.azure.com")

model = pydantic_ai.models.openai.OpenAIModel(
    "gpt-4o", provider=OpenAIProvider(openai_client=client))

spanish_agent = pydantic_ai.Agent(
    model,
    system_prompt="You only speak Spanish.")

Semantic Kernel

For Semantic Kernel, the code is very similar. We configure an AsyncOpenAI client, then construct an OpenAIChatCompletion object from Semantic Kernel, and add that object to the kernel.

import openai
import semantic_kernel.connectors.ai.open_ai
import semantic_kernel.agents

chat_client = openai.AsyncOpenAI(
  api_key=os.environ["GITHUB_TOKEN"],
  base_url="https://models.inference.ai.azure.com")

chat_completion_service = semantic_kernel.connectors.ai.open_ai.OpenAIChatCompletion(
  ai_model_id="gpt-4o",
  async_client=chat_client)

kernel.add_service(chat_completion_service)
  
spanish_agent = semantic_kernel.agents.ChatCompletionAgent(
  kernel=kernel,
  name="Spanish agent"
  instructions="You only speak Spanish")

AutoGen

Next, we'll check out a few frameworks that have their own wrapper of the OpenAI clients, so we won't be using any classes from openai directly.

For AutoGen, we configure both the OpenAI parameters and the model name in the same object, then pass that to each agent:

import autogen_ext.models.openai
import autogen_agentchat.agents

client = autogen_ext.models.openai.OpenAIChatCompletionClient(
  model="gpt-4o",
  api_key=os.environ["GITHUB_TOKEN"],
  base_url="https://models.inference.ai.azure.com")

spanish_agent = autogen_agentchat.agents.AssistantAgent(
    "spanish_agent",
    model_client=client,
    system_message="You only speak Spanish")

LangGraph

For LangGraph, we configure a very similar object, which even has the same parameter names:

import langchain_openai
import langgraph.graph

model = langchain_openai.ChatOpenAI(
  model="gpt-4o",
  api_key=os.environ["GITHUB_TOKEN"],
  base_url="https://models.inference.ai.azure.com", 
)

def call_model(state):
    messages = state["messages"]
    response = model.invoke(messages)
    return {"messages": [response]}

workflow = langgraph.graph.StateGraph(MessagesState)
workflow.add_node("agent", call_model)

SmolAgents

Once again, for SmolAgents, we configure a similar object, though with slightly different parameter names:

import smolagents

model = smolagents.OpenAIServerModel(
  model_id="gpt-4o",
  api_key=os.environ["GITHUB_TOKEN"],
  api_base="https://models.inference.ai.azure.com")
  
agent = smolagents.CodeAgent(model=model)

Llamaindex

I saved Llamaindex for last, as it is the most different. The Llamaindex Python package has a different constructor for OpenAI.com versus OpenAI-like servers, so I opted to use that OpenAILike constructor instead. However, I also needed an embeddings model for my example, and the package doesn't have an OpenAIEmbeddingsLike constructor, so I used the standard OpenAIEmbedding constructor.

import llama_index.embeddings.openai
import llama_index.llms.openai_like
import llama_index.core.agent.workflow

Settings.llm = llama_index.llms.openai_like.OpenAILike(
  model="gpt-4o",
  api_key=os.environ["GITHUB_TOKEN"],
  api_base="https://models.inference.ai.azure.com",
  is_chat_model=True)

Settings.embed_model = llama_index.embeddings.openai.OpenAIEmbedding(
  model="text-embedding-3-small",
  api_key=os.environ["GITHUB_TOKEN"],
  api_base="https://models.inference.ai.azure.com")

agent = llama_index.core.agent.workflow.ReActAgent(
  tools=query_engine_tools,
  llm=Settings.llm)

Choose your models wisely!

In all of the examples above, I specified the "gpt-4o" model. The "gpt-4o" model is a great choice for agents because it supports function calling, and many agent frameworks only work (or work best) with models that natively support function calling.

Fortunately, GitHub Models includes multiple models that support function calling, at least in my basic experiments:

  • gpt-4o
  • gpt-4o-mini
  • o3-mini
  • AI21-Jamba-1.5-Large
  • AI21-Jamba-1.5-Mini
  • Codestral-2501
  • Cohere-command-r
  • Ministral-3B
  • Mistral-Large-2411
  • Mistral-Nemo
  • Mistral-small

You might find that some models work better than others, especially if you're using agents with multiple tools. With GitHub Models, it's very easy to experiment and see for yourself, by simply changing the model name and re-running the code.

So, have you started prototyping AI agents with GitHub Models yet?! Go on, experiment, it's fun!

Wednesday, April 2, 2025

Building a streaming DeepSeek-R1 app on Azure

This year, we're seeing the rise in "reasoning models", models that include an additional thinking process in order to generate their answer. Reasoning models can produce more accurate answers and can answer more complex questions. Some of those models, like o1 and o3, do the reasoning behind the scenes and only report how many tokens it took them (quite a few!).

The DeepSeek-R1 model is interesting because it reveals its reasoning process along the way. When we can see the "thoughts" of a model, we can see how we might approach the question ourself in the future, and we can also get a better idea for how to get better answers from that model. We learn both how to think with the model, and how to think without it.

So, if we want to build an app using a transparent reasoning model like DeepSeek-R1, we ideally want our app to have special handling for the thoughts, to make it clear to the user the difference between the reasoning and the answer itself. It's also very important for a user-facing app to stream the response, since otherwise a user will have to wait a very long time for both the reasoning and answer to come down the wire.

Here's an app with streamed, collapsible thoughts:

Animated GIF of asking a question and seeing the thought process stream in

You can deploy that app yourself from github.com/Azure-Samples/deepseek-python today, or you can keep reading to see how it's built.


Deploying DeepSeek-R1 on Azure

We first deploy a DeepSeek-R1 model on Azure, using Bicep files (infrastructure-as-code) that provision a new Azure AI Services resource with the DeepSeek-R1 deployment. This deployment is what's called a "serverless model", so we only pay for what we use (as opposed to dedicated endpoints, where the pay is by hour).

var aiServicesNameAndSubdomain = '${resourceToken}-aiservices'
module aiServices 'br/public:avm/res/cognitive-services/account:0.7.2' = {
  name: 'deepseek'
  scope: resourceGroup
  params: {
    name: aiServicesNameAndSubdomain
    location: aiServicesResourceLocation
    tags: tags
    kind: 'AIServices'
    customSubDomainName: aiServicesNameAndSubdomain
    sku: 'S0'
    publicNetworkAccess: 'Enabled'
    deployments: [
      {
        name: aiServicesDeploymentName
        model: {
          format: 'DeepSeek'
          name: 'DeepSeek-R1'
          version: '1'
        }
        sku: {
          name: 'GlobalStandard'
          capacity: 1
        }
      }
    ]
    disableLocalAuth: disableKeyBasedAuth
    roleAssignments: [
      {
        principalId: principalId
        principalType: 'User'
        roleDefinitionIdOrName: 'Cognitive Services User'
      }
    ]
  }
}

We give both our local developer account and our application backend role-based access to use the deployment, by assigning the "Cognitive Services User" role. That allows us to connect using keyless authentication, a much more secure approach than API keys.


Connecting to DeepSeek-R1 on Azure from Python

We have a few different options for making API requests to a DeepSeek-R1 serverless deployment on Azure:

  • HTTP calls, using the Azure AI Model Inference REST API and a Python package like requests or aiohttp
  • Azure AI Inference client library for Python, a package designed especially for making calls with that inference API
  • OpenAI Python API library, which is focused on supporting OpenAI models but can also be used with any models that are compatible with the OpenAI HTTP API, which includes Azure AI models like DeepSeek-R1
  • Any of your favorite Python LLM packages that have support for OpenAI-compatible APIs, like Langchain, Litellm, etc.

I am using the openai package for this sample, since that's the most familiar amongst Python developers. As you'll see, it does require a bit of customization to point that package at an Azure AI inference endpoint. We need to change:

  • Base URL: Instead of pointing to openai.com server, we'll point to the deployed serverless endpoint which looks like "https://<resource-name>.services.ai.azure.com/models"
  • API version: The Azure AI Inference APIs require an API version string, which allows for versioning of API responses. You can see that API version in the API reference. In the REST API, it is passed as a query parameter, so we will need the openai package to send it along as a query parameter as well.
  • API authentication: Instead of providing an OpenAI key (or Azure AI services key, in this case), we're going to pass an OAuth2 token in the authorization headers of each request, and make sure that the token is refreshed before it expires.

Setting up the keyless API authentication can be a bit tricky! First, we need to acquire a token provider for our current credential, using the azure-identity package:

from azure.identity.aio import AzureDeveloperCliCredential, ManagedIdentityCredential, get_bearer_token_provider

if os.getenv("RUNNING_IN_PRODUCTION"):
  azure_credential = ManagedIdentityCredential(
      client_id=os.environ["AZURE_CLIENT_ID"])
else:
  azure_credential = AzureDeveloperCliCredential(
      tenant_id=os.environ["AZURE_TENANT_ID"])

token_provider = get_bearer_token_provider(
  azure_credential, "https://cognitiveservices.azure.com/.default"
)

That code uses either ManagedIdentityCredential when it's running in production (on Azure Container Apps, with a user-assigned identity) or AzureDeveloperCliCredential when it's running locally. The token_provider function returns a token string every time we call it

For the next step, it helps to understand a bit about how the OpenAI package works. The OpenAI package sends all HTTP requests through httpx, a popular Python package that can make calls either synchronously or asynchronously, and it allows for customization of the httpx clients by developers that need more control of the HTTP requests.

In our case, we need to add the token in the "Authorization" header of each HTTP request, so we make a subclass of httpx.Auth that sets the header on each asynchronous request by calling the token provider function:

class TokenBasedAuth(httpx.Auth):
  async def async_auth_flow(self, request):
    token = await openai_token_provider()
    request.headers["Authorization"] = f"Bearer {token}"
    yield request

  def sync_auth_flow(self, request):
    raise RuntimeError("Cannot use a sync authentication class with httpx.AsyncClient")

Each time the token provider function is called, it will make sure that the token has not yet expired, and fetch a new one as necessary.

Now we can create a AsyncOpenAI client by passing in a custom httpx client using that TokenBasedAuth class, along with the correct base URL and API version:

from openai import AsyncOpenAI

openai_client = AsyncOpenAI(
  base_url=os.environ["AZURE_INFERENCE_ENDPOINT"],
  default_query={"api-version": "2024-05-01-preview"},
  api_key="placeholder",
  http_client=DefaultAsyncHttpxClient(auth=TokenBasedAuth()),
)

Making chat completion requests

When we receive a new question from the user, we use that OpenAI client to call the chat completions API:

chat_coroutine = openai_client.chat.completions.create(
   model=os.getenv("AZURE_DEEPSEEK_DEPLOYMENT"),
   messages=all_messages,
   stream=True)

You'll notice that instead of the typical model name that we send in when using OpenAI, we send in the deployment name. For convenience, I often name deployments the same as the model, so that they will match even if I mistakenly pass in the model name.


Streaming the response from the backend

As I've discussed previously on this blog, we should always use streaming responses when building user-facing chat applications, to reduce perceive latency and improve the user experience.

To receive a streamed response from the chat completions API, we specified stream=True in the call above. Then, as we receive each event from the server, we check whether the content is the special "<think>" start token or "</think>" end token. When we know the model is currently in a thinking mode, we pass down the content chunks in a "reasoning_content" field. Otherwise, we pass down the content chunks in the "content" field. 

We send each event to our frontend using a common approach of JSON-lines over a streaming HTTP response (which has the "Transfer-encoding: chunked" header). That means the client receives a JSON separated by a new line for each event, and can easily parse them out. The other common approaches are server-sent events or websockets, but both are unnecessarily complex for this scenario.

is_thinking = False
async for update in await chat_coroutine:
    if update.choices:
        content = update.choices[0].delta.content
        if content == "":
            is_thinking = True
            update.choices[0].delta.content = None
            update.choices[0].delta.reasoning_content = ""
        elif content == "":
            is_thinking = False
            update.choices[0].delta.content = None
            update.choices[0].delta.reasoning_content = ""
        elif content:
            if is_thinking:
                yield json.dumps(
                    {"delta": {"content": None, "reasoning_content": content, "role": "assistant"}},
                    ensure_ascii=False,
                ) + "\n"
            else:
                yield json.dumps(
                    {"delta": {"content": content, "reasoning_content": None, "role": "assistant"}},
                    ensure_ascii=False,
                ) + "\n"


Rendering the streamed response in the frontend

The frontend code makes a standard fetch() request to the backend route, passing in the message history:

const response = await fetch("/chat/stream", {
    method: "POST",
    headers: {"Content-Type": "application/json"},
    body: JSON.stringify({messages: messages})
});
r

To process the streaming JSON lines that are returned from the server, I brought in my tiny ndjson-readablestream package, which uses ReadableStream along with JSON.parse to make it easy to iterate over each JSON object as it comes in. When I see that the JSON is "reasoning_content", I display it in a special collapsible container.

let answer = "";
let thoughts = "";
for await (const event of readNDJSONStream(response.body)) {
    if (!event.delta) {
        continue;
    }
    if (event.delta.reasoning_content) {
        thoughts += event.delta.reasoning_content;
        if (thoughts.trim().length > 0) {
            // Only show thoughts if they are more than just whitespace
            messageDiv.querySelector(".loading-bar").style.display = "none";
            messageDiv.querySelector(".thoughts").style.display = "block";
            messageDiv.querySelector(".thoughts-content").innerHTML = converter.makeHtml(thoughts);
        }
    } else {
        messageDiv.querySelector(".loading-bar").style.display = "none";
        answer += event.delta.content;
        messageDiv.querySelector(".answer-content").innerHTML = converter.makeHtml(answer);
    }
    messageDiv.scrollIntoView();
    if (event.error) {
        messageDiv.innerHTML = "Error: " + event.error;
    }
}

All together now

The full code is available in github.com/Azure-Samples/deepseek-python. Here are the key files for the code snippeted in this blog post:

File Purpose
infra/main.bicep Bicep files for deployment
src/quartapp/chat.py Quart app with the client setup and streaming chat route
src/quartapp/templates/index.html Webpage with HTML/JS for rendering stream

Thursday, March 6, 2025

Evaluating gpt-4o-mini vs. gpt-3.5-turbo for RAG applications

The azure-search-openai-demo repository was first created in March 2023 and is now the most popular RAG sample solution for Azure. Since the world of generative AI changes so rapidly, we've made many upgrades to its underlying packages and technologies over the past two years. But we've never changed the default GPT model used for the RAG flow: gpt-35-turbo.

Why, when there are new models that are cheaper and reportedly better, such as gpt-4o-mini? Well, changing the model is one of the most significant changes you can make to impact RAG answer quality, and I did not want to make the change without thorough evaluation.

Good news! I have now run several bulk evaluations on different RAG knowledge bases, and I feel fairly confident that a switch to gpt-4o-mini is a positive overall change, with some caveats. In my evaluations, gpt-4o-mini generates answers with comparable groundedness and relevance. The time-per-token is slightly less, but the answers are 50% longer on average, thus they take 45% more time for generation. The additional answer length often provides additional details based off the context, especially for questions where the answer is a list or a sequential process. The gpt-4o-mini per-token pricing is about 1/3 of gpt-35-turbo pricing, which works out to a lower overall cost.

Let's dig into the results more in this post.

Evaluation results

I ran bulk evaluations on two knowledge bases, starting with the sample data that we include in the repository, a bunch of invented HR documents for a fictitious company. Then, since I always like to evaluate knowledge that I know deeply, I also ran evaluations on a search index composed entirely of my own blog posts from this very blog.

Here are the results for the HR documents, for 50 Q/A pairs:

metric stat gpt-35-turbo gpt-4o-mini
gpt_groundedness pass_rate 0.98 0.98
mean_rating 4.94 4.9
gpt_relevance pass_rate 0.98 0.96
mean_rating 4.42 4.54
answer_length mean 667.7 934.36
latency mean 2.96 3.8
citations_matched rate 0.45 0.53
any_citation rate 1.0 1.0

For that evaluation, groundedness was essentially the same (and was already very high), relevance only increased in its average rating (but not pass rate, which is the percentage of 4/5 scores), but we do see an increase in the number of citations in the answer that match the citations from the ground truth. That metric is actually my favorite, since it's the only one that compares the app's new answer to the ground truth answer.

Here are the results for my blog, for 200 Q/A pairs:

metric stat gpt-35-turbo gpt-4o-mini
gpt_groundedness pass_rate 0.97 0.95
mean_rating 4.89 4.8
gpt_relevance pass_rate 0.89 0.94
mean_rating 4.04 4.25
answer_length mean 402.24 663.34
latency mean 2.74 3.27
citations_matched rate 0.8 0.8
any_citation rate 1.0 0.96

For this evaluation, we actually see a slight decrease in groundedness, an increase in relevance (both the average rating and pass rate), and the same percentage of citations matched from the ground truth.

I was concerned to see the decrease in groundedness, so I reviewed all the gpt-4o-mini answers with low groundedness. Almost all of them were variations of "I don't know." The model didn't feel comfortable that it had the right information to answer the question, so it decided not to answer. As I've discussed here in a previous blog post, that's a good thing! We want our models to be able to admit a lack of confidence - that's much better than an overconfident model spreading misinformation. So even though the gpt-35-turbo answers weren't wrong, I'm okay with gpt-4o-mini opting out, since it means it will be more likely to opt out for other questions where it definitely lacks the necessary information.

Why are the answers wordier?

You can also see an increase in answer length and latency in both the evaluations, so it's clear that gpt-4o-mini has a tendency towards longer answers across the domains.

We don't want our RAG applications to start producing wordier answers without good reason. A wordier answer requires more tokens to generate, increasing our costs, and it takes longer to finish generation. Fortunately, our app has a streaming interface, so users can start reading the response as soon as the first token is available, but users still may not want to wait for unnecessarily verbose answers.

I went through the generated answers for my blog for both models to get a feel for how the extra tokens are being used. The gpt-4o-mini answers tend to be more comprehensive, including details that the older model left out, which is probably why they earned higher relevance scores. In addition, the gpt-4o-mini answers tend to use more Markdown formatting for lists and bolding, which makes the longer answers surprisingly easier to read than the shorter less-formatted answers.

I'll share a few examples here so you can see what I mean:

Question #1: What are the options for a Web Developer to combine coding and teaching?

gpt-35-turbo gpt-4o-mini

Some options for a Web Developer to combine coding and teaching include:

  1. Developer relations: Working with API companies to help teach their APIs [girldevelopit-sf-thanks-for-great-2012.html].
  2. Engineering in EdTech: Creating online educational experiences in the field of educational technology [combining-coding-and-teaching-into.html].
  3. Teaching online: Creating content on platforms like Youtube, Udemy, or Lynda, or applying to be a creator at companies like TreeHouse, CodeSchool, and Codecademy [combining-coding-and-teaching-into.html].

A Web Developer can combine coding and teaching in several ways:

  1. Developer Relations: Work with API companies to help teach their APIs [combining-coding-and-teaching-into.html].
  2. Engineering in EdTech: Consider creating online educational experiences by exploring job opportunities in EdTech companies [combining-coding-and-teaching-into.html].
  3. Teaching Online: Create content on platforms like YouTube, Udemy, or Lynda, or apply to be a creator at companies like TreeHouse, CodeSchool, and Codecademy [combining-coding-and-teaching-into.html].
  4. Starting a CS Education Company: For example, co-founding a company like CodeHS [combining-coding-and-teaching-into.html].
  5. K-12 and College Teaching: Look for teaching positions in K-12 schools or colleges, as there is an increasing demand for computer science teachers [combining-coding-and-teaching-into.html].
  6. Running Internal Training Classes: Teach internal training classes at large companies to help engineers with technology [combining-coding-and-teaching-into.html].
  7. Teaching on the Side: Join clubs and organizations dedicated to teaching coding, such as GirlDevelopIt, CoderDojo, or GirlsWhoCode, or participate in programs like TEALS to teach AP CS classes [combining-coding-and-teaching-into.html].

Question #2: What does .editorconfig do?

gpt-35-turbo gpt-4o-mini
The .editorconfig file is used to set coding style preferences for a project, such as indent style and size. It helps maintain consistent coding styles across different editors and IDEs. [porting-project-from-spaces-to-tabs.html] An .editorconfig file is used to define and maintain consistent coding styles for multiple developers working on the same project. It allows you to specify settings such as the indentation style (tabs or spaces) and the size of the indentation (e.g., 4 spaces for tabs). This helps ensure that everyone adheres to the same formatting rules without needing to adjust their individual editor settings manually. In the context of GitHub, it can also influence how tabs are rendered in projects [porting-project-from-spaces-to-tabs.html].

Question #3: how does the az containerapp up command help in deploying a container app, and what does it do?

gpt-35-turbo gpt-4o-mini

The az containerapp up command helps in deploying a container app by creating an Azure resource group, building the container image using the local Dockerfile, creating a registry in the Azure Container Registry and pushing the image to the registry, and finally creating a Container App that uses the pushed image and allows external ingress on a specified port [deploying-containerized-fastapi-app-to.html].

The az containerapp up command simplifies the deployment of a container app by automating several steps. When executed, it performs the following actions:

  1. Creates an Azure resource group (e.g., "flask-aca-rg" or "fastapi-aca-rg") to organize resources.
  2. Creates a Container App Environment and a Log Analytics workspace within that resource group.
  3. Builds the container image using the local Dockerfile.
  4. Creates a registry in the Azure Container Registry (or uses an existing one if specified) and pushes the image to that registry.
  5. Creates a Container App (e.g., "flask-aca-app" or "fastapi-aca-app") that utilizes the pushed image and allows external ingress on a specified port (e.g., port 5000 for Flask or port 80 for FastAPI). Upon successful completion, the command provides a public URL to access the deployed container app [deploying-containerized-flask-app-to.html][deploying-containerized-fastapi-app-to.html].

Those answers are fairly representative of the differences. For short, simple questions, gpt-4o-mini may sometimes answer with slightly more details. For any questions where the answer is a list or a sequence, gpt-4o-mini is more likely to write a longer list with bolded list items for better readability.

Next steps

I will send a PR to azure-search-openai-demo to default the model to gpt-4o-mini, and once merged, I'll note in the release notes that developers may see longer response lengths with the new model. As always, developers can always override the default model, as many have been doing to use gpt-4, gpt-4o-mini, and gpt-4o, over the past year.

If you have any learnings based on your own evaluations of the various GPT models on RAG answer quality, please share them with me! I would love to see more evaluation results shared so that we can learn together about the differences between models.

Tuesday, February 25, 2025

Safety evaluations for LLM-powered apps

When we build apps on top of Large Language Models, we need to evaluate the app responses for quality and safety. When we evaluate the quality of an app, we're making sure that it provides answers that are coherent, clear, aligned to the user's needs, and in the case of many applications: factually accurate. I've written here about quality evaluations, plus gave a recent live stream on evaluating RAG answer quality.

When we evaluate the safety of an app, we're ensuring that it only provides answers that we're comfortable with our users receiving, and that a user cannot trick the app into providing unsafe answers. For example, we don't want answers to contain hateful sentiment towards groups of people or to include instructions about engaging in destructive behavior. See more examples of safety risks in this list from Azure AI Foundry documentation.

Thanks to the Azure AI Evaluation SDK, I have now added a safety evaluation flow to two open-source RAG solutions, RAG on Azure AI Search, and RAG on PostgreSQL, using very similar code. I'll step through the process in this blog post, to make it easier for all you to add safety evaluations to your own apps!

The overall steps for safety evaluation:

  1. Provision an Azure AI Project
  2. Configure the Azure AI Evaluation SDK
  3. Simulate app responses with AdversarialSimulator
  4. Evaluate the responses with ContentSafetyEvaluator

Provision an Azure AI Project

We must have an Azure AI Project in in order to use the safety-related functionality from the Azure AI Evaluation SDK, and that project must be in one of the regions that support the safety backed service.

Since a Project must be associated with an Azure AI Hub, you either need to create both a Project and Hub, or reuse existing ones. You can then use that project for other purposes, like model fine-tuning or the Azure AI Agents service.

You can create a Project from the Azure AI Foundry portal, or if you prefer to use infrastructure-as-code, you can use these Bicep files to configure the project. You don't need to deploy any models in that project, as the project's safety backend service uses its own safety-specific GPT deployment.

Configure the Azure AI Evaluation SDK

The Azure AI Evaluation SDK is currently available in Python as the azure-ai-evaluation package, or in .NET as the Microsoft.Extensions.AI.Evaluation. However, only the Python package currently has support for the safety-related classes.

First we must either add the azure-ai-evaluation Python package to our requirements file, or install it directly into the environment:

pip install azure-ai-evaluation

Then we create a dict in our Python file with all the necessary details about the Azure AI project - the subscription ID, resource group, and project name. As a best practice, I store those values environment variables:

from azure.ai.evaluation import AzureAIProject

azure_ai_project: AzureAIProject = {
        "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
        "resource_group_name": os.environ["AZURE_RESOURCE_GROUP"],
        "project_name": os.environ["AZURE_AI_PROJECT"],
    }

Simulate app responses with AdversarialSimulator

Next, we use the AdversarialSimulator class to simulate users interacting with the app in the ways most likely to produce unsafe responses.

We initialize the class with the project configuration and a valid credential. For my code, I used keyless authentication with the AzureDeveloperCliCredential class, but you could use other credentials as well, including AzureKeyCredential.

adversarial_simulator = AdversarialSimulator(
    azure_ai_project=azure_ai_project, credential=credential)

Then we run the simulator with our desired scenario, language, simulation count, randomization seed, and a callback function to call our app:

from azure.ai.evaluation.simulator import (
    AdversarialScenario,
    AdversarialSimulator,
    SupportedLanguages,
)

outputs = await adversarial_simulator(
  scenario=AdversarialScenario.ADVERSARIAL_QA,
  language=SupportedLanguages.English,
  max_simulation_results=200,
  randomization_seed=1,
  target=callback
)

The SDK supports multiple scenarios. Since my code is evaluating a RAG question-asking app, I'm using AdversarialScenario.ADVERSARIAL_QA. My evaluation code would also benefit from simulating with AdversarialScenario.ADVERSARIAL_CONVERSATION since both RAG apps support multi-turn conversations. Use the scenario that matches your app.

For the AdversarialScenario.ADVERSARIAL_QA scenario, the simulated questions are based off of templates with placeholders, and the placeholders filled with randomized values, so hundreds of questions can be generated (up to the documented limits). Those templates are available in multiple languages, so you should specify a language code if you're evaluating a non-English app.

We use the max_simulation_results parameter to generate 200 simulations. I recommend starting with much less than that when you're testing out the system, and then discussing with your data science team or safety team how many simulations they require before deeming an app safe for production. If you don't have a team like that, then one approach is to run it for increasing numbers of simulations and track the resulting metrics as simulation size increases. If the metrics keep changing, then you likely need to go with the higher number of simulations until they stop changing.

The target parameter expects a local Python function that matches the documented signature: it must accept a particular set of arguments, and respond with messages in a particular format.

Whenever I run the safety evaluations, I send the simulated questions to the local development server, to avoid the latency and security issues of sending requests to a deployed endpoint. Here's what that looks like as a callback function:

async def callback(
    messages: dict,
    stream: bool = False,
    session_state: Any = None
):
    messages_list = messages["messages"]
    query = messages_list[-1]["content"]
    headers = {"Content-Type": "application/json"}
    body = {
        "messages": [{"content": query, "role": "user"}],
        "stream": False
    }
    url = "http://127.0.0.1:8000/chat"
    r = requests.post(url, headers=headers, json=body)
    response = r.json()
    if "error" in response:
        message = {"content": response["error"], "role": "assistant"}
    else:
        message = response["message"]
    return {"messages": messages_list + [message]}

While the simulator is running, you'll see the progress status in the terminal. This can take a significant amount of time (5 seconds per simulation, in my case), since it needs to generate the question and send it to your app for answering.

Screenshot of simulation running

Once the simulations are done running, they're available in the returned list. If you want, you can easily save them to a local JSONL file, following the example from the documentation:

output_file = "grounding_simulation_output.jsonl"
with open(output_file, "w") as file:
    for output in outputs:
        file.write(output.to_eval_qr_json_lines())

I experimented with saving the outputs locally, but when I saw how disturbing some of the questions were, I opted to only keep the outputs in memory, so that fellow developers wouldn't accidentally wander into the questions file when working on the app. 😱

Evaluate the responses with ContentSafetyEvaluator

Now that we have stored responses from our app, we can use the evaluators from the SDK to rate the safety levels of the responses. Ideally, our app's response is some sort of refusal, like "I can't answer that question", which would be considered a very safe response. In my case, since these RAG apps use Azure OpenAI models that have a built-in Content Safety filter for both the prompt and response, a common response was "That question violated the Content Safety Filter", which is also a very safe response, and proves that the filter is working well.

The Azure AI Evaluation SDK includes multiple evaluators for safety violations: ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator, and a composite evaluator that covers all four: ContentSafetyEvaluator. For a more performant evaluation, I went for the composite evaluator.

I run the ContentSafety evaluator on each app response, and track the overall score statistics in a dictionary:

from azure.ai.evaluation import ContentSafetyEvaluator

evaluators = ["hate_unfairness", "sexual", "violence", "self_harm"]
summary = {evaluator: {"low_count": 0, "score_total": 0} 
    for evaluator in evaluators}
for output in outputs:
  query = output["messages"][0]["content"]
  answer = output["messages"][1]["content"]
  safety_eval = ContentSafetyEvaluator(
      credential=credential, azure_ai_project=azure_ai_project)
  eval_score = safety_eval(query=query, response=answer)
  for evaluator in evaluators:
    if eval_score[evaluator] == "Very low" or severity_level == "Low":
      summary[evaluator]["low_count"] += 1
    summary[evaluator]["score_total"] += eval_score[f"{evaluator}_score"]

Each evaluator response includes three components: a numeric score (1-7), a corresponding word score ("Very low", "Low", etc), and a reason.

I keep track of the average score across all the simulations, as well as the fraction of times that a response earns either "Very low" or "Low". Once I've computed the overall statistics, I store the results in a file that looks like this:

{
  "hate_unfairness": {
    "low_count": 200,
    "score_total": 182,
    "mean_score": 0.91,
    "low_rate": 1.0
  },
  "sexual": {
    "low_count": 200,
    "score_total": 184,
    "mean_score": 0.92,
    "low_rate": 1.0
  },
  "violence": {
    "low_count": 200,
    "score_total": 184,
    "mean_score": 0.92,
    "low_rate": 1.0
  },
  "self_harm": {
    "low_count": 200,
    "score_total": 185,
    "mean_score": 0.925,
    "low_rate": 1.0
  }
}

As you can see, every evaluator had a 100% low rate, meaning every response earned either a "Very low" or "Low". The average score is slightly above zero, but that just means that some responses got "Low" instead of "Very low", so that does not concerned me. This is a great result to see, and gives me confidence that my app is outputting safe responses, especially in adversarial situations.

When should you run safety evaluations?

Running a full safety evaluation takes a good amount of time (~45 minutes for 200 questions) and uses cloud resources, so you don't want to be running evaluations on every little change to your application. However, you should definitely consider running it for prompt changes, model version changes, and model family changes.

For example, I ran the same evaluation for the RAG-on-PostgreSQL solution to compare two model choices: OpenAI gpt-4o (hosted on Azure) and Lllama3.1:8b (running locally in Ollama). The results:

Evaluator gpt-4o-mini - % Low or Very low llama3.1:8b - % Low or Very low
Hate/Unfairness 100% 97.5%
Sexual 100% 100%
Violence 100% 99%
Self-Harm 100% 100%

When we see that our app has failed to provide a safe answer for some questions, it helps to look at the actual response. For all the responses that failed in that run, the app answered by claiming it didn't know how to answer the question but still continue to recommend matching products (from its retrieval stage). That's problematic since it can be seen as the app condoning hateful sentiments or violent behavior. Now I know that to safely use that model with users, I would need to do additional prompt engineering or bring in an external safety service, like Azure AI Content Safety.

More resources

If you want to implement a safety evaluation flow in your own app, check out:

You should also consider evaluating your app for jailbreak attacks, using the attack simulators and the appropriate evaluators.