April 2025

Monday, April 28, 2025

Using DefaultAzureCredential across multiple tenants

If you are using the DefaultAzureCredential class from the Azure Identity SDK while your user account is associated with multiple tenants, you may find yourself frequently running into API authentication errors (such as HTTP 401/Unauthorized). This post is for you!

These are your two options for successful authentication from a non-default tenant:

Setup your environment precisely to force DefaultAzureCredential to use the desired tenant
Use a specific credential class and explicitly pass in the desired tenant ID

Option 1: Get DefaultAzureCredential working

The DefaultAzureCredential class is a credential chain, which means that it tries a sequence of credential classes until it finds one that can authenticate successfully. The current sequence is:

EnvironmentCredential
WorkloadIdentityCredential
ManagedIdentityCredential
SharedTokenCacheCredential
AzureCliCredential
AzurePowerShellCredential
AzureDeveloperCliCredential
InteractiveBrowserCredential

For example, on my personal machine, only two of those credentials can retrieve tokens:

AzureCliCredential: from logging in with Azure CLI (az login)
AzureDeveloperCliCredential: from logging in with Azure Developer CLI (azd auth login)

Many developers are logged in with those two credentials, so it's crucial to understand how this chained credential works. The AzureCliCredential is earlier in the chain, so if you are logged in with that, you must have the desired tenant set as the "active tenant". According to Azure CLI documentation, there are two ways to set the active tenant:

az account set --subscription SUBSCRIPTION-ID where the subscription is from the desired tenant
az login --tenant TENANT-ID, with no subsequent az login commands after

Whatever option you choose, you can confirm that your desired tenant is currently the default by running az account show and verifying the tenantId in the account details shown.

If you are only logged in with the azd CLI and not the Azure CLI, you have a problem: the azd cli does not currently have a way to set the active tenant. If that credential is called with no additional information, azd assumes your home tenant, which may not be desired. The azd credential does check for a system variable called AZURE_TENANT_ID, however, so you can try setting that in your environment before running code that uses DefaultAzureCredential. That should work as long as the DefaultAzureCredential code is truly running in the same environment where AZURE_TENANT_ID has been set.

Option 2: Use specific credentials

Several credential types allow you to explicitly pass in a tenant ID, including both the AzureCliCredential and AzureDeveloperCliCredential. If you know that you’re always going to be logging in with a specific CLI, you can change your code to that credential:

For example, in the Python SDK:

azure_cred = AzureDeveloperCliCredential(
    tenant_id=os.environ["AZURE_TENANT_ID"])

For more flexibility, you can use conditionals to only pass in a tenant ID if one is set in the environment:

if AZURE_TENANT_ID := os.environ("AZURE_TENANT_ID"): 
  azure_cred = AzureDeveloperCliCredential(tenant_id=AZURE_TENANT_ID) 
else: 
  azure_cred = AzureDeveloperCliCredential()

As a best practice, I always like to log out exactly what credential I'm calling and whether I'm passing in a tenant ID, to help me spot any misconfiguration from my logs.

⚠️ Be careful when replacing DefaultAzureCredential if your code will be deployed to a production host! That means you were previously relying on it using the ManagedIdentityCredential in the chain, and that you now need to call that credential class specifically. You will also need to pass in the managed identity ID, if using user-assigned identity instead of system-assigned identity.

For example, using managed identity in the Python SDK with user-assigned identity:

azure_cred = ManagedIdentityCredential(
    client_id=os.environ["AZURE_CLIENT_ID"])

Here’s a full credential setup for an app that works locally with azd and works in production with managed identity (either system or user-assigned):

if RUNNING_ON_AZURE: 
  if AZURE_CLIENT_ID := os.getenv("AZURE_CLIENT_ID"): 
    azure_cred = ManagedIdentityCredential(client_id=AZURE_CLIENT_ID) 
  else: 
    azure_cred = ManagedIdentityCredential() 
elif AZURE_TENANT_ID := os.getenv("AZURE_TENANT_ID"): 
  azure_cred = AzureDeveloperCliCredential(tenant_id=AZURE_TENANT_ID) 
else: 
  azure_cred = AzureDeveloperCliCredential()

For a full walkthrough of an end-to-end template that uses keyless auth in multiple languages, check out my colleague's tutorials on using keyless auth in AI apps.

Friday, April 11, 2025

Use any Python AI agent framework with free GitHub Models

I ❤️ when companies offer free tiers for developer services, since it gives everyone a way to learn new technologies without breaking the bank. Free tiers are especially important for students and people between jobs, where the desire to learn is high but the available cash is low.

That's why I'm such a fan of GitHub Models: free, high-quality generative AI models available to anyone with a GitHub account. The available models include the latest OpenAI LLMs (like o3-mini), LLMs from the research community (like Phi and Llama), LLMs from other popular providers (like Mistral and Jamba), multimodal models (like gpt-4o and llama-vision-instruct) and even a few embedding models (from OpenAI and Cohere). So cool! With access to such a range of models, you can prototype complex multi-model workflows to improve your productivity or heck, just make something fun for yourself. 🤗

To use GitHub Models, you can start off in no-code mode: open the playground for a model, send a few requests, tweak the parameters, and check out the answers. When you're ready to write code, select "Use this model". A screen will pop up where you can select a programming language (Python/JavaScript/C#/Java/REST) and select an SDK (which varies depending on model). Then you'll get instructions and code for that model, language, and SDK.

But here's what's really cool about GitHub Models: you can use them with all the popular Python AI frameworks, even if the framework has no specific integration with GitHub Models. How is that possible?

The vast majority of Python AI frameworks support the OpenAI Chat Completions API, since that API became a defacto standard supported by many LLM API providers besides OpenAI itself.
GitHub Models also provide OpenAI-compatible endpoints for chat completion models.
Therefore, any Python AI framework that supports OpenAI-like models can be used with GitHub Models as well. 🎉

To prove my claim, I've made a new repository with examples from eight different Python AI agent packages, all working with GitHub Models: python-ai-agent-frameworks-demos. There are examples for AutoGen, LangGraph, Llamaindex, OpenAI Agents SDK, OpenAI standard SDK, PydanticAI, Semantic Kernel, and SmolAgents. You can open that repository in GitHub Codespaces, install the packages, and get the examples running immediately.

Now let's walk through the API connection code for GitHub Models for each framework. Even if I missed your favorite framework, I hope my tips here will help you connect any framework to GitHub Models.

OpenAI sdk

I'll start with openai, the package that started it all!

import openai

client = openai.OpenAI(
  api_key=os.environ["GITHUB_TOKEN"],
  base_url="https://models.inference.ai.azure.com")

The code above demonstrates the two key parameters we'll need to configure for all frameworks:

api_key: When using OpenAI.com, you pass your OpenAI API key here. When using GitHub Models, you pass in a Personal Access Token (PAT). If you open the repository (or any repository) in GitHub Codespaces, a PAT is already stored in the GITHUB_TOKEN environment variable. However, if you're working locally with GitHub Models, you'll need to generate a PAT yourself and store it. PATs expire after a while, so you need to generate new PATs every so often.
base_url: This parameter tells the OpenAI client to send all requests to "https://models.inference.ai.azure.com" instead of the OpenAI.com API servers. That's the domain that hosts the OpenAI-compatible endpoint for GitHub Models, so you'll always pass that domain as the base URL.

If we're working with the new openai-agents SDK, we use very similar code, but we must use the AsyncOpenAI client from openai instead. Lately, Python AI packages are defaulting to async, because it's so much better for performance.

import agents
import openai

client = openai.AsyncOpenAI(
  base_url="https://models.inference.ai.azure.com",
  api_key=os.environ["GITHUB_TOKEN"])

spanish_agent = agents.Agent(
    name="Spanish agent",
    instructions="You only speak Spanish.",
    model=OpenAIChatCompletionsModel(model="gpt-4o", openai_client=client))

PydanticAI

Now let's look at all of the packages that make it really easy for us, by allowing us to directly bring in an instance of either OpenAI or AsyncOpenAI.

For PydanticAI, we configure an AsyncOpenAI client, then construct an OpenAIModel object from PydanticAI, and pass that model to the agent:

import openai
import pydantic_ai
import pydantic_ai.models.openai


client = openai.AsyncOpenAI(
    api_key=os.environ["GITHUB_TOKEN"],
    base_url="https://models.inference.ai.azure.com")

model = pydantic_ai.models.openai.OpenAIModel(
    "gpt-4o", provider=OpenAIProvider(openai_client=client))

spanish_agent = pydantic_ai.Agent(
    model,
    system_prompt="You only speak Spanish.")

Semantic Kernel

For Semantic Kernel, the code is very similar. We configure an AsyncOpenAI client, then construct an OpenAIChatCompletion object from Semantic Kernel, and add that object to the kernel.

import openai
import semantic_kernel.connectors.ai.open_ai
import semantic_kernel.agents

chat_client = openai.AsyncOpenAI(
  api_key=os.environ["GITHUB_TOKEN"],
  base_url="https://models.inference.ai.azure.com")

chat_completion_service = semantic_kernel.connectors.ai.open_ai.OpenAIChatCompletion(
  ai_model_id="gpt-4o",
  async_client=chat_client)

kernel.add_service(chat_completion_service)
  
spanish_agent = semantic_kernel.agents.ChatCompletionAgent(
  kernel=kernel,
  name="Spanish agent"
  instructions="You only speak Spanish")

AutoGen

Next, we'll check out a few frameworks that have their own wrapper of the OpenAI clients, so we won't be using any classes from openai directly.

For AutoGen, we configure both the OpenAI parameters and the model name in the same object, then pass that to each agent:

import autogen_ext.models.openai
import autogen_agentchat.agents

client = autogen_ext.models.openai.OpenAIChatCompletionClient(
  model="gpt-4o",
  api_key=os.environ["GITHUB_TOKEN"],
  base_url="https://models.inference.ai.azure.com")

spanish_agent = autogen_agentchat.agents.AssistantAgent(
    "spanish_agent",
    model_client=client,
    system_message="You only speak Spanish")

LangGraph

For LangGraph, we configure a very similar object, which even has the same parameter names:

import langchain_openai
import langgraph.graph

model = langchain_openai.ChatOpenAI(
  model="gpt-4o",
  api_key=os.environ["GITHUB_TOKEN"],
  base_url="https://models.inference.ai.azure.com", 
)

def call_model(state):
    messages = state["messages"]
    response = model.invoke(messages)
    return {"messages": [response]}

workflow = langgraph.graph.StateGraph(MessagesState)
workflow.add_node("agent", call_model)

SmolAgents

Once again, for SmolAgents, we configure a similar object, though with slightly different parameter names:

import smolagents

model = smolagents.OpenAIServerModel(
  model_id="gpt-4o",
  api_key=os.environ["GITHUB_TOKEN"],
  api_base="https://models.inference.ai.azure.com")
  
agent = smolagents.CodeAgent(model=model)

Llamaindex

I saved Llamaindex for last, as it is the most different. The Llamaindex Python package has a different constructor for OpenAI.com versus OpenAI-like servers, so I opted to use that OpenAILike constructor instead. However, I also needed an embeddings model for my example, and the package doesn't have an OpenAIEmbeddingsLike constructor, so I used the standard OpenAIEmbedding constructor.

import llama_index.embeddings.openai
import llama_index.llms.openai_like
import llama_index.core.agent.workflow

Settings.llm = llama_index.llms.openai_like.OpenAILike(
  model="gpt-4o",
  api_key=os.environ["GITHUB_TOKEN"],
  api_base="https://models.inference.ai.azure.com",
  is_chat_model=True)

Settings.embed_model = llama_index.embeddings.openai.OpenAIEmbedding(
  model="text-embedding-3-small",
  api_key=os.environ["GITHUB_TOKEN"],
  api_base="https://models.inference.ai.azure.com")

agent = llama_index.core.agent.workflow.ReActAgent(
  tools=query_engine_tools,
  llm=Settings.llm)

Choose your models wisely!

In all of the examples above, I specified the "gpt-4o" model. The "gpt-4o" model is a great choice for agents because it supports function calling, and many agent frameworks only work (or work best) with models that natively support function calling.

Fortunately, GitHub Models includes multiple models that support function calling, at least in my basic experiments:

gpt-4o
gpt-4o-mini
o3-mini
AI21-Jamba-1.5-Large
AI21-Jamba-1.5-Mini
Codestral-2501
Cohere-command-r
Ministral-3B
Mistral-Large-2411
Mistral-Nemo
Mistral-small

You might find that some models work better than others, especially if you're using agents with multiple tools. With GitHub Models, it's very easy to experiment and see for yourself, by simply changing the model name and re-running the code.

So, have you started prototyping AI agents with GitHub Models yet?! Go on, experiment, it's fun!

Wednesday, April 2, 2025

Building a streaming DeepSeek-R1 app on Azure

Update: The approach has slightly changed (in a good way!). Read this Microsoft Learn article for an updated guide.

This year, we're seeing the rise in "reasoning models", models that include an additional thinking process in order to generate their answer. Reasoning models can produce more accurate answers and can answer more complex questions. Some of those models, like o1 and o3, do the reasoning behind the scenes and only report how many tokens it took them (quite a few!).

The DeepSeek-R1 model is interesting because it reveals its reasoning process along the way. When we can see the "thoughts" of a model, we can see how we might approach the question ourself in the future, and we can also get a better idea for how to get better answers from that model. We learn both how to think with the model, and how to think without it.

So, if we want to build an app using a transparent reasoning model like DeepSeek-R1, we ideally want our app to have special handling for the thoughts, to make it clear to the user the difference between the reasoning and the answer itself. It's also very important for a user-facing app to stream the response, since otherwise a user will have to wait a very long time for both the reasoning and answer to come down the wire.

Here's an app with streamed, collapsible thoughts:

Animated GIF of asking a question and seeing the thought process stream in

You can deploy that app yourself from github.com/Azure-Samples/deepseek-python today, or you can keep reading to see how it's built.

Deploying DeepSeek-R1 on Azure

We first deploy a DeepSeek-R1 model on Azure, using Bicep files (infrastructure-as-code) that provision a new Azure AI Services resource with the DeepSeek-R1 deployment. This deployment is what's called a "serverless model", so we only pay for what we use (as opposed to dedicated endpoints, where the pay is by hour).

var aiServicesNameAndSubdomain = '${resourceToken}-aiservices'
module aiServices 'br/public:avm/res/cognitive-services/account:0.7.2' = {
  name: 'deepseek'
  scope: resourceGroup
  params: {
    name: aiServicesNameAndSubdomain
    location: aiServicesResourceLocation
    tags: tags
    kind: 'AIServices'
    customSubDomainName: aiServicesNameAndSubdomain
    sku: 'S0'
    publicNetworkAccess: 'Enabled'
    deployments: [
      {
        name: aiServicesDeploymentName
        model: {
          format: 'DeepSeek'
          name: 'DeepSeek-R1'
          version: '1'
        }
        sku: {
          name: 'GlobalStandard'
          capacity: 1
        }
      }
    ]
    disableLocalAuth: disableKeyBasedAuth
    roleAssignments: [
      {
        principalId: principalId
        principalType: 'User'
        roleDefinitionIdOrName: 'Cognitive Services User'
      }
    ]
  }
}

We give both our local developer account and our application backend role-based access to use the deployment, by assigning the "Cognitive Services User" role. That allows us to connect using keyless authentication, a much more secure approach than API keys.

Connecting to DeepSeek-R1 on Azure from Python

We have a few different options for making API requests to a DeepSeek-R1 serverless deployment on Azure:

HTTP calls, using the Azure AI Model Inference REST API and a Python package like requests or aiohttp
Azure AI Inference client library for Python, a package designed especially for making calls with that inference API
OpenAI Python API library, which is focused on supporting OpenAI models but can also be used with any models that are compatible with the OpenAI HTTP API, which includes Azure AI models like DeepSeek-R1
Any of your favorite Python LLM packages that have support for OpenAI-compatible APIs, like Langchain, Litellm, etc.

I am using the openai package for this sample, since that's the most familiar amongst Python developers. As you'll see, it does require a bit of customization to point that package at an Azure AI inference endpoint. We need to change:

Base URL: Instead of pointing to openai.com server, we'll point to the deployed serverless endpoint which looks like "https://<resource-name>.services.ai.azure.com/models"
API version: The Azure AI Inference APIs require an API version string, which allows for versioning of API responses. You can see that API version in the API reference. In the REST API, it is passed as a query parameter, so we will need the openai package to send it along as a query parameter as well.
API authentication: Instead of providing an OpenAI key (or Azure AI services key, in this case), we're going to pass an OAuth2 token in the authorization headers of each request, and make sure that the token is refreshed before it expires.

Setting up the keyless API authentication can be a bit tricky! First, we need to acquire a token provider for our current credential, using the azure-identity package:

from azure.identity.aio import AzureDeveloperCliCredential, ManagedIdentityCredential, get_bearer_token_provider

if os.getenv("RUNNING_IN_PRODUCTION"):
  azure_credential = ManagedIdentityCredential(
      client_id=os.environ["AZURE_CLIENT_ID"])
else:
  azure_credential = AzureDeveloperCliCredential(
      tenant_id=os.environ["AZURE_TENANT_ID"])

token_provider = get_bearer_token_provider(
  azure_credential, "https://cognitiveservices.azure.com/.default"
)

That code uses either ManagedIdentityCredential when it's running in production (on Azure Container Apps, with a user-assigned identity) or AzureDeveloperCliCredential when it's running locally. The token_provider function returns a token string every time we call it

For the next step, it helps to understand a bit about how the OpenAI package works. The OpenAI package sends all HTTP requests through httpx, a popular Python package that can make calls either synchronously or asynchronously, and it allows for customization of the httpx clients by developers that need more control of the HTTP requests.

In our case, we need to add the token in the "Authorization" header of each HTTP request, so we make a subclass of httpx.Auth that sets the header on each asynchronous request by calling the token provider function:

class TokenBasedAuth(httpx.Auth):
  async def async_auth_flow(self, request):
    token = await openai_token_provider()
    request.headers["Authorization"] = f"Bearer {token}"
    yield request

  def sync_auth_flow(self, request):
    raise RuntimeError("Cannot use a sync authentication class with httpx.AsyncClient")

Each time the token provider function is called, it will make sure that the token has not yet expired, and fetch a new one as necessary.

Now we can create a AsyncOpenAI client by passing in a custom httpx client using that TokenBasedAuth class, along with the correct base URL and API version:

from openai import AsyncOpenAI

openai_client = AsyncOpenAI(
  base_url=os.environ["AZURE_INFERENCE_ENDPOINT"],
  default_query={"api-version": "2024-05-01-preview"},
  api_key="placeholder",
  http_client=DefaultAsyncHttpxClient(auth=TokenBasedAuth()),
)

Making chat completion requests

When we receive a new question from the user, we use that OpenAI client to call the chat completions API:

chat_coroutine = openai_client.chat.completions.create(
   model=os.getenv("AZURE_DEEPSEEK_DEPLOYMENT"),
   messages=all_messages,
   stream=True)

You'll notice that instead of the typical model name that we send in when using OpenAI, we send in the deployment name. For convenience, I often name deployments the same as the model, so that they will match even if I mistakenly pass in the model name.

Streaming the response from the backend

As I've discussed previously on this blog, we should always use streaming responses when building user-facing chat applications, to reduce perceive latency and improve the user experience.

To receive a streamed response from the chat completions API, we specified stream=True in the call above. Then, as we receive each event from the server, we check whether the content is the special "<think>" start token or "</think>" end token. When we know the model is currently in a thinking mode, we pass down the content chunks in a "reasoning_content" field. Otherwise, we pass down the content chunks in the "content" field.

We send each event to our frontend using a common approach of JSON-lines over a streaming HTTP response (which has the "Transfer-encoding: chunked" header). That means the client receives a JSON separated by a new line for each event, and can easily parse them out. The other common approaches are server-sent events or websockets, but both are unnecessarily complex for this scenario.

is_thinking = False
async for update in await chat_coroutine:
    if update.choices:
        content = update.choices[0].delta.content
        if content == "":
            is_thinking = True
            update.choices[0].delta.content = None
            update.choices[0].delta.reasoning_content = ""
        elif content == "":
            is_thinking = False
            update.choices[0].delta.content = None
            update.choices[0].delta.reasoning_content = ""
        elif content:
            if is_thinking:
                yield json.dumps(
                    {"delta": {"content": None, "reasoning_content": content, "role": "assistant"}},
                    ensure_ascii=False,
                ) + "\n"
            else:
                yield json.dumps(
                    {"delta": {"content": content, "reasoning_content": None, "role": "assistant"}},
                    ensure_ascii=False,
                ) + "\n"

Rendering the streamed response in the frontend

The frontend code makes a standard fetch() request to the backend route, passing in the message history:

const response = await fetch("/chat/stream", {
    method: "POST",
    headers: {"Content-Type": "application/json"},
    body: JSON.stringify({messages: messages})
});

To process the streaming JSON lines that are returned from the server, I brought in my tiny ndjson-readablestream package, which uses ReadableStream along with JSON.parse to make it easy to iterate over each JSON object as it comes in. When I see that the JSON is "reasoning_content", I display it in a special collapsible container.

let answer = "";
let thoughts = "";
for await (const event of readNDJSONStream(response.body)) {
    if (!event.delta) {
        continue;
    }
    if (event.delta.reasoning_content) {
        thoughts += event.delta.reasoning_content;
        if (thoughts.trim().length > 0) {
            // Only show thoughts if they are more than just whitespace
            messageDiv.querySelector(".loading-bar").style.display = "none";
            messageDiv.querySelector(".thoughts").style.display = "block";
            messageDiv.querySelector(".thoughts-content").innerHTML = converter.makeHtml(thoughts);
        }
    } else {
        messageDiv.querySelector(".loading-bar").style.display = "none";
        answer += event.delta.content;
        messageDiv.querySelector(".answer-content").innerHTML = converter.makeHtml(answer);
    }
    messageDiv.scrollIntoView();
    if (event.error) {
        messageDiv.innerHTML = "Error: " + event.error;
    }
}

All together now

The full code is available in github.com/Azure-Samples/deepseek-python. Here are the key files for the code snippeted in this blog post:

File	Purpose
infra/main.bicep	Bicep files for deployment
src/quartapp/chat.py	Quart app with the client setup and streaming chat route
src/quartapp/templates/index.html	Webpage with HTML/JS for rendering stream