November 2024

Wednesday, November 27, 2024

Running Azurite inside a Dev Container

I recently worked on an improvement to the flask-admin extension to upgrade the Azure Blob Storage SDK from v2 (an old legacy SDK) to v12 (the latest). To make it easy for me to test out the change without touching a production Blob storage account, I used the Azurite server, the official local emulator. I could have installed that emulator on my Mac, but I was already working in GitHub Codespaces, so I wanted Azurite to be automatically set up inside that environment, for me and any future developers. I decided to create a dev container definition for the flask-admin repository, and used that to bring in Azurite.

To make it easy for *anyone* to make a dev container with Azurite, I've created a GitHub repository whose sole purpose is to set up Azurite:
https://github.com/pamelafox/azurite-python-playground

You can open that up in a GitHub Codespace or VS Code Dev Container immediately and start playing with it, or continue reading to learn how it works.

devcontainer.json

The entry point for a dev container is .devcontainer/devcontainer.json, which tells the IDE how to set up the containerized environment.

For a container with Azurite, here's the devcontainer.json:

{
  "name": "azurite-python-playground",
  "dockerComposeFile": "docker-compose.yaml",
  "service": "app",
  "workspaceFolder": "/workspace",
  "forwardPorts": [10000, 10001],
  "portsAttributes": {
    "10000": {"label": "Azurite Blob Storage Emulator", "onAutoForward": "silent"},
    "10001": {"label": "Azurite Blob Storage Emulator HTTPS", "onAutoForward": "silent"}
  },
  "customizations": {
    "vscode": {
      "settings": {
        "python.defaultInterpreterPath": "/usr/local/bin/python"
      }
    }
  },
  "remoteUser": "vscode"
}

That dev container tells the IDE to build a container using docker-compose.yaml and to treat the "app" service as the main container for the editor to open. It also tells the IDE to forward the two ports exposed by Azurite (10000 for HTTP, 10001 for HTTPS) and to label them in the "Ports" tab. That's not strictly necessary, but it's a nice way to see that the server is running.

docker-compose.yaml

The docker-compose.yaml file needs to describe first the "app" container that will be used for the IDE's editing environment, and then define the "azurite" container for the local Azurite server.

version: '3'

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile

    volumes:
      - ..:/workspace:cached

    # Overrides default command so things don't shut down after the process ends.
    command: sleep infinity
    environment:
      AZURE_STORAGE_CONNECTION_STRING: DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;

  azurite:
    container_name: azurite
    image: mcr.microsoft.com/azure-storage/azurite:latest
    restart: unless-stopped
    volumes:
      - azurite-data:/data
    network_mode: service:app

volumes:
  azurite-data:

A few things to note:

The "app" service is based on a local Dockerfile with a base Python image. It also sets the AZURE_STORAGE_CONNECTION_STRING for connecting with the local server.
The "azurite" service is based off the official azurite image and uses a volume for data persistance.
The "azurite" service uses network_mode: service:app so that it is on the same network as the "app" service. This means that the app can access them at a localhost URL. The other approach is to use network_mode: bridge, the default, which would mean the Azurite service was only available at its service name, like "http://azurite:10000". Either approach works, as long as the connection string is set correctly.

Dockerfile

The Dockerfile defines the environment for the code editing experience. In this case, I am bringing in a devcontainer-optimized Python image. You could adapt it for other languages, like Java, .NET, JavaScript, Go, etc.

FROM mcr.microsoft.com/devcontainers/python:3.12

pip install -r requirements.txt

Monday, November 25, 2024

Making a dev container with multiple data services

A dev container is a specification that describes how to open up a project in VS Code, GitHub Codespaces, or any other IDE supporting dev containers, in a consistent and repeatable manner. It builds on Docker and docker-compose, and also allows for IDE settings like extensions and settings. These days, I always try to add a .devcontainer/ folder to my GitHub templates, so that developers can open them up quickly and get the full environment set up for them.

In the past, I've made dev containers to bring in PostgreSQL, pgvector, and Redis, but I'd never made a dev container that could bring in multiple data services at the same time. I finally made a multi-service dev container today, as part of a pull request to flask-admin, so I'm sharing my approach here.

devcontainer.json

The entry point for a dev container is devcontainer.json, which tells the IDE to use a particular Dockerfile, docker-compose, or public image. Here's what it looks like for the multi-service container:

{
  "name": "Multi-service dev container",
  "dockerComposeFile": "docker-compose.yaml",
  "service": "app",
  "workspaceFolder": "/workspace"
}

That dev container tells the IDE to build a container using docker-compose.yaml and to treat the "app" service as the main container for the editor to open.

docker-compose.yaml

The docker-compose.yaml file needs to describe first the "app" container that will be used for the IDE's editing environment, and then describe any additional services. Here's what one looks like for a Python app bringing in PostgreSQL, Azurite, and MongoDB:

version: '3'

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        IMAGE: python:3.12

    volumes:
      - ..:/workspace:cached

    # Overrides default command so things don't shut down after the process ends.
    command: sleep infinity
    environment:
      AZURE_STORAGE_CONNECTION_STRING: DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;
      POSTGRES_HOST: localhost
      POSTGRES_PASSWORD: postgres
      MONGODB_HOST: localhost

  postgres:
    image: postgis/postgis:16-3.4
    restart: unless-stopped
    environment:
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: flask_admin_test
    volumes:
      - postgres-data:/var/lib/postgresql/data
    network_mode: service:app

  azurite:
    container_name: azurite
    image: mcr.microsoft.com/azure-storage/azurite:latest
    restart: unless-stopped
    volumes:
      - azurite-data:/data
    network_mode: service:app

  mongo:
    image: mongo:5.0.14-focal
    restart: unless-stopped
    network_mode: service:app

volumes:
  postgres-data:
  azurite-data:

A few things to point out:

The "app" service is based on a local Dockerfile with a base Python image. It also sets environment variables for connecting to the subsequent services.
The "postgres" service is based off the official postgis image. The postgres or pgvector image would also work there. It specifies environment variables matching those used by the "app" service. It sets up a volume so that the data can persist inside the container.
The "azurite" service is based off the official azurite image, and also uses a volume for data persistance.
The "mongo service" is based off the official mongo image, and in this case, I did not set up a volume for it.
Each of the data services uses network_mode: service:app so that they are on the same network as the "app" service. This means that the app can access them at a localhost URL. The other approach is to use network_mode: bridge, the default, which would mean the services were only available at their service names, like "http://postgres:5432" or "http://azurite:10000". Either approach works, as long as your app code knows how to find the service ports.

Dockerfile

Any of the services can be defined with a Dockerfile, but the example above only uses a Dockerfile for the default "app" service, shown below:

ARG IMAGE=bullseye
FROM mcr.microsoft.com/devcontainers/${IMAGE}

RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \
    && apt-get -y install --no-install-recommends postgresql-client \
     && apt-get clean -y && rm -rf /var/lib/apt/lists/*

That file brings in a devcontainer-optimized Python image, and then goes on to install the psql client for interaction with the PostgreSQL database. You can also install other tools here, plus install Python requirements. It just depends on what you want to be available in the environment, versus what commands you want developers to be running themselves.

Wednesday, November 20, 2024

My first PyBay: Playing improv with Python

A few months ago in September, I attended my very first PyBay: an annual conference in San Francisco bringing together Pythonistas from across the bay area. It was a 2-track single-day conference, with nearly 300 attendees, and talks ranging from 10 to 60 minutes.

My talk

I was very honored to present one of the first talks of the day, on a topic that's near and dear to my heart: improv! Back before I had kids, I spent many years taking improv classes and running an improv club with friends out of my home. I love that improv games force me to be in the moment, and I also just generally find spontaneous generation to be a source of much hilarity. 😜

I've always wanted an excuse to re-create my favorite improv games as computer programs, and now with language models (both small and large), it's actually quite doable! So my talk was about "Playing improv with Python", where I used local models (Llama 3.1 and Phi 3.5) to play increasingly complex games, and demonstrated different approaches along the way: prompt engineering, few-shot examples, function callings, and multimodal input. You can check out my slides and code examples. You're always welcome to re-use my slides or examples yourself!- I spoke with several folks who want to use them as a way to teach language models.

To make the talk more interactive, I also asked the audience to play improv games, starting with a audience-wide game of "reverse charades", where attendees acted out a word displayed on the screen while a kind volunteer attempted to guess the word. I was very nervous about asking the audience for such a high level of interactivity, and thrilled when they joined in! Here's a shot from one part of the room:

Then, before each talk, I asked for volunteers to come on stage to play each of the games, before making the computer play them. Once again, the attendees eagerly jumped up, and it was so fun to get to play improv games with humans for the first time in years.

You can watch the whole talk on YouTube or embedded below. You may want to fast-forward through the beginning, since the recording couldn't capture the off-stage improv shenanigans.

Other talks

Since it was a two-track conference, I could only attend half of the talks, but I did manage to watch quite a few interesting ones. Highlights:

From Pandas to Polars: Upgrading Your Data Workflow
By Matthew Harrison, author of Pandas/Polars books. My takeaways: Polars looks more intuitive than Pandas in some ways, and Matt Harrison really encourages us to use chaining instead of intermediary variables. I liked how he presented in a Juypyter notebook and just used copious empty cells to present only one "slide" at a time.
The Five Demons of Python Packaging That Fuel Our Persistent Nightmare
By Peter Yang, Anaconda creator. Great points on packaging difficulties, including a slide reminding folks that Windows users exist and must have working packages! He also called out the tension with uv being VC-funded, and said that Python OSS creators should not have to take a vow of poverty. Peter also suggested a PEP for a way that packages could declare their interface versus their runtime. I asked him his thoughts on using extras, and he said yes, we should use extras more often.
F-Strings! (Slides)
By Mariatta Wijaya, CPython maintainer. Starts with the basics but then ramp up to the wild new 3.12 f-string features, which I had fun playing with afterwards.
Thinking of Topic Modeling as Search (Slides | Code)
By Kas Stohr. Used embeddings for "Hot topics" in a social media app. Really interesting use case for vector embeddings, and how to combine with clustering algorithms.
Master Python typing with Python-Type-Challenges
By Laike9m. Try it out! Fun way to practice type annotations.
PyTest, The Improv Way
By Joshua Grant. A 10-minute talk where he asked the audience what he should test in the testing pyramid (unit/integration/browser). I quickly shouted "browser", so he proceeded to write a test using Playwright, my current favorite browser automation library. Together with the audience, he got the test passing! 🎉
Secret Snake: Using AI to Entice Technical and Non-Technical Employees to Python
By Paul Karayan. A short talk about how a dev at a Fintech firm used ChatGPT as a "gateway drug" to get their colleagues eventually making PRs to GitHub repos with prompt changes and even writing Python. They even put together a curriculum with projects for their non-technical colleagues.
Accelerating ML Prototyping: The Pythonic Way
By Parul Gupta. About Meta's approach to Jupyter notebooks, which involves custom VS Code integration and extensions.
Let's make a working implementation of async functions in Python 2.1; or, why to use newer Pythons
By Christopher Neugebauer, PSF. He managed to implement async in Python 1.6, using bytecode patching and sys.settrace. His conclusion is that we should use the latest Python for async, of course. 🙂
Scrolling Animated ASCII Art in Python (Scrollart.org)
By Al Sweigart, author of many Python books. Very fun ideas for classroom projects!

Next year?

PyBay was a fantastic conference! Kudos to the organizers for a job well done. I look forward to returning next year, and hopefully finding something equally fun to talk about.

Tuesday, November 5, 2024

Entity extraction using OpenAI structured outputs mode

The relatively new structured outputs mode from the OpenAI gpt-4o model makes it easy for us to define an object schema and get a response from the LLM that conforms to that schema.

Here's the most basic example from the Azure OpenAI tutorial about structured outputs:

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

completion = client.beta.chat.completions.parse(
    model="MODEL_DEPLOYMENT_NAME",
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
    ],
    response_format=CalendarEvent,
)

output = completion.choices[0].message.parsed

The code first defines the CalendarEvent class, an instance of a Pydantic model. Then it sends a request to the GPT model specifying a response_format of CalendarEvent. The parsed output will be a dictionary containing a name, date, and participants.

We can even go a step farther and turn the parsed output into a CalendarEvent instance, using the Pydantic model_validate method:

event = CalendarEvent.model_validate(event)

With this structured outputs capability, it's easier than ever to use GPT models for "entity extraction" tasks: give it some data, tell it what sorts of entities to extract from that data, and constrain it as needed.

Extracting from GitHub READMEs

Let's see an example of a way that I actually used structured outputs, to help me summarize the submissions that we got to a recent hackathon. I can feed the README of a repository to the GPT model and ask for it to extract key details like project title and technologies used.

First I define the Pydantic models:

class Language(str, Enum):
    JAVASCRIPT = "JavaScript"
    PYTHON = "Python"
    DOTNET = ".NET"

class Framework(str, Enum):
    LANGCHAIN = "Langchain"
    SEMANTICKERNEL = "Semantic Kernel"
    LLAMAINDEX = "Llamaindex"
    AUTOGEN = "Autogen"
    SPRINGBOOT = "Spring Boot"
    PROMPTY = "Prompty"

class RepoOverview(BaseModel):
    name: str
    summary: str = Field(..., description="A 1-2 sentence description of the project")
    languages: list[Language]
    frameworks: list[Framework]

In the code above, I asked for a list of a Python enum, which will constrain the model to return only options matching that list. I could have also asked for a list[str] to give it more flexibility, but I wanted to constrain it in this case. I also annoted the description using the Pydantic Field class so that I could specify the length of the description. Without that annotation, the descriptions are often much longer. We can use that description whenever we want to give additional guidance to the model about a field.

Next, I fetch the GitHub readme, storing it as a string:

url = "https://api.github.com/repos/shank250/CareerCanvas-msft-raghack/contents/README.md"
response = requests.get(url)
readme_content = base64.b64decode(response.json()["content"]).decode("utf-8")

Finally, I send off the request and convert the result into a RepoOverview instance:

completion = client.beta.chat.completions.parse(
    model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"),
    messages=[
        {
            "role": "system",
            "content": "Extract info from the GitHub issue markdown about this hack submission.",
        },
        {"role": "user", "content": readme_content},
    ],
    response_format=RepoOverview,
)
output = completion.choices[0].message.parsed
repo_overview = RepoOverview.model_validate(output)

You can see the full code in extract_github_repo.py

Extracting from PDFs

I talk to many customers that want to extract details from PDF, like locations and dates, often to store as metadata in their RAG search index. The first step is to extract the PDF as text, and we have a few options: a hosted service like Azure Document Intelligence, or a local Python package like pymupdf. For this example, I'm using the latter, as I wanted to try out their specialized pymupdf4llm package that converts the PDF to LLM-friendly markdown.

First I load in a PDF of an order receipt and convert it to markdown:

md_text = pymupdf4llm.to_markdown("example_receipt.pdf")

Then I define the Pydantic models for a receipt:

class Item(BaseModel):
    product: str
    price: float
    quantity: int


class Receipt(BaseModel):
    total: float
    shipping: float
    payment_method: str
    items: list[Item]
    order_number: int

In this example, I'm using a nested Pydantic model Item for each item in the receipt, so that I can get detailed information about each item.

And then, as before, I send the text off to the GPT model and convert the response back to a Receipt instance:

completion = client.beta.chat.completions.parse(
    model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"),
    messages=[
        {"role": "system", "content": "Extract the information from the blog post"},
        {"role": "user", "content": md_text},
    ],
    response_format=Receipt,
)
output = completion.choices[0].message.parsed
receipt = Receipt.model_validate(output)

You can see the full code in extract_pdf_receipt.py

Extracting from images

Since the gpt-4o model is also a multimodal model, it can accept both images and text. That means that we can send it an image and ask it for a structured output that extracts details from that image. Pretty darn cool!

First I load in a local image as a base-64 encoded data URI:

def open_image_as_base64(filename):
    with open(filename, "rb") as image_file:
        image_data = image_file.read()
    image_base64 = base64.b64encode(image_data).decode("utf-8")
    return f"data:image/png;base64,{image_base64}"


image_url = open_image_as_base64("example_graph_treecover.png")

For this example, my image is a graph, so I'm going to have it extract details about the graph. Here are the Pydantic models:

class Graph(BaseModel):
    title: str
    description: str = Field(..., description="1 sentence description of the graph")
    x_axis: str
    y_axis: str
    legend: list[str]

Then I send off the base-64 image URI to the GPT model, inside a "image_url" type message, and convert the response back to a Graph object:

completion = client.beta.chat.completions.parse(
    model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"),
    messages=[
        {"role": "system", "content": "Extract the information from the graph"},
        {
            "role": "user",
            "content": [
                {"image_url": {"url": image_url}, "type": "image_url"},
            ],
        },
    ],
    response_format=Graph,
)
output = completion.choices[0].message.parsed
graph = Graph.model_validate(output)

More examples

You can use this same general approach for entity extraction across many file types, as long as they can be represented in either a text or image form. See more examples in my azure-openai-entity-extraction repository. As always, remember that large language models are probabilistic next-word-predictors that won't always get things right, so definitely evaluate the accuracy of the outputs before you use this approach for a business-critical task.