Tuesday, December 17, 2024

Add browser speech input & output to your app

One of the amazing benefits of modern machine learning is that computers can reliably turn text into speech, or transcribe speech into text, across multiple languages and accents. We can then use those capabilities to make our web apps more accessible for anyone who has a situational, temporary, or chronic issue that makes typing difficult. That describes so many people - for example, a parent holding a squirmy toddler in their hands, an athlete with a broken arm, or an individual with Parkinson's disease.

There are two approaches we can use to add speech capabilites to our apps:

  1. Use the built-in browser APIs: the SpeechRecognition API and SpeechSynthesis API.
  2. Use a cloud-based service, like the Azure Speech API.

Which one to use? The great thing about the browser APIs is that they're free and available in most modern browsers and operating systems. The drawback of the APIs is that they're often not as powerful and flexible as cloud-based services, and the speech output often sounds much more robotic. There are also a few niche browser/OS combos where the built-in APIs don't work, like SpeechRecognition on Microsoft Edge on a Mac M1. That's why we decided to add both options to azure-search-openai-demo, to give developers the option to decide for themselves.

In this post, I'm going to show you how to add speech capabilities using the free built-in browser APIs, since free APIs are often easier to get started with, and it's important to do what we can to improve the accessibility of our apps. The GIF below shows the end result, a chat app with both speech input and output buttons:

GIF of speech input and output for a chat app

All of the code described in this post is part of openai-chat-vision-quickstart, so you can grab the full code yourself after seeing how it works.

Speech input with SpeechRecognition API

To make it easier to add a speech input button to any app, I'm wrapping the functionality inside a custom HTML element, SpeechInputButton. First I construct the speech input button element with an instance of the SpeechRecognition API, making sure to use the browser's preferred language if any are set:

class SpeechInputButton extends HTMLElement {
  constructor() {
    super();
    this.isRecording = false;
    const SpeechRecognition =
      window.SpeechRecognition || window.webkitSpeechRecognition;
    if (!SpeechRecognition) {
      this.dispatchEvent(
        new CustomEvent("speecherror", {
          detail: { error: "SpeechRecognition not supported" },
        })
      );
      return;
    }
    this.speechRecognition = new SpeechRecognition();
    this.speechRecognition.lang = navigator.language || navigator.userLanguage;
    this.speechRecognition.interimResults = false;
    this.speechRecognition.continuous = true;
    this.speechRecognition.maxAlternatives = 1;
  }

Then I define the connectedCallback() method that will be called whenever this custom element has been added to the DOM. When that happens, I define the inner HTML to render a button and attach event listeners for both mouse and keyboard events. Since we want this to be fully accessible, keyboard support is important.

connectedCallback() {
  this.innerHTML = `
        <button class="btn btn-outline-secondary" type="button" title="Start recording (Shift + Space)">
            <i class="bi bi-mic"></i>
        </button>`;
  this.recordButton = this.querySelector('button');
  this.recordButton.addEventListener('click', () => this.toggleRecording());
  document.addEventListener('keydown', this.handleKeydown.bind(this));
}
  
handleKeydown(event) {
  if (event.key === 'Escape') {
    this.abortRecording();
  } else if (event.key === ' ' && event.shiftKey) { // Shift + Space
    event.preventDefault();
    this.toggleRecording();
  }
}
  
toggleRecording() {
  if (this.isRecording) {
    this.stopRecording();
  } else {
    this.startRecording();
  }
}

The majority of the code is in the startRecording function. It sets up a listener for the "result" event from the SpeechRecognition instance, which contains the transcribed text. It also sets up a listener for the "end" event, which is triggered either automatically after a few seconds of silence (in some browsers) or when the user ends the recording by clicking the button. Finally, it sets up a listener for any "error" events. Once all listeners are ready, it calls start() on the SpeechRecognition instance and styles the button to be in an active state.

startRecording() {
  if (this.speechRecognition == null) {
    this.dispatchEvent(
      new CustomEvent("speech-input-error", {
        detail: { error: "SpeechRecognition not supported" },
      })
    );
  }

  this.speechRecognition.onresult = (event) => {
    let input = "";
    for (const result of event.results) {
      input += result[0].transcript;
    }
    this.dispatchEvent(
      new CustomEvent("speech-input-result", {
        detail: { transcript: input },
      })
    );
  };

  this.speechRecognition.onend = () => {
    this.isRecording = false;
    this.renderButtonOff();
    this.dispatchEvent(new Event("speech-input-end"));
  };

  this.speechRecognition.onerror = (event) => {
    if (this.speechRecognition) {
      this.speechRecognition.stop();
      if (event.error == "no-speech") {
        this.dispatchEvent(
          new CustomEvent("speech-input-error", {
            detail: {error: "No speech was detected. Please check your system audio settings and try again."},
         }));
      } else if (event.error == "language-not-supported") {
        this.dispatchEvent(
          new CustomEvent("speech-input-error", {
            detail: {error: "The selected language is not supported. Please try a different language.",
        }}));
      } else if (event.error != "aborted") {
        this.dispatchEvent(
          new CustomEvent("speech-input-error", {
            detail: {error: "An error occurred while recording. Please try again: " + event.error},
        }));
      }
    }
  };

  this.speechRecognition.start();
  this.isRecording = true;
  this.renderButtonOn();
}

If the user stops the recording using the keyboard shortcut or button click, we call stop() on the SpeechRecognition instance. At that point, anything the user had said will be transcribed and become available via the "result" event.

stopRecording() {
  if (this.speechRecognition) {
    this.speechRecognition.stop();
  }
}

Alternatively, if the user presses the Escape keyboard shortcut, we instead call abort() on the SpeechRecognition instance, which stops the recording and does not send any previously untranscribed speech over.

abortRecording() {
  if (this.speechRecognition) {
    this.speechRecognition.abort();
  }
}

Once the custom HTML element is fully defined, we register it with the desired tag name, speech-input-button:

customElements.define("speech-input-button", SpeechInputButton);

To use the custom speech-input-button element in a chat application, we add it to the HTML for the chat form:


  <speech-input-button></speech-input-button>
  <input id="message" name="message" class="form-control form-control-sm" type="text" rows="1"></input>

Then we attach an event listener for the custom events dispatched by the element, and we update the input text field with the transcribed text:

const speechInputButton = document.querySelector("speech-input-button");
speechInputButton.addEventListener("speech-input-result", (event) => {
    messageInput.value += " " + event.detail.transcript.trim();
    messageInput.focus();
});

You can see the full custom HTML element code in speech-input.js and the usage in index.html. There's also a fun pulsing animation for the button's active state in styles.css.

Speech output with SpeechSynthesis API

Once again, to make it easier to add a speech output button to any app, I'm wrapping the functionality inside a custom HTML element, SpeechOutputButton. When defining the custom element, we specify an observed attribute named "text", to store whatever text should be turned into speech when the button is clicked.

class SpeechOutputButton extends HTMLElement {
  static observedAttributes = ["text"];

In the constructor, we check to make sure the SpeechSynthesis API is supported, and remember the browser's preferred language for later use.

constructor() {
  super();
  this.isPlaying = false;
  const SpeechSynthesis = window.speechSynthesis || window.webkitSpeechSynthesis;
  if (!SpeechSynthesis) {
    this.dispatchEvent(
      new CustomEvent("speech-output-error", {
        detail: { error: "SpeechSynthesis not supported" }
    }));
    return;
  }
  this.synth = SpeechSynthesis;
  this.lngCode = navigator.language || navigator.userLanguage;
}

When the custom element is added to the DOM, I define the inner HTML to render a button and attach mouse and keyboard event listeners:

connectedCallback() {
    this.innerHTML = `
            <button class="btn btn-outline-secondary" type="button">
                <i class="bi bi-volume-up"></i>
            </button>`;
    this.speechButton = this.querySelector("button");
    this.speechButton.addEventListener("click", () =>
      this.toggleSpeechOutput()
    );
    document.addEventListener('keydown', this.handleKeydown.bind(this));
}

The majority of the code is in the toggleSpeechOutput function. If the speech is not yet playing, it creates a new SpeechSynthesisUtterance instance, passes it the "text" attribute, and sets the language and audio properties. It attempts to use a voice that's optimal for the desired language, but falls back to "en-US" if none is found. It attaches event listeners for the start and end events, which will change the button's style to look either active or unactive. Finally, it tells the SpeechSynthesis API to speak the utterance.

toggleSpeechOutput() {
    if (!this.isConnected) {
      return;
    }
    const text = this.getAttribute("text");
    if (this.synth != null) {
      if (this.isPlaying || text === "") {
        this.stopSpeech();
        return;
      }

      // Create a new utterance and play it.
      const utterance = new SpeechSynthesisUtterance(text);
      utterance.lang = this.lngCode;
      utterance.volume = 1;
      utterance.rate = 1;
      utterance.pitch = 1;

      let voice = this.synth
        .getVoices()
        .filter((voice) => voice.lang === this.lngCode)[0];
      if (!voice) {
        voice = this.synth
          .getVoices()
          .filter((voice) => voice.lang === "en-US")[0];
      }
      utterance.voice = voice;

      if (!utterance) {
        return;
      }

      utterance.onstart = () => {
        this.isPlaying = true;
        this.renderButtonOn();
      };

      utterance.onend = () => {
        this.isPlaying = false;
        this.renderButtonOff();
      };
      
      this.synth.speak(utterance);
    }
  }

When the user no longer wants to hear the speech output, indicated either via another press of the button or by pressing the Escape key, we call cancel() from the SpeechSynthesis API.

stopSpeech() {
      if (this.synth) {
          this.synth.cancel();
          this.isPlaying = false;
          this.renderButtonOff();
      }
  }

Once the custom HTML element is fully defined, we register it with the desired tag name, speech-output-button:

customElements.define("speech-output-button", SpeechOutputButton);

To use this custom speech-output-button element in a chat application, we construct it dynamically each time that we've received a full response from an LLM, and call setAttribute to pass in the text to be spoken:

const speechOutput = document.createElement("speech-output-button");
speechOutput.setAttribute("text", answer);
messageDiv.appendChild(speechOutput);

You can see the full custom HTML element code in speech-output.js and the usage in index.html. This button also uses the same pulsing animation for the active state, defined in styles.css.

Acknowledgments

I want to give a huge shout-out to John Aziz for his amazing work adding speech input and output to the azure-search-openai-demo, as that was the basis for the code I shared in this blog post.

Wednesday, November 27, 2024

Running Azurite inside a Dev Container

I recently worked on an improvement to the flask-admin extension to upgrade the Azure Blob Storage SDK from v2 (an old legacy SDK) to v12 (the latest). To make it easy for me to test out the change without touching a production Blob storage account, I used the Azurite server, the official local emulator. I could have installed that emulator on my Mac, but I was already working in GitHub Codespaces, so I wanted Azurite to be automatically set up inside that environment, for me and any future developers. I decided to create a dev container definition for the flask-admin repository, and used that to bring in Azurite.

To make it easy for *anyone* to make a dev container with Azurite, I've created a GitHub repository whose sole purpose is to set up Azurite:
https://github.com/pamelafox/azurite-python-playground

You can open that up in a GitHub Codespace or VS Code Dev Container immediately and start playing with it, or continue reading to learn how it works.

devcontainer.json

The entry point for a dev container is .devcontainer/devcontainer.json, which tells the IDE how to set up the containerized environment.

For a container with Azurite, here's the devcontainer.json:

{
  "name": "azurite-python-playground",
  "dockerComposeFile": "docker-compose.yaml",
  "service": "app",
  "workspaceFolder": "/workspace",
  "forwardPorts": [10000, 10001],
  "portsAttributes": {
    "10000": {"label": "Azurite Blob Storage Emulator", "onAutoForward": "silent"},
    "10001": {"label": "Azurite Blob Storage Emulator HTTPS", "onAutoForward": "silent"}
  },
  "customizations": {
    "vscode": {
      "settings": {
        "python.defaultInterpreterPath": "/usr/local/bin/python"
      }
    }
  },
  "remoteUser": "vscode"
}

That dev container tells the IDE to build a container using docker-compose.yaml and to treat the "app" service as the main container for the editor to open. It also tells the IDE to forward the two ports exposed by Azurite (10000 for HTTP, 10001 for HTTPS) and to label them in the "Ports" tab. That's not strictly necessary, but it's a nice way to see that the server is running.

docker-compose.yaml

The docker-compose.yaml file needs to describe first the "app" container that will be used for the IDE's editing environment, and then define the "azurite" container for the local Azurite server.

version: '3'

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile

    volumes:
      - ..:/workspace:cached

    # Overrides default command so things don't shut down after the process ends.
    command: sleep infinity
    environment:
      AZURE_STORAGE_CONNECTION_STRING: DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;

  azurite:
    container_name: azurite
    image: mcr.microsoft.com/azure-storage/azurite:latest
    restart: unless-stopped
    volumes:
      - azurite-data:/data
    network_mode: service:app

volumes:
  azurite-data:

A few things to note:

  • The "app" service is based on a local Dockerfile with a base Python image. It also sets the AZURE_STORAGE_CONNECTION_STRING for connecting with the local server.
  • The "azurite" service is based off the official azurite image and uses a volume for data persistance.
  • The "azurite" service uses network_mode: service:app so that it is on the same network as the "app" service. This means that the app can access them at a localhost URL. The other approach is to use network_mode: bridge, the default, which would mean the Azurite service was only available at its service name, like "http://azurite:10000". Either approach works, as long as the connection string is set correctly.

Dockerfile

The Dockerfile defines the environment for the code editing experience. In this case, I am bringing in a devcontainer-optimized Python image. You could adapt it for other languages, like Java, .NET, JavaScript, Go, etc.

FROM mcr.microsoft.com/devcontainers/python:3.12

pip install -r requirements.txt

Monday, November 25, 2024

Making a dev container with multiple data services

A dev container is a specification that describes how to open up a project in VS Code, GitHub Codespaces, or any other IDE supporting dev containers, in a consistent and repeatable manner. It builds on Docker and docker-compose, and also allows for IDE settings like extensions and settings. These days, I always try to add a .devcontainer/ folder to my GitHub templates, so that developers can open them up quickly and get the full environment set up for them.

In the past, I've made dev containers to bring in PostgreSQL, pgvector, and Redis, but I'd never made a dev container that could bring in multiple data services at the same time. I finally made a multi-service dev container today, as part of a pull request to flask-admin, so I'm sharing my approach here.

devcontainer.json

The entry point for a dev container is devcontainer.json, which tells the IDE to use a particular Dockerfile, docker-compose, or public image. Here's what it looks like for the multi-service container:

{
  "name": "Multi-service dev container",
  "dockerComposeFile": "docker-compose.yaml",
  "service": "app",
  "workspaceFolder": "/workspace"
}

That dev container tells the IDE to build a container using docker-compose.yaml and to treat the "app" service as the main container for the editor to open.

docker-compose.yaml

The docker-compose.yaml file needs to describe first the "app" container that will be used for the IDE's editing environment, and then describe any additional services. Here's what one looks like for a Python app bringing in PostgreSQL, Azurite, and MongoDB:

version: '3'

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        IMAGE: python:3.12

    volumes:
      - ..:/workspace:cached

    # Overrides default command so things don't shut down after the process ends.
    command: sleep infinity
    environment:
      AZURE_STORAGE_CONNECTION_STRING: DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;
      POSTGRES_HOST: localhost
      POSTGRES_PASSWORD: postgres
      MONGODB_HOST: localhost

  postgres:
    image: postgis/postgis:16-3.4
    restart: unless-stopped
    environment:
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: flask_admin_test
    volumes:
      - postgres-data:/var/lib/postgresql/data
    network_mode: service:app

  azurite:
    container_name: azurite
    image: mcr.microsoft.com/azure-storage/azurite:latest
    restart: unless-stopped
    volumes:
      - azurite-data:/data
    network_mode: service:app

  mongo:
    image: mongo:5.0.14-focal
    restart: unless-stopped
    network_mode: service:app

volumes:
  postgres-data:
  azurite-data:

A few things to point out:

  • The "app" service is based on a local Dockerfile with a base Python image. It also sets environment variables for connecting to the subsequent services.
  • The "postgres" service is based off the official postgis image. The postgres or pgvector image would also work there. It specifies environment variables matching those used by the "app" service. It sets up a volume so that the data can persist inside the container.
  • The "azurite" service is based off the official azurite image, and also uses a volume for data persistance.
  • The "mongo service" is based off the official mongo image, and in this case, I did not set up a volume for it.
  • Each of the data services uses network_mode: service:app so that they are on the same network as the "app" service. This means that the app can access them at a localhost URL. The other approach is to use network_mode: bridge, the default, which would mean the services were only available at their service names, like "http://postgres:5432" or "http://azurite:10000". Either approach works, as long as your app code knows how to find the service ports.

Dockerfile

Any of the services can be defined with a Dockerfile, but the example above only uses a Dockerfile for the default "app" service, shown below:

ARG IMAGE=bullseye
FROM mcr.microsoft.com/devcontainers/${IMAGE}

RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \
    && apt-get -y install --no-install-recommends postgresql-client \
     && apt-get clean -y && rm -rf /var/lib/apt/lists/*

That file brings in a devcontainer-optimized Python image, and then goes on to install the psql client for interaction with the PostgreSQL database. You can also install other tools here, plus install Python requirements. It just depends on what you want to be available in the environment, versus what commands you want developers to be running themselves.

Wednesday, November 20, 2024

My first PyBay: Playing improv with Python

A few months ago in September, I attended my very first PyBay: an annual conference in San Francisco bringing together Pythonistas from across the bay area. It was a 2-track single-day conference, with nearly 300 attendees, and talks ranging from 10 to 60 minutes.


My talk

I was very honored to present one of the first talks of the day, on a topic that's near and dear to my heart: improv! Back before I had kids, I spent many years taking improv classes and running an improv club with friends out of my home. I love that improv games force me to be in the moment, and I also just generally find spontaneous generation to be a source of much hilarity. 😜

I've always wanted an excuse to re-create my favorite improv games as computer programs, and now with language models (both small and large), it's actually quite doable! So my talk was about "Playing improv with Python", where I used local models (Llama 3.1 and Phi 3.5) to play increasingly complex games, and demonstrated different approaches along the way: prompt engineering, few-shot examples, function callings, and multimodal input. You can check out my slides and code examples. You're always welcome to re-use my slides or examples yourself!- I spoke with several folks who want to use them as a way to teach language models.

To make the talk more interactive, I also asked the audience to play improv games, starting with a audience-wide game of "reverse charades", where attendees acted out a word displayed on the screen while a kind volunteer attempted to guess the word. I was very nervous about asking the audience for such a high level of interactivity, and thrilled when they joined in! Here's a shot from one part of the room:

Then, before each talk, I asked for volunteers to come on stage to play each of the games, before making the computer play them. Once again, the attendees eagerly jumped up, and it was so fun to get to play improv games with humans for the first time in years.

You can watch the whole talk on YouTube or embedded below. You may want to fast-forward through the beginning, since the recording couldn't capture the off-stage improv shenanigans.



Other talks

Since it was a two-track conference, I could only attend half of the talks, but I did manage to watch quite a few interesting ones. Highlights:

  • From Pandas to Polars: Upgrading Your Data Workflow
    By Matthew Harrison, author of Pandas/Polars books. My takeaways: Polars looks more intuitive than Pandas in some ways, and Matt Harrison really encourages us to use chaining instead of intermediary variables. I liked how he presented in a Juypyter notebook and just used copious empty cells to present only one "slide" at a time.
  • The Five Demons of Python Packaging That Fuel Our Persistent Nightmare
    By Peter Yang, Anaconda creator. Great points on packaging difficulties, including a slide reminding folks that Windows users exist and must have working packages! He also called out the tension with uv being VC-funded, and said that Python OSS creators should not have to take a vow of poverty. Peter also suggested a PEP for a way that packages could declare their interface versus their runtime. I asked him his thoughts on using extras, and he said yes, we should use extras more often.
  • F-Strings! (Slides)
    By Mariatta Wijaya, CPython maintainer. Starts with the basics but then ramp up to the wild new 3.12 f-string features, which I had fun playing with afterwards.
  • Thinking of Topic Modeling as Search (Slides | Code)
    By Kas Stohr. Used embeddings for "Hot topics" in a social media app. Really interesting use case for vector embeddings, and how to combine with clustering algorithms.
  • Master Python typing with Python-Type-Challenges
    By Laike9m. Try it out! Fun way to practice type annotations.
  • PyTest, The Improv Way
    By Joshua Grant. A 10-minute talk where he asked the audience what he should test in the testing pyramid (unit/integration/browser). I quickly shouted "browser", so he proceeded to write a test using Playwright, my current favorite browser automation library. Together with the audience, he got the test passing! 🎉
  • Secret Snake: Using AI to Entice Technical and Non-Technical Employees to Python
    By Paul Karayan. A short talk about how a dev at a Fintech firm used ChatGPT as a "gateway drug" to get their colleagues eventually making PRs to GitHub repos with prompt changes and even writing Python. They even put together a curriculum with projects for their non-technical colleagues.
  • Accelerating ML Prototyping: The Pythonic Way
    By Parul Gupta. About Meta's approach to Jupyter notebooks, which involves custom VS Code integration and extensions.
  • Let's make a working implementation of async functions in Python 2.1; or, why to use newer Pythons
    By Christopher Neugebauer, PSF. He managed to implement async in Python 1.6, using bytecode patching and sys.settrace. His conclusion is that we should use the latest Python for async, of course. 🙂
  • Scrolling Animated ASCII Art in Python (Scrollart.org)
    By Al Sweigart, author of many Python books. Very fun ideas for classroom projects!

Next year?

PyBay was a fantastic conference! Kudos to the organizers for a job well done. I look forward to returning next year, and hopefully finding something equally fun to talk about.

Tuesday, November 5, 2024

Entity extraction using OpenAI structured outputs mode

The relatively new structured outputs mode from the OpenAI gpt-4o model makes it easy for us to define an object schema and get a response from the LLM that conforms to that schema.

Here's the most basic example from the Azure OpenAI tutorial about structured outputs:

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

completion = client.beta.chat.completions.parse(
    model="MODEL_DEPLOYMENT_NAME",
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
    ],
    response_format=CalendarEvent,
)

output = completion.choices[0].message.parsed

The code first defines the CalendarEvent class, an instance of a Pydantic model. Then it sends a request to the GPT model specifying a response_format of CalendarEvent. The parsed output will be a dictionary containing a name, date, and participants.

We can even go a step farther and turn the parsed output into a CalendarEvent instance, using the Pydantic model_validate method:

event = CalendarEvent.model_validate(event)

With this structured outputs capability, it's easier than ever to use GPT models for "entity extraction" tasks: give it some data, tell it what sorts of entities to extract from that data, and constrain it as needed.

Extracting from GitHub READMEs

Let's see an example of a way that I actually used structured outputs, to help me summarize the submissions that we got to a recent hackathon. I can feed the README of a repository to the GPT model and ask for it to extract key details like project title and technologies used.

First I define the Pydantic models:

class Language(str, Enum):
    JAVASCRIPT = "JavaScript"
    PYTHON = "Python"
    DOTNET = ".NET"

class Framework(str, Enum):
    LANGCHAIN = "Langchain"
    SEMANTICKERNEL = "Semantic Kernel"
    LLAMAINDEX = "Llamaindex"
    AUTOGEN = "Autogen"
    SPRINGBOOT = "Spring Boot"
    PROMPTY = "Prompty"

class RepoOverview(BaseModel):
    name: str
    summary: str = Field(..., description="A 1-2 sentence description of the project")
    languages: list[Language]
    frameworks: list[Framework]

In the code above, I asked for a list of a Python enum, which will constrain the model to return only options matching that list. I could have also asked for a list[str] to give it more flexibility, but I wanted to constrain it in this case. I also annoted the description using the Pydantic Field class so that I could specify the length of the description. Without that annotation, the descriptions are often much longer. We can use that description whenever we want to give additional guidance to the model about a field.

Next, I fetch the GitHub readme, storing it as a string:

url = "https://api.github.com/repos/shank250/CareerCanvas-msft-raghack/contents/README.md"
response = requests.get(url)
readme_content = base64.b64decode(response.json()["content"]).decode("utf-8")

Finally, I send off the request and convert the result into a RepoOverview instance:

completion = client.beta.chat.completions.parse(
    model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"),
    messages=[
        {
            "role": "system",
            "content": "Extract info from the GitHub issue markdown about this hack submission.",
        },
        {"role": "user", "content": readme_content},
    ],
    response_format=RepoOverview,
)
output = completion.choices[0].message.parsed
repo_overview = RepoOverview.model_validate(output)

You can see the full code in extract_github_repo.py

Extracting from PDFs

I talk to many customers that want to extract details from PDF, like locations and dates, often to store as metadata in their RAG search index. The first step is to extract the PDF as text, and we have a few options: a hosted service like Azure Document Intelligence, or a local Python package like pymupdf. For this example, I'm using the latter, as I wanted to try out their specialized pymupdf4llm package that converts the PDF to LLM-friendly markdown.

First I load in a PDF of an order receipt and convert it to markdown:

md_text = pymupdf4llm.to_markdown("example_receipt.pdf")

Then I define the Pydantic models for a receipt:

class Item(BaseModel):
    product: str
    price: float
    quantity: int


class Receipt(BaseModel):
    total: float
    shipping: float
    payment_method: str
    items: list[Item]
    order_number: int

In this example, I'm using a nested Pydantic model Item for each item in the receipt, so that I can get detailed information about each item.

And then, as before, I send the text off to the GPT model and convert the response back to a Receipt instance:

completion = client.beta.chat.completions.parse(
    model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"),
    messages=[
        {"role": "system", "content": "Extract the information from the blog post"},
        {"role": "user", "content": md_text},
    ],
    response_format=Receipt,
)
output = completion.choices[0].message.parsed
receipt = Receipt.model_validate(output)

You can see the full code in extract_pdf_receipt.py

Extracting from images

Since the gpt-4o model is also a multimodal model, it can accept both images and text. That means that we can send it an image and ask it for a structured output that extracts details from that image. Pretty darn cool!

First I load in a local image as a base-64 encoded data URI:

def open_image_as_base64(filename):
    with open(filename, "rb") as image_file:
        image_data = image_file.read()
    image_base64 = base64.b64encode(image_data).decode("utf-8")
    return f"data:image/png;base64,{image_base64}"


image_url = open_image_as_base64("example_graph_treecover.png")

For this example, my image is a graph, so I'm going to have it extract details about the graph. Here are the Pydantic models:

class Graph(BaseModel):
    title: str
    description: str = Field(..., description="1 sentence description of the graph")
    x_axis: str
    y_axis: str
    legend: list[str]

Then I send off the base-64 image URI to the GPT model, inside a "image_url" type message, and convert the response back to a Graph object:

completion = client.beta.chat.completions.parse(
    model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"),
    messages=[
        {"role": "system", "content": "Extract the information from the graph"},
        {
            "role": "user",
            "content": [
                {"image_url": {"url": image_url}, "type": "image_url"},
            ],
        },
    ],
    response_format=Graph,
)
output = completion.choices[0].message.parsed
graph = Graph.model_validate(output)

More examples

You can use this same general approach for entity extraction across many file types, as long as they can be represented in either a text or image form. See more examples in my azure-openai-entity-extraction repository. As always, remember that large language models are probabilistic next-word-predictors that won't always get things right, so definitely evaluate the accuracy of the outputs before you use this approach for a business-critical task.

Friday, September 27, 2024

My parenting strategy: earn enough $ to outsource

Two kids are a lot. I know, its really not a lot in comparison to the many kids that women have had to birth and care for over the history of humanity. But still, it feels like a lot to me. My partner and I both have full-time jobs that are fortunately remote-friendly, but we’re both tired by the time kids are home, and we need to keep them fed and occupied until bedtime.

We have a 2 year old and 5 year old, and they spend 2% of their time playing together and the other 98% fighting over who gets to play with mommy. And of course, mommy is thinking of all the other stuff that needs to get done: laundry, dishes, dinner, cleaning, and wouldnt it be nice if I could have a few minutes to shower?

But alas, where is the time for all that? How are we supposed to get all the chores done, take care of two little kids, and have some time for the “self-care” I’ve heard so much about? There isn’t enough time!

Plus, my kids are also night owls, staying up to 10ish each night and often falling asleep on me, so I don’t have the magical “time after kids went to sleep” that I’ve heard so much about.

Enough with the ranting though.

Fortunately, I recently switched jobs from UC Berkeley lecturer (100k, no bonuses) to Microsoft developer advocate (220K plus bonuses), so I’ve decided to shamelessly pay my way to less stress. More money, less problems!

Here’s what I spend my funds on:

  • Meal delivery services. Currently: Plantedtable (vegan meals) and OutTheCaveFood (Paleo meals, lol). They both deliver fully ready meals in plastic-free packaging from their local kitchens. My kids have mixed feelings about the meals, but they have mixed feelings about any non-pizza foods.
  • Grocery delivery. We use a combination of Safeway (via DoorDash) and GoodEggs, depending on what items we’re missing. I prefer GoodEggs since they work with local companies, but they lack some kid essentials, like massive blocks of cheddar cheese. Weekly house cleaners. I tip them extra for also folding our clean laundry, which tends to sit on the bed for days at a time. They come Fridays, so that we can start the weekends on a clean foot! (Yes, the house is a disaster by Monday.)
  • Nanny overtime. Our amazing nanny will often take the 2 year old on Saturdays, so I can spend solo time with my 5 year old, and sometimes keeps her late during the week if I have an event to attend in the city. She also cares for the 5 year old if she has a day off school. Evening babysitter. In addition, a local babysitter comes once a week to play with the 5 year old, which gives me a break from referee-ing them, and also gives my partner the opportunity to keep his weekly D&D night.
  • Handymen. I used to fancy myself as a DIYer that could do home improvement projects, but I just cant focus on them enough now to do a good job. So I pay these two local handymen to do tiny jobs (hang a curtain rod!) as well as large jobs (toddler-safe to-code stair railings). Professionals just do it better.
  • Gardening. This is the one thing that I actually still do a lot of myself, especially planting new natives, but when I need help removing an influx of invasive weeds or pruning trees, I call a local gardener. He’s so local that folks often stop to talk with him when he’s working outside. :)

As you can see, I try to “shop local” when I can, but if I need to go to Amazon to buy a massive tub of freeze-dried strawberries to appease a picky two year old, I’m okay with that.

The point of this post is *not* to gloat about my privelege in being able to pay for all this. And yes, i have privelege up the wazoo.

The point of this post is to empower other parents, especially mothers, to feel totally okay to outsource parts of parenting and household management to others. It helps if you have some financial independence from your partner, so that you have the option to pay for outsourcing a task even if they disagree. Freedom!

Many parents do not have a high enough income for this approach, and that is why I currently would vote for policies like universal basic income, government-sponsored health insurance, universal preschool, etc. Parents need a break, wherever it comes from.

Sunday, September 8, 2024

Integrating vision into RAG applications

 Retrieval Augmented Generation (RAG) is a popular technique to get LLMs to provide answers that are grounded in a data source. What do you do when your knowledge base includes images, like graphs or photos? By adding multimodal models into your RAG flow, you can get answers based off image sources, too!  

Our most popular RAG solution accelerator, azure-search-openai-demo, now has support for RAG on image sources. In the example question below, the LLM answers the question by correctly interpreting a bar graph:  

This blog post will walk through the changes we made to enable multimodal RAG, both so that developers using the solution accelerator can understand how it works, and so that developers using other RAG solutions can bring in multimodal support. 

First let's talk about two essential ingredients: multimodal LLMs and multimodal embedding models. 


Multimodal LLMs 

Azure now offers multiple multimodal LLMs: gpt-4o and gpt-4o-mini, through the Azure OpenAI service, and phi3-vision, through the Azure AI Model Catalog. These models allow you to send in both images and text, and return text responses. (In the future, we may have LLMs that take audio input and return non-text inputs!) 

For example, an API call to the gpt-4o model can contain a question along with an image URL: 

{ 
"role": "user", 
"content": [ 
{ 
"type": "text", 
"text": "What’s in this image?" 
}, 
{ 
      "type": "image_url", 
      "image_url": { 
       "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" 
       } 
     } 
] 
} 

Those image URLs can be specified as full HTTP URLs, if the image happens to be available on the public web, or they can be specified as base-64 encoded Data URIs, which is particularly helpful for privately stored images. 

For more examples working with gpt-4o, check out openai-chat-vision-quickstart, a repo which can deploy a simple Chat+Vision app to Azure, plus includes Jupyter notebooks showcasing scenarios. 

 

Multimodal embedding models 

Azure also offers a multimodal embedding API, as part of the Azure AI Vision APIs, that can compute embeddings in a multimodal space for both text and images. The API uses the state-of-the-art Florence model from Microsoft Research. 

For example, this API call returns the embedding vector for an image: 

curl.exe -v -X POST "https://<endpoint>/computervision/retrieval:vectorizeImage?api-version=2024-02-01-preview&model-version=2023-04-15" --data-ascii " { 'url':'https://learn.microsoft.com/azure/ai-services/computer-vision/media/quickstarts/presentation.png' }" 

Once we have the ability to embed both images and text in the same embedding space, we can use vector search to find images that are similar to a user's query. Check out this notebook that setups a basic multimodal search of images using Azure AI Search. 
 

Multimodal RAG 

With those two multimodal models, we were able to give our RAG solution the ability to include image sources in both the retrieval and answering process. 

At a high-level, we made the following changes: 

  • Search index: We added a new field to the Azure AI Search index to store the embedding returned by the multimodal Azure AI Vision API (while keeping the existing field that stores the OpenAI text embeddings). 
  • Data ingestion: In addition to our usual PDF ingestion flow, we also convert each PDF document page to an image, store that image with the filename rendered on top, and add the embedding to the index. 
  • Question answering: We search the index using both the text and multimodal embeddings. We send both the text and the image to gpt-4o, and ask it to answer the question based on both kinds of sources. 
  • Citations: The frontend displays both image sources and text sources, to help users understand how the answer was generated. 

Let's dive deeper into each of the changes above. 


Search index 

For our standard RAG on documents approach, we use an Azure AI search index that stores the following fields: 

  • content: The extracted text content from Azure Document Intelligence, which can process a wide range of files and can even OCR images inside files. 
  • sourcefile: The filename of the document 
  • sourcepage: The filename with page number, for more precise citations. 
  • embedding:  A vector field with 1536 dimensions, to store the embedding of the content field, computed using text-only OpenAI ada-002 model.

For RAG on images, we add an additional field: 

  • imageEmbedding: A vector field with 1024 dimensions, to store the embedding of the image version of the document page, computed using the AI Vision vectorizeImage API endpoint. 


Data ingestion 

For our standard RAG approach, data ingestion involves these steps: 

  1. Use Azure Document Intelligence to extract text out of a document 
  2. Use a splitting strategy to chunk the text into sections. This is necessary in order to keep chunk sizes at a reasonable size, as sending too much content to an LLM at once tends to reduce answer quality. 
  3. Upload the original file to Azure Blob storage. 
  4. Compute ada-002 embeddings for the content field. 
  5. Add each chunk to the Azure AI search index. 

For RAG on images,  we add two additional steps before indexing: uploading an image version of each document page to Blob Storage and computing multi-modal embeddings for each image. 


Generating citable images 

The images are not just a direct copy of the document page. Instead, they contain the original document filename written in the top left corner of the image, like so: 

 

This crucial step will enable the GPT vision model to later provide citations in its answers. From a technical perspective, we achieved this by first using the PyMuPDF Python package to convert documents to images, then using the Pillow Python package to add a top border to the image and write the filename there.


Question answering 

Now that our Blob storage container has citable images and our AI search index has multi-modal embeddings, users can start to ask questions about images. 

Our RAG app has two primary question asking flows, one for "single-turn" questions, and the other for "multi-turn" questions which incorporates as much conversation history that can fit in the context window. To simplify this explanation, we'll focus on the single-turn flow.  

Our single-turn RAG on documents flow looks like: 

 

1. Receive a user question from the frontend. 

2. Compute an embedding for the user question using the OpenAI ada-002 model. 

3. Use the user question to fetch matching documents from the Azure AI search index, using a hybrid search that does a keyword search on the text and a vector search on the question embedding. 

4. Pass the resulting document chunks and the original user question to the gpt-3.5 model, with a system prompt that instructs it to adhere to the sources and provide citations with a certain format. 

Our single-turn RAG on documents-plus-images flows looks like this: 

 

1. Receive a user question from the frontend. 

2. Compute an embedding for the user question using the OpenAI ada-002 model AND an additional embedding using the AIVision API multimodal model. 

3. Use the user question to fetch matching documents from the Azure AI search index, using a hybrid multivector search that also searches on the imageEmbedding field using the additional embedding. This way, the underlying vector search algorithm will find results that are both similar semantically to the text of the document but also similar semantically to any images in the document (e.g. "what trends are increasing?" could match a chart with a line going up and to the right). 

4. For each document chunk returned in the search results, convert the Blob image URL into a base64 data-encoded URI. Pass both the text content and the image URIs to GPT-4-vision, with this prompt that describes how to find and format citations: 

The documents contain text, graphs, tables and images. 

Each image source has the file name in the top left corner of the image with coordinates (10,10) pixels and is in the format SourceFileName:<file_name> 

Each text source starts in a new line and has the file name followed by colon and the actual information. Always include the source name from the image or text for each fact you use in the response in the format: [filename]  

Answer the following question using only the data provided in the sources below. 

The text and image source can be the same file name, don't use the image title when citing the image source, only use the file name as mentioned. 

Now, users can ask questions where the answers are entirely contained in the images and get correct answers! This can be a great fit for diagram-heavy domains, like finance. 

 

Considerations 

We have seen some really exciting uses of this multimodal RAG approach, but there is much to explore to improve the experience. 

More file types: Our repository only implements image generation for PDFs, but developers are now ingesting many more formats, both image files like PNG and JPEG as well as non-image files like HTML, docx, etc. We'd love help from the community in bringing support for multimodal RAG to more file formats. 

More selective embeddings: Our ingestion flow uploads images for *every* PDF page, but many pages may be lacking in visual content, and that can negatively affect vector search results. For example, if your PDF contains completely blank pages, and the index stored the embeddings for those, we have found that vector searches often retrieve those blank pages. Perhaps in the multimodal space, "blankness" is considered similar to everything. We've considered approaches like using a vision model in the ingestion phase to decide whether an image is meaningful, or using that model to write a very descriptive caption for images instead of storing the image embeddings themselves. 

Image extraction: Another approach would be to extract images from document pages, and store each image separately. That would be helpful for documents where the pages contain multiple distinct images with different purposes, since then the LLM would be able to focus more on only the most relevant image. 

We would love your help in experimenting with RAG on images, sharing how it works for your domain, and suggesting what we can improve. Head over to our repo and follow the steps for deploying with the optional GPT vision feature enabled, and let us know how it goes!