2024

Two kids are a lot. I know, its really not a lot in comparison to the many kids that women have had to birth and care for over the history of humanity. But still, it feels like a lot to me. My partner and I both have full-time jobs that are fortunately remote-friendly, but we’re both tired by the time kids are home, and we need to keep them fed and occupied until bedtime.

We have a 2 year old and 5 year old, and they spend 2% of their time playing together and the other 98% fighting over who gets to play with mommy. And of course, mommy is thinking of all the other stuff that needs to get done: laundry, dishes, dinner, cleaning, and wouldnt it be nice if I could have a few minutes to shower?

But alas, where is the time for all that? How are we supposed to get all the chores done, take care of two little kids, and have some time for the “self-care” I’ve heard so much about? There isn’t enough time!

Plus, my kids are also night owls, staying up to 10ish each night and often falling asleep on me, so I don’t have the magical “time after kids went to sleep” that I’ve heard so much about.

Enough with the ranting though.

Fortunately, I recently switched jobs from UC Berkeley lecturer (100k, no bonuses) to Microsoft developer advocate (220K plus bonuses), so I’ve decided to shamelessly pay my way to less stress. More money, less problems!

Here’s what I spend my funds on:

Meal delivery services. Currently: Plantedtable (vegan meals) and OutTheCaveFood (Paleo meals, lol). They both deliver fully ready meals in plastic-free packaging from their local kitchens. My kids have mixed feelings about the meals, but they have mixed feelings about any non-pizza foods.
Grocery delivery. We use a combination of Safeway (via DoorDash) and GoodEggs, depending on what items we’re missing. I prefer GoodEggs since they work with local companies, but they lack some kid essentials, like massive blocks of cheddar cheese. Weekly house cleaners. I tip them extra for also folding our clean laundry, which tends to sit on the bed for days at a time. They come Fridays, so that we can start the weekends on a clean foot! (Yes, the house is a disaster by Monday.)
Nanny overtime. Our amazing nanny will often take the 2 year old on Saturdays, so I can spend solo time with my 5 year old, and sometimes keeps her late during the week if I have an event to attend in the city. She also cares for the 5 year old if she has a day off school. Evening babysitter. In addition, a local babysitter comes once a week to play with the 5 year old, which gives me a break from referee-ing them, and also gives my partner the opportunity to keep his weekly D&D night.
Handymen. I used to fancy myself as a DIYer that could do home improvement projects, but I just cant focus on them enough now to do a good job. So I pay these two local handymen to do tiny jobs (hang a curtain rod!) as well as large jobs (toddler-safe to-code stair railings). Professionals just do it better.
Gardening. This is the one thing that I actually still do a lot of myself, especially planting new natives, but when I need help removing an influx of invasive weeds or pruning trees, I call a local gardener. He’s so local that folks often stop to talk with him when he’s working outside. :)

As you can see, I try to “shop local” when I can, but if I need to go to Amazon to buy a massive tub of freeze-dried strawberries to appease a picky two year old, I’m okay with that.

The point of this post is *not* to gloat about my privelege in being able to pay for all this. And yes, i have privelege up the wazoo.

The point of this post is to empower other parents, especially mothers, to feel totally okay to outsource parts of parenting and household management to others. It helps if you have some financial independence from your partner, so that you have the option to pay for outsourcing a task even if they disagree. Freedom!

Many parents do not have a high enough income for this approach, and that is why I currently would vote for policies like universal basic income, government-sponsored health insurance, universal preschool, etc. Parents need a break, wherever it comes from.

Retrieval Augmented Generation (RAG) is a popular technique to get LLMs to provide answers that are grounded in a data source. What do you do when your knowledge base includes images, like graphs or photos? By adding multimodal models into your RAG flow, you can get answers based off image sources, too!

Our most popular RAG solution accelerator, azure-search-openai-demo, now has support for RAG on image sources. In the example question below, the LLM answers the question by correctly interpreting a bar graph:

This blog post will walk through the changes we made to enable multimodal RAG, both so that developers using the solution accelerator can understand how it works, and so that developers using other RAG solutions can bring in multimodal support.

First let's talk about two essential ingredients: multimodal LLMs and multimodal embedding models.

Multimodal LLMs

Azure now offers multiple multimodal LLMs: gpt-4o and gpt-4o-mini, through the Azure OpenAI service, and phi3-vision, through the Azure AI Model Catalog. These models allow you to send in both images and text, and return text responses. (In the future, we may have LLMs that take audio input and return non-text inputs!)

For example, an API call to the gpt-4o model can contain a question along with an image URL:

{
"role": "user",
"content": [
{
"type": "text",
"text": "What’s in this image?"
},
{
      "type": "image_url",
      "image_url": {
       "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
       }
     }
]
}

Those image URLs can be specified as full HTTP URLs, if the image happens to be available on the public web, or they can be specified as base-64 encoded Data URIs, which is particularly helpful for privately stored images.

For more examples working with gpt-4o, check out openai-chat-vision-quickstart, a repo which can deploy a simple Chat+Vision app to Azure, plus includes Jupyter notebooks showcasing scenarios.

Multimodal embedding models

Azure also offers a multimodal embedding API, as part of the Azure AI Vision APIs, that can compute embeddings in a multimodal space for both text and images. The API uses the state-of-the-art Florence model from Microsoft Research.

For example, this API call returns the embedding vector for an image:

curl.exe -v -X POST "https://<endpoint>/computervision/retrieval:vectorizeImage?api-version=2024-02-01-preview&model-version=2023-04-15" --data-ascii " { 'url':'https://learn.microsoft.com/azure/ai-services/computer-vision/media/quickstarts/presentation.png' }"

Once we have the ability to embed both images and text in the same embedding space, we can use vector search to find images that are similar to a user's query. Check out this notebook that setups a basic multimodal search of images using Azure AI Search.

Multimodal RAG

With those two multimodal models, we were able to give our RAG solution the ability to include image sources in both the retrieval and answering process.

At a high-level, we made the following changes:

Search index: We added a new field to the Azure AI Search index to store the embedding returned by the multimodal Azure AI Vision API (while keeping the existing field that stores the OpenAI text embeddings).
Data ingestion: In addition to our usual PDF ingestion flow, we also convert each PDF document page to an image, store that image with the filename rendered on top, and add the embedding to the index.
Question answering: We search the index using both the text and multimodal embeddings. We send both the text and the image to gpt-4o, and ask it to answer the question based on both kinds of sources.
Citations: The frontend displays both image sources and text sources, to help users understand how the answer was generated.

Let's dive deeper into each of the changes above.

Search index

For our standard RAG on documents approach, we use an Azure AI search index that stores the following fields:

content: The extracted text content from Azure Document Intelligence, which can process a wide range of files and can even OCR images inside files.
sourcefile: The filename of the document
sourcepage: The filename with page number, for more precise citations.
embedding: A vector field with 1536 dimensions, to store the embedding of the content field, computed using text-only OpenAI ada-002 model.

For RAG on images, we add an additional field:

imageEmbedding: A vector field with 1024 dimensions, to store the embedding of the image version of the document page, computed using the AI Vision vectorizeImage API endpoint.

Data ingestion

For our standard RAG approach, data ingestion involves these steps:

Use Azure Document Intelligence to extract text out of a document
Use a splitting strategy to chunk the text into sections. This is necessary in order to keep chunk sizes at a reasonable size, as sending too much content to an LLM at once tends to reduce answer quality.
Upload the original file to Azure Blob storage.
Compute ada-002 embeddings for the content field.
Add each chunk to the Azure AI search index.

For RAG on images, we add two additional steps before indexing: uploading an image version of each document page to Blob Storage and computing multi-modal embeddings for each image.

Generating citable images

The images are not just a direct copy of the document page. Instead, they contain the original document filename written in the top left corner of the image, like so:

This crucial step will enable the GPT vision model to later provide citations in its answers. From a technical perspective, we achieved this by first using the PyMuPDF Python package to convert documents to images, then using the Pillow Python package to add a top border to the image and write the filename there.

Question answering

Now that our Blob storage container has citable images and our AI search index has multi-modal embeddings, users can start to ask questions about images.

Our RAG app has two primary question asking flows, one for "single-turn" questions, and the other for "multi-turn" questions which incorporates as much conversation history that can fit in the context window. To simplify this explanation, we'll focus on the single-turn flow.

Our single-turn RAG on documents flow looks like:

1. Receive a user question from the frontend.

2. Compute an embedding for the user question using the OpenAI ada-002 model.

3. Use the user question to fetch matching documents from the Azure AI search index, using a hybrid search that does a keyword search on the text and a vector search on the question embedding.

4. Pass the resulting document chunks and the original user question to the gpt-3.5 model, with a system prompt that instructs it to adhere to the sources and provide citations with a certain format.

Our single-turn RAG on documents-plus-images flows looks like this:

1. Receive a user question from the frontend.

2. Compute an embedding for the user question using the OpenAI ada-002 model AND an additional embedding using the AIVision API multimodal model.

3. Use the user question to fetch matching documents from the Azure AI search index, using a hybrid multivector search that also searches on the imageEmbedding field using the additional embedding. This way, the underlying vector search algorithm will find results that are both similar semantically to the text of the document but also similar semantically to any images in the document (e.g. "what trends are increasing?" could match a chart with a line going up and to the right).

4. For each document chunk returned in the search results, convert the Blob image URL into a base64 data-encoded URI. Pass both the text content and the image URIs to GPT-4-vision, with this prompt that describes how to find and format citations:

The documents contain text, graphs, tables and images.

Each image source has the file name in the top left corner of the image with coordinates (10,10) pixels and is in the format SourceFileName:<file_name>

Each text source starts in a new line and has the file name followed by colon and the actual information. Always include the source name from the image or text for each fact you use in the response in the format: [filename]

Answer the following question using only the data provided in the sources below.

The text and image source can be the same file name, don't use the image title when citing the image source, only use the file name as mentioned.

Now, users can ask questions where the answers are entirely contained in the images and get correct answers! This can be a great fit for diagram-heavy domains, like finance.

Considerations

We have seen some really exciting uses of this multimodal RAG approach, but there is much to explore to improve the experience.

More file types: Our repository only implements image generation for PDFs, but developers are now ingesting many more formats, both image files like PNG and JPEG as well as non-image files like HTML, docx, etc. We'd love help from the community in bringing support for multimodal RAG to more file formats.

More selective embeddings: Our ingestion flow uploads images for *every* PDF page, but many pages may be lacking in visual content, and that can negatively affect vector search results. For example, if your PDF contains completely blank pages, and the index stored the embeddings for those, we have found that vector searches often retrieve those blank pages. Perhaps in the multimodal space, "blankness" is considered similar to everything. We've considered approaches like using a vision model in the ingestion phase to decide whether an image is meaningful, or using that model to write a very descriptive caption for images instead of storing the image embeddings themselves.

Image extraction: Another approach would be to extract images from document pages, and store each image separately. That would be helpful for documents where the pages contain multiple distinct images with different purposes, since then the LLM would be able to focus more on only the most relevant image.

We would love your help in experimenting with RAG on images, sharing how it works for your domain, and suggesting what we can improve. Head over to our repo and follow the steps for deploying with the optional GPT vision feature enabled, and let us know how it goes!

pamela fox's blog

Friday, September 27, 2024

My parenting strategy: earn enough $ to outsource

Sunday, September 8, 2024

Integrating vision into RAG applications