Tuesday, November 5, 2024

Entity extraction using OpenAI structured outputs mode

The relatively new structured outputs mode from the OpenAI gpt-4o model makes it easy for us to define an object schema and get a response from the LLM that conforms to that schema.

Here's the most basic example from the Azure OpenAI tutorial about structured outputs:

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

completion = client.beta.chat.completions.parse(
    model="MODEL_DEPLOYMENT_NAME",
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
    ],
    response_format=CalendarEvent,
)

output = completion.choices[0].message.parsed

The code first defines the CalendarEvent class, an instance of a Pydantic model. Then it sends a request to the GPT model specifying a response_format of CalendarEvent. The parsed output will be a dictionary containing a name, date, and participants.

We can even go a step farther and turn the parsed output into a CalendarEvent instance, using the Pydantic model_validate method:

event = CalendarEvent.model_validate(event)

With this structured outputs capability, it's easier than ever to use GPT models for "entity extraction" tasks: give it some data, tell it what sorts of entities to extract from that data, and constrain it as needed.

Extracting from GitHub READMEs

Let's see an example of a way that I actually used structured outputs, to help me summarize the submissions that we got to a recent hackathon. I can feed the README of a repository to the GPT model and ask for it to extract key details like project title and technologies used.

First I define the Pydantic models:

class Language(str, Enum):
    JAVASCRIPT = "JavaScript"
    PYTHON = "Python"
    DOTNET = ".NET"

class Framework(str, Enum):
    LANGCHAIN = "Langchain"
    SEMANTICKERNEL = "Semantic Kernel"
    LLAMAINDEX = "Llamaindex"
    AUTOGEN = "Autogen"
    SPRINGBOOT = "Spring Boot"
    PROMPTY = "Prompty"

class RepoOverview(BaseModel):
    name: str
    summary: str = Field(..., description="A 1-2 sentence description of the project")
    languages: list[Language]
    frameworks: list[Framework]

In the code above, I asked for a list of a Python enum, which will constrain the model to return only options matching that list. I could have also asked for a list[str] to give it more flexibility, but I wanted to constrain it in this case. I also annoted the description using the Pydantic Field class so that I could specify the length of the description. Without that annotation, the descriptions are often much longer. We can use that description whenever we want to give additional guidance to the model about a field.

Next, I fetch the GitHub readme, storing it as a string:

url = "https://api.github.com/repos/shank250/CareerCanvas-msft-raghack/contents/README.md"
response = requests.get(url)
readme_content = base64.b64decode(response.json()["content"]).decode("utf-8")

Finally, I send off the request and convert the result into a RepoOverview instance:

completion = client.beta.chat.completions.parse(
    model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"),
    messages=[
        {
            "role": "system",
            "content": "Extract info from the GitHub issue markdown about this hack submission.",
        },
        {"role": "user", "content": readme_content},
    ],
    response_format=RepoOverview,
)
output = completion.choices[0].message.parsed
repo_overview = RepoOverview.model_validate(output)

You can see the full code in extract_github_repo.py

Extracting from PDFs

I talk to many customers that want to extract details from PDF, like locations and dates, often to store as metadata in their RAG search index. The first step is to extract the PDF as text, and we have a few options: a hosted service like Azure Document Intelligence, or a local Python package like pymupdf. For this example, I'm using the latter, as I wanted to try out their specialized pymupdf4llm package that converts the PDF to LLM-friendly markdown.

First I load in a PDF of an order receipt and convert it to markdown:

md_text = pymupdf4llm.to_markdown("example_receipt.pdf")

Then I define the Pydantic models for a receipt:

class Item(BaseModel):
    product: str
    price: float
    quantity: int


class Receipt(BaseModel):
    total: float
    shipping: float
    payment_method: str
    items: list[Item]
    order_number: int

In this example, I'm using a nested Pydantic model Item for each item in the receipt, so that I can get detailed information about each item.

And then, as before, I send the text off to the GPT model and convert the response back to a Receipt instance:

completion = client.beta.chat.completions.parse(
    model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"),
    messages=[
        {"role": "system", "content": "Extract the information from the blog post"},
        {"role": "user", "content": md_text},
    ],
    response_format=Receipt,
)
output = completion.choices[0].message.parsed
receipt = Receipt.model_validate(output)

You can see the full code in extract_pdf_receipt.py

Extracting from images

Since the gpt-4o model is also a multimodal model, it can accept both images and text. That means that we can send it an image and ask it for a structured output that extracts details from that image. Pretty darn cool!

First I load in a local image as a base-64 encoded data URI:

def open_image_as_base64(filename):
    with open(filename, "rb") as image_file:
        image_data = image_file.read()
    image_base64 = base64.b64encode(image_data).decode("utf-8")
    return f"data:image/png;base64,{image_base64}"


image_url = open_image_as_base64("example_graph_treecover.png")

For this example, my image is a graph, so I'm going to have it extract details about the graph. Here are the Pydantic models:

class Graph(BaseModel):
    title: str
    description: str = Field(..., description="1 sentence description of the graph")
    x_axis: str
    y_axis: str
    legend: list[str]

Then I send off the base-64 image URI to the GPT model, inside a "image_url" type message, and convert the response back to a Graph object:

completion = client.beta.chat.completions.parse(
    model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"),
    messages=[
        {"role": "system", "content": "Extract the information from the graph"},
        {
            "role": "user",
            "content": [
                {"image_url": {"url": image_url}, "type": "image_url"},
            ],
        },
    ],
    response_format=Graph,
)
output = completion.choices[0].message.parsed
graph = Graph.model_validate(output)

More examples

You can use this same general approach for entity extraction across many file types, as long as they can be represented in either a text or image form. See more examples in my azure-openai-entity-extraction repository. As always, remember that large language models are probabilistic next-word-predictors that won't always get things right, so definitely evaluate the accuracy of the outputs before you use this approach for a business-critical task.

No comments: