Monday, June 10, 2024

RAG on a database table with PostgreSQL

RAG (Retrieval Augmented Generation) is one of the most promising uses for large language models. Instead of asking an LLM a question and hoping the answer lies somewhere in its weights, we instead first query a knowledge base for anything relevant to the question, and then feed both those results and the original question to the LLM.

We have many RAG solutions out there for asking questions on unstructured documents, like PDFs and Word Documents. Our most popular Azure solution for this scenario includes a data ingestion process to extract the text from the documents, chunk them up into appropriate sizes, and store them in an Azure AI Search index. When your RAG is on unstructured documents, you'll always need a data ingestion step to store them in an LLM-compatible format.

But what if you just want users to ask questions about structured data, like a table in a database? Imagine customers that want to ask questions about the products in a store's inventory, and each product is a row in the table. We can use the RAG approach there, too, and in some ways, it's a simpler process.

Diagram of RAG on database rows

To get you started with this flavor of RAG, we've created a new RAG-on-PostgreSQL solution that includes a FastAPI backend, React frontend, and infrastructure-as-code for deploying it all to Azure Container Apps with Azure PostgreSQL Flexible Server. Here it is with the sample seed data:

Screenshot of RAG app with question about waterproof camping gear

We use the user's question to query a single PostgreSQL table and send the matching rows to the LLM. We display the answer plus information about any of the referenced products from the answer. Now let's break down how that solution works.



Data preparation

When we eventually query the database table with the user's query, we ideally want to perform a hybrid search: both a full text search and a vector search of any columns that might match the user's intent. In order to perform a vector search, we also need a column that stores a vector embedding of the target columns.

This is what the sample table looks like, described using SQLAlchemy 2.0 model classes. The final embedding column is a Vector type, from the pgvector extension for PostgreSQl:

class Item(Base):
    __tablename__ = "items"
    id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True)
    type: Mapped[str] = mapped_column()
    brand: Mapped[str] = mapped_column()
    name: Mapped[str] = mapped_column()
    description: Mapped[str] = mapped_column()
    price: Mapped[float] = mapped_column()
    embedding: Mapped[Vector] = mapped_column(Vector(1536))

The embedding column has 1536 dimensions to match OpenAI's text-embedding-ada-002 model, but you could configure it to match the dimensions of different embedding models instead. The most important thing is to know exactly which model you used for generating embeddings, so then we can later search with that same model.

To compute the value of the embedding column, we concatenate the text columns from the table row, send them to the OpenAI embedding model, and store the result:

items = session.scalars(select(Item)).all()
for item in items:
  item_for_embedding = f"Name: {self.name} Description: {self.description} Type: {self.type}"
  item.embedding = openai_client.embeddings.create(
        model=EMBED_DEPLOYMENT,
        input=item_for_embedding
    ).data[0].embedding
session.commit()

We only need to run that once, if our data is static. However, if any of the included columns change, we should re-run that for the changed rows. Another approach is to use the Azure AI extension for Azure PostgreSQL Flexible Server. I didn't use it in my solution since I also wanted it to run with a local PostgreSQL server, but it should work great if you're always using the Azure-hosted PostgreSQL Flexible Server.



Hybrid search in PostgreSQL

Now our database table has both text columns and a vector column, so we should be able to perform a hybrid search: using the pgvector distance operator on the embedding column, using the built-in full-text search functions on the text columns, and merging them using the Reciprocal-Rank Fusion algorithm.

We use this SQL query for hybrid search, inspired by an example from the pgvector-python repository:

vector_query = f"""
SELECT id, RANK () OVER (ORDER BY embedding <=> :embedding) AS rank
  FROM items
  ORDER BY embedding <=> :embedding
  LIMIT 20
"""

fulltext_query = f"""
SELECT id, RANK () OVER (ORDER BY ts_rank_cd(to_tsvector('english', description), query) DESC)
  FROM items, plainto_tsquery('english', :query) query
  WHERE to_tsvector('english', description) @@ query
  ORDER BY ts_rank_cd(to_tsvector('english', description), query) DESC
  LIMIT 20
"""

hybrid_query = f"""
WITH vector_search AS (
  {vector_query}
),
fulltext_search AS (
  {fulltext_query}
)
SELECT
  COALESCE(vector_search.id, fulltext_search.id) AS id,
  COALESCE(1.0 / (:k + vector_search.rank), 0.0) +
  COALESCE(1.0 / (:k + fulltext_search.rank), 0.0) AS score
FROM vector_search
FULL OUTER JOIN fulltext_search ON vector_search.id = fulltext_search.id
ORDER BY score DESC
LIMIT 20
"""

results = session.execute(sql,
    {"embedding": to_db(query_vector), "query": query_text, "k": 60},
    ).fetchall()

That hybrid search is missing the final step that we always recommend for Azure AI Search: semantic ranker, a re-ranking model that sorts the results according to the original user queries. It should be possible to add a re-ranking model, as shown in another pgvector-python example, but such an addition requires loadtesting and possibly an architectural change, since re-ranking models are CPU-intensive. Ideally, the re-ranking model would be deployed on dedicated infrastructure optimized for model running, not on the same server as our app backend.

We get fairly good results from that hybrid search query, however! It easily finds rows that both match the exact keywords in a query and semantically similar phrases, as demonstrated by these user questions:

Screenshot of question 'dark blue shoes for hiking up trails' Screenshot of question 'sneakers for walking up steep hills'

Function calling for SQL filtering

The next step is to handle user queries like, "climbing gear cheaper than $100." Our hybrid search query can definitely find "climbing gear", but it's not designed to find products whose price is lower than some amount. The hybrid search isn't querying the price column at all, and isn't appropriate for a numeric comparison query anyway. Ideally, we would do both a hybrid search and add a filter clause, like WHERE price < 100.

Fortunately, we can use an LLM to suggest filter clauses based on user queries, and the OpenAI GPT models are very good at it. We add a query-rewriting phase to our RAG flow which uses OpenAI function calling to come up with the optimal search query and column filters.

In order to use OpenAI function calling, we need to describe the function and its parameters. Here's what that looks like for a search query and single column's filter clause:

{
  "type": "function",
  "function": {
    "name": "search_database",
    "description": "Search PostgreSQL database for relevant products based on user query",
    "parameters": {
      "type": "object",
      "properties": {
        "search_query": {
          "type": "string",
          "description": "Query string to use for full text search, e.g. 'red shoes'"
        },
        "price_filter": {
          "type": "object",
          "description": "Filter search results based on price of the product",
          "properties": {
            "comparison_operator": {
              "type": "string",
              "description": "Operator to compare the column value, either '>', '<', '>=', '<=', '='"
            },
            "value": {
              "type": "number",
               "description": "Value to compare against, e.g. 30"
            }
          }
        }
      }
    }
  }
}

We can easily add additional parameters for other column filters, or we could even have a generic column filter parameter and have OpenAI suggest the column based on the table schema. For my solution, I am intentionally constraining the LLM to only suggest a subset of possible filters, to minimize risk of SQL injection or poor SQL performance. There are many libraries out there that do full text-to-SQL, and that's another approach you could try out, if you're comfortable with the security of those approaches.

When we get back the results from the function call, we use it to build a filter clause, and append that to our original hybrid search query. We want to do the filtering before the vector and full text search, to narrow down the search space to only what could possibly match. Here's what the new vector search looks like, with the additional filter clause:

vector_query = f"""
  SELECT id, RANK () OVER (ORDER BY embedding <=> :embedding) AS rank
    FROM items
    {filter_clause}
    ORDER BY embedding <=> :embedding
    LIMIT 20
"""

With the query rewriting and filter building in place, our RAG app can now answer questions that depend on filters:

Screenshot of question 'climbing gear cheaper than $30'

RAG on unstructured vs structured data

Trying to decide what RAG approach to use, or which of our solutions to use for a prototype? If your target data is largely unstructured documents, then you should try out our Azure AI Search RAG starter solution which will take care of the complex data ingestion phase for you. However, if your target data is an existing database table, and you want to RAG over a single table (or a small number of tables), the try out the PostgreSQL RAG starter solution and modify it to work with your table schema. If your target data is a database with a multitude of tables with different schemas, then you probably want to research full text-to-SQL solutions. Also check out the llamaindex and langchain libraries, as they often have functionality and samples for common RAG scenarios.

No comments: