Wednesday, January 3, 2024

Using llamafile for local dev for an OpenAI Python web app

We're seeing more and more LLMs that can be run locally on a laptop, especially those with GPUs and multiple cores. Open source projects are also making it easier to run LLMs locally, so that you don't have to be an ML engineer or C/C++ programmer to get started (Phew!).

One of those projects is llamafile, which provides a single executable that serves up an API and frontend to interact with a local LLM (defaulting to LLaVa). With just a few steps, I was able to get the llamafile server running. I then discovered that llamafile includes an OpenAI-compatible endpoint, so I can point my Azure OpenAI apps at the llamafile server for local development. That means I can save costs and also evaluate the quality difference between deployed models and local models. Amazing!

I'll step through the process and share my sample app, a FastAPI chat app backend.

Running the llamafile server

Follow the instructions in the quickstart to get the server running.

Test out the server by chatting with the LLM:

Screenshot of llama.cpp conversation about haikus

Using the OpenAI-compatible endpoint

The llamafile server includes an endpoint at "/v1" that behaves just like the OpenAI servers. Note that it mimics the OpenAI servers, not the *Azure* OpenAI servers, so it does not include additional properties like the content safety filters.

Test out that endpoint by running the curl command in the JSON API quickstart.

You can also test the Python code in that JSON quickstart to confirm it works as well.

Using llamafile with an existing OpenAI app

As the llamafile documentation shows, you can point an OpenAI Python client at a local server by overriding base_url and providing a bogus api_key.

client = AsyncOpenAI(
    base_url="http://localhost:8080/v1",
    api_key = "sk-no-key-required"
)

I tried that out with one of my Azure OpenAI samples, a FastAPI chat backend, and it worked for both streaming and non-streaming responses! 🎉

Screenshot of response from FastAPI generated documentation for a request to make a Haiku

Switching with environment variables

I wanted it to be easy to switch between Azure OpenAI and a local LLM without changing any code, so I made an environment variable for the local LLM endpoint. Now my client initialization code looks like this:

if os.getenv("LOCAL_OPENAI_ENDPOINT"):
    client = openai.AsyncOpenAI(
    	api_key="no-key-required",
        base_url=os.getenv("LOCAL_OPENAI_ENDPOINT")
    )
else:
    # Lots of Azure initialization code here...
    # See link below for full code.
    client = openai.AsyncAzureOpenAI(
      api_version="2023-07-01-preview",
      azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
      # plus additional args
    )

See full code in __init__.py.

Notice that I am using the Async version of the OpenAI clients in both cases, since this backend uses FastAPI and 100% asynchronous calls. For llamafile, I don't bother using the Azure version of the client, since llamafile is only trying to mimic the openai.com servers. That should be fine, as I typically code assuming the openai.com servers as a baseline, and just taking advantage of Azure extras (like content safety filters) when available.

I will likely try out llamafile for my other Azure OpenAI samples soon, and run some evaluations to see how llamafile compares in terms of quality. I don't have any plans to use non-OpenAI models in production, but I want to keep monitoring how well the local LLMs can perform and what use cases there may be for them.

No comments: