Using SLMs in GitHub Codespaces

Today I went on a quest to figure out the best way to use SLMs (small language models) like Phi-3 in a GitHub Codespace, so that I can provide a browser-only way for anyone to start working with language models. My main target audience is teachers and students, who may not have GPUs or have the budget to pay for large language models, but it's always good to make new technologies accessible to everyone possible.

❌ Don't use transformers

For my first approach, I tried to use the HuggingFace transformers package, using code similar to their text generation tutorial, but modified for my non-GPU environment:

from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

set_seed(2024)
prompt = "insert your prompt here"
model_checkpoint = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_checkpoint,
                                             trust_remote_code=True,
                                             torch_dtype="auto",
                                             device_map="auto"
                                             )
inputs = tokenizer(prompt,
                   return_tensors="pt").to("cpu")
outputs = model.generate(**inputs,
                         do_sample=True, max_new_tokens=120)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Unfortunately, that took a very long time. Enough time for me to go for a walk, chase the garbage truck around the neighborhood, make breakfast, and drop my kid at school: 133 minutes total!

Screenshot of notebook with 133m duration

So, yes, you can use transformers, technically. But without either a very powerful CPU or better, a GPU, the performance is just too slow. Let's move on...

✅ Do use Ollama!

For my next approach, I tried setting up a dev container with Ollama built-in, by adding the ollama feature to my devcontainer.json:

"features": {
        "ghcr.io/prulloac/devcontainer-features/ollama:1": {}
},

I then pulled a small phi3 model using "ollama run phi3:mini", and I was able to generate text in a manner of seconds:

So I proceeded to use Ollama via the Python OpenAI SDK, which I can do thanks to Ollama's OpenAI-compatible endpoints.

import openai

client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="nokeyneeded",
)
  
response = client.chat.completions.create(
    model="phi3:mini",
    temperature=0.7,
    n=1,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about a hungry cat"},
    ],
)

print("Response:")
print(response.choices[0].message.content)

Try it yourself!

To make it super easy for anyone to get started with SLMs in a Codespace, I bundled everything into this repository:

https://github.com/pamelafox/ollama-python-playground/

That repository includes the Ollama feature, OpenAI SDK, a notebook with demonstrations of few-shot and RAG, and a script for an interactive chat. I hope it can be a helpful resource for teachers and students who want a quick and easy way to get started with small language models.

I also hope to add the Ollama feature to other repositories where it can be helpful, like the Phi-3 cookbook.

pamela fox's blog

Friday, June 14, 2024