Observations: Using Python with DeepSeek-R1

Everyone's going ga-ga for DeepSeek-R1, so I thought I'd try it out in a live stream today:

I'll summarize my experience in this post.

I tried Python through two different hosts, via the OpenAI Python SDK

GitHub Models: Open to anyone with a GitHub account, free up to a certain number of requests per day. Great for learning and experimenting with new models.
Ollama: Includes 1.5B all the way to 671B models, but my Mac M1 can only run the 8B.

It's also possible to deploy DeepSeek-R1 on Azure, but I used the hosts that were easy to setup quickly.

Connecting with the OpenAI SDK

The DeepSeek-R1 model provides an "OpenAI-compatible interface", so that you can use the OpenAI python SDK for making chat completion requests. The DeepSeek-R1 model is fairly limited in its compatibility - no temperature, no function calling, less attention paid to the "system" message - but it's still very usable.

Here's how I connected for GitHub models:

client = openai.OpenAI(
  base_url="https://models.inference.ai.azure.com",
  api_key=os.getenv("GITHUB_TOKEN"))
model_name = "DeepSeek-R1"

And here's how I connected for Ollama:

client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="nokeyneeded")
model_name = "deepseek-r1:8b"

Then I make the chat completion request, leaving off most parameters and system message. It is possible to specify max_tokens, but the model might end its response in the middle of a thought, so we need to be very careful when setting that parameter. It also supports the stop parameter.

response = client.chat.completions.create(
  model=model_name,
  messages=[
    {
    "role": "user",
    "content": "You're an assistant that loves emojis. Write a haiku about a hungry cat who wants tuna"
    },
  ],
)

Now you'll get a response like this:

<think>
The model's thought process, which can be VERY long.
</think>
The model's final answer.

You can choose to extract the thoughts using a regular expression for those tags, as shown in this article, and then render it differently to the user.

The thinking can take a very long time however, so my preference is to stream the response. That way I can start reading its thoughts as soon as they begin.

Handling streamed thoughts

To receive a streamed response, we first add stream=True to the chat completion call:

response = client.chat.completions.create(
    model=model_name,
    messages=[
        {"role": "user", "content": "Who painted the Mona Lisa?"},
    ],
    stream=True
)

Then, in our stream processing code, we keep track of whether we've seen the start think tag or the end think tag, and display the thoughts differently to the user:

is_thinking = False
for event in completion:
  if event.choices:
    content = event.choices[0].delta.content
    if content == "<think>":
      is_thinking = True
      print("🧠 Thinking...", end="", flush=True)
    elif content == "</think>":
      is_thinking = False
      print("🛑\n\n")
    elif content:
      print(content, end="", flush=True)

Then our output looks like this:

🧠 Thinking...
The model's thought process, which can be VERY long.
🛑

The model's final answer.

We could use a similar approach when streaming down thoughts from the backend to the frontend, so that the frontend could visually distinguish between the thoughts and the answer itself.

Tip: There are some questions that are so easy for it to answer that the "thoughts" will simply be a new line- for example, if I simply say "hi" to the model. We may want to consider that edge case in how we render thoughts. The vast majority of questions will have thoughts, however - even a seemingly simple question like "who painted the Mona Lisa?" had a long thinking process to determine that, yes, it was definitely Leonardo DaVinci.

Using DeepSeek-R1 with RAG

Since I spend most of my time these days on applications that use RAG (Retrieval-Augmented Generation), I wanted to see how it would handle answering questions based on provided context.

I used two RAG scenarios:

A CSV of hybrid cars, with 153 rows and 6 columns.
Document chunks from PDFs, from the search index created by this Azure RAG solution, with ~500 tokens in each chunk, and three chunks retrieved per question.

I started off trying RAG with Ollama and the 8B model, and I was surprised at the results: the model completely hallucinated the sources, seemingly ignoring what I passed to it, in a very confident way. For example, here's the model describing a line of the CSV and its format:

"F-Typesport; engine: 3.0L turbocharged I6; horsepower: 505.0; torque: 295 lb-ft; acceleration: 2.8s; top speed: 155 mph." So, each car's description continues with multiple attributes separated by semicolons and some other punctuation.

However, there is no "F-Typesport" in the CSV, no mention of torque or horsepower. It made that all up!

So I moved back to GitHub Models, where I could RAG with the full model with its 671 billion parameters. I got very good results with the full model, along with a very long thought process. I asked it "what's the fastest hybrid car?", and it went line-by-line through the CSV, mentioning the acceleration of each car. Here's how it started:

Looking at the CSV headers: vehicle, year, msrp, acceleration, mpg, class. The acceleration column is the one to focus on. The task is to scan through all entries and find the minimum acceleration value. Let me go through the data row by row and note the acceleration values. The first entry is the Prius (1st Gen) with 7.46 seconds. Then the Tino at 8.2, Prius (2nd Gen) at 7.97. The Insight from 2000 is 9.52, Civic (1st Gen) 2001 at 7.04. Wait, that's a low number. Continuing: Alphard 2003 is 8.33, Escape 2004 at 10.32, Lexus RX400h 2005 at 12.76. Then Highlander 2005 (12.76), Mercury Mariner 2006 (8.98), Camry 2006 (11.28), Lexus GS450h 2006 (18.65). That's pretty high. Estima 2006 (9.26), Altima 2006 (13.29).

At the end, it did some "double-checking" (like to make sure the fastest car was indeed a hybrid car), and then correctly reported the fastest car.

I tried the same query with gpt-4o and o1, and they both also correctly reported the fastest car. The difference, of course, is that DeepSeek-R1 shows its thought process. That can be helpful in some situations, and was actually helpful for me, as I'm not a car person and didn't know what column/value would properly answer that question.

I then tried RAG on the document chunks, and it came up with a correct answer, as well as providing the citations in the correct format (square brackets, page numbers, as requested in my prompt). Once again, it had a very lengthy thought process, where it looked at each source in turn, described what the source contributed to the answer. For example, here's how it started:

Okay, the user is asking what's included in the Northwind Health Plus plan that isn't in the standard plan. Let me go through the provided sources carefully. From Benefit_Options.pdf#page=3, it mentions that Northwind Health Plus covers emergency services, mental health and substance abuse, and out-of-network services, which the Standard plan doesn't. Also, Health Plus has more comprehensive prescription drug coverage and emergency services both in and out of network. Another source, Northwind_Standard_Benefits_Details.pdf#page=45, confirms that Standard covers preventive care, professional visits, and some diagnostics but doesn't list those additional services. The same Benefit_Options.pdf#page=3 also compares costs, but the key point here is the coverage differences. So, putting it together, the Plus plan includes those additional services and broader drug coverage that the Standard lacks.

Generally, it seems like the thinking process scales linearly with the size of the data: i.e. for the CSV, it scales with the number of ros, and for the document chunks, it scales with the number of chunks. I have not scientifically verified this, of course, so it's possible that the model would take a different approach with vastly different amounts of context, but this is what it looks like for the kind of data I sent to it. If anyone does verify that with some more rigor, let me know!

The thought process also looked at each line in the instructions portion of my prompt that described how to cite the sources, so we can expect longer thought processes for each additional instruction requested of the model. For example, this is the model trying to adhere to one of the lines:

But the user's instruction says: "Don't combine sources, list each source separately, for example [info1.txt][info2.pdf]." However, if all benefits are from the same source, it's allowed to list the source once per claim. Wait, no, if multiple facts are from the same source, each fact should be followed by the source. For example, "[Benefit_Options.pdf#page=3]" after each item.

That would make me think very carefully about each line in the prompt, knowing how much the model is actually paying attention to them. It also seems like a good way to iterate on prompts to find the clearest wording for the desired behavior.

pamela fox's blog

Friday, January 31, 2025