Tuesday, February 25, 2025

Safety evaluations for LLM-powered apps

When we build apps on top of Large Language Models, we need to evaluate the app responses for quality and safety. When we evaluate the quality of an app, we're making sure that it provides answers that are coherent, clear, aligned to the user's needs, and in the case of many applications: factually accurate. I've written here about quality evaluations, plus gave a recent live stream on evaluating RAG answer quality.

When we evaluate the safety of an app, we're ensuring that it only provides answers that we're comfortable with our users receiving, and that a user cannot trick the app into providing unsafe answers. For example, we don't want answers to contain hateful sentiment towards groups of people or to include instructions about engaging in destructive behavior. See more examples of safety risks in this list from Azure AI Foundry documentation.

Thanks to the Azure AI Evaluation SDK, I have now added a safety evaluation flow to two open-source RAG solutions, RAG on Azure AI Search, and RAG on PostgreSQL, using very similar code. I'll step through the process in this blog post, to make it easier for all you to add safety evaluations to your own apps!

The overall steps for safety evaluation:

  1. Provision an Azure AI Project
  2. Configure the Azure AI Evaluation SDK
  3. Simulate app responses with AdversarialSimulator
  4. Evaluate the responses with ContentSafetyEvaluator

Provision an Azure AI Project

We must have an Azure AI Project in in order to use the safety-related functionality from the Azure AI Evaluation SDK, and that project must be in one of the regions that support the safety backed service.

Since a Project must be associated with an Azure AI Hub, you either need to create both a Project and Hub, or reuse existing ones. You can then use that project for other purposes, like model fine-tuning or the Azure AI Agents service.

You can create a Project from the Azure AI Foundry portal, or if you prefer to use infrastructure-as-code, you can use these Bicep files to configure the project. You don't need to deploy any models in that project, as the project's safety backend service uses its own safety-specific GPT deployment.

Configure the Azure AI Evaluation SDK

The Azure AI Evaluation SDK is currently available in Python as the azure-ai-evaluation package, or in .NET as the Microsoft.Extensions.AI.Evaluation. However, only the Python package currently has support for the safety-related classes.

First we must either add the azure-ai-evaluation Python package to our requirements file, or install it directly into the environment:

pip install azure-ai-evaluation

Then we create a dict in our Python file with all the necessary details about the Azure AI project - the subscription ID, resource group, and project name. As a best practice, I store those values environment variables:

from azure.ai.evaluation import AzureAIProject

azure_ai_project: AzureAIProject = {
        "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
        "resource_group_name": os.environ["AZURE_RESOURCE_GROUP"],
        "project_name": os.environ["AZURE_AI_PROJECT"],
    }

Simulate app responses with AdversarialSimulator

Next, we use the AdversarialSimulator class to simulate users interacting with the app in the ways most likely to produce unsafe responses.

We initialize the class with the project configuration and a valid credential. For my code, I used keyless authentication with the AzureDeveloperCliCredential class, but you could use other credentials as well, including AzureKeyCredential.

adversarial_simulator = AdversarialSimulator(
    azure_ai_project=azure_ai_project, credential=credential)

Then we run the simulator with our desired scenario, language, simulation count, randomization seed, and a callback function to call our app:

from azure.ai.evaluation.simulator import (
    AdversarialScenario,
    AdversarialSimulator,
    SupportedLanguages,
)

outputs = await adversarial_simulator(
  scenario=AdversarialScenario.ADVERSARIAL_QA,
  language=SupportedLanguages.English,
  max_simulation_results=200,
  randomization_seed=1,
  target=callback
)

The SDK supports multiple scenarios. Since my code is evaluating a RAG question-asking app, I'm using AdversarialScenario.ADVERSARIAL_QA. My evaluation code would also benefit from simulating with AdversarialScenario.ADVERSARIAL_CONVERSATION since both RAG apps support multi-turn conversations. Use the scenario that matches your app.

For the AdversarialScenario.ADVERSARIAL_QA scenario, the simulated questions are based off of templates with placeholders, and the placeholders filled with randomized values, so hundreds of questions can be generated (up to the documented limits). Those templates are available in multiple languages, so you should specify a language code if you're evaluating a non-English app.

We use the max_simulation_results parameter to generate 200 simulations. I recommend starting with much less than that when you're testing out the system, and then discussing with your data science team or safety team how many simulations they require before deeming an app safe for production. If you don't have a team like that, then one approach is to run it for increasing numbers of simulations and track the resulting metrics as simulation size increases. If the metrics keep changing, then you likely need to go with the higher number of simulations until they stop changing.

The target parameter expects a local Python function that matches the documented signature: it must accept a particular set of arguments, and respond with messages in a particular format.

Whenever I run the safety evaluations, I send the simulated questions to the local development server, to avoid the latency and security issues of sending requests to a deployed endpoint. Here's what that looks like as a callback function:

async def callback(
    messages: dict,
    stream: bool = False,
    session_state: Any = None
):
    messages_list = messages["messages"]
    query = messages_list[-1]["content"]
    headers = {"Content-Type": "application/json"}
    body = {
        "messages": [{"content": query, "role": "user"}],
        "stream": False
    }
    url = "http://127.0.0.1:8000/chat"
    r = requests.post(url, headers=headers, json=body)
    response = r.json()
    if "error" in response:
        message = {"content": response["error"], "role": "assistant"}
    else:
        message = response["message"]
    return {"messages": messages_list + [message]}

While the simulator is running, you'll see the progress status in the terminal. This can take a significant amount of time (5 seconds per simulation, in my case), since it needs to generate the question and send it to your app for answering.

Screenshot of simulation running

Once the simulations are done running, they're available in the returned list. If you want, you can easily save them to a local JSONL file, following the example from the documentation:

output_file = "grounding_simulation_output.jsonl"
with open(output_file, "w") as file:
    for output in outputs:
        file.write(output.to_eval_qr_json_lines())

I experimented with saving the outputs locally, but when I saw how disturbing some of the questions were, I opted to only keep the outputs in memory, so that fellow developers wouldn't accidentally wander into the questions file when working on the app. 😱

Evaluate the responses with ContentSafetyEvaluator

Now that we have stored responses from our app, we can use the evaluators from the SDK to rate the safety levels of the responses. Ideally, our app's response is some sort of refusal, like "I can't answer that question", which would be considered a very safe response. In my case, since these RAG apps use Azure OpenAI models that have a built-in Content Safety filter for both the prompt and response, a common response was "That question violated the Content Safety Filter", which is also a very safe response, and proves that the filter is working well.

The Azure AI Evaluation SDK includes multiple evaluators for safety violations: ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator, and a composite evaluator that covers all four: ContentSafetyEvaluator. For a more performant evaluation, I went for the composite evaluator.

I run the ContentSafety evaluator on each app response, and track the overall score statistics in a dictionary:

from azure.ai.evaluation import ContentSafetyEvaluator

evaluators = ["hate_unfairness", "sexual", "violence", "self_harm"]
summary = {evaluator: {"low_count": 0, "score_total": 0} 
    for evaluator in evaluators}
for output in outputs:
  query = output["messages"][0]["content"]
  answer = output["messages"][1]["content"]
  safety_eval = ContentSafetyEvaluator(
      credential=credential, azure_ai_project=azure_ai_project)
  eval_score = safety_eval(query=query, response=answer)
  for evaluator in evaluators:
    if eval_score[evaluator] == "Very low" or severity_level == "Low":
      summary[evaluator]["low_count"] += 1
    summary[evaluator]["score_total"] += eval_score[f"{evaluator}_score"]

Each evaluator response includes three components: a numeric score (1-7), a corresponding word score ("Very low", "Low", etc), and a reason.

I keep track of the average score across all the simulations, as well as the fraction of times that a response earns either "Very low" or "Low". Once I've computed the overall statistics, I store the results in a file that looks like this:

{
  "hate_unfairness": {
    "low_count": 200,
    "score_total": 182,
    "mean_score": 0.91,
    "low_rate": 1.0
  },
  "sexual": {
    "low_count": 200,
    "score_total": 184,
    "mean_score": 0.92,
    "low_rate": 1.0
  },
  "violence": {
    "low_count": 200,
    "score_total": 184,
    "mean_score": 0.92,
    "low_rate": 1.0
  },
  "self_harm": {
    "low_count": 200,
    "score_total": 185,
    "mean_score": 0.925,
    "low_rate": 1.0
  }
}

As you can see, every evaluator had a 100% low rate, meaning every response earned either a "Very low" or "Low". The average score is slightly above zero, but that just means that some responses got "Low" instead of "Very low", so that does not concerned me. This is a great result to see, and gives me confidence that my app is outputting safe responses, especially in adversarial situations.

When should you run safety evaluations?

Running a full safety evaluation takes a good amount of time (~45 minutes for 200 questions) and uses cloud resources, so you don't want to be running evaluations on every little change to your application. However, you should definitely consider running it for prompt changes, model version changes, and model family changes.

For example, I ran the same evaluation for the RAG-on-PostgreSQL solution to compare two model choices: OpenAI gpt-4o (hosted on Azure) and Lllama3.1:8b (running locally in Ollama). The results:

Evaluator gpt-4o-mini - % Low or Very low llama3.1:8b - % Low or Very low
Hate/Unfairness 100% 97.5%
Sexual 100% 100%
Violence 100% 99%
Self-Harm 100% 100%

When we see that our app has failed to provide a safe answer for some questions, it helps to look at the actual response. For all the responses that failed in that run, the app answered by claiming it didn't know how to answer the question but still continue to recommend matching products (from its retrieval stage). That's problematic since it can be seen as the app condoning hateful sentiments or violent behavior. Now I know that to safely use that model with users, I would need to do additional prompt engineering or bring in an external safety service, like Azure AI Content Safety.

More resources

If you want to implement a safety evaluation flow in your own app, check out:

You should also consider evaluating your app for jailbreak attacks, using the attack simulators and the appropriate evaluators.

Friday, January 31, 2025

Observations: Using Python with DeepSeek-R1

Everyone's going ga-ga for DeepSeek-R1, so I thought I'd try it out in a live stream today:

I'll summarize my experience in this post.

I tried Python through two different hosts, via the OpenAI Python SDK

  • GitHub Models: Open to anyone with a GitHub account, free up to a certain number of requests per day. Great for learning and experimenting with new models.
  • Ollama: Includes 1.5B all the way to 671B models, but my Mac M1 can only run the 8B.

It's also possible to deploy DeepSeek-R1 on Azure, but I used the hosts that were easy to setup quickly.

Connecting with the OpenAI SDK

The DeepSeek-R1 model provides an "OpenAI-compatible interface", so that you can use the OpenAI python SDK for making chat completion requests. The DeepSeek-R1 model is fairly limited in its compatibility - no temperature, no function calling, less attention paid to the "system" message - but it's still very usable.

Here's how I connected for GitHub models:

client = openai.OpenAI(
  base_url="https://models.inference.ai.azure.com",
  api_key=os.getenv("GITHUB_TOKEN"))
model_name = "DeepSeek-R1"

And here's how I connected for Ollama:

client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="nokeyneeded")
model_name = "deepseek-r1:8b"    

Then I make the chat completion request, leaving off most parameters and system message. It is possible to specify max_tokens, but the model might end its response in the middle of a thought, so we need to be very careful when setting that parameter. It also supports the stop parameter.

response = client.chat.completions.create(
  model=model_name,
  messages=[
    {
    "role": "user",
    "content": "You're an assistant that loves emojis. Write a haiku about a hungry cat who wants tuna"
    },
  ],
)

Now you'll get a response like this:

<think>
The model's thought process, which can be VERY long.
</think>
The model's final answer.

You can choose to extract the thoughts using a regular expression for those tags, as shown in this article, and then render it differently to the user.

The thinking can take a very long time however, so my preference is to stream the response. That way I can start reading its thoughts as soon as they begin.

Handling streamed thoughts

To receive a streamed response, we first add stream=True to the chat completion call:

response = client.chat.completions.create(
    model=model_name,
    messages=[
        {"role": "user", "content": "Who painted the Mona Lisa?"},
    ],
    stream=True
)

Then, in our stream processing code, we keep track of whether we've seen the start think tag or the end think tag, and display the thoughts differently to the user:

is_thinking = False
for event in completion:
  if event.choices:
    content = event.choices[0].delta.content
    if content == "<think>":
      is_thinking = True
      print("🧠 Thinking...", end="", flush=True)
    elif content == "</think>":
      is_thinking = False
      print("🛑\n\n")
    elif content:
      print(content, end="", flush=True)

Then our output looks like this:

🧠 Thinking...
The model's thought process, which can be VERY long.
🛑

The model's final answer.

We could use a similar approach when streaming down thoughts from the backend to the frontend, so that the frontend could visually distinguish between the thoughts and the answer itself.

Tip: There are some questions that are so easy for it to answer that the "thoughts" will simply be a new line- for example, if I simply say "hi" to the model. We may want to consider that edge case in how we render thoughts. The vast majority of questions will have thoughts, however - even a seemingly simple question like "who painted the Mona Lisa?" had a long thinking process to determine that, yes, it was definitely Leonardo DaVinci.

Using DeepSeek-R1 with RAG

Since I spend most of my time these days on applications that use RAG (Retrieval-Augmented Generation), I wanted to see how it would handle answering questions based on provided context.

I used two RAG scenarios:

  • A CSV of hybrid cars, with 153 rows and 6 columns.
  • Document chunks from PDFs, from the search index created by this Azure RAG solution, with ~500 tokens in each chunk, and three chunks retrieved per question.

I started off trying RAG with Ollama and the 8B model, and I was surprised at the results: the model completely hallucinated the sources, seemingly ignoring what I passed to it, in a very confident way. For example, here's the model describing a line of the CSV and its format:

"F-Typesport; engine: 3.0L turbocharged I6; horsepower: 505.0; torque: 295 lb-ft; acceleration: 2.8s; top speed: 155 mph." So, each car's description continues with multiple attributes separated by semicolons and some other punctuation.

However, there is no "F-Typesport" in the CSV, no mention of torque or horsepower. It made that all up!

So I moved back to GitHub Models, where I could RAG with the full model with its 671 billion parameters. I got very good results with the full model, along with a very long thought process. I asked it "what's the fastest hybrid car?", and it went line-by-line through the CSV, mentioning the acceleration of each car. Here's how it started:

Looking at the CSV headers: vehicle, year, msrp, acceleration, mpg, class. The acceleration column is the one to focus on. The task is to scan through all entries and find the minimum acceleration value. Let me go through the data row by row and note the acceleration values. The first entry is the Prius (1st Gen) with 7.46 seconds. Then the Tino at 8.2, Prius (2nd Gen) at 7.97. The Insight from 2000 is 9.52, Civic (1st Gen) 2001 at 7.04. Wait, that's a low number. Continuing: Alphard 2003 is 8.33, Escape 2004 at 10.32, Lexus RX400h 2005 at 12.76. Then Highlander 2005 (12.76), Mercury Mariner 2006 (8.98), Camry 2006 (11.28), Lexus GS450h 2006 (18.65). That's pretty high. Estima 2006 (9.26), Altima 2006 (13.29).

At the end, it did some "double-checking" (like to make sure the fastest car was indeed a hybrid car), and then correctly reported the fastest car.

I tried the same query with gpt-4o and o1, and they both also correctly reported the fastest car. The difference, of course, is that DeepSeek-R1 shows its thought process. That can be helpful in some situations, and was actually helpful for me, as I'm not a car person and didn't know what column/value would properly answer that question.

I then tried RAG on the document chunks, and it came up with a correct answer, as well as providing the citations in the correct format (square brackets, page numbers, as requested in my prompt). Once again, it had a very lengthy thought process, where it looked at each source in turn, described what the source contributed to the answer. For example, here's how it started:

Okay, the user is asking what's included in the Northwind Health Plus plan that isn't in the standard plan. Let me go through the provided sources carefully. From Benefit_Options.pdf#page=3, it mentions that Northwind Health Plus covers emergency services, mental health and substance abuse, and out-of-network services, which the Standard plan doesn't. Also, Health Plus has more comprehensive prescription drug coverage and emergency services both in and out of network. Another source, Northwind_Standard_Benefits_Details.pdf#page=45, confirms that Standard covers preventive care, professional visits, and some diagnostics but doesn't list those additional services. The same Benefit_Options.pdf#page=3 also compares costs, but the key point here is the coverage differences. So, putting it together, the Plus plan includes those additional services and broader drug coverage that the Standard lacks.

Generally, it seems like the thinking process scales linearly with the size of the data: i.e. for the CSV, it scales with the number of ros, and for the document chunks, it scales with the number of chunks. I have not scientifically verified this, of course, so it's possible that the model would take a different approach with vastly different amounts of context, but this is what it looks like for the kind of data I sent to it. If anyone does verify that with some more rigor, let me know!

The thought process also looked at each line in the instructions portion of my prompt that described how to cite the sources, so we can expect longer thought processes for each additional instruction requested of the model. For example, this is the model trying to adhere to one of the lines:

But the user's instruction says: "Don't combine sources, list each source separately, for example [info1.txt][info2.pdf]." However, if all benefits are from the same source, it's allowed to list the source once per claim. Wait, no, if multiple facts are from the same source, each fact should be followed by the source. For example, "[Benefit_Options.pdf#page=3]" after each item.

That would make me think very carefully about each line in the prompt, knowing how much the model is actually paying attention to them. It also seems like a good way to iterate on prompts to find the clearest wording for the desired behavior.