Skip to content

Evaluation Framework

The evaluation framework is a Python CLI that drives the chatbot with golden test cases and scores responses using LLM-as-judge rubrics. Results are posted to Langfuse for tracking and comparison.

Overview

flowchart LR
    Datasets[datasets/*.yaml] --> Runner[run_eval.py]
    Judges[judges/*.yaml] --> Runner
    Runner -->|Drive chatbot| Chatbot[Chatbot API]
    Runner -->|Score responses| JudgeLLM[Judge LLM]
    Runner -->|Record results| Langfuse

The pipeline has three commands:

Command Purpose
sync Push golden datasets to Langfuse
run Execute test cases, score with judges, post results
report Print pass/fail summary table

Golden Datasets

Test cases are defined in YAML files under evals/datasets/:

File Scenarios
fraud_scenarios.yaml Fraud flag checks, review status
kyc_scenarios.yaml KYC verification, document requirements
funding_scenarios.yaml Deposit/withdrawal status
cross_domain.yaml Queries spanning multiple domains
escalation.yaml Frustrated users, escalation handling

Each test case specifies:

- id: fraud_basic_check
  description: User asks about account restriction
  input:
    test_user: test_user_1
    messages:
      - "My account seems restricted, can you help?"
  expected:
    tool_calls:
      - check_fraud_flags
      - get_fraud_review_status
    scoring:
      correctness: 4
      policy_compliance: 4
  • input.test_user — determines which hardcoded MCP data is returned
  • input.messages — conversation turns to send (supports multi-turn)
  • expected.tool_calls — tools the agent should call (checked deterministically)
  • expected.scoring — minimum thresholds per judge dimension (1-5 scale)

LLM-as-Judge Scoring

Five judge dimensions evaluate different aspects of the chatbot's response:

Judge What It Measures
Correctness Is the response factually accurate given the tool results?
Empathy Does it acknowledge the customer's situation?
Directness Does it get to the point without unnecessary filler?
Policy Compliance Does it follow rules (e.g., never say "fraud", no overpromising)?
Tool Usage Did it call the right tools and chain them correctly?

Each judge is defined in evals/judges/ as a YAML file containing:

  • A prompt template with a 1-5 scale rubric
  • Concrete examples of each score level
  • Model and temperature settings (temperature 0 for consistency)

Running Evaluations

Prerequisites

cd evals
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
source ../.env

Sync Datasets

Push golden datasets to Langfuse so they appear in the UI:

python run_eval.py sync

Run Evaluation

Execute all test cases and score them:

python run_eval.py run --run-name "experiment-1"

This will:

  1. Load all datasets and judge definitions
  2. For each test case, send messages to the chatbot's /chat endpoint
  3. Check tool call expectations deterministically
  4. Score the final response with each judge LLM
  5. Post scores to the corresponding Langfuse trace

View Report

Print a summary table with pass/fail status:

python run_eval.py report

The command exits with code 1 if any test case fails its scoring thresholds, making it suitable for CI pipelines.

Test Users

The MCP servers return deterministic, hardcoded data based on the user_id. This ensures evaluations are reproducible regardless of when they run.

User Fraud KYC Funding
test_user_1 Has flags
test_user_2 Pending verification
test_user_3 Deposit on hold
test_user_4 Withdrawal blocked
test_user_5 Multiple flags Expired KYC Deposit + withdrawal issues

Docker Integration Suite

Instead of running the chatbot and eval framework locally, you can run the entire pipeline in Docker. This starts all infrastructure, the chatbot, and the eval runner in containers — no local Rust or Python required.

Prerequisites

  • Docker and Docker Compose
  • An .env file with at least OPENAI_API_KEY set (see Getting Started)

Running

docker compose --profile integration up -d

This brings up everything from the base docker compose up (Langfuse, MCP servers, etc.) plus:

Service Purpose
chatbot The chatbot HTTP server (port 8080)
eval-runner Syncs datasets, runs all evals, writes report

The eval runner waits for the chatbot and Langfuse to be healthy, then automatically:

  1. Syncs golden datasets to Langfuse
  2. Runs all test cases and scores them with LLM judges
  3. Writes output to .reports/eval-output.txt (bind-mounted from the host)

Viewing Results

Watch the eval runner logs in real time:

docker compose logs -f eval-runner

Or check the report after it completes:

cat .reports/eval-output.txt

Results are also visible in Langfuse under the dataset runs.

Environment Variables

The integration suite uses these env vars from your .env file:

Variable Required Default
OPENAI_API_KEY Yes
JUDGE_API_KEY No Falls back to OPENAI_API_KEY
JUDGE_BASE_URL No https://api.openai.com/v1
JUDGE_MODEL No gpt-4o

Tearing Down

docker compose --profile integration down

Add -v to also remove persistent volumes (Langfuse database, ClickHouse, MinIO).

Configuration

Eval configuration lives in evals/config.yaml:

chatbot_url: http://localhost:8080
judge:
  model: gpt-4o
  temperature: 0
langfuse:
  base_url: http://localhost:3000

The judge model and chatbot model can differ — you might use a stronger model as judge to evaluate a cheaper model as chatbot.