Evaluation Framework

The evaluation framework is a Python CLI that drives the chatbot with golden test cases and scores responses using LLM-as-judge rubrics. Results are posted to Langfuse for tracking and comparison.

Overview

flowchart LR
    Datasets[datasets/*.yaml] --> Runner[run_eval.py]
    Judges[judges/*.yaml] --> Runner
    Runner -->|Drive chatbot| Chatbot[Chatbot API]
    Runner -->|Score responses| JudgeLLM[Judge LLM]
    Runner -->|Record results| Langfuse

The pipeline has three commands:

Command	Purpose
`sync`	Push golden datasets to Langfuse
`run`	Execute test cases, score with judges, post results
`report`	Print pass/fail summary table

Golden Datasets

Test cases are defined in YAML files under evals/datasets/:

File	Scenarios
`fraud_scenarios.yaml`	Fraud flag checks, review status
`kyc_scenarios.yaml`	KYC verification, document requirements
`funding_scenarios.yaml`	Deposit/withdrawal status
`cross_domain.yaml`	Queries spanning multiple domains
`escalation.yaml`	Frustrated users, escalation handling

Each test case specifies:

- id: fraud_basic_check
  description: User asks about account restriction
  input:
    test_user: test_user_1
    messages:
      - "My account seems restricted, can you help?"
  expected:
    tool_calls:
      - check_fraud_flags
      - get_fraud_review_status
    scoring:
      correctness: 4
      policy_compliance: 4

input.test_user — determines which hardcoded MCP data is returned
input.messages — conversation turns to send (supports multi-turn)
expected.tool_calls — tools the agent should call (checked deterministically)
expected.scoring — minimum thresholds per judge dimension (1-5 scale)

LLM-as-Judge Scoring

Five judge dimensions evaluate different aspects of the chatbot's response:

Judge	What It Measures
Correctness	Is the response factually accurate given the tool results?
Empathy	Does it acknowledge the customer's situation?
Directness	Does it get to the point without unnecessary filler?
Policy Compliance	Does it follow rules (e.g., never say "fraud", no overpromising)?
Tool Usage	Did it call the right tools and chain them correctly?

Each judge is defined in evals/judges/ as a YAML file containing:

A prompt template with a 1-5 scale rubric
Concrete examples of each score level
Model and temperature settings (temperature 0 for consistency)

Running Evaluations

Prerequisites

cd evals
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
source ../.env

Sync Datasets

Push golden datasets to Langfuse so they appear in the UI:

python run_eval.py sync

Run Evaluation

Execute all test cases and score them:

python run_eval.py run --run-name "experiment-1"

This will:

Load all datasets and judge definitions
For each test case, send messages to the chatbot's /chat endpoint
Check tool call expectations deterministically
Score the final response with each judge LLM
Post scores to the corresponding Langfuse trace

View Report

Print a summary table with pass/fail status:

python run_eval.py report

The command exits with code 1 if any test case fails its scoring thresholds, making it suitable for CI pipelines.

Test Users

The MCP servers return deterministic, hardcoded data based on the user_id. This ensures evaluations are reproducible regardless of when they run.

User	Fraud	KYC	Funding
`test_user_1`	Has flags	—	—
`test_user_2`	—	Pending verification	—
`test_user_3`	—	—	Deposit on hold
`test_user_4`	—	—	Withdrawal blocked
`test_user_5`	Multiple flags	Expired KYC	Deposit + withdrawal issues

Docker Integration Suite

Instead of running the chatbot and eval framework locally, you can run the entire pipeline in Docker. This starts all infrastructure, the chatbot, and the eval runner in containers — no local Rust or Python required.

Prerequisites

Docker and Docker Compose
An .env file with at least OPENAI_API_KEY set (see Getting Started)

Running

docker compose --profile integration up -d

This brings up everything from the base docker compose up (Langfuse, MCP servers, etc.) plus:

Service	Purpose
`chatbot`	The chatbot HTTP server (port 8080)
`eval-runner`	Syncs datasets, runs all evals, writes report

The eval runner waits for the chatbot and Langfuse to be healthy, then automatically:

Syncs golden datasets to Langfuse
Runs all test cases and scores them with LLM judges
Writes output to .reports/eval-output.txt (bind-mounted from the host)

Viewing Results

Watch the eval runner logs in real time:

docker compose logs -f eval-runner

Or check the report after it completes:

cat .reports/eval-output.txt

Results are also visible in Langfuse under the dataset runs.

Environment Variables

The integration suite uses these env vars from your .env file:

Variable	Required	Default
`OPENAI_API_KEY`	Yes	—
`JUDGE_API_KEY`	No	Falls back to `OPENAI_API_KEY`
`JUDGE_BASE_URL`	No	`https://api.openai.com/v1`
`JUDGE_MODEL`	No	`gpt-4o`

Tearing Down

docker compose --profile integration down

Add -v to also remove persistent volumes (Langfuse database, ClickHouse, MinIO).

Configuration

Eval configuration lives in evals/config.yaml:

chatbot_url: http://localhost:8080
judge:
  model: gpt-4o
  temperature: 0
langfuse:
  base_url: http://localhost:3000

The judge model and chatbot model can differ — you might use a stronger model as judge to evaluate a cheaper model as chatbot.