Evaluation Framework
The evaluation framework is a Python CLI that drives the chatbot with golden test cases and scores responses using LLM-as-judge rubrics. Results are posted to Langfuse for tracking and comparison.
Overview
flowchart LR
Datasets[datasets/*.yaml] --> Runner[run_eval.py]
Judges[judges/*.yaml] --> Runner
Runner -->|Drive chatbot| Chatbot[Chatbot API]
Runner -->|Score responses| JudgeLLM[Judge LLM]
Runner -->|Record results| Langfuse
The pipeline has three commands:
| Command | Purpose |
|---|---|
sync |
Push golden datasets to Langfuse |
run |
Execute test cases, score with judges, post results |
report |
Print pass/fail summary table |
Golden Datasets
Test cases are defined in YAML files under evals/datasets/:
| File | Scenarios |
|---|---|
fraud_scenarios.yaml |
Fraud flag checks, review status |
kyc_scenarios.yaml |
KYC verification, document requirements |
funding_scenarios.yaml |
Deposit/withdrawal status |
cross_domain.yaml |
Queries spanning multiple domains |
escalation.yaml |
Frustrated users, escalation handling |
Each test case specifies:
- id: fraud_basic_check
description: User asks about account restriction
input:
test_user: test_user_1
messages:
- "My account seems restricted, can you help?"
expected:
tool_calls:
- check_fraud_flags
- get_fraud_review_status
scoring:
correctness: 4
policy_compliance: 4
input.test_user— determines which hardcoded MCP data is returnedinput.messages— conversation turns to send (supports multi-turn)expected.tool_calls— tools the agent should call (checked deterministically)expected.scoring— minimum thresholds per judge dimension (1-5 scale)
LLM-as-Judge Scoring
Five judge dimensions evaluate different aspects of the chatbot's response:
| Judge | What It Measures |
|---|---|
| Correctness | Is the response factually accurate given the tool results? |
| Empathy | Does it acknowledge the customer's situation? |
| Directness | Does it get to the point without unnecessary filler? |
| Policy Compliance | Does it follow rules (e.g., never say "fraud", no overpromising)? |
| Tool Usage | Did it call the right tools and chain them correctly? |
Each judge is defined in evals/judges/ as a YAML file containing:
- A prompt template with a 1-5 scale rubric
- Concrete examples of each score level
- Model and temperature settings (temperature 0 for consistency)
Running Evaluations
Prerequisites
cd evals
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
source ../.env
Sync Datasets
Push golden datasets to Langfuse so they appear in the UI:
Run Evaluation
Execute all test cases and score them:
This will:
- Load all datasets and judge definitions
- For each test case, send messages to the chatbot's
/chatendpoint - Check tool call expectations deterministically
- Score the final response with each judge LLM
- Post scores to the corresponding Langfuse trace
View Report
Print a summary table with pass/fail status:
The command exits with code 1 if any test case fails its scoring thresholds, making it suitable for CI pipelines.
Test Users
The MCP servers return deterministic, hardcoded data based on the user_id. This ensures evaluations are reproducible regardless of when they run.
| User | Fraud | KYC | Funding |
|---|---|---|---|
test_user_1 |
Has flags | — | — |
test_user_2 |
— | Pending verification | — |
test_user_3 |
— | — | Deposit on hold |
test_user_4 |
— | — | Withdrawal blocked |
test_user_5 |
Multiple flags | Expired KYC | Deposit + withdrawal issues |
Docker Integration Suite
Instead of running the chatbot and eval framework locally, you can run the entire pipeline in Docker. This starts all infrastructure, the chatbot, and the eval runner in containers — no local Rust or Python required.
Prerequisites
- Docker and Docker Compose
- An
.envfile with at leastOPENAI_API_KEYset (see Getting Started)
Running
This brings up everything from the base docker compose up (Langfuse, MCP servers, etc.) plus:
| Service | Purpose |
|---|---|
chatbot |
The chatbot HTTP server (port 8080) |
eval-runner |
Syncs datasets, runs all evals, writes report |
The eval runner waits for the chatbot and Langfuse to be healthy, then automatically:
- Syncs golden datasets to Langfuse
- Runs all test cases and scores them with LLM judges
- Writes output to
.reports/eval-output.txt(bind-mounted from the host)
Viewing Results
Watch the eval runner logs in real time:
Or check the report after it completes:
Results are also visible in Langfuse under the dataset runs.
Environment Variables
The integration suite uses these env vars from your .env file:
| Variable | Required | Default |
|---|---|---|
OPENAI_API_KEY |
Yes | — |
JUDGE_API_KEY |
No | Falls back to OPENAI_API_KEY |
JUDGE_BASE_URL |
No | https://api.openai.com/v1 |
JUDGE_MODEL |
No | gpt-4o |
Tearing Down
Add -v to also remove persistent volumes (Langfuse database, ClickHouse, MinIO).
Configuration
Eval configuration lives in evals/config.yaml:
chatbot_url: http://localhost:8080
judge:
model: gpt-4o
temperature: 0
langfuse:
base_url: http://localhost:3000
The judge model and chatbot model can differ — you might use a stronger model as judge to evaluate a cheaper model as chatbot.