Architecture
System Overview
The system consists of three layers: a chatbot service that orchestrates LLM calls and tool use, a set of MCP servers that expose domain-specific tools, and a Langfuse-based observability stack for tracing and evaluation.
graph TB
User([User]) -->|HTTP POST /chat| Chatbot
subgraph Chatbot Service
Chatbot[Axum HTTP Server]
Agent[Agentic Loop]
MCP[MCP Client]
LF[Langfuse Client]
Chatbot --> Agent
Agent --> MCP
Agent --> LF
end
Agent -->|Chat Completion API| OpenAI[OpenAI API]
subgraph MCP Servers
Fraud[mcp-server-fraud<br/>:3001]
KYC[mcp-server-kyc<br/>:3002]
Funding[mcp-server-funding<br/>:3003]
end
MCP -->|Streamable HTTP| Fraud
MCP -->|Streamable HTTP| KYC
MCP -->|Streamable HTTP| Funding
LF -->|REST API| Langfuse[Langfuse :3000]
subgraph Langfuse Infrastructure
Langfuse --> Postgres[(PostgreSQL)]
Langfuse --> ClickHouse[(ClickHouse)]
Langfuse --> Redis[(Redis)]
Langfuse --> MinIO[(MinIO)]
end
Agentic Loop
The core of the chatbot is an agentic tool-use loop. The LLM decides which tools to call based on the conversation context and system prompt rules.
flowchart TD
A[User sends message] --> B[Append to session history]
B --> C[Build messages:<br/>system prompt + history]
C --> D[Call OpenAI Chat Completion]
D --> E{Response contains<br/>tool_calls?}
E -->|Yes| F[Parse tool name & args]
F --> G[Route to MCP server]
G --> H[Execute tool via RMCP]
H --> I[Append tool result to history]
I --> D
E -->|No| J[Return text response to user]
J --> K[Record trace in Langfuse]
The loop continues until the LLM produces a final text response with no tool calls. The system prompt enforces mandatory tool chaining — for example, if fraud flags are found, the agent must also check the review status before responding.
MCP Integration
On startup, the chatbot connects to each MCP server and discovers available tools via the tools/list RPC. Tool definitions are converted from RMCP format to OpenAI's ChatCompletionTool format so the LLM can select them.
sequenceDiagram
participant Chatbot
participant Fraud as mcp-server-fraud
participant KYC as mcp-server-kyc
participant Funding as mcp-server-funding
Note over Chatbot: Startup - Tool Discovery
Chatbot->>Fraud: tools/list
Fraud-->>Chatbot: check_fraud_flags, get_fraud_review_status
Chatbot->>KYC: tools/list
KYC-->>Chatbot: get_kyc_status, get_required_documents
Chatbot->>Funding: tools/list
Funding-->>Chatbot: get_deposit_status, get_withdrawal_status
Note over Chatbot: Runtime - Tool Execution
Chatbot->>Chatbot: LLM requests check_fraud_flags(user_id)
Chatbot->>Fraud: tools/call check_fraud_flags
Fraud-->>Chatbot: {flag_type, severity}
MCP Server Tools
Each MCP server is an independent Axum + RMCP binary that returns deterministic data for test users (test_user_1 through test_user_5).
| Server | Port | Tools |
|---|---|---|
| Fraud | 3001 | check_fraud_flags(user_id), get_fraud_review_status(user_id) |
| KYC | 3002 | get_kyc_status(user_id), get_required_documents(user_id) |
| Funding | 3003 | get_deposit_status(user_id), get_withdrawal_status(user_id) |
HTTP API
The chatbot exposes a simple REST API via Axum:
| Method | Path | Description |
|---|---|---|
POST |
/chat |
Send a message. Body: {session_id, user_id, message} |
DELETE |
/chat/{session_id} |
Clear conversation history for a session |
Response format:
{
"response": "I can see your deposit is currently on hold...",
"tool_calls": [
{
"name": "get_deposit_status",
"arguments": "{\"user_id\": \"test_user_3\"}",
"result": "{\"deposit_id\": \"DEP-301\", ...}"
}
]
}
Session state is stored in an in-memory HashMap — it is not persisted across restarts.
Langfuse Observability
Every conversation turn creates a trace in Langfuse with:
- Trace — one per turn, tagged with
session_idfor multi-turn grouping - Spans/Events — tool calls, LLM requests, and results
Langfuse is deployed locally via Docker Compose with pre-seeded credentials (org, project, API keys) so it works out of the box.
graph LR
Chatbot -->|POST /api/public/ingestion| Langfuse
EvalRunner[Eval Runner] -->|Langfuse SDK| Langfuse
Langfuse --> UI[Langfuse UI<br/>localhost:3000]
Evaluation Architecture
The evaluation framework is a separate Python system that drives the chatbot and scores responses.
flowchart TD
subgraph Eval Runner
Sync[sync command] -->|Push datasets| LF[Langfuse]
Run[run command] -->|Drive chatbot| ChatAPI[Chatbot /chat]
Run -->|Score with judges| Judge[Judge LLM]
Run -->|Post scores| LF
Report[report command] -->|Fetch results| LF
end
subgraph Inputs
DS[datasets/*.yaml<br/>Golden test cases]
JD[judges/*.yaml<br/>Scoring rubrics]
end
DS --> Sync
DS --> Run
JD --> Run
Report --> Summary[Pass/Fail Summary Table]
See Evaluation Framework for details on datasets, judges, and running evals.
Key Design Decisions
| Decision | Rationale |
|---|---|
| Cargo workspace | Clear separation between chatbot and MCP servers while sharing dependencies |
| Deterministic MCP responses | Hardcoded per-user data enables reproducible evaluation results |
| Langfuse REST API (not SDK) | Lightweight custom client avoids pulling in a full Rust SDK |
| Python for evals | Leverages the Langfuse Python SDK for dataset management; separate concern from the Rust chatbot |
| Streamable HTTP transport | RMCP over HTTP for inter-service MCP communication |
| System prompt as file | Centralized policy enforcement, easy to iterate on without recompilation |
| Temperature 0.3 / 0.0 | Chatbot slightly creative (0.3); judges fully deterministic (0.0) |