Architecture

System Overview

The system consists of three layers: a chatbot service that orchestrates LLM calls and tool use, a set of MCP servers that expose domain-specific tools, and a Langfuse-based observability stack for tracing and evaluation.

graph TB
    User([User]) -->|HTTP POST /chat| Chatbot

    subgraph Chatbot Service
        Chatbot[Axum HTTP Server]
        Agent[Agentic Loop]
        MCP[MCP Client]
        LF[Langfuse Client]
        Chatbot --> Agent
        Agent --> MCP
        Agent --> LF
    end

    Agent -->|Chat Completion API| OpenAI[OpenAI API]

    subgraph MCP Servers
        Fraud[mcp-server-fraud<br/>:3001]
        KYC[mcp-server-kyc<br/>:3002]
        Funding[mcp-server-funding<br/>:3003]
    end

    MCP -->|Streamable HTTP| Fraud
    MCP -->|Streamable HTTP| KYC
    MCP -->|Streamable HTTP| Funding

    LF -->|REST API| Langfuse[Langfuse :3000]

    subgraph Langfuse Infrastructure
        Langfuse --> Postgres[(PostgreSQL)]
        Langfuse --> ClickHouse[(ClickHouse)]
        Langfuse --> Redis[(Redis)]
        Langfuse --> MinIO[(MinIO)]
    end

Agentic Loop

The core of the chatbot is an agentic tool-use loop. The LLM decides which tools to call based on the conversation context and system prompt rules.

flowchart TD
    A[User sends message] --> B[Append to session history]
    B --> C[Build messages:<br/>system prompt + history]
    C --> D[Call OpenAI Chat Completion]
    D --> E{Response contains<br/>tool_calls?}
    E -->|Yes| F[Parse tool name & args]
    F --> G[Route to MCP server]
    G --> H[Execute tool via RMCP]
    H --> I[Append tool result to history]
    I --> D
    E -->|No| J[Return text response to user]
    J --> K[Record trace in Langfuse]

The loop continues until the LLM produces a final text response with no tool calls. The system prompt enforces mandatory tool chaining — for example, if fraud flags are found, the agent must also check the review status before responding.

MCP Integration

On startup, the chatbot connects to each MCP server and discovers available tools via the tools/list RPC. Tool definitions are converted from RMCP format to OpenAI's ChatCompletionTool format so the LLM can select them.

sequenceDiagram
    participant Chatbot
    participant Fraud as mcp-server-fraud
    participant KYC as mcp-server-kyc
    participant Funding as mcp-server-funding

    Note over Chatbot: Startup - Tool Discovery
    Chatbot->>Fraud: tools/list
    Fraud-->>Chatbot: check_fraud_flags, get_fraud_review_status
    Chatbot->>KYC: tools/list
    KYC-->>Chatbot: get_kyc_status, get_required_documents
    Chatbot->>Funding: tools/list
    Funding-->>Chatbot: get_deposit_status, get_withdrawal_status

    Note over Chatbot: Runtime - Tool Execution
    Chatbot->>Chatbot: LLM requests check_fraud_flags(user_id)
    Chatbot->>Fraud: tools/call check_fraud_flags
    Fraud-->>Chatbot: {flag_type, severity}

MCP Server Tools

Each MCP server is an independent Axum + RMCP binary that returns deterministic data for test users (test_user_1 through test_user_5).

Server	Port	Tools
Fraud	3001	`check_fraud_flags(user_id)`, `get_fraud_review_status(user_id)`
KYC	3002	`get_kyc_status(user_id)`, `get_required_documents(user_id)`
Funding	3003	`get_deposit_status(user_id)`, `get_withdrawal_status(user_id)`

HTTP API

The chatbot exposes a simple REST API via Axum:

Method	Path	Description
`POST`	`/chat`	Send a message. Body: `{session_id, user_id, message}`
`DELETE`	`/chat/{session_id}`	Clear conversation history for a session

Response format:

{
  "response": "I can see your deposit is currently on hold...",
  "tool_calls": [
    {
      "name": "get_deposit_status",
      "arguments": "{\"user_id\": \"test_user_3\"}",
      "result": "{\"deposit_id\": \"DEP-301\", ...}"
    }
  ]
}

Session state is stored in an in-memory HashMap — it is not persisted across restarts.

Langfuse Observability

Every conversation turn creates a trace in Langfuse with:

Trace — one per turn, tagged with session_id for multi-turn grouping
Spans/Events — tool calls, LLM requests, and results

Langfuse is deployed locally via Docker Compose with pre-seeded credentials (org, project, API keys) so it works out of the box.

graph LR
    Chatbot -->|POST /api/public/ingestion| Langfuse
    EvalRunner[Eval Runner] -->|Langfuse SDK| Langfuse
    Langfuse --> UI[Langfuse UI<br/>localhost:3000]

Evaluation Architecture

The evaluation framework is a separate Python system that drives the chatbot and scores responses.

flowchart TD
    subgraph Eval Runner
        Sync[sync command] -->|Push datasets| LF[Langfuse]
        Run[run command] -->|Drive chatbot| ChatAPI[Chatbot /chat]
        Run -->|Score with judges| Judge[Judge LLM]
        Run -->|Post scores| LF
        Report[report command] -->|Fetch results| LF
    end

    subgraph Inputs
        DS[datasets/*.yaml<br/>Golden test cases]
        JD[judges/*.yaml<br/>Scoring rubrics]
    end

    DS --> Sync
    DS --> Run
    JD --> Run

    Report --> Summary[Pass/Fail Summary Table]

See Evaluation Framework for details on datasets, judges, and running evals.

Key Design Decisions

Decision	Rationale
Cargo workspace	Clear separation between chatbot and MCP servers while sharing dependencies
Deterministic MCP responses	Hardcoded per-user data enables reproducible evaluation results
Langfuse REST API (not SDK)	Lightweight custom client avoids pulling in a full Rust SDK
Python for evals	Leverages the Langfuse Python SDK for dataset management; separate concern from the Rust chatbot
Streamable HTTP transport	RMCP over HTTP for inter-service MCP communication
System prompt as file	Centralized policy enforcement, easy to iterate on without recompilation
Temperature 0.3 / 0.0	Chatbot slightly creative (0.3); judges fully deterministic (0.0)