Files
diligence/TESTING.md
Marc J. Schmidt bd178fcaf0 Initial release: MCP server enforcing Worker-Reviewer loop
Diligence prevents AI agents from shipping quick fixes that break things
by enforcing a research-propose-verify loop before any code changes.

Key features:
- Worker sub-agent researches and proposes with file:line citations
- Reviewer sub-agent independently verifies claims by searching codebase
- Iterates until approved (max 5 rounds)
- Loads project-specific context from .claude/CODEBASE_CONTEXT.md
- State persisted across sessions

Validated on production codebase: caught architectural mistake (broker
subscriptions on client-side code) that naive agent would have shipped.
2026-01-22 06:22:59 +01:00

6.6 KiB

Testing the Diligence MCP Server

Test Suite Overview

The diligence project includes a comprehensive test suite that validates:

  1. Workflow mechanics - State transitions, round limits, phase enforcement
  2. Mock scenarios - Predefined Worker-Reviewer interactions with expected outcomes
  3. Dry-run mode - Test against real projects without making changes

Quick Start

# Run all workflow tests
node test/run-tests.mjs --workflow

# Run mock scenario tests
node test/run-tests.mjs --mock

# Run a specific scenario
node test/run-tests.mjs --mock --scenario=blocking-voice

# Dry run against nexus (no changes)
node test/dry-run.mjs --project=~/bude/codecharm/nexus --scenario=blocking-voice

Test Structure

test/
├── mcp-client.mjs      # Programmatic MCP client
├── run-tests.mjs       # Main test runner
├── dry-run.mjs         # Dry-run against real projects
├── fixture/            # Mock codebase for testing
│   ├── src/
│   │   ├── broker/events.ts
│   │   ├── services/
│   │   └── controllers/
│   └── .claude/
│       └── CODEBASE_CONTEXT.md
└── scenarios/          # Test scenarios
    ├── index.json
    ├── blocking-voice.json
    └── permission-cache.json

Test Modes

1. Workflow Tests (--workflow)

Tests the MCP server mechanics without AI:

  • Phase transitions (conversation → researching → approved → implementing)
  • Round increment on NEEDS_WORK
  • Max rounds enforcement (resets after 5 rounds)
  • Feedback accumulation
  • Abort functionality
node test/run-tests.mjs --workflow

2. Mock Tests (--mock)

Tests complete Worker-Reviewer scenarios with predefined responses:

  1. Scenario defines a task and expected naive/correct fixes
  2. Test simulates Worker submitting naive proposal
  3. Reviewer catches issues, sends NEEDS_WORK
  4. Worker submits revised proposal with all fixes
  5. Reviewer approves
  6. Validates that proposal mentions all required elements
# All scenarios
node test/run-tests.mjs --mock

# Single scenario
node test/run-tests.mjs --mock --scenario=permission-cache

3. Dry Run (--project)

Connects to MCP server in a real project directory:

# With predefined scenario
node test/dry-run.mjs --project=~/bude/codecharm/nexus --scenario=blocking-voice

# With custom task
node test/dry-run.mjs --project=/path/to/project --task="Fix the caching bug"

This:

  • Starts the workflow with the task
  • Shows the full Worker Brief (including real CODEBASE_CONTEXT.md)
  • Does NOT make any code changes
  • Aborts the workflow on exit

Test Fixture

The fixture (test/fixture/) is a mini codebase that mirrors real-world patterns:

Files

File Purpose Known Bugs (for testing)
broker/events.ts Event bus definitions Reference implementation
services/user-block.service.ts Blocking logic Missing voice cleanup
services/voice-channel.service.ts Voice/DM calls Missing blocking check on answerDmCall
services/team.service.ts Permission cache Doesn't subscribe to role events
services/chat.service.ts Correct pattern Shows permission vs action separation
controllers/roles.controller.ts Role CRUD Missing broker events on create/delete

Patterns Tested

  1. Broker event emission - Every state change should emit events
  2. Cache invalidation - Caches must subscribe to relevant events
  3. Permission vs Action - Permissions control visibility, actions have separate checks
  4. Multi-location fixes - If one place has a check, similar places need it too

Test Scenarios

blocking-voice

Task: Fix blocked users can still answer DM voice calls

Naive fix: Add blocking check to answerDmCall only

Correct fix:

  • Add blocking check to answerDmCall
  • Add blocking check to declineDmCall
  • Filter notifications for blocked users
  • Add voice cleanup to blockUser()
  • Subscribe to BusUserBlockChange for mid-call kicks

permission-cache

Task: Fix permission cache doesn't invalidate when roles change

Naive fix: Add .clear() somewhere

Correct fix:

  • Subscribe to BusTeamRoleChange in team.service
  • Subscribe to BusTeamMemberRoleChange in team.service
  • Add broker event to createRole()
  • Add broker event to deleteRole()

Adding New Scenarios

Create a new JSON file in test/scenarios/:

{
  "id": "my-scenario",
  "name": "Human-readable name",
  "description": "Brief description",

  "task": "The task description given to the Worker",

  "naive_fix": {
    "description": "What a quick-fix agent would do",
    "changes": [
      { "file": "path/to/file.ts", "change": "Quick fix description" }
    ],
    "issues": [
      "What the naive fix misses #1",
      "What the naive fix misses #2"
    ]
  },

  "correct_fix": {
    "description": "Complete fix description",
    "required_changes": [
      { "file": "path.ts", "function": "funcName", "change": "What to change" }
    ],
    "required_broker_subscriptions": [
      { "service": "x.service.ts", "event": "BusEventName", "action": "What to do" }
    ]
  },

  "validation_criteria": {
    "must_mention": ["keyword1", "keyword2"],
    "should_reference_pattern": "reference-file.ts"
  }
}

Add to test/scenarios/index.json:

{
  "scenarios": [
    { "id": "my-scenario", "file": "my-scenario.json" }
  ]
}

Real-World Testing

For true validation, test with real Claude sub-agents:

Root Agent:
  1. Call mcp__diligence__start with task
  2. Spawn Worker agent (Task tool) with get_worker_brief
  3. Worker researches, submits proposal via mcp__diligence__propose
  4. Spawn Reviewer agent (Task tool) with get_reviewer_brief
  5. Reviewer verifies claims, submits via mcp__diligence__review
  6. If NEEDS_WORK, spawn new Worker with updated brief
  7. If APPROVED, proceed to implementation

Why separate agents matter:

  • Fresh context = no bias from previous reasoning
  • Reviewer doesn't know Worker's search results
  • Forces genuine verification, not rubber-stamping

Success Criteria

  1. Reviewer catches issues that Worker initially misses
  2. Multiple rounds occur before approval
  3. Final proposal is more complete than naive approach
  4. Validation criteria are all met

Environment Variables

Variable Purpose
DEBUG=1 Show MCP server stderr output
ANTHROPIC_API_KEY Required for --live mode (future)

CI Integration

# Run in CI
npm test

# Or directly
node test/run-tests.mjs --workflow && node test/run-tests.mjs --mock

Exit codes: 0 = pass, 1 = fail