Files

Marc J. Schmidt bd178fcaf0 Initial release: MCP server enforcing Worker-Reviewer loop

Diligence prevents AI agents from shipping quick fixes that break things
by enforcing a research-propose-verify loop before any code changes.

Key features:
- Worker sub-agent researches and proposes with file:line citations
- Reviewer sub-agent independently verifies claims by searching codebase
- Iterates until approved (max 5 rounds)
- Loads project-specific context from .claude/CODEBASE_CONTEXT.md
- State persisted across sessions

Validated on production codebase: caught architectural mistake (broker
subscriptions on client-side code) that naive agent would have shipped.

2026-01-22 06:22:59 +01:00

6.6 KiB

Raw Blame History

Testing the Diligence MCP Server

Test Suite Overview

The diligence project includes a comprehensive test suite that validates:

Workflow mechanics - State transitions, round limits, phase enforcement
Mock scenarios - Predefined Worker-Reviewer interactions with expected outcomes
Dry-run mode - Test against real projects without making changes

Quick Start

# Run all workflow tests
node test/run-tests.mjs --workflow

# Run mock scenario tests
node test/run-tests.mjs --mock

# Run a specific scenario
node test/run-tests.mjs --mock --scenario=blocking-voice

# Dry run against nexus (no changes)
node test/dry-run.mjs --project=~/bude/codecharm/nexus --scenario=blocking-voice

Test Structure

test/
├── mcp-client.mjs      # Programmatic MCP client
├── run-tests.mjs       # Main test runner
├── dry-run.mjs         # Dry-run against real projects
├── fixture/            # Mock codebase for testing
│   ├── src/
│   │   ├── broker/events.ts
│   │   ├── services/
│   │   └── controllers/
│   └── .claude/
│       └── CODEBASE_CONTEXT.md
└── scenarios/          # Test scenarios
    ├── index.json
    ├── blocking-voice.json
    └── permission-cache.json

Test Modes

1. Workflow Tests (`--workflow`)

Tests the MCP server mechanics without AI:

Phase transitions (conversation → researching → approved → implementing)
Round increment on NEEDS_WORK
Max rounds enforcement (resets after 5 rounds)
Feedback accumulation
Abort functionality

node test/run-tests.mjs --workflow

2. Mock Tests (`--mock`)

Tests complete Worker-Reviewer scenarios with predefined responses:

Scenario defines a task and expected naive/correct fixes
Test simulates Worker submitting naive proposal
Reviewer catches issues, sends NEEDS_WORK
Worker submits revised proposal with all fixes
Reviewer approves
Validates that proposal mentions all required elements

# All scenarios
node test/run-tests.mjs --mock

# Single scenario
node test/run-tests.mjs --mock --scenario=permission-cache

3. Dry Run (`--project`)

Connects to MCP server in a real project directory:

# With predefined scenario
node test/dry-run.mjs --project=~/bude/codecharm/nexus --scenario=blocking-voice

# With custom task
node test/dry-run.mjs --project=/path/to/project --task="Fix the caching bug"

This:

Starts the workflow with the task
Shows the full Worker Brief (including real CODEBASE_CONTEXT.md)
Does NOT make any code changes
Aborts the workflow on exit

Test Fixture

The fixture (test/fixture/) is a mini codebase that mirrors real-world patterns:

Files

File	Purpose	Known Bugs (for testing)
`broker/events.ts`	Event bus definitions	Reference implementation
`services/user-block.service.ts`	Blocking logic	Missing voice cleanup
`services/voice-channel.service.ts`	Voice/DM calls	Missing blocking check on answerDmCall
`services/team.service.ts`	Permission cache	Doesn't subscribe to role events
`services/chat.service.ts`	Correct pattern	Shows permission vs action separation
`controllers/roles.controller.ts`	Role CRUD	Missing broker events on create/delete

Patterns Tested

Broker event emission - Every state change should emit events
Cache invalidation - Caches must subscribe to relevant events
Permission vs Action - Permissions control visibility, actions have separate checks
Multi-location fixes - If one place has a check, similar places need it too

Test Scenarios

blocking-voice

Task: Fix blocked users can still answer DM voice calls

Naive fix: Add blocking check to answerDmCall only

Correct fix:

Add blocking check to answerDmCall
Add blocking check to declineDmCall
Filter notifications for blocked users
Add voice cleanup to blockUser()
Subscribe to BusUserBlockChange for mid-call kicks

permission-cache

Task: Fix permission cache doesn't invalidate when roles change

Naive fix: Add .clear() somewhere

Correct fix:

Subscribe to BusTeamRoleChange in team.service
Subscribe to BusTeamMemberRoleChange in team.service
Add broker event to createRole()
Add broker event to deleteRole()

Adding New Scenarios

Create a new JSON file in test/scenarios/:

{
  "id": "my-scenario",
  "name": "Human-readable name",
  "description": "Brief description",

  "task": "The task description given to the Worker",

  "naive_fix": {
    "description": "What a quick-fix agent would do",
    "changes": [
      { "file": "path/to/file.ts", "change": "Quick fix description" }
    ],
    "issues": [
      "What the naive fix misses #1",
      "What the naive fix misses #2"
    ]
  },

  "correct_fix": {
    "description": "Complete fix description",
    "required_changes": [
      { "file": "path.ts", "function": "funcName", "change": "What to change" }
    ],
    "required_broker_subscriptions": [
      { "service": "x.service.ts", "event": "BusEventName", "action": "What to do" }
    ]
  },

  "validation_criteria": {
    "must_mention": ["keyword1", "keyword2"],
    "should_reference_pattern": "reference-file.ts"
  }
}

Add to test/scenarios/index.json:

{
  "scenarios": [
    { "id": "my-scenario", "file": "my-scenario.json" }
  ]
}

Real-World Testing

For true validation, test with real Claude sub-agents:

Root Agent:
  1. Call mcp__diligence__start with task
  2. Spawn Worker agent (Task tool) with get_worker_brief
  3. Worker researches, submits proposal via mcp__diligence__propose
  4. Spawn Reviewer agent (Task tool) with get_reviewer_brief
  5. Reviewer verifies claims, submits via mcp__diligence__review
  6. If NEEDS_WORK, spawn new Worker with updated brief
  7. If APPROVED, proceed to implementation

Why separate agents matter:

Fresh context = no bias from previous reasoning
Reviewer doesn't know Worker's search results
Forces genuine verification, not rubber-stamping

Success Criteria

Reviewer catches issues that Worker initially misses
Multiple rounds occur before approval
Final proposal is more complete than naive approach
Validation criteria are all met

Environment Variables

Variable	Purpose
`DEBUG=1`	Show MCP server stderr output
`ANTHROPIC_API_KEY`	Required for `--live` mode (future)

CI Integration

# Run in CI
npm test

# Or directly
node test/run-tests.mjs --workflow && node test/run-tests.mjs --mock

Exit codes: 0 = pass, 1 = fail

6.6 KiB Raw Blame History