Initial release: MCP server enforcing Worker-Reviewer loop

Diligence prevents AI agents from shipping quick fixes that break things by enforcing a research-propose-verify loop before any code changes. Key features: - Worker sub-agent researches and proposes with file:line citations - Reviewer sub-agent independently verifies claims by searching codebase - Iterates until approved (max 5 rounds) - Loads project-specific context from .claude/CODEBASE_CONTEXT.md - State persisted across sessions Validated on production codebase: caught architectural mistake (broker subscriptions on client-side code) that naive agent would have shipped.
2026-01-22 06:22:59 +01:00
commit bd178fcaf0
23 changed files with 4001 additions and 0 deletions
--- a/TESTING.md
+++ b/TESTING.md
@@ -0,0 +1,243 @@
+# Testing the Diligence MCP Server
+
+## Test Suite Overview
+
+The diligence project includes a comprehensive test suite that validates:
+
+1. **Workflow mechanics** - State transitions, round limits, phase enforcement
+2. **Mock scenarios** - Predefined Worker-Reviewer interactions with expected outcomes
+3. **Dry-run mode** - Test against real projects without making changes
+
+## Quick Start
+
+```bash
+# Run all workflow tests
+node test/run-tests.mjs --workflow
+
+# Run mock scenario tests
+node test/run-tests.mjs --mock
+
+# Run a specific scenario
+node test/run-tests.mjs --mock --scenario=blocking-voice
+
+# Dry run against nexus (no changes)
+node test/dry-run.mjs --project=~/bude/codecharm/nexus --scenario=blocking-voice
+```
+
+## Test Structure
+
+```
+test/
+├── mcp-client.mjs      # Programmatic MCP client
+├── run-tests.mjs       # Main test runner
+├── dry-run.mjs         # Dry-run against real projects
+├── fixture/            # Mock codebase for testing
+│   ├── src/
+│   │   ├── broker/events.ts
+│   │   ├── services/
+│   │   └── controllers/
+│   └── .claude/
+│       └── CODEBASE_CONTEXT.md
+└── scenarios/          # Test scenarios
+    ├── index.json
+    ├── blocking-voice.json
+    └── permission-cache.json
+```
+
+## Test Modes
+
+### 1. Workflow Tests (`--workflow`)
+
+Tests the MCP server mechanics without AI:
+
+- Phase transitions (conversation → researching → approved → implementing)
+- Round increment on NEEDS_WORK
+- Max rounds enforcement (resets after 5 rounds)
+- Feedback accumulation
+- Abort functionality
+
+```bash
+node test/run-tests.mjs --workflow
+```
+
+### 2. Mock Tests (`--mock`)
+
+Tests complete Worker-Reviewer scenarios with predefined responses:
+
+1. Scenario defines a task and expected naive/correct fixes
+2. Test simulates Worker submitting naive proposal
+3. Reviewer catches issues, sends NEEDS_WORK
+4. Worker submits revised proposal with all fixes
+5. Reviewer approves
+6. Validates that proposal mentions all required elements
+
+```bash
+# All scenarios
+node test/run-tests.mjs --mock
+
+# Single scenario
+node test/run-tests.mjs --mock --scenario=permission-cache
+```
+
+### 3. Dry Run (`--project`)
+
+Connects to MCP server in a real project directory:
+
+```bash
+# With predefined scenario
+node test/dry-run.mjs --project=~/bude/codecharm/nexus --scenario=blocking-voice
+
+# With custom task
+node test/dry-run.mjs --project=/path/to/project --task="Fix the caching bug"
+```
+
+This:
+- Starts the workflow with the task
+- Shows the full Worker Brief (including real CODEBASE_CONTEXT.md)
+- Does NOT make any code changes
+- Aborts the workflow on exit
+
+## Test Fixture
+
+The fixture (`test/fixture/`) is a mini codebase that mirrors real-world patterns:
+
+### Files
+
+| File | Purpose | Known Bugs (for testing) |
+|------|---------|-------------------------|
+| `broker/events.ts` | Event bus definitions | Reference implementation |
+| `services/user-block.service.ts` | Blocking logic | Missing voice cleanup |
+| `services/voice-channel.service.ts` | Voice/DM calls | Missing blocking check on answerDmCall |
+| `services/team.service.ts` | Permission cache | Doesn't subscribe to role events |
+| `services/chat.service.ts` | **Correct pattern** | Shows permission vs action separation |
+| `controllers/roles.controller.ts` | Role CRUD | Missing broker events on create/delete |
+
+### Patterns Tested
+
+1. **Broker event emission** - Every state change should emit events
+2. **Cache invalidation** - Caches must subscribe to relevant events
+3. **Permission vs Action** - Permissions control visibility, actions have separate checks
+4. **Multi-location fixes** - If one place has a check, similar places need it too
+
+## Test Scenarios
+
+### blocking-voice
+
+**Task:** Fix blocked users can still answer DM voice calls
+
+**Naive fix:** Add blocking check to answerDmCall only
+
+**Correct fix:**
+- Add blocking check to answerDmCall
+- Add blocking check to declineDmCall
+- Filter notifications for blocked users
+- Add voice cleanup to blockUser()
+- Subscribe to BusUserBlockChange for mid-call kicks
+
+### permission-cache
+
+**Task:** Fix permission cache doesn't invalidate when roles change
+
+**Naive fix:** Add .clear() somewhere
+
+**Correct fix:**
+- Subscribe to BusTeamRoleChange in team.service
+- Subscribe to BusTeamMemberRoleChange in team.service
+- Add broker event to createRole()
+- Add broker event to deleteRole()
+
+## Adding New Scenarios
+
+Create a new JSON file in `test/scenarios/`:
+
+```json
+{
+  "id": "my-scenario",
+  "name": "Human-readable name",
+  "description": "Brief description",
+
+  "task": "The task description given to the Worker",
+
+  "naive_fix": {
+    "description": "What a quick-fix agent would do",
+    "changes": [
+      { "file": "path/to/file.ts", "change": "Quick fix description" }
+    ],
+    "issues": [
+      "What the naive fix misses #1",
+      "What the naive fix misses #2"
+    ]
+  },
+
+  "correct_fix": {
+    "description": "Complete fix description",
+    "required_changes": [
+      { "file": "path.ts", "function": "funcName", "change": "What to change" }
+    ],
+    "required_broker_subscriptions": [
+      { "service": "x.service.ts", "event": "BusEventName", "action": "What to do" }
+    ]
+  },
+
+  "validation_criteria": {
+    "must_mention": ["keyword1", "keyword2"],
+    "should_reference_pattern": "reference-file.ts"
+  }
+}
+```
+
+Add to `test/scenarios/index.json`:
+
+```json
+{
+  "scenarios": [
+    { "id": "my-scenario", "file": "my-scenario.json" }
+  ]
+}
+```
+
+## Real-World Testing
+
+For true validation, test with real Claude sub-agents:
+
+```
+Root Agent:
+  1. Call mcp__diligence__start with task
+  2. Spawn Worker agent (Task tool) with get_worker_brief
+  3. Worker researches, submits proposal via mcp__diligence__propose
+  4. Spawn Reviewer agent (Task tool) with get_reviewer_brief
+  5. Reviewer verifies claims, submits via mcp__diligence__review
+  6. If NEEDS_WORK, spawn new Worker with updated brief
+  7. If APPROVED, proceed to implementation
+```
+
+**Why separate agents matter:**
+- Fresh context = no bias from previous reasoning
+- Reviewer doesn't know Worker's search results
+- Forces genuine verification, not rubber-stamping
+
+## Success Criteria
+
+1. **Reviewer catches issues** that Worker initially misses
+2. **Multiple rounds** occur before approval
+3. **Final proposal** is more complete than naive approach
+4. **Validation criteria** are all met
+
+## Environment Variables
+
+| Variable | Purpose |
+|----------|---------|
+| `DEBUG=1` | Show MCP server stderr output |
+| `ANTHROPIC_API_KEY` | Required for `--live` mode (future) |
+
+## CI Integration
+
+```bash
+# Run in CI
+npm test
+
+# Or directly
+node test/run-tests.mjs --workflow && node test/run-tests.mjs --mock
+```
+
+Exit codes: 0 = pass, 1 = fail