Initial release: MCP server enforcing Worker-Reviewer loop

Diligence prevents AI agents from shipping quick fixes that break things by enforcing a research-propose-verify loop before any code changes. Key features: - Worker sub-agent researches and proposes with file:line citations - Reviewer sub-agent independently verifies claims by searching codebase - Iterates until approved (max 5 rounds) - Loads project-specific context from .claude/CODEBASE_CONTEXT.md - State persisted across sessions Validated on production codebase: caught architectural mistake (broker subscriptions on client-side code) that naive agent would have shipped.
2026-01-22 06:22:59 +01:00
commit bd178fcaf0
23 changed files with 4001 additions and 0 deletions
--- a/test/results/2026-01-22-comparison-report.md
+++ b/test/results/2026-01-22-comparison-report.md
@@ -0,0 +1,183 @@
+# Diligence vs Naive Approach: Comparison Report
+
+**Date:** 2026-01-22
+**Test Bug:** B1 - Blocked users can answer DM voice calls
+**Project:** nexus (~/bude/codecharm/nexus)
+
+---
+
+## Executive Summary
+
+| Metric | Naive Approach | Diligence Approach |
+|--------|----------------|-------------------|
+| Bug verified exists? | ✅ Yes | ✅ Yes |
+| Correct line numbers? | ✅ Yes (1050, 965) | ✅ Worker correct |
+| Found declineDmCall gap? | ✅ Yes | ⚠️ Reviewer found it |
+| Found notification filtering? | ✅ Yes | ⚠️ Reviewer found it |
+| Found blockUser cleanup? | ✅ Yes | ⚠️ Reviewer found it |
+| Reviewer caught errors? | N/A | ✅ Caught line number discrepancy* |
+
+*Reviewer searched wrong codebase (test fixture instead of nexus), but the PROCESS of verification worked.
+
+---
+
+## Bug Verification: CONFIRMED REAL
+
+**Evidence from actual nexus code:**
+
+```typescript
+// startDmCall (lines 965-969) - HAS blocking check ✅
+const blocked = await this.userBlockService.isBlockingEitherWay(callerId, calleeId);
+if (blocked) {
+  throw new UserError('Cannot call this user');
+}
+
+// answerDmCall (line 1050+) - NO blocking check ❌
+async answerDmCall(callId: MongoId): Promise<{ token: string; channelId: string }> {
+  // Only checks: auth, call exists, state=ringing, user=callee, not expired
+  // MISSING: blocking check
+}
+
+// declineDmCall (line 1115+) - NO blocking check ❌
+async declineDmCall(callId: MongoId): Promise<void> {
+  // Only checks: auth, call exists, state=ringing, user=callee
+  // MISSING: blocking check
+}
+```
+
+**Conclusion:** Bug B1 is REAL. Both approaches correctly identified it.
+
+---
+
+## Detailed Comparison
+
+### Naive Approach Output
+
+The naive agent (single Explore agent) produced:
+- ✅ Root cause analysis
+- ✅ Correct file identification (`voice-channel.rpc.ts`)
+- ✅ Correct line numbers (965-969, 1050-1109)
+- ✅ Compared startDmCall vs answerDmCall patterns
+- ✅ Identified additional issues:
+  - declineDmCall needs blocking check
+  - notifyDmCall needs filtering
+  - blockUser() needs voice cleanup
+  - BusUserBlockChange subscription needed
+- ✅ Implementation order recommendation
+- ✅ Edge cases considered
+
+**Quality:** Surprisingly thorough. Searched actual code, cited lines, found patterns.
+
+### Diligence Approach Output
+
+**Worker:**
+- ✅ Verified bug exists by searching code
+- ✅ Correct line numbers
+- ✅ Cited exact file:line
+- ✅ Proposed fix matching startDmCall pattern
+
+**Reviewer:**
+- ✅ Attempted independent verification (correct process)
+- ❌ Searched wrong codebase (test fixture, 220 lines)
+- ✅ Noticed discrepancy ("file only 220 lines, Worker cited 1050")
+- ✅ Found additional gaps (declineDmCall, notification filtering)
+- ✅ Gave NEEDS_WORK decision with specific issues
+
+**Quality:** Process worked correctly. Reviewer caught a "discrepancy" (even if due to searching wrong place).
+
+---
+
+## Key Findings
+
+### 1. Both approaches verified the bug exists
+
+Neither approach blindly trusted the task description. Both:
+- Searched for answerDmCall implementation
+- Compared with startDmCall pattern
+- Verified blocking check is actually missing
+
+### 2. Naive approach was surprisingly thorough
+
+The single agent produced analysis comparable to the Worker. This suggests:
+- For bugs with clear descriptions, naive approach may suffice
+- The value of diligence may be in more ambiguous tasks
+
+### 3. Reviewer process works, but needs correct context
+
+The Reviewer:
+- Did NOT rubber-stamp the Worker's proposal
+- Actually searched and found discrepancies
+- Caught additional issues the Worker missed
+- BUT searched the wrong codebase due to test setup
+
+### 4. Test setup flaw identified
+
+The Reviewer searched `/Users/marc/bude/strikt/diligence/test/fixture/` instead of `~/bude/codecharm/nexus`. This is because:
+- Agents were spawned from the diligence project
+- They defaulted to searching the current working directory
+
+**Fix needed:** In real usage, diligence MCP runs IN the target project, so this wouldn't happen.
+
+---
+
+## What Diligence Should Catch That Naive Might Miss
+
+Based on this test, diligence adds value when:
+
+1. **Worker makes incorrect claims** - Reviewer verifies by searching
+2. **Worker misses related issues** - Reviewer's independent search finds them
+3. **Task description is wrong** - Both should verify bug exists, not assume
+4. **Patterns are misunderstood** - Reviewer checks against CODEBASE_CONTEXT.md
+
+### This test showed:
+
+| Scenario | Did Diligence Help? |
+|----------|---------------------|
+| Verify bug exists | Both approaches did this |
+| Catch wrong line numbers | Reviewer caught discrepancy ✅ |
+| Find additional gaps | Reviewer found more than Worker ✅ |
+| Prevent hallucinated bugs | Would catch if Reviewer searched correctly |
+
+---
+
+## Recommendations
+
+### 1. Run real test in nexus project
+
+Start a Claude Code session IN nexus and test the full workflow there. This ensures:
+- MCP server runs in correct project
+- Agents search the right codebase
+- Full context from CODEBASE_CONTEXT.md is loaded
+
+### 2. Test with a more ambiguous bug
+
+B1 is well-documented. Test with something like:
+- "Voice seems laggy sometimes"
+- "Users report weird permission issues"
+
+These require more investigation to even determine if there's a bug.
+
+### 3. Test if diligence catches non-bugs
+
+Give a task for a bug that doesn't exist. Does the workflow correctly identify "no bug found"?
+
+### 4. Add explicit codebase path to Worker/Reviewer briefs
+
+The briefs should specify: "Search in /path/to/project, not elsewhere"
+
+---
+
+## Conclusion
+
+**Does diligence work?** Yes, the process is sound:
+- Worker researches and proposes
+- Reviewer independently verifies
+- Discrepancies are caught
+- Multiple rounds can iterate
+
+**Is it better than naive?** For this test, similar results. But:
+- Reviewer caught additional issues Worker missed
+- Process would catch hallucinated bugs if Reviewer searches correctly
+- Real value may be in more complex/ambiguous tasks
+
+**Next step:** Run a real test in a Claude Code session in nexus, with a more ambiguous task.