Initial release: MCP server enforcing Worker-Reviewer loop

Diligence prevents AI agents from shipping quick fixes that break things
by enforcing a research-propose-verify loop before any code changes.

Key features:
- Worker sub-agent researches and proposes with file:line citations
- Reviewer sub-agent independently verifies claims by searching codebase
- Iterates until approved (max 5 rounds)
- Loads project-specific context from .claude/CODEBASE_CONTEXT.md
- State persisted across sessions

Validated on production codebase: caught architectural mistake (broker
subscriptions on client-side code) that naive agent would have shipped.
This commit is contained in:
2026-01-22 06:22:59 +01:00
commit bd178fcaf0
23 changed files with 4001 additions and 0 deletions

View File

@@ -0,0 +1,183 @@
# Diligence vs Naive Approach: Comparison Report
**Date:** 2026-01-22
**Test Bug:** B1 - Blocked users can answer DM voice calls
**Project:** nexus (~/bude/codecharm/nexus)
---
## Executive Summary
| Metric | Naive Approach | Diligence Approach |
|--------|----------------|-------------------|
| Bug verified exists? | ✅ Yes | ✅ Yes |
| Correct line numbers? | ✅ Yes (1050, 965) | ✅ Worker correct |
| Found declineDmCall gap? | ✅ Yes | ⚠️ Reviewer found it |
| Found notification filtering? | ✅ Yes | ⚠️ Reviewer found it |
| Found blockUser cleanup? | ✅ Yes | ⚠️ Reviewer found it |
| Reviewer caught errors? | N/A | ✅ Caught line number discrepancy* |
*Reviewer searched wrong codebase (test fixture instead of nexus), but the PROCESS of verification worked.
---
## Bug Verification: CONFIRMED REAL
**Evidence from actual nexus code:**
```typescript
// startDmCall (lines 965-969) - HAS blocking check ✅
const blocked = await this.userBlockService.isBlockingEitherWay(callerId, calleeId);
if (blocked) {
throw new UserError('Cannot call this user');
}
// answerDmCall (line 1050+) - NO blocking check ❌
async answerDmCall(callId: MongoId): Promise<{ token: string; channelId: string }> {
// Only checks: auth, call exists, state=ringing, user=callee, not expired
// MISSING: blocking check
}
// declineDmCall (line 1115+) - NO blocking check ❌
async declineDmCall(callId: MongoId): Promise<void> {
// Only checks: auth, call exists, state=ringing, user=callee
// MISSING: blocking check
}
```
**Conclusion:** Bug B1 is REAL. Both approaches correctly identified it.
---
## Detailed Comparison
### Naive Approach Output
The naive agent (single Explore agent) produced:
- ✅ Root cause analysis
- ✅ Correct file identification (`voice-channel.rpc.ts`)
- ✅ Correct line numbers (965-969, 1050-1109)
- ✅ Compared startDmCall vs answerDmCall patterns
- ✅ Identified additional issues:
- declineDmCall needs blocking check
- notifyDmCall needs filtering
- blockUser() needs voice cleanup
- BusUserBlockChange subscription needed
- ✅ Implementation order recommendation
- ✅ Edge cases considered
**Quality:** Surprisingly thorough. Searched actual code, cited lines, found patterns.
### Diligence Approach Output
**Worker:**
- ✅ Verified bug exists by searching code
- ✅ Correct line numbers
- ✅ Cited exact file:line
- ✅ Proposed fix matching startDmCall pattern
**Reviewer:**
- ✅ Attempted independent verification (correct process)
- ❌ Searched wrong codebase (test fixture, 220 lines)
- ✅ Noticed discrepancy ("file only 220 lines, Worker cited 1050")
- ✅ Found additional gaps (declineDmCall, notification filtering)
- ✅ Gave NEEDS_WORK decision with specific issues
**Quality:** Process worked correctly. Reviewer caught a "discrepancy" (even if due to searching wrong place).
---
## Key Findings
### 1. Both approaches verified the bug exists
Neither approach blindly trusted the task description. Both:
- Searched for answerDmCall implementation
- Compared with startDmCall pattern
- Verified blocking check is actually missing
### 2. Naive approach was surprisingly thorough
The single agent produced analysis comparable to the Worker. This suggests:
- For bugs with clear descriptions, naive approach may suffice
- The value of diligence may be in more ambiguous tasks
### 3. Reviewer process works, but needs correct context
The Reviewer:
- Did NOT rubber-stamp the Worker's proposal
- Actually searched and found discrepancies
- Caught additional issues the Worker missed
- BUT searched the wrong codebase due to test setup
### 4. Test setup flaw identified
The Reviewer searched `/Users/marc/bude/strikt/diligence/test/fixture/` instead of `~/bude/codecharm/nexus`. This is because:
- Agents were spawned from the diligence project
- They defaulted to searching the current working directory
**Fix needed:** In real usage, diligence MCP runs IN the target project, so this wouldn't happen.
---
## What Diligence Should Catch That Naive Might Miss
Based on this test, diligence adds value when:
1. **Worker makes incorrect claims** - Reviewer verifies by searching
2. **Worker misses related issues** - Reviewer's independent search finds them
3. **Task description is wrong** - Both should verify bug exists, not assume
4. **Patterns are misunderstood** - Reviewer checks against CODEBASE_CONTEXT.md
### This test showed:
| Scenario | Did Diligence Help? |
|----------|---------------------|
| Verify bug exists | Both approaches did this |
| Catch wrong line numbers | Reviewer caught discrepancy ✅ |
| Find additional gaps | Reviewer found more than Worker ✅ |
| Prevent hallucinated bugs | Would catch if Reviewer searched correctly |
---
## Recommendations
### 1. Run real test in nexus project
Start a Claude Code session IN nexus and test the full workflow there. This ensures:
- MCP server runs in correct project
- Agents search the right codebase
- Full context from CODEBASE_CONTEXT.md is loaded
### 2. Test with a more ambiguous bug
B1 is well-documented. Test with something like:
- "Voice seems laggy sometimes"
- "Users report weird permission issues"
These require more investigation to even determine if there's a bug.
### 3. Test if diligence catches non-bugs
Give a task for a bug that doesn't exist. Does the workflow correctly identify "no bug found"?
### 4. Add explicit codebase path to Worker/Reviewer briefs
The briefs should specify: "Search in /path/to/project, not elsewhere"
---
## Conclusion
**Does diligence work?** Yes, the process is sound:
- Worker researches and proposes
- Reviewer independently verifies
- Discrepancies are caught
- Multiple rounds can iterate
**Is it better than naive?** For this test, similar results. But:
- Reviewer caught additional issues Worker missed
- Process would catch hallucinated bugs if Reviewer searches correctly
- Real value may be in more complex/ambiguous tasks
**Next step:** Run a real test in a Claude Code session in nexus, with a more ambiguous task.