Initial release: MCP server enforcing Worker-Reviewer loop
Diligence prevents AI agents from shipping quick fixes that break things by enforcing a research-propose-verify loop before any code changes. Key features: - Worker sub-agent researches and proposes with file:line citations - Reviewer sub-agent independently verifies claims by searching codebase - Iterates until approved (max 5 rounds) - Loads project-specific context from .claude/CODEBASE_CONTEXT.md - State persisted across sessions Validated on production codebase: caught architectural mistake (broker subscriptions on client-side code) that naive agent would have shipped.
This commit is contained in:
183
test/results/2026-01-22-comparison-report.md
Normal file
183
test/results/2026-01-22-comparison-report.md
Normal file
@@ -0,0 +1,183 @@
|
||||
# Diligence vs Naive Approach: Comparison Report
|
||||
|
||||
**Date:** 2026-01-22
|
||||
**Test Bug:** B1 - Blocked users can answer DM voice calls
|
||||
**Project:** nexus (~/bude/codecharm/nexus)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
| Metric | Naive Approach | Diligence Approach |
|
||||
|--------|----------------|-------------------|
|
||||
| Bug verified exists? | ✅ Yes | ✅ Yes |
|
||||
| Correct line numbers? | ✅ Yes (1050, 965) | ✅ Worker correct |
|
||||
| Found declineDmCall gap? | ✅ Yes | ⚠️ Reviewer found it |
|
||||
| Found notification filtering? | ✅ Yes | ⚠️ Reviewer found it |
|
||||
| Found blockUser cleanup? | ✅ Yes | ⚠️ Reviewer found it |
|
||||
| Reviewer caught errors? | N/A | ✅ Caught line number discrepancy* |
|
||||
|
||||
*Reviewer searched wrong codebase (test fixture instead of nexus), but the PROCESS of verification worked.
|
||||
|
||||
---
|
||||
|
||||
## Bug Verification: CONFIRMED REAL
|
||||
|
||||
**Evidence from actual nexus code:**
|
||||
|
||||
```typescript
|
||||
// startDmCall (lines 965-969) - HAS blocking check ✅
|
||||
const blocked = await this.userBlockService.isBlockingEitherWay(callerId, calleeId);
|
||||
if (blocked) {
|
||||
throw new UserError('Cannot call this user');
|
||||
}
|
||||
|
||||
// answerDmCall (line 1050+) - NO blocking check ❌
|
||||
async answerDmCall(callId: MongoId): Promise<{ token: string; channelId: string }> {
|
||||
// Only checks: auth, call exists, state=ringing, user=callee, not expired
|
||||
// MISSING: blocking check
|
||||
}
|
||||
|
||||
// declineDmCall (line 1115+) - NO blocking check ❌
|
||||
async declineDmCall(callId: MongoId): Promise<void> {
|
||||
// Only checks: auth, call exists, state=ringing, user=callee
|
||||
// MISSING: blocking check
|
||||
}
|
||||
```
|
||||
|
||||
**Conclusion:** Bug B1 is REAL. Both approaches correctly identified it.
|
||||
|
||||
---
|
||||
|
||||
## Detailed Comparison
|
||||
|
||||
### Naive Approach Output
|
||||
|
||||
The naive agent (single Explore agent) produced:
|
||||
- ✅ Root cause analysis
|
||||
- ✅ Correct file identification (`voice-channel.rpc.ts`)
|
||||
- ✅ Correct line numbers (965-969, 1050-1109)
|
||||
- ✅ Compared startDmCall vs answerDmCall patterns
|
||||
- ✅ Identified additional issues:
|
||||
- declineDmCall needs blocking check
|
||||
- notifyDmCall needs filtering
|
||||
- blockUser() needs voice cleanup
|
||||
- BusUserBlockChange subscription needed
|
||||
- ✅ Implementation order recommendation
|
||||
- ✅ Edge cases considered
|
||||
|
||||
**Quality:** Surprisingly thorough. Searched actual code, cited lines, found patterns.
|
||||
|
||||
### Diligence Approach Output
|
||||
|
||||
**Worker:**
|
||||
- ✅ Verified bug exists by searching code
|
||||
- ✅ Correct line numbers
|
||||
- ✅ Cited exact file:line
|
||||
- ✅ Proposed fix matching startDmCall pattern
|
||||
|
||||
**Reviewer:**
|
||||
- ✅ Attempted independent verification (correct process)
|
||||
- ❌ Searched wrong codebase (test fixture, 220 lines)
|
||||
- ✅ Noticed discrepancy ("file only 220 lines, Worker cited 1050")
|
||||
- ✅ Found additional gaps (declineDmCall, notification filtering)
|
||||
- ✅ Gave NEEDS_WORK decision with specific issues
|
||||
|
||||
**Quality:** Process worked correctly. Reviewer caught a "discrepancy" (even if due to searching wrong place).
|
||||
|
||||
---
|
||||
|
||||
## Key Findings
|
||||
|
||||
### 1. Both approaches verified the bug exists
|
||||
|
||||
Neither approach blindly trusted the task description. Both:
|
||||
- Searched for answerDmCall implementation
|
||||
- Compared with startDmCall pattern
|
||||
- Verified blocking check is actually missing
|
||||
|
||||
### 2. Naive approach was surprisingly thorough
|
||||
|
||||
The single agent produced analysis comparable to the Worker. This suggests:
|
||||
- For bugs with clear descriptions, naive approach may suffice
|
||||
- The value of diligence may be in more ambiguous tasks
|
||||
|
||||
### 3. Reviewer process works, but needs correct context
|
||||
|
||||
The Reviewer:
|
||||
- Did NOT rubber-stamp the Worker's proposal
|
||||
- Actually searched and found discrepancies
|
||||
- Caught additional issues the Worker missed
|
||||
- BUT searched the wrong codebase due to test setup
|
||||
|
||||
### 4. Test setup flaw identified
|
||||
|
||||
The Reviewer searched `/Users/marc/bude/strikt/diligence/test/fixture/` instead of `~/bude/codecharm/nexus`. This is because:
|
||||
- Agents were spawned from the diligence project
|
||||
- They defaulted to searching the current working directory
|
||||
|
||||
**Fix needed:** In real usage, diligence MCP runs IN the target project, so this wouldn't happen.
|
||||
|
||||
---
|
||||
|
||||
## What Diligence Should Catch That Naive Might Miss
|
||||
|
||||
Based on this test, diligence adds value when:
|
||||
|
||||
1. **Worker makes incorrect claims** - Reviewer verifies by searching
|
||||
2. **Worker misses related issues** - Reviewer's independent search finds them
|
||||
3. **Task description is wrong** - Both should verify bug exists, not assume
|
||||
4. **Patterns are misunderstood** - Reviewer checks against CODEBASE_CONTEXT.md
|
||||
|
||||
### This test showed:
|
||||
|
||||
| Scenario | Did Diligence Help? |
|
||||
|----------|---------------------|
|
||||
| Verify bug exists | Both approaches did this |
|
||||
| Catch wrong line numbers | Reviewer caught discrepancy ✅ |
|
||||
| Find additional gaps | Reviewer found more than Worker ✅ |
|
||||
| Prevent hallucinated bugs | Would catch if Reviewer searched correctly |
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### 1. Run real test in nexus project
|
||||
|
||||
Start a Claude Code session IN nexus and test the full workflow there. This ensures:
|
||||
- MCP server runs in correct project
|
||||
- Agents search the right codebase
|
||||
- Full context from CODEBASE_CONTEXT.md is loaded
|
||||
|
||||
### 2. Test with a more ambiguous bug
|
||||
|
||||
B1 is well-documented. Test with something like:
|
||||
- "Voice seems laggy sometimes"
|
||||
- "Users report weird permission issues"
|
||||
|
||||
These require more investigation to even determine if there's a bug.
|
||||
|
||||
### 3. Test if diligence catches non-bugs
|
||||
|
||||
Give a task for a bug that doesn't exist. Does the workflow correctly identify "no bug found"?
|
||||
|
||||
### 4. Add explicit codebase path to Worker/Reviewer briefs
|
||||
|
||||
The briefs should specify: "Search in /path/to/project, not elsewhere"
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
**Does diligence work?** Yes, the process is sound:
|
||||
- Worker researches and proposes
|
||||
- Reviewer independently verifies
|
||||
- Discrepancies are caught
|
||||
- Multiple rounds can iterate
|
||||
|
||||
**Is it better than naive?** For this test, similar results. But:
|
||||
- Reviewer caught additional issues Worker missed
|
||||
- Process would catch hallucinated bugs if Reviewer searches correctly
|
||||
- Real value may be in more complex/ambiguous tasks
|
||||
|
||||
**Next step:** Run a real test in a Claude Code session in nexus, with a more ambiguous task.
|
||||
Reference in New Issue
Block a user