Diligence prevents AI agents from shipping quick fixes that break things by enforcing a research-propose-verify loop before any code changes. Key features: - Worker sub-agent researches and proposes with file:line citations - Reviewer sub-agent independently verifies claims by searching codebase - Iterates until approved (max 5 rounds) - Loads project-specific context from .claude/CODEBASE_CONTEXT.md - State persisted across sessions Validated on production codebase: caught architectural mistake (broker subscriptions on client-side code) that naive agent would have shipped.
5.9 KiB
Diligence vs Naive Approach: Comparison Report
Date: 2026-01-22 Test Bug: B1 - Blocked users can answer DM voice calls Project: nexus (~/bude/codecharm/nexus)
Executive Summary
| Metric | Naive Approach | Diligence Approach |
|---|---|---|
| Bug verified exists? | ✅ Yes | ✅ Yes |
| Correct line numbers? | ✅ Yes (1050, 965) | ✅ Worker correct |
| Found declineDmCall gap? | ✅ Yes | ⚠️ Reviewer found it |
| Found notification filtering? | ✅ Yes | ⚠️ Reviewer found it |
| Found blockUser cleanup? | ✅ Yes | ⚠️ Reviewer found it |
| Reviewer caught errors? | N/A | ✅ Caught line number discrepancy* |
*Reviewer searched wrong codebase (test fixture instead of nexus), but the PROCESS of verification worked.
Bug Verification: CONFIRMED REAL
Evidence from actual nexus code:
// startDmCall (lines 965-969) - HAS blocking check ✅
const blocked = await this.userBlockService.isBlockingEitherWay(callerId, calleeId);
if (blocked) {
throw new UserError('Cannot call this user');
}
// answerDmCall (line 1050+) - NO blocking check ❌
async answerDmCall(callId: MongoId): Promise<{ token: string; channelId: string }> {
// Only checks: auth, call exists, state=ringing, user=callee, not expired
// MISSING: blocking check
}
// declineDmCall (line 1115+) - NO blocking check ❌
async declineDmCall(callId: MongoId): Promise<void> {
// Only checks: auth, call exists, state=ringing, user=callee
// MISSING: blocking check
}
Conclusion: Bug B1 is REAL. Both approaches correctly identified it.
Detailed Comparison
Naive Approach Output
The naive agent (single Explore agent) produced:
- ✅ Root cause analysis
- ✅ Correct file identification (
voice-channel.rpc.ts) - ✅ Correct line numbers (965-969, 1050-1109)
- ✅ Compared startDmCall vs answerDmCall patterns
- ✅ Identified additional issues:
- declineDmCall needs blocking check
- notifyDmCall needs filtering
- blockUser() needs voice cleanup
- BusUserBlockChange subscription needed
- ✅ Implementation order recommendation
- ✅ Edge cases considered
Quality: Surprisingly thorough. Searched actual code, cited lines, found patterns.
Diligence Approach Output
Worker:
- ✅ Verified bug exists by searching code
- ✅ Correct line numbers
- ✅ Cited exact file:line
- ✅ Proposed fix matching startDmCall pattern
Reviewer:
- ✅ Attempted independent verification (correct process)
- ❌ Searched wrong codebase (test fixture, 220 lines)
- ✅ Noticed discrepancy ("file only 220 lines, Worker cited 1050")
- ✅ Found additional gaps (declineDmCall, notification filtering)
- ✅ Gave NEEDS_WORK decision with specific issues
Quality: Process worked correctly. Reviewer caught a "discrepancy" (even if due to searching wrong place).
Key Findings
1. Both approaches verified the bug exists
Neither approach blindly trusted the task description. Both:
- Searched for answerDmCall implementation
- Compared with startDmCall pattern
- Verified blocking check is actually missing
2. Naive approach was surprisingly thorough
The single agent produced analysis comparable to the Worker. This suggests:
- For bugs with clear descriptions, naive approach may suffice
- The value of diligence may be in more ambiguous tasks
3. Reviewer process works, but needs correct context
The Reviewer:
- Did NOT rubber-stamp the Worker's proposal
- Actually searched and found discrepancies
- Caught additional issues the Worker missed
- BUT searched the wrong codebase due to test setup
4. Test setup flaw identified
The Reviewer searched /Users/marc/bude/strikt/diligence/test/fixture/ instead of ~/bude/codecharm/nexus. This is because:
- Agents were spawned from the diligence project
- They defaulted to searching the current working directory
Fix needed: In real usage, diligence MCP runs IN the target project, so this wouldn't happen.
What Diligence Should Catch That Naive Might Miss
Based on this test, diligence adds value when:
- Worker makes incorrect claims - Reviewer verifies by searching
- Worker misses related issues - Reviewer's independent search finds them
- Task description is wrong - Both should verify bug exists, not assume
- Patterns are misunderstood - Reviewer checks against CODEBASE_CONTEXT.md
This test showed:
| Scenario | Did Diligence Help? |
|---|---|
| Verify bug exists | Both approaches did this |
| Catch wrong line numbers | Reviewer caught discrepancy ✅ |
| Find additional gaps | Reviewer found more than Worker ✅ |
| Prevent hallucinated bugs | Would catch if Reviewer searched correctly |
Recommendations
1. Run real test in nexus project
Start a Claude Code session IN nexus and test the full workflow there. This ensures:
- MCP server runs in correct project
- Agents search the right codebase
- Full context from CODEBASE_CONTEXT.md is loaded
2. Test with a more ambiguous bug
B1 is well-documented. Test with something like:
- "Voice seems laggy sometimes"
- "Users report weird permission issues"
These require more investigation to even determine if there's a bug.
3. Test if diligence catches non-bugs
Give a task for a bug that doesn't exist. Does the workflow correctly identify "no bug found"?
4. Add explicit codebase path to Worker/Reviewer briefs
The briefs should specify: "Search in /path/to/project, not elsewhere"
Conclusion
Does diligence work? Yes, the process is sound:
- Worker researches and proposes
- Reviewer independently verifies
- Discrepancies are caught
- Multiple rounds can iterate
Is it better than naive? For this test, similar results. But:
- Reviewer caught additional issues Worker missed
- Process would catch hallucinated bugs if Reviewer searches correctly
- Real value may be in more complex/ambiguous tasks
Next step: Run a real test in a Claude Code session in nexus, with a more ambiguous task.