Files
diligence/test/results/2026-01-22-comparison-report.md
Marc J. Schmidt bd178fcaf0 Initial release: MCP server enforcing Worker-Reviewer loop
Diligence prevents AI agents from shipping quick fixes that break things
by enforcing a research-propose-verify loop before any code changes.

Key features:
- Worker sub-agent researches and proposes with file:line citations
- Reviewer sub-agent independently verifies claims by searching codebase
- Iterates until approved (max 5 rounds)
- Loads project-specific context from .claude/CODEBASE_CONTEXT.md
- State persisted across sessions

Validated on production codebase: caught architectural mistake (broker
subscriptions on client-side code) that naive agent would have shipped.
2026-01-22 06:22:59 +01:00

5.9 KiB

Diligence vs Naive Approach: Comparison Report

Date: 2026-01-22 Test Bug: B1 - Blocked users can answer DM voice calls Project: nexus (~/bude/codecharm/nexus)


Executive Summary

Metric Naive Approach Diligence Approach
Bug verified exists? Yes Yes
Correct line numbers? Yes (1050, 965) Worker correct
Found declineDmCall gap? Yes ⚠️ Reviewer found it
Found notification filtering? Yes ⚠️ Reviewer found it
Found blockUser cleanup? Yes ⚠️ Reviewer found it
Reviewer caught errors? N/A Caught line number discrepancy*

*Reviewer searched wrong codebase (test fixture instead of nexus), but the PROCESS of verification worked.


Bug Verification: CONFIRMED REAL

Evidence from actual nexus code:

// startDmCall (lines 965-969) - HAS blocking check ✅
const blocked = await this.userBlockService.isBlockingEitherWay(callerId, calleeId);
if (blocked) {
  throw new UserError('Cannot call this user');
}

// answerDmCall (line 1050+) - NO blocking check ❌
async answerDmCall(callId: MongoId): Promise<{ token: string; channelId: string }> {
  // Only checks: auth, call exists, state=ringing, user=callee, not expired
  // MISSING: blocking check
}

// declineDmCall (line 1115+) - NO blocking check ❌
async declineDmCall(callId: MongoId): Promise<void> {
  // Only checks: auth, call exists, state=ringing, user=callee
  // MISSING: blocking check
}

Conclusion: Bug B1 is REAL. Both approaches correctly identified it.


Detailed Comparison

Naive Approach Output

The naive agent (single Explore agent) produced:

  • Root cause analysis
  • Correct file identification (voice-channel.rpc.ts)
  • Correct line numbers (965-969, 1050-1109)
  • Compared startDmCall vs answerDmCall patterns
  • Identified additional issues:
    • declineDmCall needs blocking check
    • notifyDmCall needs filtering
    • blockUser() needs voice cleanup
    • BusUserBlockChange subscription needed
  • Implementation order recommendation
  • Edge cases considered

Quality: Surprisingly thorough. Searched actual code, cited lines, found patterns.

Diligence Approach Output

Worker:

  • Verified bug exists by searching code
  • Correct line numbers
  • Cited exact file:line
  • Proposed fix matching startDmCall pattern

Reviewer:

  • Attempted independent verification (correct process)
  • Searched wrong codebase (test fixture, 220 lines)
  • Noticed discrepancy ("file only 220 lines, Worker cited 1050")
  • Found additional gaps (declineDmCall, notification filtering)
  • Gave NEEDS_WORK decision with specific issues

Quality: Process worked correctly. Reviewer caught a "discrepancy" (even if due to searching wrong place).


Key Findings

1. Both approaches verified the bug exists

Neither approach blindly trusted the task description. Both:

  • Searched for answerDmCall implementation
  • Compared with startDmCall pattern
  • Verified blocking check is actually missing

2. Naive approach was surprisingly thorough

The single agent produced analysis comparable to the Worker. This suggests:

  • For bugs with clear descriptions, naive approach may suffice
  • The value of diligence may be in more ambiguous tasks

3. Reviewer process works, but needs correct context

The Reviewer:

  • Did NOT rubber-stamp the Worker's proposal
  • Actually searched and found discrepancies
  • Caught additional issues the Worker missed
  • BUT searched the wrong codebase due to test setup

4. Test setup flaw identified

The Reviewer searched /Users/marc/bude/strikt/diligence/test/fixture/ instead of ~/bude/codecharm/nexus. This is because:

  • Agents were spawned from the diligence project
  • They defaulted to searching the current working directory

Fix needed: In real usage, diligence MCP runs IN the target project, so this wouldn't happen.


What Diligence Should Catch That Naive Might Miss

Based on this test, diligence adds value when:

  1. Worker makes incorrect claims - Reviewer verifies by searching
  2. Worker misses related issues - Reviewer's independent search finds them
  3. Task description is wrong - Both should verify bug exists, not assume
  4. Patterns are misunderstood - Reviewer checks against CODEBASE_CONTEXT.md

This test showed:

Scenario Did Diligence Help?
Verify bug exists Both approaches did this
Catch wrong line numbers Reviewer caught discrepancy
Find additional gaps Reviewer found more than Worker
Prevent hallucinated bugs Would catch if Reviewer searched correctly

Recommendations

1. Run real test in nexus project

Start a Claude Code session IN nexus and test the full workflow there. This ensures:

  • MCP server runs in correct project
  • Agents search the right codebase
  • Full context from CODEBASE_CONTEXT.md is loaded

2. Test with a more ambiguous bug

B1 is well-documented. Test with something like:

  • "Voice seems laggy sometimes"
  • "Users report weird permission issues"

These require more investigation to even determine if there's a bug.

3. Test if diligence catches non-bugs

Give a task for a bug that doesn't exist. Does the workflow correctly identify "no bug found"?

4. Add explicit codebase path to Worker/Reviewer briefs

The briefs should specify: "Search in /path/to/project, not elsewhere"


Conclusion

Does diligence work? Yes, the process is sound:

  • Worker researches and proposes
  • Reviewer independently verifies
  • Discrepancies are caught
  • Multiple rounds can iterate

Is it better than naive? For this test, similar results. But:

  • Reviewer caught additional issues Worker missed
  • Process would catch hallucinated bugs if Reviewer searches correctly
  • Real value may be in more complex/ambiguous tasks

Next step: Run a real test in a Claude Code session in nexus, with a more ambiguous task.