Files

Marc J. Schmidt bd178fcaf0 Initial release: MCP server enforcing Worker-Reviewer loop

Diligence prevents AI agents from shipping quick fixes that break things
by enforcing a research-propose-verify loop before any code changes.

Key features:
- Worker sub-agent researches and proposes with file:line citations
- Reviewer sub-agent independently verifies claims by searching codebase
- Iterates until approved (max 5 rounds)
- Loads project-specific context from .claude/CODEBASE_CONTEXT.md
- State persisted across sessions

Validated on production codebase: caught architectural mistake (broker
subscriptions on client-side code) that naive agent would have shipped.

2026-01-22 06:22:59 +01:00

5.9 KiB

Raw Blame History

Diligence vs Naive Approach: Comparison Report

Date: 2026-01-22 Test Bug: B1 - Blocked users can answer DM voice calls Project: nexus (~/bude/codecharm/nexus)

Executive Summary

Metric	Naive Approach	Diligence Approach
Bug verified exists?	✅ Yes	✅ Yes
Correct line numbers?	✅ Yes (1050, 965)	✅ Worker correct
Found declineDmCall gap?	✅ Yes	⚠️ Reviewer found it
Found notification filtering?	✅ Yes	⚠️ Reviewer found it
Found blockUser cleanup?	✅ Yes	⚠️ Reviewer found it
Reviewer caught errors?	N/A	✅ Caught line number discrepancy*

*Reviewer searched wrong codebase (test fixture instead of nexus), but the PROCESS of verification worked.

Bug Verification: CONFIRMED REAL

Evidence from actual nexus code:

// startDmCall (lines 965-969) - HAS blocking check ✅
const blocked = await this.userBlockService.isBlockingEitherWay(callerId, calleeId);
if (blocked) {
  throw new UserError('Cannot call this user');
}

// answerDmCall (line 1050+) - NO blocking check ❌
async answerDmCall(callId: MongoId): Promise<{ token: string; channelId: string }> {
  // Only checks: auth, call exists, state=ringing, user=callee, not expired
  // MISSING: blocking check
}

// declineDmCall (line 1115+) - NO blocking check ❌
async declineDmCall(callId: MongoId): Promise<void> {
  // Only checks: auth, call exists, state=ringing, user=callee
  // MISSING: blocking check
}

Conclusion: Bug B1 is REAL. Both approaches correctly identified it.

Detailed Comparison

Naive Approach Output

The naive agent (single Explore agent) produced:

✅ Root cause analysis
✅ Correct file identification (voice-channel.rpc.ts)
✅ Correct line numbers (965-969, 1050-1109)
✅ Compared startDmCall vs answerDmCall patterns
✅ Identified additional issues:
- declineDmCall needs blocking check
- notifyDmCall needs filtering
- blockUser() needs voice cleanup
- BusUserBlockChange subscription needed
✅ Implementation order recommendation
✅ Edge cases considered

Quality: Surprisingly thorough. Searched actual code, cited lines, found patterns.

Diligence Approach Output

Worker:

✅ Verified bug exists by searching code
✅ Correct line numbers
✅ Cited exact file:line
✅ Proposed fix matching startDmCall pattern

Reviewer:

✅ Attempted independent verification (correct process)
❌ Searched wrong codebase (test fixture, 220 lines)
✅ Noticed discrepancy ("file only 220 lines, Worker cited 1050")
✅ Found additional gaps (declineDmCall, notification filtering)
✅ Gave NEEDS_WORK decision with specific issues

Quality: Process worked correctly. Reviewer caught a "discrepancy" (even if due to searching wrong place).

Key Findings

1. Both approaches verified the bug exists

Neither approach blindly trusted the task description. Both:

Searched for answerDmCall implementation
Compared with startDmCall pattern
Verified blocking check is actually missing

2. Naive approach was surprisingly thorough

The single agent produced analysis comparable to the Worker. This suggests:

For bugs with clear descriptions, naive approach may suffice
The value of diligence may be in more ambiguous tasks

3. Reviewer process works, but needs correct context

The Reviewer:

Did NOT rubber-stamp the Worker's proposal
Actually searched and found discrepancies
Caught additional issues the Worker missed
BUT searched the wrong codebase due to test setup

4. Test setup flaw identified

The Reviewer searched /Users/marc/bude/strikt/diligence/test/fixture/ instead of ~/bude/codecharm/nexus. This is because:

Agents were spawned from the diligence project
They defaulted to searching the current working directory

Fix needed: In real usage, diligence MCP runs IN the target project, so this wouldn't happen.

What Diligence Should Catch That Naive Might Miss

Based on this test, diligence adds value when:

Worker makes incorrect claims - Reviewer verifies by searching
Worker misses related issues - Reviewer's independent search finds them
Task description is wrong - Both should verify bug exists, not assume
Patterns are misunderstood - Reviewer checks against CODEBASE_CONTEXT.md

This test showed:

Scenario	Did Diligence Help?
Verify bug exists	Both approaches did this
Catch wrong line numbers	Reviewer caught discrepancy ✅
Find additional gaps	Reviewer found more than Worker ✅
Prevent hallucinated bugs	Would catch if Reviewer searched correctly

Recommendations

1. Run real test in nexus project

Start a Claude Code session IN nexus and test the full workflow there. This ensures:

MCP server runs in correct project
Agents search the right codebase
Full context from CODEBASE_CONTEXT.md is loaded

2. Test with a more ambiguous bug

B1 is well-documented. Test with something like:

"Voice seems laggy sometimes"
"Users report weird permission issues"

These require more investigation to even determine if there's a bug.

3. Test if diligence catches non-bugs

Give a task for a bug that doesn't exist. Does the workflow correctly identify "no bug found"?

4. Add explicit codebase path to Worker/Reviewer briefs

The briefs should specify: "Search in /path/to/project, not elsewhere"

Conclusion

Does diligence work? Yes, the process is sound:

Worker researches and proposes
Reviewer independently verifies
Discrepancies are caught
Multiple rounds can iterate

Is it better than naive? For this test, similar results. But:

Reviewer caught additional issues Worker missed
Process would catch hallucinated bugs if Reviewer searches correctly
Real value may be in more complex/ambiguous tasks

Next step: Run a real test in a Claude Code session in nexus, with a more ambiguous task.

5.9 KiB Raw Blame History