Diligence prevents AI agents from shipping quick fixes that break things by enforcing a research-propose-verify loop before any code changes. Key features: - Worker sub-agent researches and proposes with file:line citations - Reviewer sub-agent independently verifies claims by searching codebase - Iterates until approved (max 5 rounds) - Loads project-specific context from .claude/CODEBASE_CONTEXT.md - State persisted across sessions Validated on production codebase: caught architectural mistake (broker subscriptions on client-side code) that naive agent would have shipped.
6.6 KiB
6.6 KiB
Testing the Diligence MCP Server
Test Suite Overview
The diligence project includes a comprehensive test suite that validates:
- Workflow mechanics - State transitions, round limits, phase enforcement
- Mock scenarios - Predefined Worker-Reviewer interactions with expected outcomes
- Dry-run mode - Test against real projects without making changes
Quick Start
# Run all workflow tests
node test/run-tests.mjs --workflow
# Run mock scenario tests
node test/run-tests.mjs --mock
# Run a specific scenario
node test/run-tests.mjs --mock --scenario=blocking-voice
# Dry run against nexus (no changes)
node test/dry-run.mjs --project=~/bude/codecharm/nexus --scenario=blocking-voice
Test Structure
test/
├── mcp-client.mjs # Programmatic MCP client
├── run-tests.mjs # Main test runner
├── dry-run.mjs # Dry-run against real projects
├── fixture/ # Mock codebase for testing
│ ├── src/
│ │ ├── broker/events.ts
│ │ ├── services/
│ │ └── controllers/
│ └── .claude/
│ └── CODEBASE_CONTEXT.md
└── scenarios/ # Test scenarios
├── index.json
├── blocking-voice.json
└── permission-cache.json
Test Modes
1. Workflow Tests (--workflow)
Tests the MCP server mechanics without AI:
- Phase transitions (conversation → researching → approved → implementing)
- Round increment on NEEDS_WORK
- Max rounds enforcement (resets after 5 rounds)
- Feedback accumulation
- Abort functionality
node test/run-tests.mjs --workflow
2. Mock Tests (--mock)
Tests complete Worker-Reviewer scenarios with predefined responses:
- Scenario defines a task and expected naive/correct fixes
- Test simulates Worker submitting naive proposal
- Reviewer catches issues, sends NEEDS_WORK
- Worker submits revised proposal with all fixes
- Reviewer approves
- Validates that proposal mentions all required elements
# All scenarios
node test/run-tests.mjs --mock
# Single scenario
node test/run-tests.mjs --mock --scenario=permission-cache
3. Dry Run (--project)
Connects to MCP server in a real project directory:
# With predefined scenario
node test/dry-run.mjs --project=~/bude/codecharm/nexus --scenario=blocking-voice
# With custom task
node test/dry-run.mjs --project=/path/to/project --task="Fix the caching bug"
This:
- Starts the workflow with the task
- Shows the full Worker Brief (including real CODEBASE_CONTEXT.md)
- Does NOT make any code changes
- Aborts the workflow on exit
Test Fixture
The fixture (test/fixture/) is a mini codebase that mirrors real-world patterns:
Files
| File | Purpose | Known Bugs (for testing) |
|---|---|---|
broker/events.ts |
Event bus definitions | Reference implementation |
services/user-block.service.ts |
Blocking logic | Missing voice cleanup |
services/voice-channel.service.ts |
Voice/DM calls | Missing blocking check on answerDmCall |
services/team.service.ts |
Permission cache | Doesn't subscribe to role events |
services/chat.service.ts |
Correct pattern | Shows permission vs action separation |
controllers/roles.controller.ts |
Role CRUD | Missing broker events on create/delete |
Patterns Tested
- Broker event emission - Every state change should emit events
- Cache invalidation - Caches must subscribe to relevant events
- Permission vs Action - Permissions control visibility, actions have separate checks
- Multi-location fixes - If one place has a check, similar places need it too
Test Scenarios
blocking-voice
Task: Fix blocked users can still answer DM voice calls
Naive fix: Add blocking check to answerDmCall only
Correct fix:
- Add blocking check to answerDmCall
- Add blocking check to declineDmCall
- Filter notifications for blocked users
- Add voice cleanup to blockUser()
- Subscribe to BusUserBlockChange for mid-call kicks
permission-cache
Task: Fix permission cache doesn't invalidate when roles change
Naive fix: Add .clear() somewhere
Correct fix:
- Subscribe to BusTeamRoleChange in team.service
- Subscribe to BusTeamMemberRoleChange in team.service
- Add broker event to createRole()
- Add broker event to deleteRole()
Adding New Scenarios
Create a new JSON file in test/scenarios/:
{
"id": "my-scenario",
"name": "Human-readable name",
"description": "Brief description",
"task": "The task description given to the Worker",
"naive_fix": {
"description": "What a quick-fix agent would do",
"changes": [
{ "file": "path/to/file.ts", "change": "Quick fix description" }
],
"issues": [
"What the naive fix misses #1",
"What the naive fix misses #2"
]
},
"correct_fix": {
"description": "Complete fix description",
"required_changes": [
{ "file": "path.ts", "function": "funcName", "change": "What to change" }
],
"required_broker_subscriptions": [
{ "service": "x.service.ts", "event": "BusEventName", "action": "What to do" }
]
},
"validation_criteria": {
"must_mention": ["keyword1", "keyword2"],
"should_reference_pattern": "reference-file.ts"
}
}
Add to test/scenarios/index.json:
{
"scenarios": [
{ "id": "my-scenario", "file": "my-scenario.json" }
]
}
Real-World Testing
For true validation, test with real Claude sub-agents:
Root Agent:
1. Call mcp__diligence__start with task
2. Spawn Worker agent (Task tool) with get_worker_brief
3. Worker researches, submits proposal via mcp__diligence__propose
4. Spawn Reviewer agent (Task tool) with get_reviewer_brief
5. Reviewer verifies claims, submits via mcp__diligence__review
6. If NEEDS_WORK, spawn new Worker with updated brief
7. If APPROVED, proceed to implementation
Why separate agents matter:
- Fresh context = no bias from previous reasoning
- Reviewer doesn't know Worker's search results
- Forces genuine verification, not rubber-stamping
Success Criteria
- Reviewer catches issues that Worker initially misses
- Multiple rounds occur before approval
- Final proposal is more complete than naive approach
- Validation criteria are all met
Environment Variables
| Variable | Purpose |
|---|---|
DEBUG=1 |
Show MCP server stderr output |
ANTHROPIC_API_KEY |
Required for --live mode (future) |
CI Integration
# Run in CI
npm test
# Or directly
node test/run-tests.mjs --workflow && node test/run-tests.mjs --mock
Exit codes: 0 = pass, 1 = fail