diligence/TESTING.md

# Testing the Diligence MCP Server

## Test Suite Overview

The diligence project includes a comprehensive test suite that validates:

1. **Workflow mechanics** - State transitions, round limits, phase enforcement
2. **Mock scenarios** - Predefined Worker-Reviewer interactions with expected outcomes
3. **Dry-run mode** - Test against real projects without making changes

## Quick Start

```bash
# Run all workflow tests
node test/run-tests.mjs --workflow

# Run mock scenario tests
node test/run-tests.mjs --mock

# Run a specific scenario
node test/run-tests.mjs --mock --scenario=blocking-voice

# Dry run against nexus (no changes)
node test/dry-run.mjs --project=~/bude/codecharm/nexus --scenario=blocking-voice
```

## Test Structure

```
test/
├── mcp-client.mjs      # Programmatic MCP client
├── run-tests.mjs       # Main test runner
├── dry-run.mjs         # Dry-run against real projects
├── fixture/            # Mock codebase for testing
│   ├── src/
│   │   ├── broker/events.ts
│   │   ├── services/
│   │   └── controllers/
│   └── .claude/
│       └── CODEBASE_CONTEXT.md
└── scenarios/          # Test scenarios
    ├── index.json
    ├── blocking-voice.json
    └── permission-cache.json
```

## Test Modes

### 1. Workflow Tests (`--workflow`)

Tests the MCP server mechanics without AI:

- Phase transitions (conversation → researching → approved → implementing)
- Round increment on NEEDS_WORK
- Max rounds enforcement (resets after 5 rounds)
- Feedback accumulation
- Abort functionality

```bash
node test/run-tests.mjs --workflow
```

### 2. Mock Tests (`--mock`)

Tests complete Worker-Reviewer scenarios with predefined responses:

1. Scenario defines a task and expected naive/correct fixes
2. Test simulates Worker submitting naive proposal
3. Reviewer catches issues, sends NEEDS_WORK
4. Worker submits revised proposal with all fixes
5. Reviewer approves
6. Validates that proposal mentions all required elements

```bash
# All scenarios
node test/run-tests.mjs --mock

# Single scenario
node test/run-tests.mjs --mock --scenario=permission-cache
```

### 3. Dry Run (`--project`)

Connects to MCP server in a real project directory:

```bash
# With predefined scenario
node test/dry-run.mjs --project=~/bude/codecharm/nexus --scenario=blocking-voice

# With custom task
node test/dry-run.mjs --project=/path/to/project --task="Fix the caching bug"
```

This:
- Starts the workflow with the task
- Shows the full Worker Brief (including real CODEBASE_CONTEXT.md)
- Does NOT make any code changes
- Aborts the workflow on exit

## Test Fixture

The fixture (`test/fixture/`) is a mini codebase that mirrors real-world patterns:

### Files

| File | Purpose | Known Bugs (for testing) |
|------|---------|-------------------------|
| `broker/events.ts` | Event bus definitions | Reference implementation |
| `services/user-block.service.ts` | Blocking logic | Missing voice cleanup |
| `services/voice-channel.service.ts` | Voice/DM calls | Missing blocking check on answerDmCall |
| `services/team.service.ts` | Permission cache | Doesn't subscribe to role events |
| `services/chat.service.ts` | **Correct pattern** | Shows permission vs action separation |
| `controllers/roles.controller.ts` | Role CRUD | Missing broker events on create/delete |

### Patterns Tested

1. **Broker event emission** - Every state change should emit events
2. **Cache invalidation** - Caches must subscribe to relevant events
3. **Permission vs Action** - Permissions control visibility, actions have separate checks
4. **Multi-location fixes** - If one place has a check, similar places need it too

## Test Scenarios

### blocking-voice

**Task:** Fix blocked users can still answer DM voice calls

**Naive fix:** Add blocking check to answerDmCall only

**Correct fix:**
- Add blocking check to answerDmCall
- Add blocking check to declineDmCall
- Filter notifications for blocked users
- Add voice cleanup to blockUser()
- Subscribe to BusUserBlockChange for mid-call kicks

### permission-cache

**Task:** Fix permission cache doesn't invalidate when roles change

**Naive fix:** Add .clear() somewhere

**Correct fix:**
- Subscribe to BusTeamRoleChange in team.service
- Subscribe to BusTeamMemberRoleChange in team.service
- Add broker event to createRole()
- Add broker event to deleteRole()

## Adding New Scenarios

Create a new JSON file in `test/scenarios/`:

```json
{
  "id": "my-scenario",
  "name": "Human-readable name",
  "description": "Brief description",

  "task": "The task description given to the Worker",

  "naive_fix": {
    "description": "What a quick-fix agent would do",
    "changes": [
      { "file": "path/to/file.ts", "change": "Quick fix description" }
    ],
    "issues": [
      "What the naive fix misses #1",
      "What the naive fix misses #2"
    ]
  },

  "correct_fix": {
    "description": "Complete fix description",
    "required_changes": [
      { "file": "path.ts", "function": "funcName", "change": "What to change" }
    ],
    "required_broker_subscriptions": [
      { "service": "x.service.ts", "event": "BusEventName", "action": "What to do" }
    ]
  },

  "validation_criteria": {
    "must_mention": ["keyword1", "keyword2"],
    "should_reference_pattern": "reference-file.ts"
  }
}
```

Add to `test/scenarios/index.json`:

```json
{
  "scenarios": [
    { "id": "my-scenario", "file": "my-scenario.json" }
  ]
}
```

## Real-World Testing

For true validation, test with real Claude sub-agents:

```
Root Agent:
  1. Call mcp__diligence__start with task
  2. Spawn Worker agent (Task tool) with get_worker_brief
  3. Worker researches, submits proposal via mcp__diligence__propose
  4. Spawn Reviewer agent (Task tool) with get_reviewer_brief
  5. Reviewer verifies claims, submits via mcp__diligence__review
  6. If NEEDS_WORK, spawn new Worker with updated brief
  7. If APPROVED, proceed to implementation
```

**Why separate agents matter:**
- Fresh context = no bias from previous reasoning
- Reviewer doesn't know Worker's search results
- Forces genuine verification, not rubber-stamping

## Success Criteria

1. **Reviewer catches issues** that Worker initially misses
2. **Multiple rounds** occur before approval
3. **Final proposal** is more complete than naive approach
4. **Validation criteria** are all met

## Environment Variables

| Variable | Purpose |
|----------|---------|
| `DEBUG=1` | Show MCP server stderr output |
| `ANTHROPIC_API_KEY` | Required for `--live` mode (future) |

## CI Integration

```bash
# Run in CI
npm test

# Or directly
node test/run-tests.mjs --workflow && node test/run-tests.mjs --mock
```

Exit codes: 0 = pass, 1 = fail