32blogby Studio Mitsu

Claude Code Test Generation: A Practical Guide to AI-Assisted Testing

Learn how to enforce AI-generated test quality with CLAUDE.md rules, Hooks automation, TDD subagent workflows, and Stryker mutation testing in Vitest projects.

by omitsu13 min read
On this page

To get useful tests from Claude Code, you need three layers: CLAUDE.md rules that define quality standards, Hooks that automate test execution after every edit, and mutation testing (Stryker) to verify the generated tests actually catch bugs — not just inflate coverage numbers.

Ask Claude Code to "write tests" and you'll get 20+ tests in minutes. The problem is that half of them might be meaningless. Run Stryker against Claude-generated tests and you may see mutation scores in the low 60s. Almost 40% of mutations surviving means the tests look good on paper but can't detect real code changes.

AI-generated tests are great at inflating coverage numbers but struggle to write tests that actually catch bugs. Duplicate tests covering the same code paths, happy-path-only assertions that skip edge cases, hallucinated API calls to methods that don't exist — these are the typical problems when you let Claude Code handle test generation unsupervised.

This article covers the system I built to fix this: CLAUDE.md rules for quality enforcement, Hooks for automated test execution, TDD workflows with subagents, and mutation testing to quantitatively verify test quality.

CLAUDE.mdTest rule definitionsConstrainHooksAutomated test runsAutomateTDD enforcementTest-first workflowVerifyQuality verificationMutation testing

Why Let Claude Code Write Your Tests

Testing is the most "tedious but necessary" part of development. Claude Code can dramatically accelerate this work.

Why Claude Code is strong at test generation:

  • Reads the entire codebase — it understands not just function signatures, but call sites and dependencies, writing tests with full context
  • Runs the test→fail→fix loop autonomously — it fixes failing tests and iterates until everything passes. This is fundamentally different from GitHub Copilot's inline completions
  • Learns from existing test patterns — it matches the style and conventions of tests already in your project

OpenObserve scaled from 380 to 700+ tests using Claude Code's automated testing, reducing flaky tests by 85%. Feature analysis time dropped from 45–60 minutes to 5–10 minutes.

But "write tests" as a prompt isn't enough. You need systems that enforce quality.


The Problems with AI-Generated Tests

When you let Claude Code generate tests without guardrails, these problems appear consistently.

1. Dead weight tests

Multiple tests cover the same code path. Coverage goes up but bug detection doesn't improve. Ask Claude to test a URL slug generator, and you may find tests that cover the exact same code path with trivially different inputs mixed in.

2. Happy path bias

Claude generates tests for normal inputs and skips error paths. HTTP 500 errors, empty arrays, null inputs, and boundary conditions get ignored. The generated tests all passed, but the first malformed input in production would have blown up.

3. Hallucinations

Tests assume a function returns a Promise when it doesn't, call methods that don't exist, or use incorrect import paths. Some generated tests fail to compile entirely.

4. Implementation-first by default

Without explicit instructions, Claude Code writes implementation code first and tests second. Even when asked for TDD, it drifts back to implementation-first as context fills up.

5. Framework confusion

Uses Jest APIs in a Vitest project, creates unnecessary mock files, or modifies test configuration files without permission. I once found jest.mock() calls in a project that had never installed Jest.

Trying to solve these with "careful prompting" has limits. Systems that enforce rules are the answer.


Define Test Rules in CLAUDE.md

Writing test rules in CLAUDE.md ensures consistent test quality from the start of every session. If you haven't set up CLAUDE.md yet, check out the CLAUDE.md design patterns guide first.

markdown
# Testing Rules

## Framework
- Vitest 4.1 + React Testing Library + MSW for API mocking
- Test files: `__tests__/{module}.test.ts` (colocated with source)
- Config: vitest.config.ts (already configured, do not modify)

## Test Quality Rules
- NEVER write tests that only cover the happy path. Every test file must include edge cases
- ALWAYS test error paths: null input, empty arrays, network failures, validation errors
- NEVER mock what you don't own — use MSW for HTTP, real implementations for utilities
- ONE assertion focus per test. "should handle X" not "should handle X and Y and Z"
- ALWAYS include a descriptive test name that explains the expected behavior

## Test Structure
- Arrange-Act-Assert pattern for every test
- Use test.each() for parameterized tests instead of duplicating similar tests
- Group related tests with describe() blocks matching the function/component name

## Forbidden
- Do NOT modify vitest.config.ts or test setup files without asking
- Do NOT add snapshot tests (they pass trivially and catch nothing useful)
- Do NOT use jest.* APIs — this project uses Vitest (vi.*)
- Do NOT write tests after implementation. Write tests FIRST (TDD)

Apply rules per file pattern with .claude/rules/

You can load additional rules only when working with test files using path-scoped rules.

markdown
<!-- .claude/rules/testing.md -->
---
paths:
  - "**/*.test.ts"
  - "**/*.test.tsx"
  - "**/__tests__/**"
---

# Test File Rules
- Import from vitest: describe, it, expect, vi, beforeEach, afterEach
- Import from @testing-library/react: render, screen, fireEvent, waitFor
- Import from msw: http, HttpResponse for API mocks
- Always clean up: afterEach(() => cleanup())
- Prefer userEvent over fireEvent for user interactions

This rule is automatically loaded into context only when Claude accesses matching test files. It doesn't consume tokens during normal development work.

Note: early versions of Claude Code had YAML parsing issues with the paths: frontmatter (#17204, #13905). These have been largely resolved — use the YAML list format shown above. If you also use Cursor, note that Cursor uses globs: instead of paths: in its rule files.


Automate Test Runs with Hooks

CLAUDE.md alone can't prevent rule violations. Hooks add mechanical enforcement by automating test execution.

Auto-run tests after file edits

json
{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          {
            "type": "command",
            "command": "npx vitest run --reporter=verbose 2>&1 | tail -20"
          }
        ]
      }
    ]
  }
}

Every file edit triggers Vitest automatically. Claude sees failures immediately and self-corrects.

Enforce all tests pass before session stops

json
{
  "hooks": {
    "Stop": [
      {
        "hooks": [
          {
            "type": "agent",
            "prompt": "Run the full test suite with `npx vitest run`. If any test fails, fix it before stopping. Do not stop until all tests pass.",
            "timeout": 120
          }
        ]
      }
    ]
  }
}

The Stop hook with agent type runs a subagent before Claude stops. If tests fail, they get fixed automatically before the session ends.

Combined effect

HookWhenEffect
PostToolUse + `WriteEdit`Every file edit
Stop + agentBefore session endsGuarantees all tests pass

These two hooks together nearly eliminate the scenario of "untested code getting committed." Once this combo is in place, you'll start catching regressions that would otherwise slip through.


In Practice: Adding Tests to Existing Code

A practical workflow for adding tests to an existing Next.js project.

Step 1: Identify test targets

Find functions in this project that have low test coverage.
Prioritize business logic (lib/, utils/, actions/).

Claude Code scans the codebase and lists modules without existing tests.

Step 2: Generate tests with priority

Write tests for lib/auth/validate-session.ts.
Include edge cases: expired token, malformed format, null input.

Request tests one module at a time rather than all files at once. This keeps context focused and produces higher-quality tests.

Step 3: Review generated tests

Points to check in Claude Code's generated tests:

  • Each test is independent — no shared state between tests
  • Edge cases are covered — not just happy paths but error paths too
  • Assertions are specifictoBe(true) or toEqual(expected) instead of toBeTruthy()
  • Mocks are minimal — nothing mocked beyond what's necessary
Check if the generated tests have dead weight (duplicate tests).
Remove any that cover the same code paths.

TDD Workflow: Enforcing Test-First

Practicing TDD (Test-Driven Development) with Claude Code requires explicit instructions and subagent usage.

Basic TDD prompt

Use TDD.

1. Write the test first (RED) — the test should fail
2. Write the minimum implementation to pass (GREEN)
3. Refactor (REFACTOR)

Run tests after each step and show me the results.

The context pollution problem

Running Red→Green→Refactor in a single session causes each phase's context to bleed into the next, degrading quality. This is called context pollution — a key factor in token management.

The fix is to isolate each phase in its own subagent.

markdown
<!-- .claude/commands/tdd.md -->
Execute TDD workflow.

1. Test Writer agent (RED):
   - Write tests from the feature requirements in $ARGUMENTS
   - Confirm the tests fail

2. Implementer agent (GREEN):
   - Read only the generated tests
   - Write minimum code to make tests pass

3. Refactorer agent (REFACTOR):
   - Read both tests and implementation
   - Refactor while keeping tests passing

Each agent runs with its own independent context, preventing interference between phases. See the Claude Code commands cheatsheet for more on custom slash commands.


Verifying AI-Generated Test Quality

A test existing and a test catching bugs are different things. Mutation testing lets you quantitatively verify the quality of generated tests.

What is mutation testing

Mutation testing introduces small changes (mutations) to source code and checks whether tests detect them.

  • if (a > b)if (a >= b)
  • Test fails → mutant "killed" (test is effective)
  • Test passes → mutant "survived" (test is insufficient)

Run mutation tests with Stryker

Install Stryker with the Vitest runner plugin:

bash
npm install --save-dev @stryker-mutator/core @stryker-mutator/vitest-runner
npx stryker run
typescript
// stryker.config.mjs
/** @type {import('@stryker-mutator/api/core').PartialStrykerOptions} */
export default {
  testRunner: "vitest",
  mutate: ["src/lib/**/*.ts", "!src/lib/**/*.test.ts"],
  reporters: ["html", "clear-text"],
  vitest: {
    configFile: "vitest.config.ts",
  },
};

Reading the results

Mutation score: 78.5%
Killed: 51 | Survived: 14 | No coverage: 3
  • Above 85%: Test quality is solid
  • 70–85%: Some edge cases are missing
  • Below 70%: Tests need significant revision

Fix surviving mutants with Claude

Read the Stryker mutation report (stryker-report/mutation.html).
Add test cases for surviving mutants.

Claude Code identifies where mutants survived and adds tests covering those code paths. Repeat until mutation score exceeds 85%.


Frequently Asked Questions

Does Claude Code support Vitest out of the box?

Yes. Claude Code detects your test framework from your project configuration. If you have vitest in your package.json and a vitest.config.ts, it will use Vitest APIs automatically. The problem is it sometimes confuses Jest and Vitest APIs — which is exactly why you should declare the framework explicitly in CLAUDE.md.

Can I use these patterns with Jest instead of Vitest?

Absolutely. The CLAUDE.md rules, Hooks configuration, and TDD workflow are framework-agnostic. Swap npx vitest run for npx jest in the Hook commands, and update the CLAUDE.md framework section. Stryker supports Jest as a test runner too.

How many tokens does the PostToolUse Hook consume?

The PostToolUse Hook runs Vitest and pipes the last 20 lines of output back to Claude. In a typical project with 50-100 tests, this adds roughly 200-400 tokens per file edit. The Stop Hook with an agent subagent is heavier — budget around 2,000-5,000 tokens depending on how many fixes it needs to make.

Is the TDD subagent approach worth the extra token cost?

For small bug fixes or simple utilities, no — just write tests normally. For new features with complex logic (auth flows, data transformations, multi-step workflows), the subagent TDD approach pays for itself by producing better-structured code and fewer regressions. I reserve it for modules where getting the interface right matters more than speed.

What mutation score should I target?

85% is the practical sweet spot. Below 70% means your tests have real gaps. Above 90% often means you're writing tests for trivial code paths. Focus mutation testing on business logic (lib/, utils/, actions/) — don't waste time mutating UI components or configuration files.

Can Claude Code read Stryker HTML reports directly?

Claude Code can read the text-based output from npx stryker run, which includes surviving mutant details. For the HTML report, you'll need to describe the findings or paste the relevant sections. The clear-text reporter in the Stryker config gives Claude enough detail to target surviving mutants.

How do I prevent Claude from modifying my test configuration?

Add explicit rules to CLAUDE.md: "Do NOT modify vitest.config.ts or test setup files without asking." Claude Code respects these instructions reliably. For extra safety, add those file paths to your .claudeignore file.

Does this work with monorepos?

Yes, but you'll need separate CLAUDE.md test rules per package or use path-scoped rules in .claude/rules/. The Hooks configuration applies project-wide, so adjust the Vitest command to target the specific package: npx vitest run --project=packages/core.


Wrapping Up

Claude Code's test generation can't be quality-controlled by prompt engineering alone. Systems that enforce rules are the answer.

LayerRoleSetup
CLAUDE.mdDefine test rules (framework, quality standards, forbidden patterns)5 minutes
HooksAutomate test execution (PostToolUse + Stop)5 minutes
TDD workflowEnforce test-first (subagent isolation)10 minutes
Mutation testingVerify generated test quality (Stryker)15 minutes

Define rules in CLAUDE.md, enforce execution with Hooks, verify quality with mutation testing. With these three layers in place, Claude Code generates tests that catch bugs — not tests that only inflate coverage numbers. The total setup takes about 35 minutes — and it's saved me hours of debugging bad tests since.

Related articles: