The pattern
What is auto-research?
The basic idea is to turn research into a closed loop. An agent proposes a small change, runs a fixed evaluation, compares the result to the current best score, and either keeps the change or throws it away.
That only works when the eval is cheap, repeatable, and difficult to game. The metric becomes the feedback signal that lets the loop search without a human judging every intermediate attempt.
The problem
Unit test: discovery and implementation.
When we ask an agent to write unit tests, we usually collapse two jobs into one. First it has to decide what behaviors are worth testing. Then it has to generate the actual test code.
The issue is discovery quality: how do we know the skill is targeting the correct behaviors and use cases before we ever ask it to write the final test implementation?
The target
Our focus.
We want to focus on the discovery side: finding the right behaviors and use cases to write tests for.
This experiment is solely focused on automatically determining which tests should be written, not generating the final XCTest implementation.
The dataset
A collection of ViewModels.
We use a collected set of ViewModels as the target surface for generating use cases and behaviors. Each run asks the skill to inspect those files and decide which tests should exist.
A ViewModel might contain validation rules, async submit paths, state transitions, enum modes, or computed properties that drive what the user sees.
The artifact
The output.
For each identified behavior or user path, we output a test title, a mock function name, and a short description of what that test should cover.
This gives us a clean intermediate artifact: the agent's answer to the question, "What should we test?"
The answer key
The golden set.
For selected ViewModels, we already have unit tests that humans wrote. Those tests define the behaviors this experiment treats as the target.
The eval is not asking whether every possible test was found. It asks how closely the agent's proposed plan matches the existing human-defined set.
The comparison
LLM as a judge.
The judge takes the agent's proposed test plan and the expected tests for the same ViewModel. It then decides which generated ideas match, which are extra, and which expected behaviors were missed.
This is the layer that turns a qualitative test plan into measurable feedback.