Comparing LLMs for Test Synthesis. Part 2

EXPLYT TEAM

05.07.2025

10 MINUTES

In Part 1 of this article, we compared leading large language models (LLMs) from OpenAI, DeepSeek, and Alibaba in the context of test synthesis. We examined scenario quality, instruction following, cost, and other metrics. In this part, we additionally cover Claude Sonnet 4, Qwen3-235B, and Devstral. As before, all measurements were performed on Explyt Test’s internal benchmark.

Sonnet 4 vs OpenAI o4-mini

Test Scenarios

Sonnet’s level of detail is less comfortable for human readability.
Sonnet 4’s scenarios statistically better reflect functional behavior and cover more complex and non-obvious conditions than o4-mini.

Code Generation from Scenarios

Sonnet slightly better (not statistically significant) follows project test style.
Tests created by o4-mini statistically better reflect scenario logic.

Formal Metrics

Average scenario length: Sonnet – 669 chars, o4-mini – 462 chars.
Out of 34 test classes, compiled: Sonnet – 26, o4-mini – 31.
Median token count is lower for Sonnet, explained by the absence of reasoning.
Synthesis cost for Sonnet was 1.7× higher than o4-mini (on our benchmark).

Sonnet 4 vs GPT-4.1

Scenarios

GPT-4.1 scenarios, like o4-mini’s, reflect basic functionality but do not cover complex cases as Sonnet 4 does.

Code Generation from Scenarios

GPT-4.1 follows project class style better than Sonnet 4.
GPT-4.1 tests capture scenario logic well, similar to o4-mini.

Formal Metrics

Average scenario length: GPT-4.1 – 523 chars.
Out of 34 tests compiled: GPT-4.1 – 26.
Median token count is lower than Sonnet.
Synthesis cost for GPT-4.1 is 1.35× lower than o4-mini, but quality is notably worse.

Vibe Checking – Sonnet 4 vs GPT-4.1 vs OpenAI o4-mini

We ran a blind comparison of all three models (GPT-4.1, o4-mini, Sonnet 4) with the internal team.

GPT-4.1 clearly loses to both o4-mini (at similar price) and Sonnet 4.
o4-mini vs Sonnet 4 is less clear: formal metrics are higher for o4-mini, but that may be due to simpler scenarios.
In practice, Sonnet 4’s code feels nicer to work with. However, given its ~2× higher cost and less stable API, it’s hard to claim it’s strictly better than o4-mini.

Qwen2.5-Coder-32B vs Qwen3-235B (w/o reasoning)

Scenarios

Qwen3-235B clearly outperforms Qwen2.5-Coder-32B on all scenario quality metrics.

Code Generation from Scenarios

Qwen3-235B better follows project test style (p-value = 0.06).
Qwen2.5-Coder-32B tests are not statistically significantly better at reflecting scenario logic.

Formal Metrics

Average scenario length: Qwen3-235B – 549 chars, Qwen2.5-Coder-32B – 412.
Out of 34 tests compiled: Qwen3-235B – 16, Qwen2.5-Coder-32B – 18.
Median token difference is negligible.

Devstral vs Qwen2.5-Coder-32B

Scenarios

Devstral is statistically worse than Qwen2.5-Coder-32B across all scenario quality metrics.

Code Generation from Scenarios

Devstral does not outperform Qwen2.5-Coder-32B on any criterion, even with simpler scenarios.

Formal Metrics

Average scenario length: Devstral – 549 chars, Qwen2.5-Coder-32B – 412.
Out of 34 tests compiled: Qwen2.5-Coder-32B – 18, Devstral – 13.
Devstral’s median token count is 2× lower.

Conclusions

Let’s summarize the group comparisons:

Commercial Models:
- o4-mini: The most practical choice.
- Sonnet 4: Recommended if maximum quality is required.
Open-Source Models:
- Qwen3-235B: Generates high-quality test scenarios.
- Qwen2.5-Coder-32B: A strong baseline solution, runs on 1× H100.
- DeepSeek-V3: The best open model, but GPU-hungry (see Part 1).

LATEST NEWS

Explyt 5.9: Subagents and Orchestration, Global Skills and Smarter Collaboration for Real Development Work

RELEASE

23.04.2026

Explyt 5.8: How We Taught an AI Agent to Debug, Refactor, and Use Run Configurations in the IDE

RELEASE

08.04.2026

Explyt 5.7: More Context, Less Friction

RELEASE

25.03.2026