Explyt 5.9 🚀 adds subagents and global Skills to make AI workflows smarter
ARTICLE

Comparing LLMs for Test Synthesis. Part 2

EXPLYT TEAM

EXPLYT TEAM

05.07.2025

10 MINUTES

Comparing LLMs for Test Synthesis. Part 2

In Part 1 of this article, we compared leading large language models (LLMs) from OpenAI, DeepSeek, and Alibaba in the context of test synthesis. We examined scenario quality, instruction following, cost, and other metrics. In this part, we additionally cover Claude Sonnet 4Qwen3-235B, and Devstral. As before, all measurements were performed on Explyt Test’s internal benchmark.


Sonnet 4 vs OpenAI o4-mini

Test Scenarios

  • Sonnet’s level of detail is less comfortable for human readability.
  • Sonnet 4’s scenarios statistically better reflect functional behavior and cover more complex and non-obvious conditions than o4-mini.

Code Generation from Scenarios

  • Sonnet slightly better (not statistically significant) follows project test style.
  • Tests created by o4-mini statistically better reflect scenario logic.

Formal Metrics

  • Average scenario length: Sonnet – 669 chars, o4-mini – 462 chars.
  • Out of 34 test classes, compiled: Sonnet – 26, o4-mini – 31.
  • Median token count is lower for Sonnet, explained by the absence of reasoning.
  • Synthesis cost for Sonnet was 1.7× higher than o4-mini (on our benchmark).

Sonnet 4 vs GPT-4.1

Scenarios

  • GPT-4.1 scenarios, like o4-mini’s, reflect basic functionality but do not cover complex cases as Sonnet 4 does.

Code Generation from Scenarios

  • GPT-4.1 follows project class style better than Sonnet 4.
  • GPT-4.1 tests capture scenario logic well, similar to o4-mini.

Formal Metrics

  • Average scenario length: GPT-4.1 – 523 chars.
  • Out of 34 tests compiled: GPT-4.1 – 26.
  • Median token count is lower than Sonnet.
  • Synthesis cost for GPT-4.1 is 1.35× lower than o4-mini, but quality is notably worse.

Vibe Checking – Sonnet 4 vs GPT-4.1 vs OpenAI o4-mini

We ran a blind comparison of all three models (GPT-4.1, o4-mini, Sonnet 4) with the internal team.

  • GPT-4.1 clearly loses to both o4-mini (at similar price) and Sonnet 4.
  • o4-mini vs Sonnet 4 is less clear: formal metrics are higher for o4-mini, but that may be due to simpler scenarios.
  • In practice, Sonnet 4’s code feels nicer to work with. However, given its ~2× higher cost and less stable API, it’s hard to claim it’s strictly better than o4-mini.

Qwen2.5-Coder-32B vs Qwen3-235B (w/o reasoning)

Scenarios

  • Qwen3-235B clearly outperforms Qwen2.5-Coder-32B on all scenario quality metrics.

Code Generation from Scenarios

  • Qwen3-235B better follows project test style (p-value = 0.06).
  • Qwen2.5-Coder-32B tests are not statistically significantly better at reflecting scenario logic.

Formal Metrics

  • Average scenario length: Qwen3-235B – 549 chars, Qwen2.5-Coder-32B – 412.
  • Out of 34 tests compiled: Qwen3-235B – 16, Qwen2.5-Coder-32B – 18.
  • Median token difference is negligible.

Devstral vs Qwen2.5-Coder-32B

Scenarios

  • Devstral is statistically worse than Qwen2.5-Coder-32B across all scenario quality metrics.

Code Generation from Scenarios

  • Devstral does not outperform Qwen2.5-Coder-32B on any criterion, even with simpler scenarios.

Formal Metrics

  • Average scenario length: Devstral – 549 chars, Qwen2.5-Coder-32B – 412.
  • Out of 34 tests compiled: Qwen2.5-Coder-32B – 18, Devstral – 13.
  • Devstral’s median token count is 2× lower.

Conclusions

Let’s summarize the group comparisons:

  • Commercial Models:

    • o4-mini: The most practical choice.
    • Sonnet 4: Recommended if maximum quality is required.
  • Open-Source Models:

    • Qwen3-235B: Generates high-quality test scenarios.
    • Qwen2.5-Coder-32B: A strong baseline solution, runs on 1× H100.
    • DeepSeek-V3: The best open model, but GPU-hungry (see Part 1).

LATEST NEWS

Explyt 5.9: Subagents and Orchestration, Global Skills and Smarter Collaboration for Real Development Work
Explyt 5.9: Subagents and Orchestration, Global Skills and Smarter Collaboration for Real Development Work
RELEASE
23.04.2026
Explyt 5.8: How We Taught an AI Agent to Debug, Refactor, and Use Run Configurations in the IDE
Explyt 5.8: How We Taught an AI Agent to Debug, Refactor, and Use Run Configurations in the IDE
RELEASE
08.04.2026
Explyt 5.7: More Context, Less Friction
Explyt 5.7: More Context, Less Friction
RELEASE
25.03.2026