Explyt 5.8 🚀 teaches AI agents the IDE basics: debugging, refactoring, and run configurations
ARTICLE

Which LLM Generates the Best Tests? Explyt’s Internal Benchmark Compares OpenAI, DeepSeek, and Qwen

EXPLYT TEAM

EXPLYT TEAM

22.06.2025

9 MINUTES

Which LLM Generates the Best Tests? Explyt’s Internal Benchmark Compares OpenAI, DeepSeek, and Qwen

Choosing an LLM for Test Synthesis

In this review, we compare several modern large language models (LLMs) on the task of test synthesis. All evaluations were conducted on Explyt’s internal benchmark, which includes both closed-source and open-source Java and Kotlin projects, with and without Spring.

The metrics include formal ones—such as line coverage of the tested class/method, number of executed tests, number of compilation errors, mutation coverage—as well as LLM-as-judge metrics, such as the complexity, usefulness, and detail of test scenarios, the alignment between a test method and a natural language scenario, and many others.

Experiments were carried out through the Explyt Test plugin for IntelliJ IDEA, where different models were connected and their test synthesis quality was measured against the benchmark. To ensure more precise grading, we use pairwise comparisons between models.


GPT-4o vs. GPT-4.1

We start with the solid baseline of GPT-4o and compare it to the newer GPT-4.1 from OpenAI.

On our benchmark, GPT-4.1 generates more complex, detailed, and useful scenarios (according to LLM-as-judge metrics), unlike GPT-4o, which mostly covers happy-path cases. GPT-4.1 also better implements the requested scenario behaviors, scoring 0.86 vs. 0.66 (p-value = 0.0006).

In terms of formal metrics like average code coverage and number of executed test classes, both models perform similarly without statistically significant differences.

Regarding cost, GPT-4.1 is cheaper per token, but since it consumes more tokens, the final price on our benchmark is nearly identical to GPT-4o.


GPT-4o-mini vs. GPT-4.1

Next, let’s compare OpenAI’s reasoning-focused model against a standard LLM.

GPT-4o-mini outperforms GPT-4.1 in number of executed tests (31 vs. 25) and average code coverage (0.637 vs. 0.391, p-value = 0.01). Thanks to reasoning chains, GPT-4o-mini’s median number of generated tokens is 3.5× higher than GPT-4.1. As a result, the reasoning model ends up costing about 1.36× more on our benchmark.


DeepSeek-R1-0528/V3 vs. GPT-4.1

Now we move to large open-weight models from Chinese startup DeepSeek.

The biggest advantage of Chinese models is price. On our benchmark, DeepSeek-V3 and DeepSeek-R1-0528 spent roughly 8–9× less than OpenAI’s GPT-4.1 (~$0.40 vs. $3.46).

In terms of quality, we cannot say that DeepSeek R1 or V3 are significantly different from GPT-4.1. On formal metrics, both DeepSeek models even outperform GPT-4.1 in absolute numbers, though not statistically significantly.

When it comes to following instructions, DeepSeek has some shortcomings—occasionally (1–2 times) synthesizing a different number of tests than specified in the prompt.

Overall, DeepSeek is a solid option, especially since you can further fine-tune the weights and deploy them within your company’s environment.


Qwen3-32B w/ Reasoning vs. Qwen2.5-coder

Among open models with lower GPU requirements, we compared two from Alibaba’s Qwen family: the older code-specialized Qwen2.5-coder-32B vs. Qwen3-32B with reasoning enabled.

Surprisingly, the general-purpose Qwen3 is not much worse than the specialized coder model. Test scenarios generated by Qwen3 tend to be more detailed and useful compared to Qwen2.5-coder.

However, on formal metrics, Qwen3 underperforms in absolute values (though not statistically significantly). A downside is that Qwen3 sometimes loops during generation, which does not happen with Qwen2.5-coder.

With reasoning enabled, Qwen3 does not surpass the coder model, while consuming ~2× more tokens and slowing down synthesis (median 18,114 tokens vs. 9,664).

Overall, considering Qwen3 is a general-purpose model, its benchmark results are fairly strong—but this comparison shows it is too early to retire Alibaba’s previous coding-focused model.

LATEST NEWS

Explyt 5.8: How We Taught an AI Agent to Debug, Refactor, and Use Run Configurations in the IDE
Explyt 5.8: How We Taught an AI Agent to Debug, Refactor, and Use Run Configurations in the IDE
RELEASE
08.04.2026
Explyt 5.7: More Context, Less Friction
Explyt 5.7: More Context, Less Friction
RELEASE
25.03.2026
Explyt 5.6: Interactive Onboarding, Manual Skills Invocation, and Drag-and-Drop for Images
Explyt 5.6: Interactive Onboarding, Manual Skills Invocation, and Drag-and-Drop for Images
RELEASE
12.03.2026