Autogeneration of tests for Java/Kotlin in IntelliJ IDEA: comparison of AI tools

For most developers, tests are the least favorite part of the job. Recently, we confirmed this by surveying more than 400 developers at the Joker and Heisenbug conferences about their attitude towards AI tools for testing. In the article, we will tell you what else interesting we learned from them, as well as what AI tools exist for automatic test generation, what their pros and cons are.

Why no one wants to write tests and what do language models have to do with it

So, what conclusions did the participants of Joker and Heisenbug help us to draw. Firstly, it turned out that despite the fact that with the advent of transformers, the community has a powerful tool for generating code, most of the respondents do not use any AI tools in their daily work:

We assume that this is due to security requirements. Many specialists with whom we managed to communicate confirmed that they are not ready to consider AI assistants without on-premise.

Secondly, according to developers, there are several main problems with writing tests:

  • laziness/long/boring (the overwhelming majority!)

  • repetitive work

  • it is difficult to come up with many edge cases

  • it is difficult to properly mock dependencies

If the process is monotonous and boring, it is worth thinking about its automation. The problem of test autogeneration has existed for a very long time, it was traditionally solved using test template generation, property-based testing, fuzzing, and symbolic execution. The first two approaches force you to partially write tests manually, for example, to come up with test data or properties. The last two approaches generate unreadable tests that are difficult to maintain. So, for example, although symbolic execution is able to create 5 thousand tests with one hundred percent coverage, it is completely impossible to read and maintain them in the long run.

Thus, the ideal test tool:

  • does routine work for the developer

  • saves the developer's time on coming up with corner cases

  • understands the context of the codebase

  • generates maintainable human-readable tests

  • helps with mocking

Thanks to ChatGPT, a new promising way of generating tests using a language model has emerged: copy your code into ChatGPT, ask for tests, and it will generate them for you.

Why do we need specialized plugins for test generation at all?

Strictly speaking, a plugin is not required for test generation. You can subscribe to one of the models and send requests for generation directly in the web chat at chatgpt.com. There are many plugins that integrate the chat into the IDE, such as Codeium, or complement the code, like GitHub Copilot. Why not use them for test generation as well?

The key problem is gathering the code context. If you give the LLM only the code of the function without its dependencies, it is strange to expect good tests or that they will even compile (imagine you were given an unfamiliar codebase, forbidden to look into it, and asked to write a test for a random function). Manually specifying all dependencies is also a routine mechanical task. Therefore, a good test generation plugin should automatically gather reasonable dependencies into the prompt.

Prompts (task descriptions for the LLM in natural language) are a very important part of interacting with the LLM. The quality of the prompt determines the quality of the generated tests: their compilability, adherence to the style of the codebase, and what test coverage they will provide. Composing such a prompt well is a laborious task. It turns out that the developer replaces the monotonous task of writing tests with the task of correcting and rewriting the prompt. LLM plugins for test generation try to shield the user from this nightmare by offering ready-made pipelines for different generation cases with prompts for different use cases already "built-in".

Why do we need specialized plugins for test generation if AI assistants for code generation, such as Codeium, Cursor, Gigacode, can collect context and use task-specific prompts? Their capabilities are broader but less specific in terms of user experience: the plugin should not only automatically generate code but also integrate with the existing codebase. For example, it should determine the language version, build system (Maven, gradle, Kotlin gradle), mocking framework (mockito or mockk), test library (ktest, junit, TestNG), the style of tests used in the project, and so on. If all this needs to be specified in the prompt manually, then there is no particular difference whether you generate tests using chatgpt.com or Codeium.

☝️🤓: Despite the obvious advantage of understanding the context, language models have drawbacks. For example, you cannot guarantee that the code provided by the LLM will compile and run. Also, each of the existing LLM tools operates with so-called "semantic coverage" rather than actual instruction coverage, so, generally speaking, you have no strict guarantees that increasing the number of LLM tests will increase instruction coverage.

Which one to choose?

To assess the advantages of specialized plugins for test generation, it makes sense to compare them with each other.

For analysis, we will consider Tabnine, Qodo, Explyt Test, and TestSpark, as well as mention Diffblue Cover, which represents an ML-based approach without using LLM. Codeium, GigaCode will serve as an alternative to specialized approaches.

Which plugins will we compare?

I have already mentioned that you can use an AI assistant that can work with the project context to generate tests. For comparison with specialized tools, we will take two such products: Codeium and GigaCode.

Typical tools for generating tests using LLM – Tabnine, Qodo, Explyt Test. Language models allow you to quickly generate code in any language, UI and two-phase generation (test scenarios, then code) allow you to manage corner cases. Another LLM tool is TestSpark, a plugin for generating tests in java and kotlin, unlike Tabnine and Codium, it skips scenario generation, but allows you to edit the code of each test directly in the generation window and has open source code.

There are also ML-based solutions, but without using LLM, such as Diffblue Cover. It is claimed that the tests generated by it, unlike the tests generated by LLM tools, always compile and run.

Comparison

The basic functionality of all the compared plugins is similar: you select the required method or class, press the button in the interface, and get the code of test methods or classes:

What then distinguishes them from each other?

What unique features do these plugins have?

Automatic correction of non-compiling and failing tests

Frontier language models can classify emerging problems and offer solutions "out of the box". It is convenient when the test generation plugin has a button that allows you to automatically fix the test using the language model.

Creating autofixes is a consequence of the inability of LLM to generate compiling code, correctly import dependencies. We talked about these and other problems at JokerConf.



AI tools for autogeneration of tests in IntelliJ IDEA: review and comparison

Using your own LLM key

If you already have a personal or corporate key from any provider, it is very convenient when the test generation plugin allows you to use it and interact with the provider directly.



Process of autogeneration of tests for Java/Kotlin in IntelliJ IDEA using AI

Provider and model customization

In language models, there is an arms race, and OpenAI models do not suit everyone. It is good when the test generation plugin allows you to change the provider and model to better suit your usage scenario. For example, DeepSeek is significantly cheaper than top providers, Groq generates thousands of tokens/sec, while others measure generation speed in hundreds of tokens/sec.



Comparison of the capabilities of AI tools for autogeneration of tests in IntelliJ IDEA

Fine-tuning the model for your codebase

Fine-tuning is the process of adapting a model for a specific task. Some plugins allow you to fine-tune language models on your codebase to improve generation quality. According to Codeium, fine-tuning can significantly improve the quality of code proposed by the AI assistant.

☝️🤓: It is worth noting that fine-tuning in a closed loop is a complex task and may not be worth the effort. The effect of fine-tuning heavily depends on the quality of the data, and fine-tuning itself is difficult to automate.

Using a locally deployed model

Sometimes you want the code not to go to the internet, or there is simply no access to the internet (for example, on a plane). If the computer is powerful enough, you can place a large enough language model on it that will generate meaningful code. For this, there is, for example, the Ollama project, which allows you to deploy any model with open weights locally. It is good if the test generation plugin supports the ability to use a locally deployed model.



Review of AI tools for autogeneration of tests in IntelliJ IDEA: Java and Kotlin

☝️🤓: Although the possibility of running the model locally does exist, you need to be careful when using models for test generation. On most work laptops and computers, an arbitrary open model will either work too slowly or produce poor results. Since some laptop manufacturers are designing new devices with specialized processors for neural networks, this situation may improve in a few years.

For convenience, we have compiled all the considered attributes for all solutions into one table. Since Diffblue Cover operates based on RL, not language models, parameters related to LLM are not applicable to it. Diffblue does not support test repair, you cannot choose an ML model for generation, and you cannot retrain it. Also, Diffblue Cover works exclusively locally, although it requires an internet connection for all tariffs except Enterprise.

Plugin

Auto-fix tests

Own LLM key

Provider and model selection

Retraining

Local hosting

Codeium

no

no

yes

yes

no

GigaCode

no

no

no

no

no

Tabnine

no (for jvm)

yes (enterprise)

Tabnine, OpenAI, Claude

yes (enterprise)

no

Qodo

no (for jvm)

no

OpenAI, Claude, Gemini

no

no

TestSpark

yes

yes

HuggingFace, OpenAI

no

no

Diffblue Cover

no

-

-

-

works only
locally

Explyt Test

yes, but only for LLM tests (compile + runtime)

yes

OpenAI, Claude, Gemini, DeepSeek, Groq, Cerebras, Anthropic

no

yes

How do plugins work with project context?

Proper context selection is very important for obtaining high-quality results from LLM.

There are several ways to collect code context. The simpler one is heuristic, like in GitHub Copilot: analyzing the last three open files. Such heuristics are not suitable for generating tests, as a specialized algorithm is needed specifically for generating tests. Although most AI assistants require collecting context manually, an automatic context collection can be implemented for the task of generating tests.

It is often desirable for the generated test to be similar to the existing ones, down to the specifics of the tested behavior. To ensure the "similarity" to the user code automatically, the test plugin should be able to find and use similar tests as a template automatically: see how they are structured and pass the information to the prompt. It is also good if the user can choose such a reference manually.

Often, the code is also influenced by non-code context. It may include application configuration files that do not contain code but are important for understanding the context of the tested application. Examples of non-code context are an xml file with Spring bean configuration, an env file with environment variables.

Plugin

Context

Using a similar test

Non-code context

Codeium

auto+manually

manually

no information

GigaCode

selection or current file

manually

no information

Tabnine

auto+manually

manually

no information

Qodo

auto+manually

manually

no information

TestSpark

auto

manually

no information

Diffblue Cover

auto

no

no information

Explyt Test

auto

auto

in development

What if it is important which servers the code is sent to?

Although there are free chat providers and code generation tools, our surveys show that the vast majority of specialists do not use AI tools at all. When asked "why?" they usually answer that it is prohibited by company policy due to security requirements: the source code must not leave the company's perimeter, and it is forbidden to use tools with third-party hosting. However, if you really want to use the tool, you can choose a hosting that meets your privacy requirements:

  • Community. Your requests are sent to the LLM provider's server. Depending on the chosen plan, the provider may promise not to use your data to train models or provide an opt-out (Qodo does this, providing an opt-out for the free version). Some providers (such as Tabnine) can request zero data retention: your request data will not be stored on the servers at all.

  • (Virtual) Private Cloud ((V)PC). You use the cloud provider's infrastructure, while remote access to it is only available to you. Such a service is provided, for example, by Amazon, Yandex, and cloud.ru from Sber.

  • On-premise hosting. You fully manage the infrastructure on which the solution is deployed, as your servers or servers to which you have access are used.

For international software security certification, the SOC 2 protocol is used. In short, a SOC 2-certified SaaS product provider meets five criteria: security (protection against unauthorized access), availability (ensuring service availability according to agreements), processing integrity (data accuracy and authorization), confidentiality (restricting access to data), privacy (compliance with personal information processing policies). In Russia, there is an equivalent - FSTEC certification, the requirements of which are generally similar to SOC 2. None of the foreign plugins, of course, have passed FSTEC certification.

Plugin

Private hosting options

Certifications

FSTEC Certification

Codeium

on-prem (enterprise)

SOC 2

no

GigaCode

on-prem

no

yes

Tabnine

on-prem (enterprise)

GDPR, SOC 2

no

Qodo

on-prem (enterprise)

SOC 2

no

TestSpark

unavailable

no

no

Diffblue Cover

runs locally only

-

no

Explyt Test

on-prem (enterprise)

planned SOC 2, GDPR

planned

What can be obtained for free, and what for money?

Most often, for money you get access to more powerful language models, removal of request limits, or improved UX. Tabnine in the paid subscription removes the restriction on the use of frontier models (the most advanced available models, such as GPT-4o), Codeium in the Pro version gives credits for use with frontier models and a more advanced context collection algorithm. GigaCode offers the purchase of tokens for its models directly. Qodo provides all test generation features in the free version without restrictions, including the use of frontier models, adding code autocompletion and several other non-test-related features in the paid version. Explyt Test allows you to purchase tokens to use them for generation. In a typical usage scenario, a programmer spends an average of 3000 tokens per month.

Diffblue limits differ from all other providers in that it provides a limited number of generations - clicks on the "create test" button.

Plugin

Works in RF

Free

Paid

Codeium

no

free only models from Codeium

15$

GigaCode

yes

10^6 free tokens

depending on the model

Tabnine

no

Tabnine models: no limits, frontier models: 2-4 requests per day

9$ (90 days free)

Qodo

no

no limits

19$ (14 days free)

TestSpark

from sources

-

-

Diffblue Cover

no

25 generations/month

30$: 100 generations/month

Explyt Test

yes

one-time: 1000 tokens

$0.01/1 token

In conclusion

There are other approaches to test generation. For critical code, tools based on symbolic execution, such as UTBot Java, can be considered. Symbolic execution allows for efficient enumeration of states that a program can reach, thus covering the program well with tests.

As a less fundamental approach, you can choose solutions based on automatic code analysis, such as Squaretest, Jtest. Because the correctness guarantees of these tools are weaker, the code is generated faster.

We are preparing material on alternative approaches to test generation. What else would you like to know and what is interesting to discuss? Have you already tried tools for automatic test generation, with and without LLM? Share your experiences and opinions in the comments. Thank you for reading to the end :)

Comments