When two heads are better than one: scientists experiment with collective work of neural networks

Today I want to talk about the Japanese startup Sakana AI, which invented the open-source framework TreeQuest. It allows using several different large language models at once to achieve more accurate results. A little more talk about Grok4 Heavy.

Hello, tekkix! My name is Kirill Pshinnik, I am a researcher at Innopolis University and CEO of the ZeroCoder Online University, as well as the author of the book "Artificial Intelligence: A Path to a New World." As you can understand, I am interested in neural networks and various aspects of their use: I read news, scientific papers, and write articles myself.

Today, I want to talk about the Japanese startup Sakana AI, which invented the open-source framework TreeQuest. It allows using several different large language models at once to get a more accurate result.

But I'll start with pigeons.

The Collective Intelligence of Oncology Pigeons

It turns out that pigeons can distinguish malignant tumors from benign ones. Moreover, they do this with high accuracy and even in teams — and, interestingly, they do it better when working in a team.

The study was conducted by a team from Iowa State University. The test subjects were ordinary city pigeons, the familiar residents of streets and dovecotes. They were shown histological slices and mammograms, and the correct answer (cancer or not) was reinforced with food — a classic operant conditioning, only without words, labels, or explanations.

After some time, individual birds began to demonstrate a stable accuracy of 80–85% on new images. Moreover, they generalized their knowledge — they could transfer their skills to unfamiliar images not seen during training.

However, the key finding of the researchers was that the group accuracy was significantly higher than the individual accuracy.

This method was called flock-sourcing: if even a few pigeons in a team of four considered the image to be malignant, the system took it as a "cancer" signal. The collective accuracy soared to 99%, which was much higher than any individual participant's result (on average 0.73–0.85). The only area where the pigeons lagged was particularly difficult mammograms with thin masses, where even experienced radiologists often make mistakes.

The fact that a group of "non-specialists" shows results comparable to professionals, albeit in a narrow scenario, is interesting in itself. But even more important is that collective perception (a sort of "averaged opinion") can be used in real-world tasks.

For example:

  • for evaluating the quality of medical images,

  • as an auxiliary filter when testing imaging systems,

  • or even as an analog of voting in ensemble neural networks.

The study with pigeons was the first in a series. But after years, there have been works where collective accuracy is already used in the world of machine learning — with similar results. And this makes one think: perhaps we should pay more attention not only to architectures and metrics but also to "behavior in a flock."

Ensemble Learning

Ensemble learning is an approach in machine learning where several models are combined to achieve a more accurate and stable result than each one individually. The principle is the same as with pigeons: group opinion is more reliable than individual opinions.

How does this work in practice? There are several methods.

1. Bagging, where models are trained independently on different subsets of data. A classic example is Random Forest: it consists of many decision trees, each trained on a random part of the data. The final answer is determined by voting (classification) or averaging (regression).

2. Boosting — models are trained sequentially: each new model tries to correct the errors of the previous one. The most well-known examples are XGBoost and CatBoost. Here, weak models (usually trees) "boost" each other.

3. Stacking — models are trained in parallel, and a meta-model is added on top, which learns from their outputs. For example: logistic regression combines the results of SVM, decision tree, and neural network.

Ensembles work because different models make different errors, and these errors are smoothed out when combined, while the strengths of individual models complement each other.

This is like expert voting: one might make a mistake, but if the majority agrees, the result will be more reliable.

TreeQuest: how models stopped competing and started working together

It was the ensemble learning mechanism that the Japanese startup Sakana AI decided to use. While most LLM developers are focused on one thing: creating a large model that can do everything — faster, more accurately, and deeper, Sakana AI offers a different approach: not to compete, but to cooperate.

They developed the Adaptive Branching Monte Carlo Tree Search (AB-MCTS) algorithm, which allows multiple large language models to solve tasks together — like a team, where each agent contributes. This algorithm underpins TreeQuest — a new open-source framework already available under Apache 2.0.

Sakana AI believes that each LLM has its own strengths: one is better at logical structuring, another at generating text, and a third at making quality generalizations. Instead of trying to create a single "perfect" model, TreeQuest involves different LLMs working simultaneously, like agents in a collective problem-solving process.

The ideology is simple: "We perceive the unique features of each model not as limitations but as resources for forming collective intelligence."

The key is when the models are combined. It's not about fine-tuning or pretraining, but about inference. The AB-MCTS algorithm organizes the search in the following way:

  • some agents deepen current hypotheses (depth-first search),

  • others suggest alternatives (breadth-first search),

  • all this — taking previous answers into account, like in a decision tree.

This results in an iterative search with feedback between the models. Each model can use the answers from others as a hint and "guess" the solution.

What practice shows

TreeQuest was tested on ARC-AGI-2 — a benchmark for evaluating "generalizing intelligence". The variant with three models (o4-mini + Gemini 2.5 Pro + DeepSeek-R1-0528) solved over 30% of tasks, while the single o4-mini only solved 23%.

Interesting conclusions:

  • o4-mini makes mistakes,

  • DeepSeek and Gemini use this mistake as partial guidance,

  • the result — a correct solution after a couple of iterations.

In other words, the wrong answer of one model can fuel the correct output of another.

TreeQuest is designed for tasks that require step-by-step, multi-stage solutions, especially in situations with a limited number of API calls to models. Examples include: automatic code generation and refactoring, improving the accuracy of ML models through data reinterpretation, reducing hallucinations in generative systems, optimizing computational services and pipelines.

All of this can be integrated via an open API, supporting custom quality metrics for solutions.

An interesting parallel: this is reminiscent of pigeon experiments, where the flock diagnosed more accurately than individual birds. Here, the same story, but instead of pigeons: ChatGPT, Gemini, and DeepSeek.

Grok4 Heavy: How the "study group" of neural networks is structured

While the industry competes in creating powerful universal models, Elon Musk's company xAI decided to take a different approach. Recently, they introduced Grok4 Heavy — an architecture in which neural networks work not alone, but cooperatively, like students preparing for an exam in a group.

The results are impressive: the model surpassed not only the base Grok4 but also flagship solutions from OpenAI and Google on several benchmarks. But the key here is not just the power, but the architectural idea, which is closer to collective intelligence than to classic inference.

When two neural networks work together, creating synergy and improving results, as a symbol of teamwork in the world of technology.

Grok4 Heavy runs several agents in parallel. Each solves the task independently, without knowledge of the approaches of the others. And then the most interesting part begins:

  • agents compare their conclusions;

  • exchange ideas;

  • and collectively form the final answer.

Important: this is not just majority voting. Sometimes, only one agent finds the right path and explains it to the others. This makes the final decision not just an average, but one that has undergone internal interpretation and verification.

In essence, Grok4 Heavy implements a mechanism of collective interpretation: each model gives its version of the solution, but the final version is the result of dialogue between them.

This architecture requires significantly more computation during the inference phase compared to running a single model. But xAI bets that quality is more important than query cost, especially in tasks where the price of error is high: medicine, science, robotics.

Considering that AI costs are getting closer to the price of electricity, this bet seems justified.

Collective AIs: A New Paradigm

Grok4 Heavy and TreeQuest from Sakana AI are different implementations of the same idea: abandoning the single "supermodel" in favor of a team of interacting agents.

In the first case, coordination is within the architecture, while in the second, it is between different external models (e.g., o4-mini, Gemini, and DeepSeek).

But the conclusion is the same: multi-headed intelligence works. In fact, in several tasks, it outperforms single models, even if each of them is strong on its own.

This approach opens up prospects for flexible AI systems in which:

  • different models specialize in their tasks: code, visual analysis, logic, language;

  • they interact, suggest, and "check" each other;

  • a new type of reliability and adaptability emerges, especially in unstable or interdisciplinary tasks.

In practice, this can lead to architectures where complex engineering tasks are solved by a "council" of models, automated assistants can reconsider their own hypotheses and change their opinions, important decisions go through not one, but a series of filters and rethinking.

The example with pigeons, TreeQuest, and Grok4 Heavy shows: collective intelligence is not a metaphor, but a working technical mechanism. We have only just started using it in AI, but we already see growth in accuracy, stability, and contextual flexibility.

The next step, apparently, is not to create yet another "largest model," but to form an environment where different models can learn from each other. As in nature, in some species: it is not the strongest that survives, but the one who is capable of working in a team.

Comments