How we at Yandex make a robotic arm with artificial intelligence

10-20 years ago, many thought that robots controlled by artificial intelligence would take over all the heavy and dangerous work in enterprises. However, neural networks have found applications in offices, call centers, support services, and have even become useful to people in creative professions - copywriters, designers, programmers. Nevertheless, creating robots that can independently perform complex physical manipulations with material objects remains a difficult and unresolved task.

In this article, I will explain how the ML R&D team in the Market's robotics department is creating a robotic arm and training neural networks, thanks to which the robot interacts with the physical world.

Introduction, or Why robots are still not walking on our streets

Hardware Quality

Creating hardware for robots is not an easy task. I would particularly highlight sensors. Modern RGB cameras provide decent visualization, but depth cameras and touch sensors remain a challenge. The human hand, with its ability to sense the slightest changes in force and position and respond to them instantly, is an ideal model, but replicating this functionality in robots is difficult. Robots need to have equally sensitive sensors to achieve a level of precision comparable to that of a human hand.

We in the robotics department do not design hardware, but we solve the problem of the robot's "strength": we want to teach it to understand what a particular object is made of. This way, the robot will be able to apply the necessary force to grasp an object without damaging it. This requires a synergy of algorithms and hardware that has the necessary properties.

World Dynamics

The world we live in has nonlinear dynamics. Small changes in conditions or actions can lead to significant and disproportionate changes in the result. This requires robots to have high precision and adaptability in real-time to effectively respond to any changes. At the same time, this fact makes motion control very difficult.

The old approach to creating robots, based on trajectory planning and using classical algorithms to calculate the dynamics of the world, is not always applicable to reality. But in tasks that involve a lot of tactile interaction with the physical world, this approach stops working because it is impossible to calculate all the nuances of interaction in advance. Even the smartest robot will not be able to take into account all the irregularities of the object it is grabbing and plan the trajectory perfectly.

Our robots constantly take feedback from sensors and cameras into account, allowing them to adjust their behavior if something goes wrong. We achieve this effect by collecting data: at certain moments, a human takes over control and helps the robot. Thus, when the robot encounters a similar situation again, it will already know how to handle it.

Training Data

The key to the success of neural networks is a lot of good and high-quality data. Top LLMs are pre-trained on a good half of the internet, if not more. And then they are further trained on specially labeled datasets, each team's kept in strict secrecy. Data is the fuel without which nothing will work.

With data for robots, it's even more difficult. Only recently has an open dataset of several million episodes appeared - there are many robots with their sensors and cameras, completely different tasks. In practice, data is needed for a specific robot and a specific task. For example, our robotic arm can pick up several objects at once. There was no data on solving such a problem, so we had to create a separate model.

Yandex Robotic Arm

The Market robotics department has been working on the robotic arm for several years. We use a simple stock collaborative robot (cobot): this is quite enough. Our main goal is to create efficient algorithms and ML models capable of solving complex tasks of fine motor skills and interaction with the environment.


The robotic arm with artificial intelligence, developed at Yandex, performs precise manipulations with objects.

Data: the basis of our approach

The peculiarity of our solution is that the robot independently perceives the environment, determines which object it needs to take, and chooses the method of capture.

Data collection in our project plays a key role. We have developed our own VR-based teleoperation system that allows us to conveniently and quickly collect training data. With this system, operators can control robots in various scenarios (e.g., object capture) and collect high-quality data on robot-object interactions.

In the video, we are assembling Lego using our teleoperation system:

We also use the DAgger (Dataset Aggregation) algorithm, which allows us to iteratively improve data quality. In each cycle, the robot performs a task using the current ML model. If it makes a mistake, the operator manually corrects the actions—showing how to do it correctly. Using this method, we obtain reference trajectories. A model trained on them will be able to correct its own mistakes.

Another simple but important principle in data collection is to gradually complicate the task, rather than trying to solve everything at once. Our robot initially learned to pick up a plastic lemon that was at the bottom of a basket. Then we added more plastic fruits and other items: markers, pens, bottles, toy figures, and objects of various sizes and shapes. Eventually, the robot learned to grasp objects it had never seen before and even move items to make them easier to pick up.

We are currently working with the real assortment of Yandex Market and facing various challenges: how to pick up heavy, slippery, or fragile items, and items packed in transparent bags.

Data is a key ingredient in the development of a robot. They can partially compensate for the shortcomings of hardware and sensors because they give models more information for training and adaptation. The more data collected, the more accurate and efficient the models become. As a result, those who can collect and process more data will gain a significant advantage in the development and implementation of advanced robots.

Training Pipeline

Our training pipeline consists of two key stages: Imitation Learning and fine-tuning using Reinforcement Learning.

At the first stage, the robot is trained in supervised mode - simply repeating human movement trajectories. This way, it masters basic object manipulation skills. We say that here the "spinal cord" of the robot is trained: it learns the dynamics of the world and how to interact with it.

After that, the robot independently improves its skills through Reinforcement Learning. At this stage, it performs tasks on its own, and its model receives feedback in the form of rewards for successful actions. The reward can be automatic or manually given by the operator. We use PPO as the algorithm, but the most difficult part is finding good hyperparameters and achieving stability and convergence with the small number of trajectories that can be performed on robots.

Model Architecture

We took RT-1 as the basic architecture, but it has undergone significant transformation. We use ViT as an encoder for images and depth maps. In addition to images, the internal state of the robot and the last action performed are also fed as input. All real numbers at the input and output are discretized.

We are currently using a large transformer that processes all incoming tokens. We are experimenting with sizes and can already train models comparable to LLM on our data. We have also optimized the model inference on the robot: it is possible to make predictions of the robot's actions at a frequency of 25-30 Hz, despite the large model sizes.

During the project, we gained many insights. I would highlight two of them:

  • Data is more important than architecture. A large amount of quality data is significantly more important than a perfect model architecture. Even if the architecture is not ideal, with an extensive and high-quality dataset, you will achieve better results.

  • If you have a lot of data, try making the model bigger. At some point, adding data stops improving quality. Often this means that the model size needs to be increased. We have reached this limit several times. Then a larger model size allowed for a leap in performance.

What our robot can do

Currently, the robotic arm can grasp almost any object: even heavy or wide objects that can be physically taken with the gripper. We pay special attention to developing algorithms for gentle gripping so that the robot can manipulate objects without damage.

In the video, the robot first takes a fragile and soft object, and then a heavy bottle.

Our robot can accidentally pick up several objects at once instead of one. Multipicks have become a challenge. Surprisingly, there is very little information on this topic on the internet — at least, I couldn't find anyone who has solved a similar problem. We had to create a separate model that detects multipicks from different cameras. We are constantly refining this task and collecting new data for it.

We are now launching the first projects to test our developments in real conditions at the Market warehouse, for example, at the sorting stage. This will help us better understand the range of products that robots need to learn to handle, as well as find out how robots and humans will interact effectively. This is a serious challenge, but we are confident of success.


The development of AI-powered robots capable of complex physical manipulations is one of the most exciting and challenging tasks at the intersection of modern engineering and science. Our project to develop a robotic arm shows that using modern algorithms and approaches, such as Imitation Learning and Reinforcement Learning, significant success can be achieved in teaching robots complex manipulations.

The use of improved transformer architectures, the collection of specialized data, and the gradual increase in the functionality of robots allow us to confidently move towards creating universal assistants. In the future, they may be able to effectively perform routine and complex tasks, freeing up time for humans for creative and intellectual activities.

Comments