How we at Yandex make a robotic arm with artificial intelligence

12:52
22.10.2024
dimakl
363

10-20 years ago, many thought that robots controlled by artificial intelligence would take over all the heavy and dangerous work in enterprises. However, neural networks have found applications in offices, call centers, support services, and have even become useful to people in creative professions - copywriters, designers, programmers. Nevertheless, creating robots that can independently perform complex physical manipulations with material objects remains a difficult and unresolved task.

In this article, I will explain how the ML R&D team in the Market's robotics department is creating a robotic arm and training neural networks, thanks to which the robot interacts with the physical world.

Introduction, or Why robots are still not walking on our streets

Hardware Quality

Creating hardware for robots is not an easy task. I would particularly highlight sensors. Modern RGB cameras provide decent visualization, but depth cameras and touch sensors remain a challenge. The human hand, with its ability to sense the slightest changes in force and position and respond to them instantly, is an ideal model, but replicating this functionality in robots is difficult. Robots need to have equally sensitive sensors to achieve a level of precision comparable to that of a human hand.

We in the robotics department do not design hardware, but we solve the problem of the robot's "strength": we want to teach it to understand what a particular object is made of. This way, the robot will be able to apply the necessary force to grasp an object without damaging it. This requires a synergy of algorithms and hardware that has the necessary properties.

World Dynamics

The world we live in has nonlinear dynamics. Small changes in conditions or actions can lead to significant and disproportionate changes in the result. This requires robots to have high precision and adaptability in real-time to effectively respond to any changes. At the same time, this fact makes motion control very difficult.

The old approach to creating robots, based on trajectory planning and using classical algorithms to calculate the dynamics of the world, is not always applicable to reality. But in tasks that involve a lot of tactile interaction with the physical world, this approach stops working because it is impossible to calculate all the nuances of interaction in advance. Even the smartest robot will not be able to take into account all the irregularities of the object it is grabbing and plan the trajectory perfectly.

Our robots constantly take feedback from sensors and cameras into account, allowing them to adjust their behavior if something goes wrong. We achieve this effect by collecting data: at certain moments, a human takes over control and helps the robot. Thus, when the robot encounters a similar situation again, it will already know how to handle it.

Training Data

The key to the success of neural networks is a lot of good and high-quality data. Top LLMs are pre-trained on a good half of the internet, if not more. And then they are further trained on specially labeled datasets, each team's kept in strict secrecy. Data is the fuel without which nothing will work.

With data for robots, it's even more difficult. Only recently has an open dataset of several million episodes appeared - there are many robots with their sensors and cameras, completely different tasks. In practice, data is needed for a specific robot and a specific task. For example, our robotic arm can pick up several objects at once. There was no data on solving such a problem, so we had to create a separate model.

Yandex Robotic Arm

The Market robotics department has been working on the robotic arm for several years. We use a simple stock collaborative robot (cobot): this is quite enough. Our main goal is to create efficient algorithms and ML models capable of solving complex tasks of fine motor skills and interaction with the environment.

The robotic arm with artificial intelligence, developed at Yandex, performs precise manipulations with objects.

Data: the basis of our approach

The peculiarity of our solution is that the robot independently perceives the environment, determines which object it needs to take, and chooses the method of capture.

Training Pipeline

Our training pipeline consists of two key stages: Imitation Learning and fine-tuning using Reinforcement Learning.

At the first stage, the robot is trained in supervised mode - simply repeating human movement trajectories. This way, it masters basic object manipulation skills. We say that here the "spinal cord" of the robot is trained: it learns the dynamics of the world and how to interact with it.

After that, the robot independently improves its skills through Reinforcement Learning. At this stage, it performs tasks on its own, and its model receives feedback in the form of rewards for successful actions. The reward can be automatic or manually given by the operator. We use PPO as the algorithm, but the most difficult part is finding good hyperparameters and achieving stability and convergence with the small number of trajectories that can be performed on robots.

Model Architecture

We took RT-1 as the basic architecture, but it has undergone significant transformation. We use ViT as an encoder for images and depth maps. In addition to images, the internal state of the robot and the last action performed are also fed as input. All real numbers at the input and output are discretized.

We are currently using a large transformer that processes all incoming tokens. We are experimenting with sizes and can already train models comparable to LLM on our data. We have also optimized the model inference on the robot: it is possible to make predictions of the robot's actions at a frequency of 25-30 Hz, despite the large model sizes.

During the project, we gained many insights. I would highlight two of them:

Data is more important than architecture. A large amount of quality data is significantly more important than a perfect model architecture. Even if the architecture is not ideal, with an extensive and high-quality dataset, you will achieve better results.
If you have a lot of data, try making the model bigger. At some point, adding data stops improving quality. Often this means that the model size needs to be increased. We have reached this limit several times. Then a larger model size allowed for a leap in performance.

What our robot can do

Currently, the robotic arm can grasp almost any object: even heavy or wide objects that can be physically taken with the gripper. We pay special attention to developing algorithms for gentle gripping so that the robot can manipulate objects without damage.

How we at Yandex make a robotic arm with artificial intelligence

Introduction, or Why robots are still not walking on our streets

Hardware Quality

World Dynamics

Training Data

Yandex Robotic Arm

Data: the basis of our approach

Training Pipeline

Model Architecture

What our robot can do

Write comment

Relevant news on the topic "AI"

Parsing Telegram channels, groups, and chats with LLM processing

AI website generator on ChatGPT and Next.js 15: Creating SEO‑optimized pages from scratch

I was a designer for 6 years, creating images for news, and then the neural network came

Pentest with AI agents, experiment with CAI

Also read