Goodbye, programmer? AI already writes code better than you

Dmitry Rozhkov, manager of the Kubernetes services team and creator of the popular YouTube channel Senior Software Vlogger, shared his experience testing AI assistants for programming. He talked about whether neural networks can replace programmers, what pitfalls await when working with AI assistants, and why we still do not see a boom in new applications created with the help of artificial intelligence.

We asked Dmitry Rozhkov to talk about the future of programming after his video with testing AI programmers. Dmitry's experiment sparked heated discussions in the IT community and raised many questions about the role of artificial intelligence in software development. We decided to delve into the topic and find out firsthand how real the threat of replacing a human programmer with artificial intelligence is today.

The full version of the interview, organized by our company Artezio, can be viewed on the channel Ai4Dev on Youtube. We also have a Telegram channel for developers who use AI. In it, you can exchange opinions and real cases.

“Expecting a neural network to write everything correctly on the first try is a big mistake”

Dmitry, tell us why you decided to test these AI tools and how you chose the products for testing?

I work as a manager, leading a team of 15 people. Although I haven't been programming at work for a long time, I continue to do programming for personal projects, including media projects, blogs, and telegram bots.

AI agents came into my area of interest partly because of my media activities, and also because I still consider myself a developer. The choice of tools for testing started with the high-profile Devin project. However, since Devin is in closed beta, I decided to explore other available tools on the market, preferably open source or subscription-based.

First, I tested Devica, which appeared shortly after the announcement of Devin. Devica provides a development environment with a chat for communicating with the agent, a virtual internet browser, a console, and a virtual IDE for writing code.

Then I tried Coursor - an IDE with artificial intelligence features. The interface aims to overturn the traditional IDE, allowing users to write instructions to the agent instead of directly writing code.

Next was Aider - a terminal tool that, in my opinion, fits most organically into the current workflow of programmers.

Finally, I tested the tool from Replit. They provide hosting and have developed an agent that allows you to write code directly on their service, deploy it immediately, and pay for hosting. This is part of their strategy to attract more users and democratize development.

When choosing tools for testing, I focused on those that were most discussed in the developer community. Of course, I couldn't cover all available options. For example, I tried Openhands, but it didn't work at all in my case.

It is interesting to note how quickly this area is developing. Literally the day after my video was released, OpenAI released a new model O1. And just recently, I wrote a post about the emergence of two more tools working in the same direction. Progress in this area is measured in weeks, making any review or testing quickly outdated. In essence, my testing video became obsolete almost immediately after publication.

The essence of my experiment was that the project had to be created completely from scratch. There was an empty folder without any code or templates. I wanted to write one prompt, as if it were a task from a freelance site. Theoretically, such a task could indeed be received on freelance.

I told myself, let's imagine that I am a blogger with a large amount of video materials. I need a service that would accept video or audio files, transcribe them into text, then compress this text and create YouTube headings in a specific format. The task seems quite simple and well-formalized. Moreover, I even indicated which services should be used, since I already use them myself and have my own code for these tasks.

During testing, I encountered a number of problems. Some agents, for example, started writing code in Python, although I explicitly asked to use TypeScript. It is worth noting that Cursor and Aider correctly used TypeScript, as requested. However, I must say that none of the projects created by various agents initially started and were fully ready for use, regardless of the programming language used or the quality of the code itself.

Since the agents started writing in Python, I had interesting conversations with them. For example: "Hey, can you run a file with the .py extension in TypeScript?" "No, this extension is for Python." I say: "Yes, but I asked to write in TypeScript." Then they apologized and rewrote everything in TypeScript, but also with a certain degree of lack of quality.

The second point I discovered: they write code, but if you use environments like Node.js and TypeScript, it takes quite a lot of wrapping to get TypeScript to work on Node.js. You need to configure the TypeScript compiler itself, but none of the AI agents did this. They just honestly wrote code that could not be run. To run it, you needed to work a little more, set everything up, and it didn't always work out.

And that's the problem. It turns out that on the one hand they write code, on the other hand they only write code. In environments where this is enough, for example, if it is Python, you have code and dependencies. You install dependencies from the requirements.txt file and everything works. And with TypeScript there are many different options.

Plus, probably the biggest mistake is to expect from the neural network that it will definitely write everything correctly in one attempt. In my projects, I noticed that the neural network copes and helps much better when there is already some code. When the project has certain dependencies that it can refer to, look at, and do by example. In this regard, perhaps my experiment was not quite correctly constructed. But since I mainly use JavaScript and TypeScript, I was interested in testing the network's capabilities with my programming languages.

Why don't these agents write quality code right away?

I did not delve specifically into how the agents break down the task and how they execute it. It is important to note that not all agents initially wrote code in Python. However, I assume that Python may be a preferred language for them, and this is often demonstrated in examples. My assumption is based on the fact that there should be more examples of Python code in the training set, especially considering its popularity on platforms like GitHub.

Interestingly, I even wrote a post on my Telegram channel about the fact that languages that we traditionally do not consider "beautiful", such as PHP, may become more in demand for AI precisely because of the large amount of existing code. The more code written in a language, the more material for training the neural network, which can potentially increase the AI's efficiency when working with this language.

This observation emphasizes that the effectiveness of AI in programming may depend not so much on our subjective preferences in languages, but on the amount of data available for training.

“Reproducing the flexibility and efficiency of human thinking in the context of programming has not yet been achieved”

Recently, we have observed a trend of integrating code execution environments directly into AI model interfaces. Claude has implemented this through "artifacts", and OpenAI is experimenting with similar solutions. How do you assess the impact of these built-in sandboxes on the efficiency of AI in development? How much does the ability to immediately execute and test generated code bring us closer to creating a full-fledged AI assistant for programmers?

These thoughts largely reflect what I talked about in my two videos on artificial intelligence in programming. The integration of code execution environments into AI models is indeed very important. It seems to me that these AI agents, which are now being actively developed, will help us better understand the programming process itself. At first glance, it seems that a person is just writing code, but in fact, this process includes many other aspects.

First, the code needs to be run somewhere. Consequently, the AI agent must have software interfaces for working with the code and the execution platform. This is why the creation of sandboxes is becoming an important direction. For example, Replit is engaged in this, providing a sandbox directly on their website. There is also Canvas from Open AI, which I recently tested, and StackBlitz with their new service bolt.new. All these platforms are moving in a similar direction. There is also Aider, which works directly from the terminal and can execute code on a local machine, although there may be security issues.

In fact, the programming process includes many such moments. We read documentation, search for information on Google, study APIs, and then try to adapt this knowledge to our specific task. I am not saying that a neural network cannot do this, but an AI agent needs to be constructed in a special way.

It is necessary to create an agent so that it can consider the entire process step by step, keeping all relevant information in its context. In this context, it should hold the current task, all written code, and relevant documentation. At the same time, the agent should be able to choose from the documentation what needs to be kept in context and what can be omitted.

It is important to understand that all these language models (LLM) working with prompts actually operate with one large prompt, not a series of separate messages, as it may seem in the chat. Each new request or clarification is simply added to the existing prompt. Because of this, the longer we communicate with the neural network in the process of writing a program, the more it can lose the context of what was at the beginning of the conversation. This creates certain problems that need to be considered when developing AI assistants for programming.

In light of these limitations, a more effective strategy seems to be the development of an initial prompt capable of immediately generating high-quality code, and then iteratively working with this result.

Returning to the issue of sandboxes and integration of execution environments, it is important to note that companies working on AI for programming strive to create comprehensive systems. These systems allow neural network agents to interact with various development components. We are talking about platforms for running code and handling errors, accessing the internet for search queries and gathering information from documentation and other sources. In addition, vector databases are needed for efficient search and retrieval of relevant information, such as RAG search, as well as tools for creating and managing a code repository map.

From the point of view of human activity, we, programmers, do not keep the entire context of the project in our heads, especially when it comes to millions of lines of code. Instead, we are able to effectively switch between different contexts using certain "anchors" or landmarks in the code.

It is this ability to flexibly manage context that is now being attempted to recreate in AI agents. However, so far it has not been possible to fully reproduce such flexibility and efficiency of human thinking in the context of programming. This remains one of the key challenges in the development of AI for software development.

During your testing of various AI agents, which ones showed the best results in solving the task? What factors, in your opinion, determined their superiority? Did you notice a significant difference in the quality of code generated by different systems?

The most impressive results were shown by Cursor Composer, based on the Claude 3.5 model from Anthropic. I want to note that I did not test Anthropic artifacts separately, but used an integrated solution within my Cursor subscription.

However, it is important to emphasize that technologies in this area are developing rapidly. Already after the completion of the main experiment, I discovered that the bolt.new service provides even better results. This tool created well-structured code, breaking my task into logical modules with a clear architecture. It created an index file as an entry point and divided the functionality into separate modules, which significantly improves the readability and maintainability of the code.

For comparison, Canvas from OpenAI generated code in the form of one large block, which resembles my own style of writing drafts or prototypes. Although this reflects a common approach to rapid prototyping, I expected a more structured and optimized solution from AI.

This experience confirms the reputation of the Claude model as one of the most advanced for programming tasks at the time of the test. However, it also demonstrates how quickly this area is developing, and how new tools can offer even more advanced solutions in a short period of time.

The key factor in evaluating the quality of the generated code for me was its structure, modularity, and adherence to modern clean code practices. Tools that were able to not only solve the problem but do it elegantly and professionally left the most positive impression.

"Garbage in, garbage out"

Let's talk about the complexity of interacting with AI models in the context of programming. How would you characterize the process of formulating a task for AI? How iterative is it usually, and what level of detail in the prompt do you consider optimal for obtaining a quality result? Are there any features or techniques of prompt engineering that you find particularly effective when working with AI in the field of software development?

One of the key problems in working with neural networks is the quality of the input data. People who have worked a lot with language models (LLM) come to the conclusion: "garbage in, garbage out." In other words, the quality of the answer directly depends on the quality of the query.

Take, for example, neural networks for generating images. It would seem simple: write "a beautiful kitten at sunset" and you will get the corresponding picture. But if you need a specific image that you have in mind, difficulties arise. How to describe it so that you get exactly what you want?

The situation is similar with programming. A neural network can easily handle the request "make a random web page." But if you need something specific that can be verified, then problems begin.

Interestingly, none of the tested models suggested writing tests for the code. This is surprising, considering that testing is a widely recognized best practice in development, helping to verify the program.

As for the detail of the prompt, I tend to think that it should be in human language and not too detailed. Otherwise, writing a prompt for a simple task can take as much time as writing the code itself. I have seen examples where people use 200-line prompts, detailing the structure of the database and ORM in TypeScript. But this already looks like you wrote half of the code yourself and then say that the neural network did it.

I believe that the task for AI should be formulated in much the same way as a technical task from a competent manager for a programmer.

The number of iterations when working with AI depends on the user. I have come to the conclusion that the target audience for these tools is people who understand the basics of programming. They should be able to read code, know about variables, loops, dependencies, be able to run and test code. Essentially, this should be at least a competent tester who understands internal processes, rather than just working on the "black box" principle.

In addition, working with AI requires time and patience. You need to be able to improve the initial prompt, try different approaches. In my experiments, each neural network took from half an hour to an hour. At some point, I just stopped, not wanting to spend the whole day getting fully working code from one neural network.

Are you saying that a non-programmer cannot create a program on their own? But LLM has the ability to return an error. The neural network agrees to constantly refine the code to a working state. It is not necessary to be a programmer to understand whether the program works or not.

The question of whether a person without programming skills can create a program independently with the help of LLM is quite complex. On the one hand, modern language models have built-in error handling mechanisms and can iteratively improve the code. Theoretically, even a non-programmer can determine whether the program works or not.

I would not categorically state that this is impossible. Surely there are examples where people without programming experience have successfully created simple programs using LLM. After all, neural networks are capable of generating code even from a single prompt. However, the result largely depends on the complexity of the project and the number of "degrees of freedom" in the task.

For example, in my test, there were several interconnected tasks: uploading a file to remote storage, converting it to text, compressing the text, creating chapters for YouTube, and publishing a blog post. This is already quite a complex chain of actions.

On the other hand, there are examples where people create simple web interfaces that call one or two APIs using Python - a language well known to neural networks. In such cases, you can get a working result in 30-40 minutes of working with various prompts.

The key question here is the efficiency of time use. Should a manager spend a whole day communicating with a neural network instead of performing their direct duties? If the value of the created product exceeds the potential value of other tasks of the manager (for example, communicating with clients or optimizing processes), then this approach may make sense. However, it is often more efficient to assign work with the neural network to a junior developer who already has basic programming knowledge. This will allow you to get the result faster and better.

In light of the development of AI technologies, how do you assess the future of the programmer profession? Many companies are investing in platforms where a client can order a website "in one click," and AI supposedly does all the work. Do you consider this a realistic scenario or more of a marketing ploy?

I tend to think that such a scenario is possible, but with some reservations. It is important to understand that for AI there is no fundamental difference between writing a complex algorithm used in interviews or simple code for saving data to a database. However, there are limitations that need to be considered. These are the quality of the input data and the AI training set, the limitations of the context window with which the neural network works, and the complexity of integrating various system components.

In my tests, I noticed that models cope better with simpler, isolated tasks, but begin to experience difficulties when it is necessary to create a system with a large number of interconnected components. For example, AI cannot simply take and run code on your computer - additional infrastructure and setup are needed for this.

Therefore, although the idea of a "one-click site" sounds attractive, implementing such a system requires solving many technical and organizational issues. It is not impossible, but it is not as simple as it may seem at first glance. The key challenge is not so much in writing code, but in creating a holistic, integrated system capable of working in real conditions.

How do you evaluate new approaches to AI code generation, in particular, systems like O1 preview? How much, in your opinion, can such innovations improve the quality of programming with the help of AI assistants? What approaches are currently the most common in this area?

It seems to me that if you look at the O1 preview, they just packaged what some people, so-called prompt engineers, did on their own. One of my acquaintances joked that the profession of a prompt engineer did not have time to appear, and it has already been automated.

The thing is, it really improves the quality of the output. I watched an interview with Maxim Strakhov on "Podlodka", I think, where he explains in great detail how LLM works. In particular, he talks about one of his cases where he asked the neural network to generate a haiku, a Japanese poem. And with the second request, he asked if it looked like a haiku. That is, he asked the same neural network without prior context to verify the result. And this second request significantly improved the output.

It turns out that if the answer is "no, it doesn't look like a haiku," then we start the cycle again and, say, try this cycle 10 times. If after 10 times the haiku is not generated, we fall with an error or go to the next step. This moment of generation, verification, and possibly repetition really dramatically improves the quality of the output.

The only difference, probably, is that we need to learn to verify the result of the written program again. If we are talking about a program that simply counts some numbers, it is probably easy to verify. If this program controls a drone, then we need to somehow understand that it does it correctly. Accordingly, some kind of emulator is needed. And so that the data from this emulator can be automatically analyzed, to understand that everything is happening correctly. Because a person can simply look with their eyes that the drone really took off, flew two meters, and landed. And the neural network cannot do this yet.

That is, it already needs to connect the body, hands, eyes, some kind of physical interface, to move in the direction of connecting artificial intelligence to the body, and then the results will improve.

«Programmers say that complex AI systems cannot write yet - this is their defensive reaction»

How complex projects can be done with the help of an AI assistant? The developers we talked to believe that only simple ones - some kind of website, bot, and that's it. But something complex, some kind of multimodal system, is almost impossible to do. That is, if we are talking about some kind of bank, then it is impossible to comprehensively apply AI there. You can essentially write code, but you cannot implement it comprehensively. What do you think about this?

I think this is about the same problem. It's not about the complexity of the algorithms, because basically the neural network doesn't care what complexity of code it generates. It generates code that writes "Hello World" or implements the Union-Find algorithm equally quickly. It doesn't care.

But the problem, in particular with large systems, is that they usually use a large context. And when we write large systems, we usually involve many people. Including in order to load this huge context that we have in small pieces into different people. Accordingly, none of the people has the full context in their head. It just doesn't fit in the head.

We have, let's say, some architect who just draws UML diagrams and roughly understands how this system works. Then we have each individual brick, which represents a group of systems, breaking down into organizations. In these organizations, systems break down into teams. Teams write microservices. And then it all integrates together. And, accordingly, it all comes together from the bottom up. We have some metrics on dashboards. And based on these readings, we draw conclusions.

It seems to me that the problem is not that the neural network is not capable of writing this code. The problem is that we have not yet learned to overcome these, as they say in English, "air gaps" - gaps between systems. It is necessary to teach the neural network to somehow transmit information through these gaps.

Suppose we have technical documentation, a thick book on the development of a banking system. Someone has to read all this documentation. But even if he reads it all, he will not remember it all. Accordingly, this person will make a summary. A summary is essentially a text summarization. This problem is solved. That is, each individual small problem can be solved by a neural network. But linking them together is a problem. AI developers are currently working on solving it. They are trying to fill these gaps with something, some kind of "ether" that the neural network can also use.

It seems to me that development is moving in this direction. And the fact that programmers say that such complex systems cannot yet be written by a neural network seems to me to be a kind of defense mechanism.

Will the development of AI assistants and automatic code generation systems affect the profession of a programmer in the near future? Do you expect a significant reduction in the number of developers or a change in the structure of demand in the labor market?

It's a difficult question. Some people say that we will just write more programs. On the other hand, as the situation in the US and European economies shows after the COVID bubble burst, when tens of thousands of programmers were laid off, it seemed that not so many programs were needed. Free venture money ran out, and they stopped being used in various mobile applications, conditionally. Why should the creation of a neural network suddenly increase the number of programmers, if even without a neural network tens of thousands of programmers were laid off? This is a question that needs to be answered first. I probably don't have an answer to it.

It all depends on how much this network can reproduce itself. Theoretically, it is possible, probably, to come up with such a step that we will have a perfect black box that no one else can reproduce again, but it solves all our problems. Perhaps there will be a group of scientists of a hundred people who service it, and a specialized institute where we prepare a replacement for these scientists. That is, the training is no longer for hundreds of thousands of engineers a year, but for people at the PhD level who specifically go to work on this system.

If an error occurred somewhere, it goes back to the initial requirements. This is the moment I talked about in my first video about the perfect artificial programmer. What is the problem with live programmers? When new requirements come in, we have to somehow fit them into the already written system. But the artificial programmer does not need to fit them. He will simply rewrite the entire system from scratch perfectly with the new requirements.

«You need to take the time to figure it out»

And can't the LLM itself create its own programming language that will allow people without programming experience to successfully control the effectiveness of AI?

By and large, such attempts have been made with various languages, like Rust. The Rust language is famous for its error messages - everything is described in detail: what happened, where, where to look. Theoretically, it seems to me that it is not the neural network that should make such a language for itself. This should be done by people whose task will be to say: "Okay, the error messages of this programming language should be written in plain English, without lines of code, without anything." But lines of code are still needed by people to understand where to go to look, what to do. And this bundle will still have to be somehow preserved.

It seems to me that you will be able to verify whether the neural network has solved your problem correctly or not. You just need to formulate the request in such a way that, conditionally, if we write a calculator, then we can test it and understand that it solves mathematics correctly. If there are any errors, then this is not about verifying the correctness of the program, but about debugging the program.

What advice would you give to programmers, developers who want to try using AI in their work? Based on your test, what pitfalls should be considered first? And, in general, how applicable is all this today?

AI assistants are already quite applicable in the work of a programmer, but the key to their effective use is understanding their capabilities and limitations. My main advice is to invest time in studying these tools. Spend at least a day to deeply understand their functionality, understand their strengths and weaknesses. It is important to realize that AI is not a magical solution to all problems, but a powerful tool that requires skillful handling.

In programming, there is a concept of "leaky abstraction". It implies that for effective use of the tool, it is necessary to understand its internal structure. I recommend watching several interviews or reports where experts explain in detail how neural networks process requests. This knowledge will help you formulate more effective prompts and interact with AI more consciously.

Understanding the principles of AI assistants will allow you to achieve significantly higher quality results. You will be able to better anticipate which queries will give the desired output and how to interpret the received responses. As a result, this will not only increase your productivity but also help avoid common mistakes when using AI in programming.

As I said before, garbage in, garbage out. And if a person just comes to the neural network and writes: "Create a TikTok for me," and the neural network does something, but it doesn't work out, and the person says: "Haha, you can't write code" - this is a very arrogant position, I think. You need to take the time to figure it out.

Why don't we see new TikToks or programs created by neural networks today? It turned out that many non-programmers entered the market, but we do not see a boom of new ideas, new programs.

Because, as it turned out, programming is not the most difficult part. First, you need the very idea of a product that solves a real problem or meets the needs of users. Generating and validating such an idea is a separate skill that cannot be replaced by AI.

Secondly, even having an idea, it is necessary to validate it correctly. This requires understanding the market, user needs, and business processes. AI can help with data analysis, but the interpretation of results and decision-making remains with the person.

Finally, the most difficult part is marketing and attracting an audience. Creating an application with the help of AI is possible, but getting people to use it is a completely different task.

Therefore, if a conditional manager spends the whole day creating a landing page with the help of AI, it may be an inefficient use of his time. Instead, he should focus on validating the idea, communicating with potential customers, and developing a promotion strategy. And the technical part can be delegated to a junior developer who, with the help of AI, can quickly create the necessary landing page.

Thus, the absence of a boom in new applications is not due to the limitations of AI in development, but to the complexity of the process of creating a successful product, where programming is just one of many components.

Comments