- AI
- A
Why AI Music is Human and New Without Being New?
The revolution has taken place: for most, music created by AI is already indistinguishable from live music. AI artists are signing contracts, NARAS is okay with using AI, and market leaders like Suno/Udio play by the big boys' rules. Perhaps we can end the debates, but no: some insist on the absence of soul, while others claim a false direction in the development of musical AI. I believe there is a way to clarify the situation by turning to the very nature of music.
What is music? Expressive means
Of all the definitions, I prefer the formal one: "Music is sounds organized by humans in time and space."
The organization of sounds into a musical image that conveys emotion is achieved through expressive means (EM). Western European music theory (WEMT) includes melody, harmony, rhythm, etc. However, it is poorly suited for analyzing rock, jazz, electronic music, and folk. It is better to rely on psychoacoustics (the synthesis of physics and psychology) and highlight the following components:
Pitch-related — changes in the frequency of the fundamental tone, combinations of tones (intervals, scale, Pitch curve...);
Loudness-related — changes in amplitude (volume, accents, the envelope of an individual sound, phrases, periods, amplitude modulation...);
Rhythmic — temporal parameters (duration of sound, pauses, rhythmic figures, tempo and its variations);
Timbre-related — changes in the spectrum (frequency components, their periodic and non-periodic changes...);
Spatial — concerning the environment and positioning of sounds (left/right, closer/further, higher/lower, reflection/absorption...).
A musical element (melody, chord, etc.) and a piece as a whole are combinations of these components. In different cultures and genres, the sets of EM vary (essentially, these are different "musical languages"). Sometimes, a certain component may be completely absent.
The representative of ZETM will not understand the musical language of an African performing only on the drum (without singing) songs: wedding, funeral, battle, harvest gathering, prayer for rain. To put it bluntly: "Where is the music if there are no piano, cello, tonic, dominant, major/minor — i.e. there is nothing?". Moreover, a composition on the khomus cannot be described within this theory. Even rock music was not considered new by the classics due to the similarity of melodies and harmonies, although there were many new elements in the performance techniques (articulations) and timbres.
Search for novelty: from classics to experiments
However, in the early 20th century, even the classics felt the narrowness of ZETM. Hence the search for new timbres — "prepared piano" (placing coins between the strings), interest in electronic instruments — Theremin and others. Composers lacked freedom both in classical harmony and melody, resulting in a turn to aleatoric (random) and unnatural: "ink splashes on sheet music", the use of mathematical theories in composition (K.Stockhausen, J.Cage, Y.Xenakis).
All of this, in essence, was a rejection of the natural, defined by musical acoustics and the psychology of perception. But it was fresh and, importantly, expressive! Why? The miracle happened in the moment of performance. A person cannot express a phrase completely ineffectively; musicians cannot play abstract notes without performance techniques. Once, I assigned students at the conservatory to play a set of random notes, without any meter, tempo, or measure marks. And all of them performed this nonsense quite expressively: they defined phrases, placed accents, etc. This was done by both wind players and vocalists, as well as violinists and pianists who did not need to take a breath.
What makes music music?
The main thing in music is the performer operating the VC. The key word is "operating." If you give a musician an instrument that reproduces "the most perfect sound in the universe" with the push of a button, they will not be satisfied. Where is the performance? They need to influence the sound: change that one note or extract a series (at least through the pentatonic scale).
New in music, what defines its development — is a change in the set of expressive means: someone introduced a new chord into circulation, someone a melodic turn, someone a timbre/instrument, someone a technique/articulation, someone a rhythm, etc.
Everyone hears something of their own. Examples
As noted earlier, a significant role in the accuracy of sound pattern recognition is played by differential thresholds of perception, and the development of musical AI suggests ways to update the theory of hearing.
1] A composer at a symphonic concert distinguishes a barely noticeable flute playing a secondary part—and even "hears" notes that aren't there because they should be there. A hunter in the woods sees a bird that is not visible to the untrained eye. Let’s switch their places: the hunter will ask: "What does it sound like?", the composer will say: "There’s nobody here at all."
2] While working on arrangements for a film with Indian musicians (those who have not yet fully transitioned to Western templates), I noted three points:
Sharp change of mood/ emotions within a single song. We are somewhat more definite: either sadness or joy.
Unusual for us form: chorus → verse → interlude and everything again, plus a mandatory 3rd part. I called it "Moral," "Conclusion"—in it, both the text and melody carried a clear instruction.
Ignoring the rules of harmonization of ZETM: any chord can be slapped in without paying attention to what was before and what will come after.
Conclusion: if they have a different set of VS (a different musical language)—they have a different hearing.
3] I watch a video of (conditionally) Swedes and Irish listening to a folk song with tears in their eyes—it was probably sung to them in childhood. For me, however, there is "nothing" in it: the music does not touch me, the emotion is not conveyed.
4] An adult person presses the doorbell button, the melody... and suddenly memories of first love resurface. The sound is awful, but the memories come alive, and that very song sounds in my head... By the way, a good musician "hears" music even when looking at an unfamiliar score.
5] An episode from the 90s: Moscow musicians in a Novosibirsk restaurant, upon hearing the ensemble play, approached the stage: "Where is that sound coming from, what kind of device is that? There’s no such timbre in this Roland!"... Behind the keys was Sergey Kokorin (now the artistic director of the Sochi Big Band). He "disassembled" the new sound into elements by ear, "calculated" its "formula," and accurately adjusted 30 parameters of the synthesizer for imitation.
So, we all hear different things; each person has their own "sound picture." And in most cases, it's not about anatomy. It's the result of training, i.e., the brain's work: we "complete the sound" (draw the picture) based on our past experience, even when we don't receive all the information.
The Strength and Weakness of Musical AI
The generation of music has long interested humanity. Before the advent of AI, everything was based on the formalization of rules — how music is structured and how it is performed by musicians. While much was clear with sheet music, performance nuances (i.e., what makes music alive) turned out to be much more complicated. For debugging generation algorithms, one can take everything formalized from textbooks of ETM, jazz improvisation courses, and electronic music lessons, but the result, for the most part, will never become "human." Evidence of this is the auto-accompaniment devices and programs (the so-called "self-playing" instruments, even the most expensive ones) and the fact that without human participation, Vocaloid and its analogs do not produce live vocals. The reason lies in the complexity of imitating VCs, which are secondary in ETM: timbre, its variations, articulations, subtle changes in Pitch — i.e., everything that is important for expressiveness but is not represented in traditional sheet music.
How can one reflect in notation that the guitarist plucked two strings — the 1st and 2nd, pressed with a "run" of 4 frets (1 semitone) and pulled the 2nd to the 1st, in unison (more effective with Distortion/Overdrive), or that the keyboardist "played" the Pitch Wheel during a solo? There are hundreds of such techniques, and their sound depends on many factors: the brand of the instrument/amp, the preset, the positions of the controls, and so forth. One can only get close to the model by experimenting, analyzing audio and video. That is, there are no formal rules that can be programmed and used in generation. Essentially, classical music as a whole has "literacy," while modern music (with the necessary detail) lacks it. At the same time, the training of traditional musicians is mainly based on repetition ("do as I do"), rather than reading sheet music.
Down with the rules!
In music generation with AI through spectrum (Suno, Udio, Riffusion, Sonauto, Mureka...), all this formalization is not required. Previously, in electronic instruments, some automation was added to enliven the sound, and the engineer or musician knew exactly why a particular operation was performed. Now, with AI, all the rules remain "in the background," in an implicit form — more precisely, in a "black box." Analysis of performance nuances is generally not required: the rules work, but no one has seen them! You train the system on several samples, then it creates a new spectrum (a new "canvas"), where all the strokes correspond to what it was taught. It's just like painting: if you feed the AI 30 paintings by Aivazovsky, you'll get new paintings that adhere to all the characteristic features of the artist...
Although 90% of music is listened to through headphones, in cafes, cars, and shopping centers, rather than on Hi-End systems, the criticisms regarding quality from two years ago are understandable. One must look at the notes and the strokes — and they are all correct. At first (since I was engaged in generation myself), I was quite surprised. I thought maybe I was missing something. I consulted professional musicians — they confirmed that all the notes, nuances — everything is correct.
The exclusion of formalization in musical performance during generation is an incredible, revolutionary breakthrough. At the same time:
Before the advent of AI, a person could control any parameter of generation, but the system did not reproduce expressive components sufficiently naturally.
Now, with AI, we do not have explicit access to generation parameters, but the system recreates expressive elements at the level of the best human examples.
On the "soul" in AI generations
Everything is determined by the listener's perception. Who said that all live performers sing and play "with soul," genuinely loving and respecting the audience? One must at least once be behind the scenes to doubt this. An artist who masters the entire range of expression can make the audience cry, empathize, gain confidence, or laugh. At the same time, they often treat their fans with disdain. You have to be a very high-minded person not to succumb to the temptation of considering yourself above the audience when they look at you with adoration.
Moreover, mastering the performing arts sometimes allows one to do without a "personal face." For a time, I worked with a vocalist who perfectly imitated Y. Gillan, P. McCartney, V. Leontyev, V. Mulyavin, but did not know how to sing a song "in his own way" that no one had performed yet.
Different AI generators. Evaluation criteria
In my opinion, the generations of Suno, Udio, Sonauto are quite "alive and human," especially with acoustic solos (guitar, strings, wind instruments). They are significantly more natural than anything created by machines before the AI era.
The exception is systems trained not on spectra but on MIDI files (electronic scores): AIVA, Boomy, Loudly, Media.io, Melobytes, Mubert, and others. These AI generators do not reproduce performance nuances — they are absent in the dataset. In many of them, one cannot even specify styles like Bossa-nova, Acoustic guitar solo. It seems the system simply creates a new MIDI file, which is then voiced with samples. For electronic or background music, this is sometimes sufficient; however, vocal tracks are usually not provided in such generators.
It seems unfair to compare each generated track with world masterpieces, which constitute thousandths of a percent of what humans create. Musically speaking (harmony, melody, arrangement, detail), the outputs of market leaders exceed the average professional level. Order an arrangement of a song at a regular studio for 30-100 thousand rubles — most likely, the music will be weaker than that of Suno. Listen to the soundtracks of our typical series — often they also lag behind the generations (unless they are licensed global hits). And this is the work of professionals. And if you need a "serious" style — symphonic orchestra, big band — where will you order that?
Why is there nothing new in generations?
Innovations in music are a new set of expressive means, while generations are built from what has been found and worked out by people before. Generative AI at this stage operates on the principle of: "made from what was".
Of course, something new can be obtained by mixing styles — especially those that are close and "compatible." Not all combinations turn out to be successful. For example, the attempt to combine a lyrical tenor (I. Kozlovsky, H. Carreras) with Trash-metal does not yield a meaningful result (I've checked). Yes, the combination of Russian text with the backing track and vocals from Chris Rea's song is a new song. But we clearly understand that we have received a variation of what has already been.
Limitations of Variation
If the system uses a model trained on one track (Audio Upload), and the influence of other models is minimized (parameters Weirdness, Style/Audio influence), then the "stock" of variations quickly runs out. In tests with Suno, this becomes noticeable after about 50 tracks. I encountered a similar effect back in 2017 when developing my own generator: rendering more than 100 tracks in one style made no sense — everything became very similar. And if I allowed more freedom in harmony without changing the style, then the stylistic accuracy inevitably decreased.
On the False Vector of Musical AI Development
Recently, I heard a suggestion to direct AI "in the footsteps" of human thought and science, that is, essentially, to deepen the ZETM and digitize everything possible. For a guitarist, this would involve creating models of strings, neck, guitar body, fingers... But then what? Modeling the brain or even the whole person? After all, it is necessary not only to reproduce sounds but also to hear, feel...
A small part of this list has long been implemented without the involvement of AI: Physical Modeling Synthesis (VL series at Yamaha, MOSS at Korg, COSM at Roland). However, imitating plucked and wind instruments on keyboards has not become easier — it is simpler to invite a live guitarist or saxophonist than to extract something close to natural on these high-tech devices. Evidence of this is BreathControl, which allows controlling synthesis via breath, similar to an acoustic instrument.
I believe that proponents of the classical engineering approach—with its focus on planning, design, and control—will quickly lose enthusiasm when faced with the complexity of formalizing musical performance. (I realized this while debugging algorithms similar to "supervised learning" for StyleEnhancer 30 years ago). In other fields (pattern recognition, big data processing), engineers have already realized that effective problem-solving is impossible without AI. Ultimately, AI is an attempt to replicate nature, where everything is arranged intelligently and economically.
What about the future?
I think everything will blend together; there will be a merging of approaches: the use of AI by musicians will become commonplace, as it has happened with other innovations in music. New type DAWs, incorporating all the latest musical AI algorithms (like ACE Studio), will become the standard. Scientists will continue the "taming" of AI to make generations more predictable.
The competition among people will reach a new level. However, I believe that AI detectors and "plagiarism analyzers" are a temporary phenomenon and even harmful to musicians. I am 100% sure: it will soon become clear that everyone borrows from each other—and has been doing so for a long time. I heard a story (I cannot confirm its accuracy) that an attempt was made in the USSR to conduct computer analysis of scores for plagiarism, but the Union of Composers opposed it, and the project was shelved...
The giants of the music industry will adapt to the new realities, share financial streams, and, as before, will live off the creativity and enthusiasm of millions of professional musicians and amateurs.
P.S. This article is not a scientific report. Many statements sound categorical—but only for the sake of brevity; a detailed exposition would be significantly longer. If anyone needs more precise formulations (scientific in the understanding of a musicologist), quotes, and references on the topic of musical performance and expressive means from the perspective of psychoacoustics, you will find them in the articles on my website.
Write comment