MMAudio: Sound Generation for Video Using AI

Hello everyone! Researchers from the University of Illinois and Sony AI have presented an interesting project MMAudio - a neural network for creating sound accompaniment for video. My name is Ilya, I am the founder of the online neural network for creating images ArtGeneration.me, a tech blogger, and a neuro-evangelist, and today I want to talk more about this technology and share our portable version. The main feature of the system is that it can generate sounds not only based on text descriptions but also based on images or videos.

How it works

The basis of MMAudio is the idea of multimodal learning - the system simultaneously analyzes video, sound, and textual descriptions. For this, two parallel video processing streams are used: CLIP for understanding the general context (8 frames/sec) and Synchformer for precise synchronization (24 frames/sec).

This approach allows the system to better understand what is happening in the video and create more appropriate sound accompaniment. For example, seeing falling raindrops, MMAudio does not just generate rain noise but tries to reproduce the characteristic sound of drops hitting different surfaces.


Graphical representation of MMAudio: Sound Generation for Video Using AI, demonstrating the process of creating sound effects using artificial intelligence.

It is important to note that the system uses Flow Matching technology instead of traditional diffusion, which allows achieving impressive speed - generating an 8-second clip takes only a few seconds.

System requirements

To work with MMAudio, you need:

  • NVIDIA GPU with 8+ GB of video memory

  • Windows 10/11 64-bit

  • 16 GB of RAM

  • 12 GB of free disk space

Capabilities of MMAudio

The system copes well with basic video dubbing. First of all, these are the sounds of everyday life - footsteps, movement of objects, sounds of nature. Here are some examples:

Dynamic sounds: the system accurately captures moments of movement and synchronizes the sounds of footsteps, jumps, and object movements. The synchronization accuracy reaches 25 milliseconds.

Natural effects: realistically recreates the sounds of rain, wind, and running water. The work with rain is particularly impressive - the system distinguishes how it sounds on different surfaces.

Sporting events: accurately identifies moments of ball hits, jumps, and creates a realistic acoustic atmosphere of a stadium or gym.

Soundscapes: can create atmospheric sounds for various locations - forest, city, beach, etc.

Current limitations

Unfortunately, the system does not solve all tasks equally well:

Speech problems: generated human speech is still unintelligible. The system can create speech-like sounds, but they are impossible to understand.

Complex music: musical accompaniment is limited to simple effects. Full-fledged compositions are not yet available.

Time frames: the system works with clips lasting 8-10 seconds. Longer videos have to be processed in parts.

Unusual sounds: there may be problems with sounds that were not in the training set.

Who will benefit from this

Content creators:
MMAudio will become an indispensable assistant for YouTubers and streamers. With its help, you can quickly voice a short video or add sound effects in a live broadcast. Animators will appreciate the ability to quickly prototype sound accompaniment.

Game developers:
The technology is perfect for creating test sound effects and procedural sound generation. This is especially useful at the prototyping stage when you need to quickly test different sound concepts.

Video editors: MMAudio will help create rough versions of soundtracks. The system will quickly generate a basic sound accompaniment, which can then be manually refined.

3D animators:
Ideal for quickly voicing previsualizations and test renders.

How to try

You can try MMAudio in several ways:

Online demo

For developers
The source code is available on GitHub

Our portable version

We, together with the Neuro-Soft channel, have prepared a modified Russian portable build of MMAudio, which includes:

  • Russian interface

  • One-click simplified installation that downloads and installs everything automatically

  • Ability to save audio separately from video

  • Audio generation from images

  • Optimization for working on available GPUs

Everything you need is already included in the distribution, just unpack and run. No additional settings are required. Get it here.

My experience

I actively use MMAudio for dubbing videos generated in various img2video services. The results are really impressive - the system handles basic sounds well and creates a fairly realistic atmosphere. Especially good is the dubbing of natural scenes and various actions such as walking or sports movements.

Of course, the technology is still developing, and sometimes there are funny artifacts in the generation, but for quickly creating basic sound accompaniment, this is an excellent tool.

Comments