Practice: my experience of integrating more than 50 neural networks into one project

The article is based on a year and a half of work on implementing neural networks in an open-source web application. It contains practical life hacks for solving real problems and overcoming difficulties faced by developers.

A year and a half ago, I started working on an open-source project that gradually grew and developed. Inspired by the project AUTOMATIC1111, which had just appeared at that time, I added more and more functionality and features. Today, my project includes more than 50 neural networks, each performing its unique task. In this article, I share practical life hacks and insights that helped me along the way. I hope they will be useful to you as well.

The project is focused on creating and editing videos, images, and audio using neural networks. Often, different methods can perform similar tasks. Since I integrated open-source solutions, optimized them, and added new functionality, the key task was to ensure the unity of methods. For example, functions such as face replacement, lip synchronization, and portrait animation require face recognition. In my project, one model is responsible for this task, rather than several different methods as in the original solutions. Therefore, all 50+ models are distributed so that each is responsible for its unique direction, without duplication.

During the development process, I fundamentally abandoned TensorFlow and related solutions, focusing exclusively on PyTorch and ONNX Runtime.

For those who want to learn more about the functionality and find out which neural networks I used, I offer several links: YouTube playlist, where you can track how the project developed and improved, as well as a short video created using my program — for those who do not have access to YouTube.

The functionality of each model is diverse and complex: from generating images and videos to face recognition, segmentation, and much more. There are no simple solutions in the project, and each model performs its unique task.

So, let's get started.

Life Hack 1

The first thing I encountered, and what surprised me: you cannot load one model into video memory and use it simultaneously for several tasks. Each model must be loaded for its own separate task. Therefore, this will become the basis for further life hacks.

Life Hack 2

Queue. My application is based on Flask, so the user does not wait for the processing to finish and can launch as many tasks as they want, thereby loading the memory. As a result, I artificially create a delay between tasks with a random value to avoid the simultaneous launch of two or more tasks. This is related to Life Hack 3.

Life Hack 3

Before launching, I use memory measurement. I can artificially delay the launch of tasks if I know that the amount of current memory on the device is less than required for the model.

import torch
import psutil

def get_vram_gb(device="cuda"):
    if torch.cuda.is_available():
        properties = torch.cuda.get_device_properties(device)  # Get the values ​​for a specific GPU, which is our device
        total_vram_gb = properties.total_memory / (1024 ** 3)
        available_vram_gb = (properties.total_memory - torch.cuda.memory_allocated()) / (1024 ** 3)
        busy_vram_gb = total_vram_gb - available_vram_gb
        return total_vram_gb, available_vram_gb, busy_vram_gb
    return 0, 0, 0

def get_ram_gb():
    mem = psutil.virtual_memory()
    total_ram_gb = mem.total / (1024 ** 3)
    available_ram_gb = mem.available / (1024 ** 3)
    busy_ram_gb = total_ram_gb - available_ram_gb
    return total_ram_gb, available_ram_gb, busy_ram_gb

Life Hack 4

Along with the delayed launch, I use checks for the most common error: “CUDA out of memory”. The idea is that if we get a memory shortage message, we need to clear the memory of unnecessary data and restart the process.

min_delay = 20
max_delay = 180
try:
    # Launch the method with a neural network
except RuntimeError as err:
    if 'CUDA out of memory' in str(err):
        # Clear memory
        sleep(random.randint(min_delay, max_delay))
        # Clear memory again
        # Launch the method again
    else:
        raise err

We will return to this part later, as it is not enough to simply perform `# Clear cache`, everything should be a little different.

Life Hack 5

The backend of my program consists of modules that are classified according to the following features: video or image modification, video and image generation, audio modification — i.e., by the property of the model. And also by the feature: the model processes tasks for the frontend or backend, i.e., the result of the model's work needs to be returned to the user instantly (segmentation, txt2img, and img2img) or as a completed large task. We are not talking about models that work on the frontend, using:

await ort.InferenceSession.create(MODEL_DIR).then(console.log("Model loaded"))

Therefore, I need to load models for quick response return into memory and keep them there, not allowing different users to use one model simultaneously (Lifehack 1) and not use them for long processing tasks, so as not to violate Lifehack 1.

Lifehack 6

Models for long processing can sometimes be very demanding, and depending on the video memory, such a model can fully use it. In terms of optimization, it is very unprofitable to load and unload such models every time, although sometimes, unfortunately, it has to be done. Often, micro models are used with such models, which take up little space in memory, but their loading and unloading take time. When launching tasks, we group them by long processing methods, and tasks from one group are processed on small models, creating a queue before loading into one large model. Remember Lifehacks 3 and 4? We have two methods: measure how much memory such a model consumes, or run it to get the error “CUDA out of memory” and clear the cache.

Having received this error, we clear the memory of unnecessary models, including those used for quick response, and also clear unused data, if any remains.

if torch.cuda.is_available():  # If CUDA is available, because the application can work without CUDA
	torch.cuda.empty_cache() # Frees unused memory in the CUDA cache
	torch.cuda.ipc_collect() # Performs garbage collection on CUDA objects accessed via IPC (interprocess communication)
gc.collect() # Calls Python's garbage collector to free memory occupied by unused objects

Lifehack 7

After completing each task, clear the memory and delete variables and models if they are no longer needed.

del ...

Lifehack 8

Models can be loaded layer by layer on the GPU and CPU, or on multiple GPUs, but elements of one layer must be on the same GPU. This approach is used when there is a small amount of video memory and is used in image and video generation, but is not limited to this.

device_map = {
    'encoder.layer.0': 'cuda:0',
    'encoder.layer.1': 'cuda:1',
    'decoder.layer.0': 'cuda:0',
    'decoder.layer.1': 'cuda:1',
}
# Or
device_map = {
    'encoder.layer.0': 'cuda',
    'encoder.layer.1': 'cpu',
    'decoder.layer.0': 'cuda',
    'decoder.layer.1': 'cpu',
}

Lifehack 9

Don't forget to use enable_xformers_memory_efficient_attention() if the model pipeline supports it. The documentation describes other methods such as enable_model_cpu_offload(), enable_vae_tiling(), enable_attention_slicing(). They work for me when restyling videos, but completely different methods are used for image generation:

if vram < 12:
    pipe.enable_sequential_cpu_offload()
    print("VRAM below 12 GB: Using sequential CPU offloading for memory efficiency. Expect slower generation.")
elif vram < 20:
    print("VRAM between 12-20 GB: Medium generation speed enabled.")
elif vram < 30:
    # Load essential modules to GPU
    for module in [pipe.vae, pipe.dit, pipe.text_encoder]:
        module.to("cuda")
    cpu_offloading = False
    print("VRAM between 20-30 GB: Sufficient memory for faster generation.")
else:
    # Maximize performance by disabling memory-saving options
    for module in [pipe.vae, pipe.dit, pipe.text_encoder]:
        module.to("cuda")
    cpu_offloading = False
    save_memory = False
    print("VRAM above 30 GB: Maximum speed enabled for generation.")

These approaches reduce memory usage but increase processing time.

Lifehack 10

We do not store frames in memory. In fact, it is a double-edged sword. If you need to quickly get results on a powerful machine with resolution and content duration limitations, storing in memory can be beneficial. However, users of my project run it on weak devices with hour-long high-resolution videos. Therefore, I rewrote all methods to work with the current frame and values, saving them to the hard drive. Accessing this data as needed avoids many device limitations. In the list, I only store links to files, which makes the process more efficient. Additionally, you can use generators or chunks to process only the current values, similar to what I do in some modules, for example, when replacing faces.

Lifehack 11

Frame resolution. Depending on the model, sometimes you have to resize the frame to the limits that the user's device can handle, and then restore its size by regular resizing or more advanced upscaling.

Lifehack 12

Models are not asynchronous? This is not a statement, as the world of artificial intelligence is constantly changing and this is only my experience. I found that I do not get significant gains from using asynchronous methods, except for individual data processing operations that are not directly related to the model, as well as requests for loading and checking the relevance of the model. Models work synchronously.

Lifehack 13

Let's talk about the compatibility of library versions, especially such as torch, torchvision, torchaudio, and xformers. It is important that they are compatible with each other and with your version of CUDA. How do we proceed?

First — check your CUDA version:

nvcc -V

Second — go to the PyTorch website to check version compatibility: PyTorch Previous Versions or to the download page, where cu118 is your CUDA version. Note that your CUDA version may work with older versions of torch. For example, CUDA 12.6 may work with a torch version compatible with cu118.

I noticed that torch and torchaudio often have the same versions, for example, 2.4.1, while the version of torchvision may differ, such as 0.19.1. Thus, it can be determined that torch and torchaudio version 2.2.2 work with torchvision 0.17.2. Feel the dependency?

Additionally, you can download .whl files via the link and even unpack them yourself. For me, version compliance is critically important, as the program is installed through the installer, and for Windows users, when first turned on, torch, torchaudio, and torchvision are downloaded depending on their choice, with an indication of the download status, and then unpacked.

Third — you need to make sure that xformers is also compatible. To do this, visit the xformers repository on GitHub and carefully review which version of torch and CUDA xformers will work with, as support for older versions may be discontinued, including for torch. For example, when using CUDA 11.8, you will benefit from xformers, especially if your device has a limited amount of video memory.

Fourth — this is not a mandatory step, but there is such a thing as flash-attn. If you decide to install it, you can do it faster using the command:

MAX_JOBS=4 pip install flash-attn

Where you can choose the number of jobs that suits you. I use it as follows:

try:
    from flash_attn import flash_attn_qkvpacked_func, flash_attn_func
    from flash_attn.bert_padding import pad_input, unpad_input, index_first_axis
    from flash_attn.flash_attn_interface import flash_attn_varlen_func
except ImportError:
    flash_attn_func = None
    flash_attn_qkvpacked_func = None
    flash_attn_varlen_func = None

Lifehack 14

To make sure CUDA is available in ONNX Runtime providers, run the following code:

access_providers = onnxruntime.get_available_providers()
if "CUDAExecutionProvider" in access_providers:
    provider = ["CUDAExecutionProvider"] if torch.cuda.is_available() and self.device == "cuda" else ["CPUExecutionProvider"]
else:
    provider = ["CPUExecutionProvider"]

For new versions of CUDA 12.x, unlike the older version 11.8, you will also need to install cuDNN 9.x on Linux (this may not be necessary on Windows). Note that sometimes onnxruntime-gpu is installed without CUDA support. Therefore, when we make sure that the torch version is compatible with CUDA, it is recommended to reinstall onnxruntime-gpu:

pip install -U onnxruntime-gpu

Lifehack 15

What to do if some models only work with old libraries, while others only work with new ones? I encountered this problem in gfpganer, where it requires an old version of torchvision, while new versions of torch are needed for video generation. In this case, you can use the following approach:

try:
    # Check if `torchvision.transforms.functional_tensor` and `rgb_to_grayscale` are missing
    from torchvision.transforms.functional_tensor import rgb_to_grayscale
except ImportError:
    # Import `rgb_to_grayscale` from `functional` if it’s missing in `functional_tensor`
    from torchvision.transforms.functional import rgb_to_grayscale

    # Create a module for `torchvision.transforms.functional_tensor`
    functional_tensor = types.ModuleType("torchvision.transforms.functional_tensor")
    functional_tensor.rgb_to_grayscale = rgb_to_grayscale

    # Add this module to `sys.modules` so other imports can access it
    sys.modules["torchvision.transforms.functional_tensor"] = functional_tensor

Thus, you import the modified methods for those that have disappeared in the new versions. This allows you to ensure compatibility between different libraries and models.

Lifehack 16

Pay attention to warnings. Always watch for Warning messages that talk about upcoming changes in new versions of libraries. Look for the corresponding lines of code in your project and add or change the necessary parameters. This will help avoid accumulating inconsistencies when updating to new versions.

Lifehack 17

GPU management in a cluster. If you are using a cluster of multiple machines, remember that you cannot sum up the video memory from different GPUs. However, if the video cards are on a local network, you can use GPU management from one controller. There are libraries for this, such as Ray. Note that video memory summation does not work, except when you have one machine with multiple GPUs, more precisely, Lifehack 8 works, and video memory is not summed up as before.

Lifehack 18

Using torch.jit for model compilation can significantly speed up their execution or recompilation in onnx. You can use torch.jit.trace() or torch.jit.script() to convert the model into an optimized format that runs faster, especially with repeated calls. This is especially useful if you frequently call the same model for different tasks.

import torch

# Example of using torch.jit to trace a model
model = ...  # model
example_input = ...  # sample input suitable for your model
traced_model = torch.jit.trace(model, example_input)

# Now you can use traced_model instead of the original model
output = traced_model(example_input)

Lifehack 19

Use profiling tools such as torch.profiler to analyze the performance of your model and identify bottlenecks. This will help you determine which parts of the code need optimization and how to better allocate resources. For example, you can profile the execution time of various operations and identify those that take the most time.

import torch
from torch.profiler import profile, record_function

with profile(profile_memory=True) as prof:
    with record_function("model_inference"):
        output = model(input_data)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

And here we come to the end of our article with 19 lifehacks! Although this is not a round number, I feel that one more is missing. So please share your 20th lifehack in the comments to complete this list.

Lyrical Conclusion

I have a dream — to see 4096 stars on GitHub for my project. I believe that there should be more projects from Russian-speaking developers at the top of GitHub, and your support gives me strength and inspiration to continue. It allows me to improve the code, develop new approaches, and share experiences. If you liked my work, support the project — and I will definitely continue to create useful materials and share new ideas. And also tell us about your projects with neural networks on GitHub 🖐 — in the comments!

Comments