Autoencoders in simple terms

15:30
18.10.2024
Squirrelfm
278

Autoencoders are a basic technique of machine learning and artificial intelligence, on the basis of which more complex models are built, for example, in diffusion models such as Stable Diffusion. What is an autoencoder?

Autoencoders are called autoencoders because they automatically learn to encode data into a more compact or low-dimensional representation and then decode it back into its original form. This term can be broken down as follows:

Auto: means that the process is automatic. The model learns by itself, without the need for explicit labels or human intervention to extract features. During training, it automatically finds the best way to represent the data in a low-dimensional form.
Encoder: the first part of the model, the encoder, compresses or encodes the input data into a smaller, compressed form. This stage involves reducing the dimensionality of the input data, effectively learning a more compact version of the data.
Decoder: the second part of the model, the decoder, attempts to restore the original input data from the encoded representation. The goal is to ensure that the output data is as similar as possible to the original input, showing that the encoding preserves the main features.

So, first of all, an autoencoder is a type of neural network used for unsupervised learning. But not just any kind, one that can encode and decode data, similar to a ZIP archiver that can compress and decompress data. In machine learning, it is used for dimensionality reduction or data compression, as well as for removing noise from images (we will talk about this later).

However, it does this more intelligently than a ZIP archiver. It is able to understand the most important features of the data (so-called latent, or hidden features) and remembers them instead of all the data, in order to then reconstruct something close to the original from an approximate description. In images, for example, it can remember the outlines of objects or the relative position of objects to each other. This allows for interesting lossy compression. It works roughly as follows:

Autoencoder visualizing the data compression process Diagram of the autoencoder with input and output data Simple explanation of the autoencoder using a diagram.

Only in latent features the values are actually discrete, so the process is closer to the following:

Here it should be noted that this is just an example for understanding, and that in reality what the network recognizes will be different. Also, no octopuses were harmed in these experiments.

This approach is important for diffusion models because it allows them to work in a lower-dimensional space (the so-called latent space), which is much faster than working directly with high-resolution images. That is, instead of performing the denoising process directly on the pixels, the image is first compressed into the latent space using an autoencoder, and the diffusion process takes place in this lower-dimensional space. After diffusion, the decoder restores the high-resolution image from the latent representation. If you think about it, this makes perfect sense. For the diffusion model to work, it only needs to know the main features of the original image, not all the fine details, so why waste time and computational resources on them?

The idea behind the autoencoder is a very simple trick. If we artificially limit the number of nodes in the autoencoder network and train it to restore data from the original, this will force the network to learn a compressed representation of the original data simply because it will not have enough nodes to remember all the features of the data. It will have to discard most of the insignificant features. This is, of course, only possible if there is some structure in the data (for example, correlations between input features), because then this structure can be learned and used when the data passes through the bottleneck of the network.

Creating a bottleneck in the neural network

This network can be trained by minimizing the reconstruction error, which measures the differences between our original data and the reconstructed image after decompression.

Sparse autoencoders

We can constrain the network by limiting the number of neurons that can activate simultaneously. This will essentially force the network to use separate hidden layer nodes for specific features of the input data (an idea somewhat similar to how different areas of the brain process different types of information).

Sparse autoencoders apply regularization methods that encourage hidden units (neurons) in the network to maintain a certain level of sparsity, meaning only a small portion of them should be active (i.e., have a non-zero output) at any given time. Here are the main types of regularization used in sparse autoencoders:

Regularization using KL divergence:

The most common method of regularizing sparse autoencoders is to impose a sparsity constraint on the activations of the hidden units using Kullback-Leibler (KL) divergence.

Denoising Autoencoders

Another idea is to add noise to the original image and use it as input, but for error calculation, compare it with the cleaned original. Thus, the model learns to remove noise from the image, but as a side effect, it can no longer simply memorize the input data because the input and output do not match. Here is what the training process looks like:

This, as mentioned above, forces the model to memorize only important features and ignore noise (and other insignificant details).

Variational Autoencoders (VAE)

They are used in diffusion models. The main idea is that in regular autoencoders, features are stored as discrete values in feature layers. In variational autoencoders (VAE), a probabilistic distribution is used for each latent feature. This allows the network to implement some interesting functions that I will describe in a separate article.

Limitations
Since autoencoders learn to compress data by identifying patterns and relationships (i.e., correlations between input features) that emerge during training, they tend to effectively reconstruct data similar to those used in the training process.

Also, the ability of autoencoders to compress data is rarely used by itself, as they usually are less efficient compared to manually created algorithms designed for specific types of data, such as sound or images.

Although autoencoders can be used for text encoding, they are less frequently applied compared to more modern architectures such as transformers (BERT, GPT, etc.), because autoencoders may struggle with processing complex language structures or long sequences, especially without mechanisms like attention that help capture long-range dependencies.

Conclusion
Autoencoders are fundamental building blocks in the field of machine learning and AI, offering a versatile approach to tasks such as data compression, dimensionality reduction, and noise removal. By learning to encode data into compact, low-dimensional representations and then decode them back, autoencoders can effectively capture important features of the input data while discarding less significant details. This capability makes them valuable in applications ranging from image processing to data preprocessing, as well as components of more complex architectures such as latent diffusion models.