top of page
Abstract Background

Understanding Autoencoders: Model Configuration

This is the third article in a series of articles that explore autoencoders. Prerequisites for this article are basic knowledge about neural nets. Although the code used for this article is available here it is not necessary to read it in order to follow along.

In the previous article, we tinkered a bit with the hyperparameters of a super basic linear MNIST autoencoder. In this article, we will dig deeper into the model itself and see if we can improve it. Before we blindly adjust every parameter to minimize the loss function, we need to understand what we are trying to achieve. If all that matters is for the model to produce an output identical to the input, we could simply set the size of the latent vector equal to the input vector (28x28), thereby allowing the model to pass the vector unmodified and achieve zero loss. However, our model wouldn't have learned anything about the dataset; it would only know how to copy the input to the output.

What we actually want the autoencoder to do is compress the input and thereby learn about structures within the input data. What do I mean by this? It's not so complicated.

Assume I show you the following image from the MNIST dataset:

This is a 28x28 pixel image, where the color of each pixel is represented by 8 bits (0 to 255), with 0 being black and 255 white. This amounts to a total of 28x28x8 = 6272 bits of information per image. I bet that if I asked you to describe the content of the image simply by sending me a single digit (~4 bits), I would be able to recreate the image fairly well. It wouldn't be perfect, but the loss would be relatively low. You might start by simply sending me the digit "9." If we were more strategic, we might try to encode other characteristics about the image, such as how bold the font is or how tilted the number is. In this scenario, you and I are acting as an autoencoder capable of compressing the image from 6272 bits to just 4 bits. The small amount of information we use to communicate is the latent vector of the autoencoder. Let's see if we can get our AI to do something similar.

In previous articles, we used a latent vector size of 64. We will now reduce that to a size of 2:

As in the previous article, we use just a single layer in the encoder and a single output layer in the decoder. Let’s see how it performs:

Yikes, it seems like we need some more layers. Let's add another linear layer with 40 neurons in both the encoder and decoder.

It seems like it's training, but it still looks subpar. To truly capture the complexities of an image, we need to introduce convolutional layers. We will also add activation functions; these introduce non-linearity that the model can utilize. We will use both the ReLu and the Sigmoid function. Finally we will add some batch normalization between the convolutional layers. Let's see what we get now:

Alright, now we are getting somewhere. Some numbers still seem a bit confused, but overall it's okay.

Let's peek inside the autoencoder and see if we can learn something. Since the latent vector consists of just two floating-point numbers, we can try to generate some images using different latent vectors instead of using the encoder. In the image below, I've generated 100 images using different latent vectors. Since the latent vector consists of two values, we can think of them as x and y coordinates. The top-left image is generated with [-2.0, 2.0] and the bottom-left with [2.0, -2.0].

Why this range of -2.0 to 2.0? The default data type in the framework PyTorch is a 32-bit floating point, meaning that theoretically, the AI could choose any number from 3.4028235×10^38 to -3.4028235×10^38. It turns out the AI doesn't use this range because it doesn’t need to. The weights are initialized around 0, and there is plenty of room to store the information there.

In future articles, we will explore different ways of forming this latent vector and how to use it to generate new data.


bottom of page