top of page
Abstract Background

Understanding Autoencoders: Introduction

This is the first article in a series of articles that explore autoencoders. Prerequisites for this article are basic knowledge about neural nets. Although the code used for this article is available here it is not necessary to read it in order to follow along.

An autoencoder is a type of artificial intelligence (AI) that takes some kind of input (such as images, audio, video, etc), converts it to an array of floating point numbers and then reverts the action by converting it back to what it originally was.

At first glance this may seem really stupid. Why would someone waste their precious computing resources only to end up with a slightly worse image than the input? It turns out that this can be used for multiple things including noise reduction, compression and generative applications. We will explore these but first we need to understand the basics. By the way, this kind of AI falls under the category of unsupervised learning. Meaning that we don't train it on a question-answer style dataset, we just feed it a bunch of data and hope it figures out by itself what's going on.

So how does it work and what about this array?

An autoencoder consists of two components, an encoder and a decoder. The encoder encodes the input to something that is called a latent vector. The decoder reconstructs the original input based on that latent vector.

Typically, the encoder and the decoder consist of trainable neural nets. The best type of neural net varies depending on what type of data. In this article we will start by trying to use a very simple linear neural net to encode the classic mnist dataset.

The mnist dataset is basically the "hello world" of AI programming. It consists of 70 000 images of numbers. The dataset actually contains labels for each image as well but for our autoencoder we are only interest in the images. We will use 60 000 for training and 10 000 for testing.

Both the encoder and decoder will be extremely simple, only containing one input layer and one output layer. The input layer of the encoder and the output layer of the decoder will both have 784 neurons. Why? It's simply due to the fact that each image is made up of 28x28 pixels. Each pixel gets its own neuron meaning we need 28x28=784 of them to represent an image. For now, we set the size of the latent vector to 64. Later on we will look deeper into different sizes and see how it affects the result.

And that's it for the model and the input data. Yes, as simple as that, no hidden layers, no activation function, no normalization of the input data. We will explore all that later but just to get something working this is all we need.

Before we can use the model we need to train it. The two most common methods for training are Adaptive Moment Estimation (Adam) and Stochastic Gradient Descent (SGD). Adam is more forgiving, more robust and have good performance on a wide range of problems while SGD is less forgiving but less complicated and faster. We will experiment with both functions but for now we will use the more robust Adam with a learning rate of 1e-3.

Finally we need to define a loss function to tell the optimizer in which direction we want to go. For this model I will use Mean Square Error (MSE), it's very efficient and very simple to understand. In future articles we will explore other loss functions.

We are almost there now. The final thing we need to define before we train the model is the batch size. The batch size is simply how many images we want to load to train at the same time. We will explore different batch size in the future but for now we will go with a batch size of 64.

Let's train it and plot the loss function along the way:

Alright! It seems to be learning. Let's feed it some images from the testing data and se how it performs.

Not perfect but definitely impressive for such as simple model. That's it for now. In the next article we will tinkle with the hyperparameters to get a feel for whats important and not.

Looking for the code? It's here.


bottom of page