This is the second article in a series of articles that explore autoencoders. Prerequisites for this article are basic knowledge about neural nets. Although the code used for this article is available here it is not necessary to read it in order to follow along.
In the previous article we created a super simple autoencoder, we made a lot of assumptions regarding the hyperparameters. In this article we will tinker with them and see how it affects the result.
Total number of training steps
In the previous article we only iterated over the training dataset once - one epoch. Let's increase the number of epochs from 1 to 5 to see if it makes any difference. I will plot the performance of the model measured on the test data every 100 batch to make sure we aren't overfitting.
Perfect, no overfitting (this would have shown up as a discrepancy between the training and test loss). Further we notice that no improvement was achieved after ~ 2000 steps with batch size 64 (which corresponds to about 2 epochs). The final loss (MSE) was 0.00916.
Batch Size
Let's see how the batch size affects the training time and final loss. The batch size is simply how many images we feed into the model at once. We saw no further improvement above 2 epochs when using 64 images per batch. We will therefore set that as a training stop criterion. Let's run some training runs on my laptop:
It seems like the optimal batch size in terms of minimizing the training time of 2 epochs is 64. The final loss of the training seems to go up after that. The training is however pretty slow overall. Let's see if we can improve it by switching to cuda on my GPU.
A dramatic decrease in the training time as expected. Regarding the increase in loss for higher batches, could it be that 2 epochs simply isn't enough to train the model using bigger batches? Let's change stopping criterion to stop when the model loss stops going down and see what happens.
It turns out that higher batch sizes isn't necessarily better when we take performance into account. There is a big jump in training time happening at batch size 256, this could be a limitation of my GPU (GeForce MX150). To generate this graph I've subtracted the time it has taken to test the model on the test data set. It seems like higher batch sizes yields lower loss. A batch size of 8 having about 15% higher loss than 256 and 512.
Learning Rate
So far the learning rate was completely arbitrary set to 1e-3. Lets experiment with different learning rates to see what happens. For this test we simply calculate and plot the loss on the test data (test loss) during training for different learning rates.
As you can see the learning rate has a pretty big impact on how fast and well the model converges. Rates below 1e-4 (green, orange and blue) seems to be too slow and doesn't learn as fast as the others. Learning rates of 5e-3 and higher (brown and pink) seem to have some weird behavior. Let's zoom in a bit:
The 5e-3 (brown) and 1e-2 (pink) learning rates leads to some instabilities. While both 1e-3 (purple) and 1e-4 (red) converges in a stable manner, 1e-3 is much faster here. Again, these curves applies to this model on this dataset only and is not a one-size-fit-all learning rate.
Optimization Algorithm
Now when we have a feel for different learning rates. Let's see what happens when we try SGD optimizer instead of Adam.
As you can see, the effective learning rates behave differently under SGD and Adam due to their inherent algorithmic differences.
Alright, that looks better. Although not as good as Adam, the curves converges. I did try even higher learning rates but they ended up "exploding". SDG is known for being more sensitive and for some cases more efficient. In this case it isn't.
In the next article we will see if we can improve the model itself.
Comments