Neural Network : Optimization algorithms
Introduction:
When training a neural network, we want to normalize/standardize our data ahead of of time before throwing it into the training process, hence this can lead to a very different ranges of input features. So we begin by scaling these features in the preprocessing step using the first discussed optimization technique in this article “feature scaling”.
Feature scaling:
The idea behind it is very simple, we need our features on the same scale. e.g: Let’s imagine having features such as distance walked by average person annually, this type of feature can range between a large range of values in kilometers. So instead of using values in kilometers we scale those values between much smaller range like [0,1] where 0 is the smallest distance walked and 1 is the highest distance walked.
The two most common types of feature scaling are :
- Min-Max normalization: which scales features values between [0,1].
- Standardization: which represent values in standard deviation from the mean, likely in the range of [-3,3].
One of the main reason we scale our features is gradient descent will struggle converging without proper scaling.
In this graph we see the gradient plane of unnormalized vs normalized features. The red dot at the center of both planes represent the bottom of our loss curve exist and where we want our gradient descent to take us eventually. Steps taken by the gradient descent are represented by the blue line, and we can notice easily that in the unnormalized plane steps are oscillating instead of being direct like in the normalized plane; also it lead to much lesser steps toward the global minimum.
Batch normalization:
As we’ve discussed earlier, normalizing our data ahead of time is crucial for our neural network performance, but that doesn’t mean it’s flawless as some issues may arise even with normalized data. To put everything in perspective we’ve only normalized data before passing it to the neural network, and we want to do the same at every layer by normalizing the output of activation functions as they become the input for the next layer and so on. So we wan’t to scale our weights in order for them to become robust to change in previous layers (“Covariate shift”).
Here we are passing three hyperparameters, ϵ which is a very small number to avoid dividing by zero, γ and β which are learning parameters use by all batches.
Mini batch gradient descent:
So far we’ve normalized our features at the preprocessing step and at each layer of our neural network. Now it’s time to use a variant of gradient descent corresponding to the batch norm we’ve used lately. The mini batch gradient descent is known for being the most efficient variant of gradient descent, it works by splitting large training sets into mini batches to speed gradient descent performance. Let’s say we have a training set of 5000000 features, with gradient descent we have to process the entire training set to take a step and process it again to make another step and so on … it would take so much time processing millions of training features. So it turns out that it can be much faster if we let gradient descent make some progress before processing the entire training set, this can be done by the use of mini batches of let’s say 1000 features so we end up with 5000 batches of 1000 features as 5000 * 1000 = 5000000. Now we just calculate the forward propagation, cost function and then we will back propagate to compute the gradient and update the weight and biases of each batch. These steps are united as “1 epoch”.
As we can see the mini batch gradient descent is very noisy and we will discuss later the technique used to smooth out these oscillations.
Gradient descent with momentum:
It’s an algorithm that almost work faster than gradient descent. It works by computing the exponentially weighted averages of our gradients. As we’ve seen in the latest graph, the up and down oscillations can slow down the gradient descent journey to the minimum cost and prevents us from using larger learning rates.
By using the exponentially weighted average dW and db values, we tend to average the oscillations in the vertical direction closer to zero as they are in both (positive and negative) directions. Whereas all the derivatives point to the right of the horizontal direction in the horizontal direction, the average in the horizontal direction will still be quite large. It enables our algorithm to take a straighter forward path to local optima and to damp out vertical oscillations. Because of this the algorithm will end up with a few iterations at local optima.
RMSProp optimization:
The RMSprop optimizer is similar to the gradient descent algorithm with momentum. The RMSprop optimizer restricts the oscillations in the vertical direction. Therefore, we can increase our learning rate and our algorithm could take larger steps in the horizontal direction converging faster. The difference between RMSprop and gradient descent is on how the gradients are calculated. The following equations show how the gradients are calculated for the RMSprop and gradient descent with momentum. The value of momentum is denoted by beta and is usually set to 0.9. If you are not interested in the math behind the optimizer, you can just skip the following equations.
Adam optimization:
Adam is an alternative optimization algorithm that provides more efficient neural network weights by running repeated cycles of “adaptive moment estimation.” Adam extends on stochastic gradient descent to solve non-convex problems faster while using fewer resources than many other optimization programs. It is most effective in extremely large data sets by keeping the gradients “tighter” over many learning iterations.
Adam combines the advantages of two other stochastic gradient techniques, Adaptive Gradients and Root Mean Square Propagation, to create a new learning approach to optimize a variety of neural networks.
Building upon the strengths of previous models, Adam optimizer gives much higher performance than the previously used and outperforms them by a big margin into giving an optimized gradient descent. The plot is shown below clearly depicts how Adam Optimizer outperforms the rest of the optimizer by a considerable margin in terms of training cost (low) and performance (high).
Learning rate decay:
Learning rate decay is a technique for training modern neural networks. It starts training the network with a large learning rate and then slowly reducing/decaying it until local minima is obtained. It is empirically observed to help both optimization and generalization.
In the very first image where we have a constant learning rate, the steps taken by our algorithm while iterating towards minima are so noisy that after certain iterations it seems wandering around the minima and do not actually converges.
But in the second image where learning rate is reducing over time (represented with green line), since the learning rate is large initially we still have relatively fast learning but as tending towards minima learning rate gets smaller and smaller, end up oscillating in a tighter region around minima rather than wandering far away from it.
Conclusion:
The choice of optimizer influences both the speed of convergence and whether it occurs. Several alternatives to the classic gradient descent algorithms have been developed in the past few years. Adaptive optimization methods such as Adam or RMSprop perform well in the initial portion of training, but they have been found to generalize poorly at later stages compared to stochastic gradient descent. Exploring optimization methods and hyperparameter values can help you build intuition for optimizing networks for your own tasks. During hyperparameter search, it’s important to understand intuitively the optimization’s sensitivity to learning rate, batch size, optimizer, and so on. That intuitive understanding, combined with the right method (random search or Bayesian optimization), will help you find the right model.