— Regularization in machine learning

Taha Yazidi
6 min readJul 12, 2021

Introduction:

Machine learning is based on training and evaluation, we collect a fair amount of data and feed it into our deep neural network. First we split the data into two sets, training and testing sets on which we train and evaluate our model’s performance. Two main problem can occur during this training and testing steps, one is when our model is performing well on the training set but is not able to give us good results on the testing set, and we call this over-fitting or high variance; second, our model perform poorly on the training data and better on testing data, this is described as an under-fitting or high bias model. Luckily we can overcome these issues by implementing one of the most widely used machine learning techniques which is regularization. In this article we’ll discuss the types, what’s and how’s of this techniques with more examples.

What is Regularization:

For the sake of good explanation, i want you to imagine two deep neural networks. One with little parameters and layers and another one with much deeper layers and way more parameters. Now it may seem obvious that the performance of both is going to be different but how? We assume that the model with little parameter is not going to learn the problem and the one with more is going to perform too well, in both cases these models don’t generalize well, that’s what we want to avoid. To reduce generalization error we tend to have a large neural network and apply regularization techniques in it in order to make certain weights small therefore reducing their impact on the model’s predictions.

L2 & L1 regularizations:

Also called ridge and lasso regression, they are the first regularization techniques we are going to tackle in this article. they add a penalty on the cost, so instead of calculating the cost using only a loss function we add a second components known as regularizations terms used for imposing penalties on the insignificant variables. So by convention, a model with less variables and weights is more simpler model and hence less prone to overfitting. Both these techniques may look similar on the surface but they are very different. Let’s see how they function

For instance, let’s define a simple linear regression model Y with an independent variable to understand how L2 regularization works.

For this model, W and b represents “weight” and “bias” respectively, such as:

w = w1, w2, w3, w4 …wn

b=b1, b2, b3, b4, … bn

And Ŷ is the predicted result such that

Ŷ= w1 x1 +w2 x2 +……+wn xn, + b

Then we calculate the loss function

Loss= Error (Y, Ŷ)

And function that can calculate the error with L2 regularization function

— L2 regularization —

𝝺 is known as Regularization parameter, it’s a hyperparameter we plugged in manually where 𝝺> 0.

— L1 regularization —

As we can see at the surface both function looks very similar with a slight difference on penalizing the sum of square weights for the L2 and the absolute value of weights in L1 regularization. It’s easy to spot this difference but the most important one is at the performance level of these two regularization techniques. As defined, sparsity is the characteristic of holding highly significant coefficients, either very close to zero or not very close to zero, where in general coefficients approaching zero would be eliminated later. And the feature selection is the in-depth of sparsity, i.e. in place of confining coefficients nearby to zero, feature selection is brought them exactly to zero, and hence expel certain features from the data model. In this context, L1 regularization can be helpful in features selection by eradicating the unimportant features, whereas, L2 regularization is not recommended for feature selection.

Dropout:

— DNN before and after applying dropout —

Dropout is another regularization technique used inside a single model and it simulates large network architectures by dropping out random nodes at single or multiple hidden layers during training. It is regarded as one of the most computationally cheap and effective regularization technique.

We said that dropout can be implemented on random single or multiple layers but with the exception of output layer. A new hyperparameter is introduced that specifies the probability at which outputs of the layer are dropped out, or inversely, the probability at which outputs of the layer are retained. The interpretation is an implementation detail that can differ from paper to code library. A common value is a probability of 0.5 for retaining the output of each node in a hidden layer and a value close to 1.0, such as 0.8, for retaining inputs from the visible layer.

Rather than guess at a suitable dropout rate for your network, test different rates systematically. For example, test values between 1.0 and 0.1 in increments of 0.1. This will both help you discover what works best for your specific model and dataset, as well as how sensitive the model is to the dropout rate. A more sensitive model may be unstable and could benefit from an increase in size.

Data Augmentation:

Data is a valuable yet very costly aspect of machine learning, so what if we want to train our model but we lack sufficient data? Thankfully here comes the data augmentation technique that allows us to create a variety of sub-classes and modified copies from our main data and train our model on much more examples.

Techniques used for data augmentation are, flipping, rotating, scaling, cropping, translating and adding a gaussian noise. These techniques are used on images data, there are also other techniques for other types of data such as text and audio. Various popular ML libraries such as Tensor-flow and Keras provides a good simple way of implementing data augmentation.

Early Stopping:

Another issue that comes with training neural nets is how long should our model carry on training. Too little and it’s underfit and too much it’s overfit, so we want to optimize the training time so it stops once no further progress or performance degradation start to occurs, this procedure is called early stopping.

In order to implement this technique, first we want to monitor our model’s performance by creating a validation set and use the loss on this set as metric to define the stopping point of training, this is called the stopping trigger in his simplest form. As they exist different more elaborate stopping trigger that includes more than one metrics , some additions such as the stagnation of a metric over a certain number of epochs, a radical change in metrics and many more.

As training is finished and halted by the stopping trigger we save our model at that particular epoch, this will depend on the trigger chosen to stop the training process. For example, if the trigger is a simple decrease in performance from one epoch to the next, then the weights for the model at the prior epoch will be preferred.

--

--