# Concise Lecture Notes - Lesson 5 | Fastai v3 (2019)

Posted

These notes were typed out by me while watching the lecture, for a quick revision later on. To be able to fully understand them, they should be used alongside the jupyter notebooks that are available here:

### Preamble:

- Kindly use the Jupyter notebook in parallel with these notes for revision.
- The course consists of 7 lessons and the recommended study pattern is around 10 hours a week so overall 70 hours of DL practice
- We will be using Jupyter notebooks, Fastai library and Pytorch to do the course
- Fastai can be used to solve problems in these four areas: Computer Vision, Natural Language Text, Tabular data and Collaborative filtering.

### Notes:

The rest of the lectures will be learning how everything works starting with collaborative filtering (Since it’s a linear model and easier to understand).

Activation functions are element-wise functions. Therefore the input activations and the output activations are of the same length.

When you have big enough weight matrices and activation functions stacted together, it results in the Universal Approximation Theorem, which means

**it can solve any arbitrarily complex mathematical function to any arbitrarily high level of accuracy.**

#### What happens when we do transfer learning on a resnet-34?

In Imagenet problem, the number of classifications to be made is 1000. So the target vector/Output layer has a thousand elements. Therefore the weight matrix of the final layer has 1000 elements in one axis.

This is useless to us, since we might not have a thousand categories and even if we do, they are not the same.

So

`create_cnn`

removes the last layer entirely and puts in two weight matrices with a relu in between them. The size is according to classes to predict.The later layers of a trained CNN identify specific things like eyeballs unlike early layers, which identify edges. So later layers are useless for us. So you want the weights of the earlier layers to be close to what they already are.

We freeze the all the earlier layers except the last in the beginning when backpropogating to get better than random numbers in the last layer.

Then we unfreeze the earlier layers and we split the model into a few sections. We want to train the later part more and earlier part less.

#### How do we do that?

Give smaller learning rate to earlier parts and higher learning rate to the later parts of the network. This is called using

**discriminative learning rates.**`fit(1, 1e-3)`

-> All layers`LR = 1e-3`

`fit(1, slice(1e-3))`

-> Final layer`LR = 1e-3`

, all other layers =`(1e-3)/3`

`fit(1, slice(1e-5, 1e-3))`

-> Final layer`LR = 1e-3`

, Early layer`LR = 1e-5`

and the middle layers will get LR’s that are between these two (spread evenly).To make things more manageable, fastai gives a different learning rate to different layer groups (not individual layers). By default CNN is split into 3 groups.

Affine Functions: They are linear functions that in Deep Learning context mostly mean Matrix Multiplication.

#### Embedding matrices:

Embedding is an array lookup which is mathematically identical to a matrix multiplied by a one hot encoded matrix.

Embedding simply means look something up in an array. It is a fast and memory efficient way of multiplying parameters (weight matrix) by a one hot encoded matrix.

As it turns out each of these embedding vectors correspond to a single input and when the neural network is trained, the values of these embeddings somehow describe the input in some fashion or other. This is called latent factors or latent features.

__Movie rating Collab Filtering example__

Semantically, these embeddings correspond with the input in a manner. In this example, the user embeddings may correspond to what types of movies that person likes and the movie embeddings correspond to what elements or categories that movie falls into.

Even still, sometimes a user may generally like a movie with jon travolta and battlefield earth is a movie that has john travolta but the movie sucks.

How do we deal with that? There has to a way to say, unless it’s battlefield earth. This is where bias comes in. The user bias should be able to tell if the user is someone who rates movies highly and the movie bias should be able to tell if the movie is “good”.

Therefore, every embedding has a bias vector associated with it.

The first argument in

`fit_one_cycle`

is`epochs`

.When using old datasets that are not unicode or

`'utf-8'`

, you have to guess the encoding. Just try`'latin-1'`

, it is likely to be that.`CollabDataBunch`

expects the data columns to be in the order: user, item, rating.Sigmoid never reaches the end of the range. It actually asymptotes towards it. So a trick would be to keep the

`y_range = (0,5.5)`

.Lowest rating being 0.5 and highest being 5. People liberally do give 5 to a movie.

`n_factors`

is number of factors is width of embedding matrix.

#### Why `n_factors`

= 40?

Simple, because it works. Jeremy tried 10, 20, 40 and 80. 40 worked best.

It is often insightful to actually look at the bias values and weights of particular users or movies(items).

Weights can be huge in length so we can get a pca and turn it into 3 components so that it is easy to interpret.

__Regularization:__

- Myth: You need to use less parameters so that function does not overfit.
Reality: You can have lots of parameters. More parameters means more non-linearities aka more curvy bits which is what life is like.

We don’t want them to be more curvy than necessary (Overfit). So let’s use lots of parameters and then

**penalize**complexity. We use`wd`

`wd`

is weight decay. It is a type of regularization.

#### How to penalize complexity using weight decay?

- We are gonna take the sum of the square of our parameters multiplied by some number
`wd`

and add to our loss function. - Then to reduce the loss, we have to reduce the value of the parameters and the smaller they are, the less complex our fuction will be.
- We use
`wd`

because the sum of square of parameters are likely huge so the loss function will make gradients near zero. The value of

`wd`

that works for most is 1e-1.Equation:

Now in this form, when we add \((wd*\sum{w^2})\) to the loss, it is called

__L2 regularization__.But when we take it’s derivative, we get \((2*wd*w)\) , which we can generalize as \((wd*w)\)

In this form, it is called

__weight decay__.We should always use weight decay rather than L2 regularization.

__Adam Optimizer__

- Adam is an Optimization function much like Stochastic Gradient Descent (SGD).

##### Loss plot when SGD is used:

##### Loss plot when Adam is used:

Adam reaches a much lower loss much quicker and therefore we prefer to use that.

Adam comprises of two algorithms on top of SGD, namely momentum and RMSprop

Momentum is the exponentially weighted moving average of last few steps and the current gradient. More can be understood here.

RMSProp is the exponentially weighted average of the gradient squared. It achieves damping of oscillations in dimensions with low gradient and doesn’t affect those in high gradient space, allowing for a much faster convergence.

Even with Adam, you need to use Learning Rate Annealing

In pytorch, you just have to use

`optim.Adam(model.parameters, lr)`

__Cross Entropy Loss__

The aim is to have a loss function that assigns a small loss for predicting a right thing confidently and a huge loss for predicting a wrong thing confidently.

Cross entropy loss is the sum of product of binary indication of a class (1 or 0) and the log of final probabilities (which have to add upto 1). Therefore when using Cross entropy loss, the last activation layer is almost always softmax.

For binary classification:

- For Multiclass classification: