 These notes were typed out by me while watching the lecture, for a quick revision later on. To be able to fully understand them, they should be used alongside the jupyter notebooks that are available here:

### Preamble:

• Kindly use the Jupyter notebook in parallel with these notes for revision.
• The course consists of 7 lessons and the recommended study pattern is around 10 hours a week so overall 70 hours of DL practice
• We will be using Jupyter notebooks, Fastai library and Pytorch to do the course
• Fastai can be used to solve problems in these four areas: Computer Vision, Natural Language Text, Tabular data and Collaborative filtering.
• This particular lesson’s notes are not very concise. There was way too much information to skip. Apologies in advance.

• First create your item list, then decide how to split. You nearly always want validation. You can’t skip it entirely. You have to say how to split, and one of the options is no_split
• In fast.ai parlance, we use the same kind of parlance that Kaggle does which is the training set is what you train on, the validation set has labels and you do it for testing that your models working.
• Next thing we can do is to add transforms. For small images of digits like this, you just add a bit of random padding. tfms = ([*rand_pad(padding=3, size=28, mode='zeros')], [])
• The random padding function actually returns two transforms; the bit that does the padding and the bit that does the random crop. So you have to use star(*) to say put both these transforms in this list.
• The empty ,[]) array refers to the transforms on the validation set. So here no transforms on the validation set.
• We can choose to use .normalize. When we’re not using a pre-trained model, there’s no reason to use ImageNet stats.

#### CNN

def conv(ni,nf): return nn.Conv2d(ni, nf, kernel_size=3, stride=2, padding=1)

model = nn.Sequential(
conv(1, 8), # 14
nn.BatchNorm2d(8),
nn.ReLU(),
conv(8, 16), # 7
nn.BatchNorm2d(16),
nn.ReLU(),
conv(16, 32), # 4
nn.BatchNorm2d(32),
nn.ReLU(),
conv(32, 16), # 2
nn.BatchNorm2d(16),
nn.ReLU(),
conv(16, 10), # 1
nn.BatchNorm2d(10),
Flatten()     # remove (1,1) grid
)

• Since stride=2, each time you have a convolution, it’s skipping over one pixel so it’s jumping two steps each time. That means that each time we have a convolution, it’s going to halve the grid size. When you start with 28x28 grid, after a convolution, it’ll become 14x14.
• You always get to pick how many filters you create regardless of whether it’s a fully connected layer in which case it’s just the width of the matrix you’re multiplying by, or with the 2D conv, it’s just how many filters do you want.
• We keep doing BatchNorm, ReLU, conv with the aim of having the last conv layer give an output of grid size 1x1.
• Our loss functions expect (generally) a vector not a rank 3 tensor, so you can chuck flatten at the end, and flatten just means remove any unit axes. So that will make it now just a vector of length 10 which is what we always expect.
• We do the same, LR find, fit one cycle and quickly get predictions out.
• Rather than saying conv, batch norm, ReLU all the time, fast.ai already has something called conv_layer which lets you create conv, batch norm, ReLU combinations.

#### How to improve this?

• What we really want to do is create a deeper network. How to achieve that without overfitting? We use skip connections.
• Details about skip connections and how they help: How skip connections changed deep learning
class ResBlock(nn.Module):
def __init__(self, nf:int):
super(ResBlock, self).__init__()  #initialize
self.conv1 = conv_layer(nf, nf, 3, 1) #input and output channels are of same size
self.conv2 = conv_layer(nf, nf, 3, 1) #input and output channels are of same size

def forward(self, x:torch.Tensor) -> torch.Tensor:
return self.conv2(self.conv1(x)) + x #SKIP CONNECTION

• There’s a res_block function already in fast.ai so you can just call res_block.
• There’s something else here which is you can optionally set dense=True, and what happens if you do?
• It returns cat([x,x.orig]) instead. In other words, rather than putting a plus in this connection, it does a concatenate.
• This is a dense block. And it’s not called a ResNet anymore, it’s called a DenseNet.

#### U-Net

• We use U-nets for image segmentation or generally any task that requires the output to be in the form similar to the input.
• This is the architecture of a u-net: • We can see this as three parts:

1. Downsampling path (Left side of the U)
2. Convolutions (Bottom of the U)
3. Upsampling path (Right side of the U)
• The encoder refers to the downsampling part of U-Net. In most cases they have this specific older style architecture but we replace any older style architecture bits with pretrained ResNet bits.

• The Convolutions at the bottom are there to add more computation. It simply is two convolutions i.e. (convolution2d, batchnorm, relu) x 2

• So we have encoder, batchnorm, relu and two convolutions. Now how to upsample?

• We have to double the grid size so we can do a stride half conv, also known as a deconvolution, also known as a transpose convolution.

• Originally this is how it was done: • As shown in the red, this leaded to a lot of empty information and uneven information being sent into convolution.
• There are better and simpler alternatives like nearest neighbour interpolation or bilinear interpolation followed by a stride 1 conv. nearest neighbour interpolation

• Fastai uses something called a pixel shuffle also known as sub pixel convolutions.

#### Why U-nets work so well

• So what Olaf Ronneberger et al. did was they added a skip connection, an identity connection, and amazingly enough, this was before ResNets existed.

But rather than adding a skip connection that skipped every two convolutions, they added a skip connection from the same part of the downsampling path to the same-sized bit in the upsampling path.

• Those are the gray lines in the U-net.

#### Generative modelling

• We can take an example like image restoration to understand this, i.e. turing a low-res image into a high res image or converting a black and white image into color image.
• We have to first take high resolution images and then crappify it. Those will be our input.
• We can use something like a U-net to actually generate these higher quality images and then compare it to the original ground truth images and optimize.
• We quickly run into an issue when optimizing.
• The reason that we don’t making as much progress with that as we’d like is that our loss function doesn’t really describe what we want.
• Because actually, the mean squared error between the pixels of nearly the same color are very small! This doesn’t allow the model to understand texture or other nuances!
• We need a better loss function.

#### GAN

• GAN or a Generative Adversarial Network tries to solve this problem by using a loss function which actually calls another model. • This other model is called the Dicriminator or the critic.
• This model is a binary classification model that takes all the pairs of the generated image and the real high-res image, and learns to classify which is which.
• So we train this model until it is easy for it to identify which is the generated image and which is the original image.
• Now we train the generator a little bit more using that critic as the loss function, the generators going to get really good at fooling the critic.
• So now we’re going to stop training the generator, and we’ll train the critic some more on these newly generated images. Now that the generator is better, it’s now a tougher task for the critic to decide which is real and which is fake.
• So we’ll just go ping pong ping pong, backwards and forwards. That’s a GAN
• Fastai version of GAN pre-trains the generator and pre-trains the critic.
• When you’re doing a GAN, you need to be particularly careful that the generator and the critic can’t both push in the same direction and increase the weights out of control.
• So we have to use something called spectral normalization to make GANs work nowadays.
• A GAN critic uses a slightly different way of averaging the different parts of the image when it does the loss, so anytime you’re doing a GAN at the moment, you have to wrap your loss function with AdaptiveLoss
loss_critic = AdaptiveLoss(nn.BCEWithLogitsLoss())
def create_critic_learner(data, metrics):
return Learner(data, gan_critic(), metrics=metrics, loss_func=loss_critic, wd=wd)
learn_critic = create_critic_learner(data_crit, accuracy_thresh_expand)

• At the moment, GANs hate momentum when you’re training them. It doesn’t make sense to train them with momentum because you keep switching between generator and critic.
switcher = partial(AdaptiveGANSwitcher, critic_thresh=0.65)
learn = GANLearner.from_learners(learn_gen, learn_crit, weights_gen=(1.,50.),
wd=wd)

learn.callback_fns.append(partial(GANDiscriminativeLR, mult_lr=5.))

• One of the tough things about GANs is their loss numbers, they’re meaningless.
• You can’t expect them to go down because as the generator gets better, it gets harder for the discriminator (i.e. the critic) and then as the critic gets better, it’s harder for the generator.
• So the numbers should stay about the same.
• The only way to know how are they doing is to actually take a look at the results from time to time

#### Feature Loss

• The next step is “can we get rid of GANs entirely?”
• Obviously, the thing we really want to do is come up with a better loss function
• We want a loss function that does a good job of saying this is a high-quality image without having to go through all the GAN trouble, and preferably it also doesn’t just say it’s a high-quality image but it’s an image which actually looks like the thing is meant to.
• The trick is here: Justin Johnson et al. created this thing they call perceptual losses
• But infact they have nothing perpetual about them so they are called feature losses in fastai. • The architecture is similar to a U-net, there is an encoder(downsampling) and a decoder(upsampling)
• Here, instead of taking the final output of the VGG model on this generated image, let’s take something in the middle. Let’s take the activations of some layer in the middle.
• Those activations, it might be a feature map of like 256 channels by 28 by 28. So those kind of 28 by 28 grid cells will kind of roughly semantically say things like

“in this part of that 28 by 28 grid, is there something that looks kind of furry? Or is there something that looks kind of shiny? Or is there something that was kind of circular? Is there something that kind of looks like an eyeball?”

• So the loss function says something like “there’s eyeballs here (in the target), but there isn’t here (in the generated version), so do a better job of that please”

• Hence the name feature losses.

class FeatureLoss(nn.Module):
def __init__(self, m_feat, layer_ids, layer_wgts):
super().__init__()
self.m_feat = m_feat
self.loss_features = [self.m_feat[i] for i in layer_ids]
self.hooks = hook_outputs(self.loss_features, detach=False)
self.wgts = layer_wgts
self.metric_names = ['pixel',] + [f'feat_{i}' for i in range(len(layer_ids))
] + [f'gram_{i}' for i in range(len(layer_ids))]

def make_features(self, x, clone=False):
self.m_feat(x)
return [(o.clone() if clone else o) for o in self.hooks.stored]

def forward(self, input, target):
out_feat = self.make_features(target, clone=True)
in_feat = self.make_features(input)
self.feat_losses = [base_loss(input,target)]
self.feat_losses += [base_loss(f_in, f_out)*w
for f_in, f_out, w in zip(in_feat, out_feat, self.wgts)]
self.feat_losses += [base_loss(gram_matrix(f_in), gram_matrix(f_out))*w**2 * 5e3
for f_in, f_out, w in zip(in_feat, out_feat, self.wgts)]
self.metrics = dict(zip(self.metric_names, self.feat_losses))
return sum(self.feat_losses)

def __del__(self): self.hooks.remove()

feat_loss = FeatureLoss(vgg_m, blocks[2:5], [5,15,2])

learn = unet_learner(data, arch, wd = 1e-3, loss_func=feat_loss,
callback_fns=LossMetrics, blur=True, norm_type=NormType.Weight)
gc.collect();


#### RNN

1. Grab word 1 as an input.
2. Chuck it through an embedding, create some activations.
3. Pass that through a matrix product and nonlinearity.
4. Grab the second word.
5. Put it through an embedding.
6. Then we could either add those two things together or concatenate them. Generally speaking, when you see two sets of activations coming together in a diagram, you normally have a choice of concatenate or or add. And that’s going to create the second bunch of activations.
7. Repeat for word 3
8. Then you can put it through one more fully connected layer and softmax to create an output. • When we write this as refactored and efficient code:
class Model(nn.Module):
def __init__(self):
super().__init__()
self.i_h = nn.Embedding(nv,nh)  # green arrow
self.h_h = nn.Linear(nh,nh)     # brown arrow
self.h_o = nn.Linear(nh,nv)     # blue arrow
self.bn = nn.BatchNorm1d(nh)

def forward(self, x):
h = torch.zeros(x.shape, nh).to(device=x.device)
for xi in x:
h += self.i_h(xi)
h  = self.bn(F.relu(self.h_h(h)))
return self.h_o(h)

• Previously we were comparing the result of our model to just the last word of the sequence. It is very wasteful, because there’s a lot of words in the sequence.
• So let’s compare every word in x to every word and y. To do that, we need to change the diagram so it’s not just one triangle at the end of the loop, but the triangle is inside the loop: class Model(nn.Module):
def __init__(self):
super().__init__()
self.i_h = nn.Embedding(nv,nh)
self.h_h = nn.Linear(nh,nh)
self.h_o = nn.Linear(nh,nv)
self.bn = nn.BatchNorm1d(nh)
self.h = torch.zeros(x.shape, nh).cuda()

def forward(self, x):
res = []
h = self.h
for xi in x:
h = h + self.i_h(xi)
h = F.relu(self.h_h(h))
res.append(h)
self.h = h.detach()
res = torch.stack(res)
res = self.h_o(self.bn(res))
return res

• So nn.RNN basically says do the loop for me.
• We’ve still got the same embedding, we’ve still got the same output, still got the same batch norm, we still got the same initialization of h, but we just got rid of the loop.
• One of the nice things about nn.RNN is that you can now say how many layers you want.
class Model(nn.Module):
def __init__(self):
super().__init__()
self.i_h = nn.Embedding(nv,nh)
self.rnn = nn.RNN(nh,nh,2)
self.h_o = nn.Linear(nh,nv)
self.bn = nn.BatchNorm1d(nh)
self.h = torch.zeros(2, x.shape, nh).cuda()

def forward(self, x):
res,h = self.rnn(self.i_h(x), self.h)
self.h = h.detach()
return self.h_o(self.bn(res))

• This is what we just wrote in code: • When you really think about it, this is how it looks without the loop: • There’s a few tricks you can do. One thing is you can add skip connections, of course.
• But what people normally do is, instead of just adding these together(green and orange arrows), they actually use a little mini neural net to decide how much of the green arrow to keep and how much of the orange arrow to keep.
• When you do that, you get something that’s either called GRU or LSTM depending on the details of that little neural net.
• Example usecases: The sequence of outputs could be for every word there could be something saying is there something that is sensitive and I want to anonymize or not.
• So it says private data or not. Or it could be a part of speech tag for that word, or it could be something saying how should that word be formatted.
• These are called sequence labeling tasks and so you can use this same approach for pretty much any sequence labeling task.