These notes were typed out by me while watching the lecture, for a quick revision later on. To be able to fully understand them, they should be used alongside the jupyter notebooks that are available here:
Some other very important links:
- Kindly use the Jupyter notebook in parallel with these notes for revision.
- The course consists of 7 lessons and the recommended study pattern is around 10 hours a week so overall 70 hours of DL practice
- We will be using Jupyter notebooks, Fastai library and Pytorch to do the course
- Fastai can be used to solve problems in these four areas: Computer Vision, Natural Language Text, Tabular data and Collaborative filtering.
- Transforms are bits of code that run every time something is grabbed from a dataset
- Preprocesses are like transforms but they run once before you do any training. They run once on the training set and then any kind of state or metadata that’s created is then shared with the validation and test set.
fastai provides a few preprocesses:
- Categorify which converts string objects of a column into categories or classes based on the unique values in that column.
- FillMissing which creates a column_name_na and makes that a boolean with true or false for a column: column_name. It also replaces that missing value in the original column with the median so that column now becomes a continuous variable.
- Normalize which normalizes the numerical columns using Z-transform.
We can use
processes = [FillMissing, Categorify, Normalize]to preprocess the data together and easily.
- Dropout is a kind of regularization method.
- In Dropout, at random, we throw away some percentage of the activations not the weights/parameters. Remember, there’s only two types of numbers in a neural net - parameters also called weights (kind of) and activations.
- Every time a mini batch goes through, we, at random, throw away some of the activations. And then the next mini batch, we put them back and we throw away some different ones.
- It means that no 1 activation can memorize some part of the input because that’s what happens if we over fit.
- Now too much dropout, of course, is reducing the capacity of your model, so it’s going to under fit. So you’ve got to play around with different dropout values for each of your layers to decide.
- Note that at test time we turn off dropout. We’re not going to do dropout anymore because we want it to be as accurate as possible.
- In Pytorch, we don’t need to do that. It does it automatically for us.
- Batch Normalization is kind of a regularization method or can be thought of as a training helper. Why it works exactly is unknown.
- Initially, it was assumed to be due to reduction in internal covariate shift. Recent papers suggest that it is not the case.
- Either ways, it’s effect is undenyable and that alone should be motivation for being a practitioner first.
- What batch norm does is what you see in this picture here in this paper. Here are steps or batches (x-axis). And here is loss (y-axis).
- The red line is what happens when you train without batch norm - very very bumpy.
- And here, the blue line is what happens when you train with batch norm - not very bumpy at all.
- What that means is, you can increase your learning rate with batch norm. Because these big bumps represent times that you’re really at risk of your set of weights jumping off into some awful part of the weight space that it can never get out of again.
- So if it’s less bumpy, then you can train at a higher learning rate.
- In a neural net there’s only two kinds of numbers; activations and parameters. These are parameters. They’re things that are learnt with gradient descent.
- \(\beta\) is just a normal bias layer and \(\gamma\) is a multiplicative bias layer. Nobody calls it that, but that’s all it is. It’s just like bias, but we multiply rather than add.
- That’s what batch norm is. That’s what the layer does.
In what proportion would you use dropout vs. other regularization errors, like, weight decay, L2 norms, etc.?
- We should always use the weight decay version, not the L2 regularization version. So there’s weight decay
- batch norm, we always want.
- We pretty much always want some weight decay, but we often also want a bit of dropout.
- This is one of these things you have to try out and kind of get a feel for what tends to work for your kinds of problems.
- Data augmentation is one of the least well studied types of regularization
- You can do data augmentation and get better generalization without it taking longer to train, without underfitting (to an extent, at least).
In fastai we can use the
get_transformsin computer vision problems:
tfms = get_transforms(max_rotate=20, max_zoom=1.3, max_lighting=0.4, max_warp=0.4, p_affine=1., p_lighting=1.)
More about what each of these transforms do can be checked here: List of transforms
Basically these adjust the lighting, zoom, rotation, warping of the dataset to create examples that mimic some real life noise so the neural network learns better.
Padding is generally adding values to the border of the original image.
We can use zero padding, border padding or reflection padding.
You can pick zeros, you can pick border which just replicates, or you can pick reflection which as you can see is it’s as if the last little few pixels are in a mirror.
Reflection padding is nearly always better.
Convolutional Neural Network
- A convolutional neural networks contain what we call, Convolutions.
- A convolution is just a matrix multiply with interesting properties.
- This thing where we take each 3x3 area, and element wise multiply them with a kernel, and add each of those up together to create one output is called a convolution.
- To learn this interactively, go to setosa.io/ev/image-kernels
- Different kernels give us different results. They allow us to identify edges and patterns.
- We have to think about padding because if you have a 3 by 3 kernel and a 3 by 3 image, then that can only create one pixel of output.
- There’s only one place that this 3x3 can go. So if we want to create more than one pixel of output, we have to do padding which is to put additional numbers all around the outside.
- We generally have a stack of these kernels to capture as much information as possible by the network.
- In order to avoid our memory going out of control, from time to time we create a convolution where we don’t step over every single set of 3x3, but instead we skip over two at a time.
- We would start with a 3x3 centered at (2, 2) and then we’d jump over to (2, 4), (2, 6), (2, 8), and so forth.
- That’s called a stride 2 convolution. What that does is, it looks exactly the same, it’s still just a bunch of kernels, but we’re just jumping over 2 at a time. We’re skipping every alternate input pixel.
- So the output from that will be H/2 by W/2.
- When we do that, we generally create twice as many kernels, so we can now have 32 activations in each of those spots. That’s what modern convolutional neural networks tend to look like.
For reasons we’ll talk about in part 2, we often use a larger kernel for the very first conv layer.
In the above image, the section in blue is the convolution part. This is the part we obtain when we do transfer learning.
In this case, the output of the conv layer is 512 x 11 x 11. We have to convert this to an output with 37 values. Since we have 37 classes (in this example).
So first, we average across 11 x 11 values in all the 512 channels. This is called average pooling. We just take the average of all 11 x 11 values in all the channels.
Then we multiply this 512 long avg-pooled vector with a matrix with shape 512x37. This is the final result of the network (A vector of length 37) which then goes over to softmax.
Ethics in Data Science
- The person who develops these algorithms is entirely responsible to understand it’s real world implications.
- The results of bias and other mistakes can affect a lot of lives. Here is a scenario:
- There are now systems in America that will identify a person of interest in a video and send a ping to the local police.
- These systems are extremely inaccurate, and extremely biased. And what happens then, of course, is if you’re in a predominantly black neighborhood where the probability of successfully recognizing you is much lower, and you’re much more likely to be surrounded by black people, and so suddenly all of these black people are popping up as persons of interest, or in a video of a person of interest.
- All the people in the video are all recognized as in the vicinity of a person of interest, you suddenly get all these pings going off the local police department causing the police to run down there.
- Therefore likely to lead to a larger number of arrests, which is then likely to feed back into the data being used to develop the systems. (Positive feedback loop!)
- A real world example would be how facebook heavily influenced the genocide of Rohingya people. Not because anybody at Facebook wanted it, they’re really trying to create a product that people like, but not in a thoughtful enough way.
- When they actually started asking the generals in the Myanmar army that were literally throwing babies onto bonfires, they were saying :
“We know that these are not humans. We know that they are animals, because we read the news. We read the internet.”
- Because these are the stories the algorithms are pushing.
- To summarize, we are part of the 0.3 to 0.5% of the world that knows how to code. We have a skill that very few other people do.
- Not only that, we now know how to code deep learning algorithms which is the most powerful kind of code.
- We should explicitly think about at least not making the world worse, and perhaps explicitly making it better.