These notes were typed out by me while watching the lecture, for a quick revision later on. To be able to fully understand them, they should be used alongside the jupyter notebooks that are available here:
- Kindly use the Jupyter notebook in parallel with these notes for revision.
- The course consists of 7 lessons and the recommended study pattern is around 10 hours a week so overall 70 hours of DL practice
- We will be using Jupyter notebooks, Fastai library and Pytorch to do the course
- Fastai can be used to solve problems in these four areas: Computer Vision, Natural Language Text, Tabular data and Collaborative filtering.
A language model is a model that learns to predict the next word in a sentence.
The wikitext 103 is a language model built using some of the largest articles on wikipedia
Corpus is a bunch of documents
Does the language model work in informal language models where they use near illegible shortforms?
Yes it absolutely does. You can take the language model and finetune it with the target corpus.
The procedure we follow is to first finetune a language model and then use our training data to do transfer learning and then use the validation test to get our accuracy scores.
Language Model Creation
TextDataBunch performs Tokenization and Numericalization behind the scenes and out of the box.
Both TextDataBunch and TextClasDataBunch are Factory classes and need the text to be arranged in a certain format.
We can use the Data Block API (TextList) to actually have more flexibility with our data.
We are going to use that ‘knowledge’ of the English language to build our classifier, but first, like for computer vision, we need to fine-tune the pretrained model to our particular dataset.
Because the English of the reviews left by people on IMDB isn’t the same as the English of wikipedia, we’ll need to adjust the parameters of our model by a little bit.
Plus there might be some words that would be extremely common in the reviews dataset but would be barely present in wikipedia, and therefore might not be part of the vocabulary the model was trained on.
We are trying to fine tune our language model, so we can use our text data i.e. train, test (note: Not the labels!!). We should use every bit of text available to familiarize our language model about the domain.
Now this text databunch that is created can be passed into the language_model_learner and fit. The result is this domain specific language model.
Creating a classifier
Step 1 is to again create a databunch
This time we have to separate the test set.
When creating databunch, remember to pass in the same vocab (data_lm.vocab)
If you run out of memory, reduce batch size.
This time we create a text_classifier_learner.
Reduce drop_mult (Regularization) to prevent underfitting and increase it to prevent overfitting.
We load in the encoder that we saved before.
freeze the learner object and fit one cycle on it.
It turns out in text classification, unfreezing one layer at a time in the last few layers works much much better. So freeze_to(-2) unfreezes last 2 layers.
momsare momentums and for training RNNs reducing
momsworks really well. So
momsare kept to (0.8, 0.9)
Why the learning rate for RNN is 2.6^4?
For NLP, RNNs the best learning rate was found out to be anything that starts with 2.6.
Trivia: Jeremy found this out by running a Random Forest to predict the ideal hyperparameters and came to this rule of thumb.
- Misconception that Neural Networks cannot be used for tabular data. Neural Networks give a pretty great result.
What are the 10% of cases where using Neural Nets for Tabular data is not a good option?
Always try both Random Forest and Neural Networks. Whichever works better.
Categorical variables have to be converted to embeddings
We do preprocessing of the datframe. This can be done using procs (processes)
The process again, is the same.. Create a learner object and fit.
How to combine NLP (tokenized data), matadata(tabular data) with fastai?
Conceptually, in the neural network you can have two different sets of inputs merging together into some layer.
You probably want the text going into an RNN, the image going to a CNN and metadata going into a tabular data and basically have them all concatenated and go through some fully connected layers and train them end to end.
It is where you have information about who bought what or who liked what. So you can predict whether a particular person would be interested in a particular product.
This usually runs into what is called a cold start problem. The time you particularly want to be good at recommending products is when you have a new user or when you have a new product.
The only way to solve this problem as of now is to use another model that is not a collaborative filtering model for new users or new products. You can make guesses using their location, age or sex.
Will the NLP model work when there are emojis or when hindi is written in Roman script?
The emoji is a simple problem since there are very few emojis comparatively and adding such a corpus for fintuning the language model will mostly result in what we want.
For the other problem, you can either map the vocab of the hindi words to the english counterparts directly or you learn that the first layer converts the tokens into vectors. You can throw that away and finetune just the first layer.
Fastai is building a model zoo where they are adding more and more languages models for different languages and different domains like medical texts, molecular data, musical notes etc.
On looking at the source code of one of the learner we find that it contains data, model and metrics.
The model is just the scaled matrix multiplication of embeddings summed with biases.
Specifically, an embedding is a matrix of weights.
In the example of movies and users, each will have embedding matrices where a particular user or movie’s embedding/weight vector can be indexed into.
Since a user can specifically just like a lot of movies, or there can be a movie that everyone likes more, to deal with these, we have bias terms.
The scaling is done sometimes and in this example (EmbeddingDotBias) it is a sigmoid.
Why do this step?
- If it is impossible for the model to predict too much or predict too little, the model can then devote it’s energy knowing that it has to stay in that range and get better accuracy.
What does the neural net contain?
- Neural nets are made of layers, every matrix that is an input, intermediate input, output or an intermediate output is a layer.
- These layers are further subdivided as parameters(weights) and activations.
Any matrix input i.e. weights or a filter that is used to multiply with the input or the output of a similar multiplication is called a parameter!
The output of any matrix multiplication is called an activation.
This cover picture should make it clearer