FastAI Study Notes 6


Identity mapping: Returning the input without changing it at all. This process is performed by an identity function.

In a ResNet, we don’t actually proceed by first training a smaller number of layers, and then adding new layers on the end and fine-tuning. Instead, we use ResNet blocks like the one in <> throughout the CNN, initialized from scratch in the usual way, and trained with SGD in the usual way. We rely on the skip connections to make the network easier to train with SGD.

The ResNet paper actually proposed a variant of this, which is to instead “skip over” every second convolution,
A ResNet is, therefore, good at learning about slight differences between doing nothing and passing though a block of two convolutional layers (with trainable weights). This is how these models got their name: they’re predicting residuals (reminder: “residual” is prediction minus target)

Solve the overfitting

Step one is to get to the point where you can overfit. Then the question is how to reduce that overfitting. Here shows how we recommend prioritizing the steps from there

Many practitioners, when faced with an overfitting model, start at exactly the wrong end of this diagram. Their starting point is to use a smaller model, or more regularization. Using a smaller model should be absolutely the last step you take, unless training your model is taking up too much time or memory. Reducing the size of your model reduces the ability of your model to learn subtle relationships in your data.

Instead, your first step should be to seek to create more data. That could involve adding more labels to data that you already have, finding additional tasks that your model could be asked to solve (or, to think of it another way, identifying different kinds of labels that you could model), or creating additional synthetic data by using more or different data augmentation techniques. Thanks to the development of Mixup and similar approaches, effective data augmentation is now available for nearly all kinds of data.

Once you’ve got as much data as you think you can reasonably get hold of, and are using it as effectively as possible by taking advantage of all the labels that you can find and doing all the augmentation that makes sense, if you are still overfitting you should think about using more generalizable architectures. For instance, adding batch normalization may improve generalization.

If you are still overfitting after doing the best you can at using your data and tuning your architecture, then you can take a look at regularization. Generally speaking, adding dropout to the last layer or two will do a good job of regularizing your model. However, as we learned from the story of the development of AWD-LSTM, it is often the case that adding dropout of different types throughout your model can help even more. Generally speaking, a larger model with more regularization is more flexible, and can therefore be more accurate than a smaller model with less regularization.

Only after considering all of these options would we recommend that you try using a smaller version of your architecture.

Back /Forward

to train a model, we will need to compute all the gradients of a given loss with respect to its parameters, which is known as the backward pass. The forward pass is where we compute the output of the model on a given input, based on the matrix products.

FastAI Study Notes 5

Self-supervised learning: Training a model using labels that are embedded in the independent variable, rather than requiring external labels. For instance, training a model to predict the next word in a text.

Universal Language Model Fine-tuning (ULMFit) approach. The paper

token: One element of a list created by the tokenization process. It could be a word, part of a word (a subword , eg. chinese word segmentation)), or a single character.


Steps necessary to create a language model has jargon associated with it from the world of natural language processing, and fastai and PyTorch classes available to help. The steps are:

  • Tokenization:: Convert the text into a list of words (or characters, or substrings)
  • Numericalization:: Make a list of all of the unique words that appear (the vocab), and convert each word into a number, by looking up its index in the vocab
  • Language model data loader creation:: LMDataLoader class which automatically handles creating a dependent variable that is offset from the independent variable by one token.
  • Language model creation:: We need a special kind of model that does something we haven’t seen before: handles input lists which could be arbitrarily big or small. There are a number of ways to do this;  recurrent neural network (RNN) or others.

A neural network that is defined using a loop like this is called a recurrent neural network (RNN). RNN is not a complicated new architecture, but simply a refactoring of a multilayer neural network using a for loop.


feature is a transformation of the data which is designed to make it easier to model. Feature engineering: Creating new transformations of the input data in order to make it easier to model.  convolution requires nothing more than multiplication, and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book! A convolution applies a kernel across an image. A kernel is a little matrix, such as the 3×3 matrix in the top right of image. The convolution operation multiplies each element of the kernel by each element of a 3×3 block of the image. The results of these multiplications are then added together.

convolutions are just a type of matrix multiplication, with two constraints on the weight matrix: some elements are always zero, and some elements are tied (forced to always have the same value).  Convolutions are by far the most common pattern of connectivity we see in neural nets (along with regular linear layers, which we refer to as fully connected), but it’s likely that many more will be discovered.

FastAI Study Notes 4

Collaborative filtering (Recommendation system)
Which works like this: look at what products the current user has used or liked, find other users that have used or liked similar products, and then recommend other products that those users have used or liked.

Weight decay, or L2 regularization, consists in adding to your loss function the sum of all the weights squared. Why do that? Because when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.

The biggest challenge with using collaborative filtering models in practice is the bootstrapping problem. The most extreme version of this problem is when you have no users, and therefore no history to learn from. What products do you recommend to your very first user?

There is no magic solution to this problem, and really the solutions that we suggest are just variations of use your common sense

Tabular modeling

Tabular modeling takes data in the form of a table (like a spreadsheet or CSV). The objective is to predict the value in one column based on the values in the other columns.

Modern machine learning can be distilled down to a couple of key techniques that are widely applicable. Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:

  1. Ensembles of decision trees (i.e., random forests and gradient boosting machines), mainly for structured data (such as you might find in a database table at most companies)
  2. Multilayered neural networks learned with SGD (i.e., shallow and/or deep learning), mainly for unstructured data (such as audio, images, and natural language)

Decision tree asks a series of binary (that is, yes or no) questions about the data. After each question the data at that part of the tree is split between a “yes” and a “no” branch, as shown in <>. After one or more questions, either a prediction can be made on the basis of all previous answers or another question is required.

The basic steps to train a decision tree can be written down very easily:

  1. Loop through each column of the dataset in turn.
  2. For each column, loop through each possible level of that column in turn.
  3. Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable).
  4. Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. That is, treat this as a very simple “model” where our predictions are simply the average sale price of the item’s group.
  5. After looping through all of the columns and all the possible levels for each, pick the split point that gave the best predictions using that simple model.
  6. We now have two different groups for our data, based on this selected split. Treat each of these as separate datasets, and find the best split for each by going back to step 1 for each group.
  7. Continue this process recursively, until you have reached some stopping criterion for each group—for instance, stop splitting a group further when it has only 20 items in it.

Random forest

This procedure is known as “bagging.” – 1994 Berkeley professor Leo Breiman proposed:

  1. Randomly choose a subset of the rows of your data (i.e., “bootstrap replicates of your learning set”).
  2. Train a model using this subset.
  3. Save that model, and then return to step 1 a few times.
  4. This will give you a number of trained models. To make a prediction, predict using all of the models, and then take the average of each of those model’s predictions.

It is based on a deep and important insight: although each of the models trained on a subset of data will make more errors than a model trained on the full dataset, those errors will not be correlated with each other. Different models will make different errors. The average of those errors, therefore, is: zero! This is an extraordinary result—it means that we can improve the accuracy of nearly any kind of machine learning algorithm by training it multiple times, each time on a different random subset of the data, and averaging its predictions.

In 2001 Leo Breiman went on to demonstrate that this approach to building models, when applied to decision tree building algorithms, was particularly powerful. He went even further than just randomly choosing rows for each model’s training, but also randomly selected from a subset of columns when choosing each split in each decision tree. He called this method the random forest

a tabular model is simply a model that takes columns of continuous or categorical data, and predicts a category (a classification model) or a continuous value (a regression model). Categorical independent variables are passed through an embedding, and concatenated, as we saw in the neural net we used for collaborative filtering, and then continuous variables are concatenated as well.

tabular modeling:

  • Random forests are the easiest to train, because they are extremely resilient to hyperparameter choices and require very little preprocessing. They are very fast to train, and should not overfit if you have enough trees. But they can be a little less accurate, especially if extrapolation is required, such as predicting future time periods.
  • Gradient boosting machines in theory are just as fast to train as random forests, but in practice you will have to try lots of different hyperparameters. They can overfit, but they are often a little more accurate than random forests.
  • Neural networks take the longest time to train, and require extra preprocessing, such as normalization; this normalization needs to be used at inference time as well. They can provide great results and extrapolate well, but only if you are careful with your hyperparameters and take care to avoid overfitting.

FastAI Study Notes 3

introduced in the 2017 paper mixup: Beyond Empirical Risk Minimization” by Hongyi Zhang et al., is a very powerful data augmentation technique that can provide dramatically higher accuracy, especially when you don’t have much data and don’t have a pretrained model that was trained on data similar to your dataset.
Mixup works as follows, for each image:

  1. Select another image from your dataset at random.
  2. Pick a weight at random.
  3. Take a weighted average (using the weight from step 2) of the selected image with your image; this will be your independent variable.
  4. Take a weighted average (with the same weight) of this image’s labels with your image’s labels; this will be your dependent variable.

label smoothing

Instead, we could replace all our 1s with a number a bit less than 1, and our 0s by a number a bit more than 0, and then train. This is called label smoothing. By encouraging your model to be less confident, label smoothing will make your training more robust, even if there is mislabeled data. The result will be a model that generalizes better.

FastAI study notes 2

The general approach to each one generally follows some basic principles. Here are a few guidelines:

  • Initialize:: We initialize the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initializing them to the percentage of times that pixel is activated for that category—but since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well.
  • Loss:: This is what Samuel referred to when he spoke of testing the effectiveness of any current weight assignment in terms of actual performance. We need some function that will return a number that is small if the performance of the model is good (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention).
  • Step:: A simple way to figure out whether a weight should be increased a bit, or decreased a bit, would be just to try it: increase the weight by a small amount, and see if the loss goes up or down. Once you find the correct direction, you could then change that amount by a bit more, and a bit less, until you find an amount that works well. However, this is slow! As we will see, the magic of calculus allows us to directly figure out in which direction, and by roughly how much, to change each weight, without having to try all these small changes. The way to do this is by calculating gradients. This is just a performance optimization, we would get exactly the same results by using the slower manual process as well.
  • Stop:: Once we’ve decided how many epochs to train the model for (a few suggestions for this were given in the earlier list), we apply that decision. This is where that decision is applied. For our digit classifier, we would keep training until the accuracy of the model started getting worse, or we ran out of time.

To summarize, at the beginning, the weights of our model can be random (training from scratch) or come from a pretrained model (transfer learning). In the first case, the output we will get from our inputs won’t have anything to do with what we want, and even in the second case, it’s very likely the pretrained model won’t be very good at the specific task we are targeting. So the model will need to learn better weights.

We begin by comparing the outputs the model gives us with our targets (we have labeled data, so we know what result the model should give) using a loss function, which returns a number that we want to make as low as possible by improving our weights. To do this, we take a few data items (such as images) from the training set and feed them to our model. We compare the corresponding targets using our loss function, and the score we get tells us how wrong our predictions were. We then change the weights a little bit to make it slightly better.

To find how to change the weights to make the loss a bit better, we use calculus to calculate the gradients. Let’s consider an analogy. Imagine you are lost in the mountains with your car parked at the lowest point. To find your way back to it, you might wander in a random direction, but that probably wouldn’t help much. Since you know your vehicle is at the lowest point, you would be better off going downhill. By always taking a step in the direction of the steepest downward slope, you should eventually arrive at your destination. We use the magnitude of the gradient (i.e., the steepness of the slope) to tell us how big a step to take; specifically, we multiply the gradient by a number we choose called the learning rate to decide on the step size. We then iterate until we have reached the lowest point, which will be our parking lot, then we can stop.

Now that we have a loss function that is suitable for driving SGD, we can consider some of the details involved in the next phase of the learning process, which is to change or update the weights based on the gradients. This is called an optimization step.

In order to take an optimization step we need to calculate the loss over one or more data items. How many should we use? We could calculate it for the whole dataset, and take the average, or we could calculate it for a single data item. But neither of these is ideal. Calculating it for the whole dataset would take a very long time. Calculating it for a single item would not use much information, so it would result in a very imprecise and unstable gradient. That is, you’d be going to the trouble of updating the weights, but taking into account only how that would improve the model’s performance on that single item.

So instead we take a compromise between the two: we calculate the average loss for a few data items at a time. This is called a mini-batch. The number of data items in the mini-batch is called the batch size. A larger batch size means that you will get a more accurate and stable estimate of your dataset’s gradients from the loss function, but it will take longer, and you will process fewer mini-batches per epoch. Choosing a good batch size is one of the decisions you need to make as a deep learning practitioner to train your model quickly and accurately. We will talk about how to make this choice throughout this book.

Another good reason for using mini-batches rather than calculating the gradient on individual data items is that, in practice, we nearly always do our training on an accelerator such as a GPU. These accelerators only perform well if they have lots of work to do at a time, so it’s helpful if we can give them lots of data items to work on. Using mini-batches is one of the best ways to do this. However, if you give them too much data to work on at once, they run out of memory—making GPUs happy is also tricky!

A neural network contains a lot of numbers, but they are only of two types: numbers that are calculated, and the parameters that these numbers are calculated from. This gives us the two most important pieces of jargon to learn:

  • Activations:: Numbers that are calculated (both by linear and nonlinear layers)
  • Parameters:: Numbers that are randomly initialized, and optimized (that is, the numbers that define the model)

Our activations and parameters are all contained in tensors. These are simply regularly shaped arrays—for example, a matrix. Matrices have rows and columns; we call these the axes or dimensions. The number of dimensions of a tensor is its rank. There are some special tensors:

  • Rank zero: scalar
  • Rank one: vector
  • Rank two: matrix

A neural network contains a number of layers. Each layer is either linear or nonlinear. We generally alternate between these two kinds of layers in a neural network. Sometimes people refer to both a linear layer and its subsequent nonlinearity together as a single layer. Yes, this is confusing. Sometimes a nonlinearity is referred to as an activation function.

Function that returns 0 for negative numbers and doesn’t change positive numbers.
A small group of inputs and labels gathered together in two arrays. A gradient descent step is updated on this batch (rather than a whole epoch).
Forward pass
Applying the model to some input and computing the predictions.
A value that represents how well (or badly) our model is doing.
The derivative of the loss with respect to some parameter of the model.
Backward pass
Computing the gradients of the loss with respect to all model parameters.
Gradient descent
Taking a step in the directions opposite to the gradients to make the model parameters a little bit better.
Learning rate
The size of the step we take when applying SGD to update the parameters of the model.

FastAI study notes 1

The book defined parallel distributed processing as requiring:

  1. A set of processing units
  2. state of activation
  3. An output function for each unit
  4. pattern of connectivity among units
  5. propagation rule for propagating patterns of activities through the network of connectivities
  6. An activation rule for combining the inputs impinging on a unit with the current state of that unit to produce an output for the unit
  7. learning rule whereby patterns of connectivity are modified by experience
  8. An environment within which the system must operate

Deep learning terminology for all the pieces we have discussed:

  • The functional form of the model is called its architecture (but be careful—sometimes people use model as a synonym of architecture, so this can get confusing).
  • The weights are called parameters.
  • The predictions are calculated from the independent variable, which is the data not including the labels.
  • The results of the model are called predictions.
  • The measure of performance is called the loss.
  • The loss depends not only on the predictions, but also the correct labels (also known as targets or the dependent variable); e.g., “dog” or “cat.”

Deep learning vocabulary

The data that we’re trying to predict, such as “dog” or “cat”
The _template_ of the model that we’re trying to fit; the actual mathematical function that we’re passing the input data and parameters to
The combination of the architecture with a particular set of parameters
The values in the model that change what task it can do, and are updated through model training
Update the parameters of the model such that the predictions of the model using the input data match the target labels
A synonym for _fit_
Pretrained model
A model that has already been trained, generally using a large dataset, and will be fine-tuned
Update a pretrained model for a different task
One complete pass through the input data
A measure of how good the model is, chosen to drive training via SGD
A measurement of how good the model is, using the validation set, chosen for human consumption
Validation set
A set of data held out from training, used only for measuring how good the model is
Training set
The data used for fitting the model; does not include any data from the validation set
Training a model in such a way that it _remembers_ specific features of the input data, rather than generalizing well to data not seen during training
Convolutional neural network; a type of neural network that works particularly well for computer vision tasks

Google Nature Language API

Here is a notes about the google nature language API summary I wrote  three years ago. some feature may change a bit, refer to here for latest:
Refer to here for basic concepts :



The Google Cloud Natural Language API provides natural language understanding technologies to developers, including:

  • sentiment analysis, – English
  • entity recognition, – English, Spanish, and Japanese – in fact is : nouns analysis
  • and syntax analysis – English, Spanish, and Japanese

API has three calls to do each, or can do them in one call, analyzeEntities, analyzeSentiment, annotateText.

sentiment analysis

return the value to denote the text emotion negative/positive related extent only now.
has the polarity and magnitude value as the result.

entity recognition

find out the “entity” in text – prominent named “things” such as famous individuals, landmarks, etc.
return with entities and there URL/wiki, etc.

syntactic analysis

two things returns from syntax:
1. return the sentences/subsentences of input text.
2. return the the tokens (words) and there meta in grammar syntax dependency tree.


Test Steps of commands ++++++++++++++++++++++++++++++++++

gcloud auth activate-service-account --key-file=/yourprojecytkeyfile.json

gcloud auth print-access-token

print-access-token, This will give you a token for following commands. I create three json files to test each feature:
so I can use these commands to try three API now:

curl -s -k -H "Content-Type: application/json" \
-H "Authorization: Bearer ya29.CjBWA2oWnup6dVvAlv6NTJyLsDtfqdCx70tX6_J0H7KFngd1ual2Osd8gCpcc" \ \
-d @entity-request.json

curl -s -k -H "Content-Type: application/json" \
-H "Authorization: Bearer ya29.CjBWA2op6dVv_T7nAlv6NTJyLsDtfqdCx70tX6_J0H7KFngd1ual2Osd8gCpcc" \ \
-d @syntactic-request.json

curl -s -k -H "Content-Type: application/json" \
-H "Authorization: Bearer ya29.CjBWA2oWnup6_T7nAlv6NTJyLsDtfqdCx70tX6_J0H7KFngd1ual2Osd8gCpcc" \ \
-d @3in1-request.json

About how to create these json file as input, please refer to google SDK doc.




Google Cloud Basic

I have been using some google clouds API features , like ASR, TTS with wavnet, translate, calendar, for some years. My experience basically is that they have a clear product definition and clear API. But they also have some products not so many users and have to fade out from market. Comparing with AWS, google cloud is more technical oriented and AWS is good at facing the commercial company.

Here is a summary for some basic things to use google cloud:

  1. First go to to create one account and make some basic set up.
  2. At the first, you can get some free volume , for some time for most API.
  3. Create the project in the account and download the project private key as json file. later you will use this key.
  4. Then you can start download the google cloud SDK, and this SDK may give you a lot commands of gcloud to do a lot works.
  5. Many google API features can be tested through the gcloud in the command line. Normally you need to get a token key through the Oauth2 and use this key to run command API.  Token will expire after some time.
    So from this command line you can do immediate test for API.
  6. If you use API client package, like java, python, js, you need to set env to refer to your project key for client package to run, like :
    export GOOGLE_APPLICATION_CREDENTIALS=”/opt/googleSDK/myProject-ffe45577ca.json” , like set to env for java .
    So client can run OK. Please refer to each API doc for how to use it.
  7. If your free trial period is end, then you just need to upgrade your account to GCP billing account with proper billing information. then it will starting charging. After you active your upgrade, you do no need to update your project key json, cloud side will auto know your project is validated.
  8. And at the manage console, you can see the API calling actives, calls per day, per minutes, etc. The manage console, looks like very complex, but in fact not too many things to manage over there, easy to use.





Docker common commands


  • Install
apt-get update
apt-get install sudo usermod -a -G docker $USER 
systemctl start docker
systemctl enable docker

then make a logoff , use groups to check you are in docker group or not:



  • build image:

create a Dockerfile file and then run

docker build -t your_image .

list images and containers

docker images docker images-l
docker container ls --all

run a container and check the port:

docker run -d -p 8800:80 --name your_container your_image
docker port your_container

detached run the image as a container and map the 8800 port of the container to host 80 port

Then we can check that the new container with name  based on ‘nginx_image’ is running, List the running containers:

docker ps

Or check all containers

docker ps -a
docker stop your_container
docker rm your_container
docker rmi your_image
Pull an image from a registry
docker pull alpine:3.4
Retag a local image with a new image name and tag
docker tag alpine:3.4 myrepo/myalpine:3.4
Log in to a registry (the Docker Hub by default)
docker login
Push an image to a registry
docker push myrepo/myalpine:3.4
Stop a running container through SIGTERM
docker stop my_container
Stop a running container through SIGKILL
docker kill my_container
Create an overlay network and specify a subnet
docker network create --subnet --gateway -d overlay mynet
List the networks
docker network ls
Delete all running and stopped containers
docker container prune Or this command in old versions: docker rm -f $(docker ps -aq)
 docker rm 305297d7a235 ff0a5c3750b9
Create a new bash process inside the container and connect it to the terminal
docker exec -it my_container bash
Print the last 100 lines of a container’s logs
docker logs --tail 100 my_container
  • docker run – Runs a command in a new container.
    docker run hello-world
    docker run -it busybox sh


  • docker start – Starts one or more stopped containers
  • docker stop – Stops one or more running containers
  • docker build – Builds an image form a Docker file
  • docker pull – Pulls an image or a repository from a registry
docker pull busybox
docker pull ubuntu:16.04
  • docker push – Pushes an image or a repository to a registry
  • docker export – Exports a container’s filesystem as a tar archive
  • docker exec – Runs a command in a run-time container
  • docker search – Searches the Docker Hub for images
    docker search elasticsearch
    docker search mysql

    docker attach – Attaches to a running container

  • docker commit – Creates a new image from a container’s changes
  • Watch the log
docker container logs my_container

Spring boot build version and time access

Spring boot can create the build version info for you automatically and with a default bean to access these information. I found this link is very helpful on this  and i jus tlist main points at here.

  1. build-info goal create the version info


2. In java code, just use BuildProperties to get the version info. You cna then display on UI and logs.

BuildProperties buildProperties;


3. you can even add more properties in POM to let it saved in build-info, you can refer to the link details for this.