Deep Learning
Stanford CS230
Matt Deitke
Last updated: February 15, 2020
Acknowledgements
This set of notes follows the lectures from Stanford’s graduate-level course CS230: Deep
Learning. The course was taught by Andrew Ng, Kian Katanforoosh. The course
website is cs230.stanford.edu.
i
Contents
1 Intro d u cti on 1
1.1 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 What is a neural network? . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Supervised learning with neural networks . . . . . . . . . . . . . . . . 3
1.2 Logistic regression as a ne ur al network . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Logistic regression cost function . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.6 Computation graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.7 Derivat ives with a computational graph . . . . . . . . . . . . . . . . . 8
1.2.8 Logistic regression gradient descent . . . . . . . . . . . . . . . . . . . . 8
1.2.9 Gradient descent on m examples . . . . . . . . . . . . . . . . . . . . . 9
1.3 Python and vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Vectorizing logistic regression . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Vectorizing logistic regression’s gradient computation . . . . . . . . . 10
1.3.4 Broadcasting example . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.5 A note on NumPy vectors . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Neu ral networks 12
2.1 Deep learning intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Day’n’night classification . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Face verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.3 Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.4 Art generation (neural style transfer) . . . . . . . . . . . . . . . . . . . 16
2.1.5 Trigger word detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Shallow neur al networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Neural networks overview . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Neural network representations . . . . . . . . . . . . . . . . . . . . . . 18
ii
CONTENTS iii
2.2.3 Computing a neural network’s output . . . . . . . . . . . . . . . . . . 18
2.2.4 Vectorizing across multiple examples . . . . . . . . . . . . . . . . . . . 19
2.2.5 Activati on functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.6 Why use non-linear activation functions? . . . . . . . . . . . . . . . . 21
2.2.7 Derivat ives of activation functions . . . . . . . . . . . . . . . . . . . . 22
2.2.8 Gradient descent for neural networks . . . . . . . . . . . . . . . . . . . 23
2.2.9 Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Forward propagation in a dee p network . . . . . . . . . . . . . . . . . 27
2.3.2 Getting the matrix dimensions right . . . . . . . . . . . . . . . . . . . 27
2.3.3 Why deep representations? . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.4 Forward and backward propagati on . . . . . . . . . . . . . . . . . . . 28
2.3.5 Parameters v s hyperparameters . . . . . . . . . . . . . . . . . . . . . . 30
2.3.6 Connection to the brain . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Op t imi zin g and structuring a neural network 31
3.1 Full-cycle deep learning projects . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Setting up a machine learning application . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Training, development, and t e st i ng sets . . . . . . . . . . . . . . . . . 32
3.2.2 Bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 Combatting bi as and variance . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Regularizing a neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Why regular i zat i on reduces overfitting . . . . . . . . . . . . . . . . . . 35
3.3.3 Dropout regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.4 Other regularization methods . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Setting up an optimization problem . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Normalizing inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.2 Data vanishing and exploding gradients . . . . . . . . . . . . . . . . . 40
3.4.3 Weight initialization for deep neural networks . . . . . . . . . . . . . . 40
3.4.4 Gradient numerical approximations . . . . . . . . . . . . . . . . . . . . 41
3.4.5 Gradient checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.1 Mini-batch gradient descent . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.2 Exponentially weighted averages . . . . . . . . . . . . . . . . . . . . . 43
3.5.3 Bias correction for exponentially weighted averages . . . . . . . . . . . 46
3.5.4 Gradient descent with momentum . . . . . . . . . . . . . . . . . . . . 46
3.5.5 RMSprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.6 Adam optimization algorithm . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.7 Learning rate decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
iv CONTENTS
3.5.8 The problem with local optima . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6.1 Tuning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6.2 Using an appropriate scale to pick hyperparameters . . . . . . . . . . 50
3.6.3 Hyperparameters tuning in practice: pandas vs caviar . . . . . . . . . 51
3.7 Batch normali zat i on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7.1 Normalizing activations in a network . . . . . . . . . . . . . . . . . . . 52
3.7.2 Fitting batch norm in a neural network . . . . . . . . . . . . . . . . . 52
3.7.3 Batch norm at test time . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.8 Multi-class classificati on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.8.1 Softmax regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.8.2 Training a sof t max classifier . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Ap p li ed deep learning 57
4.1 Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Setting up goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Single numb er evaluati on metric . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 Train/dev/test distributions . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Comparing to human-level performance . . . . . . . . . . . . . . . . . . . . . 58
4.3.1 Avoidable bi as . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.2 Understanding human-level performance . . . . . . . . . . . . . . . . . 59
4.3.3 Surpassing human-level performance . . . . . . . . . . . . . . . . . . . 59
4.4 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 Cleaning up incorrectly labeled data . . . . . . . . . . . . . . . . . . . 60
4.5 Mismatched training and dev set . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5.1 Bias and variance with mismatched data distributions . . . . . . . . . 61
4.5.2 Addressing data mismatch . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Learning from multiple tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6.1 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6.2 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Convolutional n eu r al networks 63
5.1 Edge detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Strided convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Cross-correlation vs convolution . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Convolutions over volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6 One layer convolution network . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.7 Pooling layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.8 Why we use convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
CONTENTS v
5.9 Classic networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.9.1 LeNet-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.9.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.9.3 VGG-16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.9.4 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.9.5 1 × 1 convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.9.6 Inception network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.10 Competitions and benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6 D etecti on algorithms 76
6.1 Object localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2 Landmark detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4 Sliding windows with convolution . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5 Bounding box predi ct i ons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.6 Intersection over Union (IoU) . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.7 Non-max suppression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.8 Anchor boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.9 YOLO algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7 Se qu en ce models 85
7.1 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.1.1 Backpropagation through time . . . . . . . . . . . . . . . . . . . . . . 86
7.1.2 Variational RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.1.3 Language modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.1.4 Sampling novel sequences . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.1.5 Vanishing gradients with RNNs . . . . . . . . . . . . . . . . . . . . . . 89
7.1.6 GRU Gated recurrent unit . . . . . . . . . . . . . . . . . . . . . . . . 89
7.1.7 LSTM: Long short-term memory . . . . . . . . . . . . . . . . . . . . . 91
7.1.8 BRNNs: Bi di r ec ti on al RNNs . . . . . . . . . . . . . . . . . . . . . . . 92
7.1.9 Deep RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.2 Word embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3 Word representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3.1 Using word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3.2 Properties of word embeddings . . . . . . . . . . . . . . . . . . . . . . 95
Chapter 1
Introduction
The gist of deep learning, and the algorithms behind it, have been around for decades.
However, we saw that as we started to add data to neural networks, they started to perform
much better than traditional machine learning algorithms. With advances in GPU comput-
ing and the amount of data we have available, training larger neural networks are easier
than ever before. With more data, larger neural networks have been shown to outperform
all other machine learning algorithms.
Amount of data
Performance
Large neural networks
Medium neural networks
Small neural networks
Traditional ML
Figure 1.0. 1: With more data, neural networks have performed better than traditional machine
learning.
Artificial intelligence can be broken down into a few subfields that include deep learn-
ing (DL), machine learning (ML), probabilistic graphical models (PGM), planning agents,
search algorithms, knowledge representation (KR), and game t he ory. The only subfield that
has dramatically improved in performance has been deep learning and machine learning.
We can see the rise of artificial intelligence in many industries t oday, however, there is still
a long way to go. Even though a company uses neural networks, the company is not an
artificial intelligence company. The best AI companies are good at strategic data acquisition,
putting data together, spotting automation, and have job descriptions on technologies at
the forefront of the field.
1
2 CHAPTER 1. INTRODUCTION
Time
Performance
(a) DL and ML
Time
Performance
(b) PGM
Time
Performance
(c) Planning agents
Time
Performance
(d) Search algorithms
Time
Performance
(e) Knowledge representation
Time
Performance
(f) Game theory
Figure 1.0.2: The performance of deep learning and machine learning algorithms have been
exploding in the last few years, in comp a ris o n to other branches of artificial intelligence.
1.1 Deep learning
1.1.1 What is a n eu r al network?
The aim of a neural network is to learn representations that best predict t h e output y, given
a set of features as the input x.
PriceSize
y
“neuron”
x
Figure 1.1.1: Simplest possible neural network where we are given the size of a home and predict
its price.
Figure 1.1.1 represents the simpl e st possible neural network. The neuron is the part of
the neural network that tries to learn a function that maps x y. Here, we only have
one neuron, whi ch is why it is the simplest case. We can make more complicated neural
networks by stacking neurons. Consi d er the cas e where we not on l y have the size but the
number of bedrooms, zip code, wealth as well. We can represent the extra inputs inside of
a neural network as shown in Figure 1.1.2.
Neural networks work well because we only need to feed the algorithm supervised data,
without specifying what all the intermediate values may be, as we did in Figure 1.1.2 with
family size, walkability, and schooling. Instead of connecting together only a few inputs,
we can connect all of the inputs together. Layers with all inputs connected to the output
are known as fully-connected layers. The general st ru ct u re of a neural network wil l fully
connected layers can be seen in Fi gur e 1.1.3.
1.1. DEEP LEARNING 3
size
bedrooms
zip code
family size
walkability
schooling
price (y)
wealth
Figure 1.1.2: In a neural network with more neurons, the intermediate connections between inputs
may represent their own values.
x
1
x
2
x
3
x
4
y
Hidden
layer
Input
layer
Output
layer
Figure 1.1.3: Neural network with fully connected layers
1.1.2 Supervised learn i n g with neural networks
Almost all of the hype around machine learning has been centered aroun d supervised learn-
ing. A supervised learning algorithm t akes in a set of features and outputs a number or a
set of numbers. Some examples can be seen in Table 1.1.
Input (x) Output (y) Application
Home features Price Real estate
Ad, user info Click on ad? (0/1) Online advertising
Image Object (1,. . . , 1000) Photo tagging
Audio Text transcript Speech recognition
English Chinese Machine tran sl at i on
Image, radar info Position of other cars Auton omou s driving
Table 1.1: Examples of supervised learn i n g
As we have already seen, we can change the connections between layers to form dierent
types of neural networks. Indeed, changing the structure of layers has led to many of the
breakthroughs in deep learning and we have formulated tasks in which it would be beneficial
not using a neural network with fully connected layers. For example, while real estate and
online advertising we may use a neural network with ful l y -con ne ct e d layers, photo tagging
may use convolutional neural network s (CNNs), sequenced data may use recurrent neural
networks (RNNs) , and for something more complicated like autonomous dr i vi n g, we may
use some cust om hybrid neural network. Figure 1.1.4 shows the basic structure of dierent
neural networks, although we will go more in-depth into how each works in this text.
4 CHAPTER 1. INTRODUCTION
Figure 1.1.4: Dierent types of neural network architectures
Supervised learning can be both structured and unstructured. Structured data may have
come in the for m of database entries, while unstructur ed data may be audio, image clips, or
text. Historically, unstructured data like image recogniti on has often been a harder problem
for computers to solve, while humans have little diculty solving these typ e s of proble ms.
Neural networks have given computers a lot of help when interpreting unstructured data.
Figure 1.1.5: Structured vs unstructured data
1.2 Logistic regression as a neura l network
1.2.1 Notation
Each training set will be composed of m training examples. We may denot e m
train
to be
the number of training examples in the training set , and m
test
to be the number of examples
in the t es ti n g set. The ith input instance will be defined as (x
(i)
, y
(i)
), where x
(i)
R
n
x
and each y
(i)
{0, 1}. To make notation more concise, we will use X to denote the matrix
formed by
X =
!
x
(1)
x
(2)
. . . x
(m)
"
, (1.2.1)
where X R
n
x
×m
. simi l arl y, we define Y as
Y =
!
y
(1)
y
(2)
. . . y
(m)
"
, (1.2.2)
where Y R
1×m
.
1.2. LOGISTIC REGRESSION AS A NEURAL NETWORK 5
1.2.2 Binary classification
A bin ary classifier takes some input and classifies that input in either two categories, typi-
cally yes (1) or no (0). For example, an image of a cat may be classified as a cat or non-cat.
We will use y to denote the output as either a 0 or 1.
Figure 1.2. 1 : Colored images can be represented with the ad d i t io n of red, green, and blue channels.
Each color can be represented using the colors red, green, and blue. Each pixel on a colored
image can be represented as a t u pl e , storing the amount of red, green, and blue inside of
each pixel. For images, we say that the red, green, and blue channel corresponds the amount
of that color in each pixel as shown in Figure 1.2.1. For an image of size n × m, we wi l l
define our input as a single column vector x that stacks the red values on the blue values
on the green values as shown
x =
#
$
%
red values
blue values
green values
&
'
(
. (1.2.3)
We will use n
x
to denote the input size of our vector. In this case, our inpu t size is
n
x
= 3nm, (1.2.4)
since each channel is size (n ×m) and there are 3 channels.
1.2.3 Logistic Regression
Given an input feature x R
n
x
, such as the pixels of an image, we would like to find an
algorithm that makes a prediction ˆy, where
ˆy = P (y = 1 | x) . (1.2.5)
Inside of our algorithm we will define weights w R
n
x
and a bias term b R, where our
output is
ˆy = σ
)
w
T
x + b
*
. (1.2.6)
The sigm oi d function σ is a function is used to transform the input betwe en 0 ˆy 1,
where
σ =
1
1 + e
z
. (1.2.7)
We will denote the inner part of the sigmoid function as
z = w
T
x + b. (1.2.8)
6 CHAPTER 1. INTRODUCTION
10 5
5 10
0.5
1
z
σ
Figure 1.2.2: Sigmoid function
1.2.4 Logistic regression cost function
Logistic regression is be a supervised learni ng algorithm where we train on the t r ain i ng
examples
+)
x
(1)
, y
(1)
*
, . . . ,
)
x
(m)
, y
(m)
*,
and would like ˆy
(i)
y
(i)
. Wh il e many machine
learning algorithms use a squared error cost function, logistic regression does not work well
when using squared error because many local opti ma arise. Instead, we will use the loss
function
L (ˆy, y) = (y log ˆy + (1 y) log (1 ˆy)) . (1.2.9)
Intuitively, consi de r the following two scenarios. When y = 1, we will end up with
L (ˆy, 1) = log ˆy, (1.2.10)
which is shown in Figure 1.2.3a. The second scenario occurs when y = 0. In that case, we
will have
L(ˆy, 0) = log(1 ˆy), (1.2.11)
which is shown in Figure 1.2. 3b. When small ˆy are inputted, we will have a small l oss , and
when ˆy approaches 1, our estimates will be way o, producing a large cost.
The loss function is a measure of how well the classifier is doing on a single example. The
cost function, defined as the average loss
J(w, b) =
1
m
m
-
i=1
L
.
ˆy
(i)
, y
(i)
/
. (1.2.12)
0.5 1
1
5
6
ˆy
log ˆy
(a) Loss function when y = 1
0.5 1
1
5
10
ˆy
log(1 ˆy)
(b) Loss function when y = 0
Figure 1.2.3: A binary classifier’s loss function based on if y = 1 o r y = 0.
1.2. LOGISTIC REGRESSION AS A NEURAL NETWORK 7
1.2.5 Gradient descent
For many logistic regression problems, our function will look bowl-shaped, which is known
as a convex function shown in Figure 1.2.4. With convex functions, there is a clear minimum
of the fu nc ti on and we would like to find the representations of w and b that will put us at
the minimum.
Figure 1.2.4: An example of wha t our cost function might look like with the inputs w and b.
For logistic regression, we generally start by initializing our weights to be zero. Gradient
descent steps in the direction of steepest descent. If we start at some point that is not
the minimum, we will move in the direction of the minimum, until we converge near the
minimum. In one dimension, we can define the gradient descent algorithm as
w = w α
J(w, b)
w
(1.2.13)
b = b α
J(w, b)
b
(1.2.14)
where we will first calculate both of the partial derivative terms and then store it inside of
the variables in the left hand side.
1.2.6 Computation graphs
Consider the function
y(a, b, c) = 3(a + bc). (1.2.15)
To compute the output requires three steps, first compute bc, then add that result to a, and
then multiply th e added result by 3. We can represent this function as the computational
graph shown in Figure 1.2.5. Each forward propagation in the computation graph produces
a number that gets closer to the result.
a
b
c
+
×3
Figure 1.2.5: Computation graph for 3(a + bc)
8 CHAPTER 1. INTRODUCTION
1.2.7 Derivatives with a computational graph
a = 5
b = 3
c = 2
u = bc
6
11 33
v = a + u
J = 3v
Figure 1.2.6: Computation graph for 3(a + bc)
Consider th e updated computational graph in Figure 1.2.6. Now, suppose that we want to
find how J changes with respect to v, that is
J
/v at v = 11. Since we know that J = 3v,
we can take the derivative of the function giving us
J
v
= 3, (1.2.16)
which in this case is not based on the input at all. Now if we wish to find
J
/a, we can use
the chain rule to break
J
a
=
J
v
·
v
a
. (1.2.17)
Since we already calculated the value of
J
/v = 3, we only need to find
v
/a. Since v = a+u,
dierentiating both sides with respect to a will give us
v
a
= 1. (1.2.18)
Now, using Equation 1.2.17
J
a
= 3. (1.2.19)
To find
J
/u, we would use a similar process how we found
J
/a and we will again end up
with
J
/u = 3. To find
J
/b, we need
J
b
=
J
v
·
v
u
·
u
b
(1.2.20)
= 3 ·
u
b
. (1.2.21)
Since u = bc, dierentiating both s i de s with resp e ct to b gives us
u
/b = c, where c in this
case is 2, so
J
b
= 6. (1.2.22)
A similar process would be carried out to find
J
/c = 9.
1.2.8 Logistic regression gradient descent
With our logistic regression compu t at ion al graph defined in Figure 1.2.7, we can again work
backward calcu l at in g the changes in L with respect each of the variables. Dierentiating
both sides of our loss functi on with respect to a gives us
L(a, y)
a
=
y
a
+
1 y
1 a
. (1.2.23)
We also find that
a
z
= a(1 a), (1.2.24)
so
L
z
= a y. (1.2.25)
1.3. PYTHON AND VECTORIZATION 9
x
1
w
1
x
2
w
2
b
z = w
1
x
1
+ w
2
x
2
+ b
a = σ(z) L(a, y)
Figure 1.2.7: Computation graph for logistic regressi on
Now back propagating towards the inputs, we have
L
x
1
= (a y) · w
1
(1.2.26)
L
x
2
= (a y) · w
2
(1.2.27)
L
w
1
= (a y) · x
1
(1.2.28)
L
w
2
= (a y) · x
2
(1.2.29)
L
b
= (a y). (1.2.30)
In general, we can show
L
x
n
= (a y) · w
n
(1.2.31)
L
w
n
= (a y) · x
n
. (1.2.32)
1.2.9 Gradient descent on m ex amp les
Recall that when we have m input instances to train on
J(w, b) =
1
m
m
-
i=1
L(a
(i)
, y
(i)
), (1.2.33)
where a
(i)
= ˆy
(i)
= σ(w
T
x
(i)
+ b). If we dierentiate J with respect to w
1
, we will have
J(w, b)
w
1
=
1
m
m
-
i=1
L(a
(i)
, y
(i)
)
w
1
. (1.2.34)
In Equation 1.2.28, we calculated L
w
1
to be (a y) · x
1
, so
J(w, b)
w
1
=
1
m
m
-
i=1
(a
(i)
y
(i)
) · x
1
. (1.2.35)
1.3 Python and vectorization
1.3.1 Vectorization
The aim of vectorization is to remove explicit for-loops in code. With deep learning datasets
can be massive and it is vital to un de r st an d how vectorization works in order to improve
algorithmic eciency. Recall that we set z = w
T
x + b. One implementation on how to
calculate this would be
z =
n
x
-
i=1
w
T
x
(i)
+ b, (1.3.1)
which would end u p being very slow. In contrast, we can use CPython to speed up the
calculation, where the NumPy command will be
z = np.dot(w, x) + b. (1.3.2)
10 CHAPTER 1. INTRODUCTION
Using built-in CPython functions enables parallelism much better than writing out explicit
for-loops. When using neur al networks, whenever possible, avoid explicit for-loops. A
few more example s of calculations that can be vectorized include the matrix multiplication
between Ax, which is also written as
Ax = np.dot(A, x). (1.3.3)
To apply some operation, such as expon entiating, to each item in a vector v, write
v = np.exp(v). (1.3.4)
1.3.2 Vectorizing logistic regression
For logistic regression, we need to find a
(i)
= σ(z
(i)
), m times, where z
(i)
= w
T
x
(i)
+ b.
Instead of running explicit for-loops for each instance, we would like a way to vectorize the
process of finding every value of a at once. Recall that we set
X =
!
x
(1)
x
(2)
. . . x
(m)
,
"
(1.3.5)
where X R
n
x
×m
. We also know that there is a weight associated with each input instance,
so w
T
R
1×n
x
. Carry i ng out the product between
w
T
X =
!
w
T
x
(1)
w
T
x
(2)
· · · w
T
x
(m)
"
. (1.3.6)
Now, we can see that b R
1×m
, so taking the sum of w
T
X an d b gives us
w
T
X + b =
!
w
T
x
(1)
+ b w
T
x
(2)
+ b · · · w
T
x
(m)
+ b
"
. (1.3.7)
In order to implement this in CPython, call
Z = np.dot(w.T, x) + b, (1.3.8)
where b R and Python will expand it t o be a 1 × m matrix. The concept of expanding a
number into a matrix in python is known as broadcasting.
Similar to how X stores the instances for x
(i)
, we will define the matrix
A =
!
a
(1)
a
(2)
· · · a
(m)
"
. (1.3. 9)
With Z being a 1 × m matrix, we can apply the sigmoid function to each element of Z by
calling
A =
1
1 + exp(Z)
. (1.3.10)
1.3.3 Vectorizing logistic regression’s gradient computation
Recall from earlier that the change in b, which we will denote as db was equal to dz. We
then wanted to update db by the average of all of the nudges over the training examples.
Instead of initializing db = 0, looping over and adding every instance of dz
(i)
, and dividing
the result by m, we could instead call
db =
1
m
m
-
i=1
dz
(i)
(1.3.11)
=
np.sum(dZ)
m
, (1.3.12)
where dZ is a row matrix formed by each dz
(i)
element. The update th e change in weights,
recall dw =
1
/m ·
0
m
i=1
dx
(i)
dz
(i)
, which can be expressed in vectorized for as
dw =
np.dot(X, dZ)
m
. (1.3.13)
1.3. PYTHON AND VECTORIZATION 11
1.3.4 Broadcasting example
Consider the matrix
A =
#
$
%
56 0 4 68
1 102 52 8
2 135 99 1
&
'
(
(1.3.14)
and we want to find the percentage each element contributes to the sum of its column. For
example, A
11
= 56/(56 + 1 + 2). To calculate the sum of each row, we would type
s = A.sum(axis = 0). (1.3.15)
Axis 0 refers to the vertical columns inside of a matrix, while axis 1 refers to the horizontal
rows in a matrix. Calling t h e method in Equation 1.3.15 will produce a vector of 4 elements,
however, we would like to convert from R
4
R
1×4
. To make this conversion, call
s = s.reshape(1, 4). (1.3.16)
Now we can perform
A
/s in order to get t h e percentage each element contributes to the sum
of the column. Dividin g the matrix A R
3×4
by the row vector s R
1×4
will divide each
of the three elements in a col umn by the value in the corresponding column of s. Since their
sizes are dierent, this is known as broadcasting. A few other examples of broadcasti n g
include adding a vector and real number
#
$
$
$
$
%
1
2
3
4
&
'
'
'
'
(
+ 100 =
#
$
$
$
$
%
101
102
103
104
&
'
'
'
'
(
, (1.3.17)
and adding a matrix in R
m×n
with a row vector in R
1×n
1
1 2 3
4 5 6
2
+
!
100 200 300
"
=
1
101 202 303
104 205 306
2
. (1.3.18)
Notice, when applying a command between a matrix in R
m×n
and a vector in R
1×n
, t he
vector will exp and to a matr i x in R
m×n
, where each row has the same elements. sim il ar l y,
when combining a matrix in R
m×n
with a column vector in R
m×1
, the vector will exp and
to be in R
m×n
, where each column is identical. An example of a column vector expanding
is
1
1 2 3
4 5 6
2
+
1
100
200
2
=
1
101 102 103
204 205 206
2
. (1.3.19)
1.3.5 A note on Nu mPy vectors
Consider the example
a = np.random.randn(5). (1.3.20)
Calling a.shape will return the shape as (5, ), which is considered a rank one array, which
is neither a row vector nor a column vector. If we call a.T , nothing will be changed to
the array, and it will still be in the shape (5, ). Instead of using a rank one array, it is
recommended to use natural matrices when dealing with neural networks. If we wanted a
matrix of size R
5×1
instead of the rank one array in Equation 1.3.20, we would call
a = np.random.randn(5, 1). (1.3.21)
One goo d tip when dealing with a matrix of an unknown size is to first assert that it is a
size we want , by calling
assert(a.shape == (5, 1)). (1.3.22)
Chapter 2
Neural networks
2.1 Deep learning intuition
A mo d el for deep learni n g is based on architecture and parameters. The architecture is the
algorithmic design we choose, like logistic re gre ss ion , linear regression, or shallow neural
networks. The parameters are the weights and bias the model that takes in an instance as
input and attempts to classify it correctly based on the parameters.
Model
=
Architecture
+
Parameters
Output
Loss
Gradients
Input
Things that can change:
- Activation function
- Optimizer
- Hyperparameters
- Loss function
- Input
- Output
- Architecture
0
Figure 2.1.1: Where the model fits in and what can be ad ju st ed
Consider if we wanted to use a multi-class classifier that not only predicts if an image is a
cat or not a cat, but instead predicts from the image whether the image is a cat, dog, or
girae. Recall that when using a binary classifier our weights were
w =
#
$
$
$
$
$
%
w
1
w
2
.
.
.
w
n
x
&
'
'
'
'
'
(
, (2.1.1)
12
2.1. DEEP LEARNING INTUITION 13
and our lab e l s were single numbers, y {0, 1}. Now, let us add two more columns to the
weights that will be the weights for the dog classifier and the girae classifier
w =
#
$
$
$
%
.
.
.
.
.
.
.
.
.
cats dogs giraes
.
.
.
.
.
.
.
.
.
&
'
'
'
(
. (2.1.2)
Our weights are in R
n
x
×3
.
To update our labels, we have a few options. First, let us put y {0, 1, 2}, where 0
corresponds to a cat image, 1 corresponds to a dog image, and 2 c orr es ponds to a girae
image. The second option is one-hot encoding, where we will have an array with each
index mapping to an animal. With one-hot encoding, only one item in each labeled array
will have a value of 1 (picture is of an animal), while the rest are 0 (picture is not of an
animal). For example, if we have an image of a cat, our label would be
cat dog girae
[ 1 0 0 ].
(2.1.3)
While both of these options would work well in most cases, if we have a picture of both a
cat and dog in the same image, our classifier wi l l not work. Instead, we will use multi-
hot en coding, which works similar to one-hot enco d i ng, with the dieren ce being multiple
values can take on the value 1. So now if we have a picture of both a cat and dog at the
same time, we can encode our label as
cat dog girae
[ 1 1 0 ].
(2.1.4)
The jth neuron in the ith layer will be labeled as a
[i]
j
as shown in Figure 2.1.2.
x
(i)
1
x
(i)
2
x
(i)
3
x
(i)
4
a
[1]
1
a
[1]
2
a
[1]
3
a
[1]
4
a
[1]
5
a
[2]
1
ˆy
(i)
Figure 2.1.2: Notation for the course
As the layers increase, the activation neurons start to look at more complex items. For
example, if we are working with a face classifier, the first layer might only detect edges, the
second layer might put some of the edges together to find ears and eyes, and the third layer
might detect the entire face. The process of extr act i n g data from the neurons is known as
encoding.
2.1.1 Day’n’night classification
Our goal is to create an image classifier that predicts from a photo taken outside if it was
taken “during the day” (0) or “during the night” (1). From the cat example, we needed
14 CHAPTER 2. NEURAL NETWORKS
about 10, 000 images to create a good classifier, so we estimate this problem is about as
dicult and we will need about 10, 000 images to classify everything correctly. We will split
the data up into both a tr ai ni n g and testing set. In this case, about 80% of the data wil l
be in the traini ng set. In cases with more training data, we would give less of a percentage
to th e testing set and more of a percentage to the training set. We also need to make sure
that the data is not split randomly, rather split proportionally to the group t he data is in
(80% of day examples and 80% of night examples go in the training set).
For our input, we would like to work with images that have th e lowest possible quality, while
still achieving good results. One clever way to choose the lowest possible quality images is
to find out the lowest possible quality that humans can perfectly identify and then use that
quality. After comparing to human performance, we find that 64 × 64 × 3 pixels will work
well as our image size.
Our output will be 0 for the day and 1 for the night. We will also use the sigmoid function
as our last activation function because it maps a number in R to be between 0 and 1. A
shallow neural network, one with only one hidden layer, should work p re t ty well with this
example because it i s a fairl y straightforward task. For our loss function, we will use the
logistical loss
L(ˆy, y) = y log(ˆy) (1 y) log(1 ˆy). (2.1.5)
2.1.2 Face verification
Our goal is to find a way for a school to use face verification for validating stud ent IDs in
facilities such as dining halls, gyms, and pools. The data we will need is a picture of every
student labeled with the ir name. The input will be the data from the camer as as well as
who they are attempting to be. In general, faces have mor e detail than the night sk y, so
we will need the input to be a larger size. Using th e human test we used last time, we find
a solid resolution to be 412 × 412 × 3 pixels. We will make the output 1 if it’s the correct
person at the camera and 0 if it is not.
To compare the image we have of a person in our database wit h the image we have from a
camera, we want to find a function to call on both of the images in order to determine if
the person in the image is the same. We could directly compare the dierences in pixels,
however, that presents se vere problems if the background lighting is dierent, the person
grows facial hair, and the ID photo is outdated. Instead, we will use a deep network for our
function as illustrated in Figure 2.1.3.
Deep Network
Deep Network
3
4
4
4
4
4
4
4
5
0.931
0.433
.
.
.
0.158
0.039
6
7
7
7
7
7
7
7
8
3
4
4
4
4
4
4
4
5
0.922
0.343
.
.
.
0.142
0.024
6
7
7
7
7
7
7
7
8
Small Distance
Figure 2.1.3: The architecture we will use to classify images will be deep networks.
We will set the di st anc e threshold to be some number, and any distance les s than that
number we will classify with y = 1. Because the school will not have tons of data on the
pictures of students, we will use a public fac e dataset, where we want to find images of th e
2.1. DEEP LEARNING INTUITION 15
same people to ensure that their encoding is similar and images of dierent people to ensure
their encodi n g is dierent. We train our network, we will feed it triplets. Each instance will
have an anchor (actual person), a positive example (the same pe rs on, mini mi ze encoding
distance), and a negative example (dierent person, maximize encoding distance).
min distance
max distance
anchor
positive negative
Figure 2.1.4: Ordered triplet input example
For our loss function, we can use
L = ||Enc(A) Enc(P )|| ||Enc(A) Enc(N)|| + α, (2.1.6)
where || represents t he L2 norm and Enc represents t h e encoding. The loss function
presented makes sense because we will want ||Enc(A) Enc(P )|| to be minimized and
||Enc(A) Enc(N)|| to be maximized. To minimize a maximized function, use the negative
of that function, which gives us our resu l t of ||Enc(A) Enc( N )||. The α term, called
the margin, is used as a kickstart to our encoding function , to avoid t he case where every
weight in the deep network is zero, and we end up with a perfect loss. It is common to keep
loss functions posit i ve, so we generally train the maximum of th e loss function and 0.
Model
=
Architecture
+
Parameters
Output
Loss
Gradients
Input
pos neganc
Enc(P)Enc(A) Enc(N)
Figure 2.1.5: Updated model for face verification
16 CHAPTER 2. NEURAL NETWORKS
2.1.3 Face recognition
In ou r school, we want to use face identification to recognize students in faci l it i es . Now,
instead of just verifying who the person is given their ID, we need to identify the person out
of many people. We will use a lot of similar details from the face verification example, but
instead of input ti n g 3 images and out pu t t in g 3 encodings, we will input one image and the
encoding on that image. Then, we will compare the encoding our database, which contains
encodings for every student and use a k-nearest neighbors algorithm to make comparisons.
To go over the entire database, the runtime complexity will be O ( n ).
For another example, suppose we have thousands of pictures on our phone of 20 dierent
people and we want to make groups of the same people in folders. Usi n g the encoding on
each image, we could run a k-means clustering algorithm to find groups within the vector
space that correspond to the same people.
2.1.4 Art generation (neural style transfer)
Given a picture, our goal is to make it beautiful. Suppose we have any data that we want.
Our input will contain both c ontent and the style we consider beautiful. Then, once we
input a new content image, we would like to generate a style image based on the content.
(a) Content image (b) Style image
Figure 2.1.6: Input for art ge n era ti o n
The architecture we will use is first based on a model that understands images very well.
The existing ImageNet mo d el s, for instance, are models that are perfe ct l y well to use for an
image understanding task. After the image is forward propagated th r ough a few layers of
networks, we will get information about the content in the image by looking at the neurons.
We will call the information about the image Content
C
. Then, giving the style matrix to
the deep network, we will u se the grain mat ri x to extract the Style
S
from the style image.
More about the grain matrix wi l l be discussed later in the course.
The loss function we can define as
L = ||Content
C
Content
G
|| + ||Style
S
Style
G
||, (2.1.7)
where
G
denotes the generated image.
2.1.5 Trigger word detection
When given a 10-second audio speech, we want to detect when the word “active” has been
said. To create a model, we will need a lot of 10-second audio clips with as much of a variety
in the voice as possible. We can break the input down into three segments: when “ac ti ve” is
being said, when a word that is not “active” is bein g said, and when nothing is b ei ng said.
The sample rate of the in pu t would be similar to how we found the resol u t ion of images,
determining the minimum viable amount that humans have no trouble with.
2.2. SHALLOW NEURAL NETWORKS 17
Deep Network
(pretrained)
After many iterations
compute
loss
update pixels using gradients
Figure 2.1.7: Art generation architecture
Figure 2.1.8 : I n p u t for the trigger word detection model. Green represents the time frame
“active” is being said, red represents the time frame that no n - “a c t ive” words are being said, and
black represents the time frame nothing is being said.
The output will be a set of 0s and then a 1 after the word ac t ive is said. This output is
generally easier to debug and works better for continuous models. The last activation we
will use is a sigmoid and the architecture we will use should be recurrent neural networks.
We can gener ate d ata for our m et hods by collecting positive words, negative words, and
background noise and then adding dierent combinations of the data together to form 10-
second clips.
2.2 Shallow neural networks
x
(i)
1
x
(i)
2
x
(i)
3
a
[1]
1
a
[1]
2
a
[1]
3
a
[2]
1
ˆy
(i)
Figure 2.2.1: Shallow neural network
2.2.1 Neural networks overview
In each layer of a neural network, we will calculate both the linear activation, and then map
the linear activation to be between 0 and 1. To calculat e the linear activati on in the ith
layer
z
[i]
= W
[i]
x + b
[i]
. (2.2.1)
18 CHAPTER 2. NEURAL NETWORKS
x
(i)
1
x
(i)
2
x
(i)
3
w
T
x + b
z
9
9
9
9
9
σ(z)
a
Figure 2.2.2: Each neuron in an activation layer represe nts two computa ti o n s: the lin ea r weights
computation plus the bias and then squeezing that number between 0 and 1.
To calculate the non-activation the ith layer, we would use the sigmoi d function
a
[i]
= σ(z
[i]
). (2.2.2)
Only once we get to the final layer would we calculate the loss.
2.2.2 Neural network representations
The hidden layer refe rs to data that is not seen in the training set. To denote the input
layer x, we may also use the notation a
[0]
. When we count neural network layers, we do not
include the input layer in that count, however, we do include the output layer. We will also
refer to the nodes in the hidden layers and output layer as neurons.
2.2.3 Computing a neural network’s output
In the shallow neural network pictured in Figure 2.2.5, for the hidden layer, we must calculate
:
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
<
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
=
z
[1]
1
= w
[1]T
1
x + b
[1]
1
a
[1]
1
= σ(z
[1]
1
)
z
[1]
2
= w
[1]T
2
x + b
[1]
2
a
[1]
2
= σ(z
[1]
2
)
z
[1]
3
= w
[1]T
3
x + b
[1]
3
a
[1]
3
= σ(z
[1]
3
)
. (2.2.3)
Implementing the assignments in Equation 2.2.3 would require an inecient for-loop. In-
stead, we will vectorize the process as follows
z
[1]
=
#
$
$
$
$
%
—– w
[1]T
1
—–
—– w
[1]T
2
—–
—– w
[1]T
3
—–
&
'
'
'
'
(
·
#
$
%
x
1
x
2
x
3
&
'
(
+
#
$
$
$
$
$
%
b
[1]
1
b
[1]
2
b
[1]
3
&
'
'
'
'
'
(
=
#
$
$
$
$
%
w
[1]T
1
x
1
+ b
[1]
1
w
[1]T
2
x
2
+ b
[1]
2
w
[1]T
3
x
3
+ b
[1]
3
&
'
'
'
'
(
. (2.2.4)
The dimensions of the weights matrix will be the number of weights in the hidden layer by
the number of inputs. The bias vector will be a column vector with the same numb e r of
rows as the weight matrix . We will c all the weight matrix W
[1]
and the bias vector b
[1]
. For
our activation, it follows that
a
[1]
=
#
$
$
$
$
$
%
a
[1]
1
a
[1]
2
a
[1]
3
&
'
'
'
'
'
(
= σ(z
[1]
). (2.2.5)
2.2. SHALLOW NEURAL NETWORKS 19
Putting both the values of z and a together, we now only have to calculate
>
z
[1]
= W
[1]
x + b
[1]
a
[1]
= σ(z
[1]
)
. (2.2.6)
Instead of writing x, we will write in its place a
[0]
, which gives us the assignments
>
z
[1]
= W
[1]
a
[0]
+ b
[1]
a
[1]
= σ(z
[1]
)
. (2.2.7)
We prefer the notation a
[0]
in this case, because it will make deeper layers more natural to
reference. Now for the second layer, we can use a similar process to show
>
z
[2]
= W
[2]
a
[1]
+ b
[2]
a
[2]
= σ(z
[2]
)
. (2.2.8)
From F i gur e 2.2.5, we see that there is only 1 neu ron in activation layer 2 with 3 inputs.
From that data, it implie s that the dimensions of the weight matrix are 1 × 3 multiplied by
the 3 × 1 input matrix and added to the 1 × 1 bias term.
2.2.4 Vectorizing across multiple examples
Previously, we have only calculated the valu es from one a single inpu t x. For m training
examples, we would be required to run a loop over every single example in ord er to cal-
culate each corresponding a
[2](i)
value, where the index i is the ith training example. The
corresponding for-loop can be shown in Algorithm 1.
Algorithm 1 Forward propagation
for i {1, . . . , m} do
z
[1](i)
= W
[1]
a
[0](i)
+ b
[1]
a
[1](i)
= σ(z
[1](i)
)
z
[2](i)
= W
[2]
a
[1](i)
+ b
[2]
a
[2](i)
= σ(z
[2](i)
)
end for
Earlier, we defined the matrix X to store all of the input instan ces as
X =
#
$
$
%
| |
x
(1)
. . . x
(m)
| |
&
'
'
(
, (2.2.9)
where X R
n
x
×m
. If we instead define
Z
[1]
= W
[1]
X + b
[1]
, (2.2.10)
where we broadcast b as a colum n vector to be a matrix with m columns. It follows that
A
[1]
= σ(Z
[1]
). (2.2.11)
Now, in each of the columns in A and Z we will have
Z
[1]
=
#
$
$
%
| |
z
[1](1)
. . . z
[1](m)
| |
&
'
'
(
, A
[1]
=
#
$
$
%
| |
a
[1](1)
. . . a
[1](m)
| |
&
'
'
(
. (2.2.12)
Horizontal nodes will correspond to dierent traning examples and vertical nodes correspond
to dierent neurons in the hidden unit. The complete vectorization across multiple examples
will be:
:
;
;
<
;
;
=
Z
[1]
= W
[1]
A
[0]
+ b
[1]
A
[1]
= σ(Z
[1]
)
Z
[2]
= W
[2]
A
[1]
+ b
[2]
A
[2]
= σ(Z
[2]
)
. (2.2.13)
20 CHAPTER 2. NEURAL NETWORKS
2.2.5 Activation functions
Previously when f or ward propagating, we u se d σ as our activation function, although that
is just one choice. Recall the graph of
σ =
1
1 + e
z
(2.2.14)
in Figure 2.2.3 was used to map z from any number in R to be between 0 and 1.
10 5 5 10
0.5
1
z
σ(z)
Figure 2.2.3: Sigmoid function
Let us denote our activation function as g, so that
a
[i]
= g(z
[i]
). (2.2.15)
10 5 5 10
0.5
1
0.5
1
z
tanh(z)
Figure 2.2.4: Hyperbolic tangent function as an activatio n function
An alter nat i ve to the traditional sigmoid fu nc t ion is the sigmoid fun ct i on with a range from
1 t o 1 that can be thought of as dragging the bottom of the si gmoi d s from the line at 0
to the line at 1. After all of the calculations, we would end up findi n g that this function
is the hyperbolic tangent function, so our a new activation function could be
g(z) = tanh(z) =
e
z
e
z
e
z
+ e
z
. (2.2.16)
Generally, the hyperbolic tangent function tends to outperform the sigmoid function because
it has the eect of centering the weights at 0. For the last layer, however, it is still common
to use the sigmoid function to give a better interpretation of the data from a probabilistic
perspective. Since dierent activation layers may have dierent activation functions, we will
refer to the activation function in the ith layer as g
[i]
.
As we approach larger numbers and smaller numbers for both the sigmoid and hyperbolic
tangent functions, it is clear that the rate of change is decreasing drastically and is approach-
ing no rate of change at the asymp tot e s. The rectified linear unit (ReLU) activation
2.2. SHALLOW NEURAL NETWORKS 21
x
(i)
1
x
(i)
2
x
(i)
3
tanh
tanh
tanh
σ
ˆy
(i)
Figure 2.2.5: Dierent layers in the neural network may have dierent activation functions
function is often used to solve the problem at the infinities. The ReLU function is defined
as
ReLU(z) = max(0, z), ( 2. 2. 17)
which gives the plot shown in Figure 2.2.6. Th e rate of change of the ReLU function will be
1 if z > 0 and 0 if z < 0. It is quite improbabl e that z = 0.
¯
0, in which case our derivative
is undefined and we should manually set the result to either 0 or 1 in th at case.
10 5 5 10
5
10
z
ReLU
Figure 2.2.6: ReLU activation function
For binary classification problems, where y {0, 1}, it is common practice to use the sigmoid
activati on function in the last layer and the ReLU activation function everywhere else. We
prefer the ReLU activation function in practice be cau se it is often much easier and faster to
handle derivatives.
The last activation function is the leaky ReLU, which behaves similar to the ReLU function
with the exception being how negative numbers are handled. Instead of making all negative
number 0, the leaky ReLU uses some scalin g factor multiplied by z. The most common
scaling factor is 0.01, wh i ch corresponds to the leaky ReLU function
leaky ReLU(z) = max(0.01z, z). (2.2.18)
In practi ce , it is often hard to determine which activation function should be used. It is
quite common to test dierent combinations of activation functions in order to find which
one works the best.
2.2.6 Why use non-linear activation functions?
Consider the shallow neural network pictured in Figure 2.2.8. If we used a linear activation
function, that is
g
[i]
= z
[i]
, (2.2.19)
22 CHAPTER 2. NEURAL NETWORKS
z
Leaky ReLU
Figure 2.2.7: Leaky ReLU activation function
x
(i)
1
x
(i)
2
x
(i)
3
linear
linear
linear
σ
ˆy
(i)
Figure 2.2.8: Dierent layers in the neural network may have dierent activation functions
then we have
>
a
[1]
= z
[1]
= W
[1]
x + b
[1]
a
[2]
= z
[2]
= W
[2]
a
[1]
+ b
[2]
. (2.2.20)
Substituting W
[1]
x + b
[1]
in for a
[1]
in the second equation, we have
a
[2]
= W
[2]
(W
[1]
x + b
[1]
) + b
[2]
. (2.2.21)
Expanding out the matrix, we have
(W
[2]
W
[1]
)x + (W
[2]
b
[1]
+ b
[2]
), (2.2.22)
which is a linear equation of x. Since this is a linear hidden layer, st or i ng deeper and deeper
sets of linear weights on top of each other will never change our function from being linear, so
more weights end up being useless. When using a linear activation function for preliminary
calculations, the entire function tends to appear just like a logistic regression problem.
In the case of linear regression where ˆy R, such as when housing prices are being predicted,
we may use a linear function i n the output layer of the neural network in order to map the set
of the real number, but other than that case, linear activation functions should be avoided.
2.2.7 Derivatives of activation functions
With a sigmoid activation function
g(z) =
1
1 + e
z
, (2.2.23)
2.2. SHALLOW NEURAL NETWORKS 23
we can write the derivative in terms of the or i gin al function us in g some clever algebraic
manipulation that follows:
σ
z
=
e
z
(1 + e
z
)
2
(2.2.24)
=
e
z
1 + e
z
·
1
1 + e
z
(2.2.25)
=
1 + e
z
1
1 + e
z
·
1
1 + e
z
(2.2.26)
=
?
1 + e
z
1 + e
z
1
1 + e
z
@
·
1
1 + e
z
(2.2.27)
= (1 σ) · σ. (2.2.28)
When using the hyperbolic tangent function as our activation function, we have
g(z) = tanh(z) =
e
z
e
z
e
z
+ e
z
. (2.2.29)
The derivative of the hyperbolic tangent function can also be written in terms of its input
tanh(z)
z
= 1 tanh
2
(z). (2.2.30)
For the ReLU function
g(z) = max(0, z), (2.2.31)
the derivative is
g
(z) =
>
1 z > 0
0 z < 0
. (2.2.32)
In there rare case where we have g
(z = 0), the derivative is techincally undefined, although
in practice we choose a value of either 0 or 1. The choice tends to be very insignificant.
For the leaky ReLU function
g(z) = max(0.01z, z), (2.2.33)
the derivative is
g
(z) =
>
1 z > 0
0.01 z < 0
. (2.2.34)
We will treat the case where z = 0 for the leaky ReLU similar to the case where z = 0 for
the ReLU, in which we choose either 0 or 1.
2.2.8 Gradient descent for neural networks
For the shallow neural network with one input unit, one hidden unit, and one output unit,
we will have the parameters W
[1]T
R
n
[1]
×n
[0]
, b
[1]
R
n
[1]
×1
, W
[2]T
R
n
[2]
×n
[1]
, and
b
[2]
R
n
[2]
×1
. So far we have only seen examples where n
[2]
= 1. We are also denotin g n
[0]
as n
x
and n
[i]
to be the number of nodes in the ith l ayer.
For now, assume we are doing binary classification with t he cost function
J(W
[1]
, b
[1]
, W
[2]
, b
[2]
) =
1
m
m
-
i=1
L(ˆy, y), (2.2.35)
where ˆy = a
[2]
in this case and our loss function is
L(ˆy, y) = y log ˆy (1 y) log(1 ˆy). (2.2.36)
We then make updates to our parameters with the grad i ent descent algorithm shown in
Algorithm 2.
24 CHAPTER 2. NEURAL NETWORKS
x
(i)
1
x
(i)
2
x
(i)
3
z
[1]
1
9
9
9
9
a
[1]
1
z
[1]
2
9
9
9
9
a
[1]
2
z
[1]
3
9
9
9
9
a
[1]
3
z
[2]
9
9
9
9
σ
a
[2]
ˆy
(i)
Figure 2.2.9: Each neuron can be broken into two parts, the linear activation z a n d then non-linear
activation a
Algorithm 2 Gr ad ie nt descent
repeat
Compute the predictions ˆy
(i)
for i {1, . . . , m}
Compute the derivatives of the parameters dW
[]
, db
[]
with respect J
Update the weights and bias with W
[]
= W
[]
α dW
[]
and b
[]
= b
[]
α db
[]
until the cost function J converges
x
w
b
z = w
1
x
1
+ w
2
x
2
+ b
a = σ(z) L(a, y)
Figure 2.2.10: Computation graph for logistic regress io n
Recall that when we were using logistic regression, we had the computational graph in Figure
2.2.10, where the red arrows forward propagation and the blue arrows repres ent backward
propagation. We previously calculated
L
a
=
y 1
a 1
y
a
(2.2.37)
L
z
= L
b
= a y (2.2.38)
L
x
= w(a y) (2.2.39)
L
w
= x(a y). (2.2.40)
In the two layer computational graph pictured in Figure 2.2.11, it is clear that a
[2]
, z
[2]
,
W
[2]
, and b
[2]
will have similar rates of change to the loss function to a, z, w, and b when
using logistic regression, so we have the derivatives
L
a
[2]
=
y 1
a
[2]
1
y
a
[2]
(2.2.41)
L
z
[2]
= a
[2]
y (2.2.42)
L
b
[2]
= a
[2]
y (2.2.43)
L
W
[2]
= a
[1]T
· (a
[2]
y). (2.2.44)
2.2. SHALLOW NEURAL NETWORKS 25
W
[1]
x
b
z
[1]
= W
[1]
x + b
[1]
a
[1]
= σ(z
[1]
)
W
[2]
b
[2]
z
[2]
= W
[2]
a
[1]
+ b
[2]
a
[2]
= σ(z
[2]
)
L(a, y)
Figure 2.2.11: Two layer neural network computational graph
We can calculate L
a
[1]
from the chain rule
L
a
[1]
=
L
z
·
z
a
[1]
(2.2.45)
= (a
[2]
y) · W
[2]T
. (2.2.46)
Going backward to L
z
[1]
, we can use the result from L
a
[1]
L
z
[1]
=
L
a
[1]
·
a
[1]
z
[1]
(2.2.47)
= ((a
[2]
y) · W
[2]T
) · (z
[1]
· (1 z
[1]
)). (2.2.48)
Now we can get back to the weight and bias terms in the first layer, where the rate of change
with respect to W
[1]
is
L
W
[1]
=
L
z
[1]
·
z
[1]
W
[1]
(2.2.49)
= W
[2]T
(a
[2]
y) · z
[1]
(1 z
[1]
) · xT, (2.2.50)
and the rate of change with respect to b is
L
W
[1]
=
L
z
[1]
·
z
[1]
b
[1]
(2.2.51)
= W
[2]T
(a
[2]
y) · z
[1]
(1 z
[1]
). (2.2.52)
What we have written is for t he case of one input example , however, we can extend to all
the input examples by using vectorization, where
L
Z
[2]
= A
[2]
Y. (2.2.53)
For L
W
[2]
, we will need to conisder the cost function inside of ou r rate of change, so we get
L
W
[2]
=
1
m
L
Z
[2]
A
[1]T
. (2.2.54)
We will write L
b
[2]
using the CPython notation of
L
b
[2]
=
1
m
· np.sum(L
Z
[2]
, axis = 1, keepdims = True). (2.2.55)
We will have L
Z
[1]
R
n
[1]
×m
, where
L
Z
[1]
= W
[2]T
L
Z
[2]
g
[1]
(Z
[1]
), (2.2.56)
where g
[1]
is the activation function in layer 1, and is the element product operator. Finally
the change to the weights and bias in layer 1 will be
L
W
[1]
=
dZ
[1]
X
T
m
(2.2.57)
and
L
b
[1]
=
1
m
· np.sum(L
Z
[1]
, axis = 1, keepdims = True). (2.2.58)
26 CHAPTER 2. NEURAL NETWORKS
x
1
x
2
a
[1]
1
a
[1]
2
a
[2]
1
ˆy
(i)
Figure 2.2.12: Two layer neural network
2.2.9 Random Initialization
Consider if we initialized
W
[1]
=
1
0 0
0 0
2
, (2.2.59)
then a
[1]
1
= a
[2]
2
. If two neurons in the same layer are equal, then they are going to have
the same rate of change and update from the gradient. With the same changes after each
iteration, the two neu r ons will act as if they are the same, which means having two neurons
is pointless since it does not add additional information. We call this error t h e symmetry
breaking problem.
The solution to the prob l em is to in i ti al i ze small non-zero random weights, where we initialize
W
[1]
= np.random.randn((2, 2) ) 0.01. (2.2.60)
We use the 0.01 term to make sure that our input is very small, since our hyperbolic tangent
function and sigmoid function have more meaninful values near 0 and our function converge
faster.
It turns out th at our bias term b does not have a symmetry problem, s o it is alright if we
initialize it to be the zero column vector
b
[1]
= np.zeros((2, 1)). (2.2.61)
We can initialize the rest of the weights and bias terms in a similar fashion.
2.3 Deep neural networks
x
(i)
1
x
(i)
2
x
(i)
3
a
[1]
1
a
[1]
2
a
[1]
3
a
[1]
4
a
[2]
1
a
[2]
2
a
[2]
3
a
[2]
4
a
[3]
1
a
[3]
2
a
[3]
3
a
[4]
1
ˆy
(i)
Figure 2.3.1: 4 layer neural network
For neural networks, we will use L to d en ot e the number of layers and n
[]
to denote the
number of nodes in the nth layer. For example, in F i gur e 2.3.1, we have L = 4 and
n
[1]
, n
[2]
= 4 and n
[3]
= 3. To refer to the activations in each layer, we will use a
[]
and to
refer to the weights in each layer we will use w
[]
. Similarly, the linear activation and bias
in each layer will be z
[]
and b
[]
, respectively. We can refer t o the input as = 0 and the
output as = L.
2.3. DEEP NEURAL NETWORKS 27
2.3.1 Forward propagation in a deep network
To propagate a training example forward, using the network in Figure 2.3.1, we would use
a similar process to how we forward propagated shallow neural networks, but now we will
extend to more layers, calculating
:
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
<
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
=
z
[1]
= w
[1]
a
[0]
+ b
[1]
a
[1]
= g
[1]
(z
[1]
)
z
[2]
= w
[2]
a
[1]
+ b
[2]
a
[2]
= g
[2]
(z
[2]
)
z
[3]
= w
[3]
a
[2]
+ b
[3]
a
[3]
= g
[3]
(z
[3]
)
z
[4]
= w
[4]
a
[3]
+ b
[4]
a
[4]
= g
[4]
(z
[4]
)
. (2.3.1)
In general, it is easy to see that
>
z
[]
= w
[]
a
[1]
+ b
[]
a
[]
= g
[]
(z
[]
, (2.3.2)
for 0 L. For an entire training set, instead of a single training instance, we could use
>
Z
[]
= w
[]
A
[1]
+ b
[]
A
[]
= g
[]
(z
[]
)
. (2.3.3)
While we would prefer to never use for-loops, using a for loop to iterate over each layer is
the best way to forward propagate.
2.3.2 Getting the matrix dimensions right
x
1
x
2
a
[1]
1
a
[1]
2
a
[1]
3
Figure 2.3.2: Figure for matix d im en si o n s walkthrough
One of the aspects in deep learning that is easy to make a mistake on is calculating the
dimensions of the matrices incorrectly. In t hi s section, we are going to walk through h ow
we can possibly think about the dimensions and make sure they are correct at each step.
Consider the neural network in Figure 2.3.2, where our goal is to form W
[1]
, b
[1]
, and x in
order to form
z
[1]
= w
[1]
x + b
[1]
. (2.3.4)
We know that input features correspond to the rows of our matrix and dierent training
examples correspond to dier ent columns. So, if each instance has two input features, that
means each instance is in R
2×1
. Further, since n
[1]
= 3, we know that for each instance,
z
[1]
R
3×1
. So, we currently have the dimensi on s for the equation to be
(3 × 1) = w
[1]
(2 × 1) + b
[1]
. (2.3.5)
From matrix multiplication (without broadcasting), w
[1]
must have 2 colum ns to match the
number of rows of x. We also know that when adding matrices the dimensions must match,
28 CHAPTER 2. NEURAL NETWORKS
so b
[1]
must be in the same dimensional space as z
[1]
and w
[1]
x, which means b
[1]
R
3×1
.
Now, from properties of matrix multiplication, w
[1]
x must have w
[1]
R
3×2
. The matrix a
will always be in the same dimensional space as z, since a only applies element operations
to the elements of z. More gen er al ly, we can see that the sizes for each item in our neural
network based on will be
z
[]
, a
[]
R
n
[l]
×m
(2.3.6)
w
[]
R
w
[]
×w
[1]
(2.3.7)
b
[]
R
n
[]
×1
. (2.3.8)
The derivatives of each variable will also be in the same dimensionsal space as the variables,
so when we are backward propagating
dz
[]
, da
[]
R
n
[l]
×m
(2.3.9)
dw
[]
R
w
[]
×w
[1]
(2.3.10)
db
[]
R
n
[]
×1
. (2.3.11)
2.3.3 Why deep representations?
less complex more complex
Figure 2.3.3: Deeper layers in a n eu ra l nexwork analyze more complex data
When using a neural network architecture, the deeper nodes look at more complex data.
For example, with images, the first layer may detect edges, the second layer may put a few
edges together, and the thir d layer may formulate the understanding of a face. Compounding
knowledge throughout multiple layers is analogous to how many people believe the human
brain works.
Graphs, i n general, are o ft en more useful when multiple laye rs can be stacked on top of each
other. Consider the logic tree in Figure 2.3.4 that represents the function y = x
1
(x
2
x
3
).
With more layers, it is easier to store mor e information that bec omes more complex as the
layers grow in depth.
2.3.4 Forward and backward propagation
The forward propagation step on layer takes in as input a
[1]
and outputs a
[]
, which also
caching z
[]
for back propagation. For backward propagation, we will input da
[]
and output
da
[1]
, dW
[]
, db
[]
. From the chain rule, we can derive the outputs of back propagation to
2.3. DEEP NEURAL NETWORKS 29
x
1
x
2
x
3
x
2
x
3
x
1
(x
2
x
3
)
y
(a) Few nodes are required
for a deeper trees
x
1
x
2
x
3
x
1
x
2
x
3
¬x
1
x
2
x
3
x
1
¬x
2
x
3
x
1
x
2
¬x
3
¬x
1
¬x
2
x
3
¬x
1
x
2
¬x
3
x
1
¬x
2
¬x
3
¬x
1
¬x
2
¬x
3
y
(b) Exponentially more
nodes are required for a shal-
low tree, since it must test
every possible combination of
x
1
, x
2
, and x
3
.
Figure 2.3.4 : Comparing logic trees for y = x
1
(x
2
x
3
) of dierent depth, it is easier to for
deeper trees to represent the same information as shallow trees with less nodes.
a
[0]
w
[1]
, b
[1]
a
[1]
w
[2]
, b
[2]
a
[2]
= ˆy
da
[0]
w
[1]
, b
[1]
, dz
[1]
da
[1]
w
[2]
, b
[2]
, dz
[2]
cache z
[1]
, a
[0]
cache z
[2]
, a
[1]
dw
[1]
, db
[1]
dw
[2]
, db
[2]
Figure 2.3.5: Forward and backward propaga t io n steps for a two layer neural network. We
first forward propagate across the top, while caching intermediate z
[]
values, then we backward
propagate across the bottom to find the changes to the weights.
give us the equations
:
;
;
<
;
;
=
dz
[]
= da
[]
g
[]
(z
[]
)
dw
[]
= dz
[]
· a
[1]
db
[]
= dz
[]
da
[a]
= w
[]T
· dz
[]
. (2.3.12)
30 CHAPTER 2. NEURAL NETWORKS
For an entire training set, we would have
:
;
;
;
;
<
;
;
;
;
=
dZ
[]
= dA
[]
g
[]
(Z
[]
)
dW
[]
=
1
/m · dZ
[]
· A
[1]T
db
[]
=
1
/m · np.sum ( dZ
[]
, axis=1, keepdims=True)
dA
[1]
= W
[]T
· dZ
[]
. (2.3.13)
2.3.5 Parameters vs hyperparameters
For neural network models , our parameters consis t of strictly weights W
[]
and bias b
[]
parameters. However, there are plenty more variables that we have to set in order to
optimize our function, such as the learning rate α, number of gradie nt descent iterations,
number of hi d de n layers L, number of hidden un i t s n
[]
, and type of activation function.
The variables that aect our parameters, as just listed, are known as hyperparameters.
Many hyperparamet e rs are set on an experimental basis and there are many uses of neural
networks that determine optimal hyperparameters.
2.3.6 Connection to the brain
Modern deep learning has little resemblance to what we know about how the brain learns.
While it is still unknown how a neuron in the brain works, we have a loose idea that each
neuron takes some signal and passes it to the next signal. Putting many neurons together
then can sort of resembles a tree-like structure similar to a neural network. However,
there is little evidence that anything like forward and backward propagation oc cu rs , so the
connection to the brain is very loose.
Chapter 3
Optimizing and structuring a
neural network
3.1 Full-cycle de ep learning projects
For most machine learning projects we have to do the fol lowing tasks in order:
1. find a problem
2. get data
3. design a model
4. train the model
5. test the model
6. deploy the project
7. maintain the project
The original training model that we choose tends to be changed in order to find the best
model that performs best. Most machine learning papers tend to focu s on steps 2-5, but in
this section, we wi ll focus on steps 1, 6, and 7. For deep learning problems, good problems
for us would tend to be interesting with available data in a field we have domain knowledge
in, as well as useful and feasible to create. Feasibility is a massive problem for choosing
projects, espec ial l y in industry. To find data let us use the example of a trigger word speech
detection system, such as al ex a. It may be a good idea to first use a small set of dat a in
order to quickly start building out a model and finding out the diculties in the problem
at hand; even a couple hundred instances may initially work well. One way to collect new
data would be to go around and ask dierent people to speak into a recording device, where
they say either nothing, alexa, or some other words.
As we train t h e models iteratively, it is a good idea to document each model that we set up
and the resul t s it performed. That way, it will be easy to look back on older mo de ls when
making tweaks to a current system.
VAD
Large NN
for trigger word detection
ˆy
audio
Figure 3.1.1: Breaking a problem involving trigger word detection into two parts: voice activity
dection (VAD) and trigger word detection.
For our system involving a trigger word detection, it may be a good idea to break the
problem into two parts: one th at detects if a word is spoken, and one that detects if the
31
32 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
spoken word was the trigger word. In mod er n literature, the word speaki ng detector may
be referred to as voice activity detection (VAD). When using VAD we primarily have two
options to choose from wh en select i ng a model. We can use a non-machine learning based
approach which detects if the volume is greater than some value %, or we can use a small
neural network that is trained to detect human speech. Since the non-machine learning
option may only take a few minutes to implement, it may be best to choose that option
first in order to minimize cost and move quickly. Another downside t o the neural network
approach i s that new accents m ay be introduced in production that the model has never
heard before which could cause abnormal results.
3.2 Setting up a machine learning application
3.2.1 Training, development, and testing sets
Applied machine l ear ni n g involving choosing many hyperparameters for a model, such as the
number of layers, hidden units, learning rates, and activation functions. Most of the choices
of hyperparameters for one subset of problems, such as natural language processing, tend to
not work very well on other machine learning problems, like computer vision. Because it is
often dicult t o determine which hyperparameters work best, we tend to check our choice
of hyperparameters against some test training set in order to determine its eectiveness.
train
dev
test
Figure 3.2.1: A data set is broken into three parts: training, development, and testing.
We tend to break our data into three sets: training, cross-validation or development, and
testing.
1
The training data set is use d to change the weights and bias of our model, the
cross-vali dat i on set is used to determine if our model has overfitted the data and determine
if our hyperparameters generalize well to data we have not seen, and the testing set is used
to calculate how accurate our model will be once deployed. In t he past, the best practice
for breaking the data up was to give the training set 60% of the dat a, the development
set 20% of the data and the testing set 20% of the data. Now, however, our datasets
have grown large enough to the point that breaking up the data based on percentages no
longer makes sense. Today, we may break up a data set with 1 million examples into a
development set with 10 thousand examples and a testing set with 10 thousand examples.
It is important to distri bu t e dierent types of data between the testing and development
set evenly.
3.2.2 Bias and variance
high bias
underfitting overfitting
“just right” high variance
Figure 3.2.2: Dierent types of varia n c e and bias.
When we talk about the variance for a par t i cu lar model, we are referring the amount of
overfitting in our model. When there is a lot of overfitting, we say the model has high
1
Many people may only use a training and testing set , where the testing set acts as if it were a
development set.
3.3. REGULARIZING A NEURAL NETWORK 33
variance. similarly, when the model is underfitting, we say there is low varianc e. We use
the term b ias to describe how well our model fits the data. If our model performs well on
both the training and development set, we say that the model has low bias, and if the model
performs badly on both the training and development set, we say the model has a high bias.
We often determine variance based on t he training set and development set err or rates.
Train set error 1% 15% 15% 0.5%
Dev set error 11% 16% 30% 1%
high variance high bias high bias low bias
high variance low variance
Table 3.1: How we may denote h i gh and low variance and bias when given the training and
development set error rates. Here, we assume th a t the human error is approximately 0%.
high bias
high variance
Figure 3.2.3: An example of a model having both high varian ce and high bias.
3.2.3 Combatting bias and variance
With a model, we firs t check for high bias and then high variance. When a model has
high bias the networ k will not be generalizing well on the traini n g data. It may be a
good idea to test out a larger neur al network, run gradie nt descent longer, or change neural
network architectures. When a model has high variance, the development set error will be
high. We can improve the high variance problem by introducing new data, regularization,
or changing the neural network architecture.
Before neural networks became widely adopted, we often had to make decisi ons on whether
or not to improve the bias xor variance. Having to choose one or the other came to be
known as the bias variance tradeo. However, with larger neural networks we have been
able to independently improve both the bi as and variance of a network.
3.3 Regularizing a neural network
3.3.1 Regularization
Regularization will be a way to improve a high variance model. To show why regulari zat i on
may be useful, let us consider the logi st i c regression where our goal was to find
min
w,b
J(w, b) (3.3.1)
where w R
n
x
, b R, and
J(w, b) =
1
m
m
-
i=1
L
.
ˆy
(i)
, y
(i)
/
. (3.3.2)
34 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
When using regularization we will redefine our cost function to be
J(w, b) =
1
m
m
-
i=1
L
.
ˆy
(i)
, y
(i)
/
+
λ
2m
||w||
2
2
, (3.3.3)
where we are using the L2 nor m defined as
||w||
2
2
=
n
x
-
j=1
w
2
j
= w
T
w. (3.3.4)
When using the L2 norm in our regularization term, we say that we are using L2 regulariza-
tion. The other main type of regularization is L1 regularization, where our regularization
term is
λ
2m
n
x
-
j=1
|w
j
| =
λ
2m
||w||
1
. (3.3.5)
L1 regularization tends to result in w being a sparse vector, which is a vector with a lot of
0s and convenient for calculations. L2 regularization tends to be used a lot more often with
neural networks. The term λ is known as the regularization p aram eter which is another
hyperparameter often set during the validation process of training the network.
For neural networks, we want to find
min
w
[1]
,b
[1]
,...,w
[L]
,b
[L]
J(w
[1]
, b
[1]
, . . . , w
[L]
, b
[L]
), (3.3.6)
where our cost function is also defined as
J(w
[1]
, b
[1]
, . . . , w
[L]
, b
[L]
) =
1
m
m
-
i=1
L
.
ˆy
(i)
, y
(i)
/
. (3.3.7)
The regularization involving multiple bias and weight terms is
J(w
[1]
, b
[1]
, . . . , w
[L]
, b
[L]
) =
1
m
m
-
i=1
L
.
ˆy
(i)
, y
(i)
/
+
λ
2m
L
-
=1
||w
[]
||
2
F
, (3.3.8)
where the squared norm of a matrix is known as the Frobenius norm, which is defined as
||w
[]
||
2
F
=
n
[1]
-
i=1
n
[]
-
j=1
.
w
[]
ij
/
2
. (3.3.9)
Before we added the additional regularizat ion term to our cost function, we updated the
weights like
w
[]
= w
[]
α dw
[]
, (3.3.10)
where we calculated dw
[]
to be the change in t he cost function with respect to w. When
adding the regularization term, our calc ul at i on of gradient descent must also include changes
to the regularization term with resp e ct to the weights and bias. Whe n our bias changes, the
regularization term does not change, so the gradient descent term for the calculation of bias
remains the same. However, when our weights change, the derivative of our re gul ar iz at i on
term turns out to be
d
dw
[]
A
λ
2m
||w
[]
||
2
F
B
=
λ
m
w
[]
. (3.3.11)
While it would seem weird that the derivative of the Frobenius nor m results in a matrix (since
the Frobenius norm produces a scalar value), we are really just looking for the changing
result of the Frob en iu s norm due t o changes in the entire mat ri x and not t he changes due to
one element, which is why the result ends up as a matrix. Now we can rewrite our weights
gradient descent expression as
w
[]
= w
[]
α
?
dw
[]
+
λ
m
w
[]
@
. (3.3.12)
3.3. REGULARIZING A NEURAL NETWORK 35
Distributing the alpha term, we have
w
[]
= w
[]
α dw
[]
α
λ
m
w
[]
. (3.3.13)
Notice that with the terms
w
[]
α
λ
m
w
[]
(3.3.14)
in our gradient descent function, the addition of the regularization term will always decrease
the values of the weights, since α, λ, and m are all nonnegati ve. We sometimes refer to
regularization that strictly decreases the weights as weight decay.
3.3.2 Why regularization reduces overfitting
With the L2 regularization of the cost function
J =
1
m
m
-
i=1
L
.
ˆy
(i)
, y
(i)
/
+
λ
2m
L
-
=1
||w
[]
||
2
F
, (3.3.15)
making λ a large number will make the term
1
m
m
-
i=1
L
.
ˆy
(i)
, y
(i)
/
(3.3.16)
negligible in comparison. In the scenario where λ is very large, we must make the weights
very small in order to minimize the cost. If the weights are near zero, they will not have
any eect on the neural network and our new neural network will become much simpl er . In
practice, however, al l of the weights become much smaller, not just a select few as shown in
Figure 3.3.1. But, with all the weights near 0, the model will still become simpler.
x
(i)
1
x
(i)
2
x
(i)
3
a
[1]
1
0
0
0
a
[2]
1
0
0
0
a
[3]
1
0
0
a
[4]
1
ˆy
(i)
Figure 3.3.1: If the weights connect ed to a neuron are near 0, then the neuron will not make a
significant contribution to the hypothesis. With less neurons in the neural network, the model will
become simpler.
With our linear activation calculation of
z
[]
= w
[]
a
[1]
+ b
[]
, (3.3.17)
if we make w
[]
really small, then z
[]
will end up really small (assuming b
[]
does not make
a significant contribution). If we are using the hyperbolic tangent function or the sigmoid
function as our activation function, then when z is smal l we wil l pretty much have a linear
activati on function. Recall that with a linear activation function, the depth of our neur al
network would not matter and the entire model c oul d just be represented with a linear
function.
36 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
0.5
0.5
0.5
0.5
z
tanh(z)
Figure 3.3.2: The hyperbolic tangent function is nearly a linear fun c t io n for values near 0
x
(i)
1
x
(i)
2
x
(i)
3
a
[1]
1
a
[1]
2
a
[1]
3
a
[1]
4
a
[2]
1
a
[2]
2
a
[2]
3
a
[2]
4
a
[3]
1
a
[3]
2
a
[3]
3
a
[3]
4
a
[4]
1
ˆy
(i)
Figure 3.3.3: 4 layer neural network with 3 hidden layers ea ch with 4 neurons
3.3.3 Dropout regularization
Consider t h e neural network in Figure 3.3.3, which has a total of 12 hidden neurons. For
each iteration of gradient descent that we train our neural network on, dropout regular-
ization visits each hidden neuron and randomly chooses whether or not that neuron will
be removed from the neural network. The probability that each hidden neuron remains
is a hyperparameter that we can set. With a neural network has a diering number of
neurons in each hidden layer, we can make the keep prob value a hyperparameter in each
layer. If a neuron is removed, then all the weights and bias conne ct ed to the neuron are
also removed. When t es t in g our neural network or once it is deployed, we will no longer
use dropout regularization, primarily because we do not want randomized predictions. In
computer vision, the input to the neural network is very large and many people in computer
vision use dropout regularly.
To implement dropout regu lar i zat i on let us set
d = np.random.rand(a.sh ape[0], a .s hape[1]) < keep prob, (3.3.18)
where keep prob is the pr obab il i ty that we keep the neuron and d is the same shape as a
[]
.
d will now store a matrix of True or False values. We can then take
a = a d (3.3.19)
in order to zero out the values in the matrix where d is 0 and keep the values in the matrix
where d is 1. We also want to make sure that z
[]
is not reduc ed from its expected value
without dropbox regularization, which can be done by taking
a = a/keep prob. (3.3.20)
Consider, for example, if we had 100 hidden n eu r ons in a layer and we set keep prob = 0.8.
On average, 20 neurons would be hidden and a
[]
would be reduc ed by 20%. In order to
3.3. REGULARIZING A NEURAL NETWORK 37
x
(i)
1
x
(i)
2
x
(i)
3
a
[1]
1
a
[1]
3
a
[2]
2
a
[2]
3
a
[3]
2
a
[3]
4
a
[4]
1
ˆy
(i)
Figure 3.3.4: A potential simplified neura l network with only the hidden neurons that are not
removed in dropout regularization.
mitigate t hi s loss, divide by our keep prob value of 0.8. Dividing by the hyperparameter
keep prob is known as the inverted dropout technique.
With dropout regularization, any feature could be removed randomly from a neural network
on any given iteration. Because of this, the network can learn not to rely on any given feature
too much, which can regularize a model. In general, when using dropout regularization, the
weights shrink in size.
3.3.4 Other regularization methods
While we know that adding more data to our model could help reduce overfitting, we do
not always have easy access to new data. Instead, one way to add more data is to augment
the existing data that we have, which is known as data augmentation. If we were to
augment images, for exam pl e, then we might reflect, crop, distort, and change the colors of
our inputted data as sh own in Figure 3.3.5.
Figure 3.3.5: Data augmentation on images
Early stopping is another way to regularize our model. With early stopping, we decrease
the number of iterations when running gradient descent in hopes of finding a more general
model. When using early stopping, the primary downside is that we are no longer looking
38 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
to completely optimize the cost function and it is often hard to determine when it is a good
time to stop iterating.
3.4 Setting up an optimization problem
3.4.1 Normalizing inputs
50 1000
5
10
Figure 3.4.1: Normal distribution of data vertically centered at 5 with a varian c e of 2.25.
To normalize a dat ase t , we must do two things: center the mean at the origin in each
direction and set the variance to be 1. Consider the two dimensional data in Figure 3.4.1.
Centering t h e data in each direction would require us to calcu l at e each directi on ’s mean
µ =
1
m
m
-
i=1
x
(i)
(3.4.1)
and then take each point and subt r act the mean, which can be done with
x = x µ. (3.4.2)
The result after centering the data in each direction at 0 can be shown in Figure 3.4.2.
-50 50
5
-5
Figure 3.4.2: Centering the mean of the data in each d irec t io n at 0
The variance is defined as the square of the standard deviation. In general, the variance is
σ
2
=
1
m
m
-
i=1
.
x
(i)
µ
/
2
, (3.4.3)
but since we have set µ = 0, we can simpifly th e variance to
σ
2
=
1
m
m
-
i=1
.
x
(i)
/
2
. (3.4.4)
3.4. SETTING UP AN OPTIMIZATION PROBLEM 39
To normalize the variance, we must set the variance of the entire dataset to be 1, which can
be done by dividing the entire dataset by the standard deviation
x =
x
σ
. (3.4.5)
Figure 3.4.3 shows the result after making the variance of the data 1 and displays the
normalized data. It is important when u si ng a training set and testing set th at both sets
- 1 1- 2 2- 3 3
- 1
1
- 2
2
- 3
3
Figure 3.4.3: Normalized data after centering the mean at 0 and making the variance 1.
are normalized with the same µ and σ
2
in order to best simulate how the model would
perform when deployed.
w
b
(a) Normalized
w
b
(b) Unnormalized
Figure 3.4.4: Potential contour plots of the cost function for normalized and unno rm a li zed data.
Normalizing the data often makes it q ui cker for gradient descent to converge. Considering
the two-dimensional contour plots of a potential cost function in Figure 3.4.4, our goal with
normalizing is to make it easier to find the minimum of t he cost function. If we have a set
of features with a range of (1000, 1000) and another set of features with a range of (1, 1),
it is often quite h ar d to pick a learning rate that will work well for both features.
40 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
a
[0]
1
a
[0]
2
a
[1]
1
a
[1]
2
a
[2]
1
a
[2]
2
a
[3]
1
a
[3]
2
a
[4]
1
a
[4]
2
a
[5]
1
a
[5]
2
a
[6]
1
a
[6]
2
a
[7]
1
ˆy
Figure 3.4.5: 7 layer neural network
3.4.2 Data vanishing and exploding gradients
Consider the neural network in Figure 3.4.5, where we will assume the input is non-negative
and will ignore the bias term for now. Here, we will exclusively use ReLU activations and
initialize all of our weights to be in the form of
W
[]
=
1
γ 0
0 γ
2
, (3.4.6)
where γ > 0. In the c ase presented, the ReLU activation function will behave like a linear
activati on function and we will have
W
[7]
W
[6]
W
[5]
W
[4]
W
[3]
W
[2]
W
[1]
a
[0]
= ˆy. (3.4.7)
The multiplication of all of the W
[]
will simplify to
7
C
=1
W
[]
=
7
C
=1
1
γ 0
0 γ
2
=
1
γ
7
0
0 γ
7
2
. (3.4.8)
When we initialize λ < 1 we will end up multiplying ou r input by an exponentially small
number and when we initialize λ > 1 multiplication will be ex ponentially large. Consider, for
example, if we set λ = 0.5 then our input will be scaled down by a factor of λ
7
= 0.0078125.
similarly, if we set λ = 1.5 then the i np u t will be scaled up by λ
7
= 17. 0859375. When
the prediction of our model is a huge number, then the gradients in back propagation will
explode and when our prediction is near zer o, the data will have a negligible impact.
3.4.3 Weight initialization for deep neural networks
For each neuron, we calculate the linear activation to be
z = w
1
x
1
+ w
2
x
2
+ · · · + w
n
x
n
. (3.4.9)
At the moment, we are going to ignore the bias term. Notice that when n 0, we would
like the weights to be small in order to keep z from exploding. similarly, when n is near 0,
making the weights larger would make t he most sense in order to have the input make a
contribution. One solution is to initialize our weights to have a variance of
σ
2
=
1
n
. (3.4.10)
We previously calculated the variance of a data set centered at zero to be
σ
2
=
1
m
m
-
i=1
.
x
(i)
/
2
. (3.4.11)
Starting with a random normally distributed data set with a variance of σ
2
= 1, we can
change the variance to be ζ by multiplying each instanc e by
ζ. We can s how that this is
correct by carrying out the multiplication on each instance and using substituting with the
original variance of 1, which gives us
σ
2
=
1
m
m
-
i=1
.
x
(i)
D
ζ
/
2
(3.4.12)
= ζ
E
1
m
m
-
i=1
.
x
(i)
/
2
F
(3.4.13)
= ζ. (3.4.14)
3.4. SETTING UP AN OPTIMIZATION PROBLEM 41
So, if we wanted to make the variance
1
/n, we would initialize our weights to be
W
[]
= np.random.randn(shape) np.sqrt
?
1
n
[1]
@
. (3.4.15)
The modern best p r act i ce tends to set the variance of weights with the ReLU function to
be
σ
2
=
2
n
[1]
, (3.4.16)
which is known as He initialization. With a hyperbolic tangent function we generally
stick wit h
σ
2
=
1
n
[1]
, (3.4.17)
which is known as Xavier initialization. Occasionally, some people may also set the
variance to be
σ
2
=
2
n
[1]
+ n
[]
. (3.4.18)
3.4.4 Gradient numerical approximations
Numerical approximations of our gradi e nt can be useful when debugging a network and
testing if the forward an d backward propagation steps are working correctly. From the
definition of the derivative, we know that the one-side d limit of
f
(θ) = lim
#0
f(θ + %) f(θ)
%
(3.4.19)
will give us the slope of the line tangent to a two variate equation centered at θ. We could
also use use the two sided limit
f
(θ) = lim
#0
f(θ + %) f(θ %)
2%
(3.4.20)
to calculate the deri vative, which wil l produce an equivalent result. The two-sided limit
as shown in Fi gu re 3.4.6 works best for numerical approximations becuase it gets a more
accurate picture of the slope at both sides.
x
f (x )
Figure 3.4.6: Two-sided limit derivative definition
The error when using the two-sided limit is O(%
2
) and the error when usi ng a one-sided limit
is O(%). Although it is twice as computationally expensive, it is often worth it to use the
two-sided limit.
42 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
3.4.5 Gradient checking
Using the numerical approximations of the derivative our goal is to check if our calculations
implemented in back propagation are correct. To simplify calculations, we will store each
entry in all of our parameters W
[1]
, b
[1]
, . . . , W
[L]
, b
[L]
into a single large vector denoted as
θ. similarly, we will store dW
[1]
, db
[1]
, . . . , dW
[L]
, db
[L]
into a vector dθ. We now can write
our cost function as
J(θ) = J(θ
1
, θ
2
, θ
3
, . . . ). (3.4.21)
For each element i of our vector θ, we can approximate the derivative with
˜
dθ
i
=
J(θ
1
, θ
2
, . . . , θ
i
+ %, . . . ) J(θ
1
, θ
2
, . . . , θ
i
%, . . . )
2%
. (3.4.22)
If our calculation s are correct, then
˜
dθ should approximately be dθ. We can check our error
rate E by taking th e normalized Euclidean distance between the vectors of
E =
G
G
G
˜
dθ dθ
G
G
G
2
G
G
G
˜
dθ
G
G
G
2
+
G
G
G
dθ
G
G
G
2
. (3.4.23)
In practice, if we set % = 10
7
, then with E < 10
7
we can generally confirm that our
calculations are correct and with E > 10
3
we should double check our calculations. For
values in-between, it is harder to determine correctness. For debugging, it is best to fir s t
check for bugs where the dierence between
˜
dθ
i
and dθ
i
are the greatest.
We can run gradient checking on L2 regularization by making sure to include our reg-
ularization term in the cost function. However, because of the randomness of dropout
regularization, we cannot use gradient checking with dropout on.
3.5 Optimization algorithms
3.5.1 Mini-batch gradient descent
Deep learning works best when there is a lot of data and we can quickly optimize our model.
We have previously been using gradient descent on all of our training examples with our
vectorized input
X =
!
x
(1)
x
(2)
. . . x
(m)
"
, (3.5.1)
where X R
n
x
×m
and our vectorized output
Y =
!
y
(1)
y
(2)
. . . y
(m)
"
, (3.5.2)
where Y R
1×m
. The gradient descent we have previously been usi ng is known as b atch
gradient descent. If m 0, then it may take a while to compute each iterati on of gradient
descent. For now, suppose that our training set is of size m = 5, 000, 000. Instead of usin g
all m training examples for each iteration of gradient descent, let us break the input into
groups of 1000, w hi ch we will call mini-batches. For example, the first mini-batch will be
have
:
<
=
X
{1}
=
H
x
(1)
x
(2)
. . . x
(1000)
I
Y
{1}
=
H
y
(1)
y
(2)
. . . y
(1000)
I
(3.5.3)
where we will denote the ith mini-batch with superscript {i}. In total, we will have 5000
mini-batches. Next, we will make a pass through the training set as described in Algorithm
3. Each pass through the training set using any form of gradient descent is known as an
epoch.
With batch gradient descent, our cost will ideally always be decreasing, however, that is
not the case with mini-batch gradient descent because each mini-batch does not reflect the
entire training set and our gradient may not step in the perfect direction that minimizes
our cost.
3.5. OPTIMIZATION ALGORITHMS 43
Algorithm 3 A single pass through the training set using mini-batch gradient descent.
for t {1, . . . , 5000} do
Forward propagation on X
{t}
Compute the cost for J
{t}
Backprop to compute gradients
Update the weights and bias
end for
Epoch
Cost
(a) Batch gradient descent
Mini-batch iterations t
Cost
(b) Mini-batch gradient descent
Figure 3.5.1: How the cost function could decrease b a sed on the choice of gradient descent.
When setting the hyperparameter that represents the size of the mini-batch, there are two
extreme cases. First, if we set the mini-batch size equal to m, then we just have our old
method of batch gradient descent. If we set the mini-batch size to be 1 then each example is
a mini-batch; we call this Stochastic gradient d escent. W it h Stochastic gradient descent,
we cannot u se vectorization and will not typically converge to a single value, instead, it will
jump around near the minimum. In practice, we often choose a mini-batch size in-between
the two extremes in order to preserve vectorization and make faster progress.
The modern best practice in the field is to use batch gradient descent when m 2000,
otherwise, we typically use 2
6
, 2
7
, 2
8
or 2
9
in order to store data most eciently.
3.5.2 Exponentially weighted averages
20 40 60 80
500
1000
1500
Figure 3.5.2: Data set to calc u la t e the exponentially weighted average
To describe other optimization algorithms that may work better than gradi e nt descent, we
first need to introduce the concept of exponentially weighted averages. Consider the dataset
44 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
in Figure 3.5.2 where
θ =
#
$
$
$
$
%
θ
1
θ
2
.
.
.
θ
100
&
'
'
'
'
(
=
#
$
$
$
$
%
1124
1184
.
.
.
1546
&
'
'
'
'
(
. (3.5.4)
To calcu l at e t h e exponentially weighted average, we will start by initializing v
0
= 0 and
recursively defining
v
t
= 0.9 v
t1
+ 0.1 θ
t
(3.5.5)
for t {1, . . . , 100}. After computing all of the v
t
points, we can p lot the joined data as
shown in red in Figure 3.5.3.
20 40 60 80
500
1000
1500
Figure 3.5.3: Exponentially weighted average as a joined plot (red)
In general, we can set
v
t
= β v
t1
+ (1 β) θ
t
, (3.5.6)
where 0 β < 1. The choice of β sets the value of v
t
to approximately be the average over
the last
1
1 β
(3.5.7)
data points. For example, when we set β = 0.9, we were closely mirroring each valu e of v
t
to
be the average of the previous 10 instances. Additional changes to the value of β are shown
in Figure 3.5.4. In statistics, the exponentially weighted average may also be referred to as
the exponentially weighted moving average.
20 40 60 80
500
1000
1500
Figure 3.5.4: β = 0.01 (green), β = 0.6 (purple), beta = 0.95 (ora n g e)
To better understand exponentially weighted averages, let us set β = 0.9 and look at the
sequence
:
;
;
;
<
;
;
;
=
v
100
= 0.9 v
99
+ 0.1 θ
100
v
99
= 0.9 v
98
+ 0.1 θ
99
v
98
= 0.9 v
97
+ 0.1 θ
98
.
.
.
. (3.5.8)
3.5. OPTIMIZATION ALGORITHMS 45
Using substitution, we can write
v
100
= 0.9 (0.9 ( 0. 9 (v
97
) + 0.1 θ
98
) + 0.1 θ
99
) + 0.1 θ
100
(3.5.9)
= 0.1 θ
100
+ (0.1 × 0.9)θ
99
+
)
0.1 × 0.9
2
*
θ
98
+
)
0.1 × 0.9
3
*
θ
97
+ · · · (3.5.10)
=
100
-
t=1
)
0.1 × 0.9
100t
*
θ
t
. (3.5.11)
Visually, the the weight that is contributed to the θ
100
term decreases exponentially as
shown in Figure 3.5.5.
50 60 70 80 90 100
0.0
0.2
0.4
0.6
0.8
1.0
Scale factor
60 70 80 90 100
500
1000
1500
Figure 3.5.5: At θ
100
is the sum of the element-wise produ c t s between the scaling factors and
data
The time constant τ is the time it takes for an exponentially decaying plot to reach
1
/e its
max. From our decay equation, we have
β
tτ
=
1
e
. (3.5.12)
Solving for τ, we find the time constant to be
τ = t + log
β
(e), (3.5.13)
where log
β
(e) represents the number of data points we go back in order to calculate 1
1
/e
of the value of v
t
. The approximation in Equation 3.5.6 is a simp li fi ed version of log
β
(e)
and is often u se d because it gives a close enough answer to the true time constant in our
domain for β. Th e dierences in log
β
(e) and
1
/1β are shown in Figure 3.5.6.
0.2 0.4 0.6 0.8 1.0
2
4
6
8
10
12
Figure 3.5.6: Approximation of the time consta nt τ =
1
/1β (red) and the real time constant
τ = log
β
(e) (blue)
We generally choose to use this definition of a weighted average instead of truly calculating
the average in a particular window because it is easier to compute and requires less memory
storage.
46 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
10 20 30 40
500
1000
1500
Figure 3.5.7: Data for bias correc ti o n analysis, where the red line is the exponentially weighted
average when β = 0.5
3.5.3 Bias correction for exponentially weighted averages
Notice that at the beginning pe ri od of the plot in Figure 3.5.7, the plot starts o with pretty
small values and grows relatively slow. Let us denote % = 1 β. Because of how we defined
our exponentially weighted average, each plot will start with
:
;
;
<
;
;
=
v
0
= 0
v
1
= θ
1
%
v
2
= βθ
1
% + θ
2
%
v
3
= β
2
θ
1
% + βθ
2
% + θ
3
%
. (3.5.14)
We can more easily interpret this dat a by plugging in a few values of β and anal y zi ng the
results, which is done in Table 3.2.
β 0.1 0.5 0.9
v
1
0.9 θ
1
0.5 θ
1
0.1 θ
1
v
2
0.09 θ
1
+ 0.9 θ
2
0.25 θ
1
+ 0.5 θ
2
0.09 θ
1
+ 0.1 θ
2
v
3
0.009 θ
1
+ 0.09 θ
2
+ 0.9 θ
3
0.125 θ
1
+ 0.25 θ
2
+ 0.5 θ
3
0.081 θ
1
+ 0.09 θ
2
+ 0.1 θ
3
Table 3.2: Calculation s of the first three v
t
terms.
Notice that when β is large, v
t
starts o growing the slowest. If we set each value of θ
n
to
1 then we can get the total amount we are scaling our input. We would expect the scaling
factor to be 1 in the general case because when averaging n numbers, if each number is 1,
then the average is 1. However, when β = 0.9 our total scaling factor for v
1
is 0.1, for v
2
is
0.19, and for v
3
is 0.271, which are all much lower than 1.
If we instead define our recursive v
t
equation as
v
t
=
β v
t1
+ (1 β)θ
t
1 β
t
, (3.5.15)
then will scale up the first few instances of the data without aecting the latter entries, as
shown in Figure 3.5.8. The term in the denominator of Equation 3.5.15 is known as the bias
correction term for exponentially weighted averages.
In practice, most people do not actually cal cu lat e the bias correction because it d oes not
make much of a dierence on large datasets.
3.5.4 Gradient descent with momentum
In gradient descent, when we are far from an optimal point in our cost function, the gradients
will be larger numbers. We are typically far at the beginning and taking large steps but often
not in a great direction. In Figure 3.5. 9, we are illustrating a potential path through t he
3.5. OPTIMIZATION ALGORITHMS 47
10 20 30 40
500
1000
1500
Figure 3.5.8: Weighted average of our da t a with bias correction (p u rp l e) , without bias correction
(red)
Figure 3. 5 .9 : Contour plot of the cost funct io n with the minimum in red a n d our path from
gradient descent in blue
contour plot. Ideally, we would like to have a larger learning rate in the horizontal direction
and a smaller learning rate in the vertical direction, in orde r to step more di r ec tl y towards
the minimum. Looking at the previous it er at ion s of the direction of gr adi ent descent, we
can get a pretty good feeling about the direction of the minimum by taking the weighted
average. In our figure, the vertical weighted average would be near 0, while the horizontal
weighted average would be in the direction towards the center. Using the exponentially
weighted average with gr ad ie nt descent is referred to gradient descent with momentum.
Gradient descent with momentum has two hyperparamete rs , α and β.
Algorithm 4 Gr ad ie nt descent with momentum
v
dW
, v
db
= 0
repeat
Compute dW, db with respect J
v
dW
= βv
dW
+ (1 β)dW
v
db
= βv
db
+ (1 β)db
W = W α v
dW
b = b α v
db
until the cost function J converges
Consider how a ball rolling down the sides of a bowl will find the minimum by taking a step
in the direction of the gradient. Here, the equation
v
dW
= βv
dW
+ (1 β)dW ( 3.5. 16)
has a loose analog to the physical system of a ball rolling down the sides of a bowl, where
dW represents acceleration, v
dW
represents velocity, and β represents friction.
Finally, some people may change the exponentially weighted average term to be
v
dW
= βv
dW
+ dW, (3.5.17)
where th ey omit (1 β) in hopes of avoiding overfitting, however, this would require us to
make less intuitive changes to our learning rate α in order for the cost function to converge
quickly.
3.5.5 RMSprop
From Figure 3.5.10, notice that it would be more beneficial to move further i n the direction
with less oscillation than it would be to continue t ak i ng large steps vertically. One way to
48 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
w
b
Figure 3.5.10: Plotting the gradient in each direction, if we denote the vertical axis as the bias
and the horizontal direction with the weight
determine the amount of oscillation would be to calculate
>
S
dW
= β
2
S
dW
+ (1 β
2
)dW
2
S
db
= β
2
S
db
+ (1 β
2
)db
2
, (3.5.18)
where the square operation is element-wise and β
2
is new hyperparameter we are introducing
that works similar to β.
2
Squaring the rate of change term will make the term positive and
reward small val u es while making large values more costly. In our e x amp le , the changes
in the horizontal direction, which correspond to the changes in the weights, are relatively
small, so S
dW
will be relatively small. In contrast, S
db
will be huge because the vertical
changes are massive. Now, if we update the weights and bias with
:
;
;
<
;
;
=
w = w α
dW
S
dW
+ %
b = b α
db
S
db
+ %
, (3.5.19)
we will end up moving more in the direction that does not oscillate and less in the direction
that oscillates a lot. Here, we include % in order to avoid a division by zero error, where %
is commonly set to a small value such as % = 10
8
. The algorithmic described is known as
root mean square prop (RMSprop) and is detailed in Algorithm 5
Algorithm 5 RM Sp rop
S
dW
, S
db
= 0
repeat
Compute dW, db with respect J
S
dW
= β
2
S
dW
+ (1 β
2
)dW
2
S
db
= β
2
S
db
+ (1 β
2
)db
2
w = w α
dW
S
dW
+ %
b = b α
db
S
db
+ %
until the cost function J converges
2
We are introducing the new hyperparameter primarily to make the next section which combines
RMSprop and momentum together.
3.5. OPTIMIZATION ALGORITHMS 49
3.5.6 Adam optimization algorithm
The Adam optimization algorithm combines both RMSprop and momentum and has shown
to work well across plenty of optimization problems. The complete algorithm is shown in
Algorithm 6. Although we did not use bias correction with momentum or RMSprop, it is
most common to use bias correction with the Adam al gor i th m.
Algorithm 6 Adam optimization
S
dW
, S
db
, v
dW
, v
db
= 0
repeat
t = t + 1
Compute dW, db with respect J
v
dW
= β
1
v
dW
+ (1 β
1
)dW , v
db
= β
1
v
db
+ (1 β
1
)db
S
dW
= β
2
S
dW
+ (1 β
2
)dW
2
, S
db
= β
2
S
db
+ (1 β
2
)db
2
v
dW
=
v
dW
1 β
t
1
, v
db
=
v
db
1 β
t
1
S
dW
=
S
dW
1 β
t
1
, S
db
=
S
db
1 β
t
1
w = w α
v
dW
S
dW
+ %
b = b α
v
db
S
db
+ %
until the cost function J converges
From th e Adam op t i mi zat i on p aper (Kingma, et. al), the default values for the hyperpa-
rameters are
β
1
= 0.9 (3.5.20)
β
2
= 0.999 (3.5.21)
% = 10
8
. (3.5.22)
The learning rate α does not have a default value and still need s to be tuned. Adam stands
for Adaptive moment estimation and the terms β
1
and b
2
are the fi r st and second
moments, res pectively.
3.5.7 Learning rate decay
Figure 3.5.11: Gradient descent may take large steps when near the minimum, bouncing around
and never quite converging
Learning rate decay refers to reducing the size of the learning rate as the number of iterations
increases. Consider th e example in Figure 3.5.11, where the steps once we are near the center
appear to never quite reach the minimum and stay there. Instead, if we took smaller steps
when closer to the minimum, we could more easily determin e the values gradient descent is
converging towards. One way to make α decay is by setting
α =
α
0
1 + decay rate epoch num
, (3. 5. 23)
50 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
where α
0
is the initial learning rate, decay rate is a hyperparam et er that we much set, and
epoch num is the numb e r of times we have iterated through the entire training set. A few
other ways to d ec ay α include expotential decay
α = α
0
(0.95
epoch num
), (3.5.24)
or
α =
k α
0
epoch num
. (3.5.25)
Occasionally, people may use a discrete staircase to change the learning rate, in which case
the value changes after a set number of iterations, or we can go in and manually change the
decay rate.
3.5.8 The problem with local optima
Back in the day, deep learning practit i one rs fe are d th at t he i r algor it h ms would find some
local m i ni mum and get trapped inside due to the gradient being 0 at a local opti mum. In
higher dimensional spac e, however, it is quite dicult for local optima to form because it
would require every si n gl e direction to have the minimum at a single point. In general, we
are more likely to see saddle points and plateaus. Plat e aus tend to be the biggest problem
because they will s low down the rate of gradient descent and could possib ly fool us into
believing that we have found a global minimum if we stop running gradient descent early.
3.6 Hyperparameter Tuning
3.6.1 Tuning process
With deep neural networks, there are quite a few hyp er p aram et er s that need to be tuned,
including the learning rate, momentum term, number of layers, number of hidden units,
learning rate decay, and mini-batch size. The learning rate tends to be the hyperparameter
that can vary the most an d is often the one that is hardest to tune. The next set of
hyperparameters that c an make a significant dierence include changes to the momentum
term, number of hidden units, and mini-batch size. The hyperparameters that represent
the number of layers and learning rate decay may make a d i er e nc e, but not to the degree
of the rest of the hyperparameters mentioned.
When testing out choices for hyperparam et e rs , it is often best to start with a reasonable
range of values that could ponentially work for each hyperparameter. Then, rather than
changing one hyperparameter at a time in a grid-like way, it is best to randomly choose
values in the range for the hyper p aram et er to take on, which will allow us to test a lot more
hyperparameters. For example, in F i gu re 3.6.1, randomly choosing hyperparameter 1 each
time allows us to test 25 dierent values, while a grid choice would only allow for 5 dierent
values we could test.
If we find that a section of the choices for our hyperparameters is producing good results, we
can often zoom in to that s ect i on on focus more on the hyperparamete rs there, narrowing
down the range of values. This is known as coarse to fine and is shown in Figure 3.6.2.
3.6.2 Using an appropriate scale to pi ck hyperparameters
Randomly testing hyperparameter values tends to work out well for things like the number
of hidden units in each layer and the number of total layers. In each of these cases, we may
only be choosing from ar ou nd 50 dierent integer values. For the learning rate, however,
the range might be from 0.0001 to 1 if we randomly choose values then 90% of the values
we choose will be between 0.1 and 1. Instead of using a numerical scale, we may use a log
scale to divide up t he values.
To choose α from the logarithmic scale in Figure 3.6.3, we could first choose a random
variable r that is in between 0 and 4, and then take
α = 10
r
. (3.6.1)
3.6. HYPERPARAMETER TUNING 51
1 2 3 4 5
1
2
3
4
5
Hyperparameter 1
Hyperparameter 2
(a) Grid choice of hyperparameters
1 2 3 4 5
1
2
3
4
5
Hyperparameter 1
Hyperparameter 2
(b) Random choice of hyperparam et er val-
ues
Figure 3.6.1: Ways we can go about choosing dierent hyperparameters
1 2 3 4 5
1
2
3
4
5
Hyperparameter 1
Hyperparameter 2
1 2
1
2
Hyperparameter 1
Hyperparameter 2
Figure 3.6.2: Coarse to fine, where we focus on the region of the choices for hyperparame ters that
has produced the most promising results.
Another problem that may arise is sampling the values of β, which are often between 0.9
and 0.999. If we instead look to find values of 1 β, we will be searching for values between
0.1 and 0.001. Now, u si n g the logarithmic scale just introduced, we can pick a random
variable r between 3 and 1 and set
β = 1 10
r
. (3.6.2)
3.6.3 Hyperparameters tun in g in practice: pandas vs caviar
For models that we do not have an immense GPU power to run on, we may be only able
to train a single test over a certain amount of time. During this trai ni ng period, we could
proactively adjust the hyperparameters as the model is training, attempting to improve
the process at each change. Here, we are taking a careful approach to make sure we get
everything right on this model as best as we can, since we will only be testing a s i ngl e model
at a time. This is known as b abysitting one model and is considered the pandas approach
because pandas take careful care of their children. In contrast, if we multiple GPUs t o run
52 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
0.0001 0.1 1
0.0001 0.001 0.01 0.1 1
Figure 3.6.3: The bin sizes for randomly variab le s on a linear scale (top) versus a logarithmi c
scale (bottom)
our model across, we may run a bunch of tests in parallel and use the hyperparameters that
give us the best results, which is known as the caviar approach.
3.7 Batch normalization
3.7.1 Normalizing activations in a networ k
Previously, we have been normalize our inputs a
[0]
to more easily train w
[1]
and b
[1]
. Batch
normalization takes this a step further and works to normalize a
[]
in order to better train
w
[+1]
and b
[+1]
. In general, p eop le are more likely to normalize z
[]
instead of a
[]
, so that
is what we will be focusing on.
To calculate batch normalization on layer , we will have the batch of values
{z
[](1)
, . . . , z
[](m)
}, where
µ =
1
m
m
-
i=1
z
[](i)
(3.7.1)
and
σ
2
=
1
m
m
-
i=1
.
z
[](i)
µ
/
2
. (3.7.2)
Just in case σ
2
turns out to be 0, we will introduce a hyperparameter % and set the normal-
ization of z to be
z
[](i)
norm
=
z
[](i)
µ
σ
2
+ %
, (3.7.3)
where % is some small number like 10
8
. We may not always want the mean to be 0 and
the variance to be 1, so instead we will set
˜z
[](i)
= γz
[](i)
norm
+ β, (3.7.4)
where both the new standard deviation γ and the new mean β can be learned from the
model and are updated after each gradient descent iteration.
3.7.2 Fitting batch norm in a n eu ral network
To add batch normalization to our mo de l , each neuron will now have to calculat e the linear
activati on, then the batch normalization of the linear activation, and then the non-linear
activati on. For each layer , the batch normalization step will have the parameters β
[]
and
γ
[]
as shown in Figure 3.7.1. For the entire neural network, our parameters will be W
[]
,
b
[]
, β
[]
, and γ
[]
for {1, . . . , L}, with our original gradient descent step updating to
:
;
;
<
;
;
=
W
[]
= W
[]
α dW
[]
b
[]
= b
[]
α db
[]
β
[]
= β
[]
α dβ
[]
γ
[]
= γ
[]
α dγ
[]
. (3.7.5)
3.7. BATCH NORMALIZATION 53
x
(i)
1
x
(i)
2
x
(i)
3
z
[1]
1
9
9
9
9
˜z
[1]
1
.
β
[1]
, γ
[1]
/
9
9
9
9
a
[1]
1
z
[1]
2
9
9
9
9
˜z
[1]
2
.
β
[1]
, γ
[1]
/
9
9
9
9
a
[1]
2
z
[1]
3
9
9
9
9
˜z
[1]
3
.
β
[1]
, γ
[1]
/
9
9
9
9
a
[1]
3
z
[2]
9
9
9
9
˜z
[2]
.
β
[2]
, γ
[2]
/
9
9
9
9
a
[2]
ˆy
(i)
Figure 3.7.1: With batch norma li z at i o n we now have three calculations at each node: th e linear
activation, the batch normalization of the linear activation, and the non-linear activation.
In our current process, we have
z
[]
= W
[]
a
[1]
+ b
[]
(3.7.6)
which is then normalied with
˜z
[]
= γ
[]
?
z
[]
µ
σ
2
+ %
@
+ β
[]
(3.7.7)
= γ
[]
?
W
[]
a
[1]
+ b
[]
µ
σ
2
+ %
@
+ β
[]
(3.7.8)
= γ
[]
?
W
[]
a
[1]
µ
σ
2
+ %
@
+ β
[]
+ b
[]
?
γ
[]
σ
2
+ %
@
. (3.7.9)
Here, notice that bias term is not really adding any new information and we can combine
the β
[]
term with the b
[]
term into a si ngl e term. For simplicity, we will choose to remove
the term b
[]
and keep β
[]
, giving us the updated param et e rs γ
[]
, W
[]
, and β
[]
.
Currently, we have been using batch normalization with the entire training set, but if we
were to break the data up into mini-batches, we would go through a similar process using
batch normalization on each mini-batch. The complete algorithm for gradient descent with
batch normali z ati on is described in Algorithm 7
Algorithm 7 Gr ad ie nt descent with batch normalization
for t = 1 to the number of mini-batches do
Forward propagate on X
{t}
In each hidden layer update z
[]
with batch normalization
Backward propagate to compute dW
[]
, dβ
[]
, dγ
[]
Update the parameters with W
[]
= W
[]
α dW
[]
, . . .
end for
54 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
3.7.3 Batch norm at test ti me
Recall that for each mini-batch, we calculated
:
;
;
;
;
;
;
;
;
;
;
;
;
;
<
;
;
;
;
;
;
;
;
;
;
;
;
;
=
µ =
1
m
m
-
i=1
z
(i)
σ
2
=
1
m
m
-
i=1
.
z
(i)
µ
/
2
z
(i)
norm
=
z
(i)
µ
σ
2
+ %
˜z
(i)
= γz
(i)
norm
+ β
, (3.7.10)
which relies on having an entire training set of data, which we do n ot have at test time. In
practice, we predict µ and σ
2
as the exponentially weighted average across mini-batches. For
example, if we had three mini-batches, then for each layer we would find µ
{1}[]
, µ
{2}[]
,
µ
{3}[]
and σ
2{1}[]
, σ
2{2}[]
, σ
2{3}[]
and calculate the expon entially weighted average for
both µ and σ
2
. At test time, we would compute
z
(i)
norm
=
z
(i)
µ
σ
2
+ %
, (3.7.11)
where µ and σ
2
come from the latest cal cu l at ion of our exponentially weighted average.
3.8 Multi-class classification
3.8.1 Softmax regression
Currently, we have only been d eal i ng with binary classification problems, but it would be
much more useful to be able to handle multi-class classifications.
Figure 3.8.1: Multi-class cla ss ifi c a t io n with the labels are 0 cat, 1 dog, 2 lion, and 3
zebra.
We will denote C to be the number of classes, so in Figure 3.8.1 C = 4. The number of
classes will corresp ond to the number of neurons in the last layer of the neural network, so
n
[L]
= C, (3.8.1)
where each neuron corresponds to the probability of each class as shown in Figure 3.8.2.
From the laws of probability, we know that the sum of the final layer must be 1. We will
start o, as usual, calculating the linear activation in the final layer, but we must change
the activation function. For our activation function, let us first calculat e the element-wise
exponentiation of each linear ac t i vation
t = e
(
z
[L]
)
, (3.8.2)
where t is a tem porary variable and t R
C×1
. Next, we will normalize the vector t which
will be our activation function
a
[L]
=
e
(
z
[L]
)
0
C
j=1
t
j
, (3.8.3)
3.8. MULTI-CLAS S CLASSIFICAT IO N 55
a
[L]
1
P (class = 0 | X)
a
[L]
2
P (class = 1 | X)
a
[L]
3
P (class = 2 | X)
a
[L]
4
P (class = 3 | X)
Figure 3.8.2: Each neuron in the final layer corresponds to a probability of a class
where a
[L]
R
C×1
. Equation 3.8.3 is again an element-wise function where the ith element
is
a
[L]
i
=
t
i
0
C
j=1
t
j
. (3.8.4)
As a concrete example, s up pose we have
z
[L]
=
#
$
$
$
$
%
5
2
1
3
&
'
'
'
'
(
. (3.8.5)
We will first exponentiate each element
t = e
(
z
[L]
)
=
#
$
$
$
$
%
e
5
e
2
e
1
e
3
&
'
'
'
'
(
#
$
$
$
$
%
148.4
7.4
0.4
20.1
&
'
'
'
'
(
(3.8.6)
and calculate
4
-
j=1
t
j
176.3. (3.8.7)
Next, we will d i v id e each element in t by the sum of all the valu es in t, which gives us the
probabilities each class occurs
a
[L]
=
#
$
$
$
$
%
e
5
/176.3
e
2
/176.3
e
1
/176.3
e
3
/176.3
&
'
'
'
'
(
=
#
$
$
$
$
%
0.842
0.042
0.002
0.114
&
'
'
'
'
(
. (3.8.8)
The activat i on f un ct i on we have been using on the final layer of our neural network is known
as the softmax activation fun ction . The name softmax is in contrast to a hardmax
classifier that would output 1 for the max and 0 for everything else. In our example, the
hardmax classifier would output
a
[L]
=
#
$
$
$
$
%
1
0
0
0
&
'
'
'
'
(
. (3.8.9)
56 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
3.8.2 Training a softmax classifier
As an example, suppose our neur al network outputs the data
ˆy =
#
$
$
$
$
%
0.3
0.2
0.1
0.4
&
'
'
'
'
(
(3.8.10)
where the actual labels are
y =
#
$
$
$
$
%
0
1
0
0
&
'
'
'
'
(
. (3.8.11)
The loss function we will use for a softmax classifier is defined as
L(ˆy, y) =
C
-
j=1
y
j
log ˆy
j
, (3.8.12)
which would leave us with only the prediction from the pos it i ve label contributing to the
loss and we can simplify our loss to be
L(ˆy, y) = y
2
log ˆy
2
= log ˆy
2
. (3.8.13)
In order to minimize our loss, we would want to make ˆy
2
as close to 1 as possible. If ˆy
2
is
near 0
+
, then we will have a loss approaching . For a multi-class classifier, we will use
the same definition of t h e cost function from before with
J =
1
m
m
-
i=1
L
.
ˆy
(i)
, y
(i)
/
. (3.8.14)
We can calculate the backward propagation step through a softmax classifier with
J
z
[L]
= ˆy y. (3.8.15)
Chapter 4
Applied deep learning
4.1 Orthogonalization
We will use the term orthogonalization to refer to changes in one hyperparameter n ot
influencing the rest of the hyperparameter. For instance, in a self-driving car, we would not
want changes i n the acceleration to have any eect on the steering. In regards to machine
learning, we primarily have to fit the training set well, fit the dev set well, fit the test set
well, and the n maintain the model on real-world examples. If one of these steps is not
working well, then we would like to adjust a set of hyperparameters that will have littl e
impact on the other st ep s.
4.2 Setting up goals
4.2.1 Single number evaluation metric
Using a binary cat classification example, we wi l l say that precision refers to the percentage
of actual cats inside of the recognized cats by our model. Recall refer s to the percentage
of actual cats that are correctly labeled.
Classifier Precision Recall
A 95% 90%
B 98% 85%
Table 4.1: Example of precision and recall fo r two dierent classifiers
Suppose our classifier gives us the precision and recall data in Table 4.1, where one classifier
gives us better precision, while the other gives us better recall. With two evalu at ion metri cs ,
determining the best classifier is hard to do. Instead, we can combine precision and recall
into a category known as F1 Score, where
F1 Score =
2
1
P
+
1
R
, (4.2.1)
where P represents precision, and R represe nts r ec all . This way of calculating the mean is
known as the Harmonic mean. Now, using the F1 Score, we can simply say that classifier
A is the better classifier.
When choosing metrics, we typically first want to define a metric and the n have a completely
separate process the deals with optimizing that metric. It is often best to quickly choose an
evaluation metric and then change that metric in the fu t ur e if there is a significant dierence
in how data performs on the development set versus the testing set versus the real world.
57
58 CHAPTER 4. APPLIED DEEP LEARNING
Classifier Precision Recall F1 Score
A 95% 90% 92.4%
B 98% 85% 91.0%
Table 4.2: Calculating the F1 Score for each classifier
4.2.2 Train/dev/test distributions
For our development and test set, we would like to have instances come from the same
distribution of our data. In particular, we would like each of these sets t o model the
expected data we will receive in the future as best as possible. The size of each set may
vary. O u r test set should be large enough that it is capable of giving us a lot of confidence
if our model per for ms well on it.
4.3 Comparing to human-level performance
time
accuracy
Bayes optimal error
human level
Figure 4.3.1: How our model could behave over time
In cases such as image recognition or speech detection, we tend to be able to create mod-
els that rapidly approach human-level performance, but then slowly decrease the rat e at
which the accuracy is improving. Some of the reasons for the plateau afte r human-level
performance is achieved are that human-level and Bayes optimal error are not too far apart
on tasks, like recognizing images. When our model performs worse than human-level per-
formance we still have ways we can improve with gaining more trained data and manual
error analysis. Our model will never surpass the Bayes optimal error, which is the best
error rate we could possibly achieve as shown in Figure 4.3.1. It will not always be 100%,
because in scenarios where images are blurred, for example, it may be impossible to extract
information about what is in t h e image.
4.3.1 Avoidable bias
We will hardly ever actually have a number for Bayes optimal error. Instead, we use human-
level error to approximate Bayes optimal error. From Table 4.3, we h ave two dierent tasks
that give us the same training and development error but d i er in human error. In task
A, wh er e our training error is quite far from our human error , we likely want to focus on
improving the bias in our model. In contrast, when our training error is close to our human
error, like in model B, it m ay be best to focus on improving variance in the model. We
say that the dierence betwe en our approximation for Bayes optimal error and our training
4.4. ERROR ANALYSIS 59
Task A B
Human Error 1% 7.5%
Traini n g Error 8% 8%
Development Error 10% 10%
Table 4.3: Example error rates for two tasks with dierent human error rat es.
error is the avoidable bias and the dierence between the training error and development
error is the variance.
4.3.2 Understanding human-level performance
Suppose we have the medical imaging classification error rates :
Typical human: 3%
Typical doctor: 1%
Experienced doctor: 0.7%
Team of experienced doctors: 0.5%
and we want to determine a human-level e rr or . Since human-level is often used as an
approximate for Bayes optimal error, it is best, in this case, to go with the error rate of
0.5% from the team of experienced doc t ors . For deployment purp oses , however, we may
have an application that just needs to beat a typical doctor, in which case it may be best
to use the typical do ct or ’s error rate of 1% as human-level.
4.3.3 Surpassing human-level performance
If we have the error rates
Team of humans: 0.5%
One human: 1%
Training error: 0.3%
Dev error: 0.4%
where we have already surpass ed our estimate for Bayes optimal error, then it is hard to
determine what we should focus on improving next. There are a bunch of problems that fall
into categories where machine learning far outperforms human-level performance such as
online advertising, product recommendations, transit time predictions, and loan approvals.
4.4 Error analysis
Suppose that we are working on a cat classifier and we have achieved a 10% error rate, with
a Bayes optimal error rate of 2%. Further, suppose we also want to see if better recognition
of dogs could help improve our cat classifier if it mislabels a dog as a cat . Before attempti n g
for our cat recognizer to do better with dogs, it is likely best to find out how many images
are mislabeled in our development set and how many of those images are dogs. If we have
around 100 mislabeled images in total and only 5 of those images contain dogs, then the
ceiling we can improve our model at by recognizing dogs is with a 9.5% error rate. I ns te ad,
if we 50 images where dogs are mislabeled in our development set, then improving dog
classification would be more worth our time, since the ceiling for our development set error
rate would be 5%.
We can also look for multiple sets of misclassified images i n parallel. For example, on our
cat classification, we may want to check for dogs, great cats (lions, panthers), and blurry
images being mislabeled. To check for multiple categories, it may be best to create a table
or spreadsheet like the one shown in Table 4.4.
60 CHAPTER 4. APPLIED DEEP LEARNING
Image Dog Great cats Blurry Comments
1 ! Pitbull
2 !
3 ! ! Rainy day at zoo
.
.
.
.
.
.
Total % 8% 43% 61%
Table 4.4: Using a table to represent our misclassifi e d images
4.4.1 Cleaning up incorrectly labeled data
With a large enough training set, it may not matter much if data is randomly labeled
incorrectly. However, for smaller data sets or if data is systematically labeled incorrectly it
may be in our best interest to manually change t h e labels. In our development set, we may
update our table to include a column for incorrectly labeled instances as shown in Table
4.5.
Image Dog Great cats Blurry Incorrectly labeled Comments
.
.
.
98 ! Cat was in background
99 !
100 ! Cat drawing
Total % 8% 43% 61% 6%
Table 4.5: Adding an incorrectly labeled category to our table of misclassified images
We tend to only make adjustments to the labels in our development set when the changes
could produce significantly b et t er results. For example, if our development set error was
10% and the error due to incorrect labels was 0.6%, then it may be the most worthwhile to
focus on the 9.4% first. If we later get the development set error rate down to 2%, while still
having 0.6% error du e to incorrect labels, then it may be worthwhile to adjust the l abels
so that they are accurate. If we go in and adjust the development set for errors in the
mislabeled images, then it may also be worthwhile to go in and look for incorrectly labeled
images that our network labeled as correct.
4.5 Mismatched training and dev set
Uploaded data from usersTraining data
images: 200,000 images: 10,000
Figure 4.5.1: Training data could vary significantly from data our model views once deployed
After developing a model, we may test it in the real world and find that the data uploaded
by users varies significantly from the data we trained our mo d el on. Consider the data in
Figure 4.5.1, which shows that we have 200,000 images from our training set and 10,000
4.6. LEARNING FROM MULTIPLE TASKS 61
images from the users that each is distributed dierently. We also know that we want to
optimize the uploaded d ata from the users as best as possible, although 10,000 images will
not work well as an entire training set.
One way we could adjust our sets would be to combine the testing and uploaded data into
a single large array, then shue the array and break up the data into groups for a training,
development, and testing set. Suppose we set our training set size to be 205,000 and our
development set size and testi ng set size each to 2,500. If we were to break data up randomly,
then we would only expect around 119 images in the development and test set to come from
the uploaded data from u ser s .
Instead, since our primary goal is to optimize the data uploaded by users, we could again use
bin sizes of 205,000 for the traini ng set and 2,500 for both the development and testing set,
but this time we will make up the development and testin g sets entirely from the uploaded
data from users. While the data is not split evenly, we tend to prefer this dist r i bu t ion of
data because it allows us to focus more on optimizing the right goals.
4.5.1 Bias and variance with mismatched data distributions
When we distribute the types of data dierently between our training set and development
set, it may be likely that we have a model with a low training error and a relatively high
development error, even without having a variance problem . Instead of directly comparing
the results against the development set, we will introduce a training-development set,
which has the same distribution as the training set although we do not act ual l y use data in
this set to train on. Now, we can look at the training set error rate, the training-development
set error rate, and the development set error rate, where we will determine the variance
based on the dierence between the training s et error rate and the training-development set
error rate. If we have both a low training set error rate and training-development set error
rate while our development set error rate is high, then we say we have a data mismatch
problem.
4.5.2 Addressing data mismatch
With a data mismatch prob l em, it may first be beneficial to look at where the errors are
coming from in the development set. If we find , for example, that our errors are due to
blurry i mages then we may want to get more t rai n i ng examples of blurry images. Instead of
actually going out and getting more data, we cou l d use artificial data synthesis, which
in this case would take crisp images and apply a blurring filter to them.
For another example, if we are trying to create a trigger word detector and we have a bunch
of low noise voice recordings, but we find that the user s’ data on ce our model is deployed
tends to have a lot of noise, then we can artifici al l y add car noise. One thing to keep in
mind here is that if we have a lot more data of voice recordings than we have of car noise,
adding the same car noise to each audio clip may cause our model to overfit the car noise
data.
4.6 Learning from multiple tasks
4.6.1 Transfer learning
Transfe r learning is a process of taking a neural network that has been trained for one thing
and using it as a starting poi nt to train something else. For instance, if we are training
an i mage cl ass ifi cat i on sys t em , we may be able to transfer the low-level implementation
details in a neur al network, that ideally look for edges and curves, and use those details
with radiology diagnosis.
When changing the last layer of a neural network as shown in Figure 4.6.1, we should set
the new weight and bias terms randomly. We are also able to add more layers to a network
if necessary and are not restricted to simply changing out the last layer when using transfer
learning.
62 CHAPTER 4. APPLIED DEEP LEARNING
x
a
[1]
1
a
[1]
2
a
[1]
3
a
[1]
4
a
[2]
1
a
[2]
2
a
[2]
3
a
[2]
4
a
[3]
1
a
[3]
2
a
[3]
3
a
[4]
1
ˆy
(i)
a
[4]
1
ˆy
(i)
Figure 4.6.1: Disconnec t in g the end of a neural network and transfering the trained each weight
and bias to anothe r similar problem
In our image classification to radiology diagnosis example, we say that the image classifica-
tion data is the pre-training data, while the radi ology diagnosis data is the fine-tuning
our model.
We will often change the number of layers that are learned in a neural network based on
the amount of fine-tuning data we have available. With more fine-tuning data it would be
beneficial to use more layers, wh i le with only a little bit of fine-tuning data we would rely
more heavily on our pre-training data. It is also common to fr ee ze the representations of
earlier layers in a neural network and only learn the disconnected part, or the latter part
of the model. With plenty of dat a, we may use the pre-training model as our initialized
representations and retrain the entire network.
4.6.2 Multi-task learning
For multi-task learning, suppose that we are building a self-driving car and need to classify
pedestrians, cars, stop signs, and trac lights. Instead of building four sep ar at e neural
networks to do each of these tasks, we can use a single neural network with multi-hot
encoding where our outputs are labeled with
y
(i)
=
#
$
$
$
$
%
pedestrian?
car?
stop signs?
trac light?
&
'
'
'
'
(
(4.6.1)
and a
[L]
= 4. If some of our labels do not have all the information filled in, for instance if
it is unknown if there is a stop sign or trac light in the image, then we can have our loss
function not sum over those values. Multi-task learning is dierent from softmax regression
because each image can have multiple labels.
Multi-task learning works best with tasks that share low-level features, relatively similar
data, and when we can use a large enough neural network.
Chapter 5
Convolutional neural networks
With our current approach to image classification, where we took pixels as our input, we
will quickly find that with larger images or images with dierent aspect ratios, our neural
network will no longer work. Consider an image of size 1000 × 1000 × 3 which would make
our inpu t of size 3 million. With such a large input, it is too computation al ly expensive
to run most models and overfitting is prone to occur. Instead, we will use and introduce
convolutional neural networks in this section.
5.1 Edge detection
From the field of image processing, we have long been able to apply filters to images. We
represent dierent filters as matrices or tensors. Applying a filter to a set of pixels can
give us properties of an image, including where the edges are. When we apply a filter to a
matrix, the matrix and the filte r must be the same size and the res ul t is the element-wise
sum between the filter and matrix.
With images, the matrix that re pr e sents the pixels is usually larger than the filter. In cases
where our matrix is of a dierent size than the filter, we can apply a convolution operation
that will r es ul t in a matrix all th e possible ways to form complete filters across the image.
It is typical to denote the convolution operator as , which is not to be confused with the
element-wise operator we defined with the same notation earlie r.
We wil l denote filters to be of size f ×f and our image matrix of size n×n. After convolving
a filter across our image matrix, we will end up with a matrix of size (nf +1)×(nf +1).
An example of a convolution operator can be found in Figure 5.1.1.
4
3
9
3
6
1
1
0
0
5
8
3
6
7
0
2
8
8
8
0
7
2
0
1
5
2
7
7
3
7
8
8
7
7
0
5
*
=
1
1
1
0
0
0
-1
-1
-1
-8
1
1
1
0
0
0
-1
-1
-1
Figure 5.1.1: A co nvolution over a matrix is the element-wise product between a filter and smaller
entries of the matrix. The value from the convolution between the filter an d the upper left-h a n d
corner of the matrix corresponds to the top left entry in the result of the convolution.
With dierent filters, we can find out dierent properties from our image. Whe n look in g
at results after applying an edge detection filt er , gray represents no edge detected, black
represents positive values from our filte r , and white represents negative values from our
filter, as shown in Figure 5.1.2. Some of the most c ommon filters used for edge detection
are shown in Figure 5.1.3.
63
64 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS
(a) Original image (b) Horizontal edge detector (c) Vertical edge detector
Figure 5.1.2: Resulting images after applying a horizontal and vertical filter
1
1
1
0
0
0
-1
-1
-1
(a) Standard vertical filter
1
2
1
0
0
0
-1
-2
-1
(b) Sob el vertical filter
3
10
3
0
0
0
-3
-10
-3
(c) Scharr vertical filter
-1
0
1
-1
0
1
-1
0
1
(d) Standard horizontal filter
-1
0
1
-2
0
2
-1
0
1
(e) Sobel horizontal filter
-3
0
3
-10
0
10
-3
0
3
(f) Scharr horiz o ntal filter
Figure 5.1.3: Dierent types of horizontal and vertical filters that are used for edg e detection
5.2 Padding
If we stack multiple convolution operations back to back, then we will end up with a signifi-
cantly smaller final outputted image than we started with. One way to combat the changing
size would be by adding padding to the image. When we add padding to the image, we
typically add a border of all 0s as shown in Figure 5.2.1.
6
1
1
7
4
7
6
5
1
9
4
3
1
8
8
6
3
1
1
8
8
3
4
2
5
5
1
7
5
6
7
5
3
5
9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0
0 0
0 0
0 0
0 0
0 0
*
=
1
1
1
0
0
0
-1
-1
-1
-7
1
1
1
0
0
0
-1
-1
-1
Figure 5.2 . 1: Adding padding to our image could produce the output in the same siz e as our
input
5.3. STRIDED CONVOLUTIONS 65
We add padding un if or ml y to each size of the image and the amount of padding we add
is denoted by p. With paddi ng, the output of our image will be of s iz e (n + 2p f +
1) × (n + 2p f + 1). If we want our input to be equal to our output, then we can solve
n + 2p f + 1 = n for p and find
p =
f 1
2
. (5.2.1)
When the output size matches the input size, we call the convolution a same convolu ti on
and when we do not add padding, we say that we have a valid convolution. It is also most
common to see f as an odd number because that allows for a pixel to be in the distinctive
center.
5.3 Strided convolutions
3
0
6
5
3
8
3
4
2
3
8
6
2
8
0
8
Figure 5.3.1: An invali d filter since we will not use the filters that run over the side of the image,
instead we will only use filters that fill a comple te f × f grid
With convolutions, we are also able to take stride lengths that ar e larger than 1. From the
top left corner, when moving right or down we will move each spot in the filter a length of
s. With strides, only count the convolutions that contain lie inside of the boundary of our
image with padding, as sh own in Figure 5.3.1. Therefore, the size of our output will be
J
n + 2p f
s
+ 1
K
×
J
n + 2p f
s
+ 1
K
. (5.3.1)
5.4 Cross-correlation vs convolution
In image processing, the convolution operator actually mirrors the filter horizontally and
vertically as shown in Figure 5.4.1. The method we have been using thus far, whe re we do
not mirror the filter is known in image processing as cross-correlation. However, i t is most
common in computer vision to use the word convolution in place of cross-correlation, which
is what we will continue to do.
1 2 3
4 5 6
7 8 9
9 8 7
6 5 4
3 2 1
Figure 5.4.1: Wit h a traditional convolution operation, we first flip the filter horizontally and
vertically
5.5 Convolutions over volume
For colored images, we have three channels–red, green, and blue–and can define a filter for
each channel as shown in Figure 5.5.1. For a channel with a height and width of f, our filter
will have f
3
parameters. To convolve with volume, we apply the front filter only to the red
channel, the second filter only to the green channel, and the third filter only to the blue
channel. Then, on each channel, we carry out the convolution process described earlier.
Previously, we saw that each filter may be looking for dierent things inside of an image.
We can also set up multiple filters, by storing the resulting matrices from each convolution
inside of a tensor. We will denote t h e number of filters as n
c
, so our outputted matrix will
be n
c
layers deep as shown in Fi gur e 5.5.2.
66 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS
*
=
Figure 5.5.1: Convolutions of an RGB ima g e
*
=
Figure 5.5.2: Using a convolution on n
c
filters will produce a matrix in n
c
dimensions with each
dimension correspond in g to a result from a convoluted filter
5.6 One layer convolution network
For convolutional neural networks, we will treat the resulting matrices created from the
convolution added to the bias t e rm b as our linear activation z. Then we will put each
element through a non-linear activation as shown in Figure 5.6.1. Each index inside of the
filter tensor is treated as a weight.
*
=
+ b
1
!
+ b
2
!
ReLU
"
ReLU
"
a
[0]
W
[1]
b
[1]
z
[1]
a
[1]
Figure 5.6.1: One layer of a convolutional neura l network
5.7. POOLING LAYERS 67
For the lth layer of a convolutional neural network:
f
[l]
= filter size
p
[l]
= padding
s
[l]
= stride
n
[l]
c
= number of filters
n
[l1]
h
×n
[l1]
w
×n
[l1]
c
= size of input, height of input × width of input × channels
in input
n
[l]
h
=
J
n
[l1]
h
+ 2p
[l]
f
[l]
s
[l]
+ 1
K
n
[l]
w
=
J
n
[l1]
w
+ 2p
[l]
f
[l]
s
[l]
+ 1
K
f
[l]
× f
[l]
× n
[l1]
c
= filter size
n
[l]
h
× n
[l]
w
× n
[l]
c
= a
[l]
f
[l]
× f
[l]
× n
[l1]
c
× n
[l]
c
= size of weights
n
[l]
c
= size of bias
5.7 Pooling layers
Pooling layers give us a way to shrink the size of our input, while still capturing the essential
details. The primary typ es of pooling include average pooling and max pooling. Pooling
layers have both a filter and a step size, although the re are no weights to be learne d. Instead,
pooling applies a function to each region of the input it focuses on, where each frame is f ×f
and the stride lengt h is s. For 2D max pooling, the output is the maximum number inside
of the frame and for 2D average po ol i ng, the output is the aver age of all t he numbers inside
of the frame.
5 6 1 2
1 3 2 3
2 9 1 1
1 3 2 1
9
6
2
3
(a) Max pooling
5 6 1 2
1 3 2 3
2 9 1 1
1 3 2 1
3.75
3.75
1.25
2.00
(b) Average pooling
Figure 5.7.1: 2D pooling with the hyp erp a ra m et ers f = 2 and s = 2
For 3D pooling, the number of channels in the input will match the number of channels
in the output. Each of the channels will work independently. For the ith channel in the
output, we will take the pool from the filter over the ith channel of the inpu t.
Max pooling is used most commonly because it generally produces bet t e r results. Some
people hypothesize that max pooling performs better because it captures in each filter if a
certain aspect of the image is present, while average pooling can be saturated with a lot of
data.
In deep learning literature, we combine convolution and pooling steps into a single layer
because pooling does not have any representations to be learned.
5.8 Why we use convolutions
As a t hou ght e xperiment, consider a single layer neural network with an input size of
32 × 32 × 3 and an output size of 28 × 28 × 6. If we were to use a fu l ly -c onn ec t ed neural
network, we would have around 14 million representations we would have to learn, whereas
68 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS
if we use a convolutional neural network with a filter size of 5 × 5 then we would only need
to learn 156 representations.
With fewer representations we can take advantage of parameter sharing and sparsity
of connections. Parameter sharing al l ows us to detect featu re s at any area inside of the
image. The sparsity of connections means that for every output, there will only be a few
numbers th at impact it.
5.9 Classic networks
Since choosing the architecture for a neural network is often an empirical task we can look
at other people’s models to make better decisions on our model. The main way to look at
how other people build their models is by reading their research. We will be going over how
LeNet-5, AlexNet, VGG-16, ResNet, and Inception work.
5.9.1 LeNet-5
INPUT
32x32
Convolutions
Subsampling
Convolutions
C1: feature maps
6@28x28
Subsampling
S2: f. maps
6@14x14
S4: f. maps 16@5x5
C5: layer
120
C3: f. maps 16@10x10
F6: layer
84
Full connection
Full connection
Gaussian connections
OUTPUT
10
Figure 5.9.1: LeNet-5 architecture (LeCun et al., 1998).
LeNet-5, named aft e r the paper’s lead author Yann LeCun, was introduced in 1998 with a
primary goal of recognizing characters. LeNet-5 uses average pooling which was m ore widely
used at the time of publication. This model contains around 60 thousand representations,
which is now considered a small amount. Among other things, this architecture coined the
concept of moving deeper into the neural network to shrink the input size, while increasing
the number of channels and convolutional layers should be followed by poolin g layers.
5.9.2 AlexNet
!"#
$%& $%& !"#
&'(# &'(#
(#
!"#
Figure 5.9.2: AlexNet architecture (Krizhevsky et al., 2012).
AlexNet looks quite similar to LeNet-5. Although , AlexNet has around 60 million parame-
ters in total and used a ReLU activation function. AlexNet significantly sparked a surge of
research toward artificial intelligence when published in 2012, because the results from the
neural network far outperformed state-of-the-art approaches at the time.
5.9. CLASSIC NETWORKS 69
5.9.3 VGG-16
224x224x64
4096x1
1000x1
112x112x256
56x56x256
28x28x512
softmax
fully connected with ReLU
max pooling
convolution with ReLU
14x14x512
7x7x512
Figure 5.9.3: VGG-16 architecture
The VGG-16 architecture was introd uc ed in a 2015 paper by Karen Simonyan and Andrew
Zisserman. In it, they make all their convolutions as same convolutions with a filter size
of 3 × 3 and a stride length of 1. Each max pooling layer has a filt er size of 2 × 2 with a
stride l en gt h of 2. This architecture is on the larger side of modern neural networks with
a total number of representations around 138 million and sticks out primarily due to its
simple choices for hyperpar ame t er s that turn out to work well.
5.9.4 ResNet
A residual n etwork (ResNet) gives us a way to skip connections inside of a neural network.
With a plain network, our path from a
[]
to a
[+2]
may be what is shown in Figure 5.9.4.
a
[]
linear
ReLU
a
[+1]
linear
ReLU
a
[+2]
Figure 5.9.4: Plain network
However, with a ResNet, we use what is known as a residual block that sets
a
[+2]
= g
.
z
[+2]
+ a
[]
/
, (5.9.1)
where a
[]
is skipping connections. The path of a residual block is shown in Figure 5.9.5.
a
[]
linear
ReLU
a
[+1]
linear
ReLU
a
[+2]
Figure 5.9.5: Residual block
As shown in Figure 5.9.6, residual networks work well with deep neural networks that have
a lot of layers whereas plain neural networks often have weight decay.
Figure 5.9.6: The error rate for plain networks (left) and ResNets (middle, right) when the
number of layers in the network changes (He et al., 2015).
70 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS
The reasoning why ResNets work with more layers can be shown by expanding the right-
hand side of Equation 5. 9. 1
a
[+2]
= g
.
z
[+2]
+ a
[]
/
(5.9.2)
= g
.
w
[+2]
a
[+1]
+ b
[+2]
+ a
[]
/
. (5.9.3)
Notice that when weight decay occurs in plain neural networks, where w
[+2]
0, and we
are using a ReLU activation function, we will come close to having a
[+2]
= g
)
a
[]
*
= a
[]
.
Therefore, adding more layers to a ResNet may have no eect on the activat i on if weight
decay occurs, or the weights could add addi t ion al information that could decrease the error
rate.
5.9. CLASSIC NETWORKS 71
Figure 5.9.7: How a 34-layer ResNet compares to a 34-layer plain network and VGG-19, which
is modified version of VGG-16 (He et al., 2015).
72 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS
5.9.5 1 × 1 convolution
While a 1 × 1 convolution would simply scale up each value in a 2D input, w it h 3D inputs
we can do much more. For a 1 × 1 convolution the depth of the filter will be equal to the
depth of the input. Then t he output for each filter comes from t he dot product between
the pixel on the input and the filter. Als o, our output will always be the same size as our
input. With multiple filters, we add depth to the output as shown in Figure 5.9.8. 1 × 1
convolutions may also be referred to as a network in network (Lin et al., 2013).
=
*
Figure 5.9.8: 1 × 1 convolution
5.9.6 Inception network
The primary goal from the inception network was to allow a model to learn the best hy-
perparameter choices for each layer of a network. Instead of manually choosing a filter size,
the model will use three dierent filter sizes: 1 × 1, 3 × 3, and 5 ×5; in addition, each layer
could also l ear n max p ooling. To pick the best hyperparameters, the model concatenates the
results from convolving with the dierent filter sizes and max pooling into a single output
as shown in Figure 5.9.9a.
When we go deeper into t he network we often run into a large number of channels with
relatively small dim en si on size. Now consider, f or example, a layer whose input is of size
28 × 28 × 192 with a same convolution that has a 5 × 5 × 192 filter size and 32 filters. In
this scenario, our output will be of size 28 × 28 × 32 and for each output, we will ne ed
5 × 5 × 192 total multiplications, which equates to around 120 milli on in total. Even with
modern computers, 120 million multiplications will take a su bs t antial amount of time.
Instead, if we first pass the input of size 28 × 28 × 192 into a 1 × 1 × 192 convolution with
16 filters and then pass that result into a 5 × 5 × 16 same convolution with 32 filters we
could end up with a pretty similar result with a far less computational cost. In tot al, we
would have around 12 m il l i on total multiplications using the dimension reduction method
we just proposed, which is shown in Figure 5.9.9b. Also, noti ce in our dimension reduction
figure that we pass max pooling into a 1 × 1 convolution. This is because when using a
max pooling filter, we cannot shrink the number of channels. Rather, we will shrink the
number of channels with a 1 ×1 c onvolution, where the number of filters correspond s to the
outputted channel size.
1x1 convolutions 3x3 convolutions 5x5 convolutions
Filter
concatenation
Previous layer
3x3 max pooling
(a) Na¨ıve version
1x1 convolutions
3x3 convolutions 5x5 convolutions
Filter
concatenation
Previous layer
3x3 max pooling1x1 convolutions 1x1 convolutions
1x1 convolutions
(b) Dimension reduction
Figure 5.9.9: Inception layer (Szegedy et al., 2015)
5.9. CLASSIC NETWORKS 73
The entire original inception network is shown in Figure 5.9.10 and was named Go ogLe Net
after the popular LeNet architecture.
74 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS
input
Conv
7x7+2(S)
MaxPool
3x3+2(S)
LocalRespNorm
Conv
1x1+1(V)
Conv
3x3+1(S)
LocalRespNorm
MaxPool
3x3+2(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
MaxPool
3x3+2(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
MaxPool
3x3+2(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
AveragePool
7x7+1(V)
FC
Conv
1x1+1(S)
FC
FC
SoftmaxActivation
softmax0
Conv
1x1+1(S)
FC
FC
SoftmaxActivation
softmax1
SoftmaxActivation
softmax2
Figure 5.9.10: GoogLeNet architecture (Szegedy et al., 2015)
5.10. COMPETITIONS AND BENCHMARKS 75
Figure 5.9.11: Meme cited in the original inception paper as a reasoning for the name inception
network.
As a fun fact, the i nc ep t i on network cited the reasoning f or its name came from the popular
meme in Figure 5.9.11 th at appeared in the movie Inception.
5.10 Competitions and benchmarks
To beat state-of-the-art models on a s pecific b e nchmark in competitions, there are a few
methods employed that are rarely used in practice including ensembling and multi-crop at
test time. Ensembling is when we test each instance of the t es t in g set against a range of
models, usually around 3-15, and use the prediction that either occurs most of t en (clas-
sification) or the average prediction (regression). Ensembling often performs better, but
there is a significant computational cost that is introduced . The other method employed at
test time is multi-crop which averages the output from dierent crops of the image. One
common way to multi-crop an image take is known as 10-crop and is shown in Figure 5.10.1.
Figure 5.10.1: 10-crop includes mirroring an image and taking dierent croppings of both the
image and its mirroring.
Chapter 6
Detection algorithms
6.1 Object localization
(a) Image classification
(b) Classification with localiza-
tion
(c) Detection
Figure 6.1.1: Classification vs. l ocalization vs. detection
Previously, we have only been working on image classification, which assumes only one
object in a screen i s present and determines that object. Next, we will be goin g into
classification with localization which determines where the object is located in the image
as shown in Figure 6.1.1b. We will also be going over image detection which allows u s to
classify multiple occurrences of the same or di er ent classes from a single image as shown
in Figure 6.1.1c.
With classification with localization, we are going to assume there is only one object in every
image. As an example, we are going to walkthrough how a self -dr i v i ng car could localize a
pedestrian (1), car (2), motorcycle (3), or, if no object exists, a background (4). We
can then change the output of the neural network from a softmax of 4 categories t o include a
bounding box with the center point, height, and width. Th e center of the bounding box will
be denoted as (b
x
, b
y
) and the height and width will be b
h
and b
w
, respectively. Currently,
we will be using square images and will label the t op left corner as (0, 0) and the bottom
right corne r as (1, 1). The entire labeling of our image is shown in Figure 6. 1. 2.
76
6.1. OBJECT LOCALIZATION 77
(0, 0)
(1, 1)
b
w
b
h
(b
x
, b
y
)
Figure 6.1.2: Our labeling for localization
We will also need to adjust our supervised learning labels to match the output of our model .
We will define the output as
y =
#
$
$
$
$
$
$
$
$
$
$
$
$
$
%
P
c
b
x
b
y
b
h
b
w
car?
pedestrian?
motorcycle?
&
'
'
'
'
'
'
'
'
'
'
'
'
'
(
, (6.1.1)
where P
c
is the probability we have a class 1, 2, or 3. A few labeled examples are shown in
Figure 6.1.3.
y =
#
$
$
$
$
$
$
$
$
$
$
$
$
$
%
1
0.5
0.7
0.2
0.5
1
0
0
&
'
'
'
'
'
'
'
'
'
'
'
'
'
(
(a) Car label
y =
#
$
$
$
$
$
$
$
$
$
$
$
$
$
%
0
None
None
None
None
None
None
None
&
'
'
'
'
'
'
'
'
'
'
'
'
'
(
(b) Background label
Figure 6.1.3: Our labels for dierent images
78 CHAPTER 6. DETECTION ALGORITHMS
We will also define our loss function to be a squared error loss function
L (ˆy, y) =
L
0
8
i=1
(ˆy
i
y
i
)
2
if y
1
= 1
(ˆy
1
y
1
)
2
if y
1
= 0
, (6.1.2)
where the 8 comes from the size of our label.
6.2 Landmark detection
Landmark detection works to find specific areas of an image. Recently, landmark detection
has b ee n used to determine human emotions, determine the orientation of a moving body,
and apply S n apchat filter s. In order to create Snapchat face filter, like the one shown in
Figure 6. 2.1c , we can trai n a neural network to determin e specific landmark locations inside
of an i mage . To determine landmark locations on a face, we first need to classify a face and
then label each landmark’s location on our training data as sh own in Figure 6.2.2, where
each label with n landmarks would be in the form
y =
#
$
$
$
$
$
$
$
$
$
%
face?
1x
1y
.
.
.
nx
ny
&
'
'
'
'
'
'
'
'
'
(
, (6.2.1)
where (
ix
,
iy
) is the xy coordinate of the ith landmark. We also must makr sure that
we always label the ith landmark the same thing in each image. For example, if the first
landmark is the leftmost part of somebody’s le ft eye in one image, then it also must be the
leftmost part of a person’s l ef t eye in another image.
Happy
Angry
(a) Emotion classification
(b) Bod y tracking
(c) Snapchat face filters
Figure 6.2.1: Applications for landmark detection
6.3 Object detection
Object detection focuses on locating multiple objects in a given image. A na¨ıve approach to
object detection is with the sliding windows detection algorithm. We will start wit h
6.4. SLIDING WINDOWS WITH CONVOLUTION 79
1
9
14
15
16
17
18
19
20
21 22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
59
60
61
62
63
64
65
66
49
50
51
52
53
54
55
56
57
58
10
11
12
13
5
6
7
8
2
3
4
Figure 6. 2 .2 : Adding a specific numbered landmark to each part of the imag e we are interested
in
1 1 1 0 0
Figure 6.3.1: Sliding windows detection training set
a training set of ti ghtly cropped im ages as shown in Figure 6. 3. 1. Then, we will create a
window and slide it throughout our image. At each location where the window is on the
image, only the pixels that the filter is covering will be passed into the convolutional neural
network in hope s that the window will perfectly identify the location of objects. We will
then change the size of the window and again repe at the sliding process as shown in Figure
6.3.2.
With the sliding windows detection algorithm just presented, we will run into a high compu-
tational cost in ord er to have an accurate image detector. Further, improving t h e algorithm
by making the stride length shorter and testing more window sizes will increase the cost.
6.4 Sliding windows with convolution
To improve the computational cost of our sliding windows detection algorithm let us try
to use a convolutional implementation where the output corresponds to sliding the window
across each position. Consider taking a 3 × 3 matrix and convoluting it twice with a 2 × 2
filter, which will produce a 1 ×1 output as shown in Figure 6.4.1a. Now, consider if we were
to add a row and colum n to the end of our input matrix, increasing it to a size of 4 × 4 and
then convolute it twice with a 2 × 2 filter. What we end up with is a 2 × 2 output matrix
where the upper left entry is equivalent to the two convolutions from the 3 ×3 input matrix
as shown in Figure 6.4.1b. Similarily, the top right entry will be the two convolutions with
the 3 × 3 input matrix without the first column and without the last row.
Using the me t hod just introduced, we have a way to create a sliding window with a convo-
lution. Takin g this method a step further, we can also adjust the stride by adding a max
pooling layer. The filter size of the max pooling layer will determine the size. Further, if we
80 CHAPTER 6. DETECTION ALGORITHMS
1st window size
slides: 0 slides: 1 slides: 2
2nd window size
Figure 6.3.2: The sliding windows detection algorithm chooses dierent window sizes and slides
the window size across the entire image. At each slide, the area that the window is covering is fed
into a convolutional neural ne twork in order to try and classify the image.
f = 2 f = 2
(a) 3 × 3 input
f = 2 f = 2
(b) 4 × 4 input
Figure 6.4.1: Applying the same model to inputs of dierent size. The dark portion in Figure
6.4.1b is equivalent to the model in Figure 6.4.1a.
want to learn a more complex function we can add 1 × 1 convolutional neural networks to
the end of our network that will act like fully connected layers as shown in Figure 6.4.2.
#(vertical convolutions) x
#(horizontal convolutions) x
#(classes)
Act like fully connected layers
softmax
max pooling
convolution with ReLU
f = stride
Figure 6.4.2: Convolutional neural network that determines if a window image is one of four
categories: pedestrian , car, motorcycle, or background.
6.5. BOUNDING BOX PREDICTIONS 81
6.5 Bounding box pr edi cti ons
y =
#
$
$
$
$
$
$
$
$
$
$
$
$
$
%
1
0.1
0.5
1
1.3
0
1
0
&
'
'
'
'
'
'
'
'
'
'
'
'
'
(
y =
#
$
$
$
$
$
$
$
$
$
$
$
$
$
%
1
0.4
0.9
1.9
1.2
0
1
0
&
'
'
'
'
'
'
'
'
'
'
'
'
'
(
Figure 6.5.1: Using a bounding box on an image with the correspon d i n g labels that contain cars.
We are using the labeling set in Equation 6.1.1.
Currently, we have only seen guess and check algorithms in terms of choosing a filter and
step size. Instead, it would be nice to only have one pass through a convolutional neural
network that determines object detection. Let us consider breaking the image into a gr i d
and treating each gridded cell as its own image that we h ave to label. If an object we are
trying to identify has its centerpoint in the grid, then we will label P
c
in that grid as 1.
For grids that do not contain any objects or contain non-centerpoints of objects they will
have P
c
as 0 with None in all of the other values. When l abeling images from the grids
centerpoint, notice that our constraints are 0 b
x
, b
y
1 and b
h
, b
w
0. With our 3 × 3
grid, the output wil l need to be 3 × 3 × 8 because each grid cell has its own label with 8
entries. Therefore, we can construct a convolutional neural network, like t he one shown in
Figure 6.4.2 that takes our image as input and outputs a 3 ×3 ×8 tensor. The way we have
set up bounding box prediction is known as the YOLO–You Only Look Once–algorithm and
we will continue discussing it after introducing a few more ideas that pl ay a key role.
6.6 Intersection over Union (IoU)
One common way to asse ss if a bounding box’s prediction i s accurate is to take the inter-
section of the labeled bounding box and the predicted bounding box and divide that by the
union of the labeled and predicted bounding boxes, known as IoU. Typically if IoU 0. 5,
we say that our model’s bounding box is correct, however, 0.5 is just an arbitrary choice
and is sometimes changed.
6.7 Non-max suppression
When using smaller grid sizes for bounding box prediction, it is more likely that neighboring
boxes will predict references to the same object as shown in Figure 6.7.1a. To start, we will
first want to remove any boxes that are uncertain of containing a class. Typically, we may
82 CHAPTER 6. DETECTION ALGORITHMS
Intersection
Union
Figure 6.6.1: Taking the intersection over union of two bounding boxes to assess its correctness.
0.9
0.8
0.7
0.8
0.95
0.45
(a) Pre non-max suppression
0.9
0.95
(b) Post non-max suppression
Figure 6.7.1: Predicted boundin g boxes using a 19 × 19 grid along with the confidence a non-
background class exists inside of that box.
remove all b oxes that have P
c
0.6. With a way to calculate the similarity between two
bounding boxes (IoU), we can use that to determine if two bounding boxes are referring
to the same object. Starting with the box that has the highest propability a class exists
inside of it, P
c
, we will keep that box and then remove any other box from the same class
prediction that has an IoU 0.5 with the box in focus. Thus, we are eectively removing the
non-maximum boxes from the p r ed ic t ion . The complete algorithm for n on- max suppression
is shown in Algorithm 8.
Algorithm 8 Non- max suppression
Input: boxes list of all bou nd i ng boxes
Initialize: finalBoxes empty list
Remove any box from boxes with P
c
0.6
while boxes length > 0 do
maxBox box with max P
c
from boxes
Append maxBox to finalBoxes
Remove maxBox from boxes
for each box in boxes do
if IoU(box, maxBox) 0.5 then
Remove box from boxes
end if
end for
end while
6.8. ANCHOR BOXES 83
6.8 Anchor boxes
Figure 6 .8 . 1: The midpo int for two objects–pedestrian and car–are located in the same grid cell.
With our current implementation for labelling each grid cell, we have no way to label two
separate classes that occur in th e same grid. For example, Figu re 6.8.1 shows that the
centerpoint for a pedestrian and a car occur in the center gridpoint. We will start o by a
set number of dierent anchor boxes as shown in Figure 6.8. 2. The sizes of the anchor boxes
can be de te r mi ne d from a k-means clustering algorithm that looks for grouping of anchor
box sizes amongst each class.
(a) Anchor box 1 (b) Anchor box 2
Figure 6.8.2: Defining multiple distinct anchor boxes
84 CHAPTER 6. DETECTION ALGORITHMS
Now, for each label in our training data we will label it with
y =
#
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
%
P
c
b
x
b
h
b
w
c
w
c
2
c
3
P
c
b
x
b
h
b
w
c
w
c
2
c
3
&
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
'
(
M
;
;
;
;
;
;
;
;
;
;
;
N
;
;
;
;
;
;
;
;
;
;
;
O
anchor box 1
M
;
;
;
;
;
;
;
;
;
;
;
N
;
;
;
;
;
;
;
;
;
;
;
O
anchor box 2
. (6.8.1)
Then, for each grid cell we will use the IoU to determine which anchor box is closest to each
bounding box. In Figure 6.8.1, the pedes t ri an would be asssigned to anchor box 1 and the
car would be assigned to anchor box 2. We then fill in the output as the labels from each
bounding box. With our 3 × 3 grid, the output size will be 3 × 3 × 16 or 3 × 3 × (2 × 8),
since there are two anchor boxes and each anchor b ox has 8 entries. When u si n g two anchor
boxes, our predic t ion will give us 2 boundary boxes in each grid c el l .
We could always define m ore anchor boxes in practice, alghough with a larger grid size, it
is less likely for collisions in anchor boxes to occur. If we have more objects that share the
same grid cell than we have anchor boxes or we have two boundary boxes that are the same
dimension, our algorithm will not perform well.
6.9 YOLO algorithm
The YOLO (You Only Look Once) algorithm was introduced in 2016 by Joseph Redmon,
Santosh Divvala, Ross G ir s hi ck, and Ali Farhadi. The algorithm combines boundary box
predictions with anchor boxes and non-max suppression. With only having to move through
the neural network in one pass, YOLO beat the running time for all pr ev i ous state-of-the-art
object detection algorithms.
Chapter 7
Sequence models
Sequence models allow us to work with variable inputs or variable outputs. This feature
allows us to greatly expand the number of problems we are able to tackle. For example ,
sequence models are used for speech recognition, where given an variable length audio clip, we
translate the spoken words; music generation, where we can input a set of data and output
a sequence of music; sentiment classification, where we could take in a sentence review and
output a star-rating; DNA sequence analysis, where we take a DNA sequence and label the
part of the sequence corresponding to, say a protein; machine (or language) translation,
where we are given a sentence in one language and have to convert it into another; video
activity recognition, where we are given a set of frames and want to output the activity; and
name entity recognition, where we are given a set of text and want to find the names within
the text.
As an example, suppose we would like to build a sequence model for name entity recognition
with the input
x : Harry Potter and Hermione Granger i nvented a new spell. (7.0.1)
We will denote the tth word of the input as x
<t>
, starting at index 1. For the output y, we
will use multi-hot encoding where each word is labeled with a 1 or 0 as follows
y : 1 1 0 1 1 0 0 0 0. (7.0.2)
The tth element of the output will similarly be referred to as y
<t>
. The length of the input
sequence wil l be denoted by T
x
and the length of the output sequence will be T
y
. In this
case, both T
x
and T
y
are equal to 9, although that is not always the case. So, if we want to
refer to the tth word in the ith element of the t r ain i ng sample, we will call x
(i)<t>
. Since
the input and output of the seque nc e can vary, we will use T
(i)
x
and T
(i)
y
to denote the length
of the ith input and output, respectively.
To represent words, we will need use vocabulary list of all the possible words that are likely
to occur. For modern app li c at ion s, this could be around 50,000 words in total. We will
represent teh vocabulary list sorted in ascending order . Then, to store x
<t>
, we can use
one-hot encoding, where we have a 1 in the index of the vocabulary list that stores the word
we are looking at, and a 0 everywhere else. We will tal k about unknown words, or words
that are not in our vocabulary list, soon.
7.1 Recurrent neural networks
The problem with using a standard neural network is that the in p ut length and output
lengths can dier in dierent new samples of our data. If we tried to set a maximum
number as the input, with 0 for each element that is not necessary, then we will not only
greatly diminish the number of problems that we can focus on, but we also will be using
a bunch of unnecessary memory i n the network. The plain neural network will al so not
85
86 CHAPTER 7. SEQUENCE MODELS
take advantage of shared features of sequ ential d at a, similar to how convolutional neural
networks took advantage of using the same filters over dierent regions of the image.
With recurrent ne ur al networks, we built a m odel that takes as input the current time step
t and all of the previous time steps in order to compute the output y
<t>
. At each time step,
the model is parameterized by shared weights W
ax
, W
aa
and W
y a
, as shown in Figure 7.1.1.
A big problem with this type of representation is that we are only looking at the previous
time steps. Suppose we have the following sentences as input
>
He said, “Teddy Roosevelt was a great President.”
He said, “Teddy bears are on sale!”
. (7.1.1)
In both of these cases , the first three words are equivalent; however, in only t he first case
does the word “Teddy” represent a name. We will delve into this topic soon, which looks
at Bidirectional RNNs (BRNNs), however to understand the basics of RNNs we will stick
with our unidirectional RNN.
For now, we will set a
<0>
= 0, which is typical in many cases. Now, to represent the first
activati ons and output in the recurrent neural network pictured in Figur e 7.1.1, we will set
a
<1>
= g
)
W
aa
a
<0>
+ W
ax
x
<1>
+ b
a
*
(7.1.2)
ˆy
<1>
= g
)
W
y a
a
<1>
+ b
y
*
. (7.1.3)
The activation functions in Equation 7.1.2 and Equation 7.1.3 do not have to be the same.
With the activation functions, tanh and Re LU are commonly used, while sigmoid or softmax
are typically use d for the output activation functions. To generalize the previous equations,
we have
a
<t>
= g
)
W
aa
a
<t1>
+ W
ax
x
<t>
+ b
a
*
(7.1.4)
ˆy
<t>
= g
)
W
y a
a
<t>
+ b
y
*
. (7.1.5)
We will futher simplify Equation 7.1.4 to be
a
<t>
= g
)
W
a
H
a
<t1>
, x
<t>
I
+ b
a
*
, (7.1.6)
where W
a
is formed by stacking
W
a
=
!
W
aa
| W
ax
"
(7.1.7)
into a single matrix. We wil l also use the notation
H
a
<t1>
, x
<t>
I
=
1
a
<t1>
x
<t>
2
(7.1.8)
We will also simplify Equation 7.1.5 to be
ˆy
<t>
= g
)
W
y
a
<t>
+ b
y
*
. (7.1.9)
7.1.1 Backpropagation through time
We will define our loss at time step t as our familiar cross-entropy, or logistic loss
L
<t>
)
ˆy
<t>
, y
<t>
*
= y
<t>
log ˆy
<t>
)
1 y
<t>
*
log
)
1 ˆy
<t>
*
. (7.1.10)
For the entire sequence, we will compute the loss as
L (ˆy, y) =
T
y
-
t=1
L
<t>
)
ˆy
<t>
, y
<t>
*
. (7.1.11)
Now, we can define the computation graph pictured in Figure 7.1.2. When backpropagating
through this network, we say that we are backpropagating through time, because we are going
backwards in the time series.
7.1. RECURRENT NEURAL NETWORKS 87
a
< 0>
x
< 1>
< 1>
a
< 1>
x
< 2>
< 2>
a
< 2>
x
< 3>
< 3>
a
<Tx
-
1 >
x
<Tx>
<Ty>
a
< 3>
Waa Waa Waa WaaWaa
Wax, ba Wax, ba Wax, ba Wax, ba
Wya, by Wya, by Wya, by Wya, by
Figure 7.1 .1 : The structure of a ba s ic recurrent neural network predicts y
<t>
based on all of the
previous time steps, along with the current time step. The parameters of the network are pictured
in red.
x
< 1>
< 1>
a
< 1>
a
< 0>
Wa, ba
Wy, by
< 1>
x
< 2>
< 2>
a
< 2>
< 2>
x
< 3>
< 3>
a
< 3>
< 3>
x
<Tx >
<Ty>
a
<Tx>
<Ty>
Figure 7.1.2: Computation graph for our recurrent neural network.
7.1.2 Variational RNNs
RNNs can be broken into dierent categories, as shown in Figure 7.1.3. What we have
currently been working with is a many-to-many model. If we were to use sentiment classi-
ficaiton, where we input a rev i ew sentence and output the number of stars, we would use a
many-to-one model. A one-to-many model may consist of music generation, where we could
input some metadata (i.e. genre, style, etc.), and output script s of music. Many-to-many
models may also consist of examples where the number of inputs and the numbe r of outputs
dier, such as with machine translati on.
7.1.3 Language modeling
Suppose we would like to build a speech recognition system that inputs an audio clip of
speech and outputs the spoken words. Looking simply at the pronunciation of each word i s
not enough to determine the corre ct word that is spoken. For instance, t he pronunciation
for “pair” and “pear”; “two”, “to”, and “too”; and “their”, “they’re”, and “there” are all
88 CHAPTER 7. SEQUENCE MODELS
Figure 7.1.3: Depending on the combinations of fixed or variable length inpu t and output, we
can form dierent types of RNNs.
pronounced identitically, but are dierent words alltogether. A language model may lo ok at
a set of sentences that sound identitical and choose the sentence that is most likely to have
occured. For example, suppose we have the two equivalently spoken sentences
>
The apple and pair salad.
The apple and pear salad.
. (7.1. 12)
Our language model may output something in the form of
>
P (The apple and pair salad.) = 3.2 × 10
13
P (The apple and pear salad.) = 5.7 × 10
10
, (7.1.13)
thus, the second sentence would be selected because it is more prob abl e .
To create a language model with a RNN, we need a training set that consists of a large corpus
of English text. The word corpus comes from the field of natural language processing and
refers to a set of text . For each sentence we are given as input, we will tokenize the ou tp u t
using one-hot enco di n g. It is also common to add an end-of-sentence token. If we are given
a sentence like
The Egyptian Mau is a bread of cat. (7.1.14)
and do not have the word “Mau” in our vocab ul ar y list, then we will use a 1 t o encode the
unknown word item in our vocabulary list, while making everything else a 0.
Suppose we have a vocabulary list that has 10k entries (excludes puncuation and incl u de s
spots for unknown words) are working with the following sentence
x : Well, he l lo there! (7.1.15)
With an RNN model that is trying to predict the next word, the inputs will be the ordered
list of words that came before the current time step t. For the firs t time step, the input will
be the 0 vector and the first prediction ˆy
<1>
will output a 10k softmax classification, where
each word in th e vocabulary list i s given a probability that it occurs; that is
ˆy
<1>
= P (< each word >) =
#
$
$
%
P (< word one >)
.
.
.
P (< word 10k >)
&
'
'
(
. (7.1.16)
In this case, we have y
<1>
as a one-hot encoding with the word “well”. For the second
time step, our input will be y
<1>
= “well”. Then prediction for ˆy
<2>
will be a conditional
probability of the second word, given the first word; that is
ˆy
<2>
= P
)
< each word > | y
<1>
= well
*
. (7.1.17)
For the third time step, t he input will be y
<1>
= “well”, y
<2>
= “hello”. Following the
same pattern, the prediction will be in the form of a 10k softmax classifier, where
ˆy
<3>
= P
)
< each word > | y
<1>
= “well”, y
<2>
= “hello”
*
. (7.1.18)
7.1. RECURRENT NEURAL NETWORKS 89
On the backward pass, we will lo ok to maximize the predictions that
:
<
=
ˆy
<1>
= well
ˆy
<2>
= hello
ˆy
<3>
= there
. (7.1.19)
To predict the probability of an entire sentence, we could calculate
P
)
y
<1>
*
· P
)
y
<2>
| y
<1>
*
· P
E
y
<3>
|
2
P
t=1
y
<t>
F
· · · P
E
y
<T >
|
T 1
P
t=1
y
<t>
F
. (7.1.20)
7.1.4 Sampling novel sequences
To sample a sequence, we will use a similar RNN model to the one just described. However,
we will now change the input at each time step to b e all of the previous predic t ion s from
the model, as shown in Figure 7.1.4. We will als o have the predictions b e random variabl e
choices from the probability distribution in order to avoid an infinite loop of the same few
words. In NumPy, we can use np.random.choice to choose a random variable from the
output prediction distribut ion . To end t he program, we can wait until the sentence is over,
in which case the output will be our <EOS> word. We could also iterate for a certain number
of time steps or run a program for a specific amount of time .
x
< 1>
< 1>
a
< 1>
a
< 0>
< 2>
< 1> < 2> <Tx -1>
a
< 2>
< 3>
a
< 3>
<Ty>
a
<Tx>
Figure 7.1.4: RNN model for sequence generation. Here, the to t h e time step t is the output from
all of the previous time steps.
Up until this point, we have been focused on using a word vocabulary list; however, it is
also common to use a character vocabulary list, such as all of the ASCII characters. An
advantage of character level models is that we wil l never have to use an unknown token. One
of the drawbacks with the character level approach is that it is worse at capturing long-range
dependencies, such as opening and closing quot es , along with being more computationally
expensive.
7.1.5 Vanishing gradients with RNNs
When we are working wit h RNNs, it is not unrealistic to be working with hundreds or
thousands of time steps in a given model. However, recall the pr obl em with deep neural
networks that had a lot of layers, where we continued to multiply by smaller and smaller
numbers as we backpropagate further. We can think of our the number of layers we have
with NNs similar to the number of time steps with RNNs. With our current RNN approach,
it will much harder to r ep re se nt long-term dependencies in our data, such as opening and
closing quotes.
7.1.6 GRU Gated recurrent unit
Currently, we have been using RNNs that have a hidden unit in the following form
a
<t>
= tanh
)
W
a
H
a
<t1>
, x
<t>
I
+ b
a
*
. (7.1.21)
90 CHAPTER 7. SEQUENCE MODELS
x
< t >
< t >
a
< t –1>
a
< t +1>
softmax
tanh
Figure 7.1.5: Our current approach to RNN calculations, where the g ray box represents a hidden
black box calculation.
We can picture this as a black box calculation at each time step t as shown in in Figure
7.1.5. However, our model is not adapt to cap t ur i ng long-term dependencies in the dat a.
Suppose we are working with the following sentence
x : The cat, which already ate, was full. (7.1.22)
If we slightly changed this sentence to have the word cat as cats, then we would also need
to update was to were. With a GRU, a
<t>
will be treated like a memory cell. We will then
set
˜a
<t>
= tanh
)
W
a
H
a
<t1>
, x
<t>
I
+ b
a
*
(7.1.23)
as a potential candidate for a
<t>
. Next , we will create an update gate
Γ
u
= σ
)
W
u
H
a
<t1>
, x
<t>
I
+ b
u
*
, (7.1.24)
where the majority of values in Γ
u
will be 0 or 1, based on the f un ct i onal i ty of the sigmoid
function. The update gate ultimately determines if we will update a
<t>
. For example,
suppose the model learns that singular words are denoted with a
<t>
= 1. Then, i n our
working sentence we may have something along the l i n es of:
t 1 2 3 4 5 6 7
x The cat which already ate was full
a
<t>
1 1 1 1 1
Notice that the update gate continues to denote that our sentence is singular at further time
steps t. At some fut ur e time t, if we run into a plural phrase, then it is the job of Γ
u
to
update a
<t>
. Specifically, our up d at e will be
a
<t>
= Γ
u
˜a
<t>
+ (1 Γ
u
) a
<t1>
, (7.1.25)
where is an element-wise multiplication between the vector. Notice here that when Γ
u
= 1,
we will update a
<t>
to be ˜a
<t>
, such as when the time step was 2 in the above example.
When Γ
u
= 0, then we keep a
<t>
the same value, such as in the time steps [3, 6] above. An
inportant note is that the values ˜a
<t>
, a
<t>
, and Γ
u
are all vectors that can learn many
dierent types of long-term dependencies. For example, we can learn how to open and close
brackets, q u ot es, and use singular or plural p hr ase s when appropriate. The computation
graph for GRUs is shown in Figure 7.1.6.
7.1. RECURRENT NEURAL NETWORKS 91
x
< t >
< t >
a
< t –1>
a
< t +1>
softmax
tanh
Figure 7.1.6: Simplified GRU computational graph at a time step t.
There is one other part to the GRU that we have currently left out, and that is to determi ne
how relevant the last me mor y cell a
<t1>
is for the calculation of the next, potential,
memory cell ˜a
<t>
. Here, we add the relevance gate i n as follows
˜a
<t>
= tanh
)
W
a
H
Γ
r
a
<t1>
, x
<t>
I
+ b
a
*
(7.1.26)
Γ
u
= σ
)
W
u
H
a
<t1>
, x
<t>
I
+ b
u
*
(7.1.27)
Γ
r
= σ
)
W
r
H
a
<t1>
, x
<t>
I
+ b
r
*
(7.1.28)
a
<t>
= Γ
u
˜a
<t>
+ (1 Γ
u
) a
<t1>
. (7.1.29)
7.1.7 LSTM: Long short-term memory
In addition to GRUs to store long-term dependencies in sequential data, LSTMs are also
commonly used. Figure 7.1.7 shows an example of an LSTM model. We can think about
LSTMs as an extension of GRUs wi th 2 hidden units h
1
and h
2
. We use 3 separate gates,
a forget gate Γ
f
, update gate Γ
u
, and output gate Γ
o
, that are each defined as
Γ
u
= σ
.
W
u
· [x
t
, h
t1
]
/
(7.1.30)
Γ
f
= σ
.
W
f
· [x
t
, h
t1
]
/
(7.1.31)
Γ
o
= σ
.
W
o
· [x
t
, h
t1
]
/
. (7.1.32)
With LSTMs, we remove the relevance gate Γ
r
because it often does not improve the perfor-
mance of a model in practice. With our 1st hidden unit h
1
, we will set its replace candidate
based on the other hid de n unit input h
2
t1
and the new input x
t
, which we will define as
˜
h
1
t
= σ
.
W
c
· [x
t
, h
2
t1
]
/
. (7.1.33)
Now instead of using a single update gate that determin es both how much of the previous
layer we want to forget and how much of the replacement candidate we want to remember,
92 CHAPTER 7. SEQUENCE MODELS
x
t
softmax
forget gate update gate output gatetanh
tanh
y
t
h
1
t
h
1
t-1
h
2
t
h
2
t-1
~
h
1
t
(a) Single LSTM gate
(b) Stacking multiple LSTM gates in a row
Figure 7.1.7: LSTM model
we use 2 separate gates to define these quantities. In particular, we define our 1st hidden
unit to be
h
1
t
= Γ
u
˜
h
1
t
+ Γ
f
h
1
t1
. (7.1.34)
Our 2nd hidden un i t is dependent on h
1
t
and passed into the softmax fucntion to produce
the output y
t
, which we define as
h
2
t
= Γ
o
tanh
)
h
1
t
*
. (7.1.35)
Here, the output gate determines t h e relevance of each hidden unit for the output. The blue
path that we show in Figure 7.1. 7b connects the 1st hidden unit between multiple LSTM
gates. The path could give an intuitive fee l for why LSTMs give long-range dependency
connections, which is primarily due to having multiple update gates control what we store
in the internal memory, making it harder to forget everything completely.
7.1.8 BRNNs: Bidirectional RNNs
Recall our 2 sentences we used earlier
>
He said, “Teddy bears are on sale!”
He said, “Teddy Roosevelt was a great Pre si de nt!”
, (7.1.36)
where it is not enough to predict if t he work Teddy is a name based on the information before
the word occurs. In this case, we can use a bidirectional RNN (BRNN) to make pred ic t ion s
with the words before and after Teddy. Figure 7.1.8 shows the diagram for a BRNN with a
fixed size input length of 4. Our BRNN is broken up into 2 separate forward RNNs. Here,
we have a forward RNN with activations in the form
a that takes into account all of the
previous time step inputs. We also have another RNN with activations in the form of
a
7.2. WORD EMBE D D ING 93
that looks at all the futur e time steps in reverse order. For example, if our sentence is
He said, “Teddy Roosevelt!”, (7.1.37)
then at t = 3 we can form
a
<3>
with {He, said, Teddy} and we can form
a
<3>
with
{Roosevelt}. We c an then make an output prediction, such as the meaning of the word
Teddy, in the form of
ˆy
<t>
= g
)
W ·
H
a
<3>
,
a
<3>
I*
. (7.1.38)
One of the downsides to using a BRNN is that we are required to supply the entire input
at the start.
x
<1>
a
<1>
a
<1>
<1>
x
<2>
a
<2>
a
<2>
<2>
x
<3>
a
<3>
a
<3>
<3>
x
<4>
a
<4>
a
<4>
<4>
(a) Forward and backward connections
x
<1>
<1>
a
<1>
x
<2>
<2>
a
<2>
x
<3>
<3>
a
<3>
x
<4>
<4>
a
<4>
(b) Forward RNN connection with the sequence
in order
x
<4>
<4>
a
<4>
x
<3>
<3>
a
<3>
x
<2>
<2>
a
<2>
x
<1>
<1>
a
<1>
(c) Forward RNN connect io n , where the se-
quence is in reverse order
Figure 7.1.8: BRNN with an in p u t of length of 4
7.1.9 Deep RNNs
Figure 7.1.9 shows the 2 primary types of deep RNNs, where we stack multiple activations
at each time step. The 1st type of network, in Figure 7.1.9a, shows 3 stacked activations at
each time step that are each connected to the next time step. Here, each layer has its own
weights, which are shared at each time step. If, for example, we can cal cu l ate each hidden
activati on in layer on the tth time step, then we can use
a
[]<t>
= g
.
W
[]
·
!
a
[]<t1>
, a
[1]<t>
"/
, (7.1.39)
which takes the l ef t and b e l ow activations as the input. The 2nd type of network, in Figure
7.1.9b, shows the first few layers connecting to future time steps, while the deeper layers are
only connected to a single time step. Here, the deeper layers are not connec te d to fut u r e
time steps because connecting them would be computationally expensive.
7.2 Word embedding
7.3 Word repres entation
Up until this point, we have represented words with a dictionary, where we specify a specific
word using a one-hot enco d i ng vector. One of the problems with this representation is that
94 CHAPTER 7. SEQUENCE MODELS
x
<1>
<1>
a
[2]<1>
a
[1]<1>
a
[3]<1>
x
<2>
<2>
a
[2]<2>
a
[1]<2>
a
[3]<2>
x
<3>
<3>
a
[2]<3>
a
[1]<3>
a
[3]<3>
x
<4>
<4>
a
[2]<4>
a
[1]<4>
a
[3]<4>
(a) Stacking together multiple layers of activations that are each connected to the ne xt and p rev io u s
time steps
x
<1>
<1>
a
[2]<1>
a
[1]<1>
a
[3]<1>
x
<2>
<2>
a
[2]<2>
a
[1]<2>
a
[3]<2>
x
<3>
<3>
a
[2]<3>
a
[1]<3>
a
[3]<3>
x
<4>
<4>
a
[2]<4>
a
[1]<4>
a
[3]<4>
a
[5]<1>
a
[4]<1>
a
[6]<1>
a
[5]<2>
a
[4]<2>
a
[6]<2>
a
[5]<3>
a
[4]<3>
a
[6]<3>
a
[5]<4>
a
[4]<4>
a
[6]<4>
a
[7]<1>
a
[7]<2>
a
[7]<3>
a
[7]<4>
(b) Deep RNN where the first few layers have activations that connect to future time steps, while
the deeper layers only carry weight in a single time step
Figure 7.1.9: Deep RNNs
the words do not have a connection with each other. For examp l e, suppose we input
I want a glass of orange (7.3.1)
7.3. WORD REPRE SE NTATION 95
input our ne twork, where it is the goal of the network to predict the blank. With a well
trained network the likely prediction is the word juice. But, suppose we now have t he input
I want a glass of apple (7.3.2)
and our model does not know about apple juice. In this situation, if our model knows that
there is some connection between orange and apple, then it would still likely predict juice.
One approach to learning connections would be to discover feature repre sentations with each
word. Table 7.1 shows an example of a potential encoding that our model may find. The
table shows the words king, queen, apple, and orange, along with thei r responses to a set of
features. For example, if we have a royal feature, then king and queen might have a high
response; alte rn at i vely if we have a food feature, then apple and orange might have a high
response. Here, if we have 300 features, then we can represent each word with a vector of
length 300. Now, due to the similarities between features, our network is more likely to pick
up similar words.
King Qu een Apple Orange
Royal 0.93 0.95 -0.01 0.00
Age 0.7 0.71 0.03 -0.02
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Food 0.01 -0.03 0.95 0.97
Table 7.1: Sample words in a dictionary (top) with their responses to each feature (left) that
ranges between [1 : 1]
7.3.1 Using word embeddings
To learn the features for a set of words, it is common to use transfer learning. There are
currently many freely available text corpuses that have learned word embeddings training
on 1B to 100B words. Once we have word emb ed di n gs, we can transfer the words to a
smaller task and fine-tune the parameters if necessary.
7.3.2 Properties of word embeddings
Once we have the word embeddings, one neat thing that we can find is the dieren ce between
2 words. Suppose for example that one of our features gives the value 1 for words describing
objects w it h an orange color, 1 for words describing a colored object that is not orange,
and 0 for objects that are not defined by a color. Table 7.2 shows how we can spot the
feature dierences between the apple and orange vectors. I n particular, if we subtract the 2
vectors, then t he f eat ur e s th at ar e lar ges t i n magn it u de cor r es pond to the feature dierences
between words, while the feature dierences near 0 refer to features being the same.
Apple Orange Apple - Orange (approximate)
Orange colored 0.96 0.98 2
Royal -0.01 0.00 0
Age 0.03 -0.02 0
Food 0.95 0.97 0
Table 7.2: To compare words we can find the dierences and similarities by taking the dierence
between the word feature vectors, and finding the features wit h the largest dierence in magnitude
are the most dierent, while feature dierences near 0 are quite similar.