Deep Learning

Stanford CS230

Matt Deitke

Last updated: February 15, 2020

Acknowledgements

This set of notes follows the lectures from Stanford’s graduate-level course CS230: Deep

Learning. The course was taught by Andrew Ng, Kian Katanforoosh. The course

website is cs230.stanford.edu.

Contents

1 Intro d u cti on 1

1.1 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 What is a neural network? . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Supervised learning with neural networks . . . . . . . . . . . . . . . . 3

1.2 Logistic regression as a ne ur al network . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.2 Binary classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.4 Logistic regression cost function . . . . . . . . . . . . . . . . . . . . . 6

1.2.5 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.6 Computation graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.7 Derivat ives with a computational graph . . . . . . . . . . . . . . . . . 8

1.2.8 Logistic regression gradient descent . . . . . . . . . . . . . . . . . . . . 8

1.2.9 Gradient descent on m examples . . . . . . . . . . . . . . . . . . . . . 9

1.3 Python and vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.1 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.2 Vectorizing logistic regression . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.3 Vectorizing logistic regression’s gradient computation . . . . . . . . . 10

1.3.4 Broadcasting example . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.5 A note on NumPy vectors . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Neu ral networks 12

2.1 Deep learning intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Day’n’night classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.2 Face veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.3 Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.4 Art generation (neural style transfer) . . . . . . . . . . . . . . . . . . . 16

2.1.5 Trigger word detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Shallow neur al networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Neural networks overview . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.2 Neural network representations . . . . . . . . . . . . . . . . . . . . . . 18

CONTENTS iii

2.2.3 Computing a neural network’s output . . . . . . . . . . . . . . . . . . 18

2.2.4 Vectorizing across multiple examples . . . . . . . . . . . . . . . . . . . 19

2.2.5 Activati on functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.6 Why use non-linear activation functions? . . . . . . . . . . . . . . . . 21

2.2.7 Derivat ives of activation functions . . . . . . . . . . . . . . . . . . . . 22

2.2.8 Gradient descent for neural networks . . . . . . . . . . . . . . . . . . . 23

2.2.9 Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.1 Forward propagation in a dee p network . . . . . . . . . . . . . . . . . 27

2.3.2 Getting the matrix dimensions right . . . . . . . . . . . . . . . . . . . 27

2.3.3 Why deep representations? . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.4 Forward and backward propagati on . . . . . . . . . . . . . . . . . . . 28

2.3.5 Parameters v s hyperparameters . . . . . . . . . . . . . . . . . . . . . . 30

2.3.6 Connection to the brain . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Op t imi zin g and structuring a neural network 31

3.1 Full-cycle deep learning projects . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Setting up a machine learning application . . . . . . . . . . . . . . . . . . . . 32

3.2.1 Training, development, and t e st i ng sets . . . . . . . . . . . . . . . . . 32

3.2.2 Bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.3 Combatting bi as and variance . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Regularizing a neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.2 Why regular i zat i on reduces overﬁtting . . . . . . . . . . . . . . . . . . 35

3.3.3 Dropout regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.4 Other regularization methods . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Setting up an optimization problem . . . . . . . . . . . . . . . . . . . . . . . 38

3.4.1 Normalizing inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4.2 Data vanishing and exploding gradients . . . . . . . . . . . . . . . . . 40

3.4.3 Weight initialization for deep neural networks . . . . . . . . . . . . . . 40

3.4.4 Gradient numerical approximations . . . . . . . . . . . . . . . . . . . . 41

3.4.5 Gradient checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5.1 Mini-batch gradient descent . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5.2 Exponentially weighted averages . . . . . . . . . . . . . . . . . . . . . 43

3.5.3 Bias correction for exponentially weighted averages . . . . . . . . . . . 46

3.5.4 Gradient descent with momentum . . . . . . . . . . . . . . . . . . . . 46

3.5.5 RMSprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5.6 Adam optimization algorithm . . . . . . . . . . . . . . . . . . . . . . . 49

3.5.7 Learning rate decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

iv CONTENTS

3.5.8 The problem with local optima . . . . . . . . . . . . . . . . . . . . . . 50

3.6 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.6.1 Tuning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.6.2 Using an appropriate scale to pick hyperparameters . . . . . . . . . . 50

3.6.3 Hyperparameters tuning in practice: pandas vs caviar . . . . . . . . . 51

3.7 Batch normali zat i on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.7.1 Normalizing activations in a network . . . . . . . . . . . . . . . . . . . 52

3.7.2 Fitting batch norm in a neural network . . . . . . . . . . . . . . . . . 52

3.7.3 Batch norm at test time . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.8 Multi-class classiﬁcati on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.8.1 Softmax regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.8.2 Training a sof t max classiﬁer . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Ap p li ed deep learning 57

4.1 Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Setting up goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2.1 Single numb er evaluati on metric . . . . . . . . . . . . . . . . . . . . . 57

4.2.2 Train/dev/test distributions . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Comparing to human-level performance . . . . . . . . . . . . . . . . . . . . . 58

4.3.1 Avoidable bi as . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3.2 Understanding human-level performance . . . . . . . . . . . . . . . . . 59

4.3.3 Surpassing human-level performance . . . . . . . . . . . . . . . . . . . 59

4.4 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4.1 Cleaning up incorrectly labeled data . . . . . . . . . . . . . . . . . . . 60

4.5 Mismatched training and dev set . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5.1 Bias and variance with mismatched data distributions . . . . . . . . . 61

4.5.2 Addressing data mismatch . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.6 Learning from multiple tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.6.1 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.6.2 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Convolutional n eu r al networks 63

5.1 Edge detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3 Strided convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4 Cross-correlation vs convolution . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.5 Convolutions over volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.6 One layer convolution network . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.7 Pooling layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.8 Why we use convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

CONTENTS v

5.9 Classic networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.9.1 LeNet-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.9.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.9.3 VGG-16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.9.4 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.9.5 1 × 1 convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.9.6 Inception network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.10 Competitions and benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6 D etecti on algorithms 76

6.1 Object localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.2 Landmark detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.3 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.4 Sliding windows with convolution . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.5 Bounding box predi ct i ons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.6 Intersection over Union (IoU) . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.7 Non-max suppression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.8 Anchor boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.9 YOLO algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7 Se qu en ce models 85

7.1 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.1.1 Backpropagation through time . . . . . . . . . . . . . . . . . . . . . . 86

7.1.2 Variational RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.1.3 Language modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.1.4 Sampling novel sequences . . . . . . . . . . . . . . . . . . . . . . . . . 89

7.1.5 Vanishing gradients with RNNs . . . . . . . . . . . . . . . . . . . . . . 89

7.1.6 GRU Gated recurrent unit . . . . . . . . . . . . . . . . . . . . . . . . 89

7.1.7 LSTM: Long short-term memory . . . . . . . . . . . . . . . . . . . . . 91

7.1.8 BRNNs: Bi di r ec ti on al RNNs . . . . . . . . . . . . . . . . . . . . . . . 92

7.1.9 Deep RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.2 Word embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.3 Word representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.3.1 Using word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.3.2 Properties of word embeddings . . . . . . . . . . . . . . . . . . . . . . 95

Chapter 1

Introduction

The gist of deep learning, and the algorithms behind it, have been around for decades.

However, we saw that as we started to add data to neural networks, they started to perform

much better than traditional machine learning algorithms. With advances in GPU comput-

ing and the amount of data we have available, training larger neural networks are easier

than ever before. With more data, larger neural networks have been shown to outperform

all other machine learning algorithms.

Amount of data

Performance

Large neural networks

Medium neural networks

Small neural networks

Traditional ML

Figure 1.0. 1: With more data, neural networks have performed better than traditional machine

learning.

Artiﬁcial intelligence can be broken down into a few subﬁelds that include deep learn-

ing (DL), machine learning (ML), probabilistic graphical models (PGM), planning agents,

search algorithms, knowledge representation (KR), and game t he ory. The only subﬁeld that

has dramatically improved in performance has been deep learning and machine learning.

We can see the rise of artiﬁcial intelligence in many industries t oday, however, there is still

a long way to go. Even though a company uses neural networks, the company is not an

artiﬁcial intelligence company. The best AI companies are good at strategic data acquisition,

putting data together, spotting automation, and have job descriptions on technologies at

the forefront of the ﬁeld.

2 CHAPTER 1. INTRODUCTION

Time

Performance

(a) DL and ML

Time

Performance

(b) PGM

Time

Performance

Time

Performance

(d) Search algorithms

Time

Performance

(e) Knowledge representation

Time

Performance

(f) Game theory

Figure 1.0.2: The performance of deep learning and machine learning algorithms have been

exploding in the last few years, in comp a ris o n to other branches of artiﬁcial intelligence.

1.1 Deep learning

1.1.1 What is a n eu r al network?

The aim of a neural network is to learn representations that best predict t h e output y, given

a set of features as the input x.

PriceSize

“neuron”

Figure 1.1.1: Simplest possible neural network where we are given the size of a home and predict

its price.

Figure 1.1.1 represents the simpl e st possible neural network. The neuron is the part of

the neural network that tries to learn a function that maps x → y. Here, we only have

one neuron, whi ch is why it is the simplest case. We can make more complicated neural

networks by stacking neurons. Consi d er the cas e where we not on l y have the size but the

number of bedrooms, zip code, wealth as well. We can represent the extra inputs inside of

a neural network as shown in Figure 1.1.2.

Neural networks work well because we only need to feed the algorithm supervised data,

without specifying what all the intermediate values may be, as we did in Figure 1.1.2 with

family size, walkability, and schooling. Instead of connecting together only a few inputs,

we can connect all of the inputs together. Layers with all inputs connected to the output

are known as fully-connected layers. The general st ru ct u re of a neural network wil l fully

connected layers can be seen in Fi gur e 1.1.3.

1.1. DEEP LEARNING 3

size

bedrooms

zip code

family size

walkability

schooling

price (y)

wealth

Figure 1.1.2: In a neural network with more neurons, the intermediate connections between inputs

may represent their own values.

Hidden

layer

Input

layer

Output

layer

Figure 1.1.3: Neural network with fully connected layers

1.1.2 Supervised learn i n g with neural networks

Almost all of the hype around machine learning has been centered aroun d supervised learn-

ing. A supervised learning algorithm t akes in a set of features and outputs a number or a

set of numbers. Some examples can be seen in Table 1.1.

Input (x) Output (y) Application

Home features Price Real estate

Ad, user info Click on ad? (0/1) Online advertising

Image Object (1,. . . , 1000) Photo tagging

Audio Text transcript Speech recognition

English Chinese Machine tran sl at i on

Image, radar info Position of other cars Auton omou s driving

Table 1.1: Examples of supervised learn i n g

As we have already seen, we can change the connections between layers to form diﬀerent

types of neural networks. Indeed, changing the structure of layers has led to many of the

breakthroughs in deep learning and we have formulated tasks in which it would be beneﬁcial

not using a neural network with fully connected layers. For example, while real estate and

online advertising we may use a neural network with ful l y -con ne ct e d layers, photo tagging

may use convolutional neural network s (CNNs), sequenced data may use recurrent neural

networks (RNNs) , and for something more complicated like autonomous dr i vi n g, we may

use some cust om hybrid neural network. Figure 1.1.4 shows the basic structure of diﬀerent

neural networks, although we will go more in-depth into how each works in this text.

4 CHAPTER 1. INTRODUCTION

Figure 1.1.4: Diﬀerent types of neural network architectures

Supervised learning can be both structured and unstructured. Structured data may have

come in the for m of database entries, while unstructur ed data may be audio, image clips, or

text. Historically, unstructured data like image recogniti on has often been a harder problem

for computers to solve, while humans have little diﬃculty solving these typ e s of proble ms.

Neural networks have given computers a lot of help when interpreting unstructured data.

Figure 1.1.5: Structured vs unstructured data

1.2 Logistic regression as a neura l network

1.2.1 Notation

Each training set will be composed of m training examples. We may denot e m

train

to be

the number of training examples in the training set , and m

test

to be the number of examples

in the t es ti n g set. The ith input instance will be deﬁned as (x

(i)

, y

(i)

), where x

(i)

∈ R

and each y

(i)

∈ {0, 1}. To make notation more concise, we will use X to denote the matrix

formed by

X =

(1)

(2)

. . . x

(m)

, (1.2.1)

where X ∈ R

×m

. simi l arl y, we deﬁne Y as

Y =

(1)

(2)

. . . y

(m)

, (1.2.2)

where Y ∈ R

1×m

1.2. LOGISTIC REGRESSION AS A NEURAL NETWORK 5

1.2.2 Binary classiﬁcation

A bin ary classiﬁer takes some input and classiﬁes that input in either two categories, typi-

cally yes (1) or no (0). For example, an image of a cat may be classiﬁed as a cat or non-cat.

We will use y to denote the output as either a 0 or 1.

Figure 1.2. 1 : Colored images can be represented with the ad d i t io n of red, green, and blue channels.

Each color can be represented using the colors red, green, and blue. Each pixel on a colored

image can be represented as a t u pl e , storing the amount of red, green, and blue inside of

each pixel. For images, we say that the red, green, and blue channel corresponds the amount

of that color in each pixel as shown in Figure 1.2.1. For an image of size n × m, we wi l l

deﬁne our input as a single column vector x that stacks the red values on the blue values

on the green values as shown

x =

red values

blue values

green values

(

. (1.2.3)

We will use n

to denote the input size of our vector. In this case, our inpu t size is

= 3nm, (1.2.4)

since each channel is size (n ×m) and there are 3 channels.

1.2.3 Logistic Regression

Given an input feature x ∈ R

, such as the pixels of an image, we would like to ﬁnd an

algorithm that makes a prediction ˆy, where

ˆy = P (y = 1 | x) . (1.2.5)

Inside of our algorithm we will deﬁne weights w ∈ R

and a bias term b ∈ R, where our

output is

ˆy = σ

)

x + b

. (1.2.6)

The sigm oi d function σ is a function is used to transform the input betwe en 0 ≤ ˆy ≤ 1,

where

σ =

1 + e

−z

. (1.2.7)

We will denote the inner part of the sigmoid function as

z = w

x + b. (1.2.8)

6 CHAPTER 1. INTRODUCTION

−10 −5

5 10

0.5

Figure 1.2.2: Sigmoid function

1.2.4 Logistic regression cost function

Logistic regression is be a supervised learni ng algorithm where we train on the t r ain i ng

examples

(1)

, y

(1)

, . . . ,

)

(m)

, y

(m)

and would like ˆy

(i)

≈ y

(i)

. Wh il e many machine

learning algorithms use a squared error cost function, logistic regression does not work well

when using squared error because many local opti ma arise. Instead, we will use the loss

function

L (ˆy, y) = −(y log ˆy + (1 − y) log (1 − ˆy)) . (1.2.9)

Intuitively, consi de r the following two scenarios. When y = 1, we will end up with

L (ˆy, 1) = −log ˆy, (1.2.10)

which is shown in Figure 1.2.3a. The second scenario occurs when y = 0. In that case, we

will have

L(ˆy, 0) = −log(1 − ˆy), (1.2.11)

which is shown in Figure 1.2. 3b. When small ˆy are inputted, we will have a small l oss , and

when ˆy approaches 1, our estimates will be way oﬀ, producing a large cost.

The loss function is a measure of how well the classiﬁer is doing on a single example. The

cost function, deﬁned as the average loss

J(w, b) =

i=1

ˆy

(i)

, y

(i)

. (1.2.12)

0.5 1

ˆy

−log ˆy

(a) Loss function when y = 1

0.5 1

ˆy

−log(1 − ˆy)

(b) Loss function when y = 0

Figure 1.2.3: A binary classiﬁer’s loss function based on if y = 1 o r y = 0.

1.2. LOGISTIC REGRESSION AS A NEURAL NETWORK 7

1.2.5 Gradient descent

For many logistic regression problems, our function will look bowl-shaped, which is known

as a convex function shown in Figure 1.2.4. With convex functions, there is a clear minimum

of the fu nc ti on and we would like to ﬁnd the representations of w and b that will put us at

the minimum.

Figure 1.2.4: An example of wha t our cost function might look like with the inputs w and b.

For logistic regression, we generally start by initializing our weights to be zero. Gradient

descent steps in the direction of steepest descent. If we start at some point that is not

the minimum, we will move in the direction of the minimum, until we converge near the

minimum. In one dimension, we can deﬁne the gradient descent algorithm as

w = w − α

∂J(w, b)

∂w

(1.2.13)

b = b − α

∂J(w, b)

∂b

(1.2.14)

where we will ﬁrst calculate both of the partial derivative terms and then store it inside of

the variables in the left hand side.

1.2.6 Computation graphs

Consider the function

y(a, b, c) = 3(a + bc). (1.2.15)

To compute the output requires three steps, ﬁrst compute bc, then add that result to a, and

then multiply th e added result by 3. We can represent this function as the computational

graph shown in Figure 1.2.5. Each forward propagation in the computation graph produces

a number that gets closer to the result.

∗

×3

Figure 1.2.5: Computation graph for 3(a + bc)

8 CHAPTER 1. INTRODUCTION

1.2.7 Derivatives with a computational graph

a = 5

b = 3

c = 2

u = bc

11 33

v = a + u

J = 3v

Figure 1.2.6: Computation graph for 3(a + bc)

Consider th e updated computational graph in Figure 1.2.6. Now, suppose that we want to

ﬁnd how J changes with respect to v, that is

∂J

/∂v at v = 11. Since we know that J = 3v,

we can take the derivative of the function giving us

∂J

∂v

= 3, (1.2.16)

which in this case is not based on the input at all. Now if we wish to ﬁnd

∂J

/∂a, we can use

the chain rule to break

∂J

∂a

∂J

∂v

∂a

. (1.2.17)

Since we already calculated the value of

∂J

/∂v = 3, we only need to ﬁnd

∂v

/∂a. Since v = a+u,

diﬀerentiating both sides with respect to a will give us

∂v

∂a

= 1. (1.2.18)

Now, using Equation 1.2.17

∂J

∂a

= 3. (1.2.19)

To ﬁnd

∂J

/∂u, we would use a similar process how we found

∂J

/∂a and we will again end up

with

∂J

/∂u = 3. To ﬁnd

∂J

/∂b, we need

∂J

∂b

∂J

∂v

∂u

∂b

(1.2.20)

= 3 ·

∂u

∂b

. (1.2.21)

Since u = bc, diﬀerentiating both s i de s with resp e ct to b gives us

∂u

/∂b = c, where c in this

case is 2, so

∂J

∂b

= 6. (1.2.22)

A similar process would be carried out to ﬁnd

∂J

/∂c = 9.

1.2.8 Logistic regression gradient descent

With our logistic regression compu t at ion al graph deﬁned in Figure 1.2.7, we can again work

backward calcu l at in g the changes in L with respect each of the variables. Diﬀerentiating

both sides of our loss functi on with respect to a gives us

∂L(a, y)

∂a

= −

1 − y

1 − a

. (1.2.23)

We also ﬁnd that

∂a

∂z

= a(1 − a), (1.2.24)

∂L

∂z

= a − y. (1.2.25)

1.3. PYTHON AND VECTORIZATION 9

z = w

+ w

+ b

a = σ(z) L(a, y)

Figure 1.2.7: Computation graph for logistic regressi on

Now back propagating towards the inputs, we have

= (a − y) · w

(1.2.26)

= (a − y) · w

(1.2.27)

= (a − y) · x

(1.2.28)

= (a − y) · x

(1.2.29)

= (a − y). (1.2.30)

In general, we can show

= (a − y) · w

(1.2.31)

= (a − y) · x

. (1.2.32)

1.2.9 Gradient descent on m ex amp les

Recall that when we have m input instances to train on

J(w, b) =

i=1

L(a

(i)

, y

(i)

), (1.2.33)

where a

(i)

= ˆy

(i)

= σ(w

(i)

+ b). If we diﬀerentiate J with respect to w

, we will have

∂J(w, b)

∂w

i=1

∂L(a

(i)

, y

(i)

)

∂w

. (1.2.34)

In Equation 1.2.28, we calculated L

to be (a − y) · x

, so

∂J(w, b)

∂w

i=1

(i)

− y

(i)

) · x

. (1.2.35)

1.3 Python and vectorization

1.3.1 Vectorization

The aim of vectorization is to remove explicit for-loops in code. With deep learning datasets

can be massive and it is vital to un de r st an d how vectorization works in order to improve

algorithmic eﬃciency. Recall that we set z = w

x + b. One implementation on how to

calculate this would be

z =

i=1

(i)

+ b, (1.3.1)

which would end u p being very slow. In contrast, we can use CPython to speed up the

calculation, where the NumPy command will be

z = np.dot(w, x) + b. (1.3.2)

10 CHAPTER 1. INTRODUCTION

Using built-in CPython functions enables parallelism much better than writing out explicit

for-loops. When using neur al networks, whenever possible, avoid explicit for-loops. A

few more example s of calculations that can be vectorized include the matrix multiplication

between Ax, which is also written as

Ax = np.dot(A, x). (1.3.3)

To apply some operation, such as expon entiating, to each item in a vector v, write

v = np.exp(v). (1.3.4)

1.3.2 Vectorizing logistic regression

For logistic regression, we need to ﬁnd a

(i)

= σ(z

(i)

), m times, where z

(i)

= w

(i)

+ b.

Instead of running explicit for-loops for each instance, we would like a way to vectorize the

process of ﬁnding every value of a at once. Recall that we set

X =

(1)

(2)

. . . x

(m)

(1.3.5)

where X ∈ R

×m

. We also know that there is a weight associated with each input instance,

so w

∈ R

1×n

. Carry i ng out the product between

X =

(1)

(2)

· · · w

(m)

. (1.3.6)

Now, we can see that b ∈ R

1×m

, so taking the sum of w

X an d b gives us

X + b =

(1)

+ b w

(2)

+ b · · · w

(m)

+ b

. (1.3.7)

In order to implement this in CPython, call

Z = np.dot(w.T, x) + b, (1.3.8)

where b ∈ R and Python will expand it t o be a 1 × m matrix. The concept of expanding a

number into a matrix in python is known as broadcasting.

Similar to how X stores the instances for x

(i)

, we will deﬁne the matrix

A =

(1)

(2)

· · · a

(m)

. (1.3. 9)

With Z being a 1 × m matrix, we can apply the sigmoid function to each element of Z by

calling

A =

1 + exp(−Z)

. (1.3.10)

1.3.3 Vectorizing logistic regression’s gradient computation

Recall from earlier that the change in b, which we will denote as db was equal to dz. We

then wanted to update db by the average of all of the nudges over the training examples.

Instead of initializing db = 0, looping over and adding every instance of dz

(i)

, and dividing

the result by m, we could instead call

db =

i=1

(i)

(1.3.11)

np.sum(dZ)

, (1.3.12)

where dZ is a row matrix formed by each dz

(i)

element. The update th e change in weights,

recall dw =

/m ·

i=1

(i)

, which can be expressed in vectorized for as

dw =

np.dot(X, dZ)

. (1.3.13)

1.3. PYTHON AND VECTORIZATION 11

1.3.4 Broadcasting example

Consider the matrix

A =

56 0 4 68

1 102 52 8

2 135 99 1

(

(1.3.14)

and we want to ﬁnd the percentage each element contributes to the sum of its column. For

example, A

= 56/(56 + 1 + 2). To calculate the sum of each row, we would type

s = A.sum(axis = 0). (1.3.15)

Axis 0 refers to the vertical columns inside of a matrix, while axis 1 refers to the horizontal

rows in a matrix. Calling t h e method in Equation 1.3.15 will produce a vector of 4 elements,

however, we would like to convert from R

→ R

1×4

. To make this conversion, call

s = s.reshape(1, 4). (1.3.16)

Now we can perform

/s in order to get t h e percentage each element contributes to the sum

of the column. Dividin g the matrix A ∈ R

3×4

by the row vector s ∈ R

1×4

will divide each

of the three elements in a col umn by the value in the corresponding column of s. Since their

sizes are diﬀerent, this is known as broadcasting. A few other examples of broadcasti n g

include adding a vector and real number

(

+ 100 =

101

102

103

104

(

, (1.3.17)

and adding a matrix in R

m×n

with a row vector in R

1×n

1 2 3

4 5 6

100 200 300

101 202 303

104 205 306

. (1.3.18)

Notice, when applying a command between a matrix in R

m×n

and a vector in R

1×n

, t he

vector will exp and to a matr i x in R

m×n

, where each row has the same elements. sim il ar l y,

when combining a matrix in R

m×n

with a column vector in R

m×1

, the vector will exp and

to be in R

m×n

, where each column is identical. An example of a column vector expanding

1 2 3

4 5 6

100

200

101 102 103

204 205 206

. (1.3.19)

1.3.5 A note on Nu mPy vectors

Consider the example

a = np.random.randn(5). (1.3.20)

Calling a.shape will return the shape as (5, ), which is considered a rank one array, which

is neither a row vector nor a column vector. If we call a.T , nothing will be changed to

the array, and it will still be in the shape (5, ). Instead of using a rank one array, it is

recommended to use natural matrices when dealing with neural networks. If we wanted a

matrix of size R

5×1

instead of the rank one array in Equation 1.3.20, we would call

a = np.random.randn(5, 1). (1.3.21)

One goo d tip when dealing with a matrix of an unknown size is to ﬁrst assert that it is a

size we want , by calling

assert(a.shape == (5, 1)). (1.3.22)

Chapter 2

Neural networks

2.1 Deep learning intuition

A mo d el for deep learni n g is based on architecture and parameters. The architecture is the

algorithmic design we choose, like logistic re gre ss ion , linear regression, or shallow neural

networks. The parameters are the weights and bias the model that takes in an instance as

input and attempts to classify it correctly based on the parameters.

Model

Architecture

Parameters

Output

Loss

Gradients

Input

Things that can change:

- Activation function

- Optimizer

- Hyperparameters

- Loss function

- Input

- Output

- Architecture

Figure 2.1.1: Where the model ﬁts in and what can be ad ju st ed

Consider if we wanted to use a multi-class classiﬁer that not only predicts if an image is a

cat or not a cat, but instead predicts from the image whether the image is a cat, dog, or

giraﬀe. Recall that when using a binary classiﬁer our weights were

w =

(

, (2.1.1)

2.1. DEEP LEARNING INTUITION 13

and our lab e l s were single numbers, y ∈ {0, 1}. Now, let us add two more columns to the

weights that will be the weights for the dog classiﬁer and the giraﬀe classiﬁer

w =

cats dogs giraﬀes

(

. (2.1.2)

Our weights are in R

×3

To update our labels, we have a few options. First, let us put y ∈ {0, 1, 2}, where 0

corresponds to a cat image, 1 corresponds to a dog image, and 2 c orr es ponds to a giraﬀe

image. The second option is one-hot encoding, where we will have an array with each

index mapping to an animal. With one-hot encoding, only one item in each labeled array

will have a value of 1 (picture is of an animal), while the rest are 0 (picture is not of an

animal). For example, if we have an image of a cat, our label would be

cat dog giraﬀe

[ 1 0 0 ].

(2.1.3)

While both of these options would work well in most cases, if we have a picture of both a

cat and dog in the same image, our classiﬁer wi l l not work. Instead, we will use multi-

hot en coding, which works similar to one-hot enco d i ng, with the diﬀeren ce being multiple

values can take on the value 1. So now if we have a picture of both a cat and dog at the

same time, we can encode our label as

cat dog giraﬀe

[ 1 1 0 ].

(2.1.4)

The jth neuron in the ith layer will be labeled as a

[i]

as shown in Figure 2.1.2.

(i)

[1]

[2]

ˆy

(i)

Figure 2.1.2: Notation for the course

As the layers increase, the activation neurons start to look at more complex items. For

example, if we are working with a face classiﬁer, the ﬁrst layer might only detect edges, the

second layer might put some of the edges together to ﬁnd ears and eyes, and the third layer

might detect the entire face. The process of extr act i n g data from the neurons is known as

encoding.

2.1.1 Day’n’night classiﬁcation

Our goal is to create an image classiﬁer that predicts from a photo taken outside if it was

taken “during the day” (0) or “during the night” (1). From the cat example, we needed

14 CHAPTER 2. NEURAL NETWORKS

about 10, 000 images to create a good classiﬁer, so we estimate this problem is about as

diﬃcult and we will need about 10, 000 images to classify everything correctly. We will split

the data up into both a tr ai ni n g and testing set. In this case, about 80% of the data wil l

be in the traini ng set. In cases with more training data, we would give less of a percentage

to th e testing set and more of a percentage to the training set. We also need to make sure

that the data is not split randomly, rather split proportionally to the group t he data is in

(80% of day examples and 80% of night examples go in the training set).

For our input, we would like to work with images that have th e lowest possible quality, while

still achieving good results. One clever way to choose the lowest possible quality images is

to ﬁnd out the lowest possible quality that humans can perfectly identify and then use that

quality. After comparing to human performance, we ﬁnd that 64 × 64 × 3 pixels will work

well as our image size.

Our output will be 0 for the day and 1 for the night. We will also use the sigmoid function

as our last activation function because it maps a number in R to be between 0 and 1. A

shallow neural network, one with only one hidden layer, should work p re t ty well with this

example because it i s a fairl y straightforward task. For our loss function, we will use the

logistical loss

L(ˆy, y) = −y log(ˆy) − (1 − y) log(1 − ˆy). (2.1.5)

2.1.2 Face veriﬁcation

Our goal is to ﬁnd a way for a school to use face veriﬁcation for validating stud ent IDs in

facilities such as dining halls, gyms, and pools. The data we will need is a picture of every

student labeled with the ir name. The input will be the data from the camer as as well as

who they are attempting to be. In general, faces have mor e detail than the night sk y, so

we will need the input to be a larger size. Using th e human test we used last time, we ﬁnd

a solid resolution to be 412 × 412 × 3 pixels. We will make the output 1 if it’s the correct

person at the camera and 0 if it is not.

To compare the image we have of a person in our database wit h the image we have from a

camera, we want to ﬁnd a function to call on both of the images in order to determine if

the person in the image is the same. We could directly compare the diﬀerences in pixels,

however, that presents se vere problems if the background lighting is diﬀerent, the person

grows facial hair, and the ID photo is outdated. Instead, we will use a deep network for our

function as illustrated in Figure 2.1.3.

Deep Network

0.931

0.433

0.158

0.039

0.922

0.343

0.142

0.024

Small Distance

Figure 2.1.3: The architecture we will use to classify images will be deep networks.

We will set the di st anc e threshold to be some number, and any distance les s than that

number we will classify with y = 1. Because the school will not have tons of data on the

pictures of students, we will use a public fac e dataset, where we want to ﬁnd images of th e

2.1. DEEP LEARNING INTUITION 15

same people to ensure that their encoding is similar and images of diﬀerent people to ensure

their encodi n g is diﬀerent. We train our network, we will feed it triplets. Each instance will

have an anchor (actual person), a positive example (the same pe rs on, mini mi ze encoding

distance), and a negative example (diﬀerent person, maximize encoding distance).

min distance

max distance

anchor

positive negative

Figure 2.1.4: Ordered triplet input example

For our loss function, we can use

L = ||Enc(A) − Enc(P )|| − ||Enc(A) − Enc(N)|| + α, (2.1.6)

where || represents t he L2 norm and Enc represents t h e encoding. The loss function

presented makes sense because we will want ||Enc(A) − Enc(P )|| to be minimized and

||Enc(A) −Enc(N)|| to be maximized. To minimize a maximized function, use the negative

of that function, which gives us our resu l t of −||Enc(A) − Enc( N )||. The α term, called

the margin, is used as a kickstart to our encoding function , to avoid t he case where every

weight in the deep network is zero, and we end up with a perfect loss. It is common to keep

loss functions posit i ve, so we generally train the maximum of th e loss function and 0.

Model

Architecture

Parameters

Output

Loss

Gradients

Input

pos neganc

Enc(P)Enc(A) Enc(N)

Figure 2.1.5: Updated model for face veriﬁcation

16 CHAPTER 2. NEURAL NETWORKS

2.1.3 Face recognition

In ou r school, we want to use face identiﬁcation to recognize students in faci l it i es . Now,

instead of just verifying who the person is given their ID, we need to identify the person out

of many people. We will use a lot of similar details from the face veriﬁcation example, but

instead of input ti n g 3 images and out pu t t in g 3 encodings, we will input one image and the

encoding on that image. Then, we will compare the encoding our database, which contains

encodings for every student and use a k-nearest neighbors algorithm to make comparisons.

To go over the entire database, the runtime complexity will be O ( n ).

For another example, suppose we have thousands of pictures on our phone of 20 diﬀerent

people and we want to make groups of the same people in folders. Usi n g the encoding on

each image, we could run a k-means clustering algorithm to ﬁnd groups within the vector

space that correspond to the same people.

2.1.4 Art generation (neural style transfer)

Given a picture, our goal is to make it beautiful. Suppose we have any data that we want.

Our input will contain both c ontent and the style we consider beautiful. Then, once we

input a new content image, we would like to generate a style image based on the content.

(a) Content image (b) Style image

Figure 2.1.6: Input for art ge n era ti o n

The architecture we will use is ﬁrst based on a model that understands images very well.

The existing ImageNet mo d el s, for instance, are models that are perfe ct l y well to use for an

image understanding task. After the image is forward propagated th r ough a few layers of

networks, we will get information about the content in the image by looking at the neurons.

We will call the information about the image Content

. Then, giving the style matrix to

the deep network, we will u se the grain mat ri x to extract the Style

from the style image.

More about the grain matrix wi l l be discussed later in the course.

The loss function we can deﬁne as

L = ||Content

− Content

|| + ||Style

− Style

||, (2.1.7)

where

denotes the generated image.

2.1.5 Trigger word detection

When given a 10-second audio speech, we want to detect when the word “active” has been

said. To create a model, we will need a lot of 10-second audio clips with as much of a variety

in the voice as possible. We can break the input down into three segments: when “ac ti ve” is

being said, when a word that is not “active” is bein g said, and when nothing is b ei ng said.

The sample rate of the in pu t would be similar to how we found the resol u t ion of images,

determining the minimum viable amount that humans have no trouble with.

2.2. SHALLOW NEURAL NETWORKS 17

Deep Network

(pretrained)

After many iterations

compute

loss

update pixels using gradients

Figure 2.1.7: Art generation architecture

Figure 2.1.8 : I n p u t for the trigger word detection model. Green represents the time frame

“active” is being said, red represents the time frame that no n - “a c t ive” words are being said, and

black represents the time frame nothing is being said.

The output will be a set of 0s and then a 1 after the word ac t ive is said. This output is

generally easier to debug and works better for continuous models. The last activation we

will use is a sigmoid and the architecture we will use should be recurrent neural networks.

We can gener ate d ata for our m et hods by collecting positive words, negative words, and

background noise and then adding diﬀerent combinations of the data together to form 10-

second clips.

2.2 Shallow neural networks

(i)

[1]

[2]

ˆy

(i)

Figure 2.2.1: Shallow neural network

2.2.1 Neural networks overview

In each layer of a neural network, we will calculate both the linear activation, and then map

the linear activation to be between 0 and 1. To calculat e the linear activati on in the ith

layer

[i]

= W

[i]

x + b

[i]

. (2.2.1)

18 CHAPTER 2. NEURAL NETWORKS

(i)

x + b

σ(z)

Figure 2.2.2: Each neuron in an activation layer represe nts two computa ti o n s: the lin ea r weights

computation plus the bias and then squeezing that number between 0 and 1.

To calculate the non-activation the ith layer, we would use the sigmoi d function

[i]

= σ(z

[i]

). (2.2.2)

Only once we get to the ﬁnal layer would we calculate the loss.

2.2.2 Neural network representations

The hidden layer refe rs to data that is not seen in the training set. To denote the input

layer x, we may also use the notation a

[0]

. When we count neural network layers, we do not

include the input layer in that count, however, we do include the output layer. We will also

refer to the nodes in the hidden layers and output layer as neurons.

2.2.3 Computing a neural network’s output

In the shallow neural network pictured in Figure 2.2.5, for the hidden layer, we must calculate

;

[1]

= w

[1]T

x + b

[1]

= σ(z

[1]

)

[1]

= w

[1]T

x + b

[1]

= σ(z

[1]

)

[1]

= w

[1]T

x + b

[1]

= σ(z

[1]

)

. (2.2.3)

Implementing the assignments in Equation 2.2.3 would require an ineﬃcient for-loop. In-

stead, we will vectorize the process as follows

[1]

—– w

[1]T

—–

—– w

[1]T

—–

—– w

[1]T

—–

(

[1]

(

[1]T

+ b

[1]

[1]T

+ b

[1]

[1]T

+ b

[1]

(

. (2.2.4)

The dimensions of the weights matrix will be the number of weights in the hidden layer by

the number of inputs. The bias vector will be a column vector with the same numb e r of

rows as the weight matrix . We will c all the weight matrix W

[1]

and the bias vector b

[1]

. For

our activation, it follows that

[1]

(

= σ(z

[1]

). (2.2.5)

2.2. SHALLOW NEURAL NETWORKS 19

Putting both the values of z and a together, we now only have to calculate

[1]

= W

[1]

x + b

[1]

= σ(z

[1]

)

. (2.2.6)

Instead of writing x, we will write in its place a

[0]

, which gives us the assignments

[1]

= W

[1]

[0]

+ b

[1]

= σ(z

[1]

)

. (2.2.7)

We prefer the notation a

[0]

in this case, because it will make deeper layers more natural to

reference. Now for the second layer, we can use a similar process to show

[2]

= W

[2]

[1]

+ b

[2]

= σ(z

[2]

)

. (2.2.8)

From F i gur e 2.2.5, we see that there is only 1 neu ron in activation layer 2 with 3 inputs.

From that data, it implie s that the dimensions of the weight matrix are 1 × 3 multiplied by

the 3 × 1 input matrix and added to the 1 × 1 bias term.

2.2.4 Vectorizing across multiple examples

Previously, we have only calculated the valu es from one a single inpu t x. For m training

examples, we would be required to run a loop over every single example in ord er to cal-

culate each corresponding a

[2](i)

value, where the index i is the ith training example. The

corresponding for-loop can be shown in Algorithm 1.

Algorithm 1 Forward propagation

for i ∈ {1, . . . , m} do

[1](i)

= W

[1]

[0](i)

+ b

[1]

[1](i)

= σ(z

[1](i)

)

[2](i)

= W

[2]

[1](i)

+ b

[2]

[2](i)

= σ(z

[2](i)

)

end for

Earlier, we deﬁned the matrix X to store all of the input instan ces as

X =

| |

(1)

. . . x

(m)

| |

(

, (2.2.9)

where X ∈ R

×m

. If we instead deﬁne

[1]

= W

[1]

X + b

[1]

, (2.2.10)

where we broadcast b as a colum n vector to be a matrix with m columns. It follows that

[1]

= σ(Z

[1]

). (2.2.11)

Now, in each of the columns in A and Z we will have

[1]

| |

[1](1)

. . . z

[1](m)

| |

(

, A

[1]

| |

[1](1)

. . . a

[1](m)

| |

(

. (2.2.12)

Horizontal nodes will correspond to diﬀerent traning examples and vertical nodes correspond

to diﬀerent neurons in the hidden unit. The complete vectorization across multiple examples

will be:

;

[1]

= W

[1]

[0]

+ b

[1]

= σ(Z

[1]

)

[2]

= W

[2]

[1]

+ b

[2]

= σ(Z

[2]

)

. (2.2.13)

20 CHAPTER 2. NEURAL NETWORKS

2.2.5 Activation functions

Previously when f or ward propagating, we u se d σ as our activation function, although that

is just one choice. Recall the graph of

σ =

1 + e

−z

(2.2.14)

in Figure 2.2.3 was used to map z from any number in R to be between 0 and 1.

−10 −5 5 10

0.5

σ(z)

Figure 2.2.3: Sigmoid function

Let us denote our activation function as g, so that

[i]

= g(z

[i]

). (2.2.15)

−10 −5 5 10

0.5

−0.5

−1

tanh(z)

Figure 2.2.4: Hyperbolic tangent function as an activatio n function

An alter nat i ve to the traditional sigmoid fu nc t ion is the sigmoid fun ct i on with a range from

−1 t o 1 that can be thought of as dragging the bottom of the si gmoi d s from the line at 0

to the line at −1. After all of the calculations, we would end up ﬁndi n g that this function

is the hyperbolic tangent function, so our a new activation function could be

g(z) = tanh(z) =

− e

−z

+ e

−z

. (2.2.16)

Generally, the hyperbolic tangent function tends to outperform the sigmoid function because

it has the eﬀect of centering the weights at 0. For the last layer, however, it is still common

to use the sigmoid function to give a better interpretation of the data from a probabilistic

perspective. Since diﬀerent activation layers may have diﬀerent activation functions, we will

refer to the activation function in the ith layer as g

[i]

As we approach larger numbers and smaller numbers for both the sigmoid and hyperbolic

tangent functions, it is clear that the rate of change is decreasing drastically and is approach-

ing no rate of change at the asymp tot e s. The rectiﬁed linear unit (ReLU) activation

2.2. SHALLOW NEURAL NETWORKS 21

(i)

tanh

ˆy

(i)

Figure 2.2.5: Diﬀerent layers in the neural network may have diﬀerent activation functions

function is often used to solve the problem at the inﬁnities. The ReLU function is deﬁned

ReLU(z) = max(0, z), ( 2. 2. 17)

which gives the plot shown in Figure 2.2.6. Th e rate of change of the ReLU function will be

1 if z > 0 and 0 if z < 0. It is quite improbabl e that z = 0.

0, in which case our derivative

is undeﬁned and we should manually set the result to either 0 or 1 in th at case.

−10 −5 5 10

ReLU

Figure 2.2.6: ReLU activation function

For binary classiﬁcation problems, where y ∈ {0, 1}, it is common practice to use the sigmoid

activati on function in the last layer and the ReLU activation function everywhere else. We

prefer the ReLU activation function in practice be cau se it is often much easier and faster to

handle derivatives.

The last activation function is the leaky ReLU, which behaves similar to the ReLU function

with the exception being how negative numbers are handled. Instead of making all negative

number 0, the leaky ReLU uses some scalin g factor multiplied by z. The most common

scaling factor is 0.01, wh i ch corresponds to the leaky ReLU function

leaky ReLU(z) = max(0.01z, z). (2.2.18)

In practi ce , it is often hard to determine which activation function should be used. It is

quite common to test diﬀerent combinations of activation functions in order to ﬁnd which

one works the best.

2.2.6 Why use non-linear activation functions?

Consider the shallow neural network pictured in Figure 2.2.8. If we used a linear activation

function, that is

[i]

= z

[i]

, (2.2.19)

22 CHAPTER 2. NEURAL NETWORKS

Leaky ReLU

Figure 2.2.7: Leaky ReLU activation function

(i)

linear

ˆy

(i)

Figure 2.2.8: Diﬀerent layers in the neural network may have diﬀerent activation functions

then we have

[1]

= z

[1]

= W

[1]

x + b

[1]

[2]

= z

[2]

= W

[2]

[1]

+ b

[2]

. (2.2.20)

Substituting W

[1]

x + b

[1]

in for a

[1]

in the second equation, we have

[2]

= W

[2]

[1]

x + b

[1]

) + b

[2]

. (2.2.21)

Expanding out the matrix, we have

[2]

[1]

)x + (W

[2]

[1]

+ b

[2]

), (2.2.22)

which is a linear equation of x. Since this is a linear hidden layer, st or i ng deeper and deeper

sets of linear weights on top of each other will never change our function from being linear, so

more weights end up being useless. When using a linear activation function for preliminary

calculations, the entire function tends to appear just like a logistic regression problem.

In the case of linear regression where ˆy ∈ R, such as when housing prices are being predicted,

we may use a linear function i n the output layer of the neural network in order to map the set

of the real number, but other than that case, linear activation functions should be avoided.

2.2.7 Derivatives of activation functions

With a sigmoid activation function

g(z) =

1 + e

−z

, (2.2.23)

2.2. SHALLOW NEURAL NETWORKS 23

we can write the derivative in terms of the or i gin al function us in g some clever algebraic

manipulation that follows:

∂σ

∂z

−z

(1 + e

−z

)

(2.2.24)

−z

1 + e

−z

1 + e

−z

(2.2.25)

1 + e

−z

− 1

1 + e

−z

1 + e

−z

(2.2.26)

1 + e

−z

1 + e

−z

−

1 + e

−z

1 + e

−z

(2.2.27)

= (1 − σ) · σ. (2.2.28)

When using the hyperbolic tangent function as our activation function, we have

g(z) = tanh(z) =

− e

−z

+ e

−z

. (2.2.29)

The derivative of the hyperbolic tangent function can also be written in terms of its input

∂ tanh(z)

∂z

= 1 − tanh

(z). (2.2.30)

For the ReLU function

g(z) = max(0, z), (2.2.31)

the derivative is

′

(z) =

1 z > 0

0 z < 0

. (2.2.32)

In there rare case where we have g

′

(z = 0), the derivative is techincally undeﬁned, although

in practice we choose a value of either 0 or 1. The choice tends to be very insigniﬁcant.

For the leaky ReLU function

g(z) = max(0.01z, z), (2.2.33)

the derivative is

′

(z) =

1 z > 0

0.01 z < 0

. (2.2.34)

We will treat the case where z = 0 for the leaky ReLU similar to the case where z = 0 for

the ReLU, in which we choose either 0 or 1.

2.2.8 Gradient descent for neural networks

For the shallow neural network with one input unit, one hidden unit, and one output unit,

we will have the parameters W

[1]T

∈ R

[1]

×n

[0]

, b

[1]

∈ R

[1]

×1

, W

[2]T

∈ R

[2]

×n

[1]

, and

[2]

∈ R

[2]

×1

. So far we have only seen examples where n

[2]

= 1. We are also denotin g n

[0]

as n

and n

[i]

to be the number of nodes in the ith l ayer.

For now, assume we are doing binary classiﬁcation with t he cost function

J(W

[1]

, b

[1]

, W

[2]

, b

[2]

) =

i=1

L(ˆy, y), (2.2.35)

where ˆy = a

[2]

in this case and our loss function is

L(ˆy, y) = −y log ˆy −(1 − y) log(1 − ˆy). (2.2.36)

We then make updates to our parameters with the grad i ent descent algorithm shown in

Algorithm 2.

24 CHAPTER 2. NEURAL NETWORKS

(i)

[1]

[2]

ˆy

(i)

Figure 2.2.9: Each neuron can be broken into two parts, the linear activation z a n d then non-linear

activation a

Algorithm 2 Gr ad ie nt descent

repeat

Compute the predictions ˆy

(i)

for i ∈ {1, . . . , m}

Compute the derivatives of the parameters dW

[ℓ]

, db

[ℓ]

with respect J

Update the weights and bias with W

[ℓ]

= W

[ℓ]

− α dW

[ℓ]

and b

[ℓ]

= b

[ℓ]

− α db

[ℓ]

until the cost function J converges

z = w

+ w

+ b

a = σ(z) L(a, y)

Figure 2.2.10: Computation graph for logistic regress io n

Recall that when we were using logistic regression, we had the computational graph in Figure

2.2.10, where the red arrows forward propagation and the blue arrows repres ent backward

propagation. We previously calculated

y − 1

a − 1

−

(2.2.37)

= L

= a − y (2.2.38)

= w(a − y) (2.2.39)

= x(a − y). (2.2.40)

In the two layer computational graph pictured in Figure 2.2.11, it is clear that a

[2]

, z

[2]

, and b

[2]

will have similar rates of change to the loss function to a, z, w, and b when

using logistic regression, so we have the derivatives

[2]

y − 1

[2]

− 1

−

[2]

(2.2.41)

[2]

= a

[2]

− y (2.2.42)

[2]

= a

[2]

− y (2.2.43)

[2]

= a

[1]T

· (a

[2]

− y). (2.2.44)

2.2. SHALLOW NEURAL NETWORKS 25

[1]

= W

[1]

x + b

[1]

= σ(z

[1]

)

[2]

= W

[2]

[1]

+ b

[2]

= σ(z

[2]

)

L(a, y)

Figure 2.2.11: Two layer neural network computational graph

We can calculate L

[1]

from the chain rule

∂L

∂a

[1]

∂L

∂z

∂a

[1]

(2.2.45)

= (a

[2]

− y) · W

[2]T

. (2.2.46)

Going backward to L

[1]

, we can use the result from L

[1]

∂L

∂z

[1]

∂L

∂a

[1]

∂a

[1]

∂z

[1]

(2.2.47)

= ((a

[2]

− y) · W

[2]T

) · (z

[1]

· (1 − z

[1]

)). (2.2.48)

Now we can get back to the weight and bias terms in the ﬁrst layer, where the rate of change

with respect to W

[1]

∂L

∂W

[1]

∂L

∂z

[1]

∂z

[1]

∂W

[1]

(2.2.49)

= W

[2]T

[2]

− y) · z

[1]

(1 − z

[1]

) · xT, (2.2.50)

and the rate of change with respect to b is

∂L

∂W

[1]

∂L

∂z

[1]

∂z

[1]

∂b

[1]

(2.2.51)

= W

[2]T

[2]

− y) · z

[1]

(1 − z

[1]

). (2.2.52)

What we have written is for t he case of one input example , however, we can extend to all

the input examples by using vectorization, where

[2]

= A

[2]

− Y. (2.2.53)

For L

[2]

, we will need to conisder the cost function inside of ou r rate of change, so we get

[2]

[1]T

. (2.2.54)

We will write L

[2]

using the CPython notation of

[2]

· np.sum(L

[2]

, axis = 1, keepdims = True). (2.2.55)

We will have L

[1]

∈ R

[1]

×m

, where

[1]

= W

[2]T

[2]

∗ g

[1]′

[1]

), (2.2.56)

where g

[1]

is the activation function in layer 1, and ∗ is the element product operator. Finally

the change to the weights and bias in layer 1 will be

[1]

(2.2.57)

and

[1]

· np.sum(L

[1]

, axis = 1, keepdims = True). (2.2.58)

26 CHAPTER 2. NEURAL NETWORKS

[1]

[2]

ˆy

(i)

Figure 2.2.12: Two layer neural network

2.2.9 Random Initialization

Consider if we initialized

[1]

0 0

, (2.2.59)

then a

[1]

= a

[2]

. If two neurons in the same layer are equal, then they are going to have

the same rate of change and update from the gradient. With the same changes after each

iteration, the two neu r ons will act as if they are the same, which means having two neurons

is pointless since it does not add additional information. We call this error t h e symmetry

breaking problem.

The solution to the prob l em is to in i ti al i ze small non-zero random weights, where we initialize

[1]

= np.random.randn((2, 2) ) ∗0.01. (2.2.60)

We use the 0.01 term to make sure that our input is very small, since our hyperbolic tangent

function and sigmoid function have more meaninful values near 0 and our function converge

faster.

It turns out th at our bias term b does not have a symmetry problem, s o it is alright if we

initialize it to be the zero column vector

[1]

= np.zeros((2, 1)). (2.2.61)

We can initialize the rest of the weights and bias terms in a similar fashion.

2.3 Deep neural networks

(i)

[1]

[2]

[3]

[4]

ˆy

(i)

Figure 2.3.1: 4 layer neural network

For neural networks, we will use L to d en ot e the number of layers and n

[ℓ]

to denote the

number of nodes in the nth layer. For example, in F i gur e 2.3.1, we have L = 4 and

[1]

, n

[2]

= 4 and n

[3]

= 3. To refer to the activations in each layer, we will use a

[ℓ]

and to

refer to the weights in each layer we will use w

[ℓ]

. Similarly, the linear activation and bias

in each layer will be z

[ℓ]

and b

[ℓ]

, respectively. We can refer t o the input as ℓ = 0 and the

output as ℓ = L.

2.3. DEEP NEURAL NETWORKS 27

2.3.1 Forward propagation in a deep network

To propagate a training example forward, using the network in Figure 2.3.1, we would use

a similar process to how we forward propagated shallow neural networks, but now we will

extend to more layers, calculating

;

[1]

= w

[1]

[0]

+ b

[1]

= g

[1]

)

[2]

= w

[2]

[1]

+ b

[2]

= g

[2]

)

[3]

= w

[3]

[2]

+ b

[3]

= g

[3]

)

[4]

= w

[4]

[3]

+ b

[4]

= g

[4]

)

. (2.3.1)

In general, it is easy to see that

[ℓ]

= w

[ℓ]

[ℓ−1]

+ b

[ℓ]

= g

[ℓ]

, (2.3.2)

for 0 ≤ ℓ ≤ L. For an entire training set, instead of a single training instance, we could use

[ℓ]

= w

[ℓ]

[ℓ−1]

+ b

[ℓ]

= g

[ℓ]

)

. (2.3.3)

While we would prefer to never use for-loops, using a for loop to iterate over each layer is

the best way to forward propagate.

2.3.2 Getting the matrix dimensions right

[1]

Figure 2.3.2: Figure for matix d im en si o n s walkthrough

One of the aspects in deep learning that is easy to make a mistake on is calculating the

dimensions of the matrices incorrectly. In t hi s section, we are going to walk through h ow

we can possibly think about the dimensions and make sure they are correct at each step.

Consider the neural network in Figure 2.3.2, where our goal is to form W

[1]

, b

[1]

, and x in

order to form

[1]

= w

[1]

x + b

[1]

. (2.3.4)

We know that input features correspond to the rows of our matrix and diﬀerent training

examples correspond to diﬀer ent columns. So, if each instance has two input features, that

means each instance is in R

2×1

. Further, since n

[1]

= 3, we know that for each instance,

[1]

∈ R

3×1

. So, we currently have the dimensi on s for the equation to be

(3 × 1) = w

[1]

(2 × 1) + b

[1]

. (2.3.5)

From matrix multiplication (without broadcasting), w

[1]

must have 2 colum ns to match the

number of rows of x. We also know that when adding matrices the dimensions must match,

28 CHAPTER 2. NEURAL NETWORKS

so b

[1]

must be in the same dimensional space as z

[1]

and w

[1]

x, which means b

[1]

∈ R

3×1

Now, from properties of matrix multiplication, w

[1]

x must have w

[1]

∈ R

3×2

. The matrix a

will always be in the same dimensional space as z, since a only applies element operations

to the elements of z. More gen er al ly, we can see that the sizes for each item in our neural

network based on ℓ will be

[ℓ]

, a

[ℓ]

∈ R

[l]

×m

(2.3.6)

[ℓ]

∈ R

[ℓ]

×w

[ℓ−1]

(2.3.7)

[ℓ]

∈ R

[ℓ]

×1

. (2.3.8)

The derivatives of each variable will also be in the same dimensionsal space as the variables,

so when we are backward propagating

[ℓ]

, da

[ℓ]

∈ R

[l]

×m

(2.3.9)

[ℓ]

∈ R

[ℓ]

×w

[ℓ−1]

(2.3.10)

[ℓ]

∈ R

[ℓ]

×1

. (2.3.11)

2.3.3 Why deep representations?

less complex more complex

Figure 2.3.3: Deeper layers in a n eu ra l nexwork analyze more complex data

When using a neural network architecture, the deeper nodes look at more complex data.

For example, with images, the ﬁrst layer may detect edges, the second layer may put a few

edges together, and the thir d layer may formulate the understanding of a face. Compounding

knowledge throughout multiple layers is analogous to how many people believe the human

brain works.

Graphs, i n general, are o ft en more useful when multiple laye rs can be stacked on top of each

other. Consider the logic tree in Figure 2.3.4 that represents the function y = x

∨(x

∧x

With more layers, it is easier to store mor e information that bec omes more complex as the

layers grow in depth.

2.3.4 Forward and backward propagation

The forward propagation step on layer ℓ takes in as input a

[ℓ−1]

and outputs a

[ℓ]

, which also

caching z

[ℓ]

for back propagation. For backward propagation, we will input da

[ℓ]

and output

[ℓ−1]

, dW

[ℓ]

, db

[ℓ]

. From the chain rule, we can derive the outputs of back propagation to

2.3. DEEP NEURAL NETWORKS 29

∧ x

∨ (x

∧ x

)

(a) Few nodes are required

for a deeper trees

¬x

(b) Exponentially more

nodes are required for a shal-

low tree, since it must test

every possible combination of

, x

, and x

Figure 2.3.4 : Comparing logic trees for y = x

∨ (x

∧ x

) of diﬀerent depth, it is easier to for

deeper trees to represent the same information as shallow trees with less nodes.

[0]

[1]

, b

[1]

[2]

, b

[2]

= ˆy

[0]

[1]

, b

[1]

, dz

[1]

[2]

, b

[2]

, dz

[2]

cache z

[1]

, a

[0]

cache z

[2]

, a

[1]

, db

[1]

[2]

, db

[2]

Figure 2.3.5: Forward and backward propaga t io n steps for a two layer neural network. We

ﬁrst forward propagate across the top, while caching intermediate z

[ℓ]

values, then we backward

propagate across the bottom to ﬁnd the changes to the weights.

give us the equations

;

[ℓ]

= da

[ℓ]

∗ g

[ℓ]′

[ℓ]

)

[ℓ]

= dz

[ℓ]

· a

[ℓ−1]

[ℓ]

= dz

[ℓ]

[ℓ−a]

= w

[ℓ]T

· dz

[ℓ]

. (2.3.12)

30 CHAPTER 2. NEURAL NETWORKS

For an entire training set, we would have

;

[ℓ]

= dA

[ℓ]

∗ g

[ℓ]′

[ℓ]

)

[ℓ]

/m · dZ

[ℓ]

· A

[ℓ−1]T

[ℓ]

/m · np.sum ( dZ

[ℓ]

, axis=1, keepdims=True)

[ℓ−1]

= W

[ℓ]T

· dZ

[ℓ]

. (2.3.13)

2.3.5 Parameters vs hyperparameters

For neural network models , our parameters consis t of strictly weights W

[ℓ]

and bias b

[ℓ]

parameters. However, there are plenty more variables that we have to set in order to

optimize our function, such as the learning rate α, number of gradie nt descent iterations,

number of hi d de n layers L, number of hidden un i t s n

[ℓ]

, and type of activation function.

The variables that aﬀect our parameters, as just listed, are known as hyperparameters.

Many hyperparamet e rs are set on an experimental basis and there are many uses of neural

networks that determine optimal hyperparameters.

2.3.6 Connection to the brain

Modern deep learning has little resemblance to what we know about how the brain learns.

While it is still unknown how a neuron in the brain works, we have a loose idea that each

neuron takes some signal and passes it to the next signal. Putting many neurons together

then can sort of resembles a tree-like structure similar to a neural network. However,

there is little evidence that anything like forward and backward propagation oc cu rs , so the

connection to the brain is very loose.

Chapter 3

Optimizing and structuring a

neural network

3.1 Full-cycle de ep learning projects

For most machine learning projects we have to do the fol lowing tasks in order:

1. ﬁnd a problem

2. get data

3. design a model

4. train the model

5. test the model

6. deploy the project

7. maintain the project

The original training model that we choose tends to be changed in order to ﬁnd the best

model that performs best. Most machine learning papers tend to focu s on steps 2-5, but in

this section, we wi ll focus on steps 1, 6, and 7. For deep learning problems, good problems

for us would tend to be interesting with available data in a ﬁeld we have domain knowledge

in, as well as useful and feasible to create. Feasibility is a massive problem for choosing

projects, espec ial l y in industry. To ﬁnd data let us use the example of a trigger word speech

detection system, such as al ex a. It may be a good idea to ﬁrst use a small set of dat a in

order to quickly start building out a model and ﬁnding out the diﬃculties in the problem

at hand; even a couple hundred instances may initially work well. One way to collect new

data would be to go around and ask diﬀerent people to speak into a recording device, where

they say either nothing, alexa, or some other words.

As we train t h e models iteratively, it is a good idea to document each model that we set up

and the resul t s it performed. That way, it will be easy to look back on older mo de ls when

making tweaks to a current system.

VAD

Large NN

for trigger word detection

ˆy

audio

Figure 3.1.1: Breaking a problem involving trigger word detection into two parts: voice activity

dection (VAD) and trigger word detection.

For our system involving a trigger word detection, it may be a good idea to break the

problem into two parts: one th at detects if a word is spoken, and one that detects if the

32 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

spoken word was the trigger word. In mod er n literature, the word speaki ng detector may

be referred to as voice activity detection (VAD). When using VAD we primarily have two

options to choose from wh en select i ng a model. We can use a non-machine learning based

approach which detects if the volume is greater than some value %, or we can use a small

neural network that is trained to detect human speech. Since the non-machine learning

option may only take a few minutes to implement, it may be best to choose that option

ﬁrst in order to minimize cost and move quickly. Another downside t o the neural network

approach i s that new accents m ay be introduced in production that the model has never

heard before which could cause abnormal results.

3.2 Setting up a machine learning application

3.2.1 Training, development, and testing sets

Applied machine l ear ni n g involving choosing many hyperparameters for a model, such as the

number of layers, hidden units, learning rates, and activation functions. Most of the choices

of hyperparameters for one subset of problems, such as natural language processing, tend to

not work very well on other machine learning problems, like computer vision. Because it is

often diﬃcult t o determine which hyperparameters work best, we tend to check our choice

of hyperparameters against some test training set in order to determine its eﬀectiveness.

train

dev

test

Figure 3.2.1: A data set is broken into three parts: training, development, and testing.

We tend to break our data into three sets: training, cross-validation or development, and

testing.

The training data set is use d to change the weights and bias of our model, the

cross-vali dat i on set is used to determine if our model has overﬁtted the data and determine

if our hyperparameters generalize well to data we have not seen, and the testing set is used

to calculate how accurate our model will be once deployed. In t he past, the best practice

for breaking the data up was to give the training set ∼ 60% of the dat a, the development

set ∼ 20% of the data and the testing set ∼ 20% of the data. Now, however, our datasets

have grown large enough to the point that breaking up the data based on percentages no

longer makes sense. Today, we may break up a data set with 1 million examples into a

development set with 10 thousand examples and a testing set with 10 thousand examples.

It is important to distri bu t e diﬀerent types of data between the testing and development

set evenly.

3.2.2 Bias and variance

high bias

underfitting overfitting

“just right” high variance

Figure 3.2.2: Diﬀerent types of varia n c e and bias.

When we talk about the variance for a par t i cu lar model, we are referring the amount of

overﬁtting in our model. When there is a lot of overﬁtting, we say the model has high

Many people may only use a training and testing set , where the testing set acts as if it were a

development set.

3.3. REGULARIZING A NEURAL NETWORK 33

variance. similarly, when the model is underﬁtting, we say there is low varianc e. We use

the term b ias to describe how well our model ﬁts the data. If our model performs well on

both the training and development set, we say that the model has low bias, and if the model

performs badly on both the training and development set, we say the model has a high bias.

We often determine variance based on t he training set and development set err or rates.

Train set error 1% 15% 15% 0.5%

Dev set error 11% 16% 30% 1%

high variance high bias high bias low bias

high variance low variance

Table 3.1: How we may denote h i gh and low variance and bias when given the training and

development set error rates. Here, we assume th a t the human error is approximately 0%.

high bias

high variance

Figure 3.2.3: An example of a model having both high varian ce and high bias.

3.2.3 Combatting bias and variance

With a model, we ﬁrs t check for high bias and then high variance. When a model has

high bias the networ k will not be generalizing well on the traini n g data. It may be a

good idea to test out a larger neur al network, run gradie nt descent longer, or change neural

network architectures. When a model has high variance, the development set error will be

high. We can improve the high variance problem by introducing new data, regularization,

or changing the neural network architecture.

Before neural networks became widely adopted, we often had to make decisi ons on whether

or not to improve the bias xor variance. Having to choose one or the other came to be

known as the bias variance tradeoﬀ. However, with larger neural networks we have been

able to independently improve both the bi as and variance of a network.

3.3 Regularizing a neural network

3.3.1 Regularization

Regularization will be a way to improve a high variance model. To show why regulari zat i on

may be useful, let us consider the logi st i c regression where our goal was to ﬁnd

min

w,b

J(w, b) (3.3.1)

where w ∈ R

, b ∈ R, and

J(w, b) =

i=1

ˆy

(i)

, y

(i)

. (3.3.2)

34 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

When using regularization we will redeﬁne our cost function to be

J(w, b) =

i=1

ˆy

(i)

, y

(i)

||w||

, (3.3.3)

where we are using the L2 nor m deﬁned as

||w||

j=1

= w

w. (3.3.4)

When using the L2 norm in our regularization term, we say that we are using L2 regulariza-

tion. The other main type of regularization is L1 regularization, where our regularization

term is

j=1

| =

||w||

. (3.3.5)

L1 regularization tends to result in w being a sparse vector, which is a vector with a lot of

0s and convenient for calculations. L2 regularization tends to be used a lot more often with

neural networks. The term λ is known as the regularization p aram eter which is another

hyperparameter often set during the validation process of training the network.

For neural networks, we want to ﬁnd

min

[1]

,...,w

[L]

J(w

[1]

, b

[1]

, . . . , w

[L]

, b

[L]

), (3.3.6)

where our cost function is also deﬁned as

J(w

[1]

, b

[1]

, . . . , w

[L]

, b

[L]

) =

i=1

ˆy

(i)

, y

(i)

. (3.3.7)

The regularization involving multiple bias and weight terms is

J(w

[1]

, b

[1]

, . . . , w

[L]

, b

[L]

) =

i=1

ˆy

(i)

, y

(i)

ℓ=1

||w

[ℓ]

, (3.3.8)

where the squared norm of a matrix is known as the Frobenius norm, which is deﬁned as

||w

[ℓ]

[ℓ−1]

i=1

[ℓ]

j=1

[ℓ]

. (3.3.9)

Before we added the additional regularizat ion term to our cost function, we updated the

weights like

[ℓ]

= w

[ℓ]

− α dw

[ℓ]

, (3.3.10)

where we calculated dw

[ℓ]

to be the change in t he cost function with respect to w. When

adding the regularization term, our calc ul at i on of gradient descent must also include changes

to the regularization term with resp e ct to the weights and bias. Whe n our bias changes, the

regularization term does not change, so the gradient descent term for the calculation of bias

remains the same. However, when our weights change, the derivative of our re gul ar iz at i on

term turns out to be

[ℓ]

||w

[ℓ]

. (3.3.11)

While it would seem weird that the derivative of the Frobenius nor m results in a matrix (since

the Frobenius norm produces a scalar value), we are really just looking for the changing

result of the Frob en iu s norm due t o changes in the entire mat ri x and not t he changes due to

one element, which is why the result ends up as a matrix. Now we can rewrite our weights

gradient descent expression as

[ℓ]

= w

[ℓ]

− α

[ℓ]

. (3.3.12)

3.3. REGULARIZING A NEURAL NETWORK 35

Distributing the alpha term, we have

[ℓ]

= w

[ℓ]

− α dw

[ℓ]

− α

[ℓ]

. (3.3.13)

Notice that with the terms

[ℓ]

− α

[ℓ]

(3.3.14)

in our gradient descent function, the addition of the regularization term will always decrease

the values of the weights, since α, λ, and m are all nonnegati ve. We sometimes refer to

regularization that strictly decreases the weights as weight decay.

3.3.2 Why regularization reduces overﬁtting

With the L2 regularization of the cost function

J =

i=1

ˆy

(i)

, y

(i)

ℓ=1

||w

[ℓ]

, (3.3.15)

making λ a large number will make the term

i=1

ˆy

(i)

, y

(i)

(3.3.16)

negligible in comparison. In the scenario where λ is very large, we must make the weights

very small in order to minimize the cost. If the weights are near zero, they will not have

any eﬀect on the neural network and our new neural network will become much simpl er . In

practice, however, al l of the weights become much smaller, not just a select few as shown in

Figure 3.3.1. But, with all the weights near 0, the model will still become simpler.

(i)

[1]

≈ 0

[2]

≈ 0

[3]

≈ 0

[4]

ˆy

(i)

Figure 3.3.1: If the weights connect ed to a neuron are near 0, then the neuron will not make a

signiﬁcant contribution to the hypothesis. With less neurons in the neural network, the model will

become simpler.

With our linear activation calculation of

[ℓ]

= w

[ℓ]

[ℓ−1]

+ b

[ℓ]

, (3.3.17)

if we make w

[ℓ]

really small, then z

[ℓ]

will end up really small (assuming b

[ℓ]

does not make

a signiﬁcant contribution). If we are using the hyperbolic tangent function or the sigmoid

function as our activation function, then when z is smal l we wil l pretty much have a linear

activati on function. Recall that with a linear activation function, the depth of our neur al

network would not matter and the entire model c oul d just be represented with a linear

function.

36 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

0.5

−0.5

0.5

−0.5

tanh(z)

Figure 3.3.2: The hyperbolic tangent function is nearly a linear fun c t io n for values near 0

(i)

[1]

[2]

[3]

[4]

ˆy

(i)

Figure 3.3.3: 4 layer neural network with 3 hidden layers ea ch with 4 neurons

3.3.3 Dropout regularization

Consider t h e neural network in Figure 3.3.3, which has a total of 12 hidden neurons. For

each iteration of gradient descent that we train our neural network on, dropout regular-

ization visits each hidden neuron and randomly chooses whether or not that neuron will

be removed from the neural network. The probability that each hidden neuron remains

is a hyperparameter that we can set. With a neural network has a diﬀering number of

neurons in each hidden layer, we can make the keep prob value a hyperparameter in each

layer. If a neuron is removed, then all the weights and bias conne ct ed to the neuron are

also removed. When t es t in g our neural network or once it is deployed, we will no longer

use dropout regularization, primarily because we do not want randomized predictions. In

computer vision, the input to the neural network is very large and many people in computer

vision use dropout regularly.

To implement dropout regu lar i zat i on let us set

dℓ = np.random.rand(aℓ.sh ape[0], a ℓ .s hape[1]) < keep prob, (3.3.18)

where keep prob is the pr obab il i ty that we keep the neuron and dℓ is the same shape as a

[ℓ]

dℓ will now store a matrix of True or False values. We can then take

aℓ = aℓ ∗ dℓ (3.3.19)

in order to zero out the values in the matrix where dℓ is 0 and keep the values in the matrix

where dℓ is 1. We also want to make sure that z

[ℓ]

is not reduc ed from its expected value

without dropbox regularization, which can be done by taking

aℓ = aℓ/keep prob. (3.3.20)

Consider, for example, if we had 100 hidden n eu r ons in a layer and we set keep prob = 0.8.

On average, 20 neurons would be hidden and a

[ℓ]

would be reduc ed by 20%. In order to

3.3. REGULARIZING A NEURAL NETWORK 37

(i)

[1]

[2]

[3]

[4]

ˆy

(i)

Figure 3.3.4: A potential simpliﬁed neura l network with only the hidden neurons that are not

removed in dropout regularization.

mitigate t hi s loss, divide by our keep prob value of 0.8. Dividing by the hyperparameter

keep prob is known as the inverted dropout technique.

With dropout regularization, any feature could be removed randomly from a neural network

on any given iteration. Because of this, the network can learn not to rely on any given feature

too much, which can regularize a model. In general, when using dropout regularization, the

weights shrink in size.

3.3.4 Other regularization methods

While we know that adding more data to our model could help reduce overﬁtting, we do

not always have easy access to new data. Instead, one way to add more data is to augment

the existing data that we have, which is known as data augmentation. If we were to

augment images, for exam pl e, then we might reﬂect, crop, distort, and change the colors of

our inputted data as sh own in Figure 3.3.5.

Figure 3.3.5: Data augmentation on images

Early stopping is another way to regularize our model. With early stopping, we decrease

the number of iterations when running gradient descent in hopes of ﬁnding a more general

model. When using early stopping, the primary downside is that we are no longer looking

38 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

to completely optimize the cost function and it is often hard to determine when it is a good

time to stop iterating.

3.4 Setting up an optimization problem

3.4.1 Normalizing inputs

50 1000

Figure 3.4.1: Normal distribution of data vertically centered at 5 with a varian c e of 2.25.

To normalize a dat ase t , we must do two things: center the mean at the origin in each

direction and set the variance to be 1. Consider the two dimensional data in Figure 3.4.1.

Centering t h e data in each direction would require us to calcu l at e each directi on ’s mean

µ =

i=1

(i)

(3.4.1)

and then take each point and subt r act the mean, which can be done with

x = x − µ. (3.4.2)

The result after centering the data in each direction at 0 can be shown in Figure 3.4.2.

-50 50

-5

Figure 3.4.2: Centering the mean of the data in each d irec t io n at 0

The variance is deﬁned as the square of the standard deviation. In general, the variance is

i=1

(i)

− µ

, (3.4.3)

but since we have set µ = 0, we can simpiﬂy th e variance to

i=1

(i)

. (3.4.4)

3.4. SETTING UP AN OPTIMIZATION PROBLEM 39

To normalize the variance, we must set the variance of the entire dataset to be 1, which can

be done by dividing the entire dataset by the standard deviation

x =

. (3.4.5)

Figure 3.4.3 shows the result after making the variance of the data 1 and displays the

normalized data. It is important when u si ng a training set and testing set th at both sets

- 1 1- 2 2- 3 3

- 1

- 2

- 3

Figure 3.4.3: Normalized data after centering the mean at 0 and making the variance 1.

are normalized with the same µ and σ

in order to best simulate how the model would

perform when deployed.

(a) Normalized

(b) Unnormalized

Figure 3.4.4: Potential contour plots of the cost function for normalized and unno rm a li zed data.

Normalizing the data often makes it q ui cker for gradient descent to converge. Considering

the two-dimensional contour plots of a potential cost function in Figure 3.4.4, our goal with

normalizing is to make it easier to ﬁnd the minimum of t he cost function. If we have a set

of features with a range of (−1000, 1000) and another set of features with a range of (−1, 1),

it is often quite h ar d to pick a learning rate that will work well for both features.

40 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

[0]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

ˆy

Figure 3.4.5: 7 layer neural network

3.4.2 Data vanishing and exploding gradients

Consider the neural network in Figure 3.4.5, where we will assume the input is non-negative

and will ignore the bias term for now. Here, we will exclusively use ReLU activations and

initialize all of our weights to be in the form of

[ℓ]

γ 0

0 γ

, (3.4.6)

where γ > 0. In the c ase presented, the ReLU activation function will behave like a linear

activati on function and we will have

[7]

[6]

[5]

[4]

[3]

[2]

[1]

[0]

= ˆy. (3.4.7)

The multiplication of all of the W

[ℓ]

will simplify to

ℓ=1

[ℓ]

ℓ=1

γ 0

0 γ

. (3.4.8)

When we initialize λ < 1 we will end up multiplying ou r input by an exponentially small

number and when we initialize λ > 1 multiplication will be ex ponentially large. Consider, for

example, if we set λ = 0.5 then our input will be scaled down by a factor of λ

= 0.0078125.

similarly, if we set λ = 1.5 then the i np u t will be scaled up by λ

= 17. 0859375. When

the prediction of our model is a huge number, then the gradients in back propagation will

explode and when our prediction is near zer o, the data will have a negligible impact.

3.4.3 Weight initialization for deep neural networks

For each neuron, we calculate the linear activation to be

z = w

+ w

+ · · · + w

. (3.4.9)

At the moment, we are going to ignore the bias term. Notice that when n ≫ 0, we would

like the weights to be small in order to keep z from exploding. similarly, when n is near 0,

making the weights larger would make t he most sense in order to have the input make a

contribution. One solution is to initialize our weights to have a variance of

. (3.4.10)

We previously calculated the variance of a data set centered at zero to be

i=1

(i)

. (3.4.11)

Starting with a random normally distributed data set with a variance of σ

= 1, we can

change the variance to be ζ by multiplying each instanc e by

√

ζ. We can s how that this is

correct by carrying out the multiplication on each instance and using substituting with the

original variance of 1, which gives us

i=1

(i)

(3.4.12)

= ζ

i=1

(i)

(3.4.13)

= ζ. (3.4.14)

3.4. SETTING UP AN OPTIMIZATION PROBLEM 41

So, if we wanted to make the variance

/n, we would initialize our weights to be

[ℓ]

= np.random.randn(shape) ∗ np.sqrt

[ℓ−1]

. (3.4.15)

The modern best p r act i ce tends to set the variance of weights with the ReLU function to

[ℓ−1]

, (3.4.16)

which is known as He initialization. With a hyperbolic tangent function we generally

stick wit h

[ℓ−1]

, (3.4.17)

which is known as Xavier initialization. Occasionally, some people may also set the

variance to be

[ℓ−1]

+ n

[ℓ]

. (3.4.18)

3.4.4 Gradient numerical approximations

Numerical approximations of our gradi e nt can be useful when debugging a network and

testing if the forward an d backward propagation steps are working correctly. From the

deﬁnition of the derivative, we know that the one-side d limit of

′

(θ) = lim

#→0

f(θ + %) − f(θ)

(3.4.19)

will give us the slope of the line tangent to a two variate equation centered at θ. We could

also use use the two sided limit

′

(θ) = lim

#→0

f(θ + %) − f(θ − %)

(3.4.20)

to calculate the deri vative, which wil l produce an equivalent result. The two-sided limit

as shown in Fi gu re 3.4.6 works best for numerical approximations becuase it gets a more

accurate picture of the slope at both sides.

f (x )

Figure 3.4.6: Two-sided limit derivative deﬁnition

The error when using the two-sided limit is O(%

) and the error when usi ng a one-sided limit

is O(%). Although it is twice as computationally expensive, it is often worth it to use the

two-sided limit.

42 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

3.4.5 Gradient checking

Using the numerical approximations of the derivative our goal is to check if our calculations

implemented in back propagation are correct. To simplify calculations, we will store each

entry in all of our parameters W

[1]

, b

[1]

, . . . , W

[L]

, b

[L]

into a single large vector denoted as

θ. similarly, we will store dW

[1]

, db

[1]

, . . . , dW

[L]

, db

[L]

into a vector dθ. We now can write

our cost function as

J(θ) = J(θ

, θ

, . . . ). (3.4.21)

For each element i of our vector θ, we can approximate the derivative with

dθ

J(θ

, θ

, . . . , θ

+ %, . . . ) − J(θ

, θ

, . . . , θ

− %, . . . )

. (3.4.22)

If our calculation s are correct, then

dθ should approximately be dθ. We can check our error

rate E by taking th e normalized Euclidean distance between the vectors of

E =

dθ −dθ

dθ

. (3.4.23)

In practice, if we set % = 10

−7

, then with E < 10

−7

we can generally conﬁrm that our

calculations are correct and with E > 10

−3

we should double check our calculations. For

values in-between, it is harder to determine correctness. For debugging, it is best to ﬁr s t

check for bugs where the diﬀerence between

dθ

and dθ

are the greatest.

We can run gradient checking on L2 regularization by making sure to include our reg-

ularization term in the cost function. However, because of the randomness of dropout

regularization, we cannot use gradient checking with dropout on.

3.5 Optimization algorithms

3.5.1 Mini-batch gradient descent

Deep learning works best when there is a lot of data and we can quickly optimize our model.

We have previously been using gradient descent on all of our training examples with our

vectorized input

X =

(1)

(2)

. . . x

(m)

, (3.5.1)

where X ∈ R

×m

and our vectorized output

Y =

(1)

(2)

. . . y

(m)

, (3.5.2)

where Y ∈ R

1×m

. The gradient descent we have previously been usi ng is known as b atch

gradient descent. If m ≫ 0, then it may take a while to compute each iterati on of gradient

descent. For now, suppose that our training set is of size m = 5, 000, 000. Instead of usin g

all m training examples for each iteration of gradient descent, let us break the input into

groups of 1000, w hi ch we will call mini-batches. For example, the ﬁrst mini-batch will be

have

{1}

(1)

(2)

. . . x

(1000)

{1}

(1)

(2)

. . . y

(1000)

(3.5.3)

where we will denote the ith mini-batch with superscript {i}. In total, we will have 5000

mini-batches. Next, we will make a pass through the training set as described in Algorithm

3. Each pass through the training set using any form of gradient descent is known as an

epoch.

With batch gradient descent, our cost will ideally always be decreasing, however, that is

not the case with mini-batch gradient descent because each mini-batch does not reﬂect the

entire training set and our gradient may not step in the perfect direction that minimizes

our cost.

3.5. OPTIMIZATION ALGORITHMS 43

Algorithm 3 A single pass through the training set using mini-batch gradient descent.

for t ∈ {1, . . . , 5000} do

Forward propagation on X

{t}

Compute the cost for J

{t}

Backprop to compute gradients

Update the weights and bias

end for

Epoch

Cost

(a) Batch gradient descent

Mini-batch iterations t

Cost

(b) Mini-batch gradient descent

Figure 3.5.1: How the cost function could decrease b a sed on the choice of gradient descent.

When setting the hyperparameter that represents the size of the mini-batch, there are two

extreme cases. First, if we set the mini-batch size equal to m, then we just have our old

method of batch gradient descent. If we set the mini-batch size to be 1 then each example is

a mini-batch; we call this Stochastic gradient d escent. W it h Stochastic gradient descent,

we cannot u se vectorization and will not typically converge to a single value, instead, it will

jump around near the minimum. In practice, we often choose a mini-batch size in-between

the two extremes in order to preserve vectorization and make faster progress.

The modern best practice in the ﬁeld is to use batch gradient descent when m ≤ 2000,

otherwise, we typically use 2

, 2

or 2

in order to store data most eﬃciently.

3.5.2 Exponentially weighted averages

20 40 60 80

500

1000

1500

Figure 3.5.2: Data set to calc u la t e the exponentially weighted average

To describe other optimization algorithms that may work better than gradi e nt descent, we

ﬁrst need to introduce the concept of exponentially weighted averages. Consider the dataset

44 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

in Figure 3.5.2 where

θ =

100

(

1124

1184

1546

(

. (3.5.4)

To calcu l at e t h e exponentially weighted average, we will start by initializing v

= 0 and

recursively deﬁning

= 0.9 v

t−1

+ 0.1 θ

(3.5.5)

for t ∈ {1, . . . , 100}. After computing all of the v

points, we can p lot the joined data as

shown in red in Figure 3.5.3.

20 40 60 80

500

1000

1500

Figure 3.5.3: Exponentially weighted average as a joined plot (red)

In general, we can set

= β v

t−1

+ (1 − β) θ

, (3.5.6)

where 0 ≤ β < 1. The choice of β sets the value of v

to approximately be the average over

the last

1 − β

(3.5.7)

data points. For example, when we set β = 0.9, we were closely mirroring each valu e of v

be the average of the previous 10 instances. Additional changes to the value of β are shown

in Figure 3.5.4. In statistics, the exponentially weighted average may also be referred to as

the exponentially weighted moving average.

20 40 60 80

500

1000

1500

Figure 3.5.4: β = 0.01 (green), β = 0.6 (purple), beta = 0.95 (ora n g e)

To better understand exponentially weighted averages, let us set β = 0.9 and look at the

sequence

;

100

= 0.9 v

+ 0.1 θ

100

= 0.9 v

+ 0.1 θ

= 0.9 v

+ 0.1 θ

. (3.5.8)

3.5. OPTIMIZATION ALGORITHMS 45

Using substitution, we can write

100

= 0.9 (0.9 ( 0. 9 (v

) + 0.1 θ

100

(3.5.9)

= 0.1 θ

100

+ (0.1 × 0.9)θ

)

0.1 × 0.9

)

0.1 × 0.9

+ · · · (3.5.10)

100

t=1

)

0.1 × 0.9

100−t

. (3.5.11)

Visually, the the weight that is contributed to the θ

100

term decreases exponentially as

shown in Figure 3.5.5.

50 60 70 80 90 100

0.0

0.2

0.4

0.6

0.8

1.0

Scale factor

60 70 80 90 100

500

1000

1500

Figure 3.5.5: At θ

100

is the sum of the element-wise produ c t s between the scaling factors and

data

The time constant τ is the time it takes for an exponentially decaying plot to reach

/e its

max. From our decay equation, we have

t−τ

. (3.5.12)

Solving for τ, we ﬁnd the time constant to be

τ = t + log

(e), (3.5.13)

where log

(e) represents the number of data points we go back in order to calculate 1 −

of the value of v

. The approximation in Equation 3.5.6 is a simp li ﬁ ed version of −log

(e)

and is often u se d because it gives a close enough answer to the true time constant in our

domain for β. Th e diﬀerences in −log

(e) and

/1−β are shown in Figure 3.5.6.

0.2 0.4 0.6 0.8 1.0



Figure 3.5.6: Approximation of the time consta nt τ =

/1−β (red) and the real time constant

τ = log

(e) (blue)

We generally choose to use this deﬁnition of a weighted average instead of truly calculating

the average in a particular window because it is easier to compute and requires less memory

storage.

46 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

10 20 30 40

500

1000

1500

Figure 3.5.7: Data for bias correc ti o n analysis, where the red line is the exponentially weighted

average when β = 0.5

3.5.3 Bias correction for exponentially weighted averages

Notice that at the beginning pe ri od of the plot in Figure 3.5.7, the plot starts oﬀ with pretty

small values and grows relatively slow. Let us denote % = 1 −β. Because of how we deﬁned

our exponentially weighted average, each plot will start with

;

= 0

= θ

= βθ

% + θ

= β

% + βθ

% + θ

. (3.5.14)

We can more easily interpret this dat a by plugging in a few values of β and anal y zi ng the

results, which is done in Table 3.2.

β 0.1 0.5 0.9

0.9 θ

0.5 θ

0.1 θ

0.09 θ

+ 0.9 θ

0.25 θ

+ 0.5 θ

0.09 θ

+ 0.1 θ

0.009 θ

+ 0.09 θ

+ 0.9 θ

0.125 θ

+ 0.25 θ

+ 0.5 θ

0.081 θ

+ 0.09 θ

+ 0.1 θ

Table 3.2: Calculation s of the ﬁrst three v

terms.

Notice that when β is large, v

starts oﬀ growing the slowest. If we set each value of θ

1 then we can get the total amount we are scaling our input. We would expect the scaling

factor to be 1 in the general case because when averaging n numbers, if each number is 1,

then the average is 1. However, when β = 0.9 our total scaling factor for v

is 0.1, for v

0.19, and for v

is 0.271, which are all much lower than 1.

If we instead deﬁne our recursive v

equation as

β v

t−1

+ (1 − β)θ

1 − β

, (3.5.15)

then will scale up the ﬁrst few instances of the data without aﬀecting the latter entries, as

shown in Figure 3.5.8. The term in the denominator of Equation 3.5.15 is known as the bias

correction term for exponentially weighted averages.

In practice, most people do not actually cal cu lat e the bias correction because it d oes not

make much of a diﬀerence on large datasets.

3.5.4 Gradient descent with momentum

In gradient descent, when we are far from an optimal point in our cost function, the gradients

will be larger numbers. We are typically far at the beginning and taking large steps but often

not in a great direction. In Figure 3.5. 9, we are illustrating a potential path through t he

3.5. OPTIMIZATION ALGORITHMS 47

10 20 30 40

500

1000

1500

Figure 3.5.8: Weighted average of our da t a with bias correction (p u rp l e) , without bias correction

(red)

Figure 3. 5 .9 : Contour plot of the cost funct io n with the minimum in red a n d our path from

gradient descent in blue

contour plot. Ideally, we would like to have a larger learning rate in the horizontal direction

and a smaller learning rate in the vertical direction, in orde r to step more di r ec tl y towards

the minimum. Looking at the previous it er at ion s of the direction of gr adi ent descent, we

can get a pretty good feeling about the direction of the minimum by taking the weighted

average. In our ﬁgure, the vertical weighted average would be near 0, while the horizontal

weighted average would be in the direction towards the center. Using the exponentially

weighted average with gr ad ie nt descent is referred to gradient descent with momentum.

Gradient descent with momentum has two hyperparamete rs , α and β.

Algorithm 4 Gr ad ie nt descent with momentum

, v

= 0

repeat

Compute dW, db with respect J

= βv

+ (1 − β)dW

= βv

+ (1 − β)db

W = W − α v

b = b − α v

until the cost function J converges

Consider how a ball rolling down the sides of a bowl will ﬁnd the minimum by taking a step

in the direction of the gradient. Here, the equation

= βv

+ (1 − β)dW ( 3.5. 16)

has a loose analog to the physical system of a ball rolling down the sides of a bowl, where

dW represents acceleration, v

represents velocity, and β represents friction.

Finally, some people may change the exponentially weighted average term to be

= βv

+ dW, (3.5.17)

where th ey omit (1 − β) in hopes of avoiding overﬁtting, however, this would require us to

make less intuitive changes to our learning rate α in order for the cost function to converge

quickly.

3.5.5 RMSprop

From Figure 3.5.10, notice that it would be more beneﬁcial to move further i n the direction

with less oscillation than it would be to continue t ak i ng large steps vertically. One way to

48 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

Figure 3.5.10: Plotting the gradient in each direction, if we denote the vertical axis as the bias

and the horizontal direction with the weight

determine the amount of oscillation would be to calculate

= β

+ (1 − β

)dW

= β

+ (1 − β

)db

, (3.5.18)

where the square operation is element-wise and β

is new hyperparameter we are introducing

that works similar to β.

Squaring the rate of change term will make the term positive and

reward small val u es while making large values more costly. In our e x amp le , the changes

in the horizontal direction, which correspond to the changes in the weights, are relatively

small, so S

will be relatively small. In contrast, S

will be huge because the vertical

changes are massive. Now, if we update the weights and bias with

;

w = w − α

√

+ %

b = b − α

√

+ %

, (3.5.19)

we will end up moving more in the direction that does not oscillate and less in the direction

that oscillates a lot. Here, we include % in order to avoid a division by zero error, where %

is commonly set to a small value such as % = 10

−8

. The algorithmic described is known as

root mean square prop (RMSprop) and is detailed in Algorithm 5

Algorithm 5 RM Sp rop

, S

= 0

repeat

Compute dW, db with respect J

= β

+ (1 − β

)dW

= β

+ (1 − β

)db

w = w − α

√

+ %

b = b − α

√

+ %

until the cost function J converges

We are introducing the new hyperparameter primarily to make the next section which combines

RMSprop and momentum together.

3.5. OPTIMIZATION ALGORITHMS 49

3.5.6 Adam optimization algorithm

The Adam optimization algorithm combines both RMSprop and momentum and has shown

to work well across plenty of optimization problems. The complete algorithm is shown in

Algorithm 6. Although we did not use bias correction with momentum or RMSprop, it is

most common to use bias correction with the Adam al gor i th m.

Algorithm 6 Adam optimization

, S

, v

= 0

repeat

t = t + 1

Compute dW, db with respect J

= β

+ (1 − β

)dW , v

= β

+ (1 − β

)db

= β

+ (1 − β

)dW

, S

= β

+ (1 − β

)db

1 − β

, v

1 − β

, S

1 − β

w = w − α

√

+ %

b = b − α

√

+ %

until the cost function J converges

From th e Adam op t i mi zat i on p aper (Kingma, et. al), the default values for the hyperpa-

rameters are

= 0.9 (3.5.20)

= 0.999 (3.5.21)

% = 10

−8

. (3.5.22)

The learning rate α does not have a default value and still need s to be tuned. Adam stands

for Adaptive moment estimation and the terms β

and b

are the ﬁ r st and second

moments, res pectively.

3.5.7 Learning rate decay

Figure 3.5.11: Gradient descent may take large steps when near the minimum, bouncing around

and never quite converging

Learning rate decay refers to reducing the size of the learning rate as the number of iterations

increases. Consider th e example in Figure 3.5.11, where the steps once we are near the center

appear to never quite reach the minimum and stay there. Instead, if we took smaller steps

when closer to the minimum, we could more easily determin e the values gradient descent is

converging towards. One way to make α decay is by setting

α =

1 + decay rate ∗ epoch num

, (3. 5. 23)

50 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

where α

is the initial learning rate, decay rate is a hyperparam et er that we much set, and

epoch num is the numb e r of times we have iterated through the entire training set. A few

other ways to d ec ay α include expotential decay

α = α

(0.95

epoch num

), (3.5.24)

α =

k α

√

epoch num

. (3.5.25)

Occasionally, people may use a discrete staircase to change the learning rate, in which case

the value changes after a set number of iterations, or we can go in and manually change the

decay rate.

3.5.8 The problem with local optima

Back in the day, deep learning practit i one rs fe are d th at t he i r algor it h ms would ﬁnd some

local m i ni mum and get trapped inside due to the gradient being 0 at a local opti mum. In

higher dimensional spac e, however, it is quite diﬃcult for local optima to form because it

would require every si n gl e direction to have the minimum at a single point. In general, we

are more likely to see saddle points and plateaus. Plat e aus tend to be the biggest problem

because they will s low down the rate of gradient descent and could possib ly fool us into

believing that we have found a global minimum if we stop running gradient descent early.

3.6 Hyperparameter Tuning

3.6.1 Tuning process

With deep neural networks, there are quite a few hyp er p aram et er s that need to be tuned,

including the learning rate, momentum term, number of layers, number of hidden units,

learning rate decay, and mini-batch size. The learning rate tends to be the hyperparameter

that can vary the most an d is often the one that is hardest to tune. The next set of

hyperparameters that c an make a signiﬁcant diﬀerence include changes to the momentum

term, number of hidden units, and mini-batch size. The hyperparameters that represent

the number of layers and learning rate decay may make a d i ﬀer e nc e, but not to the degree

of the rest of the hyperparameters mentioned.

When testing out choices for hyperparam et e rs , it is often best to start with a reasonable

range of values that could ponentially work for each hyperparameter. Then, rather than

changing one hyperparameter at a time in a grid-like way, it is best to randomly choose

values in the range for the hyper p aram et er to take on, which will allow us to test a lot more

hyperparameters. For example, in F i gu re 3.6.1, randomly choosing hyperparameter 1 each

time allows us to test 25 diﬀerent values, while a grid choice would only allow for 5 diﬀerent

values we could test.

If we ﬁnd that a section of the choices for our hyperparameters is producing good results, we

can often zoom in to that s ect i on on focus more on the hyperparamete rs there, narrowing

down the range of values. This is known as coarse to ﬁne and is shown in Figure 3.6.2.

3.6.2 Using an appropriate scale to pi ck hyperparameters

Randomly testing hyperparameter values tends to work out well for things like the number

of hidden units in each layer and the number of total layers. In each of these cases, we may

only be choosing from ar ou nd 50 diﬀerent integer values. For the learning rate, however,

the range might be from 0.0001 to 1 if we randomly choose values then 90% of the values

we choose will be between 0.1 and 1. Instead of using a numerical scale, we may use a log

scale to divide up t he values.

To choose α from the logarithmic scale in Figure 3.6.3, we could ﬁrst choose a random

variable r that is in between 0 and −4, and then take

α = 10

. (3.6.1)

3.6. HYPERPARAMETER TUNING 51

1 2 3 4 5

Hyperparameter 1

Hyperparameter 2

(a) Grid choice of hyperparameters

1 2 3 4 5

Hyperparameter 1

Hyperparameter 2

(b) Random choice of hyperparam et er val-

ues

Figure 3.6.1: Ways we can go about choosing diﬀerent hyperparameters

1 2 3 4 5

Hyperparameter 1

Hyperparameter 2

1 2

Hyperparameter 1

Hyperparameter 2

Figure 3.6.2: Coarse to ﬁne, where we focus on the region of the choices for hyperparame ters that

has produced the most promising results.

Another problem that may arise is sampling the values of β, which are often between 0.9

and 0.999. If we instead look to ﬁnd values of 1 −β, we will be searching for values between

0.1 and 0.001. Now, u si n g the logarithmic scale just introduced, we can pick a random

variable r between −3 and −1 and set

β = 1 − 10

. (3.6.2)

3.6.3 Hyperparameters tun in g in practice: pandas vs caviar

For models that we do not have an immense GPU power to run on, we may be only able

to train a single test over a certain amount of time. During this trai ni ng period, we could

proactively adjust the hyperparameters as the model is training, attempting to improve

the process at each change. Here, we are taking a careful approach to make sure we get

everything right on this model as best as we can, since we will only be testing a s i ngl e model

at a time. This is known as b abysitting one model and is considered the pandas approach

because pandas take careful care of their children. In contrast, if we multiple GPUs t o run

52 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

0.0001 0.1 1

0.0001 0.001 0.01 0.1 1

Figure 3.6.3: The bin sizes for randomly variab le s on a linear scale (top) versus a logarithmi c

scale (bottom)

our model across, we may run a bunch of tests in parallel and use the hyperparameters that

give us the best results, which is known as the caviar approach.

3.7 Batch normalization

3.7.1 Normalizing activations in a networ k

Previously, we have been normalize our inputs a

[0]

to more easily train w

[1]

and b

[1]

. Batch

normalization takes this a step further and works to normalize a

[ℓ]

in order to better train

[ℓ+1]

and b

[ℓ+1]

. In general, p eop le are more likely to normalize z

[ℓ]

instead of a

[ℓ]

, so that

is what we will be focusing on.

To calculate batch normalization on layer ℓ, we will have the batch of values

[ℓ](1)

, . . . , z

[ℓ](m)

}, where

µ =

i=1

[ℓ](i)

(3.7.1)

and

i=1

[ℓ](i)

− µ

. (3.7.2)

Just in case σ

turns out to be 0, we will introduce a hyperparameter % and set the normal-

ization of z to be

[ℓ](i)

norm

[ℓ](i)

− µ

√

+ %

, (3.7.3)

where % is some small number like 10

−8

. We may not always want the mean to be 0 and

the variance to be 1, so instead we will set

˜z

[ℓ](i)

= γz

[ℓ](i)

norm

+ β, (3.7.4)

where both the new standard deviation γ and the new mean β can be learned from the

model and are updated after each gradient descent iteration.

3.7.2 Fitting batch norm in a n eu ral network

To add batch normalization to our mo de l , each neuron will now have to calculat e the linear

activati on, then the batch normalization of the linear activation, and then the non-linear

activati on. For each layer ℓ, the batch normalization step will have the parameters β

[ℓ]

and

[ℓ]

as shown in Figure 3.7.1. For the entire neural network, our parameters will be W

[ℓ]

, β

[ℓ]

, and γ

[ℓ]

for ℓ ∈ {1, . . . , L}, with our original gradient descent step updating to

;

[ℓ]

= W

[ℓ]

− α dW

[ℓ]

= b

[ℓ]

− α db

[ℓ]

= β

[ℓ]

− α dβ

[ℓ]

= γ

[ℓ]

− α dγ

[ℓ]

. (3.7.5)

3.7. BATCH NORMALIZATION 53

(i)

[1]

˜z

[1]

, γ

[1]

˜z

[1]

, γ

[1]

˜z

[1]

, γ

[1]

[2]

˜z

[2]

, γ

[2]

ˆy

(i)

Figure 3.7.1: With batch norma li z at i o n we now have three calculations at each node: th e linear

activation, the batch normalization of the linear activation, and the non-linear activation.

In our current process, we have

[ℓ]

= W

[ℓ]

[ℓ−1]

+ b

[ℓ]

(3.7.6)

which is then normalied with

˜z

[ℓ]

= γ

[ℓ]

− µ

√

+ %

+ β

[ℓ]

(3.7.7)

= γ

[ℓ]

[ℓ−1]

+ b

[ℓ]

− µ

√

+ %

+ β

[ℓ]

(3.7.8)

= γ

[ℓ]

[ℓ−1]

− µ

√

+ %

+ β

[ℓ]

+ b

[ℓ]

√

+ %

. (3.7.9)

Here, notice that bias term is not really adding any new information and we can combine

the β

[ℓ]

term with the b

[ℓ]

term into a si ngl e term. For simplicity, we will choose to remove

the term b

[ℓ]

and keep β

[ℓ]

, giving us the updated param et e rs γ

[ℓ]

, W

[ℓ]

, and β

[ℓ]

Currently, we have been using batch normalization with the entire training set, but if we

were to break the data up into mini-batches, we would go through a similar process using

batch normalization on each mini-batch. The complete algorithm for gradient descent with

batch normali z ati on is described in Algorithm 7

Algorithm 7 Gr ad ie nt descent with batch normalization

for t = 1 to the number of mini-batches do

Forward propagate on X

{t}

In each hidden layer update z

[ℓ]

with batch normalization

Backward propagate to compute dW

[ℓ]

, dβ

[ℓ]

, dγ

[ℓ]

Update the parameters with W

[ℓ]

= W

[ℓ]

− α dW

[ℓ]

, . . .

end for

54 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

3.7.3 Batch norm at test ti me

Recall that for each mini-batch, we calculated

;

µ =

i=1

(i)

i=1

(i)

− µ

(i)

norm

(i)

− µ

√

+ %

˜z

(i)

= γz

(i)

norm

+ β

, (3.7.10)

which relies on having an entire training set of data, which we do n ot have at test time. In

practice, we predict µ and σ

as the exponentially weighted average across mini-batches. For

example, if we had three mini-batches, then for each layer ℓ we would ﬁnd µ

{1}[ℓ]

, µ

{2}[ℓ]

{3}[ℓ]

and σ

2{1}[ℓ]

, σ

2{2}[ℓ]

, σ

2{3}[ℓ]

and calculate the expon entially weighted average for

both µ and σ

. At test time, we would compute

(i)

norm

(i)

− µ

√

+ %

, (3.7.11)

where µ and σ

come from the latest cal cu l at ion of our exponentially weighted average.

3.8 Multi-class classiﬁcation

3.8.1 Softmax regression

Currently, we have only been d eal i ng with binary classiﬁcation problems, but it would be

much more useful to be able to handle multi-class classiﬁcations.

Figure 3.8.1: Multi-class cla ss iﬁ c a t io n with the labels are 0 → cat, 1 → dog, 2 → lion, and 3 →

zebra.

We will denote C to be the number of classes, so in Figure 3.8.1 C = 4. The number of

classes will corresp ond to the number of neurons in the last layer of the neural network, so

[L]

= C, (3.8.1)

where each neuron corresponds to the probability of each class as shown in Figure 3.8.2.

From the laws of probability, we know that the sum of the ﬁnal layer must be 1. We will

start oﬀ, as usual, calculating the linear activation in the ﬁnal layer, but we must change

the activation function. For our activation function, let us ﬁrst calculat e the element-wise

exponentiation of each linear ac t i vation

t = e

(

[L]

)

, (3.8.2)

where t is a tem porary variable and t ∈ R

C×1

. Next, we will normalize the vector t which

will be our activation function

[L]

(

[L]

)

j=1

, (3.8.3)

3.8. MULTI-CLAS S CLASSIFICAT IO N 55

[L]

P (class = 0 | X)

[L]

P (class = 1 | X)

[L]

P (class = 2 | X)

[L]

P (class = 3 | X)

Figure 3.8.2: Each neuron in the ﬁnal layer corresponds to a probability of a class

where a

[L]

∈ R

C×1

. Equation 3.8.3 is again an element-wise function where the ith element

[L]

j=1

. (3.8.4)

As a concrete example, s up pose we have

[L]

−1

(

. (3.8.5)

We will ﬁrst exponentiate each element

t = e

(

[L]

)

−1

(

≈

148.4

7.4

0.4

20.1

(

(3.8.6)

and calculate

j=1

≈ 176.3. (3.8.7)

Next, we will d i v id e each element in t by the sum of all the valu es in t, which gives us the

probabilities each class occurs

[L]

/176.3

−1

/176.3

(

0.842

0.042

0.002

0.114

(

. (3.8.8)

The activat i on f un ct i on we have been using on the ﬁnal layer of our neural network is known

as the softmax activation fun ction . The name softmax is in contrast to a hardmax

classiﬁer that would output 1 for the max and 0 for everything else. In our example, the

hardmax classiﬁer would output

[L]

(

. (3.8.9)

56 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

3.8.2 Training a softmax classiﬁer

As an example, suppose our neur al network outputs the data

ˆy =

0.3

0.2

0.1

0.4

(

(3.8.10)

where the actual labels are

y =

(

. (3.8.11)

The loss function we will use for a softmax classiﬁer is deﬁned as

L(ˆy, y) = −

j=1

log ˆy

, (3.8.12)

which would leave us with only the prediction from the pos it i ve label contributing to the

loss and we can simplify our loss to be

L(ˆy, y) = −y

log ˆy

= −log ˆy

. (3.8.13)

In order to minimize our loss, we would want to make ˆy

as close to 1 as possible. If ˆy

near 0

, then we will have a loss approaching ∞. For a multi-class classiﬁer, we will use

the same deﬁnition of t h e cost function from before with

J =

i=1

ˆy

(i)

, y

(i)

. (3.8.14)

We can calculate the backward propagation step through a softmax classiﬁer with

∂J

∂z

[L]

= ˆy − y. (3.8.15)

Chapter 4

Applied deep learning

4.1 Orthogonalization

We will use the term orthogonalization to refer to changes in one hyperparameter n ot

inﬂuencing the rest of the hyperparameter. For instance, in a self-driving car, we would not

want changes i n the acceleration to have any eﬀect on the steering. In regards to machine

learning, we primarily have to ﬁt the training set well, ﬁt the dev set well, ﬁt the test set

well, and the n maintain the model on real-world examples. If one of these steps is not

working well, then we would like to adjust a set of hyperparameters that will have littl e

impact on the other st ep s.

4.2 Setting up goals

4.2.1 Single number evaluation metric

Using a binary cat classiﬁcation example, we wi l l say that precision refers to the percentage

of actual cats inside of the recognized cats by our model. Recall refer s to the percentage

of actual cats that are correctly labeled.

Classiﬁer Precision Recall

A 95% 90%

B 98% 85%

Table 4.1: Example of precision and recall fo r two diﬀerent classiﬁers

Suppose our classiﬁer gives us the precision and recall data in Table 4.1, where one classiﬁer

gives us better precision, while the other gives us better recall. With two evalu at ion metri cs ,

determining the best classiﬁer is hard to do. Instead, we can combine precision and recall

into a category known as F1 Score, where

F1 Score =

, (4.2.1)

where P represents precision, and R represe nts r ec all . This way of calculating the mean is

known as the Harmonic mean. Now, using the F1 Score, we can simply say that classiﬁer

A is the better classiﬁer.

When choosing metrics, we typically ﬁrst want to deﬁne a metric and the n have a completely

separate process the deals with optimizing that metric. It is often best to quickly choose an

evaluation metric and then change that metric in the fu t ur e if there is a signiﬁcant diﬀerence

in how data performs on the development set versus the testing set versus the real world.

58 CHAPTER 4. APPLIED DEEP LEARNING

Classiﬁer Precision Recall F1 Score

A 95% 90% 92.4%

B 98% 85% 91.0%

Table 4.2: Calculating the F1 Score for each classiﬁer

4.2.2 Train/dev/test distributions

For our development and test set, we would like to have instances come from the same

distribution of our data. In particular, we would like each of these sets t o model the

expected data we will receive in the future as best as possible. The size of each set may

vary. O u r test set should be large enough that it is capable of giving us a lot of conﬁdence

if our model per for ms well on it.

4.3 Comparing to human-level performance

time

accuracy

Bayes optimal error

human level

Figure 4.3.1: How our model could behave over time

In cases such as image recognition or speech detection, we tend to be able to create mod-

els that rapidly approach human-level performance, but then slowly decrease the rat e at

which the accuracy is improving. Some of the reasons for the plateau afte r human-level

performance is achieved are that human-level and Bayes optimal error are not too far apart

on tasks, like recognizing images. When our model performs worse than human-level per-

formance we still have ways we can improve with gaining more trained data and manual

error analysis. Our model will never surpass the Bayes optimal error, which is the best

error rate we could possibly achieve as shown in Figure 4.3.1. It will not always be 100%,

because in scenarios where images are blurred, for example, it may be impossible to extract

information about what is in t h e image.

4.3.1 Avoidable bias

We will hardly ever actually have a number for Bayes optimal error. Instead, we use human-

level error to approximate Bayes optimal error. From Table 4.3, we h ave two diﬀerent tasks

that give us the same training and development error but d i ﬀer in human error. In task

A, wh er e our training error is quite far from our human error , we likely want to focus on

improving the bias in our model. In contrast, when our training error is close to our human

error, like in model B, it m ay be best to focus on improving variance in the model. We

say that the diﬀerence betwe en our approximation for Bayes optimal error and our training

4.4. ERROR ANALYSIS 59

Task A B

Human Error 1% 7.5%

Traini n g Error 8% 8%

Development Error 10% 10%

Table 4.3: Example error rates for two tasks with diﬀerent human error rat es.

error is the avoidable bias and the diﬀerence between the training error and development

error is the variance.

4.3.2 Understanding human-level performance

Suppose we have the medical imaging classiﬁcation error rates :

• Typical human: 3%

• Typical doctor: 1%

• Experienced doctor: 0.7%

• Team of experienced doctors: 0.5%

and we want to determine a human-level e rr or . Since human-level is often used as an

approximate for Bayes optimal error, it is best, in this case, to go with the error rate of

0.5% from the team of experienced doc t ors . For deployment purp oses , however, we may

have an application that just needs to beat a typical doctor, in which case it may be best

to use the typical do ct or ’s error rate of 1% as human-level.

4.3.3 Surpassing human-level performance

If we have the error rates

• Team of humans: 0.5%

• One human: 1%

• Training error: 0.3%

• Dev error: 0.4%

where we have already surpass ed our estimate for Bayes optimal error, then it is hard to

determine what we should focus on improving next. There are a bunch of problems that fall

into categories where machine learning far outperforms human-level performance such as

online advertising, product recommendations, transit time predictions, and loan approvals.

4.4 Error analysis

Suppose that we are working on a cat classiﬁer and we have achieved a 10% error rate, with

a Bayes optimal error rate of 2%. Further, suppose we also want to see if better recognition

of dogs could help improve our cat classiﬁer if it mislabels a dog as a cat . Before attempti n g

for our cat recognizer to do better with dogs, it is likely best to ﬁnd out how many images

are mislabeled in our development set and how many of those images are dogs. If we have

around 100 mislabeled images in total and only 5 of those images contain dogs, then the

ceiling we can improve our model at by recognizing dogs is with a 9.5% error rate. I ns te ad,

if we 50 images where dogs are mislabeled in our development set, then improving dog

classiﬁcation would be more worth our time, since the ceiling for our development set error

rate would be 5%.

We can also look for multiple sets of misclassiﬁed images i n parallel. For example, on our

cat classiﬁcation, we may want to check for dogs, great cats (lions, panthers), and blurry

images being mislabeled. To check for multiple categories, it may be best to create a table

or spreadsheet like the one shown in Table 4.4.

60 CHAPTER 4. APPLIED DEEP LEARNING

Image Dog Great cats Blurry Comments

1 ! Pitbull

2 !

3 ! ! Rainy day at zoo

Total % 8% 43% 61%

Table 4.4: Using a table to represent our misclassiﬁ e d images

4.4.1 Cleaning up incorrectly labeled data

With a large enough training set, it may not matter much if data is randomly labeled

incorrectly. However, for smaller data sets or if data is systematically labeled incorrectly it

may be in our best interest to manually change t h e labels. In our development set, we may

update our table to include a column for incorrectly labeled instances as shown in Table

4.5.

Image Dog Great cats Blurry Incorrectly labeled Comments

98 ! Cat was in background

99 !

100 ! Cat drawing

Total % 8% 43% 61% 6%

Table 4.5: Adding an incorrectly labeled category to our table of misclassiﬁed images

We tend to only make adjustments to the labels in our development set when the changes

could produce signiﬁcantly b et t er results. For example, if our development set error was

10% and the error due to incorrect labels was 0.6%, then it may be the most worthwhile to

focus on the 9.4% ﬁrst. If we later get the development set error rate down to 2%, while still

having 0.6% error du e to incorrect labels, then it may be worthwhile to adjust the l abels

so that they are accurate. If we go in and adjust the development set for errors in the

mislabeled images, then it may also be worthwhile to go in and look for incorrectly labeled

images that our network labeled as correct.

4.5 Mismatched training and dev set

Uploaded data from usersTraining data

images: 200,000 images: 10,000

Figure 4.5.1: Training data could vary signiﬁcantly from data our model views once deployed

After developing a model, we may test it in the real world and ﬁnd that the data uploaded

by users varies signiﬁcantly from the data we trained our mo d el on. Consider the data in

Figure 4.5.1, which shows that we have 200,000 images from our training set and 10,000

4.6. LEARNING FROM MULTIPLE TASKS 61

images from the users that each is distributed diﬀerently. We also know that we want to

optimize the uploaded d ata from the users as best as possible, although 10,000 images will

not work well as an entire training set.

One way we could adjust our sets would be to combine the testing and uploaded data into

a single large array, then shuﬄe the array and break up the data into groups for a training,

development, and testing set. Suppose we set our training set size to be 205,000 and our

development set size and testi ng set size each to 2,500. If we were to break data up randomly,

then we would only expect around 119 images in the development and test set to come from

the uploaded data from u ser s .

Instead, since our primary goal is to optimize the data uploaded by users, we could again use

bin sizes of 205,000 for the traini ng set and 2,500 for both the development and testing set,

but this time we will make up the development and testin g sets entirely from the uploaded

data from users. While the data is not split evenly, we tend to prefer this dist r i bu t ion of

data because it allows us to focus more on optimizing the right goals.

4.5.1 Bias and variance with mismatched data distributions

When we distribute the types of data diﬀerently between our training set and development

set, it may be likely that we have a model with a low training error and a relatively high

development error, even without having a variance problem . Instead of directly comparing

the results against the development set, we will introduce a training-development set,

which has the same distribution as the training set although we do not act ual l y use data in

this set to train on. Now, we can look at the training set error rate, the training-development

set error rate, and the development set error rate, where we will determine the variance

based on the diﬀerence between the training s et error rate and the training-development set

error rate. If we have both a low training set error rate and training-development set error

rate while our development set error rate is high, then we say we have a data mismatch

problem.

4.5.2 Addressing data mismatch

With a data mismatch prob l em, it may ﬁrst be beneﬁcial to look at where the errors are

coming from in the development set. If we ﬁnd , for example, that our errors are due to

blurry i mages then we may want to get more t rai n i ng examples of blurry images. Instead of

actually going out and getting more data, we cou l d use artiﬁcial data synthesis, which

in this case would take crisp images and apply a blurring ﬁlter to them.

For another example, if we are trying to create a trigger word detector and we have a bunch

of low noise voice recordings, but we ﬁnd that the user s’ data on ce our model is deployed

tends to have a lot of noise, then we can artiﬁci al l y add car noise. One thing to keep in

mind here is that if we have a lot more data of voice recordings than we have of car noise,

adding the same car noise to each audio clip may cause our model to overﬁt the car noise

data.

4.6 Learning from multiple tasks

4.6.1 Transfer learning

Transfe r learning is a process of taking a neural network that has been trained for one thing

and using it as a starting poi nt to train something else. For instance, if we are training

an i mage cl ass iﬁ cat i on sys t em , we may be able to transfer the low-level implementation

details in a neur al network, that ideally look for edges and curves, and use those details

with radiology diagnosis.

When changing the last layer of a neural network as shown in Figure 4.6.1, we should set

the new weight and bias terms randomly. We are also able to add more layers to a network

if necessary and are not restricted to simply changing out the last layer when using transfer

learning.

62 CHAPTER 4. APPLIED DEEP LEARNING

[1]

[2]

[3]

[4]

ˆy

(i)

[4]

ˆy

(i)

Figure 4.6.1: Disconnec t in g the end of a neural network and transfering the trained each weight

and bias to anothe r similar problem

In our image classiﬁcation to radiology diagnosis example, we say that the image classiﬁca-

tion data is the pre-training data, while the radi ology diagnosis data is the ﬁne-tuning

our model.

We will often change the number of layers that are learned in a neural network based on

the amount of ﬁne-tuning data we have available. With more ﬁne-tuning data it would be

beneﬁcial to use more layers, wh i le with only a little bit of ﬁne-tuning data we would rely

more heavily on our pre-training data. It is also common to fr ee ze the representations of

earlier layers in a neural network and only learn the disconnected part, or the latter part

of the model. With plenty of dat a, we may use the pre-training model as our initialized

representations and retrain the entire network.

4.6.2 Multi-task learning

For multi-task learning, suppose that we are building a self-driving car and need to classify

pedestrians, cars, stop signs, and traﬃc lights. Instead of building four sep ar at e neural

networks to do each of these tasks, we can use a single neural network with multi-hot

encoding where our outputs are labeled with

(i)

pedestrian?

car?

stop signs?

traﬃc light?

(

(4.6.1)

and a

[L]

= 4. If some of our labels do not have all the information ﬁlled in, for instance if

it is unknown if there is a stop sign or traﬃc light in the image, then we can have our loss

function not sum over those values. Multi-task learning is diﬀerent from softmax regression

because each image can have multiple labels.

Multi-task learning works best with tasks that share low-level features, relatively similar

data, and when we can use a large enough neural network.

Chapter 5

Convolutional neural networks

With our current approach to image classiﬁcation, where we took pixels as our input, we

will quickly ﬁnd that with larger images or images with diﬀerent aspect ratios, our neural

network will no longer work. Consider an image of size 1000 × 1000 × 3 which would make

our inpu t of size 3 million. With such a large input, it is too computation al ly expensive

to run most models and overﬁtting is prone to occur. Instead, we will use and introduce

convolutional neural networks in this section.

5.1 Edge detection

From the ﬁeld of image processing, we have long been able to apply ﬁlters to images. We

represent diﬀerent ﬁlters as matrices or tensors. Applying a ﬁlter to a set of pixels can

give us properties of an image, including where the edges are. When we apply a ﬁlter to a

matrix, the matrix and the ﬁlte r must be the same size and the res ul t is the element-wise

sum between the ﬁlter and matrix.

With images, the matrix that re pr e sents the pixels is usually larger than the ﬁlter. In cases

where our matrix is of a diﬀerent size than the ﬁlter, we can apply a convolution operation

that will r es ul t in a matrix all th e possible ways to form complete ﬁlters across the image.

It is typical to denote the convolution operator as ∗, which is not to be confused with the

element-wise operator we deﬁned with the same notation earlie r.

We wil l denote ﬁlters to be of size f ×f and our image matrix of size n×n. After convolving

a ﬁlter across our image matrix, we will end up with a matrix of size (n−f +1)×(n−f +1).

An example of a convolution operator can be found in Figure 5.1.1.

-1

-8

-1

Figure 5.1.1: A co nvolution over a matrix is the element-wise product between a ﬁlter and smaller

entries of the matrix. The value from the convolution between the ﬁlter an d the upper left-h a n d

corner of the matrix corresponds to the top left entry in the result of the convolution.

With diﬀerent ﬁlters, we can ﬁnd out diﬀerent properties from our image. Whe n look in g

at results after applying an edge detection ﬁlt er , gray represents no edge detected, black

represents positive values from our ﬁlte r , and white represents negative values from our

ﬁlter, as shown in Figure 5.1.2. Some of the most c ommon ﬁlters used for edge detection

are shown in Figure 5.1.3.

64 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS

(a) Original image (b) Horizontal edge detector (c) Vertical edge detector

Figure 5.1.2: Resulting images after applying a horizontal and vertical ﬁlter

-1

(a) Standard vertical ﬁlter

-1

-2

-1

(b) Sob el vertical ﬁlter

-3

-10

-3

-1

(d) Standard horizontal ﬁlter

-1

-2

-1

(e) Sobel horizontal ﬁlter

-3

-10

-3

(f) Scharr horiz o ntal ﬁlter

Figure 5.1.3: Diﬀerent types of horizontal and vertical ﬁlters that are used for edg e detection

5.2 Padding

If we stack multiple convolution operations back to back, then we will end up with a signiﬁ-

cantly smaller ﬁnal outputted image than we started with. One way to combat the changing

size would be by adding padding to the image. When we add padding to the image, we

typically add a border of all 0s as shown in Figure 5.2.1.

0 0

-1

-7

-1

Figure 5.2 . 1: Adding padding to our image could produce the output in the same siz e as our

input

5.3. STRIDED CONVOLUTIONS 65

We add padding un if or ml y to each size of the image and the amount of padding we add

is denoted by p. With paddi ng, the output of our image will be of s iz e (n + 2p − f +

1) × (n + 2p − f + 1). If we want our input to be equal to our output, then we can solve

n + 2p − f + 1 = n for p and ﬁnd

p =

f − 1

. (5.2.1)

When the output size matches the input size, we call the convolution a same convolu ti on

and when we do not add padding, we say that we have a valid convolution. It is also most

common to see f as an odd number because that allows for a pixel to be in the distinctive

center.

5.3 Strided convolutions

Figure 5.3.1: An invali d ﬁlter since we will not use the ﬁlters that run over the side of the image,

instead we will only use ﬁlters that ﬁll a comple te f × f grid

With convolutions, we are also able to take stride lengths that ar e larger than 1. From the

top left corner, when moving right or down we will move each spot in the ﬁlter a length of

s. With strides, only count the convolutions that contain lie inside of the boundary of our

image with padding, as sh own in Figure 5.3.1. Therefore, the size of our output will be

n + 2p − f

+ 1

n + 2p − f

+ 1

. (5.3.1)

5.4 Cross-correlation vs convolution

In image processing, the convolution operator actually mirrors the ﬁlter horizontally and

vertically as shown in Figure 5.4.1. The method we have been using thus far, whe re we do

not mirror the ﬁlter is known in image processing as cross-correlation. However, i t is most

common in computer vision to use the word convolution in place of cross-correlation, which

is what we will continue to do.

1 2 3

4 5 6

7 8 9

→

9 8 7

6 5 4

3 2 1

Figure 5.4.1: Wit h a traditional convolution operation, we ﬁrst ﬂip the ﬁlter horizontally and

vertically

5.5 Convolutions over volume

For colored images, we have three channels–red, green, and blue–and can deﬁne a ﬁlter for

each channel as shown in Figure 5.5.1. For a channel with a height and width of f, our ﬁlter

will have f

parameters. To convolve with volume, we apply the front ﬁlter only to the red

channel, the second ﬁlter only to the green channel, and the third ﬁlter only to the blue

channel. Then, on each channel, we carry out the convolution process described earlier.

Previously, we saw that each ﬁlter may be looking for diﬀerent things inside of an image.

We can also set up multiple ﬁlters, by storing the resulting matrices from each convolution

inside of a tensor. We will denote t h e number of ﬁlters as n

′

, so our outputted matrix will

be n

′

layers deep as shown in Fi gur e 5.5.2.

66 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS

Figure 5.5.1: Convolutions of an RGB ima g e

Figure 5.5.2: Using a convolution on n

′

ﬁlters will produce a matrix in n

′

dimensions with each

dimension correspond in g to a result from a convoluted ﬁlter

5.6 One layer convolution network

For convolutional neural networks, we will treat the resulting matrices created from the

convolution added to the bias t e rm b as our linear activation z. Then we will put each

element through a non-linear activation as shown in Figure 5.6.1. Each index inside of the

ﬁlter tensor is treated as a weight.

+ b

ReLU

[0]

[1]

Figure 5.6.1: One layer of a convolutional neura l network

5.7. POOLING LAYERS 67

For the lth layer of a convolutional neural network:

• f

[l]

= ﬁlter size

• p

[l]

= padding

• s

[l]

= stride

• n

[l]

= number of ﬁlters

• n

[l−1]

×n

[l−1]

×n

[l−1]

= size of input, height of input × width of input × channels

in input

• n

[l]

[l−1]

+ 2p

[l]

− f

[l]

+ 1

• n

[l]

[l−1]

+ 2p

[l]

− f

[l]

+ 1

• f

[l]

× f

[l]

× n

[l−1]

= ﬁlter size

• n

[l]

× n

[l]

× n

[l]

= a

[l]

• f

[l]

× f

[l]

× n

[l−1]

× n

[l]

= size of weights

• n

[l]

= size of bias

5.7 Pooling layers

Pooling layers give us a way to shrink the size of our input, while still capturing the essential

details. The primary typ es of pooling include average pooling and max pooling. Pooling

layers have both a ﬁlter and a step size, although the re are no weights to be learne d. Instead,

pooling applies a function to each region of the input it focuses on, where each frame is f ×f

and the stride lengt h is s. For 2D max pooling, the output is the maximum number inside

of the frame and for 2D average po ol i ng, the output is the aver age of all t he numbers inside

of the frame.

→

5 6 1 2

1 3 2 3

2 9 1 1

1 3 2 1

(a) Max pooling

→

5 6 1 2

1 3 2 3

2 9 1 1

1 3 2 1

3.75

1.25

2.00

(b) Average pooling

Figure 5.7.1: 2D pooling with the hyp erp a ra m et ers f = 2 and s = 2

For 3D pooling, the number of channels in the input will match the number of channels

in the output. Each of the channels will work independently. For the ith channel in the

output, we will take the pool from the ﬁlter over the ith channel of the inpu t.

Max pooling is used most commonly because it generally produces bet t e r results. Some

people hypothesize that max pooling performs better because it captures in each ﬁlter if a

certain aspect of the image is present, while average pooling can be saturated with a lot of

data.

In deep learning literature, we combine convolution and pooling steps into a single layer

because pooling does not have any representations to be learned.

5.8 Why we use convolutions

As a t hou ght e xperiment, consider a single layer neural network with an input size of

32 × 32 × 3 and an output size of 28 × 28 × 6. If we were to use a fu l ly -c onn ec t ed neural

network, we would have around 14 million representations we would have to learn, whereas

68 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS

if we use a convolutional neural network with a ﬁlter size of 5 × 5 then we would only need

to learn 156 representations.

With fewer representations we can take advantage of parameter sharing and sparsity

of connections. Parameter sharing al l ows us to detect featu re s at any area inside of the

image. The sparsity of connections means that for every output, there will only be a few

numbers th at impact it.

5.9 Classic networks

Since choosing the architecture for a neural network is often an empirical task we can look

at other people’s models to make better decisions on our model. The main way to look at

how other people build their models is by reading their research. We will be going over how

LeNet-5, AlexNet, VGG-16, ResNet, and Inception work.

5.9.1 LeNet-5

INPUT

32x32

Convolutions

Subsampling

Convolutions

C1: feature maps

6@28x28

Subsampling

S2: f. maps

6@14x14

S4: f. maps 16@5x5

C5: layer

120

C3: f. maps 16@10x10

F6: layer

Full connection

Gaussian connections

OUTPUT

Figure 5.9.1: LeNet-5 architecture (LeCun et al., 1998).

LeNet-5, named aft e r the paper’s lead author Yann LeCun, was introduced in 1998 with a

primary goal of recognizing characters. LeNet-5 uses average pooling which was m ore widely

used at the time of publication. This model contains around 60 thousand representations,

which is now considered a small amount. Among other things, this architecture coined the

concept of moving deeper into the neural network to shrink the input size, while increasing

the number of channels and convolutional layers should be followed by poolin g layers.

5.9.2 AlexNet

!"#

$%& $%& !"#

&'(# &'(#

!"#

Figure 5.9.2: AlexNet architecture (Krizhevsky et al., 2012).

AlexNet looks quite similar to LeNet-5. Although , AlexNet has around 60 million parame-

ters in total and used a ReLU activation function. AlexNet signiﬁcantly sparked a surge of

research toward artiﬁcial intelligence when published in 2012, because the results from the

neural network far outperformed state-of-the-art approaches at the time.

5.9. CLASSIC NETWORKS 69

5.9.3 VGG-16

224x224x64

input size: 224x224x3

4096x1

1000x1

112x112x256

56x56x256

28x28x512

softmax

fully connected with ReLU

max pooling

convolution with ReLU

14x14x512

7x7x512

Figure 5.9.3: VGG-16 architecture

The VGG-16 architecture was introd uc ed in a 2015 paper by Karen Simonyan and Andrew

Zisserman. In it, they make all their convolutions as same convolutions with a ﬁlter size

of 3 × 3 and a stride length of 1. Each max pooling layer has a ﬁlt er size of 2 × 2 with a

stride l en gt h of 2. This architecture is on the larger side of modern neural networks with

a total number of representations around 138 million and sticks out primarily due to its

simple choices for hyperpar ame t er s that turn out to work well.

5.9.4 ResNet

A residual n etwork (ResNet) gives us a way to skip connections inside of a neural network.

With a plain network, our path from a

[ℓ]

to a

[ℓ+2]

may be what is shown in Figure 5.9.4.

[ℓ]

linear

ReLU

[ℓ+1]

linear

ReLU

[ℓ+2]

Figure 5.9.4: Plain network

However, with a ResNet, we use what is known as a residual block that sets

[ℓ+2]

= g

[ℓ+2]

+ a

[ℓ]

, (5.9.1)

where a

[ℓ]

is skipping connections. The path of a residual block is shown in Figure 5.9.5.

[ℓ]

linear

ReLU

[ℓ+1]

linear

ReLU

[ℓ+2]

•

⊕

Figure 5.9.5: Residual block

As shown in Figure 5.9.6, residual networks work well with deep neural networks that have

a lot of layers whereas plain neural networks often have weight decay.

Figure 5.9.6: The error rate for plain networks (left) and ResNets (middle, right) when the

number of layers in the network changes (He et al., 2015).

70 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS

The reasoning why ResNets work with more layers can be shown by expanding the right-

hand side of Equation 5. 9. 1

[ℓ+2]

= g

[ℓ+2]

+ a

[ℓ]

(5.9.2)

= g

[ℓ+2]

[ℓ+1]

+ b

[ℓ+2]

+ a

[ℓ]

. (5.9.3)

Notice that when weight decay occurs in plain neural networks, where w

[ℓ+2]

≈ 0, and we

are using a ReLU activation function, we will come close to having a

[ℓ+2]

= g

)

[ℓ]

= a

[ℓ]

Therefore, adding more layers to a ResNet may have no eﬀect on the activat i on if weight

decay occurs, or the weights could add addi t ion al information that could decrease the error

rate.

5.9. CLASSIC NETWORKS 71

Figure 5.9.7: How a 34-layer ResNet compares to a 34-layer plain network and VGG-19, which

is modiﬁed version of VGG-16 (He et al., 2015).

72 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS

5.9.5 1 × 1 convolution

While a 1 × 1 convolution would simply scale up each value in a 2D input, w it h 3D inputs

we can do much more. For a 1 × 1 convolution the depth of the ﬁlter will be equal to the

depth of the input. Then t he output for each ﬁlter comes from t he dot product between

the pixel on the input and the ﬁlter. Als o, our output will always be the same size as our

input. With multiple ﬁlters, we add depth to the output as shown in Figure 5.9.8. 1 × 1

convolutions may also be referred to as a network in network (Lin et al., 2013).

Figure 5.9.8: 1 × 1 convolution

5.9.6 Inception network

The primary goal from the inception network was to allow a model to learn the best hy-

perparameter choices for each layer of a network. Instead of manually choosing a ﬁlter size,

the model will use three diﬀerent ﬁlter sizes: 1 × 1, 3 × 3, and 5 ×5; in addition, each layer

could also l ear n max p ooling. To pick the best hyperparameters, the model concatenates the

results from convolving with the diﬀerent ﬁlter sizes and max pooling into a single output

as shown in Figure 5.9.9a.

When we go deeper into t he network we often run into a large number of channels with

relatively small dim en si on size. Now consider, f or example, a layer whose input is of size

28 × 28 × 192 with a same convolution that has a 5 × 5 × 192 ﬁlter size and 32 ﬁlters. In

this scenario, our output will be of size 28 × 28 × 32 and for each output, we will ne ed

5 × 5 × 192 total multiplications, which equates to around 120 milli on in total. Even with

modern computers, 120 million multiplications will take a su bs t antial amount of time.

Instead, if we ﬁrst pass the input of size 28 × 28 × 192 into a 1 × 1 × 192 convolution with

16 ﬁlters and then pass that result into a 5 × 5 × 16 same convolution with 32 ﬁlters we

could end up with a pretty similar result with a far less computational cost. In tot al, we

would have around 12 m il l i on total multiplications using the dimension reduction method

we just proposed, which is shown in Figure 5.9.9b. Also, noti ce in our dimension reduction

ﬁgure that we pass max pooling into a 1 × 1 convolution. This is because when using a

max pooling ﬁlter, we cannot shrink the number of channels. Rather, we will shrink the

number of channels with a 1 ×1 c onvolution, where the number of ﬁlters correspond s to the

outputted channel size.

1x1 convolutions 3x3 convolutions 5x5 convolutions

Filter

concatenation

Previous layer

3x3 max pooling

(a) Na¨ıve version

1x1 convolutions

3x3 convolutions 5x5 convolutions

Filter

concatenation

Previous layer

3x3 max pooling1x1 convolutions 1x1 convolutions

1x1 convolutions

(b) Dimension reduction

Figure 5.9.9: Inception layer (Szegedy et al., 2015)

5.9. CLASSIC NETWORKS 73

The entire original inception network is shown in Figure 5.9.10 and was named Go ogLe Net

after the popular LeNet architecture.

74 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS

input

Conv

7x7+2(S)

MaxPool

3x3+2(S)

LocalRespNorm

Conv

1x1+1(V)

Conv

3x3+1(S)

LocalRespNorm

MaxPool

3x3+2(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

Dept Concat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

Dept Concat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

MaxPool

3x3+2(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

Dept Concat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

AveragePool

5x5+3(V)

Dept Concat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

Dept Concat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

Dept Concat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

AveragePool

5x5+3(V)

Dept Concat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

MaxPool

3x3+2(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

Dept Concat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

Conv

1x1+1(S)

MaxPool

3x3+1(S)

Dept Concat

Conv

3x3+1(S)

Conv

5x5+1(S)

Conv

1x1+1(S)

AveragePool

7x7+1(V)

Conv

1x1+1(S)

SoftmaxActivation

softmax0

Conv

1x1+1(S)

SoftmaxActivation

softmax1

SoftmaxActivation

softmax2

Figure 5.9.10: GoogLeNet architecture (Szegedy et al., 2015)

5.10. COMPETITIONS AND BENCHMARKS 75

Figure 5.9.11: Meme cited in the original inception paper as a reasoning for the name inception

network.

As a fun fact, the i nc ep t i on network cited the reasoning f or its name came from the popular

meme in Figure 5.9.11 th at appeared in the movie Inception.

5.10 Competitions and benchmarks

To beat state-of-the-art models on a s peciﬁc b e nchmark in competitions, there are a few

methods employed that are rarely used in practice including ensembling and multi-crop at

test time. Ensembling is when we test each instance of the t es t in g set against a range of

models, usually around 3-15, and use the prediction that either occurs most of t en (clas-

siﬁcation) or the average prediction (regression). Ensembling often performs better, but

there is a signiﬁcant computational cost that is introduced . The other method employed at

test time is multi-crop which averages the output from diﬀerent crops of the image. One

common way to multi-crop an image take is known as 10-crop and is shown in Figure 5.10.1.

Figure 5.10.1: 10-crop includes mirroring an image and taking diﬀerent croppings of both the

image and its mirroring.

Chapter 6

Detection algorithms

6.1 Object localization

(a) Image classiﬁcation

(b) Classiﬁcation with localiza-

tion

Figure 6.1.1: Classiﬁcation vs. l ocalization vs. detection

Previously, we have only been working on image classiﬁcation, which assumes only one

object in a screen i s present and determines that object. Next, we will be goin g into

classiﬁcation with localization which determines where the object is located in the image

as shown in Figure 6.1.1b. We will also be going over image detection which allows u s to

classify multiple occurrences of the same or di ﬀ er ent classes from a single image as shown

in Figure 6.1.1c.

With classiﬁcation with localization, we are going to assume there is only one object in every

image. As an example, we are going to walkthrough how a self -dr i v i ng car could localize a

pedestrian (1), car (2), motorcycle (3), or, if no object exists, a background (4). We

can then change the output of the neural network from a softmax of 4 categories t o include a

bounding box with the center point, height, and width. Th e center of the bounding box will

be denoted as (b

, b

) and the height and width will be b

and b

, respectively. Currently,

we will be using square images and will label the t op left corner as (0, 0) and the bottom

right corne r as (1, 1). The entire labeling of our image is shown in Figure 6. 1. 2.

6.1. OBJECT LOCALIZATION 77

(0, 0)

•

(1, 1)

•

, b

)

•

Figure 6.1.2: Our labeling for localization

We will also need to adjust our supervised learning labels to match the output of our model .

We will deﬁne the output as

y =

car?

pedestrian?

motorcycle?

(

, (6.1.1)

where P

is the probability we have a class 1, 2, or 3. A few labeled examples are shown in

Figure 6.1.3.

y =

0.5

0.7

0.2

0.5

(

(a) Car label

y =

None

(

(b) Background label

Figure 6.1.3: Our labels for diﬀerent images

78 CHAPTER 6. DETECTION ALGORITHMS

We will also deﬁne our loss function to be a squared error loss function

L (ˆy, y) =

i=1

(ˆy

− y

)

if y

= 1

(ˆy

− y

)

if y

= 0

, (6.1.2)

where the 8 comes from the size of our label.

6.2 Landmark detection

Landmark detection works to ﬁnd speciﬁc areas of an image. Recently, landmark detection

has b ee n used to determine human emotions, determine the orientation of a moving body,

and apply S n apchat ﬁlter s. In order to create Snapchat face ﬁlter, like the one shown in

Figure 6. 2.1c , we can trai n a neural network to determin e speciﬁc landmark locations inside

of an i mage . To determine landmark locations on a face, we ﬁrst need to classify a face and

then label each landmark’s location on our training data as sh own in Figure 6.2.2, where

each label with n landmarks would be in the form

y =

face?

ℓ

(

, (6.2.1)

where (ℓ

, ℓ

) is the xy coordinate of the ith landmark. We also must makr sure that

we always label the ith landmark the same thing in each image. For example, if the ﬁrst

landmark is the leftmost part of somebody’s le ft eye in one image, then it also must be the

leftmost part of a person’s l ef t eye in another image.

Happy

Angry

(a) Emotion classiﬁcation

(b) Bod y tracking

Figure 6.2.1: Applications for landmark detection

6.3 Object detection

Object detection focuses on locating multiple objects in a given image. A na¨ıve approach to

object detection is with the sliding windows detection algorithm. We will start wit h

6.4. SLIDING WINDOWS WITH CONVOLUTION 79

→

21 22

Figure 6. 2 .2 : Adding a speciﬁc numbered landmark to each part of the imag e we are interested

1 1 1 0 0

Figure 6.3.1: Sliding windows detection training set

a training set of ti ghtly cropped im ages as shown in Figure 6. 3. 1. Then, we will create a

window and slide it throughout our image. At each location where the window is on the

image, only the pixels that the ﬁlter is covering will be passed into the convolutional neural

network in hope s that the window will perfectly identify the location of objects. We will

then change the size of the window and again repe at the sliding process as shown in Figure

6.3.2.

With the sliding windows detection algorithm just presented, we will run into a high compu-

tational cost in ord er to have an accurate image detector. Further, improving t h e algorithm

by making the stride length shorter and testing more window sizes will increase the cost.

6.4 Sliding windows with convolution

To improve the computational cost of our sliding windows detection algorithm let us try

to use a convolutional implementation where the output corresponds to sliding the window

across each position. Consider taking a 3 × 3 matrix and convoluting it twice with a 2 × 2

ﬁlter, which will produce a 1 ×1 output as shown in Figure 6.4.1a. Now, consider if we were

to add a row and colum n to the end of our input matrix, increasing it to a size of 4 × 4 and

then convolute it twice with a 2 × 2 ﬁlter. What we end up with is a 2 × 2 output matrix

where the upper left entry is equivalent to the two convolutions from the 3 ×3 input matrix

as shown in Figure 6.4.1b. Similarily, the top right entry will be the two convolutions with

the 3 × 3 input matrix without the ﬁrst column and without the last row.

Using the me t hod just introduced, we have a way to create a sliding window with a convo-

lution. Takin g this method a step further, we can also adjust the stride by adding a max

pooling layer. The ﬁlter size of the max pooling layer will determine the size. Further, if we

80 CHAPTER 6. DETECTION ALGORITHMS

1st window size

slides: 0 slides: 1 slides: 2

2nd window size

Figure 6.3.2: The sliding windows detection algorithm chooses diﬀerent window sizes and slides

the window size across the entire image. At each slide, the area that the window is covering is fed

into a convolutional neural ne twork in order to try and classify the image.

f = 2 f = 2

(a) 3 × 3 input

f = 2 f = 2

(b) 4 × 4 input

Figure 6.4.1: Applying the same model to inputs of diﬀerent size. The dark portion in Figure

6.4.1b is equivalent to the model in Figure 6.4.1a.

want to learn a more complex function we can add 1 × 1 convolutional neural networks to

the end of our network that will act like fully connected layers as shown in Figure 6.4.2.

#(vertical convolutions) x

#(horizontal convolutions) x

#(classes)

Act like fully connected layers

softmax

max pooling

convolution with ReLU

f = stride

Figure 6.4.2: Convolutional neural network that determines if a window image is one of four

categories: pedestrian , car, motorcycle, or background.

6.5. BOUNDING BOX PREDICTIONS 81

6.5 Bounding box pr edi cti ons

y =

0.1

0.5

1.3

(

y =

0.4

0.9

1.9

1.2

(

Figure 6.5.1: Using a bounding box on an image with the correspon d i n g labels that contain cars.

We are using the labeling set in Equation 6.1.1.

Currently, we have only seen guess and check algorithms in terms of choosing a ﬁlter and

step size. Instead, it would be nice to only have one pass through a convolutional neural

network that determines object detection. Let us consider breaking the image into a gr i d

and treating each gridded cell as its own image that we h ave to label. If an object we are

trying to identify has its centerpoint in the grid, then we will label P

in that grid as 1.

For grids that do not contain any objects or contain non-centerpoints of objects they will

have P

as 0 with None in all of the other values. When l abeling images from the grids

centerpoint, notice that our constraints are 0 ≤ b

, b

≤ 1 and b

, b

≥ 0. With our 3 × 3

grid, the output wil l need to be 3 × 3 × 8 because each grid cell has its own label with 8

entries. Therefore, we can construct a convolutional neural network, like t he one shown in

Figure 6.4.2 that takes our image as input and outputs a 3 ×3 ×8 tensor. The way we have

set up bounding box prediction is known as the YOLO–You Only Look Once–algorithm and

we will continue discussing it after introducing a few more ideas that pl ay a key role.

6.6 Intersection over Union (IoU)

One common way to asse ss if a bounding box’s prediction i s accurate is to take the inter-

section of the labeled bounding box and the predicted bounding box and divide that by the

union of the labeled and predicted bounding boxes, known as IoU. Typically if IoU ≥ 0. 5,

we say that our model’s bounding box is correct, however, 0.5 is just an arbitrary choice

and is sometimes changed.

6.7 Non-max suppression

When using smaller grid sizes for bounding box prediction, it is more likely that neighboring

boxes will predict references to the same object as shown in Figure 6.7.1a. To start, we will

ﬁrst want to remove any boxes that are uncertain of containing a class. Typically, we may

82 CHAPTER 6. DETECTION ALGORITHMS

Intersection

Union

Figure 6.6.1: Taking the intersection over union of two bounding boxes to assess its correctness.

0.9

0.8

0.7

0.8

0.95

0.45

(a) Pre non-max suppression

0.9

0.95

(b) Post non-max suppression

Figure 6.7.1: Predicted boundin g boxes using a 19 × 19 grid along with the conﬁdence a non-

background class exists inside of that box.

remove all b oxes that have P

≤ 0.6. With a way to calculate the similarity between two

bounding boxes (IoU), we can use that to determine if two bounding boxes are referring

to the same object. Starting with the box that has the highest propability a class exists

inside of it, P

, we will keep that box and then remove any other box from the same class

prediction that has an IoU ≥ 0.5 with the box in focus. Thus, we are eﬀectively removing the

non-maximum boxes from the p r ed ic t ion . The complete algorithm for n on- max suppression

is shown in Algorithm 8.

Algorithm 8 Non- max suppression

Input: boxes ← list of all bou nd i ng boxes

Initialize: finalBoxes ← empty list

Remove any box from boxes with P

≤ 0.6

while boxes length > 0 do

maxBox ← box with max P

from boxes

Append maxBox to finalBoxes

Remove maxBox from boxes

for each box in boxes do

if IoU(box, maxBox) ≥ 0.5 then

Remove box from boxes

end if

end for

end while

6.8. ANCHOR BOXES 83

6.8 Anchor boxes

Figure 6 .8 . 1: The midpo int for two objects–pedestrian and car–are located in the same grid cell.

With our current implementation for labelling each grid cell, we have no way to label two

separate classes that occur in th e same grid. For example, Figu re 6.8.1 shows that the

centerpoint for a pedestrian and a car occur in the center gridpoint. We will start oﬀ by a

set number of diﬀerent anchor boxes as shown in Figure 6.8. 2. The sizes of the anchor boxes

can be de te r mi ne d from a k-means clustering algorithm that looks for grouping of anchor

box sizes amongst each class.

(a) Anchor box 1 (b) Anchor box 2

Figure 6.8.2: Deﬁning multiple distinct anchor boxes

84 CHAPTER 6. DETECTION ALGORITHMS

Now, for each label in our training data we will label it with

y =

(

;

anchor box 1

;

anchor box 2

. (6.8.1)

Then, for each grid cell we will use the IoU to determine which anchor box is closest to each

bounding box. In Figure 6.8.1, the pedes t ri an would be asssigned to anchor box 1 and the

car would be assigned to anchor box 2. We then ﬁll in the output as the labels from each

bounding box. With our 3 × 3 grid, the output size will be 3 × 3 × 16 or 3 × 3 × (2 × 8),

since there are two anchor boxes and each anchor b ox has 8 entries. When u si n g two anchor

boxes, our predic t ion will give us 2 boundary boxes in each grid c el l .

We could always deﬁne m ore anchor boxes in practice, alghough with a larger grid size, it

is less likely for collisions in anchor boxes to occur. If we have more objects that share the

same grid cell than we have anchor boxes or we have two boundary boxes that are the same

dimension, our algorithm will not perform well.

6.9 YOLO algorithm

The YOLO (You Only Look Once) algorithm was introduced in 2016 by Joseph Redmon,

Santosh Divvala, Ross G ir s hi ck, and Ali Farhadi. The algorithm combines boundary box

predictions with anchor boxes and non-max suppression. With only having to move through

the neural network in one pass, YOLO beat the running time for all pr ev i ous state-of-the-art

object detection algorithms.

Chapter 7

Sequence models

Sequence models allow us to work with variable inputs or variable outputs. This feature

allows us to greatly expand the number of problems we are able to tackle. For example ,

sequence models are used for speech recognition, where given an variable length audio clip, we

translate the spoken words; music generation, where we can input a set of data and output

a sequence of music; sentiment classiﬁcation, where we could take in a sentence review and

output a star-rating; DNA sequence analysis, where we take a DNA sequence and label the

part of the sequence corresponding to, say a protein; machine (or language) translation,

where we are given a sentence in one language and have to convert it into another; video

activity recognition, where we are given a set of frames and want to output the activity; and

name entity recognition, where we are given a set of text and want to ﬁnd the names within

the text.

As an example, suppose we would like to build a sequence model for name entity recognition

with the input

x : Harry Potter and Hermione Granger i nvented a new spell. (7.0.1)

We will denote the tth word of the input as x

<t>

, starting at index 1. For the output y, we

will use multi-hot encoding where each word is labeled with a 1 or 0 as follows

y : 1 1 0 1 1 0 0 0 0. (7.0.2)

The tth element of the output will similarly be referred to as y

<t>

. The length of the input

sequence wil l be denoted by T

and the length of the output sequence will be T

. In this

case, both T

and T

are equal to 9, although that is not always the case. So, if we want to

refer to the tth word in the ith element of the t r ain i ng sample, we will call x

(i)<t>

. Since

the input and output of the seque nc e can vary, we will use T

(i)

and T

(i)

to denote the length

of the ith input and output, respectively.

To represent words, we will need use vocabulary list of all the possible words that are likely

to occur. For modern app li c at ion s, this could be around 50,000 words in total. We will

represent teh vocabulary list sorted in ascending order . Then, to store x

<t>

, we can use

one-hot encoding, where we have a 1 in the index of the vocabulary list that stores the word

we are looking at, and a 0 everywhere else. We will tal k about unknown words, or words

that are not in our vocabulary list, soon.

7.1 Recurrent neural networks

The problem with using a standard neural network is that the in p ut length and output

lengths can diﬀer in diﬀerent new samples of our data. If we tried to set a maximum

number as the input, with 0 for each element that is not necessary, then we will not only

greatly diminish the number of problems that we can focus on, but we also will be using

a bunch of unnecessary memory i n the network. The plain neural network will al so not

86 CHAPTER 7. SEQUENCE MODELS

take advantage of shared features of sequ ential d at a, similar to how convolutional neural

networks took advantage of using the same ﬁlters over diﬀerent regions of the image.

With recurrent ne ur al networks, we built a m odel that takes as input the current time step

t and all of the previous time steps in order to compute the output y

<t>

. At each time step,

the model is parameterized by shared weights W

, W

and W

y a

, as shown in Figure 7.1.1.

A big problem with this type of representation is that we are only looking at the previous

time steps. Suppose we have the following sentences as input

He said, “Teddy Roosevelt was a great President.”

He said, “Teddy bears are on sale!”

. (7.1.1)

In both of these cases , the ﬁrst three words are equivalent; however, in only t he ﬁrst case

does the word “Teddy” represent a name. We will delve into this topic soon, which looks

at Bidirectional RNNs (BRNNs), however to understand the basics of RNNs we will stick

with our unidirectional RNN.

For now, we will set a

<0>

= 0, which is typical in many cases. Now, to represent the ﬁrst

activati ons and output in the recurrent neural network pictured in Figur e 7.1.1, we will set

<1>

= g

)

<0>

+ W

<1>

+ b

(7.1.2)

ˆy

<1>

= g

)

y a

<1>

+ b

. (7.1.3)

The activation functions in Equation 7.1.2 and Equation 7.1.3 do not have to be the same.

With the activation functions, tanh and Re LU are commonly used, while sigmoid or softmax

are typically use d for the output activation functions. To generalize the previous equations,

we have

<t>

= g

)

<t−1>

+ W

<t>

+ b

(7.1.4)

ˆy

<t>

= g

)

y a

<t>

+ b

. (7.1.5)

We will futher simplify Equation 7.1.4 to be

<t>

= g

)

<t−1>

, x

<t>

+ b

, (7.1.6)

where W

is formed by stacking

| W

(7.1.7)

into a single matrix. We wil l also use the notation

<t−1>

, x

<t>

<t−1>

<t>

(7.1.8)

We will also simplify Equation 7.1.5 to be

ˆy

<t>

= g

)

<t>

+ b

. (7.1.9)

7.1.1 Backpropagation through time

We will deﬁne our loss at time step t as our familiar cross-entropy, or logistic loss

<t>

)

ˆy

<t>

, y

<t>

= −y

<t>

log ˆy

<t>

−

)

1 − y

<t>

log

)

1 − ˆy

<t>

. (7.1.10)

For the entire sequence, we will compute the loss as

L (ˆy, y) =

t=1

<t>

)

ˆy

<t>

, y

<t>

. (7.1.11)

Now, we can deﬁne the computation graph pictured in Figure 7.1.2. When backpropagating

through this network, we say that we are backpropagating through time, because we are going

backwards in the time series.

7.1. RECURRENT NEURAL NETWORKS 87

< 0>

< 1>

< 2>

< 3>

<Tx

1 >

<Tx>

<Ty>

< 3>

Waa Waa Waa WaaWaa

Wax, ba Wax, ba Wax, ba Wax, ba

Wya, by Wya, by Wya, by Wya, by

Figure 7.1 .1 : The structure of a ba s ic recurrent neural network predicts y

<t>

based on all of the

previous time steps, along with the current time step. The parameters of the network are pictured

in red.

< 1>

< 0>

Wa, ba

Wy, by

< 1>

< 2>

< 3>

<Tx >

<Ty>

<Tx>

<Ty>

Figure 7.1.2: Computation graph for our recurrent neural network.

7.1.2 Variational RNNs

RNNs can be broken into diﬀerent categories, as shown in Figure 7.1.3. What we have

currently been working with is a many-to-many model. If we were to use sentiment classi-

ﬁcaiton, where we input a rev i ew sentence and output the number of stars, we would use a

many-to-one model. A one-to-many model may consist of music generation, where we could

input some metadata (i.e. genre, style, etc.), and output script s of music. Many-to-many

models may also consist of examples where the number of inputs and the numbe r of outputs

diﬀer, such as with machine translati on.

7.1.3 Language modeling

Suppose we would like to build a speech recognition system that inputs an audio clip of

speech and outputs the spoken words. Looking simply at the pronunciation of each word i s

not enough to determine the corre ct word that is spoken. For instance, t he pronunciation

for “pair” and “pear”; “two”, “to”, and “too”; and “their”, “they’re”, and “there” are all

88 CHAPTER 7. SEQUENCE MODELS

Figure 7.1.3: Depending on the combinations of ﬁxed or variable length inpu t and output, we

can form diﬀerent types of RNNs.

pronounced identitically, but are diﬀerent words alltogether. A language model may lo ok at

a set of sentences that sound identitical and choose the sentence that is most likely to have

occured. For example, suppose we have the two equivalently spoken sentences

The apple and pair salad.

The apple and pear salad.

. (7.1. 12)

Our language model may output something in the form of

P (The apple and pair salad.) = 3.2 × 10

−13

P (The apple and pear salad.) = 5.7 × 10

−10

, (7.1.13)

thus, the second sentence would be selected because it is more prob abl e .

To create a language model with a RNN, we need a training set that consists of a large corpus

of English text. The word corpus comes from the ﬁeld of natural language processing and

refers to a set of text . For each sentence we are given as input, we will tokenize the ou tp u t

using one-hot enco di n g. It is also common to add an end-of-sentence token. If we are given

a sentence like

The Egyptian Mau is a bread of cat. (7.1.14)

and do not have the word “Mau” in our vocab ul ar y list, then we will use a 1 t o encode the

unknown word item in our vocabulary list, while making everything else a 0.

Suppose we have a vocabulary list that has 10k entries (excludes puncuation and incl u de s

spots for unknown words) are working with the following sentence

x : Well, he l lo there! (7.1.15)

With an RNN model that is trying to predict the next word, the inputs will be the ordered

list of words that came before the current time step t. For the ﬁrs t time step, the input will

be the 0 vector and the ﬁrst prediction ˆy

<1>

will output a 10k softmax classiﬁcation, where

each word in th e vocabulary list i s given a probability that it occurs; that is

ˆy

<1>

= P (< each word >) =

P (< word one >)

P (< word 10k >)

(

. (7.1.16)

In this case, we have y

<1>

as a one-hot encoding with the word “well”. For the second

time step, our input will be y

<1>

= “well”. Then prediction for ˆy

<2>

will be a conditional

probability of the second word, given the ﬁrst word; that is

ˆy

<2>

= P

)

< each word > | y

<1>

= well

. (7.1.17)

For the third time step, t he input will be y

<1>

= “well”, y

<2>

= “hello”. Following the

same pattern, the prediction will be in the form of a 10k softmax classiﬁer, where

ˆy

<3>

= P

)

< each word > | y

<1>

= “well”, y

<2>

= “hello”

. (7.1.18)

7.1. RECURRENT NEURAL NETWORKS 89

On the backward pass, we will lo ok to maximize the predictions that

ˆy

<1>

= well

ˆy

<2>

= hello

ˆy

<3>

= there

. (7.1.19)

To predict the probability of an entire sentence, we could calculate

)

<1>

· P

)

<2>

| y

<1>

· P

<3>

t=1

<t>

· · · P

<T >

T −1

t=1

<t>

. (7.1.20)

7.1.4 Sampling novel sequences

To sample a sequence, we will use a similar RNN model to the one just described. However,

we will now change the input at each time step to b e all of the previous predic t ion s from

the model, as shown in Figure 7.1.4. We will als o have the predictions b e random variabl e

choices from the probability distribution in order to avoid an inﬁnite loop of the same few

words. In NumPy, we can use np.random.choice to choose a random variable from the

output prediction distribut ion . To end t he program, we can wait until the sentence is over,

in which case the output will be our <EOS> word. We could also iterate for a certain number

of time steps or run a program for a speciﬁc amount of time .

< 1>

< 0>

< 2>

< 1> < 2> <Tx -1>

< 2>

< 3>

<Ty>

<Tx>

Figure 7.1.4: RNN model for sequence generation. Here, the to t h e time step t is the output from

all of the previous time steps.

Up until this point, we have been focused on using a word vocabulary list; however, it is

also common to use a character vocabulary list, such as all of the ASCII characters. An

advantage of character level models is that we wil l never have to use an unknown token. One

of the drawbacks with the character level approach is that it is worse at capturing long-range

dependencies, such as opening and closing quot es , along with being more computationally

expensive.

7.1.5 Vanishing gradients with RNNs

When we are working wit h RNNs, it is not unrealistic to be working with hundreds or

thousands of time steps in a given model. However, recall the pr obl em with deep neural

networks that had a lot of layers, where we continued to multiply by smaller and smaller

numbers as we backpropagate further. We can think of our the number of layers we have

with NNs similar to the number of time steps with RNNs. With our current RNN approach,

it will much harder to r ep re se nt long-term dependencies in our data, such as opening and

closing quotes.

7.1.6 GRU Gated recurrent unit

Currently, we have been using RNNs that have a hidden unit in the following form

<t>

= tanh

)

<t−1>

, x

<t>

+ b

. (7.1.21)

90 CHAPTER 7. SEQUENCE MODELS

< t >

< t –1>

< t +1>

softmax

tanh

Figure 7.1.5: Our current approach to RNN calculations, where the g ray box represents a hidden

black box calculation.

We can picture this as a black box calculation at each time step t as shown in in Figure

7.1.5. However, our model is not adapt to cap t ur i ng long-term dependencies in the dat a.

Suppose we are working with the following sentence

x : The cat, which already ate, was full. (7.1.22)

If we slightly changed this sentence to have the word cat as cats, then we would also need

to update was to were. With a GRU, a

<t>

will be treated like a memory cell. We will then

set

˜a

<t>

= tanh

)

<t−1>

, x

<t>

+ b

(7.1.23)

as a potential candidate for a

<t>

. Next , we will create an update gate

= σ

)

<t−1>

, x

<t>

+ b

, (7.1.24)

where the majority of values in Γ

will be 0 or 1, based on the f un ct i onal i ty of the sigmoid

function. The update gate ultimately determines if we will update a

<t>

. For example,

suppose the model learns that singular words are denoted with a

<t>

= 1. Then, i n our

working sentence we may have something along the l i n es of:

t 1 2 3 4 5 6 7

x The cat which already ate was full

<t>

− 1 1 1 1 1 −

Notice that the update gate continues to denote that our sentence is singular at further time

steps t. At some fut ur e time t, if we run into a plural phrase, then it is the job of Γ

update a

<t>

. Speciﬁcally, our up d at e will be

<t>

= Γ

∗ ˜a

<t>

+ (1 − Γ

) ∗ a

<t−1>

, (7.1.25)

where ∗ is an element-wise multiplication between the vector. Notice here that when Γ

= 1,

we will update a

<t>

to be ˜a

<t>

, such as when the time step was 2 in the above example.

When Γ

= 0, then we keep a

<t>

the same value, such as in the time steps [3, 6] above. An

inportant note is that the values ˜a

<t>

, a

<t>

, and Γ

are all vectors that can learn many

diﬀerent types of long-term dependencies. For example, we can learn how to open and close

brackets, q u ot es, and use singular or plural p hr ase s when appropriate. The computation

graph for GRUs is shown in Figure 7.1.6.

7.1. RECURRENT NEURAL NETWORKS 91

< t >

< t –1>

< t +1>

softmax

tanh

Figure 7.1.6: Simpliﬁed GRU computational graph at a time step t.

There is one other part to the GRU that we have currently left out, and that is to determi ne

how relevant the last me mor y cell a

<t−1>

is for the calculation of the next, potential,

memory cell ˜a

<t>

. Here, we add the relevance gate i n as follows

˜a

<t>

= tanh

)

∗ a

<t−1>

, x

<t>

+ b

(7.1.26)

= σ

)

<t−1>

, x

<t>

+ b

(7.1.27)

= σ

)

<t−1>

, x

<t>

+ b

(7.1.28)

<t>

= Γ

∗ ˜a

<t>

+ (1 − Γ

) ∗ a

<t−1>

. (7.1.29)

7.1.7 LSTM: Long short-term memory

In addition to GRUs to store long-term dependencies in sequential data, LSTMs are also

commonly used. Figure 7.1.7 shows an example of an LSTM model. We can think about

LSTMs as an extension of GRUs wi th 2 hidden units h

and h

. We use 3 separate gates,

a forget gate Γ

, update gate Γ

, and output gate Γ

, that are each deﬁned as

= σ

· [x

, h

t−1

]

(7.1.30)

= σ

· [x

, h

t−1

]

(7.1.31)

= σ

· [x

, h

t−1

]

. (7.1.32)

With LSTMs, we remove the relevance gate Γ

because it often does not improve the perfor-

mance of a model in practice. With our 1st hidden unit h

, we will set its replace candidate

based on the other hid de n unit input h

t−1

and the new input x

, which we will deﬁne as

= σ

· [x

, h

t−1

]

. (7.1.33)

Now instead of using a single update gate that determin es both how much of the previous

layer we want to forget and how much of the replacement candidate we want to remember,

92 CHAPTER 7. SEQUENCE MODELS

softmax

forget gate update gate output gatetanh

tanh

t-1

(a) Single LSTM gate

(b) Stacking multiple LSTM gates in a row

Figure 7.1.7: LSTM model

we use 2 separate gates to deﬁne these quantities. In particular, we deﬁne our 1st hidden

unit to be

= Γ

⊗

+ Γ

⊗ h

t−1

. (7.1.34)

Our 2nd hidden un i t is dependent on h

and passed into the softmax fucntion to produce

the output y

, which we deﬁne as

= Γ

⊗ tanh

)

. (7.1.35)

Here, the output gate determines t h e relevance of each hidden unit for the output. The blue

path that we show in Figure 7.1. 7b connects the 1st hidden unit between multiple LSTM

gates. The path could give an intuitive fee l for why LSTMs give long-range dependency

connections, which is primarily due to having multiple update gates control what we store

in the internal memory, making it harder to forget everything completely.

7.1.8 BRNNs: Bidirectional RNNs

Recall our 2 sentences we used earlier

He said, “Teddy bears are on sale!”

He said, “Teddy Roosevelt was a great Pre si de nt!”

, (7.1.36)

where it is not enough to predict if t he work Teddy is a name based on the information before

the word occurs. In this case, we can use a bidirectional RNN (BRNN) to make pred ic t ion s

with the words before and after Teddy. Figure 7.1.8 shows the diagram for a BRNN with a

ﬁxed size input length of 4. Our BRNN is broken up into 2 separate forward RNNs. Here,

we have a forward RNN with activations in the form

−→

a that takes into account all of the

previous time step inputs. We also have another RNN with activations in the form of

←−

7.2. WORD EMBE D D ING 93

that looks at all the futur e time steps in reverse order. For example, if our sentence is

He said, “Teddy Roosevelt!”, (7.1.37)

then at t = 3 we can form

←−

<3>

with {He, said, Teddy} and we can form

−→

<3>

with

{Roosevelt}. We c an then make an output prediction, such as the meaning of the word

Teddy, in the form of

ˆy

<t>

= g

)

W ·

−→

<3>

←−

<3>

. (7.1.38)

One of the downsides to using a BRNN is that we are required to supply the entire input

at the start.

<1>

<2>

<3>

<4>

(a) Forward and backward connections

<1>

<2>

<3>

<4>

(b) Forward RNN connection with the sequence

in order

<4>

<3>

<2>

<1>

quence is in reverse order

Figure 7.1.8: BRNN with an in p u t of length of 4

7.1.9 Deep RNNs

Figure 7.1.9 shows the 2 primary types of deep RNNs, where we stack multiple activations

at each time step. The 1st type of network, in Figure 7.1.9a, shows 3 stacked activations at

each time step that are each connected to the next time step. Here, each layer has its own

weights, which are shared at each time step. If, for example, we can cal cu l ate each hidden

activati on in layer ℓ on the tth time step, then we can use

[ℓ]<t>

= g

[ℓ]

[ℓ]<t−1>

, a

[ℓ−1]<t>

, (7.1.39)

which takes the l ef t and b e l ow activations as the input. The 2nd type of network, in Figure

7.1.9b, shows the ﬁrst few layers connecting to future time steps, while the deeper layers are

only connected to a single time step. Here, the deeper layers are not connec te d to fut u r e

time steps because connecting them would be computationally expensive.

7.2 Word embedding

7.3 Word repres entation

Up until this point, we have represented words with a dictionary, where we specify a speciﬁc

word using a one-hot enco d i ng vector. One of the problems with this representation is that

94 CHAPTER 7. SEQUENCE MODELS

<1>

[2]<1>

[1]<1>

[3]<1>

<2>

[2]<2>

[1]<2>

[3]<2>

<3>

[2]<3>

[1]<3>

[3]<3>

<4>

[2]<4>

[1]<4>

[3]<4>

(a) Stacking together multiple layers of activations that are each connected to the ne xt and p rev io u s

time steps

<1>

[2]<1>

[1]<1>

[3]<1>

<2>

[2]<2>

[1]<2>

[3]<2>

<3>

[2]<3>

[1]<3>

[3]<3>

<4>

[2]<4>

[1]<4>

[3]<4>

[5]<1>

[4]<1>

[6]<1>

[5]<2>

[4]<2>

[6]<2>

[5]<3>

[4]<3>

[6]<3>

[5]<4>

[4]<4>

[6]<4>

[7]<1>

[7]<2>

[7]<3>

[7]<4>

(b) Deep RNN where the ﬁrst few layers have activations that connect to future time steps, while

the deeper layers only carry weight in a single time step

Figure 7.1.9: Deep RNNs

the words do not have a connection with each other. For examp l e, suppose we input

I want a glass of orange (7.3.1)

7.3. WORD REPRE SE NTATION 95

input our ne twork, where it is the goal of the network to predict the blank. With a well

trained network the likely prediction is the word juice. But, suppose we now have t he input

I want a glass of apple (7.3.2)

and our model does not know about apple juice. In this situation, if our model knows that

there is some connection between orange and apple, then it would still likely predict juice.

One approach to learning connections would be to discover feature repre sentations with each

word. Table 7.1 shows an example of a potential encoding that our model may ﬁnd. The

table shows the words king, queen, apple, and orange, along with thei r responses to a set of

features. For example, if we have a royal feature, then king and queen might have a high

response; alte rn at i vely if we have a food feature, then apple and orange might have a high

response. Here, if we have 300 features, then we can represent each word with a vector of

length 300. Now, due to the similarities between features, our network is more likely to pick

up similar words.

King Qu een Apple Orange

Royal 0.93 0.95 -0.01 0.00

Age 0.7 0.71 0.03 -0.02

Food 0.01 -0.03 0.95 0.97

Table 7.1: Sample words in a dictionary (top) with their responses to each feature (left) that

ranges between [−1 : 1]

7.3.1 Using word embeddings

To learn the features for a set of words, it is common to use transfer learning. There are

currently many freely available text corpuses that have learned word embeddings training

on 1B to 100B words. Once we have word emb ed di n gs, we can transfer the words to a

smaller task and ﬁne-tune the parameters if necessary.

7.3.2 Properties of word embeddings

Once we have the word embeddings, one neat thing that we can ﬁnd is the diﬀeren ce between

2 words. Suppose for example that one of our features gives the value 1 for words describing

objects w it h an orange color, −1 for words describing a colored object that is not orange,

and 0 for objects that are not deﬁned by a color. Table 7.2 shows how we can spot the

feature diﬀerences between the apple and orange vectors. I n particular, if we subtract the 2

vectors, then t he f eat ur e s th at ar e lar ges t i n magn it u de cor r es pond to the feature diﬀerences

between words, while the feature diﬀerences near 0 refer to features being the same.

Apple Orange Apple - Orange (approximate)

Orange colored −0.96 0.98 −2

Royal -0.01 0.00 0

Age 0.03 -0.02 0

Food 0.95 0.97 0

Table 7.2: To compare words we can ﬁnd the diﬀerences and similarities by taking the diﬀerence

between the word feature vectors, and ﬁnding the features wit h the largest diﬀerence in magnitude

are the most diﬀerent, while feature diﬀerences near 0 are quite similar.