Data 311: Machine Learning

Lecture 18 - Deep Learning

Dr. Irene Vrbik

University of British Columbia Okanagan

Background

  • Neural networks (NN) is a supervised machine learning algorithms that rose to fame in the late 1980s

  • Automatic methods like Support Vector Machines (SVM), boosting, and random forests caused NN, which required a lot of tinkering, to take a backseat.

  • Neural networks resurfaced after 2010 with the new name deep learning and by 2020 they are one of the most popular algorithms in machine learning.

Implementation

  • The rising success of NN, in part, is due to the vast improvements in computer power, larger training sets, and software: Tensorflow and PyTorch
  • In this course, we will utilize an R package to run these models. However, for more intensive or large-scale computations, consider using tools outside of R.

Preface

  • NN cover a broad range of concepts and techniques.

  • As with most of the subjects in this class, an entire course (grad school-level) could be given on this matter!

  • This lecture just scratches the surface

  • These methods are often used as a black box, but they are rooted in math and statistics.

  • The material in this unit is slightly more challenging than elsewhere in this book.

Neural Networks

  • A neural network is a series of algorithms designed to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates.
  • These can be applied to variety of predictors inputs (e.g. video, images, speech, sounds, text, time series, etc.)
  • NN require (a lot of) labeled training data to learn patterns and make predictions (the more diverse and representative, the better).

Structure of a NN

  • NN are often visualized as a Network Diagram with connections and nodes

  • These connections (denoted by arrows on the diagram) are each associated with a parameter (aka weight)

  • In a feedforward network, information always moves one direction; it never goes backwards.

Terminology

  • These models are likened to the human brain; some of the terminology closely mirrors the connection with biology.

  • Each NN is made of nodes (akin to neurons), and are connected (depicted by arrows and akin to synapses)

  • Some terms in this unit are simply different names for things we’ve learned about previously in the course.

  • I will draw those connections by calling them what we would have called it in statistics, striking it out, and rename it using the language adopted in the Deep Learning community.

Building Block: Logistic Regression

  • Neural networks and logistic regression are closely connected, as logistic regression forms the foundation of the simplest type of neural network.
  • Recall: Logistic regression is a linear model that predicts the probability of a binary outcome using the sigmoid function:

\[ \begin{align} \Pr(y = 1 \mid x) &= \frac{1}{1 + e^{- \beta^\top \mathbf x}} \\ &= \dfrac{1}{1+ e^{-(\beta_0 + \beta_1 X_1 + \cdots \beta_p X_p)}}\\ \end{align} \]

Another way we wrote this:

\[ g(\eta) = \dfrac{1}{1 + e^{-\eta}} \]

where we called \(\eta = \beta_0 + \beta_1 X_1 + \cdots \beta_p X_p\) the linear predictor.

Building Block: Logistic Regression

\[ g(\mathbf x) = \frac{1}{1 + e^{- \mathbf w\top \mathbf x}} = \frac{1}{1 + e^{- (w_0 + \sum_{j=1}^p w_{j} x_j)}} \]

  • where \(\mathbf w\) are the coefficients weights

  • \(w_0\) is the intercept bias1

  • and \(g(\cdot )\) is the activation function2

Visual Diagram

The diagram represents a single-layer neural network (also called a perceptron), which is conceptually equivalent to logistic regression when used for binary classification tasks.

Single Layer Neural Networks

In its simplest form, a single layer neural network has only three layers:

  1. input layer: data features/predictors
  2. hidden layer: transformations and computations
  3. output layer1: continuous predictions or classifications

Let’s explore one for modeling a quantitative response using \(p = 4\) predictors.

Network Diagram

This “shallow” feed-forward NN has: 4 input nodes, 1 hidden layer (with 5 neurons/nodes/units), and 1 output node.

ISLR Fig 10.1 Neural network with a single hidden layer. Note that the textbook does not visualize the green ‘1’ node associated with bias.

Jump to: Single Hidden Layer Revisited

Single Layer NN

  • In NN terminology, the four features \(X_1, \dots ,X_4\) make up the “units” or “nodes” in the input layer

  • Each arrows feeds into each (\(k = 1, \dots K\)1) of the so-called activations of the hidden layer : \[ A_k = h_k(X) = g(w_{k0} + \sum_{j=1}^p w_{kj}X_j) \] These \(A_k\)s are not directly observed, hence “hidden”.

The Model

The resulting model is then a linear combination in the \(K = 5\) activations:

\[ \begin{align} f(X) &= \beta_0 + \beta_1 A_1 + \beta_2 A_2 + \dots + \beta_K A_K\\ &= \beta_0 + \sum_{k=1}^K \beta_k h_k(X)\\ &= \beta_0 + \sum_{k=1}^K \beta_k g(w_{k0} + \sum_{j=1}^p w_{kj}X_j) \\ \end{align} \]

\(g(z)\) is typically a non-linear function (see activation functions) of \(z\), a linear combination of the inputs.

Activation Functions

\(g(z)\) is called the activation function

x-asis standardized z values from -4 to 4, y-axix is g(z) = activation function.  Plotted in green is the Sigmoid function which is an s-shaped curve taht starts at 0 and finishes at 1.  Plotted in black is the Relu activation function; it is peice-wise linear, that is 0 for z less than 0 then a linear function of z.

A plot of the popular activation functions: the sigmoid function in green and rectified linear (or Relu) in black.

Sigmoid

In the early instances of neural networks, the sigmoid activation function was favoured:

\[ g(z) = \frac{e^z}{1+e^z} \]

Recall:

  • the sigmoid function forms an S-shaped graph
  • this was used in logistic regression to convert values on the real line into probabilities between zero and one

Relu

The preferred1 choice in modern neural networks is the ReLU (rectified linear unit) activation function, which takes the form

\[ g(z) = \text{max}(0,z) = \begin{cases} 0 & \text{if }z<p\\ z & \text{otherwise} \end{cases} \]

Although it thresholds at zero, because we apply it to a linear function the constant term \(w_{k0}\) will shift this inflection point.

Common Activation Functions

Source of Image

Summary of Single Layer NN

In summary, we derive five new features by computing five different linear combinations of \(X\), and then plug each through an activation function \(g(·)\) to transform it. The final model is linear in these derived variables and has the following parameters …

Comment

  • All hidden layers typically use the same activation function.

  • The output layer will typically use a different activation function from the hidden layers.

  • The \(w_{kj}\)s are the coefficients weights and \(w_{k0}\)s are the intercepts biases1

  • This is a “feed-forward neural network” - there are more complicated types of NN (e.g. backwards arrows, loops, no arrows)

Output Layer

Common choices for activations functions for the output layer:

  1. Linear (aka “identity”)

    • multiplication by 1, i.e. no activation.
  2. Sigmoid (aka “logistic”)

  3. Softmax1 (for multi-class)

iClicker

bias

In a neural network, what is the purpose of the bias term?

  1. To prevent overfitting

  2. To normalize input data

  3. To shift the activation function

  4. To minimize the loss function

Why Non-linear Activation Functions?

  • The nonlinearity in the activation function \(g(·)\) is essential

  • Without it the model \(f(X)\) would collapse into a simple linear model in \(X_1,...,X_p\).

  • Moreover, having a nonlinear activation function allows us to capture complex nonlinearities and interaction effects.

Let’s look at an example where the sum of two nonlinear transformations of linear functions can give us an interaction

Example: Single Layer NN

Suppose we have two input variables \(X = (X_1, X_2)\), i.e. \(p=2\). In our hidden layer suppose we have two hidden units (i.e \(K=2\)) with activation function equal to \(g(z) = z^2\)

Suppose we have parameters: \[\begin{align*} \beta_0 &= 0, & \beta_1 &= \frac{1}{4} & \beta_2 &= - \frac{1}{4}\\ w_{10} &= 0, & w_{11} &=1 & w_{12} &= 1, \\ w_{20} &= 0, & w_{21} &=1 & w_{22} &= -1, \\ \end{align*}\]

Example (activations)

The two activations in the hidden layer and then computed \[\begin{align*} A_k = h_k(X) &= g (w_{k0} + \sum_{j=1}^p w_{kj}X_j)\\ \end{align*}\]

\[ \begin{align*} A_1 &= h_1(X) \\ &= g (w_{10} + w_{11}X_1 + w_{12}X_2)\\ &= g (0 + 1\cdot X_1 + 1 \cdot X_2)\\ &= (X_1 + X_2)^2 \end{align*} \]

\[ \begin{align*} A_2 &= h_2(X)\\ &= g (w_{20} + w_{21}X_1 + w_{22}X_2)\\ &= g(0 + 1 \cdot X_1 + (-1) \cdot X_2)\\ &= (X_1 - X_2)^2 \end{align*} \]

Example (model)

Plugging \(h_1(X) = (X_1 + X_2)^2\) and \(h_2(X) =(X_1 - X_2)^2\) into the neural network model:

\[\begin{align*} f(X) &= \beta_0 + \sum_{k=1}^K \beta_k h_k(X) = \beta_0 + \beta_1 h_1(X) + \beta_2 h_2(X)\\ &= 0 + \frac{1}{4}(X_1 + X_2)^2 - \frac{1}{4} (X_1 - X_2)^2\\ &= \frac{1}{4}\left[ X_1^2 + 2\cdot X_1 X_2 + X_2^2 - (X_1^2 - 2 \cdot X_1 X_2 + X_2^2)\right] \\ &= X_1X_2 \end{align*}\]

Multilayer Neural Networks

  • In theory1 a single hidden layer with a large number of units has the ability to approximate most functions.

  • However, the learning task of discovering a good solution is made much easier with multiple layers each of modest size.

  • The weights \(w\) are parameters that require estimation. The quantity of these gets out of hand quickly.

  • The adjective “deep” in deep learning refers to the use of multiple layers in the network.

Digit Recognition Example

  • Digit recognition problems were the catalyst that accelerated the development of neural network technology in the late 1980s at AT&T Bell Laboratories and elsewhere

  • Turns out that these pattern recognition (that are relatively simple for humans) are not so simple for machines.

  • It has taken more than 30 years to refine the neural-network architectures to match human performance.

Digit Dataset

  • The textbook goes through the process of setting up a large dense network on the famous and publicly available MNIST handwritten digit dataset (60K training, 10K testing)

  • The idea is to build a model to classify the images into their correct digit class 0–9.

  • Every image has \(p = 28 × 28 = 784\) pixels, each of which is an eight-bit grayscale value between 0 and 255 representing the brightness of a pixel (0 = black, 255 = white):

Example images

Two-Layer Feed-forward NN

ISLR Fig 10.4. Neural network diagram suitable for the MNIST handwritten-digit problem. The input layer has \(p = 784\) units, the two hidden layers having \(K_1 = 256\) and \(K_2 = 128\) units respectively, and 10 output layer units. Along with intercepts (AKA biases) this network has \(235,146\) parameters (aka weights).

First Hidden Layer

The first hidden layer takes a linear combination 784 inputs stored in \(X\) as inputs for the activation function. For \(k = 1, \dots, K_1 = 256\) we have: \[\begin{align*} &A_k^{(1)} = h^{(1)}_k(X) = g (w^{(1)}_{k0} + \sum_{j=1}^{p = 784} w^{(1)}_{kj}X_j) \end{align*}\]

  • \(W_1\) is the 785 \(\times\) 256 matrix of weights that feed from the input layer to hidden layer one (L1),
  • i.e. \(W_1 = \{w_{kj}^{(1)}\}\), \(j =\) 0, 1, \(\dots p\), and \(k=1, \dots, K_1\)

Second Hidden Layer

The second hidden layer takes a linear combination of the activations \(A_k^{(1)}\) from the first hidden layer as inputs for the activation function. For \(\ell = 1, \dots, K_2= 128\) we have:

\[ \begin{align*} &A^{(2)}_\ell= h^{(2)}_\ell(X)= g (w^{(2)}_{\ell 0} + \sum_{k=1}^{K_1=256} w^{(2)}_{ \ell k}A_k^{(1)}) \end{align*} \]

  • \(W_2\) is the \(25\textbf{7}\times 128\) matrix of weights that feed from the first hidden layer (L1) to the second hidden layer (L2),
  • i.e. \(W_2 = \{w_{\ell k}^{(2)}\}\), with \(k=\textbf{0}, \dots, K_1\) and \(\ell = 1,\dots K_2\)

Output Layer

The output layer takes a linear combination of these activations \(A_\ell^{(2)}\) from the second hidden layer as inputs and for output activation function. For \(m = 0, \dots, 9\): \[\begin{align*} f_m(A_\ell^{(2)}) &= f_m \left( \beta_{m0} + \sum_{\ell=1}^{K_2 = 128} \beta_{m\ell} A_\ell^{(2)} \right) \end{align*}\]

  • \(B\) is the \(12\textbf{9} \times 10\) matrix of weights that feed from the second hidden layer (L2) to output layer,

  • i.e. \(B = \{ \beta_{m \ell} \}\) with \(\ell = \textbf{0},\dots K_2\), and \(m=0, \dots, 9\)

Softmax

  • As stated previously, the output layer typically use a different activation function from the hidden layers

  • The output activation function used here is the softmax function which returns probabilities i.e. \(\sum_{m=0}^9 f_m(X) = 1\): \[\begin{equation} f_m(X) = \text{P}(Y=m \mid X) = \frac{e^{Z_m}}{\sum_{\ell=0}^9 e^{Z_{\ell}}} \end{equation}\] where \(Z_m = \beta_{m0} + \sum_{\ell=1}^{K_2} \beta_{m\ell}A_\ell^{(2)}\). We assign the image to the class with the highest probability.

Cross-entropy

We fit the model by minimizing the negative multinomial log-likelihood, or cross-entropy1:

\[ \mathcal{L} = \begin{equation} - \sum_{i=1}^n \sum_{m=0}^9 y_{im} \log (f_m(x_i)) \end{equation} \]

where \(y_{im}\) is 1 if the true class for observation \(i\) is \(m\), else 0. These classes are said to be one-hot encoded2

Loss Function

In the regression1 setting, for example, the model is fit by minimizing the familiar residual sum of squares:

\[ \mathcal{L} = \sum_{i=1}^n (y_i - f(x_i))^2 \]

This is commonly referred to as a loss function (or cost function, or objective function).2

Dummy variables One-hot encoding

  • Like the regression model, NN will require that our categorical data be converted to numeric form.

  • When no ordinal relationship exists we use one-hot encoding which encodes \(N\) categories using binary (aka dummy) variables

Original red.dummy green.dummy blue.dummy
red 1 0 0
green 0 1 0
blue 0 0 1

Number of Parameters Weights

  • This model has 235,146 parameters (referred to as weights):
    • \(W_1\) has \(785×256 = 200,960\) weights1
    • \(W_2\) has \(257 × 128 = 32,896\) weights\(^1\)
    • \(B\) has \(129×10 = 1290\) weights\(^1\).

Note that we have close to 4 times as many parameters as we do training observations (60k).

What should we be concerned about?

Overfitting

  • One of the most important aspects when training neural networks is avoiding overfitting.

  • To avoid overfitting, some regularization is needed.

  • As in our regression unit, regularization will be achieved by adding a penalty term regularization term to our loss function in order to penalize complexity.

  • This will effectively shrink (or remove) certain weights thereby making some hidden neurons negligible and reducing the overall complexity of the NN.

Regularization

Two popular regularization techniques are:

  1. L1 regularization (aka LASSO regularization)
  2. L2 regularization (aka Ridge regularization)

As you may be able to guess, L1 regularization forces the weights to become (exactly) zero and L2 regularization forces the weights towards (but never exactly) zero.

L1/L2 Regularization

  • L2/Ridge regularization uses the L2 norm \(|| W ||_2^2\) in it’s regularization term1 and adds it to the loss function: \[ \mathcal{L} + \frac{\alpha}{2}|| W ||_2^2 \]

  • L1/LASSO regularization uses the L1 norm \(|| W ||_1\) in it’s regularization term\(^1\) and adds it to the loss function: \[ \mathcal{L} + \alpha|| W ||_1 \]

Dropout

Another powerful option is dropout regularization.

A neural network diagram where with three hidden layers.  On the left we have a feedforward neural network with no dropout.  On theright we have some of the nodes in these hidden layers have an x through them, indicated that they are dropped from the network

Figure source: DeepLearning.AI YouTube video

Simply put, we created a smaller network by removing nodes according to some probability (in this case \(p =0.5\)).

Results

Method Test Error
Neural Network + Ridge Regularization 2.3%
Neural Network + Dropout Regularization 1.8%
Multinomial Logistic Regression 7.2%
Linear Discriminant Analysis 12.7%

This is a historic example that marks one of the early “wins” for neural networks in the 1990s.

Training NN

  • Once the structure is determined, we’re left with a non-linear optimization problem.

  • Typically, NNs are trained using the Stochastic Gradient descent and weights are updated using the Backpropagation.

  • We will give a brief overview using the loss function for regression but these ideas can be extended to classification and to when Regularization penalties are applied.

Single Hidden Layer Revisited

  • Lets return to the single hidden layer example used in our first network diagram.

  • In this model the parameters are \(\beta = (\beta_0, \beta_1, \dots, \beta_K)\), as well as each of the \(w_k = (w_{k0}, w_{k1,}...,w_{kp})\), for \(k = 1, . . . , K\).

  • Given observations \((x_i, y_i)\), for \(i = 1, \dots, n\) we could fit the model by minmizing the Loss Function with respect to the parameters, \(\beta\), and \(W\).

Nonconvex problem

While this might look like the minimization problem we had in linear regression, the following is not straightforward to minimize. \[ \mathcal{L}(\beta, W) = \sum_{i=1}^n (y_i - f(x_i))^2 \]

Furthermore, this problem is nonconvex in the parameters, and hence there are multiple solutions.

Gradient Descent

  • At the basic level, we adjust the weights so that error is reduced for the next iteration.

  • More technically, the optimization algorithm is navigating down the gradient (or slope) of error and seeks to change each weight proportionally to its effect on the RSS: \[\begin{equation} \Delta w_{\cdots} = - \alpha \frac{ d RSS}{d w_{ \cdots}} \end{equation}\] where \(\alpha\) is a learning rate.

Gradient Descent Algorithm

  1. Take the derivative of the loss function for each parameter (i.e. find the gradient for the loss function)
  2. Pick random values for parameters
  3. Plug in the parameters values into the derivatives (i.e. gradient)
  4. Calculate \(\text{Step Size} = - \alpha \frac{ d RSS}{d w_{ \cdots}}\)1
  5. Update: New Parameters = New Parameters + Step Size

Non-convexity

ISLR Fig 10.7: Illustration of gradient descent for one-dimensional \(\theta\). The objective function \(R(\theta)\) is not convex, and has two minima, one at \(\theta = -0.46\) (local), the other at \(\theta = 1.02\) (global). Starting at some value \(\theta_0\) (typically randomly chosen), each step in θ moves downhill — against the gradient — until it cannot go down any further. Here gradient descent reached the global minimum in 7 steps.

Learning Rate

  • The learning rate is akin to that which we saw in Boosting.

  • Essentially it is included so that the alorithm “learns slow” and avoids overfitting.

  • Typically this value is very small (say ~0.1)

  • The algorithm will be highly dependent on this value and often it’s set to “schedule” that starts off large and gets smaller and smaller.

Stochastic Gradient descent

  • The gradient descent just described computes the gradient of the loss function using the entire dataset.

  • While accurate, this is computationally expensive

  • Instead we do stochastic gradient decent which considers a small random subset of the data (called a batch) is used to compute an approximate gradient.

Backpropagation

  • Geoffrey Hinton, David Rumelhart and Ronald J. Williams pioneered the back-propagation algorithm in a pair of landmark papers published in 1986.

  • As a very briefoverview, we work out way through the NN backwards (hence the name), and computes the gradient

  • Stochastic gradient descent, is then used to perform learning using this gradient.

Visualization of Backpropigation

Source of Image

Epoch in NN

  • An epoch is completed each time the algorithm sees the all the samples in the dataset in a cycle (i.e. a forward pass and a backward pass).

  • An epoch is made up of one or more batches (or mini-batches), where we use a part of the dataset to train the neural network.

Note

When you run through all the batches, that is called an epoch. The length of the training of NN is usually measured in epochs.

Network Tuning

Fitting NN requires a number of choices that all have an effect on the performance:

  • The number of hidden layers, and the number of units (or nodes) per layer.

  • Regularization technique/ tuning parameters

  • Details of stochastic gradient descent

  • choice of activation functions

  • number of epochs

TensorFlow Playground

NN in R

  • One popular way of fitting Neural Networks in R is to use the keras package1

  • It can be a bit finicky to install our your computer but you follow this instructions from the textbooks website here

  • This package interfaces to the tensorflow package which in turn links to efficient python code.

  • For this course, we’ll just stick to the neuralnet and NeuralNetTools packages for our simple demonstrations.

Example: Body NN

  • We fit a neural network with one hidden layer, four hidden layer nodes (or neurons/units) to predict recorded Gender.
library(gclus)
data(body)
sbod <- cbind(scale(body[,1:24]), factor(body[,25]))
colnames(sbod)[25] <- "Gender"
# load the pacakge for Neural Networks
library(nnet)
# size = nodes in hidden layer
# fits a single layer NN
nnbod2 <- nnet(factor(Gender)~., data=sbod, size=4)

We call to another package to produce a plot for it …

Plotting the NN

library(NeuralNetTools);  plotnet(nnbod2)

iClicker

Parameter Count

How many parameters are in this NN? Note there are \(p\) = 24 predictors

  1. \(24 \times 4 \times + 4 \times 1\)
  2. \(25 \times 5 \times + 5 \times 1\)
  3. \(24 \times 4 \times 1\)
  4. \(25 \times 4 + 5 \times + 1\)
  5. none of the above

Properties of the Network Diagram

  • For NeuralNetTools package, positive weights show up as black, negative as grey.

  • Also the biases are shown as a separate node (B1 and B2) on the graph; for simpler NN, the weights would be shown on the lines connecting the nodes.

  • In this case, the magnitude of the weight shows up as the thickness of the line.

  • There are 25 (24 predictors + 1 bias term) x 4 (hidden layer units) + 5 (hidden layer units + 1 bias term) x 1 (output layer unit) = 105 weights

Example: Body NN

  • How does it perform??
table(body[,25], predict(nnbod2, type="class"))
   
      1   2
  0 260   0
  1   0 247
  • 0 misclassifications!

But of course, this is on the training set, so let’s do a training/testing set to approximate the long-run error.

Validation Approach with Body NN

  • Set up a quick training and testing set. Sample approx half the data for a 50-50 training/testing scenario, refit 4 hidden variable NN to the training set, and predict on the test set…
set.seed(53747958)
bindex <- sample(1:nrow(sbod), nrow(sbod)*.50)
btrain <- sbod[bindex,]
btest <- sbod[-bindex,]
nnbodtr <- nnet(factor(Gender)~., data=btrain, size=4)
table(btest[,25], predict(nnbodtr, newdata=btest[,-25], type="class"))
   
      1   2
  1 122   2
  2   4 126

6 misclassifications and misclassification rate of 0.023622

Example: Body NN

  • How does that compare to some of our other methods?

  • RF on same train/test? Misclassify 21. Rate closer to 8%. NN wins that battle.

  • LDA on same train/test? Misclassify 6. Rate closer to 2%. And it’s a much simpler model for interpretation.

  • So, be careful. Trendy, flexible models are not a cure-all. They need to be carefully tuned and tested.

Summary

Pros

✅ Effective for high-dim and complex data.
✅ Parellization
✅ Easily extended to multiple response models
✅ Can use pre-trained NN

Cons

❌ Black box (lack of interpretibility)
❌ Lots of tuning to consider
❌ High danger of overfitting
❌ Requires very large training set

Taking it further

Different types of data call for different variations. For example

  • ISLR2 Ch 10.3 - CNN (Convolutional Nueral Networks) are a special family of NN for classifying images and video (see ISLR2 Ch 10.3)
  • ISLR2 Ch 10.4 - the bag-of-words model can be used for predicting attributes of documents
  • ISLR2 Ch 10.5 - RNN (Recurrent Neural Networks) are used when data are sequential in nature (e.g. time series, documents)

CNN