Neural networks (NN) is a supervised machine learning algorithms that rose to fame in the late 1980s
Automatic methods like Support Vector Machines (SVM), boosting, and random forests caused NN, which required a lot of tinkering, to take a backseat.
Neural networks resurfaced after 2010 with the new name deep learning and by 2020 they are one of the most popular algorithms in machine learning.
Implementation
The rising success of NN, in part, is due to the vast improvements in computer power, larger training sets, and software: Tensorflow and PyTorch
In this course, we will utilize an R package to run these models. However, for more intensive or large-scale computations, consider using tools outside of R.
Preface
NN cover a broad range of concepts and techniques.
As with most of the subjects in this class, an entire course (grad school-level) could be given on this matter!
This lecture just scratches the surface
These methods are often used as a black box, but they are rooted in math and statistics.
The material in this unit is slightly more challenging than elsewhere in this book.
Neural Networks
A neural network is a series of algorithms designed to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates.
These can be applied to variety of predictors inputs (e.g. video, images, speech, sounds, text, time series, etc.)
NN require (a lot of) labeled training data to learn patterns and make predictions (the more diverse and representative, the better).
Structure of a NN
NN are often visualized as a Network Diagram with connections and nodes
These connections (denoted by arrows on the diagram) are each associated with a parameter (aka weight)
In a feedforward network, information always moves one direction; it never goes backwards.
Terminology
These models are likened to the human brain; some of the terminology closely mirrors the connection with biology.
Each NN is made of nodes (akin to neurons), and are connected (depicted by arrows and akin to synapses)
Some terms in this unit are simply different names for things we’ve learned about previously in the course.
I will draw those connections by calling them what we would have called it in statistics, striking it out, and rename it using the language adopted in the Deep Learning community.
Building Block: Logistic Regression
Neural networks and logistic regression are closely connected, as logistic regression forms the foundation of the simplest type of neural network.
Recall: Logistic regression is a linear model that predicts the probability of a binary outcome using the sigmoid function:
The diagram represents a single-layer neural network (also called a perceptron), which is conceptually equivalent to logistic regression when used for binary classification tasks.
Single Layer Neural Networks
In its simplest form, a single layer neural network has only three layers:
input layer: data features/predictors
hidden layer: transformations and computations
output layer1: continuous predictions or classifications
Let’s explore one for modeling a quantitative response using \(p = 4\) predictors.
Network Diagram
This “shallow” feed-forward NN has: 4 input nodes, 1 hidden layer (with 5 neurons/nodes/units), and 1 output node.
ISLR Fig 10.1 Neural network with a single hidden layer. Note that the textbook does not visualize the green ‘1’ node associated with bias.
In NN terminology, the four features \(X_1, \dots ,X_4\) make up the “units” or “nodes” in the input layer
Each arrows feeds into each (\(k = 1, \dots K\)1) of the so-called activations of the hidden layer : \[
A_k = h_k(X) = g(w_{k0} + \sum_{j=1}^p w_{kj}X_j)
\] These \(A_k\)s are not directly observed, hence “hidden”.
The Model
The resulting model is then a linear combination in the \(K = 5\) activations:
In summary, we derive five new features by computing five different linear combinations of \(X\), and then plug each through an activation function \(g(·)\) to transform it. The final model is linear in these derived variables and has the following parameters …
Comment
All hidden layers typically use the same activation function.
The output layer will typically use a different activation function from the hidden layers.
The \(w_{kj}\)s are the coefficientsweights and \(w_{k0}\)s are the interceptsbiases1
This is a “feed-forward neural network” - there are more complicated types of NN (e.g. backwards arrows, loops, no arrows)
Output Layer
Common choices for activations functions for the output layer:
In a neural network, what is the purpose of the bias term?
To prevent overfitting
To normalize input data
To shift the activation function ✅
To minimize the loss function
Why Non-linear Activation Functions?
The nonlinearity in the activation function \(g(·)\) is essential
Without it the model \(f(X)\) would collapse into a simple linear model in \(X_1,...,X_p\).
Moreover, having a nonlinear activation function allows us to capture complex nonlinearities and interaction effects.
Let’s look at an example where the sum of two nonlinear transformations of linear functions can give us an interaction
Example: Single Layer NN
Suppose we have two input variables \(X = (X_1, X_2)\), i.e. \(p=2\). In our hidden layer suppose we have two hidden units (i.e \(K=2\)) with activation function equal to \(g(z) = z^2\)
In theory1 a single hidden layer with a large number of units has the ability to approximate most functions.
However, the learning task of discovering a good solution is made much easier with multiple layers each of modest size.
The weights \(w\) are parameters that require estimation. The quantity of these gets out of hand quickly.
The adjective “deep” in deep learning refers to the use of multiple layers in the network.
Digit Recognition Example
Digit recognition problems were the catalyst that accelerated the development of neural network technology in the late 1980s at AT&T Bell Laboratories and elsewhere
Turns out that these pattern recognition (that are relatively simple for humans) are not so simple for machines.
It has taken more than 30 years to refine the neural-network architectures to match human performance.
Digit Dataset
The textbook goes through the process of setting up a large dense network on the famous and publicly available MNIST handwritten digit dataset (60K training, 10K testing)
The idea is to build a model to classify the images into their correct digit class 0–9.
Every image has \(p = 28 × 28 = 784\) pixels, each of which is an eight-bit grayscale value between 0 and 255 representing the brightness of a pixel (0 = black, 255 = white):
Example images
Two-Layer Feed-forward NN
ISLR Fig 10.4. Neural network diagram suitable for the MNIST handwritten-digit problem. The input layer has \(p = 784\) units, the two hidden layers having \(K_1 = 256\) and \(K_2 = 128\) units respectively, and 10 output layer units. Along with intercepts (AKA biases) this network has \(235,146\) parameters (aka weights).
First Hidden Layer
The first hidden layer takes a linear combination 784 inputs stored in \(X\) as inputs for the activation function. For \(k = 1, \dots, K_1 = 256\) we have: \[\begin{align*}
&A_k^{(1)} = h^{(1)}_k(X) = g (w^{(1)}_{k0} + \sum_{j=1}^{p = 784} w^{(1)}_{kj}X_j)
\end{align*}\]
\(W_1\) is the 785\(\times\) 256 matrix of weights that feed from the input layer to hidden layer one (L1),
i.e. \(W_1 = \{w_{kj}^{(1)}\}\), \(j =\)0, 1, \(\dots p\), and \(k=1, \dots, K_1\)
Second Hidden Layer
The second hidden layer takes a linear combination of the activations \(A_k^{(1)}\) from the first hidden layer as inputs for the activation function. For \(\ell = 1, \dots, K_2= 128\) we have:
\(W_2\) is the \(25\textbf{7}\times 128\) matrix of weights that feed from the first hidden layer (L1) to the second hidden layer (L2),
i.e. \(W_2 = \{w_{\ell k}^{(2)}\}\), with \(k=\textbf{0}, \dots, K_1\) and \(\ell = 1,\dots K_2\)
Output Layer
The output layer takes a linear combination of these activations \(A_\ell^{(2)}\) from the second hidden layer as inputs and for output activation function. For \(m = 0, \dots, 9\): \[\begin{align*}
f_m(A_\ell^{(2)})
&= f_m \left( \beta_{m0} + \sum_{\ell=1}^{K_2 = 128} \beta_{m\ell} A_\ell^{(2)} \right)
\end{align*}\]
\(B\) is the \(12\textbf{9} \times 10\) matrix of weights that feed from the second hidden layer (L2) to output layer,
i.e. \(B = \{ \beta_{m \ell} \}\) with \(\ell = \textbf{0},\dots K_2\), and \(m=0, \dots, 9\)
Softmax
As stated previously, the output layer typically use a different activation function from the hidden layers
The output activation function used here is the softmax function which returns probabilities i.e. \(\sum_{m=0}^9 f_m(X) = 1\): \[\begin{equation}
f_m(X) = \text{P}(Y=m \mid X)
= \frac{e^{Z_m}}{\sum_{\ell=0}^9 e^{Z_{\ell}}}
\end{equation}\] where \(Z_m = \beta_{m0} + \sum_{\ell=1}^{K_2} \beta_{m\ell}A_\ell^{(2)}\). We assign the image to the class with the highest probability.
Cross-entropy
We fit the model by minimizing the negative multinomial log-likelihood, or cross-entropy1:
where \(y_{im}\) is 1 if the true class for observation \(i\) is \(m\), else 0. These classes are said to be one-hot encoded2
Loss Function
In the regression1 setting, for example, the model is fit by minimizing the familiar residual sum of squares:
\[
\mathcal{L} = \sum_{i=1}^n (y_i - f(x_i))^2
\]
This is commonly referred to as a loss function (or cost function, or objective function).2
Dummy variables One-hot encoding
Like the regression model, NN will require that our categorical data be converted to numeric form.
When no ordinal relationship exists we use one-hot encoding which encodes \(N\) categories using binary (aka dummy) variables
Original
red.dummy
green.dummy
blue.dummy
red
1
0
0
green
0
1
0
blue
0
0
1
Number of Parameters Weights
This model has 235,146 parameters (referred to as weights):
\(W_1\) has \(785×256 = 200,960\) weights1
\(W_2\) has \(257 × 128 = 32,896\) weights\(^1\)
\(B\) has \(129×10 = 1290\) weights\(^1\).
Note that we have close to 4 times as many parameters as we do training observations (60k).
What should we be concerned about?
Overfitting
One of the most important aspects when training neural networks is avoiding overfitting.
To avoid overfitting, some regularization is needed.
As in our regression unit, regularization will be achieved by adding a penalty term regularization term to our loss function in order to penalize complexity.
This will effectively shrink (or remove) certain weights thereby making some hidden neurons negligible and reducing the overall complexity of the NN.
Regularization
Two popular regularization techniques are:
L1 regularization (aka LASSO regularization)
L2 regularization (aka Ridge regularization)
As you may be able to guess, L1 regularization forces the weights to become (exactly) zero and L2 regularization forces the weights towards (but never exactly) zero.
L1/L2 Regularization
L2/Ridge regularization uses the L2 norm \(|| W ||_2^2\) in it’s regularization term1 and adds it to the loss function: \[
\mathcal{L} + \frac{\alpha}{2}|| W ||_2^2
\]
L1/LASSO regularization uses the L1 norm \(|| W ||_1\) in it’s regularization term\(^1\) and adds it to the loss function: \[
\mathcal{L} + \alpha|| W ||_1
\]
Dropout
Another powerful option is dropout regularization.
We will give a brief overview using the loss function for regression but these ideas can be extended to classification and to when Regularization penalties are applied.
Single Hidden Layer Revisited
Lets return to the single hidden layer example used in our first network diagram.
In this model the parameters are \(\beta = (\beta_0, \beta_1, \dots, \beta_K)\), as well as each of the \(w_k = (w_{k0}, w_{k1,}...,w_{kp})\), for \(k = 1, . . . , K\).
Given observations \((x_i, y_i)\), for \(i = 1, \dots, n\) we could fit the model by minmizing the Loss Function with respect to the parameters, \(\beta\), and \(W\).
Nonconvex problem
While this might look like the minimization problem we had in linear regression, the following is not straightforward to minimize. \[
\mathcal{L}(\beta, W) = \sum_{i=1}^n (y_i - f(x_i))^2
\]
Furthermore, this problem is nonconvex in the parameters, and hence there are multiple solutions.
Gradient Descent
At the basic level, we adjust the weights so that error is reduced for the next iteration.
More technically, the optimization algorithm is navigating down the gradient (or slope) of error and seeks to change each weight proportionally to its effect on the RSS: \[\begin{equation}
\Delta w_{\cdots} = - \alpha \frac{ d RSS}{d w_{ \cdots}}
\end{equation}\] where \(\alpha\) is a learning rate.
Gradient Descent Algorithm
Take the derivative of the loss function for each parameter (i.e. find the gradient for the loss function)
Pick random values for parameters
Plug in the parameters values into the derivatives (i.e. gradient)
Update: New Parameters = New Parameters + Step Size
Non-convexity
ISLR Fig 10.7: Illustration of gradient descent for one-dimensional \(\theta\). The objective function \(R(\theta)\) is not convex, and has two minima, one at \(\theta = -0.46\) (local), the other at \(\theta = 1.02\) (global). Starting at some value \(\theta_0\) (typically randomly chosen), each step in θ moves downhill — against the gradient — until it cannot go down any further. Here gradient descent reached the global minimum in 7 steps.
Learning Rate
The learning rate is akin to that which we saw in Boosting.
Essentially it is included so that the alorithm “learns slow” and avoids overfitting.
Typically this value is very small (say ~0.1)
The algorithm will be highly dependent on this value and often it’s set to “schedule” that starts off large and gets smaller and smaller.
Stochastic Gradient descent
The gradient descent just described computes the gradient of the loss function using the entire dataset.
While accurate, this is computationally expensive
Instead we do stochastic gradient decent which considers a small random subset of the data (called a batch) is used to compute an approximate gradient.
Backpropagation
Geoffrey Hinton, David Rumelhart and Ronald J. Williams pioneered the back-propagation algorithm in a pair of landmark papers published in 1986.
As a very briefoverview, we work out way through the NN backwards (hence the name), and computes the gradient
Stochastic gradient descent, is then used to perform learning using this gradient.
One popular way of fitting Neural Networks in R is to use the keras package1
It can be a bit finicky to install our your computer but you follow this instructions from the textbooks website here
This package interfaces to the tensorflow package which in turn links to efficient python code.
For this course, we’ll just stick to the neuralnet and NeuralNetTools packages for our simple demonstrations.
Example: Body NN
We fit a neural network with one hidden layer, four hidden layer nodes (or neurons/units) to predict recorded Gender.
library(gclus)data(body)sbod <-cbind(scale(body[,1:24]), factor(body[,25]))colnames(sbod)[25] <-"Gender"# load the pacakge for Neural Networkslibrary(nnet)# size = nodes in hidden layer# fits a single layer NNnnbod2 <-nnet(factor(Gender)~., data=sbod, size=4)
We call to another package to produce a plot for it …
Plotting the NN
library(NeuralNetTools); plotnet(nnbod2)
iClicker
Parameter Count
How many parameters are in this NN? Note there are \(p\) = 24 predictors
\(24 \times 4 \times + 4 \times 1\)
\(25 \times 5 \times + 5 \times 1\)
\(24 \times 4 \times 1\)
\(25 \times 4 + 5 \times + 1\)✅
none of the above
Properties of the Network Diagram
For NeuralNetTools package, positive weights show up as black, negative as grey.
Also the biases are shown as a separate node (B1 and B2) on the graph; for simpler NN, the weights would be shown on the lines connecting the nodes.
In this case, the magnitude of the weight shows up as the thickness of the line.
There are 25 (24 predictors + 1 bias term) x 4 (hidden layer units) + 5 (hidden layer units + 1 bias term) x 1 (output layer unit) = 105 weights
Example: Body NN
How does it perform??
table(body[,25], predict(nnbod2, type="class"))
1 2
0 260 0
1 0 247
0 misclassifications!
But of course, this is on the training set, so let’s do a training/testing set to approximate the long-run error.
Validation Approach with Body NN
Set up a quick training and testing set. Sample approx half the data for a 50-50 training/testing scenario, refit 4 hidden variable NN to the training set, and predict on the test set…