# Deep Learning with Keras

Created: Notes on Deep Learning with Keras and TensorFlow

# Deep Learning and Neural Networks with Keras

Notes from this YouTube Series, Full course notes can be found on GitHub

## Overview of Neural Networks

A Neural Network takes the some kind of data and has the ability to handle and process data that other ML models are not really able to process

In a normal model you would pass in a 1D vector such as a list of predictors, with a an NN you can pass in more complex data and the model will place weight on the position as well as the values of a respective data point which is something other models can't necessarily handle

Some examples of higher order data can be:

- 1D Vector - Normal input, like a row in a spreadsheet
- 2D Matrix - Grayscale image
- 3D Matrix - Colour image
- nD Matrix - Any higher order data

With traditional models we speak about regression or classification.

A regression network could have a single numerical output, or a classification network could have a set of potential binary outputs for each classes (like one-hot) or a probability of the result being each of the possible outputs

Neural Networks are also capble of more complex outputs or even combinations of outputs

In general an NN consists of an Input Layer which takes in the input data, a few hidden layers which proces the data, and an output layer which is our target outcome. Each layer passes a weighted data to each model

There are usually these types of neurons:

- Input - get the input data
- Hidden - between input and output and abstract processing
- Output - the output that's calculated
- Context - hold state between calls to the network
- Bias Neurons - similar to a y-intercept, alow us to offset the data to a neurons

Neural networks pass data to nodes using Activation functions, some common ones are:

- Rectified Linear Unit (ReLU) - used for hidden layers
- Softmax - output for classification
- Linear - for regression

The Bias Neuron along with a Weight allow us to move and scale our activation functions

## Tensorflow and Keras

TensorFlow is the low-level library for Neural Networks, and Keras is an API that sits on top of TF and allows you to interact with it at a higher level

The current version of TF requires Python 3.7, so just align with that

TensorBoard is a way to visualize Neural Networks

### Using Tensorflow Directly

#### Simple Matrix Multiplication

```
import tensorflow as tf
```

```
matrix1 = tf.constant([[3., 3.]])
matrix2 = tf.constant([[2.], [2.]])
product = tf.matmul(matrix1, matrix2)
print(product)
float(product)
```

```
12.0
```

#### Using Variables

Variables can be created, used, and resasigned and recalculated with

```
x = tf.Variable([1., 2.])
a = tf.constant([3., 3.])
print(tf.subtract(x, a).numpy())
```

```
x.assign([4., 6.])
print(tf.subtract(x, a).numpy())
```

### Using Keras with MPG Dataset

Keras enables us to think about the Layers in an NN, we'll use the Miles Per Gallon dataset which uses the

```
import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from sklearn import metrics
```

```
DATA_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
COLUMN_NAMES = [
'mpg',
'cylinders',
'displacement',
'horsepower',
'weight',
'acceleration',
'model year',
'origin',
'car name'
]
```

```
import pandas as pd
from tensorflow.keras.layers import Dense, Activation
df = pd.read_fwf(
DATA_URL,
names=COLUMN_NAMES,
na_values=['NA', '?']
)
# fill missing
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())
```

```
df.head()
```

mpg | cylinders | displacement | horsepower | weight | acceleration | model year | origin | car name | |
---|---|---|---|---|---|---|---|---|---|

0 | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | 1 | "chevrolet chevelle malibu" |

1 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | 1 | "buick skylark 320" |

2 | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | 1 | "plymouth satellite" |

3 | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | 1 | "amc rebel sst" |

4 | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | 1 | "ford torino" |

```
X = df.drop(['mpg', 'car name'], axis=1).values
y = df[['mpg']].values
```

### Build Regression Model with Keras

When building a Neural Network we take the following steps:

- Create a Sequential
- Define the Hidden Layers
- Define the Output Layer
- Compile and Train the Model

#### 1. Create Sequential

```
model = Sequential()
```

#### 2. Define Hidden Layers

Define the first hiddel layer with the `input_dim`

to be the shape of our input data set (`X`

columns in this case)

A dense layer is one where each neuron is connected to the next

```
model.add(Dense(25, input_dim=X.shape[1], activation='relu'))
model.add(Dense(10, activation='relu'))
```

#### 3. Define the Output Layer

This is depends on the dimensionality of the output, similar to the input. For this case it is one dimensional

```
model.add(Dense(1))
```

#### 3. Compile and train the model

We specify a `loss`

function and an `optimizer`

for the model, and then give it the `X`

and `y`

values to train on a well as how many `epoch`

s we want it to train for

For a Regression NN you usually use MSE as the loss

We can also make use of methods to increase the model's effectiveness and identifying the optimal number of epochh

```
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X, y, verbose=0, epochs=400)
```

```
<tensorflow.python.keras.callbacks.History at 0x2041a5ff688>
```

#### Test the Model

```
y_pred = model.predict(X)
score = np.sqrt(metrics.mean_squared_error(y_pred, y))
'MSE: ' + str(score)
```

```
'MSE: 3.4881811444565303'
```

### Build a Classification Model with Keras

Building a Classification Model is much the same, however we need to ensure that we hot-encode our categorical values, and in this case we'll have a categorical output which means more than one potential result

For this we're making use of the Iris Dataset

However for a Multi-Class classification we use `softmax`

and `categorical_crossentropy`

For a Binary we can additionally use an appliccable loss and activation

```
DATA_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
COLUMN_NAMES = [
'sepal length',
'sepal width',
'petal length',
'petal width',
'class'
]
```

```
df = pd.read_csv(DATA_URL, names=COLUMN_NAMES)
```

```
df.head()
```

sepal length | sepal width | petal length | petal width | class | |
---|---|---|---|---|---|

0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |

1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |

2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |

3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |

4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |

```
X = df.drop('class', axis=1).values
dummies = pd.get_dummies(df['class'])
species = dummies.columns
y = dummies.values
```

```
model = Sequential()
model.add(Dense(50, input_dim=X.shape[1], activation='relu'))
model.add(Dense(25, activation='relu'))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.fit(X, y, verbose=0, epochs=100)
```

```
<tensorflow.python.keras.callbacks.History at 0x2041b8c9c88>
```

```
y_pred = model.predict(X)
predict_classes = np.argmax(y_pred,axis=1)
expected_classes = np.argmax(y,axis=1)
print(f"Predictions: {predict_classes}")
print(f"Expected: {expected_classes}")
print(species[predict_classes[1:10]])
score = metrics.accuracy_score(expected_classes,predict_classes)
'Accuracy: ' + str(score)
```

```
'Accuracy: 0.9733333333333334'
```

## Saving and Loading Neural Networks

We can store data in a few different formats, the ideal one is the `HDF5`

format which stores the structure and weights for the network

### Save Model

We can save the model we just trained with:

```
MODEL_SAVE_PATH = './exported-models/iris-model.h5'
```

```
model.save(MODEL_SAVE_PATH)
```

### Load Model

```
from tensorflow.keras.models import load_model
```

```
loaded_model = load_model(MODEL_SAVE_PATH)
loaded_model
```

```
<tensorflow.python.keras.engine.sequential.Sequential at 0x2041b8141c8>
```

## Early Stopping to prevent Overfitting

We can make use of test/train sets to help us prevent overfitting, this is done by helping us identify when to stop training the network

It's important that we save our score at a good fitted value

Data is usually split into the following sets:

- Test
- Train
- Holdout

If we have have a lot of data we can even try to have multiple test and train sets

To train the model we'll do the normal preprocessing and model definition as before, and then we'll implement `EarlyStopping`

from Keras when doing the `model.fit`

portion

### Categorical

```
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
```

#### Preprocessing and Model Definition

```
DATA_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
COLUMN_NAMES = [
'sepal length',
'sepal width',
'petal length',
'petal width',
'class'
]
```

```
df = pd.read_csv(DATA_URL, names=COLUMN_NAMES)
X = df.drop('class', axis=1).values
dummies = pd.get_dummies(df['class'])
species = dummies.columns
y = dummies.values
```

```
model = Sequential()
model.add(Dense(50, input_dim=X.shape[1], activation='relu'))
model.add(Dense(25, activation='relu'))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
```

#### Train/Test Split

```
X_train, X_test, y_train, y_test = train_test_split (
X, y,
test_size=0.25,
random_state=0
)
```

#### Train the Model

The below applies to both categorical and regression models

We can train the model using an `EarlyStopping`

callback, in this we specify:

- The metric we want to monitor for change,
`val_loss`

to use the validation`loss`

we defined for the model as the metric - The minimum change we want to for stability, this will not have much of an impact if made smaller
- The number of rounds we want the delta to be small for before stopping
- The mode, usually keep this at
`auto`

but it is whether to minimize or maximize the error - Restore best weights automatically, always keep this at
`True`

```
monitor = EarlyStopping(
monitor='val_loss',
min_delta=1e-3,
patience=50,
verbose=1,
mode='auto',
restore_best_weights=True
)
```

```
model.fit(
X_train, y_train,
validation_data=(X_test, y_test),
callbacks=[monitor],
verbose=0,
epochs=1000
)
```

```
<tensorflow.python.keras.callbacks.History at 0x204254a7108>
```

#### Measure the Accuracy

```
y_pred = model.predict(X_test)
predicted_classes = np.argmax(y_pred, axis=1)
expected_classes = np.argmax(y_test, axis=1)
score = metrics.accuracy_score(expected_classes, predicted_classes)
'Accuracy: ' + str(score)
```

```
'Accuracy: 0.9736842105263158'
```

## Feature Vectors and Tabular Data

All data that comes into a Neural Network must be numerical

Some of the processing we will typically do are:

- Convert categorical values to dummies (features and target)
- Drop any columns like ID, etc.
- Get all the different numerical data to be in a the same range
- Center numerical data around a mean of zero
- Fill missing values as appropriate for the relevant data
- If we have missing data in the targe column we should drop those rows

We can use a Z-score to work with points 3 and 4

## CLassification Metrics

Sometimes we care about additional factors than just the accuracy, such as the counts of false positives or negatives etc.

### ROC Values

- Flase Positives
- False Negatives
- True Positives
- True Negatives

These can also be be described as Type-1 and Type-2 Erors as well as Test Sensitivity and Specificity

A sensitive NN will lead to more false positives, and more specific NN will lead towards fewer false positives

A ROC chart compares our model to random predictions, the higher up our line is the more accurate our model. We measure the area under this curve to get the AUC Value, if our model falls below the `0.5`

mark (below the random line) it means our model is doing worse than a random guess (which is really bad)

### Log Loss

A Log Loss calculation we can get a sort of accuracy score that's more harsh on overconfidence

### Confusion Matrix

This compares our predicted values to the actual values, in this we would ideally want to see a strong diagonal correlation

## Regression Metrics

When working with regression models there are different metrics that we can use in order to

### Mean Squared Error and Root Mean Squared Error

We usually work with the MSE value which is sort of releative to our dataset, square rooting this gives us the RMSE which tells us how close we are to our actual value in the same units as our target data

### Lift Chart

A Lift chart is a way to compare our model output to the actual test data in order to see how our model compares over specific value ranges in the target vector

## Backpropagation

We have a few two types of backpropagation which we use when training a model

- Classic - using gradient descent (e.g. 0.1, 0.01, 0.001)
- Momentum - pushes weights in order to avoid local minumums (e.g. 0.9)
- Batch and Online - update weights in batches instead of every iteration
- Stochastic Gradient Descent - Often used with batching, network trained on differing sets of the data and decreases overfitting by focusing on a smaller number of weights

Additionally we have a few methods that can help us to automate certain hyperparameters;

- Resilient Propogation - uses only the gradient magnitude and allows each neuron it's own learning rate
- Nesterov accelerated gradient - helps mitigate the risk of choosing a bad batch
- Adagrad - allows an automatically decaying learning rate and momentum per weight
- Adadelta - Based on on Adagrad, monotonically decreasing learning rate

There are also other non gradient methods such as:

- Simulated Annealing
- Generic Algorithm
- Particle Swarm Optimization
- Nelder Mead

Some interestnig diagrams comparing different algorithms

## Regularization

Regularization is used to combat overfitting. The two types of Regularization we have Lasso (L1) and Ridge (L2) regularization

L1 regularization can help a network focus on the important factors

The `alpha`

value lets us say how important the regularization is to our model, in general a higher `alpha`

will cause the model to have a lower accuracy but prevent overfitting

### Lasso (L1)

```
from sklearn.linear_model import Lasso
model = Lasso(random_state=0, alpha=0.1)
model.fit(X_train, y_train)
```

```
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=0,
selection='cyclic', tol=0.0001, warm_start=False)
```

### Ridge (L2)

L2 Regression (Ridge) lets us focus a bit less on the weightings than the L1 method and penalizes the model less for large weights

```
from sklearn.linear_model import Ridge
model = Ridge(random_state=0, alpha=0.1)
model.fit(X_train, y_train)
```

```
Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=0, solver='auto', tol=0.001)
```

### ElasticNet

ElasticNet uses a combination of L1 and L2 regularization

```
from sklearn.linear_model import ElasticNet
model = ElasticNet(random_state=0, alpha=0.1)
model.fit(X_train, y_train)
```

```
ElasticNet(alpha=0.1, copy_X=True, fit_intercept=True, l1_ratio=0.5,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=0, selection='cyclic', tol=0.0001, warm_start=False)
```

### Dropout

Dropout is another method of regularization and is applied during training

When using dropout we disable random neurons in each epoch to prevent them from becoming too specialized. This helps us to prevent overfitting as well as reduce the variance in the overall trained network

The dropped neurons are re-added once the training is complete

In order to use dropout in Keras we can add a `Dropout`

layer with a value for what fraction of neurons we want to be dropped out

The suggestion is usually not to use a dropout after the final hidden layer

```
DATA_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
COLUMN_NAMES = [
'sepal length',
'sepal width',
'petal length',
'petal width',
'class'
]
```

```
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from tensorflow.keras import regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
```

```
df = pd.read_csv(DATA_URL, names=COLUMN_NAMES)
X = df.drop('class', axis=1).values
dummies = pd.get_dummies(df['class'])
species = dummies.columns
y = dummies.values
```

```
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.25,
random_state=0
)
```

Below we will train the model, the model has the following:

- 1st Dense Layer with 50 neurons and ReLU activation
- A dropout of 50%
- 2nd Dense Layer with 25 neurons, ReLU, and an L1 Regularization
- An Output Layer with the categories and Softmax activation

```
model = Sequential()
model.add(Dense(50, input_dim=X.shape[1], activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(
25,
activation='relu',
activity_regularizer=regularizers.l1(1e-4)
))
model.add(Dense(y.shape[1], activation='softmax'))
```

```
model.compile(loss='categorical_crossentropy', optimizer='adam')
```

```
model.fit(
X_train,
y_train,
validation_data=(X_test, y_test),
verbose=0,
epochs=100
)
```

```
<tensorflow.python.keras.callbacks.History at 0x1804e5a00c8>
```

```
y_pred = model.predict(X_test)
```

```
predicted_classes = np.argmax(y_pred, axis=1)
expected_classes = np.argmax(y_test, axis=1)
print(predicted_classes)
print(expected_classes)
```

```
score = metrics.accuracy_score(expected_classes, predicted_classes)
'Accuracy: ' + str(score)
```

```
'Accuracy: 0.9736842105263158'
```

## Benchmarking and Regularization

So far we've seen of a network is based on the following:

- Number of layers
- How many neurons per layers
- Activation functions for each layers
- Droppout per layer
- L2 and L2 Regularization

There are additional parameters that can also influence the network

Due to the different parameters and the random nature of a network it can be difficult to see if our change in hyperparameters is actually impacting the output of a network

We can do something called Bootstrapping which is similar to cross validation with replacement and early stopping to help our network average converge and after how many epochs this takes

```
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.utils import resample
from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit
from tensorflow.keras import regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.callbacks import EarlyStopping
```

```
SPLITS = 15
```

### Bootstrap

For a Regression model we can use:

```
boot = ShuffleSplit(n_splits=SPLITS, test_size=0.1)
```

and then:

```
for train, test in boot.split(X):
# train model
```

However, for a Categorical classification we want to ensure that we have a class balance, we can do this with the `StratifiedShuffleSplit`

which works like so:

Note that the

`EarlyStopping`

monitor returns`0`

if the training was not early stopped (e.g. trained till end)

```
boot = StratifiedShuffleSplit(n_splits=SPLITS, test_size=0.2)
```

### Progress Tracking

```
accuracy_tracker = []
epoch_tracker = []
```

```
for train, test in boot.split(X, y): # using the data from the last import
X_train = X[train]
X_test = X[test]
y_train = y[train]
y_test = y[test]
model = Sequential()
model.add(Dense(50, input_dim=X.shape[1], activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(
25,
activation='relu'
))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
monitor = EarlyStopping(
monitor='val_loss',
min_delta=1e-3,
patience=50,
verbose=0,
mode='auto',
restore_best_weights=True
)
model.fit(
X_train, y_train,
validation_data=(X_test, y_test),
callbacks=[monitor],
verbose=0,
epochs=1000
)
epoch_tracker.append(
monitor.stopped_epoch if monitor.stopped_epoch > 0 else 1000
)
y_pred = model.predict(X_test)
predicted_classes = np.argmax(y_pred, axis=1)
expected_classes = np.argmax(y_test, axis=1)
score = metrics.accuracy_score(expected_classes, predicted_classes)
accuracy_tracker.append(score)
```

```
pd.DataFrame({
"Score": accuracy_tracker,
"Epochs": epoch_tracker
}).describe()
```

Score | Epochs | |
---|---|---|

count | 15.000000 | 15.000000 |

mean | 0.993333 | 350.200000 |

std | 0.013801 | 76.766064 |

min | 0.966667 | 226.000000 |

25% | 1.000000 | 308.000000 |

50% | 1.000000 | 321.000000 |

75% | 1.000000 | 397.000000 |

max | 1.000000 | 528.000000 |

```
```