Machine Learning with Python

Introduction to Machine Learning with Python

Machine Learning with Python

Based on this Cognitive Class Course

Labs

The Labs for the course are located in the Labs folder are from CognitiveClass and are licensed under MIT

Intro to ML

Machine learning is a field of computer science that gives computers the ability to learn without being explicitly progammed

Some popular techniques are:

  • Regression for predicting continuous values
  • Classification for predicting a class/category
  • Clustering for finding structure of data and summarization
  • Associations for finding items/events that co-occur
  • Anomaly detection is used for finding abnormal/unusual cases
  • Sequence mining is for predicting next values
  • Dimension reduction for reducing the size of data
  • Recommendation systems

We have a few different buzzwords

  • AI
    • Computer Vision
    • Language processing
    • Creativity
  • Machine learning
    • Field of AI
    • Experience based
    • Classification
    • Clustering
    • Neural Networks
  • Deep Learning
    • Specialized case of ML
    • More automation than most ML

Python for Machine Learning

Python has many different libraries for machine learning such as

  • NumPy
  • SciPy
  • Matplotlib
  • Pandas
  • Scikit Learn

Supervised vs Unsupervised

Supervised learning involves us supervising a machine learning model. We do this by teaching the model with a labelled dataset

There are two types of supervised learning, namely Classification and Regression

Unsupervised learning is when the model works on its own to discover information about data using techniques such as Dimension Reduction, Density Estimation, Market Basket Analysis, and Clustering

  • Supervised
    • Classification
    • Regression
    • More evaluation methods
    • Controlled environment
  • Unsupervised
    • Clustering
    • Fewer evaluation methods
    • Less controlled environment

Regression

Regression makes use of two different variables

  • Dependent - Predictors XX
  • Independent - Target YY

With Regression our XX values need to be continuous, but the YY values can be either continuous, discrete, or categorical

There are two types of regression:

  • Simple Regression
    • Simple Linear Regression
    • Simple Non-Linear Regression
    • Single XX
  • Multiple Regression
    • Multiple Linear Regression
    • Multiple Non-Linear Regression
    • Multiple XX

Regression is used when we have continuous data and is well suited to predicting continuous data

There are many regression algorithms such as

  • Ordinal regression
  • Poisson regression
  • Fast forest quartile regression
  • Linear, polynomial, lasso, stepwise, and ridge regression
  • Bayesian linear regression
  • Neural network regression
  • Decision forest regression
  • Boosted decision tree regression
  • K nearest neighbors (KNN)

Each of which are better suited to some circumstances than to others

Simple Linear Regression

In SLR we have two variables, one dependent, and one independent. The target variable (yy) can be either be continuous or categorical, but the predictor (xx) must be continuous

To get a better idea of whether SLR is appropriate we can simply do a plot of xx vs yy and find the line which will be the best fit for the data

The line is represented by the following equation

y=θ0+θ1x1y=\theta_0+\theta_1x_1

The aim of SLE is to adjust the θ\theta values to minimize the residual error in our data and find the best fit

MSE=1nΣi=1n(yiy^i)2MSE=\frac{1}{n}\Sigma_{i=1}^n(y_i-\hat y_i)^2

Estimating Parameters

We have two options to estimate our parameters, given an SLR problem

Estimate θ0\theta_0 and θ1\theta_1 using the following equations

θ1=Σi=1n(xix¯)(yiy¯)Σi=1n(xix¯)2\theta_1 = \frac{\Sigma_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\Sigma_{i=1}^n(x_i-\bar x)^2}

θ0=y¯θ1x¯\theta_0=\bar y-\theta_1\bar x

We can use these values to make predictions with the equation

y^=θ0+θ1x1\hat y=\theta_0+\theta_1x_1

Pros

  • Fast
  • Easy to Understand
  • No tuning needed

Lab

Import Necessary Libraries
%reset -f
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline
Import Data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv')
df.head()
MODELYEAR MAKE MODEL VEHICLECLASS ENGINESIZE CYLINDERS TRANSMISSION FUELTYPE FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG CO2EMISSIONS
0 2014 ACURA ILX COMPACT 2.0 4 AS5 Z 9.9 6.7 8.5 33 196
1 2014 ACURA ILX COMPACT 2.4 4 M6 Z 11.2 7.7 9.6 29 221
2 2014 ACURA ILX HYBRID COMPACT 1.5 4 AV7 Z 6.0 5.8 5.9 48 136
3 2014 ACURA MDX 4WD SUV - SMALL 3.5 6 AS6 Z 12.7 9.1 11.1 25 255
4 2014 ACURA RDX AWD SUV - SMALL 3.5 6 AS6 Z 12.1 8.7 10.6 27 244
Data Exploration
df.describe()
MODELYEAR ENGINESIZE CYLINDERS FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG CO2EMISSIONS
count 1067.0 1067.000000 1067.000000 1067.000000 1067.000000 1067.000000 1067.000000 1067.000000
mean 2014.0 3.346298 5.794752 13.296532 9.474602 11.580881 26.441425 256.228679
std 0.0 1.415895 1.797447 4.101253 2.794510 3.485595 7.468702 63.372304
min 2014.0 1.000000 3.000000 4.600000 4.900000 4.700000 11.000000 108.000000
25% 2014.0 2.000000 4.000000 10.250000 7.500000 9.000000 21.000000 207.000000
50% 2014.0 3.400000 6.000000 12.600000 8.800000 10.900000 26.000000 251.000000
75% 2014.0 4.300000 8.000000 15.550000 10.850000 13.350000 31.000000 294.000000
max 2014.0 8.400000 12.000000 30.200000 20.500000 25.800000 60.000000 488.000000
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
cdf.head(10)
ENGINESIZE CYLINDERS FUELCONSUMPTION_COMB CO2EMISSIONS
0 2.0 4 8.5 196
1 2.4 4 9.6 221
2 1.5 4 5.9 136
3 3.5 6 11.1 255
4 3.5 6 10.6 244
5 3.5 6 10.0 230
6 3.5 6 10.1 232
7 3.7 6 11.1 255
8 3.7 6 11.6 267
9 2.4 4 9.2 212
viz = cdf[['CYLINDERS','ENGINESIZE','CO2EMISSIONS','FUELCONSUMPTION_COMB']]
viz.hist()
plt.show()
<matplotlib.figure.Figure at 0x7f2ab0310c50>
plt.title('CO2 Emission vs Fuel Consumption')
plt.scatter(cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS,  color='blue')
plt.xlabel("FUELCONSUMPTION_COMB")
plt.ylabel("Emission")
plt.show()
<matplotlib.figure.Figure at 0x7f2aa8826080>
plt.title('CO2 Emission vs Engine Size')
plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS,  color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()
<matplotlib.figure.Figure at 0x7f2aa8769cc0>
plt.title('CO2 Emission vs Cylinders')
plt.scatter(cdf.CYLINDERS, cdf.CO2EMISSIONS, color='blue')
plt.xlabel("Cylinders")
plt.ylabel("Emission")
plt.show()
<matplotlib.figure.Figure at 0x7f2aa86b2748>
Test-Train Split

We need to split our data into a test set and a train set

tt_mask = np.random.rand(len(df)) < 0.8
train = cdf[tt_mask].reset_index()
test = cdf[~tt_mask].reset_index()
Simple Regression Model

We can look at the distribution of the Engine Size in our training and test set respectively as follows

plt.title('CO2 Emissions vs Engine Size for Test and Train Data')
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,color='blue',label='train')
plt.scatter(test.ENGINESIZE, test.CO2EMISSIONS,color='red',label='test')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.legend()
plt.show()
<matplotlib.figure.Figure at 0x7f2aa86d9eb8>
Modeling
from sklearn import linear_model
lin_reg = linear_model.LinearRegression()
train_x = train[['ENGINESIZE']]
train_y = train[['CO2EMISSIONS']]

test_x = test[['ENGINESIZE']]
test_y = test[['CO2EMISSIONS']]
lin_reg.fit(train_x, train_y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
'Coefficients: ' + str(lin_reg.coef_) + ' Intercept: ' + str(lin_reg.intercept_)
'Coefficients: [[ 39.30964622]] Intercept: [ 124.8710344]'

We can plot the line on our data to see the fit

plt.title('CO2 Emissions vs Engine Size, Training and Fit')
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,color='blue',label='train')
plt.plot(train_x, lin_reg.coef_[0,0]*train_x + lin_reg.intercept_[0],color='red',label='regression')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.legend()
plt.show()
<matplotlib.figure.Figure at 0x7f2a9dbdef60>
Model Evaluation
Import Packages
from sklearn.metrics import r2_score
Predict the CO2 Emissions
predicted_y = lin_reg.predict(test_x)
Display Results
results = pd.DataFrame()

results[['ENGINESIZE']] = test_x
results[['ACTUALCO2']] = test_y
results[['PREDICTEDCO2']] = pd.DataFrame(predicted_y)
results[['ERROR']] = pd.DataFrame(np.abs(predicted_y - test_y))
results[['SQUAREDERROR']] = pd.DataFrame((predicted_y - test_y)**2)

results.head()
ENGINESIZE ACTUALCO2 PREDICTEDCO2 ERROR SQUAREDERROR
0 5.9 359 356.797947 2.202053 4.849037
1 2.0 230 203.490327 26.509673 702.762771
2 2.0 230 203.490327 26.509673 702.762771
3 2.0 214 203.490327 10.509673 110.453230
4 5.2 409 329.281195 79.718805 6355.087912
Model Evaluation
MAE = np.mean(results[['ERROR']])
MSE = np.mean(results[['SQUAREDERROR']])
R2  = r2_score(test_y, predicted_y)

print("Mean absolute error: %.2f" % MAE)
print("Residual sum of squares (MSE): %.2f" % MSE)
print("R2-score: %.2f" % R2)

Multiple Linear Regression

In reality multiple independent variables will define a specific target. MLR is simply an extension on the SLR Model

MLR is useful for solving problems such as

  • Define the impact of independent variables on effectiveness of prediction
  • Predicting the impact of change in a specific variable

MLR makes use of multiple predictors to predict the target value, and is generally of the form

y^=θ0+θ1x1+θ2x2+θ3x3+...+θnxn\hat y=\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + ... + \theta_nx_n

y^=θTX\hat y=\theta^TX

θ\theta is a vector of coefficients which are multiplied by xx, these are called the parameters or weight vectors, and xx is the feature set, the idea with MLR is to predict the best-fit hyperplane for our data

Estimating Parameters

We have a few ways to estimate the best parameters, such as

  • Ordinary Least Squares
    • Linear algebra
    • Not suited to large datasets
  • Gradient Descent
    • Good for large datasets
  • Other methods are available to do this as well

How Many Variables?

Making use of more variables will generally increase the accuracy of the model, howevre using too many variables without good justification can lead to us overfitting the model

We can make use of categorical variables if we convert them to numerric values

MLR assumes that we have a linear relationship between the dependent and independent variables

Model Evaluation

We have to perform regression evaluation when building a model

Train/Test Joint

We make use of our data to train our model, and then compare the predicted values to the actual values of our model

The error of the model is the average of the actual and predicted values for the model

This approach has a high training accuracy, but a lower out-of-sample accuracy

Aiming for a very high training accuracy can lead to overfitting to the training data resulting in poor out-of-sample data

Train/Test Split

We split our data into a portion for testing and a portion for training, these two sets are mutually exclusive and allow us to get a good idea of what our out-of-sample accuracy will be

Generally we would train our data with the testing data afterwards in order to increase our accuracy

K-Fold Cross-Validation

This makes use of us splitting the dataset into different pieces, and using every combination of test/train datasets in order to get a more aggregated fit

Evaluation Metrics

Evaluation metrics are used to evaluate the performance of a model, metrics provide insight into areas of the model that require attention

Errors

In the context of regression, error is the difference between the data points and the valuedetermined by the model

Some of the main error equations are defined below

MAE=1nΣi=1nyiy^iMAE=\frac{1}{n}\Sigma_{i=1}^n|y_i-\hat y_i|

MSE=1nΣi=1n(yiy^i)2MSE=\frac{1}{n}\Sigma_{i=1}^n(y_i-\hat y_i)^2

RMSE=1nΣi=1n(yiy^i)2RMSE=\sqrt{\frac{1}{n}\Sigma_{i=1}^n(y_i-\hat y_i)^2}

RAE=Σi=1nyiy^iΣi=1nyiy¯iRAE=\frac{\Sigma_{i=1}^n|y_i-\hat y_i|}{\Sigma_{i=1}^n|y_i-\bar y_i|}

RSE=Σi=1n(yiy^i)2Σi=1n(yiy¯i)2RSE=\frac{\Sigma_{i=1}^n(y_i-\hat y_i)^2}{\Sigma_{i=1}^n(y_i-\bar y_i)^2}

Fit

R2R^2 helps us see how closely our data is represented by a specific regression line, and is defined as

R2=1RSER^2=1-RSE

Or

R2=1Σi=1n(yiy^i)2Σi=1n(yiy¯i)2R^2=1-\frac{\Sigma_{i=1}^n(y_i-\hat y_i)^2}{\Sigma_{i=1}^n(y_i-\bar y_i)^2}

A higher R2R^2 represents a better fit

Non-Linear Regression

Not all data can be predicted using a linear regression line, we have many diferent regression lines to fit more complex data

Polynomial Regression

Polynomial Regression is a method with which we can fit a polynomial to our data, it is still possible for us to define a polynomial regression by transforming it into a multi-variable linear regression problem as follows

Given the polynomial

y^=θ0+θ1x+θ2x2+θ3x3\hat y=\theta_0+\theta_1x+\theta_2x^2+\theta_3x^3

We can create new variables which represent the different powers of our initial variable

x1=xx_1=x

x2=x2x_2=x^2

x3=x3x_3=x^3

Therefore resulting in the following linear equation

y^=θ0+θ1x1+θ2x2+θ3x3\hat y=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3

Other Non-Linear Regression

Non-Linear Regression can be of many forms as well, including any other mathematical relationships that we can define

For more complex NLR problems it can be difficult to evaluate the parameters for the equation

Lab

There are many different model types and equations shown in the Lab Notebook aside from what I have here

Import the Data

Using China's GDP data

df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/china_gdp.csv')
df.head()
Year Value
0 1960 5.918412e+10
1 1961 4.955705e+10
2 1962 4.668518e+10
3 1963 5.009730e+10
4 1964 5.906225e+10
x_data, y_data = (df[['Year']], df[['Value']])
Plotting the Data
plt.title('China\'s GDP by Year')
plt.plot(x_data, y_data, 'o')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()
<matplotlib.figure.Figure at 0x7f2a9dbdeac8>
Defining a Fit

Next we can try to approximate a curve that we think will fit the data we have, we can use a sigmoid, as defined below

Y^=11+eβ1(Xβ2)\hat{Y} = \frac1{1+e^{\beta_1(X-\beta_2)}}

β1\beta_1 : Controls the curve's steepness,

β2\beta_2 : Slides the curve on the x-axis.

def sigmoid(x, b_1, b_2):
     y = 1 / (1 + np.exp(-b_1*(x-b_2)))
     return y

The above function can be seen to be

X = np.arange(-5.0, 5.0, 0.1)
Y = sigmoid(X, 1, 1)

plt.title('Sigmoid')
plt.plot(X,Y) 
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
<matplotlib.figure.Figure at 0x7f2a8405f390>

Next let's try to fit this to the data with some example values

b_1 = 0.10
b_2 = 1990.0

#logistic function
y_pred = sigmoid(x_data, b_1 , b_2)

#plot initial prediction against datapoints
plt.title('Approximating NLR with Sigmoid')
plt.plot(x_data, y_pred*15000000000000.)
plt.plot(x_data, y_data, 'ro')
plt.show()
<matplotlib.figure.Figure at 0x7f2a84010be0>
Data Normalization

Let's normalize our data so that we don't need to multiply by crazy numbers as before

# for some reason this seems to be the only way the conversion
# from a dataframe works as desired
# the normalization from the labs are as such:
# xdata =x_data/max(x_data)
# ydata =y_data/max(y_data)
x_norm = (np.array(x_data)/max(np.array(x_data))).transpose()[0]
y_norm = (np.array(y_data)/max(np.array(y_data))).transpose()[0]
Finding the Best Fit

Next we can import curve_fit to help us fit the the curve to our data

from scipy.optimize import curve_fit
popt, pcov = curve_fit(sigmoid, x_norm, y_norm)
print(" beta_1 = %f, beta_2 = %f" % (popt[0], popt[1]))
print(popt)
print(pcov)

And we can plot the result as follows

x = np.linspace(1960, 2015, 55)
x = x/max(x)
y = sigmoid(x, *popt)

plt.title('Sigmoid Fit of Data')
plt.plot(x_norm, y_norm, 'ro', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.legend()
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()
<matplotlib.figure.Figure at 0x7f2a443bb780>
Model Accuracy
from sklearn.metrics import r2_score

# split data into train/test
mask = np.random.rand(len(df)) < 0.8
train_x = x_norm[mask]
test_x = x_norm[~mask]
train_y = y_norm[mask]
test_y = y_norm[~mask]

# build the model using train set
popt, pcov = curve_fit(sigmoid, train_x, train_y)

# predict using test set
y_hat = sigmoid(test_x, *popt)

# evaluation
print("Mean absolute error: %.2f" % np.mean(np.absolute(y_hat - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((y_hat - test_y) ** 2))
print("R2-score: %.2f" % r2_score(y_hat , test_y) )

Classification

Classification is a supervised learning approach which is a means of splitting data into discrete classes

The target atribute is a categorical value with discrete values

Classification will determine the class label for a specific test case

Binary as well as multi-class classification methods are available

Learning Algorithms

Many learning algorithms are available for classification such as

  • Decision trees
  • Naive Bayes
  • KNN
  • Logistic Regression
  • Neural Networks
  • SVM

Evaluation Metrics

We have a few different evaluation metrics for classification

Jaccard Index

We simply measure which fraction of our predicted values y^\hat y intersect with the actual values yy

J(y,y^)=yy^yy^=yy^y+y^yy^J(y,\hat y)=\frac{|y\cap\hat y|}{|y\cup\hat y|}=\frac{|y\cap\hat y|}{|y|+|\hat y|-|y\cap\hat y|}

F1 Score

This is a measure which makes use of a confusion matrix and compares the predictions vs actual values for each class

In the count of binary classification this will give us our True Positives, False Positives, True Negatives and False Negatives

We can define some metrics for each class with the following

Precision=TPTP+FPPrecision=\frac{TP}{TP+FP}

Recall=TPTP+FNRecall=\frac{TP}{TP+FN}

F1=2Precision×RcallPrecision+RecallF1=2\frac{Precision\times Rcall}{Precision+Recall}

F1 varies between 0 and 1, with 1 being the best

The accuracy for a classifier is the average accuracy of each of its classes

Log Loss

The log loss is the performance of a classifier where the predicted output is a probability between 1 and 0

ylog(y^)+(1y)log(1y^)y\cdot log(\hat y)+(1-y)\cdot log(1-\hat y)

Better classifiers have a log loss closer to zero

K-Nearest Neighbor

KNN is a method of determining class based on the training datapoints that sit near our test datapoint based on the fact that closer datapoints are more important than those further away in predicting a specific value

Algorithm

  1. Pick a value for K
  2. Calculate distance of unknown case from known cases
  3. Select k observations
  4. Predict the value based on the most common observaton value

We can make use of euclidean distance to calculate the distance between our continuous values, and a voting system for discrete data

Using a low K value can lead to overfitting, and using a very high value can lead to us underfitting

In order to find the optimal K value we do multiple tests by continuously increasing our K value and measuring the accuracy for that K value

Furthermore KNN can also be used to predict continuous values (regression) by simply having a target variable and predictors that are continuous

Lab

Import Libraries
import itertools
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
Import Data

The dataset being used is one in which demographic data is used to define a customer service group, these being as follows

Value Category
1 Basic Service
2 E-Service
3 Plus Service
4 Total Service
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/teleCust1000t.csv')
df.head()
region tenure age marital address income ed employ retire gender reside custcat
0 2 13 44 1 9 64.0 4 5 0.0 0 2 1
1 3 11 33 1 7 136.0 5 5 0.0 0 6 4
2 3 68 52 1 24 116.0 1 29 0.0 1 2 3
3 2 33 33 0 12 33.0 2 0 0.0 1 1 1
4 2 23 30 1 9 30.0 1 2 0.0 0 4 3
Data Visualization and Analysis

We can look at the number of customers in each class

df.custcat.value_counts()
3    281
1    266
4    236
2    217
Name: custcat, dtype: int64
df.hist()
plt.show()
<matplotlib.figure.Figure at 0x7f2a4435ef98>

We can take a closer look at income with

df.income.hist(bins=50)
plt.title('Income of Customers')
plt.xlabel('Frequency')
plt.ylabel('Income')
plt.show()
<matplotlib.figure.Figure at 0x7f2a441bc908>
Features

To use sklearn we need to convert our data into an array as follows

df.columns
Index(['region', 'tenure', 'age', 'marital', 'address', 'income', 'ed',
       'employ', 'retire', 'gender', 'reside', 'custcat'],
      dtype='object')
# X = df.loc[:, 'region':'reside'].values
# Y = df.loc[:,'custcat'].values
X = df.loc[:, 'region':'reside']
Y = df.loc[:,'custcat']
X.head()
region tenure age marital address income ed employ retire gender reside
0 2 13 44 1 9 64.0 4 5 0.0 0 2
1 3 11 33 1 7 136.0 5 5 0.0 0 6
2 3 68 52 1 24 116.0 1 29 0.0 1 2
3 2 33 33 0 12 33.0 2 0 0.0 1 1
4 2 23 30 1 9 30.0 1 2 0.0 0 4
Y.head()
0    1
1    4
2    3
3    1
4    3
Name: custcat, dtype: int64
Normalize Data

For alogrithms like KNN which are distance based it is useful to normalize the data to have a zero mean and unit variance, we can do this using the sklearn.preprocessing package

X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
print(X[0:5])
Test/Train Split

Next we can split our model into a test and train set using sklearn.model_selection.train_test_split()

from sklearn.model_selection import train_test_split
ran = 4
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=ran)
print('Train: ', X_train.shape, Y_train.shape)
print('Test: ', X_test.shape, Y_test.shape)
Classification

We can then make use of the KNN classifier on our data

from sklearn.neighbors import KNeighborsClassifier as knn_classifier

We will use an intial value of 4 for k, but will later evaluate different k values

k = 4
knn = knn_classifier(n_neighbors=k)
knn.fit(X_train, Y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=4, p=2,
           weights='uniform')
Y_hat = knn.predict(X_test)
print(Y_hat[0:5])
Model Evaluation
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(Y_train, knn.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(Y_test, Y_hat))
Other K Values

We can do this for additional K values to look at how the accuracy is affected

k_max = 100
mean_acc = np.zeros((k_max))
std_acc = np.zeros((k_max))
ConfustionMx = [];
for n in range(1,k_max + 1):
    
    #Train Model and Predict  
    knn = knn_classifier(n_neighbors = n).fit(X_train,Y_train)
    Y_hat = knn.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(Y_test, Y_hat)

    std_acc[n-1] = np.std(Y_hat == Y_test)/np.sqrt(Y_hat.shape[0])
    
print(mean_acc)
plt.title('Accuracy vs K')
plt.plot(range(1,k_max + 1),mean_acc,'g')
plt.fill_between(range(1,k_max + 1),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()
<matplotlib.figure.Figure at 0x7f2a44198438>

The maximum accuracy can be found to be

print('Max Accuracy: {}, K={}'.format(max(mean_acc),mean_acc.argmax() + 1))
Test Sample

It can be noted that the accuracy and optimal value varies based on the random_state parameter in the train_test_split function used when doing the test/train split

Retrain with All Data

We can retrain the model to use all the data at the determined optimal value and look at the in-sample accuracy

k = mean_acc.argmax()
knn = knn_classifier(n_neighbors=k)
knn.fit(X, Y)
print("In-Sample Accuracy: ", metrics.accuracy_score(Y, knn.predict(X)))

Decision Trees

Decision Trees allow us to make use of discrete and continuous predictors to find a discrete target

Decision trees test a condition and branch off based on the result, eventually leading to a specific outcome/decision

Algorithm

  1. Choose a dataset
  2. Calculate the significance of an attribute in splitting the data
  3. Split the data based on the value of the attribute
  4. Go to 1

We aim to have resulting nodes that are high in purity. A higher purity increases predictiveness/significance

Recursive partitining is used to decrease the impurity/entropy in the resulting nodes

Entropy is a measurement of randomness

If samples are equally mixed, the entropy is 1, if the samples are pure, the entropy is 1

Entropy(v)=P(v)log(P(v))Entropy(v)=P(v)-log(P(v))

The best tree is the one that results in the most information gain after the split

Gain(S,A)=Entropy(S)ΣvSvSEntropy(Sv)Gain(S,A) = Entropy(S)-\Sigma_v\frac{|S_v|}{|S|}Entropy(S_v)

Lab

Import Libraries
import numpy as np 
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
Import Data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv')
print(df.shape)
df.head()
Age Sex BP Cholesterol Na_to_K Drug
0 23 F HIGH HIGH 25.355 drugY
1 47 M LOW HIGH 13.093 drugC
2 47 M LOW HIGH 10.114 drugC
3 28 F NORMAL HIGH 7.798 drugX
4 61 F LOW HIGH 18.043 drugY
Split X and Y Values
X_headers = ['Age','Sex','BP','Cholesterol','Na_to_K']
X = df[X_headers]
X.head()
Age Sex BP Cholesterol Na_to_K
0 23 F HIGH HIGH 25.355
1 47 M LOW HIGH 13.093
2 47 M LOW HIGH 10.114
3 28 F NORMAL HIGH 7.798
4 61 F LOW HIGH 18.043
Y = df[['Drug']]
Y.head()
Drug
0 drugY
1 drugC
2 drugC
3 drugX
4 drugY
Create Numeric Variables

We need to get numeric variables for X as sklearn does not support string categorization (according to the guy in the course anyway)

from sklearn import preprocessing
X_arr = np.array(X)

encoder = preprocessing.LabelEncoder()
encoder.fit(['F','M'])
X_arr[:,1] = encoder.transform(X_arr[:,1])

encoder.fit(['LOW','NORMAL','HIGH'])
X_arr[:,2] = encoder.transform(X_arr[:,2])

encoder.fit(['NORMAL','HIGH'])
X_arr[:,3] = encoder.transform(X_arr[:,3])

print(X_arr[0:5])
X_encoded = pd.DataFrame(data=X_arr, columns=X_headers)
X_encoded.head()
Age Sex BP Cholesterol Na_to_K
0 23 0 0 0 25.355
1 47 1 1 0 13.093
2 47 1 1 0 10.114
3 28 0 2 0 7.798
4 61 0 1 0 18.043
Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_encoded, Y, test_size = 0.3)
print('Training: X : {}, Y : {}'.format(X_train.shape,Y_train.shape))
print('Testing: X : {}, Y : {}'.format(X_test.shape,Y_test.shape))
Decision Tree
drug_tree = DecisionTreeClassifier(criterion='entropy', max_depth = 4)
drug_tree
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
drug_tree.fit(X_train, Y_train)
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
Prediction
Y_predicted = drug_tree.predict(X_test)
print(Y_predicted[0:5])
print(Y_test[0:5])
Evaluation
from sklearn import metrics
print('Decision Tree Accuracy: ', metrics.accuracy_score(Y_test, Y_predicted))
Visalization
!pip install pydotplus
import matplotlib.pyplot as plt
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
dot_data = StringIO()
filename = 'drug_decision_tree.png'
feature_names = X_headers
target_names = df['Drug'].unique().tolist()
out = tree.export_graphviz(drug_tree, 
                           feature_names=feature_names, 
                           out_file=dot_data, 
                           class_names=target_names, 
                           filled=True, 
                           special_characters=True, 
                           rotate=False)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(filename)
True
img = mpimg.imread(filename)
plt.figure(figsize=(100, 100))
plt.imshow(img, interpolation='nearest')
plt.show()
<matplotlib.figure.Figure at 0x7f2a840262b0>

Logistic Regression

Logistic regression is a categorical classification algorithm based on a linear division between categorical values

Logistic regression can be used for binary and multi class classification and predicts the probability of a class which is then mapped to a discrete value

Logistic regression is best suited to

  • Binary Classification
  • If you need probabilistic results
  • Linear decision boundry

θ0+θ1x1+θ2x2>0\theta_0+\theta_1x_1+\theta_2x_2>0

  • If you need to understand the impact of a feature

A logistic regression can calculate

y^=P(y=1x)\hat y=P(y=1|x)

Logistic vs Linear Regression

We can use linear regression with a dividing line to give whether or not a specific circumstance will lead to a specific output, where we define a threshold value which would define a boundry for the target class

The problem with this method is that we only have a specific binary outcome, and not any information as to what the probability of that outcome is. Logistic regression helps us to define this by making use of a sigmoid to smoothen out the classification boundry, the sigmoid function can be seen below

σ(θTX)=11+eθTx\sigma(\theta^TX)=\frac{1}{1+e^{-\theta^Tx}}

import numpy as np
from math import exp
import matplotlib.pyplot as plt

x = np.array(range(-100,102,2))/10
sigmoid = 1/(1+np.exp(-1*x))
step = []
for i in range(len(x)):
    step.append(1 if x[i] >= 0 else 0)

plt.plot(x,step, label='Step' )
plt.plot(x,sigmoid, label='Sigmoid')
# plt.xlim(-10,10)
plt.ylim(-0.1,1.1)
plt.xlabel('$x$')
plt.ylabel('$\sigma(x)$')
plt.legend()
plt.show()
<matplotlib.figure.Figure at 0x7f2a441f6278>

Based on the above we can see that depending on the value of xx we will have a greater tendency of a value towards 0 or 1 but not explicitly either

Algorithm

  1. Initialize θ\theta
  2. Calculate y^=σ(θTX)\hat y=\sigma(\theta^TX) for an XX
  3. Compare YY and Y^\hat Y and record the error, defined by a cost function J(θ)J(\theta)
  4. Change θ\theta to reduce the cost
  5. Go to 2

We can use different ways to change θ\theta such as gradient descent

Lab

Import Libraries
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
import matplotlib.pyplot as plt
Import Data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv')
df.head()
tenure age address income ed employ equip callcard wireless longmon ... pager internet callwait confer ebill loglong logtoll lninc custcat churn
0 11.0 33.0 7.0 136.0 5.0 5.0 0.0 1.0 1.0 4.40 ... 1.0 0.0 1.0 1.0 0.0 1.482 3.033 4.913 4.0 1.0
1 33.0 33.0 12.0 33.0 2.0 0.0 0.0 0.0 0.0 9.45 ... 0.0 0.0 0.0 0.0 0.0 2.246 3.240 3.497 1.0 1.0
2 23.0 30.0 9.0 30.0 1.0 2.0 0.0 0.0 0.0 6.30 ... 0.0 0.0 0.0 1.0 0.0 1.841 3.240 3.401 3.0 0.0
3 38.0 35.0 5.0 76.0 2.0 10.0 1.0 1.0 1.0 6.05 ... 1.0 1.0 1.0 1.0 1.0 1.800 3.807 4.331 4.0 0.0
4 7.0 35.0 14.0 80.0 2.0 15.0 0.0 1.0 0.0 7.10 ... 0.0 0.0 1.0 1.0 0.0 1.960 3.091 4.382 3.0 0.0

5 rows × 28 columns

Preprocessing
df = df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip',   'callcard', 'wireless','churn']]
df[['churn']] = df[['churn']].astype('int')
df.head()
tenure age address income ed employ equip callcard wireless churn
0 11.0 33.0 7.0 136.0 5.0 5.0 0.0 1.0 1.0 1
1 33.0 33.0 12.0 33.0 2.0 0.0 0.0 0.0 0.0 1
2 23.0 30.0 9.0 30.0 1.0 2.0 0.0 0.0 0.0 0
3 38.0 35.0 5.0 76.0 2.0 10.0 1.0 1.0 1.0 0
4 7.0 35.0 14.0 80.0 2.0 15.0 0.0 1.0 0.0 0
Define X and Y
X = np.asarray(df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])
X[0:5]
array([[  11.,   33.,    7.,  136.,    5.,    5.,    0.],
       [  33.,   33.,   12.,   33.,    2.,    0.,    0.],
       [  23.,   30.,    9.,   30.,    1.,    2.,    0.],
       [  38.,   35.,    5.,   76.,    2.,   10.,    1.],
       [   7.,   35.,   14.,   80.,    2.,   15.,    0.]])
Y = np.asarray(df['churn'])
Y[0:5]
array([1, 1, 0, 0, 0])
Normalize Data
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
array([[-1.13518441, -0.62595491, -0.4588971 ,  0.4751423 ,  1.6961288 ,
        -0.58477841, -0.85972695],
       [-0.11604313, -0.62595491,  0.03454064, -0.32886061, -0.6433592 ,
        -1.14437497, -0.85972695],
       [-0.57928917, -0.85594447, -0.261522  , -0.35227817, -1.42318853,
        -0.92053635, -0.85972695],
       [ 0.11557989, -0.47262854, -0.65627219,  0.00679109, -0.6433592 ,
        -0.02518185,  1.16316   ],
       [-1.32048283, -0.47262854,  0.23191574,  0.03801451, -0.6433592 ,
         0.53441472, -0.85972695]])
Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=4)
print('Train: ', X_train.shape, Y_train.shape)
print('Test: ', X_test.shape, Y_test.shape)
Modelling
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
lr = LogisticRegression(C=0.01, solver='liblinear').fit(X_train, Y_train)
lr
LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Predict
Y_hat = lr.predict(X_test)
Y_hat_prob = lr.predict_proba(X_test)
Y_hat_prob[0:5]
array([[ 0.54132919,  0.45867081],
       [ 0.60593357,  0.39406643],
       [ 0.56277713,  0.43722287],
       [ 0.63432489,  0.36567511],
       [ 0.56431839,  0.43568161]])
Evaluation
Jaccard Index
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(Y_test, Y_hat)
0.75
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
print(confusion_matrix(Y_test, Y_hat, labels=[1,0]))
# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test, Y_hat, labels=[1,0])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False,  title='Confusion matrix')
<matplotlib.figure.Figure at 0x7f2a44129e48>
print(classification_report(Y_test, Y_hat))
from sklearn.metrics import log_loss
log_loss(Y_hat, Y_hat_prob)
0.54903192026736869
Using Different Model Parameters
lr2 = LogisticRegression(C=10, solver='sag').fit(X_train,Y_train)
Y_hat_prob2 = lr2.predict_proba(X_test)
print ("LogLoss: : %.2f" % log_loss(Y_test, Y_hat_prob2))

Support Vector Machine

SVM is a supervised algorithm that classifies data by finding a separator

  1. Map data to higher-dimensional Feature Space
  2. Find a separating hyperplane in higher dimensional space

Data Transformation

Mapping data into a higher space is known as kernelling and can be of different functions such as

  • Linear
  • Polynomial
  • RBF
  • Sigmoid

The best hyperplane is the one that results in the largest margin possible between the hyperplane and our closest sample, the samples closest to our hyperlane are known as support vectors

Advantages and Disadvantages

  • Advantages
    • Accurate in high dimensional spaces
    • Memory efficient
  • Disadvantages
    • Prone to overfitting
    • No probability estimation
    • Not suited to very large datasets

Applications

  • Image Recognition
  • Text mining/categorization
    • Spam detection
    • Sentiment analysis
  • Regression
  • Outlier detection
  • Clustering

Lab

Import Packages
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
Import Data

The data is from the UCI Machine Learning Archive, the fields are as follows

Field name Description
ID Clump thickness
Clump Clump thickness
UnifSize Uniformity of cell size
UnifShape Uniformity of cell shape
MargAdh Marginal adhesion
SingEpiSize Single epithelial cell size
BareNuc Bare nuclei
BlandChrom Bland chromatin
NormNucl Normal nucleoli
Mit Mitoses
Class Benign or malignant
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv')
df.head()
ID Clump UnifSize UnifShape MargAdh SingEpiSize BareNuc BlandChrom NormNucl Mit Class
0 1000025 5 1 1 1 2 1 3 1 1 2
1 1002945 5 4 4 5 7 10 3 2 1 2
2 1015425 3 1 1 1 2 2 3 1 1 2
3 1016277 6 8 8 1 3 4 3 7 1 2
4 1017023 4 1 1 3 2 1 3 1 1 2
Visualization

The Class field contains the diagnosis where 2 means benign, and 4 means malignant

ax = df[df['Class'] == 4][0:50].plot(kind='scatter', 
                                     x='Clump', 
                                     y='UnifSize', 
                                     color='DarkBlue', 
                                     label='malignant');
df[df['Class'] == 2][0:50].plot(kind='scatter', 
                                x='Clump', 
                                y='UnifSize', 
                                color='Yellow', 
                                label='benign', 
                                ax=ax);
plt.show()
<matplotlib.figure.Figure at 0x7f2a44235d68>
Preprocessing Data
print(df.dtypes)
df = df[pd.to_numeric(df['BareNuc'].apply(lambda x: x.isnumeric()))]
df['BareNuc'] = df['BareNuc'].astype('int')
df.dtypes
ID             int64
Clump          int64
UnifSize       int64
UnifShape      int64
MargAdh        int64
SingEpiSize    int64
BareNuc        int64
BlandChrom     int64
NormNucl       int64
Mit            int64
Class          int64
dtype: object
Break into X and Y
X = np.asarray(df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']])
X[0:5]
array([[ 5,  1,  1,  1,  2,  1,  3,  1,  1],
       [ 5,  4,  4,  5,  7, 10,  3,  2,  1],
       [ 3,  1,  1,  1,  2,  2,  3,  1,  1],
       [ 6,  8,  8,  1,  3,  4,  3,  7,  1],
       [ 4,  1,  1,  3,  2,  1,  3,  1,  1]])
Y = np.asarray(df['Class'])
Y[0:5]
array([2, 2, 2, 2, 2])
Train/Test Split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, 
                                                    test_size=0.2, 
                                                    random_state=4)
print ('Train set:', X_train.shape,  Y_train.shape)
print ('Test set:', X_test.shape,  Y_test.shape)
Modeling
from sklearn import svm
clf = svm.SVC(gamma='auto', kernel='rbf')
clf.fit(X_train, Y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Y_hat = clf.predict(X_test)
Y_hat[0:5]
array([2, 4, 2, 4, 2])
Evaluation
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test, Y_hat, labels=[2,4])
np.set_printoptions(precision=2)

print (classification_report(Y_test, Y_hat))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],
                      normalize= False,  title='Confusion matrix')
<matplotlib.figure.Figure at 0x7f2a441c3208>
from sklearn.metrics import f1_score
print('F1 Score: ', f1_score(Y_test, Y_hat, average='weighted'))
from sklearn.metrics import jaccard_similarity_score
print('Jaccard Index: ', jaccard_similarity_score(Y_test, Y_hat))
Using an Alternative Kernal
clf2 = svm.SVC(kernel='linear')
clf2.fit(X_train, Y_train) 
Y_hat2 = clf2.predict(X_test)
print("Avg F1-score: %.4f" % f1_score(Y_test, Y_hat2, average='weighted'))
print("Jaccard score: %.4f" % jaccard_similarity_score(Y_test, Y_hat2))

Clustering

Clustering is an unsupervised grouping of data in which similar datapoints are grouped together

The diference between clustering and classification is that clustering does not speficy what th groupings should be

Uses of Clustering

  • Exploration of data
  • Summary Generation
  • Outlier Detection
  • Finding Duplicates
  • Data Pre-Processing

Clustering Algorithms

  • Partitioned Based
    • Efficient
  • Hierachical
    • Produces trees of clusters
  • Density based
    • Produces arbitrary shaped clusters

K-Means

  • Partitioning Clustering
  • Divides data into K non-overlapping subsets

K tries to minimize intra-cluster distances, and maximize inter-cluster distances

Distance

We can define the distance simply as the euclidean distance, typically normalizing the values so that our distances are not affected more by one value than another

Other distance formulas can be used depending on our understanding of the data as appropriate

Algorithm

  1. Determine K and initialize centroids randomly
  2. Measure distance from centroids to each datapoint
  3. Assign each point to closest centroid
  4. New centroids are at the mean of the points in its cluster
  5. Go to 2 if not converged

K-Means may not converge to a global optimum, but simply a local one and is somewhat dependant on the intial choice in 1

Accuracy

Average distance between datapoints within a cluster is a measure of error

Choice of K

We can use the elbow method in which we look at the distance of the datapoints to their centroid versus the K value, and select the one at which we notice a sharp change in the distance gradient

Lab

Import Packages
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans 
# from sklearn.datasets.samples_generator import make_blobs 
Import Data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/Cust_Segmentation.csv')
df.head()
Customer Id Age Edu Years Employed Income Card Debt Other Debt Defaulted Address DebtIncomeRatio
0 1 41 2 6 19 0.124 1.073 0.0 NBA001 6.3
1 2 47 1 26 100 4.582 8.218 0.0 NBA021 12.8
2 3 33 2 10 57 6.111 5.802 1.0 NBA013 20.9
3 4 29 2 4 19 0.681 0.516 0.0 NBA009 6.3
4 5 47 1 31 253 9.308 8.908 0.0 NBA008 7.2
df = df.drop('Address', axis=1)
df.head()
Customer Id Age Edu Years Employed Income Card Debt Other Debt Defaulted DebtIncomeRatio
0 1 41 2 6 19 0.124 1.073 0.0 6.3
1 2 47 1 26 100 4.582 8.218 0.0 12.8
2 3 33 2 10 57 6.111 5.802 1.0 20.9
3 4 29 2 4 19 0.681 0.516 0.0 6.3
4 5 47 1 31 253 9.308 8.908 0.0 7.2
Normalize the Data
from sklearn.preprocessing import StandardScaler
X = np.asarray(df.values[:,1:])
X = np.nan_to_num(X)
X
array([[ 41.  ,   2.  ,   6.  , ...,   1.07,   0.  ,   6.3 ],
       [ 47.  ,   1.  ,  26.  , ...,   8.22,   0.  ,  12.8 ],
       [ 33.  ,   2.  ,  10.  , ...,   5.8 ,   1.  ,  20.9 ],
       ..., 
       [ 25.  ,   4.  ,   0.  , ...,   3.21,   1.  ,  33.4 ],
       [ 32.  ,   1.  ,  12.  , ...,   0.7 ,   0.  ,   2.9 ],
       [ 52.  ,   1.  ,  16.  , ...,   3.64,   0.  ,   8.6 ]])
X_norm = StandardScaler().fit_transform(X)
X_norm
array([[ 0.74,  0.31, -0.38, ..., -0.59, -0.52, -0.58],
       [ 1.49, -0.77,  2.57, ...,  1.51, -0.52,  0.39],
       [-0.25,  0.31,  0.21, ...,  0.8 ,  1.91,  1.6 ],
       ..., 
       [-1.25,  2.47, -1.26, ...,  0.04,  1.91,  3.46],
       [-0.38, -0.77,  0.51, ..., -0.7 , -0.52, -1.08],
       [ 2.11, -0.77,  1.1 , ...,  0.16, -0.52, -0.23]])

Modeling

k = 3
k_means = KMeans(init='k-means++',
                n_clusters=k,
                n_init=12)
k_means.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=12, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
labels = k_means.labels_
print(labels[:20], labels.shape)
df['Cluster'] = labels
df.head()
Customer Id Age Edu Years Employed Income Card Debt Other Debt Defaulted DebtIncomeRatio Cluster
0 1 41 2 6 19 0.124 1.073 0.0 6.3 1
1 2 47 1 26 100 4.582 8.218 0.0 12.8 0
2 3 33 2 10 57 6.111 5.802 1.0 20.9 1
3 4 29 2 4 19 0.681 0.516 0.0 6.3 1
4 5 47 1 31 253 9.308 8.908 0.0 7.2 2
df.groupby('Cluster').mean()
Customer Id Age Edu Years Employed Income Card Debt Other Debt Defaulted DebtIncomeRatio
Cluster
0 402.295082 41.333333 1.956284 15.256831 83.928962 3.103639 5.765279 0.171233 10.724590
1 432.468413 32.964561 1.614792 6.374422 31.164869 1.032541 2.104133 0.285185 10.094761
2 410.166667 45.388889 2.666667 19.555556 227.166667 5.678444 10.907167 0.285714 7.322222
Visualization
%matplotlib inline
area = np.pi*(X[:,1])**2
plt.figure()
plt.title('Income vs Age')
plt.scatter(X[:,0], X[:,3], s=area, c=labels, alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()
<matplotlib.figure.Figure at 0x7f2adaf72ac8>
from mpl_toolkits.mplot3d import Axes3D
plt.clf()
ax = Axes3D(plt.figure(figsize=(8,6)), rect=[0,0,0.95,1], elev=48, azim=134)
plt.cla()
ax.set_xlabel('Education')
ax.set_ylabel('Age')
ax.set_zlabel('Income')

ax.scatter(X[:,1], X[:,0], X[:,3], c=labels)

plt.figure()
plt.show()
<matplotlib.figure.Figure at 0x7f2a4432b828>
<matplotlib.figure.Figure at 0x7f2a4432b6d8>
<matplotlib.figure.Figure at 0x7f2a9c021e48>

Hierachical Clustering

Two types

  • Divisive - Top Down
  • Agglomerative - Bottom Up

Agglomerative works by combining clusters based on the distance between them, this is the most popular method for HC

Agglomerative Algorithm

  1. Create n clusters, one for each datapoint
  2. Compute the proximity matrix
  3. Repeat Until a single cluster remains
    1. Merge the two closest clusters
    2. Update the proximity matrix

We can use any distance function we want to, there are multiple algorithms for this

  • Single linkage clustering
  • Complete linkage clustering
  • Average linkag clustering
  • Centroid linkage clustering

Advantages and Disadvantages

  • Advantages
    • Number of clusters does not need to be specified
    • Easy to implement
    • Dendogram can be easily understood
  • Disadvantages
    • Long runtimes
    • Cannot undo previous steps
    • Difficult to identify the number of clusters on dendogram

Lab

Import Packages
import numpy as np 
import pandas as pd
from scipy import ndimage 
from scipy.cluster import hierarchy 
from scipy.spatial import distance_matrix 
from matplotlib import pyplot as plt 
from sklearn import manifold, datasets 
from sklearn.cluster import AgglomerativeClustering 
from sklearn.datasets.samples_generator import make_blobs
Import Data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cars_clus.csv')
df.head()
manufact model sales resale type price engine_s horsepow wheelbas width length curb_wgt fuel_cap mpg lnsales partition
0 Acura Integra 16.919 16.360 0.000 21.500 1.800 140.000 101.200 67.300 172.400 2.639 13.200 28.000 2.828 0.0
1 Acura TL 39.384 19.875 0.000 28.400 3.200 225.000 108.100 70.300 192.900 3.517 17.200 25.000 3.673 0.0
2 Acura CL 14.114 18.225 0.000 $null$ 3.200 225.000 106.900 70.600 192.000 3.470 17.200 26.000 2.647 0.0
3 Acura RL 8.588 29.725 0.000 42.000 3.500 210.000 114.600 71.400 196.600 3.850 18.000 22.000 2.150 0.0
4 Audi A4 20.397 22.255 0.000 23.990 1.800 150.000 102.600 68.200 178.000 2.998 16.400 27.000 3.015 0.0
Clean Data
print ("Shape of dataset before cleaning: ", df.size)

df[[ 'sales', 'resale', 'type', 'price', 'engine_s',
       'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
       'mpg', 'lnsales']] = df[['sales', 'resale', 'type', 'price', 'engine_s',
       'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
       'mpg', 'lnsales']].apply(pd.to_numeric, errors='coerce')
df = df.dropna()
df = df.reset_index(drop=True)

print ("Shape of dataset after cleaning: ", df.size)
df.head()
manufact model sales resale type price engine_s horsepow wheelbas width length curb_wgt fuel_cap mpg lnsales partition
0 Acura Integra 16.919 16.360 0.0 21.50 1.8 140.0 101.2 67.3 172.4 2.639 13.2 28.0 2.828 0.0
1 Acura TL 39.384 19.875 0.0 28.40 3.2 225.0 108.1 70.3 192.9 3.517 17.2 25.0 3.673 0.0
2 Acura RL 8.588 29.725 0.0 42.00 3.5 210.0 114.6 71.4 196.6 3.850 18.0 22.0 2.150 0.0
3 Audi A4 20.397 22.255 0.0 23.99 1.8 150.0 102.6 68.2 178.0 2.998 16.4 27.0 3.015 0.0
4 Audi A6 18.780 23.555 0.0 33.95 2.8 200.0 108.7 76.1 192.0 3.561 18.5 22.0 2.933 0.0
Selecting Features
X = df[['engine_s','horsepow', 'wheelbas', 
        'width', 'length', 'curb_wgt', 
        'fuel_cap', 'mpg']].values
print(X[:5])
Normalization
from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)
print(X_norm[:5])
Clustering with Scipy
import scipy as sp

entries = X_norm.shape[0]
D = sp.zeros([entries, entries])
for i in range(entries):
    for j in range(entries):
        D[i,j] = sp.spatial.distance.euclidean(X[i],X[j])
print(D)

We have different distace formulas such as

  • single
  • complete
  • average
  • weighted
  • centroid
import pylab
import scipy.cluster.hierarchy
Z = hierarchy.linkage(D, 'complete')
print(Z[:5])
from scipy.cluster.hierarchy import fcluster
max_d =  3
clusters = fcluster(Z, max_d, criterion='distance')
print(clusters)
max_d = 5
clusters = fcluster(Z, max_d, criterion='maxclust')
print(clusters)
fig = pylab.figure(figsize=(18,50))
def llf(id):
    return '[%s %s %s]' % (df['manufact'][id], 
                           df['model'][id], 
                           int(float(df['type'][id])))

dendro = hierarchy.dendrogram(Z, leaf_label_func=llf, 
                             leaf_rotation=0, 
                             leaf_font_size=12, 
                             orientation='right')
<matplotlib.figure.Figure at 0x7f2a4422e128>
Clustering with SciKit Learn
D = distance_matrix(X, X)
print(D)
agglom = AgglomerativeClustering(n_clusters=6, linkage='complete')
agglom.fit(X)
AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='complete', memory=None,
            n_clusters=6, pooling_func=<function mean at 0x7f2ad42c3730>)
df['cluster_'] = agglom.labels_
df.head()
manufact model sales resale type price engine_s horsepow wheelbas width length curb_wgt fuel_cap mpg lnsales partition cluster_
0 Acura Integra 16.919 16.360 0.0 21.50 1.8 140.0 101.2 67.3 172.4 2.639 13.2 28.0 2.828 0.0 2
1 Acura TL 39.384 19.875 0.0 28.40 3.2 225.0 108.1 70.3 192.9 3.517 17.2 25.0 3.673 0.0 0
2 Acura RL 8.588 29.725 0.0 42.00 3.5 210.0 114.6 71.4 196.6 3.850 18.0 22.0 2.150 0.0 0
3 Audi A4 20.397 22.255 0.0 23.99 1.8 150.0 102.6 68.2 178.0 2.998 16.4 27.0 3.015 0.0 3
4 Audi A6 18.780 23.555 0.0 33.95 2.8 200.0 108.7 76.1 192.0 3.561 18.5 22.0 2.933 0.0 0
import matplotlib.cm as cm
n_clusters = max(agglom.labels_)+1
colors = cm.rainbow(np.linspace(0,1,n_clusters))
cluster_labels = list(range(0,n_clusters))

plt.figure(figsize=(16,14))

for color, label in zip(colors, cluster_labels):
    subset = df[df.cluster_ == label]
    for i in subset.index:
            plt.text(subset.horsepow[i], 
                     subset.mpg[i],
                     str(subset['model'][i]), 
                     rotation=25) 
            
    plt.scatter(subset.horsepow, subset.mpg, 
                s= subset.price*10, c=color, 
                label='cluster'+str(label),alpha=0.5)

plt.legend()
plt.title('Clusters')
plt.xlabel('horsepow')
plt.ylabel('mpg')
plt.show()
<matplotlib.figure.Figure at 0x7f2a44101048>
df.groupby(['cluster_','type'])['cluster_'].count()
cluster_  type
0         0.0     29
          1.0     14
1         0.0     10
2         0.0     26
          1.0      4
3         0.0     21
          1.0     11
4         0.0      1
5         0.0      1
Name: cluster_, dtype: int64
df_mean = df.groupby(['cluster_','type'])['horsepow','engine_s','mpg','price'].mean()
df_mean
horsepow engine_s mpg price
cluster_ type
0 0.0 210.551724 3.420690 23.648276 30.449310
1.0 206.428571 4.064286 18.500000 28.727714
1 0.0 294.700000 4.380000 21.600000 57.864000
2 0.0 121.230769 1.934615 29.115385 14.720385
1.0 133.750000 2.225000 22.750000 15.856500
3 0.0 160.857143 2.680952 24.857143 19.822048
1.0 154.272727 2.936364 20.909091 21.199364
4 0.0 55.000000 1.000000 45.000000 9.235000
5 0.0 450.000000 8.000000 16.000000 69.725000
plt.figure(figsize=(16,10))
for color, label in zip(colors, cluster_labels):
    subset = df_mean.loc[(label,),]
    for i in subset.index:
        plt.text(subset.loc[i][0]+5, subset.loc[i][2], 'type='+str(int(i)) + ', price='+str(int(subset.loc[i][3]))+'k')
    plt.scatter(subset.horsepow, subset.mpg, s=subset.price*20, c=color, label='cluster'+str(label))
plt.legend()
plt.title('Clusters')
plt.xlabel('horsepow')
plt.ylabel('mpg')
Text(0,0.5,'mpg')
<matplotlib.figure.Figure at 0x7f29c4117978>

DBSCAN

Density Based clstering locates regions of high density and separates outliers while being able to find arbitrarily shaped clusters while ignoring noise

  • Density Based Spacial Clustering of Applications with Noise
    • Common clustering algorithm
    • Based on object density
  • Radius of neighborhood
  • Min number of neighbors

Different types of points

  • Core
    • Has M neighbors withn R
  • Border
    • Has Core point within R, less than M in R
  • Outlier
    • Not Core, or within R of Core

DBSCAN visits each point and identifies its type, and then groups points based on this

Lab

Import Packages
import numpy as np 
from sklearn.cluster import DBSCAN 
from sklearn.datasets.samples_generator import make_blobs 
from sklearn.preprocessing import StandardScaler 
import matplotlib.pyplot as plt
import pandas as pd
About the Data
Environment Canada

Monthly Values for July - 2015

Name in the table Meaning
Stn_Name Station Name
Lat Latitude (North+, degrees)
Long Longitude (West - , degrees)
Prov Province
Tm Mean Temperature (°C)
DwTm Days without Valid Mean Temperature
D Mean Temperature difference from Normal (1981-2010) (°C)
Tx Highest Monthly Maximum Temperature (°C)
DwTx Days without Valid Maximum Temperature
Tn Lowest Monthly Minimum Temperature (°C)
DwTn Days without Valid Minimum Temperature
S Snowfall (cm)
DwS Days without Valid Snowfall
S%N Percent of Normal (1981-2010) Snowfall
P Total Precipitation (mm)
DwP Days without Valid Precipitation
P%N Percent of Normal (1981-2010) Precipitation
S_G Snow on the ground at the end of the month (cm)
Pd Number of days with Precipitation 1.0 mm or more
BS Bright Sunshine (hours)
DwBS Days without Valid Bright Sunshine
BS% Percent of Normal (1981-2010) Bright Sunshine
HDD Degree Days below 18 °C
CDD Degree Days above 18 °C
Stn_No Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically).
NA Not Available
Import the Data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/weather-stations20140101-20141231.csv')
df.head()
Stn_Name Lat Long Prov Tm DwTm D Tx DwTx Tn ... DwP P%N S_G Pd BS DwBS BS% HDD CDD Stn_No
0 CHEMAINUS 48.935 -123.742 BC 8.2 0.0 NaN 13.5 0.0 1.0 ... 0.0 NaN 0.0 12.0 NaN NaN NaN 273.3 0.0 1011500
1 COWICHAN LAKE FORESTRY 48.824 -124.133 BC 7.0 0.0 3.0 15.0 0.0 -3.0 ... 0.0 104.0 0.0 12.0 NaN NaN NaN 307.0 0.0 1012040
2 LAKE COWICHAN 48.829 -124.052 BC 6.8 13.0 2.8 16.0 9.0 -2.5 ... 9.0 NaN NaN 11.0 NaN NaN NaN 168.1 0.0 1012055
3 DISCOVERY ISLAND 48.425 -123.226 BC NaN NaN NaN 12.5 0.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1012475
4 DUNCAN KELVIN CREEK 48.735 -123.728 BC 7.7 2.0 3.4 14.5 2.0 -1.0 ... 2.0 NaN NaN 11.0 NaN NaN NaN 267.7 0.0 1012573

5 rows × 25 columns

Clean Data
df = df[pd.notnull(df['Tm'])]
df.reset_index(drop=True)
df.head()
Stn_Name Lat Long Prov Tm DwTm D Tx DwTx Tn ... DwP P%N S_G Pd BS DwBS BS% HDD CDD Stn_No
0 CHEMAINUS 48.935 -123.742 BC 8.2 0.0 NaN 13.5 0.0 1.0 ... 0.0 NaN 0.0 12.0 NaN NaN NaN 273.3 0.0 1011500
1 COWICHAN LAKE FORESTRY 48.824 -124.133 BC 7.0 0.0 3.0 15.0 0.0 -3.0 ... 0.0 104.0 0.0 12.0 NaN NaN NaN 307.0 0.0 1012040
2 LAKE COWICHAN 48.829 -124.052 BC 6.8 13.0 2.8 16.0 9.0 -2.5 ... 9.0 NaN NaN 11.0 NaN NaN NaN 168.1 0.0 1012055
4 DUNCAN KELVIN CREEK 48.735 -123.728 BC 7.7 2.0 3.4 14.5 2.0 -1.0 ... 2.0 NaN NaN 11.0 NaN NaN NaN 267.7 0.0 1012573
5 ESQUIMALT HARBOUR 48.432 -123.439 BC 8.8 0.0 NaN 13.1 0.0 1.9 ... 8.0 NaN NaN 12.0 NaN NaN NaN 258.6 0.0 1012710

5 rows × 25 columns

# ! pip install --user git+https://github.com/matplotlib/basemap.git
# from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from pylab import rcParams
rcParams['figure.figsize'] = (14,10)

llon=-140
ulon=-50
llat=40
ulat=65

df = df[(df['Long'] > llon) & (df['Long'] < ulon) & (df['Lat'] > llat) &(df['Lat'] < ulat)]

plt.title('Location of Sensors')
plt.scatter(list(df['Long']),list(df['Lat']))
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
<matplotlib.figure.Figure at 0x7f2aa86d1278>
Compute DBSCAN
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
<mtrand.RandomState at 0x7f29c41733a8>
X = np.nan_to_num(df[['Lat','Long']])
X = StandardScaler().fit_transform(X)

X
array([[-0.3 , -1.17],
       [-0.33, -1.19],
       [-0.33, -1.18],
       ..., 
       [ 1.84,  1.47],
       [ 1.01,  1.65],
       [ 0.6 ,  1.28]])
db = DBSCAN(eps=0.15, min_samples=10).fit(X)
db
DBSCAN(algorithm='auto', eps=0.15, leaf_size=30, metric='euclidean',
    metric_params=None, min_samples=10, n_jobs=1, p=None)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True

core_samples_mask
array([ True,  True,  True, ..., False, False, False], dtype=bool)
df['Clus_db'] = db.labels_
df.head()
Stn_Name Lat Long Prov Tm DwTm D Tx DwTx Tn ... P%N S_G Pd BS DwBS BS% HDD CDD Stn_No Clus_db
0 CHEMAINUS 48.935 -123.742 BC 8.2 0.0 NaN 13.5 0.0 1.0 ... NaN 0.0 12.0 NaN NaN NaN 273.3 0.0 1011500 0
1 COWICHAN LAKE FORESTRY 48.824 -124.133 BC 7.0 0.0 3.0 15.0 0.0 -3.0 ... 104.0 0.0 12.0 NaN NaN NaN 307.0 0.0 1012040 0
2 LAKE COWICHAN 48.829 -124.052 BC 6.8 13.0 2.8 16.0 9.0 -2.5 ... NaN NaN 11.0 NaN NaN NaN 168.1 0.0 1012055 0
4 DUNCAN KELVIN CREEK 48.735 -123.728 BC 7.7 2.0 3.4 14.5 2.0 -1.0 ... NaN NaN 11.0 NaN NaN NaN 267.7 0.0 1012573 0
5 ESQUIMALT HARBOUR 48.432 -123.439 BC 8.8 0.0 NaN 13.1 0.0 1.9 ... NaN NaN 12.0 NaN NaN NaN 258.6 0.0 1012710 0

5 rows × 26 columns

df[['Stn_Name','Tx','Tm','Clus_db']][1000:1500:45]
Stn_Name Tx Tm Clus_db
1138 HEATH POINT -1.0 -13.3 -1
1185 LA GRANDE RIVIERE A -11.6 -28.4 -1
1234 BRIER ISLAND 4.4 -6.3 3
1286 BRANCH 8.0 -3.4 4
1332 GOOSE A -4.2 -22.0 -1
Cluster Visualization
print(df['Clus_db'].max(), df['Clus_db'].min())
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
colours = ['#D3D3D3','blue','red','green','purple','yellow','deepskyblue']
le.fit(colours)
le.classes_
array(['#D3D3D3', 'blue', 'deepskyblue', 'green', 'purple', 'red', 'yellow'],
      dtype='<U11')
# le.inverse_transform([0,1,2,3,4,5,6])
df['Colours'] = le.inverse_transform(db.labels_ + 1)
df[['Stn_Name','Tx','Tm','Clus_db', 'Colours']][1000:1500:45]
Stn_Name Tx Tm Clus_db Colours
1138 HEATH POINT -1.0 -13.3 -1 #D3D3D3
1185 LA GRANDE RIVIERE A -11.6 -28.4 -1 #D3D3D3
1234 BRIER ISLAND 4.4 -6.3 3 purple
1286 BRANCH 8.0 -3.4 4 red
1332 GOOSE A -4.2 -22.0 -1 #D3D3D3
plt.title('Clusters')
plt.scatter(list(df['Long']),
            list(df['Lat']),
            c=list(df['Colours']))
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
<matplotlib.figure.Figure at 0x7f2a0408b320>

Recommender Systems

Recommender systems try to capture people's behaviour in order to predict what people may like

There are two main types

  • Content based
    • Provide more content similar to what that user likes
  • Collaborative filtering
    • A user may be interested in what other similar users like

There are two types of implementations

  • Memory based
    • Uses entire user-item dataset to generate a recommendation
  • Model based
    • Develops model of users in an attempt to learn their preferences

Content Based

Content based systems try to recommend content based on a model of the user and similarity of the content that they interact with

Lab

Download the Data

The dataset being used is a movie dataset from GroupLens

#only run once

# !wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
# print('unziping ...')
# !unzip -o -j moviedataset.zip 
Import Packages
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
Import Data
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')
movies_df.head()
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
ratings_df.head()
userId movieId rating timestamp
0 1 169 2.5 1204927694
1 1 2471 3.0 1204927438
2 1 48516 5.0 1204927435
3 2 2571 3.5 1436165433
4 2 109487 4.0 1436165496
Preprocessing
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))', 
                                                expand=False)
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)', 
                                               expand=False)

movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

movies_df.head()
movieId title genres year
0 1 Toy Story Adventure|Animation|Children|Comedy|Fantasy 1995
1 2 Jumanji Adventure|Children|Fantasy 1995
2 3 Grumpier Old Men Comedy|Romance 1995
3 4 Waiting to Exhale Comedy|Drama|Romance 1995
4 5 Father of the Bride Part II Comedy 1995
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()
movieId title genres year
0 1 Toy Story [Adventure, Animation, Children, Comedy, Fantasy] 1995
1 2 Jumanji [Adventure, Children, Fantasy] 1995
2 3 Grumpier Old Men [Comedy, Romance] 1995
3 4 Waiting to Exhale [Comedy, Drama, Romance] 1995
4 5 Father of the Bride Part II [Comedy] 1995
genres_df = movies_df.copy()

for index, row in movies_df.iterrows():
    for genre in row['genres']:
        genres_df.at[index, genre] = 1

genres_df = genres_df.fillna(0)
genres_df.head()
movieId title genres year Adventure Animation Children Comedy Fantasy Romance ... Horror Mystery Sci-Fi IMAX Documentary War Musical Western Film-Noir (no genres listed)
0 1 Toy Story [Adventure, Animation, Children, Comedy, Fantasy] 1995 1.0 1.0 1.0 1.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2 Jumanji [Adventure, Children, Fantasy] 1995 1.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3 Grumpier Old Men [Comedy, Romance] 1995 0.0 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 4 Waiting to Exhale [Comedy, Drama, Romance] 1995 0.0 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5 Father of the Bride Part II [Comedy] 1995 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 24 columns

ratings_df.head()
userId movieId rating timestamp
0 1 169 2.5 1204927694
1 1 2471 3.0 1204927438
2 1 48516 5.0 1204927435
3 2 2571 3.5 1436165433
4 2 109487 4.0 1436165496
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()
userId movieId rating
0 1 169 2.5
1 1 2471 3.0
2 1 48516 5.0
3 2 2571 3.5
4 2 109487 4.0
User Interests
user_movies = pd.DataFrame([
                            {'title':'Breakfast Club, The', 'rating':5},
                            {'title':'Toy Story', 'rating':3.5},
                            {'title':'Jumanji', 'rating':2},
                            {'title':"Pulp Fiction", 'rating':5},
                            {'title':'Akira', 'rating':4.5}
                           ])
user_movies
rating title
0 5.0 Breakfast Club, The
1 3.5 Toy Story
2 2.0 Jumanji
3 5.0 Pulp Fiction
4 4.5 Akira
movie_ids = genres_df[genres_df['title'].isin(user_movies['title'].tolist())]
user_movies = pd.merge(movie_ids, user_movies)
user_genres = user_movies.drop('genres', 1).drop('year',1)
user_genres
movieId title Adventure Animation Children Comedy Fantasy Romance Drama Action ... Mystery Sci-Fi IMAX Documentary War Musical Western Film-Noir (no genres listed) rating
0 1 Toy Story 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.5
1 2 Jumanji 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0
2 296 Pulp Fiction 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0
3 1274 Akira 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.5
4 1968 Breakfast Club, The 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0

5 rows × 23 columns

Since we only need the genres

user_genres.drop('title', 1, inplace=True)
user_genres.drop('movieId', 1, inplace=True)
user_genres.drop('rating', 1, inplace=True)
user_genres
Adventure Animation Children Comedy Fantasy Romance Drama Action Crime Thriller Horror Mystery Sci-Fi IMAX Documentary War Musical Western Film-Noir (no genres listed)
0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

And next we need to multiply this with the ratings column

user_profile = user_genres.transpose().dot(user_movies['rating'])
user_profile
Adventure             10.0
Animation              8.0
Children               5.5
Comedy                13.5
Fantasy                5.5
Romance                0.0
Drama                 10.0
Action                 4.5
Crime                  5.0
Thriller               5.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 4.5
IMAX                   0.0
Documentary            0.0
War                    0.0
Musical                0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

We can then compare this to the table of all our movies, and build a recommendation based on that

all_genres = genres_df.set_index(genres_df['movieId'])
all_genres.head()
movieId title genres year Adventure Animation Children Comedy Fantasy Romance ... Horror Mystery Sci-Fi IMAX Documentary War Musical Western Film-Noir (no genres listed)
movieId
1 1 Toy Story [Adventure, Animation, Children, Comedy, Fantasy] 1995 1.0 1.0 1.0 1.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2 Jumanji [Adventure, Children, Fantasy] 1995 1.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 3 Grumpier Old Men [Comedy, Romance] 1995 0.0 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 4 Waiting to Exhale [Comedy, Drama, Romance] 1995 0.0 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 5 Father of the Bride Part II [Comedy] 1995 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 24 columns

all_genres.drop(['movieId','title','genres','year'], 1, inplace=True)
all_genres.head()
Adventure Animation Children Comedy Fantasy Romance Drama Action Crime Thriller Horror Mystery Sci-Fi IMAX Documentary War Musical Western Film-Noir (no genres listed)
movieId
1 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
user_recommendation = all_genres.dot(user_profile)/user_profile.sum()
user_recommendation.head()
movieId
1    0.594406
2    0.293706
3    0.188811
4    0.328671
5    0.188811
dtype: float64
user_recommendation.sort_values(ascending=False, inplace=True)
user_recommendation.head(10)
movieId
5018      0.748252
26093     0.734266
27344     0.720280
148775    0.685315
6902      0.678322
117646    0.678322
64645     0.671329
81132     0.671329
122787    0.671329
2987      0.664336
dtype: float64
Top Recommendations for User
movies_df.loc[movies_df['movieId'].isin(user_recommendation.head().keys())]
movieId title genres year
4923 5018 Motorama [Adventure, Comedy, Crime, Drama, Fantasy, Mys... 1991
6793 6902 Interstate 60 [Adventure, Comedy, Drama, Fantasy, Mystery, S... 2002
8605 26093 Wonderful World of the Brothers Grimm, The [Adventure, Animation, Children, Comedy, Drama... 1962
9296 27344 Revolutionary Girl Utena: Adolescence of Utena... [Action, Adventure, Animation, Comedy, Drama, ... 1999
33509 148775 Wizards of Waverly Place: The Movie [Adventure, Children, Comedy, Drama, Fantasy, ... 2009

Collaborative Filtering

Collaborative filtering works by recommending content based on other similar users/items

There are two types

  • User
    • Based on user's similar neighborhood
  • Item
    • Based on similarity of item recommendations

Lab

Note that this uses the same movie data as before and uses the Pearson Correlation Coefficient to identify users who rate movies similarly based on the ratings table and can be found in 5-2-Collaborative-Filtering