Machine Learning with Python
Introduction to Machine Learning with Python
Machine Learning with Python
Based on this Cognitive Class Course
Labs
The Labs for the course are located in the Labs
folder are from CognitiveClass and are licensed under MIT
Intro to ML
Machine learning is a field of computer science that gives computers the ability to learn without being explicitly progammed
Some popular techniques are:
- Regression for predicting continuous values
- Classification for predicting a class/category
- Clustering for finding structure of data and summarization
- Associations for finding items/events that co-occur
- Anomaly detection is used for finding abnormal/unusual cases
- Sequence mining is for predicting next values
- Dimension reduction for reducing the size of data
- Recommendation systems
We have a few different buzzwords
- AI
- Computer Vision
- Language processing
- Creativity
- Machine learning
- Field of AI
- Experience based
- Classification
- Clustering
- Neural Networks
- Deep Learning
- Specialized case of ML
- More automation than most ML
Python for Machine Learning
Python has many different libraries for machine learning such as
- NumPy
- SciPy
- Matplotlib
- Pandas
- Scikit Learn
Supervised vs Unsupervised
Supervised learning involves us supervising a machine learning model. We do this by teaching the model with a labelled dataset
There are two types of supervised learning, namely Classification and Regression
Unsupervised learning is when the model works on its own to discover information about data using techniques such as Dimension Reduction, Density Estimation, Market Basket Analysis, and Clustering
- Supervised
- Classification
- Regression
- More evaluation methods
- Controlled environment
- Unsupervised
- Clustering
- Fewer evaluation methods
- Less controlled environment
Regression
Regression makes use of two different variables
- Dependent - Predictors
- Independent - Target
With Regression our values need to be continuous, but the values can be either continuous, discrete, or categorical
There are two types of regression:
- Simple Regression
- Simple Linear Regression
- Simple Non-Linear Regression
- Single
- Multiple Regression
- Multiple Linear Regression
- Multiple Non-Linear Regression
- Multiple
Regression is used when we have continuous data and is well suited to predicting continuous data
There are many regression algorithms such as
- Ordinal regression
- Poisson regression
- Fast forest quartile regression
- Linear, polynomial, lasso, stepwise, and ridge regression
- Bayesian linear regression
- Neural network regression
- Decision forest regression
- Boosted decision tree regression
- K nearest neighbors (KNN)
Each of which are better suited to some circumstances than to others
Simple Linear Regression
In SLR we have two variables, one dependent, and one independent. The target variable () can be either be continuous or categorical, but the predictor () must be continuous
To get a better idea of whether SLR is appropriate we can simply do a plot of vs and find the line which will be the best fit for the data
The line is represented by the following equation
The aim of SLE is to adjust the values to minimize the residual error in our data and find the best fit
Estimating Parameters
We have two options to estimate our parameters, given an SLR problem
Estimate and using the following equations
We can use these values to make predictions with the equation
Pros
- Fast
- Easy to Understand
- No tuning needed
Lab
Import Necessary Libraries
%reset -f
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline
Import Data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv')
df.head()
MODELYEAR | MAKE | MODEL | VEHICLECLASS | ENGINESIZE | CYLINDERS | TRANSMISSION | FUELTYPE | FUELCONSUMPTION_CITY | FUELCONSUMPTION_HWY | FUELCONSUMPTION_COMB | FUELCONSUMPTION_COMB_MPG | CO2EMISSIONS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014 | ACURA | ILX | COMPACT | 2.0 | 4 | AS5 | Z | 9.9 | 6.7 | 8.5 | 33 | 196 |
1 | 2014 | ACURA | ILX | COMPACT | 2.4 | 4 | M6 | Z | 11.2 | 7.7 | 9.6 | 29 | 221 |
2 | 2014 | ACURA | ILX HYBRID | COMPACT | 1.5 | 4 | AV7 | Z | 6.0 | 5.8 | 5.9 | 48 | 136 |
3 | 2014 | ACURA | MDX 4WD | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.7 | 9.1 | 11.1 | 25 | 255 |
4 | 2014 | ACURA | RDX AWD | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.1 | 8.7 | 10.6 | 27 | 244 |
Data Exploration
df.describe()
MODELYEAR | ENGINESIZE | CYLINDERS | FUELCONSUMPTION_CITY | FUELCONSUMPTION_HWY | FUELCONSUMPTION_COMB | FUELCONSUMPTION_COMB_MPG | CO2EMISSIONS | |
---|---|---|---|---|---|---|---|---|
count | 1067.0 | 1067.000000 | 1067.000000 | 1067.000000 | 1067.000000 | 1067.000000 | 1067.000000 | 1067.000000 |
mean | 2014.0 | 3.346298 | 5.794752 | 13.296532 | 9.474602 | 11.580881 | 26.441425 | 256.228679 |
std | 0.0 | 1.415895 | 1.797447 | 4.101253 | 2.794510 | 3.485595 | 7.468702 | 63.372304 |
min | 2014.0 | 1.000000 | 3.000000 | 4.600000 | 4.900000 | 4.700000 | 11.000000 | 108.000000 |
25% | 2014.0 | 2.000000 | 4.000000 | 10.250000 | 7.500000 | 9.000000 | 21.000000 | 207.000000 |
50% | 2014.0 | 3.400000 | 6.000000 | 12.600000 | 8.800000 | 10.900000 | 26.000000 | 251.000000 |
75% | 2014.0 | 4.300000 | 8.000000 | 15.550000 | 10.850000 | 13.350000 | 31.000000 | 294.000000 |
max | 2014.0 | 8.400000 | 12.000000 | 30.200000 | 20.500000 | 25.800000 | 60.000000 | 488.000000 |
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
cdf.head(10)
ENGINESIZE | CYLINDERS | FUELCONSUMPTION_COMB | CO2EMISSIONS | |
---|---|---|---|---|
0 | 2.0 | 4 | 8.5 | 196 |
1 | 2.4 | 4 | 9.6 | 221 |
2 | 1.5 | 4 | 5.9 | 136 |
3 | 3.5 | 6 | 11.1 | 255 |
4 | 3.5 | 6 | 10.6 | 244 |
5 | 3.5 | 6 | 10.0 | 230 |
6 | 3.5 | 6 | 10.1 | 232 |
7 | 3.7 | 6 | 11.1 | 255 |
8 | 3.7 | 6 | 11.6 | 267 |
9 | 2.4 | 4 | 9.2 | 212 |
viz = cdf[['CYLINDERS','ENGINESIZE','CO2EMISSIONS','FUELCONSUMPTION_COMB']]
viz.hist()
plt.show()
<matplotlib.figure.Figure at 0x7f2ab0310c50>
plt.title('CO2 Emission vs Fuel Consumption')
plt.scatter(cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS, color='blue')
plt.xlabel("FUELCONSUMPTION_COMB")
plt.ylabel("Emission")
plt.show()
<matplotlib.figure.Figure at 0x7f2aa8826080>
plt.title('CO2 Emission vs Engine Size')
plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS, color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()
<matplotlib.figure.Figure at 0x7f2aa8769cc0>
plt.title('CO2 Emission vs Cylinders')
plt.scatter(cdf.CYLINDERS, cdf.CO2EMISSIONS, color='blue')
plt.xlabel("Cylinders")
plt.ylabel("Emission")
plt.show()
<matplotlib.figure.Figure at 0x7f2aa86b2748>
Test-Train Split
We need to split our data into a test set and a train set
tt_mask = np.random.rand(len(df)) < 0.8
train = cdf[tt_mask].reset_index()
test = cdf[~tt_mask].reset_index()
Simple Regression Model
We can look at the distribution of the Engine Size in our training and test set respectively as follows
plt.title('CO2 Emissions vs Engine Size for Test and Train Data')
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,color='blue',label='train')
plt.scatter(test.ENGINESIZE, test.CO2EMISSIONS,color='red',label='test')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.legend()
plt.show()
<matplotlib.figure.Figure at 0x7f2aa86d9eb8>
Modeling
from sklearn import linear_model
lin_reg = linear_model.LinearRegression()
train_x = train[['ENGINESIZE']]
train_y = train[['CO2EMISSIONS']]
test_x = test[['ENGINESIZE']]
test_y = test[['CO2EMISSIONS']]
lin_reg.fit(train_x, train_y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
'Coefficients: ' + str(lin_reg.coef_) + ' Intercept: ' + str(lin_reg.intercept_)
'Coefficients: [[ 39.30964622]] Intercept: [ 124.8710344]'
We can plot the line on our data to see the fit
plt.title('CO2 Emissions vs Engine Size, Training and Fit')
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,color='blue',label='train')
plt.plot(train_x, lin_reg.coef_[0,0]*train_x + lin_reg.intercept_[0],color='red',label='regression')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.legend()
plt.show()
<matplotlib.figure.Figure at 0x7f2a9dbdef60>
Model Evaluation
Import Packages
from sklearn.metrics import r2_score
Predict the CO2 Emissions
predicted_y = lin_reg.predict(test_x)
Display Results
results = pd.DataFrame()
results[['ENGINESIZE']] = test_x
results[['ACTUALCO2']] = test_y
results[['PREDICTEDCO2']] = pd.DataFrame(predicted_y)
results[['ERROR']] = pd.DataFrame(np.abs(predicted_y - test_y))
results[['SQUAREDERROR']] = pd.DataFrame((predicted_y - test_y)**2)
results.head()
ENGINESIZE | ACTUALCO2 | PREDICTEDCO2 | ERROR | SQUAREDERROR | |
---|---|---|---|---|---|
0 | 5.9 | 359 | 356.797947 | 2.202053 | 4.849037 |
1 | 2.0 | 230 | 203.490327 | 26.509673 | 702.762771 |
2 | 2.0 | 230 | 203.490327 | 26.509673 | 702.762771 |
3 | 2.0 | 214 | 203.490327 | 10.509673 | 110.453230 |
4 | 5.2 | 409 | 329.281195 | 79.718805 | 6355.087912 |
Model Evaluation
MAE = np.mean(results[['ERROR']])
MSE = np.mean(results[['SQUAREDERROR']])
R2 = r2_score(test_y, predicted_y)
print("Mean absolute error: %.2f" % MAE)
print("Residual sum of squares (MSE): %.2f" % MSE)
print("R2-score: %.2f" % R2)
Multiple Linear Regression
In reality multiple independent variables will define a specific target. MLR is simply an extension on the SLR Model
MLR is useful for solving problems such as
- Define the impact of independent variables on effectiveness of prediction
- Predicting the impact of change in a specific variable
MLR makes use of multiple predictors to predict the target value, and is generally of the form
is a vector of coefficients which are multiplied by , these are called the parameters or weight vectors, and is the feature set, the idea with MLR is to predict the best-fit hyperplane for our data
Estimating Parameters
We have a few ways to estimate the best parameters, such as
- Ordinary Least Squares
- Linear algebra
- Not suited to large datasets
- Gradient Descent
- Good for large datasets
- Other methods are available to do this as well
How Many Variables?
Making use of more variables will generally increase the accuracy of the model, howevre using too many variables without good justification can lead to us overfitting the model
We can make use of categorical variables if we convert them to numerric values
MLR assumes that we have a linear relationship between the dependent and independent variables
Model Evaluation
We have to perform regression evaluation when building a model
Train/Test Joint
We make use of our data to train our model, and then compare the predicted values to the actual values of our model
The error of the model is the average of the actual and predicted values for the model
This approach has a high training accuracy, but a lower out-of-sample accuracy
Aiming for a very high training accuracy can lead to overfitting to the training data resulting in poor out-of-sample data
Train/Test Split
We split our data into a portion for testing and a portion for training, these two sets are mutually exclusive and allow us to get a good idea of what our out-of-sample accuracy will be
Generally we would train our data with the testing data afterwards in order to increase our accuracy
K-Fold Cross-Validation
This makes use of us splitting the dataset into different pieces, and using every combination of test/train datasets in order to get a more aggregated fit
Evaluation Metrics
Evaluation metrics are used to evaluate the performance of a model, metrics provide insight into areas of the model that require attention
Errors
In the context of regression, error is the difference between the data points and the valuedetermined by the model
Some of the main error equations are defined below
Fit
helps us see how closely our data is represented by a specific regression line, and is defined as
Or
A higher represents a better fit
Non-Linear Regression
Not all data can be predicted using a linear regression line, we have many diferent regression lines to fit more complex data
Polynomial Regression
Polynomial Regression is a method with which we can fit a polynomial to our data, it is still possible for us to define a polynomial regression by transforming it into a multi-variable linear regression problem as follows
Given the polynomial
We can create new variables which represent the different powers of our initial variable
Therefore resulting in the following linear equation
Other Non-Linear Regression
Non-Linear Regression can be of many forms as well, including any other mathematical relationships that we can define
For more complex NLR problems it can be difficult to evaluate the parameters for the equation
Lab
There are many different model types and equations shown in the Lab Notebook aside from what I have here
Import the Data
Using China's GDP data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/china_gdp.csv')
df.head()
Year | Value | |
---|---|---|
0 | 1960 | 5.918412e+10 |
1 | 1961 | 4.955705e+10 |
2 | 1962 | 4.668518e+10 |
3 | 1963 | 5.009730e+10 |
4 | 1964 | 5.906225e+10 |
x_data, y_data = (df[['Year']], df[['Value']])
Plotting the Data
plt.title('China\'s GDP by Year')
plt.plot(x_data, y_data, 'o')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()
<matplotlib.figure.Figure at 0x7f2a9dbdeac8>
Defining a Fit
Next we can try to approximate a curve that we think will fit the data we have, we can use a sigmoid, as defined below
: Controls the curve's steepness,
: Slides the curve on the x-axis.
def sigmoid(x, b_1, b_2):
y = 1 / (1 + np.exp(-b_1*(x-b_2)))
return y
The above function can be seen to be
X = np.arange(-5.0, 5.0, 0.1)
Y = sigmoid(X, 1, 1)
plt.title('Sigmoid')
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
<matplotlib.figure.Figure at 0x7f2a8405f390>
Next let's try to fit this to the data with some example values
b_1 = 0.10
b_2 = 1990.0
#logistic function
y_pred = sigmoid(x_data, b_1 , b_2)
#plot initial prediction against datapoints
plt.title('Approximating NLR with Sigmoid')
plt.plot(x_data, y_pred*15000000000000.)
plt.plot(x_data, y_data, 'ro')
plt.show()
<matplotlib.figure.Figure at 0x7f2a84010be0>
Data Normalization
Let's normalize our data so that we don't need to multiply by crazy numbers as before
# for some reason this seems to be the only way the conversion
# from a dataframe works as desired
# the normalization from the labs are as such:
# xdata =x_data/max(x_data)
# ydata =y_data/max(y_data)
x_norm = (np.array(x_data)/max(np.array(x_data))).transpose()[0]
y_norm = (np.array(y_data)/max(np.array(y_data))).transpose()[0]
Finding the Best Fit
Next we can import curve_fit
to help us fit the the curve to our data
from scipy.optimize import curve_fit
popt, pcov = curve_fit(sigmoid, x_norm, y_norm)
print(" beta_1 = %f, beta_2 = %f" % (popt[0], popt[1]))
print(popt)
print(pcov)
And we can plot the result as follows
x = np.linspace(1960, 2015, 55)
x = x/max(x)
y = sigmoid(x, *popt)
plt.title('Sigmoid Fit of Data')
plt.plot(x_norm, y_norm, 'ro', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.legend()
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()
<matplotlib.figure.Figure at 0x7f2a443bb780>
Model Accuracy
from sklearn.metrics import r2_score
# split data into train/test
mask = np.random.rand(len(df)) < 0.8
train_x = x_norm[mask]
test_x = x_norm[~mask]
train_y = y_norm[mask]
test_y = y_norm[~mask]
# build the model using train set
popt, pcov = curve_fit(sigmoid, train_x, train_y)
# predict using test set
y_hat = sigmoid(test_x, *popt)
# evaluation
print("Mean absolute error: %.2f" % np.mean(np.absolute(y_hat - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((y_hat - test_y) ** 2))
print("R2-score: %.2f" % r2_score(y_hat , test_y) )
Classification
Classification is a supervised learning approach which is a means of splitting data into discrete classes
The target atribute is a categorical value with discrete values
Classification will determine the class label for a specific test case
Binary as well as multi-class classification methods are available
Learning Algorithms
Many learning algorithms are available for classification such as
- Decision trees
- Naive Bayes
- KNN
- Logistic Regression
- Neural Networks
- SVM
Evaluation Metrics
We have a few different evaluation metrics for classification
Jaccard Index
We simply measure which fraction of our predicted values intersect with the actual values
F1 Score
This is a measure which makes use of a confusion matrix and compares the predictions vs actual values for each class
In the count of binary classification this will give us our True Positives, False Positives, True Negatives and False Negatives
We can define some metrics for each class with the following
F1 varies between 0 and 1, with 1 being the best
The accuracy for a classifier is the average accuracy of each of its classes
Log Loss
The log loss is the performance of a classifier where the predicted output is a probability between 1 and 0
Better classifiers have a log loss closer to zero
K-Nearest Neighbor
KNN is a method of determining class based on the training datapoints that sit near our test datapoint based on the fact that closer datapoints are more important than those further away in predicting a specific value
Algorithm
- Pick a value for K
- Calculate distance of unknown case from known cases
- Select k observations
- Predict the value based on the most common observaton value
We can make use of euclidean distance to calculate the distance between our continuous values, and a voting system for discrete data
Using a low K value can lead to overfitting, and using a very high value can lead to us underfitting
In order to find the optimal K value we do multiple tests by continuously increasing our K value and measuring the accuracy for that K value
Furthermore KNN can also be used to predict continuous values (regression) by simply having a target variable and predictors that are continuous
Lab
Import Libraries
import itertools
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
Import Data
The dataset being used is one in which demographic data is used to define a customer service group, these being as follows
Value | Category |
---|---|
1 | Basic Service |
2 | E-Service |
3 | Plus Service |
4 | Total Service |
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/teleCust1000t.csv')
df.head()
region | tenure | age | marital | address | income | ed | employ | retire | gender | reside | custcat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 13 | 44 | 1 | 9 | 64.0 | 4 | 5 | 0.0 | 0 | 2 | 1 |
1 | 3 | 11 | 33 | 1 | 7 | 136.0 | 5 | 5 | 0.0 | 0 | 6 | 4 |
2 | 3 | 68 | 52 | 1 | 24 | 116.0 | 1 | 29 | 0.0 | 1 | 2 | 3 |
3 | 2 | 33 | 33 | 0 | 12 | 33.0 | 2 | 0 | 0.0 | 1 | 1 | 1 |
4 | 2 | 23 | 30 | 1 | 9 | 30.0 | 1 | 2 | 0.0 | 0 | 4 | 3 |
Data Visualization and Analysis
We can look at the number of customers in each class
df.custcat.value_counts()
3 281
1 266
4 236
2 217
Name: custcat, dtype: int64
df.hist()
plt.show()
<matplotlib.figure.Figure at 0x7f2a4435ef98>
We can take a closer look at income with
df.income.hist(bins=50)
plt.title('Income of Customers')
plt.xlabel('Frequency')
plt.ylabel('Income')
plt.show()
<matplotlib.figure.Figure at 0x7f2a441bc908>
Features
To use sklearn
we need to convert our data into an array as follows
df.columns
Index(['region', 'tenure', 'age', 'marital', 'address', 'income', 'ed',
'employ', 'retire', 'gender', 'reside', 'custcat'],
dtype='object')
# X = df.loc[:, 'region':'reside'].values
# Y = df.loc[:,'custcat'].values
X = df.loc[:, 'region':'reside']
Y = df.loc[:,'custcat']
X.head()
region | tenure | age | marital | address | income | ed | employ | retire | gender | reside | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 13 | 44 | 1 | 9 | 64.0 | 4 | 5 | 0.0 | 0 | 2 |
1 | 3 | 11 | 33 | 1 | 7 | 136.0 | 5 | 5 | 0.0 | 0 | 6 |
2 | 3 | 68 | 52 | 1 | 24 | 116.0 | 1 | 29 | 0.0 | 1 | 2 |
3 | 2 | 33 | 33 | 0 | 12 | 33.0 | 2 | 0 | 0.0 | 1 | 1 |
4 | 2 | 23 | 30 | 1 | 9 | 30.0 | 1 | 2 | 0.0 | 0 | 4 |
Y.head()
0 1
1 4
2 3
3 1
4 3
Name: custcat, dtype: int64
Normalize Data
For alogrithms like KNN which are distance based it is useful to normalize the data to have a zero mean and unit variance, we can do this using the sklearn.preprocessing
package
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
print(X[0:5])
Test/Train Split
Next we can split our model into a test and train set using sklearn.model_selection.train_test_split()
from sklearn.model_selection import train_test_split
ran = 4
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=ran)
print('Train: ', X_train.shape, Y_train.shape)
print('Test: ', X_test.shape, Y_test.shape)
Classification
We can then make use of the KNN classifier on our data
from sklearn.neighbors import KNeighborsClassifier as knn_classifier
We will use an intial value of 4 for k, but will later evaluate different k values
k = 4
knn = knn_classifier(n_neighbors=k)
knn.fit(X_train, Y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=4, p=2,
weights='uniform')
Y_hat = knn.predict(X_test)
print(Y_hat[0:5])
Model Evaluation
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(Y_train, knn.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(Y_test, Y_hat))
Other K Values
We can do this for additional K values to look at how the accuracy is affected
k_max = 100
mean_acc = np.zeros((k_max))
std_acc = np.zeros((k_max))
ConfustionMx = [];
for n in range(1,k_max + 1):
#Train Model and Predict
knn = knn_classifier(n_neighbors = n).fit(X_train,Y_train)
Y_hat = knn.predict(X_test)
mean_acc[n-1] = metrics.accuracy_score(Y_test, Y_hat)
std_acc[n-1] = np.std(Y_hat == Y_test)/np.sqrt(Y_hat.shape[0])
print(mean_acc)
plt.title('Accuracy vs K')
plt.plot(range(1,k_max + 1),mean_acc,'g')
plt.fill_between(range(1,k_max + 1),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()
<matplotlib.figure.Figure at 0x7f2a44198438>
The maximum accuracy can be found to be
print('Max Accuracy: {}, K={}'.format(max(mean_acc),mean_acc.argmax() + 1))
Test Sample
It can be noted that the accuracy and optimal value varies based on the random_state
parameter in the train_test_split
function used when doing the test/train split
Retrain with All Data
We can retrain the model to use all the data at the determined optimal value and look at the in-sample accuracy
k = mean_acc.argmax()
knn = knn_classifier(n_neighbors=k)
knn.fit(X, Y)
print("In-Sample Accuracy: ", metrics.accuracy_score(Y, knn.predict(X)))
Decision Trees
Decision Trees allow us to make use of discrete and continuous predictors to find a discrete target
Decision trees test a condition and branch off based on the result, eventually leading to a specific outcome/decision
Algorithm
- Choose a dataset
- Calculate the significance of an attribute in splitting the data
- Split the data based on the value of the attribute
- Go to 1
We aim to have resulting nodes that are high in purity. A higher purity increases predictiveness/significance
Recursive partitining is used to decrease the impurity/entropy in the resulting nodes
Entropy is a measurement of randomness
If samples are equally mixed, the entropy is 1, if the samples are pure, the entropy is 1
The best tree is the one that results in the most information gain after the split
Lab
Import Libraries
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
Import Data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv')
print(df.shape)
df.head()
Age | Sex | BP | Cholesterol | Na_to_K | Drug | |
---|---|---|---|---|---|---|
0 | 23 | F | HIGH | HIGH | 25.355 | drugY |
1 | 47 | M | LOW | HIGH | 13.093 | drugC |
2 | 47 | M | LOW | HIGH | 10.114 | drugC |
3 | 28 | F | NORMAL | HIGH | 7.798 | drugX |
4 | 61 | F | LOW | HIGH | 18.043 | drugY |
Split X and Y Values
X_headers = ['Age','Sex','BP','Cholesterol','Na_to_K']
X = df[X_headers]
X.head()
Age | Sex | BP | Cholesterol | Na_to_K | |
---|---|---|---|---|---|
0 | 23 | F | HIGH | HIGH | 25.355 |
1 | 47 | M | LOW | HIGH | 13.093 |
2 | 47 | M | LOW | HIGH | 10.114 |
3 | 28 | F | NORMAL | HIGH | 7.798 |
4 | 61 | F | LOW | HIGH | 18.043 |
Y = df[['Drug']]
Y.head()
Drug | |
---|---|
0 | drugY |
1 | drugC |
2 | drugC |
3 | drugX |
4 | drugY |
Create Numeric Variables
We need to get numeric variables for X as sklearn
does not support string categorization (according to the guy in the course anyway)
from sklearn import preprocessing
X_arr = np.array(X)
encoder = preprocessing.LabelEncoder()
encoder.fit(['F','M'])
X_arr[:,1] = encoder.transform(X_arr[:,1])
encoder.fit(['LOW','NORMAL','HIGH'])
X_arr[:,2] = encoder.transform(X_arr[:,2])
encoder.fit(['NORMAL','HIGH'])
X_arr[:,3] = encoder.transform(X_arr[:,3])
print(X_arr[0:5])
X_encoded = pd.DataFrame(data=X_arr, columns=X_headers)
X_encoded.head()
Age | Sex | BP | Cholesterol | Na_to_K | |
---|---|---|---|---|---|
0 | 23 | 0 | 0 | 0 | 25.355 |
1 | 47 | 1 | 1 | 0 | 13.093 |
2 | 47 | 1 | 1 | 0 | 10.114 |
3 | 28 | 0 | 2 | 0 | 7.798 |
4 | 61 | 0 | 1 | 0 | 18.043 |
Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_encoded, Y, test_size = 0.3)
print('Training: X : {}, Y : {}'.format(X_train.shape,Y_train.shape))
print('Testing: X : {}, Y : {}'.format(X_test.shape,Y_test.shape))
Decision Tree
drug_tree = DecisionTreeClassifier(criterion='entropy', max_depth = 4)
drug_tree
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
drug_tree.fit(X_train, Y_train)
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
Prediction
Y_predicted = drug_tree.predict(X_test)
print(Y_predicted[0:5])
print(Y_test[0:5])
Evaluation
from sklearn import metrics
print('Decision Tree Accuracy: ', metrics.accuracy_score(Y_test, Y_predicted))
Visalization
!pip install pydotplus
import matplotlib.pyplot as plt
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
dot_data = StringIO()
filename = 'drug_decision_tree.png'
feature_names = X_headers
target_names = df['Drug'].unique().tolist()
out = tree.export_graphviz(drug_tree,
feature_names=feature_names,
out_file=dot_data,
class_names=target_names,
filled=True,
special_characters=True,
rotate=False)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(filename)
True
img = mpimg.imread(filename)
plt.figure(figsize=(100, 100))
plt.imshow(img, interpolation='nearest')
plt.show()
<matplotlib.figure.Figure at 0x7f2a840262b0>
Logistic Regression
Logistic regression is a categorical classification algorithm based on a linear division between categorical values
Logistic regression can be used for binary and multi class classification and predicts the probability of a class which is then mapped to a discrete value
Logistic regression is best suited to
- Binary Classification
- If you need probabilistic results
- Linear decision boundry
- If you need to understand the impact of a feature
A logistic regression can calculate
Logistic vs Linear Regression
We can use linear regression with a dividing line to give whether or not a specific circumstance will lead to a specific output, where we define a threshold value which would define a boundry for the target class
The problem with this method is that we only have a specific binary outcome, and not any information as to what the probability of that outcome is. Logistic regression helps us to define this by making use of a sigmoid to smoothen out the classification boundry, the sigmoid function can be seen below
import numpy as np
from math import exp
import matplotlib.pyplot as plt
x = np.array(range(-100,102,2))/10
sigmoid = 1/(1+np.exp(-1*x))
step = []
for i in range(len(x)):
step.append(1 if x[i] >= 0 else 0)
plt.plot(x,step, label='Step' )
plt.plot(x,sigmoid, label='Sigmoid')
# plt.xlim(-10,10)
plt.ylim(-0.1,1.1)
plt.xlabel('$x$')
plt.ylabel('$\sigma(x)$')
plt.legend()
plt.show()
<matplotlib.figure.Figure at 0x7f2a441f6278>
Based on the above we can see that depending on the value of we will have a greater tendency of a value towards 0 or 1 but not explicitly either
Algorithm
- Initialize
- Calculate for an
- Compare and and record the error, defined by a cost function
- Change to reduce the cost
- Go to 2
We can use different ways to change such as gradient descent
Lab
Import Libraries
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
import matplotlib.pyplot as plt
Import Data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv')
df.head()
tenure | age | address | income | ed | employ | equip | callcard | wireless | longmon | ... | pager | internet | callwait | confer | ebill | loglong | logtoll | lninc | custcat | churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 11.0 | 33.0 | 7.0 | 136.0 | 5.0 | 5.0 | 0.0 | 1.0 | 1.0 | 4.40 | ... | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.482 | 3.033 | 4.913 | 4.0 | 1.0 |
1 | 33.0 | 33.0 | 12.0 | 33.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9.45 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.246 | 3.240 | 3.497 | 1.0 | 1.0 |
2 | 23.0 | 30.0 | 9.0 | 30.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 6.30 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.841 | 3.240 | 3.401 | 3.0 | 0.0 |
3 | 38.0 | 35.0 | 5.0 | 76.0 | 2.0 | 10.0 | 1.0 | 1.0 | 1.0 | 6.05 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.800 | 3.807 | 4.331 | 4.0 | 0.0 |
4 | 7.0 | 35.0 | 14.0 | 80.0 | 2.0 | 15.0 | 0.0 | 1.0 | 0.0 | 7.10 | ... | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.960 | 3.091 | 4.382 | 3.0 | 0.0 |
5 rows × 28 columns
Preprocessing
df = df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'callcard', 'wireless','churn']]
df[['churn']] = df[['churn']].astype('int')
df.head()
tenure | age | address | income | ed | employ | equip | callcard | wireless | churn | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 11.0 | 33.0 | 7.0 | 136.0 | 5.0 | 5.0 | 0.0 | 1.0 | 1.0 | 1 |
1 | 33.0 | 33.0 | 12.0 | 33.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
2 | 23.0 | 30.0 | 9.0 | 30.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0 |
3 | 38.0 | 35.0 | 5.0 | 76.0 | 2.0 | 10.0 | 1.0 | 1.0 | 1.0 | 0 |
4 | 7.0 | 35.0 | 14.0 | 80.0 | 2.0 | 15.0 | 0.0 | 1.0 | 0.0 | 0 |
Define X and Y
X = np.asarray(df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])
X[0:5]
array([[ 11., 33., 7., 136., 5., 5., 0.],
[ 33., 33., 12., 33., 2., 0., 0.],
[ 23., 30., 9., 30., 1., 2., 0.],
[ 38., 35., 5., 76., 2., 10., 1.],
[ 7., 35., 14., 80., 2., 15., 0.]])
Y = np.asarray(df['churn'])
Y[0:5]
array([1, 1, 0, 0, 0])
Normalize Data
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
array([[-1.13518441, -0.62595491, -0.4588971 , 0.4751423 , 1.6961288 ,
-0.58477841, -0.85972695],
[-0.11604313, -0.62595491, 0.03454064, -0.32886061, -0.6433592 ,
-1.14437497, -0.85972695],
[-0.57928917, -0.85594447, -0.261522 , -0.35227817, -1.42318853,
-0.92053635, -0.85972695],
[ 0.11557989, -0.47262854, -0.65627219, 0.00679109, -0.6433592 ,
-0.02518185, 1.16316 ],
[-1.32048283, -0.47262854, 0.23191574, 0.03801451, -0.6433592 ,
0.53441472, -0.85972695]])
Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=4)
print('Train: ', X_train.shape, Y_train.shape)
print('Test: ', X_test.shape, Y_test.shape)
Modelling
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
lr = LogisticRegression(C=0.01, solver='liblinear').fit(X_train, Y_train)
lr
LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
Predict
Y_hat = lr.predict(X_test)
Y_hat_prob = lr.predict_proba(X_test)
Y_hat_prob[0:5]
array([[ 0.54132919, 0.45867081],
[ 0.60593357, 0.39406643],
[ 0.56277713, 0.43722287],
[ 0.63432489, 0.36567511],
[ 0.56431839, 0.43568161]])
Evaluation
Jaccard Index
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(Y_test, Y_hat)
0.75
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
print(confusion_matrix(Y_test, Y_hat, labels=[1,0]))
# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test, Y_hat, labels=[1,0])
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False, title='Confusion matrix')
<matplotlib.figure.Figure at 0x7f2a44129e48>
print(classification_report(Y_test, Y_hat))
from sklearn.metrics import log_loss
log_loss(Y_hat, Y_hat_prob)
0.54903192026736869
Using Different Model Parameters
lr2 = LogisticRegression(C=10, solver='sag').fit(X_train,Y_train)
Y_hat_prob2 = lr2.predict_proba(X_test)
print ("LogLoss: : %.2f" % log_loss(Y_test, Y_hat_prob2))
Support Vector Machine
SVM is a supervised algorithm that classifies data by finding a separator
- Map data to higher-dimensional Feature Space
- Find a separating hyperplane in higher dimensional space
Data Transformation
Mapping data into a higher space is known as kernelling and can be of different functions such as
- Linear
- Polynomial
- RBF
- Sigmoid
The best hyperplane is the one that results in the largest margin possible between the hyperplane and our closest sample, the samples closest to our hyperlane are known as support vectors
Advantages and Disadvantages
- Advantages
- Accurate in high dimensional spaces
- Memory efficient
- Disadvantages
- Prone to overfitting
- No probability estimation
- Not suited to very large datasets
Applications
- Image Recognition
- Text mining/categorization
- Spam detection
- Sentiment analysis
- Regression
- Outlier detection
- Clustering
Lab
Import Packages
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
Import Data
The data is from the UCI Machine Learning Archive, the fields are as follows
Field name | Description |
---|---|
ID | Clump thickness |
Clump | Clump thickness |
UnifSize | Uniformity of cell size |
UnifShape | Uniformity of cell shape |
MargAdh | Marginal adhesion |
SingEpiSize | Single epithelial cell size |
BareNuc | Bare nuclei |
BlandChrom | Bland chromatin |
NormNucl | Normal nucleoli |
Mit | Mitoses |
Class | Benign or malignant |
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv')
df.head()
ID | Clump | UnifSize | UnifShape | MargAdh | SingEpiSize | BareNuc | BlandChrom | NormNucl | Mit | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1000025 | 5 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 2 |
1 | 1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | 2 |
2 | 1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | 2 |
3 | 1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | 2 |
4 | 1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | 2 |
Visualization
The Class
field contains the diagnosis where 2 means benign, and 4 means malignant
ax = df[df['Class'] == 4][0:50].plot(kind='scatter',
x='Clump',
y='UnifSize',
color='DarkBlue',
label='malignant');
df[df['Class'] == 2][0:50].plot(kind='scatter',
x='Clump',
y='UnifSize',
color='Yellow',
label='benign',
ax=ax);
plt.show()
<matplotlib.figure.Figure at 0x7f2a44235d68>
Preprocessing Data
print(df.dtypes)
df = df[pd.to_numeric(df['BareNuc'].apply(lambda x: x.isnumeric()))]
df['BareNuc'] = df['BareNuc'].astype('int')
df.dtypes
ID int64
Clump int64
UnifSize int64
UnifShape int64
MargAdh int64
SingEpiSize int64
BareNuc int64
BlandChrom int64
NormNucl int64
Mit int64
Class int64
dtype: object
Break into X and Y
X = np.asarray(df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']])
X[0:5]
array([[ 5, 1, 1, 1, 2, 1, 3, 1, 1],
[ 5, 4, 4, 5, 7, 10, 3, 2, 1],
[ 3, 1, 1, 1, 2, 2, 3, 1, 1],
[ 6, 8, 8, 1, 3, 4, 3, 7, 1],
[ 4, 1, 1, 3, 2, 1, 3, 1, 1]])
Y = np.asarray(df['Class'])
Y[0:5]
array([2, 2, 2, 2, 2])
Train/Test Split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.2,
random_state=4)
print ('Train set:', X_train.shape, Y_train.shape)
print ('Test set:', X_test.shape, Y_test.shape)
Modeling
from sklearn import svm
clf = svm.SVC(gamma='auto', kernel='rbf')
clf.fit(X_train, Y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Y_hat = clf.predict(X_test)
Y_hat[0:5]
array([2, 4, 2, 4, 2])
Evaluation
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test, Y_hat, labels=[2,4])
np.set_printoptions(precision=2)
print (classification_report(Y_test, Y_hat))
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],
normalize= False, title='Confusion matrix')
<matplotlib.figure.Figure at 0x7f2a441c3208>
from sklearn.metrics import f1_score
print('F1 Score: ', f1_score(Y_test, Y_hat, average='weighted'))
from sklearn.metrics import jaccard_similarity_score
print('Jaccard Index: ', jaccard_similarity_score(Y_test, Y_hat))
Using an Alternative Kernal
clf2 = svm.SVC(kernel='linear')
clf2.fit(X_train, Y_train)
Y_hat2 = clf2.predict(X_test)
print("Avg F1-score: %.4f" % f1_score(Y_test, Y_hat2, average='weighted'))
print("Jaccard score: %.4f" % jaccard_similarity_score(Y_test, Y_hat2))
Clustering
Clustering is an unsupervised grouping of data in which similar datapoints are grouped together
The diference between clustering and classification is that clustering does not speficy what th groupings should be
Uses of Clustering
- Exploration of data
- Summary Generation
- Outlier Detection
- Finding Duplicates
- Data Pre-Processing
Clustering Algorithms
- Partitioned Based
- Efficient
- Hierachical
- Produces trees of clusters
- Density based
- Produces arbitrary shaped clusters
K-Means
- Partitioning Clustering
- Divides data into K non-overlapping subsets
K tries to minimize intra-cluster distances, and maximize inter-cluster distances
Distance
We can define the distance simply as the euclidean distance, typically normalizing the values so that our distances are not affected more by one value than another
Other distance formulas can be used depending on our understanding of the data as appropriate
Algorithm
- Determine K and initialize centroids randomly
- Measure distance from centroids to each datapoint
- Assign each point to closest centroid
- New centroids are at the mean of the points in its cluster
- Go to 2 if not converged
K-Means may not converge to a global optimum, but simply a local one and is somewhat dependant on the intial choice in 1
Accuracy
Average distance between datapoints within a cluster is a measure of error
Choice of K
We can use the elbow method in which we look at the distance of the datapoints to their centroid versus the K value, and select the one at which we notice a sharp change in the distance gradient
Lab
Import Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# from sklearn.datasets.samples_generator import make_blobs
Import Data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/Cust_Segmentation.csv')
df.head()
Customer Id | Age | Edu | Years Employed | Income | Card Debt | Other Debt | Defaulted | Address | DebtIncomeRatio | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 41 | 2 | 6 | 19 | 0.124 | 1.073 | 0.0 | NBA001 | 6.3 |
1 | 2 | 47 | 1 | 26 | 100 | 4.582 | 8.218 | 0.0 | NBA021 | 12.8 |
2 | 3 | 33 | 2 | 10 | 57 | 6.111 | 5.802 | 1.0 | NBA013 | 20.9 |
3 | 4 | 29 | 2 | 4 | 19 | 0.681 | 0.516 | 0.0 | NBA009 | 6.3 |
4 | 5 | 47 | 1 | 31 | 253 | 9.308 | 8.908 | 0.0 | NBA008 | 7.2 |
df = df.drop('Address', axis=1)
df.head()
Customer Id | Age | Edu | Years Employed | Income | Card Debt | Other Debt | Defaulted | DebtIncomeRatio | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 41 | 2 | 6 | 19 | 0.124 | 1.073 | 0.0 | 6.3 |
1 | 2 | 47 | 1 | 26 | 100 | 4.582 | 8.218 | 0.0 | 12.8 |
2 | 3 | 33 | 2 | 10 | 57 | 6.111 | 5.802 | 1.0 | 20.9 |
3 | 4 | 29 | 2 | 4 | 19 | 0.681 | 0.516 | 0.0 | 6.3 |
4 | 5 | 47 | 1 | 31 | 253 | 9.308 | 8.908 | 0.0 | 7.2 |
Normalize the Data
from sklearn.preprocessing import StandardScaler
X = np.asarray(df.values[:,1:])
X = np.nan_to_num(X)
X
array([[ 41. , 2. , 6. , ..., 1.07, 0. , 6.3 ],
[ 47. , 1. , 26. , ..., 8.22, 0. , 12.8 ],
[ 33. , 2. , 10. , ..., 5.8 , 1. , 20.9 ],
...,
[ 25. , 4. , 0. , ..., 3.21, 1. , 33.4 ],
[ 32. , 1. , 12. , ..., 0.7 , 0. , 2.9 ],
[ 52. , 1. , 16. , ..., 3.64, 0. , 8.6 ]])
X_norm = StandardScaler().fit_transform(X)
X_norm
array([[ 0.74, 0.31, -0.38, ..., -0.59, -0.52, -0.58],
[ 1.49, -0.77, 2.57, ..., 1.51, -0.52, 0.39],
[-0.25, 0.31, 0.21, ..., 0.8 , 1.91, 1.6 ],
...,
[-1.25, 2.47, -1.26, ..., 0.04, 1.91, 3.46],
[-0.38, -0.77, 0.51, ..., -0.7 , -0.52, -1.08],
[ 2.11, -0.77, 1.1 , ..., 0.16, -0.52, -0.23]])
Modeling
k = 3
k_means = KMeans(init='k-means++',
n_clusters=k,
n_init=12)
k_means.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=12, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
labels = k_means.labels_
print(labels[:20], labels.shape)
df['Cluster'] = labels
df.head()
Customer Id | Age | Edu | Years Employed | Income | Card Debt | Other Debt | Defaulted | DebtIncomeRatio | Cluster | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 41 | 2 | 6 | 19 | 0.124 | 1.073 | 0.0 | 6.3 | 1 |
1 | 2 | 47 | 1 | 26 | 100 | 4.582 | 8.218 | 0.0 | 12.8 | 0 |
2 | 3 | 33 | 2 | 10 | 57 | 6.111 | 5.802 | 1.0 | 20.9 | 1 |
3 | 4 | 29 | 2 | 4 | 19 | 0.681 | 0.516 | 0.0 | 6.3 | 1 |
4 | 5 | 47 | 1 | 31 | 253 | 9.308 | 8.908 | 0.0 | 7.2 | 2 |
df.groupby('Cluster').mean()
Customer Id | Age | Edu | Years Employed | Income | Card Debt | Other Debt | Defaulted | DebtIncomeRatio | |
---|---|---|---|---|---|---|---|---|---|
Cluster | |||||||||
0 | 402.295082 | 41.333333 | 1.956284 | 15.256831 | 83.928962 | 3.103639 | 5.765279 | 0.171233 | 10.724590 |
1 | 432.468413 | 32.964561 | 1.614792 | 6.374422 | 31.164869 | 1.032541 | 2.104133 | 0.285185 | 10.094761 |
2 | 410.166667 | 45.388889 | 2.666667 | 19.555556 | 227.166667 | 5.678444 | 10.907167 | 0.285714 | 7.322222 |
Visualization
%matplotlib inline
area = np.pi*(X[:,1])**2
plt.figure()
plt.title('Income vs Age')
plt.scatter(X[:,0], X[:,3], s=area, c=labels, alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()
<matplotlib.figure.Figure at 0x7f2adaf72ac8>
from mpl_toolkits.mplot3d import Axes3D
plt.clf()
ax = Axes3D(plt.figure(figsize=(8,6)), rect=[0,0,0.95,1], elev=48, azim=134)
plt.cla()
ax.set_xlabel('Education')
ax.set_ylabel('Age')
ax.set_zlabel('Income')
ax.scatter(X[:,1], X[:,0], X[:,3], c=labels)
plt.figure()
plt.show()
<matplotlib.figure.Figure at 0x7f2a4432b828>
<matplotlib.figure.Figure at 0x7f2a4432b6d8>
<matplotlib.figure.Figure at 0x7f2a9c021e48>
Hierachical Clustering
Two types
- Divisive - Top Down
- Agglomerative - Bottom Up
Agglomerative works by combining clusters based on the distance between them, this is the most popular method for HC
Agglomerative Algorithm
- Create n clusters, one for each datapoint
- Compute the proximity matrix
- Repeat Until a single cluster remains
- Merge the two closest clusters
- Update the proximity matrix
We can use any distance function we want to, there are multiple algorithms for this
- Single linkage clustering
- Complete linkage clustering
- Average linkag clustering
- Centroid linkage clustering
Advantages and Disadvantages
- Advantages
- Number of clusters does not need to be specified
- Easy to implement
- Dendogram can be easily understood
- Disadvantages
- Long runtimes
- Cannot undo previous steps
- Difficult to identify the number of clusters on dendogram
Lab
Import Packages
import numpy as np
import pandas as pd
from scipy import ndimage
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix
from matplotlib import pyplot as plt
from sklearn import manifold, datasets
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets.samples_generator import make_blobs
Import Data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cars_clus.csv')
df.head()
manufact | model | sales | resale | type | price | engine_s | horsepow | wheelbas | width | length | curb_wgt | fuel_cap | mpg | lnsales | partition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Acura | Integra | 16.919 | 16.360 | 0.000 | 21.500 | 1.800 | 140.000 | 101.200 | 67.300 | 172.400 | 2.639 | 13.200 | 28.000 | 2.828 | 0.0 |
1 | Acura | TL | 39.384 | 19.875 | 0.000 | 28.400 | 3.200 | 225.000 | 108.100 | 70.300 | 192.900 | 3.517 | 17.200 | 25.000 | 3.673 | 0.0 |
2 | Acura | CL | 14.114 | 18.225 | 0.000 | $null$ | 3.200 | 225.000 | 106.900 | 70.600 | 192.000 | 3.470 | 17.200 | 26.000 | 2.647 | 0.0 |
3 | Acura | RL | 8.588 | 29.725 | 0.000 | 42.000 | 3.500 | 210.000 | 114.600 | 71.400 | 196.600 | 3.850 | 18.000 | 22.000 | 2.150 | 0.0 |
4 | Audi | A4 | 20.397 | 22.255 | 0.000 | 23.990 | 1.800 | 150.000 | 102.600 | 68.200 | 178.000 | 2.998 | 16.400 | 27.000 | 3.015 | 0.0 |
Clean Data
print ("Shape of dataset before cleaning: ", df.size)
df[[ 'sales', 'resale', 'type', 'price', 'engine_s',
'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
'mpg', 'lnsales']] = df[['sales', 'resale', 'type', 'price', 'engine_s',
'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
'mpg', 'lnsales']].apply(pd.to_numeric, errors='coerce')
df = df.dropna()
df = df.reset_index(drop=True)
print ("Shape of dataset after cleaning: ", df.size)
df.head()
manufact | model | sales | resale | type | price | engine_s | horsepow | wheelbas | width | length | curb_wgt | fuel_cap | mpg | lnsales | partition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Acura | Integra | 16.919 | 16.360 | 0.0 | 21.50 | 1.8 | 140.0 | 101.2 | 67.3 | 172.4 | 2.639 | 13.2 | 28.0 | 2.828 | 0.0 |
1 | Acura | TL | 39.384 | 19.875 | 0.0 | 28.40 | 3.2 | 225.0 | 108.1 | 70.3 | 192.9 | 3.517 | 17.2 | 25.0 | 3.673 | 0.0 |
2 | Acura | RL | 8.588 | 29.725 | 0.0 | 42.00 | 3.5 | 210.0 | 114.6 | 71.4 | 196.6 | 3.850 | 18.0 | 22.0 | 2.150 | 0.0 |
3 | Audi | A4 | 20.397 | 22.255 | 0.0 | 23.99 | 1.8 | 150.0 | 102.6 | 68.2 | 178.0 | 2.998 | 16.4 | 27.0 | 3.015 | 0.0 |
4 | Audi | A6 | 18.780 | 23.555 | 0.0 | 33.95 | 2.8 | 200.0 | 108.7 | 76.1 | 192.0 | 3.561 | 18.5 | 22.0 | 2.933 | 0.0 |
Selecting Features
X = df[['engine_s','horsepow', 'wheelbas',
'width', 'length', 'curb_wgt',
'fuel_cap', 'mpg']].values
print(X[:5])
Normalization
from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)
print(X_norm[:5])
Clustering with Scipy
import scipy as sp
entries = X_norm.shape[0]
D = sp.zeros([entries, entries])
for i in range(entries):
for j in range(entries):
D[i,j] = sp.spatial.distance.euclidean(X[i],X[j])
print(D)
We have different distace formulas such as
- single
- complete
- average
- weighted
- centroid
import pylab
import scipy.cluster.hierarchy
Z = hierarchy.linkage(D, 'complete')
print(Z[:5])
from scipy.cluster.hierarchy import fcluster
max_d = 3
clusters = fcluster(Z, max_d, criterion='distance')
print(clusters)
max_d = 5
clusters = fcluster(Z, max_d, criterion='maxclust')
print(clusters)
fig = pylab.figure(figsize=(18,50))
def llf(id):
return '[%s %s %s]' % (df['manufact'][id],
df['model'][id],
int(float(df['type'][id])))
dendro = hierarchy.dendrogram(Z, leaf_label_func=llf,
leaf_rotation=0,
leaf_font_size=12,
orientation='right')
<matplotlib.figure.Figure at 0x7f2a4422e128>
Clustering with SciKit Learn
D = distance_matrix(X, X)
print(D)
agglom = AgglomerativeClustering(n_clusters=6, linkage='complete')
agglom.fit(X)
AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
connectivity=None, linkage='complete', memory=None,
n_clusters=6, pooling_func=<function mean at 0x7f2ad42c3730>)
df['cluster_'] = agglom.labels_
df.head()
manufact | model | sales | resale | type | price | engine_s | horsepow | wheelbas | width | length | curb_wgt | fuel_cap | mpg | lnsales | partition | cluster_ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Acura | Integra | 16.919 | 16.360 | 0.0 | 21.50 | 1.8 | 140.0 | 101.2 | 67.3 | 172.4 | 2.639 | 13.2 | 28.0 | 2.828 | 0.0 | 2 |
1 | Acura | TL | 39.384 | 19.875 | 0.0 | 28.40 | 3.2 | 225.0 | 108.1 | 70.3 | 192.9 | 3.517 | 17.2 | 25.0 | 3.673 | 0.0 | 0 |
2 | Acura | RL | 8.588 | 29.725 | 0.0 | 42.00 | 3.5 | 210.0 | 114.6 | 71.4 | 196.6 | 3.850 | 18.0 | 22.0 | 2.150 | 0.0 | 0 |
3 | Audi | A4 | 20.397 | 22.255 | 0.0 | 23.99 | 1.8 | 150.0 | 102.6 | 68.2 | 178.0 | 2.998 | 16.4 | 27.0 | 3.015 | 0.0 | 3 |
4 | Audi | A6 | 18.780 | 23.555 | 0.0 | 33.95 | 2.8 | 200.0 | 108.7 | 76.1 | 192.0 | 3.561 | 18.5 | 22.0 | 2.933 | 0.0 | 0 |
import matplotlib.cm as cm
n_clusters = max(agglom.labels_)+1
colors = cm.rainbow(np.linspace(0,1,n_clusters))
cluster_labels = list(range(0,n_clusters))
plt.figure(figsize=(16,14))
for color, label in zip(colors, cluster_labels):
subset = df[df.cluster_ == label]
for i in subset.index:
plt.text(subset.horsepow[i],
subset.mpg[i],
str(subset['model'][i]),
rotation=25)
plt.scatter(subset.horsepow, subset.mpg,
s= subset.price*10, c=color,
label='cluster'+str(label),alpha=0.5)
plt.legend()
plt.title('Clusters')
plt.xlabel('horsepow')
plt.ylabel('mpg')
plt.show()
<matplotlib.figure.Figure at 0x7f2a44101048>
df.groupby(['cluster_','type'])['cluster_'].count()
cluster_ type
0 0.0 29
1.0 14
1 0.0 10
2 0.0 26
1.0 4
3 0.0 21
1.0 11
4 0.0 1
5 0.0 1
Name: cluster_, dtype: int64
df_mean = df.groupby(['cluster_','type'])['horsepow','engine_s','mpg','price'].mean()
df_mean
horsepow | engine_s | mpg | price | ||
---|---|---|---|---|---|
cluster_ | type | ||||
0 | 0.0 | 210.551724 | 3.420690 | 23.648276 | 30.449310 |
1.0 | 206.428571 | 4.064286 | 18.500000 | 28.727714 | |
1 | 0.0 | 294.700000 | 4.380000 | 21.600000 | 57.864000 |
2 | 0.0 | 121.230769 | 1.934615 | 29.115385 | 14.720385 |
1.0 | 133.750000 | 2.225000 | 22.750000 | 15.856500 | |
3 | 0.0 | 160.857143 | 2.680952 | 24.857143 | 19.822048 |
1.0 | 154.272727 | 2.936364 | 20.909091 | 21.199364 | |
4 | 0.0 | 55.000000 | 1.000000 | 45.000000 | 9.235000 |
5 | 0.0 | 450.000000 | 8.000000 | 16.000000 | 69.725000 |
plt.figure(figsize=(16,10))
for color, label in zip(colors, cluster_labels):
subset = df_mean.loc[(label,),]
for i in subset.index:
plt.text(subset.loc[i][0]+5, subset.loc[i][2], 'type='+str(int(i)) + ', price='+str(int(subset.loc[i][3]))+'k')
plt.scatter(subset.horsepow, subset.mpg, s=subset.price*20, c=color, label='cluster'+str(label))
plt.legend()
plt.title('Clusters')
plt.xlabel('horsepow')
plt.ylabel('mpg')
Text(0,0.5,'mpg')
<matplotlib.figure.Figure at 0x7f29c4117978>
DBSCAN
Density Based clstering locates regions of high density and separates outliers while being able to find arbitrarily shaped clusters while ignoring noise
- Density Based Spacial Clustering of Applications with Noise
- Common clustering algorithm
- Based on object density
- Radius of neighborhood
- Min number of neighbors
Different types of points
- Core
- Has M neighbors withn R
- Border
- Has Core point within R, less than M in R
- Outlier
- Not Core, or within R of Core
DBSCAN visits each point and identifies its type, and then groups points based on this
Lab
Import Packages
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
About the Data
Environment Canada
Monthly Values for July - 2015
Name in the table | Meaning |
---|---|
Stn_Name | Station Name |
Lat | Latitude (North+, degrees) |
Long | Longitude (West - , degrees) |
Prov | Province |
Tm | Mean Temperature (°C) |
DwTm | Days without Valid Mean Temperature |
D | Mean Temperature difference from Normal (1981-2010) (°C) |
Tx | Highest Monthly Maximum Temperature (°C) |
DwTx | Days without Valid Maximum Temperature |
Tn | Lowest Monthly Minimum Temperature (°C) |
DwTn | Days without Valid Minimum Temperature |
S | Snowfall (cm) |
DwS | Days without Valid Snowfall |
S%N | Percent of Normal (1981-2010) Snowfall |
P | Total Precipitation (mm) |
DwP | Days without Valid Precipitation |
P%N | Percent of Normal (1981-2010) Precipitation |
S_G | Snow on the ground at the end of the month (cm) |
Pd | Number of days with Precipitation 1.0 mm or more |
BS | Bright Sunshine (hours) |
DwBS | Days without Valid Bright Sunshine |
BS% | Percent of Normal (1981-2010) Bright Sunshine |
HDD | Degree Days below 18 °C |
CDD | Degree Days above 18 °C |
Stn_No | Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically). |
NA | Not Available |
Import the Data
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/weather-stations20140101-20141231.csv')
df.head()
Stn_Name | Lat | Long | Prov | Tm | DwTm | D | Tx | DwTx | Tn | ... | DwP | P%N | S_G | Pd | BS | DwBS | BS% | HDD | CDD | Stn_No | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CHEMAINUS | 48.935 | -123.742 | BC | 8.2 | 0.0 | NaN | 13.5 | 0.0 | 1.0 | ... | 0.0 | NaN | 0.0 | 12.0 | NaN | NaN | NaN | 273.3 | 0.0 | 1011500 |
1 | COWICHAN LAKE FORESTRY | 48.824 | -124.133 | BC | 7.0 | 0.0 | 3.0 | 15.0 | 0.0 | -3.0 | ... | 0.0 | 104.0 | 0.0 | 12.0 | NaN | NaN | NaN | 307.0 | 0.0 | 1012040 |
2 | LAKE COWICHAN | 48.829 | -124.052 | BC | 6.8 | 13.0 | 2.8 | 16.0 | 9.0 | -2.5 | ... | 9.0 | NaN | NaN | 11.0 | NaN | NaN | NaN | 168.1 | 0.0 | 1012055 |
3 | DISCOVERY ISLAND | 48.425 | -123.226 | BC | NaN | NaN | NaN | 12.5 | 0.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1012475 |
4 | DUNCAN KELVIN CREEK | 48.735 | -123.728 | BC | 7.7 | 2.0 | 3.4 | 14.5 | 2.0 | -1.0 | ... | 2.0 | NaN | NaN | 11.0 | NaN | NaN | NaN | 267.7 | 0.0 | 1012573 |
5 rows × 25 columns
Clean Data
df = df[pd.notnull(df['Tm'])]
df.reset_index(drop=True)
df.head()
Stn_Name | Lat | Long | Prov | Tm | DwTm | D | Tx | DwTx | Tn | ... | DwP | P%N | S_G | Pd | BS | DwBS | BS% | HDD | CDD | Stn_No | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CHEMAINUS | 48.935 | -123.742 | BC | 8.2 | 0.0 | NaN | 13.5 | 0.0 | 1.0 | ... | 0.0 | NaN | 0.0 | 12.0 | NaN | NaN | NaN | 273.3 | 0.0 | 1011500 |
1 | COWICHAN LAKE FORESTRY | 48.824 | -124.133 | BC | 7.0 | 0.0 | 3.0 | 15.0 | 0.0 | -3.0 | ... | 0.0 | 104.0 | 0.0 | 12.0 | NaN | NaN | NaN | 307.0 | 0.0 | 1012040 |
2 | LAKE COWICHAN | 48.829 | -124.052 | BC | 6.8 | 13.0 | 2.8 | 16.0 | 9.0 | -2.5 | ... | 9.0 | NaN | NaN | 11.0 | NaN | NaN | NaN | 168.1 | 0.0 | 1012055 |
4 | DUNCAN KELVIN CREEK | 48.735 | -123.728 | BC | 7.7 | 2.0 | 3.4 | 14.5 | 2.0 | -1.0 | ... | 2.0 | NaN | NaN | 11.0 | NaN | NaN | NaN | 267.7 | 0.0 | 1012573 |
5 | ESQUIMALT HARBOUR | 48.432 | -123.439 | BC | 8.8 | 0.0 | NaN | 13.1 | 0.0 | 1.9 | ... | 8.0 | NaN | NaN | 12.0 | NaN | NaN | NaN | 258.6 | 0.0 | 1012710 |
5 rows × 25 columns
# ! pip install --user git+https://github.com/matplotlib/basemap.git
# from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from pylab import rcParams
rcParams['figure.figsize'] = (14,10)
llon=-140
ulon=-50
llat=40
ulat=65
df = df[(df['Long'] > llon) & (df['Long'] < ulon) & (df['Lat'] > llat) &(df['Lat'] < ulat)]
plt.title('Location of Sensors')
plt.scatter(list(df['Long']),list(df['Lat']))
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
<matplotlib.figure.Figure at 0x7f2aa86d1278>
Compute DBSCAN
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
<mtrand.RandomState at 0x7f29c41733a8>
X = np.nan_to_num(df[['Lat','Long']])
X = StandardScaler().fit_transform(X)
X
array([[-0.3 , -1.17],
[-0.33, -1.19],
[-0.33, -1.18],
...,
[ 1.84, 1.47],
[ 1.01, 1.65],
[ 0.6 , 1.28]])
db = DBSCAN(eps=0.15, min_samples=10).fit(X)
db
DBSCAN(algorithm='auto', eps=0.15, leaf_size=30, metric='euclidean',
metric_params=None, min_samples=10, n_jobs=1, p=None)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
core_samples_mask
array([ True, True, True, ..., False, False, False], dtype=bool)
df['Clus_db'] = db.labels_
df.head()
Stn_Name | Lat | Long | Prov | Tm | DwTm | D | Tx | DwTx | Tn | ... | P%N | S_G | Pd | BS | DwBS | BS% | HDD | CDD | Stn_No | Clus_db | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CHEMAINUS | 48.935 | -123.742 | BC | 8.2 | 0.0 | NaN | 13.5 | 0.0 | 1.0 | ... | NaN | 0.0 | 12.0 | NaN | NaN | NaN | 273.3 | 0.0 | 1011500 | 0 |
1 | COWICHAN LAKE FORESTRY | 48.824 | -124.133 | BC | 7.0 | 0.0 | 3.0 | 15.0 | 0.0 | -3.0 | ... | 104.0 | 0.0 | 12.0 | NaN | NaN | NaN | 307.0 | 0.0 | 1012040 | 0 |
2 | LAKE COWICHAN | 48.829 | -124.052 | BC | 6.8 | 13.0 | 2.8 | 16.0 | 9.0 | -2.5 | ... | NaN | NaN | 11.0 | NaN | NaN | NaN | 168.1 | 0.0 | 1012055 | 0 |
4 | DUNCAN KELVIN CREEK | 48.735 | -123.728 | BC | 7.7 | 2.0 | 3.4 | 14.5 | 2.0 | -1.0 | ... | NaN | NaN | 11.0 | NaN | NaN | NaN | 267.7 | 0.0 | 1012573 | 0 |
5 | ESQUIMALT HARBOUR | 48.432 | -123.439 | BC | 8.8 | 0.0 | NaN | 13.1 | 0.0 | 1.9 | ... | NaN | NaN | 12.0 | NaN | NaN | NaN | 258.6 | 0.0 | 1012710 | 0 |
5 rows × 26 columns
df[['Stn_Name','Tx','Tm','Clus_db']][1000:1500:45]
Stn_Name | Tx | Tm | Clus_db | |
---|---|---|---|---|
1138 | HEATH POINT | -1.0 | -13.3 | -1 |
1185 | LA GRANDE RIVIERE A | -11.6 | -28.4 | -1 |
1234 | BRIER ISLAND | 4.4 | -6.3 | 3 |
1286 | BRANCH | 8.0 | -3.4 | 4 |
1332 | GOOSE A | -4.2 | -22.0 | -1 |
Cluster Visualization
print(df['Clus_db'].max(), df['Clus_db'].min())
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
colours = ['#D3D3D3','blue','red','green','purple','yellow','deepskyblue']
le.fit(colours)
le.classes_
array(['#D3D3D3', 'blue', 'deepskyblue', 'green', 'purple', 'red', 'yellow'],
dtype='<U11')
# le.inverse_transform([0,1,2,3,4,5,6])
df['Colours'] = le.inverse_transform(db.labels_ + 1)
df[['Stn_Name','Tx','Tm','Clus_db', 'Colours']][1000:1500:45]
Stn_Name | Tx | Tm | Clus_db | Colours | |
---|---|---|---|---|---|
1138 | HEATH POINT | -1.0 | -13.3 | -1 | #D3D3D3 |
1185 | LA GRANDE RIVIERE A | -11.6 | -28.4 | -1 | #D3D3D3 |
1234 | BRIER ISLAND | 4.4 | -6.3 | 3 | purple |
1286 | BRANCH | 8.0 | -3.4 | 4 | red |
1332 | GOOSE A | -4.2 | -22.0 | -1 | #D3D3D3 |
plt.title('Clusters')
plt.scatter(list(df['Long']),
list(df['Lat']),
c=list(df['Colours']))
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
<matplotlib.figure.Figure at 0x7f2a0408b320>
Recommender Systems
Recommender systems try to capture people's behaviour in order to predict what people may like
There are two main types
- Content based
- Provide more content similar to what that user likes
- Collaborative filtering
- A user may be interested in what other similar users like
There are two types of implementations
- Memory based
- Uses entire user-item dataset to generate a recommendation
- Model based
- Develops model of users in an attempt to learn their preferences
Content Based
Content based systems try to recommend content based on a model of the user and similarity of the content that they interact with
Lab
Download the Data
The dataset being used is a movie dataset from GroupLens
#only run once
# !wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
# print('unziping ...')
# !unzip -o -j moviedataset.zip
Import Packages
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
Import Data
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')
movies_df.head()
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
ratings_df.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 169 | 2.5 | 1204927694 |
1 | 1 | 2471 | 3.0 | 1204927438 |
2 | 1 | 48516 | 5.0 | 1204927435 |
3 | 2 | 2571 | 3.5 | 1436165433 |
4 | 2 | 109487 | 4.0 | 1436165496 |
Preprocessing
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',
expand=False)
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',
expand=False)
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df.head()
movieId | title | genres | year | |
---|---|---|---|---|
0 | 1 | Toy Story | Adventure|Animation|Children|Comedy|Fantasy | 1995 |
1 | 2 | Jumanji | Adventure|Children|Fantasy | 1995 |
2 | 3 | Grumpier Old Men | Comedy|Romance | 1995 |
3 | 4 | Waiting to Exhale | Comedy|Drama|Romance | 1995 |
4 | 5 | Father of the Bride Part II | Comedy | 1995 |
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()
movieId | title | genres | year | |
---|---|---|---|---|
0 | 1 | Toy Story | [Adventure, Animation, Children, Comedy, Fantasy] | 1995 |
1 | 2 | Jumanji | [Adventure, Children, Fantasy] | 1995 |
2 | 3 | Grumpier Old Men | [Comedy, Romance] | 1995 |
3 | 4 | Waiting to Exhale | [Comedy, Drama, Romance] | 1995 |
4 | 5 | Father of the Bride Part II | [Comedy] | 1995 |
genres_df = movies_df.copy()
for index, row in movies_df.iterrows():
for genre in row['genres']:
genres_df.at[index, genre] = 1
genres_df = genres_df.fillna(0)
genres_df.head()
movieId | title | genres | year | Adventure | Animation | Children | Comedy | Fantasy | Romance | ... | Horror | Mystery | Sci-Fi | IMAX | Documentary | War | Musical | Western | Film-Noir | (no genres listed) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Toy Story | [Adventure, Animation, Children, Comedy, Fantasy] | 1995 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 2 | Jumanji | [Adventure, Children, Fantasy] | 1995 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 3 | Grumpier Old Men | [Comedy, Romance] | 1995 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 4 | Waiting to Exhale | [Comedy, Drama, Romance] | 1995 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 5 | Father of the Bride Part II | [Comedy] | 1995 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 24 columns
ratings_df.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 169 | 2.5 | 1204927694 |
1 | 1 | 2471 | 3.0 | 1204927438 |
2 | 1 | 48516 | 5.0 | 1204927435 |
3 | 2 | 2571 | 3.5 | 1436165433 |
4 | 2 | 109487 | 4.0 | 1436165496 |
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()
userId | movieId | rating | |
---|---|---|---|
0 | 1 | 169 | 2.5 |
1 | 1 | 2471 | 3.0 |
2 | 1 | 48516 | 5.0 |
3 | 2 | 2571 | 3.5 |
4 | 2 | 109487 | 4.0 |
User Interests
user_movies = pd.DataFrame([
{'title':'Breakfast Club, The', 'rating':5},
{'title':'Toy Story', 'rating':3.5},
{'title':'Jumanji', 'rating':2},
{'title':"Pulp Fiction", 'rating':5},
{'title':'Akira', 'rating':4.5}
])
user_movies
rating | title | |
---|---|---|
0 | 5.0 | Breakfast Club, The |
1 | 3.5 | Toy Story |
2 | 2.0 | Jumanji |
3 | 5.0 | Pulp Fiction |
4 | 4.5 | Akira |
movie_ids = genres_df[genres_df['title'].isin(user_movies['title'].tolist())]
user_movies = pd.merge(movie_ids, user_movies)
user_genres = user_movies.drop('genres', 1).drop('year',1)
user_genres
movieId | title | Adventure | Animation | Children | Comedy | Fantasy | Romance | Drama | Action | ... | Mystery | Sci-Fi | IMAX | Documentary | War | Musical | Western | Film-Noir | (no genres listed) | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Toy Story | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.5 |
1 | 2 | Jumanji | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
2 | 296 | Pulp Fiction | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 |
3 | 1274 | Akira | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.5 |
4 | 1968 | Breakfast Club, The | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 |
5 rows × 23 columns
Since we only need the genres
user_genres.drop('title', 1, inplace=True)
user_genres.drop('movieId', 1, inplace=True)
user_genres.drop('rating', 1, inplace=True)
user_genres
Adventure | Animation | Children | Comedy | Fantasy | Romance | Drama | Action | Crime | Thriller | Horror | Mystery | Sci-Fi | IMAX | Documentary | War | Musical | Western | Film-Noir | (no genres listed) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
And next we need to multiply this with the ratings column
user_profile = user_genres.transpose().dot(user_movies['rating'])
user_profile
Adventure 10.0
Animation 8.0
Children 5.5
Comedy 13.5
Fantasy 5.5
Romance 0.0
Drama 10.0
Action 4.5
Crime 5.0
Thriller 5.0
Horror 0.0
Mystery 0.0
Sci-Fi 4.5
IMAX 0.0
Documentary 0.0
War 0.0
Musical 0.0
Western 0.0
Film-Noir 0.0
(no genres listed) 0.0
dtype: float64
We can then compare this to the table of all our movies, and build a recommendation based on that
all_genres = genres_df.set_index(genres_df['movieId'])
all_genres.head()
movieId | title | genres | year | Adventure | Animation | Children | Comedy | Fantasy | Romance | ... | Horror | Mystery | Sci-Fi | IMAX | Documentary | War | Musical | Western | Film-Noir | (no genres listed) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
movieId | |||||||||||||||||||||
1 | 1 | Toy Story | [Adventure, Animation, Children, Comedy, Fantasy] | 1995 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 2 | Jumanji | [Adventure, Children, Fantasy] | 1995 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 3 | Grumpier Old Men | [Comedy, Romance] | 1995 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 4 | Waiting to Exhale | [Comedy, Drama, Romance] | 1995 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 5 | Father of the Bride Part II | [Comedy] | 1995 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 24 columns
all_genres.drop(['movieId','title','genres','year'], 1, inplace=True)
all_genres.head()
Adventure | Animation | Children | Comedy | Fantasy | Romance | Drama | Action | Crime | Thriller | Horror | Mystery | Sci-Fi | IMAX | Documentary | War | Musical | Western | Film-Noir | (no genres listed) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
movieId | ||||||||||||||||||||
1 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
user_recommendation = all_genres.dot(user_profile)/user_profile.sum()
user_recommendation.head()
movieId
1 0.594406
2 0.293706
3 0.188811
4 0.328671
5 0.188811
dtype: float64
user_recommendation.sort_values(ascending=False, inplace=True)
user_recommendation.head(10)
movieId
5018 0.748252
26093 0.734266
27344 0.720280
148775 0.685315
6902 0.678322
117646 0.678322
64645 0.671329
81132 0.671329
122787 0.671329
2987 0.664336
dtype: float64
Top Recommendations for User
movies_df.loc[movies_df['movieId'].isin(user_recommendation.head().keys())]
movieId | title | genres | year | |
---|---|---|---|---|
4923 | 5018 | Motorama | [Adventure, Comedy, Crime, Drama, Fantasy, Mys... | 1991 |
6793 | 6902 | Interstate 60 | [Adventure, Comedy, Drama, Fantasy, Mystery, S... | 2002 |
8605 | 26093 | Wonderful World of the Brothers Grimm, The | [Adventure, Animation, Children, Comedy, Drama... | 1962 |
9296 | 27344 | Revolutionary Girl Utena: Adolescence of Utena... | [Action, Adventure, Animation, Comedy, Drama, ... | 1999 |
33509 | 148775 | Wizards of Waverly Place: The Movie | [Adventure, Children, Comedy, Drama, Fantasy, ... | 2009 |
Collaborative Filtering
Collaborative filtering works by recommending content based on other similar users/items
There are two types
- User
- Based on user's similar neighborhood
- Item
- Based on similarity of item recommendations
Lab
Note that this uses the same movie data as before and uses the Pearson Correlation Coefficient to identify users who rate movies similarly based on the ratings table and can be found in 5-2-Collaborative-Filtering