Machine Learning Application Checklist
In this section, we will discuss the ML workflow.
StandardScaler vs Normalizer
The StandardScaler transfer the dataset with zero mean and unit variance, while the Normalizer convert the dataset to unit vector:
import numpy as np
from sklearn.preprocessing import StandardScaler, Normalizer
X = np.array(
[[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print 'X_scaled.mean = %f, std = %f' %(np.mean(X_scaled), np.std(X_scaled))
normalizer = Normalizer()
X_norm = normalizer.fit_transform(X)
print 'X_norm norm = ', np.sum(np.power(X_norm, 2), axis=1)
X_scaled.mean = 0.000000, std = 1.000000 X_norm norm = [ 1. 1. 1.]
High bias vs High variance
The data visualization is our best friend to find the “Just right” fitting. Unfortunately, visualization is not feasible for the multi-feature dataset.
A more generic approach is to split the samples to “training set” and “testing set”, using the latter to measure the accuracy of the hypothesis. Some hypotheses can be parameterized, such as for support vector machine, degree for linear regression. We will need another dataset, “cross validation set” to try the combination of the parameters before testing.
Using ex5 as an example:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
import scipy.io
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
data = scipy.io.loadmat('ex5data1.mat')
clf = LinearRegression()
clf.fit(data['X'], data['y'])
print 'intercept: ', clf.intercept_
print 'Coefficients: ', clf.coef_
xx = np.linspace(-60, 60, 100)[:, np.newaxis]
yy = clf.predict(xx)
plt.scatter(data['X'], data['y'])
plt.plot(xx, clf.predict(xx), 'r--');
intercept: [ 13.08790351] Coefficients: [[ 0.36777923]]
It is more interesting to introspect the error with more sample data are processed, with cross-validation samples correlated.
def converged_error(clf, X, y, Xval, yval):
iterations = range(2, len(X))
train_error = []
val_error = []
for i in iterations:
clf.fit(X[:i], y[:i])
train_error.append(100* (1 - clf.score(X[:i], y[:i])))
val_error.append(100 * (1- clf.score(Xval, yval)))
plt.plot(iterations, train_error, label="Train")
plt.plot(iterations, val_error, label="Cross Validation")
plt.ylabel('Error (%)')
plt.xlabel('Number of training examples')
plt.xlim([0, 12])
plt.legend()
fig = plt.figure(figsize=(8, 6))
converged_error(LinearRegression(), data['X'], data['y'], data['Xval'], data['yval'])
It is quite clear we are underfitting due to both the error from training set and cross-validation set are converged to a constant value.
# feature scaling
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
# mapping the features to degree 8
XX = np.hstack(np.power(data['X'], i) for i in range(1, 8 + 1))
xx = np.linspace(-60, 60, 100)[:, np.newaxis]
yy = np.hstack(np.power(xx, i) for i in range(1, 8 + 1))
scaler = StandardScaler()
X_norm = scaler.fit_transform(XX)
x_norm = scaler.transform(yy)
def fit_transform(clf, X_norm, y, style, label):
clf.fit(X_norm, y)
z = clf.predict(x_norm)
plt.plot(xx, z, style, label=label)
plt.scatter(data['X'], data['y'])
plt.xlim([-60, 50])
plt.ylim([-10, 60]);
fit_transform(LinearRegression(), X_norm, data['y'], 'r--', label='lamda=0')
fit_transform(Ridge(alpha=1.0), X_norm, data['y'], 'g-', label='lamda=1')
fit_transform(Ridge(alpha=100.0), X_norm, data['y'], 'c-', label='lamda=100')
plt.legend(loc='upper left');
Xval = np.hstack(np.power(data['Xval'], i) for i in range(1, 9))
Xval_norm = scaler.transform(Xval)
fig = plt.figure(figsize=(24, 6))
plt.subplot(131)
plt.title('lambda = 0')
converged_error(LinearRegression(), X_norm, data['y'], Xval_norm, data['yval'])
plt.subplot(132)
plt.title('lambda = 1')
converged_error(Ridge(alpha=1.0), X_norm, data['y'], Xval_norm, data['yval'])
plt.subplot(133)
plt.title('lambda = 100')
converged_error(Ridge(alpha=100.0), X_norm, data['y'], Xval_norm, data['yval'])
It is clearly we overfitting if lambda = 0: the error of of cross validation is not converging and much bigger than the training set; on the other hand, lambda=100 is underfitting, and lambda=1 is about right.
Here is the cross reference between the lambda and errors:
lambdas = [0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10]
val_errors = []
train_errors = []
for l in lambdas:
clf = Ridge(alpha=l)
clf.fit(X_norm, data['y'])
train_errors.append(100 * (1- clf.score(X_norm, data['y'])))
val_errors.append(100 * (1- clf.score(Xval_norm, data['yval'])))
plt.plot(lambdas, train_errors, label='Train')
plt.plot(lambdas, val_errors, label='Cross Validation')
plt.xlabel('lambdas')
plt.ylabel('Error');
plt.ylim([0, 20])
plt.legend(loc='upper left');
sklearn.linear_model.RidgeCV has built-in support for the cross-validation, which could be quite convienient.
Methodoloy of Cross Validation
TBD
Precision vs Recall
The accuracy is less informative if the classification is highly skewed, one metric will be the F1 Score:
Using the table from wikipedia
Condition (as determined by "Gold standard") |
||||
Condition positive | Condition negative | |||
Test outcome |
Test outcome positive |
True positive | False positive (Type I error) |
Precision = $\Sigma$ True positive $\Sigma$ Test outcome positive
|
Test outcome negative |
False negative (Type II error) |
True negative | Negative predictive value = $\Sigma$ True negative $\Sigma$ Test outcome negative
|
|
Sensitivity = $\Sigma$ True positive $\Sigma$ Condition positive
|
Specificity = $\Sigma$ True negative $\Sigma$ Condition negative
|
Accuracy |
Sensitivity is also known as Recall.
Principal component analysis
The principal component anaysis, herein PCA is backed by the eigen value and singular value decomposition. The analogy is the fourier transform of the time-series waveform: the conversion is loseless, but you can simplify the problem by discarding the high frequence signal without losing too much fidelity.
Using ex7 as a concrete example:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
data = scipy.io.loadmat('ex7data1.mat')
# feature normalization
scaler = StandardScaler()
X = scaler.fit_transform(data['X'])
pca = PCA(n_components=1).fit(X)
X_pca = pca.transform(X)
print 'PCA eigen vector: %s' % pca.components_
print 'PCA variance ratio: %s' % pca.explained_variance_ratio_
plt.scatter(data['X'][:, 0], data['X'][:, 1], alpha=0.5);
# projection to the eigen vector
X_proj = np.dot(X_pca, pca.components_) + scaler.mean_
plt.scatter(X_proj[:, 0], X_proj[:, 1], marker='x', c='r');
PCA eigen vector: [[ 0.70710678 0.70710678]] PCA variance ratio: [ 0.86776519]