Machine Learning Application Checklist

In this section, we will discuss the ML workflow.

StandardScaler vs Normalizer

The StandardScaler transfer the dataset with zero mean and unit variance, while the Normalizer convert the dataset to unit vector:

import numpy as np
from sklearn.preprocessing import StandardScaler, Normalizer

X = np.array(
    [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print 'X_scaled.mean = %f, std = %f' %(np.mean(X_scaled), np.std(X_scaled))

normalizer = Normalizer()
X_norm = normalizer.fit_transform(X)
print 'X_norm norm = ', np.sum(np.power(X_norm, 2), axis=1)

X_scaled.mean = 0.000000, std = 1.000000
X_norm norm =  [ 1.  1.  1.]

High bias vs High variance

The data visualization is our best friend to find the “Just right” fitting. Unfortunately, visualization is not feasible for the multi-feature dataset.

A more generic approach is to split the samples to “training set” and “testing set”, using the latter to measure the accuracy of the hypothesis. Some hypotheses can be parameterized, such as $\gamma$ for support vector machine, degree for linear regression. We will need another dataset, “cross validation set” to try the combination of the parameters before testing.

Using ex5 as an example:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

import scipy.io
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression


data = scipy.io.loadmat('ex5data1.mat')
clf = LinearRegression()
clf.fit(data['X'], data['y'])
        
print 'intercept: ', clf.intercept_
print 'Coefficients: ', clf.coef_

xx = np.linspace(-60, 60, 100)[:, np.newaxis]
yy = clf.predict(xx)

plt.scatter(data['X'], data['y'])
plt.plot(xx, clf.predict(xx), 'r--');

intercept:  [ 13.08790351]
Coefficients:  [[ 0.36777923]]

It is more interesting to introspect the error with more sample data are processed, with cross-validation samples correlated.

def converged_error(clf, X, y, Xval, yval):
    iterations = range(2, len(X))
    train_error = []
    val_error = []
    for i in iterations:
        clf.fit(X[:i], y[:i])
        train_error.append(100* (1 - clf.score(X[:i], y[:i])))
        val_error.append(100 * (1- clf.score(Xval, yval)))

    plt.plot(iterations, train_error, label="Train")
    plt.plot(iterations, val_error, label="Cross Validation")
    plt.ylabel('Error (%)')
    plt.xlabel('Number of training examples')
    plt.xlim([0, 12])
    plt.legend()

fig = plt.figure(figsize=(8, 6))
converged_error(LinearRegression(), data['X'], data['y'], data['Xval'], data['yval'])

It is quite clear we are underfitting due to both the error from training set and cross-validation set are converged to a constant value.

# feature scaling
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

# mapping the features to degree 8
XX = np.hstack(np.power(data['X'], i) for i in range(1, 8 + 1))
xx = np.linspace(-60, 60, 100)[:, np.newaxis]
yy = np.hstack(np.power(xx, i) for i in range(1, 8 + 1))

scaler = StandardScaler()
X_norm = scaler.fit_transform(XX)
x_norm = scaler.transform(yy)

def fit_transform(clf, X_norm, y, style, label):
    clf.fit(X_norm, y)
    z = clf.predict(x_norm)
    plt.plot(xx, z, style, label=label)

    
plt.scatter(data['X'], data['y'])
plt.xlim([-60, 50])
plt.ylim([-10, 60]);
    
fit_transform(LinearRegression(), X_norm, data['y'], 'r--', label='lamda=0')
fit_transform(Ridge(alpha=1.0), X_norm, data['y'], 'g-', label='lamda=1')
fit_transform(Ridge(alpha=100.0), X_norm, data['y'], 'c-', label='lamda=100')
plt.legend(loc='upper left');

Xval = np.hstack(np.power(data['Xval'], i) for i in range(1, 9))
Xval_norm = scaler.transform(Xval)
fig = plt.figure(figsize=(24, 6))

plt.subplot(131)
plt.title('lambda = 0')
converged_error(LinearRegression(), X_norm, data['y'], Xval_norm, data['yval'])

plt.subplot(132)
plt.title('lambda = 1')
converged_error(Ridge(alpha=1.0), X_norm, data['y'], Xval_norm, data['yval'])

plt.subplot(133)
plt.title('lambda = 100')
converged_error(Ridge(alpha=100.0), X_norm, data['y'], Xval_norm, data['yval'])

It is clearly we overfitting if lambda = 0: the error of of cross validation is not converging and much bigger than the training set; on the other hand, lambda=100 is underfitting, and lambda=1 is about right.

Here is the cross reference between the lambda and errors:

lambdas = [0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10]
val_errors = []
train_errors = []
for l in lambdas:
    clf = Ridge(alpha=l)
    clf.fit(X_norm, data['y'])
    train_errors.append(100 * (1- clf.score(X_norm, data['y'])))
    val_errors.append(100 * (1- clf.score(Xval_norm, data['yval'])))


plt.plot(lambdas, train_errors, label='Train')
plt.plot(lambdas, val_errors, label='Cross Validation')

plt.xlabel('lambdas')
plt.ylabel('Error');
plt.ylim([0, 20])
plt.legend(loc='upper left');

sklearn.linear_model.RidgeCV has built-in support for the cross-validation, which could be quite convienient.

Methodoloy of Cross Validation

TBD

Precision vs Recall

The accuracy is less informative if the classification is highly skewed, one metric will be the F1 Score:

F = 2 \frac{PR}{P + R}

Using the table from wikipedia

		Condition (as determined by "Gold standard")
		Condition positive	Condition negative
Test outcome	Test outcome positive	True positive	False positive (Type I error)	Precision = $\Sigma$ True positive $\Sigma$ Test outcome positive
Test outcome	Test outcome negative	False negative (Type II error)	True negative	Negative predictive value = $\Sigma$ True negative $\Sigma$ Test outcome negative
		Sensitivity = $\Sigma$ True positive $\Sigma$ Condition positive	Specificity = $\Sigma$ True negative $\Sigma$ Condition negative	Accuracy

Sensitivity is also known as Recall.

Principal component analysis

The principal component anaysis, herein PCA is backed by the eigen value and singular value decomposition. The analogy is the fourier transform of the time-series waveform: the conversion is loseless, but you can simplify the problem by discarding the high frequence signal without losing too much fidelity.

Using ex7 as a concrete example:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

data = scipy.io.loadmat('ex7data1.mat')

# feature normalization
scaler = StandardScaler()
X = scaler.fit_transform(data['X'])
pca = PCA(n_components=1).fit(X)
X_pca = pca.transform(X)
print 'PCA eigen vector: %s' % pca.components_
print 'PCA variance ratio: %s' % pca.explained_variance_ratio_


plt.scatter(data['X'][:, 0], data['X'][:, 1], alpha=0.5);
# projection to the eigen vector
X_proj = np.dot(X_pca, pca.components_) + scaler.mean_
plt.scatter(X_proj[:, 0], X_proj[:, 1], marker='x', c='r');

PCA eigen vector: [[ 0.70710678  0.70710678]]
PCA variance ratio: [ 0.86776519]