Logistic Regression

The linear regression will help us to find the correlation of continous function, while the logistic regression tries to adress the classification. The boundary in the classification does matter, an intuitive approach is to make the regression saturated quickly away from boundary, see the logistic function as below:

The basic idea of the logistic regression is the hypotheis will use the linear approximation, then mapped with logistic function for binary prediction, thus:

h(θ)=11+eθTXh(\theta) = \frac{1}{1 + e^{-\theta^{T}X}}
%pylab inline
Populating the interactive namespace from numpy and matplotlib
import numpy as np
import matplotlib.pyplot as plt

X = np.linspace(-6, 6, 200)
Y = 1 / (1 + np.exp(-X))

plt.figure()
plt.plot(X, Y)
plt.grid(True)
plt.show()

image-2

The cost function is mere a quatitative to measure the difference of the hypotheis with experiments. The logistic cost function is defined as:

J(θ)=1mCost(hθ(x),y)J(\theta) = \frac{1}{m} Cost(h_{\theta}(x), y) Cost(hθ(x),y)={log(hθ(x))if y = 1log(1hθ(x))if y = 0Cost(h_{\theta}(x), y) = \left\{ \begin{array}{ll} - log(h_{\theta}(x)) & \quad \text{if y = 1}\\ - log(1 - h_{\theta}(x)) & \quad \text{if y = 0} \end{array} \right.

Thus:

J(θ)=1m[i=1mylog(hθ(x))+(1y)log(1hθ(x))]J(\theta) = -\frac{1}{m} [\sum_{i=1}^{m} y\cdot log(h_{\theta}(x)) + (1 - y) log(1 - h_{\theta}(x))] θJ(θ)=1m[i=1myhθ(x)hθ(x)+(1y)hθ(x)1hθ(x)]=1m[i=1my(1hθ(x)) hθ(x)+(y1)hθ(x)hθ(x)hθ(x)(1hθ(x))]=1m[i=1myhθ(x)yhθ(x)hθ(x)+yhθ(x)hθ(x)hθ(x)hθ(x)hθ(x)(1hθ(x))]=1m[i=1m(yhθ(x))hθ(x)hθ(x)(1hθ(x))]\begin{aligned} \frac{\partial}{\partial \theta}J(\theta) & = -\frac{1}{m} [\sum_{i=1}^{m} y\cdot \frac{h^{'}_{\theta}(x)}{h_{\theta}(x)} + (1 - y) \frac{-h^{'}_{\theta}(x)}{1 - h_{\theta}(x)}] \\ & = -\frac{1}{m} [\sum_{i=1}^{m} \frac{y(1 - h_{\theta}(x))\ h^{'}_{\theta}(x) + (y - 1)h_{\theta}(x) h^{'}_{\theta}(x) }{h_{\theta}(x)(1 - h_{\theta}(x))}] \\ & = -\frac{1}{m} [\sum_{i=1}^{m} \frac{yh^{'}_{\theta}(x) - y h_{\theta}(x)h^{'}_{\theta}(x) + y h_{\theta}(x) h^{'}_{\theta}(x) - h_{\theta}(x) h^{'}_{\theta}(x)}{h_{\theta}(x)(1 - h_{\theta}(x))}] \\ & = -\frac{1}{m} [\sum_{i=1}^{m} \frac{(y - h_{\theta}(x))h^{'}_{\theta}(x)}{h_{\theta}(x)(1 - h_{\theta}(x))} ] \end{aligned}

And we know

θhθ(x)=hθ2(x)eθTx(x)=hθ2(x)(1hθ(x)1)x=hθ2(x)1hθ(x)hθ(x)x=hθ(x)(hθ(x)1)x\begin{aligned} \frac{\partial}{\partial \theta}h_{\theta}(x) & = - h^2_{\theta}(x) e^{-\theta^{T}x} (-x) \\ & = h^2_{\theta}(x) (\frac{1}{ h_{\theta}(x)} - 1) x \\ & = h^2_{\theta}(x) \frac{ 1 - h_{\theta}(x)}{h_{\theta}(x)} x \\ & = h_{\theta}(x)(h_{\theta}(x) - 1) x \end{aligned}

Thus

J(θ)=1m[i=1m(yhθ(x))hθ(x)(hθ(x)1)xhθ(x)(1hθ(x))]=1m[i=1m(yhθ(x))x]\begin{aligned} \frac{\partial}{\partial J(\theta)} & = -\frac{1}{m} [\sum_{i=1}^{m} \frac{(y - h_{\theta}(x)) h_{\theta}(x)(h_{\theta}(x) - 1) x}{h_{\theta}(x)(1 - h_{\theta}(x))} ] \\ & = \frac{1}{m} [\sum_{i=1}^{m} (y - h_{\theta}(x)) x ] \end{aligned}

Let’s use the exercise #2 as an example, first visualize the data:

def render_exams(data, admitted, rejected):
    plt.figure(figsize=(6, 6))

    plt.scatter(np.extract(admitted, data['ex1']), 
                np.extract(admitted, data['ex2']), 
                c='b', marker='+', label='admitted')
    plt.scatter(np.extract(rejected, data['ex1']), 
                np.extract(rejected, data['ex2']), 
                c='y', marker='o', label='rejected')
    plt.xlabel('Exam 1 score');
    plt.ylabel('Exam 2 score');
    plt.axes().set_aspect('equal', 'datalim')
   
data = np.loadtxt('ex2data1.txt', delimiter=',', dtype={
    'names': ('ex1', 'ex2', 'score'), 
    'formats': ('f4', 'f4', 'i4')},
)
admitted = data['score'] == 1
rejected = data['score'] == 0
render_exams(data, admitted, rejected)
plt.legend();

image-3

We will leverage the existing wheel, aka scikit-learn’s OneVsRestClassifier and the LogisticRegression.

from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

X = np.column_stack((data['ex1'], data['ex2']))
Y = data['score']

classifier = OneVsRestClassifier(LogisticRegression(penalty='l1')).fit(X, Y)
print 'Coefficents: ', classifier.coef_
print 'Intercept" ', classifier.intercept_

coef = classifier.coef_
intercept = classifier.intercept_

# see the coutour approach for a more general solution
ex1 = np.linspace(30, 100, 100)
ex2 = -(coef[:, 0] * ex1 + intercept[:, 0]) / coef[:,1]

render_exams(data, admitted, rejected)
plt.plot(ex1, ex2, color='r', label='decision boundary');
plt.legend();
Coefficents:  [[ 0.09788619  0.09175555]]
Intercept"  [[-11.57316814]]

image-4

Predict the admission using the model:

print classifier.score(X, Y)
theta = np.concatenate((intercept[0], coef[0]), axis=0)
freq = 1 / (1 + np.exp(-1 * np.dot(theta, [1, 45, 85])))
print "For a student with scores 45 and 85, we predict an admission probability of %f" % freq
0.91
For a student with scores 45 and 85, we predict an admission probability of 0.652701

Regularization

The idea behind the regularization is the linear fitting does not work very well, high bias or underfitting; we might want some ingredients from high order polynomial for a better fit, but by no means expact the fitting will go extra mile to fit every single points, high variance or overfitting. There are two commonly-used regularization to cap the impact of the higher order polynomial:

See more regulization of the linear models in the wikipedia.

Luckily, the LogisticRegression in scikit-learn has integrated both l1l1 and l2l2 regularization, named penalty. More concretly, we will use Lazzo for the l1l1 regularization, and Ridge regression for l2l2 form.

Here is a more sophisticated example, the acceptance test for the microchips. We are going to demonstrate the impact of regularization: high variance and high bias.

def render_tests(data, accepted, rejected):
    plt.figure(figsize=(6, 6))
    plt.scatter(np.extract(accepted, data['test1']), 
            np.extract(accepted, data['test2']), 
            c='b', marker='+', label='accepted')
    plt.scatter(np.extract(rejected, data['test1']), 
            np.extract(rejected, data['test2']), 
            c='y', marker='o', label='rejected')
    plt.xlabel('Microchip Test 1');
    plt.ylabel('Microchip Test 2');
    plt.axes().set_aspect('equal', 'datalim')
    
data = np.loadtxt('ex2data2.txt', delimiter=',', dtype={
    'names': ('test1', 'test2', 'score'), 
    'formats': ('f4', 'f4', 'i4')},
)

accepted = data['score'] == 1
rejected = data['score'] == 0

render_tests(data, accepted, rejected)
plt.legend();

image-6

It is clear that we need higher-order function for the input. First define the map_features to map the features to higher order polynomial:

def map_features(f1, f2, order=1):
    '''map the f1 and f2 to its higher order polynomial'''
    assert order >= 1
    def iter():
        for i in range(1, order + 1):
            for j in range(i + 1):
                yield np.power(f1, i - j) * np.power(f2, j)
    return np.vstack(iter())

We will use up to 6th order polynomial for the LogisticRegression with λ=1\lambda = 1, aka the inverse regularization strength C=1/1C = 1 / 1

out = map_features(data['test1'], data['test2'], order=6)
X = out.transpose()
Y = data['score']

classifier = OneVsRestClassifier(LogisticRegression(penalty='l1', C=1)).fit(X, Y)
print 'Coefficents: ', classifier.coef_
print 'Intercept: ', classifier.intercept_
print 'Accuracy: ', classifier.score(X, Y)
Coefficents:  [[ 0.68658946  1.28037888 -4.86238904 -1.62177411 -2.34202973  0.          0.
   0.          0.          0.          0.          0.          0.
  -2.36760082  0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.        ]]
Intercept:  [[ 1.86957363]]
Accuracy:  0.796610169492

Visualizing the decision boundry is a little bit tricky: we will draw level 0 contour to get the decision boundary:

def draw_boundary(classifier):
    dim = np.linspace(-1, 1.5, 1000)
    dx, dy = np.meshgrid(dim, dim)
    v = map_features(dx.flatten(), dy.flatten(), order=6)
    z = (np.dot(classifier.coef_, v) + classifier.intercept_).reshape(1000, 1000)
    CS = plt.contour(dx, dy, z, levels=[0], colors=['r'])

render_tests(data, accepted, rejected)
draw_boundary(classifier)
plt.legend();

image-9

Here is an example of overfitting: C=1000,λ0C = 1000, \lambda \approx 0

overfitter = OneVsRestClassifier(LogisticRegression(penalty='l1', C=1000)).fit(X, Y)
render_tests(data, accepted, rejected)
draw_boundary(overfitter)
plt.legend();

image-10

Here is an example of underfitting, C=0.01,λ=100C = 0.01, \lambda = 100:

underfitter = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(X, Y)
render_tests(data, accepted, rejected)
draw_boundary(underfitter)
plt.legend();

image-11

Logistic Regression is just one of approaches to map the continous linear space to the classification, Support Vector Machine is probably more popular in the field.

Multi-class Classification

We can use OneVsOneClassifier for multi-class classification:

import scipy.io
data = scipy.io.loadmat('ex3data1.mat');

# pick random 100 handwriting
import random
indexes = random.sample(range(0, 5000), 100)

Render the selected handwritings:

figure = plt.figure(figsize=(10, 10))
for index, i in enumerate(indexes):
    plt.subplot(10, 10, index + 1)
    plt.axis('off')
    plt.imshow(data['X'][i].reshape(20, 20).transpose(), cmap='Greys')

image-13

from sklearn.multiclass import OneVsOneClassifier
clf = OneVsOneClassifier(LogisticRegression(penalty='l2', C=0.01)
                         ).fit(data['X'], data['y'].ravel())

Let’s print out the predicted digits:

predicted = clf.predict(data['X'][indexes])
for l, u in zip(range(0, 91, 10), range(10, 101, 10)):
    print map(lambda x: 0 if x == 10 else x, predicted[l:u])
    
print "Predict score for the sample input: %f" % \
    clf.score(data['X'][indexes], data['y'][indexes])
[7, 8, 2, 4, 1, 8, 5, 2, 0, 8]
[8, 1, 7, 6, 7, 8, 8, 7, 7, 3]
[8, 3, 4, 8, 3, 6, 1, 6, 9, 6]
[1, 4, 1, 5, 7, 2, 5, 1, 9, 0]
[0, 7, 8, 7, 2, 5, 0, 9, 7, 8]
[1, 4, 8, 3, 2, 0, 3, 9, 3, 1]
[5, 8, 6, 8, 6, 8, 4, 1, 3, 3]
[6, 0, 6, 1, 8, 4, 0, 0, 1, 3]
[8, 2, 4, 1, 3, 8, 9, 5, 5, 0]
[5, 7, 2, 9, 5, 9, 6, 7, 9, 0]
Predict score for the sample input: 0.880000