Linear Regression

The idea behind the linear regression is Least squares: minimize the sum of the squares of the errors made between hypothese and the observed result, aka the cost function JJ.

Hyposis:hθ(x)=θ0+θ1x=θTxHyposis: h_{\theta}(x) = \theta_{0} + \theta_{1}x = \theta^{T}x
Cost function:J(θ0,θ1)=12mi=1m(hθ(xi)y(i))2Cost~function: J(\theta_{0}, \theta_{1}) = \frac{1}{2m} \sum_{i=1}^m (h_{\theta}(x^{i}) - y^{(i)})^2
%pylab inline
Populating the interactive namespace from numpy and matplotlib

Let’s use the sample data from the exercise #1.

import numpy as np
import matplotlib.pyplot as plt

data = np.loadtxt(open("ex1data1.txt"), delimiter=",")
X = data[:, 0]
Y = data[:, 1]

plt.figure()
plt.scatter(X, Y)
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
plt.show()

image-47

Here is the scalar cost function, computeCost.

def computeCost(theta_0, theta_1):
    return np.sum((theta_0 + theta_1 * X - Y) ** 2) /(2 * len(X))

print computeCost(0, 0)
32.0727338775

Based on the trial-n-error, we will identify that, to minimize the cost function, θ0\theta_{0} must be in [5,5][-5, 5], θ1\theta_{1} must be in [0,2][0, 2]:

theta_0_space = np.concatenate([
    np.linspace(-10, -5, 400),
    np.linspace(-5, 5, 1000),
    np.linspace(5, 10, 400)])

theta_1_space = np.concatenate([
    np.linspace(-1, 0, 400),
    np.linspace(0, 2, 1200),
    np.linspace(2, 4, 400)])

Theta_0, Theta_1 = np.meshgrid(theta_0_space, theta_1_space)
rows, cols = np.shape(Theta_0)

Z = np.zeros(shape=(rows, cols))

for i in range(rows):
    for j in range(cols):
        Z[i, j] = computeCost(Theta_0[i, j], Theta_1[i, j])

Here is the contour of the J(θ)J(\theta):

plt.figure()
CS = plt.contour(Theta_0, Theta_1, Z, 
                 levels=[5, 8, 16, 32, 64, 128, 256])
plt.xlabel('theta0')
plt.ylabel('theta1')
plt.title('The contour of the J(theta)')       
plt.clabel(CS, inline=1, fontsize=10)
plt.show()

image-5

The saddle point can be found at (θ0,θ1)(\theta_{0}, \theta_{1}) where θJ(θ)=0\frac{\partial}{\partial\theta}J(\theta) = 0 and 2θJ(θ)<0\frac{\partial^2}{\partial\theta}J(\theta) < 0. The numerical approach is to use the graident descent.

θ0=θ0αθ0J(θ)\theta_{0} = \theta_{0} - \alpha \frac{\partial}{\partial\theta_{0}}J(\theta)
θ1=θ1αθ1J(θ)\theta_{1} = \theta_{1} - \alpha \frac{\partial}{\partial\theta_{1}}J(\theta)

More concretly:

J(θ0,θ1)=12mi=1m(hθ(x(i))y(i))2=12mi=1m(θ0+θ1x(i)y(i))2J(\theta_{0}, \theta_{1}) = \frac{1}{2m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m} \sum_{i=1}^m (\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^2
θ0J(θ)=12mi=1m2(θ0+θ1x(i)y(i))=1mi=1m(hθ(x(i))y(i))\frac{\partial}{\partial\theta_{0}}J(\theta) = \frac{1}{2m} \sum_{i=1}^m 2 (\theta_{0} + \theta_{1}x^{(i)} - y^{(i)}) = \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})
θ1J(θ)=12mi=1m2(θ0+θ1x(i)y(i))x(i)=1mi=1m(hθ(x(i))y(i))x(i)\frac{\partial}{\partial\theta_{1}}J(\theta) = \frac{1}{2m} \sum_{i=1}^m 2 (\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})\cdot x^{(i)} = \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})\cdot x^{(i)}
theta = np.array([0,0])
alpha = 0.01
m = len(X)
iters = 1000
V = np.array([ones(shape(X)), X])
thetas = zeros(shape=(iters, 2))

for i in range(iters):
    thetas[i] = theta
    theta = theta - (alpha / m) * dot((np.dot(theta, V) - Y), V.transpose())

We can also render the gradient descent trace in the contour.

plt.figure()
CS = plt.contour(Theta_0, Theta_1, Z, 
                 levels=[5, 8, 16, 32, 64, 128, 256])
plt.xlabel('theta0')
plt.ylabel('theta1')
plt.title('The trace of gradient descent in J(theta)')       
plt.clabel(CS, inline=1, fontsize=10)
plt.plot(thetas[:, 0], thetas[:, 1], c='r')
plt.show()

image-7

Let’s check out how the hypothesis fit into the experiments:

B = np.linspace(0, 25, 100)
Xh = np.array([ones(shape(B)), B])
Yh = np.dot(theta, Xh)

plt.figure()
plt.scatter(X, Y)
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
plt.plot(B, Yh, c='red')
plt.xlim([0, 25])
plt.show()

print 'theta: \n', theta

image-43

theta: 
[-3.24140214  1.1272942 ]

There is no reason to reinvent the wheel, we can leverage the power of Generalized Linear Models from scikit-learn:

from sklearn import linear_model
X = data[:, 0]
Y = data[:, 1]
clf = linear_model.LinearRegression()
X.shape = (len(X), 1)
clf.fit(X, Y)

print 'intercept: ', clf.intercept_
print 'Coefficients: ', clf.coef_


plt.figure()
plt.scatter(X, Y)
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
plt.plot(X, clf.predict(X), color='red')
plt.xlim([0, 25])
plt.show()
intercept:  -3.89578087831
Coefficients:  [ 1.19303364]

image-46