Linear Regression

The idea behind the linear regression is Least squares: minimize the sum of the squares of the errors made between hypothese and the observed result, aka the cost function $J$.

$$ \begin{equation} Hyposis: h_{\theta}(x) = \theta_{0} + \theta_{1}x = \theta^{T}x \end{equation} $$

$$ \begin{equation} Cost~function: J(\theta_{0}, \theta_{1}) = \frac{1}{2m} \sum_{i=1}^m (h_{\theta}(x^{i}) - y^{(i)})^2 \end{equation} $$

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib

Let's use the sample data from the exercise #1.

In [47]:
import numpy as np
import matplotlib.pyplot as plt

data = np.loadtxt(open("ex1data1.txt"), delimiter=",")
X = data[:, 0]
Y = data[:, 1]

plt.figure()
plt.scatter(X, Y)
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
plt.show()

Here is the scalar cost function, computeCost.

In [3]:
def computeCost(theta_0, theta_1):
    return np.sum((theta_0 + theta_1 * X - Y) ** 2) /(2 * len(X))

print computeCost(0, 0)
32.0727338775

Based on the trial-n-error, we will identify that, to minimize the cost function, $\theta_{0}$ must be in $[-5, 5]$, $\theta_{1}$ must be in $[0, 2]$:

In [4]:
theta_0_space = np.concatenate([
    np.linspace(-10, -5, 400),
    np.linspace(-5, 5, 1000),
    np.linspace(5, 10, 400)])

theta_1_space = np.concatenate([
    np.linspace(-1, 0, 400),
    np.linspace(0, 2, 1200),
    np.linspace(2, 4, 400)])

Theta_0, Theta_1 = np.meshgrid(theta_0_space, theta_1_space)
rows, cols = np.shape(Theta_0)

Z = np.zeros(shape=(rows, cols))

for i in range(rows):
    for j in range(cols):
        Z[i, j] = computeCost(Theta_0[i, j], Theta_1[i, j])

Here is the contour of the $J(\theta)$:

In [5]:
plt.figure()
CS = plt.contour(Theta_0, Theta_1, Z, 
                 levels=[5, 8, 16, 32, 64, 128, 256])
plt.xlabel('theta0')
plt.ylabel('theta1')
plt.title('The contour of the J(theta)')       
plt.clabel(CS, inline=1, fontsize=10)
plt.show()

The saddle point can be found at $(\theta_{0}, \theta_{1})$ where $\frac{\partial}{\partial\theta}J(\theta) = 0$ and $\frac{\partial^2}{\partial\theta}J(\theta) < 0$. The numerical approach is to use the graident descent.

$$ \begin{equation} \theta_{0} = \theta_{0} - \alpha \frac{\partial}{\partial\theta_{0}}J(\theta) \end{equation} $$

$$ \begin{equation} \theta_{1} = \theta_{1} - \alpha \frac{\partial}{\partial\theta_{1}}J(\theta) \end{equation} $$

More concretly:

$$ \begin{equation} J(\theta_{0}, \theta_{1}) = \frac{1}{2m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m} \sum_{i=1}^m (\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^2 \end{equation} $$

$$ \begin{equation} \frac{\partial}{\partial\theta_{0}}J(\theta) = \frac{1}{2m} \sum_{i=1}^m 2 (\theta_{0} + \theta_{1}x^{(i)} - y^{(i)}) = \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) \end{equation} $$

$$ \begin{equation} \frac{\partial}{\partial\theta_{1}}J(\theta) = \frac{1}{2m} \sum_{i=1}^m 2 (\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})\cdot x^{(i)} = \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})\cdot x^{(i)} \end{equation} $$

In [6]:
theta = np.array([0,0])
alpha = 0.01
m = len(X)
iters = 1000
V = np.array([ones(shape(X)), X])
thetas = zeros(shape=(iters, 2))

for i in range(iters):
    thetas[i] = theta
    theta = theta - (alpha / m) * dot((np.dot(theta, V) - Y), V.transpose())

We can also render the gradient descent trace in the contour.

In [7]:
plt.figure()
CS = plt.contour(Theta_0, Theta_1, Z, 
                 levels=[5, 8, 16, 32, 64, 128, 256])
plt.xlabel('theta0')
plt.ylabel('theta1')
plt.title('The trace of gradient descent in J(theta)')       
plt.clabel(CS, inline=1, fontsize=10)
plt.plot(thetas[:, 0], thetas[:, 1], c='r')
plt.show()

Let's check out how the hypothesis fit into the experiments:

In [43]:
B = np.linspace(0, 25, 100)
Xh = np.array([ones(shape(B)), B])
Yh = np.dot(theta, Xh)

plt.figure()
plt.scatter(X, Y)
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
plt.plot(B, Yh, c='red')
plt.xlim([0, 25])
plt.show()

print 'theta: \n', theta
theta: 
[-3.24140214  1.1272942 ]

There is no reason to reinvent the wheel, we can leverage the power of Generalized Linear Models from scikit-learn:

In [46]:
from sklearn import linear_model
X = data[:, 0]
Y = data[:, 1]
clf = linear_model.LinearRegression()
X.shape = (len(X), 1)
clf.fit(X, Y)

print 'intercept: ', clf.intercept_
print 'Coefficients: ', clf.coef_


plt.figure()
plt.scatter(X, Y)
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
plt.plot(X, clf.predict(X), color='red')
plt.xlim([0, 25])
plt.show()
intercept:  -3.89578087831
Coefficients:  [ 1.19303364]