Neural Network

Before jumping into the topic, let's recap what we have learned so far. In general, the supervised machine learning aims to find the mapping between the input and output based on our experiments, aka hypotheis $\Theta(x)$, then we define a cost function $J(\Theta)$ to quantitize the error between the hypotheis and experiments.

We then minimize $J(\Theta)$ to make our hypotheis to match our experiments better. To avoid the overfitting, we introduce $l1$ and $l2$ regularization to penalize the higher-order fitting.

We prefer gradient decentant to minimize $J(\Theta)$ for its effectiveness of convex functions. For $J(\Theta)$ with multiple local optimals, we can try randomize the start entries to find the global optimal.

With that in your mind, we can model the neural network as $\Theta(x)$, with cost function

$$ \begin{equation} J(\theta) = -\frac{1}{m} \big[\sum_{i=1}^{m}\sum_{k=1}^{K} y_k^{(i)} log(h_{\Theta}(x^{i}))_k + (1 - y_k^{(i)}) log(1 - h_{\Theta}(x^{(i)})_k)\big] + \frac{\lambda}{2m} \sum_{l=1}^{L-1}\sum_{i=1}^{S_l}\sum_{j=1}^{S_l + 1}(\Theta_{ji}^{(l)})^2 \end{equation} $$

The equation looks intimating, but it basically just use the cost function of logistic regression plus the $l2$ regularization for everyting single $\theta$ in the neural network. We use the front propgation to compute the hypothsis without normalize the $\Theta$ using symbolic calcus due to its complexity. It will be even harder to calculate he partial derivative of $J(\Theta)$. Luckily, we have backpropagation method. The gist of this approach is:

backpropagation calculates the gradient of the error of the network regarding the network's modifiable weights. This gradient is almost always used in a simple stochastic gradient descent algorithm to find weights that minimize the error.

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]:
import scipy.io
data = scipy.io.loadmat('ex3data1.mat');

# pick random 100 handwriting
import random
indexes = random.sample(range(0, 5000), 100)

Render the selected handwritings:

In [3]:
figure = plt.figure(figsize=(10, 10))
for index, i in enumerate(indexes):
    plt.subplot(10, 10, index + 1)
    plt.axis('off')
    plt.imshow(data['X'][i].reshape(20, 20).transpose(), cmap='Greys')

scikit-learn does not support supervised neural network, so I'd seek the help from pylearn2 library. The following code is adpated from this gist:

In [21]:
from pylearn2.models import mlp
from pylearn2.training_algorithms import sgd
from pylearn2.termination_criteria import EpochCounter
from pylearn2.datasets.dense_design_matrix import DenseDesignMatrix

class Handwriting(DenseDesignMatrix):
    def __init__(self, X, y):
        self.class_names = [str(i) for i in range(11)]
        # convert the label y to boolean vector.
        Y = []
        for label in y:
            v = [0] * 10
            if label == 10:
                v[0] = 1
            else:
                v[label] = 1
            Y.append(v)
            
        super(Handwriting, self).__init__(X=X, y=np.array(Y))
        
dataset = Handwriting(data['X'], data['y'])


# We will creat three-layer neural network:
hidden_layer = mlp.Sigmoid(layer_name='hidden', dim=25, irange=.1, init_bias=1.)
# create Softmax output layer
output_layer = mlp.Softmax(10, 'output', irange=.1)
# create Stochastic Gradient Descent trainer that runs for 400 epochs
trainer = sgd.SGD(learning_rate=.05, batch_size=10, termination_criterion=EpochCounter(400))
layers = [hidden_layer, output_layer]
# create neural net that takes 400 inputs
ann = mlp.MLP(layers, nvis=400)
trainer.setup(ann, dataset)
# train neural net until the termination criterion is true
while True:
    trainer.train(dataset=dataset)
    if not trainer.continue_learning(ann):
        break
Parameter and initial learning rate summary:
	hidden_W: 0.05
	hidden_b: 0.05
	softmax_b: 0.05
	softmax_W: 0.05
Compiling sgd_update...
Compiling sgd_update done. Time elapsed: 0.559623 seconds
In [30]:
import theano

def get_label(X, i):
    inputs = X[i, np.newaxis]
    output = ann.fprop(theano.shared(inputs, name='inputs')).eval()
    return np.argmax(output)
predicted = [get_label(data['X'], i) for i in indexes]

for l, u in zip(range(0, 91, 10), range(10, 101, 10)):
    print map(lambda x: 0 if x == 10 else x, predicted[l:u])
ann.score()
[4, 8, 9, 5, 2, 3, 9, 9, 7, 3]
[7, 7, 5, 1, 6, 7, 1, 9, 8, 9]
[9, 3, 3, 9, 8, 0, 1, 5, 0, 6]
[1, 0, 7, 5, 3, 1, 7, 2, 6, 3]
[1, 6, 6, 5, 1, 3, 2, 6, 1, 1]
[2, 0, 6, 3, 0, 3, 9, 0, 0, 6]
[8, 9, 0, 3, 6, 8, 8, 0, 7, 5]
[7, 7, 8, 1, 2, 0, 5, 0, 0, 3]
[2, 3, 7, 4, 1, 9, 3, 5, 0, 1]
[7, 9, 5, 7, 9, 4, 5, 3, 0, 7]