Clustering

K-means clustering is our very first unsupervised machine learning.

Using the ex7 as an example:

%pylab inline
Populating the interactive namespace from numpy and matplotlib
import scipy.io
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

data = scipy.io.loadmat('ex7data2.mat')
clf = KMeans(n_clusters=3, init=np.array([[3, 3], [6, 2], [8, 5]]), n_jobs=1)
clf.fit_predict(data['X'])

fig = plt.figure(figsize=(8, 6))
plt.scatter(data['X'][:, 0], data['X'][:, 1], c=clf.labels_, alpha=0.3);
plt.scatter(clf.cluster_centers_[:, 0], clf.cluster_centers_[:, 1], c=[1, 2, 3], marker='x', s=200);

image-4

Here is another example using K-Mean to compress the image:

import matplotlib.image as mpimg

bird = mpimg.imread('bird_small.png')
w, h, d = bird.shape
data = bird.reshape(w * h, d)

clf = KMeans(n_clusters=16)
clf.fit_predict(data)
compressed = clf.cluster_centers_[ clf.labels_].reshape(w, h, d)

plt.subplot(121)
plt.imshow(bird)
plt.xticks(())
plt.yticks(())
plt.title('orignal image')
plt.subplot(122)
plt.xticks(())
plt.yticks(())
plt.title('compressed image')
plt.imshow(compressed);

image-5

Note we only exploit the color approximation in all pixars, we do not exploit the local similarity, like dithering.