Lets see some of that python magic!

Todays agenda:

18:15 - 18:45: Introduction to clustering and feature engineering

Short break, refill drinks, greet & meet etc.

19:00 - 19:30: Introduction to data (mining) analysis in Python

19:30: Hang around and discuss clustering and machine learning.

Introduction to data (mining) analysis in Python

Why python?

(Matlab) vs R vs Python!

Learn all, use the right tool for the right task

How we work in Python:

iPython

&

The Numpy stack

Explaining the Numpy stack. Buuuuzzwords incoming!

Store the raw dataset in a pandas dataframe

Engineer features and store in a numpy-array

Create the distance matrix with scipy

train the model with scikit-learn

plot the result with matplotlib

Pandas

Scikit-learn

In []:
# Helper stuff as I'm presenting stuff live
# Font warning in plots
import warnings
warnings.filterwarnings('ignore')

# Bigger matplotlib figures
from pylab import rcParams
rcParams['figure.figsize'] = 15, 10

iPython notebook

From here on things might get shaky

In []:
print "Lets do this!"

Execute python and get inline plots!

In [486]:
mean = 0
variance = 1
sigma = math.sqrt(variance)

x = np.linspace(-3,3,100)
plt.plot(x,mlab.normpdf(x,mean,sigma))

plt.show()

Command line stuff

! terminal command

% magic command

%% Kernel (R, Bash)

In []:
!ls ~/nltk_data/tokenizers/punkt/
In []:
my_variable = !cat ~/nltk_data/tokenizers/punkt/README | head -n 10
print my_variable

But watch it and do use !

In []:
ls
In []:
ls = 1
print ls

del ls
!ls
In []:
ls

Other kernals than python

In []:
%%bash

# Execute hive-queries from iPython
printf "we need more data \n\n"
printf "Lets run a hive-query \n\n"

printf "hive -e 'select users from music_played'"

Variables are stored in a local state

In []:
print not_defined_variable
In []:
not_defined_variable =  1

Getting the Data

Loading the dataset

In []:
!ls ../crawler/output/ | head -n 10
In []:
!ls ../crawler/output/Aerosmith/ | head -n 10
In []:
# helper function to load each json and add it as a row to a dataframe
from utils import utils
dataset = utils.build_dataset(debug=False)
In []:
dataset.head()

Data cleaning

In []:
dataset.loc[1].lyric

Removing escape sequences:

In []:
dataset['lyric'] = dataset['lyric'].apply(lambda lyric: re.sub('\\n', '', lyric))
dataset['lyric'] = dataset['lyric'].apply(lambda lyric: re.sub('\\r', '', lyric))
print dataset.loc[1].lyric

Removing the [words] and (words):

In []:
print "Before"
print dataset.loc[111].lyric
dataset['lyric'] = dataset['lyric'].apply(lambda lyric: re.sub('.?\[.*?\] | \(.*?\)', '', lyric))
print ""
print "After"
print dataset.loc[111].lyric

More weird characters

In []:
dataset.loc[2].lyric
In []:
dataset['lyric'] = dataset['lyric'].apply(lambda lyric: re.sub('[^a-zA-Z0-9\s]', '' ,lyric))
dataset.loc[2].lyric
Some final tweaks
In []:
dataset['track'] = dataset['track'].apply(lambda track: track.lower().rstrip('lyrics'))
dataset['artist'] = dataset['artist'].apply(lambda artist: artist.lower())
dataset['lyric'] = dataset['lyric'].apply(lambda lyric: lyric.decode('string_escape'))
dataset['lyric'] = dataset['lyric'].apply(lambda lyric: lyric.lower())

dataset.head()
In []:
dataset.loc[1].lyric

Feature Extraction

In []:
from sklearn.feature_extraction.text import CountVectorizer
cnt_vec = CountVectorizer() # Bag of words
In []:
X = cnt_vec.fit_transform(dataset.lyric)
In []:
X.shape
In []:
X[0,:].toarray()

Very sparse

In []:
print "Average non zero features %.0f " % (X.nnz / X.shape[0])
print "out of %d features" % X.shape[1]

Well, what are our features?

In []:
from operator import itemgetter
per_feature_weight_sum = np.sum(X.toarray(), axis=0)

weight_feature = zip(per_feature_weight_sum, cnt_vec.get_feature_names())

sorted(weight_feature, key=itemgetter(0), reverse=True)[:10] # 10 most common features in all rows

Jump directly to hierarchical clustering

In []:
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import dendrogram, linkage

D = pdist(X.toarray(), metric='cosine')
Z = linkage(D, method='complete', metric='euclidean')

dendro = dendrogram(Z)
In []:
from scipy.cluster.hierarchy import fcluster

for cutting_distance in [0.99, 0.97, 0.96, 0.4, 0.2]:
    Y = fcluster(Z, cutting_distance, criterion='distance')
    print "Number of clusters %d" % len(np.unique(Y)), 
    print "Silhouette score %f" % silhouette_score(X, Y)
In []:
def silhouette_per_cluster(X,Y):
    """ Calculates the average silhouette score
    per cluster given X,Y
    """
    
    samples = silhouette_samples(X, Y)
    cluster_silhouette_core = []
    for i in np.unique(Y):
        cluster_i = samples[Y == i]
        cluster_silhouette_core.append(np.mean(cluster_i))
    return cluster_silhouette_core

def cluster_sizes(Y):
    """ Returns the size of each cluster given
    a categorial vector Y
    """
    cluster_size = []
    for cluster in np.unique(Y):
        Y_ = Y == cluster
        cluster_size.append(sum(Y_))
    return cluster_size

Lets try K-means

In []:
from sklearn.cluster import KMeans
from sklearn.metrics.cluster import silhouette_samples

def kmeans(X, k=3, n_init=10):
    """ Fits a Kmeans model given a number of cluster
    and returns the "clustered" kmeans object
    """
    kmeans_model = KMeans(n_clusters=k, n_init=10)
    kmeans_model.fit(X)
    
    return  kmeans_model
In []:
# Local Optimum
km = kmeans(X, k=3)
Y = km.labels_
print Y
print "Silhouette score %f" % silhouette_score(X, Y)
print km.inertia_
In []:
km = kmeans(X, k=3)
Y = km.labels_
print Y
print "Silhouette score %f" % silhouette_score(X, Y)
print km.inertia_

Avoiding local optimum

In []:
def kmeans_multiple(X, k=3, n_runs=20):
    """ Run k-means multiple times to avoid 
    getting stuck in a local optimum.
    We worship silhouette score hence return
    the model with maximum score.
    """

    km_models = []
    for i in range(1, n_runs):
        
        # Train model
        kmeans_model = KMeans(n_clusters=k)
        kmeans_model.fit(X)
        # Calculate 
        Y = kmeans_model.labels_
        sil_score = silhouette_score(X, Y)
        
        km_models.append((sil_score, kmeans_model))
    return max(km_models, key=itemgetter(0)) # Best silhouette score
In []:
sil_score, kmeans_model = kmeans_multiple(X)
print kmeans_model.inertia_
print sil_score
In []:
# k = 2 to 9
# This takes a while..
k_range = range(2, 10)
km_k = [kmeans_multiple(X, k=k_i) for k_i in k_range]
In []:
 
In []:
# km = (sil, kmeans_model)
plt.plot(k_range, [km[1].inertia_ for km in km_k])
plt.title('Inertia')
plt.ylabel('Inertia')
plt.xlabel('Number of clusters, k')
In []:
plt.plot(k_range, [km[0] for km in km_k])
plt.title('Silhouette score')
plt.ylabel('Silhouette score')
plt.xlabel('Numbers of clusters, k')
In []:
# The cluster we choose
k = 5
# -2 due to indexing, and 1 to get the model
Y = km_k[k-2][1].labels_
In []:
print silhouette_per_cluster(X, Y)
print cluster_sizes(Y)

Manual inspection

Do you remember this?

In []:
dataset.head()
In []:
sil_cluster = silhouette_per_cluster(X, Y)
for cluster in np.unique(Y):
    
    Y_ = Y == cluster
    
    print "Id:  %d" % cluster, 
    print "Sil score: %f" % sil_cluster[cluster],
    print "Size: %d" % sum(Y_)
    
    cluster_data = dataset[Y==cluster]
    
    for (index, obj) in cluster_data.iterrows():
        print obj.artist,
        print " - ",
        print obj.track
    print " "
In []:
dataset[Y==2]
In []:
for (i, row) in dataset[Y==2].iterrows():
    print row.lyric
    print ""

Next step?

* Try a stemmer

* Remove stop words

* Cluster artists with all lyrics as vector representation

Thank you for tonight, please hang around for a while!

Oscar Carlsson

* Slides: https://github.com/Oscarlsson/meetup-clustering-presentation

* Keep in touch: https://www.linkedin.com/in/ocarlsson3

In []: