Welcome to the Machine Learning meetup!

Todays agenda:

18:15 - 18:45: Introduction to clustering and feature engineering

Short break, refill drinks, greet & meet etc.

19:00 - 19:30: Introduction to data (mining) analysis in Python

19:30: Hang around and discuss clustering and machine learning.

Who is it that we have here today?




Introduction to clustering and feature enginneering

Outline of the todays talk

  • The problem

  • Clustering algorithms

  • Evaluation of the clustering

  • Machine Learning at Spotify

The problem of clustering

Clustering is the task of grouping a set of object in such a way that objects in the same group are more similar to each other than to those in other groups.

The objects could be text documents (news articles)

And the grouping could be the theme of the text documents

Supervised Learning

data X and the labels Y

(documents and categories)

Unsupervised Learning

Only X

(documents only)

Feature Engineering

Our dataset $$ X = \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ \vdots \\ x_n \end{pmatrix} $$

Each sample $$x_1 = \begin{pmatrix} f_1, & f_2, & f_3, & f_4, & \dots, & f_n \end{pmatrix}^T $$

Define X means
define each $x_i$
means decide each $f_i$
=
Feature Engineering

We have $n$ DOCUMENTS

Each document $x_i$ is a numerical vector

We design the meaning of each $f_i$

Features: The words? The datetime? The font used?

$$ X = \begin{pmatrix} document_1 \\ document_2 \\ document_3 \\ document_4 \\ \vdots \\ document_n \end{pmatrix} , \: Y = \begin{pmatrix} subject_1 \\ subject_2 \\ subject_1 \\ subject_3 \\ \vdots \\ subject_k \end{pmatrix} $$

$document_1$ and $document_3$ both share $subject_1$

Clustering algorithms

  • K-means

  • Hierarchical Clustering

K-means

  • Need to specify: K (Number of clusters)
  • Each cluster is the mean of the samples (centroids)

The algorithm:

Specify K clusters

1. Randomly create k clusters

3. Calculate centroid of each cluster (with the mean)

4. Associate samples to their closest centroid (with squared euclidean distance)

5. Repeat until centroids changes very little

These examples are stole.

Thank you Summer School in Statistics for Astronomers V!

Our example 2-D DATA

K = 2, RANDOM ASSIGNMENT

Update the centroids (Move clusters)

Recalculate what is the closest center (Update colors)

Iterate

Update the centroids (Move clusters)

Recalculate what is the closest center (Update colors)

Stop when converge

$$\underset{\mathbf{S}} {\operatorname{arg\,min}} \sum_{i=1}^{k} \sum_{\mathbf x \in S_i} \left\| \mathbf x - \boldsymbol\mu_i \right\|^2$$

Disadvantages:

* Local optimum

* Outliers

* Choosing K

Hierarchical agglomerative clustering (HAC)

The (agglomerative) algorithm:

Each sample is a cluster

Merge the two closest clusters until there is only one left

Output: Linkage matrix (How and when did we merge two clusters)

Dendrogram

Evaluation of the algorithm

Silhouette score

$$a(x) = \text{average distance to all other sample in the SAME cluster} $$

$$b(x) = \text{average distance to the samples of the closest cluster}$$

$$s(x) = \frac{b(x) - a(x)}{\max\{a(x),b(x)\}}, \forall x\in X$$

$$ -1 \le s(x) \le 1 $$

If we have training data we can use it to evaluate

Visualization with dimensionality reduction

Machine Learning at Spotify

Recommender systems

  • Discovery - new music
  • Related artists
  • Radio

Rec sys: http://www.a1k0n.net/spotify/ml-madison/
Deep Learning: http://benanne.github.io/2014/08/05/spotify-cnns.html

Clustering

  • User behavior
  • Artist disambiguation ( the kent bug )

Stay put, we will soon start again!

18:15 - 18:45: Introduction to clustering and feature engineering

Short break, refill drinks, greet & meet etc.

19:00 - 19:30: Introduction to data (mining) analysis in Python

19:30: Hang around and discuss clustering and machine learning.