Todays agenda:

** 18:15 - 18:45:
Introduction to clustering and feature engineering**

Short break, refill drinks, greet & meet etc.

*19:00 - 19:30:*
Introduction to data (mining) analysis in Python

*19:30: *
Hang around and discuss clustering and machine learning.

## The problem

## Clustering algorithms

## Evaluation of the clustering

## Machine Learning at Spotify

**Our dataset** $$
X =
\begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ \vdots \\ x_n \end{pmatrix}
$$

**Each sample**
$$x_1 = \begin{pmatrix} f_1, & f_2, & f_3, & f_4, & \dots, & f_n \end{pmatrix}^T $$

define each $x_i$

means decide each $f_i$

=

Feature Engineering

$$ X = \begin{pmatrix} document_1 \\ document_2 \\ document_3 \\ document_4 \\ \vdots \\ document_n \end{pmatrix} , \: Y = \begin{pmatrix} subject_1 \\ subject_2 \\ subject_1 \\ subject_3 \\ \vdots \\ subject_k \end{pmatrix} $$

### K-means

### Hierarchical Clustering

- Need to specify: K (Number of clusters)
- Each cluster is the
*mean*of the samples (centroids)

Disadvantages:

$$a(x) = \text{average distance to all other sample in the SAME cluster} $$

$$b(x) = \text{average distance to the samples of the closest cluster}$$

$$s(x) = \frac{b(x) - a(x)}{\max\{a(x),b(x)\}}, \forall x\in X$$

$$ -1 \le s(x) \le 1 $$

- Discovery - new music
- Related artists
- Radio

Rec sys: http://www.a1k0n.net/spotify/ml-madison/

Deep Learning: http://benanne.github.io/2014/08/05/spotify-cnns.html

- User behavior
- Artist disambiguation ( the kent bug )

18:15 - 18:45:Introduction to clustering and feature engineering

Short break, refill drinks, greet & meet etc.

19:00 - 19:30:Introduction to data (mining) analysis in Python

19:30:Hang around and discuss clustering and machine learning.