Computer vision for kids

Computer vision is an interdisciplinary scientific field that deals with how computers can be made to gain a high-level understanding of digital images or videos. Can computers really see images in…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Mastering Unsupervised

Below are some steps that I learned to try to tackle a difficult unsupervised classification problem.

Before we cluster the data, it is a good idea to determine how many clusters are ideal for the variability of the dataset. In order to determine the ideal number of clusters, we run a clustering model (kmeans, gaussian mixture, etc) for a range of cluster values.

The ideal number of clusters is the largest number of clusters when the variance is low, or there is a change in the BIC slope. Thus, in this example the ideal number of clusters appears to be 7 for the gaussian mixture model. The ideal number of clusters could change depending on the clustering model that you use.

Each clustering method can distinguish/label some columns better than other columns. One way to identify which columns are being described well or poorly by the clustering method is to plot the data with respect to each class.

If the clustering method is able to distinguish between similar column data, it should also be able to distinguish same column data as is typically seen when applying kmeans to a time-series signal; so if the clustering was 100% perfect each column plot should have blocks of color (ie: class assignment) that are non-overlapping. In this condensed plot, we can see that not many columns have separated blocks of color, so it means that the label is not distinguishing the data well for the colums that have overlapping color.

We can further see the same result with the smaller figure above, it shows the mean of X for each class (y-axis) for each column (y-axis). We can see that the y label, that the gaussian mixture model created can not distinguish the data well for columns [0, 1, 2, 3, 4, 5, 6, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 26, 27, 28].

Therefore, we need another clustering model to create a label for these columns OR cluster each column alone to obtain a y label for each column.

The silhouette score is a value from -1 to 1, where positive scores close to 1 signify dense and well-separated clusters and negative scores indicate incorrect clustering; scores near 0 indicate overlapping clusters. Thus our score of 0.067 means that gaussian mixture model results in overlapping clusters, even for the most distinguishable columns. kmeans performs a little better with a score of 0.11.

This means that this problem is really hard. We could re-run the same procedure with more clustering methods, but the results are likely to be similar. So, what do you do?

I tried two options, slightly modifying the values of X such that each class has a more distinct mean and ensemble clustering. Using the first option I did acheive an improved silhoutte score of 0.253.

The 2D averaging method requires a lot of memory, so I made an initial batch method which I am in the process of working on.

When I get it up and running I will let you know how I solved this problem, or at least what happened.

Happy practicing! 👋

Add a comment

Related posts:

Karlsruhe schickt einen klaren Appell an uns alle

Gerade hat das Bundesverfassungsgericht die Sanktionen in der Hartz-IV-Gesetzgebung teilweise für verfassungswidrig erklärt. Das Gericht verweist in seinem Urteil dabei auf einen zentralen Punkt, der…

How To Lose Weight When You Are Obese

So How To Lose Weight When You Are Obese. Firstly, maybe we should just cover what the difference is between being overweight or being obese. Well to be truthful, not a lot, just one extra pound in…

Preparing for a Full Stack Developer Interview

So I am preparing for a Full Stack Developer Interview. Here’s my stack: HTML/CSS | Javascript | Angular | Python | MySQL | AWS | Git. I’ll share my thoughts on brushing up with each technology and…