Statistical learning with R part 4: Clustering
[This post is part of the Statistical learning with R series]
If you are here you probably have already unpacked the tgz file containing the demos and read the previous articles (part 1: Overfitting and part 2: Linear regression, and part 3: Classification). If not, please check this page before starting.
Try to run clusteringDemo.R: supposing your current working directory is the one where you unpacked the R files, type
source("clusteringDemo.R",print.eval=TRUE)
The print.eval parameter is needed to show you the output of some commands such as summary in the context of the source command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.
This demo tests different clustering algorithms (K-Means, Hierarchical, and K-Medoids) on the Iris flower dataset.
After you run the full demo, try to answer the following questions:
- What can you say about the WSS vs K plot? Can you spot an elbow? How can you interpret this behaviour, knowing the correct K and having looked at how the dataset looks like?
- What kind of hierarchical clustering is ran by default? Top down or bottom up? Single, complete, or average linkage? Can you modify the method (just look at the parameters of the hclust function) to obtain better results in terms of NMI?
- Try to play with the fuzziness coefficient parameter (m) of Fuzzy C-Means. We know that its value is strictly greater than 1 but with no upper bound. What happens if its value is very close to 1? What happens if it is very big? And what if the value is the default (2)?
- What is the best algorithm in terms of NMI (also take into account the different methods of Hierarchical)? Which one do you consider the best in terms of interpretability of the results?
- Finally, take one sample whose Fuzzy C-Means membership does not clearly put it in one cluster (you can look at the "Memberships" matrix printed after FCM execution, check e.g. sample 51). What makes it so ambiguous? Try to check its features (iris[51,]) and the ones of the cluster centers (i.e. the representative points of each cluster, result$centers), and comment what you found.