[This post is part of the Octave clustering demo series]
This is (currently) the last of my posts devoted to clustering demos. If you haven't downloaded the file containing Octave code yet, you can get it here. If you want to read the previous articles, here they are: part 0 - part 1 - part 2.
The last demo in the package shows you how spectral clustering works. Spectral Clustering is a relatively recent approach to clustering that I found very interesting, especially because it addresses some of the limitations of k-means but, at the same time, it relies on k-means itself! What it does in practice is finding a new space where localities are preserved, emphasizing both the cohesion and the separation of clusters, thus allowing k-means to work at its best. If you are curious and want to delve deeper into this topic, you will find more material in the following page: Paper Review: A Tutorial on Spectral Clustering.
I was quite surprised not to find ready code for spectral clustering in Octave, but probably I just had not searched (well) enough. By the way, if you came here searching for it here is an(other?) implementation, together with some examples and data to play with: feel free to try it and to leave your feedback!
The spectralDemo function can be called as follows:
where dataset is the usual dataset path+name, while nn and t are parameters used to build the (weighted) adjacency matrix describing similarity relationships between data points. Respectively, nn specifies the number of nearest neighbors for a point (a point is connected to another if it appears as one of its top nn nearest neighbors) and t is the parameter for the Gaussian similarity function s(xi,xj)=exp(-(xi-xj)^2 / t). My suggestion is to play with both parameters, but once you understood how they work you can just keep t=0 for the automatic tuning of the kernel and focus only on nn. Good values of nn range between 10 and 40, but of course you are free to test others (but keep in mind that the bigger the value, the slower the whole algorithm is! Can you say why?). The script performs the following tasks:
- it calculates the weighted adjacency matrix and plots both the distances between points and the weights between nodes in the adjacency graph (this allows one to inspect whether parameters need to be trimmed to have a more/less connected graph, and also to understand when the graph approach might work better than a classical euclidean distance approach);
- it builds the graph Laplacian and calculates its eigenvalues and eigenvectors. Then, it plots eigenvalues for inspection (this allows one to understand --if the dataset is clean enough-- how many connected components/natural clusters are present in the graph);
- to compare results between k-means and spectral clustering, it runs k-means both in the euclidean space and in the eigenspace built out of the top k eigenvectors, then shows both clusterings with related SSEs and accuracies.
And now, here are the usual maieutical questions:
- what are the datasets in which spectral clustering performs definitely better than k-means?
- how much do the results of spectral clustering depend on the input parameters?
- how would you evaluate the quality of a labeling done with spectral clustering? How would you apply that to find which is the best number of nearest neighbors to take into account?
- how often (according to your tests with different datasets) does the rule "the number of 0-valued eigenvalues matches the number of clusters" holds? How ofted does its "relaxed version" hold? (if the question is not clear check spectral clustering notes/papers)
- how worth would be to use algorithms other than k-means to do clustering in the eigenspace?
That's all... have fun with the demos and with spectral clustering :-)