New TR: Multimodal diffusion geometry by joint diagonalization of Laplacians
Hi all,
this paper is something I am particularly happy to share, as it is the first report related to my new research theme (wow, this reminds me that I should update my research page!). The coolest aspect of this topic is that, despite looking different from my previous work, it actually has a lot of points in common with it.
As some of you may know, many previous works of mine heavily relied on different implementations of the concept of similarity (e.g. similarity between tags, between tourism destinations, and so on). This concept has many interpretations, depending on how it is translated into an actual distance for automatic calculation (this is what typically happens in practice, no matter how "semantic" your interpretation is supposed to be).
One of the main problems is: in a rich and social ecosystem like the Web is, it is frequent to find different ways to define/measure similarity between entities. For instance, two images could be considered similar according to some visual descriptors (e.g. SIFT, or color histograms), to tags associated with them (e.g. "lake", "holiday", "bw"), to some descriptive text (e.g. a Wikipedia page describing what is depicted), metadata (e.g. author, camera lens, etc.), and so on. Moreover, people might not agree on what is similar to what, as everyone has their own subjective way of categorizing stuff. The result is that often there is no single way to relate similar entities. This is sometimes a limit (how can we say that our method is the correct one?) but also an advantage: for instance, when entities need to be disambiguated it is useful to have different ways of describing/classifying them. This is, I believe, an important step towards (more or less) automatically understanding the semantics of data.
The concept I like most behind this work is that there are indeed ways to exploit these different measures of similarity and (pardon me if I banalize it too much) find some kind of average measure that takes all of them into account. This allows, for instance, to tell apart different acceptations of the same word as it can be applied in dissimilar contexts, or photos that share the same graphical features but are assigned different tags. Some (synthetic and real-data) examples are provided, and finally some friends of mine will understand why I have spent weeks talking about swimming tigers ;-). The paper abstract follows:
"We construct an extension of diffusion geometry to multiple modalities through joint approximate diagonalization of Laplacian matrices. This naturally extends classical data analysis tools based on spectral geometry, such as diffusion maps and spectral clustering. We provide several synthetic and real examples of manifold learning, retrieval, and clustering demonstrating that the joint diffusion geometry frequently better captures the inherent structure of multi-modal data. We also show that many previous attempts to construct multimodal spectral clustering can be seen as particular cases of joint approximate diagonalization of the Laplacians."
… and the full text is available on ArXiv. Enjoy, and remember that --especially in this case, as this is mostly new stuff for me-- comments are more than welcome :-)