Perl Hacks: infogain-based term cloud
This is one of the very first tools I have developed in my first year of post-doc. It took a while to publish it as it was not clear what I could disclose of the project that was funding me. Now the project has ended and, after more than one year, the funding company still has not funded anything ;-) Moreover, this is something very far from the final results that we obtained, so I guess I could finally share it.
Rather than a real research tool, this is more like a quick hack that I built up to show how we could use Information Gain to extract "interesting" words from a collection of documents, and term frequencies to show them in a cloud. I have called it a "term cloud" because, even if it looks like the well-known tag clouds, it is not built up with tags but with terms that are automatically extracted from a corpus of documents.
The tool is called "rain" as it is based on rainbow, an application built on top of the "bow" libraries that performs statistical text classification. The basic idea is that we use two sets of documents: the training set is used to instruct the system about what can be considered "common knowledge"; the test set is used to provide documents about the specific domain of knowledge we are interested in. The result is that the words which more likely discriminate the test set from the training one are selected, and their occurrences are used to build the final cloud.
All is done within a pretty small perl script, which does not do much more than calling the rainbow tool (which has to be installed first!) and using its output to perform calculation and build an HTML page with the generated cloud. You can download the script from these two locations:
- here you can find a barebone version of the script, which only contains the script and few test documents collections. The tool works (that is it does not return errors) even without training data, but will not perform fine unless it is properly trained. You will also have to download and install rainbow before you can use it;
- here you can find an "all inclusive" version of the script. It is much bigger but it provides: the ".deb" file to install rainbow (don't be frightened by its release date, it still works with Ubuntu Maverick!), and a training set built by collecting all the posts from the "20_newsgroups" data set.
How can I run the rain tool?
Supposing you are using the "all inclusive" version, that is you already have your training data and rainbow installed, running the tool is easy as writing
perl rain.pl <path_to_test_dir>
The script parameters can be modified within the script itself (see the following excerpt from the script source):
my $TERMS = 50; # size of the pool (top words by infogain) my $TAGS = 50; # final number of tags (top words by occurrence) my $SIZENUM = 6; # number of size classes to be used in the HTML document, # represented as different font sizes in the CSS my $FIREFOX_BIN = '/usr/bin/firefox'; # path to browser binary # (if present, firefox will be called to open the HTML file) my $RAINBOW_BIN = '/usr/bin/rainbow'; # path to rainbow binary my $DIR_MODEL = './results/model'; # used internally by rainbow my $DIR_DATA = './train'; # path to training dir my $DIR_TEST = $ARGV[0]; # path to test dir my $FILE_STOPLIST = './stopwords.txt'; # stopwords file my $FILE_TMPDATA = './results/data.txt'; # file where data generated # by rainbow will be dumped my $HTML_TEMPLATE = './template.html'; # template file used to generate the # tag-cloud html page my $HTML_OUTPUT = './results/output.html'; # final html page
As you can see, there are quite a lot of parameters but the script can also be run just out of the box: for instance, if you type
perl rain.pl ./test/folksonomies
you will see the cloud shown in Figure 1. And now, here is another term cloud built using my own blog posts as a text corpus: