Perl Hacks: automatically get info about a movie from IMDB

20Jan/114

Perl Hacks: automatically get info about a movie from IMDB

Ok, I know the title sounds like "hey, I found an easy API and here's how to use it"... well, the post is not much more advanced than this but 1) it does not talk about an API but rather about a Perl package and 2) it is not so trivial, as it does not deal with clean, precise movie titles, but rather with another kind of use case you might happen to witness quite frequently in real life.

Yes, we are talking about pirate video... and I hope that writing this here will bring some kids to my blog searching for something they won't find. Guys, go away and learn to search stuff the proper way... or stay here and learn something useful ;-)

So, the real life use case: you have a directory containing videos you have downloaded or you have been lent by a friend (just in case you don't want to admit you have downloaded them ;-). All the titles are like

The.Best.Movie.Ever(2011).SiLeNT.[dvdrip].md.xvid.whatever_else.avi

and yes, you can easily understand what the movie title is and automatically remove all the other crap, but how can a software can do this? But first of all, why should a software do this?

Well, suppose you are not watching movies from your computer, but rather from a media center. It would be nice if, instead of having a list of ugly filenames, you were presented a list of film posters, with their correct titles and additional information such as length, actors, etc. Moreover, if the information you have about your video files is structured you can easily manage your movie collection, automatically categorizing them by genre, director, year, or whatever you prefer, thus finding them more easily. Ah, and of course you might just be interested in a tool that automatically renames files so that they comply with your naming standards (I know, nerds' life is a real pain...).

What we need, then, is a tool that links a filename (that we suppose contains at least the movie title) to, basically, an id (like the IMDB movie id) that uniquely identifies that movie. Then, from IMDB itself we can get a lot of additional information. The steps we have to perform are two:

convert an almost unreadable filename in something similar to the original movie title
search IMDB for that movie title and get information about it

Of course, I expect that not all the results are going to be perfect: filenames might really be too messy or there could be different movies with the same name. However, I think that having a tool that gives you a big help in cleaning your movie collection might be fine even if it's not perfect.

Let's start with point 1: filename cleaning. First of all, you need to have an idea about how files are actually named. To find enough examples, I just did a quick search on google for divx dvdscr dvdrip xvid filetype:txt: it is surprising how many file lists you can find with such a small effort :) Here are some examples of movie titles:

American.Beauty.1999.AC3.iNTERNAL.DVDRip.XviD-xCZ
Dodgeball.A.True.Underdog.Story.SVCD.TELESYNC-VideoCD
Eternal.Sunshine.Of.The.Spotless.Mind.REPACK.DVDSCr.XViD-DvP

and so on. All the additional keywords that you find together with the movie's title are actually useful ones, as they detail the quality of the video (i.e. svcd vs dvdrip or screener), the group that created it, and so on. However we currently don't need them so we have to find a way to remove them. Moreover, a lot of junk such as punctuation and parentheses are added to most of the file names, so it is very difficult to understand when the movie title ends and the rest begins.

To clean this mess, I have first filtered everything so that only alphanumeric characters are kept and everything else is considered as a separator (this became the equivalent of the tokenizing phase of a document indexer). Then I built a list of stopwords that need to be automatically removed from the filename. The list has been built from the very same file I got the titles from (you can find it here or, if the link is dead, download the file ALL_MOVIES here). Just by counting the occurences of each word with a simple perl script, I came out with a huge list of terms I could easily remove: to give you an idea of it, here's an excerpt of all the terms that occur at least 10 times in that file

129 -> DVDRip 
119 -> XviD 
118 -> The 
73 -> XViD 
35 -> DVDRiP 
34 -> 2003 
33 -> REPACK DVDSCR 
30 -> SVCD the 
24 -> 2 
22 -> xvid 
20 -> LiMiTED 
19 -> DVL dvdrip 
18 -> of A DMT 
17 -> LIMITED Of 
16 -> iNTERNAL SCREENER 
14 -> ALLiANCE 
13 -> DiAMOND 
11 -> TELESYNC AC3 DVDrip 
10 -> INTERNAL and VideoCD in

As you can see, most of the most frequent terms are actually stopwords that can be safely removed without worrying too much about the original titles. I started with this list and then manually added new keywords I found after the first tests. The results are in my blacklist file. Finally, I decided to apply another heuristic to clean filenames: every time a sequence of token ends with a date (starting with 19- or 20-) there are chances that it is just the movie date and not part of the title; thus, this date has to be removed too. Of course, if the title is made up of only that date (i.e. 2010) it is kept. Summarizing, the Perl code used to clean filenames is just the following one:

sub cleanTitle{
 my ($dirtyTitle, $blacklist) = @_;
 my @cleanTitleArray;

 # HEURISTIC #1: everything which is not alphanumeric is a separator;
 # HEURISTIC #2: if an extracted word belongs to the blacklist then ignore it;
 while ($dirtyTitle =~ /([a-zA-Z0-9]+)/g){
 my $word = lc($1);    # blacklist is lowercase
 if (!defined $$blacklist{$word}){
 push @cleanTitleArray, $word;
 }
 }

 # HEURISTIC #3: often movies have a date (year) after the title, remove
 # that (if it is not the title of the movie itself!);
 my $lastWord = pop(@cleanTitleArray);
 my $arraySize = @cleanTitleArray;
 if ($lastWord !~ /(19\d\d)|(20\d\d)/ || !$arraySize){
 push @cleanTitleArray, $lastWord;
 }

 return join (" ", @cleanTitleArray);
}

Step 2, instead, consists in sending the query to IMDB and getting information back. Fortunately this is a pretty trivial step, given we have IMDB::Film available! The only thing I had to modify in the package was the function that returns alternative movie titles: it was not working and I also wanted to customize it to return a hash ("language"=>"alt title") instead of an array. I started from an existing patch to the (currently) latest version of IMDB::Film that you can find here, and created my own Film.pm patch (note that the patch has to be applied to the original Film.pm and not to the patched version described in the bug page).

Access to IMDB methods is already well described in the package page. The only addition I made was to my script was the getAlternativeTitle function, which gets all the alternative titles for a movie and returns, in order of priority, the Italian one if it exists, otherwise the international/English one. This is the code:

sub getAlternativeTitle{
 my $imdb = shift;
 my $altTitle = "";
 my $aka = $imdb->also_known_as();
 foreach $key (keys %$aka){
 # NOTE: currently the default is to return the Italian title,
 # otherwise rollback to the first occurrence of International
 # or English. Change below here if you want to customize it!
 if ($key =~ /^Ital/){
 $altTitle = $$aka{$key};
 last;
 }elsif ($key =~ /^(International|English)/){
 $altTitle = $$aka{$key};
 }
 }
 return $altTitle;
}

So, that's basically all. The script has been built to accept only one parameter on the command line. For testing, if the parameter is recognized as an existing filename, the file is opened and parsed for a list of movie titles; otherwise, the string is considered as a movie title to clean and search.

Testing has been very useful to understand if the tool was working well or not. The precision is pretty good, as you can see from the attached file output.txt (if anyone wants to calculate the percentage of right movies... Well, you are welcome!), however I suspect it is kind of biased towards this list, as it was the same one I took the stopwords from. If you have time to waste and want to work on a better, more complete stopwords list I think you have already understood how easy it is to create one... please make it available somewhere! :-)

The full package, with the original script (mt.pl) and its data files is available here. Enjoy!

Filed under: hacks, perl Leave a comment

Comments (4) Trackbacks (0) ( subscribe to comments on this post )

leplatrem
November 6th, 2011 - 19:34

I made similar script in python that generates a webpage
https://github.com/leplatrem/folder-theater

( REPLY )
- +mala
  December 5th, 2011 - 13:13
  
  Thanks a lot Mathieu, and sorry for not approving this post earlier (I totally missed it till today!). I have checked your script and looks really cool, I’ll definitely give it a try.
  
  ( REPLY )
Simon
March 30th, 2013 - 12:05

Despite being an old post this is very useful. I unfortunately don’t speak Perl so I am attempting something similar in PHP, that’s is of course… you happen to know if someone has already written it?

( REPLY )
- +mala
  April 30th, 2013 - 15:54
  
  Hi Simon! First of all sorry for replying so late… I only know about PHP IMDB scraper, but that’s probably the first thing you’d find out if you search for PHP and IMDB :-)
  Good luck for your adventure in Sydney! ;-)
  
  ( REPLY )

mala::home Davide “+mala” Eynard’s website

Perl Hacks: automatically get info about a movie from IMDB

Leave a comment Cancel reply

MyPages

Blog Archive

Categories

January 2011
M	T	W	T	F	S	S
« Nov				Feb »
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31