I recently came across a blog post which analyzed a collection of popular children's books and clustered them thematically. I felt this use case presented an opportunity to perform a similar analysis using popular films. IMDB implements a keyword matching to identify plot points in films and has a user-based ranking system where visitors can rate films on a 1-10 scale. This provides a great opportunity to plot similarities based on genres and plot points, and then see which types of films are statistically likely to receive higher ratings. My first aim was to narrowly cluster high rated films (> 8) as seen below:
This type of visualization is based on phylogenetic trees which show evolutionary relationships between species. Similarly this type of dendrogram clustering shows shared features between films. There are some expected groupings and a few surprises. For instance, The Matrix and the Terminator series all take place in a dystopian future with a savior element. Alternatively, Pan's Labyrinth and the Green Mile are not films that would typically be mentioned together but on further review they do share a magical surrealism in historical settings.
The next step was to perform some feature selection on the entire dataset to see which genres and plot keywords were most predictive of scoring. After deploying a variety of feature selection techniques, a core set of significant variables (or, genres/plot keywords) were left. To the side is a table of these variables and their coefficent.
Each coefficient can be thought of as the mean increase or decrease in score (or, the intercept) with the addition of that genre/plot type. With a mean score of 6.5, we can see a drama film is likely to receive half a point higher, whereas a horror film half a point lower. One case I found amusing was how a one word title was likely to reduce the score by an entire point. Some of these stats may reveal deeper truths about the state of the industry as well. Female protagonist films are likely to negatively impact the score of a film. Whether that is the quality of the films given to female leads or the bias in the reviewers is inconclusive but it is a notable element. It is also indicative of popularity and quality of comic book films that their genre has the highest impact on the final rating.
The final plot below is a visual of these impactful variables.