I analyzed a movie dataset containing 562 rows using R and visualizing with the ggplot2 package.
I started by importing the dataset after changing the working directory.
library(ggplot2)
movies <- read.csv("Movie_Ratings.csv")
I then visualized the dataset using the following codes
head(movies)
tail(movies)
colnames(movies)
Film | Genre | Rotten.Tomatoes.Ratings.. | Audience.Ratings.. | Budget..million... | Year.of.release | |
1 | (500) Days of Summer | Comedy | 87 | 81 | 8 | 2009 |
2 | 10,000 | Adventure | 9 | 44 | 105 | 2008 |
3 | 12 Rounds | Action | 30 | 52 | 20 | 2009 |
4 | 127 Hours | Adventure | 93 | 84 | 18 | 2010 |
5 | 17 Again | Comedy | 55 | 70 | 20 | 2009 |
6 | 2012 | Action | 39 | 63 | 200 | 2009 |
I then renamed the columns to make future use easier.
colnames(movies) <- c("Film", "Genre", "CriticRating", "AudienceRating", "BudgetMillions", "Year")
Then, I checked the structure of the data frame.
str(movies)
'data.frame': 562 obs. of 6 variables:
$ Film : chr "(500) Days of Summer " "10,000 B.C." "12 Rounds " "127 Hours" ...
$ Genre : chr "Comedy" "Adventure" "Action" "Adventure" ...
$ CriticRating : int 87 9 30 93 55 39 40 50 43 93 ...
$ AudienceRating: int 81 44 52 84 70 63 71 57 48 93 ...
$ BudgetMillions: int 8 105 20 18 20 200 30 32 28 8 ...
$ Year : int 2009 2008 2009 2010 2009 2009 2008 2007 2011 2011 ...
I converted the column "Genre" and "Year" of release from Character type to Factor to easily visualize grouping, then pulled a summary of the dataset.
movies$Year <- factor(movies$Year)
movies$Genre <- factor(movies$Genre)
summary(movies)
Exploratory Data Analysis
ggplot(data = movies) +
geom_point(mapping = aes(x = CriticRating, y = AudienceRating, color = Genre,
size = BudgetMillions)) +
labs(title = 'Critic vs Audience Rating', subtitle = "Including Budget Representation",
caption = "Analyzed by Ashek Ag Mohamed")
ggplot(data = movies) +
geom_point(mapping = aes(x = BudgetMillions, y = AudienceRating, color = Genre,
size = BudgetMillions)) +
labs(title = 'Budget vs Audience Rating', subtitle = "Not a strong liner correlation",
caption = "Analyzed by Ashek Ag Mohamed", x = "Budget in Millions")
ggplot(data = movies) +
geom_histogram(mapping = aes(x = BudgetMillions, fill = Genre), color = "Black",
binwidth = 10) +
labs(title = "Movies' Budget", caption = "Analyzed by Ashek Ag Mohamed",
x = "Budget in Millions", y = "Number of Movies")
ggplot(data = movies) +
geom_density(mapping = aes(x = BudgetMillions, fill = Genre), position = "stack") +
labs(title = "Movies' Budget", caption = "Analyzed by Ashek Ag Mohamed",
x = "Budget in Millions", y = "Number of Movies")
ggplot(data = movies) +
geom_histogram(mapping = aes(x = AudienceRating), fill = "White", color = "Blue",
binwidth = 10) +
labs(title = "Audience Rating", caption = "Analyzed by Ashek Ag Mohamed",
subtitle = "More of a normal distribution",
x = "Audience Rating", y = "Number of Movies")
ggplot(data = movies, aes(x=CriticRating, y = AudienceRating, color = Genre)) +
geom_point() + geom_smooth(fill = NA) +
labs(title = "Critic vs Audience Rating", caption = "Analyzed by Ashek Ag Mohamed",
x = "Critic Rating", y = "Audience Rating")
Here we notice that of the movies receiving the same rating by professional critics, the audience is more likely to rate higher genres like Action and Adventure.
ggplot(data = movies, aes(x=Genre, y = AudienceRating, color = Genre)) +
geom_jitter(data = movies, aes(size = BudgetMillions), alpha=0.7) + geom_boxplot(size = 1, alpha = 0.6) +
labs(title = "Boxplot of Genres by Audience Rating", caption = "Analyzed by Ashek Ag Mohamed",
x = "Genre", y = "Audience Rating", subtitle = "Points sized by budget")
It is more challenging to be in the Horror movie genre where the audience rating of the 3rd percentile is even lower than the 1st percentile of the Thriller genre.
ggplot(data = movies, aes(x=Genre, y = CriticRating, color = Genre)) +
geom_jitter(data = movies, aes(size = BudgetMillions), alpha=0.7) + geom_boxplot(size = 1, alpha = 0.6) +
labs(title = "Boxplot of Genres by Critic Rating", caption = "Analyzed by Ashek Ag Mohamed",
x = "Genre", y = "Critic Rating", subtitle = "Points sized by budget")
# Conclusion: It is more challenging to be in the horror movie businesss as the
# average rating is less than 50%
# However, statistically, Thriller and Drama movies score higher
Contrary to the box plots of the audience rating, we see a bigger consolidation when it comes to the professional critic rating.
ggplot(data = movies, aes(x=Genre, y = CriticRating, color = Genre)) +
geom_jitter(data = movies, aes(size = BudgetMillions), alpha=0.7) + geom_boxplot(size = 1, alpha = 0.6) +
labs(title = "Boxplot of Genres by Critic Rating", caption = "Analyzed by Ashek Ag Mohamed",
x = "Genre", y = "Critic Rating", subtitle = "Points sized by budget")
Audience rating in Comedy has been increasing through the years.