Skip to content
New Workbook
Sign up
Netflix Movie Data

Data Visualization of Netflix Movie Data

Hello, again. My name is Ariel, and welcome to my exploration of the Netflix content. The dataset contains information for over 8,500 TV Shows and movies that was originally released from 1940’s all the way to spring of 2021. The dataset has over 8,800 rows of data, and 12-character variables. The dataset is also located in Kaggle.

library(tidyverse)
library(stringr)
Netflix_data <- readr::read_csv('data/netflix_dataset.csv')
glimpse(Netflix_data)

Line Chart

The image below explores how the average length of a movies change throughout the years. During 1942, movies had the lowest mean run time at 35 mins. As the years past, we begin to see how the run time of movies began to increase. Interestingly, in 1961, movies had the highest mean run time at nearly 201 mins. More recently, the average length of films plateau. In terms of spread, the median and mean durations of movies in this dataset was 113 and 112 mins respectively. Since both measurements of center are near each other on the number line, it is safe to assume that this dataset contains a low number of outliers impacting disputation.

Line_Chart_Data <- Netflix_data %>%
  select(title, type, duration, release_year)  %>%
  group_by(release_year) %>%
  filter(type == "Movie") %>%
  mutate(duration = as.numeric(str_remove(duration, "min"))) %>%
  summarize(avg_length = mean(duration)) %>%
  ggplot(aes(x = release_year, y = avg_length)) +
  geom_line(color="dark green") +
  xlim(c(1940, 2010)) +
  theme_classic() +
  labs(
    title = 'Average Length of Films in Minutes: 1940 - 2010.',
    subtitle = 'NetFlix Movie Inventroy: Spring 2021',
    caption = 'Data Source: (https://www.kaggle.com/shivamb/netflix-shows)',
    x = 'Year',
    y = 'Avg Length of Program',
  )
Line_Chart_Data

Bar Plot

When looking at the chart below, it becomes clear that the Netflix dataset contains more films then TV programs. More specifically, the dataset contains 6131 movies and 2676 TV shows. In the Netflix dataset, the company has zero films created in 1925. However, 2018 had the most movies created at 787 movies. In terms of spread, it median is 12 films a year, while the mean is 84 movies created per year. Since two measurements are so apart from each other on the number line, outliers have a drastic impact on the mean. On the contrary, 2020 has the greatest number of TV shows in the Netflix dataset at 436 TV different tv shows. While the lowest number of TV shows created is one. The median of the number of tv shows created each year is four.

Bar_Plot_data <- Netflix_data %>%
  select(type, release_year)  %>%
  group_by(type) %>%
  ggplot(aes(release_year, fill = type)) +
  geom_bar(stat = "count") +
  theme_classic() +
  xlim(c(1960, 2020)) +
  labs(
    title = 'Number of Programs Created by Year: 1940 - 2020.',
    subtitle = 'NetFlix Movie Inventroy: Spring 2021',
    caption = 'Data Source: (https://www.kaggle.com/shivamb/netflix-shows)',
    x = 'Year',
    y = 'Number of Films or TV Shows',
    fill = 'Type of Program'
  )

Bar_Plot_data

Thank you for taking the time to review my data visualizations. As mentioned above, this dataset contains information for over 8,500 movies and tv shows. We saw how the average length of movies was at its highest in 1961 at nearly 201 mins. In addition, we saw that Netflix has more movie then tv shows. The highest number of movies created in a calendar year within the Netflix dataset was 787 movies. In 2020, the most number of tv shows was created at 436 shows.