Netflix! What started in 1997 as a DVD rental service has since exploded into one of the largest entertainment and media companies.
Given the large number of movies and series available on the platform, it is a perfect opportunity to flex your exploratory data analysis skills and dive into the entertainment industry. Our friend has also been brushing up on their Python skills and has taken a first crack at a CSV file containing Netflix data. They believe that the average duration of movies has been declining. Using your friends initial research, you'll delve into the Netflix data to see if you can determine whether movie lengths are actually getting shorter and explain some of the contributing factors, if any.
You have been supplied with the dataset netflix_data.csv , along with the following table detailing the column names and descriptions:
The data
netflix_data.csv
| Column | Description |
|---|---|
show_id | The ID of the show |
type | Type of show |
title | Title of the show |
director | Director of the show |
cast | Cast of the show |
country | Country of origin |
date_added | Date added to Netflix |
release_year | Year of Netflix release |
duration | Duration of the show in minutes |
description | Description of the show |
genre | Show genre |
Analysis Summary
In this notebook, we conducted an exploratory data analysis on Netflix data to determine whether the average duration of movies has been declining. We started by loading the netflix_data.csv dataset, which contains information about various shows available on Netflix.
We first examined the dataset and its columns, including show_id, type, title, director, cast, country, date_added, release_year, duration, description, and genre.
To investigate the trend in movie durations, we filtered the dataset to include only movies and created a subset called netflix_movies. We then calculated the average duration of movies over the years and plotted a line graph to visualize the trend.
Additionally, we created another subset called short_movies to focus specifically on movies with a duration less than a certain threshold. We analyzed the distribution of short movies across different countries and genres.
Based on our analysis, we found that the average duration of movies on Netflix has been relatively stable over the years. However, there is a noticeable increase in the number of short movies, indicating a growing trend towards shorter content.
Overall, this analysis provides insights into the changing landscape of movie durations on Netflix and highlights the importance of adapting to evolving viewer preferences.
# Importing pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt
# Reading the Netflix data from the CSV file
netflix_df = pd.read_csv("netflix_data.csv")
# Printing the unique values in the "type" column of the Netflix dataframe
print(netflix_df["type"].unique())
# Creating a subset of the Netflix dataframe by dropping rows where the "type" is "TV Show"
netflix_subset = netflix_df.drop(netflix_df[netflix_df["type"]=="TV Show"].index)
"""Može se napraviti i na sljedeći način """
"""netflix_subset1 = netflix_df[netflix_df["type"] == "Movies"]"""
# Creating a new dataframe "netflix_movies" with selected columns from the subset
netflix_movies = netflix_subset[["title","country","genre","release_year","duration"]]
# Creating a new dataframe "short_movies" by filtering the "netflix_movies" dataframe for movies with duration less than 60 minutes
short_movies = netflix_movies[netflix_movies["duration"]<60]
"""Može se napraviti i na sljedeći način """
"""short_movies1 = netflix_movies[netflix_movies.duration < 60]"""
# Printing information about the "short_movies" dataframe
short_movies.info()
short_movies.sort_values("duration",ascending=False)FOR loop regular & appending colors to list
# Creating an empty list to store the colors
colors = []
# Iterating over each genre in the "genre" column of the "netflix_movies" dataframe
for genre in netflix_movies["genre"]:
# Checking if the genre is "Children"
if genre == "Children":
# Appending "green" to the "colors" list if the genre is "Children"
colors.append("green")
# Checking if the genre is "Documentaries"
elif genre == "Documentaries":
# Appending "blue" to the "colors" list if the genre is "Documentaries"
colors.append("blue")
# Checking if the genre is "Stand-Up"
elif genre == "Stand-Up":
# Appending "gold" to the "colors" list if the genre is "Stand-Up"
colors.append("salmon")
# Executed if none of the above conditions are met
else:
# Appending "lightgrey" to the "colors" list for all other genres
colors.append("lightgrey")FOR loop with .itterows() function
# Initialize an empty list called colors to store our different color values.
colors = []
# Use a for loop to iterate through the netflix_movies DataFrame's rows and append colors to your colors list based on the following conditions:
for lab, row in netflix_movies.iterrows():
if row['genre'] == "Children":
colors.append("red")
elif row['genre'] == "Documentaries":
colors.append("blue")
elif row['genre'] == "Stand-Up":
colors.append("green")
else:
colors.append("black")
# Print the first 10 values of your colors list to inspect the results.
print(colors[:10])fig = plt.figure(figsize=(15,8))
# Creating a scatter plot with release year on the x-axis and duration on the y-axis
plt.scatter(netflix_movies["release_year"], netflix_movies["duration"], c=colors, marker='s')
# Adding labels to the x-axis and y-axis
plt.xlabel("Release year")
plt.ylabel("Duration (min)")
# Adding a title to the plot
plt.title("Movie Duration by Year of Release")
# Displaying the plot
plt.show()answer = "maybe"To determine how much the movie duration has decreased over time based on the scatter plot, we can calculate the average duration for each year and compare it to the previous year. Let's calculate the average duration for each year and analyze the trend.
# Calculate the average duration for each year
average_duration = netflix_movies.groupby('release_year')['duration'].mean()
# Calculate the difference in average duration between consecutive years
duration_decrease = average_duration.diff()
# Display the average duration and the decrease in duration
average_duration, duration_decreaseimport matplotlib.pyplot as plt
# Calculate the average duration for each year
average_duration = netflix_movies.groupby('release_year')['duration'].mean()
# Calculate the difference in average duration between consecutive years
duration_decrease = average_duration.diff()
# Plotting the average duration
plt.figure(figsize=(12, 8))
plt.plot(average_duration.index, average_duration.values, marker='o')
# Adding labels to the x-axis and y-axis
plt.xlabel('Release Year')
plt.ylabel('Average Duration (min)')
# Adding a title to the plot
plt.title('Average Movie Duration by Year of Release')
# Displaying the plot
plt.show()To analyze the movie duration by genre, we can group the movies by genre and calculate the average duration for each genre. Let's calculate the average duration for each genre and visualize the results.
# Calculate the average duration for each genre
average_duration_by_genre = netflix_movies.groupby('genre')['duration'].mean()
# Display the average duration by genre
average_duration_by_genre