Skip to content

This dataset comprises Netflix's weekly top 10 lists for the most-watched TV shows and films worldwide. The data spans from June 28, 2021, to August 27, 2023.

Objective: Determine if there's a correlation between content duration and its likelihood of making it to the top 10 lists.

# Import your libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

global_top_10 = pd.read_csv("netflix_top10.csv", index_col=0)
global_top_10.head()
countries_top_10 = pd.read_csv("netflix_top10_country.csv", index_col=0)
countries_top_10.head()

After reading our data in csv file we will get quick information about our data

global_top_10.info()
# Statistic summary
global_top_10.describe()
global_top_10.columns
global_top_10.shape

Before analyze our data to see what its distribution looks like

# plot histogram using pandas dataframe plot
global_top_10.plot.hist(bins=10)
plt.title("Netflix Global Top Ten")
plt.legend()

Data distribution will not be accurate because you have missing values so we have to process them.

# Count the numbers of missing values in each columns
global_top_10.isna().sum()
# Find the five percent threshold
threshold = len(global_top_10) * 0.05
print(threshold)