Netflix Top 10

Netflix Top 10: Analyzing Weekly Chart-Toppers

This dataset comprises Netflix's weekly top 10 lists for the most-watched TV shows and films worldwide. The data spans from June 28, 2021, to August 27, 2023.

This workspace is pre-loaded with two CSV files.

netflix_top10.csv contains columns such as show_title, category, weekly_rank, and several view metrics.
netflix_top10_country.csv has information about a show or film's performance by country, contained in the columns cumulative_weeks_in_top_10 and weekly_rank.

We've added some guiding questions for analyzing this exciting dataset! Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.

Source: Netflix

Explore this dataset

To get you started with your analysis...

Combine the different categories of top 10 lists in a single weekly top 10 list spanning all categories
Are there consistent trends or patterns in the content format (tv, film) that make it to the top 10 over different weeks or months?
Explore your country's top 10 trends. Are there unique preferences or regional factors that set your country's list apart from others?
Visualize popularity ranking over time through time series plots

🔍 Scenario: Understanding the Impact of Content Duration on Netflix's Top 10 Lists

This scenario helps you develop an end-to-end project for your portfolio.

Background: As a data scientist at Netflix, you're tasked with exploring the dataset containing weekly top 10 lists of the most-watched TV shows and films. For example, you're tasked to find out what the relationship is between duration and ranking over time. Answering this question can inform content creators and strategists on how to optimize their offerings for the platform.

Objective: Determine if there's a correlation between content duration and its likelihood of making it to the top 10 lists.

import pandas as pd

global_top_10 = pd.read_csv("netflix_top10.csv", index_col=0)
global_top_10.head()

countries_top_10 = pd.read_csv("netflix_top10_country.csv", index_col=0)
countries_top_10.head()

import matplotlib.pyplot as plt
import seaborn as sns

global_top_10.columns
# Create a correlation matrix
corr_metrics = global_top_10[['weekly_rank', 'weekly_hours_viewed', 'cumulative_weeks_in_top_10']].corr()
corr_metrics.style.background_gradient()

def null_analysis(df):
    """
    Analyze null values in a DataFrame.

    Parameters:
        df (pd.DataFrame): Input dataframe for analysis.

    Returns:
        pd.DataFrame: DataFrame containing columns' names, number of null values,
                      and the ratio of null values to the total length of the dataframe.
    """
    # Count of null values for each column
    null_count = df.isna().sum()

    # Ratio of null values to the total length for each column
    null_ratio = null_count / len(df)

    # Creating the result dataframe
    result = pd.DataFrame({
        'Column Name': null_count.index,
        'Null Values Count': null_count.values,
        'Null Ratio': null_ratio.values.round(2)
    })

    return result
null_analysis(global_top_10)

The dataset does not specify the duration of each category. Furthermore, correlations among the numerical data don't offer much insight. Therefore, it's reasonable to assume that the duration of a TV show is typically shorter than that of a film. To differentiate between the categories, we'll analyze the average viewed hours and the median (or central value) of cumulative weeks each category remained in the top 10 each month.

Aggregating By Month

# Extract the month and year and assign to the 'month' column
global_top_10['month'] = pd.to_datetime(global_top_10.index).strftime('%Y-%m')

# Group by 'month' and 'category' and calculate the mean and median values
aggregated = (global_top_10.groupby(['month', 'category'])
                           .agg(hours_viewed_mean=('weekly_hours_viewed', 'mean'),
                                cumulative_weeks_median=('cumulative_weeks_in_top_10', 'median'))
                           .reset_index())

aggregated.head()

# Create a correlation matrix
corr_metrics = aggregated.corr()
corr_metrics.style.background_gradient()

Average Hours Viewed per Month

# Set the figure size
plt.figure(figsize=(12, 6))

sns.set_style("darkgrid")

# Create the lineplot
sns.lineplot(data=aggregated, x='month', y='hours_viewed_mean', hue='category')

# Rotate x-axis labels for better visibility
plt.xticks(rotation=90)

# Add title
plt.title("Average Hours Viewed per Month")

# Optionally, to ensure better layout and avoid overlapping, especially after rotating ticks
plt.tight_layout()

# Show the plot
plt.show()

‌
‌
‌

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Netflix Top 10: Analyzing Weekly Chart-Toppers

Explore this dataset

🔍 Scenario: Understanding the Impact of Content Duration on Netflix's Top 10 Lists

Aggregating By Month

Average Hours Viewed per Month

Netflix Top 10: Analyzing Weekly Chart-Toppers