Skip to content
Netflix Top 10
  • AI Chat
  • Code
  • Report
  • Netflix Top 10: Analyzing Weekly Chart-Toppers

    This dataset comprises Netflix's weekly top 10 lists for the most-watched TV shows and films worldwide. The data spans from June 28, 2021, to August 27, 2023.

    This workspace is pre-loaded with two CSV files.

    • netflix_top10.csv contains columns such as show_title, category, weekly_rank, and several view metrics.
    • netflix_top10_country.csv has information about a show or film's performance by country, contained in the columns cumulative_weeks_in_top_10 and weekly_rank.

    We've added some guiding questions for analyzing this exciting dataset! Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.

    Source: Netflix

    Explore this dataset

    To get you started with your analysis...

    1. Combine the different categories of top 10 lists in a single weekly top 10 list spanning all categories
    2. Are there consistent trends or patterns in the content format (tv, film) that make it to the top 10 over different weeks or months?
    3. Explore your country's top 10 trends. Are there unique preferences or regional factors that set your country's list apart from others?
    4. Visualize popularity ranking over time through time series plots

    🔍 Scenario: Understanding the Impact of Content Duration on Netflix's Top 10 Lists

    This scenario helps you develop an end-to-end project for your portfolio.

    Background: As a data scientist at Netflix, you're tasked with exploring the dataset containing weekly top 10 lists of the most-watched TV shows and films. For example, you're tasked to find out what the relationship is between duration and ranking over time. Answering this question can inform content creators and strategists on how to optimize their offerings for the platform.

    Objective: Determine if there's a correlation between content duration and its likelihood of making it to the top 10 lists.

    import pandas as pd
    global_top_10 = pd.read_csv("netflix_top10.csv", index_col=0)
    countries_top_10 = pd.read_csv("netflix_top10_country.csv", index_col=0)
    import matplotlib.pyplot as plt
    import seaborn as sns
    # Create a correlation matrix
    corr_metrics = global_top_10[['weekly_rank', 'weekly_hours_viewed', 'cumulative_weeks_in_top_10']].corr()
    def null_analysis(df):
        Analyze null values in a DataFrame.
            df (pd.DataFrame): Input dataframe for analysis.
            pd.DataFrame: DataFrame containing columns' names, number of null values,
                          and the ratio of null values to the total length of the dataframe.
        # Count of null values for each column
        null_count = df.isna().sum()
        # Ratio of null values to the total length for each column
        null_ratio = null_count / len(df)
        # Creating the result dataframe
        result = pd.DataFrame({
            'Column Name': null_count.index,
            'Null Values Count': null_count.values,
            'Null Ratio': null_ratio.values.round(2)
        return result

    The dataset does not specify the duration of each category. Furthermore, correlations among the numerical data don't offer much insight. Therefore, it's reasonable to assume that the duration of a TV show is typically shorter than that of a film. To differentiate between the categories, we'll analyze the average viewed hours and the median (or central value) of cumulative weeks each category remained in the top 10 each month.

    Aggregating By Month

    # Extract the month and year and assign to the 'month' column
    global_top_10['month'] = pd.to_datetime(global_top_10.index).strftime('%Y-%m')
    # Group by 'month' and 'category' and calculate the mean and median values
    aggregated = (global_top_10.groupby(['month', 'category'])
                               .agg(hours_viewed_mean=('weekly_hours_viewed', 'mean'),
                                    cumulative_weeks_median=('cumulative_weeks_in_top_10', 'median'))
    # Create a correlation matrix
    corr_metrics = aggregated.corr()

    Average Hours Viewed per Month

    # Set the figure size
    plt.figure(figsize=(12, 6))
    # Create the lineplot
    sns.lineplot(data=aggregated, x='month', y='hours_viewed_mean', hue='category')
    # Rotate x-axis labels for better visibility
    # Add title
    plt.title("Average Hours Viewed per Month")
    # Optionally, to ensure better layout and avoid overlapping, especially after rotating ticks
    # Show the plot