Skip to content
Tracking Internet Access Across Time
  • AI Chat
  • Code
  • Report
  • How Much of the World Has Access to the Internet?

    📖 Background

    You work for a policy consulting firm. One of the firm's principals is preparing to give a presentation on the state of internet access in the world. She needs your help answering some questions about internet accessibility across the world.

    💾 The data

    The research team compiled the following tables (source):
    internet
    • "Entity" - The name of the country, region, or group.
    • "Code" - Unique id for the country (null for other entities).
    • "Year" - Year from 1990 to 2019.
    • "Internet_usage" - The share of the entity's population who have used the internet in the last three months.
    people
    • "Entity" - The name of the country, region, or group.
    • "Code" - Unique id for the country (null for other entities).
    • "Year" - Year from 1990 to 2020.
    • "Users" - The number of people who have used the internet in the last three months for that country, region, or group.
    broadband
    • "Entity" - The name of the country, region, or group.
    • "Code" - Unique id for the country (null for other entities).
    • "Year" - Year from 1998 to 2020.
    • "Broadband_Subscriptions" - The number of fixed subscriptions to high-speed internet at downstream speeds >= 256 kbit/s for that country, region, or group.

    Acknowledgments: Max Roser, Hannah Ritchie, and Esteban Ortiz-Ospina (2015) - "Internet." OurWorldInData.org.

    💪 Challenge

    Create a report to answer the principal's questions. Include:

    1. What are the top 5 countries with the highest internet use (by population share)?
    2. How many people had internet access in those countries in 2019?
    3. What are the top 5 countries with the highest internet use for each of the following regions: 'Middle East & North Africa', 'Latin America & Caribbean', 'East Asia & Pacific', 'South Asia', 'North America', 'Europe & Central Asia'?
    4. Create a visualization for those five regions' internet usage over time.
    5. What are the 5 countries with the most internet users?
    6. What is the correlation between internet usage (population share) and broadband subscriptions for 2019?
    7. Summarize your findings.

    Note: This is how the World Bank defines the different regions.

    Imports

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    sns.set_style("whitegrid")
    
    pd.set_option('display.float_format', '{:,.2f}'.format)
    # Read the broadband table
    broadband = pd.read_csv('data/broadband.csv')
    
    # Read the internet table
    internet = pd.read_csv('data/internet.csv')
    
    # Read the people table
    people = pd.read_csv('data/people.csv')
    
    # regions
    region_df = pd.read_excel('data/CLASS.xlsx')

    Q1. What are the top 5 countries with the highest internet use (by population share)?

    year = 2019
    
    merge = internet.groupby('Entity')['Year'].max().reset_index()
    merge = merge[merge['Year'] == year]
    df = pd.merge(internet, merge, on = ['Entity', 'Year'], how = 'inner').sort_values(by = "Internet_Usage", ascending = False).reset_index(drop = True).head()
    df
    plt.figure(figsize = (16, 4))
    sns.barplot(data = df, y = 'Entity', x = 'Internet_Usage')
    plt.tight_layout()

    Q2. How many people had internet access in those countries in 2019?

    countries = df['Entity'].values
    tmp = people[(people['Entity'].isin(countries)) & (people['Year'] == 2019)]
    tmp = pd.merge(df, tmp, on = 'Code', how = 'inner').drop(['Entity_y','Year_y'], axis = 1)
    tmp.columns = ['Entity','Code','Year','Internet_Usage','Users']
    tmp = tmp.head(5)
    tmp
    plt.figure(figsize = (10, 8))
    ax = sns.scatterplot(data = tmp, x = 'Users', y = 'Internet_Usage')
    
    # Annotate each point in the scatter plot
    for i in range(tmp.shape[0]):
        ax.text(tmp['Users'][i] + 0.1,  # use proper indexing for DataFrame columns
                tmp['Internet_Usage'][i] + 0.1,  # use proper indexing for DataFrame columns
                tmp['Entity'][i],  # use proper indexing for DataFrame columns
                horizontalalignment='left',
                size='small', color='black', weight='semibold')
    
    plt.ylim(0, 110)
    plt.title("Internet users vs. Internet Usages as a percent of the total population")
    plt.tight_layout()

    Q3. What are the top 5 countries with the highest internet use for each of the following regions: 'Middle East & North Africa', 'Latin America & Caribbean', 'East Asia & Pacific', 'South Asia', 'North America', 'Europe & Central Asia'?

    
    regions = sorted([x for x in region_df['Region'].dropna().unique()])
    
    fig, axs = plt.subplots(2, 4, figsize = (16, 8), sharex = True)
    axs = axs.flatten()
    
    max_years = internet.groupby('Entity')['Year'].max().reset_index()
    tmp = pd.merge(internet, region_df, on = 'Code', how = 'left')
    tmp = pd.merge(tmp, max_years, on = ['Entity', 'Year'], how = 'inner')
    tmp = tmp[['Region','Entity','Code','Year','Internet_Usage']].sort_values(by = "Internet_Usage", ascending = False)
    
    i = 0
    for region in regions:
        tmp2 = tmp[tmp['Region'] == region].head(5)
        sns.barplot(data = tmp2, x = 'Internet_Usage', y = 'Entity', ax = axs[i])
        axs[i].set_title(f"{region}")
        axs[i].set_xlabel("% Internet Usage")
        i += 1
    
    axs[i].remove()
    
    plt.tight_layout()