Internet Access Patterns around the World, 1994-2020
Introduction
This project examines the patterns of internet use around the world. The analysis was guided by the provided questions, with some additional observations combining elements used for different segments of the analysis.
Provided Questions:
- What are the top 5 countries with the highest internet use (by population share)?
- How many people had internet access in those countries in 2019?
- What are the top 5 countries with the highest internet use for each of the following regions: 'Africa Eastern and Southern', 'Africa Western and Central', 'Latin America & Caribbean', 'East Asia & Pacific', 'South Asia', 'North America', 'European Union'?
- Create a visualization for those five regions' internet usage over time.
- What are the 5 countries with the most internet users?
- What is the correlation between internet usage (population share) and broadband subscriptions for 2019?
- Summarize your findings.
This project was completed as part of the DataCamp XP Challenge 2022, which was open for the period Nov 9-13, 2022. <br > Provided data are from Our World in Data. Source credit: Max Roser, Hannah Ritchie, and Esteban Ortiz-Ospina (2015) - "Internet." OurWorldInData.org.
Analysis Approach
Data tables have mixed information from both countries and regions. Analyses will require filtration by Country, Region, and Country-in-region to answer all questions. Q6 will require merging two data tables to obtain the target combination of data. A list of unique combinations of country & region will be needed to complete the analysis of highest country utilization in each region in Q3 and to enhance the plot in Q6. This correlation table was not available within the provided data, but may be populated from external sources.
Approach to individual components, including handling of ambiguities in the questions, is described with each question.
# Load libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load data tables
internet = pd.read_csv('data/internet.csv')
people = pd.read_csv('data/people.csv')
broadband = pd.read_csv('data/broadband.csv')
# Preview tables provided
print(internet.head())
print(people.head())
print(broadband.head())
print('Shape of internet: ' + str(internet.shape))
print('Shape of people: ' + str(people.shape))
print('Shape of broadband: ' + str(broadband.shape))All 3 provided tables contain data labeled by location (Entity & Code) and Year. The final data column is what differs among the 3 data sets. Per the information brief provided:
- Internet_Usage : Percentage of the location's population that has used the internet in the last 3 months
- Users : Number of individuals from a location that has used the internet in the last 3 months
- Broadband_Subscriptions : The number of fixed (wired) subscriptions to high-speed internet at downstream speeds >= 256 kbit/s per 100 people in the location
The three tables do NOT contain the same number of observations, as evidenced by the difference in length.
1. What are the top 5 countries with the highest internet use (by population share)?
We can determine the relative internet utilization rates of countries using data in the internet table. These data are aggregated and normalized by population to enable reasonable comparison among countries by rates of use.
The Internet_Usage column is the percentage of the country's population that used the internet within the last 3 months for the year indicated adjusted to a population basis. The maximum of this value per country should reflect the highest utilization rate. This approach assumes that the levels of access are increasing or consistent year to year without sudden drops from events like political instability.
# Filter out regions, which have null country codes, leaving countries
internet_country = internet.dropna(subset='Code')
# Select the maximum usage for each country and sort descending to select top 5
max_country_use = internet_country.groupby('Entity').Internet_Usage.max().sort_values(ascending = False)
print(max_country_use.head(5))The highest internet usage per-poulation is in Bahrain, Qatar, Kuwait, Liechtenstein, and the United Arab Emirates. In all these nations, at least 99% of the population had used the internet in the last 3 months.
An alternative approach would be to select the data for the most recent year available, then rank by maximum utlization. However, if the most recent year is not the same for all locations, then this may still not produce a fair comparison.
2. How many people had internet access in those countries in 2019?
Initial approach:
The result list from Q1 of the 5 countries with highest internet utilization was used to select rows from the 'people' table where Year is 2019 and Entity is in the country list.
# Create select-list for top 5 utilization countries
top_5_c_util = ['Bahrain', 'Qatar', 'Kuwait', 'Liechtenstein', 'United Arab Emirates']
# Select Year 2019 data from people table for top 5 utilization coutries
in_top5_c = people.Entity.isin(top_5_c_util)
top_5_c_2019 = people[(in_top5_c) & (people.Year == 2019)]
# Print DataFrame
print(top_5_c_2019)Curiously, only 4 of the 5 top internet use countries show data for 2019. Liechtenstein is missing. To see what's the trouble, let's look at the most recent years of data for the country.
print(internet[(internet['Entity'] == 'Liechtenstein') & (internet['Year'] > 2015)])The data for Liechtenstein was last reported for 2017, showing that the data for 2019 does not exist. This was also true for the next two entries in the 'max_country_use' maximum utilization list, showing it may be inefficient to find the country with 5th highest utilization for 2019 by manually extending the list.
To get the number of users in top 5 utilization countries in 2019, let's filter the per-capita use from the 'internet' table for 2019 and retrieve the top 5 countries with data for that year. Then, we can use that list to select the 2019 data on number of individual users from the people table.
# Identify highest utilization countries & truncate to top 5
top_2019_c_util = internet_country[internet_country['Year'] == 2019].sort_values('Internet_Usage',ascending=False).head(5)
print(top_2019_c_util)
# Select Year 2019 data from people table for top 5 utilization coutries
in_top5_2019_c = people.Entity.isin(top_2019_c_util['Entity'])
top_5_c_2019 = people[(in_top5_2019_c) & (people.Year == 2019)]
# Print DataFrame
print(top_5_c_2019)The 5th highest country for internet usage per population in 2019 is Denmark. Among these 5 countries, it has the second-highest number of individual users after the UAE. All countries with the highest utilization have fewer than 10 million users and are geographically compact with high population density.
3. What are the top 5 countries with the highest internet use for each of the following regions: 'Africa Eastern and Southern', 'Africa Western and Central', 'Latin America & Caribbean', 'East Asia & Pacific', 'South Asia', 'North America', 'European Union'?
Determining the top 5 countries in each region requires generation of a Region column to pair with Entity for the internet table. This does not seem possible to generate from the data provided. Q4 also implies that the output from this question is a list of 5 regions.
As a result, this question could be asking for the top 5 countries within each of 7 regions (35 responses) OR for the top 5 utilization regions of the list.
If this were a real situation, I would ask for clarification to ensure I generated the correct product. In this hypothetical situation, I will perform both analyses.
3A. Interpreting the question as written, engaging external data
In a real scenario, a list of the employer's preferred region classifications likely exists and could be provided by coworkers. Where this is a routine mix of classification levels, someone should have a copy of the correspondence! However, in the absence of a correspondence table one can be constructed.
Data sources: A table with ISO 3166 aplha-3 country codes and UN region data was sourced from a publicly-generated GitHub repository: lukes ISO code repo. This table is not a 1:1 match for the regions in the request, and the UN sub-regions from the table do not correspond neatly to the categories listed. The EU member list does not appear in the main region list because it is an economic bloc, not a geographic region. Instead, a list of countries in the EU was sourced from Kaweb. Notably, the UK departed from the EU in 2020, which is the last year of internet data provided; it makes the most sense to include the UK in the EU for the correspondence list and manually adjust the Region of Europe for any 2020 data.
Table construction: Populating a region column from disparate data sources that cobble together different region levels and economic groups requires sequentially writing data for selected rows from at least 2 sources based on the country ID. To do this requires logical selections for subregions in Africa, Asia, and the Americas, then sequential substitution with combined sub-regions and cleaning region names to match format of the provided data. Other parts of the world would remain populated at the region level. Finally, EU countries would have the region 'Europe' overwritten with the economic bloc 'European Union'.
Applying the correspondence & Analysis: Once the correspondence table is constructed, it can be merged with the internet data table. After merging, the region row for UK in 2020 must be manually corrected to return the region to Europe, if present. This cleaned data can then be analyzed similarly to Q1, grouping sequentially by region and country.