Internet: A Global Phenomenon
This dataset contains information on internet access around the world.
The workspace is set up with two CSV files containing information on global internet access for years ranging from 1990 to 2020.
internet_users.csvusers- The number of people who have used the internet in the last three monthsshare- The share of the entity's population who have used the internet in the last three months
adoption.csvfixed_telephone_subs- The number of people who have a telephone landline connectionfixed_telephone_subs_share- The share of the entity's population who have a telephone landline connectionfixed_broadband_subs- The number of people who have a broadband internet landline connectionfixed_broadband_subs_share- The share of the entity's population who have a broadband internet landline connectionmobile_cell_subs- The number of people who have a mobile subscriptionmobile_cell_subs_share- The share of the entity's population who have a mobile subscription
Both data files are indexed on the following 3 attributes:
entity- The name of the country, region, or group.code- Unique id for the country (null for other entities).year- Year from 1990 to 2020.
Check out the guiding questions or the scenario described below to get started with this dataset! Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.
Source: Our World In Data
๐ Some guiding questions to help you explore this data:
- What are the top 5 countries with the highest internet use (by population share)?
- What are the top 5 countries with the highest internet use for some large regions?
- What is the correlation between internet usage (population share) and broadband subscriptions for 2020?
Note: This is how the World Bank defines the different regions.
๐ Scenario: Identify emerging markets for a global internet provider
This scenario helps you develop an end-to-end project for your portfolio.
Background: You work for a global internet provider on a mission to provide affordable Internet access to everybody around the world using satellites. You are tasked with identifying which markets (regions or countries) are most worthwhile to focus efforts on.
Objective: Construct a top 5 list of countries where there is a big opportunity to roll out our services. Try to consider the amount of people not having access to (good) wired or mobile internet and their spending power.
You can query the pre-loaded CSV files using SQL directly. Hereโs a sample query:
SELECT *
FROM 'internet_users.csv'
LIMIT 10import pandas as pd
internet_users = pd.read_csv('internet_users.csv')
print(internet_users.shape)
internet_users.head()adoption = pd.read_csv('adoption.csv')
print(adoption.shape)
adoption.head()
#Preliminary data check column information and missing data
print(internet_users.info())
print(internet_users.isna().sum())
PART 1
Internet coverage by different countries over the years.
import seaborn as sns
import matplotlib.pyplot as plt
#Plot internet coverage over time for top countries
#Group df according to year and internet coverage as save as a dataframe
topla = internet_users.groupby(['year', 'entity'])['share'].sum().reset_index()
#Use flag to select different countries
def patio(flag):
#Select stop year accurding to flag 1, number of top countires with flag 2
#default 5 leap years and fin year 2020 for flag 0
leap_year=1990
nleap=5
fin_year=2020
top_select=5
if(flag==1):
fin_year=2010
if(flag==2):
top_select=10
#Scan years from star to final year jumping by leap year
while leap_year<=fin_year:
#Filter topla for each leap year sort by share
topla_leapyear = topla[topla['year'] == leap_year].sort_values(by= ['share'],ascending=False)
#increment leap year
leap_year=leap_year+nleap
#select top 5 or 10 countries , top coverage and corresponding countries
sel_top=topla_leapyear['entity'].head(top_select).values
top_internet_users = internet_users[internet_users['entity'].isin(sel_top)]
#Print topcountries at final set year 2010 or 2020
print(f' Selected top internet user for top {top_select} countries at {fin_year} ')
#Plot lineplot of changing internet coverage over the years for each selected contry
sns.lineplot(data=top_internet_users,x='year',y='share',hue='entity')
plt.title(f'Internet coverage over the years for top {top_select} countries at {fin_year} ')
plt.show()
#Select top 10 countries final year 2020
patio(2)
#Select top 5 countries final year 2020
patio(0)
#Select top 5 countiries final year 2010
patio(1)
Observation 1:
Line plots for internet coverage over the years shows that:
- There is wide spectrum of development. Some countries lime Luxembourg, Netherlands, Norway and Sweeden reached 80% coverage eaarly on.
- Some countries like Bahrain, Kuwait, Qatar started late but reached near 100 percent coverage by 2020.
#Identify countries and the year they achieved 90% internet coverage
year_score = []
country_score = []
#Select country list
country_list = topla['entity'].unique()
#scan country list
for country in country_list:
#Scan topla to determine the year the country in line reached 90% coverage,
#if found append the country to country_score and year to year_acore and break to skip to next country in the list
#if found append to country_score and year_score
for k, row in topla.iterrows():
if row['share'] > 0.9 and row['entity'] == country:
country_score.append(country)
year_score.append(row['year'])
break
#Convert country_score and year_score into a dataframe and sort to get top 10 and bottom 10 countries
df_cy = pd.DataFrame({'country': country_score, 'year': year_score})
df_cy = df_cy.sort_values(by='year')
print(f'\nCountry and year to achieve head 90%\n {df_cy.head(10)}')
print(f'\nCountry and year to achieve tail 90%\n {df_cy.tail(10)}')
Observation 2:
-
Some countries developed internet rapidly very early on and achieved 90% benchmark coverage in early 1990s. These cuntries include USA,Sweeden, Finland etc. mostly rich European countries.
-
Some other countries lagged behind and reached 90% coverage only after 2010. These countries include Brundi, Kosovo, Ethiopia, Eritrea, mostly poor african countries.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.ticker as ticker
# Assuming df_cy is already defined and loaded with data
ncountry_year = df_cy.groupby('year')['country'].count().reset_index(name='count')
fig,ax=plt.subplots()
sns.barplot(data=ncountry_year, x='year', y='count')
# Fixing the error by specifying start, stop, and step in np.arange
plt.xticks(rotation=90)
plt.title('Number of countries that reached 90% coverage over the years')
ax.xaxis.set_major_locator(ticker.MultipleLocator(base=5))
plt.show()
# Now identify countries and regions that never achieved 90% coverage
diff = set(country_list).difference(set(country_score))
print(f'\nCountries or regions failed to achieve 90% coverage as of 2020: \n {diff}')
# Omit regions and print only countries
omit = ['Africa', 'North America', 'High-income countries', 'South America', 'Asia', 'Low-income countries', 'Europe', 'Lower-middle-income countries', 'Upper-middle-income countries']
sieved = set(diff).difference(omit)
print(f'\nCountries only with regions filtered:\n {sieved}')โ
โ