DA World Internet Access Data

Internet: A Global Phenomenon

This dataset contains information on internet access around the world.

The workspace is set up with two CSV files containing information on global internet access for years ranging from 1990 to 2020.

internet_users.csv
- users - The number of people who have used the internet in the last three months
- share - The share of the entity's population who have used the internet in the last three months
adoption.csv
- fixed_telephone_subs - The number of people who have a telephone landline connection
- fixed_telephone_subs_share - The share of the entity's population who have a telephone landline connection
- fixed_broadband_subs - The number of people who have a broadband internet landline connection
- fixed_broadband_subs_share - The share of the entity's population who have a broadband internet landline connection
- mobile_cell_subs - The number of people who have a mobile subscription
- mobile_cell_subs_share - The share of the entity's population who have a mobile subscription

Both data files are indexed on the following 3 attributes:

entity - The name of the country, region, or group.
code - Unique id for the country (null for other entities).
year - Year from 1990 to 2020.

Check out the guiding questions or the scenario described below to get started with this dataset! Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.

Source: Our World In Data

🌎 Some guiding questions to help you explore this data:

What are the top 5 countries with the highest internet use (by population share)?
What are the top 5 countries with the highest internet use for some large regions?
What is the correlation between internet usage (population share) and broadband subscriptions for 2020?

Note: This is how the World Bank defines the different regions.

📊 Visualization ideas

Line chart: Display internet usage over time of the top 5 countries.
Map: Vividly illustrate the internet usage around the world in a certain year on a map. Leveraging, for example, GeoPandas or Folium.

🔍 Scenario: Identify emerging markets for a global internet provider

This scenario helps you develop an end-to-end project for your portfolio.

Background: You work for a global internet provider on a mission to provide affordable Internet access to everybody around the world using satellites. You are tasked with identifying which markets (regions or countries) are most worthwhile to focus efforts on.

Objective: Construct a top 5 list of countries where there is a big opportunity to roll out our services. Try to consider the amount of people not having access to (good) wired or mobile internet and their spending power.

You can query the pre-loaded CSV files using SQL directly. Here’s a sample query:

DataFrameas

df

variable

SELECT *
FROM 'internet_users.csv'
LIMIT 10

import pandas as pd
internet_users = pd.read_csv('internet_users.csv')
print(internet_users.shape)
internet_users.head()

adoption = pd.read_csv('adoption.csv')
print(adoption.shape)
adoption.head()


#Preliminary data check column information and missing data

print(internet_users.info())
print(internet_users.isna().sum())

PART 1

Internet coverage by different countries over the years.

import seaborn as sns
import matplotlib.pyplot as plt

#Plot internet coverage over time for top countries

#Group df according to year and internet coverage as save as a dataframe
topla = internet_users.groupby(['year', 'entity'])['share'].sum().reset_index()

#Use flag to select different countries

def patio(flag):

#Select stop year accurding to flag 1, number of top countires with flag 2
#default 5 leap years and fin year 2020 for flag 0    

    leap_year=1990
    nleap=5
    fin_year=2020
    top_select=5
    if(flag==1):
        fin_year=2010
    if(flag==2):
        top_select=10
        
#Scan years from star to final year jumping by leap year

    while leap_year<=fin_year:
        #Filter topla for each leap year sort by share 
        topla_leapyear = topla[topla['year'] == leap_year].sort_values(by= ['share'],ascending=False)
        #increment leap year
        leap_year=leap_year+nleap
        #select top 5 or 10 countries , top coverage and corresponding countries
        sel_top=topla_leapyear['entity'].head(top_select).values
        top_internet_users = internet_users[internet_users['entity'].isin(sel_top)]
        
    #Print topcountries at final set year 2010 or 2020
    
    print(f' Selected top internet user for top {top_select} countries at {fin_year} ')        
          
    #Plot lineplot of changing internet coverage over the years for each selected contry 
    sns.lineplot(data=top_internet_users,x='year',y='share',hue='entity')
    plt.title(f'Internet coverage over the years for top {top_select} countries at {fin_year} ')      
    plt.show()

#Select top 10 countries final year 2020         
patio(2)    
#Select top 5 countries final year 2020          
patio(0)
#Select top 5 countiries final year 2010          
patio(1)

Observation 1:

Line plots for internet coverage over the years shows that:

There is wide spectrum of development. Some countries lime Luxembourg, Netherlands, Norway and Sweeden reached 80% coverage eaarly on.
Some countries like Bahrain, Kuwait, Qatar started late but reached near 100 percent coverage by 2020.


#Identify countries and the year they achieved 90% internet coverage
year_score = []
country_score = []

#Select country list
country_list = topla['entity'].unique() 

#scan country list
for country in country_list:  
    #Scan topla to determine the year the country in line reached 90% coverage, 
    #if found append the country to country_score and year to year_acore and break to skip to next country in the list
    #if found append to country_score and year_score
    for k, row in topla.iterrows():  
        if row['share'] > 0.9 and row['entity'] == country:
            country_score.append(country)
            year_score.append(row['year'])
            break                      
#Convert country_score and year_score into a dataframe and sort to get top 10 and bottom 10 countries

df_cy = pd.DataFrame({'country': country_score, 'year': year_score}) 
df_cy = df_cy.sort_values(by='year') 
print(f'\nCountry and year to achieve head 90%\n {df_cy.head(10)}')
print(f'\nCountry and year to achieve tail 90%\n {df_cy.tail(10)}')

Observation 2:

Some countries developed internet rapidly very early on and achieved 90% benchmark coverage in early 1990s. These cuntries include USA,Sweeden, Finland etc. mostly rich European countries.
Some other countries lagged behind and reached 90% coverage only after 2010. These countries include Brundi, Kosovo, Ethiopia, Eritrea, mostly poor african countries.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.ticker as ticker

# Assuming df_cy is already defined and loaded with data
ncountry_year = df_cy.groupby('year')['country'].count().reset_index(name='count')
fig,ax=plt.subplots()
sns.barplot(data=ncountry_year, x='year', y='count')

# Fixing the error by specifying start, stop, and step in np.arange
plt.xticks(rotation=90)
plt.title('Number of countries that reached 90% coverage over the years')

ax.xaxis.set_major_locator(ticker.MultipleLocator(base=5))
plt.show()

# Now identify countries and regions that never achieved 90% coverage

diff = set(country_list).difference(set(country_score))     
print(f'\nCountries or regions failed to achieve 90% coverage as of 2020: \n {diff}')

# Omit regions and print only countries

omit = ['Africa', 'North America', 'High-income countries', 'South America', 'Asia', 'Low-income countries', 'Europe', 'Lower-middle-income countries', 'Upper-middle-income countries']
sieved = set(diff).difference(omit)
print(f'\nCountries only with regions filtered:\n {sieved}')

‌
‌
‌