Skip to content
0

Analyzing global internet patterns

๐Ÿ“– Background

In this competition, you'll be exploring a dataset that highlights internet usage for different countries from 2000 to 2023. Your goal is import, clean, analyze and visualize the data in your preferred tool.

The end goal will be a clean, self explanatory, and interactive visualization. By conducting a thorough analysis, you'll dive deeper into how internet usage has changed over time and the countries still widely impacted by lack of internet availability.

๐Ÿ’พ Data

You have access to the following file, but you can supplement your data with other sources to enrich your analysis.

Interet Usage (internet_usage.csv)

Column nameDescription
Country NameName of the country
Country CodeCountries 3 character country code
2000Contains the % of population of individuals using the internet in 2000
2001Contains the % of population of individuals using the internet in 2001
2002Contains the % of population of individuals using the internet in 2002
2003Contains the % of population of individuals using the internet in 2003
.......
2023Contains the % of population of individuals using the internet in 2023

The data can be downloaded from the Files section (File > Show workbook files).

๐Ÿ’ช Challenge

Use a tool of your choice to create an interesting visual or dashboard that summarizes your analysis!

Things to consider:

  1. Use this Workspace to prepare your data (optional).
  2. Stuck on where to start, here's some ideas to get you started:
    • Visualize interner usage over time, by country
    • How has internet usage changed over time, are there any patterns emerging?
    • Consider bringing in other data to supplement your analysis
  3. Create a screenshot of your main dashboard / visuals, and paste in the designated field.
  4. Summarize your findings in an executive summary.
#Data Cleaning and Segmentation
import pandas as pd
import numpy as np
from scipy.interpolate import UnivariateSpline, interp1d

data = pd.read_csv("data/internet_usage.csv")
data.replace('..', np.nan, inplace=True)

for i in range(2000, 2024):
    print(f'missing data in year {i}: {data[str(i)].isna().sum()}')

for country in data['Country Name'].unique():
    country_df = data[data['Country Name'] == country]
    df = country_df.melt(id_vars=['Country Name'], 
                         var_name='year', 
                         value_name='internet_usage_percentage')
    df.drop(columns='Country Name', inplace=True)
    df["year"] = pd.to_numeric(df["year"], errors='coerce')
    df['internet_usage_percentage'] = pd.to_numeric(df['internet_usage_percentage'], errors='coerce')
    df["internet_usage_percentage_data_source"] = df['internet_usage_percentage'].apply(lambda x: "real data" if not pd.isna(x) else np.nan)
    
    valid_data = df[df['internet_usage_percentage'].notna()]
    
    if len(valid_data) > 23:  # Using spline interpolation when there is few missing value
        spline = UnivariateSpline(valid_data['year'], valid_data['internet_usage_percentage'], s=0)
        df['internet_usage_percentage'] = spline(df['year'])
        method_used = "spline"
    elif len(valid_data) > 10:  # Using quadratic interpolation when there is moderate missing value as most of the data has polinomial pattern
        quadratic_interp = interp1d(valid_data['year'], valid_data['internet_usage_percentage'], kind='quadratic', fill_value="extrapolate")
        df['internet_usage_percentage'] = quadratic_interp(df['year'])
        method_used = "quadratic"
    elif len(valid_data) > 1:  # Using linear interpolation as fallback
        linear_interp = interp1d(valid_data['year'], valid_data['internet_usage_percentage'], kind='linear', fill_value="extrapolate")
        df['internet_usage_percentage'] = linear_interp(df['year'])
        method_used = "linear"
    else:
        print(f"Not enough data points for interpolation in {country}")
        method_used = "none"
    
    
    df["internet_usage_percentage_data_source"] = np.where(df["internet_usage_percentage_data_source"].isna(), "interpolated data", df["internet_usage_percentage_data_source"])
    df["internet_usage_percentage_data_source"] = np.where(df["internet_usage_percentage_data_source"] == "interpolated data", r"interpolated/extrapolated data", df["internet_usage_percentage_data_source"])
    df.to_csv(f'{country}.csv', index=False)
    print(f"Saved data for {country} to {country}.csv using {method_used} interpolation.")

# import os
# import pandas as pd
# directory = os.getcwd()
# combined_data = pd.DataFrame()
# for file in os.listdir(directory):
#     if file.endswith(".csv") and file != "internet_usage_217_countries.csv":  
#         country_name = os.path.splitext(file)[0]  
#         country_data = pd.read_csv(os.path.join(directory, file))
#         country_data['Country Name'] = country_name 
#         combined_data = pd.concat([combined_data, country_data], ignore_index=True)
# combined_data['year']=combined_data['year'].astype('int')
# combined_data.to_csv(os.path.join(directory, "internet_usage_217_countries.csv"), index=False)
#Data Cleaning and Segmentation
import pandas as pd
import numpy as np
from scipy.interpolate import UnivariateSpline, interp1d

data = pd.read_csv("gdp_per_capita.csv")
data.replace('..', np.nan, inplace=True)

for i in range(2000, 2024):
    print(f'missing data in year {i}: {data[str(i)].isna().sum()}')

for country in data['Country Name'].unique():
    country_df = data[data['Country Name'] == country]
    df = country_df.melt(id_vars=['Country Name'], 
                         var_name='year', 
                         value_name='gdp_per_capita')
    df.drop(columns='Country Name', inplace=True)
    df["year"] = pd.to_numeric(df["year"], errors='coerce')
    df['gdp_per_capita'] = pd.to_numeric(df['gdp_per_capita'], errors='coerce')
    df["gdp_per_capita_data_source"] = df['gdp_per_capita'].apply(lambda x: "real data" if not pd.isna(x) else np.nan)
    
    valid_data = df[df['gdp_per_capita'].notna()]
    
    if len(valid_data) > 23:  # Using spline interpolation when there is few missing value
        spline = UnivariateSpline(valid_data['year'], valid_data['gdp_per_capita'], s=0)
        df['gdp_per_capita'] = spline(df['year'])
        method_used = "spline"
    elif len(valid_data) > 10:  # Using quadratic interpolation when there is moderate missing value as most of the data has polinomial pattern
        quadratic_interp = interp1d(valid_data['year'], valid_data['gdp_per_capita'], kind='quadratic', fill_value="extrapolate")
        df['gdp_per_capita'] = quadratic_interp(df['year'])
        method_used = "quadratic"
    elif len(valid_data) > 1:  # Using linear interpolation as fallback
        linear_interp = interp1d(valid_data['year'], valid_data['gdp_per_capita'], kind='linear', fill_value="extrapolate")
        df['gdp_per_capita'] = linear_interp(df['year'])
        method_used = "linear"
    else:
        print(f"Not enough data points for interpolation in {country}")
        method_used = "none"#Data Cleaning and Segmentation
import pandas as pd
import numpy as np
from scipy.interpolate import UnivariateSpline, interp1d

data = pd.read_csv("gdp_per_capita.csv")
data.replace('..', np.nan, inplace=True)

for i in range(2000, 2024):
    print(f'missing data in year {i}: {data[str(i)].isna().sum()}')

for country in data['Country Name'].unique():
    country_df = data[data['Country Name'] == country]
    df = country_df.melt(id_vars=['Country Name'], 
                         var_name='year', 
                         value_name='gdp_per_capita')
    df.drop(columns='Country Name', inplace=True)
    df["year"] = pd.to_numeric(df["year"], errors='coerce')
    df['gdp_per_capita'] = pd.to_numeric(df['gdp_per_capita'], errors='coerce')
    df["gdp_per_capita_data_source"] = df['gdp_per_capita'].apply(lambda x: "real data" if not pd.isna(x) else np.nan)
    
    valid_data = df[df['gdp_per_capita'].notna()]
    
    if len(valid_data) > 23:  # Using spline interpolation when there is few missing value
        spline = UnivariateSpline(valid_data['year'], valid_data['gdp_per_capita'], s=0)
        df['gdp_per_capita'] = spline(df['year'])
        method_used = "spline"
    elif len(valid_data) > 10:  # Using quadratic interpolation when there is moderate missing value as most of the data has polinomial pattern
        quadratic_interp = interp1d(valid_data['year'], valid_data['gdp_per_capita'], kind='quadratic', fill_value="extrapolate")
        df['gdp_per_capita'] = quadratic_interp(df['year'])
        method_used = "quadratic"
    elif len(valid_data) > 1:  # Using linear interpolation as fallback
        linear_interp = interp1d(valid_data['year'], valid_data['gdp_per_capita'], kind='linear', fill_value="extrapolate")
        df['gdp_per_capita'] = linear_interp(df['year'])
        method_used = "linear"
    else:
        print(f"Not enough data points for interpolation in {country}")
        method_used = "none"
    
    
    df["gdp_per_capita_data_source"] = np.where(df["gdp_per_capita_data_source"].isna(), "interpolated data", df["gdp_per_capita_data_source"])
    df["gdp_per_capita_data_source"] = np.where(df["gdp_per_capita_data_source"] == "interpolated data", r"interpolated/extrapolated data", df["gdp_per_capita_data_source"])
    df.to_csv(f'{country}_gdp_per_capita.csv', index=False)
    print(f"Saved data for {country} to {country}.csv using {method_used} interpolation.")
    
    
    df["gdp_per_capita_data_source"] = np.where(df["gdp_per_capita_data_source"].isna(), "interpolated data", df["gdp_per_capita_data_source"])
    df["gdp_per_capita_data_source"] = np.where(df["gdp_per_capita_data_source"] == "interpolated data", r"interpolated/extrapolated data", df["gdp_per_capita_data_source"])
    df.to_csv(f'{country}_gdp_per_capita.csv', index=False)
    print(f"Saved data for {country} to {country}.csv using {method_used} interpolation.")
#Data Cleaning and Segmentation
import pandas as pd
import numpy as np
from scipy.interpolate import UnivariateSpline, interp1d

data = pd.read_csv("urban_population_percentage.csv")
data.replace('..', np.nan, inplace=True)

for i in range(2000, 2024):
    print(f'missing data in year {i}: {data[str(i)].isna().sum()}')

for country in data['Country Name'].unique():
    country_df = data[data['Country Name'] == country]
    df = country_df.melt(id_vars=['Country Name'], 
                         var_name='year', 
                         value_name='urban_population_percentage')
    df.drop(columns='Country Name', inplace=True)
    df["year"] = pd.to_numeric(df["year"], errors='coerce')
    df['urban_population_percentage'] = pd.to_numeric(df['urban_population_percentage'], errors='coerce')
    df["urban_population_percentage_data_source"] = df['urban_population_percentage'].apply(lambda x: "real data" if not pd.isna(x) else np.nan)
    
    valid_data = df[df['urban_population_percentage'].notna()]
    
    if len(valid_data) > 23:  # Using spline interpolation when there is few missing value
        spline = UnivariateSpline(valid_data['year'], valid_data['urban_population_percentage'], s=0)
        df['urban_population_percentage'] = spline(df['year'])
        method_used = "spline"
    elif len(valid_data) > 10:  # Using quadratic interpolation when there is moderate missing value as most of the data has polinomial pattern
        quadratic_interp = interp1d(valid_data['year'], valid_data['urban_population_percentage'], kind='quadratic', fill_value="extrapolate")
        df['urban_population_percentage'] = quadratic_interp(df['year'])
        method_used = "quadratic"
    elif len(valid_data) > 1:  # Using linear interpolation as fallback
        linear_interp = interp1d(valid_data['year'], valid_data['urban_population_percentage'], kind='linear', fill_value="extrapolate")
        df['urban_population_percentage'] = linear_interp(df['year'])
        method_used = "linear"
    else:
        print(f"Not enough data points for interpolation in {country}")
        method_used = "none"
    
    
    df["urban_population_percentage_data_source"] = np.where(df["urban_population_percentage_data_source"].isna(), "interpolated data", df["urban_population_percentage_data_source"])
    df["urban_population_percentage_data_source"] = np.where(df["urban_population_percentage_data_source"] == "interpolated data", r"interpolated/extrapolated data", df["urban_population_percentage_data_source"])
    df.to_csv(f'{country}_urban_population_percentage.csv', index=False)
    print(f"Saved data for {country} to {country}.csv using {method_used} interpolation.")
Run cancelled
!pip install dash jupyter_dash

โœ๏ธ Judging criteria

CATEGORYWEIGHTINGDETAILS
Visualizations50%
  • Appropriateness of visualizations used.
  • Clarity of insight from visualizations.
Summary35%
  • Clarity of insights - how clear and well presented the findings are.
Votes15%
  • Up voting - most upvoted entries get the most points.

๐Ÿงพ Executive summary

In a couple of lines, write your main findings here.

The internet usage pattern from 2000 to 2023 for maximum countries follows a cubic trend. Moreover, the internet usage has a high correlation with urban population percentage and especially with access to electricity.

๐Ÿ“ท Visual/Dashboard screenshot

Paste one screenshot of your visual/dashboard here.

๐ŸŒ Upload your dashboard (optional)

Ideally, paste your link to your online available dashboard here.

Otherwise, upload your dashboard file to the Files section (File > Show workbook files).

File uploaded as app.py.

Github link: https://github.com/Mustaqeem01/internet_usage_analysis_by_countries_and_years_dashboard_application

โŒ›๏ธ Time is ticking. Good luck!