Introduction
This notebook explores global development indicators, including economic factors, education levels, health metrics, and labor force participation, to uncover potential relationships and regional variations across countries. The analysis leverages a comprehensive dataset sourced from Kaggle which compiles a wide array of socio-economic data points at the country level. By structuring this data into a relational database and employing SQL queries, we aim to extract meaningful insights into the interconnectedness of these global development aspects.
The Data
The dataset for this analysis, "Global Data: GDP, Life Expectancy & More," available on Kaggle, offers a comprehensive snapshot of key economic and social indicators across 204 entries and 38 distinct attributes. This rich collection allows for an in-depth exploration of global trends related to crucial aspects of development, including economic output (GDP), demographic structures (Sex Ratio, Urban Population Growth), health (Life Expectancy, Homicide Rate, Infant Mortality), education (Enrollment Rates across various levels), labor (Unemployment Rate), and environmental factors (CO2 Emissions, Forested Area, Tourist Numbers). The breadth and depth of these attributes provide a solid foundation for uncovering meaningful patterns and relationships across the globe.
From Raw Data to Relational Database
Prior to analysis, the initial dataset underwent a cleaning process to handle inconsistencies and missing values, ensuring data integrity for subsequent steps. For the purpose of focused analysis, the comprehensive dataset was then logically subdivided into multiple CSV files, each representing a specific aspect of global development: demographics, economy, education, labour, and lifespan. These CSV files were subsequently converted into corresponding tables within a SQLite database named world_data_v4.db. The database schema was designed with these five tables as primary dimensions, each containing relevant indicators and linked by a common country_code identifier to facilitate cross-dimensional analysis and the creation of a central country_facts table for integrated insights.
import pandas as pd
import sqlite3
import os
def process_and_load_csvs_to_sql_with_duplicates_dropped(csv_paths, db_name, table_names, subset_columns=None):
"""
Reads specific CSV files from provided paths, drops rows with missing values AND duplicates,
and loads each processed DataFrame into a SQLite database with specified table names.
Args:
csv_paths (list): A list of strings, where each string is the full path to a CSV file.
db_name (str): The name of the SQLite database file to create or connect to (e.g., 'world.db').
table_names (list): A list of strings, where each string is the desired table name
in the database for the corresponding CSV file in csv_paths.
The order of table names should match the order of CSV paths.
subset_columns (list, optional): A list of column names to consider when identifying
duplicates. If None, all columns are considered.
Defaults to None.
"""
try:
# Connect to the SQLite database (it will be created if it doesn't exist)
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
if len(csv_paths) != len(table_names):
raise ValueError("The number of CSV paths must match the number of table names.")
for i, csv_path in enumerate(csv_paths):
table_name = table_names[i]
print(f"Processing file: {csv_path} and loading into table: {table_name}")
try:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv(csv_path)
print(f" Shape before dropping NaNs: {df.shape}")
# Drop rows with any missing values
df_cleaned = df.dropna()
print(f" Shape after dropping NaNs: {df_cleaned.shape}")
# Drop duplicate rows
df_deduplicated = df_cleaned.drop_duplicates(subset=subset_columns)
print(f" Shape after dropping duplicates: {df_deduplicated.shape}")
# Load the cleaned and deduplicated DataFrame into the SQLite database
df_deduplicated.to_sql(table_name, conn, if_exists='replace', index=False)
print(f" DataFrame loaded into table: {table_name}")
except FileNotFoundError:
print(f"Error: File not found: {csv_path}")
except pd.errors.EmptyDataError:
print(f"Warning: File is empty: {file_path}")
except Exception as e:
print(f"An error occurred while processing {csv_path}: {e}")
# Close the database connection
if conn:
conn.close()
print(f"\nSuccessfully processed CSV files and loaded data into database: {db_name}")
except ValueError as ve:
print(f"Error: {ve}")
except Exception as overall_e:
print(f"An overall error occurred: {overall_e}")
# Example Usage (assuming you want to drop duplicates based on all columns):
if __name__ == "__main__":
csv_file_paths = [
'demographics.csv',
'economy.csv',
'education.csv',
'labour.csv',
'lifespan.csv'
]
output_table_names = [
'demographics',
'economy',
'education',
'labour',
'lifespan'
]
database_name = 'world_data_v4.db'
process_and_load_csvs_to_sql_with_duplicates_dropped(csv_file_paths, database_name, output_table_names)
The SQLite Database Design
To interact with our SQLite database directly within this Jupyter Notebook, we utilized the ipython-sql extension. This powerful tool allows us to execute SQL queries and retrieve results seamlessly within our Python workflow. Before use, the extension was loaded using the %load_ext sql command, which registers the necessary magic commands. Subsequently, the line %%sql sqlite:///world_data_v4.db establishes a connection to our world_data_v4.db SQLite database, making its tables and data accessible for querying using SQL directly within code cells marked with the %%sql magic command. This integration streamlines the analytical process by enabling in-notebook database interactions.
pip install ipython-sql%load_ext sqlThe code below shows the names of the tables in the world_data_v4 database
%%sql sqlite:///world_data_v4.db
SELECT name
FROM sqlite_master
WHERE type='table';Each table within our world_data_v4.db database is structured with a designated primary key to ensure unique identification of records. The demographics table utilizes country_code as its primary key. The dimension tables for economy (economy_id), education (education_id), labour (labour_id), and lifespan (lifespan_id) each employ an auto-incrementing integer as their primary key. The process of creating these dimension tables involved defining their schema based on the corresponding CSV files, including appropriate data types for each attribute and designating the primary key. For the demographics table, this involved creating the table with country_code as the primary key. For the other dimension tables, this initially involved creating tables and loading data, followed by the addition of a unique, auto-incrementing primary key to facilitate efficient querying and linking within our database schema.
%%sql
CREATE TABLE temp_demographics AS SELECT * FROM demographics;
DROP TABLE demographics;
CREATE TABLE demographics (
country_code VARCHAR(2) PRIMARY KEY,
name VARCHAR(255),
region VARCHAR(100),
urban_population REAL,
pop_growth REAL,
pop_density REAL,
sex_ratio REAL
);
INSERT INTO demographics SELECT * FROM temp_demographics;
-- Drop the temporary table
DROP TABLE temp_demographics;%%sql
CREATE TABLE temp_economy AS SELECT * FROM economy;
DROP TABLE economy;
CREATE TABLE economy (
economy_id INTEGER PRIMARY KEY,
country_code VARCHAR(2),
gdp REAL,
imports REAL,
exports REAL,
gdp_growth REAL,
gdp_per_capita REAL
);
INSERT INTO economy SELECT * FROM temp_economy;
-- Drop the temporary table
DROP TABLE temp_economy;