Introduction
The World Development Indicators (WDI) is a comprehensive database compiled by the World Bank and its international partners, offering insights into global economic conditions across six dimensions: World View, People, Environment, Economy, States and Markets, and Global Linkages. This extensive dataset includes over 1400 variables for over 250 economies, spanning from 1960 to the present. WDI provides valuable measures of social progress, quality of life, economic development, physical infrastructure, environmental health, and government performance.
Additionally, the World Bank Data Catalogue offers access to related statistical series, such as International Debt Statistics, Millennium Development Indicators, Education Statistics, and Gender Statistics. To access the World Development Indicators, visit the World Bank data catalogue here.
Project Focus
For this project, I will explore the relationship between economic prosperity, alcohol consumption, and suicide rates across various countries. Specifically, I will examine how GDP per capita and total alcohol consumption per capita correlate with suicide mortality rates. By analyzing these factors/variables, I hope to uncover potential patterns and insights into the factors that may contribute to higher or lower suicide rates.
The dataset I downloaded from the World Bank Data Catalog has ten world economic indicators as follows:
- Access to electricity (% of population)
- Central government debt, total (% of GDP)
- GDP growth (annual %)
- GDP per capita (current US$)
- Government expenditure on education, total (% of GDP)
- Government expenditure on education, total (% of government expenditure)
- Population growth (annual %)
- Population, total
- Suicide mortality rate (per 100,000 population)
- Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)
The data covers a wide range of years, from 1960 to 2023, providing a long-term perspective on trends and changes over time. It includes detailed information on individual countries as well as broader regional aggregates, allowing for both micro and macro-level analysis. However, for this project, I will only focus on countries for simplicity and consistency.
Hypothesis Formulation
For this project I will test the following hypothesis:
Hypothesis 1: There is a significant negative correlation between GDP per capita and suicide mortality rates. This means that as the economic prosperity of a country increases, the suicide mortality rate decreases.
Hypothesis 2: There is a significant positive correlation between total alcohol consumption per capita and suicide mortality rates. This suggests that higher alcohol consumption is associated with higher suicide mortality rates.
Data Import and Cleaning
In this section, we will import the dataset and perform essential cleaning steps to ensure the data is ready for analysis. This involves handling missing values, converting data types, and restructuring the data to facilitate our investigation into the relationship between GDP per capita, total alcohol consumption per capita, and suicide mortality rates.
Import Required Packages
To begin our analysis, we need to import several essential Python libraries. These libraries will help us with data manipulation, visualization, and statistical analysis.
# Data manipulation and analysis
import pandas as pd
import numpy as np
# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Display settings
import warnings
warnings.filterwarnings('ignore')
Load and Preview Dataset
Below, we will load the dataset into a pandas DataFrame and display the first 5 rows to get an initial understanding of the data structure and contents.
df = pd.read_csv("world_development_indicators.csv")
df.head()
To ensure consistency and verify that there are no extraneous rows, we will also examine the last few rows of the dataset.
df.tail(10)
We can observe that the dataset contains 5 extraneous rows which could potentially interfere with our analysis. We will proceed to remove these unnecessary rows to ensure the integrity and consistency of our data.
df = df.iloc[:-5]
df.tail(10)
Next, if we examine the dataset by looking at the last few rows above, we can see that the years are represented as columns and the series names (economic indicators) are represented as rows. To facilitate our analysis, we need to reshape the dataset. Specifically, we need to transform the data such that the economic indicators are represented in distinct columns, while consolidating the years into a single column. We will also rename the columns and reformat the years to exclude the [YR0000] part.
# Pivot all years of data into a single column
df_pivot = df.melt(id_vars=["Country Name", "Country Code", "Series Name"], var_name="Year", value_name="Value")
# Un-Pivot all indicator names from rows into columns
df_unpivot = df_pivot.pivot(index=["Country Name", "Country Code", "Year"], columns="Series Name", values="Value").reset_index()
# Rename columns to be cleaner and easy to work with
new_column_names = [
'country_name', 'country_code', 'year', 'access_to_electricity',
'central_government_debt_total', 'gdp_growth', 'gdp_per_capita',
'government_expenditure_on_education_total_of_gdp',
'government_expenditure_on_education_total_of_expenditure',
'population_growth', 'population_total',
'suicide_mortality_rate_per_100k_population',
'total_alcohol_consumption_per_capita'
]
# Assign new column names
df_unpivot.columns = new_column_names
# Ensure the 'year' column is of string type and reformat to exclude the [YR0000] part
df_unpivot['year'] = df_unpivot['year'].astype(str).str[:4]
# Drop the 'country_code' column as it is not needed for further analysis
df_clean = df_unpivot.drop(columns=['country_code'])
# Display the first few rows of the cleaned dataframe
df_clean.head()
Let's take a look at the last few rows of our transformed dataset as well.