Skip to content
0

Introduction

Obtained from Kaggle source, this dataset provides the GDP of African countries in US Dollars from 1960 to 2023. However, this analysis will specifically focus on identifying clusters of countries with similar GDP growth patterns within the 2000-2023 timeframe.

The Data

import pandas as pd

Africa = pd.read_csv("Africa_GDP.csv")

print(Africa.shape)

print(Africa.info())

display(Africa)

The data consists of 64 rows (years) and 34 columns (African countries), holding GDP values. With no missing data, the initial wide format will be transformed through transposing for modeling. Additionally, scaling is essential to prevent features with larger magnitudes (i.e., later years with higher GDP) from dominating the distance calculations in the clustering algorithm.

Preprocessing and finding the value for n_clusters

from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler

from  sklearn.pipeline import Pipeline

from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

#Filter the years to be between 2000 - 2023
start_year = 2000
end_year = 2023

Africa = Africa[(Africa['Year'] >= start_year) & (Africa['Year'] <= end_year)]

#Define a list containing the country names
Country = ["Algeria","Benin","Botswana","Burkina Faso","Burundi","Cameroon","Central African Republic","Chad",	"Eswatini","Ethiopia","Gabon","Ghana","Kenya","Lesotho","Liberia","Libya","Madagascar","Mauritius","Morocco",	"Niger","Nigeria","Rwanda","Senegal","Seychelles","Sierra Leone","Somalia","South Africa","Sudan","Tanzania",	"Togo","Uganda","Zambia","Zimbabwe"]

new_africa = Africa.set_index('Year')[Country]

#Transpose
new_africa = new_africa.T


#Now the modelling bit can begin

#Instatiate the scaler
scaler = StandardScaler()

# Scale your data
scaled_africa = scaler.fit_transform(new_africa)

#Create a range of possible number of clusters (k values) to test
k_range = range(1, 11)  # You can adjust the upper limit as needed

# Initialize an empty list to store the within-cluster sum of squares (WCSS) for each k
wcss = []

# Iterate through the range of k values
for k in k_range:
    # a. Create a KMeans model for the current k
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)

    # b. Fit the KMeans model to the *scaled* data
    kmeans.fit(scaled_africa)

    # c. Get the WCSS for the current k
    wcss.append(kmeans.inertia_)

#Plot the WCSS values against the number of clusters (k)
plt.figure(figsize=(10, 6))
plt.plot(k_range, wcss, marker='o', linestyle='--')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.xticks(k_range)
plt.grid(True)
plt.show()

From the visualization above it looks like the "elbow" of the line is at 3. This means that using 3 groups probably gives us a good way to separate the countries without making too many groups.

Application of KMeans Algorithm

# After examining the plot, determine your optimal 'n_clusters' value
optimal_k = 3

#Now you can create your final pipeline with the chosen number of clusters
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('kmeans', KMeans(n_clusters=optimal_k, random_state=42, n_init=10))
])

#Fit the final pipeline to your data
pipeline.fit(new_africa)

# Get the cluster labels
cluster_labels = pipeline.predict(new_africa)
print(cluster_labels)

Visualizations of the clusters

# Create a DataFrame to store the cluster labels for each country
cluster_df = pd.DataFrame({'Country': new_africa.index, 'Cluster': cluster_labels})
cluster_df = cluster_df.set_index('Country')

# Merge the cluster labels back with the original GDP data (transposed)
clustered_africa = new_africa.merge(cluster_df, left_index=True, right_index=True)

# Get the list of years (columns in your 'new_africa' DataFrame)
years = new_africa.columns

# Visualize GDP trends for each cluster

# Convert GDP to billions (assuming original values are in USD)
gdp_billions = new_africa / 1_000_000_000

#Define the years you want to visualize with bar plots
selected_years = [2000, 2005, 2010, 2015, 2020]

#Create the figure and subplots
fig, axes = plt.subplots(nrows=optimal_k, ncols=1, figsize=(12, 5 * optimal_k), sharey=False) # sharey=False for original scale
fig.suptitle('Average GDP by Cluster for Selected Years (Billions USD)')

#Iterate through each cluster and create a bar plot for the selected years
for cluster_num in range(optimal_k):
    cluster_countries_df_billions = gdp_billions.loc[clustered_africa[clustered_africa['Cluster'] == cluster_num].index]
    average_gdp_billions = cluster_countries_df_billions[selected_years].mean(axis=0)

    ax = axes[cluster_num]
    bars = ax.bar(selected_years, average_gdp_billions, color=[f'C{cluster_num}'] * len(selected_years))
    ax.set_title(f'Cluster {cluster_num}')
    ax.set_xlabel('Year')
    ax.set_ylabel('Average GDP (Billions USD)')
    ax.grid(axis='y')

    # Add value labels on top of the bars in billions
    for bar in bars:
        yval = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2, yval + 0.1, f'{yval:.1f}B', ha='center', va='bottom', fontsize=9)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

#Display countries belonging to each cluster (as before)
cluster_membership = pd.DataFrame({'Country': new_africa.index, 'Cluster': cluster_labels})

print("\nCountries Belonging to Each Cluster:")
for i in range(optimal_k):
    countries_in_cluster = cluster_membership[cluster_membership['Cluster'] == i]['Country'].tolist()
    print(f"Cluster {i}: {', '.join(countries_in_cluster)}")

Findings

The K-Means clustering analysis, using an optimal number of clusters (k=3) determined by the elbow method, revealed distinct groupings of African countries based on their GDP growth patterns between 2000 and 2020.

  • Cluster 0:

This cluster comprises a large group of diverse African nations (Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Eswatini, Ethiopia, Gabon, Ghana, Kenya, Lesotho, Liberia, Libya, Madagascar, Mauritius, Niger, Rwanda, Senegal, Seychelles, Sierra Leone, Somalia, Sudan, Tanzania, Togo, Uganda, Zambia, Zimbabwe). The average GDP for this cluster shows a consistent upward trend over the period, starting at approximately 5.88 billion in 2000 and reaching 25.59 billion in 2020. This indicates a general pattern of economic growth within this broad group, albeit from a relatively lower base compared to the other clusters.

  • Cluster 1:

This cluster includes Nigeria and South Africa. The average GDP for this group is significantly higher than Cluster 0, starting at around 110.5 billion in 2000 and increasing to 385.2 billion in 2020, with a notable peak around 2015 419.9 billion. This suggests that these two major economies followed a similar, higher-scale growth trajectory during this period, although with some fluctuations.

  • Cluster 2:

This cluster consists of Algeria and Morocco. The average GDP for this cluster also starts at a higher level than Cluster 0, with approximately 48.9 billion in 2000, growing to 143.1 billion in 2020, showing a steady increase throughout the observed years. This indicates a shared pattern of substantial and relatively stable economic expansion for these North African nations.

Potential Explanatory Characteristics:

It's important to note that without deeper economic and historical analysis, these are potential hypotheses:

  • Cluster 0 (Diverse Growth):

Economic Diversity: This large cluster likely encompasses countries with varied economic structures, including reliance on agriculture, emerging manufacturing sectors, and differing levels of resource dependence. Their shared pattern of general growth might reflect continent-wide trends in development aid, increasing global trade, and improvements in governance in some regions.

Development Stage: Many of these nations might be in similar stages of economic development, experiencing gradual but consistent growth as they build infrastructure, attract investment, and diversify their economies.

Historical Factors: While diverse, some shared colonial histories or post-independence development strategies might have influenced their growth trajectories.

  • Cluster 1 (Nigeria and South Africa - Major Economies):

Significant Resource Wealth: Both Nigeria (oil) and South Africa (various minerals) possess substantial natural resources, which have historically been major drivers of their GDP. Fluctuations in global commodity prices could explain some of the variations in their growth.

Regional Influence: As the two largest economies in Sub-Saharan Africa, their economic policies and performance can have ripple effects across the region.

Established Industrial Base: Compared to many countries in Cluster 0, Nigeria and South Africa have more established industrial and financial sectors.

  • Cluster 2 (Algeria and Morocco - North African Stability):

Hydrocarbon Resources (Algeria): Algeria's significant oil and gas reserves are a major contributor to its GDP.

Diversifying Economies (Morocco): Morocco has made significant strides in diversifying its economy, including developing its tourism, manufacturing, and agricultural sectors.

Regional Stability and Trade Links: Their relative political stability compared to some other nations and strong trade links with Europe might contribute to their consistent growth.

Government Policies: Consistent government policies aimed at economic development and diversification in both countries could be a factor.

Model performance