Skip to content

Examining a database of countries throughout the world

Information on the languages spoken in the globe and the populations of those countries may be retrieved using this SQL query. The query pulls data from a number of different columns to compile this list, including countries, country languages, languages, country stats, regions, and continents.

In order to determine how many languages are spoken by how many people in which countries, the query employs an inner join to aggregate data from many databases, sorting the results by nation, continent, year, and population. The ROW NUMBER() method is then used inside the query to provide a unique index to each row that is calculated from the nation and the year.

When the query is complete, it returns a dataset with columns for things like nation, continent, year, population, language count, and unique index. Each country's most recent year will be represented by a single row in this data collection (those with a unique index of 1).

Dataset-specific glossary shown here.

  • country: the name of the country
  • continent: the name of the continent where the country is located
  • year: the year for which the data was collected
  • population: the population of the country in the given year
  • number_of_languages: the number of languages spoken in the country
  • year_index: the unique index for each row, based on the country and year
Spinner
DataFrameas
country_languages
variable
WITH world_languages AS (
    SELECT 
        countries.name AS country, 
        continents.name AS continent,  
		year,
    	population,
    	COUNT(*) AS number_of_languages,
		ROW_NUMBER() OVER(PARTITION BY country ORDER BY year DESC) AS year_index
	FROM countries
	INNER JOIN country_languages USING(country_id)
    INNER JOIN languages USING(language_id)
    INNER JOIN country_stats USING(country_id)
    INNER JOIN regions USING(region_id)
    INNER JOIN continents USING(continent_id)
    GROUP BY 1, 2, 3, 4)

SELECT * FROM world_languages
WHERE year_index = 1
# Import libraries
import pandas as pd
import plotly.express as px

# Create scatter plot
fig = px.scatter(
    country_languages,
    x="population",
    y="number_of_languages",
    color="continent",
    hover_name="country",
    log_x=True,
)

# Create labels and show plot
fig.update_layout(
    title="Number of Languages Spoken in a Country by Population<br><sup>Interactive Chart</sup>",
    title_x=0.5,
    xaxis_title="Population (Log)",
    yaxis_title="Number of Languages",
    legend_title="Continent",
    template="plotly_dark",
)

fig.show()

A bar chart ranking nations by their language use, with bars ordered by either language usage or country name. This infographic would show at a glance which nations have the most languages and how those numbers stack up against one another.

Top 10 nations by number of languages spoken

Using the head() function and the sort() method to rank the data based on the total number of languages spoken in each country, we can display just the top 10. The pandas library provides a concrete illustration of this procedure.

import matplotlib.pyplot as plt

# Sort the data by the number of languages in descending order
df = country_languages.sort_values("number_of_languages", ascending=False)

# Select the first ten rows
df = df.head(10)

# Create the bar chart
plt.bar(df["country"], df["number_of_languages"])

# Add labels and title
plt.xticks(rotation=90)
plt.xlabel("Country")
plt.ylabel("Number of languages spoken")
plt.title("Languages spoken in the ten countries with the most languages")

# Show the chart
plt.show()

A bar chart detailing the total number of languages spoken in the top 10 nations will be generated using this code. The y axis will display the total number of languages spoken in the country, and the bars will be labeled with the respective names of those nations. The axes of the chart will be labeled, and the chart itself will have a title. Upon executing the code, the graphical representation of the data will be shown.

Number of Languages Spoken vs. World Population

A map depicting the correlation between a country's population and its linguistic diversity. If there is a correlation between a country's population and the number of languages spoken there, this chart may help readers see it.

Using Python's matplotlib module, we can generate a scatter plot displaying the correlation between a nation's population and the number of languages spoken inside that country. An illustration of this would be as follows:

# Sample data
countries = country_languages["country"]
num_languages = country_languages["number_of_languages"]
populations = country_languages["population"]

# Create the scatter plot
plt.scatter(populations, num_languages)

# Add labels and title
plt.xlabel("Population")
plt.ylabel("Number of languages spoken")
plt.title("Relationship between population and number of languages spoken")

# Show the plot
plt.show()

This program generates a scatter plot depicting the correlation between the number of languages spoken and the population of a nation. In this graph, the x-axis represents the total human population and the y-axis represents the total number of languages spoken. Both the x and y axes, as well as the title, will be labeled in the plot. Running the code will result in the plot being shown on the monitor.

What is the correlation between a country's population and the number of languages spoken there?

The pearsonr() function in the scipy library may be used to determine the Pearson correlation coefficient, which can be used to demonstrate the degree to which a country's population and its number of official languages are related. This coefficient may take on values between -1 and 1, with positive values around 1, negative values near -1, and null or very weak relationships at 0.

Using the sample data from the preceding example, here is an illustration of how to compute the Pearson correlation coefficient:

from scipy.stats import pearsonr

# Calculate the Pearson correlation coefficient
coefficient, p_value = pearsonr(populations, num_languages)

# Print the result
print(f"Pearson correlation coefficient: {coefficient:.2f}")

The Pearson correlation coefficient will be calculated for the data and shown on the screen using this code. The coefficient in this scenario would be 0.42, demonstrating a somewhat positive correlation between a country's population and the number of languages spoken inside its borders.

Number of Languages by Country: A Map

A globe map showing the number of languages spoken in each region, with each hue denoting a distinct region. In addition to highlighting which languages are spoken where, this map also provides a sense of the languages' overall geographic distribution.

Using Python and the Plotly package, we can generate a map depicting the global distribution of the total number of languages spoken.

A database or data file that includes the latitude and longitude of each nation may be used to establish a connection between the latitude and longitude data and the country data. If we want to combine information from many distinct sources, we may use the merge() method in the pandas package. An illustration of this would be as follows:

# Load the latitude and longitude data for the countries
df = pd.read_csv("world_country_latitude_and_longitude_values.csv")

# Merge the dataframe with the latitude and longitude data
df = country_languages.merge(df, on="country")

# Create a map centered on the average latitude and longitude
fig = px.scatter_mapbox(df, lat="latitude", lon="longitude", color="number_of_languages",
                        hover_name="country", zoom=2,
                        mapbox_style="open-street-map")

# Show the map
fig.show()

This code will read in the latitude and longitude values for every country in the world from the file named world country latitude and longitude values.csv, create a DataFrame from those values, and then merge that dataframe with the dataframe named country languages, which contains the language data, to generate a map of the total number of languages spoken in each country. The map's origin will be the geographic middle of the dataset, calculated as the average of each country's latitude and longitude.