Skip to content
(SQL) Project: Exploring Trends in American Baby Names
  • AI Chat
  • Code
  • Report
  • Exploring Trends in American Baby Names

    Use categorization and ranking techniques to explore 101 years of American baby names.

    What makes a name timeless or trendy? This project uses over a century of U.S. Social Security data to explore American baby name preferences.

    The data manipulation skills you'll use in this project are broadly applicable: understanding changing tastes is a key competency for businesses as well as parents searching for a baby name!

    How have American baby name tastes changed since 1920? Which names have remained popular for over 100 years, and how do those names compare to more recent top baby names? These are considerations for many new parents, but the skills you'll practice while answering these queries are broadly applicable. After all, understanding trends and popularity is important for many businesses, too!

    You'll be working with data provided by the United States Social Security Administration, which lists first names along with the number and sex of babies they were given to in each year. For processing speed purposes, the dataset is limited to first names which were given to over 5,000 American babies in a given year. The data spans 101 years, from 1920 through 2020.

    The Data

    baby_names

    columntypedescription
    yearintyear
    first_namevarcharfirst name
    sexvarcharsex of babies given first_name
    numintnumber of babies of sex given first_name in that year

    Before you proceed, consider the assumption that a name is considered "Classic" if it appears in 50 or more years, and "Trendy" otherwise.

    Spinner
    DataFrameavailable as
    usa_baby_names
    variable
    -- Run this code to view the data in baby_names
    SELECT * FROM baby_names
    LIMIT 5;
    Spinner
    DataFrameavailable as
    df
    variable
    SELECT * FROM information_schema.columns
    WHERE table_name = 'baby_names'

    1. Classify the names as Classic or Trendy

    List the overall top five names in alphabetical order and find out if each name is "Classic" or "Trendy." Save your query as a DataFrame name_types with three columns: first_name, sum, and popularity_type.

    Spinner
    DataFrameavailable as
    name_types
    variable
    -- List the overall top five names in alphabetical order and find out if each name is "Classic" or "Trendy."
    
    SELECT 
    	first_name, 
    	SUM(num) AS sum, 
    	(CASE 
    		WHEN SUM(num) >= 50 THEN 'Classic'
    		ELSE 'Trendy'
    	END) AS popularity_type
    FROM baby_names
    GROUP BY first_name
    ORDER BY first_name
    LIMIT 5;

    2. Find the top 20 male names

    What were the top 20 male names overall, and how did the name Paul rank? Save your query as a DataFrame top_20 with three columns: name_rank, first_name, and sum.

    Spinner
    DataFrameavailable as
    top_20
    variable
    -- What were the top 20 male names overall, and how did the name Paul rank?
    
    SELECT 
    	RANK() OVER (ORDER BY SUM(num) DESC) AS name_rank, 
    	first_name, 
    	SUM(num) AS sum
    FROM baby_names
    WHERE sex = 'M'
    GROUP BY first_name
    ORDER BY name_rank
    LIMIT 20;

    3. Find female names that appear in both 1920 and 2020

    Which female names appeared in both 1920 and 2020? Save your query as a DataFrame a_names with two columns: first_name, and total_occurrences.

    Spinner
    DataFrameavailable as
    a_names_2
    variable
    -- Which female names appeared in both 1920 and 2020?
    
    SELECT first_name, SUM(num) AS total_occurrences
    FROM baby_names
    WHERE sex = 'F' AND year IN ('1920', '2020')
    GROUP BY first_name
    HAVING COUNT(DISTINCT year) = 2
    ORDER BY total_occurrences DESC
    Spinner
    DataFrameavailable as
    a_names
    variable
    -- Alternative solution for query above
    -- Select first name and total occurrences
    
    SELECT a.first_name, (a.num + b.num) AS total_occurrences
    FROM baby_names AS a
    JOIN baby_names AS b
    -- Join on first name
    ON a.first_name = b.first_name
    -- Filter for the years 1920 and 2020 and sex equals 'F'
    WHERE a.year = 1920 AND a.sex = 'F'
    AND b.year = 2020 AND b.sex = 'F'
    ORDER BY total_occurrences DESC;
    Spinner
    DataFrameavailable as
    df1
    variable
    [47]
    -- take a sample from results of an example from the 3rd result ("Emma")
    
    SELECT first_name, SUM(num)
    FROM baby_names
    WHERE first_name = 'Emma' AND year = 2020
    GROUP BY first_name
    Spinner
    DataFrameavailable as
    df2
    variable
    SELECT first_name, SUM(num)
    FROM baby_names
    WHERE first_name = 'Emma' AND year = 1920
    GROUP BY first_name