How have American baby name tastes changed since 1920? Which names have remained popular for over 100 years, and how do those names compare to more recent top baby names? These are considerations for many new parents, but the skills you'll practice while answering these queries are broadly applicable. After all, understanding trends and popularity is important for many businesses, too!
You'll be working with data provided by the United States Social Security Administration, which lists first names along with the number and sex of babies they were given to in each year. For processing speed purposes, the dataset is limited to first names which were given to over 5,000 American babies in a given year. The data spans 101 years, from 1920 through 2020.
The Data
baby_names
baby_names| column | type | description |
|---|---|---|
year | int | year |
first_name | varchar | first name |
sex | varchar | sex of babies given first_name |
num | int | number of babies of sex given first_name in that year |
-- Run this code to view the data in baby_names
SELECT *
FROM baby_names
LIMIT 5;List the first five names in alphabetical order and find out if each name is "Classic" or "Trendy." Save your query as a DataFrame name_types with three columns: first_name, sum, and popularity_type.
-- Use this table for the answer to question 1:
-- List the overall top five names in alphabetical order and find out if each name is "Classic" or "Trendy."
-- Classic they are if present in 50+ years
SELECT
first_name,
SUM(num),
CASE
WHEN COUNT(first_name) >= 50 THEN 'Classic'
ELSE 'Trendy'
END AS popularity_type
FROM public.baby_names
GROUP BY first_name
ORDER BY first_name
LIMIT 5
;What were the top 20 male names overall, and how did the name Paul rank? Save your query as a DataFrame top_20 with three columns: name_rank, first_name, and sum.
WITH ranked_names AS (
SELECT
first_name,
SUM(num) AS total
FROM baby_names
WHERE sex = 'M'
GROUP BY first_name
ORDER BY total DESC
),
top_20 AS (
SELECT
ROW_NUMBER() OVER () AS name_rank,
first_name,
total AS sum
FROM ranked_names
LIMIT 20
),
paul_rank AS (
SELECT
ROW_NUMBER() OVER () AS name_rank,
first_name,
total AS sum
FROM ranked_names
WHERE first_name = 'Paul'
)
SELECT * FROM top_20
/*
SELECT * FROM top_20
UNION ALL
SELECT * FROM paul_rank;*/Which female names appeared in both 1920 and 2020? Save your query as a DataFrame a_names with two columns: first_name, and total_occurrences.
WITH names_1920 AS (
SELECT first_name, SUM(num) AS total_occurrences_1920
FROM baby_names
WHERE year = 1920 AND sex = 'F'
GROUP BY first_name
),
names_2020 AS (
SELECT first_name, SUM(num) AS total_occurrences_2020
FROM baby_names
WHERE year = 2020 AND sex = 'F'
GROUP BY first_name
)
SELECT names_1920.first_name, (names_1920.total_occurrences_1920 + names_2020.total_occurrences_2020) AS total_occurrences
FROM names_1920
JOIN names_2020 ON names_1920.first_name = names_2020.first_name;