Photo credit: SeventyFour
What is a database?
Databases store and organize data electronically. The data is housed in objects called tables, with data organized into rows and columns (or records and fields).
A relational database defines relationships between tables of data inside a database.
Although they can look similar to a spreadsheet, databases can store more data, more securely, and allow multiple users to access the same database.
Our data
This project will use data from the Austin Animal Shelter. In this case, the data is saved to this DataLab workbook as a CSV file, but you'll query it just like you would a database!
Here is copy of the data scription from the Animal Shelter website and what you'd expect to find in the data:
Column Name | Description | Data Type |
---|---|---|
animal_id | Identifier | Text |
name | Animal name, if provided | Text |
datetime | Date of outcome | Floating Timestamp |
monthyear | Month and year of outcome | Floating Timestamp |
date_of_birth | Animal's date of birth | Floating Timestamp |
outcome_type | Details of the outcome | Text |
outcome_subtype | Additional outcome details | Text |
animal_type | Type of animal | Text |
sex_upon_outcome | Details on whether the animal was neutered or intact | Text |
age_upon_outcome | Age of animal at time of outcome | Text |
breed | Animal breed | Text |
color | Animal color | Text |
SELECT
animal_type,
EXTRACT(YEAR FROM datetime) AS adoption_year,
EXTRACT(MONTH FROM datetime) AS adoption_month,
COUNT(*) AS total_adoptions
FROM 'animal_shelter.csv'
WHERE outcome_type = 'Adoption'
GROUP BY animal_type, adoption_year, adoption_month
ORDER BY adoption_year, adoption_month, animal_type;
WITH AdoptionStats AS (
SELECT
animal_type,
EXTRACT(YEAR FROM datetime) AS adoption_year,
EXTRACT(MONTH FROM datetime) AS adoption_month,
COUNT(*) AS total_adoptions
FROM 'animal_shelter.csv'
WHERE outcome_type = 'Adoption'
GROUP BY animal_type, adoption_year, adoption_month
),
MonthlyStats AS (
SELECT
adoption_year,
adoption_month,
SUM(total_adoptions) AS monthly_total_adoptions
FROM AdoptionStats
GROUP BY adoption_year, adoption_month
ORDER BY monthly_total_adoptions DESC
LIMIT 5
)
SELECT
m.adoption_year,
m.adoption_month,
m.monthly_total_adoptions,
a.animal_type,
a.total_adoptions AS highest_animal_adoptions
FROM MonthlyStats AS m
JOIN AdoptionStats AS a
ON m.adoption_year = a.adoption_year
AND m.adoption_month = a.adoption_month
ORDER BY a.total_adoptions DESC
LIMIT 5;
Step 1: Explore the data
SELECT *
FROM 'animal_shelter.csv'
LIMIT 5;
Before you dig into the data, it's always a good idea to explore it to see if it is as expected. Does the data match what was outlined above? Try querying the data to view a snapshot of the first 10 rows.
SELECT COUNT(outcome_type) AS total_outcome_type, datetime, outcome_type
FROM 'animal_shelter.csv'
GROUP BY datetime, outcome_type
It's also helpful to understand how big the dataset is. How may rows (or records) are there in this data? Query the file to find out.
SELECT COUNT(*) AS total_rows
FROM 'animal_shelter.csv';
Your turn! Can you think of another way to count the rows (or records)?
WITH AdoptionStats AS (
SELECT
animal_type,
sex_upon_outcome,
COUNT(*) AS total_outcomes,
SUM(CASE WHEN outcome_type = 'Adoption' THEN 1 ELSE 0 END) AS adoption_count
FROM 'animal_shelter.csv'
GROUP BY animal_type, sex_upon_outcome
)
SELECT
animal_type,
sex_upon_outcome,
total_outcomes,
adoption_count,
(adoption_count * 1.0 / total_outcomes) AS adoption_probability
FROM AdoptionStats
ORDER BY adoption_probability DESC;