Skip to content

                                                                                             Photo credit: SeventyFour

What is a database?

Databases store and organize data electronically. The data is housed in objects called tables, with data organized into rows and columns (or records and fields).

A relational database defines relationships between tables of data inside a database.

Although they can look similar to a spreadsheet, databases can store more data, more securely, and allow multiple users to access the same database.

Our data

This project will use data from the Austin Animal Shelter. In this case, the data is saved to this DataLab workbook as a CSV file, but you'll query it just like you would a database!

Here is copy of the data scription from the Animal Shelter website and what you'd expect to find in the data:

Column NameDescriptionData Type
animal_idIdentifierText
nameAnimal name, if providedText
datetimeDate of outcomeFloating Timestamp
monthyearMonth and year of outcomeFloating Timestamp
date_of_birthAnimal's date of birthFloating Timestamp
outcome_typeDetails of the outcomeText
outcome_subtypeAdditional outcome detailsText
animal_typeType of animalText
sex_upon_outcomeDetails on whether the animal was neutered or intactText
age_upon_outcomeAge of animal at time of outcomeText
breedAnimal breedText
colorAnimal colorText
Spinner
DataFrameas
df13
variable
SELECT 
    animal_type,
    EXTRACT(YEAR FROM datetime) AS adoption_year,
    EXTRACT(MONTH FROM datetime) AS adoption_month,
    COUNT(*) AS total_adoptions
FROM 'animal_shelter.csv'
WHERE outcome_type = 'Adoption'
GROUP BY animal_type, adoption_year, adoption_month
ORDER BY adoption_year, adoption_month, animal_type;
Spinner
DataFrameas
df15
variable
WITH AdoptionStats AS (
    SELECT 
        animal_type,
        EXTRACT(YEAR FROM datetime) AS adoption_year,
        EXTRACT(MONTH FROM datetime) AS adoption_month,
        COUNT(*) AS total_adoptions
    FROM 'animal_shelter.csv'
    WHERE outcome_type = 'Adoption'
    GROUP BY animal_type, adoption_year, adoption_month
),
MonthlyStats AS (
    SELECT 
        adoption_year,
        adoption_month,
        SUM(total_adoptions) AS monthly_total_adoptions
    FROM AdoptionStats
    GROUP BY adoption_year, adoption_month
    ORDER BY monthly_total_adoptions DESC
    LIMIT 5
)
SELECT 
    m.adoption_year,
    m.adoption_month,
    m.monthly_total_adoptions,
    a.animal_type,
    a.total_adoptions AS highest_animal_adoptions
FROM MonthlyStats AS m
JOIN AdoptionStats AS a
ON m.adoption_year = a.adoption_year
   AND m.adoption_month = a.adoption_month
ORDER BY a.total_adoptions DESC
LIMIT 5;

Step 1: Explore the data

Spinner
DataFrameas
df14
variable
SELECT * 
FROM 'animal_shelter.csv'
LIMIT 5;

Before you dig into the data, it's always a good idea to explore it to see if it is as expected. Does the data match what was outlined above? Try querying the data to view a snapshot of the first 10 rows.

Spinner
DataFrameas
animal_shelter
variable
SELECT COUNT(outcome_type) AS total_outcome_type, datetime, outcome_type
FROM 'animal_shelter.csv'
GROUP BY datetime, outcome_type

It's also helpful to understand how big the dataset is. How may rows (or records) are there in this data? Query the file to find out.

Spinner
DataFrameas
animal_shelter_2
variable
SELECT COUNT(*) AS total_rows
FROM 'animal_shelter.csv';

Your turn! Can you think of another way to count the rows (or records)?

Spinner
DataFrameas
animal_shelter_3
variable
WITH AdoptionStats AS (
    SELECT 
        animal_type,
        sex_upon_outcome,
        COUNT(*) AS total_outcomes,
        SUM(CASE WHEN outcome_type = 'Adoption' THEN 1 ELSE 0 END) AS adoption_count
    FROM 'animal_shelter.csv'
    GROUP BY animal_type, sex_upon_outcome
)
SELECT 
    animal_type,
    sex_upon_outcome,
    total_outcomes,
    adoption_count,
    (adoption_count * 1.0 / total_outcomes) AS adoption_probability
FROM AdoptionStats
ORDER BY adoption_probability DESC;