Introduction to Databases in Python (sqlalchemy course)

Understanding how to work with databases is a crucial skill for software development, especially when dealing with applications that require persistent data storage and retrieval. The course Introduction to Databases in Python would be useful for anyone looking to build software with Python, as it provides the necessary knowledge to interact with databases effectively.

For those specifically interested in software development with Python, the following tracks would be most relevant:

Python Programming: This track is designed to improve your Python programming skills and covers a range of topics from optimizing code to understanding software engineering concepts. It's a good fit for someone looking to enhance their programming capabilities in Python.
Python Developer: This track is tailored for those aiming to become Python developers. It includes learning how to manipulate data, write efficient Python code, and work with various Python libraries that are useful for tasks such as web development, data analysis, and task automation.
Software Engineering for Data Scientists in Python (Practice): This practice module focuses on software engineering principles that are important for data scientists, which would also be beneficial for software developers.

These tracks and practices can help you build a strong foundation in Python programming and software development principles, which, when combined with database skills, will enable you to create robust and scalable software applications.

Create a result set from the census table in the census.sqlite dataset

# Import the necessary libraries
from sqlalchemy import create_engine, MetaData, Table, select

# Create an engine to connect to the SQLite database - this is a common interface for the database
# In general, connection strings have the form "dialect+driver://username:password@host:port/database"
engine = create_engine('sqlite:///datasets/census.sqlite')

# Create a metadata object - this object will be populated using reflection
metadata = MetaData()

# Connect to the database using the engine - this connection is only used for operations that require a connection to the database, such as executing a select statement
connection = engine.connect()

# Print the table names
print(engine.table_names())

# Reflect the census table from the database using the metadata and engine
census = Table('census', metadata, autoload=True, autoload_with=engine)

# Print the column names
print(census.columns.keys())

# Print census table metadata
print(repr(census))

# Print census table metadata using the metadata.tables dictionary
print(repr(metadata.tables['census']))

# Build a select statement to retrieve all columns from the census table
# The select function uses pythonic code to create an equivalent SQL query string that is compatable with whatever sql dialect is used by the database
stmt = select([census])

# Execute the select statement on the database connection and fetch 10 records using fetchmany
results1 = connection.execute(stmt).fetchmany(size=10)

# connection.execute(stmt) gives a result proxy
# The fetchmany method creates a result set from the result proxy by fetching the specified number of rows (size) from the result set.
# It returns a list of tuples, where each tuple represents a row of data.
# Other methods, such as .fetchall() and .first(), can be used to retrieve a different result set.
# For example, fetchall retrieves all the remaining rows from the result set as a list of tuples.
# The fetched records can be used for further processing or analysis.

# Print the result set
print(results1)

results2 = connection.execute(stmt).fetchall()

# Get the first row of results2
first_row = results2[0]

# Print the first column of the first row
print(first_row[0])
print(first_row['state'])

Show the values from particular columns only by iterating over the result proxy

You can iterate over a result proxy without needing to create a result set. Iteration also allows you to show results from particular columns only, even if the executed select statement includes other columns or all columns.

stmt = select([census])
stmt = stmt.limit(10)

results = connection.execute(stmt)

for result in results:
    print(result.state, '-', result.age, 'year-old', end=' ')
    if result.sex == 'M':
        print('males', end=' ')
    elif result.sex == 'F':
        print('females', end=' ')
    print('- population:', result.pop2000)
    
# Note that setting the 'end' parameter in the print function prevents it ending with a new line (default is '\n')

Create a result set with particular columns only, and load it in a dataframe

import pandas as pd

stmt = select([census.columns.state, census.columns.age, census.columns.sex, census.columns.pop2000])

results = connection.execute(stmt).fetchall()

df = pd.DataFrame(results)

# Set the column names of the DataFrame to be the columns (keys) from the results set
df.columns = results[0].keys()
# Note that this step may not be necessary, as the column names may be set automatically.

df[:10]

Using a where clause to summarise specific data only, aggregate data in one column for each distinct value in another column - order the results to make them easier to understand

from sqlalchemy import and_, or_, not_, func, desc

# Define a list of states for which we don't want results
states = ['New York', 'California', 'Texas']

# Select the state and age columns but not the sex column, so we can group by state and age regardless of sex, then select the sum of the pop2008 column so we can show the total population for each age in each state (combining both sexes)
stmt = select([census.columns.state, census.columns.age, func.sum(census.columns.pop2008).label('population')])

# Note that func.count and other aggregation functions work similarly to func.sum
# A label (alias) for the calculated column is used to clarify its meaning.

# Append a where clause to select only adults not from the specified states
stmt = stmt.where(
    and_(not_(census.columns.state.in_(states)),
         census.columns.age >= '18'
         )
)

# Note that ==, !=, <, <=, and other comparison operators can be used similarly to >=, while sqlalchemy expressions such as or_ can be used similarly to not_, in_, and_

# Append a group_by clause to group by state then age, and an order_by clause to order by largest population first
stmt = stmt.group_by(census.columns.state, census.columns.age).order_by(census.columns.state, desc('population'))

# Note that the calculated column can be referred to by its label in quotes.

# Note that multiple columns can be included in the group_by and order_by methods, and will be operated on in order of left to right

results = connection.execute(stmt).fetchall()

df = pd.DataFrame(results)

# Set the column names of the DataFrame to be the columns (keys) from the results set
df.columns = results[0].keys()
# Note that this step may not be necessary, as the column names may be set automatically.

df

Count the number of distinct values and store the scalar result

# Build a query to count distinct states values
stmt = select([func.count(census.columns.state.distinct())]).where(not_(census.columns.state.in_(states)))

# The ResultProxy has a method called .scalar() for getting just the value of a query that returns only one row and column. This can be very useful when you are querying for just a count or sum.

# Execute the query and store the scalar result: distinct_state_count
distinct_state_count = connection.execute(stmt).scalar()

# Print the distinct_state_count
print(distinct_state_count)

Calculate a difference between two columns

# Build query to return population change from 2000 to 2008
stmt = select([census.columns.state, census.columns.age, census.columns.sex, (census.columns.pop2008 - census.columns.pop2000).label('pop_change')])

# As with aggregation function calculations covered above, a label (alias) for the calculated column is used to clarify its meaning.

# Return only 5 results
stmt_top5 = stmt.limit(5)

results = connection.execute(stmt_top5).fetchall()

print(results)

# Build query to return state names by total population change from 2000 to 2008
stmt = select([census.columns.state, func.sum(census.columns.pop2008 - census.columns.pop2000).label('pop_change')])

# Show the population difference for each state in total, with the largest differences at the top
stmt_grouped = stmt.group_by(census.columns.state).order_by(desc('pop_change'))

# Return only 5 results: stmt_top5
stmt_top5 = stmt_grouped.limit(5)

results = connection.execute(stmt_top5).fetchall()

# Print the state and population change for each record
for result in results:
    print('{}: {}'.format(result.state, result.pop_change))
    
# result.state and result.pop_change are attributes (or fields) of each element in the results list. They represent columns retrieved from the database query.

# '{}: {}'.format(result.state, result.pop_change) is a string formatting operation that substitutes {} with the values of result.state and result.pop_change, respectively. This allows the code to print out the state and population change in a structured manner.

Calculate a percentage and store the scalar result

# import case, cast and Float from sqlalchemy
from sqlalchemy import case, cast, Float

# Build an expression to calculate female population in 2000
female_pop2000 = func.sum(case([(census.columns.sex == 'F', census.columns.pop2000)], else_=0))

# We can use the case() expression to operate on data that meets specific criteria while not affecting the query as a whole. The case() expression accepts a list of tuples containing a condition to match and the column to return if the condition matches, followed by an else_ if none of the conditions match. We can wrap this entire expression in any function or math operation we like.

# Cast an expression to calculate total population in 2000 to Float
total_pop2000 = cast(func.sum(census.columns.pop2000), Float)

# Often when performing integer division, we want to get a float back rather than an integer. While some databases will do this automatically, you can use the cast() function to convert an expression to a particular type.

# Build a query to calculate the percentage of women in 2000
stmt = select([female_pop2000 / total_pop2000 * 100])

# Execute the query and store the scalar result: percent_female
percent_female = connection.execute(stmt).scalar()

# Print the percentage
print(percent_female)

# Note that in addition to simply calculating a single result, you could also add another column to the select statement and then group by it, to calculate the percentage of females in each group. Remember that the calculated column would need to be labelled, after being wrapped in brackets.

stmt = select([census.columns.age, (female_pop2000 / total_pop2000 * 100).label('percent_female')])

stmt = stmt.group_by(census.columns.age)

percent_female = connection.execute(stmt).fetchall()

# Print the percentage
for result in percent_female:
    print(result.age, result.percent_female)
    
df = pd.DataFrame(percent_female)

df

# You can observe the upward sloping curve of the line plot below.

‌
‌
‌