Skip to content
1 hidden cell
Data Manipulation with pandas
Data Manipulation with pandas
Run the hidden code cell below to import the data used in this course.
1 hidden cell
Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Add your notes here
# Add your code snippets here
# Code to make understand DataFrame
.head() returns the first few rows (the “head” of the DataFrame).
.info() shows information on each of the columns, such as the data type and number of missing values.
.shape returns the number of rows and columns of the DataFrame.
.describe() calculates a few summary statistics for each column.
.values: A two-dimensional NumPy array of values.
.columns: An index of columns: the column names.
.index: An index for the rows: either row numbers or row names.
Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- Print the highest weekly sales for each
department
in thewalmart
DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video. - What was the total
nb_sold
of organic avocados in 2017 in theavocado
DataFrame? If you're stuck, try reviewing this video. - Create a bar plot of the total number of homeless people by region in the
homelessness
DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video. - Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.
# Sort rows by a specific column in ascending order
df.sort_values('column_name', ascending=True, inplace=True)
# Sort rows by multiple columns in descending order
df.sort_values(['column_name1', 'column_name2'], ascending=[False, False], inplace=True)
Hidden output
# Subset a single column
df['column_name']
# Subset multiple columns
df[['column_name1', 'column_name2']]
Hidden output
# Subset rows where a specific column meets a condition
subset = df[df['column_name'] > 10]
# Subset rows where multiple columns meet conditions
subset = df[(df['column_name1'] > 10) & (df['column_name2'] == 'value')]
Hidden output
# Subset rows where a specific column has certain categorical values
subset = df[df['column_name'].isin(['value1', 'value2', 'value3'])]
Hidden output
df['new_column_name'] = values
df['total'] = df['column1'] + df['column2']
df['average'] = df['column1'] / df['column1'].mean()dfd
Hidden output
Use Case for data manipulation
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]
# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]
# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending = False)
# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state", "indiv_per_10k"]]
# See the result
print(result)
# Import the necessary libraries
import pandas as pd
# Create a DataFrame
data = {'column_name': [5, 10, 15, 20, 25],
'column_name1': [5, 10, 15, 20, 25],
'column_name2': ['value1', 'value2', 'value3', 'value4', 'value5']}
df = pd.DataFrame(data)
# Subset rows where a specific column meets a condition
subset = df[df['column_name'] > 10]
# Subset rows where multiple columns meet conditions
subset = df[(df['column_name1'] > 10) & (df['column_name2'] == 'value')]
# Subset rows where a specific column has certain categorical values
subset = df[df['column_name'].isin(['value1', 'value2', 'value3'])]
# Create a new column with calculated values
df['new_column_name'] = values
# Perform arithmetic operations on columns
df['total'] = df['column1'] + df['column2']
df['average'] = df['column1'] / df['column1'].mean()
# Manipulate data using pandas functions
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending=False)
result = high_homelessness_srt[["state", "indiv_per_10k"]]
# Import the necessary libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
# Create a DataFrame
data = {'column_name': [5, 10, 15, 20, 25],
'column_name1': [5, 10, 15, 20, 25],
'column_name2': ['value1', 'value2', 'value3', 'value4', 'value5']}
df = pd.DataFrame(data)
# Calculate descriptive statistics
mean = df['column_name'].mean()
median = df['column_name'].median()
mode = df['column_name'].mode().values[0]
std_dev = df['column_name'].std()
variance = df['column_name'].var()
# Perform hypothesis testing
t_stat, p_value = stats.ttest_1samp(df['column_name'], 0)
# Perform correlation analysis
correlation = df['column_name'].corr(df['column_name1'])
# Perform linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['column_name'], df['column_name1'])