Skip to content
Data Manipulation with pandas
Data Manipulation with pandas
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")[18]
avocado.head()Mean & Median
- The mean weekly sales amount is almost double the median weekly sales amount! This can tell you that there are a few very high sales weeks that are making the mean so much higher than the median.
Agg()
The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example:
[7]
def diff(column):
return column.max() - column.min()
avocado[['avg_price', 'nb_sold']].agg([diff, np.median])cumsum(), cummax()
- Use the
.cumsum()method on a column to get a cumulative sum. - Use the
.cummax()method on a column to get the cumulative maximum.
.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)
Return DataFrame with duplicate rows removed.
Considering certain columns is optional. Indexes, including time indexes are ignored.
Parameters:
- subset: subsetcolumn label or sequence of labels, optional, Only consider certain columns for identifying duplicates, by default use all of the columns. - keep: {‘first’, ‘last’, False}, default ‘first’, Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates. - inplace: bool, default False, Whether to modify the DataFrame rather than creating a new one. - ignore_index: bool, default False, If True, the resulting axis will be labeled 0, 1, …, n - 1.
# Drop duplicate avg_price/nb_sold combinations
avocado_value = avocado.drop_duplicates(subset=['avg_price', 'nb_sold']).count_values()
Return a Series containing counts of unique values.
The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
Parameters:
- normalize: bool, default False, If True then the object returned will contain the relative frequencies of the unique values.
- sort: bool, default True, Sort by frequencies.
- ascending: bool, default False, Sort in ascending order. bins: int, optional, ather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.
- dropna: bool, default True, Don’t include counts of NaN.
print(avocado['type'].value_counts())
print(avocado['type'].value_counts(sort=True, normalize=True))group_by(), pivot_table()
- pivot_table aggfunc takes mean as default
[16]
avocado.groupby("type")[['avg_price', 'nb_sold']].agg([np.min, np.max, np.mean, np.median])[17]
pd.pivot_table(data=avocado, values=['avg_price'], index=['type'], aggfunc=np.mean)