Skip to content

Data Manipulation with pandas

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")
[18]
avocado.head()

Mean & Median

  • The mean weekly sales amount is almost double the median weekly sales amount! This can tell you that there are a few very high sales weeks that are making the mean so much higher than the median.

Agg()

The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example:

[7]
def diff(column):
    return column.max() - column.min()

avocado[['avg_price', 'nb_sold']].agg([diff, np.median])

cumsum(), cummax()

  • Use the .cumsum() method on a column to get a cumulative sum.
  • Use the .cummax() method on a column to get the cumulative maximum.

.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

Return DataFrame with duplicate rows removed.

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters:

- subset: subsetcolumn label or sequence of labels, optional, Only consider certain columns for identifying duplicates, by default use all of the columns. - keep: {‘first’, ‘last’, False}, default ‘first’, Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates. - inplace: bool, default False, Whether to modify the DataFrame rather than creating a new one. - ignore_index: bool, default False, If True, the resulting axis will be labeled 0, 1, …, n - 1.
# Drop duplicate avg_price/nb_sold combinations
avocado_value = avocado.drop_duplicates(subset=['avg_price', 'nb_sold'])

.count_values()

Return a Series containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters:

  • normalize: bool, default False, If True then the object returned will contain the relative frequencies of the unique values.
  • sort: bool, default True, Sort by frequencies.
  • ascending: bool, default False, Sort in ascending order. bins: int, optional, ather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.
  • dropna: bool, default True, Don’t include counts of NaN.
print(avocado['type'].value_counts())

print(avocado['type'].value_counts(sort=True, normalize=True))

group_by(), pivot_table()

  • pivot_table aggfunc takes mean as default
[16]
avocado.groupby("type")[['avg_price', 'nb_sold']].agg([np.min, np.max, np.mean, np.median])
[17]
pd.pivot_table(data=avocado, values=['avg_price'], index=['type'], aggfunc=np.mean)