Data Manipulation with pandas

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

[18]

avocado.head()

Mean & Median

The mean weekly sales amount is almost double the median weekly sales amount! This can tell you that there are a few very high sales weeks that are making the mean so much higher than the median.

Agg()

The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example:

[7]

def diff(column):
    return column.max() - column.min()

avocado[['avg_price', 'nb_sold']].agg([diff, np.median])

cumsum(), cummax()

Use the .cumsum() method on a column to get a cumulative sum.
Use the .cummax() method on a column to get the cumulative maximum.

.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

Return DataFrame with duplicate rows removed.

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters:

- subset: subsetcolumn label or sequence of labels, optional, Only consider certain columns for identifying duplicates, by default use all of the columns.

- keep: {‘first’, ‘last’, False}, default ‘first’, Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.

- inplace: bool, default False, Whether to modify the DataFrame rather than creating a new one.

- ignore_index: bool, default False, If True, the resulting axis will be labeled 0, 1, …, n - 1.

# Drop duplicate avg_price/nb_sold combinations
avocado_value = avocado.drop_duplicates(subset=['avg_price', 'nb_sold'])

.count_values()

Return a Series containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters:

normalize: bool, default False, If True then the object returned will contain the relative frequencies of the unique values.
sort: bool, default True, Sort by frequencies.
ascending: bool, default False, Sort in ascending order. bins: int, optional, ather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.
dropna: bool, default True, Don’t include counts of NaN.

print(avocado['type'].value_counts())

print(avocado['type'].value_counts(sort=True, normalize=True))

group_by(), pivot_table()

pivot_table aggfunc takes mean as default

[16]

avocado.groupby("type")[['avg_price', 'nb_sold']].agg([np.min, np.max, np.mean, np.median])

[17]

pd.pivot_table(data=avocado, values=['avg_price'], index=['type'], aggfunc=np.mean)

‌
‌
‌

Data Manipulation with pandas

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Data Manipulation with pandas

Mean & Median

Agg()

cumsum(), cummax()

.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

.count_values()

group_by(), pivot_table()

Data Manipulation with pandas