Data Manipulation with pandas

What's the point of pandas?

Data manipulation skill track
Data visualization skill track

Course outline

DataFrames

Sorting and subsetting
Creating new columns

Aggregating Data

Summary statistics
Counting
Grouped summary statistics

Slicing and Indexing Data

Subsetting using slicing
Indexes and subsetting using indexes

Creating and Visualizing Data

Plotting
Handling missing data
Reading data into a DataFrame

Chapter 1: Introducing DataFrames

pandas built on NumPy and Matplotlib.
pandas is popular.

pandas DataFrames

print(dogs)

Exploring a DataFrame

dogs.head() dogs.info() dogs.shape dogs.describe()

dogs.values -> 2 boyutlu bir NumPy dizisindeki veri değerlerini içerir.

dogs.columns -> sütun adlarını içerir

dogs.index -> satır numaralarını veya satır adlarını içerir

output -> RangeIndex(start=0, stop=7, step=1)

pandas Philosophy

There should be one -- and preferably only one -- obvious way to do it. -The Zen of Python by Tim Peters, Item 13
Bu bakımdan pandalar İsviçre Çakısı gibidir.

Sorting and subsetting

Sorting

dogs.sort_values("weight_kg")

From lightest dog to heaviest dog.

Sorting in descending order

dogs.sort_values("weight_kg", ascending=False)

From heaviest dog to lightest dog.

Sorting by multiple variables

dogs.sort_values(["weight_kg", "height_cm])"

Sort first by weight, then by height.
If their weight is same, sort from shortest to tallest.

dogs.sort_values(["weight_kg", "height_cm], ascending=[True, False])"

Tallest to shortest.

Subsetting columns

dogs["name"]

Subsetting multiple columns

dogs[["breed", "height_cm"]]
cols_to_subset = ["breed", "height_cm"] dogs[cols_to_subset]

Each way giving same result

Subsetting rows

dogs["height_cm"] > 50

True or False value for every row.

dogs[dogs["height_cm"] > 50]

All of the dogs taller than 50 cm with infos.

dogs[dogs["breed"] == "Labrador"]

Subset rows based on text data.

dogs[dogs["date_of_birth"] < "2015-01-01"]

Filter all the dogs born before 2015.

Subsetting based on multiple conditions

is_lab = dogs["breed"] == "Labrador"

is_brown = dogs["color"] == "Brown"

dogs[is_lab § is_brown]

Also do this in one line of code

dogs[ (dogs["breed"] == "Labrador") § (dogs["color"] == "Brown") ]

Subsetting using .isin()

is_black_or_brown = dogs["color"].isin(["Black, "Brown])

dogs[is_black_or_brown]

filter on multiple values of categorical variable.

Creating new columns

dogs["height_m"] = dogs["height_cm"] / 100

Doggy mass index

BMI = weight in kg / (height in m) ** 2

dogs["bmi"] = dogs["weight_kg"] / dogs["height_m"] ** 2

Multiple manipulations

bmi_lt_100 = dogs[dogs["bmi"] < 100]

bmi_lt_100_height = bmi_lt_100.sort_values("height_cm", ascending = False)

bmi_lt_100_height[["name", "height_cm", "bmi"]]

Chapter 2: Aggregating Data

Summary statistics

Summary statistics, as follows from their name, are numbers that summarize and tell you about your data set.

Summarizing numerical data

dogs["height_cm"].mean()

.median(), .mode()
.min(), .max()
.var(), std()
.sum()
.quantile()

Summarizing dates

oldest dog:

dogs["date_of_birth"].min()

youngest dog:

dogs["date_of_birth"].max()

The .agg() method

The aggregate, or agg, method allows you to compute custom summary statistics.

def pct30(column):

 return column.quantile(0.3)

We create a function called pct30 that computes the thirtieth percentile of a DataFrame column.

dogs["weight_kg"].agg(pct30)

It gives us the thirtieth percentile of the dogs' weights.

Summaries on multiple columns

dogs[["weight_kg", "height_cm"]].agg(pct30)

Multiple summaries

def pct40(column):

 return column.quantile(0.4)

dogs["weight_kg"].agg([pct30, pct40])

Thirtieth and fortieth percentiles of the dogs' weights.

Cumulative sum

dogs["weight_kg"]

dogs["weight_kg"].cumsum()

Cummulative statistics

.cummax()
.cummin()
.cumprod()

4 hidden cells

Counting

Dropping duplicate names

vet_visits.drop_duplicate(subset="name")

Dropping duplicate pairs

unique_dogs = vet_visits.drop_duplicates(subset=["name","breed"])

Easy as 1,2,3

unique_dogs["breed"].value_counts()

unique_dogs["breed"].value_counts(sort=True)

Proportions

unique_dogs["breed"].value_counts(normalize=True)

2 hidden cells

Grouped summary statistics

Summaries by group

dogs[dogs["color"] == "Black"]["weight_kg"].mean()

Grouped summaries

dogs.groupby("color")["weight_kg"].mean()

Multiple grouped summaries

dogs.groupby("color")["weight_kg"].agg([min,max,sum])

Grouping by multiple varibles

dogs.groupby(["color", "breed"])["weight_kg"].mean()

Many groups, many summaries

dogs.groupby(["color", "breed"])[["weight_kg"]].mean()

4 hidden cells

Pivot tables

dogs.groupby("color")["weight_kg"].mean()

Same thing using the pivot_table method.

dogs.pivot_table(values="weight_kg",

                    index="color")

Different statistics

import numpy as np

dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)

Filling missing values in pivot tables

dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)

Summing with pivot tables

dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True)

2 hidden cells

Chapter 3: Slicing and Indexing Data

Explicit indexes

Setting a column as the index dogs_ind = dogs.set_index("name")

Removing an index dogs_ind.reset_index()

Dropping an index dogs_ind.reset_index(drop=True)

Indexes make subsetting simpler dogs[dogs["name"].isin(["Bella, Stella])]"

Index values don't need to be unique dogs_ind2 = dogs.set_index("breed")

Index values are just data
Indexes violate "tidy data" principles

‌
‌
‌

Data Manipulation with pandas

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Data Manipulation with pandas

What's the point of pandas?

Course outline

Chapter 1: Introducing DataFrames

pandas DataFrames

pandas Philosophy

Sorting and subsetting

Sorting

Sorting in descending order

Sorting by multiple variables

Subsetting columns

Subsetting multiple columns

Subsetting rows

Subsetting based on multiple conditions

Subsetting using .isin()

Creating new columns

Doggy mass index

Multiple manipulations

Chapter 2: Aggregating Data

Summary statistics

Summarizing numerical data

Summarizing dates

The .agg() method

Summaries on multiple columns

Multiple summaries

Cumulative sum

Cummulative statistics

Counting

Dropping duplicate names

Dropping duplicate pairs

Easy as 1,2,3

Proportions

Grouped summary statistics

Summaries by group

Grouped summaries

Multiple grouped summaries

Grouping by multiple varibles

Many groups, many summaries

Pivot tables

Different statistics

Filling missing values in pivot tables

Summing with pivot tables

Chapter 3: Slicing and Indexing Data

Explicit indexes

Data Manipulation with pandas