Skip to content

Data Manipulation with pandas

What's the point of pandas?

  • Data manipulation skill track
  • Data visualization skill track

Course outline

  1. DataFrames
  • Sorting and subsetting
  • Creating new columns
  1. Aggregating Data
  • Summary statistics
  • Counting
  • Grouped summary statistics
  1. Slicing and Indexing Data
  • Subsetting using slicing
  • Indexes and subsetting using indexes
  1. Creating and Visualizing Data
  • Plotting
  • Handling missing data
  • Reading data into a DataFrame

Chapter 1: Introducing DataFrames

  • pandas built on NumPy and Matplotlib.
  • pandas is popular.
pandas DataFrames

print(dogs)

Exploring a DataFrame

dogs.head() dogs.info() dogs.shape dogs.describe()

dogs.values -> 2 boyutlu bir NumPy dizisindeki veri değerlerini içerir.

dogs.columns -> sütun adlarını içerir

dogs.index -> satır numaralarını veya satır adlarını içerir

output -> RangeIndex(start=0, stop=7, step=1)

pandas Philosophy
  • There should be one -- and preferably only one -- obvious way to do it. -The Zen of Python by Tim Peters, Item 13

  • Bu bakımdan pandalar İsviçre Çakısı gibidir.

Sorting and subsetting

Sorting

dogs.sort_values("weight_kg")

  • From lightest dog to heaviest dog.
Sorting in descending order

dogs.sort_values("weight_kg", ascending=False)

  • From heaviest dog to lightest dog.
Sorting by multiple variables

dogs.sort_values(["weight_kg", "height_cm])"

  • Sort first by weight, then by height.
  • If their weight is same, sort from shortest to tallest.

dogs.sort_values(["weight_kg", "height_cm], ascending=[True, False])"

  • Tallest to shortest.
Subsetting columns

dogs["name"]

Subsetting multiple columns
  1. dogs[["breed", "height_cm"]]

  2. cols_to_subset = ["breed", "height_cm"] dogs[cols_to_subset]

  • Each way giving same result
Subsetting rows

dogs["height_cm"] > 50

  • True or False value for every row.

dogs[dogs["height_cm"] > 50]

  • All of the dogs taller than 50 cm with infos.

dogs[dogs["breed"] == "Labrador"]

  • Subset rows based on text data.

dogs[dogs["date_of_birth"] < "2015-01-01"]

  • Filter all the dogs born before 2015.
Subsetting based on multiple conditions

is_lab = dogs["breed"] == "Labrador"

is_brown = dogs["color"] == "Brown"

dogs[is_lab § is_brown]

  • Also do this in one line of code

dogs[ (dogs["breed"] == "Labrador") § (dogs["color"] == "Brown") ]

Subsetting using .isin()

is_black_or_brown = dogs["color"].isin(["Black, "Brown])

dogs[is_black_or_brown]

  • filter on multiple values of categorical variable.

Creating new columns

dogs["height_m"] = dogs["height_cm"] / 100

Doggy mass index

BMI = weight in kg / (height in m) ** 2

dogs["bmi"] = dogs["weight_kg"] / dogs["height_m"] ** 2

Multiple manipulations

bmi_lt_100 = dogs[dogs["bmi"] < 100]

bmi_lt_100_height = bmi_lt_100.sort_values("height_cm", ascending = False)

bmi_lt_100_height[["name", "height_cm", "bmi"]]

Chapter 2: Aggregating Data

Summary statistics

Summary statistics, as follows from their name, are numbers that summarize and tell you about your data set.

Summarizing numerical data

dogs["height_cm"].mean()

  • .median(), .mode()
  • .min(), .max()
  • .var(), std()
  • .sum()
  • .quantile()
Summarizing dates

oldest dog:

dogs["date_of_birth"].min()

youngest dog:

dogs["date_of_birth"].max()

The .agg() method
  • The aggregate, or agg, method allows you to compute custom summary statistics.

def pct30(column):

return column.quantile(0.3)
  • We create a function called pct30 that computes the thirtieth percentile of a DataFrame column.

dogs["weight_kg"].agg(pct30)

  • It gives us the thirtieth percentile of the dogs' weights.
Summaries on multiple columns

dogs[["weight_kg", "height_cm"]].agg(pct30)

Multiple summaries

def pct40(column):

return column.quantile(0.4)

dogs["weight_kg"].agg([pct30, pct40])

  • Thirtieth and fortieth percentiles of the dogs' weights.
Cumulative sum

dogs["weight_kg"]

dogs["weight_kg"].cumsum()

Cummulative statistics
  • .cummax()
  • .cummin()
  • .cumprod()

4 hidden cells

Counting

Dropping duplicate names

vet_visits.drop_duplicate(subset="name")

Dropping duplicate pairs

unique_dogs = vet_visits.drop_duplicates(subset=["name","breed"])

Easy as 1,2,3

unique_dogs["breed"].value_counts()

unique_dogs["breed"].value_counts(sort=True)

Proportions

unique_dogs["breed"].value_counts(normalize=True)


2 hidden cells

Grouped summary statistics

Summaries by group
  • dogs[dogs["color"] == "Black"]["weight_kg"].mean()
Grouped summaries
  • dogs.groupby("color")["weight_kg"].mean()
Multiple grouped summaries
  • dogs.groupby("color")["weight_kg"].agg([min,max,sum])
Grouping by multiple varibles
  • dogs.groupby(["color", "breed"])["weight_kg"].mean()
Many groups, many summaries
  • dogs.groupby(["color", "breed"])[["weight_kg"]].mean()

4 hidden cells
Pivot tables

dogs.groupby("color")["weight_kg"].mean()

  • Same thing using the pivot_table method.

dogs.pivot_table(values="weight_kg",

index="color")
Different statistics

import numpy as np

dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)

Filling missing values in pivot tables

dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)

Summing with pivot tables

dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True)


2 hidden cells

Chapter 3: Slicing and Indexing Data

Explicit indexes

Setting a column as the index dogs_ind = dogs.set_index("name")

Removing an index dogs_ind.reset_index()

Dropping an index dogs_ind.reset_index(drop=True)

Indexes make subsetting simpler dogs[dogs["name"].isin(["Bella, Stella])]"

Index values don't need to be unique dogs_ind2 = dogs.set_index("breed")

  • Index values are just data
  • Indexes violate "tidy data" principles