Data Manipulation with pandas
What's the point of pandas?
- Data manipulation skill track
- Data visualization skill track
Course outline
- DataFrames
- Sorting and subsetting
- Creating new columns
- Aggregating Data
- Summary statistics
- Counting
- Grouped summary statistics
- Slicing and Indexing Data
- Subsetting using slicing
- Indexes and subsetting using indexes
- Creating and Visualizing Data
- Plotting
- Handling missing data
- Reading data into a DataFrame
Chapter 1: Introducing DataFrames
- pandas built on NumPy and Matplotlib.
- pandas is popular.
pandas DataFrames
print(dogs)
Exploring a DataFrame
dogs.head()
dogs.info()
dogs.shape
dogs.describe()
dogs.values -> 2 boyutlu bir NumPy dizisindeki veri değerlerini içerir.
dogs.columns -> sütun adlarını içerir
dogs.index -> satır numaralarını veya satır adlarını içerir
output -> RangeIndex(start=0, stop=7, step=1)
pandas Philosophy
-
There should be one -- and preferably only one -- obvious way to do it. -The Zen of Python by Tim Peters, Item 13
-
Bu bakımdan pandalar İsviçre Çakısı gibidir.
Sorting and subsetting
Sorting
dogs.sort_values("weight_kg")
- From lightest dog to heaviest dog.
Sorting in descending order
dogs.sort_values("weight_kg", ascending=False)
- From heaviest dog to lightest dog.
Sorting by multiple variables
dogs.sort_values(["weight_kg", "height_cm])"
- Sort first by weight, then by height.
- If their weight is same, sort from shortest to tallest.
dogs.sort_values(["weight_kg", "height_cm], ascending=[True, False])"
- Tallest to shortest.
Subsetting columns
dogs["name"]
Subsetting multiple columns
-
dogs[["breed", "height_cm"]] -
cols_to_subset = ["breed", "height_cm"] dogs[cols_to_subset]
- Each way giving same result
Subsetting rows
dogs["height_cm"] > 50
- True or False value for every row.
dogs[dogs["height_cm"] > 50]
- All of the dogs taller than 50 cm with infos.
dogs[dogs["breed"] == "Labrador"]
- Subset rows based on text data.
dogs[dogs["date_of_birth"] < "2015-01-01"]
- Filter all the dogs born before 2015.
Subsetting based on multiple conditions
is_lab = dogs["breed"] == "Labrador"
is_brown = dogs["color"] == "Brown"
dogs[is_lab § is_brown]
- Also do this in one line of code
dogs[ (dogs["breed"] == "Labrador") § (dogs["color"] == "Brown") ]
Subsetting using .isin()
is_black_or_brown = dogs["color"].isin(["Black, "Brown])
dogs[is_black_or_brown]
- filter on multiple values of categorical
variable.
Creating new columns
dogs["height_m"] = dogs["height_cm"] / 100
Doggy mass index
BMI = weight in kg / (height in m) ** 2
dogs["bmi"] = dogs["weight_kg"] / dogs["height_m"] ** 2
Multiple manipulations
bmi_lt_100 = dogs[dogs["bmi"] < 100]
bmi_lt_100_height = bmi_lt_100.sort_values("height_cm", ascending = False)
bmi_lt_100_height[["name", "height_cm", "bmi"]]
Chapter 2: Aggregating Data
Summary statistics
Summary statistics, as follows from their name, are numbers that summarize and tell you about your data set.
Summarizing numerical data
dogs["height_cm"].mean()
.median(),.mode().min(),.max().var(),std().sum().quantile()
Summarizing dates
oldest dog:
dogs["date_of_birth"].min()
youngest dog:
dogs["date_of_birth"].max()
The .agg() method
- The aggregate, or agg, method allows you to compute custom summary statistics.
def pct30(column):
return column.quantile(0.3)
- We create a function called pct30 that computes the thirtieth percentile of a DataFrame column.
dogs["weight_kg"].agg(pct30)
- It gives us the thirtieth percentile of the dogs' weights.
Summaries on multiple columns
dogs[["weight_kg", "height_cm"]].agg(pct30)
Multiple summaries
def pct40(column):
return column.quantile(0.4)
dogs["weight_kg"].agg([pct30, pct40])
- Thirtieth and fortieth percentiles of the dogs' weights.
Cumulative sum
dogs["weight_kg"]
dogs["weight_kg"].cumsum()
Cummulative statistics
.cummax().cummin().cumprod()
4 hidden cells
Counting
Dropping duplicate names
vet_visits.drop_duplicate(subset="name")
Dropping duplicate pairs
unique_dogs = vet_visits.drop_duplicates(subset=["name","breed"])
Easy as 1,2,3
unique_dogs["breed"].value_counts()
unique_dogs["breed"].value_counts(sort=True)
Proportions
unique_dogs["breed"].value_counts(normalize=True)
2 hidden cells
Grouped summary statistics
Summaries by group
dogs[dogs["color"] == "Black"]["weight_kg"].mean()
Grouped summaries
dogs.groupby("color")["weight_kg"].mean()
Multiple grouped summaries
dogs.groupby("color")["weight_kg"].agg([min,max,sum])
Grouping by multiple varibles
dogs.groupby(["color", "breed"])["weight_kg"].mean()
Many groups, many summaries
dogs.groupby(["color", "breed"])[["weight_kg"]].mean()
4 hidden cells
Pivot tables
dogs.groupby("color")["weight_kg"].mean()
- Same thing using the pivot_table method.
dogs.pivot_table(values="weight_kg",
index="color")
Different statistics
import numpy as np
dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)
Filling missing values in pivot tables
dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)
Summing with pivot tables
dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True)
2 hidden cells
Chapter 3: Slicing and Indexing Data
Explicit indexes
Setting a column as the index
dogs_ind = dogs.set_index("name")
Removing an index
dogs_ind.reset_index()
Dropping an index
dogs_ind.reset_index(drop=True)
Indexes make subsetting simpler
dogs[dogs["name"].isin(["Bella, Stella])]"
Index values don't need to be unique
dogs_ind2 = dogs.set_index("breed")
- Index values are just data
- Indexes violate "tidy data" principles