Skip to content
Data Manipulation with pandas
  • AI Chat
  • Code
  • Report
  • Data Manipulation with pandas

    What's the point of pandas?

    • Data manipulation skill track
    • Data visualization skill track

    Course outline

    1. DataFrames
    • Sorting and subsetting
    • Creating new columns
    1. Aggregating Data
    • Summary statistics
    • Counting
    • Grouped summary statistics
    1. Slicing and Indexing Data
    • Subsetting using slicing
    • Indexes and subsetting using indexes
    1. Creating and Visualizing Data
    • Plotting
    • Handling missing data
    • Reading data into a DataFrame

    Chapter 1: Introducing DataFrames

    • pandas built on NumPy and Matplotlib.
    • pandas is popular.
    pandas DataFrames


    Exploring a DataFrame

    dogs.head() dogs.shape dogs.describe()

    dogs.values -> 2 boyutlu bir NumPy dizisindeki veri değerlerini içerir.

    dogs.columns -> sütun adlarını içerir

    dogs.index -> satır numaralarını veya satır adlarını içerir

    output -> RangeIndex(start=0, stop=7, step=1)

    pandas Philosophy
    • There should be one -- and preferably only one -- obvious way to do it. -The Zen of Python by Tim Peters, Item 13

    • Bu bakımdan pandalar İsviçre Çakısı gibidir.

    Sorting and subsetting



    • From lightest dog to heaviest dog.
    Sorting in descending order

    dogs.sort_values("weight_kg", ascending=False)

    • From heaviest dog to lightest dog.
    Sorting by multiple variables

    dogs.sort_values(["weight_kg", "height_cm])"

    • Sort first by weight, then by height.
    • If their weight is same, sort from shortest to tallest.

    dogs.sort_values(["weight_kg", "height_cm], ascending=[True, False])"

    • Tallest to shortest.
    Subsetting columns


    Subsetting multiple columns
    1. dogs[["breed", "height_cm"]]

    2. cols_to_subset = ["breed", "height_cm"] dogs[cols_to_subset]

    • Each way giving same result
    Subsetting rows

    dogs["height_cm"] > 50

    • True or False value for every row.

    dogs[dogs["height_cm"] > 50]

    • All of the dogs taller than 50 cm with infos.

    dogs[dogs["breed"] == "Labrador"]

    • Subset rows based on text data.

    dogs[dogs["date_of_birth"] < "2015-01-01"]

    • Filter all the dogs born before 2015.
    Subsetting based on multiple conditions

    is_lab = dogs["breed"] == "Labrador"

    is_brown = dogs["color"] == "Brown"

    dogs[is_lab § is_brown]

    • Also do this in one line of code

    dogs[ (dogs["breed"] == "Labrador") § (dogs["color"] == "Brown") ]

    Subsetting using .isin()

    is_black_or_brown = dogs["color"].isin(["Black, "Brown])


    • filter on multiple values of categorical variable.

    Creating new columns

    dogs["height_m"] = dogs["height_cm"] / 100

    Doggy mass index

    BMI = weight in kg / (height in m) ** 2

    dogs["bmi"] = dogs["weight_kg"] / dogs["height_m"] ** 2

    Multiple manipulations

    bmi_lt_100 = dogs[dogs["bmi"] < 100]

    bmi_lt_100_height = bmi_lt_100.sort_values("height_cm", ascending = False)

    bmi_lt_100_height[["name", "height_cm", "bmi"]]

    Chapter 2: Aggregating Data

    Summary statistics

    Summary statistics, as follows from their name, are numbers that summarize and tell you about your data set.

    Summarizing numerical data


    • .median(), .mode()
    • .min(), .max()
    • .var(), std()
    • .sum()
    • .quantile()
    Summarizing dates

    oldest dog:


    youngest dog:


    The .agg() method
    • The aggregate, or agg, method allows you to compute custom summary statistics.

    def pct30(column):

    return column.quantile(0.3)
    • We create a function called pct30 that computes the thirtieth percentile of a DataFrame column.


    • It gives us the thirtieth percentile of the dogs' weights.
    Summaries on multiple columns

    dogs[["weight_kg", "height_cm"]].agg(pct30)

    Multiple summaries

    def pct40(column):

    return column.quantile(0.4)

    dogs["weight_kg"].agg([pct30, pct40])

    • Thirtieth and fortieth percentiles of the dogs' weights.
    Cumulative sum



    Cummulative statistics
    • .cummax()
    • .cummin()
    • .cumprod()


    Dropping duplicate names


    Dropping duplicate pairs

    unique_dogs = vet_visits.drop_duplicates(subset=["name","breed"])

    Easy as 1,2,3





    Grouped summary statistics

    Summaries by group
    • dogs[dogs["color"] == "Black"]["weight_kg"].mean()
    Grouped summaries
    • dogs.groupby("color")["weight_kg"].mean()
    Multiple grouped summaries
    • dogs.groupby("color")["weight_kg"].agg([min,max,sum])
    Grouping by multiple varibles
    • dogs.groupby(["color", "breed"])["weight_kg"].mean()
    Many groups, many summaries
    • dogs.groupby(["color", "breed"])[["weight_kg"]].mean()
    Pivot tables


    • Same thing using the pivot_table method.


    Different statistics

    import numpy as np

    dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)

    Filling missing values in pivot tables

    dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)

    Summing with pivot tables

    dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True)

    Chapter 3: Slicing and Indexing Data

    Explicit indexes

    Setting a column as the index dogs_ind = dogs.set_index("name")

    Removing an index dogs_ind.reset_index()

    Dropping an index dogs_ind.reset_index(drop=True)

    Indexes make subsetting simpler dogs[dogs["name"].isin(["Bella, Stella])]"

    Index values don't need to be unique dogs_ind2 = dogs.set_index("breed")

    • Index values are just data
    • Indexes violate "tidy data" principles