this is the nav!
Data Manipulation with pandas
• AI Chat
• Code
• Report
• ## .mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Data Manipulation with pandas

#### What's the point of pandas?

• Data manipulation skill track
• Data visualization skill track

#### Course outline

1. DataFrames
• Sorting and subsetting
• Creating new columns
1. Aggregating Data
• Summary statistics
• Counting
• Grouped summary statistics
1. Slicing and Indexing Data
• Subsetting using slicing
• Indexes and subsetting using indexes
1. Creating and Visualizing Data
• Plotting
• Handling missing data
• Reading data into a DataFrame

### Chapter 1: Introducing DataFrames

• pandas built on NumPy and Matplotlib.
• pandas is popular.
##### pandas DataFrames

`print(dogs)`

Exploring a DataFrame

`dogs.head()` `dogs.info()` `dogs.shape` `dogs.describe()`

`dogs.values` -> 2 boyutlu bir NumPy dizisindeki veri değerlerini içerir.

`dogs.columns` -> sütun adlarını içerir

`dogs.index` -> satır numaralarını veya satır adlarını içerir

output -> `RangeIndex(start=0, stop=7, step=1)`

##### pandas Philosophy
• There should be one -- and preferably only one -- obvious way to do it. -The Zen of Python by Tim Peters, Item 13

• Bu bakımdan pandalar İsviçre Çakısı gibidir.

#### Sorting and subsetting

##### Sorting

`dogs.sort_values("weight_kg")`

• From lightest dog to heaviest dog.
##### Sorting in descending order

`dogs.sort_values("weight_kg", ascending=False)`

• From heaviest dog to lightest dog.
##### Sorting by multiple variables

`dogs.sort_values(["weight_kg", "height_cm])"`

• Sort first by weight, then by height.
• If their weight is same, sort from shortest to tallest.

`dogs.sort_values(["weight_kg", "height_cm], ascending=[True, False])"`

• Tallest to shortest.
##### Subsetting columns

`dogs["name"]`

##### Subsetting multiple columns
1. `dogs[["breed", "height_cm"]]`

2. `cols_to_subset = ["breed", "height_cm"] dogs[cols_to_subset]`

• Each way giving same result
##### Subsetting rows

`dogs["height_cm"] > 50`

• True or False value for every row.

`dogs[dogs["height_cm"] > 50]`

• All of the dogs taller than 50 cm with infos.

`dogs[dogs["breed"] == "Labrador"]`

• Subset rows based on text data.

`dogs[dogs["date_of_birth"] < "2015-01-01"]`

• Filter all the dogs born before 2015.
##### Subsetting based on multiple conditions

`is_lab = dogs["breed"] == "Labrador"`

`is_brown = dogs["color"] == "Brown"`

`dogs[is_lab § is_brown]`

• Also do this in one line of code

`dogs[ (dogs["breed"] == "Labrador") § (dogs["color"] == "Brown") ]`

##### Subsetting using .isin()

`is_black_or_brown = dogs["color"].isin(["Black, "Brown])`

`dogs[is_black_or_brown]`

• filter on multiple values of categorical `variable.`

#### Creating new columns

`dogs["height_m"] = dogs["height_cm"] / 100`

##### Doggy mass index

BMI = weight in kg / (height in m) ** 2

`dogs["bmi"] = dogs["weight_kg"] / dogs["height_m"] ** 2`

##### Multiple manipulations

`bmi_lt_100 = dogs[dogs["bmi"] < 100]`

`bmi_lt_100_height = bmi_lt_100.sort_values("height_cm", ascending = False)`

`bmi_lt_100_height[["name", "height_cm", "bmi"]]`

### Chapter 2: Aggregating Data

#### Summary statistics

Summary statistics, as follows from their name, are numbers that summarize and tell you about your data set.

##### Summarizing numerical data

`dogs["height_cm"].mean()`

• `.median()`, `.mode()`
• `.min()`, `.max()`
• `.var(),` `std()`
• `.sum()`
• `.quantile()`
##### Summarizing dates

oldest dog:

`dogs["date_of_birth"].min()`

youngest dog:

`dogs["date_of_birth"].max()`

##### The .agg() method
• The aggregate, or agg, method allows you to compute custom summary statistics.

def pct30(column):

```.mfe-app-workspace-16mnz0b{right:4px;top:4px;}.mfe-app-workspace-5n8jq8{-webkit-backdrop-filter:blur(6px);backdrop-filter:blur(6px);color:var(--wf-text--subtle, #5D6A77);float:right;right:4px;top:4px;}.mfe-app-workspace-1ctqx14{-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;border-color:transparent;border-radius:4px;border-style:solid;border-width:2px;cursor:pointer;display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;-webkit-flex-direction:column;-ms-flex-direction:column;flex-direction:column;-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;font-family:Studio-Feixen-Sans,Arial,sans-serif;font-weight:800;-webkit-box-pack:center;-ms-flex-pack:center;-webkit-justify-content:center;justify-content:center;line-height:1;margin:0;outline:0;padding:0;position:relative;-webkit-text-decoration:none;text-decoration:none;-webkit-transition:background-color 125ms ease-out;transition:background-color 125ms ease-out;-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;user-select:none;vertical-align:middle;background-color:transparent;color:var(--wf-text--link, #0065D1);font-size:12px;height:28px;min-width:28px;width:auto;padding-left:8px;padding-right:8px;-webkit-backdrop-filter:blur(6px);backdrop-filter:blur(6px);color:var(--wf-text--subtle, #5D6A77);float:right;right:4px;top:4px;}.mfe-app-workspace-1ctqx14::before{border-radius:2px;content:"";display:block;height:100%;inset:0;position:absolute;width:100%;z-index:0;}.mfe-app-workspace-1ctqx14:active{background-color:transparent;}.mfe-app-workspace-1ctqx14:disabled{cursor:default;opacity:0.6;pointer-events:none;}.mfe-app-workspace-1ctqx14:hover{border-color:var(--wf-bg--hover, rgba(48, 57, 105, 0.06));}.mfe-app-workspace-1ctqx14:hover::before{background-color:var(--wf-bg--hover, rgba(48, 57, 105, 0.06));}.mfe-app-workspace-1ctqx14 >*{z-index:1;}.mfe-app-workspace-1hnbc48{font-family:JetBrainsMonoNL,Menlo,Monaco,'Courier New',monospace;font-size:13px;line-height:20px;padding:8px;}.mfe-app-workspace-1hnbc48 code{font-family:JetBrainsMonoNL,Menlo,Monaco,'Courier New',monospace;}``` return column.quantile(0.3)
``````
• We create a function called pct30 that computes the thirtieth percentile of a DataFrame column.

`dogs["weight_kg"].agg(pct30)`

• It gives us the thirtieth percentile of the dogs' weights.
##### Summaries on multiple columns

`dogs[["weight_kg", "height_cm"]].agg(pct30)`

##### Multiple summaries

def pct40(column):

`````` return column.quantile(0.4)

``````

`dogs["weight_kg"].agg([pct30, pct40])`

• Thirtieth and fortieth percentiles of the dogs' weights.
##### Cumulative sum

`dogs["weight_kg"]`

`dogs["weight_kg"].cumsum()`

##### Cummulative statistics
• `.cummax()`
• `.cummin()`
• `.cumprod()`

#### Counting

##### Dropping duplicate names

`vet_visits.drop_duplicate(subset="name")`

##### Dropping duplicate pairs

`unique_dogs = vet_visits.drop_duplicates(subset=["name","breed"])`

##### Easy as 1,2,3

`unique_dogs["breed"].value_counts()`

`unique_dogs["breed"].value_counts(sort=True)`

##### Proportions

`unique_dogs["breed"].value_counts(normalize=True)`

#### Grouped summary statistics

##### Summaries by group
• `dogs[dogs["color"] == "Black"]["weight_kg"].mean()`
##### Grouped summaries
• `dogs.groupby("color")["weight_kg"].mean()`
##### Multiple grouped summaries
• `dogs.groupby("color")["weight_kg"].agg([min,max,sum])`
##### Grouping by multiple varibles
• `dogs.groupby(["color", "breed"])["weight_kg"].mean()`
##### Many groups, many summaries
• `dogs.groupby(["color", "breed"])[["weight_kg"]].mean()`
##### Pivot tables

`dogs.groupby("color")["weight_kg"].mean()`

• Same thing using the pivot_table method.

dogs.pivot_table(values="weight_kg",

``````                    index="color")

``````
##### Different statistics

`import numpy as np`

`dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)`

##### Filling missing values in pivot tables

`dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)`

##### Summing with pivot tables

`dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True)`

### Chapter 3: Slicing and Indexing Data

##### Explicit indexes

Setting a column as the index `dogs_ind = dogs.set_index("name")`

Removing an index `dogs_ind.reset_index()`

Dropping an index `dogs_ind.reset_index(drop=True)`

Indexes make subsetting simpler `dogs[dogs["name"].isin(["Bella, Stella])]"`

Index values don't need to be unique `dogs_ind2 = dogs.set_index("breed")`

• Index values are just data
• Indexes violate "tidy data" principles