Skip to content
Data Manipulation with pandas
Pandas
- .head() returns the first few(5) rows (the “head” of the DataFrame).
- .info() shows information on each of the columns, such as the data type and number of missing values (gives the name of the columns)
- .shape returns the number of rows and columns of the DataFrame.
- .describe() calculates a few summary statistics for each column (numerical data)
DataFrame Objects
- .values: A two-dimensional NumPy array of values.(turns dataframe display into a numpy array)
- .columns: An index of columns: the column names.
- .index: An index for the rows: either row numbers or row names.
Sorting and Subsettings
- .sort_values(): sort rows by passing a column name, from the less to the most, you can use the arg ascending(takes True or False)
- To select only "col_a" of the DataFrame df: df["col_a"]
- To select "col_a" and "col_b" of df, use: df[["col_a", "col_b"]] **filering
- dogs[dogs["height_cm"] > 60]
- dogs[dogs["color"] == "tan"]
- dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")] **filtering using catergorical variables (isin(): function)
- colors = ["brown", "black", "tan"]
- condition = dogs["color"].isin(colors)
- dogs[condition]
Setting a column as an index column
-.set_index("column name"): you can pass a list to multi-indexing
- reset_index(): remove the index column
Sorting
- .sort_values(): sorting by the values of the column, can be in ascending or descending order, takes the name of the column as an arg
- .sort_index(): sort by the values of the index column, can take a column as an arg in which the sorting will only happen there -Slice rows with code like df.loc[("a", "b"):("c", "d")]. -Slice columns with code like df.loc[:, "e":"f"]. -Slice both ways with code like df.loc[("a", "b"):("c", "d"), "e":"f"].
Summary Statistics
- .agg() method allows you to apply your own custom functions to a DataFrame (define your own function then use it as an arg to the method)
- .drop_duplicates()
- .value_counts(): enumerazation of items, used for categorical data, has arg normalize (for normalization of the data hihi)
Group Summary Statisitics
- .groupby(categorical variable): instead of new columns definition
- Pivot tables are the standard way of aggregating data in spreadsheets.
- the .pivot_table() method is an alternative to .groupby()
- dogs.pivot_table(values="weight_kg",index="color") gives you the mean weight by each color
- dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median) as previous but plus a median
- dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)
- fill_value replaces missing values with a real value
- margins is a shortcut for when you pivoted by two variables, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.
- .isna(): checks for missing values
- .isna().any() : checks within the columns
- isna().sum(): nb of missing values
- .dropna() remove rows of missing values
- fillna(0) replace missing values with 0
- .value_counts() enumates categorical variables