Skip to content

Pandas

  • .head() returns the first few(5) rows (the “head” of the DataFrame).
  • .info() shows information on each of the columns, such as the data type and number of missing values (gives the name of the columns)
  • .shape returns the number of rows and columns of the DataFrame.
  • .describe() calculates a few summary statistics for each column (numerical data)

DataFrame Objects

  • .values: A two-dimensional NumPy array of values.(turns dataframe display into a numpy array)
  • .columns: An index of columns: the column names.
  • .index: An index for the rows: either row numbers or row names.

Sorting and Subsettings

  • .sort_values(): sort rows by passing a column name, from the less to the most, you can use the arg ascending(takes True or False)
  • To select only "col_a" of the DataFrame df: df["col_a"]
  • To select "col_a" and "col_b" of df, use: df[["col_a", "col_b"]] **filering
  • dogs[dogs["height_cm"] > 60]
  • dogs[dogs["color"] == "tan"]
  • dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")] **filtering using catergorical variables (isin(): function)
  • colors = ["brown", "black", "tan"]
  • condition = dogs["color"].isin(colors)
  • dogs[condition]

Setting a column as an index column

-.set_index("column name"): you can pass a list to multi-indexing

  • reset_index(): remove the index column

Sorting

  • .sort_values(): sorting by the values of the column, can be in ascending or descending order, takes the name of the column as an arg
  • .sort_index(): sort by the values of the index column, can take a column as an arg in which the sorting will only happen there -Slice rows with code like df.loc[("a", "b"):("c", "d")]. -Slice columns with code like df.loc[:, "e":"f"]. -Slice both ways with code like df.loc[("a", "b"):("c", "d"), "e":"f"].

Summary Statistics

  • .agg() method allows you to apply your own custom functions to a DataFrame (define your own function then use it as an arg to the method)
  • .drop_duplicates()
  • .value_counts(): enumerazation of items, used for categorical data, has arg normalize (for normalization of the data hihi)

Group Summary Statisitics

  • .groupby(categorical variable): instead of new columns definition
  • Pivot tables are the standard way of aggregating data in spreadsheets.
  • the .pivot_table() method is an alternative to .groupby()
  • dogs.pivot_table(values="weight_kg",index="color") gives you the mean weight by each color
  • dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median) as previous but plus a median
  • dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)
  • fill_value replaces missing values with a real value
  • margins is a shortcut for when you pivoted by two variables, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.
  • .isna(): checks for missing values
  • .isna().any() : checks within the columns
  • isna().sum(): nb of missing values
  • .dropna() remove rows of missing values
  • fillna(0) replace missing values with 0
  • .value_counts() enumates categorical variables