Skip to content

Explore a DataFrame

Use this template to get a solid understanding of the structure of your DataFrame and its values before jumping into a deeper analysis. This template leverages many of pandas' handy functions for the most fundamental exploratory data analysis steps, including inspecting column data types and distributions, creating exploratory visualizations, and counting unique and missing values.

import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

# Load your dataset into a DataFrame
df = pd.read_csv("data/taxis.csv")

# Print the number of rows and columns
print("Number of rows and columns:", df.shape)

# Print out the first five rows
df.head()

Understanding columns and values

The info() function prints a concise summary of the DataFrame. For each column, you can find its name, data type, and the number of non-null rows. This is useful to gauge if there are many missing values and to understand what data types you're dealing with.

df.info()

To get an exact count of missing values in each column, call the isna() function and aggregate it using the sum() function:

df.isna().sum()

If there are missing values, you'll have to decide if and how missing values should be dealt with. If you want to learn more about removing and replacing values, check out chapter 2 of DataCamp's Data Manipulation with pandas course.

The describe() function generates helpful descriptive statistics for each numeric column. You can see the percentile, mean, standard deviation, and minimum and maximum values in its output. Note that missing values are excluded here.

df.describe()
df["pickup_borough"].unique()  # Replace with a column of interest

Use the value_counts() function to print out the number of rows for each unique value:

df["pickup_borough"].value_counts(  # Replace with a column of interest
    dropna=True  # Set to False if you want to include NaN values
)

Basic data visualizations

pandas' plot() function makes it easy to plot columns from your DataFrame. This section will go through a few basic data visualizations to better understand your data. If you need a refresher on visualizing DataFrames, chapter 4 of DataCamp's Data Manipulation with pandas course is a useful reference!

Boxplots can help you identify outliers: