How To Drop Columns in Pandas
Often, a DataFrame will contain columns that are not useful to your analysis. Such columns should be dropped from the DataFrame to make it easier for you to focus on the remaining columns.
The columns can be removed by specifying label names and corresponding axis, or by specifying index or column names directly. When using a multi-index, labels on different levels can be removed by specifying the level.
Let's compare missing value counts with the shape of the dataframe. You will notice that the
county_name column contains as many missing values as rows, meaning that it only contains missing values.
state 0 stop_date 0 stop_time 0 county_name 91741 driver_gender 5205 driver_race 5202 ...
Since it contains no useful information, this column can be dropped using the
Besides specifying the column name, you need to specify that you are dropping from the columns axis and that you want the operation to occur in place, which avoids an assignment statement as shown below:
ri.drop('county_name', axis='columns', inplace=True)
.dropna() method is a great way to drop rows based on the presence of missing values in that row.
For example, using the dataset above, let's assume the stop_date and stop_time columns are critical to our analysis, and thus a row is useless to us without that data.
state stop_date stop_time driver_gender driver_race 0 RI 2005-01-04 12:55 M White 1 RI 2005-01-23 23:15 M White 2 RI 2005-02-17 04:15 M White 3 RI 2005-02-20 17:15 M White 4 RI 2005-02-24 01:20 F White
We can tell pandas to drop all rows that have a missing value in either the
stop_time column. Because we specify a subset, the
.dropna() method only takes these two columns into account when deciding which rows to drop.
ri.dropna(subset=['stop_date', 'stop_time'], inplace=True)
Interactive Example of Dropping Columns
In this example, you will drop the
county_name column because it only contains missing values, and you'll drop the
state column because all of the traffic stops took place in one state (Rhode Island). Thus, these columns can be dropped because they contain no useful information. The number of missing values in each column has been printed to the console for you.
- Examine the DataFrame's
.shapeto find out the number of rows and columns.
- Drop both the
statecolumns by passing the column names to the
.drop()method as a list of strings.
- Examine the
.shapeagain to verify that there are now two fewer columns.
# Examine the shape of the DataFrame print(ri.shape) # Drop the 'county_name' and 'state' columns ri.drop(['county_name', 'state'], axis='columns', inplace=True) # Examine the shape of the DataFrame (again) print(ri.shape)
When you run the above code, it produces the following result:
(91741, 15) (91741, 13)
To learn more about dropping columns in pandas, please see this video from our course, Analyzing Police Activity with pandas.
This content is taken from DataCamp’s Analyzing Police Activity with pandas course by Kevin Markham.