One alternative to using a loop to iterate over a DataFrame is to use the pandas
.apply() method. This function acts as a
map() function in Python. It takes a function as an input and applies this function to an entire DataFrame.
If you are working with tabular data, you must specify an axis you want your function to act on (
0 for columns; and
1 for rows).
Much like the
map() function, the
apply() method can also be used with anonymous functions or lambda functions. Let's look at some
apply() examples using baseball data.
Calculating Run Differentials With
First, you will call the
.apply() method on the
basebal_df dataframe. Then use the
lambda function to iterate over the rows of the dataframe. For every row, we grab the
RA columns and pass them to the
calc_run_diff function. Finally, you will specify the
axis=1 to tell the
.apply() method that we want to apply it on the rows instead of columns.
baseball_df.apply( lambda row: calc_run_diff(row['RS'], row['RA']), axis=1 )
You will notice that we don't need to use a
for loop. You can collect the run differentials directly into an object called
run_diffs_apply. After creating a new column and printing the dataframe you will notice that the results are similar to what you will get with the
run_diffs_apply = baseball_df.apply( lambda row: calc_run_diff(row['RS'], row['RA']), axis=1) baseball_df['RD'] = run_diffs_apply print(baseball_df)
Team League year RS RA W G Playoffs RD 0 ARI NL 2012 734 688 81 162 0 46 1 ATL NL 2012 700 600 94 162 1 100 2 BAL AL 2012 712 705 93 162 1 7
Interactive Example Using
The Tampa Bay Rays want you to analyze their data.
They'd like the following metrics:
- The sum of each column in the data
- The total amount of runs scored in a year (
'RA'for each year)
'Playoffs'column in text format rather than using
The below function can be used to convert the
'Playoffs' column to text:
def text_playoffs(num_playoffs): if num_playoffs == 1: return 'Yes' else: return 'No'
.apply() to get these metrics. A DataFrame (
rays_df) has been printed below. This DataFrame is indexed on the
RS RA W Playoffs 2012 697 577 90 0 2011 707 614 91 1 2010 802 649 96 1 2009 803 754 84 0 2008 774 671 97 1
sum()to each column of the
rays_dfto collect the sum of each column. Be sure to specify the correct
# Gather sum of all columns stat_totals = rays_df.apply(sum, axis=0) print(stat_totals)
When we run the above code, it produces the following result:
RS 3783 RA 3265 W 458 Playoffs 3 dtype: int64
To learn more about pandas alternative to looping, please see this video from our course Writing Efficient Python Code.
This content is taken from DataCamp’s Writing Efficient Python Code course by Logan Thomas.