One alternative to using a loop to iterate over a DataFrame is to use the pandas .apply()
method. This function acts as a map()
function in Python. It takes a function as an input and applies this function to an entire DataFrame.
If you are working with tabular data, you must specify an axis you want your function to act on (0
for columns; and 1
for rows).
Much like the map()
function, the apply()
method can also be used with anonymous functions or lambda functions. Let's look at some apply()
examples using baseball data.
Calculating Run Differentials With .apply()
First, you will call the .apply()
method on the basebal_df
dataframe. Then use the lambda
function to iterate over the rows of the dataframe. For every row, we grab the RS
and RA
columns and pass them to the calc_run_diff
function. Finally, you will specify the axis=1
to tell the .apply()
method that we want to apply it on the rows instead of columns.
baseball_df.apply(
lambda row: calc_run_diff(row['RS'], row['RA']),
axis=1
)
You will notice that we don't need to use a for
loop. You can collect the run differentials directly into an object called run_diffs_apply
. After creating a new column and printing the dataframe you will notice that the results are similar to what you will get with the .iterrows()
method.
run_diffs_apply = baseball_df.apply(
lambda row: calc_run_diff(row['RS'], row['RA']),
axis=1)
baseball_df['RD'] = run_diffs_apply
print(baseball_df)
Team League year RS RA W G Playoffs RD
0 ARI NL 2012 734 688 81 162 0 46
1 ATL NL 2012 700 600 94 162 1 100
2 BAL AL 2012 712 705 93 162 1 7
Interactive Example Using .apply()
The Tampa Bay Rays want you to analyze their data.
They'd like the following metrics:
- The sum of each column in the data
- The total amount of runs scored in a year (
'RS'
+'RA'
for each year) - The
'Playoffs'
column in text format rather than using1's
and0's
The below function can be used to convert the 'Playoffs'
column to text:
def text_playoffs(num_playoffs):
if num_playoffs == 1:
return 'Yes'
else:
return 'No'
Use .apply()
to get these metrics. A DataFrame (rays_df
) has been printed below. This DataFrame is indexed on the 'Year'
column.
RS RA W Playoffs
2012 697 577 90 0
2011 707 614 91 1
2010 802 649 96 1
2009 803 754 84 0
2008 774 671 97 1
- Apply
sum()
to each column of therays_df
to collect the sum of each column. Be sure to specify the correctaxis
.
# Gather sum of all columns
stat_totals = rays_df.apply(sum, axis=0)
print(stat_totals)
When we run the above code, it produces the following result:
RS 3783
RA 3265
W 458
Playoffs 3
dtype: int64
To learn more about pandas alternative to looping, please see this video from our course Writing Efficient Python Code.
This content is taken from DataCamp’s Writing Efficient Python Code course by Logan Thomas.
Learn more about Python and pandas