Course
Many people reach for .apply() to “avoid loops” and expect it to be fast and straightforward. In practice, row-wise .apply(axis=1) is still a Python-level loop. It can be slow on large data, and it sometimes returns shapes you didn’t expect. The fix is simple: use vectorized pandas/NumPy operations for common tasks, and reserve .apply() for logic that truly needs multiple columns.
This guide shows how .apply() works today, highlights common pitfalls, and provides drop-in patterns that are faster and clearer.
What Is DataFrame.apply() and Series.apply()?
DataFrame.apply(func, axis=0) calls func on each column by default. With axis=1, it calls func on each row. Series.apply(func) calls func on each element (or the whole Series, depending on signatures in recent versions).
- Row-wise vs. column-wise:
axis=1means “per row.”axis=0(the default) means “per column.”
- Return shape rules (pandas ≥0.23): If your function returns a scalar per row/column, you get a
Series. If it returns aSeriesordict, pandas expands those keys into aDataFrame. If it returns a list/array, you’ll get a Series of lists unless you setresult_type='expand'.
- Stricter errors (pandas ≥2.0): Mismatched list-like lengths now raise errors instead of silently producing inconsistent results.
Recommended First Choice: Vectorized Operations
Most tasks that involve combining or transforming columns are faster and clearer with vectorized expressions or built-in methods. The following example computes a baseball team’s run differential without .apply().
import pandas as pd
team_stats = pd.DataFrame({
"Team": ["ARI", "ATL", "BAL"],
"League": ["NL", "NL", "AL"],
"Year": [2012, 2012, 2012],
"RunsScored": [734, 700, 712],
"RunsAllowed": [688, 600, 705],
"Wins": [81, 94, 93],
"Games": [162, 162, 162],
"Playoffs": [0, 1, 1],
})
# Vectorized: fast and idiomatic
team_stats["RunDiff"] = team_stats["RunsScored"] - team_stats["RunsAllowed"]
print(team_stats[["Team", "RunsScored", "RunsAllowed", "RunDiff"]])
Row-Wise .apply() When You Actually Need It
Use .apply(axis=1) when your logic truly spans multiple columns and isn’t easily vectorized (for example, conditional rules that depend on several fields).
compute a derived value from multiple columns
This pattern calculates a value per row using multiple inputs. The vectorized approach above is still preferred, but this shows the correct row-wise usage and options that affect speed and output.
def compute_run_diff(row):
# Treat the row as read-only; return a scalar
return row["RunsScored"] - row["RunsAllowed"]
# Row-wise apply (Python-level loop; can be slow on large data)
team_stats["RunDiff_apply"] = team_stats.apply(compute_run_diff, axis=1)
When the function is numeric and only needs raw arrays, passing raw=True skips some pandas overhead by providing a NumPy array to your function.
import numpy as np
def compute_run_diff_raw(values):
# values is a NumPy array when raw=True
# Order matches the column order we select
rs, ra = values
return rs - ra
team_stats["RunDiff_raw"] = team_stats[["RunsScored", "RunsAllowed"]].apply(
compute_run_diff_raw, axis=1, raw=True
)
verify the function input once
When I wire up a new row function, I print one row once to confirm what’s being passed.
def debug_row(row):
# Print the first row only
if row.name == team_stats.index[0]:
print("Example row:", row.to_dict())
return 0
_ = team_stats.apply(debug_row, axis=1)
Case Study: Sums, Season Totals, and Text Flags
Given Rays-by-year stats, prefer built-in methods and elementwise tools over .apply() where possible.
rays_by_year = pd.DataFrame(
{
"Year": [2012, 2011, 2010, 2009, 2008],
"RunsScored": [697, 707, 802, 803, 774],
"RunsAllowed": [577, 614, 649, 754, 671],
"Wins": [90, 91, 96, 84, 97],
"Playoffs": [0, 1, 1, 0, 1],
}
).set_index("Year")
get column sums efficiently
Use DataFrame.sum() instead of .apply(sum). It’s faster and clearer.
# Preferred
totals = rays_by_year.sum(axis=0)
print(totals)
# If you must use apply (not recommended here)
totals_apply = rays_by_year.apply(sum, axis=0)
compute total runs scored in a season
Use vectorized arithmetic across columns; no .apply() needed.
rays_by_year["TotalRuns"] = rays_by_year["RunsScored"] + rays_by_year["RunsAllowed"]
print(rays_by_year[["RunsScored", "RunsAllowed", "TotalRuns"]].head())
convert a 0/1 flag to text
Prefer Series.map() or replace() for elementwise transforms on one column.
# Using map with a dict
rays_by_year["PlayoffsText"] = rays_by_year["Playoffs"].map({0: "No", 1: "Yes"})
# Equivalent with replace
rays_by_year["PlayoffsText2"] = rays_by_year["Playoffs"].replace({0: "No", 1: "Yes"})
Control the Output Shape From .apply()
When your function returns more than one value per row, make the shape explicit. This avoids surprises and future version breakage.
return a list and expand to columns
Use result_type='expand' to split a list/array into multiple columns.
def wins_losses(row):
# Derive wins and losses as two outputs
wins = row["Wins"]
losses = row["Games"] - row["Wins"] if "Games" in row else np.nan
return [wins, losses]
# Example using team_stats, which has "Wins" and "Games"
expanded = team_stats.apply(wins_losses, axis=1, result_type="expand")
expanded.columns = ["WinsOut", "LossesOut"]
team_stats = pd.concat([team_stats, expanded], axis=1)
return a series or dict to name columns automatically
Returning a Series or dict from each row produces a DataFrame with matching column labels.
def summary_row(row):
return pd.Series(
{
"IsWinningSeason": row["Wins"] >= 90,
"RunRatio": (row["RunsScored"] / row["RunsAllowed"]),
}
)
summary = team_stats.apply(summary_row, axis=1)
team_stats = pd.concat([team_stats, summary], axis=1)
Performance Tips That Matter
Row-wise .apply(axis=1) is convenient but slow on large frames because it calls your Python function once per row. These patterns avoid that bottleneck.
- Prefer vectorized pandas/NumPy operations and built-in methods like
sum,mean,clip,where,astype, and string/accessor methods.
- For numeric row/column operations that accept arrays, pass
raw=Trueto get a NumPy array and reduce pandas overhead.
- For elementwise work on a single column, use
Series.map()or vectorized methods instead ofDataFrame.apply(axis=1).
- Avoid mutating the provided row/column inside your function; return a new value instead.
Version Notes You Should Know (pandas 2.x)
Recent pandas releases tightened behavior around .apply() so results are more predictable.
- List-like returns: In 2.0+, mismatched list lengths raise an error. Use consistent lengths and set
result_type='expand'when you want multiple columns.
- Output expansion: Returning a
Seriesordictexpands into a DataFrame with those keys as columns (stable since 0.23).
Series.apply()changes: Theconvert_dtypeargument is deprecated. If you need mixed types, cast toobjectfirst (for example,s.astype("object").apply(fn)). Newer signatures allow controlling whetherfnreceives scalars or a Series in more contexts (for example,by_row).
.apply() vs. .map() vs. .applymap() vs. .agg()
Choose the API that matches the shape of your transformation.
Series.map(func_or_dict): Elementwise transform on one column. Best for lookups or simple functions.
DataFrame.apply(func, axis=1): Row-wise logic that needs multiple columns.
DataFrame.apply(func, axis=0): Column-wise logic (each input is a Series representing a column).
DataFrame.applymap(func): Elementwise over every cell. Use sparingly; vectorized methods are faster.
.agg()/.transform(): Aggregations and group-wise transforms; prefer these in groupby pipelines.
Common Mistakes and Quick Fixes
These are the issues that most often cause incorrect results or slowdowns, with ways to correct them.
- Forgetting axis=1 for per-row logic: If your function suddenly receives columns instead of rows, add
axis=1.
- Unexpected Series of lists: When returning a list/array from each row, set
result_type='expand'to split into columns.
- Slow row-wise code: Replace with vectorized expressions or built-ins; if not possible, consider
raw=Trueto reduce overhead.
- Mutating the input row: Treat inputs as read-only, and return new values.
- Dtype drift: If your function sometimes returns non-integers, integer columns can upcast to float or object. Cast afterward with
astypeif needed.
Conclusion
.apply() is a flexible tool, but it’s not a performance shortcut. Use vectorized operations for arithmetic, aggregation, and elementwise transforms on a single column. Reach for DataFrame.apply(axis=1) only when your logic really needs multiple columns per row. When you do use it, control output shape with result_type, consider raw=True for numeric functions, and keep an eye on dtypes. These patterns produce predictable results on modern pandas and scale better as your data grows.

