You are never stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.
You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units.
Deriving a Column
Using a dog dataset, let's say you want to add a new column to your DataFrame that has each dog's height in meters instead of centimeters.
On the left-hand side of the equals, you use square brackets with the name of the new column you want to create, in this case, height_m
. On the right-hand side, you have the calculation.
dogs["height_m"] = dogs["height_cm"] / 100
print(dogs)
name breed color height_cm weight_kg date_of_birth height_m
0 Bella Labrador Brown 56 24 2013-07-01 0.56
1 Charlie Poodle Black 43 24 2016-09-16 0.43
2 Lucy Chow Chow Brown 46 24 2014-08-25 0.46
3 Cooper Schnauzer Gray 49 17 2016-09-16 0.49
4 Max Labrador Black 59 29 2016-09-16 0.59
5 Stella Chihuahua Tan 18 2 2016-09-16 0.18
6 Bernie St. Bernard White 77 74 2016-09-16 0.77
Notice that both the existing and the derived column are in the dataframe you modified.
Adding a Column
In this example, you will calculate doggy mass index and add it as a column to your dataframe. BMI stands for body mass index, which is calculated by weight in kilograms divided by their height in meters, squared.
dogs["height_m"] = dogs["weight_kg"] / dogs["height_m"] ** 2
print(dogs.head())
name breed color height_cm weight_kg date_of_birth height_m bmi
0 Bella Labrador Brown 56 24 2013-07-01 0.56 76.530612
1 Charlie Poodle Black 43 24 2016-09-16 0.43 129.799892
2 Lucy Chow Chow Brown 46 24 2014-08-25 0.46 113.421550
3 Cooper Schnauzer Gray 49 17 2016-09-16 0.49 70.803832
4 Max Labrador Black 59 29 2016-09-16 0.59 83.309394
Again, the new column is on the left-hand side of the equals, but this time, our calculation involves two columns.
Adding a Column with Multiple Manipulations
The real power of pandas comes in when you combine all the skills that you have learned so far. Let's figure out the names of skinny, tall dogs.
First, to define the skinny dogs, you take the subset of dogs that have a BMI of less than 100. Next, you sort the height in descending order of height to get the tallest skinny dogs at the top.
Finally, this time you will only keep the columns you are interested in.
bmi_lt_100 = dogs[dogs["bmi"] < 100]
bmi_lt_100_height = bmi_lt_100.short_values("height_cm", ascending=False)
bmi_lt_100_height[["name", "height_cm", "bmi"]]
name height_cm bmi
4 Max 59 83.309394
0 Bella 56 76.530612
3 Cooper 49 70.803832
5 Stella 18 61.728395
Here, you can see that Max
is the tallest dog with a BMI of under 100.
Interactive Example
In the below example, you add a new column to DataFrame homelessness
, named total
, containing the sum of the individuals
and family_members
columns. Then, add another column to homelessness
, named p_individuals
, containing the proportion of homeless people in each state who are individuals. Finally, print the homelessness
dataframe.
# Add total col as sum of individuals and family_members
homelessness["total"] = homelessness["individuals"] + homelessness["family_members"]
# Add p_individuals col as proportion of individuals
homelessness["p_individuals"] = homelessness["individuals"] / homelessness["total"]
# See the result
print(homelessness)
When we run the above code, it produces the following result:
region state individuals family_members state_pop total p_individuals
0 East South Central Alabama 2570.0 864.0 4887681 3434.0 0.748
1 Pacific Alaska 1434.0 582.0 735139 2016.0 0.711
2 Mountain Arizona 7259.0 2606.0 7158024 9865.0 0.736
3 West South Central Arkansas 2280.0 432.0 3009733 2712.0 0.841
4 Pacific California 109008.0 20964.0 39461588 129972.0 0.8
...
48 South Atlantic West Virginia 1021.0 222.0 1804291 1243.0 0.821
49 East North Central Wisconsin 2740.0 2167.0 5807406 4907.0 0.558
50 Mountain Wyoming 434.0 205.0 577601 639.0 0.679
Take a look at DataCamp's tutorials on Renaming Columns in a Pandas DataFrame and How To Drop Columns in Pandas.
To learn more about adding new columns in pandas, please see this video from our course Data Manipulation with pandas.
This content is taken from DataCamp’s Data Manipulation with pandas course by Maggie Matsui and Richie Cotton.
Learn more about Python and pandas