Which tree species should the city plant?
๐ Background
You work for a nonprofit organization advising the planning department on ways to improve the quantity and quality of trees in New York City. The urban design team believes tree size (using trunk diameter as a proxy for size) and health are the most desirable characteristics of city trees.
The city would like to learn more about which tree species are the best choice to plant on the streets of Manhattan.
๐พ The data
The team has provided access to the 2015 tree census and geographical information on New York City neighborhoods (trees, neighborhoods):
Tree Census
- "tree_id" - Unique id of each tree.
- "tree_dbh" - The diameter of the tree in inches measured at 54 inches above the ground.
- "curb_loc" - Location of the tree bed in relation to the curb. Either along the curb (OnCurb) or offset from the curb (OffsetFromCurb).
- "spc_common" - Common name for the species.
- "status" - Indicates whether the tree is alive or standing dead.
- "health" - Indication of the tree's health (Good, Fair, and Poor).
- "root_stone" - Indicates the presence of a root problem caused by paving stones in the tree bed.
- "root_grate" - Indicates the presence of a root problem caused by metal grates in the tree bed.
- "root_other" - Indicates the presence of other root problems.
- "trunk_wire" - Indicates the presence of a trunk problem caused by wires or rope wrapped around the trunk.
- "trnk_light" - Indicates the presence of a trunk problem caused by lighting installed on the tree.
- "trnk_other" - Indicates the presence of other trunk problems.
- "brch_light" - Indicates the presence of a branch problem caused by lights or wires in the branches.
- "brch_shoe" - Indicates the presence of a branch problem caused by shoes in the branches.
- "brch_other" - Indicates the presence of other branch problems.
- "postcode" - Five-digit zip code where the tree is located.
- "nta" - Neighborhood Tabulation Area (NTA) code from the 2010 US Census for the tree.
- "nta_name" - Neighborhood name.
- "latitude" - Latitude of the tree, in decimal degrees.
- "longitude" - Longitude of the tree, in decimal degrees.
Neighborhoods' geographical information
- "ntacode" - NTA code (matches Tree Census information).
- "ntaname" - Neighborhood name (matches Tree Census information).
- "geometry" - Polygon that defines the neighborhood.
Tree census and neighborhood information from the City of New York NYC Open Data.
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plttrees = pd.read_csv('data/trees.csv')
neighborhoods = gpd.read_file('data/nta.shp')#Get an idea of which columns are included in the datasets and which columns they have in common
trees.info()
neighborhoods.info()
#Check which and how many neighbourhoods are included in the trees dataset
#print(trees["nta"].nunique())
#First subset the neighbourhoods dataset for Manhattan only
Manhattan_neighbourhoods = neighborhoods[neighborhoods["boroname"] == 'Manhattan']
#Create a list of ntacodes of Manhattan, use this to subset the trees dataset so that it only consists of those nta's in Manhattan.
nta_Manhattan = Manhattan_neighbourhoods["ntacode"].unique()
mnh = trees[trees["nta"].isin(nta_Manhattan)]
mnh#Check for and count missings per variable
mnh.isna().sum()#Look at distribution of values of variables
mnh.describe()
#Filter for out of range cases
#mnh[mnh["postcode"]< 10000]
#mnh[mnh["tree_dbh"]<1]
#mnh[mnh["health"].isnull()]
#mnh[mnh["status"]=="Dead"]
#mnh[(mnh["health"].isnull()) & (mnh["status"]=="Dead")]
#Above queries return the same number of 1802, thus all null values for health and spc_common are dead trees.
It looks like there are in total 64229 trees in Manhattan. Two variables contain missing values, and 2 variables contain out or range values for the same 1802 cases. When creating a subset that only contains missings, we find out these are the dead trees. Subsetting missings with status is dead, and subsetting for only dead trees all gives the same amount of 1802 cases.
94 cases have an out of range value for postcode, and 43 for trunk diameter. These cases we will filter out.
mnh = mnh[(mnh["postcode"]>9999) & (mnh["tree_dbh"]>0)]Our cleaned up dataset consists now of 63252 cases. Let's look at the distributions of diameters and health values, as these are the variables that are believed to be most desirable.
As we had seen above already, the average tree diameter is 8,5 inch. Below we see that the most common diameter is 4 inches.
mnh.groupby("tree_dbh").nunique("tree_id")mnh.groupby("health").nunique("tree_id")Let's look at most common tree species next.
#What are the most common tree species in Manhattan? Group by species (spc_common), count distinct tree_id's per tree species and sort descending
species = mnh.groupby("spc_common")["tree_id"].nunique()
species.sort_values(ascending=False)
โ
โ