Competition - City Tree Species

Which tree species should the city plant?

📖 Background

You work for a nonprofit organization advising the planning department on ways to improve the quantity and quality of trees in New York City. The urban design team believes tree size (using trunk diameter as a proxy for size) and health are the most desirable characteristics of city trees.

The city would like to learn more about which tree species are the best choice to plant on the streets of Manhattan.

💪 Challenge

Create a report that covers the following:

What are the most common tree species in Manhattan?
Which are the neighborhoods with the most trees?
A visualization of Manhattan's neighborhoods and tree locations.
What ten tree species would you recommend the city plant in the future?

import pandas as pd
import geopandas as gpd
trees = pd.read_csv('data/trees.csv')
neighborhoods = gpd.read_file('data/nta.shp')

💾 The data

The team has provided access to the 2015 tree census and geographical information on New York City neighborhoods (trees, neighborhoods):

Tree Census

"tree_id" - Unique id of each tree.
"tree_dbh" - The diameter of the tree in inches measured at 54 inches above the ground.
"curb_loc" - Location of the tree bed in relation to the curb. Either along the curb (OnCurb) or offset from the curb (OffsetFromCurb).
"spc_common" - Common name for the species.
"status" - Indicates whether the tree is alive or standing dead.
"health" - Indication of the tree's health (Good, Fair, and Poor).
"root_stone" - Indicates the presence of a root problem caused by paving stones in the tree bed.
"root_grate" - Indicates the presence of a root problem caused by metal grates in the tree bed.
"root_other" - Indicates the presence of other root problems.
"trunk_wire" - Indicates the presence of a trunk problem caused by wires or rope wrapped around the trunk.
"trnk_light" - Indicates the presence of a trunk problem caused by lighting installed on the tree.
"trnk_other" - Indicates the presence of other trunk problems.
"brch_light" - Indicates the presence of a branch problem caused by lights or wires in the branches.
"brch_shoe" - Indicates the presence of a branch problem caused by shoes in the branches.
"brch_other" - Indicates the presence of other branch problems.
"postcode" - Five-digit zip code where the tree is located.
"nta" - Neighborhood Tabulation Area (NTA) code from the 2010 US Census for the tree.
"nta_name" - Neighborhood name.
"latitude" - Latitude of the tree, in decimal degrees.
"longitude" - Longitude of the tree, in decimal degrees.

Neighborhoods' geographical information

"ntacode" - NTA code (matches Tree Census information).
"ntaname" - Neighborhood name (matches Tree Census information).
"geometry" - Polygon that defines the neighborhood.

Tree census and neighborhood information from the City of New York NYC Open Data.

📢 Submission Content:

Exploratory Data Analysis
Data Cleaning (Invalid URL)
Answers to the Competition Questions (Invalid URL)

Q1. What are the most common tree species in Manhattan? (Invalid URL)
Q2. Which are the neighborhoods with the most trees? (Invalid URL)
Q3. A visualization of Manhattan's neighborhoods and tree locations. (Invalid URL)
Q4. What ten tree species would you recommend the city plant in the future? (Invalid URL)

🔍 Exploratory Data Analysis

Before answering the competition questions, we need to explore the datasets and clean them if needed.

# Stats of numerical features in the trees dataset
trees.describe()

# Identify missing data in the trees dataset
missing_data_tr = trees.isnull().sum()
missing_data_tr

We are looking at a dataset of 64,229 unique trees.
A tree with a diameter of 318 inches is very rare (i.e. maximum of tree_dbh), not to mention that the diameter at 75% percentile is only 11 inches. We should expect some outliers.
There are only two columns with missing data which are spc_common and health.

# Stats of numerical features in the neighborhoods dataset
neighborhoods.describe()

# Identify missing data in the neighborhoods dataset
missing_data_nb = neighborhoods.isnull().sum()
missing_data_nb

The stats of the Neighborhoods dataset does not say anything much. However, the good thing is that there is no missing data in this dataset.

# Check if the two datasets have the same list of neighborhoods
print('Number of neighborhoods in Trees: ',len(trees['nta'].unique()))
print('Number of neighborhoods in Neighborhoods: ',len(neighborhoods['ntacode'].unique()))

The trees are only planted in 28 neighborhoods, out of 195 in total.

🧹 Data Cleaning

spc_common and health are both categorical features, and there is no basis to impute the missing data. We also don't want to drop these rows as it could affect the number of trees planted in some neighborhoods. Therefore, we will fill them with "unknown".
We will mitigate potential errors due to trailing white spaces by removing them, or due to different capitalization by converting all words to lower case.

‌
‌
‌