Skip to content
New Workbook
Sign up
Competition - City Tree Species
0

Which tree species should the city plant?

📖 Background

You work for a nonprofit organization advising the planning department on ways to improve the quantity and quality of trees in New York City. The urban design team believes tree size (using trunk diameter as a proxy for size) and health are the most desirable characteristics of city trees.

The city would like to learn more about which tree species are the best choice to plant on the streets of Manhattan.

💾 The data

The team has provided access to the 2015 tree census and geographical information on New York City neighborhoods (trees, neighborhoods):

Tree Census
  • "tree_id" - Unique id of each tree.
  • "tree_dbh" - The diameter of the tree in inches measured at 54 inches above the ground.
  • "curb_loc" - Location of the tree bed in relation to the curb. Either along the curb (OnCurb) or offset from the curb (OffsetFromCurb).
  • "spc_common" - Common name for the species.
  • "status" - Indicates whether the tree is alive or standing dead.
  • "health" - Indication of the tree's health (Good, Fair, and Poor).
  • "root_stone" - Indicates the presence of a root problem caused by paving stones in the tree bed.
  • "root_grate" - Indicates the presence of a root problem caused by metal grates in the tree bed.
  • "root_other" - Indicates the presence of other root problems.
  • "trunk_wire" - Indicates the presence of a trunk problem caused by wires or rope wrapped around the trunk.
  • "trnk_light" - Indicates the presence of a trunk problem caused by lighting installed on the tree.
  • "trnk_other" - Indicates the presence of other trunk problems.
  • "brch_light" - Indicates the presence of a branch problem caused by lights or wires in the branches.
  • "brch_shoe" - Indicates the presence of a branch problem caused by shoes in the branches.
  • "brch_other" - Indicates the presence of other branch problems.
  • "postcode" - Five-digit zip code where the tree is located.
  • "nta" - Neighborhood Tabulation Area (NTA) code from the 2010 US Census for the tree.
  • "nta_name" - Neighborhood name.
  • "latitude" - Latitude of the tree, in decimal degrees.
  • "longitude" - Longitude of the tree, in decimal degrees.
Neighborhoods' geographical information
  • "ntacode" - NTA code (matches Tree Census information).
  • "ntaname" - Neighborhood name (matches Tree Census information).
  • "geometry" - Polygon that defines the neighborhood.

Tree census and neighborhood information from the City of New York NYC Open Data.

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(sf))
trees <- readr::read_csv('data/trees.csv', show_col_types = FALSE)
trees
neighborhoods <- st_read("data/nta.shp", quiet=TRUE)
plot(neighborhoods)

💪 Challenge

Create a report that covers the following:

  • What are the most common tree species in Manhattan?
  • Which are the neighborhoods with the most trees?
  • A visualization of Manhattan's neighborhoods and tree locations.
  • What ten tree species would you recommend the city plant in the future?
## Installing and loading common packages and libraries
library(tidyverse)
library(dplyr)

treesdf <- readr::read_csv('data/trees.csv', show_col_types = FALSE)

## Exploring the data
head(treesdf)

The data was relatively clean. However, it was observed that all the dead trees had no common name for its species, so I separated the dead trees from the alive ones to get more detailed results based on the projects objectives.

#seperate dead trees from alive ones
dead_trees <- treesdf %>% 
  filter(status == "Dead")

alive_trees <- treesdf %>% 
  filter(status == "Alive")
# Get number of alive trees per specie
tree_count <- alive_trees %>% 
  count(spc_common)

# change column name for n column
colnames(tree_count)[2]  <- "number_of_trees"

# Sort by number_of_trees 
tree_count <- tree_count[order(tree_count$number_of_trees, decreasing=TRUE, na.last=FALSE),]

# Most populous trees
tree_count <- tree_count %>% 
  arrange(-number_of_trees)

head(tree_count)
# Get number of alive trees per location
nta_location_count <- alive_trees %>% 
  count(nta_name)

# change column name for n column
colnames(nta_location_count)[2]  <- "number_of_trees"

# Sort by number_of_trees 
nta_location_count <- nta_location_count[order(nta_location_count$number_of_trees, decreasing=TRUE, na.last=FALSE),]

# View location with most populous trees
nta_location_count <- nta_location_count %>% 
  arrange(-number_of_trees)

head(nta_location_count)
tail(nta_location_count)
# Get number of dead trees per location
deadTrees.location_count <- dead_trees %>% 
  count(nta_name)

# change column name for n column
colnames(deadTrees.location_count)[2]  <- "number_of_trees"

# Sort by number_of_trees 
deadTrees.location_count <- deadTrees.location_count[order(deadTrees.location_count$number_of_trees, decreasing=TRUE, na.last=FALSE),]

# View location with most populous dead trees
deadTrees.location_count <- deadTrees.location_count %>% 
  arrange(-number_of_trees)

head(deadTrees.location_count)

Since the urban design team believes trunk diameter and health are the most desirable characteristics of city trees, lets find out the trees with good health status and diameter greater than 9 to determine the best set of trees to recommend.

# sort trees by health status
trees_by_health <- alive_trees %>% 
  group_by(spc_common, health) %>% 
  summarise(number = n())

trees_by_healthWIDE <- trees_by_health %>% 
  spread(health, number)

## add row sums to get percentages
trees_by_healthWIDE$row_sum <- rowSums(trees_by_healthWIDE[ , c(2,3,4)], na.rm=TRUE)

trees_by_healthWIDE$percentage_good <- round(100*(trees_by_healthWIDE$Good/trees_by_healthWIDE$row_sum), 2)

trees_by_healthWIDE$percentage_fair <- round(100*(trees_by_healthWIDE$Fair/trees_by_healthWIDE$row_sum), 2)

trees_by_healthWIDE$percentage_poor <- round(100*(trees_by_healthWIDE$Poor/trees_by_healthWIDE$row_sum), 2)

# sort trees by diameter
tree_diameter <- alive_trees %>% 
  group_by(spc_common) %>% 
  summarise(
    mean_diameter = mean(tree_dbh),
    median_diameter = median(tree_dbh),
    number = n()) %>%
  arrange(-mean_diameter)

# merge the two sorted data frames by specie
tree_status.diameter_health <- merge(tree_diameter, trees_by_healthWIDE, by="spc_common")

# clean merged dataframe
## remove column
tree_status.diameter_health <- tree_status.diameter_health[,-4]
tree_status.diameter_health <- tree_status.diameter_health[,-3]

## change column name
colnames(tree_status.diameter_health)[3]  <- "fair_health"
colnames(tree_status.diameter_health)[4]  <- "good_health"
colnames(tree_status.diameter_health)[5]  <- "poor_health"
colnames(tree_status.diameter_health)[6]  <- "total_number_of_trees"


#filter by diameter and percentage good to determine the best trees to recommend
recommended_trees <- tree_status.diameter_health %>% 
  filter(mean_diameter >= 9 & percentage_good >= 70) %>% 
  arrange(-percentage_good)

# View first 10 recommended trees to plant
head(recommended_trees, 10)

Key Findings and Summary

A visualization of Manhattan's neighborhoods and tree locations can be found here. https://public.tableau.com/app/profile/omotola.olasope/viz/AvisualizationofManhattansneighborhoodsandtreelocations_/Map?publish=yes

Table One

‌
‌
‌