Competition - City Tree Species

Analyzing the Manhattan Tree Distribution

Summary

The urban design team team within the Department of City Planning is looking to understand and find ways on improving the quantity and quality of trees in New York.

Using New york's tree data from 2015, the profile resulting from the analysis in this report shows that:

2 out every 62 trees are considered dead
The average diameter of trees that are alive is twice that of the dead ones
Honeylocust is the specie with largest number of trees accounting for more than 21% of the total number of trees considered alive followed by Callery Pear and Gingko
Upper west side, upper east side-carnegie hill and west village are the top three neighborhoods with the most trees
Location of trees is evenly spread across the whole of Manhattan

Based on the analysis, The top ten trees to consider planting are

Honeylocust
Callery Pear
Gingko
Pink Oak
Sophora
London Planetree
Japanese Zelkova
Littleleaf Linden
American elm
American Linden

Recommendation

the urban design team should prioritize planting trees with larger diameters in order to increase their chances of survival.
Additionally, the planning department should consider implementing measures to improve the overall health of the trees, such as providing adequate watering and nutrients or protecting them from pests and diseases
They should also consider having records of the specie of the dead trees as it was missing in the data as this would enable them know which species suffer more deaths.

IMPORT PACKAGES AND LIBRARIES

#!pip install GridSpec
#!pip install descartes
#!pip install geopandas
#!pip install descartes

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt; plt.style.use('seaborn-whitegrid')
import seaborn as sns
from matplotlib.gridspec import GridSpec
import geopandas as gpd
#import descartes
from shapely.geometry import Point, Polygon
#import geoplot
from pandas.plotting import register_matplotlib_converters
import warnings

register_matplotlib_converters()
%matplotlib inline
warnings.filterwarnings('ignore')
plt.style.use('seaborn-deep')
plt.rcParams['figure.figsize'] = (16,12)
plt.rcParams['axes.labelsize'] = 16
plt.rcParams['axes.titlesize'] = 18
plt.rcParams['legend.fontsize'] = 14
plt.rcParams['xtick.labelsize'] = 14
plt.rcParams['ytick.labelsize'] = 14

Read the data

trees = pd.read_csv('data/trees.csv')
trees.head()

DATA WRANGLING

Preparing the data

The data is first checked for any missing entries. A lot of records seem to contain missing specie name and tree status. After ruling out the existence of duplicates records, the data is then passed through the following transformations:

All missing values are replaced with a string
Categorical features are identified, and their data type changed accordingly
Trees with diameter of 0 and diameter higher than 20 are dropped

VISUAL AND PROGRAMMATIC ASSESSMENT

# make a copy of the data
tree_info = trees.copy()

Only a two columns have null values which are the spc_common(specie of tree) and health (health status of trees whether Good, Fair, and Poor). There are coluns with erroneous data types which needs to be resolved to the correct type.

# check for missing value and data types
tree_info.info()

Looking at the dataframe, there seems to a situation of data missing not at random as we can see that for ebvery tree with status as Dead, the two columns have missing values in those rows. lets investigate more. Only about 1802 observation (which constitute 3% of the values) are missing in the two columns. from my investigation, All observation of trees with status as Dead have null values in the aformentioned columns whiles others with status as Alive don't have missing values.

# check record with null entries
null_entries = tree_info[tree_info.isnull().any(axis = 1)]
null_entries

‌
‌
‌