Skip to content
Competition - City Tree Species
  • AI Chat
  • Code
  • Report
  • Analyzing the Manhattan Tree Distribution


    The urban design team team within the Department of City Planning is looking to understand and find ways on improving the quantity and quality of trees in New York.

    Using New york's tree data from 2015, the profile resulting from the analysis in this report shows that:

    • 2 out every 62 trees are considered dead
    • The average diameter of trees that are alive is twice that of the dead ones
    • Honeylocust is the specie with largest number of trees accounting for more than 21% of the total number of trees considered alive followed by Callery Pear and Gingko
    • Upper west side, upper east side-carnegie hill and west village are the top three neighborhoods with the most trees
    • Location of trees is evenly spread across the whole of Manhattan

    Based on the analysis, The top ten trees to consider planting are

    • Honeylocust
    • Callery Pear
    • Gingko
    • Pink Oak
    • Sophora
    • London Planetree
    • Japanese Zelkova
    • Littleleaf Linden
    • American elm
    • American Linden


    • the urban design team should prioritize planting trees with larger diameters in order to increase their chances of survival.
    • Additionally, the planning department should consider implementing measures to improve the overall health of the trees, such as providing adequate watering and nutrients or protecting them from pests and diseases
    • They should also consider having records of the specie of the dead trees as it was missing in the data as this would enable them know which species suffer more deaths.


    #!pip install GridSpec
    #!pip install descartes
    #!pip install geopandas
    #!pip install descartes
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt;'seaborn-whitegrid')
    import seaborn as sns
    from matplotlib.gridspec import GridSpec
    import geopandas as gpd
    #import descartes
    from shapely.geometry import Point, Polygon
    #import geoplot
    from pandas.plotting import register_matplotlib_converters
    import warnings
    %matplotlib inline
    plt.rcParams['figure.figsize'] = (16,12)
    plt.rcParams['axes.labelsize'] = 16
    plt.rcParams['axes.titlesize'] = 18
    plt.rcParams['legend.fontsize'] = 14
    plt.rcParams['xtick.labelsize'] = 14
    plt.rcParams['ytick.labelsize'] = 14

    Read the data

    trees = pd.read_csv('data/trees.csv')


    Preparing the data

    The data is first checked for any missing entries. A lot of records seem to contain missing specie name and tree status. After ruling out the existence of duplicates records, the data is then passed through the following transformations:

    • All missing values are replaced with a string
    • Categorical features are identified, and their data type changed accordingly
    • Trees with diameter of 0 and diameter higher than 20 are dropped


    # make a copy of the data
    tree_info = trees.copy()

    Only a two columns have null values which are the spc_common(specie of tree) and health (health status of trees whether Good, Fair, and Poor). There are coluns with erroneous data types which needs to be resolved to the correct type.

    # check for missing value and data types

    Looking at the dataframe, there seems to a situation of data missing not at random as we can see that for ebvery tree with status as Dead, the two columns have missing values in those rows. lets investigate more. Only about 1802 observation (which constitute 3% of the values) are missing in the two columns. from my investigation, All observation of trees with status as Dead have null values in the aformentioned columns whiles others with status as Alive don't have missing values.

    # check record with null entries
    null_entries = tree_info[tree_info.isnull().any(axis = 1)]