Competition - City Tree Species

Which tree species should the city plant?

📖 Background

You work for a nonprofit organization advising the planning department on ways to improve the quantity and quality of trees in New York City. The urban design team believes tree size (using trunk diameter as a proxy for size) and health are the most desirable characteristics of city trees.

The city would like to learn more about which tree species are the best choice to plant on the streets of Manhattan.

💾 The data

The team has provided access to the 2015 tree census and geographical information on New York City neighborhoods (trees, neighborhoods):

Tree Census

"tree_id" - Unique id of each tree.
"tree_dbh" - The diameter of the tree in inches measured at 54 inches above the ground.
"curb_loc" - Location of the tree bed in relation to the curb. Either along the curb (OnCurb) or offset from the curb (OffsetFromCurb).
"spc_common" - Common name for the species.
"status" - Indicates whether the tree is alive or standing dead.
"health" - Indication of the tree's health (Good, Fair, and Poor).
"root_stone" - Indicates the presence of a root problem caused by paving stones in the tree bed.
"root_grate" - Indicates the presence of a root problem caused by metal grates in the tree bed.
"root_other" - Indicates the presence of other root problems.
"trunk_wire" - Indicates the presence of a trunk problem caused by wires or rope wrapped around the trunk.
"trnk_light" - Indicates the presence of a trunk problem caused by lighting installed on the tree.
"trnk_other" - Indicates the presence of other trunk problems.
"brch_light" - Indicates the presence of a branch problem caused by lights or wires in the branches.
"brch_shoe" - Indicates the presence of a branch problem caused by shoes in the branches.
"brch_other" - Indicates the presence of other branch problems.
"postcode" - Five-digit zip code where the tree is located.
"nta" - Neighborhood Tabulation Area (NTA) code from the 2010 US Census for the tree.
"nta_name" - Neighborhood name.
"latitude" - Latitude of the tree, in decimal degrees.
"longitude" - Longitude of the tree, in decimal degrees.

Neighborhoods' geographical information

"ntacode" - NTA code (matches Tree Census information).
"ntaname" - Neighborhood name (matches Tree Census information).
"geometry" - Polygon that defines the neighborhood.

Tree census and neighborhood information from the City of New York NYC Open Data.

import pandas as pd
import geopandas as gpd
trees = pd.read_csv('data/trees.csv')
trees

neighborhoods = gpd.read_file('data/nta.shp')
neighborhoods

First of all, we are going to import all required packages

import pandas as pd
pd.options.mode.chained_assignment = None
import missingno as msno
import seaborn as sns
import matplotlib.pyplot as plt
import geopandas as gpd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import numpy as np
from imblearn.over_sampling import SMOTE
%matplotlib inline
import matplotlib.colors as colors
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

Then, we will define some useful functions for the first partv of the analysis

# Change data types to categorical for those with not many unique values
def convert_to_categorical(df, tolerance):
    """
    Function to convert cdf columns to categorical depending on their row number compared with the 
    total length of the dataframe. Yes no variables are converted to 1 and 0
    :param df, tolerance: Tolerance in x.xx format
    :return: Dataframe with converted columns
    """
    for column in df.columns:
        if ((df[column].nunique() < (tolerance * df_subset.shape[0])) and (df[column].nunique() >2)):
            df[column] = df[column].astype('category')
        elif ((df[column].nunique() < (tolerance * df_subset.shape[0])) and (df[column].nunique() == 2)):
            df[column] = df[column].astype('category')
            if "Yes" in df[column].unique():
                df[column] = df[column].map({'Yes': 1, 'No': 0})

# Get summary dataframe
def unique_values(df):
    """
    Get distinct values and number of them in a dataframe
    :param df: Dataframe to check
    :return: Df with list and count of unique values
    """
    values = df.apply(lambda col: col.unique())
    counts = df.apply(lambda col: col.nunique())
    resumen = pd.concat([values, counts], axis=1)
    return (resumen)

def categorical_variables_plots(df, cols_to_skip):
    """
    Create barplots for categorical  variables
    :param df: Dataframe with variables to plot
    :param cols_to_skip: list of columns that shouldnt be plotted
    :return: Nothing. Plots 
    """
    ix = 1
    fig = plt.figure(figsize = (15,10))
    for c in list(df.columns):
        if ix <= 3:
            if c not in cols_to_skip:
                ax1 = fig.add_subplot(2,3,ix)
                sns.countplot(data = df, x=c, ax = ax1)
                #ax2 = fig.add_subplot(2,3,ix+3)
                #sns.boxplot(data=ds_cat, x=c, y='SalePrice', ax=ax2)
                #sns.violinplot(data=ds_cat, x=c, y='SalePrice', ax=ax2)
                #sns.swarmplot(data = ds_cat, x=c, y ='SalePrice', color = 'k', alpha = 0.4, ax=ax2)

        ix = ix +1
        if ix == 4: 
            fig = plt.figure(figsize = (15,10))
            ix =1

# Read trees dataframe and neighborhood with anpther name
df_subset = trees.copy()
neighborhoods = neighborhoods.copy()

# Get info from dataset like number of nulls and data types
print(df_subset.info())

# Check for duplicated rows
df_subset.duplicated().sum()

# Print unique values for columns
resumen = unique_values(df_subset)
resumen

# Check number of nas 
df_subset.isna().sum()

# Lets explore the NA with a barchart
msno.bar(df_subset)

The proportion of NAs here is less than 50%. In other case we have to get rid of them. Lets check if there is something relevant about the df containing NA