Buzzing Botany: A Scientific Report on Bee-friendly Blooms

Hidden code

📕 Overview

As part of our commitment to habitat conservation, with a local government environment agency we have undertaken a project aimed at establishing bee-friendly spaces. These spaces would play an important role in supporting the health and well-being of pollinator bees, vital contributors to the ecosystem's functioning and local biodiversity. In order to achieve this, we will explore the data collected on native and non-native plants and select those species that can optimize the environment for pollinator bees.

The research objectives are thus the following:

To investigate the plant preferences of native and non-native bee species.
Recommend to the agency the three most appropriate plant species to support native bees.

The plants_and_bees.csv file used for this analysis, wherein each row represents a sample taken from a patch of land with the studied plant species, contains the following fields:

Column	Description
sample_id	The ID number of the sample taken.
bees_num	The total number of bee individuals in the sample.
date	Date the sample was taken.
season	Season during sample collection ("early.season" or "late.season").
site	Name of collection site.
native_or_non	Whether the sample was from a native or non-native plot.
sampling	The sampling method.
plant_species	The name of the plant species the sample was taken from. None indicates the sample was taken from the air.
time	The time the sample was taken.
bee_species	The bee species in the sample.
sex	The gender of the bee species.
specialized_on	The plant genus the bee species preferred.
parasitic	Whether or not the bee is parasitic (0: no, 1: yes).
nesting	The bees nesting method.
status	The status of the bee species.
nonnative_bee	Whether the bee species is native or not (0: no, 1: yes).

Source (data has been modified)

The report is structured into three distinct sections: an exploratory data analysis section, a main analysis section, and a final section for conclusions and recommendations.

📗 Exploratory Data Analysis

Data Cleaning

The objective of this section is to familiarize ourselves with the bees dataset and acquire an initial understanding of its characteristics. Presented in the table below are the first five rows of the dataset, with the headers renamed to enhance readability:

3 hidden cells

Hidden code

Upon initial inspection, several observations can be made, serving as preliminary steps for our feature engineering endeavors:

Null values are present in the dataset, specifically in the Preferred Plant Genus and Status columns. The treatment of these null values will depend on their frequency and type of missingness.
Also the Plant Species column contains missing values, but these are imputed as "None" due to the sample in those entries being collected from the air. As the plant species in these cases are unknown, it is best to replace them with null values.
The Sampling Date and Sampling Time variables are not recognized as datetime objects. We will convert them to the appropriate data type and merge them into a single column.
The categories within the Season attribute can be reassigned as "Early" and "Late," while Plot is Native can be remapped to the binary values of 1 ("yes") and 0 ("no").
The Bee Species column exhibits a hierarchical structure, where different bee species belong to the same higher-ranking group (genus). For example, "Andrena carlini" and "Andrena perplexa" both belong to the Andrena genus. We can expect the Bee Species column to have a vast number of categories, which we can condense into a new column, Bee Genus. We can expect a similar behavior from Plant Species, which can most likely also be condensed in a Plant Genus column.

The .info() function — restyled as a dataframe— provides information about the presence of missing values, their percentage of the total number of rows, and the data types of the columns.

Confirming the first observation, there are five columns (highlighted in red) that contain null values: Preferred Plant Genus, Species is Parasitic, Nesting Method, Status, Bee is Native, and Plant Species. While Species is Parasitic, Nesting Method, and Bee is Native have only a few missing values ( less than 5%), Plant Species is missing 66% of its values, and the other two 99%. Finally, with regards to the third observation, we can confirm that the Sampling Date and Sampling Time variables are incorrectly classified as "object" and "int", respectively, while the Sample Id column, which is actually a label, is classified as an integer instead of as an object.

The remaining columns have all 1250 entries, with the correct data types assigned.

Hidden code

Missing Data Analysis

It is imperative that we thoroughly examine the missing data prior to proceeding with deletion and/or imputation, as these processes are contigent on the type of missingness found in the columns. Thus, after replacing the None values in the Plant Species column with nulls, we create a sparsity matrix using all the columns in the dataset.

Hidden code

From the matrix, it may appear that the Plant Species column has data that is missing completely at random. In reality, the missingess is structural as it's attributed to the sampling method used: only hand netting contains samples that indicate the plant species. Furthermore, beyond the missing values found in Bee is Native, these are also primarily of non-native bee species; two samples of native bees aren't representative of the population.

Hidden code

On the other hand, the Preferred Plant Genus and Status columns have an extremely high level of sparsity and seem to have missing data that is not random. These two columns have a 1:1 relationship, with Status having slightly more entries than Preferred Plant Genus. Additionally, the columns Species is Parasitic, Nesting Method, and Bee is Native appear to be associated with each other. These relationships, along with the ones mentioned earlier, can be confirmed using a missingness heatmap.

Hidden code

The missingness analysis reveals interesting correlations between certain columns in the dataset. There is a strong correlation (0.7) in missingness between the columns Preferred Plant Genus and Status. When a bee species has a preference for a specific plant genus, their survival may be directly impacted by the availability of those plant species. Therefore, in our case, if a bee species does not show a preference, it may been more challenging to attribute a status to them.

Also the Nesting Method and Species is Parasitic columns exhibit a high correlation (0.8) in missingness. This is because although a bee species' nesting method isn't determined solely by whether ore not it is parasitic, the two categories exhibit different reproductive strategies. Finally, there is a near perfect relationship in missingness between the columns Species is Parasitic and Bee is Native, suggesting that in most cases, if it is not possible to determine whether or not a bee species exhibits parasitism, it is also difficult to determine whether they're native to their environment.

We can make three important conclusions regarding how to treat the missing data:

Given that the Status and Preferred Plant Genus columns are 99% sparse, with their missingness uncorrelated to the ones of the other variables, these will be dropped from the dataset and not considered in the rest of the analysis.
Nesting Method, Species is Parasitic, and Bee is Native have a low sparsity level (+- 5%), and tend to miss values on the same rows. We will thus drop those that contain missing values using listwise deletion.
The challenge arises when treating the Plant Species variable. As seen earlier, its 65% sparsity isn't due to mere chance but to the sampling method used, pan traps, which by nature can't determine the plant species as the samples are collected artificially. However, given that we have enough data collected using hand netting, we can use it to try and impute the plant species that would've been associated with the sampled bee species. Thus, a machine learning algorithm will be used to this end.

MissForest Imputation vs. KNNImputer for Missing Values

From the documentation, MissForest is a nonparametric imputation method which works with mixed-type data, nonlinear relations, complex interactions, and high dimensionality, requiring only that the observations be pairwise independent. For each feature, MissForest fits a random forest on the observed part and then predicts the missing part, repeating these two steps until it reaches a stopping criterion or a pre-defined maximum number of iterations. Meanwhile, KNNImputer (Invalid URL) imputes missing values using k-Nearest Neighbors: each sample's missing values are imputed using the average value from k-neibors nearest neighbors found in the training set.

The output of these two algorithms will be internally checked by comparing the distributions of the observed values and the imputed values; the dataset with the smallest discrepancies will be selected for the main analysis.

In our case, the column containing missing values is categorical; this can be imputed directly with MissForest but not with KNNImputer, so for commodity we will first encode the feature to numeric values. Then, following the listwise deletion of missing values in the Nesting Method, Species is Parasitic, and Bee is Native, the Sampling Date, Sampling Time, Status, Preferred Plant Genus, variables are dropped from the dataset. The first two are eliminated as MissForest doesn't handle datetime objects. The time factor will nevertheless be encapsulated by Season as well as Sample Id, which, as a grouping variable, contains the differences in location and experimental conditions between samples. Status and Preferred Plant Genus, as previously mentioned, are eliminated due to their sparsity; estimates are otherwise likely to be extremely biased.

To impute Plant Species, the variable needs to follow a specific criterion: by subsetting the dataset into samples collected only using hand netting, it was found that some plant species occur only in native plots, while others only in non-native plots. Without separating the dataset into the distinct plot types, the algorithms risk imputing the missing plant species incorrectly.

The imputation then goes as follows:

The dataset is split into two parts, native plot and non-native plot.
A nested function is defined. It takes the encoded dataframe, transformed with an inner ordinal encoder function, and one of the two algorithms. The KNNImputer is implemented with its default parameters, while the MissForest algorithm, is personalized with random state set to RandomState(3) for reproducibility, decreasing = False, so that imputation moves from imputing the columns with the smallest number of missing values to those with the largest, and class_weight = "balanced" due to the target variable of interest being highly imbalanced.
The imputation is then applied separately on the two dataset parts, with the final output being a dataframe devoid of missing values and with Plant Species decoded back to their original labels. The two halves are then merged back together.

After re-inserting the Sampling Time, and Sampling Date columns, we can check if the listwise deletion and imputation worked by plotting the nullity bar graph (below) on the new datasets. Now, all variables of interest contain 1182 entries (approximately 5% less than in the beginning).

‌
‌
‌

Buzzing Botany: A Scientific Report on Bee-friendly Blooms

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}📕 Overview

📗 Exploratory Data Analysis

Data Cleaning

Missing Data Analysis

MissForest Imputation vs. KNNImputer for Missing Values

📕 Overview