Competition - Preserving Our Pollinators: Leveraging Data to Enhance Native Bee Habitats

Preserving Our Pollinators: Leveraging Data to Enhance Native Bee Habitats

Introduction

In the realm of environmental stewardship, a key concern of our local government environment agency is the promotion of biodiversity and the protection of our invaluable pollinator bees. These industrious insects play an indispensable role in pollinating both wild plants and agricultural crops, thereby holding our ecosystems together and safeguarding our food supply.

In a bid to bolster these native bee populations and create habitats conducive to their survival, our project advocates the use of both native and non-native plants. However, the choice of plant species is not made lightly. It is essential to select the right combination of plants that not only invite native bees but also dissuade the presence of parasitic invasive species, which pose a significant threat to native bee populations.

These parasitic species are known to usurp the resources of native bees, introduce diseases, and disrupt the equilibrium of local ecosystems. Therefore, to protect our native bees, it is necessary to avoid using plant species that are preferred by such invasive bees in our pollinator-friendly spaces.

Our team has collected extensive data on native and non-native plants and their effects on various pollinator bees. This dataset forms the foundation for the present report, in which we conduct a comprehensive analysis to identify the optimal plant species that would create an environment most beneficial for our native bees while discouraging parasitic invasive species.

By harnessing advanced data science techniques and machine learning models, we can distill meaningful insights from our dataset, uncover patterns among bee populations, and guide our plant selection process with empirical evidence.

This report aims to deliver science-based recommendations for plant selection and offer a model for integrating data-driven decision-making into environmental initiatives. As we strive to nurture our ecosystems and promote biodiversity, the careful and informed application of data remains central to our mission.

The Dataset

The data for this analysis is derived from the file plants_and_bees.csv, which contains extensive information on plant species and their interaction with different bee species. Each row in the dataset represents a unique sample collected from a specific patch of land, providing insights into the intricate relationships between the plant and bee species found in that area.

The dataset comprises the following fields:

sample_id: The unique identification number assigned to each collected sample.
species_num: The number of different bee species identified in the sample.
date: The specific date on which the sample was collected.
season: The season during which the sample was collected, categorized as "early.season" or "late.season".
site: The designated name of the site from where the sample was collected.
native_or_non: A categorization of whether the sample was collected from a native or a non-native plant.
sampling: The method utilized to collect the sample.
plant_species: The name of the plant species from which the sample was taken. If the field reads 'None', it indicates that the sample was collected from the air.
time: The specific time of day when the sample was taken.
bee_species: The species of the bee identified in the sample.
sex: The gender of the identified bee species.
specialized_on: The plant genus that the bee species shows a preference for.
parasitic: An indicator of whether the bee species is parasitic or not, where 0 denotes 'No' and 1 denotes 'Yes'.
nesting: The specific nesting method employed by the bees.
status: The status or condition of the bee species.
nonnative_bee: An indicator of whether the bee species is native or not, where 0 signifies 'No' and 1 signifies 'Yes'.

The data used in this study were sourced from Data Dryad, with some modifications made to better fit the goals of analysis. By leveraging this dataset, this analysis aims to identify key trends and patterns that will inform the creation of pollinator bee-friendly spaces. Through our data-driven approach, we aim to uncover valuable insights that will guide our decisions and strategies in promoting local biodiversity and strengthening our ecosystems.

Data Cleaning, Missing Value Imputation, and Analysis

In any data analysis project, the preprocessing phase, which includes tasks like data cleaning and handling missing values, is critically important. This process ensures that the data fed into our machine learning models is accurate, relevant, and structured in a way that the models can interpret.

In the case of our plants_and_bees.csv dataset, a thorough preliminary assessment of the data revealed some challenges that needed to be addressed before diving into the main analysis.

Firstly, two columns, specialized_on and status, had over 98% of their values missing. The sheer volume of missing data in these columns presented a significant obstacle for any imputation method. Attempting to fill these gaps could introduce a large amount of noise or inaccurate data, which might distort our subsequent analysis. As a result, the decision was made to drop these columns from the dataset entirely.

Next, we turned our attention to the parasitic, nesting, and nonnative_bee columns, all of which also contained missing values. Given the relatively straightforward nature of these columns, we opted for a simple method of imputation: filling the missing values with the mode (the most common value) of each column. This approach ensures that the balance of categories within these columns is maintained.

However, a more complex problem was presented by the plant_species column, which had a substantial 66% of its values missing. The high cardinality and categorical nature of this column meant a more sophisticated method of imputation was needed. For this, we turned to a machine learning-based approach using the Random Forest algorithm.

Random Forest was chosen for this task due to its ability to handle high-dimensional and non-linear data, and its robustness against overfitting, which is a common concern when using machine learning for imputation. The algorithm works by creating a 'forest' of decision trees trained on random subsets of the data, and outputting the mode (for classification) of the classes output by individual trees.

The Random Forest model was trained on the non-missing data and then used to predict the missing plant_species values. To ensure reproducible results, we set a random state of 42 when initializing our Random Forest model. This means that although the model inherently contains a random component, this randomness is consistent each time the model is run, allowing us to replicate our results.

Finally, to assess the performance of our imputation model, we artificially created a situation with "missing" values that weren't truly missing by hiding some known plant_species values, imputing them as if they were missing, and then comparing the imputed values to the actual ones.

The model's performance was evaluated based on the accuracy and F1 score, both of which balance the trade-off between precision (correctly predicted positive observations out of the total predicted positives) and recall (correctly predicted positive observations out of the total actual positives).

For our model, the F1 score was 0.72, and the accuracy was quite robust with a support of 108. The macro averages of precision, recall, and F1 score were 0.53, 0.52, and 0.50 respectively, suggesting that the model performed reasonably well across different classes. The weighted averages of precision, recall, and the F1 score were 0.72, 0.52, and 0.69 respectively, which indicate a relatively solid performance when considering the imbalance in class distribution.

Through this process of careful data cleaning and sophisticated missing value imputation, we prepared a high-quality, robust dataset, setting a strong foundation for the subsequent steps of our analysis.

Summary of Key Findings and Plant Descriptions

Our research findings provide valuable insight into the preferences of different types of bees for specific plant species, allowing us to create more supportive environments for our native pollinators while controlling potentially harmful non-native bee populations.

Among native bees, three plant species have shown to be particularly attractive: Leucanthemum vulgare (Oxeye Daisy), Rudbeckia hirta (Black-Eyed Susan), and Rudbeckia triloba (Brown-Eyed Susan).

1. Leucanthemum vulgare or the Oxeye Daisy is a charming flowering plant native to Europe and temperate regions of Asia. Its white-petaled blooms with bright yellow centers offer both a visual appeal and a rich source of pollen and nectar, making them a significant attraction for native bees.

2. Rudbeckia hirta, also known as Black-Eyed Susan, is a North American native that boasts vibrant yellow or gold flowers, featuring dark-brown centers. This plant is not only visually appealing but also serves as an abundant source of pollen and nectar, making it a preferred choice for native bees.

3. Rudbeckia triloba, often referred to as Brown-Eyed Susan. Like its close relative, the Black-Eyed Susan, this plant offers abundant nectar and pollen and has the added benefit of blooming later into the season, providing a consistent food source for bees.

For non-native bees, our study highlights Rudbeckia hirta, Leucanthemum vulgare, and a new entry, Cichorium intybus or Chicory. Chicory is known for its beautiful blue flowers and its use in agriculture due to its deep roots and high biomass yield. While beneficial in certain agricultural settings, its attractiveness to non-native bees means it may require controlled planting in areas aiming to control non-native bee populations.

Most importantly, our findings provide guidance on plant species to avoid due to their attraction for non-native parasitic bees. These species are Daucus carota (Wild Carrot) and Trifolium incarnatum (Crimson Clover).

The Wild Carrot, despite its delicate, umbrella-like clusters of white flowers, attracts non-native parasitic bees, making it a less suitable choice for our mission to support native bees.

Similarly, the Crimson Clover, with its striking red flowers, while useful as a cover crop, has been observed to attract certain invasive bee species, making it another plant to avoid when planning pollinator-friendly spaces.

Conclusion: Optimizing Our Approach to Protecting Native Bees

The value of pollinator bees to our ecosystem and agriculture cannot be overstated. As we navigate the challenges of environmental stewardship, we must leverage all available tools and knowledge to ensure these essential insects continue to thrive. In this context, the findings of our study serve as both a critical guide and a call to action.

We've uncovered that native bees demonstrate a clear preference for three specific plant species: Leucanthemum vulgare (Oxeye Daisy), Rudbeckia hirta (Black-Eyed Susan), and Rudbeckia triloba (Brown-Eyed Susan). With this information in hand, we can optimize our pollinator-friendly spaces, selecting plants that provide the maximum benefit for our native bees.

However, our analysis has also highlighted plants to avoid, specifically Daucus carota (Wild Carrot) and Trifolium incarnatum (Crimson Clover). While these species may have their own aesthetic or agricultural merits, they have the unintended consequence of attracting non-native parasitic bees, an invasive species which threatens the health of our native bee populations.

Our commitment to protecting native bees also extends to the methods we use to uncover these findings. Rigorous data cleaning, careful handling of missing values, and the utilization of machine learning models were all part of this data-driven approach. In particular, the usage of the Random Forest algorithm to impute missing 'plant_species' values represents a sophisticated method that ensures the accuracy and reliability of our results.

The performance of this model, indicated by a strong F1 score and robust accuracy, provides a measure of confidence in the quality of the data used for analysis. These figures highlight the strength of machine learning in handling complex, real-world data challenges and underscore the importance of robust methodologies in ensuring reliable, reproducible results.

This project represents a meaningful stride forward in our quest to protect native bee populations. Yet, it also emphasizes that the work is far from over. The plant preferences of bees are shaped by an array of complex and interwoven factors, necessitating ongoing research and adaptation of our strategies.

The findings of this study should guide our immediate actions, aiding in the development of pollinator-friendly spaces that support our native bees and deter invasive species. However, they also underscore the need for continued vigilance, monitoring, and research. As our climate changes and our landscapes evolve, so too will the needs of our native bees.

This report underscores a broader principle: in the face of environmental challenges, data-driven and evidence-based strategies are indispensable. As we continue our efforts to protect native bees and enhance biodiversity, we will continue to harness the power of data, deepening our understanding, improving our strategies, and securing a brighter future for our environment.

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import missingno as msno
import numpy as np
import seaborn as sns

# Load data
data = pd.read_csv('data/plants_and_bees.csv')

# Check the first 5 rows of the data
data.head()

# Check the shape of the data
data.shape

# Check the data types of the columns
data.dtypes

data.info()

# Check for missing values
data.isnull().sum()

# Check summary statistics of numerical columns
data.describe()

# replace string "None" with NaN
data=data.replace(to_replace='None', value=np.nan)

# Initialize nullity DataFrame
data_nullity = data.isnull()
# data_nullity.head()
# data_nullity.sum()

# Percentage of missingness
data_nullity.mean() * 100

msno.matrix(data)

msno.dendrogram(data)

There are still many string None values in the plant_species data, which need to replaced with NaNs before imputing them with a machine learning algorithm. For simplicity, we will use mode imputation to fill missing data in the parasitic, nesting, and nonnative_bee columns.

# Use mode imputation to replace missing values in the parasitic, nesting, and nonnative_bee columns
data['parasitic'].fillna(data['parasitic'].mode()[0], inplace=True)
data['nesting'].fillna(data['nesting'].mode()[0], inplace=True)
data['nonnative_bee'].fillna(data['nonnative_bee'].mode()[0], inplace=True)

‌
‌
‌