Competition - Bee friendly plants

🎯 Problem and objectives:

The challenge is to determine which plants are preferred by native vs non-native bee species, to optimize spaces dedicated to them. The ultimate goal is to recommend the top three plant species that most benefit native bees.

In sort, the following points are to be answered:

Which plants are preferred by native vs non-native bee species?
A visualization of the distribution of bee and plant species across one of the samples.
Select the top three plant species you would recommend to the agency to support native bees.

💾 Data exploration

We will read the CSV file plants_and_bees.csv that contains relevant information and explore the data using exploratory analysis methods. Each row represents a sample that was taken from a patch of land where the plant species were being studied.

Column	Description
`sample_id`	The ID number of the sample taken.
`species_num`	The number of different bee species in the sample.
`date`	Date the sample was taken.
`season`	Season during sample collection ("early.season" or "late.season").
`site`	Name of collection site.
`native_or_non`	Whether the sample was from a native or non-native plant.
`sampling`	The sampling method.
`plant_species`	The name of the plant species the sample was taken from. None indicates the sample was taken from the air.
`time`	The time the sample was taken.
`bee_species`	The bee species in the sample.
`sex`	The gender of the bee species.
`specialized_on`	The plant genus the bee species preferred.
`parasitic`	Whether or not the bee is parasitic (0:no, 1:yes).
`nesting`	The bees nesting method.
`status`	The status of the bee species.
`nonnative_bee`	Whether the bee species is native or not (0:no, 1:yes).

Source (data has been modified)

import pandas as pd

data_frame_plants_and_bees_original = pd.read_csv("data/plants_and_bees.csv", parse_dates=['date'])

display(data_frame_plants_and_bees_original.head())

🔍 Data validation

In examining the dataset, we found that the status and specialized_on columns have a significant number of missing values. The specialized_on column only contains two distinct values, Penstemon and Ipomoea, each corresponding exclusively to Osmia distincta and Melitoma taurea bee species, respectively. Similarly, the status column only holds three unique values, which are vulnerable (IUCN), uncommon, and common. Given these findings, we have decided to exclude both status and specialized_on columns from our analysis due to their limited variety and substantial missing data.

Based on the information given in the parasitic column, the sub-categorization within the nesting column seems redundant. To streamline our data analysis, we will simplify the nesting categories as follows: Wood/cavities and Wood/shell will be recategorized as Wood. We will carry out similar consolidation for the rest of the categories as well.

import seaborn as sns

df = data_frame_plants_and_bees_original.drop(columns=['specialized_on', 'status'])

nesting_replacements = {'wood/cavities': 'wood',
                        'wood/shell': 'wood',
                        'parasite [ground]': 'ground'}

df['nesting'] = df['nesting'].replace(nesting_replacements)

import matplotlib.pyplot as plt

most_common_species = df.groupby('native_or_non')['plant_species'].apply(lambda x: x.value_counts().nlargest(2).idxmin()).values

native_data = df.loc[df['native_or_non'] == 'native']
native_counts = native_data['plant_species'].value_counts()
native_counts = native_counts.drop(native_counts.idxmax())

non_native_data = df.loc[df['native_or_non'] == 'non-native']
non_native_counts = non_native_data['plant_species'].value_counts()
non_native_counts = non_native_counts.drop(non_native_counts.idxmax())

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

ax1.bar(native_counts.index, native_counts.values)
ax1.set_title('Native Plants')
ax1.set_xlabel('Plant Species')
ax1.set_ylabel('Visit Counts')
ax1.tick_params(axis='x', rotation=90)

ax2.bar(non_native_counts.index, non_native_counts.values)
ax2.set_title('Non-Native Plants')
ax2.set_xlabel('Plant Species')
ax2.set_ylabel('Visit Counts')
ax2.tick_params(axis='x', rotation=90)

plt.tight_layout()
plt.show()

print(f'The most visited plant by native bees is: {most_common_species[0]}')
print(f'The most visited plant by non-native bees is: {most_common_species[1]}')

visits_count = df.groupby(['native_or_non', 'bee_species', 'plant_species']).size().reset_index(name='count')

visits_count = visits_count[visits_count['plant_species'] != 'None']

heatmap_data = visits_count.pivot_table(index='bee_species', columns=['native_or_non', 'plant_species'], values='count', fill_value=0)

plt.figure(figsize=(15, 7))
sns.heatmap(heatmap_data, cmap="BuGn", linewidths=.5)
plt.title("Visits count by bee species")
plt.xlabel("Plant Species")
plt.ylabel("Bee Species")
plt.show()

most_common_species

for tag, bee in heatmap_data.columns:
    print(bee)

df[df.plant_species != 'None'].sample_id.values

Competition - Bee friendly plants

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}🎯 Problem and objectives:

💾 Data exploration

🔍 Data validation

🎯 Problem and objectives: