Skip to content
0

Which plants are better for bees: native or non-native?

📖 Background

You work for the local government environment agency and have taken on a project about creating pollinator bee-friendly spaces. You can use both native and non-native plants to create these spaces and therefore need to ensure that you use the correct plants to optimize the environment for these bees.

The team has collected data on native and non-native plants and their effects on pollinator bees. Your task will be to analyze this data and provide recommendations on which plants create an optimized environment for pollinator bees.

💾 The Data

You have assembled information on the plants and bees research in a file called plants_and_bees.csv. Each row represents a sample that was taken from a patch of land where the plant species were being studied.

ColumnDescription
sample_idThe ID number of the sample taken.
bees_numThe total number of bee individuals in the sample.
dateDate the sample was taken.
seasonSeason during sample collection ("early.season" or "late.season").
siteName of collection site.
native_or_nonWhether the sample was from a native or non-native plot.
samplingThe sampling method.
plant_speciesThe name of the plant species the sample was taken from. None indicates the sample was taken from the air.
timeThe time the sample was taken.
bee_speciesThe bee species in the sample.
sexThe gender of the bee species.
specialized_onThe plant genus the bee species preferred.
parasiticWhether or not the bee is parasitic (0:no, 1:yes).
nestingThe bees nesting method.
statusThe status of the bee species.
nonnative_beeWhether the bee species is native or not (0:no, 1:yes).

Source (data has been modified)

Conclusions

  • Which plants are preferred by native vs non-native bee species? - Flowering length (during the day) and flowering time (during the season) seems to be key components in attracting bees to plants but I can only arrive to this conclusion indirectly. I would advise the team to collect more data and design the data collection with the earlier statements in mind.

  • A visualization of the distribution of bee and plant species across one of the samples. - Please see below.

  • Select the top three plant species you would recommend to the agency to support native bees. - Based on the current dataset, I cannot recommend specific plant species for supporting native bees. However, I advise promoting the use of a diverse range of plant species in the field to enhance floral diversity. Additionally, conducting further research is essential to determine the optimal plant species for supporting the bee population.

import pandas as pd
data = pd.read_csv("data/plants_and_bees.csv")
data
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Read the CSV file
data = pd.read_csv("data/plants_and_bees.csv")
print("HEAD")
print(data.head())
print("----")
print("SHAPE")
print(data.shape)
print("----")
print("INFO")
print(data.info())
print("----")
print("DESCRIBE")
print(data.describe())
print("----")
print("EMPTY")
print(data.isnull().sum())
print("----")


#Remove columns with no additional value
exclude_columns = ['specialized_on','status']
data = data.drop(exclude_columns, axis=1, errors='ignore')

# Add column to identify each row as a unique observation
data['individual_bee_count']=1

#Convert the date column to date format
data['date']=pd.to_datetime(data['date'])

parasitic_contents = data['parasitic'].unique().tolist()
nonnative_bee_contents = data['nonnative_bee'].unique().tolist()
print("Contents of 'parasitic':")
print(parasitic_contents)
print("Contents of nonnative_bee'")
print(nonnative_bee_contents)

# Convert 'time' from int64 to time type
data['time'] = pd.to_datetime(data['time'], unit='ms')

# Convert 'parasitic' from float64 to object type
data['parasitic'] = data['parasitic'].astype(str)
data['nonnative_bee'] = data['nonnative_bee'].astype(str)
# Get the list of categorical variables
categorical_vars = data.select_dtypes(include='object').columns

# Set up the figure and axes for subplots
num_plots = len(categorical_vars)
num_cols = 2  # Number of graphs to display in each row
num_rows = (num_plots + num_cols - 1) // num_cols  # Calculate the number of rows needed
fig, axes = plt.subplots(num_rows, num_cols, figsize=(12, 5 * num_rows))

# Flatten the axes array if it's multidimensional
if num_rows > 1:
    axes = axes.flatten()

# Iterate through each categorical variable and create a bar chart
for i, var in enumerate(categorical_vars):
    ax = axes[i] if num_plots > 1 else axes  # Handle single subplot case
    sns.countplot(x=var, data=data, ax=ax)
    ax.set_title(var)
    ax.set_xlabel('')
    ax.set_ylabel('Count')
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90)

# Remove empty subplots
if num_plots < len(axes):
    for j in range(num_plots, len(axes)):
        fig.delaxes(axes[j])

# Adjust the spacing between subplots
plt.tight_layout()

# Add column to identify each row as a unique observation
data['individual_bee_count']=1

print(data.info())
print("----")
#Based on the graphical information, let's drop the following columns: parasitic, nesting, nonnative_bee

columns_to_remove = ["parasitic", "nesting", "nonnative_bee"]
data = data.drop(columns=columns_to_remove)

data.head()
# Group the data by "sampling" and "plant_species" and count the occurrences
grouped_data = data.groupby(["sampling", "plant_species"]).size().reset_index(name="count")

# Get the unique "sampling" categories
unique_sampling = grouped_data["sampling"].unique()

# Create two subplots for each "sampling" category
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

# Iterate over the unique "sampling" categories
for i, sampling_category in enumerate(unique_sampling):
    # Filter the data for the current "sampling" category
    subset = grouped_data[grouped_data["sampling"] == sampling_category]
    
    # Extract the plant species and their corresponding counts
    plant_species = subset["plant_species"]
    count = subset["count"]
    
    # Create a bar plot for the current "sampling" category
    axs[i].bar(plant_species, count)
    axs[i].set_title(f"Sampling: {sampling_category}")
    axs[i].set_xlabel("Plant Species")
    axs[i].set_ylabel("Count")
    axs[i].set_xticklabels(plant_species, rotation=90)

# Adjust the spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()
import pandas as pd

# Read the CSV file
data = pd.read_csv("data/plants_and_bees.csv")

# Iterate over each column and print unique values and their counts
for column in data.columns:
    unique_values = data[column].value_counts()
    print(f"Column: {column}")
    print(unique_values)
    print()
import pandas as pd

# Read the CSV file
data = pd.read_csv("data/plants_and_bees.csv")

# Create the pivot table with a total column
pivot_table = pd.pivot_table(data, index='date', columns='native_or_non', aggfunc='size', fill_value=0)

# Store the pivot table in a new DataFrame
pivot_df = pd.DataFrame(pivot_table)

#pivot_df['Total'] = pivot_df.sum(axis=1)
pivot_df.loc['Total'] = pivot_df.sum()

# Display the new DataFrame
print(pivot_df)
import scipy.stats as stats

# Extract the data from the 'native' and 'non-native' columns
data_native = pivot_df['native']
data_non_native = pivot_df['non-native']

# Perform a paired t-test
t_statistic, p_value = stats.ttest_rel(data_native, data_non_native)

# Print the results
print("Paired t-test results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv("data/plants_and_bees.csv")

# Filter the data on "sampling" column to keep only "hand netting" records
hand_netting_data = data[data['sampling'] == 'hand netting']

# Group the filtered data by "date" and "plant_species"
grouped_data = hand_netting_data.groupby(['date', 'plant_species']).size().reset_index(name='count')

# Rename the new dataframe
new_dataframe = grouped_data.copy()
new_dataframe.head()

# Convert 'plant_species' column to categorical codes
new_dataframe['color_code'] = new_dataframe['plant_species'].astype('category').cat.codes

# Create a larger figure
plt.figure(figsize=(12, 6))

# Create a scatter plot with modified color map
scatter = plt.scatter(new_dataframe['date'], new_dataframe['count'], c=new_dataframe['color_code'], cmap='Set1')

# Iterate over each unique category
for category in new_dataframe['plant_species'].unique():
    # Filter the data points belonging to the current category
    category_data = new_dataframe[new_dataframe['plant_species'] == category]
    
    # Connect the data points with a line
    plt.plot(category_data['date'], category_data['count'], marker='o', linestyle='-', alpha=0.5)

# Customize the plot
plt.title('Scatter Plot of Bees Count by Date')
plt.xlabel('Date')
plt.ylabel('Bees Count')
plt.colorbar(scatter, label='Plant Species')

# Show the plot
plt.show()

import matplotlib.pyplot as plt

# Group the filtered data by "date" and "plant_species"
grouped_data = hand_netting_data.groupby(['date', 'plant_species']).size().reset_index(name='count')

# Calculate the total count for each date
date_totals = grouped_data.groupby('date')['count'].sum()

# Calculate the share of each plant species by date
grouped_data['share'] = grouped_data['count'] / grouped_data['date'].map(date_totals)

# Create a pivot table with "date" as rows and "plant_species" as columns, showing the share
pivot_table = grouped_data.pivot(index='date', columns='plant_species', values='share')

# Plot the share of each plant species by date
ax = pivot_table.plot(kind='bar', stacked=True, width=0.8)  # Adjust the width parameter to make the bars wider

# Customize the plot
plt.xlabel('Date')
plt.ylabel('Share')
plt.title('Share of Plant Species by Date')

# Create unique patterns for the legends
patterns = ['/', '\\', '|', '-', '+', 'x', 'o', 'O', '.', '*']

# Set the unique patterns for the legends
for i, (column, pattern) in enumerate(zip(pivot_table.columns, patterns)):
    ax.get_children()[i].set_hatch(pattern)  # Set hatch pattern for the bar objects

# Create a custom legend with unique labels
handles, labels = ax.get_legend_handles_labels()
custom_legend = [(handle, label) for handle, label in zip(handles, labels)]

# Add the custom legend outside the plot area
ax.legend(*zip(*custom_legend), loc='center left', bbox_to_anchor=(1.0, 0.5))

# Display the plot
plt.show()
import numpy as np
from scipy.stats import chi2_contingency

# Create the contingency table
observed = np.array([[164, 266],
                     [442, 378]])

# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(observed)

# Print the results
print(f"Chi-square statistic: {chi2}")
print(f"p-value: {p}")
print(f"Degrees of freedom: {dof}")
print("Expected counts:")
print(expected)
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv("data/plants_and_bees.csv")

# Filter the data on "sampling" column to keep only "hand netting" records
hand_netting_data = data[data['sampling'] == 'hand netting']

# Group the filtered data by "date" and "plant_species"
grouped_data = hand_netting_data.groupby(['date', 'plant_species']).size().reset_index(name='count')

# Rename the new dataframe
new_dataframe = grouped_data.copy()
new_dataframe.head()

# Convert 'plant_species' column to categorical codes
new_dataframe['color_code'] = new_dataframe['plant_species'].astype('category').cat.codes

# Create a larger figure
plt.figure(figsize=(12, 6))

# Create a scatter plot with modified color map
scatter = plt.scatter(new_dataframe['date'], new_dataframe['count'], c=new_dataframe['color_code'], cmap='Set1')

# Iterate over each unique category
for category in new_dataframe['plant_species'].unique():
    # Filter the data points belonging to the current category
    category_data = new_dataframe[new_dataframe['plant_species'] == category]
    
    # Connect the data points with a line
    plt.plot(category_data['date'], category_data['count'], marker='o', linestyle='-', alpha=0.5)

# Customize the plot
plt.title('Scatter Plot of Bees Count by Date')
plt.xlabel('Date')
plt.ylabel('Bees Count')
plt.colorbar(scatter, label='Plant Species')

# Show the plot
plt.show()

import matplotlib.pyplot as plt

import pandas as pd

# Group the filtered data by "sample_id" and "plant_species"
grouped_data = hand_netting_data.groupby(['sample_id', 'plant_species']).size().reset_index(name='count')

# Calculate the total count for each sample_id
sample_id_totals = grouped_data.groupby('sample_id')['count'].sum()

# Calculate the share of each plant species by sample_id
grouped_data['share'] = grouped_data['count'] / grouped_data['sample_id'].map(sample_id_totals)

# Create a pivot table with "sample_id" as rows and "plant_species" as columns, showing the share
pivot_table = grouped_data.pivot(index='sample_id', columns='plant_species', values='share')

# Plot the share of each plant species by sample_id as a stacked area chart
fig, ax = plt.subplots()
pivot_table.plot(kind='area', stacked=True, alpha=0.7, ax=ax)

# Customize the plot
ax.set_xlabel('Sample ID')
ax.set_ylabel('Share')
ax.set_title('Share of Plant Species by Sample ID')

# Create a custom legend with unique labels
handles, labels = ax.get_legend_handles_labels()
custom_legend = [(handle, label) for handle, label in zip(handles, labels)]

# Add the custom legend outside the plot area
ax.legend(*zip(*custom_legend), loc='center left', bbox_to_anchor=(1.0, 0.5))

# Display the plot
plt.show()

import pandas as pd

# Group the data by "plant_species" and count the unique "sample_id" occurrences
species_counts = hand_netting_data.groupby('plant_species')['sample_id'].nunique()

# Get the plant species that appear in more than one sample_id
multi_sample_species = species_counts[species_counts > 6].index.tolist()

# Filter the original dataframe based on the multi-sample species
simplified = hand_netting_data[hand_netting_data['plant_species'].isin(multi_sample_species)].copy()

# Print the simplified dataframe
print(simplified)

import matplotlib.pyplot as plt
import pandas as pd

# Group the filtered data by "sample_id" and "plant_species"
grouped_data = simplified.groupby(['sample_id', 'plant_species']).size().reset_index(name='count')

# Calculate the total count for each sample_id
sample_id_totals = grouped_data.groupby('sample_id')['count'].sum()

# Calculate the share of each plant species by sample_id
grouped_data['share'] = grouped_data['count'] / grouped_data['sample_id'].map(sample_id_totals)

# Create a pivot table with "sample_id" as rows and "plant_species" as columns, showing the share
pivot_table = grouped_data.pivot(index='sample_id', columns='plant_species', values='share')

# Plot the share of each plant species by sample_id as a stacked area chart
fig, ax = plt.subplots()
pivot_table.plot(kind='area', stacked=True, alpha=0.7, ax=ax)

# Customize the plot
ax.set_xlabel('Sample ID')
ax.set_ylabel('Share')
ax.set_title('Share of Plant Species by Sample ID')

# Create a custom legend with unique labels
handles, labels = ax.get_legend_handles_labels()
custom_legend = [(handle, label) for handle, label in zip(handles, labels)]

# Add the custom legend outside the plot area
ax.legend(*zip(*custom_legend), loc='center left', bbox_to_anchor=(1.0, 0.5))

# Display the plot
plt.show()