Skip to content
Bee friendly plants - Analysis by FS
  • AI Chat
  • Code
  • Report
  • Spinner

    Which plants are better for bees: native or non-native?

    📖 Background

    You work for the local government environment agency and have taken on a project about creating pollinator bee-friendly spaces. You can use both native and non-native plants to create these spaces and therefore need to ensure that you use the correct plants to optimize the environment for these bees.

    The team has collected data on native and non-native plants and their effects on pollinator bees. Your task will be to analyze this data and provide recommendations on which plants create an optimized environment for pollinator bees.

    💾 The Data

    You have assembled information on the plants and bees research in a file called plants_and_bees.csv. Each row represents a sample that was taken from a patch of land where the plant species were being studied.

    ColumnDescription
    sample_idThe ID number of the sample taken.
    bees_numThe total number of bee individuals in the sample.
    dateDate the sample was taken.
    seasonSeason during sample collection ("early.season" or "late.season").
    siteName of collection site.
    native_or_nonWhether the sample was from a native or non-native plot.
    samplingThe sampling method.
    plant_speciesThe name of the plant species the sample was taken from. None indicates the sample was taken from the air.
    timeThe time the sample was taken.
    bee_speciesThe bee species in the sample.
    sexThe gender of the bee species.
    specialized_onThe plant genus the bee species preferred.
    parasiticWhether or not the bee is parasitic (0:no, 1:yes).
    nestingThe bees nesting method.
    statusThe status of the bee species.
    nonnative_beeWhether the bee species is native or not (0:no, 1:yes).

    Source (data has been modified)

    Conclusions

    • Which plants are preferred by native vs non-native bee species? - Flowering length (during the day) and flowering time (during the season) seems to be key components in attracting bees to plants but I can only arrive to this conclusion indirectly. I would advise the team to collect more data and design the data collection with the earlier statements in mind.

    • A visualization of the distribution of bee and plant species across one of the samples. - Please see below.

    • Select the top three plant species you would recommend to the agency to support native bees. - Based on the current dataset, I cannot recommend specific plant species for supporting native bees. However, I advise promoting the use of a diverse range of plant species in the field to enhance floral diversity. Additionally, conducting further research is essential to determine the optimal plant species for supporting the bee population.

    import pandas as pd
    data = pd.read_csv("data/plants_and_bees.csv")
    data
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    import pandas as pd
    
    # Read the CSV file
    data = pd.read_csv("data/plants_and_bees.csv")
    print("HEAD")
    print(data.head())
    print("----")
    print("SHAPE")
    print(data.shape)
    print("----")
    print("INFO")
    print(data.info())
    print("----")
    print("DESCRIBE")
    print(data.describe())
    print("----")
    print("EMPTY")
    print(data.isnull().sum())
    print("----")
    
    
    
    #Remove columns with no additional value
    exclude_columns = ['specialized_on','status']
    data = data.drop(exclude_columns, axis=1, errors='ignore')
    
    # Add column to identify each row as a unique observation
    data['individual_bee_count']=1
    
    #Convert the date column to date format
    data['date']=pd.to_datetime(data['date'])
    
    parasitic_contents = data['parasitic'].unique().tolist()
    nonnative_bee_contents = data['nonnative_bee'].unique().tolist()
    print("Contents of 'parasitic':")
    print(parasitic_contents)
    print("Contents of nonnative_bee'")
    print(nonnative_bee_contents)
    
    # Convert 'time' from int64 to time type
    data['time'] = pd.to_datetime(data['time'], unit='ms')
    
    # Convert 'parasitic' from float64 to object type
    data['parasitic'] = data['parasitic'].astype(str)
    data['nonnative_bee'] = data['nonnative_bee'].astype(str)
    # Get the list of categorical variables
    categorical_vars = data.select_dtypes(include='object').columns
    
    # Set up the figure and axes for subplots
    num_plots = len(categorical_vars)
    num_cols = 2  # Number of graphs to display in each row
    num_rows = (num_plots + num_cols - 1) // num_cols  # Calculate the number of rows needed
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(12, 5 * num_rows))
    
    # Flatten the axes array if it's multidimensional
    if num_rows > 1:
        axes = axes.flatten()
    
    # Iterate through each categorical variable and create a bar chart
    for i, var in enumerate(categorical_vars):
        ax = axes[i] if num_plots > 1 else axes  # Handle single subplot case
        sns.countplot(x=var, data=data, ax=ax)
        ax.set_title(var)
        ax.set_xlabel('')
        ax.set_ylabel('Count')
        ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
    
    # Remove empty subplots
    if num_plots < len(axes):
        for j in range(num_plots, len(axes)):
            fig.delaxes(axes[j])
    
    # Adjust the spacing between subplots
    plt.tight_layout()
    
    # Add column to identify each row as a unique observation
    data['individual_bee_count']=1
    
    print(data.info())
    print("----")
    
    #Based on the graphical information, let's drop the following columns: parasitic, nesting, nonnative_bee
    
    columns_to_remove = ["parasitic", "nesting", "nonnative_bee"]
    data = data.drop(columns=columns_to_remove)
    
    data.head()
    
    # Group the data by "sampling" and "plant_species" and count the occurrences
    grouped_data = data.groupby(["sampling", "plant_species"]).size().reset_index(name="count")
    
    # Get the unique "sampling" categories
    unique_sampling = grouped_data["sampling"].unique()
    
    # Create two subplots for each "sampling" category
    fig, axs = plt.subplots(1, 2, figsize=(10, 5))
    
    # Iterate over the unique "sampling" categories
    for i, sampling_category in enumerate(unique_sampling):
        # Filter the data for the current "sampling" category
        subset = grouped_data[grouped_data["sampling"] == sampling_category]
        
        # Extract the plant species and their corresponding counts
        plant_species = subset["plant_species"]
        count = subset["count"]
        
        # Create a bar plot for the current "sampling" category
        axs[i].bar(plant_species, count)
        axs[i].set_title(f"Sampling: {sampling_category}")
        axs[i].set_xlabel("Plant Species")
        axs[i].set_ylabel("Count")
        axs[i].set_xticklabels(plant_species, rotation=90)
    
    # Adjust the spacing between subplots
    plt.tight_layout()
    
    # Show the plot
    plt.show()
    import pandas as pd
    
    # Read the CSV file
    data = pd.read_csv("data/plants_and_bees.csv")
    
    # Iterate over each column and print unique values and their counts
    for column in data.columns:
        unique_values = data[column].value_counts()
        print(f"Column: {column}")
        print(unique_values)
        print()
    import pandas as pd
    
    # Read the CSV file
    data = pd.read_csv("data/plants_and_bees.csv")
    
    # Create the pivot table with a total column
    pivot_table = pd.pivot_table(data, index='date', columns='native_or_non', aggfunc='size', fill_value=0)
    
    # Store the pivot table in a new DataFrame
    pivot_df = pd.DataFrame(pivot_table)
    
    #pivot_df['Total'] = pivot_df.sum(axis=1)
    pivot_df.loc['Total'] = pivot_df.sum()
    
    # Display the new DataFrame
    print(pivot_df)
    import scipy.stats as stats
    
    # Extract the data from the 'native' and 'non-native' columns
    data_native = pivot_df['native']
    data_non_native = pivot_df['non-native']
    
    # Perform a paired t-test
    t_statistic, p_value = stats.ttest_rel(data_native, data_non_native)
    
    # Print the results
    print("Paired t-test results:")
    print("t-statistic:", t_statistic)
    print("p-value:", p_value)
    import pandas as pd
    import matplotlib.pyplot as plt
    
    data = pd.read_csv("data/plants_and_bees.csv")
    
    # Filter the data on "sampling" column to keep only "hand netting" records
    hand_netting_data = data[data['sampling'] == 'hand netting']
    
    # Group the filtered data by "date" and "plant_species"
    grouped_data = hand_netting_data.groupby(['date', 'plant_species']).size().reset_index(name='count')
    
    # Rename the new dataframe
    new_dataframe = grouped_data.copy()
    new_dataframe.head()
    
    # Convert 'plant_species' column to categorical codes
    new_dataframe['color_code'] = new_dataframe['plant_species'].astype('category').cat.codes
    
    # Create a larger figure
    plt.figure(figsize=(12, 6))
    
    # Create a scatter plot with modified color map
    scatter = plt.scatter(new_dataframe['date'], new_dataframe['count'], c=new_dataframe['color_code'], cmap='Set1')
    
    # Iterate over each unique category
    for category in new_dataframe['plant_species'].unique():
        # Filter the data points belonging to the current category
        category_data = new_dataframe[new_dataframe['plant_species'] == category]
        
        # Connect the data points with a line
        plt.plot(category_data['date'], category_data['count'], marker='o', linestyle='-', alpha=0.5)
    
    # Customize the plot
    plt.title('Scatter Plot of Bees Count by Date')
    plt.xlabel('Date')
    plt.ylabel('Bees Count')
    plt.colorbar(scatter, label='Plant Species')
    
    # Show the plot
    plt.show()
    
    import matplotlib.pyplot as plt
    
    # Group the filtered data by "date" and "plant_species"
    grouped_data = hand_netting_data.groupby(['date', 'plant_species']).size().reset_index(name='count')
    
    # Calculate the total count for each date
    date_totals = grouped_data.groupby('date')['count'].sum()
    
    # Calculate the share of each plant species by date
    grouped_data['share'] = grouped_data['count'] / grouped_data['date'].map(date_totals)
    
    # Create a pivot table with "date" as rows and "plant_species" as columns, showing the share
    pivot_table = grouped_data.pivot(index='date', columns='plant_species', values='share')
    
    # Plot the share of each plant species by date
    ax = pivot_table.plot(kind='bar', stacked=True, width=0.8)  # Adjust the width parameter to make the bars wider
    
    # Customize the plot
    plt.xlabel('Date')
    plt.ylabel('Share')
    plt.title('Share of Plant Species by Date')
    
    # Create unique patterns for the legends
    patterns = ['/', '\\', '|', '-', '+', 'x', 'o', 'O', '.', '*']
    
    # Set the unique patterns for the legends
    for i, (column, pattern) in enumerate(zip(pivot_table.columns, patterns)):
        ax.get_children()[i].set_hatch(pattern)  # Set hatch pattern for the bar objects
    
    # Create a custom legend with unique labels
    handles, labels = ax.get_legend_handles_labels()
    custom_legend = [(handle, label) for handle, label in zip(handles, labels)]
    
    # Add the custom legend outside the plot area
    ax.legend(*zip(*custom_legend), loc='center left', bbox_to_anchor=(1.0, 0.5))
    
    # Display the plot
    plt.show()
    import numpy as np
    from scipy.stats import chi2_contingency
    
    # Create the contingency table
    observed = np.array([[164, 266],
                         [442, 378]])
    
    # Perform the chi-square test
    chi2, p, dof, expected = chi2_contingency(observed)
    
    # Print the results
    print(f"Chi-square statistic: {chi2}")
    print(f"p-value: {p}")
    print(f"Degrees of freedom: {dof}")
    print("Expected counts:")
    print(expected)
    import pandas as pd
    import matplotlib.pyplot as plt
    
    data = pd.read_csv("data/plants_and_bees.csv")
    
    # Filter the data on "sampling" column to keep only "hand netting" records
    hand_netting_data = data[data['sampling'] == 'hand netting']
    
    # Group the filtered data by "date" and "plant_species"
    grouped_data = hand_netting_data.groupby(['date', 'plant_species']).size().reset_index(name='count')
    
    # Rename the new dataframe
    new_dataframe = grouped_data.copy()
    new_dataframe.head()
    
    # Convert 'plant_species' column to categorical codes
    new_dataframe['color_code'] = new_dataframe['plant_species'].astype('category').cat.codes
    
    # Create a larger figure
    plt.figure(figsize=(12, 6))
    
    # Create a scatter plot with modified color map
    scatter = plt.scatter(new_dataframe['date'], new_dataframe['count'], c=new_dataframe['color_code'], cmap='Set1')
    
    # Iterate over each unique category
    for category in new_dataframe['plant_species'].unique():
        # Filter the data points belonging to the current category
        category_data = new_dataframe[new_dataframe['plant_species'] == category]
        
        # Connect the data points with a line
        plt.plot(category_data['date'], category_data['count'], marker='o', linestyle='-', alpha=0.5)
    
    # Customize the plot
    plt.title('Scatter Plot of Bees Count by Date')
    plt.xlabel('Date')
    plt.ylabel('Bees Count')
    plt.colorbar(scatter, label='Plant Species')
    
    # Show the plot
    plt.show()
    
    import matplotlib.pyplot as plt
    
    import pandas as pd
    
    # Group the filtered data by "sample_id" and "plant_species"
    grouped_data = hand_netting_data.groupby(['sample_id', 'plant_species']).size().reset_index(name='count')
    
    # Calculate the total count for each sample_id
    sample_id_totals = grouped_data.groupby('sample_id')['count'].sum()
    
    # Calculate the share of each plant species by sample_id
    grouped_data['share'] = grouped_data['count'] / grouped_data['sample_id'].map(sample_id_totals)
    
    # Create a pivot table with "sample_id" as rows and "plant_species" as columns, showing the share
    pivot_table = grouped_data.pivot(index='sample_id', columns='plant_species', values='share')
    
    # Plot the share of each plant species by sample_id as a stacked area chart
    fig, ax = plt.subplots()
    pivot_table.plot(kind='area', stacked=True, alpha=0.7, ax=ax)
    
    # Customize the plot
    ax.set_xlabel('Sample ID')
    ax.set_ylabel('Share')
    ax.set_title('Share of Plant Species by Sample ID')
    
    # Create a custom legend with unique labels
    handles, labels = ax.get_legend_handles_labels()
    custom_legend = [(handle, label) for handle, label in zip(handles, labels)]
    
    # Add the custom legend outside the plot area
    ax.legend(*zip(*custom_legend), loc='center left', bbox_to_anchor=(1.0, 0.5))
    
    # Display the plot
    plt.show()
    
    import pandas as pd
    
    # Group the data by "plant_species" and count the unique "sample_id" occurrences
    species_counts = hand_netting_data.groupby('plant_species')['sample_id'].nunique()
    
    # Get the plant species that appear in more than one sample_id
    multi_sample_species = species_counts[species_counts > 6].index.tolist()
    
    # Filter the original dataframe based on the multi-sample species
    simplified = hand_netting_data[hand_netting_data['plant_species'].isin(multi_sample_species)].copy()
    
    # Print the simplified dataframe
    print(simplified)
    
    import matplotlib.pyplot as plt
    import pandas as pd
    
    # Group the filtered data by "sample_id" and "plant_species"
    grouped_data = simplified.groupby(['sample_id', 'plant_species']).size().reset_index(name='count')
    
    # Calculate the total count for each sample_id
    sample_id_totals = grouped_data.groupby('sample_id')['count'].sum()
    
    # Calculate the share of each plant species by sample_id
    grouped_data['share'] = grouped_data['count'] / grouped_data['sample_id'].map(sample_id_totals)
    
    # Create a pivot table with "sample_id" as rows and "plant_species" as columns, showing the share
    pivot_table = grouped_data.pivot(index='sample_id', columns='plant_species', values='share')
    
    # Plot the share of each plant species by sample_id as a stacked area chart
    fig, ax = plt.subplots()
    pivot_table.plot(kind='area', stacked=True, alpha=0.7, ax=ax)
    
    # Customize the plot
    ax.set_xlabel('Sample ID')
    ax.set_ylabel('Share')
    ax.set_title('Share of Plant Species by Sample ID')
    
    # Create a custom legend with unique labels
    handles, labels = ax.get_legend_handles_labels()
    custom_legend = [(handle, label) for handle, label in zip(handles, labels)]
    
    # Add the custom legend outside the plot area
    ax.legend(*zip(*custom_legend), loc='center left', bbox_to_anchor=(1.0, 0.5))
    
    # Display the plot
    plt.show()