Skip to content

Sowing Success: How Machine Learning Helps Farmers Select the Best Crops

Measuring essential soil metrics such as nitrogen, phosphorous, potassium levels, and pH value is an important aspect of assessing soil condition. However, it can be an expensive and time-consuming process, which can cause farmers to prioritize which metrics to measure based on their budget constraints.

Farmers have various options when it comes to deciding which crop to plant each season. Their primary objective is to maximize the yield of their crops, taking into account different factors. One crucial factor that affects crop growth is the condition of the soil in the field, which can be assessed by measuring basic elements such as nitrogen and potassium levels. Each crop has an ideal soil condition that ensures optimal growth and maximum yield.

A farmer reached out to you as a machine learning expert for assistance in selecting the best crop for his field. They've provided you with a dataset called soil_measures.csv, which contains:

  • "N": Nitrogen content ratio in the soil
  • "P": Phosphorous content ratio in the soil
  • "K": Potassium content ratio in the soil
  • "pH" value of the soil
  • "crop": categorical values that contain various crops (target variable).

Each row in this dataset represents various measures of the soil in a particular field. Based on these measurements, the crop specified in the "crop" column is the optimal choice for that field.

In this project, you will build multi-class classification models to predict the type of "crop" and identify the single most importance feature for predictive performance.

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Load the dataset
crops = pd.read_csv("soil_measures.csv")

# Check for missing values
print(crops.isna().sum())

# Display unique crop types to verify multi-class target
print(crops.crop.unique())

# Separate features and target variable
X = crops.drop(columns="crop")
y = crops["crop"]

# Print the columns to verify the feature names
print(X.columns)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

# Dictionary to store performance scores for each feature
feature_performance = {}

# Train a logistic regression model for each feature individually
for feature in ["N", "P", "K", "pH"]:
    if feature in X_train.columns:
        log_reg = LogisticRegression(multi_class="multinomial", max_iter=1000)
        log_reg.fit(X_train[[feature]], y_train)
        y_pred = log_reg.predict(X_test[[feature]])
        
        # Calculate the F1 score for the current feature
        f1 = metrics.f1_score(y_test, y_pred, average="weighted")
        
        # Store the feature and its F1 score in the dictionary
        feature_performance[feature] = f1
        print(f"F1-score for {feature}: {f1}")
    else:
        print(f"Feature {feature} not found in the dataset.")

# Identify the feature with the highest F1 score
best_feature = max(feature_performance, key=feature_performance.get)
best_score = feature_performance[best_feature]

# Create a dictionary with the best feature and its score
best_predictive_feature = {best_feature: best_score}

print(best_predictive_feature)

A brief explanation of each part of the code:

import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn import metrics
  • Imports: This section imports necessary libraries:
    • pandas for data manipulation.
    • LogisticRegression from sklearn for creating the logistic regression model.
    • train_test_split for splitting the dataset into training and testing sets.
    • metrics from sklearn for calculating performance metrics.
# Load the dataset crops = pd.read_csv("soil_measures.csv")
  • Load the Dataset: Reads the CSV file soil_measures.csv into a DataFrame called crops.
# Check for missing values print(crops.isna().sum())
  • Check Missing Values: Prints the count of missing values in each column to ensure the data is complete.
# Display unique crop types to verify multi-class target print(crops.crop.unique())
  • Display Unique Crop Types: Prints the unique values in the crop column to verify that it's a multi-class target variable.
# Separate features and target variable X = crops.drop(columns="crop") y = crops["crop"]
  • Separate Features and Target:
    • X contains all columns except crop, which are the features.
    • y contains only the crop column, which is the target variable.
# Print the columns to verify the feature names print(X.columns)
  • Print Columns: Prints the names of the columns in X to verify the feature names and ensure they match the expected names.
# Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )
  • Split Data: Splits the dataset into training and testing sets:
    • X_train and y_train are the features and target for training.
    • X_test and y_test are the features and target for testing.
    • test_size=0.2 indicates 20% of the data is used for testing.
    • random_state=42 ensures reproducibility by setting a seed for random number generation.
# Dictionary to store performance scores for each feature feature_performance = {}
  • Initialize Dictionary: Creates an empty dictionary to store the performance scores (F1 scores) for each feature.
# Train a logistic regression model for each feature individually for feature in ["N", "P", "K", "pH"]: if feature in X_train.columns: log_reg = LogisticRegression(multi_class="multinomial", max_iter=1000) log_reg.fit(X_train[[feature]], y_train) y_pred = log_reg.predict(X_test[[feature]]) # Calculate the F1 score for the current feature f1 = metrics.f1_score(y_test, y_pred, average="weighted") # Store the feature and its F1 score in the dictionary feature_performance[feature] = f1 print(f"F1-score for {feature}: {f1}") else: print(f"Feature {feature} not found in the dataset.")
  • Train and Evaluate Model:
    • Loop Over Features: For each feature ("N", "P", "K", "pH"):
      • Check Feature Presence: Ensure the feature exists in the dataset.
      • Create Model: Initialize a logistic regression model with multi_class="multinomial" for multi-class classification and max_iter=1000 to ensure convergence.
      • Train Model: Fit the model using the current feature.
      • Predict: Predict the crop type using the test set for the current feature.
      • Calculate F1 Score: Compute the weighted F1 score to evaluate the model's performance.
      • Store Score: Store the F1 score in the feature_performance dictionary and print it.
      • Handle Missing Features: Print a message if a feature is not found in the dataset.
# Identify the feature with the highest F1 score best_feature = max(feature_performance, key=feature_performance.get) best_score = feature_performance[best_feature]
  • Find Best Feature:
    • Identify Best Feature: Find the feature with the highest F1 score by using the max function on the feature_performance dictionary.
    • Retrieve Best Score: Get the F1 score associated with the best feature.
# Create a dictionary with the best feature and its score best_predictive_feature = {best_feature: best_score} print(best_predictive_feature)
  • Store and Print Best Feature:
    • Create Dictionary: Create a dictionary best_predictive_feature with the best feature and its F1 score.
    • Print Result: Print the best_predictive_feature dictionary to display the result.

.