Skip to content

Sowing Success: How Machine Learning Helps Farmers Select the Best Crops

Measuring essential soil metrics such as nitrogen, phosphorous, potassium levels, and pH value is an important aspect of assessing soil condition. However, it can be an expensive and time-consuming process, which can cause farmers to prioritize which metrics to measure based on their budget constraints.

Farmers have various options when it comes to deciding which crop to plant each season. Their primary objective is to maximize the yield of their crops, taking into account different factors. One crucial factor that affects crop growth is the condition of the soil in the field, which can be assessed by measuring basic elements such as nitrogen and potassium levels. Each crop has an ideal soil condition that ensures optimal growth and maximum yield.

A farmer reached out to you as a machine learning expert for assistance in selecting the best crop for his field. They've provided you with a dataset called soil_measures.csv, which contains:

  • "N": Nitrogen content ratio in the soil
  • "P": Phosphorous content ratio in the soil
  • "K": Potassium content ratio in the soil
  • "pH" value of the soil
  • "crop": categorical values that contain various crops (target variable).

Each row in this dataset represents various measures of the soil in a particular field. Based on these measurements, the crop specified in the "crop" column is the optimal choice for that field.

In this project, you will build multi-class classification models to predict the type of "crop" and identify the single most importance feature for predictive performance.

1 - Read the data into a pandas DataFrame and perform exploratory data analysis

# All required libraries are imported here for you.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import f1_score

# Load the dataset
crops = pd.read_csv("soil_measures.csv")

# Display the first few rows of the DataFrame to undestand its structure
print(crops.head())

# Checking for missing values in each column
missing_values = crops.isna().sum()
print("Missing values in each column: \n", missing_values)

# Check the unique values in the "crop" column to understand the type of crops
unique_crops = crops['crop'].unique()
print("Unique crop types: \n", unique_crops)

In the first step, we loaded the dataset soil_measures.csv into a pandas DataFrame named crops. We then performed exploratory data analysis to understand the structure and contents of the dataset.

  1. Loading the Data: We used pd.read_csv() to read the CSV file into the DataFrame.
  2. Checking for Missing Values: We checked for missing values in each column using isna().sum(). The results showed that there were no missing values in any of the columns.
  3. Checking Unique Crop Types: We inspected the unique values in the crop column using unique(), revealing a variety of crop types.

2 - Split the data

# Features and target variables 
X = crops.drop(columns=['crop'])
y = crops['crop']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

In the second step, we split the dataset into training and test sets to prepare for model training and evaluation.

  1. Defining Features and Target: We created X to include all columns except crop, and y to include only the crop column.
  2. Splitting the Data: Using train_test_split(), we split the data into training and test sets, allocating 20% of the data for testing.

These steps helped us understand the data and prepared us for building and evaluating machine learning models to predict the optimal crop based on soil measurements.

3 - Evaluate feature performance

# Create an empty dictionary to store each feature's predictive performances
features_dict = {}

# List of features to iterate over
features = ["N", "P", "K", "ph"]

# Loop through the features
for feature in features:
    # Create and train a Logitic Regression model
    log_reg = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=200)
    log_reg.fit(X_train[[feature]], y_train)
    
    # Predict the target values using the test set
    y_pred = log_reg.predict(X_test[[feature]])
    
    # Evaluate the perfomance using F1 score 
    feature_performance = f1_score(y_test, y_pred, average='weighted')
    
    # Store the feature and its performance in the dictionary
    features_dict[feature] = feature_performance
    
    # Print the feature and its F1 score
    print(f"F1-score for {feature}: {feature_performance}")
          
# Display the feature performance dictionary
features_dict

In this stage, we aimed to identify which individual soil metric (Nitrogen, Phosphorous, Potassium, or pH value) has the strongest predictive performance for classifying crop types. We built and evaluated a separate model for each feature using a Logistic Regression classifier, and we used the F1 score to measure the performance of each model.

Here are the steps we followed:

  1. Create a Logistic Regression Model: We used the LogisticRegression model from the sklearn library with the multi_class='multinomial' parameter to support multi-class classification.
  2. Train the Model: For each feature (N, P, K, ph), we trained the Logistic Regression model using the training subset of that feature.
  3. Predict Crop Types: We used the trained model to predict crop types on the test subset of the respective feature.
  4. Evaluate the Model: We calculated the F1 score for each model to evaluate its performance. The F1 score is the harmonic mean of precision and recall, providing a balanced measure of a model's accuracy.
  5. Store the Results: We stored the F1 scores in a dictionary to compare the performance of each feature.

Here are the F1 scores for each feature:

  • Nitrogen (N): 0.0946
  • Phosphorous (P): 0.1360
  • Potassium (K): 0.2511
  • pH: 0.0453

4 - Create the best_predictive_feature variable

# Defining the feature performance dictionary correctly
feature_performance = {
    'N': 0.09460822808525177,
    'P': 0.13596627225711835,
    'K': 0.25109003900054505,
    'ph': 0.04532731061152114
}

# Identify the feature with the highest F1 score
best_feature = max(feature_performance, key=feature_performance.get)
best_score = feature_performance[best_feature]

# Create a dictionary to store the best feature and its score
best_predictive_feature = {best_feature: best_score}

# Print the best predictive feature
print(f"The best predictive feature is '{best_feature}' with an F1 score of {best_score:.4f}")

From the results, we can see that the Potassium content (K) has the highest F1 score of 0.2511, indicating it has the strongest predictive performance among the four features for classifying crop types. This suggests that Potassium levels in the soil are the most influential factor in determining the optimal crop for a given field, based on our dataset and model.

In contrast, the pH value has the lowest F1 score of 0.0453, indicating it is the least predictive feature for classifying crop types in our analysis.

Conclusion:

In this project, we utilized machine learning techniques to assist farmers in selecting the best crop to plant based on essential soil metrics. The exploratory data analysis allowed us to understand the distribution of variables and the absence of values, ensuring that our data was clean and ready for modeling.

We split our data into training and test sets to ensure proper model validation. Next, we built logistic regression models for each of the four soil variables (Nitrogen, Phosphorus, Potassium, and pH) and evaluated their predictive performance using the F1-score metric.

The results showed that the most important variable for predicting the crop type was the Potassium (K) level in the soil, with an F1-score of 0.2511. This indicates that Potassium plays a crucial role in determining the best crop to plant to maximize productivity.

Identifying the most predictive variable provides valuable insight for farmers, allowing them to make informed decisions and optimize their crops more efficiently. This project demonstrates how the application of machine learning techniques can have a significant impact on agriculture, promoting more sustainable and productive farming practices.