Project: Predictive Modeling for Agriculture

Sowing Success: How Machine Learning Helps Farmers Select the Best Crops

Measuring essential soil metrics such as nitrogen, phosphorus, potassium levels, and pH is crucial for assessing soil health. However, this testing can be expensive and time-consuming, often forcing farmers to prioritize which metrics they measure based on budget constraints.

When choosing which crop to plant each season, farmers aim to maximize crop yield by considering factors such as soil quality. Each crop has specific ideal soil conditions that support optimal growth.

A farmer has approached us for assistance in selecting the best crop for their field using machine learning. They provided a dataset, soil_measures.csv, containing:

"N": Nitrogen content ratio in the soil
"P": Phosphorous content ratio in the soil
"K": Potassium content ratio in the soil
"pH" value of the soil
"crop": categorical values that contain various crops (target variable).

Each row in the dataset represents a sample of soil measurements from a particular field, along with the corresponding optimal crop.

🎯 Project Objective

In this project, we will:

Build multi-class classification models using logistic regression
Evaluate how well each individual feature predicts the crop
Identify the single most important soil metric for accurate crop prediction

This work helps guide cost-effective testing by identifying which soil element contributes the most predictive value.

📥 1. Load and Inspect the Data

We'll begin by importing the necessary libraries and loading the dataset soil_measures.csv.

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Load the dataset
crops = pd.read_csv("soil_measures.csv")
print(crops.head())
# Write your code here

🧼 2. Data Overview

This dataset includes four numeric features:

Nitrogen (N)
Phosphorus (P)
Potassium (K)
pH value (ph)

Our target is a categorical label: crop, with 22 unique crop types. We'll also confirm there are no missing values.

crops.isna().sum().sort_values()

crops["crop"].unique()

🤖 3. Model Training

We'll build separate logistic regression models using one soil metric at a time to predict the crop.
This helps determine which individual feature has the strongest predictive power.

We use:

LogisticRegression from scikit-learn
F1-score (weighted) for performance evaluation

X=crops.drop("crop",axis=1)
y=crops["crop"]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)

📈 4. Performance Evaluation

Here are the F1-scores for each feature when used alone in a multinomial logistic regression model:

feature_performance = {}
for feature in ["N", "P","K","ph"]:
    log_reg=LogisticRegression(multi_class="multinomial")
    log_reg.fit(X_train[[feature]].values,y_train)
    y_pred=log_reg.predict(X_test[[feature]].values)
    feature_performance[feature]=f1_score(y_test,y_pred,average="weighted")
    print(f"F1-score for {feature}: {feature_performance[feature]}")

print(feature_performance)

best_predictive_feature={"K":feature_performance["K"]}
print(best_predictive_feature)