Skip to content

Smart Crop Selection Using Soil Data

Understanding the quality of soil is very important when deciding what crops to grow. Soil contains key nutrients such as nitrogen, phosphorus, and potassium, along with other properties like pH (which tells us how acidic or basic the soil is). However, testing for all of these things can be expensive and time-consuming. Because of this, farmers often have to choose which soil measurements to focus on based on their budget.

Farmers want to grow crops that will give them the best harvest. One of the most important things that affects crop growth is the condition of the soil. Each crop grows best in certain types of soil, so knowing what's in the soil helps make better planting decisions.

  • "N": Nitrogen content ratio in the soil
  • "P": Phosphorous content ratio in the soil
  • "K": Potassium content ratio in the soil
  • "ph" value of the soil
  • "crop": categorical values that contain various crops (target variable)

The sample dataset includes:

  • Nitrogen level in the soil
  • Phosphorus level in the soil
  • Potassium level in the soil
  • pH level of the soil (how acidic or basic it is)
  • The crop that grew best in those soil conditions

Each row in the dataset shows one soil sample and the best crop to grow in it.

What this project does:

  • Uses the soil data to figure out which crop is the best match for each type of soil
  • Builds a simple model to predict the crop based on just one soil feature at a time
  • Finds out which single feature (nitrogen, phosphorus, potassium, or pH) gives the best prediction
  • Helps farmers focus on the most useful test when they can't afford to test everything

This project gives a helpful starting point for farmers who want to make better crop choices without spending too much on soil testing.

# All required libraries 
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

def main():
    # Load the dataset
    crops = pd.read_csv("soil_measures.csv")

    # Convert the 'crop' column to categorical numeric codes
    crops["crop"] = crops["crop"].astype("category").cat.codes

    # Dictionary to store accuracy scores for each feature
    best_predictive_feature = {}

    # Loop through each soil feature to evaluate its performance
    for f in ["N", "P", "K", "ph"]:  # Column names match the dataset
        if f in crops.columns:
            # Split data using only one feature at a time
            X_train, X_test, y_train, y_test = train_test_split(
                crops[[f]], crops["crop"], test_size=0.2, random_state=42
            )

            # Train a logistic regression model using the single feature
            model = LogisticRegression(max_iter=200).fit(X_train, y_train)

            # Evaluate the model's accuracy on the test set
            score = metrics.accuracy_score(y_test, model.predict(X_test))

            # Store the feature name and its corresponding accuracy score
            best_predictive_feature[f] = score
        else:
            print(f"Feature '{f}' not found in dataset columns.")

    # Identify the best feature and print result
    best_predictive_feature = max(best_predictive_feature, key=best_predictive_feature.get)
    print("Best predictive feature:", {best_predictive_feature: feature_scores[best_predictive_feature]})

if __name__ == "__main__":
    main()

Results and Interpretation

After training models with each soil feature, we found:

  • Potassium (K) gave the highest accuracy score when used alone.
  • This suggests that K is the most important single measurement for crop prediction.
  • Farmers may want to prioritize testing for potassium if they can't afford full soil testing.

Limitations

  • We used a simple logistic regression model and accuracy as the only evaluation metric.
  • More advanced models (like Random Forest or XGBoost) could improve predictions.
  • Feature combinations were not explored—some features may work better together.

Conclusion

Using basic soil data, we found that potassium is the best single predictor of optimal crop choice in this dataset. This approach can help farmers make better decisions even with limited resources.

Future work could include trying different models, using multiple features at once, and analyzing costs of testing vs. benefits in yield.