Sowing Success: How Machine Learning Helps Farmers Select the Best Crops
Measuring essential soil metrics such as nitrogen, phosphorous, potassium levels, and pH value is an important aspect of assessing soil condition. However, it can be an expensive and time-consuming process, which can cause farmers to prioritize which metrics to measure based on their budget constraints.
Farmers have various options when it comes to deciding which crop to plant each season. Their primary objective is to maximize the yield of their crops, taking into account different factors. One crucial factor that affects crop growth is the condition of the soil in the field, which can be assessed by measuring basic elements such as nitrogen and potassium levels. Each crop has an ideal soil condition that ensures optimal growth and maximum yield.
A farmer reached out to you as a machine learning expert for assistance in selecting the best crop for his field. They've provided you with a dataset called soil_measures.csv, which contains:
"N": Nitrogen content ratio in the soil"P": Phosphorous content ratio in the soil"K": Potassium content ratio in the soil"pH"value of the soil"crop": categorical values that contain various crops (target variable).
Each row in this dataset represents various measures of the soil in a particular field. Based on these measurements, the crop specified in the "crop" column is the optimal choice for that field.
In this project, you will build multi-class classification models to predict the type of "crop" and identify the single most importance feature for predictive performance.
Let's start by setting up our environment.
!pip install --upgrade pip!pip install -q ydata-profiling plotlyimport sys
print(f"The current python version being used is {sys.version}")from importlib.metadata import version
pkgs = [
"pandas",
"ydata-profiling",
"matplotlib",
"scikit-learn",
"seaborn",
"ydata-profiling"
]
for p in pkgs:
print(f"{p} version: {version(p)}")import pandas as pd
from ydata_profiling import ProfileReport
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from IPython.core.display import display, HTML
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, f1_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree# Load the dataset
crops = pd.read_csv("soil_measures.csv")
# see the dataset
print (crops.head(), "#" * 100, '\n', crops.tail())# explore the dataset with pandas profiling
profile = ProfileReport(crops, title="Profiling Report")
profile.to_file("crops_dataset.html")# see the HTML page
display(HTML("crops_dataset.html"))Upon analyzing the dataset, it appears that the variables Potassium (K), Phosphorus, and Nitrogen (N) exhibit a high degree of correlation with the crop yield variable. This multicollinearity implies that these nutrients are not only related to the crops but also possibly to each other.
While building predictive models, especially linear models such as multiple regression, high multicollinearity among independent variables can inflate the variance of the coefficient estimates and make the model less interpretable. It increases the risk of overfitting, where the model captures random noise rather than underlying patterns. This complication could potentially lead to less reliable predictions and a poorer understanding of which nutrients are most influential on the yield.
Therefore, tree based models like decision trees, and random forest to help us mitigate the problem of multicollinearity.
%whos# Encode the 'crop' column
label_encoder = LabelEncoder()
crops['crop_encoded'] = label_encoder.fit_transform(crops['crop'])
# Define features and target
X = crops[['N', 'P', 'K', 'ph']]
y = crops['crop_encoded']
# Split the data into train, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Create a pipeline with polynomial features, feature selection, and decision tree classifier
pipeline_dt = Pipeline([
#('polynomial_features', PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)),
('feature_selection', SelectKBest(score_func=f_classif, k="all")),
('classifier', DecisionTreeClassifier(random_state=42))
])
# Define hyperparameters for Decision Tree
param_grid_dt = {
'classifier__max_depth': [4, 5, 7, None],
'classifier__min_samples_split': [2, 5, 10, 25],
'classifier__min_samples_leaf': [1, 2, 4, 20, 25]
}
# Perform GridSearchCV for Decision Tree
grid_search_dt = GridSearchCV(pipeline_dt, param_grid_dt, cv=5, scoring='f1_weighted', n_jobs=1)
grid_search_dt.fit(X_train, y_train)
# Validate the best Decision Tree model
best_model_dt = grid_search_dt.best_estimator_
val_score_dt = best_model_dt.score(X_val, y_val)
y_val_pred_dt = best_model_dt.predict(X_val)
f1_score_dt = f1_score(y_val, y_val_pred_dt, average='weighted')
classification_report_dt = classification_report(y_val, y_val_pred_dt)
# Test the best Decision Tree model
test_score_dt = best_model_dt.score(X_test, y_test)
y_test_pred_dt = best_model_dt.predict(X_test)
f1_score_test_dt = f1_score(y_test, y_test_pred_dt, average='weighted')
classification_report_test_dt = classification_report(y_test, y_test_pred_dt)
# Create a pipeline with polynomial features, feature selection, and random forest classifier
pipeline_rf = Pipeline([
#('polynomial_features', PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)),
('feature_selection', SelectKBest(score_func=f_classif, k="all")),
('classifier', RandomForestClassifier(random_state=42))
])
# Define hyperparameters for Random Forest
param_grid_rf = {
'classifier__n_estimators': [100, 200, 300],
'classifier__max_depth': [3, 4, 5, 7, None],
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': [1, 2, 4, 20, 25]
}
# Perform GridSearchCV for Random Forest
grid_search_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=5, scoring='f1_weighted', n_jobs=1)
grid_search_rf.fit(X_train, y_train)
# Validate the best Random Forest model
best_model_rf = grid_search_rf.best_estimator_
val_score_rf = best_model_rf.score(X_val, y_val)
y_val_pred_rf = best_model_rf.predict(X_val)
f1_score_rf = f1_score(y_val, y_val_pred_rf, average='weighted')
classification_report_rf = classification_report(y_val, y_val_pred_rf)
# Test the best Random Forest model
test_score_rf = best_model_rf.score(X_test, y_test)
y_test_pred_rf = best_model_rf.predict(X_test)
f1_score_test_rf = f1_score(y_test, y_test_pred_rf, average='weighted')
classification_report_test_rf = classification_report(y_test, y_test_pred_rf)
# Extract feature importances
#feature_names = pipeline_rf.named_steps['polynomial_features'].get_feature_names_out(X_train.columns)
feature_names = X_train.columns
# Decision Tree feature importances
dt_feature_importances = best_model_dt.named_steps['classifier'].feature_importances_
# Random Forest feature importances
rf_feature_importances = best_model_rf.named_steps['classifier'].feature_importances_
# Create a DataFrame to compare feature importances
feature_importances_df = pd.DataFrame({
'Feature': feature_names,
'Decision Tree Importance': dt_feature_importances,
'Random Forest Importance': rf_feature_importances
})
# Sort the DataFrame by Random Forest Importance
feature_importances_df = feature_importances_df.sort_values(by='Random Forest Importance', ascending=False)
# Print Decision Tree results
print("Decision Tree Model Validation Results:")
print(f"Validation Score: {val_score_dt}")
print(f"Validation F1 Score: {f1_score_dt}")
print("Validation Classification Report:")
print(classification_report_dt)
print("\nDecision Tree Model Test Results:")
print(f"Test Score: {test_score_dt}")
print(f"Test F1 Score: {f1_score_test_dt}")
print("Test Classification Report:")
print(classification_report_test_dt)
# Print Random Forest results
print("\nRandom Forest Model Validation Results:")
print(f"Validation Score: {val_score_rf}")
print(f"Validation F1 Score: {f1_score_rf}")
print("Validation Classification Report:")
print(classification_report_rf)
print("\nRandom Forest Model Test Results:")
print(f"Test Score: {test_score_rf}")
print(f"Test F1 Score: {f1_score_test_rf}")
print("Test Classification Report:")
print(classification_report_test_rf)
# Print feature importances
print("\nFeature Importances:")
print(feature_importances_df)# Plot the decision tree
plt.figure(figsize=(20, 10))
plot_tree(best_model_dt.named_steps['classifier'], filled=True, feature_names=X_train.columns, class_names=label_encoder.classes_)
plt.title("Decision Tree")
plt.show()
plt.savefig("DecisionTree_tuned.png")
# Plot a sample tree from the random forest
sample_tree = best_model_rf.named_steps['classifier'].estimators_[0]
plt.figure(figsize=(20, 10))
plot_tree(sample_tree, filled=True, feature_names=X_train.columns, class_names=label_encoder.classes_)
plt.title("Sample Tree from Random Forest")
plt.show()
plt.savefig("SampleRandomForesetTree.png")# Identify the best predictive feature from Random Forest feature importances
best_feature_index = rf_feature_importances.argmax()
best_feature_name = feature_names[best_feature_index]
best_feature_importance = rf_feature_importances[best_feature_index]
# Use the F1 score from the test set as the evaluation score
evaluation_score = f1_score_test_rf
# Create the best_predictive_feature dictionary
best_predictive_feature = {best_feature_name: evaluation_score}
# Print the best_predictive_feature dictionary
print("Best Predictive Feature:")
print(best_predictive_feature)