Nepal's agricultural sector feeds over 65% of the population, yet farmers and policymakers have almost no data-driven tools to anticipate crop yield fluctuations. Climate variability — erratic monsoons, temperature shifts, unexpected frost — makes season-to-season yield prediction genuinely difficult.
This project built a full machine learning pipeline that predicts district-level crop yields across Nepal by combining historical agricultural statistics with meteorological data from NASA's POWER satellite system. The goal: give farmers and agricultural planners a tool that actually works with Nepal's limited data infrastructure.
Ministry of Agriculture and Livestock Development — national crop statistics
Prediction of Worldwide Energy Resources — satellite-derived meteorological data
The core model evaluation loop — all 8 regression algorithms evaluated on both log-scale and original scale, with the target variable log1p-transformed before training:
import numpy as np from sklearn.linear_model import LinearRegression, Ridge, Lasso from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from sklearn.svm import SVR from sklearn.metrics import mean_squared_error, r2_score # Target: log1p transformed for better regression performance y = np.log1p(df_encoded['h/ha_yield']) # Evaluate each model on log-scale and original scale def evaluate_model(model, name, X_train, X_test, y_train, y_test): model.fit(X_train, y_train) y_pred_log = model.predict(X_test) # Convert back to original scale y_pred_orig = np.expm1(y_pred_log) y_test_orig = np.expm1(y_test) r2 = r2_score(y_test_orig, y_pred_orig) mse = mean_squared_error(y_test_orig, y_pred_orig) return r2, mse models = [ ("Linear Regression", LinearRegression()), ("Ridge Regression", Ridge()), ("Lasso Regression", Lasso()), ("Decision Tree", DecisionTreeRegressor(random_state=42)), ("Random Forest", RandomForestRegressor(random_state=42)), ("Gradient Boosting", GradientBoostingRegressor(random_state=42)), ("Support Vector Regression", SVR()), ] # XGBoost trained separately using xgb.train() with early stopping # Best result → R² = 0.8175, MSE = 0.2031
All models evaluated on a held-out 20% test set. Metrics computed on the original scale (after inverse log1p transform). XGBoost achieved the best R² and lowest MSE:
| Model | R² Score | MSE | Notes |
|---|---|---|---|
| XGBoost BEST | 0.8175 | 0.2031 | Gradient boosting with early stopping, handles non-linearity best |
| Random Forest | 0.7879 | 0.2360 | Strong ensemble baseline, slightly below XGBoost |
| Gradient Boosting | 0.7422 | 0.2869 | Good but slower convergence than XGBoost |
| Decision Tree | 0.6285 | 0.4135 | Prone to overfitting without pruning |
| SVR | 0.4784 | 0.5805 | Sensitive to feature scaling, limited on tabular data |
| Linear Regression | 0.3907 | 0.6781 | Misses non-linear crop-weather interactions |
| Ridge Regression | 0.3906 | 0.6782 | Marginal improvement over Linear Regression |
| Lasso Regression | -0.0001 | 1.1130 | Over-regularized — collapsed predictions |
Presented at the International Conference on Recent Trends in Artificial Intelligence · ICRTAI 2025
The paper presents the full methodology, dataset construction, feature engineering decisions, and model evaluation framework. It contextualizes the work within Nepal's agricultural data scarcity and argues for satellite-derived meteorological inputs as a scalable alternative to ground station networks.
The final XGBoost model was wrapped in a Streamlit web application that lets users input district, crop type, and meteorological parameters to get a predicted yield. Designed to be accessible to agricultural officers and researchers without requiring any coding knowledge.
🌾
Crop Yield Prediction — Nepal
Input district + crop type + weather parameters → get predicted yield
Open Live App →