Machine Learning · Final Year Project

Crop Yield Prediction
in Nepal

Status
Presented · ICRTAI 2025
Best Model
XGBoost Regressor
Data Sources
NASA POWER + MoALD
Deployment
Live Streamlit App
Overview
The Problem

Nepal's agricultural sector feeds over 65% of the population, yet farmers and policymakers have almost no data-driven tools to anticipate crop yield fluctuations. Climate variability — erratic monsoons, temperature shifts, unexpected frost — makes season-to-season yield prediction genuinely difficult.

This project built a full machine learning pipeline that predicts district-level crop yields across Nepal by combining historical agricultural statistics with meteorological data from NASA's POWER satellite system. The goal: give farmers and agricultural planners a tool that actually works with Nepal's limited data infrastructure.

Methodology
Pipeline Architecture
01
Data Collection
MoALD + NASA POWER API
02
Preprocessing
Merge, encode, impute, scale
03
Feature Eng.
Log & Yeo-Johnson transforms
04
Model Training
7 regression algorithms
05
Deployment
Streamlit web app
Data
Data Sources

MoALD Nepal

Ministry of Agriculture and Livestock Development — national crop statistics

  • District-level yield data
  • Crop area harvested
  • Crop type labels
  • Multi-year historical records

NASA POWER API

Prediction of Worldwide Energy Resources — satellite-derived meteorological data

  • Temperature (min/max/avg)
  • Precipitation
  • Solar radiation
  • Relative humidity
Code
Key Implementation

The core model evaluation loop — all 8 regression algorithms evaluated on both log-scale and original scale, with the target variable log1p-transformed before training:

Python
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

# Target: log1p transformed for better regression performance
y = np.log1p(df_encoded['h/ha_yield'])

# Evaluate each model on log-scale and original scale
def evaluate_model(model, name, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred_log = model.predict(X_test)

    # Convert back to original scale
    y_pred_orig = np.expm1(y_pred_log)
    y_test_orig = np.expm1(y_test)

    r2  = r2_score(y_test_orig, y_pred_orig)
    mse = mean_squared_error(y_test_orig, y_pred_orig)
    return r2, mse

models = [
    ("Linear Regression",          LinearRegression()),
    ("Ridge Regression",            Ridge()),
    ("Lasso Regression",            Lasso()),
    ("Decision Tree",               DecisionTreeRegressor(random_state=42)),
    ("Random Forest",               RandomForestRegressor(random_state=42)),
    ("Gradient Boosting",           GradientBoostingRegressor(random_state=42)),
    ("Support Vector Regression",   SVR()),
]
# XGBoost trained separately using xgb.train() with early stopping
# Best result → R² = 0.8175, MSE = 0.2031
Results
Model Comparison

All models evaluated on a held-out 20% test set. Metrics computed on the original scale (after inverse log1p transform). XGBoost achieved the best R² and lowest MSE:

Model R² Score MSE Notes
XGBoost BEST 0.8175 0.2031 Gradient boosting with early stopping, handles non-linearity best
Random Forest 0.7879 0.2360 Strong ensemble baseline, slightly below XGBoost
Gradient Boosting 0.7422 0.2869 Good but slower convergence than XGBoost
Decision Tree 0.6285 0.4135 Prone to overfitting without pruning
SVR 0.4784 0.5805 Sensitive to feature scaling, limited on tabular data
Linear Regression 0.3907 0.6781 Misses non-linear crop-weather interactions
Ridge Regression 0.3906 0.6782 Marginal improvement over Linear Regression
Lasso Regression -0.0001 1.1130 Over-regularized — collapsed predictions
Conference
Research Output
📄

Meteorology-Driven Crop Yield Prediction in Nepal: A Regression Approach

Presented at the International Conference on Recent Trends in Artificial Intelligence · ICRTAI 2025

The paper presents the full methodology, dataset construction, feature engineering decisions, and model evaluation framework. It contextualizes the work within Nepal's agricultural data scarcity and argues for satellite-derived meteorological inputs as a scalable alternative to ground station networks.

Pratik Ghimire presenting at ICRTAI 2025
Presenting at ICRTAI 2025
Deployment
Live Web Application

The final XGBoost model was wrapped in a Streamlit web application that lets users input district, crop type, and meteorological parameters to get a predicted yield. Designed to be accessible to agricultural officers and researchers without requiring any coding knowledge.

crop-yield-prediction-nepal-nhugtaob97vm7rnp2mj9be.streamlit.app

🌾

Crop Yield Prediction — Nepal

Input district + crop type + weather parameters → get predicted yield

Open Live App →
XGBoost Python Streamlit NASA POWER MoALD Data Scikit-learn Pandas NumPy Yeo-Johnson Transform Google Colab
← Pratik Ghimire
Computer Vision Projects → Rice Transplanter →