Abstract
Crop yield prediction is a challenging task towards precision agriculture. In particular, paddy is one of the world’s significant cereal crops and thus crucial for crop management and decision making. Despite the number of crop yield prediction models, better performance in paddy yield prediction is still desirable. Keeping this in mind, the present study aimed to determine the most influencing features that impact paddy production. We employed a machine learning algorithm alongside the best data sources for paddy yield prediction in this study. A total of 5 regression machine learning algorithms were developed using the 16 input variables obtained from the soil health card. Note that we have carried out multiple approaches to improving the model performances. The model results were also validated using Monte Carlo methods. The result from our analysis depicts that XG boost ensembled random forest has demonstrated the highest prediction accuracy of 86% of the other models investigated in our study. It is worth mentioning that this is the first study on paddy crop yield prediction from the features of a soil health card. Indeed, farmers and agronomists could use this model to plan their paddy cultivation and procure the maximum yield.
Similar content being viewed by others
References
Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2623–2631
Amaratunga V, Wickramasinghe L, Perera A, Jayasinghe J, Rathnayake U (2020) Artificial neural network to estimate the paddy yield prediction using climatic data. Math Probl Eng. https://doi.org/10.1155/2020/8627824
Archana AS, Kanagasabapathi K, Sakthivel V (2020) Adoption of sustainable farming practices in paddy cultivation in Kanyakumari district of Tamil Nadu, India. Plant Arch 20:6995–6998
Bergmeir C, Hyndman RJ, Koo B (2018) A note on the validity of cross-validation for evaluating autoregressive time series prediction. Comput Stat Data Anal 120:70–83. https://doi.org/10.1016/j.csda.2017.11.003
Bhat SA (2021) Detection of polycystic ovary syndrome using machine learning algorithms. Dissertation, Dublin, National College of Ireland
Bhatele KR, Bhadauria SS (2020) Glioma segmentation and classification system based on proposed texture features extraction method and hybrid ensemble learning. Traitement Du Signal. 37(6):989–1001. https://doi.org/10.18280/ts.370611
Botchkarev A (2018) Evaluating performance of regression machine learning models using multiple error metrics in azure machine learning studio. SSRN 3177507. https://doi.org/10.2139/ssrn.3177507
Chakraborty K, Mistri B (2015) Importance of soil texture in sustenance of agriculture: a study in Burdwan-I CD Block, Burdwan, West Bengal. Eastern Geographer 21:475–482
Chary S, Mustaffha S, Ismail WI (2019) Determining the yield of the crop using artificial neural network method. Int J Eng Adv Technol 9:2959–2965. https://doi.org/10.35940/ijeat.A1289.109119
Choudhary NK, Chukkapalli SSL, Mittal S, Gupta M, Abdelsalam M, Joshi A (2020) Yieldpredict: a crop yield prediction framework for smart farms. IEEE Int Conf Big Data 2020:2340–2349. https://doi.org/10.1109/BigData50022.2020.9377832
Dhanushkodi S, Wilson VH, Sudhakar K (2017) Mathematical modeling of drying behavior of cashew in a solar biomass hybrid dryer. Resour Effic Technol 3:359–364. https://doi.org/10.1016/j.reffit.2016.12.002
Dorogush AV, Ershov V, Gulin A (2018) CatBoost: gradient boosting with categorical features support. Mech Learn. https://doi.org/10.48550/arXiv.1810.11363
Dou F, Soriano J, Tabien RE, Chen K (2016) Soil texture and cultivar effects on rice (Oryza sativa L.) grain yield, yield components and water productivity in three water regimes. PLoS ONE 11:e0150549. https://doi.org/10.1371/journal.pone.0150549
Ekanayake P, Rankothge W, Weliwatta R, Jayasinghe JW (2021) Machine learning modelling of the relationship between weather and paddy yield in Sri Lanka. J Math
Elavarasan D, Vincent DR (2020) Reinforced XGBoost machine learning model for sustainable intelligent agrarian applications. J Intell Fuzzy Syst 39:7605–7620. https://doi.org/10.3233/JIFS-200862
Fang G, Liu W, Wang L (2020) A machine learning approach to select features important to stroke prognosis. Comput Biol Chem 88:107316. https://doi.org/10.1016/j.compbiolchem.2020.107316
Fauzan MA, Murfi H (2018) The accuracy of XGBoost for insurance claim prediction. Int J Adv Soft Comput Appl 10:159–171
Gopal PM, Bhargavi R (2019a) A novel approach for efficient crop yield prediction. Comput Electron Agric 165:104968. https://doi.org/10.1016/j.compag.2019.104968
Gopal PM, Bhargavi R (2019b) Performance evaluation of best feature subsets for crop yield prediction using machine learning algorithms. Appl Artif Intell 33:621–642. https://doi.org/10.1080/08839514.2019.1592343
Gopika N, Meena Kowshalaya A (2018) Correlation based feature selection algorithm for machine learning. In: 3rd international conference on communication and electronics systems (ICCES), pp 692–695. https://doi.org/10.1109/CESYS.2018.8723980.
Hancock JT, Khoshgoftaar TM (2020) CatBoost for big data: an interdisciplinary review. J Big Data 7(1):1–45. https://doi.org/10.1186/s40537-020-00369-8
Ibrahim S, Nazir S, Velastin SA (2021) Feature selection using correlation analysis and principal component analysis for accurate breast cancer diagnosis. J Imaging 7(11):225. https://doi.org/10.3390/jimaging7110225
Jabeur SB, Gharib C, Mefteh-Wali S, Arfi WB (2021) CatBoost model and artificial intelligence techniques for corporate failure prediction. Technol Forecast Soc Change 166:120658. https://doi.org/10.1016/j.techfore.2021.120658
Jeong JH, Resop JP, Mueller ND, Fleisher DH, Yun K, Butler EE, Timlin DJ, Shim KM, Gerber JS, Reddy VR, Kim SH (2016) Random forests for global and regional crop yield predictions. PLoS ONE 11:e0156571. https://doi.org/10.1371/journal.pone.0156571
Joshua V, Priyadharson SM, Kannadasan R (2021) Exploration of machine learning approaches for paddy yield prediction in Eastern Part of Tamilnadu. Agronomy 11:2068. https://doi.org/10.3390/agronomy11102068
Jui SJJ, Ahmed AM, Bose A, Raj N, Sharma E, Soar J, Chowdhury MWI (2022) Spatiotemporal hybrid random forest model for tea yield prediction using satellite-derived variables. Remote Sensing 14(3):805. https://doi.org/10.3390/rs14030805
Kang Y, Ozdogan M, Zhu X, Ye Z, Hain C, Anderson M (2020) Comparative assessment of environmental variables and machine learning algorithms for maize yield prediction in the US Midwest. Environ Res Lett 15(6):064005
Kaplan RM, Chambers DA, Glasgow RE (2014) Big data and large sample size: a cautionary note on the potential for bias. Clin Transl Sci 7(4):342–346. https://doi.org/10.1111/cts.12178
Lacerda P, Barros B, Albuquerque C, Conci A (2021) Hyperparameter optimization for COVID-19 pneumonia diagnosis based on chest CT. Sensors 21(6):2174. https://doi.org/10.3390/s21062174
Li Y, Cornelis B, Dusa A, Vanmeerbeeck G, Vercruysse D, Sohn E, Blaszkiewicz K, Prodanov D, Schelkens P, Lagae L (2018) Accurate label-free 3-part leukocyte recognition with single cell lens-free imaging flow cytometry. Comput Biol Med 96:147–156. https://doi.org/10.1016/j.compbiomed.2018.03.008
Luo M, Wang Y, Xie Y, Zhou L, Qiao J, Qiu S, Sun Y (2021) Combination of feature selection and catboost for prediction: the first application to the estimation of aboveground biomass. Forests 12:216. https://doi.org/10.3390/f12020216
Ma L, Fu T, Blaschke T, Li M, Tiede D, Zhou Z, Ma X, Chen D (2017) Evaluation of feature selection methods for object-based land cover mapping of unmanned aerial vehicle imagery using random forest and support vector machine classifiers. ISPRS Int J Geo-Inf 6:51. https://doi.org/10.3390/ijgi6020051
Maeda Y, Goyodani T, Nishiuchi S, Kita E (2018) Yield prediction of paddy rice with machine learning. In Proceedings of the international conference on parallel and distributed processing techniques and applications (PDPTA), pp 361–365. The steering committee of the world congress in computer science, computer engineering and applied computing (WorldComp)
Mahajan G, Kumar V, Chauhan BS (2017) Rice production in India. Rice production worldwide. Springer, Cham, pp 53–91. https://doi.org/10.1007/978-3-319-47516-5_3
Masutomi Y, Takahashi K, Harasawa H, Matsuoka Y (2009) Impact assessment of climate change on rice production in Asia in comprehensive consideration of process/parameter uncertainty in general circulation models. Agric Ecosyst Environ Environ 131:281–291. https://doi.org/10.1016/j.agee.2009.02.004
Misra P, Yadav AS (2020) Improving the classification accuracy using recursive feature elimination with cross-validation. Int J Emerg Technol 11:659–665
Naser MZ, Alavi AH (2021) Error metrics and performance fitness indicators for artificial intelligence and machine learning in engineering and sciences. Archit Struct Construct. https://doi.org/10.1007/s44150-021-00015-8
Obsie EY, Qu H, Drummond F (2020) Wild blueberry yield prediction using a combination of computer simulation and machine learning algorithms. Comput Electron Agric 178:105778. https://doi.org/10.1016/j.compag.2020.105778
Pallathadka H, Mustafa M, Sanchez DT, Sajja GS, Gour S, Naved M (2021) Impact of machine learning on management, healthcare and agriculture. Mater Today Proc. https://doi.org/10.1016/j.matpr.2021.07.042
Pant J, Pant RP, Singh MK, Singh DP, Pant H (2021) Analysis of agricultural crop yield prediction using statistical techniques of machine learning. Mater Today Proc 46:10922–10926. https://doi.org/10.1016/j.matpr.2021.01.948
Paul M, Vishwakarma SK, Verma A (2015) Analysis of soil behaviour and prediction of crop yield using data mining approach. In: International conference on computational intelligence and communication networks (CICN), IEEE, pp 766–771. https://doi.org/10.1109/CICN.2015.156
Peacock CJ, Lamont C, Sheen DA, Sinha VK, Kreplak L, Frampton JP (2021) Predicting the mixing behavior of aqueous solutions using a machine learning framework. ACS Appl Mater Interface 13:11449–11460. https://doi.org/10.1021/acsami.0c21036
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Probst P, Boulesteix AL, Bischl B (2019) Tunability: importance of hyperparameters of machine learning algorithms. J Mach Learn Res 20:1934–1965
Pullanagari RR, Kereszturi G, Yule I (2018) Integrating airborne hyperspectral, topographic, and soil data for estimating pasture quality using recursive feature elimination with random forest regression. Remote Sensing 10:1117. https://doi.org/10.3390/rs10071117
Rahman MA, Kang S, Nagabhatla N, Macnee R (2017) Impacts of temperature and rainfall variation on rice productivity in major ecosystems of Bangladesh. Agric Food Secur 6:1–1. https://doi.org/10.1186/s40066-017-0089-5
Ramezan CA, Warner TA, Maxwell AE (2019) Evaluation of sampling and cross-validation tuning strategies for regional-scale machine learning classification. Remote Sensing 11:185. https://doi.org/10.3390/rs11020185
Ramraj S, Uzir N, Sunil R, Banerjee S (2016) Experimenting XGBoost algorithm for prediction and classification of different datasets. Int J Control Theory Appl 9:651–662
Ratnasiri S, Walisinghe R, Rohde N, Guest R (2019) The effects of climatic variation on rice production in Sri Lanka. Appl Econ 51:4700–4710. https://doi.org/10.1080/00036846.2019.1597253
Ray S (2019) A quick review of machine learning algorithms. In: International conference on machine learning, big data, cloud and parallel computing (COMITCon), IEEE pp 35–39. https://doi.org/10.1109/comitcon.2019.8862451
Sellam V, Poovammal E (2016) Prediction of crop yield using regression analysis. Indian J Sci Technol 9:1–5. https://doi.org/10.17485/ijst/2016/v9i38/91714
Singh V, Sarwar A, Sharma V (2017) Analysis of soil and prediction of crop yield (Rice) using machine learning approach. Int J Adv Res Comput Sci 8(5):15
Sinha V, Dash S, Naskar N, Hossain SMM (2022) A study of feature selection and extraction algorithms for cancer subtype prediction. In: International conference for advancement in technology (ICONAT), pp 1–6. https://doi.org/10.1109/ICONAT53423.2022.9726007
Skogholt J, Liland KH, Indahl UG (2019) Preprocessing of spectral data in the extended multiplicative signal correction framework using multiple reference spectra. J Raman Spectrosc 50(3):407–417. https://doi.org/10.1002/jrs.5520
Srinivas P, Katarya R (2022) hyOPTXg: OPTUNA hyper-parameter optimization framework for predicting cardiovascular disease using XGBoost. Biomed Signal Proc Control 73:103456. https://doi.org/10.1016/j.bspc.2021.103456
Torres-Barrán A, Alonso Á, Dorronsoro JR (2019) Regression tree ensembles for wind energy and solar radiation prediction. Neurocomputing 326:151–160. https://doi.org/10.1016/j.neucom.2017.05.104
Triba MN, Le ML, Amathieu R, Goossens C, Bouchemal N, Nahon P, Rutledge DN, Savarin P (2015) PLS/OPLS models in metabolomics: the impact of permutation of dataset rows on the K-fold cross-validation quality parameters. Mol BioSyst 11(1):13–19. https://doi.org/10.1039/C4MB00414K
Vafeiadis T, Diamantaras KI, Sarigiannidis G, Chatzisavvas KC (2015) A comparison of machine learning techniques for customer churn prediction. Simul Model Pract Theory 55:1–9. https://doi.org/10.1016/j.simpat.2015.03.003
Van Klompenburg T, Kassahun A, Catal C (2020) Crop yield prediction using machine learning: a systematic literature review. Comput Electron Agric 177:105709. https://doi.org/10.1016/j.compag.2020.105709
Vehtari A, Gelman A, Gabry J (2017) Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput 27(5):1413–1432. https://doi.org/10.1007/s11222-016-9696-4
Weerts HJ, Mueller AC, Vanschoren J (2020) Importance of tuning hyperparameters of machine learning algorithms. Mach Learn. https://doi.org/10.48550/arXiv.2007.07588
Yadav RS (2020) Data analysis of COVID-2019 epidemic using machine learning methods: a case study of India. Int J Inf Technol 12:1321–1330. https://doi.org/10.1007/s41870-020-00484-y
Yu N, Haskins T (2021) Bagging machine learning algorithms: a generic computing framework based on machine-learning methods for regional rainfall forecasting in upstate New York. Informatics, MDPI 8(3):47. https://doi.org/10.3390/informatics8030047
Zhang Y, Zhao Z, Zheng J (2020) CatBoost: a new approach for estimating daily reference crop evapotranspiration in arid and semi-arid regions of Northern China. J Hydrol 588:125087. https://doi.org/10.1016/j.jhydrol.2020.125087
Acknowledgements
The authors thank VIT management for providing the facility for carrying out this research work.
Funding
The author(s) reported there is no funding associated with the work featured in this article.
Author information
Authors and Affiliations
Contributions
KR conceived and planned the present study. AA carried out the experiments. KR contributed to the interpretation of the results. AA wrote the first draft of the manuscript. KR provided critical feedback and helped to draft the manuscript and supervised the entire study. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interests.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Antony, A., Karuppasamy, R. Mining of soil data for predicting the paddy productivity by machine learning techniques. Paddy Water Environ 21, 231–242 (2023). https://doi.org/10.1007/s10333-023-00924-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10333-023-00924-y