Selection of Important Features for Optimizing Crop Yield Prediction

Selection of Important Features for Optimizing Crop Yield Prediction

Maya Gopal P S (VIT University, Chennai, India) and Bhargavi R (School of Computing Science and Engineering, VIT University, Chennai, India)
DOI: 10.4018/IJAEIS.2019070104


In agriculture, crop yield prediction is critical. Crop yield depends on various features including geographic, climate and biological. This research article discusses five Feature Selection (FS) algorithms namely Sequential Forward FS, Sequential Backward Elimination FS, Correlation based FS, Random Forest Variable Importance and the Variance Inflation Factor algorithm for feature selection. Data used for the analysis was drawn from secondary sources of the Tamil Nadu state Agriculture Department for a period of 30 years. 75% of data was used for training and 25% data was used for testing. The performance of the feature selection algorithms are evaluated by Multiple Linear Regression. RMSE, MAE, R and RRMSE metrics are calculated for the feature selection algorithms. The adjusted R2 was used to find the optimum feature subset. Also, the time complexity of the algorithms was considered for the computation. The selected features are applied to Multilinear regression, Artificial Neural Network and M5Prime. MLR gives 85% of accuracy by using the features which are selected by SFFS algorithm.
Article Preview

Data mining is a process of discovering previously unknown and potentially interesting patterns in large datasets (Frawley et al., 1991). The data mining process includes fixing the problem, understanding the data, preparing the data, applying the right techniques to build the models, interpreting the results and use the data into action. Now-a-days, intelligent data mining and knowledge discovery by artificial neural network and feature selection algorithms have become the important revolutionary concepts in prediction and modelling (Roddick et al., 2001, Schuize et al., 2005). Data set may contain redundant information that does not directly impact the predictions, and it may contain highly correlated attributes. The data sets are typically not gaining any new information by including all the attributes. In data mining, feature selection algorithms are useful for identifying irrelevant attributes to be excluded from the dataset (Che et al., 2017, Kotu et al., 2015). Feature selection in predictive analytics refers to the process of identifying few most important features or attributes that are essential in building a model for an accurate prediction. Efficient predictive models can improve the quality of the decision making. Feature selection optimizes the performance of the data mining algorithm and makes it easier for the analyst to interpret the outcome of the modeling. This procedure can reduce not only the cost of recognition by reducing the number of features to be collected, but in some cases it can also provide a better classification of prediction accuracy due to finite sample size effects (Jain et al., 1982). This strategy aims at further reducing the number of features. Adding feature selection to the analytical process has several benefits: it simplifies and narrows down the scope of the features that are essential in building a predictive model, to minimize the computational time and memory requirements using the feature selection algorithms (Pal and Foody, 2010). The focus can be directed to a subset of predictors which are very essential.

Researchers work with different feature selection models to optimize their data sets. Automated feature selection for every algorithm with the conventional approach of stepwise regression for feature selection (Alvarez, 2009). Gonzalez- Sanchez et al. (2014), performed an exhaustive search of the feature selection algorithms. H. Liu et al. (1996) proposed a consistency based feature selection mechanism to evaluate the worth of a subset of the attributes by the level of consistency in the class values when the training instances are projected onto the subset of attributes. In this model the consistency of any subset can never be lower than that of the full set of attributes. M. Hall (1999) proposed a correlation based approach to feature selection in different datasets and demonstrated how it can be applied to both classification and regression problems for machine learning. Karimi et al. (2013) presented a hybrid feature selection methods by combining symmetric uncertainty measure and gain measure. Both measures for each feature-class correlation were calculated first and then rank feature according to average score value. High ranked feature greater than a threshold values was selected. They evaluated their system using knowledge discovery data dataset and Naïve Bayes algorithm. Correlation based method, Gain Ratio method and Information Gain method were used by Chaudhary et al. (2013) and presented the performance evaluation of three feature selection methods with optimized Naïve Bayes is performed on mobile device. Zhang et al. (2005) performed a principal components analysis to transform data and used stepwise feature selection for multiple linear regression (MLR). In most experiments conducted, researchers collect data that are supposedly related to the phenomenon of interest, given resource and/or time constraints on the collection and analysis of data. The oriented collection of data means that these kinds of datasets have only pre-approved features. Feature selection can enhance model quality by discarding unwanted features or simply decreasing the model and computational complexity by keeping the most important features with an example (Ruß et al., 2010).

Complete Article List

Search this Journal:
Open Access Articles
Volume 11: 4 Issues (2020): Forthcoming, Available for Pre-Order
Volume 10: 4 Issues (2019): 3 Released, 1 Forthcoming
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 2 Issues (2012)
Volume 2: 2 Issues (2011)
Volume 1: 2 Issues (2010)
View Complete Journal Contents Listing