Machine Learning-Based Academic Result Prediction System

Students’ academic performance is a critical issue as it decides his/her career. It is pivotal for the educational institutes to track the performance record because it can help to enhance the standard of their quality education. Thus, the role of the academic result prediction system comes into existence which uses semester grade point average (SGPA) as a metric. The proposed work aims to create a model that can forecast the SGPA of students based on certain traits. It predicts the result in the form of SGPA of computer science students considering their past academic performance, study, and personal habits during their academic semester using different machine learning models, and to compare them based on different accuracy parameters. Some models that are widely used and are found effective in this field are regression algorithms, classification algorithms, and deep learning techniques. The results conclude that deep learning techniques are the most effective in the proposed work because of their high accuracy and performance, depending upon the attributes used in the prediction.


INTRodUCTIoN
Education is a crucial aspect in terms of the economy, due to which researchers are developing several methods to enhance the performance of the students.One way to do so is to track the student's performance.Through research and development in this field, students can be benefitted in numerous ways, like faculty can give special attention to the students whose predicted semester grade point average (SGPA) is low or not up to the mark and this will be very helpful for the student and for the university as well as their entire result percentage will improve, on the other hand students can also track their performance and hence, can improve their study pattern and accordingly perform well.
Educational data mining is an emerging discipline, used to explore the distinctive and increasingly sizable data gathered from various educational institutes, and using data mining techniques to comprehend the students and the methods in which they learn.By exploring these huge datasets and using different aforementioned techniques, unique patterns can be identified which will help to study, predict, and improve academic performance of students.
COVID-19 outbreak has brought unique challenges that were not expected earlier, especially in the education system.Due to the pandemic, schools, colleges and universities were closed.Therefore, few universities decided to take the exams in online mode, while some went for promoting the students to next semester by providing them grades based on the assignment marks and the marks obtained by the students in the previous semester only.This cannot be the only factor for the evaluation since there are other personal factors that contribute to a student's result.
Secondly, it was difficult for faculty to identify the students who are at a risk of not performing well due to following variation in students' performance, (i) who perform well in the beginning but degrade their performance by the end., (ii) who perform worse in the beginning but improve their performance by the end, and (iii) who consistently perform either better or worse.Predicting their SGPA can be very helpful to identify the students who need more attention and work hard.
A model is proposed in this work in which the results of students will not be dependent on a single feature, rather the features will include day-to-day personal habits along with academic habits and past academic performance, to predict the future result as fair as possible.Various machine learning (ML) (Kedia & Bhushan, 2022;Kholiya et al., 2021;Singh & Bhushan, 2022;Verma et al., 2019) techniques have been used for predicting the result of the student.It is crucial to assess the data quality (Bhushan & Goel, 2016;Bhushan et al., 2018;Bhushan et al., 2021) used by ML algorithms.Predicting the SGPA will also help the student in planning his/her academic goals and accordingly put in efforts to improve the results.It will also help the faculty as well as the university to maintain a record of responses submitted by the student and hence, to better understand and supervise their students accordingly.The objective of this work is to ensure that each student is being monitored and being given the guidance he or she needs in order to improve their academics.
The remaining paper incorporates related literature review followed by the experiment details along with the results and discussions.Later, the conclusion along with the future scope is discussed.
According to Sharma and Aggarwal (2021), a dataset of around 400 students was selected and the analysis was conducted to check the level of parental influence on academic performance.The attributes taken into account for prediction were family size, parents' education, educational support from the family, internet access at home, paid classes and semester wise marks.During the analysis, correlations were found between the attributes, and linear regression (LR) was used as the predictive model.Further, training was done on 90% of the data and accuracy was calculated in terms of mean absolute error (MAE) as well as root mean square error (RMSE) which were 3.155 and 3.76, respectively.Gradient boost and support vector machine (SVM) were also applied but LR proved to be the best with a value of adjusted R square as 0.4771.
The analysis of students' performance based on a subset of behavioral and academic parameters was done using the techniques of feature selection and supervised ML algorithms (logistic regression (LOGR), decision tree (DT), naïve bayes' (NB) classifier and ensemble ML algorithms like bagging and boosting) (Gajwani & Chakraborty, 2021).The attributes such as demographic, behavioral and academic were taken into consideration.These included nationality, gender, place of birth, student's participation in discussion groups, raising hand in classes, using external resources, grade as well as semester marks.The dataset of 500 records was taken from Kaggle which was in turn obtained from learning management system (LMS).Further, 70% records were used in training and 30% for testing.The correlations between attributes were plotted using box plot and finally, ML algorithms were applied.At the end, results were summarized with gradient boosting achieving the highest accuracy of 75% followed by random forest (RF) classifier achieving the accuracy of 74.31% and LOGR with the accuracy of 73.61%.
Various ML algorithms have been applied on the dataset of the university students to predict their performance (Rai et al., 2021).The students were split into 3 different classes based on their performance as good, average and poor.The dataset was collected from the UCI student performance dataset which consist of 831 samples with 22 student attributes comprising study hours, study field, success rate, gender, internal assessment marks, end-semester marks, year back status, family income, family education and medium of education.The dataset was pre-processed by handling missing values and label encoding, and then splitting into train and test sets.Supervised ML algorithms such as SVM classifier and RF classifier were used.RF proved to be the best with an accuracy of 94% and SVM with 79%.This analysis helped faculty to take early actions and assist the students belonging to the poor as well as the average category to improve their results.
According to Shetu et al. (2021), the academic result of students studying in a private institution was evaluated based on both environmental attributes and academic status.Three attribute selection methods i.e., info gain ranking filter, gain ratio feature evaluator and correlation ranking filter were used to identify the attribute with the most significant effect on the students' result.DT algorithm was also used.As a result, about 77% instances were correctly classified, 22.57% were incorrectly classified and MAE came out to be 0.1087.
As per Ganorkar et al. (2020), the dataset was collected from the student database of RKDF Institute of Science and Technology.Prime data mining methodologies which were applied were association, classification, clustering and DT.The attributes considered in the dataset were student id, gender, date of birth, college grade point average (GPA), grade etc.The objectives of this review paper are to find the factors affecting a student's academic performance, the student's weakness along with their strength so that the students who are low-grade performers could be assisted to gain better achievement in their academics.
A comparison was done between data mining methods such as DT C4.5 and k-nearest neighbors (KNN) (Yulianto et al., 2020).The dataset was collected from various sources such as by making a questionnaire and gathering information from the literature work.After simplifying the received data, following attributes were considered: student id, name, religion, gender, school origin, domicile, distance of student's residence from campus, number of siblings, parents' job, number of vehicles owned by student families, scholarship received, study time and graduation status.Two classification techniques such as DT C4.5 and KNN were applied, and then later, were compared to find the highest accuracy using hypertext pre-processor (PHP) programming language.It was found that KNN performed better than DT C4.5 with an accuracy of 59.32% while the latter with 54.80% accuracy.
Artificial neural network (ANN) technique was used to model 11 input variables with 2 layers of hidden neurons and one output layer (Lau et al., 2019).The dataset used consisted of 1000 students from a private Chinese Institution.National university entrance exam result and socio-economic background of the students along with previous attained cumulative grade point average (CGPA) were the attributes considered.Further, Levenberg-Marquardt algorithm is employed as the backpropagation training rule.The accuracy obtained was 84.8%.
The results of students studying in a private university in Bangladesh were evaluated by Rifat et al. (2019).The students' CGPA was predicted by taking DT algorithm as base model and using deep neural network (DNN) approach with major attributes such as student name, student id, gender, SGPA and course grade.The results obtained were mean square error (MSE) (DT-0.0226,DNN-0.008).
Techniques such as regression, DT, gradient booster and DNN were used for the prediction by Kumar and Garg (2018).The dataset was provided by DIT University, Dehradun which includes the result data of engineering students (B.Tech.IT, 2017-18 Batch).Attributes considered were schooling marks, continuous assessment and final evaluation.The best accuracy was bagged by the gradient boost model with 98.26% followed by neural network with accuracy of 97.05%.
Existing research works based on predicting academic results were reviewed by Dhamija et al. (2017).Class behavior, exam result, class attendance and extra-curricular activities were major attributes used in this work.Waikato environment for knowledge analysis (WEKA) tool along with many other data mining algorithms were discussed.In WEKA, the past result is given as input and then a predictive model is generated using an appropriate algorithm.By this, it can be classified which students need extra attention or have low predicted grades.
The data is collected from BMSIT & M for the 2014-18 batch (Pushpa et al., 2017).The data consists of three features, internal score, external score, and total score for each subject.The final feature shows the status of the result, whether the student passed or failed the semester.The ML based techniques such as SVM, NB algorithm, RF, gradient boosting were used in this work.The accuracies of the models were identified as SVM-87.5%,NB -87.5%, RF -89% and gradient boosting-82%.
A review of 25 research works was conducted by Kumar et al. (2017), to understand techniques for prediction used in education.It identified and comprehended different attributes of the students which are used for predicting their performance.The major techniques addressed were classification, regression, KNN, DT, NB and clustering.Attributes included were personal, family, institutional and social.KNN and NB possessed the best results in most of the works, accuracy ranged from 90-98%.
The dataset was chosen from the University of Bangladesh BSMRSTU for result prediction (Sikder et al., 2016).Various techniques such as data mining, neural network, MATLAB tool, Levenberg Marquart backpropagation algorithm were used.In total, 14 attributes were considered including test marks, attendance, lab performance, previous result, social media interaction and study time.In this work, a student's yearly performance is predicted in the form of CGPA using a neural network and then compared with real CGPA.The best accuracy attained was 88% and the average was 74%.Neural networks proved out to be the best method when compared with other ML models by having least error.
According to Halde et al. (2016), the dataset was taken from the students of Thadomal Shahani Engineering College.Techniques used were ANN, LR, NB, DT and SVM.The experiment was performed on the real time data collected from the final year students.The matriculate and preuniversity examination scores, five semester scores along with data on the motivation level, information processing ability and other learning and study skills were taken as input to the model to predict the CGPA.Further, accuracy without psychological factors was 0.9310 and accuracy with psychological factors was 0.9999.
A dataset of 300 students at Sacred Hearts Girls High School in Kerala and the results were evaluated using neural network (multi-layer perceptron (MLP) training using K fold cross validation) with the help of WEKA data mining tool and association rule mining (Sebastian & Puthiyidam, 2015).Attributes were related to academic details (interest of study, unit-test marks, attendance, and assignment) and personal attributes (residence, parent education, family status).The obtained results include K fold cross validation with 91% accuracy and association rule mining with 62% accuracy.
Real data history was extracted from UiTM's student's information management system (SIMS) in (Arsad & Buniyamin, 2014), where the data included student identity number, gender, CGPA obtained in previous semesters, grade point (GP) of all subjects attempted at every semester as well as GP of English courses.The techniques used in this work were ANN and LR.In ANN, three models with different inputs and same output were developed.The correlation coefficients of these three models were 0.9254, 0.9225 and 0.5221, respectively.In LR, for matriculation students R squared score in model A and B are both 0.720 and 0.752 respectively while model C gave R squared score of 0.35 which proved to be poor correlation.As for the diploma students, the R squared score was 0.728 and 0.755 while R squared score for model C was 0.318.
The dataset from Sofia University was used in (Kabakchieva, 2013).Various classifiers such as Bayesian network, NB, DT and KNN were used for the result prediction.Different attributes were considered such as age, gender, ethnicity, education, work status and disability.It presents the initial results from a data mining research project implemented at a Bulgarian University, aimed at revealing the high potential of data mining applications for university management.The accuracy with DT was observed as 66% and with KNN was 60%.
A dataset was collected from 3 phases, personal data, pre-university data and university data (Kabakchieva, 2012).The attributes were age, gender, previous exam score, entrance exam score, current semester marks and number of backs.The algorithms applied were DT, neural network and KNN algorithm using WEKA tool.After applying the algorithms, the highest accuracy achieved was 73.59% by neural networks model followed by DT model with an accuracy of 72.74% and KNN with an accuracy of 70.49%.
Table 1 summarizes the existing related work in the area of student result prediction.It contains the attributes of the dataset used in the corresponding work and the results obtained after applying the relevant techniques are shown in column third.

PRoPoSed woRK
The proposed work predicts the SGPA/result of computer science students using regression, classification and deep learning (DL) (Nalavade et al., 2020) techniques, by using academic and personal factors.It aims to predict the SGPA of the student based on his/her attributes.

experimental Setup
The setup required to conduct experiments include hardware, RAM of 8 GB, hard disk of 512 GB and processor of 3 rd generation Intel Core i5.Further, software requirements include Spyder IDE version 4.0.1 which is a free integrated development environment included in Anaconda Navigator and python 3.7.4.version is used as a programming language.

dataset
The data of 2,115 computer science engineering students studying at DIT University, Dehradun, was collected through Microsoft form which is shown in Figure 1.It includes both academic as well as personal attributes required to predict the result of the students.The list of attributes considered is presented in Table 2.The Learning Process.It discusses various ML techniques that were used to predict the SGPA/ result of students.

Importing the
Regression.It is a supervised learning algorithm and works with labeled datasets (Sharma & Aggarwal, 2021).It identifies the relation between dependent and independent variables.The technique is used for predictive modelling in ML and it predicts the continuous outcomes.In this work, regression is applied to predict the SGPA 4 of the student using various personal factors like daily social media interaction and attention during lectures which were collected through survey (see

Sharma and Aggarwal (2021)
Family size, parent's education, educational support from the family, paid classes, internet access at home and semester wise marks • The accuracy was calculated in terms of MAE and RMSE which were 3.155 and 3.76, respectively.
• Gradient boost and SVM were also applied but LR proved to be the best with the value of adjusted R square 0.4771.

GajwaniandChakraborty (2021)
Nationality, gender, place of birth, student's participation in discussion groups, raising hand in classes, using external resources, grade along with semester marks.
• Gradient boosting achieved the highest accuracy of 75% • RF classifier achieving the accuracy of 74.31% • LOGR with the accuracy of 73.61%.

Rai et al. (2021)
Study hours, study field, success rate, gender, internal assessment marks, end semester marks, year back status, family income, family education, medium of education.
Table 2) as well as taking the continuous academic progress of the student through his/her previous SGPAs.Following are the applied models: 1. Multiple linear regression (MLR) is a statistical technique that uses more than one independent variable to predict the dependent output variable (Sharma & Aggarwal, 2021).Its aim is to predict and model the linear relationship between the input and output variables.2. Random forest regression is a supervised ML technique and here ensemble learning is used (Obata et al., 2021).Ensemble learning is a method which aggregates the predictions from several ML models to make even more accurate predictions than a single model.

Classification.
It is a supervised learning algorithm, where the dataset is divided into classes based on different parameters (Gajwani & Chakraborty, 2021).In this process, the dataset is trained on a training set and accordingly, it divides the data into different classes.In this work, the previous SGPA of the students was divided in 3 classes, according to which the model was trained and predicted the SGPA 4.
The three classes of SGPA are shown in Table 3.The models applied are as follows: 1. Random forest classification is a supervised ML technique, used for classification (Gajwani & Chakraborty, 2021).This algorithm makes decision trees on data and then predicts, and finally selects the best solution by means of voting/majority.2. Logistic regression classification is a supervised ML technique used for classification and works on predictive modelling; therefore, it is called logistic regression, but is used for classifying the samples, hence, it is a classification algorithm (Gajwani & Chakraborty, 2021).It gives the probabilistic values ranging from 0 to 1. 3. Decision tree classification is a supervised ML method and a tree-structured classifier (Shetu et al., 2021).It consists of 2 nodes, decision node and leaf node.Decisions are made by decision nodes and the outputs of those decisions are reflected in leaf nodes.4. Naïve bayes classification which is based on Bayes' theorem is a supervised ML method and used for classification (Pushpa et al., 2017).It is a probabilistic classifier, i.e., it predicts based on the probability of an object.
Artificial Neural Network.It's a DL technique.Neural networks learn through a process called back-propagation (Lau et al., 2019).It uses a set of training data that matches known input to desired outputs.

ANN with principal component analysis (PCA).
It is one of the most used unsupervised dimensionality reduction techniques.PCA reduces the dimension of N dimensional data set by projecting it into M dimensional subspace where M<N (Lee & Jemain, 2021).In this work, PCA is applied to reduce the dimensionality of data to only 2 columns.After applying PCA, ANN was used to build and train the model in order to fetch the predictions.

Results and discussions
After applying all the aforementioned models, the results were calculated and were compared as shown in this section.
The regression models were evaluated on the basis of MAE, MSE and RMSE.RMSE is the standard deviation of the residuals which are also called as prediction errors.Residuals are basically the measure of distance between the regression line and the data points.A lower RMSE indicates a better model but RMSE of zero will denote over-fitting.The RMSE comparison of different models has been shown in Figure 2.
The results obtained after applying different regression techniques are summarized in Table 4.
In Figure 2, RF regression has low RMSE as compared to MLR proving to be the best among regression.The classification models and ANN were evaluated on the basis of accuracy which was calculated on the number of correctly classified instances.The accuracy comparison of different models has been represented in Figure 3.The results obtained after applying different classification techniques and ANN are summarized in Table 5.
Figure 3 represents that ANN along with PCA has the highest accuracy among all classification models, proving to be the best among classification.

Comparative Analysis
A comparative analysis of the proposed work with the existing works has been shown in Table 6.It is observed that the proposed work considered more attributes which affect the student's result, than the existing works.Moreover, both regression and classification algorithms along with ANN were also applied in the proposed work.The proposed work resulted in better accuracy when compared with the existing works.

CoNCLUSIoN ANd FUTURe SCoPe
This work defined the factors responsible for a student's academic performance and compared various ML techniques that predict a student's result.The future challenges include finding more attributes that affect the result of the students and also to hyper-tune the parameters using advanced hyper-parameter tuning techniques like Hyperopt in order to get the better results.Data gathering can be improved with focus on getting a larger number of data points.Exploratory data analysis can be done using libraries like seaborn to visually understand the relationship between different parameters.Various other ML techniques like boosting algorithms can also be used.Moreover, in the future an application can be built to be used by stakeholders (students, faculty as well as, training and placement officers), that can help them to evaluate a student's performance and hence work on improvement.

CoMPLIANCe wITH eTHICAL STANdARdS
Libraries and Dataset.NumPy, Pandas, Matplotlib, Scikit-learn and TensorFlow are the libraries used in the proposed work.Data Pre-Processing.It comprises of following steps: 1. Label encoding.2. Handling missing values.3. Feature scaling.4. Splitting the dataset into a testing and training set.

Figure
Figure 1.Survey form

Figure
Figure 2. Comparison of RMSE