Article Preview
Top1. Introduction
Peer to peer lending, also known as crowd lending, where lenders and borrowers can communicate directly without the involvement of financial bodies as the middleman (Ma and Wang 2016). Social lending market is spreading exponentially due to fast loan disbursement and less paperwork compared to traditional loan system (Guo et al., 2016). Borrowers with good credit can take the loan on lower interest rates, and lenders can also make more profits (Malekipirbazari and Aksakalli, 2015). Borrowers with average or lower credit score can communicate with lenders who are ready to provide them loans on high-interest rates (Serrano-Cinca and Gutiérrez-Nieto 2016).
There are several security concerns associated with P2P (Peer-to-Peer) lending market. Due to the absence of intermediate firm who is checking the authenticity of borrowers and lenders, it can be dangerous for lenders as well as for borrowers (Malekipirbazari and Aksakalli, 2015). Loan dispersed through P2P lending platforms contains risk of default and repayment delay. Borrowers and lenders communicates on common platform at various risk levels. High risk provides better returns but generates probable conditions of being defaulted. Therefore, precise prediction of the credibility of the borrowers is a crucial and significant issue in social lending. Statistical and machine learning models can be utilized to predict the probable defaulters. Due to capability of generation of more optimal results, Machine learning models have already been outperformed statistical models. More accurate and specialized machine learning models are required to deal with such issues in P2P lending.
Several issues like long process time, limited lending money, not legal everywhere are associated with the P2P lending market. In the conventional process of loan disbursement, banks communicate directly with borrowers and assess the credibility of the borrowers, whereas, in P2P lending market, lenders and borrowers communicate directly with each other on the social lending platform. P2P lending datasets contains imbalance implicitly. Ratio of safe borrowers to defaulters is very high due to imbalance in the dataset. Machine learning models trained through such dataset provides good accuracy but accuracy is biased towards major class samples (Safe borrowers) and model predicts wrong class to most of the minor class samples. Therefore, traditional risk assessment machine learning models will not provide significant results to predict potential defaulters from borrowers.
In Imbalanced datasets, the ratio of the number of samples in major class to minor class, is significantly high. Machine learning models build on such datasets are biased towards majority class. Therefore, the precision of the major class is on the higher side and recall of minor class is on the lower side. Such kind of models will not be able to predict defaulters optimally (Lin et al., 2017; Xia & Liu 2017; Yijing et al. 2016). Although, Machine learning models trained through imbalanced dataset provides promising predictive accuracy, but due to less training of minor class samples compared to major class samples, ratio of predictive accuracy of minor class samples to major class samples is not close to one.
Therefore, to build machine learning models capable of providing high precision on major class samples and high recall on minor class samples, a novel undersampling approach SCCSDNN (Spectral Clustering and Cost Sensitive Deep Neural Network based Undersampling) is proposed. SCCSDNN is the combination of spectral clustering with cost sensitive deep neural network. Spectral clustering is applied to the major class samples and K clusters are obtained. Afterwards, K clusters are concatenated with minor class samples and K different datasets are obtained. Later, cost sensitive deep neural network model is built through k datasets, and dataset with the highest precision on major class and highest recall on minor class is chosen as the undersampled dataset.
The remaining of the paper is structured as follows. In Section 2, credit risk in social lending along with several resampling algorithms proposed to deal with the imbalance are discussed. In section 3, various quantitative and qualitative attributes of the lendingclub dataset are discussed. In section 4, the proposed undersampling algorithm (SCCSDNN) is discussed. In section 5, experimental results along with evaluation parameters are explored. The final section of the paper concludes the paper along the future directions.