Combining Clustering and Factor Analysis as Complementary Techniques

Combining Clustering and Factor Analysis as Complementary Techniques

Lakshmi Prayaga (Department of Information Technology, University of West Florida, USA), Krishna Devulapalli (Indian Institute of Chemical Technology, India) and Chandra Prayaga (Physics Department, University of West Florida, USA)
Copyright: © 2020 |Pages: 10
DOI: 10.4018/IJDA.2020070104
OnDemand PDF Download:
No Current Special Offers


The study of driver behavior and associated accidents has been of interest to researchers and insurance companies. From the perspective of insurance companies, identifying factors that contribute to traffic violations plays a significant role in providing insurance quotes as it establishes the basis for charging appropriate insurance rates to customers. This study assesses the traffic violations intensity for 64 counties in the state of Florida, USA by using the publicly available traffic violations data set. This data set consists of 3,669,796 records with 11 attributes, which include race, gender, driver's age, type of driving violation, etc. The 187 types of traffic violations are categorized into 11 broad traffic violations categories. Two machine learning algorithms, factor analysis and k-means clustering, were applied in this study. After applying factor analysis, a new comprehensive traffic violation index (TVI) was developed, which quantified the traffic violation intensity of each county. All the counties in the data set were ranked with the TVI scores, and the counties with high TVI scores were identified. K-means clustering algorithm was then applied to the same data, and four clusters of counties were derived. The counties that were grouped in each cluster were compared with the TVI scores to check if the counties in each cluster had similar TVI scores. The counties with the highest TVI scores are found to be grouped in one cluster, followed by counties with the next high TVI scores in the second cluster, and so on. Thus, it is observed that there is a perfect match in the results of both models. They serve as two techniques complementary to each other, in that the k-means clustering method groups counties with comparable traffic violation intensities and factor analysis is able to also rank individual counties according to the TVI. These techniques have identified the counties with high traffic violation intensities, which helps the policymakers to take adequate measures for traffic management.
Article Preview

Literature Review

Researchers have observed that both driving habits such as speeding, distracted driving, not maintaining an appropriate distance between vehicles, among others, and kinetic factors contribute to many accidents on the road (Jahangiri, 2015; Jahangiri, 2016). Identifying such behaviors helps in the design of advanced driving assistance systems and training for safe driving. Prior research demonstrates that the application of several machine learning algorithms was used to model driving violations to predict factors such as speeding violation, future driving risk, and motorcycle crash (Cheng, 2019; Wang, 2019; Wahab, 2019). Thus, Zeyang Cheng et al. (2019) observed that speeding violations had become a key concern in the traffic safety management as it increases the risk of traffic crashes, as well as the severity of these crashes. They have developed a decision tree method to predict speeding violations. Chen Wang et al. (2019) have studied seven-year crash/violation data and applied four machine learning models viz., random forest (RF), Adaboost with a decision tree, gradient boosting decision tree (GBDT) and extreme gradient boosting decision tree (XGboost) to predict the future driving risk of crash-involved drivers. Wahab and Jiang (2019) have applied three machine learning techniques – Adaboost with a decision tree, gradient boosting decision tree (GBDT), and extreme gradient boosting decision tree (Xgboost) to predict motorcycle crashes in Ghana city.

Most of these studies on traffic violations have studied only one traffic violation at a time, such as speeding violations, motorcycle crashes, etc., and applied various machine learning techniques for predictions. However, at a county or state level, policymakers need a comprehensive traffic violation index, which is derived from all types of driving violations, to take policy decisions. Previous studies have attempted to develop such comprehensive indexes in other fields to rank individual observations. Chandra Sekhar et al. (1991) have developed an index of need for health resources for states in India by using factor analysis. Krishna and Reddy (1994) have applied factor analysis to develop a comprehensive coal index from physicochemical properties data of various Indian coals and ranked Indian coals according to that index. Vijaya Krishnan (2010) has developed a socioeconomic index using principal component and factor analysis for the 2006 census data of Canada for the province of Alberta.

Concerning traffic violations, Khaled Shaaban (2012) has made a comparative study of road traffic rules in Qatar and western countries. The main purpose of this study was to provide comparisons with major western countries and to suggest possible guidance in the development and implementation of driving policies in Qatar. He has studied various factors like driving age, seat belt laws, driving under the influence, etc. and compared the traffic laws of Florida State, the United Kingdom, and Qatar.

In the current study, factor analysis is used to develop a comprehensive Traffic Violation Index (TVI) for traffic violations data from the 64 counties of the state of Florida (Pierson et al., 2019). The counties are then ranked in descending order of the TVI scores, which helps in quantifying the traffic violation intensity of each county in comparison with other counties.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 3: 2 Issues (2022): Forthcoming, Available for Pre-Order
Volume 2: 2 Issues (2021): 1 Released, 1 Forthcoming
Volume 1: 2 Issues (2020)
View Complete Journal Contents Listing