Guidelines for Detecting Cyberbullying in Social Media Data Through Text Analysis

The intensive use of the internet comes with negative and positive effects. Cyberbullying is one of the negative effects of using the internet. Cyberbullying has a negative effect on the victims emotionally, academically, and psychologically. Cyberbullying detection tools can help in reducing or eliminating cyberbullying on social media platforms. The aim of the study was to identify the elements that drive cyberbullying and build classification models to determine whether social media textual information contains cyberbullying text or not. The research aim was achieved through a mixed methods research design, containing qualitative and quantitative elements. The drivers of cyberbullying were identified through a literature review. These included age, gender, family structure, parental education, race, technology, anonymity, academic achievement, and awareness of cyber safety. The support vector machines and naïve Bayes models were used to classify the text dataset (Formspring.me dataset), with a 72.81% and a 99.87% classification accuracy, respectively.


INTRODUCTION
The intensive use of technology has increased the number of teenagers with access to the internet through personal smart phones and computers (Rizza & Pereira, 2013).In this digital era, technology poses opportunities and benefits to teenagers, although there are also problems associated with its use (Rizza & Pereira, 2013).In a study conducted by Rizza and Pereira (2013), 80% of the participating teenagers were affected by cyberbullying.
Most social media companies, restrict the use of their platforms to users above the age of 12 (The Children's Society, 2018).However, a survey carried out by The Children's Society in 2018, revealed that 61% of young social media users open their first account at the age of 12 or younger.The survey also indicated that young users spend at least three hours per day on social media platforms.

BACKGROUND
Cyberbullying is a term that was defined by Bill Besley, a Canadian educator (Macaulay et al., 2018).Besley (2005), states that cyberbullying mainly involves the use of technology to make deliberate attacks; it involves persistent bad behaviour to harm others.The bullies inflict pain on other individuals or a group of people by using technology.
The participant roles include the victim, the bully, the assistants, the reinforcers of the bully, the defenders of the victim, and the outsiders who watch from a distance (Hee et al., 2018).Intention, repetition and power imbalance are three criteria that describe bullying.Intention means that the bully deliberately hurts the victim; repetition refers to the frequency of the bullying instances, and power imbalance refers to the vulnerability of the victim of bullying -the bully is more powerful than the victim (Hee et al., 2018).Technological skills, anonymity and failure of the victim to escape the bully determine the online interaction power over the victim (Görzig & Ólafsson, 2012;Hee et al., 2018).
Cyberbullying policies are enforced through social media companies' policies to address cyberbullying incidents on their platforms, through use of software, human or automated systems and geofencing (Milosevic, 2016).Violation of these policies leads to blocking of the user, removal of the content from the platform, and in some circumstances the case is reported to the relevant authorities (Milosevic, 2016).
Various studies of cyberbullying focused on prevalence of cyberbullying, characteristics of the victims, the perpetrators, and the negative impact of cyberbullying (Na, 2013).Various studies applied Support Vector Machine (SVM) and Naïve Bayes classifiers however, no studies that compared the performance of SVMs and the naïve Bayes model classifiers, were trained on the Formspring.medataset.For this specific scenario, the Formspring.medataset was originally extracted from a platform where users were allowed to send questions to any user's page anonymously, which encouraged cyberbullying (Ptaszynski et al., 2018).Therefore, the specific scope of this study focused on the identification and application of two text classifiers, namely SVM and naïve Bayes classifiers.The choice of these two cyberbullying detection models was based on their performance and their ability to handle text information.
Therefore, aligned to the purpose of this study, the objectives guiding this study are to identify the key drivers of cyberbullying to consider the application of text classification algorithms for detection of cyberbullying.This includes building SVM and naïve Bayes classifiers, and to evaluate the performance of the cyberbullying detection models.
Harassment and bullying have always been a problem in learning institutions such as schools and colleges (Bennet, n.d.).It has been extended to social media platforms through the intensive use of the internet (Bretscheider, Wohner & Peters, 2014).Online harassment and cyberbullying is the act of sending offensive messages via electronic media to victims to cause psychological harm (Li, 2008).Hamal (2017) also states that cyberbullying can cause mental health effects.These effects include thinking about suicide, depression, suffering from headaches, devastation, low performance at school, and other psychological effects (Hamal, 2017).
According to Hee et al. (2018), national and international initiatives are put in place for the safety of children online.Some of these initiatives are educative, since they educate and inform on online safety.Despite the measures taken to curb social threats, undesirable and harmful materials remain online.Online communication and relationships have a great effect nowadays, especially on children and teenagers, as they go online more than any other age group.
The study used a dataset from one platform to train a model applicable across all social media platforms.The assumption is that the posting behaviour and the nature of text messages on the Formspring.medataset, will be consistent across all social media platforms.The study also assumed that the labels assigned in the training dataset are correct, and not subject to the views of the data annotators who labelled the dataset.
A key challenge in carrying out studies on cyberbullying is availability of data on public platforms.The Formspring.me dataset contains cyberbullying text, but the classes of the dataset are imbalanced.Mechanisms to address imbalance includes over sampling and under sampling.The posting behaviour, rules and policies vary for each platform.The classification models are only applicable to the English corpus, since the models were trained using an English corpus.
In the rest of this section we define cyberbullying, discuss issues related to cyberbullying, conduct a literature review on drivers of cyberbullying, and review methodologies applied in the past to detect cyberbullying text.

The Cyberbullying Phenomenon
There are ways in which cyberbullying differs from face-to-face bullying.There is no face-to-face interaction when it comes to online interaction, because one cannot see facial expressions (Hee et al., 2018;Notar, Padgett, & Roden, 2013).The recipient of the online message could easily get the message wrong, consequently getting offended or feeling ridiculed.Notar, Padgett, and Roden (2013) state that traditional and cyberbullying are the same, except that cyberbullying involves the use of technology.Cyberbullying takes place anywhere, even beyond the school premises, yet this affects school performance (Froese-Germain, 2008).
Cyberbullying is characterised by sending or posting harmful messages over the internet, using digital gadgets.Cyberbullying can occur on social networking platforms, or by using emails and text messages.
Cyberbullies may be female or male, and they are usually older than their victims.They are more likely to be frequent internet users (Gorzig & Olafsson, 2012).Notar, Padgett and Roden (2013) state that a cyberbully does not have to be strong and fast.Anyone can be a cyberbully; the bully just needs access to devices to carry out cyberbullying.Cyberbullies usually work in cahoots, thereby hindering identification of the real cyberbully.
The psychological effects associated with cyberbullying include depression, stress, loneliness and low self-esteem, compared to their peers (Aune, 2009;Pawar, 2019).It also leads to emotional and physiological harm, especially to defenceless victims.The victims may end up taking alcohol or smoking and getting depressed.Like the victims of bullying, cyberbullying victims experience failure to excel in their studies, school failure, anxiety and significant emotional damage (Froese-Germain, 2008;Görzig & Ólafsson, 2012;Pawar, 2019).
Victims of cyberbullying are likely to suffer great psychological damage, since the hurtful messages will be available online, and it is difficult to remove information posted on the web (Feinberg & Robey, 2009).In most cases, children are reluctant to inform their parents of the abuse, since they are emotionally traumatised.They might think that it is their fault, or they fear restriction from internet use.
The most common media that cause cyberbullying are electronic mail, instant messaging, chat rooms, text messaging and social media (Peled, 2019).
Electronic mail is a means of sending a message from one sender to at least one recipient.

Instant messaging offers instant transmission of messages between two parties online.
Chat rooms involve interaction with users that one is not familiar with, on a common subject or topic online.
Text messaging is the sending of messages between two or more people using mobile phones (Peled, 2019).
Social media allow people to create social interactions and relationships with people that have common interests, activities and real-life connections (Paled, 2019).
Web sites offer a space for personal, commercial and government purposes (Paled, 2019).
According to Willard (2006) and Peled (2019) the categories for cyberbullying are as follows: Flaming is the category that involves the use of harsh and vulgar words.
Harassment and cyberstalking involve the act of repeatedly sending cruel and threatening messages.
Denigration involves the act of sending gossip and rumours about an individual or group, thereby damaging their reputation.
Impersonation involves the act of getting into someone's email and sending vicious and embarrassing messages while using their account.
Outing and trickery involve interacting with someone, tricking them to reveal sensitive information and distributing that sensitive information to other people (Peled, 2019).
Masquerading is when the cyberbully fakes their identity and sends malicious messages that are threatening and harmful on a social media platform (Peled, 2019).
Exclusion is an act of excluding someone from an online group.
The drivers of cyberbullying that were identified include age, gender of the role players, the family structure, the level of education of the parents, race, use of technology, academic achievement, and awareness of cyber safety (Hamal, 2017).Some of the studies reviewed are tabulated below showing extracts relevant to the research topic, the driver, as well as a driver summary.

Methodological Approaches on Detection of Cyberbullying
Literature was reviewed on methodologies applied for cyberbullying detection by other researchers.Table 2 is a summary of some of the cyberbullying detection studies conducted by other researchers.The table presents the topic of this paper and the methodology.
Previous studies applied both qualitative and quantitative techniques to detect cyberbullying.The naïve Bayes and SVM algorithms were applied in most studies, but on different datasets that were extracted from other social media platforms.No study from the reviewed literature trained the SVM and naïve Bayes classifiers on the Formspring.medataset to detect cyberbullying.
Most empirical studies that were conducted, concentrated on demographic risk factors affecting cyberbullying.Past research studies applied the SVMs and naïve Bayes models on datasets other than the Formspring.medataset.Twitter dataset was used for model lubrication in most studies, however, Twitter restricts the number of characters that can be posted on the platform, hence users normally post only the most crucial information.
Cyberbullying identification from social media data is more objective than the use of qualitative techniques, for example questionnaires.Most researchers have focused on descriptive factors such as personality, social relationships and psychological factors in trying to understand the effects of cyberbullying.

MeTHODOLOGy
The pragmatic research philosophy was adopted, through use of the mixed methods research approach (Creswell, 2009).We made use of the mixed methods research approach, which incorporates elements of qualitative and quantitative research methods (Creswell et al., 2007;Creswell, 2009)."Risk factors of cyberbullying among Finnish adolescents and its effects on their health" (Hamal, 2017) "The experience of victimization as the result of cyberbullying among college students" (Poole, 2017) Gender "The experience of victimization as the result of cyberbullying among college students" (Poole, 2017) Girls are more likely to experience cyberbullying as victims; boys are more involved as bullies and victims."Cyberbullying in schools: An examination of preservice teachers' perception" (Li, 2007) "Cyberbullying, University of Wisconsin Strout" (Aune, 2009) "Psychological, physical, and academic correlates of cyberbullying and traditional bullying" (Kowalski & Limber, 2013) "Risk factors of cyberbullying among Finnish adolescents and its effects on their health" (Hamal, 2017) "The experience of victimization as the result of cyberbullying among college students" (Poole, 2017) Family structure "Cyberbullying from a socio-ecological perspective: A contemporary synthesis of findings from EU Kids Online" (Gorzig & Machackova, 2015) The home environment of a child is a factor in cyberbullying.Family breakup is likely to cause involvement of a child in cyberbullying."Risk factors of cyberbullying among Finnish adolescents and its effects on their health" (Hamal, 2017) Race "Cyberbullying in schools: An examination of preservice teachers' perception" (Li, 2007) Children who bully usually come from families with low income and low education."The experience of victimization as the result of cyberbullying among college students" (Poole, 2017) Technology "Cyber bullying an old problem in a new guise?"(Campbell, 2005) Limited access to technology reduces cyberbullying and access to technology intensifies cyberbullying."Cyberbullying in schools: An examination of preservice teachers' perception" (Li, 2007) "Cyberbullying from a socio-ecological perspective: A contemporary synthesis of findings from EU Kids Online" (Gorzig & Machackova, 2015) Anonymity "Cyberbullying: A Review of the Literature.Universal Journal of Educational Research" (Notar, Padgett, & Roden, 2013) Anonymity creates instances where young people become wild and behave in a way they will never do when they are offline.'"Anonymity and roles associated with aggressive posts in an online forum" (Moore et al., 2012) "The experience of victimization as the result of cyberbullying among college students" (Poole, 2017) Traditional bullying "Cyberbullying in schools: An examination of preservice teachers' perception" (Li, 2007) Students involved in traditional bullying find it much easier to practise cyberbullying."The overlap between cyberbullying and traditional bullying" (Waasdorp & Bradshaw, 2015) Academic achievement "Cyber bullying an old problem in a new guise?"(Campbell, 2005) Cyber victims and bullies are likely to perform poorly at school "Cyberbullying in schools: An examination of preservice teachers' perception" (Li, 2007) "Psychological, physical, and academic correlates of cyberbullying and traditional bullying" (Kowalski & Limber, 2012) Awareness of cyber safety "Cyber bullying an old problem in a new guise?"(Campbell, 2005) Cyberbullying is a universal problem, but it varies from one country to another.Students from different countries and different cultures differ in terms of cyberbullying.
"Cyberbullying in schools: An examination of preservice teachers' perception" (Li, 2007) Qualitative methods can be used as a foundation for quantitative research by providing explanatory work; hence, these methods are dependent on each other (De Villiers, 2005).Qualitative data analysis creates categories within the words or images used by people (Oates, 2006).The sources of qualitative data include documents and texts (Bowen, 2009).
Quantitative research is characterised by data that can be analysed statistically.These datasets are usually secondary datasets (Oates, 2006).This method seeks explanations and predictions that are applicable to other cases (Creswell et al., 2007).
Mixed methods research applies both qualitative and quantitative research methods in data gathering and analysis (Creswell et al., 2007).Explanatory sequential mixed methods were used for this study (Creswell, 2009).
We started with qualitative research -literature research and using a qualitative dataset.The qualitative data was then used to build the second phase, namely building text classification models that are quantitative in nature.The research strategy applied for data analysis purposes is the experiment, which is a scientific method whereby knowledge is discovered by controlled empirical means (De Villiers, 2005).
We identified the key drivers of cyberbullying through document analysis.Document analysis involves reviewing of printed and electronic material (Bowen, 2009).The documents analysed to identify drivers of cyberbullying and methodological approaches used in past cyberbullying research studies, included books, journals, articles, theses and various public records Digital/online data was used in training the classification models and identifying the best model.This is a type of pragmatic data, mainly comprising datasets generated through the use of machines and hand-held devices.Pragmatic data covers communication where text data or messages are generated through communication on social media (Jucker, 2018).
The quantitative component involved the use of mathematical modelling and controlled experiments -SVM and naïve Bayes text classification models.Text classification is a process of  (Gorzig & Olafsson, 2012) Survey data: Logistic regression "Improved cyberbullying detection using gender information" (Davdar et al., 2012) YouTube comments "Twitter bullying detection" (Sanchez & Kumar, 2011) "naïve Bayes: Twitter data" "Twitter sentiment classification using distant supervision" (Sanchez & Kumar, 2009) "naïve Bayes.maximum entropy and SVM: Twitter data" classifying text information into two or more classes, based on its content (Kobayashi et al., 2017).The text classification process consists of six steps, which include text preprocessing, normalisation or transformation, dimensionality reduction, application of classification techniques, model evaluation, and validation (Kobayashi et al., 2017).Classification method predicts the category which the text message belongs to (Lantz, 2013).Controlled experiments were by application of supervised machine learning algorithms.The root of support vector machines is based on a statistical learning theory (Shang et al., 2016).It is a supervised classifier, which separates feature space into two classes based on their features (Lantz, 2013).Naïve Bayes uses probability to classify the elements of a dataset.This algorithm uses the occurrence of past events to estimate the occurrence of future events.Therefore, the algorithm uses conditional probability and follows from the Bayes theorem (Sarkar et al., 2014).
We made use of Formspring.me dataset from Kaggle to train our models.This dataset contains 16 163 text messages and 14 attributes.The Formspring.me dataset was originally extracted from a platform where users posted questions anonymously to any user's page, which encouraged cyberbullying (Ptaszynski et al., 2018).In order to address class imbalance in the extract, selected examples from the minority class were selected, with replacement, and added to the training dataset.The dataset was essential for this study, because the platform was populated by teenagers and students, and it contains cyberbullying text (Reynolds, Edwards & Edwards, 2011).
In this study, model evaluation concerned assessment of whether the model can classify cyberbullying text and non-cyberbullying text.The classification performance of the two algorithms was measured using the confusion matrix parameters to measure the correctness of predicted values against the true values on the test dataset, the hold-out method and cross validation.The comparison of the performance of the SVM and naïve Bayes classifiers was based on classification accuracy (Al-Garad, 2019;Pawar, 2019).The rest of the section is a discussion of the experimental set-up of the study.
R programming language was used for the purposes of this study, since the Formspring.medataset is large.The benefit of using R in text classification is that it consists of the right packages for text analysis and classification.
Social media datasets are noisy, hence there was need to remove unwanted information.
Data preprocessing was performed through normalisation, removal of unwanted characters and tokenisation.Normalisation is a process of transforming words into a more uniform construction (Benoit et al., 2017).It involves transformation of words from uppercase to lowercase and stemming of words.Removal of unwanted characters involved removing numbers, punctuation and stop words, and stripping white space.The tm_map function was used to eliminate the unwanted characters.
The words were stemmed to reduce the vocabulary and to eliminate words that appear more than once through tokenisation.Stemming was performed on preprocessed data -the dataset that had undergone the above-mentioned transformations.The stemDocument function was used for reducing each word to its root word.
After the data preprocessing stage, the corpus was transformed to a document term matrix (DTM) using the R DocumentTermMatrix() function.The DTM portrays the relationship between words and documents.Not all words are equally important when it comes to text analytics, so the words that were less frequent were removed from the DTM.The process involved removing rare words from the DTM.
The ggplot2 R library was used to create data visualisations.The Formspring.me dataset was partitioned into training and into the test datasets.These datasets were then used to train and test the naïve Bayes and SVM cyberbullying detection models.The test dataset was used for checking the accuracy and behaviour of the models.
The kernleb library was used for training the SVM model.The SVM-kernel used for the model is radial.The caret package was used for creating the confusion matrix, which displays numerous performance parameters.The svm() function, which is in the kernlab library trained the model.The predict() function was used for creating the predictive model of the cyberbullying content models.
Naïve Bayes model training was performed through using the e1071 library.The dataset was partitioned into two parts for training and evaluating purposes.The naïveBayes() function was used to build the naïve Bayes model.
The classification performance of the two algorithms was measured using confusion matrix to measure the predicted values against the true values on the test dataset.The evaluation phase also entailed cross validation of the models.
The training and test datasets were adjusted in terms of proportions to improve and evaluate the naïve Bayes and SVM classifiers.The 80-20 and 75-25 (training-testing) splits were used for evaluating the prediction accuracy of the models.

ANALySIS OF FINDINGS
We identified the drivers of cyberbullying through literature reviews.In the literature reviews we discovered that age, gender and anonymity are the major drivers of cyberbullying, as these were mentioned most frequently in the cyberbullying studies.
We adopted quantitative modelling methods in classifying the text message into non-cyberbullying and cyberbullying classes.These included the SVM and naïve Bayes classifiers.The experiments included data processing, model building and comparison of the performance of the classification models.
The original corpus contained 16 163 text documents or rows.Each document or row is the response to the questions posed on the Formspring.mewebsite of various users and it falls under the column "ans".The "ans1" variable consists of 72.58% non-cyberbullying text and 27.42% cyberbullying text.

Model Building
A 10-fold cross validation was applied to assess the models' performance and it was repeated six times.Naïve Bayes and SVM models were trained on different parameters to enable identification of the best model for text classification.

Comparison of the Performance of the SVM and Naïve Bayes Models
The naïve Bayes model is the best classification model for cyberbullying detection.The model performed well when 75% of the Formspring.medataset was used to train the dataset and the rest was used to evaluate the dataset.The accuracy rate and the Kappa rate are 99.87% and 99.69%, respectively.
The model selection process was based on the performance metrics of the models.The performance percentages of the SVM and naïve Bayes classifiers were 72.81% and 99.87% respectively.The naïve Bayes model is based on the Bayes theorem; it calculates an observed probability of the classes (cyberbullying and non-cyberbullying text) based on all the observations on the Formspring.medataset.This explains its high performance in terms of classification accuracy in this study.
The naïve Bayes classifier can be used as a classification tool owing to its performance in classifying text into cyberbullying or non-cyberbullying classes.The naïve Bayes classifier calculates an observed probability of cyberbullying and non-cyberbullying based on all observations, hence the model classified observations to non-cyberbullying text and cyberbullying text even if cyberbullying text was limited in the Formspring.medataset.One of the assumptions of the naïve Bayes algorithm is that all the observations in the dataset are equally important and independent.

CONCLUSION
The objectives of the study were to identify the drivers of cyberbullying, train cyberbullying classifiers and to identify the best classifier between SVM and naïve Bayes model.The main drivers of cyberbullying included age, race, gender, family structure, parental education, technology, anonymity and involvement in traditional bullying.The accuracy rates of the SVM and naïve Bayes classifiers were 72.81% and 99.87%, respectively.The naïve Bayes classifier was the best in terms of classification accuracy.
Future studies can train three different classification models, based on the three classification labels assigned by the three data annotators on each of the text messages on Formspring.me dataset.The performance of these classification models can then be compared and analysed to check the consistency and accuracy of the text classification labels that are assigned by the data annotators.
Cyberbullying classification models can also be trained on datasets from various social media platforms like Facebook, Twitter, Instagram, WhatsApp, YouTube, etc.The study can also consider training a classifier on a dataset from one social media platform, and testing or evaluating it on a dataset from a similar social media platform.The scope of this study was limited to the comparison of the SVM and naïve Bayes classifiers.Therefore, the findings of the study may only be directly relevant to scenarios where the choice is between SVM and Naïve Bayes.The results may not necessarily extend to other types of classifiers or machine learning algorithms and further research will be required to be able to generalise the findings.

word
Cloud Visualisation of the Training DatasetThe text data was visualised through word cloud visualisations.The word cloud visualisation of the Formspring.metraining dataset in Figure1, shows that there are more non-cyberbullying words in the dataset (shown by larger fonts), compared to cyberbullying text (smaller font).

Figure 1 .
Figure 1.Word cloud visualisation of the training dataset