Improvised Spam Detection in Twitter Data Using Lightweight Detectors and Classifiers

Improvised Spam Detection in Twitter Data Using Lightweight Detectors and Classifiers

Velammal B. L., Aarthy N.
DOI: 10.4018/IJWLTT.20210701.oa2
Article PDF Download
Open access articles are freely available for download

Abstract

Receiving spam messages is one of the most serious issues in social media, especially in Twitter, which is a widely used platform to reflect the opinions and emotions of an individual publicly as well as focused to a specific group of members with similar thoughts or discussion topic. In such focused discussion groups, getting spam message through social media sites is the most annoying issue. In this paper, a system is developed to detect spam tweets by using four lightweight detectors, namely blacklist domain detector, near duplicate detector, reliable ham detector, and multiclass detector. The detected tweets are then classified using ensemble classifiers such as naïve Bayes, logistic regression, and random forest. Voting method is applied to decide the labels for the tweets obtained after classification process. The proposed system has achieved an accuracy of 79% to detect spam tweets with the help of naïve Bayes classifier method and the value seems to be optimizing further with the availability of more sample data.
Article Preview
Top

1. Introduction

Nowadays social media has become the most unavoidable and the most popular means for communication amongst the individuals. For most of the youngsters, the days won’t count without using social media such as Twitter, which was established in 2006 and became an exceptionally good social website amongst the most well-known microblogging administration web applications. Twitter is the most popular micro-blogging site with approximately 200 million users. Twitter has witnessed different kinds of spam attacks. Detecting a spam is the first and very crucial step in the battle of fight against spam (Chu, Widjaja, & Wang, 2012).

Conventional spam detection methods on Twitter mainly check individual tweets or twitter accounts for the existence of spam. The tweet-level detection monitors individual tweets to check whether they contain spam text content or Uniform Resource Locators (URLs). By 6th June2018, around 8.3 million tweets are generated per hour (Lin, Sun, Nepal, Zhang, Xiang &Hassan, 2017)demand near real-time delivery. So, the tweet-level detection would consume too much computing resources and can hardly meet stringent time requirements. The account-level detection works by checking individual accounts for the evidence of posting spam tweets or aggressive automation behavior. Accounts violating the twitter rules of spam and abuse (Meda, Bisio, Gastaldo, Zunino, 2014) will get suspended by the administrators. However, suspending spam accounts is an endless cat and mouse game, as it is easy for spammers to create new accounts as a replacement for the suspended ones. The twitter detection should shift from the perspective of individual detection to collective detection and focus on detecting spam campaigns.

A spam campaign is defined as a collection of multiple accounts controlled and manipulated by a spammer to spread spam on twitter for a specific purpose (e.g., advertising a spam site or selling counterfeit goods). Detecting such spam campaigns and prohibiting them can bring two additional benefits. First, improvement in Efficiency by clustering related spam accounts into a campaign and generating a signature for the spammer behind the campaign. With the help of this process, the system can detect multiple existing spam accounts at a given time and also capture future ones, if the spammer maintains the same spamming strategies. Second, Robustness - There are some spamming methods which cannot be detected at an individual level, similar to the behavior of posting duplicate content over multiple accounts, which Twitter do not consider as spamming. By grouping related accounts, the system can be able to detect such a collective spamming behavior and precautionary measures can be taken to restrict such messages. Another way of achieving this phenomenon is by clustering tweets with the same final URL into a campaign using the Twitter dataset and then partitioning the dataset into numerous campaigns based on URLs. Then perform a detailed analysis over the campaign data and generate a set of useful features to classify a campaign into two classes: spam or legitimate.

Internet spam is one or more unsolicited messages sent or posted as a part of larger collection of messages, all having substantially identical content (Giyanani& Desai, 2013). Most spam messages take the form of advertising or promotional materials like debt reduction plans, getting rich quick schemes, gambling opportunities, pornography, online dating, health-related products etc. The major technical disadvantages of spam messages are wastage of network resources (bandwidth), wastage of time, damage to the PC and laptops that may be caused due to viruses. Spammers generally have designed personalized templates to deliver their messages using bulk mailing software. It is widely assumed that most of the spam messages are sent directly from a collection of bots.

On an organizational front, spam effects are likely to be considered as annoyance to individual users, less reliable e-mails, loss of work productivity, misuse of network bandwidth, wastage of file server storage space and computational power. It can also include spreading of viruses, worms, Trojan horses and financial losses through phishing, Denial of Service (DoS), directory harvesting attacks. According to the Text Retrieval Conference (TREC) (Bhowmick& Hazarika, 2013) the term ‘spam’ is - an unsolicited, unwanted information that was sent indiscriminately. Spams are unsolicited, unratified and usually mass broadcasted to act as a carrier of unsolicited advertisements, fraud schemes, phishing messages, explicit content, promotions of cause, etc.

Complete Article List

Search this Journal:
Reset
Volume 19: 1 Issue (2024)
Volume 18: 2 Issues (2023)
Volume 17: 8 Issues (2022)
Volume 16: 6 Issues (2021)
Volume 15: 4 Issues (2020)
Volume 14: 4 Issues (2019)
Volume 13: 4 Issues (2018)
Volume 12: 4 Issues (2017)
Volume 11: 4 Issues (2016)
Volume 10: 4 Issues (2015)
Volume 9: 4 Issues (2014)
Volume 8: 4 Issues (2013)
Volume 7: 4 Issues (2012)
Volume 6: 4 Issues (2011)
Volume 5: 4 Issues (2010)
Volume 4: 4 Issues (2009)
Volume 3: 4 Issues (2008)
Volume 2: 4 Issues (2007)
Volume 1: 4 Issues (2006)
View Complete Journal Contents Listing