Complex Events Processing on Live News Events Using Apache Kafka and Clustering Techniques

Complex Events Processing on Live News Events Using Apache Kafka and Clustering Techniques

Aditya Kamleshbhai Lakkad (Vellore Institute of Technology, India), Rushit Dharmendrabhai Bhadaniya (Vellore Institute of Technology, India), Vraj Nareshkumar Shah (Vellore Institute of Technology, India) and Lavanya K. (Vellore Institute of Technology, India)
Copyright: © 2021 |Pages: 14
DOI: 10.4018/IJIIT.2021010103
OnDemand PDF Download:
No Current Special Offers


The explosive growth of news and news content generated worldwide, coupled with the expansion through online media and rapid access to data, has made trouble and screening of news tedious. An expanding need for a model that can reprocess, break down, and order main content to extract interpretable information, explicitly recognizing subjects and content-driven groupings of articles. This paper proposed automated analyzing heterogeneous news through complex event processing (CEP) and machine learning (ML) algorithms. Initially, news content streamed using Apache Kafka, stored in Apache Druid, and further processed by a blend of natural language processing (NLP) and unsupervised machine learning (ML) techniques.
Article Preview


The present world is changing over to the advanced world. Creation of news content is developing at an astounding rate. Many news destinations are there, which gives all reports pretty much all around the globe. Numerous individuals are changing the way they expend news, replacing the conventional physical papers and magazines with their virtual online adaptations (M. I. Rana, 2014). Two major key highlights of online news: Interactivity and immediacy. Interactivity identifies with how individuals will expend the news they are keen on, while immediacy expresses that individuals hope to be educated towards the most recent news with for all intents and no delay. That made the online news industry competitive. In the first place, a progressive online news locales and second, in contrast to physical papers, online news destinations do not have any physical limitation on the measure of data they can place in, accordingly they can distribute. Given that, individuals are ready to invest a restricted time for expending news. News locals expected a successful system to grab individuals' eye and pull in their snaps. Regardless of the supreme significance of news generation and utilization, little is thought about them. Motivation of this paper is to find the most interactive and the most immediate news content out of a vast amount of journalism content over the globe. This paper tries to find interactive news content by groping several news articles concerning news’s literature. Most immediate news filtered by using the uniqueness of those articles. Hence, the principal aim for given research work is to create superior comprehension of the sentiments communicated in headlines, the popularity of news stories and the remarks news things trigger. Computation of above algorithm needs a consistent data processing pipeline, as the quantity and time of data generation are uncertain. Thus, paper intended to design the most reliable automatic data fetching and pre-processing architecture. These efforts depend on utilizing unsupervised clustering analysis as an intent to catch how the news reflects present time. Besides, it draws attention to the unique and interesting news feeds in the current scenario. It additionally looks at how the nature of news changes at each moment. To perform the analysis, this study has taken live news channels from everywhere throughout the world through explicit Application Programming Interface (API). Apache Kafka accepts the news events and spills it. Apache Kafka an open-source stream handling software that functions as a Publish/Subscribe broker. (Carina Andrade, 2019) The reason for utilizing the Apache Kafka stage is that it intends to give a unified, high-throughput, low-latency stage for dealing with continuous information feeds. Newsfeeds published to Kafka as back-to-back events. Apache Druid stores those events and liable for a streaming analytics data store, perfect for powering user-facing data applications. Druid explores events following they happen and to join ongoing outcomes with historical occasions of news. (Rowanda Ahmeda) As the online-offline, method has incorporated effectively with many streams clustering algorithms. The authors have likewise used an online-offline handling structure. During online stage, the Druid dynamically maintains the necessary information of the uninterrupted arriving data records. While in the offline stage, it utilizes a centroid based and density-based clustering calculations for evaluating better insights of events. As online mode brings vital events from the live stream and those filtered events than transferred to the offline mode it improves the performance of the clustering algorithm and accelerates speed, the need for storage memory, and reduce the time complexity of event processing.

The research work contributes towards developing an automated news processing system, which will give a valuable insight from a large amount of data. Initial section describes the background and related research work for the related fields. It gives an already developed key research topic and a building block for this article. In the following section, the primary method and approach is discussed. The method section further divided into two sections implementation and approaches. Implementation shows throughout the data life cycle and approaches show how different algorithms and data analytics techniques utilized to come up with some valuable results. The following section of the article gives some results of the implemented algorithm and tests those algorithms using some hypothesis testing techniques. At last, we have derived the future works and conclusion.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 18: 4 Issues (2022): Forthcoming, Available for Pre-Order
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing