Topic Detection and Tracking Towards Determining Public Agenda Items: The Impact of Named Entities on Event-Based News Clustering

Topic Detection and Tracking Towards Determining Public Agenda Items: The Impact of Named Entities on Event-Based News Clustering

Basak Buluz Komecoglu, Burcu Yilmaz
DOI: 10.4018/978-1-6684-4045-2.ch008
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

It is a known fact that all of the events that people in the society are exposed to while continuing their lives have important effects on their quality of life. Events that have significant effects on a large part of the society are shared with the public through news texts. With a perspective that keeps up with the digital age, the problem of automatic detection and tracking of events in the news with natural language processing methods is discussed. An event-based news clustering approach is presented for data regimentation, which is necessary to extract meaningful information from news in the form of heaps in online environments. In this approach, it is aimed to increase clustering performance and speed by making use of named entities. Additionally, an event-based text clustering dataset was created by the researchers and brought to the literature. By using the B-cubed evaluation metric on this test dataset, which consists of 930 different event groups and has a total of 19,848 news, a solution to the event-based text clustering problem was provided with an F-score of over 85%.
Chapter Preview
Top

Introduction

In the digital age we live in, there is a rapid paradigm shift in information access (Nielsen & Selva, 2019). Online media, which plays one of the most significant roles in this change, has become an essential tool for the public to obtain and publish information. So much so that news and information can be provided by online news sites, mainstream media, and individual users through the internet. However, the content provided on the so-called individual text media platforms such as Twitter and blogs is of lower quality than the news texts published on online news sites. It is seen that it is insufficient to describe the events in the content (Fisher et al., 2020). The primary motivation in individually published news content is to be interesting, receive readers' reactions, and share as quickly as possible after an event has occurred. This motivation also results in the transmission of news texts that have not been confirmed, are prone to create information pollution, are far from objective information transfer, and are very open to manipulating the events in the content. Therefore, news data published online and in mainstream media has become an accurate and stable source of information for the public instead of individually shared content (Nielsen & Selva, 2019). As reported by the Reuters Institute in the Digital News Report, the acquisition of information from online media sources has increased worldwide in direct proportion to the rapid development of internet technology. The last report published in 2021 stated that the transition to a digital future accelerated with the coronavirus epidemic, and long-term trends around the rise of digital news consumption also accelerated (Newman et al., 2021).

With the advantages of providing instant access to news and offering an improved reading experience to readers, a rapid transformation process has been experienced to meet news consumption by online media platforms. These platforms, developed with web technologies, contain news that can reach a much larger audience than traditional printed media due to their dynamic nature (Nielsen & Selva,2019). With these aspects, platforms are not only tools that provide information but also appear as knowledge bases that contain data suitable for analysis for different purposes.

Almost all decision-making mechanisms at different levels, such as individuals, institutions, and governments, are affected by daily events (Chen & Wang, 2021). Therefore, it is necessary to identify and follow the news content in this digital age, which is essential for individuals, organizations, institutions, or governments. For example, for the governments, it is vital to grasp the public’s views quickly and effectively on government policy, act quickly when necessary, and provide insight. In the past, the way to do this was to manually collect and analyze the news content that made up the agenda. Today, as more online media platforms are used as data sources, there has been a need to develop algorithmic approaches to cope with the exponentially increasing number of news sources and content.

Similarly, as a requirement of the competitive market, companies are faced with the need to analyze their competitors and themselves from the customer's perspective. The importance of the news reflected in the press for brand value, public opinion, and market value is beyond doubt. For this reason, it is necessary to completely assort and analyze the news reflected in the press on public and competitor analysis.

The first solution for automatic detection and tracking of hot events that constitute the agenda was presented with a Topic Detection, and Tracking (TDT) system developed because of a pilot study supported by DARPA in 1996 (Allan et al., 1998). This system has been a pioneer in defining the Topic Detection and Tracking (TDT) problem and determining a concept for solutions to be developed around this problem. There are three main concepts accepted as fundamental in TDT research (Allan,2002; Yang et al., 2009; Rasouli et al., 2020):

  • Event: a reality that contains knowledge of time and space.

  • Story: a news article that provides information about an event.

  • Topic: an event and a series of related stories.

Key Terms in this Chapter

Information Retrieval: The process, methods, and procedures of accessing and retrieving sufficient information from information system resources on a query given by the user.

Cluster: A group of similar objects (data points) belonging to the same cluster. They have similar characteristics in the cluster but are different from the objects in other clusters.

Embedding: A term used to represent words/sentences/documents for NLP applications. It is typically in the form of a real-valued vector and used for encoding the meaning of the word/sentence/document.

Noise: Unwanted variation, irrelevant information.

Named-Entity Recognition: A process or task in NLP where the text is parsed through to find entities that can be put under categories like a person, organizations, locations, etc.

Density-Based Clustering: An unsupervised learning technique that identifies distinctive clusters which includes data objects spread in the data space over a contiguous region of a high density of objects.

B-Cubed: An evaluation metric, which can evaluate the precision and recall for every data point in clustering on a given dataset according to ground truth.

Pre-Trained Language Model: Language models used to learn universal language representations. It has been trained with large-scale corpora to perform specific language tasks.

Complete Chapter List

Search this Book:
Reset