Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Combining Machine Learning and Natural Language Processing for Language-Specific, Multi-Lingual, and Cross-Lingual Text Summarization: A Wide-Ranging Overview

Luca Cagliero, Paolo Garza, Moreno La Quatra

Source Title: Trends and Applications of Text Summarization Techniques

DOI: 10.4018/978-1-5225-9373-7.ch001

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

The recent advances in multimedia and web-based applications have eased the accessibility to large collections of textual documents. To automate the process of document analysis, the research community has put relevant efforts into extracting short summaries of the document content. However, most of the early proposed summarization methods were tailored to English-written textual corpora or to collections of documents all written in the same language. More recently, the joint efforts of the machine learning and the natural language processing communities have produced more portable and flexible solutions, which can be applied to documents written in different languages. This chapter first overviews the most relevant language-specific summarization algorithms. Then, it presents the most recent advances in multi- and cross-lingual text summarization. The chapter classifies the presented methodology, highlights the main pros and cons, and discusses the perspectives of the extension of the current research towards cross-lingual summarization systems.

Chapter Preview

Top

Introduction

In recent years, accomplice the recent advances of Web-based applications, the number of textual documents produced and made available in electronic form has steadily increased. To peruse potentially large collections of textual documents, domain experts often need for the aid of automatic compression tools, namely the document summarizers. These systems are able to produce informative yet succinct summaries by filtering out irrelevant or redundant content and by selecting the most salient parts of the text.

Text summarization is an established branch of research, whose main goal is to study and develop summarization tools which are able to extract high-quality information from large document collections (Tan et al., 2006). Plenty of approaches to document summarization have been proposed in literature. They commonly rely on Natural Language Processing (NLP), Information Retrieval (IR), or text mining techniques (Nazari & Mahdavi, 2019). Automated summarization systems have found application in industrial and research domains, e.g., content curation for medical applications (Zitnik et al., 2019), news recommendation (Tang et al., 2009), disaster management (Li et al., 2010), and learning analytics (Cagliero et al., 2019, Baralis & Cagliero, 2018).

The text summarization process commonly entails the following steps:

1.
Filter the content of the input documents and transform it using ad hoc textual data representations.
2.
Identify the key concepts mentioned in the text and extract significant descriptions of these concepts in textual form.
3.
Generate summaries of the original document content that cover all of the salient concepts with minimal redundancy.

Statistics- and semantics-based text analyses are commonly applied in order to detect the most significant concepts and their descriptions in the text (Conroy et al., 2004). Most of them rely on the hypothesis that the content of all the original documents is written in the same language. This simplifies both the models used to capture in the text, which are usually language- and domain-specific, and the computation of text similarity measures, which usually rely on frequency-based term analyses. Hereafter, they will denote as “language-specific” summarizers all the systems that cannot be applied to documents written in different languages.

The rapid growth of Internet worldwide has produced a huge mass of textual documents written in a variety of different languages. Accessing the information contained in documents written in different languages has become a relevant yet compelling research issue (Wang et al., 2018). For instance, the findings described in scientific articles and reports written in languages other than English are, in most cases, not easily accessible by foreign researchers. This limits the accessibility of the achieved results. Similarly, the news articles published on national newspapers in the local languages cannot be easily explored unless adopting language-dependent text analysis tools. The knowledge provided by documents written in foreign languages is valuable for driving experts’ decisions in several domains, among which finance, medicine, transportation, and publishing industry (Wan et al., 2010). However, in practice, most researchers, practitioners, and entrepreneurs explore only small documents written in English or in their native language. Therefore, the information hidden in the documents written in foreign languages is either not considered at all or underused to a large extent.

Key Terms in this Chapter

Natural Language Processing: Subfield of computer science that concerns the processing of large amounts of natural language data by means of automated systems.

Single-document Summarization: The process of generating a representative summary from a single input document.

Word Embeddings: Feature learning techniques aimed to map words or phrases from a vocabulary to vectors of real numbers. The vector space allows analysts to identify semantic similarities between linguistic items based on their distributional properties in large textual corpora.

Multi-Document Summarization: The process of generating a representative summary from a collection of input documents.

Knowledge Discovery From Data (KDD): The process of extracting hidden information from data. It includes the tasks of data selection, preprocessing, transformation, mining, and evaluation.

Extractive-Based Summarization: The process of generating a representative summary by selecting the most relevant sentences from the input documents.

Document Summarization: The process of conveying the most representative content of either a single document or a document collection to a concise summary.

Cross-Lingual Language Model: Machine learning model representing relations between words in different languages.

Abstractive-Based Summarization: The process of generating a summary by means of new content and new sentences automatically generated by capturing the essence of the input document.

Frequent Itemset Mining: Frequent itemset mining is a widely exploratory technique to discover relevant recurrences hidden in the analyzed data.

Text Analytics: Techniques to derive high-quality information from textual data.

Machine Translation: Automatic translation of sentences or documents from a source language to a target language by means of automatic algorithms.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Combining Machine Learning and Natural Language Processing for Language-Specific, Multi-Lingual, and Cross-Lingual Text Summarization: A Wide-Ranging Overview

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List