Information Extraction from Blogs

Information Extraction from Blogs

Marie-Francine Moens (Katholieke Universiteit Leuven, Belgium)
Copyright: © 2009 |Pages: 19
DOI: 10.4018/978-1-59904-974-8.ch023
OnDemand PDF Download:


This chapter introduces information extraction from blog texts. It argues that the classical techniques for information extraction that are commonly used for mining well-formed texts lose some of their validity in the context of blogs. This finding is demonstrated by considering each step in the information extraction process and by illustrating this problem in different applications. In order to tackle the problem of mining content from blogs, algorithms are developed that combine different sources of evidence in the most flexible way. The chapter concludes with ideas for future research.
Chapter Preview


A blog (short for Web log) is a Web based publication consisting primarily of periodic content. The content is usually displayed in a reverse chronological order. Blogs are typically social media and provide commentary on a large variety of topics on a particular subject, such as products (e.g., cars, food), people (e.g., politicians, celebrities), politics, news or health. The communication medium is primarily text, although we see an increasing focus on photographs (photoblog), sketchblogs, videos (vlog) or audio (podcasting), or on combinations of these media. A descriptive textual component is usually present, because text is an important component in human communication. Many blogs are built in an interactive dialogue setting, but a blog can also have the form of a personal diary. Other people engage themselves to complement, freely tag or comment the content, and authors of blogs prefer to link to other content. The people who write the blogs are usually not professionals.

Blogs are very creative forms of human expression and have in our society an influence on our convictions, political opinions and societal relationships that is often underestimated. Blogs are a mirror of a society, and many different parties have an interest in monitoring their content. Businesses, lawyers, sociologists and politicians want to know the topics that are of most concern to citizens. Police and intelligence services might find valuable links or cues to crime tracking. Citizens are interested in finding soul mates with common interests. We humans have no trouble aggregating the different media and inferring messages and interpretations from them. If we design machines that help people to search blogs, to monitor blogs, mine or summarize them, we expect from these machines a certain degree of understanding of the blog contents. Assigning a semantic meaning to blogs brings us to the domain of artificial intelligence. This chapter will treat the topic of information extraction from blogs. In previous work we have defined information extraction as:

Information extraction is the identification, and consequent or concurrent classification and structuring into semantic classes, of specific information found in unstructured data sources providing additional aids to access and interpret the unstructured data by information systems. (Moens, 2006, p. 225)

Information extraction is used to get some information out of unstructured data. Written and spoken text, pictures, video and audio are all forms of unstructured data. Unstructured does not imply that the data is structurally incoherent (in that case it would simply be nonsense), but rather that its information is encoded in such a way that makes it difficult for computers to immediately interpret it. Information extraction is the process that adds meaning to unstructured, raw data, whether that is text, images, video or audio. Consequently, the data becomes structured or semi-structured and can be more easily processed by the computer.

In other words, information extraction presupposes that although the semantic information in a text and its linguistic organization is not immediately computationally transparent, it can nevertheless be retrieved by taking into account surface regularities that reflect its computationally opaque internal organization. An information extraction system will use a set of extraction patterns, which are either manually constructed or automatically learned, to take information out of a source and put it in a more structured format. When structuring this information, it is not the purpose to replace the unstructured data by the extracted information, which would be equal to imposing a certain view on the data. The goal is to complement the unstructured low level data with semantic labels so that their automated retrieval, linking, mining and visualization become more effective (Moens, 2006).

The unstructured data sources we are mainly concerned with in this chapter are written texts, possibly enriched with free tags, comments and hypermedia links. Information extraction aims here at identifying certain information for use in subsequent information systems. State-of-the-art information extraction techniques are applied to well-formed texts, i.e., consistent with the standards of an official language. However, blog data is notorious for being incoherent and full of grammatical and spelling errors. Sometimes a community or jargon language is used. The focus of this chapter is on the problems encountered by using a state of the art information extraction system when dealing with blogs.

Key Terms in this Chapter

Argumentative Mining: The detection of an argumentative structure in a discourse and the recognition of its composing elements such as the premises and conclusions of an argument; possibly the integration of the found arguments into a knowledge structure used for reasoning.

Part-of-Speech: Word class or category (also called lexical class) which is generally defined by the syntactic or morphological behaviour of the word in question; common classes are noun, verb and adjective among others.

Conditional Random Field (CRF): Learning system for classification often used for labeling sequential data (such as natural language data); as a type of Markov random field, it is an undirected graphical model in which each vertex represents a random variable, whose distribution is to be inferred, and each edge represents a dependency between two variables.

Treebank: A syntactically processed corpus that contains annotations of natural language data at various linguistic levels (word, phrase, clause and sentence levels). A treebank provides mainly the morphosyntactic and syntactic structure of the utterances within the corpus and consists of a bank of linguistic trees, thereby its name.

Named Entity Recognition: Classifies named expressions in text (such as person, company, location or protein names).

Blog (Short for Web Log): A Web based publication consisting primarily of periodic content.

Opinion Mining: The detection of the opinion or subjective assessment in a certain medium (mostly text) where the opinion is usually expressed towards a certain entity or an entity’s attribute; possibly the aggregation of the found opinions into a score that reflects the opinion of a community.

Support Vector Machine (SVM): Learning system used for classification and regression that uses a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory; special property of an SVM is that it simultaneously minimizes the empirical classification error and maximizes the geometric margin that separates two classes; hence SVMs are known as maximum margin classifiers.

Maximum Entropy Model: Learning system used for classification that computes the probability distributions corresponding to an object and its class based on training examples, and that selects the one with maximum entropy, where the computed probability distributions satisfy the constraints set by the training examples.

Noun Phrase Coreferent: Two or more noun phrases are coreferent when they refer to the same situation described in the text.

Tokenization: Breaks a text into tokens or words. It distinguishes words, components of multipart words and multiword expressions.

Information Extraction: The identification, and consequent or concurrent classification and structuring into semantic classes, of specific information found in unstructured data sources providing additional aids to access and interpret the unstructured data by information systems.

Parser: Software program which analyses the grammatical structure of a sentence according to the grammar of the language; a parser is often automatically trained from annotated examples; it captures the implied hierarchy of the input sentence and transforms it into a form suitable for further processing (e.g., a dependency tree).

Complete Chapter List

Search this Book:
Table of Contents
Bernard J. Jansen, Amanda Spink, Isak Taksa
Chapter 1
Bernard J. Jansen, Isak Taksa, Amanda Spink
This chapter outlines and discusses theoretical and methodological foundations for transaction log analysis. We first address the fundamentals of... Sample PDF
Research and Methodological Foundations of Transaction Log Analysis
Chapter 2
W. David Penniman
This historical review of the birth and evolution of transaction log analysis applied to information retrieval systems provides two perspectives.... Sample PDF
Historic Perspective of Log Analysis
Chapter 3
Lee Rainie, Bernard J. Jansen
Every research methodology for data collection has both strengths and limitations, and this is certainly true for transaction log analysis.... Sample PDF
Surveys as a Complementary Method for Web Log Analysis
Chapter 4
Sam Ladner
This chapter aims to improve the rigor and legitimacy of Web-traffic measurement as a social research method. I compare two dominant forms of... Sample PDF
Watching the Web: An Ontological and Epistemological Critique of Web-Traffic Measurement
Chapter 5
Kirstie Hawkey
This chapter examines two aspects of privacy concerns that must be considered when conducting studies that include the collection of Web logging... Sample PDF
Privacy Concerns for Web Logging Data
Chapter 6
Bernard J. Jansen
Exploiting the data stored in search logs of Web search engines, Intranets, and Websites can provide important insights into understanding the... Sample PDF
The Methodology of Search Log Analysis
Chapter 7
Anthony Ferrini, Jakki J. Mohr
As the Web’s popularity continues to grow and as new uses of the Web are developed, the importance of measuring the performance of a given Website... Sample PDF
Uses, Limitations, and Trends in Web Analytics
Chapter 8
Danielle Booth
This chapter is an overview of the process of Web analytics for Websites. It outlines how visitor information such as number of visitors and visit... Sample PDF
A Review of Methodologies for Analyzing Websites
Chapter 9
Gi Woong Yun
This chapter discusses validity of units of analysis of Web log data. First, Web log units are compared to the unit of analysis of television to... Sample PDF
The Unit of Analysis and the Validity of Web Log Data
Chapter 10
Kirstie Hawkey, Melanie Kellar
This chapter presents recommendations for reporting context in studies of Web usage including Web browsing behavior. These recommendations consist... Sample PDF
Recommendations for Reporting Web Usage Studies
Chapter 11
Seda Ozmutlu, Huseyin C. Ozmutlu, Amanda Spink
This chapter summarizes the progress of search engine user behavior analysis from search engine transaction log analysis to estimation of user... Sample PDF
From Analysis to Estimation of User Behavior
Chapter 12
Gheorghe Muresan
In this chapter, we describe and discuss a methodological framework that integrates analysis of interaction logs with the conceptual design of the... Sample PDF
An Integrated Approach to Interaction Design and Log Analysis
Chapter 13
Brian Detlor, Maureen Hupfer, Umar Ruhi
This chapter provides various tips for practitioners and researchers who wish to track end-user Web information seeking behavior. These tips are... Sample PDF
Tips for Tracking Web Information Seeking Behavior
Chapter 14
Sandro José Rigo
Adaptive Hypermedia is an effective approach to automatic personalization that overcomes the difficulties and deficiencies of traditional Web... Sample PDF
Identifying Users Stereotypes for Dynamic Web Pages Customization
Chapter 15
Brian K. Smith, Priya Sharma, Kyu Yon Lim, Goknur Kaplan Akilli, KyoungNa Kim, Toru Fujimoto
Computers and networking technologies have led to increases in the development and sustenance of online communities, and much research has focused... Sample PDF
Finding Meaning in Online, Very-Large Scale Conversations
Chapter 16
Isak Taksa, Sarah Zelikovitz, Amanda Spink
Search query classification is a necessary step for a number of information retrieval tasks. This chapter presents an approach to non-hierarchical... Sample PDF
Machine Learning Approach to Search Query Classification
Chapter 17
Seda Ozmutlu, Huseyin C. Ozmutlu, Amanda Spink
This chapter emphasizes topic analysis and identification of search engine user queries. Topic analysis and identification of queries is an... Sample PDF
Topic Analysis and Identification of Queries
Chapter 18
Elmer V. Bernstam, Jorge R. Herskovic, William R. Hersh
Clinicians, researchers and members of the general public are increasingly using information technology to cope with the explosion in biomedical... Sample PDF
Query Log Analysis in Biomedicine
Chapter 19
Michael Chau, Yan Lu, Xiao Fang, Christopher C. Yang
More non-English contents are now available on the World Wide Web and the number of non-English users on the Web is increasing. While it is... Sample PDF
Processing and Analysis of Search Query Logs in Chinese
Chapter 20
Udo Kruschwitz, Nick Webb, Richard Sutcliffe
The theme of this chapter is the improvement of Information Retrieval and Question Answering systems by the analysis of query logs. Two case studies... Sample PDF
Query Log Analysis for Adaptive Dialogue-Driven Search
Chapter 21
Mimi Zhang
In this chapter, we present the action-object pair approach as a conceptual framework for conducting transaction log analysis. We argue that there... Sample PDF
Using Action-Object Pairs as a Conceptual Framework for Transaction Log Analysis
Chapter 22
Paul DiPerna
This chapter proposes a new theoretical construct for evaluating Websites that facilitate online social networks. The suggested model considers... Sample PDF
Analysis and Evaluation of the Connector Website
Chapter 23
Marie-Francine Moens
This chapter introduces information extraction from blog texts. It argues that the classical techniques for information extraction that are commonly... Sample PDF
Information Extraction from Blogs
Chapter 24
Adriana Andrade Braga
This chapter explores the possibilities and limitations of nethnography, an ethnographic approach applied to the study of online interactions... Sample PDF
Nethnography: A Naturalistic Approach Towards Online Interaction
Chapter 25
Isak Taksa, Amanda Spink, Bernard J. Jansen
Web log analysis is an innovative and unique field constantly formed and changed by the convergence of various emerging Web technologies. Due to its... Sample PDF
Web Log Analysis: Diversity of Research Methodologies
About the Contributors