Measurement of Textual Complexity Based on Categorical Invariance

Measurement of Textual Complexity Based on Categorical Invariance

Lixiao Zhang (School of Computer Engineering and Science, Shanghai University, Shanghai, China & Shanghai Sanda University, Shanghai, China) and Jun Zhang (School of Computer Engineering and Science, Shanghai University, Shanghai, China)
DOI: 10.4018/ijcini.2013040106
OnDemand PDF Download:
No Current Special Offers


Based on the categorical invariance in human concept learning a measurement of textual complexity is proposed. To reach this, transformations of keywords are defined. If a reader grasps the meaning of keywords and the semantic relationship between keywords and sentences, the authors say he/she has understood the text. The transformations of keywords take the difficulty of keywords and the semantic relations between keywords into account. If a text has more common keywords and relations, its complexity is lower. The experiment shows that the measurement is workable. Representational information based on text complexity is to measure the amount of the information in sentences in respect to the whole text. The example shows that the measured information of each sentence is in accordance with the reader’s reading experience.
Article Preview


Measurement of textual complexity is used to accurately predict the efforts or difficulties in reading texts. It is a fundamental issue in e-learning, online question-answering, web search and browsing, etc. For example, when used in search engine, it is able to provide users proper contents whose difficulties are fit to each user’ s cognitive ability. How to find appropriate difficulty level of reading materials is important for different readers, such as second language (L2) learners, educators, and etc.

Many advanced researches have been done on the measurement of textual complexity. In linguistics it is called readability. Abundant formulas, such as Flesch reading ease (Flesch, 1948), Flesch-Kincaid grade level (Kincaid, Fishburne, Rogers & Chissom, 1975), Automated Readability Index, Gunning Fog (Gunning, 1952), SMOG (McLaughlin, 1969), and Coleman-Liau (Coleman & Liau, 1975) etc., were developed to measure the readability of texts. These methods are mostly based on the average words per sentence, the average syllables per word or the average characters per word. They are easy to be calculated and used widely. But they do not take account of the content of texts. Some of the later studies used a list of words to assess the readability of the text (Stenner, Horabin, Smith, & Smith, 1988; Fry, 1990; Chall & Dale, 1995), which can be considered as a simple language model (LM). LM is to establish a model to describe the probability of a given word sequence. For a given text, it is easy to compute its likelihood under a given language model. Si and Callan (2001) and Collins-Thompson and Callan (2004) used LMs to get vocabulary information to predict the grade level of document.

In the state of the art, various linguistic aspects of texts were studied, including lexical features, syntactic features and discourse features. These features are combined to predict the readability of texts (Schwarm & Ostendorf, 2005; Pitler & Nenkova, 2008; Feng, Jansche, Huenerfauth & Elhadad, 2010; Kanungo & Orr, 2009). With the recent developments in computational linguistics and natural language processing, these various features can be used to model readability automatically. Schwarm and Ostendorf (2005) used support vector machines to combine features from traditional reading level measures, statistical language models and automatic parsers to assess reading levels. Pitler and Nenkova (2008) analyzed readability factors including vocabulary, syntax, cohesion, entity coherence and discourse relations by using texts from the Wall Street Journal. They also studied the associations between these features and readability ratings assigned by readers. They found that discourse and vocabulary are the factors most strongly linked to text quality. They established readability predictors on two different tasks: predicting text readability and ranking the readability. Kate et al. (2010) combined syntactic features, language model features and vocabulary features for readability assessment of natural-language documents. Their syntactic features were gotten from Sundance shallow parser and English Slot Grammar (ESG).They also compared the performances of different learning algorithms and different types of feature sets. In a recently study, several sets of explanatory variables – including shallow, language modeling, POS, syntactic, and discourse features – were compared and evaluated in terms of their impacts on predicting the grade level of reading materials for primary school students (Feng, Jansche, Huenerfauth & Elhadad, 2010).

Complete Article List

Search this Journal:
Volume 17: 1 Issue (2023): Forthcoming, Available for Pre-Order
Volume 16: 1 Issue (2022)
Volume 15: 4 Issues (2021)
Volume 14: 4 Issues (2020)
Volume 13: 4 Issues (2019)
Volume 12: 4 Issues (2018)
Volume 11: 4 Issues (2017)
Volume 10: 4 Issues (2016)
Volume 9: 4 Issues (2015)
Volume 8: 4 Issues (2014)
Volume 7: 4 Issues (2013)
Volume 6: 4 Issues (2012)
Volume 5: 4 Issues (2011)
Volume 4: 4 Issues (2010)
Volume 3: 4 Issues (2009)
Volume 2: 4 Issues (2008)
Volume 1: 4 Issues (2007)
View Complete Journal Contents Listing