Using Computational Text Analysis to Explore Open-Ended Survey Question Responses

Using Computational Text Analysis to Explore Open-Ended Survey Question Responses

DOI: 10.4018/978-1-5225-8563-3.ch007


To capture a broader range of data than close-ended questions (often defined and delimited by the survey instrument designer), open-ended questions, such as text-based elicitations (and file-upload options for still imagery, audio, video, and other contents) are becoming more common because of the wide availability of computational text analysis, both within online survey tools and in external software applications. These computational text analysis tools—some online, some offline—make it easier to capture reproducible insights with qualitative data. This chapter explores some analytical capabilities, in matrix queries, theme extraction (topic modeling), sentiment analysis, cluster analysis (concept mapping), network text structures, qualitative cross-tabulation analysis, manual coding to automated coding, linguistic analysis, psychometrics, stylometry, network analysis, and others, as applied to open-ended questions from online surveys (and combined with human close reading).
Chapter Preview


The popularization of online surveys has meant that a wide range of different questions are ask-able, with the integration of still visuals, audio, video, web links, and other elements. Invisible or hidden questions enable the collection of additional information, such as time spent per question, devices used to access the survey, geographical information, and other data. File upload question types enable respondents to share imagery, audio, video, and other digital file types as a response. Integrations with online tools enable outreaches through social media for broader audiences through crowd-sourcing and commercial survey panels. Automation enables customizing survey experiences with uses of names, question answers, piped text from a number of sources, expanded question elicitations (like through loop & merge techniques, and others), branching logic, and randomizers, among others. And many online research suites, designed as all-in-one shops, enable the automated analyses of text, quantitative data in cross-tabulation analyses, and other approaches.

Yet, in the midst of all these changes, a simple confluence of technological capabilities has suggested an even more fundamental change: the sophistication of computational text analysis (computer-aided text analysis) means that open-ended text-based survey question responses may be better harnessed and exploited for information than in the recent past. Computational text analysis enables the identification of a range of data patterns: matrix queries, theme extraction (topic modeling), sentiment analysis, cluster analysis (concept mapping), network text structures, qualitative cross-tabulation analysis, manual coding to automated coding, linguistic analysis, psychometrics, stylometry, network analysis, and others. These computational text analysis approaches harness quantitative, qualitative and mixed methods approaches, and all include “humans in the loop” for the analyses.

While some all-in-one online survey systems are expanding to built-in text analyses, the available tools look to be simplistic presently, with commercial software tools enabling more sophisticated text analysis. Those with the technology skills and statistical know-how stand to exploit the capabilities of open-ended survey questions and freeform respondent comments and insights. Going to “machine reading” (or “distant reading” through various forms of computational text analysis) does not remove the human from the loop. There is still the need for human “close reading” of the findings and of some of the original raw data. (In some cases, all of the original text may be read depending on the size of the text corpus.)

Technology Tools Used

The software tools highlighted in this work include Qualtrics®, NVivo 12 Plus, Linguistic Inquiry and Word Count (LIWC2015), and Network Overview, Discovery and Exploration for Excel (NodeXL).


Review Of The Literature

The main strength of surveys is that they capture elicited information from human respondents, but that fact is also its main weakness. There is a wide body of literature that shows that people’s responses to surveys may depend on social relationships, design features of how questions are presented and asked, the types of technologies used, and other factors, which “intervene” and “interfere” with respondents’ offering their truest thinking. Besides these factors, the respondent himself/herself has limitations, in terms of built-in cognitive biases (confirmative bias, anchoring biases, priming effects, and others) and limited working memory. And yet, surveys are sometimes the only way to capture respondent experiences, preferences, imaginations, and opinions, even with the limitations of self-reportage.

Surveys are delivered in various ways. Surveys may be delivered in person or remotely, to respondents who are alone or in the company of others. They may be other-administered or self-administered. They may be delivered through various modalities: via telephone (Arnon & Reichel, Apr. 2009) or paper (postal or face-to-face) or computer, offline or online, and so on. There are some survey sequences that involve various mixes of the prior variables. Some classic Delphi survey methods began with face-to-face (F2F) meetings followed by distance-based interactions, for example.

Key Terms in this Chapter

Population Segmentation: The partitioning of a human population to particular sub-groups with specified characteristics and preferences.

Codebook (Codeframe): The thematic categories that may be coded to that are relevant to a particular phenomenon or research target of interest (and these may be created from top-down coding as well as bottom-up coding).

Polysemous: Many-meaninged.

Concept Map: A 2D diagram that shows interrelationships between words and concepts.

File-Upload Questions: Questions that may be responded to with the upload of any number of digital file types.

Theme Extraction: The identification of main ideas and/or topics from a text or collection of texts.

Word Frequency Count: A computational technique that enables computers to count how many words of each time occur in a piece of writing or collection or text set.

Dendrogram: A data visualization that shows clustered words in structured interrelationships as branches on a tree (may be horizontal or vertical).

Non-Substantive Option: A response of “don’t know” on a survey that does not offer much in the way of informational value; the equivalent of avoiding an opportunity to answer or skipping an elicitation.

Psychometric: The objective measurement of various aspects of human personality.

Network Analysis: The depiction of objects and relationships.

Semantic: Meaning-bearing (as in words in a language).

Topic Modeling: The extraction of topics within a piece of writing or set of written texts.

Linguistic Analysis: The scientific study of language.

Stylometry: The statistical analysis (metrics) of style.

Qualitative Cross-Tabulation Analysis: The integration of a cross-tabulation table with interview subjects/focus group speakers/survey respondents in the row data, and variables and themes in the column data to enable the identification of data patterns through computational means.

Cluster Analysis: Any of a class of statistical analysis techniques that group various contents (like words or data points) based on similarity or other forms of connectedness (often depicted in node-link graphs).

Word Tree: A data visualization that depicts a target word or ngram/phrase and a number of lead-up and lead-away words to the target term to provide human users with a sense of the target word/phrase use contexts (for semantic meaning).

High-Burden: A descriptive term suggesting the level of investment needed for a survey respondent to engage with a survey instrument.

Computational Text Analysis: The application of various counting, statistical analysis, dictionary comparison, and other techniques to capture information from natural language texts (and transcribed speeches).

Treemap Diagram: A data visualization indicating the frequency of occurrence of particular words and/or n-grams.

Text Corpus: A collection of written texts selected around particular topics and standards.

Elicitation: The drawing out of information.

N-Gram: A contiguous sequence of “n” items (words), from unigram (one-gram) to bigram, three-gram, four-gram, and so on.

Open-Ended Questions: Questions that may be responded to with a variety of text responses (only limited by the length of the text).

Sentiment Analysis: The labeling of words and phrases as positive or negative (in a binary way) or in various categories of positive to negative (on a continuum).

Modality: A form or type (of survey, such as face-to-face, in-person; by telephone; by postal mail; by computer face-to-face; by paper face-to-face; online; mixed modal, and others).

Close-Ended Questions: Questions that may be responded to with true/false, yes/no, or other multiple-choice options.

Coverage Error: A sampling error in survey deployment that does not involve sufficient random representation of the complete population’s members.

Visual Question Answering: A new computational data analytics technique that enables computers to analyze an image or image sequence or set using computer vision and making observations of the target images.

Dimensionality: The state of having multiple characteristics or attributes (with high dimensionality indicating many dimensions and low dimensionality indicating few dimensions).

Complete Chapter List

Search this Book: