Embodied Conversation: A Personalized Conversational HCI Interface for Ambient Intelligence

Embodied Conversation: A Personalized Conversational HCI Interface for Ambient Intelligence

Andrej Zgank, Izidor Mlakar, Uros Berglez, Danilo Zimsek, Matej Borko, Zdravko Kacic, Matej Rojc
DOI: 10.4018/978-1-7998-3473-1.ch076
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The chapter presents an overview of human-computer interfaces, which are a crucial element of an ambient intelligence solution. The focus is given to the embodied conversational agents, which are needed to communicate with users in a most natural way. Different input and output modalities, with supporting methods, to process the captured information (e.g., automatic speech recognition, gesture recognition, natural language processing, dialog processing, text to speech synthesis, etc.), have the crucial role to provide the high level of quality of experience to the user. As an example, usage of embodied conversational agent for e-Health domain is proposed.
Chapter Preview
Top

Introduction

Ambient Intelligence (AmI) and Smart Environments (SmE) are built on three foundations: Ubiquitous computing, ubiquitous communication, and personalized adaptive interfaces. One of the key requirements is the fact that interaction takes place through natural interfaces, in such a way that people can perceive the presence of smart objects and machines as little as possible, and only when the system cannot achieve its functionality without intervening for the user’s attention. Namely, the use of physical, especially non-freeform, controllers (e.g. touch-screen, remote, mouse, keyboard) hinders natural interface and establishes a strong barrier between the user and computer. Moreover, speech and body language are part of the input/output Human-Computer Interface (HCI) modalities, which have the benefit of being a natural modality for communication between humans. The speech as input modality can be used in many scenarios, which are also reflected in the complexity of the automatic speech recognition system. Humans are social beings, who utilize all available signals, from linguistics, language, and speech, to paralinguistic and non-verbal in the perception, understanding, and production of information in face-to-face communication. Some of the paralinguistic and non-verbal audio events can be processed in parallel with the automatic speech recognition system, with the possibility to share some modules, such as feature extraction.

Aggregation in (multimodal) understanding and use of natural language is especially supported through concepts of Conversational Interfaces (CIs) and Embodied Conversational Agents (ECA). In recent years, the CIs have become an important research and development topic in academia and the ICT industry. Smart environments utilize Artificial Intelligence (AI) increasingly and incorporate smart objects, such as: Devices, wearables, virtual agents, social robots, etc. Leading companies (e.g. Google, Amazon, Microsoft, Apple) have been making huge investments in the development of supporting technologies of Artificial Intelligence, such as Deep Learning and Natural Language Processing (NLP). The aim of these ventures is to create systems that will enable users of smart systems to obtain information and access services in a more natural, embodied conversational way. The classical vision of CIs incorporates dialog systems, and voice-enabled interfaces supported by Natural Language Processing. The multimodal vision of CIs, however, also aggregates embodiment and embodied cognition (as part of embodied language processing) in input (gesture-enabled user interfaces) and output (embodied conversational agents). Namely, CIs with virtual ECA capable of understanding verbal and non-verbal signals and capable of generating them, can engage with a human user on a more personal level. Particularly, through more conversation-like flow of the interaction, these Embodied Conversational Agents (supported by Natural Language Processing (NLP) and Embodied Language Processing (ELP)), may deliver new interaction models capable of adapting to the user’s context and facilitating context of the conversational situation not only via modalities like speech, but also through visual interpretation and representation of information and social cues.

Despite all investment in this research area, an important amount of challenges still need to be solved. One barrier, which is connected with the users’ desire to communicate in their native language, has to deal with the language questions. The languages outside the economic interest of leading companies are usually under-resourced, which presents an additional challenge for the intelligent ambient development process. A solution applied in the last few years is oriented toward collecting open and freely available language resources. In the case of such resources, some modalities, such as text, are available more frequently than others, as is speech combined with transcriptions. The availability of language resources is, thus, one of the key questions in the era of industry digitalization.

The intelligent ambient, and especially its manifestation in the form of the Embodied Conversational Agent, plays a crucial role in many areas of modern society. The application areas can range from general use in a smart home, to a dedicated use in a highly specialized e-Health environment. The common goal is to ease the user’s interaction, with the objective to make it as natural as possible.

Key Terms in this Chapter

Feature Extraction: A digital signal processing algorithm, which extracts distinctive values from the input signal.

Deep Learning: Part of Machine Learning, where methods of higher complexity are used for training data representation.

Personalized: An approach where parts or functions of a system get adapted to the particular end-user.

Gesture Recognition: An approach to recognize gestures produced by a human during communication automatically.

Text-to-Speech Synthesis: An approach where a system generates output audio signals from input text.

Intelligent Ambient: An environment where various devices, algorithms and processes capture input signals and produce optimal environment conditions based on captured data and previous knowledge, applying Artificial Intelligence.

Automatic Speech Recognition: An approach of recognizing an uttered speech signal into a computer readable text.

Multi-Modal: A procedure or algorithm which incorporates different types of modality and processes them to a single end result.

Complete Chapter List

Search this Book:
Reset