Facial Muscle Activity Patterns for Recognition of Utterances in Native and Foreign Language: Testing for its Reliability and Flexibility

Facial Muscle Activity Patterns for Recognition of Utterances in Native and Foreign Language: Testing for its Reliability and Flexibility

Sridhar Arjunan (RMIT University, Australia), Dinesh Kant Kumar (RMIT University, Australia), Hans Weghorn (Baden-Wuerttemberg Cooperative State University, Germany) and Ganesh Naik (RMIT University, Australia)
DOI: 10.4018/978-1-61350-429-1.ch012
OnDemand PDF Download:


The need for developing reliable and flexible human computer interface is increased and applications of HCI have been in each and every field. Human factors play an important role in these kinds of interfaces. Research and development of new human computer interaction (HCI) techniques that enhance the flexibility and reliability for the user are important. Research on new methods of computer control has focused on three types of body functions: speech, bioelectrical activity, and use of mechanical sensors. Speech operated systems have the advantage that these provide the user with flexibility. Such systems have the potential for making computer control effortless and natural. This chapter summarizes research conducted to investigate the use of facial muscle activity for a reliable interface to identify voiceless speech based commands without any audio signals. System performance and reliability have been tested to study inter-subject and inter-day variations and impact of the native language of the speaker. The experimental results indicate that such a system has high degree of inter-subject and inter-day variations. The results also indicate that the variations of the style of speaking in the native language are low but are high when the speaker speaks in a foreign language. The results also indicate that such a system is suitable for a very small vocabulary. The authors suggest that facial sEMG based speech recognition systems may only find limited applications.
Chapter Preview

1. Introduction

One bottleneck in our technological advancements is the interface between the computer and the user. While till recently, Human computer interface (HCI) was largely restricted to the keyboard and the mouse, in the recent past the advancements have lead to systems that are voice, biosignals and gesture operated. Speech operated systems have the advantage that these provide the user with flexibility and time tested natural ability. Such systems provide a potential for natural and seamless interface that have the potential for making computer control almost effortless. Such HCI systems can provide richness comparable to human to human interaction. The success of such systems is based on the robustness of the speech recognition system which is a complex multidisciplinary research area including speech and language processing.

In recent years, significant progress has been achieved in advancing speech recognition technology, making speech an effective modality in both telephony and multimodal human-machine interaction. The technology has become increasingly usable and useful. However, currently speech recognition is largely audio based and suffers from three major shortcomings; (i) it is not suitable in noisy environments such as a vehicle or a factory, (ii) it is not suitable for people with speech impairment disability, such as people after a stroke attack, and (iii) it is not suitable for giving discrete commands or when there may be other people talking loudly in the vicinity.

Work conducted by Chen (Chen, 2001) has demonstrated that speech based human to human communication is multimodal where along with audio signal the listener also observes the facial and body gestures. When we speak in noisy environments, or with people with hearing loss, the lip and facial movements often compensate the lack of quality audio (Simpson et al. 1990; Stone et al 1992). The identification of the speech with lip movement can be achieved using visual sensing, or sensing of the movement and shape using mechanical sensors (Manabe et.al., 2003) or by relating the movement and shape to the muscle activity (Chan et al. 2002; Kumar et al. 2004). To improve the speech classification systems, numbers of researchers have proposed the use of facial movements and gestures (Dimberg et al. 1997; Edward et al. 2006; Francis et.al., 2002). Proposed systems are based on vision, biosignals and mechanical sensor. The proposed systems are generally used along with audio speech recognition systems.

Each of these techniques has strengths and limitations. The video based technique is computationally expensive, requires a camera monitoring the lips that is fixed to the user’s head, and is sensitive to lighting conditions. The sensor based technique has the obvious disadvantage that it requires the user to have sensors fixed to the face, making the system not user friendly. The muscle monitoring systems have limitations of low reliability. There are two possible reasons; (i) people use different muscles even when they make the same sound and (ii) cross talk due to different muscles makes the signal quality difficult to classify. These reasons were extensively studied by Harris (Harris, 1970) and reported that the suitable problems for EMG research will be divided into three classes: first, ‘which muscle’ problems; second, ‘which mechanism’ problems; and third, a more vaguely defined class of problems having to do with the general organization of the speech mechanism. The other difficulty of each of these systems is that these systems are user dependent and not suitable for different users. In this chapter we report the use of recording muscle activity of the facial muscles to determine the unspoken command from the user. Even the Myoelectric Signals (MES) based systems are heavily influenced by user dependencies, such as style of speaking, rate of speaking, and variation in pronunciation.

Complete Chapter List

Search this Book: