Article Preview
TopIntroduction
We have studied Japanese machine lip reading by modeling skilled lip readers, who pay attention to the sequence of mouth shapes when they read lips. Therefore, we studied how to detect mouth shapes in images of Japanese speakers. We considered that words and phrases can be recognized from the detected mouth shape sequences and proposed a method of detecting the basic mouth shape (BaMS) by using template matching (Miyazaki, Nakashima, & Ishii, 2011). The BaMS is the set of mouth shapes associated with Japanese vowels and the closed mouth. Japanese language has only five vowels (/a/, /i/, /u/, /e/, and /o/). The BaMS is defined as:
(1) where the first five symbols represent the vowels, and
X represents the closed mouth. Conventional studies of machine lip reading have adopted a method based on words (Kiyota & Uchimura, 1993; Nakata & Ando, 2002; Okumura, Hamaguchi, Okano, & Miyazaki, 1998; Saitoh & Konishi, 2007; Uda, Tagawa, Minagawa, & Moriya, 2001). This method requires real utterance images because the features of each word or phrase are calculated from the images. If the number of words and phrases to be recognized increases, this method is cumbersome. In contrast, our proposed method uses the mouth shape. It is easy to identify a sequence of mouth shapes associated with a word or phrase (Miyazaki, Nakashima, & Ishii, 2011). However, in some cases the method did not detect the beginning mouth shape (BeMS), as defined in (2):
(2)We believe that the BeMS frames may be dropped because the mouth shape is formed for a very short time period. The digital video camera that we used was intended for home use, and the frame rate was only 30 fps. Therefore, we used a video camera with a higher frame rate (60 fps) to capture the BeMS frames. The new camera could capture the necessary frames; however, we faced another problem. When one mouth shape changes to another, the deformed mouth shape at that transitional moment was misdetected as the BeMS. To correct this, the motion of the lips is measured to prevent the detection of these deformed mouth shapes. This study describes the previous method and how optical flow (Farnebäck, 2003) is adopted to measure the distance that the lips move.
TopPrevious Bams Detection Method
We have proposed a method of detecting the BaMS using template matching (Miyazaki, Nakashima, & Ishii, 2011). Six mouth shape images were used as template images, and the similarity to each BaMS was calculated for a sequence of images of Japanese utterances. In addition to the BeMS, we used the ending mouth shape (EMS), as defined by (3).
(3)When the EMS is formed, the similarity waveform of the corresponding mouth shape becomes flat. In contrast, when the BeMS is formed, the corresponding waveform becomes convex. Using these characteristics, our previous method first detected the BeMS and EMS periods and then detected the BaMS for each period.