INTRODUCTION
Automatic recognition of speech by using the video sequence of the speaker's lips, namely automatic lipreading, or speech-reading, has recently attracted signicant interest.Much of this interest focuses on ways of combining the video channel information with its audio counterpart, in the quest for an audio-visual automatic speech recognition (ASR) system that outperforms audio-only ASR. Such a performance improvement depends on both the audio-visual fusion architecture, as well as on the visual front end, namely, on the extraction of appropriate visual This features that contain relevant information about the spoken word sequence. In this project, we concentrate on the latter.We consider a number of visual features, propose new ones,compare them on the basis of lipreading performance, and investigate their robustness to video degradations.Various visual features have been proposed in the literature that, in general, can be grouped into lip contour based and pixel based ones. In the rst approach, the speaker's lip contours are extracted from the image sequence. A parametric or statistical lip contour model is then obtained, and the model parameters are used as visual features. Alternatively, lip contour geometric features are used. In the second approach, the entire image containing the speaker's mouth is considered as informative for lipreading, and appropriate transformations of its pixel values are used as visual features.
Lip Contour Extraction
The lip contour extraction system is described in detail else-where.In its current implementation, for each video eld, two channels of processing are used: A combination of shape and texture analysis, and a color segmentation, to rst locate the mouth and then the precise lip shape. Estimated outer and inner lip contours are depicted.For a single speaker, part of the outer lip contour is missed in less than 0.25% of the processed images. However, inner lip and multi-speaker contour estimation are less robust.
Automatic recognition of speech by using the video sequence of the speaker's lips, namely automatic lipreading, or speech-reading, has recently attracted signicant interest.Much of this interest focuses on ways of combining the video channel information with its audio counterpart, in the quest for an audio-visual automatic speech recognition (ASR) system that outperforms audio-only ASR. Such a performance improvement depends on both the audio-visual fusion architecture, as well as on the visual front end, namely, on the extraction of appropriate visual This features that contain relevant information about the spoken word sequence. In this project, we concentrate on the latter.We consider a number of visual features, propose new ones,compare them on the basis of lipreading performance, and investigate their robustness to video degradations.Various visual features have been proposed in the literature that, in general, can be grouped into lip contour based and pixel based ones. In the rst approach, the speaker's lip contours are extracted from the image sequence. A parametric or statistical lip contour model is then obtained, and the model parameters are used as visual features. Alternatively, lip contour geometric features are used. In the second approach, the entire image containing the speaker's mouth is considered as informative for lipreading, and appropriate transformations of its pixel values are used as visual features.
Lip Contour Extraction
The lip contour extraction system is described in detail else-where.In its current implementation, for each video eld, two channels of processing are used: A combination of shape and texture analysis, and a color segmentation, to rst locate the mouth and then the precise lip shape. Estimated outer and inner lip contours are depicted.For a single speaker, part of the outer lip contour is missed in less than 0.25% of the processed images. However, inner lip and multi-speaker contour estimation are less robust.
VIDEO DEMO
No comments:
Post a Comment