A study on selecting and optimizing perceptually relevant features for automatic speech recognition

Författare: Christos Koniaris; W. Bastiaan Kleijn; Richard Heusdens; Kth; []

Nyckelord: ;

Sammanfattning: The performance of an automatic speech recognition (ASR) system strongly depends on the representation used for the front-end. If the extracted features do not include all relevant information, the performance of the classification stage is inherently suboptimal. This work is motivated by the fact that humans perform better at speech recognition than machines, particularly for noisy environments. The goal of this thesis is to make use of knowledge of human perception in the selection and optimization of speech features for speech recognition. Papers A and C show that robust feature selection for speech recognition can be based on models of the human auditory system. These papers show that maximizing the similarity of the Euclidian geometry of the features to the geometry of the perceptual domain is a powerful tool to select features. Whereas conventional methods optimize classification performance, the new feature selection method exploits knowledge implicit in the human auditory system, inheriting its robustness to varying environmental conditions. The proposed algorithm show how the feature set can be learned from perception only by establishing a measure of goodness for a given feature based on a perturbation analysis and distortion criteria derived from psycho-acoustic models. Experiments with a practical speech recognizer confirm the validity of the principle.  In Paper B the perceptually relevant objective criterion is used to define new features. Again the motivation has its origin at the human peripheral auditory system which plays a major role to the input speech signal until it reaches the central auditory system of the brain where the recognition occurs. While many feature extraction techniques incorporate knowledge of the auditory system, the procedures are usually designed for a specific task, and they lack of the most recently gained knowledge on human hearing. Paper B shows an approach to improve mel frequency cepstrum coefficients (MFCCs) through off-line optimization. The method has three advantages: i) it is computational inexpensive, ii) it does not use the auditory model directly, thus avoiding its computational cost, and iii) importantly, it provides better recognition performance than  traditional MFCCs for both clean and noisy conditions