<< Chapter < Page Chapter >> Page >

Overview of the Modern Algorithm:

The speech recognition system consists of speech segmentation, feature extraction, and optimal feature-matching with a trained library of stored features.

Speech Segmentation:

Speech segmentation is fairly uniform across systems, segmenting a string of spoken words into individual components. This can be easily accomplished by segmenting at points where power of the sampled signal goes to zero.

Feature Extraction:

Feature extraction may be done in a variety of ways, depending on the features one chooses to extract. Industry standard is extraction of the coefficients that collectively represent the short-term power spectrum of the recorded sound, known as mel-frequency cepstrum coefficients (MFCCs). MFCCs are derived by:

  1. Fourier transforming the windowed (usually Hamming) excerpt of the signal
  2. Using triangular overlapping windows, map the powers of the obtained spectrum onto the mel scale, a logarithmic scale of pitches that more accurately models human hearing
  3. Taking the log of the powers at each mel-frequency
  4. De-correlate the resulting spectrum with a cosine transform
  5. Extract the MFCCs as the amplitudes of the resulting spectrum

MFCC feature-extraction is typically used in conjunction with Hidden Markov Model feature-matching.

Prior to MFCCs, speech recognition systems used linear predictive coding (LPC). By assuming sibilants and plosive sounds to be occasional anomalies and therefore inverse-filtering out the formants, the values of the signal could be predicted on a local timescale by a series of linear representations after having extracted the coefficients.

Feature-matching:

Feature-matching is traditionally implemented via dynamic time warping (DTW), which allowing for the matching of sampled words with stored templates despite stretched and compressed differences in speed and timing. This technique has fallen out of favor thanks to the current industry gold standard of speech recognition: the Hidden Markov Model (HMM).

As speech signals are short-time stationary processes, modeling speech signals as HMMs is feasible - and offers great advantages over DTW due to extensive training features and implications towards a tremendously robust recognition system. The HMM-based approach is complex, but at the highest level involves the following:

  1. Each word has been broken down into phonemes, its smallest linguistic segments, and the HMM output distribution for each phoneme trained and stored beforehand
  2. The HMM outputs sequences of n-dimensional real-valued vectors consisting of MFCCs every few milliseconds
  3. Each state of the HMM contains a statistical distribution of a mixture of diagonal covariance Gaussians that indicate the likelihood of each vector
  4. The HMM for the targeted word is then identified by concatenating the stored HMMs per target phoneme

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Elec 301 project: voice recognition. OpenStax CNX. Dec 19, 2011 Download for free at http://cnx.org/content/col11396/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Elec 301 project: voice recognition' conversation and receive update notifications?

Ask