<< Chapter < Page Chapter >> Page >
  • Excitation signal: White noise for an unvoiced sound and a periodic pulsed signal for voiced sounds
  • An energy level for the sound

The filter then used the coefficients to adjust the filter such that it modified the sound created by the white noise or pulses to synthesize the original sound. These coefficients, excitation signals and energy level were introduced to the synthesizer every 20 mSec and were internally updated every 5 mSec, based on the present values and the next set of target coefficients.

Simplified block diagram of the speech synthesizer.

All together there were 13 parameters in each frame of data:

  • Energy
  • Pitch period
  • Repeat indicator
  • 10 Reflection coefficients (K1 - K10)

The speech data was packed in a variable word size. Table 1 shows the bit assignment for each of the coefficients and the order in which they were packed.

Description of coefficients *These bits are expanded to 10 bits for use in the synthesis filter
Coefficient Bits* Comment
Energy 4 Fh indicates the end of the phrase
Repeat 1 If "1" use the previous K values
Pitch Period 5 If "0" unvoiced - only K1 - K4 used
K1 5
K2 5
K3 4
K4 4
K5 4
K6 4
K7 4
K8 3
K9 3
K10 3

The reasoning behind the variation in bits per reflection coefficients has to do with the part of the human vocal tract each coefficient represents. K1 can be thought of as representing the lips, K2 the teeth, and so on where K10 represents the back of the throat. The teeth and lips have more movement than the back of the throat and therefore have more bits assigned to them.

The actual coefficients in the synthesizer were 12 bits in size. For K1 there were 32 different 12 bit coefficients mapped to the 5 bit data word stored in the ROM. This mapping was done for the specific professional speaker we had hired to speak the words. Since we were using an “analysis/synthesis” technique, it was possible to analyze the complete data set of spoken words and use the statistics from all of the frames of data for the K1 values to create the best 32, 12-bit coefficients to represent the whole spread of the values (we called them buckets) to represent the data set. Once the range of the values for each bucket was determined, a median value was assigned to represent all of the values within the bucket. The same occurred for each of the other coefficients including the Energy and Pitch Period. The synthesis device (TMC0281) stored the decoding table for the reflection coefficients. The last digit of the part number, a 1 in the case of the Speak N Spell, identified the decoding table used. That is, it is specific to the voice of the professional speaker.

This meant that the number of bits per frame went from 4 to 49 as in Table 2.

Frame Description
Frame Bits Coefficients used
Voiced 49 E, R, P, K1 - K10
Voiced Repeated 10 E, R, P
Unvoiced 28 E, R, P, K1 - K4
Unvoiced Repeated 10 E, R, P
End of Phrase 4 E

If you do the math you’ll note that sending only voiced frames gives slightly over 2400b/s data rate. But it seemed that generally the best quality for this particular implementation of LPC-10 was approximately 1000b/s.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, The speak n spell. OpenStax CNX. Jan 31, 2014 Download for free at http://cnx.org/content/col11501/1.5
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'The speak n spell' conversation and receive update notifications?

Ask