<< Chapter < Page | Chapter >> Page > |
The filter then used the coefficients to adjust the filter such that it modified the sound created by the white noise or pulses to synthesize the original sound. These coefficients, excitation signals and energy level were introduced to the synthesizer every 20 mSec and were internally updated every 5 mSec, based on the present values and the next set of target coefficients.
All together there were 13 parameters in each frame of data:
The speech data was packed in a variable word size. Table 1 shows the bit assignment for each of the coefficients and the order in which they were packed.
Coefficient | Bits* | Comment |
Energy | 4 | Fh indicates the end of the phrase |
Repeat | 1 | If "1" use the previous K values |
Pitch Period | 5 | If "0" unvoiced - only K1 - K4 used |
K1 | 5 | |
K2 | 5 | |
K3 | 4 | |
K4 | 4 | |
K5 | 4 | |
K6 | 4 | |
K7 | 4 | |
K8 | 3 | |
K9 | 3 | |
K10 | 3 |
The reasoning behind the variation in bits per reflection coefficients has to do with the part of the human vocal tract each coefficient represents. K1 can be thought of as representing the lips, K2 the teeth, and so on where K10 represents the back of the throat. The teeth and lips have more movement than the back of the throat and therefore have more bits assigned to them.
The actual coefficients in the synthesizer were 12 bits in size. For K1 there were 32 different 12 bit coefficients mapped to the 5 bit data word stored in the ROM. This mapping was done for the specific professional speaker we had hired to speak the words. Since we were using an “analysis/synthesis” technique, it was possible to analyze the complete data set of spoken words and use the statistics from all of the frames of data for the K1 values to create the best 32, 12-bit coefficients to represent the whole spread of the values (we called them buckets) to represent the data set. Once the range of the values for each bucket was determined, a median value was assigned to represent all of the values within the bucket. The same occurred for each of the other coefficients including the Energy and Pitch Period. The synthesis device (TMC0281) stored the decoding table for the reflection coefficients. The last digit of the part number, a 1 in the case of the Speak N Spell, identified the decoding table used. That is, it is specific to the voice of the professional speaker.
This meant that the number of bits per frame went from 4 to 49 as in Table 2.
Frame | Bits | Coefficients used |
Voiced | 49 | E, R, P, K1 - K10 |
Voiced Repeated | 10 | E, R, P |
Unvoiced | 28 | E, R, P, K1 - K4 |
Unvoiced Repeated | 10 | E, R, P |
End of Phrase | 4 | E |
If you do the math you’ll note that sending only voiced frames gives slightly over 2400b/s data rate. But it seemed that generally the best quality for this particular implementation of LPC-10 was approximately 1000b/s.
Notification Switch
Would you like to follow the 'The speak n spell' conversation and receive update notifications?