<< Chapter < Page Chapter >> Page >

Figures 2 and 3 show a couple of examples of the data for each speech frame for two spoken words. The word in Figure 2 is “help” and the word in Figure 3 is “free”.

If you look closely at the speech data for the word “help” you will notice that, for a particular voiced sound, the value for each reflection coefficient (K’s) of each frame varies somewhat from those of the neighboring frames. With hand editing by an expert ear, many of these frames could be replaced by one of its neighboring frame’s data (see frames 8, 9 and 10 in Figure 2). If frames could have their K values replaced with those of an appropriately chosen frame, a repeat frame could be used – therefore, 10 bits could represent the frame rather than 48 bits for a voiced frame or 28 bits for an unvoiced frame. Figure 3b shows repeat frames for the synthesized word “free”.

The idea of repeat frames giving better quality became obvious to me as I was editing the spoken phrase “Texas Instruments”. I had trouble making the word “Texas” sound correct. I finally slowed the word down and realized that the “eh” vowel sound (in the word Texas) had been turned into a diphthong (two different sounds) giving a vowel sound of “aa – eh”. This should have been expected as the professional speaker was a Texan with a good Texas accent (I’ll tell other stories about the issues we had with a Texas accent later). What was amazing was that I could take any one of the 8 or so frames and repeat it through and make the sound of the word “Texas” better. What I was doing was reducing the string of vowel sounds to just one. This edit gave a better quality at a lower bit rate.

Coefficients for the word "help" before compression.

The spectral graph of the word Free before and after data compression.
The speech data for the word “Free” after compression. Note that the data rate after compression is 775b/s.

I've attached an appendix (Appendix 3) for those who would like to know how the speech data was formatted in the ROM. It is not a topic I thought would fit well into this chapter, but one which would be interesting to some readers. With all of this as background, let’s now discuss some of the compromises we made in a bit more detail.

Sample rate

The sample rate determined the frequency response of the sound output. The higher the sample rate the better the sound would be, particularly for the unvoiced sounds. But the higher the sample rate, the higher the data rate. Although it can be argued that this is not necessarily the case, we understood that there needed to be more reflection coefficients to capture the data properly at a higher sample rate. It seemed to follow a rule of sample rate plus two to determine the number of reflection coefficients. We didn’t have time to evaluate this rule so we chose to use the simpler solution of ten coefficients, sampling at 8kHz.

At the time we began to find the professional speaker for the product, we noticed that not everyone had an “LPC friendly” voice. One of the early speakers we looked at was marvelous at doing character voices. We used him to give a voice to several of the options we had for the final name of the product. Here are a few of those choices:

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, The speak n spell. OpenStax CNX. Jan 31, 2014 Download for free at http://cnx.org/content/col11501/1.5
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'The speak n spell' conversation and receive update notifications?

Ask