<< Chapter < Page | Chapter >> Page > |
After we gathered information about probabilistic interpretations of speech recognition, we were ready to begin building the project. Starting with the basics, we built a simple program that would take as input a three-second long signal and produce as output a measure of the corresponding formants of the signal. By applying the auto-regressive model, we were able to come up with a reasonable transfer function modeling the speech. However, we encountered a slight problem.
If we estimated the transfer function using an auto-regressive model of low order, the associated frequency response would be a little too smooth. I.e., it would perhaps mask some of the peaks of the graph and cost us valuable information in determining the formants. On the other hand, if we used a high-order model, we would get too much variation in the frequency response because of the intricacies of the high-order rational function. We would not be able to tell which peaks were true formants and which peaks were just "overestimation" by our model. This is demonstrated in the figure below for the test speech signal "hat.mat".
This eventually started to look like an optimization problem, and since none of us had much relevant expertise, we approached it slightly differently. Instead of solving a case-by-case parameter optimization problem with unnecessarily difficult math, we simply abstracted the problem in terms of something we knew - polynomial regression.
The Taylor series tells us that we can approximate every function as a polynomial, whose order determines its degree of precision. As with the parameter for the AR model, increasing the polynomial approximation's order would result in a high degree of information saved from the signal. Since we were more familiar with Taylor series than machine learning and statistical approximation, we chose to rely more strongly on polynomial regression than the AR model.
With this in mind, we used MATLAB to solve a generalized least-squares regression problem. In other words, we came up with a solution that is both accurate and precise. We used a high-order (30 order) a.r. model to get the accuracy we needed and a moderate-order(120 order) polynomial regression to get the precision we needed. This is illustrated in figure below for "hat.mat".
After using predetermined average vowel formants found on the internet to estimate the formants of the signal, we then implemented a matched-filter which would find the vowel whose theoretical formant values were most similar. Our filter was based on the method of least-squares, so it chose the vowel for which the mean squared difference was smallest. We plotted the measures of the mean-squared differences for hat.mat below. Note that the lower the bar, the more similar the match between the corresponding vowel and the signal.
Notification Switch
Would you like to follow the 'Elec 301 projects fall 2014' conversation and receive update notifications?