Whenever I tell a new person about what I do – research in Automatic Speech Recognition for Rosetta Stone – the first thing they always say is, “Oh, so you must speak – what, about 30 languages?”
Well, not exactly.
My knowledge of foreign languages is similar to that of many Americans and Canadians. I took Spanish in school. I can hold my own in a simple conversation. I can read Borges in the original. Because I’m Jewish and I had a Bar Mitzvah, I can also read Hebrew and lately I’ve been learning to speak it with Rosetta Stone TOTALe. But that’s it, other than a few phrases I might have picked up while traveling.
So how is it that I can train computers to judge pronunciation in over 31 languages?
The thing I usually tell people is this: there are standard acoustic modeling methods we can apply to any language, as long as we have two really important ingredients.
Ingredient number one is training data. This means hours and hours – hundreds of hours – of recordings of native speakers in each language. These training recordings need to match the type of speech we’ll eventually be recognizing – will it be read speech, or spontaneous? What dialect are we expecting? What age will the speakers be? What gender? Mismatch in any these factors can have a profound effect on how well the recognizer performs. When scoring pronunciation, we also need many hours of recordings from nonnative learners of the language to serve as a reference.
And ingredient two is transcripts of those hundreds of hours of recordings. These can either be word-for-word annotations done by native speakers, or they can be close phonetic transcripts written by linguists who have an academic familiarity with the language. Along with these transcripts come dictionaries of word-level pronunciation and comprehensive lists of the unique phonetic sounds each language uses.
With these ingredients, we “teach” the recognizer to know the sounds of a language by example: through demonstrating thousands of instances of each sound, in context. For those of us who don’t actually speak 31 languages, this is where we come in. With the recordings and their transcripts, we run complex training routines that amass statistics and “learn” the characteristics each sound is expected to have. And with the pronunciation dictionaries, we give the recognizer an instant vocabulary – a relatively complete list of phonetic sequences that it should expect to see. Of course I am simplifying things – even transcription itself is a laborious process, with a subtle art to doing it right. But essentially the set of sounds and words and recordings are arbitrary. They can come from anywhere, from any language.
So, no, I don’t speak 31 languages. Not yet, anyway. But our recognizer does need many native speakers and transcribers – thousands of them! – for it to know how a student of a foreign language ought to sound.
Find more posts about: Speech Technology