Abstract
There are more than 6000 languages currently spoken in the world today. Many languages have disappeared or are in danger of the same fate because of various forces acting on the people that speak them. Microsoft, Google and Apple have recently started to push speech processing on all our electronic devices. Current speech processing methods include only a handful of languages and adding a new one to the list involves quite a bit of resources. What's more, to my knowledge there is no readily available support for people with speech impairments. The meager quantity of languages supported, the difficulty in supporting new languages, the way history has shown that certain pressures will cause a language to become extinct and the speed with which speech processing in digital devices is being pushed substantiate the view by which fewer and fewer languages will be spoken in the future. The solution is simple, make it easier to add new languages.
There was quite a lot of research on digital signal processing (DSP) methods for automatic speech recognition and production before the current, statistically based, methods became mainstream. All that research seems to have had no practical outcomes because a "better" method came along. I propose to use some of the research on DSP combined with the power of neural networks to implement a speech processor that has similar efficacy with current methods but needs less resources for new languages to be added.
I have already written and tested a DSP procedure that extracts the time that vowels occur in the recordings. I have also written and tested a procedure to identify the formants associated with vowels in the intervals identified by the previous procedure. I have executed a series of training and validation runs using a Long Short Term Memory (LSTM) neural network with encouraging preliminary results. Since I am not using fixed frame size input to the LSTM I will need to add Connectionist Temporal Classification (CTC) to the neural network architecture to be able to identify the vowel being spoken. I am now in the process of implementing the CTC.
Problem Statement
The ISO 639-3 standard lists 6962 living languages in the world today. Glottolog lists 7536 living languages. Whichever way you count them there are lots of ways people communicate with each other. The different ways people communicate are the external manifestation of the different ways people think and act. Many say that language is culture and culture is language. Language diversity is a healthy characteristic of human society like biodiversity is of Earth biology. Just like biological entities, languages evolve and become extinct. There are many examples in history in which a language has evolved and given birth to a new one or to a group of languages. There are also many examples of languages going extinct. There are various reasons why a language may become extinct. Military occupation of an area, immigration, emigration and political pressure are some of the reasons for extinction. Some pressures are not so apparent because they are masked behind good intentions or because of the length of time in which they are applied. One example is the slow erosion on the many languages of Italy. Since the adoption of the Tuscan language as the official one to become Italian all the others began to be considered “dialects” even though they all evolved from the common language that was Latin. Up to WWI most of the languages were still mostly intact. The war forced people from different language communities to interact and so the only way people could communicate was through the lingua franca that was Italian. Mussolini gave them the final blow by establishing compulsory basic education with the teaching of the Italian language. Since then those who speak a “dialect” are considered uneducated and are shamed. This has led to a continued increase of the use of Italian in place of the local languages. Often, the need to communicate between two or more different language communities leads them do adopt a common language that in the end leads to the demise of at least one of the preexisting languages.
Today, there may be a new pressure acting on language diversity, and it is a worldwide pressure. Speech processing has begun to be available and used more frequently on our technological companion devices. In the years to come this type of communication will be ever more present in our daily lives. The problem is that there are very few languages supported by this new technology and the speed with which new ones are supported is not enough to cover all the languages in an adequate amount of time. Some may say that this would lead to all of us becoming bilingual. That will surely happen in the mid term, but because we will all share a few common languages and because of a natural tendency to use less energy we will all start speaking the same language both with the devices and with our fellow humans.
People with speech disorders have not been at all considered in the expansion of the technology. To be able to include the speech impaired one would have to include a speech model for each type of speech impairment for each language. So the difficulty in adding languages would be at least an order of magnitude greater if one should include the speech impaired. While today there is at least partial language support for normal speaking people, to my knowledge there is no readily available support for the speech impaired.
Current speech processors are based on the assumption that, given a speech waveform, one can calculate the probability that the waveform symbolizes a certain word by applying a statistical model. That statistical model represents a list of probabilities of associations between words and their waveform. The ideal model would contain all the words in a certain language spoken by all the people who speak or will ever speak that language. Obviously that is not possible. The choice of a sample that is representative of a population is a cornerstone of statistical science. In statistics in general, if one chooses a sample with characteristics that are not representative of the population the results will be biased and may not be generalizable to the population. This also applies to the current, statistics based, speech processors. Their predictive capacity depends on the choice of subset of the ideal model. If the group of people whose speech is recorded is not sufficiently diversified the model may not be able to find the correct word given a speech waveform that does not belong to that group, i.e. it will not be able to generalize. The solution to this problem is to collect as much data as possible. Speech recordings are easily available on the Internet for certain languages but for most languages there aren't enough recordings for a speech processor. For some languages speech recordings on the Internet may be enough but are often single topic and so may create a biased model. Recording people's speech in the field is resource intensive and may also lead to a biased model if it is too local.
An important part of training a speech model is the transcription, i.e. the association of the waveform with the words. It generally involves a person listening to each and every recording and writing the words that were spoken. The intensity of this task depends on the word list that is used during the recording sessions. If there is a fixed list of words then the transcription is merely a search for differences and a synchronization, otherwise the whole text must be transcribed. If the recordings are taken from the Internet outside of already transcribed speech corpora then they also must be wholly transcribed. The already transcribed speech corpora exist only for the languages that are already included in speech processors, so they are of little use for new languages.
Up to this point I've been talking about the Automatic Speech Recognizer (ASR) part of speech processing. There is also a Text to Speech (TTS) part that has problems of its own. There are two ways to produce speech from text:
- The concatenative system - record a person's voice, split it into its component parts, reproduce a text by recombining the needed parts.
- The parametric system - create a speech waveform for each part of a word using predefined parameters.
The concatenative system has better voice quality but can't be used for different voice intonation unless the voice is recorded once for each intonation. The parametric system is more flexible but has yet to reach the voice quality of the concatenative system.
High Level Description of Solution
Among the freely available research papers I have been able to find there are quite a few that delve into the subject of recognition of parts of speech through audio signal processing. As far back as 1952 Peterson And Barney show how the formant frequencies extracted from the audio signal of speech can identify the vowel being spoken. In 1966 Reddy is ready to recognize speech but needs a more powerful computer to be able to do it in real time. This work was done without using any neural networks or statistically driven algorithms. In 1967 Andrew Viterbi proposed an algorithm for decoding signals over noisy digital communication links. This algorithm, together with Markov Models first proposed by Andrey Markovr1 in 1906, begin to be applied to the problem of speech recognition and it quickly became the leading method used today. In the meantime there were other researchers that continued to publish papers on signal processing methods for speech recognition. In 1975 Weinstein, McCandless, Mondshein and Zue report on a "System for Acoustic-Phonetic Analysis of Continuous Speech". In the 1990's more papers are published refining the methods in previous papers and obtaining better results also due to increasing computational capacity of modern computers. Around the year 2000 only the Hidden Markov Model (HMM) based system of speech processing remained and all that research on other methods has fallen into a state of neglect. The current, statistics based, method is so ingrained that in 2006 Ali, Bhatti and Mian in "Formants Based Analysis for Speech Recognition" consider their method "a novel way to speech recognition" even though it's based on decades old research.
I propose to use "a novel way" myself. I will use some of the research on DSP combined with the power of neural networks to implement a speech processor that has similar efficacy with current methods but needs less resources for new languages to be added. The project will consist in an automatic speech recognizer (ASR) and a text to speech processor (TTS).
The ASR will be comprised of a language independent part that will recognize single phonemes from their audio frequency characteristics and output the result as a series of standard phonetic symbols. The second part of the recognition process will be a language dependent part. It will process the series of standard phonetic symbols and apply language dependent pronunciation and syntax rules to correct for any slurring and find word boundaries. Then it will do a search in the vocabulary for the corresponding words. The language independent part, through the use of DSP procedures, will analyze the speech signal and extract person independent data that contain the information necessary to feed to neural networks for phoneme recognition. There will be two such parts working in concert because the DSP procedures for vowels and consonants will be different. The outputs will then be merged to output final phoneme results.
The TTS will use a method of sound wave generation typical of the voder f1 part of a vocoder. The data needed to create the voice for the TTS will be extracted from a real person's voice through the same technology used in the ASR. The information tied to the speaker that is discarded in the DSP part of the ASR will be used to personalize the output of the voder in the TTS.
To be able to meet the goal of making it easier to add new languages to a speech processor I intend to use a less data hungry method than statistical approaches. Neural networks don't need as much data for their pattern recognition abilities to work, especially if there is a preprocessing stage which extracts relevant information from the raw signal. What's more, because the procedure will be recognizing single phonemes and not whole wordsf2 there is less need to collect data containing all the words in the final vocabulary, only words that contain all the phonemes used in that language.
One may object that the project is overly complex and so may defeat its purpose. Consider though that this is only a proof of concept and that any final version will be made in such a way as to be easy to use for the final user.
Technology Impact
There was a time in which a world speaking one language was considered a utopia. It was a time when automatic translators were science fiction. Today's automatic translators are not perfect but are becoming more efficient every day. There is no longer any need for all of us to speak the same language to be able to communicate. We have since discovered the enriching nature of diversity, be it biodiversity or language diversity or any other type of diversity. Any technology that will make it easier and faster to include new languages in speech processors can only be good for the health of modern human society. The technology I am proposing is just such a way to ease access to speech technology for all the peoples of the world.
r1 A.A. Markov, Extension of the law of large numbers to dependent quantities (in Russian), Izvestiia Fiz.-Matem. Obsch. Kazan Univ., (2nd Ser.), 15(1906), pp. 135–156 [Also [37],pp. 339–61]. ↩
f1 A vocoder encodes the parameters of a speech signal needed for speech intelligibility, transmits them over a communications channel, then the voder part of the vocoder recreates the voice from those parameters. ↩
f2 Even though current methods use parts of words these parts still contain more than one phoneme so the simplification still holds. ↩