Sunday, October 19, 2014

Pitfalls in Current Speech-Recognition Software and Finding Solutions




            The underlying problem of speech-recognition is quite straightforward to layout: how does one create software that encompasses, recognizes, and interprets the variations, nuances, and evolution of natural languages in a deterministic manner? These aforementioned aspects of language are attributed to the multiplicities of context and semantics possessed by words and groups of words. Consequently, this significantly increases the complexity of the problem for the computer as deciphering context to ascertain the correct definition of a word or meaning of a group of words requires a sort of innate knowledge and instinct that is very difficult to translate into a programming language. In addition to the interpretative problems arising from the polymorphic and amorphous nature of languages, there are also technical difficulties posed in actually recognizing the words spoken by the user as well as incorporating noise-recognition and cancellation software to ensure that the computer analyzes the right words.

                Let’s start by focusing on the two problems highlighted regarding the actual recognition of the words spoken. Even in a noise-free environment, recognizing the words spoken by a user can be difficult because of the various accents that exist. A method typically employed to account for this is a training session with a user for typically half an hour to an hour. The purpose is mutual: new users become accustomed to speaking to the machine and conversely the machine becomes accustomed to the voice of the user. To enhance such a machine’s rate of learning, a relational data-based management system (RDBMS) that keeps track of accents by relating them to dominant variables from user-generated data that greatly determine a user’s accent could be employed.  These variables might include ethnicity, date of birth (accents may slightly vary depending on when someone is born in a given region), and any information regarding where a user lived at some point in his/her life and for how long. The enhancement of a machine’s rate of learning is of paramount importance in the future if speech-technology is to be implemented in other industries, such as fast-food restaurants and their drive-thrus, because we expect these machines to recognize what we are saying with relative immediacy and without having to spare a “training-session” that may not be feasible.

                The impediment of noise is important to deal with as a speech-recognizing machine takes the input as is and does not inherently know given an input of sounds which ones are intentional. To facilitate accurate speech-recognition of the user, these programs could incorporate a technique used by noise-cancelling headphones. This technique basically involves splitting up noise into the individual sounds causing it via digital signal processing algorithms and then artificially generating sound waves that interfere destructively with these sounds to consequently negate the noise.  It is theoretically possible to apply the same technique here except once the sound waves the machine retrieves are isolated via digital signal processing, the machine should isolate the sounds that originate from the user by cross-verifying the properties of the sound waves generated by the user from a built-in database of sounds produced by the user in the past. Incorporating this may not solve all the problems faced as processing multiple sounds may not result in an accurate recognition of user-input because of the processing power of current speech-recognition machines as well as a possible lack of data on the user’s speech patterns and properties. 

                Finally, we come to the interpretative problems in speech-recognition software, which is the set of problems most pertaining to linguists. Compared to the problems previously discussed which can be solved through technological means, the problem of interpreting contexts is orders of magnitude more difficult. The lack of a systematic structure to context and its continuous evolution with time make it hard to track merely through a database. Although it’s a relatively new field, machine learning will be indispensable in the future growth of speech-recognition software. The implementation of machine learning algorithms will theoretically aid the machine in keeping up-to-date with the contexts of current words and phrases and the accuracy will get progressively better as more contexts are introduced to the machine. This will reduce the size of the database needed and simultaneously improve the efficacy of speech-recognition software. 

                In summation, despite the rapid growth and relevance of speech-technology, there are still many obstacles to overcome, especially in speech-recognition software, before speech-technology becomes more widely incorporated into industries and eventually our everyday lives. Better technology and methodologies to be able to filter and accurately conceive input from the user as well as employing machine-learning algorithms to improve and expand the interpretative capabilities of speech-recognition software are pivotal steps that, if taken, will allow us to tap into the vast potential speech-technology can provide us in the future.

No comments:

Post a Comment