Wednesday, October 8, 2014

The Future of Speech Recognition Technology

One of the great parts of Stanford’s linguistics department is the integration it has with other academic areas. Our symbolic systems program is a great example. For speech recognition software to improve, we need more programmers with a solid understanding of linguistics.
When I think of examples of speech recognition software, the first thing that springs to my mind is Siri. If I’m being entirely honest, Siri is actually disabled on my phone, but that doesn’t mean I don’t love the idea of speaking at my phone and having it speak back. The idea is there; it’s just not quite developed to a point where it’s easier to ask Siri a simple question than it is to open up a web browser and google the question. When I do ask something of Siri, I always do it twice: First, I use “casual” speech. Siri doesn’t usually understand. Then, I use “careful” speech. This, she can often process.
Siri doesn’t deal well with “ums” and “uhs” and “ers,” though our natural dialogue is littered with them. Sometimes, she gets confused with background noise. What’s a voice asking a question and what’s a song playing on the car radio, or construction from whatever building is being added outside your dorm? She can differentiate, but not always.
I also mentioned in a comment to another post the importance of contextualizing a speaker. The idea of asking a relatively simple question, such as “Where can I buy clothing?” or, “What movies are playing in theaters right now?” to a human being involves context. I, as a person, won’t send you to a children’s clothing store if you’re an elderly gentleman. I won’t send you to an R-rated horror film if you’re a child in the first grade. Siri cannot make these differentiations. She’ll answer the same way, regardless of who asked.
Though these are some of the issues we currently see with speech recognition software, they’re also some of the things we will hopefully soon see resolved. Speech recognition within technology is a relatively young study, and there are still many strides to be made. The more primitive version of the software we see today has only really been available to the public for the past two decades.
I’m curious what comes next.
What are the next leaps and bounds in the advancement of speech recognition technology? Is there a time in the foreseeable future when we can expect a version of Siri that learns from our voice? Can she pick up tendencies in our speaking patterns--can she take into account the offbeat way you say the letter “r,” or the unusual accent with which you speak?
What about on the frontier of translation software? Is it feasible to imagine traveling to a foreign country with only a smartphone as a translator? Could we put this software to use in emergency settings, such as hospitals? It’s hard to imagine all the components that would be involved in such a sophisticated piece of technology. It would need to differentiate between accents, and be able to comprehend voices that could easily be garbled with panic.

What do you think? What will the next big strides in the field of linguistics in technology? What obstacles will stand in the way?

12 comments:

  1. When you mentioned Siri’s shortcomings, it made me feel quite relieved that we are still a ways off from having an operating system as highly intelligent as the one in that movie, Her, from last year, the premise of which – that a computing machine can become someone’s friend and much more – I found simply revolting! My luddite tendencies aside, I do find it fascinating to dwell on the challenges that remain for speech recognition software to overcome, and how we may overcome them. To respond to your first question, I think the biggest challenges for speech recognition software would come from semantics and pragmatics. As you put it quite nicely, natural language is “littered” with stuff that makes comprehension difficult and difficult to program.

    For one thing, natural language contains many phrases that function more as an emotional or interpersonal strategy rather than as an epistemic tool that conveys hard facts. When two people are talking, for example, the listener will often say things like “yeah” and “right”, which have literal meanings of the sort we can look up in a dictionary, but whose importance, in the context of the conversation, lies more in their interpersonal function, which is to convey that the listener is paying attention or wishes to validate the speaker’s thoughts. Aside from “yeah” and “right”, there are also paralinguistic backchannels, like “hmm” and head nods, which don’t have a literal meaning of the sort you can look up in the dictionary, but which function the same way to convey the listener’s keen intent to pay attention and validate the speaker’s thoughts. Such things as intention, desires, wishes, etc. would be difficult to teach to a machine. Another example is the phrase “I don’t know,” which has a literal meaning and epistemic value insofar as it conveys that the subject has knowledge that s/he does not have knowledge of some topic. But in everyday speech, we often use “I don’t know” to signal more than this epistemic thing. Sometimes, “I don’t know” means “it doesn't matter” (e.g. Person A: “Where do you want to go for dinner?” / Persona B: “I don’t know – maybe we could try that new Thai place down the road.”). Sometimes, “I don’t know” is said to lessen the blow of a dissenting opinion and preserve social harmony (e.g. Person A: “The President is handling this crisis so badly.” / Person B: “I don’t know – he’s doing all he can, given the state of things there.”). These are just some of the many examples of contexts in which “I don’t know” doesn't "mean" merely what it says.

    ReplyDelete
  2. What a good read! Your post highlights a very important point - the state of natural language processing is a direct reflection of our knowledge of language. That is to say, the fact that we can't encode simple rules into a computer so that it can effectively decrypt sound into meaning gives us a good idea of what are true knowledge of the science of language is. Yet this is a good thing! It means we have a lot to do and that there's a long way to go!

    It is utterly fascinating that we strive for replicating our behavior in computers without us understanding it thoroughly ourselves. And building off of what Marianne said, I do agree that there's a certain awkward sensation in the idea that Machines and humans can have a relationship defined as 'friendship'. Yet I have to say, I believe that the ultimate benefit of having these technologies will not only give us further insight into our inner workings, but also it will save us from massive time waste.

    That argument aside, I do agree that it's absolutely wonderful to have as interdisciplinary a program as we have, and moreover, that these problems cannot possibly be solved without discipline interaction.

    ReplyDelete
  3. I think the idea of a Siri or comparable AI system that learns and adapts itself to our voice and preferences is intriguing and completely plausible in the near future. We already have technology that continually adapts itself to a specific user, such as Pandora learning a person’s individual music taste or Facebook showing us posts from friends with whom we interact the most. Therefore, I think this general scheme could easily apply to Siri. She could learn our interests over repeated use and show us responses more in tune with them, artificially incorporating the personal context component of understanding speech. While I think it would be unrealistic for a computer program to learn and understand perfectly all of the variation in human speech, it would be realistic to customize her speech recognition based on the gender, age, and origin of the user. This would allow Siri to hone in on the more common pronunciations and speech patterns associated with those attributes from the beginning, and then learn more user-specific speech patterns in the same way she, Google, and Facebook learn our fashion and movie preferences. Tone, context, and intent would be more difficult to program into a machine, but would be less large of a concern as Siri would most likely not be used as a conversation partner and therefore wouldn’t need to be able to recognize sarcasm or colloquialisms.

    ReplyDelete
  4. I feel like the issues discussed here are could be solved in either of two ways. The first, and the much simpler way, seems to be the approach utilized in most speech recognition technology. This solution is simply a matter of improving the machine’s ability to identify words, specifically casual, everyday speech. With a large enough dictionary and a programmed understanding of grammar, this type of machine would be fairly effective in achieving what we want it to do.

    The second method focuses on machine learning. This method, which I believe to be the ultimate goal, greatly resembles true artificial intelligence in that it would model the way that children acquire language. Each machine would start with the basic capability to hear the user and gradually learn not only definitions and common usages of words, but also how to interpret inflection, voice type (male, female, old, young, etc.), and other intangibles that dominate our usage of speech. While we may be a long ways away from this type of technology, I believe it would be the most effective personal assistant, while simultaneously uncovering mysteries of our own race.

    ReplyDelete
  5. I am not sure if we're far away from having voice recognition truly understand and interact with us. A few years ago a computer passed the Turing Test, a system that tests whether a computer can pass as human. Then again there was a lot of criticism because of the premises of that computer. What do you all think about the Turing test in terms of voice recognition? Is it a valid test?

    Alternatively, I agree with Laura that we end up speaking "carefully" rather than "casually." Additionally, we find ourselves speaking in standard English. I wonder when computers will be able to understand slang, humor, and sarcasm. Computers would need to have enough AI to develop interpretations based on context to guess what new words mean. Right now we don't develop a relationship with Siri because it doesn't understand us. Once we develop a machine that can pick up on any patterns we develop, guess words, and have a more "natural" conversation, it is more likely that people will subconsciously develop some slight attachment to it.

    ReplyDelete
  6. The issue of integrating computer science and linguistics is an interesting one. Computer scientists have long bemoaned the uselessness of linguistic knowledge in natural language processing, especially in machine translation. After all, they argue, it is difficult for computers to understand syntactic theories, and "our statistical models, which discarded current syntactic frameworks altogether, work equally well."

    Well, maybe. But as far as speech recognition is concerned, such an approach apparently won't work. From the very beginning, computer scientists have to apply knowledge of acoustic phonetics to analyze different features in a given sound wave and "translate" it into phonemes that the algorithm can understand. Yet, as this article points out, this is far from enough. Instead of requiring every user to speak carefully and slowly, we have to allow our algorithm to accommodate variations widespread in human speech. This step requires systematic studies of phonological rules (how to derive the underlying representation from the raw form). Therefore, computer scientists and linguists should definitely collaborate more.

    ReplyDelete
  7. The main problem I have encountered with Siri is answering a question in the wrong context, as you have described above. One possibility that could get Siri closer to putting the posed question into context based on the speaker’s age demographic is to program Siri to recognize one’s pitch or timbre. Although this method won’t work all the time, ideally Siri would be able to differentiate between children (those with high-pitched voices), women (those with medium to high-pitched voices), and men (those with lower-pitched voices). Alternatively, Siri could be programmed to recognize the speed at which people speak, assuming that youth and the elderly speak slower than adults do; however, this wouldn’t solve the problem since most people speak very carefully when communicating with Siri in order to avoid misinterpretation.

    Another obstacle is that both native English speakers and those for whom English isn’t their native language have different accents and ways of speaking. My high school Spanish teacher originally from Bogotá, Colombia, used Siri to practice her English skills but ultimately disabled Siri after getting fed up with the constant misinterpretation due to her heavy Colombian accent. I predict that once the sound database becomes more inclusive of people with accents, the performance of Siri and other AI interfaces will improve in the future.

    ReplyDelete
  8. I anticipate that computers, being very good at doing complicated tasks algorithmically, will soon be able to identify the acoustic characteristic of different voices and associate them with specific speakers. If a computer identifies a word as “little” from syntactic context, for example, but perceives a [t] different from what it has in its audio dictionary, it can “learn” that the particular speaker pronounces his/ her t’s in that way. Moreover, since phonological patterns are generally algorithmic, e.g., nasal assimilation before stops, fortition, etc., I expect computers to fare well in “learning” them and how words themselves vary in pronounciation once a computer has been preloaded with the phonemes and places of articulation in the mouth.
    Human-like translation software, however, I’m much more skeptical about. A “good” translation would convey the spirit and meaning of a message without 1-to-1 word correspondence, and developing a computer that could truly understand a message like “I hit a fencepost when parking my car” without the real world experience of a car or fencepost sounds very difficult without biomimetic AI. An algorithmic translation, unfortunately, will always be the best thing that an algorithmic system can offer us.

    ReplyDelete
  9. As a SymSys major with a particular interest in this field I found this post and the resulting comments to very interesting. I think the future of this field maybe a lot nearer than we all expect. Today, personal assistants like Cortana, Siri, and Google Now are being updated and improved at such a rapid pace that us, as consumers, can't even keep up. However, one thing remains constant: when you talk to Siri (or any other personal assistant), your speech sounds are compressed and relayed to a server on the cloud. The digital pattern is checked against a vast library of “phonemes” – the basic building blocks of speech, the sounds of consonants and vowels. Data-processing centers use a statistical model to figure out what you meant by the noises you were making, decode what task is needed to be done, and relays that information back to the phone. (Info copied from http://www.ibtimes.com/siri-meet-jarvis-future-voice-recognition-cloud-1552663).

    This is the biggest limitation I see with this technology, you are tethered to the cloud and the cellular network. I would love to see this technology become decentralized so that you can utilize it when you are out hiking with no cellphone signal and get an accurate response from Siri about the weather. Obviously, it takes enormous amounts of computational power to process all this information so it is unrealistic -- right now -- that it can be computed without the use of external servers. But that is where the future of the industry, in my eyes, is moving towards and I think having that offline functionality will change the way we use and perceive voice recognition technology.

    ReplyDelete
  10. I would like to return to Betsy's comment on Turing tests as a basis for understanding human language. Betsy's question rests on the notion that "fooling a human" may be a sufficient basis for asserting intelligence in humans. When applied to a linguistics setting, particularly in the context of human-computer interaction, one can conceive of a computer voice, an evolved version of Siri, sufficiently fooling a human. The question naturally is raised, 'would this computer be intelligent?'

    Instead of explicitly answering this question, I would like to propose a distinct but related one. Perhaps the whole framework we are operating under is flawed in that it requires human language as a prerequisite for even discussing intelligence. Instead of requiring that computers must speak a human language, perhaps there is a natural language among computers—perhaps binary or maybe a different language—that underlies all system-to-system electronic communication. Although this notion is a bit fuzzy, it is not inconceivable that all computer interactions share common characteristics. And the study of linguistics, a descriptive more so than proscriptive field (with perhaps the exception of language pathology), might be rightfully applied to computer interactions to better understand how machines interact.

    In its essence, approaching this task might shed light on the conundrums faced in the current development of computer speech recognition. For it is likely humans, not computers per say, whose dearth of knowledge with respect to language is slowing the progress of computer voice recognition.

    ReplyDelete
  11. I believe that perfect speech recognition software isn’t that far-fetched at all. Very soon, I think there will be a Siri that will be able to accurately pick up on accents, individual speaking habits, and background noise. The more challenging aspect for the future of technology and language is combining speech recognition with an actual interpretation of the conversation. We are still very far from a program that can actually hold a conversation with a person. Although this isn’t primarily what Siri was made for, I believe this program on the iPhone is the first step. However, there’s obviously more to holding a conversation than speech recognition. In a conversation, people do not only need to recognize what is being said, but they also need to store the information and recall it for context purposes. In my opinion, this ability to take what is being said and put it in context will prove to be the main obstacle for future AI technology.

    ReplyDelete
  12. In IHUM my freshman year, we discussed what it meant to be human. Is it appearing to be human? is it behaving like a human? is it convincing others that you are human? There are many different ways that humans have tried to define humanity. One of these, which has gained probably the most support, is the Turning test. And the requirement is to convince enough (30%) of human evaluators that the entity being tested is human. Marco brings up an interesting point about understanding. I don't know if a speech recognition system will ever truly understand speech, at least in the same way that humans understand something. I think there has to be significant progress in understanding how humans understand before we will be able to make any headway in simulating this understanding in a computer. I believe that we may be able to fake it with huge amounts of data (IBM's Watson and other similar projects) but this isn't even close to true human understanding. It will be interesting to see how we progress in artificial intelligence in the future.

    ReplyDelete