Sunday, October 5, 2014

Speech Perception, Siri, NLP, and the Brain

NLP and Linguistics


Listening to the casual and careful audio files made me wonder how Siri recognizes our voices. I looked into this and found that the audio file from your phone is wirelessly relayed to a cellphone tower, and then sent to a server in the cloud that is programming with a series of natural language processing algorithms that take some of language processing techniques from the IPA. At the same time, what you said requires usage of the Internet or is local (e.g. remind me to do something). Then “The server compares your speech against a statistical model to estimate, based on the sounds you spoke and the order in which you spoke them, what letters might constitute it. Based on that, what you said, which has been converted to diphthongs and monophthongs is run through a language model, which estimates the words that you said. On top of all this, if there is any ambiguity, the computer is able to guess what you might have said (e.g. did you mean to say ____). I thought it was fascinating that this entire process goes into using Siri, and I can only image the numbers of linguists and computer scientists who perfected it.
I think it is also important to talk about how our brains process speech. Not only is the entire brain remarkable but speech perception is particularly incredible when you realize how much it can do. One of the leading researchers in the field of understanding how the human brain processes language and how speech processing is different from processing other sounds is Dr. Josef Rauschecker. He has discovered  “that separate areas in primate brains control the processing of different sounds. For example, a particular region handles sounds used for communication.” By looking at MRIs of individuals who are both producing speech and listening to speech, he has located areas of the brain used in advanced and mixed speech (e.g. bilingual). Most neuroscientists thought that Wernickes area and Brocas area completely dealt with speech recognition and production, but Dr. Rauschecker has shown that there are other components involved.
In a very broad sense, Wernicke’s area is thought to be the area involved in the area of understanding written and spoken language, while Broca’s area is involved in speech production. Dr. Rauschecker’s work with understanding where in the brain speech is located for the deaf and bilingual is particular interesting. He published a research paper on this topic available here: http://www.pnas.org/content/95/3/922.full.One particular finding of the paper which is very contrary to what scientists think  about Wernicke’s area and Broca’s area is that , “cerebral organization for a language depends on the age of acquisition of the language, the ultimate proficiency in the language, whether an individual learned more than one language, and the degree of similarity between the languages learned.” This would mean that the parts of the brain associated with language is very malleable and subject to change. This also matches the rule of neural rewiring which was easily proven with people who have right brain strokes that have aphasia are able to learn how to manipulate language again with proper speech therapy




10 comments:

  1. It is very interesting to realize how many small details have to be taken into account when designing speech recognition software. For example, the program must be able to distinguish the speaker’s voice from any background noises that may be present, and more interestingly, it has to be able to deal with different dialects, accents, tones of voice, intonations, and phonetic abilities of the users. If I were to ask Siri to “call mother,” I might say [kɑl mɑðɚ] while a British person might pronounce this same command as [kol mɑðə]. It is mind-boggling to think about how specific these programs have to be while determining what words are being said. Words can have so many different pronunciations, but sometimes even the slightest variation in certain sounds can change their spelling completely.

    ReplyDelete
  2. Bouncing off of the idea that natural language processing techniques require such a large amount of coordination and computation, I would just like to point out that many of these programs are implemented through machine learning. Machine learning algorithms are able to extract features from a training data set mapped to its correct classification. In the case that Jonas pointed out, [kol] and [kɑl] might both map to a "call" classification. By using multiclass classifiers with anterior and posterior features (i.e. properties of the word that proceed and follow any given word), learning algorithms can be trained on thousands of sample speech (think of the quantity that could be scraped from youtube alone), in order to make the usage more accurate. I merely want to point out that the "language model" is not hard coded, but is rather learned by an algorithm. It is likely that every part of the process that Caleb described was classified by an algorithm trained by machine learning. Linguists and computer scientists collaborate in order to make the algorithm more effective (adding new features), and in this manner NLP lies just between linguistics and computer science.

    ReplyDelete
    Replies
    1. I found some information that ties in what we are discussing about machine learning, Siri, NLP, machine learning, and one of my favorite technologies, neural networks. Siri was based on a project started by DARPA, PAL, the “personal assistant that learns.” DARPA released multiple papers outlining how PAL works, available here: https://pal.sri.com/Plone/publications.
      According to these papers, Siri can be summarized as performing the following steps based on the type of input:
      1. Siri uses automatic speech recognition technology (ASR technology) to transcribe human speech into tech
      2. Parsing the text using NLP (able to determine subject, verb, noun, etc.)
      3. Uses “mash-up” technologies to interface with webservices such as OpenTable, WolframAlpha, and other question answering sites. I tried saying “graph y is equal to x squared” and it gave me an out similar to WolramAlpha output for the equation y=x2
      4. Then it transform what the 3rd part services output into helpful text, for example it would turn a weather report into a phrase like “the weather will be sunny”
      5. Finally, it turns the text into speech

      Delete
  3. To go off of what Nathan said, there is no way that these language models are "hard coded" spoken language (in particular casual spoken language) is far too complicated and varied to explicitly specify all the possibilities (hard coding the rules) that could be used. It is extraordinary how powerful a system one could create by applying a "stupid" machine learning algorithm to a complicated problem. When I say a "stupid" Machine learning algorithm I mean one that knows very little (or nothing) about the subject at hand. The strength of a system is almost always determined by the quantity of data that the model was trained on. One of the most disappointing things that I learned in the NLP and Machine Learning classes at Stanford is that adding more data is almost always a better mechanism to improve a machine learning system rather than spending hours/days/months improving the underlying algorithm itself. I'm sure Siri and other production system are very much tuned by experts in linguistics, but one of the reasons that they are as good as they are is the sheer quantity of data that has been ingested.

    ReplyDelete
  4. When reading this post, I got to wondering how Siri and other language processing software deals with ambiguities in linguistic meaning. For example, consider a question such as “Did he hit the man with a rock”? It is unclear from this sentence whether "with a rock" informs "the man" or "hit". I think such ambiguities provide another good example of how complex language and its processing can be. On this topic, I’d encourage any of you interested in how sentences can be parsed to check out the following sentence parser from Stanford’s Natural Language Processing group: http://nlp.stanford.edu:8080/parser/ (those of you who took SymSys 100 last winter will likely find this familiar!) It’s a really neat tool to play around with and shows the difficulties in simulating language understanding of sentences algorithmically. I think using this tool is particularly insightful in trying to find places where strings of words have ambiguous meaning and understanding how language processing also depends on context. This is also another example of how, as Nathan suggested, NLP lies between linguistics and computer science.

    ReplyDelete
  5. Bringing up Siri ties very well into discussions in another one of my classes about artificial intelligence and its increasing incorporation into modern technology. My take on the history of AI is that over time the standard for what one would consider to be artificial intelligence evolved; when a rung was reached, there was immediately a new one to strive for, with the ultimate goal of a machine intelligent as a human and with the same capabilities to learn, reason and problem solve. At some point one of these rungs was whether a computer could identify a misspelled word and correct it. We all have our complaints with autocorrect, but it’s no longer what typical users would consider as AI. The rung moved to: can a computer sort human speech into specific commands? Meet Siri, once a technological phenomena, she, or her equivalent, is now commonplace among many devices. As hardware improves (i.e. memory, processing power) researchers strive to pass the Turing Test. Through speech, can a computer fool a human into believing it, too, is human? I believe with Siri, we are halfway there. We have a computer that can understand speech, now it’s just a matter of creating one that can respond. Doing that though, is harder than it sounds, for many of the reasons we are learning about in class. To sound convincing, it would need to speak casually (as opposed to carefully), use slang, and express emotion through its voice.

    ReplyDelete
  6. Your post reminds me of the Turing test, and how very recently a Russian computer finally passed it. The Turing test is rather simple - a human talks to two screens, one of which is another human, and the other a computer. The computer's goal is to convince the judge that it is the human. Surprisingly enough, this was not even possible through text until very recently, so imagine how far away we are from achieving that through speech.

    It is remarkable, though, when you think about the steps involved in the process of figuring out speech, that it all happens instantaneously and subconsciously in your brain. There are very few exceptions, especially in common speaking, where you required a conscious effort to eve parse words from each other in a sentence. It is very hard not to marvel at the sheer power of the brain's parallelism!

    ReplyDelete
  7. The conversation about Siri made me remember something that had been brought up at the beginning of the quarter. The example used was the question: “Siri, where can I buy clothing?” In response, Siri’s likely to produce a few general clothing stores. However, if you were to ask a person where to buy clothing, he or she would probably suggest brands tailored to suit your needs. You’re not likely to recommend the Children’s Place to an eighty year old man, and you’re not likely to suggest Men’s Warehouse to an eight year old girl.
    This is a distinction we can easily make, but something that Siri can’t really do: apply context to a situation.
    Is this what’s next on the horizon for voice recognition software? Making generalizations based on perceived age or gender from the sound of a voice?

    ReplyDelete
  8. In Simner, Cuskley, and Kirby (2010), subjects associated certain tastes (sweetness, bitterness, saltiness, and sourness) with certain sounds that had been made by manipulating F1, F2, voice discontinuity, and spectral balance. Extrapolating this data, the researchers suggest that all people have implicit mappings of taste and sound that synaesthetes experience more explicitly. Since you mentioned that sound can get processed by different areas of the brain, this got me wondering whether there are particular areas of the brain that concurrently process multiple modalities (i.e. sounds, tastes, textures, and spatial properties). If such an area does not exist, could sound, taste, texture, and spatial properties nevertheless intermingle during the brain's consolidation of memories that contain all or some of these modalities? Could this cross-modal processing give rise to our perception of sound symbolism, as discussed in Nicolas Perdomo in an earlier blog post below? That is, could the fact that the brain processes the senses cross-modally account for why some words for food “sound like what they taste”, why some words for textures “sound like what they feel”, and why some words for shape “sound like what they look like”?

    ReplyDelete
  9. I think reading your post makes me want to look into more natural language processing classes - the whole idea of how fast and how well Siri and other speech processing systems can understand us is amazing.
    It's also fascinating how much psychology and neuroscience can change while we're studying it - what we think to be true is now being questioned and we learning so much about the entire system itself.
    On the point of Siri - does the system learn my accent? When I speak to people often, I'll learn their nuances of speech, the different pronunciations they have for different words. I wonder if Siri does the same.

    ReplyDelete