Wednesday, October 15, 2014

Synthetic Voices: What Makes Our Voice Our Own.

According to professor Rupal Patel, every person has a “unique voiceprint” that is part of one’s identity.  Similar to a fingerprint, no two voices are exactly alike. However, unlike fingerprints, our voiceprint can give hints to the listener about a speaker’s age, gender, nationality, size, and lifestyle: A grown man, with an British accent, will sound significantly different from a young girl with Australian accent. We can learn a significant amount about someone just from hearing their voice. But what is someone doesn’t have a voice? There are millions of people around the world who are unable to speak: some who have lost the ability to speak, and others who are born without this ability entirely. In Rupal Patel’s TED Talk, she comments that many of these people who are unable to speak have overcome this disability with modern text to speech technology. However, currently solutions lack a mechanism to incorporate any degree of individualism in a synthetic voice. While at a conference, Patel experienced many pairs of individuals carrying on a conversation each with the exact same voice. While it is miraculous that they are even able to carry on a spoken conversation, it is unfortunate that they have lost a part of their identity. Patel identifies two major parts that comprise the human voice: the source, the vibrations initiated by the voice box, and the filter, the shape and structure of the vocal tract that shapes the vibrations to create vowels and consonants. The majority of those individuals who lack the ability to create the sounds required for speech still have functional sources. Patel hopes to utilize these sources to build synthetic voices for those who were never able to speak with their own identity every before.

Giving a unique voice to those without one is just the first step, however. Patel’s approach uses concatenated sounds to build words/phrases/sentences. Just how each voice is unique, each utterance of each sound is unique. The chances of a person producing the exact same sound twice in a row is infinitesimally small. While there is so much that makes our voice sound like our voice, there is so much expressive power in the slight variations in our voice. When someone is scared, their voice can convey this information to those around us. When someone is mad, an increase in volume or intensity can be communicated this anger. When someone is nervous, he or she may speak at a faster pace. Patel’s research moves us significantly closer to giving those who cannot speak a piece of their identity, a new way to express themselves. However, the future holds great things in store. What if one didn’t have to type his or her responses before they could be spoken? Steven Hawking, who is unable to even move his fingers to type, is restricted to using his cheek to “type” what he wants to say. Perhaps one day, someone with the same condition will be able to think exactly what he or she wants to say and immediately, before he or she is able to finish thinking the sentence, those around him or her will begin to hear what he or she has to say. And if that person is angry, the spoken words will be louder and more intense. I cannot wait to see what we will be able to do in the future.

To those interested, Rupal Patel's TED Talk can be found here. She discusses much more in detail how her team is creating synthetic voices. 


2 comments:

  1. This is the first time I've heard about the idea of synthetic voices, especially under the banner of individualism. I disagree with the assumption that current text-to-speech technology erases individuality. True, elements such as tone and pace may not be represented in such technology. However, the way a person constructs sentences, the frequency in which he talks, the length of pauses are all indicators of a unique personality. What is the true value of assigning more "individuality" markers to such a person?

    The ability to express emotions (such as anger) in volume or other traits of spoken speech also seems a stretch. When a person is angry or upset, non-verbal cues are often more powerful than verbal ones. Crossed arms, glaring eyes, movement can all communicate emotions. Unless a person lacks the ability to perform any of these actions, how much would a synthetic voice truly add to a person's capacity to communicate?

    ReplyDelete
  2. I think any conversation on synthetic voices and their uniqueness, and I say this only a bit shamelessly, warrants a discussion of synthesized voices in music. Although people usually associate auto-tune with a very predictable sound, in certain contexts there is significant variation between different synthesized voices. Compare, for example, the distinct differences between the voices created for these two pieces of music: https://www.youtube.com/watch?v=c_abAgvn1nw and https://www.youtube.com/watch?v=FQTDdJShcpI .
    A more everyday example includes the different voices available for a GPS. The current state-of-the-art can offer people who lack speech distinct and unique voices, but each is deliberately pre-constructed rather than spontaneously generated. Dr. Patel’s approach, allowing spontaneously generated uniqueness, would mark a giant leap forward, but it still would have its limitations compared to pre-constructed blueprints. Our voices only vary so much, and random variation would only duplicate the variation within an individual speaker’s voice rather than the variation between people of different genders, ages, etc.
    Both pre-generated vocal blueprints, as GPS and Vocaloid technology features, and spontaneous variation would ultimately need to be combined to bring those without speech the best possible individual expression.

    ReplyDelete