According to professor Rupal Patel, every person has a “unique
voiceprint” that is part of one’s identity.
Similar to a fingerprint, no two voices are exactly alike. However,
unlike fingerprints, our voiceprint can give hints to the listener about a
speaker’s age, gender, nationality, size, and lifestyle: A grown man, with an
British accent, will sound significantly different from a young girl with
Australian accent. We can learn a significant amount about someone just from
hearing their voice. But what is someone doesn’t have a voice? There are
millions of people around the world who are unable to speak: some who have lost
the ability to speak, and others who are born without this ability entirely. In
Rupal Patel’s TED Talk, she comments that many of these people who are unable
to speak have overcome this disability with modern text to speech technology.
However, currently solutions lack a mechanism to incorporate any degree of
individualism in a synthetic voice. While at a conference, Patel experienced
many pairs of individuals carrying on a conversation each with the exact same
voice. While it is miraculous that they are even able to carry on a spoken
conversation, it is unfortunate that they have lost a part of their identity.
Patel identifies two major parts that comprise the human voice: the source, the
vibrations initiated by the voice box, and the filter, the shape and structure
of the vocal tract that shapes the vibrations to create vowels and consonants. The
majority of those individuals who lack the ability to create the sounds
required for speech still have functional sources. Patel hopes to utilize these
sources to build synthetic voices for those who were never able to speak with
their own identity every before.
Giving a unique voice to those without one is just the first
step, however. Patel’s approach uses concatenated sounds to build
words/phrases/sentences. Just how each voice is unique, each utterance of each
sound is unique. The chances of a person producing the exact same sound twice
in a row is infinitesimally small. While there is so much that makes our voice
sound like our voice, there is so much expressive power in the slight
variations in our voice. When someone is scared, their voice can convey this
information to those around us. When someone is mad, an increase in volume or
intensity can be communicated this anger. When someone is nervous, he or she
may speak at a faster pace. Patel’s research moves us significantly closer to
giving those who cannot speak a piece of their identity, a new way to express
themselves. However, the future holds great things in store. What if one didn’t
have to type his or her responses before they could be spoken? Steven Hawking,
who is unable to even move his fingers to type, is restricted to using his
cheek to “type” what he wants to say. Perhaps one day, someone with the same
condition will be able to think exactly what he or she wants to say and
immediately, before he or she is able to finish thinking the sentence, those
around him or her will begin to hear what he or she has to say. And if that
person is angry, the spoken words will be louder and more intense. I cannot
wait to see what we will be able to do in the future.
To those interested, Rupal Patel's TED Talk can be found here. She discusses much more in detail how her team is creating synthetic voices.
This is the first time I've heard about the idea of synthetic voices, especially under the banner of individualism. I disagree with the assumption that current text-to-speech technology erases individuality. True, elements such as tone and pace may not be represented in such technology. However, the way a person constructs sentences, the frequency in which he talks, the length of pauses are all indicators of a unique personality. What is the true value of assigning more "individuality" markers to such a person?
ReplyDeleteThe ability to express emotions (such as anger) in volume or other traits of spoken speech also seems a stretch. When a person is angry or upset, non-verbal cues are often more powerful than verbal ones. Crossed arms, glaring eyes, movement can all communicate emotions. Unless a person lacks the ability to perform any of these actions, how much would a synthetic voice truly add to a person's capacity to communicate?
I think any conversation on synthetic voices and their uniqueness, and I say this only a bit shamelessly, warrants a discussion of synthesized voices in music. Although people usually associate auto-tune with a very predictable sound, in certain contexts there is significant variation between different synthesized voices. Compare, for example, the distinct differences between the voices created for these two pieces of music: https://www.youtube.com/watch?v=c_abAgvn1nw and https://www.youtube.com/watch?v=FQTDdJShcpI .
ReplyDeleteA more everyday example includes the different voices available for a GPS. The current state-of-the-art can offer people who lack speech distinct and unique voices, but each is deliberately pre-constructed rather than spontaneously generated. Dr. Patel’s approach, allowing spontaneously generated uniqueness, would mark a giant leap forward, but it still would have its limitations compared to pre-constructed blueprints. Our voices only vary so much, and random variation would only duplicate the variation within an individual speaker’s voice rather than the variation between people of different genders, ages, etc.
Both pre-generated vocal blueprints, as GPS and Vocaloid technology features, and spontaneous variation would ultimately need to be combined to bring those without speech the best possible individual expression.