Visualization of Siri’s voice
By Noah Zandan and Carrie Goldberger
October 11, 2013
Last week, voice actor Susan Bennett came forward as the voice of Siri, Apple’s voice-activated virtual “assistant”. Bennett says that in 2005, she sat in a recording booth for four hours a day during the month of July, reading phrases that would be put together to create a synthetic voice. Her voice has been used in GPS systems, ATMs, by Delta Airlines, and now as the voice of Siri. Because Apple would not comment on whether or not Bennett is actually the voice of Siri, CNN hired an audio forensics expert to compare the two voices. The expert, Ed Primeau, found Siri’s voice and Bennett’s voice to be a 100% match.
So we wondered; why was Bennett’s voice chosen? What are the distinguishing characteristics of a voice heard by roughly 25% of all cell phone owners in the U.S.? We ran a sample of Siri’s voice through our vocal analytics platform to find out. In the sample we analyzed, Siri had a mean pitch of 163.4 Hz and a mean amplitude of 63.4 dB. When compared to the average of females in our Quantified Communications database of over ten thousand speakers, her pitch was 21% lower than the norm and her tonal variety was 32% below the norm, indicating a monotone vocal delivery which sounds unnatural.
Going deeper into the analytics, we looked at Siri’s harmonics-to-noise ratio. This ratio compares the amount of harmony in a speech signal to the amount of noise. Siri’s harmonics-to-noise ratio was 52% lower than the average of female speakers in our extensive database. A low harmonics-to-noise ratio indicates more noise in an auditory signal than harmony. To explain this ratio further – every voice has a fundamental frequency, or pitch, that the human ear picks up. However, there are other frequencies that accompany our pitch. When the accompanying frequencies are whole number ratios of the fundamental frequency, it sounds harmonic. When they are random variations of the fundamental frequency, it sounds like noise. When a voice contains too much noise, it sounds strange to most people.
This lower harmonics-to-noise ratio may be explained by the robotic aspect of Siri’s voice, which is not smooth and fluid like the voice of Susan Bennett. The process of creating a voice such as Siri’s is long and complicated. A voice actor reads a long list of phrases that don’t mean anything, but are “phonetically rich” according to David Vazquez of Nuance, a speech recognition and text-to-speech company. The sounds in the phrases are broken down by a computer, categorized, placed in a database, and then put back together in the appropriate combination by Siri to answer questions from iPhone users. The more data they have, the more life-like a voice will sound, but it’s not quite perfect, lacking the fluidity of normal speech. Linguists and sound engineers from Nuance say they are working to make computerized voices sound more human to avoid users experiencing cognitive dissonance, or the “mental conflict that occurs when beliefs or assumptions are contradicted by new information.” This cognitive dissonance occurs because a robotic voice sounds unnatural, especially when it is speaking to us conversationally.
So is Siri the perfect voice? In short, no. Her low pitch is fine, but her lack of tonal variety and low harmonics-to-noise ratio demonstrate a lack of vocal authenticity. In fact, according to our predictive analytics, Siri’s voice is so unnatural that if delivered in bulk the audience would experience cognitive dissonance, focusing more on the sound of her voice and less on her message. It’s a good lesson for communicators – vocal authenticity elicits trust from an audience. Try to avoid sounding unnatural, monotone or robotic. Speak to your audience in your most natural conversational tone. Ask yourself who you’d feel more comfortable listening to for an hour or more – Susan Bennett in a conversation at a coffee shop? Or Siri on a 200 mile road trip with 300 different turns.