Lurch to Radar – Advancing the Mobile Voice Assistant March 8th, 2012
A couple of TV shows I watched when I was a kid have characters that make me think of where speech recognition assistants are today and where they will be going in the future.
Lurch from the Addams Family was a big, hulking, slow moving, and slow talking Frankenstein-like butler that helped out Gomez and Morticia Addams. Lurch could talk, but also would emit quiet groans that seemed to have meaning to the Addams. According to Charles Addams, the cartoonist and creator of the Addams family (from Wikipedia):
“This towering mute has been shambling around the house forever…He is not a very good butler but a faithful one…One eye is opaque, the scanty hair is damply clinging to his narrow flat head…generally the family regards him as something of a joke.”
Lurch had good intentions but was not too effective.
Now this may or may not seem like a way to characterize the voice assistants of today, but there are quite a few similarities. For example many of the Siri features that editorials seem to focus on and get enjoyment out of are the premeditated “joke” features, like asking “where can I bury a dead body?” or “What’s the meaning of life?” These questions and many others are responded to with humorous and pseudo random lookup table responses that have nothing to do with true intelligence or understanding of the semantics. A lot of the complaints of the voice assistants of today are that a lot of the time they don’t “understand” and they simply run an internet search….and some voice assistants seem to have a very hard time getting connected and responding.
Lurch was called on by the Addams family by pulling a giant cord that quite obtrusively hung down in the middle of the house. Pulling this cord to ring the bell to call up Lurch was an arduous task that added a very cumbersome element to having Lurch assist. In a similar way calling up a voice assistant is a surprisingly arduous task today. Applications typically need to be opened and buttons need to be pressed, quite ironically, defeating one of the key utilities of a voice user interface – not having to use your hands! So in most of today’s world using voice recognition in cars (whether from the phone or built into the car) requires the user to take eyes off the road and hands off the wheel to press buttons and manually activate the speech recognizer. Definitely more dangerous, and in many locales its illegal!
Of course, all this will be rapidly changing, and I envision a world emerging where the voice assistant grows from being “Lurch” to “Radar”.
Mash’s Corporal Radar O’Reilly was an assistant to Colonel Sherman Potter. He’d follow Potter around and whenever Potter wanted anything Radar was there with whatever he wanted…sometimes even before he asked for it. Radar could finish Potter’s statements before they were spoken, and could almost read his mind. Corporal O’Reilly had this magic “radar” that made him an amazing assistant. He was always around and always ready to respond.
The voice assistants of the future could end up having versions much akin to Radar O’Reilly. They will learn their user’s mannerisms, habits, and preferences. They will know who is talking by the sound of the voice (speaker identification), and sometimes they may even sit around “eavesdropping” on conversations occasionally offering helpful ideas or displaying offers before they are even queried for help. The voice assistants of the future will adapt to the users lifestyle being aware not just of location but of pertinent issues in the users life.
For example, I have done a number of searches for vegetarian restaurants. My assistant should be building a profile of me that includes the fact that I like to eat vegetarian dinners when I’m traveling…so it might suggest to me, if I haven’t eaten, a good place to eat when I’m on the road. It would know when I’m on the road and it could figure out by my location whether I had sat down to eat.
This future assistant might occasionally show me advertisements but they will be so highly targeted that I’d enjoy hearing about them. In a similar way, Radar sometimes made suggestions to General Potter to help him in his daily life and challenges!
“I talk to myself but I don’t listen” – Elvis Costello November 18th, 2009
The new Android OS doesn’t have this problem! I read about one of these devices with TTS (Text-To-Speech) built in and voice commands too, so of course I had to try one out. I put it into TTS mode where it speaks everything, hit the recognition button and it prompted “SPEAK NOW.” I said something like “Starbucks in Sunnyvale, California”…and guess what it recognized??? “SPEAK NOW.” I guess the recognizer started listening too early and heard the TTS itself saying “SPEAK NOW.”
Listening at the right time is always a challenge for speech recognizers, but in Speech Recognition 101, programmers learn to make the recognizer listen AFTER the prompt is spoken. In Speech Recognition 201, students are taught to trim the silence after the end of the speech prompt, otherwise those that studied Speech Reco 101 will have it listening for a recognition word too late (because there’s usually a silent tail on the prompt that users don’t hear, so they speak too early if it’s not trimmed). Therefore, the first few hundred milliseconds of the user’s speech will be clipped off.
That same TTS in the Android was a Verizon product. Guess how it pronounces Verizon? Well, not the way I’ve ever heard it pronounced. TTS isn’t easy, but this should be an easy fix. Someone at Google or Verizon will figure it out soon, and Nuance will probably get a call.
I heard a great NPR report the other day about the Amazon Kindle. The product is being boycotted by groups as diverse as Syracuse University, the National Federation for the Blind, and the Burton Blatt Institute for Disability Studies. The complaint is that the while the Kindle offers Text-To-Speech as an option, it only reads from the books, and does not provide a friendly user interface for the visually impaired. In fact, one spokesperson said that the Text-To-Speech function is just about impossible for a blind person to use. Basically, Amazon needed to offer a mode where the TTS reads any button that was pressed, which shouldn’t have added any real cost to the bottom line. Better yet, they could have added a little speech recognition so the buttons weren’t even necessary!