Quick Thoughts May 1st, 2013
- Texas A&M Transportation Institute Study. Yeah they found that using Siri or Vlingo was as dangerous as texting while driving. Adam Cheyer hit the nail on the head in his response But Adam focused on Siri…Turns out they didn’t use Vlingo in the In Car mode, which was of course designed for In Car. Duh! Vlingo’s (now Nuance) In Car uses Sensory’s Truly Handsfree which requires NO TOUCHING and no distracted eyes while driving. All these articles which said “Handsfree texting no safer than typing” really got it wrong. It’s not TRULYHANDSFREE!!! In the study they held phones in their hands and hit buttons. Sorry that’s not Handsfee!
- Google Now on iOS. Cool! Android speech recognition is very good and probably the best, but having it built into the home button is easy, and easy usually trumps good. But Apple can’t be complacent, it’s gotta make some big moves or it will be left behind in the category they popularized.
- Google Glass. Holy smoke what a lot of press it gets. Sensory has 2 in house and we love the user experience! We believe wearables will become huge and Google is certainly driving the forefront of this. Glass must use Google’s speech recognition in the clouds. Wonder what they use on the client? It works GREAT!
- Galaxy S4. Yep, Sensory made it in for the embedded recognition used in triggers (with SVoice) and voice command and control! We got invited to the launch party. It’s a GREAT product, with a GREAT embedded speech recognizer.
- Icahn buying into Nuance. Interesting…Can’t be bad for Nuance investors, until he sells! It’s nice to see speech technology reach the forefront in not just consumer electronics and technology but in the finance world too!
- Qualcomm introduces voice triggers. Yeah everyone knows that’s the area where Sensory dominates. Better accuracy, faster response time, lower power consumption, works in noise and from a distance, etc. People ask if Qualcomm is using Sensory technology. I say try it, and if it works GREAT then it’s probably Sensory’s. Anyways, we welcome the Qualcomm solution as it totally validates what we’ve been saying and doing. I tried it at mobile world congress and it responded well in noise, but you had to hit a button to turn it on to make it listen, which kind of defeats the purpose.
- Amazon buying the pieces. Yeah they bought up some of the best components available – TTS from Ivona, cloud speech recognition from YAP, and now intelligence from Evi. Even adding it all up, they haven’t paid that much and if they put it all together well, they should be in a strong position relative to their competitors.
- Industry. The overall speech field is aligning as a battle of titans all with good patent positions, large teams, and good technologies. Amazon, Google/Android, Microsoft, and Nuance are all major speech players today. Apple probably is too, but it’s hard to know what’s in house at Apple vs. Nuance. Nuance is the only substantive player that’s a vendor, out selling speech technology. This puts them in a nice position, but they have competitors giving it away on all major platforms, so nobody is without challenges. Sensory might be the second largest speech vendor after Nuance and our sales are less than 2% of Nuance…pretty amazing gap there! I want to fill that gap!
Superbowl Ads – Speech Activation Coming of Age February 18th, 2013
(…and something new from Sensory just around the corner!)
I remember watching the Superbowl last year and seeing a BMW Series 3 commercial that I thought was interesting.
It was interesting to me because they put a motion/proximity sensor under the trunk so the user could open the trunk in a hands-free manner. The commercial highlights the benefit of hands-free access when a woman walks up with her hands full of luggage and she just wiggles her foot around and the trunk pops open! Cool…except the user has to do a little one legged dance with their hands full, and as the commercial highlights (which is another reason why I found it interesting), other things can accidentally open the trunk, like a dog wagging its tail. Wouldn’t a hands-free voice trigger do a much better job? Especially an ultra-low-power implementation on a standalone processor with built in speaker verification for security…sounds like a challenge for Sensory’s TrulyHandsfree approach.
Fast forward to this year’s Superbowl, and Kia comes out with the “space babies” ad for its Sorento, and the Uvo entertainment system. Kid asks dad “where do babies come from” and dad concocts an elaborate and humorous lie.
Then after dad’s tall tale the kid says “But Jake said that babies are made when mommies and daddies…” and dad quickly interrupts the kid by saying “Uvo, play Wheels on the Bus”. The Uvo system hears dad and immediately plays the music drowning out the kid’s question. Cool commercial and nice use of voice activation to control music while driving!
Many of Sensory’s customers have told us that they don’t want to have to say the brand name as a command word, and they would really like to name their products themselves, and even better, have the products know who they are when they talk so that settings and controls can be customized to their use…Another job for Sensory’s TrulyHandsfree!
On February 19th we will announce our TrulyHandsfree 3.0 which will enable all of the voice control scenarios I have described, enabling better user experiences that are more customized and more secure!
Stay tuned for the details!
Mobile Users Get it! May 30th, 2012
Sensory’s had a lot of press lately. We made 3 big announcements all pretty much together:
1) Announcing speaker verification http://www.sensoryinc.com/company/pr12_03.html
2) Announcing speaker identification http://www.sensoryinc.com/company/pr12_04.html
3) Saying Sensory is in the Samsung Galaxy S3 http://www.sensoryinc.com/company/pr12_05.html
Sensory announced these just before CTIA in New Orleans. We had a small booth at the show, and gave demos at several events (on the CTIA stage and floor, at the Mobility Awards dinner, and at the excellent Pepcom Mobile Focus event).
We got a lot of nice press from this. I was thrilled that the Speech Technology email newsletter put our verification release as the featured and lead story. One of the articles I like best, though, just came out last week by Pete Pachal at Mashable http://mashable.com/2012/05/29/sensory-galaxy-s-iii/
This article is great for several key reasons. One is that Pete gets it. He didn’t just reprint our press release, but he added his commentary and wrapped it up in a nice story that hits some of the key issues.
However, what’s best is what the readers wrote in. I LOVE their insights and comments. Here’s a few of the dialogs with my commentary attached:
Seriously??? You still need to push a button to use Siri? I’ve had the “wake with voice” option on my crusty old HTC Incredible, via VLingo inCar, for about 2 years now. Hard to believe Apple is that far behind.
My response: EXACTLY JB! In fact that crusty old HTC using Vlingo, also uses Sensory’s TrulyHandsfree approach! Vlingo was our first licensee in the mobile space.
Scott: But this is talking about OS integration instead of app integration. And as I’m sure you’ve seen on your phone, and as the article noted, wake with voice options currently use a lot of power, which means I can’t see a lot of people willing to use it.
My response: Precisely, Scott! This is why we are implementing the “deeply embedded” approach that will take power consumption down by a factor of 10! Nevertheless, users LOVE it even if it consumes power:
JB - I use it all the time and since my phone plugs into the car’s adapter, I don’t really worry at all about power usage. It’s never been a problem.
My response – Yes, Vlingo and Samsung did a very nice implementation by having an “always listening” mode, particularly useful while driving. Other approaches we expect to see in the future are intelligent sensor based approaches so the phone knows when to listen and when not to (e.g. why not have it turn on and listen whenever you start traveling past 20 MPH, etc.)
refutethis Is there anything to prevent me from messing with another person’s phone?
Fillfill Ha ha, imagine being in an auditorium and yelling “Hi Galaxy! … Erase Address Book! … Confirm!”
My comment – Funny! This is one of the reasons we have added speaker verification and identification features to the trigger function
DhanB - Siri doesn’t require a button. It can be activated by lifting the phone up to your face.
Great reader responses:
Darkreaper - …..while driving? (Right! That’s illegal in California and other states!)
Tone - Yes, but with the Samsung Galaxy II, I don’t have to touch it at all. As the article states, this is crucial when you’re in a situation, such as driving. I’ve dropped the phone on the floor while driving and I was still able to send a text message, an email and place a call with it sliding around the back seat. (Bluetooth) iPhone can’t compete, sorry. :-/
…and of course the old “butt dialing” problem:
Jason - This makes me think of the old “butt dialing” problem when you sat down on your phone cause I’d much prefer a manual trigger to prevent accidental usage.
My comment: Once again, I agree with the readers. Sensory isn’t pushing to force “always listening” modes on users, we just want to allow them the choice. We strongly recommend that products have multiple options for anything that can be done by voice or touch. We believe the users should have the right and the ability to access the power of mobile devices without being forced to touch them. And if they want to turn off this ability, that is certainly their choice! We turn off our ringers (at least we should) when we enter a meeting or go to the movies. Likewise, we can turn off hands free voice control when it’s not appropriate…and with the growing presence and power of intelligent sensors, it will get easier and easier (albeit with some mishaps along the way!) for the phones to know when they should listen!
A lot of people commented about Siri. Apple isn’t stupid. They get it that hitting buttons isn’t the most convenient way to always access voice control. That’s why there’s a sensor in place when you lift the phone to your face (of course still requiring touch), it’s also why Siri can speak back. Apple pushed the Voice User Interface forward with Siri…Samsung pushed it further with TrulyHandsfree wake up. There will be a lot of back and forth over the coming years and voice features will continue as a major battleground.
As devices get increasing utility WITHOUT touching the phones (e.g. remote control functions, accessing and receiving data by voice, etc.), the need for a TrulyHandsfree approach will grow stronger and stronger, and Sensory will continue to have the BEST solution – More Accurate, Lower Power, Faster Response Times, and NOW with built in speaker verification or speaker ID!
Lurch to Radar – Advancing the Mobile Voice Assistant March 8th, 2012
A couple of TV shows I watched when I was a kid have characters that make me think of where speech recognition assistants are today and where they will be going in the future.
Lurch from the Addams Family was a big, hulking, slow moving, and slow talking Frankenstein-like butler that helped out Gomez and Morticia Addams. Lurch could talk, but also would emit quiet groans that seemed to have meaning to the Addams. According to Charles Addams, the cartoonist and creator of the Addams family (from Wikipedia):
“This towering mute has been shambling around the house forever…He is not a very good butler but a faithful one…One eye is opaque, the scanty hair is damply clinging to his narrow flat head…generally the family regards him as something of a joke.”
Lurch had good intentions but was not too effective.
Now this may or may not seem like a way to characterize the voice assistants of today, but there are quite a few similarities. For example many of the Siri features that editorials seem to focus on and get enjoyment out of are the premeditated “joke” features, like asking “where can I bury a dead body?” or “What’s the meaning of life?” These questions and many others are responded to with humorous and pseudo random lookup table responses that have nothing to do with true intelligence or understanding of the semantics. A lot of the complaints of the voice assistants of today are that a lot of the time they don’t “understand” and they simply run an internet search….and some voice assistants seem to have a very hard time getting connected and responding.
Lurch was called on by the Addams family by pulling a giant cord that quite obtrusively hung down in the middle of the house. Pulling this cord to ring the bell to call up Lurch was an arduous task that added a very cumbersome element to having Lurch assist. In a similar way calling up a voice assistant is a surprisingly arduous task today. Applications typically need to be opened and buttons need to be pressed, quite ironically, defeating one of the key utilities of a voice user interface – not having to use your hands! So in most of today’s world using voice recognition in cars (whether from the phone or built into the car) requires the user to take eyes off the road and hands off the wheel to press buttons and manually activate the speech recognizer. Definitely more dangerous, and in many locales its illegal!
Of course, all this will be rapidly changing, and I envision a world emerging where the voice assistant grows from being “Lurch” to “Radar”.
Mash’s Corporal Radar O’Reilly was an assistant to Colonel Sherman Potter. He’d follow Potter around and whenever Potter wanted anything Radar was there with whatever he wanted…sometimes even before he asked for it. Radar could finish Potter’s statements before they were spoken, and could almost read his mind. Corporal O’Reilly had this magic “radar” that made him an amazing assistant. He was always around and always ready to respond.
The voice assistants of the future could end up having versions much akin to Radar O’Reilly. They will learn their user’s mannerisms, habits, and preferences. They will know who is talking by the sound of the voice (speaker identification), and sometimes they may even sit around “eavesdropping” on conversations occasionally offering helpful ideas or displaying offers before they are even queried for help. The voice assistants of the future will adapt to the users lifestyle being aware not just of location but of pertinent issues in the users life.
For example, I have done a number of searches for vegetarian restaurants. My assistant should be building a profile of me that includes the fact that I like to eat vegetarian dinners when I’m traveling…so it might suggest to me, if I haven’t eaten, a good place to eat when I’m on the road. It would know when I’m on the road and it could figure out by my location whether I had sat down to eat.
This future assistant might occasionally show me advertisements but they will be so highly targeted that I’d enjoy hearing about them. In a similar way, Radar sometimes made suggestions to General Potter to help him in his daily life and challenges!
The Holy Grail in Speech is Almost Here! May 6th, 2011
For far too long, speech recognition just hasn’t worked well enough to be usable for everyday purposes. Even simple command and control by voice had been barely functional and unreliable…but times, they are a changing! Today speech recognition works quite well and is widely used in computer and smart phone applications…and I believe we are rapidly converging on the Holy Grail of Speech - making a recognition and response system that can be virtually indistinguishable from a human (a really smart human with immaculate spelling skills and fluency in many languages!)
I think there are 4 important components to what I’d call the Holy Grail in Speech:
- No Buttons Necessary. OK here I’m tooting my own whistle, but Sensory has really done something amazing in this area. For the first time in history there is a technology that can be always-on and always-listening, and it consistently works when you call out to it and VERY rarely false-fires in noise and conversation! This just didn’t exist before Sensory introduced the Truly Handsfree™ Voice Control, and it is a critical part of a human-like system. Users don’t want to have to learn how to use a device, Open Apps, and hold talk buttons to use! People just want to talk naturally, like we do to each other! This technology is HERE NOW and gaining traction VERY rapidly.
- Natural Language Interactions. This is a bit tricky, because it goes way beyond just speech recognition; there has to be “meaning recognition”. Today, many of the applications running on smart phones allow you to just say what you want. I use SIRI (Nuance), Google and Vlingo pretty regularly, and they are all very good. But what’s impressive to me isn’t just how good they are, it’s the rate at which they seem to be improving. Both the recognition accuracy and the understanding of intent seem to be gaining ground very rapidly.
I just did a fun test…I asked each engine (in my nice quiet office) “How many legs does an insect have?”…and all three interpreted my request perfectly. Google and Vlingo called up the right website with the question and answer…and SIRI came back with the answer – six! Pretty nice! My guess is the speech recognition is still a bit ahead of the “meaning recognition”…
Just tried another experiment. I asked “Where can I celebrate Cinco de Mayo?” SIRI was smart enough to know I wanted a location, but tried to send me off to Sacramento (sorry - too far away for a margarita!) Vlingo and Google both rely on Google search, and did a general search which didn’t seem to associate my location… (one of them mis-recognized, but not so badly that they didn’t spit out identical results!) Anyways, I’d say we are close in this category, but this is where the biggest challenge lies.
- Accurate Translation and Transcription. I suppose this is ultimately important in achieving the Holy Grail. I don’t do much of this myself, but it’s an important component to Item 2 above, and also necessary for dictating emails and text messages. When I last tested Nuance’s Dragon Dictate I was blown away by how well it performed. It’s probably the Nuance engine used in Apple’s Siri (you know, Nuance has a lot of engines to choose from!), and it’s really quite good. I think Nuance is a step ahead in this area.
- Human Sounding TTS. The TTS (text-to-speech) technology in use today is quite remarkable. There are really good sounding engines from ATT, Nuance, Acapela, Neospeech, SVOX, Ivona, Loquendo and probably others! They are not quite “human”, but come very close. As more data gets thrown at unit selection (yes, size will not matter in the future!), they will essentially become intelligently spliced-together recordings that are indistinguishable from live performance.
Anyways, reputable companies are starting to combine and market these kinds of functions today, and I’d guess it’s a just a matter of five to ten years until you can have a conversation with a computer or smartphone that’s so good, it is difficult to tell whether it’s a live person or not!
Conversation with an Analyst April 21st, 2011
I had an interesting email conversation with a blog reader last month, and I thought I’d share some of the dialog. He is an equity analyst (who wishes to remain anonymous) that follows some companies in the speech industry. He emailed me saying:
“I came across your blog some time ago and have been reading it since with great interest. A topic of particular interest to me has been your periodic comments about how Apple has lagged the investments made by Google in speech recognition technology, opting instead to lean on Nuance. I was also struck by your observation that big companies, such as Google, have a history of licensing Nuance technologies before eventually taking those capabilities in-house.”
This makes me feel the need to clarify something…Nuance has great technologies, period. When companies feel the need to bring the technology “in-house”, it’s not driven by a failing of Nuance, but simply the fact that the USER EXPERIENCE IS SO CRITICAL to the success of consumer products. It’s difficult for big companies like Apple, Google, Microsoft, HP and others that depend heavily upon positive consumer experiences to farm out the technology for such a critical component.
The conversation turned to Apple, and the equity manager asked about the all too common question of whether Apple might acquire Nuance. Here’s, roughly, how the conversation went:
Analyst: What is your current view on Apple’s efforts in this space? As a company they seem to take great pride in controlling the user experience and that extends to how they think about key technologies (witness the Flash vs. HTML 5 spat, for example). It makes me wonder if Apple would be satisfied relying on Nuance for such a visible and important capability or whether they’d feel the need to also bring it in-house.
Todd: Apple can definitely afford Nuance. In fact, Apple probably makes enough profit in a good quarter to buy Nuance outright. Nevertheless, it would be a BIG price tag, and not in line with Apple’s traditional acquisition strategy. I wouldn’t rule it out, but I wouldn’t say they “need” Nuance, either, but they do need to do something, and they know it. Apple has been posting job requisitions this year in the area of speech recognition, so they definitely want to bring more of the technology in-house. My guess is they’ll do some M&A in the speech technology area as well. Google and Microsoft have combined aggressive hiring with M&A, so it seems likely that Apple will go beyond the SIRI acquisition (which added an AI layer on top of Nuance) and acquire more core speech technology expertise.
Analyst: I agree with you that Apple makes/has enough cash to acquire Nuance, but that it would be out of character for Apple to do so. Where I’m most interested is whether there are meaningful technical/architectural reasons why Apple must partner with Nuance for SR, or if the gap between Nuance and these smaller players is narrow enough that Apple would acquire or partner more closely with one of the small guys in order to maintain more control over the technology. Many people seem to think that an SR acquisition would have to be of Nuance, but I’ve been told that there are many quality SR start-ups. If you had to bet, do you think that Apple needs the 800-pound gorilla Nuance in order to do a good job in SR, or would one of these smaller companies give Apple a sufficient base upon which to build out a solution?
Todd: I’m confident Apple will eventually own it. I’d say the odds of them buying Nuance though are quite low (10-30% as a wild guess). There’s no technical reason why they can’t use another technology, but the 3 best reasons they’d acquire Nuance are:
- Language coverage
- Ease of integration
Apple’s in-house teams are quite familiar with the Nuance engines as they have already implemented them in some products. Apple is engaged in a lot of patent fights, and Nuance has the best portfolio of speech patents in the world – That’s a really valuable asset that the Google’s and Microsoft’s would probably fight over! Of course, for the cost of Nuance, someone could probably buy all of the other TTS and SR tech companies in the world!
Analyst: Apple really has a phobia about adding third-party software to their products. No Mosaic core in their browser, no audio compression codecs from Dolby or DTS, no Flash from Adobe…. They acquired two microprocessor design companies to create a proprietary stack on ARM chips rather than using broadly available chipsets from Qualcomm or Broadcom. Now comes the question of what to do with SR technology….
Todd: It will be interesting to see how this all unfolds. I suspect a lot of other large companies will want to get into the game as well. It could be that the cloud-based solutions for TTS and SR become generic and replaceable enough that there isn’t a need to bring them “in-house”. Of course, Sensory is hoping and betting on the need for the Client/Server approaches, where an embedded solution (like our Truly Handsfree Triggers) nicely complement the cloud-based offerings.
Truly Handsfree™ Trigger Technology Taking Over Sensory! February 24th, 2011
I haven’t had much time to blog lately, and you may have noticed that when I do, I often write about our revolutionary new Truly Handsfree™ Trigger speech technology. Technically it’s a phrase-spotting technology, but Sensory is using a revolutionary new multi-patent pending approach that’s changing the way we do speech recognition. The Truly Handsfree™ Trigger doesn’t use typical techniques like background noise modeling or speech detection (i.e. start and ending speech.) In operation, it ends up being MUCH more noise robust, yet still very efficient as it consumes less current than it would if we also included all the traditional approaches. The basic idea is that it’s on and listening all the time, and able to reject all of the wrong words and correctly identify the right words! This eliminates the need for activation via button pressing.
A lot of companies are using our technology now as a voice trigger for other speech recognition applications. At the recent Mobile World Congress, Samsung introduced the first Truly Handsfree Smartphone, the Galaxy sII, which uses a Truly Handsfree™ Trigger followed by the Vlingo experience. You say “Hey Galaxy” and it wakes up, no touching necessary! I tried this on the noisy showroom floor at Mobile World Congress, and it nailed my “Hey Galaxy” every time, even from a distance of 5 feet away!
Chris Schreiner over at Strategy Analytics recently tried out an early beta demo for Android, and in a blog late last year he said, “In a demo experience on my Android phone, the hands-free trigger worked remarkably well with varying types of background noise.”
With Truly Handsfree™ Trigger’s noise-robust nature and the ability to always be on listening, we are able to do more natural language-like schemes. A couple of great examples are in the toy space (and we do love toys at Sensory!)
- I mentioned Hallmark in my last blog…now they are rolling out a whole new product line built with Sensory chips because of the huge success of Jingle, the Husky Pup.
- Mattel has pushed us to deploy this phrase spotting technology even in our lowest cost, entry level processor. They have a new product line coming out this year that’s for sure to be a BIG HIT called Fijit. The Fijit’s are these cute wiggly characters with amazing skin, and they do the TOUGHEST speech recognition feats ever. They listen for a bunch (30??) of short key words like “hungry” so you can say a variety of things to it (Like…Hungry?…I’m Hungry…Are you Hungry?) and it can intelligently respond and interact. (Actually I don’t know if “Hungry” is a one of its actual words, that’s for example only.) SpeechTech just did a nice summary on Fiji Friends in their blog, and Mattel has some nice YouTube videos and websites where you can learn all about Fijits.
So what’s happening here at Sensory is that this technology initially invented as a trigger is migrating into being an amazingly noise-robust speech solution for any command and control application! It’s nominated for awards by MobileTrax in both the Speech Processing and Software Technology innovation categories!
Sensory has developed a whole product roadmap around our new approach, and this includes speaker adaptive recognition, larger vocabulary solutions, improvements in accuracy, and consumer created triggers. A funny thing about consumer created triggers…Our initial release was NOT INTENDED for this, but one of our customers, Adelavoice, did a few tricks and allowed end users to create their own triggers. Know what’s the most common trigger phrase?? “Yo Bitch”…I guess that says something about the demographic of the user base!
OK…I could go on and on about this new phrase spotting technology, but I gotta get some real work done!
Improving Signal to Noise Ratio July 14th, 2010
Dealing with a poor signal to noise ratio is one of the toughest issues in automating speech recognition. At Sensory, we develop lots of techniques so our customers’ products can sit at one end of a noisy room and still recognize a speaker at the other end of the room. Our technologists typically don’t like to implement active noise cancellation techniques because of the belief that active noise cancellation’s signal processing will extract useful information from the speech data. Nevertheless we have a whole host of other techniques to make performance in noise work really well.
In Bluetooth® headsets we use a dual mic beamforming technology, and we’ve found that this approach improves our ability to recognize by about 7 or 8 dB. In the Bluetooth® space there are lots of noise cancellation providers, and there are many well proven techniques for removing noise.
What I’ve been wondering for the last few months are why those vuvuzelas are so dang loud during the World Cup broadcasts. Seems like a relatively easy task to just filter them out, or have the broadcasters microphones be in a silent booth.
I guess I’m not the only one that wondered about this: If you Google Vuvuzela, “filter” is one of the most common words following it, and clicking on it showed over 1.3 million listings from hackers guides to products for sale.
Google’s Nexus One Isn’t Afraid of Speech! January 7th, 2010
Yeah everyone’s writing about the new Google phone. I’ve heard various reports about it being underwhelming, and in-need of the marketing hype that Apple is so good at. Everybody loves to compare the iPhone with the Nexus One and talk about screen size, weight, camera capabilities, software, etc.
Here’s my 2 cents on speech recognition and Bluetooth for these devices:
Apple’s initial iPhone release had speech recognition–phobia, with no factory options for implementing voice recognition commands. It was such a shocking omission that many of the mainstream reviewers even pointed it out. In various industry conversations I heard “Steve doesn’t like speech recognition”. As a result, 50 speech recognition applications quickly appeared in the Apps store, and by necessity Apple soon implemented Voice Control for music and voice dialing. I assume Apple implemented Nuance technology and most likely in a local version that runs on the iPhone.
What Google’s done with the Nexus is WAY different. They are embracing speech recognition from the start, and not just implementing “me too” features. Google is pushing the boundaries by including speech recognition for dictation (text messaging, email, social networking, etc.) and mapping/GPS type functions. I remember the original Android announcements mentioned that Nuance was their speech partner, but it seems like all the big guys like to start with Nuance then switch away. My guess is that the Nexus One uses homegrown (Mike Cohen and Co.) speech recognition, and since it is server based, it should adapt and improve and just get better with the data they are collecting. I give Kudo’s to Google for this!
On the Bluetooth side of things, we were shocked and hurt that we couldn’t use our BlueGenie Voice Interface Bluetooth headsets to easily call up recognizers on the iphone for name dialing. Although Bluetooth makes a clear protocol for this, it wasn’t implemented on the initial iPhone. New iPhone versions do support this, but Apple never clearly thought through the importance of a cohesive user interface and functionality with Bluetooth connected to its phones, especially when speech recognition is involved.
If Google is smart, they won’t only introduce a Nexus One phone, but they’ll come out with a really cool Nexus One headset that TAKES ADVANTAGE of all the great speech recognition software on the handset, with one seamless voice user interface! The Nexus One has been blasted as nothing really new, but this type of integration with a hands-free headset or car kit could make it TOTALLY REVOLUTIONARY.
Hey Google – make a BLUEGENIE VOICE INTERFACE HEADSET!
Phrase Spotting Offers New Opportunities for New products October 29th, 2009
We have had a lot of requests over the years for products that are always on and listening for a key “trigger” word. The challenge of this approach is making a “trigger” that doesn’t accidentally trigger when it is not spoken, but also doesn’t accidentally NOT trigger when it IS spoken. The trade-off between these two types of errors is not so simple, since improving one usually makes the other worse, and background noise, especially talking, typically makes voice interfaces perform poorly. And this doesn’t even take into account the constant energy drain from devices that are always on and listening.
Nevertheless, we have gotten the same question over and over. “What’s the point of having speech recognition if I need to press a button to activate it?”
Some of our earliest customers, like VOS Systems, used a hands-free trigger to control a light switch. This was a particularly useful application, because it could be plugged into a wall without battery drain.
The “Phrase Spotting” technology has advanced over the years, and recently we introduced a new spin on it that we call “Truly Hands-Free” for Bluetooth carkits. This technology is being extremely well received, and we are consistently hearing high praise about performance in noise. It really hits the RIGHT combination of minimizing false accepts AND false rejects, all with minimal power drain considering it is always listening for a trigger word.
Now we’re starting to apply this technology to some new and interesting applications:
- Answer/Ignore for Bluetooth headsets and car kits. One of the most desired features of Sensory’s BlueGenie Voice Interface is that it allows answering a phone without having to touch it, for example in a Bluetooth headset or hands-free car kit. The challenge has been getting this to work well in the presence of really loud ring tones and background noises like a car radio or wind noise. The solution…we’ve implemented a Phrase Spotting version of Answer/Ignore that is completely robust to noise and ALWAYS does the right thing.
- Interactive Books. Imagine a book that offers an interactive experience with parents and children while they are reading at night. For example, I say “Jack and Jill went up a Hill” and Jack grunts and says “This is hard work!”, and then I say “to fetch a pail of water”, and I hear a water pouring sound, etc. Pretty fun! In the past this would have been difficult because the talking would have messed up the recognition, but the Phrase Spotting can be embedded even in the middle of a sentence!
- Remote-less Home Controls. If you are my age, you might remember the days of having to walk up to a TV set and manually crank the channel and volume knobs. That’s unheard of today, and nobody would ever buy a TV like that…but we do buy thermostats, microwaves, clocks, fans, heaters, lights, radios, and virtually everything else around the house that requires a manual interface. Why not use voice triggers? Sensory is currently working with many different consumer electronics manufacturers to implement this revolutionary recognition technology into a new generation of voice controlled devices.
Lot’s of exciting stuff in development here!! Next time, maybe I’ll write about our voice morphing TTS!