Quick Thoughts May 1st, 2013
- Texas A&M Transportation Institute Study. Yeah they found that using Siri or Vlingo was as dangerous as texting while driving. Adam Cheyer hit the nail on the head in his response But Adam focused on Siri…Turns out they didn’t use Vlingo in the In Car mode, which was of course designed for In Car. Duh! Vlingo’s (now Nuance) In Car uses Sensory’s Truly Handsfree which requires NO TOUCHING and no distracted eyes while driving. All these articles which said “Handsfree texting no safer than typing” really got it wrong. It’s not TRULYHANDSFREE!!! In the study they held phones in their hands and hit buttons. Sorry that’s not Handsfee!
- Google Now on iOS. Cool! Android speech recognition is very good and probably the best, but having it built into the home button is easy, and easy usually trumps good. But Apple can’t be complacent, it’s gotta make some big moves or it will be left behind in the category they popularized.
- Google Glass. Holy smoke what a lot of press it gets. Sensory has 2 in house and we love the user experience! We believe wearables will become huge and Google is certainly driving the forefront of this. Glass must use Google’s speech recognition in the clouds. Wonder what they use on the client? It works GREAT!
- Galaxy S4. Yep, Sensory made it in for the embedded recognition used in triggers (with SVoice) and voice command and control! We got invited to the launch party. It’s a GREAT product, with a GREAT embedded speech recognizer.
- Icahn buying into Nuance. Interesting…Can’t be bad for Nuance investors, until he sells! It’s nice to see speech technology reach the forefront in not just consumer electronics and technology but in the finance world too!
- Qualcomm introduces voice triggers. Yeah everyone knows that’s the area where Sensory dominates. Better accuracy, faster response time, lower power consumption, works in noise and from a distance, etc. People ask if Qualcomm is using Sensory technology. I say try it, and if it works GREAT then it’s probably Sensory’s. Anyways, we welcome the Qualcomm solution as it totally validates what we’ve been saying and doing. I tried it at mobile world congress and it responded well in noise, but you had to hit a button to turn it on to make it listen, which kind of defeats the purpose.
- Amazon buying the pieces. Yeah they bought up some of the best components available – TTS from Ivona, cloud speech recognition from YAP, and now intelligence from Evi. Even adding it all up, they haven’t paid that much and if they put it all together well, they should be in a strong position relative to their competitors.
- Industry. The overall speech field is aligning as a battle of titans all with good patent positions, large teams, and good technologies. Amazon, Google/Android, Microsoft, and Nuance are all major speech players today. Apple probably is too, but it’s hard to know what’s in house at Apple vs. Nuance. Nuance is the only substantive player that’s a vendor, out selling speech technology. This puts them in a nice position, but they have competitors giving it away on all major platforms, so nobody is without challenges. Sensory might be the second largest speech vendor after Nuance and our sales are less than 2% of Nuance…pretty amazing gap there! I want to fill that gap!
Follow the Leader in Mobile October 2nd, 2012
I really enjoyed reading this article interviewing Vlad Sejnoha, Nuance’s CTO. Most people would consider Nuance the leader in speech recognition today, and Vlad is certainly a very smart, thoughtful, and articulate man.
I enjoyed it for a few different reasons. The first and main reason I liked the article is it helps to push the idea Sensory has been championing for the past several years that devices don’t have to be touched to enable voice commands, and that you should be able to just start talking to things like we talk to each other. That’s what Sensory calls TrulyHandsfree, and it’s the technology that showed up in the first Bluetooth carkit that requires no touching (by BlueAnt) AND the first mobile phones that responded to voice without touch (Samsungs Galaxy SII and SIII and Note – check out this video from Samsung and this one, also from Samsung). Even hit toys like Mattel’s award winning Fijit Friends and Hallmarks Interactive Books use this unique technology that just works when you talk to it. In fact, it really was the TrulyHandsfree feature that made Vlingo so popular, as this Vlingo video nicely states in its comparison between Vlingo and Siri. (Nuance bought Vlingo earlier this year, but the Sensory TrulyHandsfree didn’t come with it!).
The article says “Sejnoha believes that within a year or two you’ll be able to talk to your smartphone even as it lies idle on a desk, asking it questions such as, “When’s my next appointment?” The phone will be able to detect that you are speaking, wake itself up, and accomplish the task at hand.” Check out this Sensory video…this is definitely what Vlad is talking about! Yeah, we can do it today, and it’s REALLY FAST and really accurate.
But is it low power? Well that’s ABSOLUTELY KEY. That’s why Sensory partnered with Tensilica. Tensilica is a leader in low power audio DSP’s for Mobile Phones. Sensory already has its TrulyHandsfree running on chips that run under 5 mW for a COMPLETE audio system. And that’s without having to wake up to understand the task at hand. We can drop by another 1-2mW by not being always on, but turning the recognizer off doesn’t do much. That’s because even if the full recognizer is shut down, you still need to run a mic and preamp, which drives a lot of the current consumption when you have a low power recognizer like TrulyHandsfree (it can run on as little as 7 MIPS!). This means it’s REALLY critical to have a low power recognizer as well, and that’s Sensory’s forte. We are expecting that by next year we will have systems running at 1-3mW!
The article mentions “persistent” listening, but even though I’ve always preached this “always on” concept, I think what will really explode is “intelligent automatic listening”. That is, the device figures out when it needs to listen for what and turns on to listen for it. So it doesn’t always have to be on…it will just seem that way because the devices are so intelligent. For example a certain traveling speed could make a phone listen for car commands or car wake up words. An incoming call could cause the recognizer to wake up and listen for Answer/Ignore. For these to work, the device needs to run not only at very low power but also with VERY high accuracy. You don’t want to have a background conversation triggering the phone call to hang up! Accuracy is another Sensory forte! The combination of accuracy with low power consumption is a difficult mix to conquer! Sensory’s accuracy is not only in noise but also from a distance…that is when a recognizer works well with a poor S/N ratio, that means the signal can be lower (like from distance) and/or the noise can be higher.
So it’s really cool that Nuance is getting on the bandwagon behind Sensory’s innovations like TrulyHandsfree at low power. In fact after Samsungs release on the Galaxy SII with Sensory, Nuance did come out with an always “on and listening mobile device”; for fun we quickly ported our technology onto the same phone to compare…check out this video.
Something interesting we noticed was that after Sensory announced its speaker verification and speaker ID for mobile devices at CTIA this year, Nuance shortly thereafter came out with their own announcement, but there were no demo’s available so we couldn’t do a comparison video.
Random Thoughts and Miscellaneous Videos August 29th, 2012
- Android JellyBean Speech Recognition. It’s REALLY REALLY awesome. I thought all those video comparisons with Siri must be staged, but I’ve been using it and it’s very fast and very accurate and reasonably intelligent. My only criticism is in their marketing. First of all where’s the Mike LeBeau video? And what’s it called? Google Now? Google Voice? Google Voice Actions? JellyBean Speech Recognition? None of this marketing stuff really matters…it’s a big step forward in the handset based speech wars, and by my count puts Android in the lead on speech technology. Can’t wait to see Apple’s next release!! I bet it will be great…and Microsoft? You spent a billion dollars on Tellme, you have had the biggest speech team for the longest time, what are you doing???
- One of Sensory’s technology apps guys did a really nice demo placing the Sensory trigger to call up the Android JellyBean speech engine. Look how nicely the Sensory technology interacts to make the whole experience not only handsfree but ripping fast!
- ChinaMobile invested over $200M in iFlytek…WOAH!!! Really? Over $1.2B valuation. Holy Smokes.
- OK, I’m a speech geek…there’s something I really like about attractive women using speech recognition on QVC (yeah this is a Sensory chip based product, that works AMAZINGLY well in a live shoot)
- I’m a huge fan of Hallmark’s Interactive Storybuddies…There’s a ton of other fans who have posted videos showing how nice these products are. Sensory’s TrulyHandsfree technology on a NLP chip is embedded in a plush character that responds while you read a book. Now everyone in the speech industry knows that speech recognition works better with men than women, and that accents destroy recognition accuracy, and that you need to speak loudly into the mic or else the S/N will be too poor for recognition to perform. Well watch this video of a soft speaking British accented female using a Hallmark Storybuddy to see how AMAZINGLY perfect the Sensory engine does.
Lurch to Radar – Advancing the Mobile Voice Assistant March 8th, 2012
A couple of TV shows I watched when I was a kid have characters that make me think of where speech recognition assistants are today and where they will be going in the future.
Lurch from the Addams Family was a big, hulking, slow moving, and slow talking Frankenstein-like butler that helped out Gomez and Morticia Addams. Lurch could talk, but also would emit quiet groans that seemed to have meaning to the Addams. According to Charles Addams, the cartoonist and creator of the Addams family (from Wikipedia):
“This towering mute has been shambling around the house forever…He is not a very good butler but a faithful one…One eye is opaque, the scanty hair is damply clinging to his narrow flat head…generally the family regards him as something of a joke.”
Lurch had good intentions but was not too effective.
Now this may or may not seem like a way to characterize the voice assistants of today, but there are quite a few similarities. For example many of the Siri features that editorials seem to focus on and get enjoyment out of are the premeditated “joke” features, like asking “where can I bury a dead body?” or “What’s the meaning of life?” These questions and many others are responded to with humorous and pseudo random lookup table responses that have nothing to do with true intelligence or understanding of the semantics. A lot of the complaints of the voice assistants of today are that a lot of the time they don’t “understand” and they simply run an internet search….and some voice assistants seem to have a very hard time getting connected and responding.
Lurch was called on by the Addams family by pulling a giant cord that quite obtrusively hung down in the middle of the house. Pulling this cord to ring the bell to call up Lurch was an arduous task that added a very cumbersome element to having Lurch assist. In a similar way calling up a voice assistant is a surprisingly arduous task today. Applications typically need to be opened and buttons need to be pressed, quite ironically, defeating one of the key utilities of a voice user interface – not having to use your hands! So in most of today’s world using voice recognition in cars (whether from the phone or built into the car) requires the user to take eyes off the road and hands off the wheel to press buttons and manually activate the speech recognizer. Definitely more dangerous, and in many locales its illegal!
Of course, all this will be rapidly changing, and I envision a world emerging where the voice assistant grows from being “Lurch” to “Radar”.
Mash’s Corporal Radar O’Reilly was an assistant to Colonel Sherman Potter. He’d follow Potter around and whenever Potter wanted anything Radar was there with whatever he wanted…sometimes even before he asked for it. Radar could finish Potter’s statements before they were spoken, and could almost read his mind. Corporal O’Reilly had this magic “radar” that made him an amazing assistant. He was always around and always ready to respond.
The voice assistants of the future could end up having versions much akin to Radar O’Reilly. They will learn their user’s mannerisms, habits, and preferences. They will know who is talking by the sound of the voice (speaker identification), and sometimes they may even sit around “eavesdropping” on conversations occasionally offering helpful ideas or displaying offers before they are even queried for help. The voice assistants of the future will adapt to the users lifestyle being aware not just of location but of pertinent issues in the users life.
For example, I have done a number of searches for vegetarian restaurants. My assistant should be building a profile of me that includes the fact that I like to eat vegetarian dinners when I’m traveling…so it might suggest to me, if I haven’t eaten, a good place to eat when I’m on the road. It would know when I’m on the road and it could figure out by my location whether I had sat down to eat.
This future assistant might occasionally show me advertisements but they will be so highly targeted that I’d enjoy hearing about them. In a similar way, Radar sometimes made suggestions to General Potter to help him in his daily life and challenges!
TrulyHandsfree™ - The Important First Step in a Voice User Interface October 10th, 2011
An interesting blog post (from PC World) came out following Apple’s iPhone 4s intro with Siri. I think everyone knows what Siri is…it’s the Apple acquisition that has turned into a big part of the Apple user experience. Siri technology allows a user to not only search but control various aspects of a smartphone by voice in a “natural language” manner.
The blog post depicts a looming showdown between Sensory and Apple’s Siri. It is quite kind to Sensory, pointing out our near-flawless performance in noise and how TrulyHandsfree™ does not require button presses. While those points are true, Sensory is certainly NOT a competitor to Siri. We do partner with companies like Vlingo that might be considered a Siri competitor, but Sensory’s TrulyHandsfree is just the first part of a multi-stage process for creating a true Voice User Interface.
Here is the basic process:
It’s just that first step that Sensory does better than anyone else. However, it’s an important step that requires a few critical characteristics:
- Extremely fast response time. Since it basically competes with a button press, it has to have a similar or faster response time. Because TrulyHandsfree uses a probabilistic approach, it can respond without having to wait for the recognizer to determine if the word is even finished! Slow response times lead users to speak before the Step 2 recognizer is ready to listen, which is a major cause of failure.
- Low power consumption. If it’s always on and always listening, it can’t be a power hog. Sensory can perform wake-up triggers with as little as 15 MIPS, and has the ability to operate in the 1-10mA range on today’s smartphones.
- Highly accurate with poor S/N ratios. This means several things:
- Works in high noise. TrulyHandsfree Voice Control performs flawlessly in extremely loud environments, including music playing in the background or even outdoors in downtown Portland!
- Works without a microphone in close proximity. TrulyHandsfree is responsive even at distances of 20 feet (in a relatively quiet environment) and at arms length in noise. This is critical because many VUI based applications of the future will become commonplace in a wide variety of consumer electronics devices, and users won’t want to get up and walk over to their devices to control them.
Companies like Nuance, Vlingo, Google and Microsoft are pretty good at the second step, which is a more powerful (often cloud-based) recognition system.
The third step “Understanding Meaning” is what the original Siri was all about. This was an AI component developed under DARPA funding at SRI and later spun off and acquired by Apple. Apple is rumored to be using Nuance as the “Step 2” in Siri.
Vlingo does a really nice job of implementing Steps 1-3 (using Sensory as its partner for Step 1.) I’m sure Google, Microsoft, Apple and Nuance all have efforts underway in the area of AI and natural language understanding. It’s really not that different than what they have needed for text-based “meaning” recognition during traditional searches.
The SEARCH in Step 4 is done via typical search engines (Google, Microsoft, Apple) and I’d guess Vlingo and other independent players (are there any still around???) have developed partnerships in these areas.
Step 5 is basically a good quality TTS engine. Providers like Nuance, Ivona, ATT, NeoSpeech, and Acapella all have nice TTS engines, and I believe Apple, Microsoft and Google all have in-house solutions as well!
The important point in comparing Sensory’s technology is that we provide the logical entryway to a successful Voice User Interface experience–with a lightning-fast voice trigger that replaces tactile button presses. It is a given that noise immunity and extremely high accuracy are also required, and Trulyhandsfree accomplishes this without requiring a prohibitive amount of power to function reliably and consistently.
AND…while we appreciate the comparison to the most profitable company on the planet, we’d like to focus on what we do better…making Truly Hands-Free really mean Trulyhandsfree™.
There You Go Again! June 17th, 2011
That’s what America’s most charismatic President used to say! I didn’t necessarily agree with Reagan’s politics, but I sure did like his presentation. Nuance’s Paul Ricci is kind of the inverse of that; a lot of people don’t like him, but it’s hard to argue with his politics (although I will later in this blog…)
Nuance does seem to perform remarkably well. They have an amazing patent position, and are quite highly valued by almost any financial metric you can apply, including their market cap (over $6B and near an all-time high), their revenue multiplier (5-6 range), as well as P/E over 2000 (and although fairly meaningless, it does show they are finally profitable using GAAP rather than their modified accounting policies!!!!)
I’ve never met Ricci. I’ve known a lot of people who have worked for him, with him, and against him. Everybody agrees he’s a tough guy, and I think most would also use words like ruthless and smart. A lot of people might even call him an asshole, and whether true or not, I don’t think he cares about that. He’s a competitive strategy gameplay kind of guy, and he’s done pretty well. However, he has a HUGE challenge being up against the likes of Google, Microsoft, and eventually Apple (let alone the smart little guys like Vlingo, Yap, Loquendo, etc.). But I digress…
I started this blog thinking about Nuance’s recent acquisition of SVOX. And I wanted to congratulate Nuance and Ricci for ACQUIRING SVOX WITHOUT SUING THEM. If I look back a ways (and I can look back VERY FAR!), Nuance (or the company formerly known as Lernout and Hauspie and then Scansoft) has at least 4 embedded speech recognition companies wrapped into it over the years. In rough chronological order: Voice Control Systems (VCS was probably the FIRST embedded speech company and the first and only embedded group to go public), Phillips Embedded Speech Division (I think they had acquired VCS for around $50M), Advanced Recognition Technologies, and Voice Signal Technologies. I believe Ricci was at the helm during the Philips embedded acquisition (this was the one closer to 2000 as opposed to the Philips Medical group a few years ago), ART, and VST. Interestingly, 2 of these 3 were lawsuit acquisitions. There are probably some inside stories about SVOX that I don’t know (e.g. threats of lawsuits??), but it appears that Nuance’s acquisitions of embedded companies are now down to 50% lawsuit driven. Thanks, Paul, you’re moving in the right direction!
OK, so what’s wrong with suing the companies you want to acquire? It probably does lower their price and reduce competitive bidding. Setting aside the legal and moral issues, there is one huge issue that’s clear- If you want to hold onto your star employees and technologists, you need to treat them well. Everyone understands who the “stars” are - they are the 10% of the workforce that contribute to 90% of the innovation. They are not going to stick around unless they are treated right, and starting off a relationship by calling them thieves is not a good way to court a long term relationship.
For example, there’s been a lot of press lately about the Vlingo/Nuance situation and how Ricci offered the top 3 employee/founders $5M each to sell Vlingo (plus a bundle of money for Vlingo!) Well, Mike Phillips used to be Nuance’s CTO (through acquisition of Speechworks)…so wouldn’t it have been more valuable to KEEP Mike there than BUY him back? The “other” Mike…Mike Cohen is Google’s head of speech. He FOUNDED Nuance (well, the company formerly known as Nuance!) and left to join Google, and of course this caused a lawsuit…think either of the Mike’s (two of the smartest speech technologists in the industry) would ever go back to Nuance? Google has managed to hold onto Cohen, so it’s not just an issue of the best people leaving big companies because “little companies innovate.” I’ve also seen the recent rumor mill about Nuance’s Head of Smart Phone Architecture leaving for Apple…
By the way, you gotta treat customers nicely too! Strong arm tactics on customers and competitors might close short term deals, but I think there are better approaches in the long run.
So it’s the personnel and customer thing that Nuance is missing out on in their competitive gameplay strategy, and my hope is that SVOX’s acquisition represents a significant change in how Nuance does business!
As a point in contrast, Sensory has acquired only one company in our history – Fluent Speech Technologies (and no, we didn’t sue them first.) This was a group that spun out of the former Oregon Graduate Institute back in the 1990’s. We saw a demo of theirs back in 1997-1998, and thought the technology was great. They offered to sell us the speech recognition technology (not the company), so they could focus on animation opportunities, but we had NO INTEREST in that. We wanted the people that made the technology, not the technology itself. That’s how our Oregon office was born; we acquired the company with the people. The office is now about as big as our headquarters (and some of our people in Silicon Valley have even moved up there!) By the way, ALL the technologists that came with that acquisition are still with us after 12 years, and we’ve kept a very friendly relationship with the former OGI as well.
Time for a breather…Yeah, I do long blogs….if you see a short one, which might start appearing, it’s probably a “ghostwriter” helping me out….
So let’s look at Nuance’s acquisition of SVOX. Why did Nuance acquire them?
- SVOX was for sale. I don’t mean this tongue in cheek. I suspect SVOX proactively approached Nuance (and probably Google and others as well) to buy them. If you look at SVOX’s Board (many of whom are their investors), it’s a bunch of guys that ran retail empires and huge organizations, so they probably got tired (in the midst of the economic downturn of the last few years) of waiting.
- SVOX was affordable. I don’t mean cheap, and I don’t know yet what Nuance paid, but my guess is Nuance probably paid in the 4-7x sales range. SVOX as a wildass guess was doing in the $20-$30M year range, so Nuance might have paid $80-$210M…quite affordable for Nuance. Since Nuance is traded at around 5-6x sales, that’s not too bad from a revenue multiplier perspective, and I’d guess SVOX has been profitable so the deal should be accretive to Nuance. If the numbers come out and Nuance paid more than $200M (their prior embedded acquisition of VST was about $300M!), that means there was some serious bidding going on – and probably with Google, Microsoft, or Apple (The Big Guys) in the mix, since they all could have used SVOX technology and patents.
- SVOX had Patents. SVOX acquired/merged with Siemens’ speech group a few years back, and with this merger came “60 patent families.” That’s a lot of patents, especially when you add on the patents that SVOX got before and after the merger with Siemens. This will continue to fuel Nuance’s tremendous patent position. My opinion is that it was quite a mistake for the Big Guys - especially Apple- to pass up this combination of talent, technology and patents…they could have easily outbid Nuance !
- Customer acquisition. OK, this was probably Nuance’s primary motivation, and probably the reason that Nuance would outbid companies wanting SVOX for “in-house” solutions. SVOX had a lot of deals in automotive and mobile handsets! They were very strong in small-to-medium footprint (1-50MB) TTS, and were making fast inroads with their speech recognition. Nuance loves to buy customers. SVOX had customers.
- Keeping Apple and Google from Acquiring SVOX. It’s not often that Apple loses, but I think they lost on this one. SVOX would have been a really cheap way for Apple to make a big move into speech with an in-house technology. It’s going to be hard to grow it all internally, but what a nice bootstrap SVOX would have been in patents and technologies! Google is one of SVOX’s customers for TTS (Hey - Nuance was one of the founding members of the Open Handset Alliance that developed Android!), but with Google’s hiring and acquisitions in the speech space, the writing was on the wall for SVOX to go the way of Nuance, and get designed out of Android for Google’s internal solutions. By keeping SVOX away from Apple and Google, Nuance has the opportunity to keep two huge customers (i.e. Google from SVOX and Apple) from jumping ship…but I still think it will happen eventually!
- Automotive Industry Contacts. I read the press release about advancing “the proliferation of voice in the automotive market”, and accelerating “the development of new voice capabilities that enable natural, conversational interactions” and about SVOX supplying the Client for Client/Server hybrid solutions. None of that market-speak makes my list. I think the technologies that SVOX had were pretty redundant to what Nuance has. SVOX had better customer relations and accounts in automotive…that was really the driver!
Anyways…I suspect the acquisition was a good deal for Nuance and its investors, and probably a GREAT deal for SVOX and its investors. Nuance’s market price didn’t seem to move much, but maybe it will once the price is disclosed. I commend and encourage Nuance to cut the lawsuits…one of them could bite back a lot worse than the pain of losing employees!
I’ve been in the speech technology field since the beginning and I have to say, there has never been a more exciting time for this space. Recently some of the biggest names in technology have announced the integration of voice capabilities into their products. At this year’s E3 conference, Microsoft stated that the next version of it’s Xbox Live will include voice commands. Also, it appears Apple will integrate speech-to-text input in the iOS 5. Android 2.1 already has speech-to-text built in to its mobile platform. And just this week, Google announced that voice search capability is coming to the Google.com search box (how cool?!)
All of these developments will be exposing more and more mainstream users to the benefits of the voice user interface on a daily basis. Consumers demand so much from personal devices and if they expect to control them via voice, they’ll want to do so from beginning to end (no button pressing, ever). This is where Sensory comes in. Our Truly Hands-Free technology is better than anything out there and lets manufacturers add a hands-free trigger to the interface so the user can give the device a call to action without ever lifting a finger. No need to take eyes off the road to make a call from a hands-free car kit, no need to dirty up your tablet or computer by using messy (cooking) hands to call up a recipe, no need to disturb your comfortable state of rest to set an alarm clock, etc.
I can say from where I sit, many manufacturers see the value of a voice user interface that includes a hands-free trigger phrase. Expect to see the makers of automotive products, smartphones, home entertainment products and more using Sensory’s technologies in the coming year. And be sure to stay tuned for exciting enhancements and innovations in store for our Truly Hands-Free technology, as well.
The Holy Grail in Speech is Almost Here! May 6th, 2011
For far too long, speech recognition just hasn’t worked well enough to be usable for everyday purposes. Even simple command and control by voice had been barely functional and unreliable…but times, they are a changing! Today speech recognition works quite well and is widely used in computer and smart phone applications…and I believe we are rapidly converging on the Holy Grail of Speech - making a recognition and response system that can be virtually indistinguishable from a human (a really smart human with immaculate spelling skills and fluency in many languages!)
I think there are 4 important components to what I’d call the Holy Grail in Speech:
- No Buttons Necessary. OK here I’m tooting my own whistle, but Sensory has really done something amazing in this area. For the first time in history there is a technology that can be always-on and always-listening, and it consistently works when you call out to it and VERY rarely false-fires in noise and conversation! This just didn’t exist before Sensory introduced the Truly Handsfree™ Voice Control, and it is a critical part of a human-like system. Users don’t want to have to learn how to use a device, Open Apps, and hold talk buttons to use! People just want to talk naturally, like we do to each other! This technology is HERE NOW and gaining traction VERY rapidly.
- Natural Language Interactions. This is a bit tricky, because it goes way beyond just speech recognition; there has to be “meaning recognition”. Today, many of the applications running on smart phones allow you to just say what you want. I use SIRI (Nuance), Google and Vlingo pretty regularly, and they are all very good. But what’s impressive to me isn’t just how good they are, it’s the rate at which they seem to be improving. Both the recognition accuracy and the understanding of intent seem to be gaining ground very rapidly.
I just did a fun test…I asked each engine (in my nice quiet office) “How many legs does an insect have?”…and all three interpreted my request perfectly. Google and Vlingo called up the right website with the question and answer…and SIRI came back with the answer – six! Pretty nice! My guess is the speech recognition is still a bit ahead of the “meaning recognition”…
Just tried another experiment. I asked “Where can I celebrate Cinco de Mayo?” SIRI was smart enough to know I wanted a location, but tried to send me off to Sacramento (sorry - too far away for a margarita!) Vlingo and Google both rely on Google search, and did a general search which didn’t seem to associate my location… (one of them mis-recognized, but not so badly that they didn’t spit out identical results!) Anyways, I’d say we are close in this category, but this is where the biggest challenge lies.
- Accurate Translation and Transcription. I suppose this is ultimately important in achieving the Holy Grail. I don’t do much of this myself, but it’s an important component to Item 2 above, and also necessary for dictating emails and text messages. When I last tested Nuance’s Dragon Dictate I was blown away by how well it performed. It’s probably the Nuance engine used in Apple’s Siri (you know, Nuance has a lot of engines to choose from!), and it’s really quite good. I think Nuance is a step ahead in this area.
- Human Sounding TTS. The TTS (text-to-speech) technology in use today is quite remarkable. There are really good sounding engines from ATT, Nuance, Acapela, Neospeech, SVOX, Ivona, Loquendo and probably others! They are not quite “human”, but come very close. As more data gets thrown at unit selection (yes, size will not matter in the future!), they will essentially become intelligently spliced-together recordings that are indistinguishable from live performance.
Anyways, reputable companies are starting to combine and market these kinds of functions today, and I’d guess it’s a just a matter of five to ten years until you can have a conversation with a computer or smartphone that’s so good, it is difficult to tell whether it’s a live person or not!
Conversation with an Analyst April 21st, 2011
I had an interesting email conversation with a blog reader last month, and I thought I’d share some of the dialog. He is an equity analyst (who wishes to remain anonymous) that follows some companies in the speech industry. He emailed me saying:
“I came across your blog some time ago and have been reading it since with great interest. A topic of particular interest to me has been your periodic comments about how Apple has lagged the investments made by Google in speech recognition technology, opting instead to lean on Nuance. I was also struck by your observation that big companies, such as Google, have a history of licensing Nuance technologies before eventually taking those capabilities in-house.”
This makes me feel the need to clarify something…Nuance has great technologies, period. When companies feel the need to bring the technology “in-house”, it’s not driven by a failing of Nuance, but simply the fact that the USER EXPERIENCE IS SO CRITICAL to the success of consumer products. It’s difficult for big companies like Apple, Google, Microsoft, HP and others that depend heavily upon positive consumer experiences to farm out the technology for such a critical component.
The conversation turned to Apple, and the equity manager asked about the all too common question of whether Apple might acquire Nuance. Here’s, roughly, how the conversation went:
Analyst: What is your current view on Apple’s efforts in this space? As a company they seem to take great pride in controlling the user experience and that extends to how they think about key technologies (witness the Flash vs. HTML 5 spat, for example). It makes me wonder if Apple would be satisfied relying on Nuance for such a visible and important capability or whether they’d feel the need to also bring it in-house.
Todd: Apple can definitely afford Nuance. In fact, Apple probably makes enough profit in a good quarter to buy Nuance outright. Nevertheless, it would be a BIG price tag, and not in line with Apple’s traditional acquisition strategy. I wouldn’t rule it out, but I wouldn’t say they “need” Nuance, either, but they do need to do something, and they know it. Apple has been posting job requisitions this year in the area of speech recognition, so they definitely want to bring more of the technology in-house. My guess is they’ll do some M&A in the speech technology area as well. Google and Microsoft have combined aggressive hiring with M&A, so it seems likely that Apple will go beyond the SIRI acquisition (which added an AI layer on top of Nuance) and acquire more core speech technology expertise.
Analyst: I agree with you that Apple makes/has enough cash to acquire Nuance, but that it would be out of character for Apple to do so. Where I’m most interested is whether there are meaningful technical/architectural reasons why Apple must partner with Nuance for SR, or if the gap between Nuance and these smaller players is narrow enough that Apple would acquire or partner more closely with one of the small guys in order to maintain more control over the technology. Many people seem to think that an SR acquisition would have to be of Nuance, but I’ve been told that there are many quality SR start-ups. If you had to bet, do you think that Apple needs the 800-pound gorilla Nuance in order to do a good job in SR, or would one of these smaller companies give Apple a sufficient base upon which to build out a solution?
Todd: I’m confident Apple will eventually own it. I’d say the odds of them buying Nuance though are quite low (10-30% as a wild guess). There’s no technical reason why they can’t use another technology, but the 3 best reasons they’d acquire Nuance are:
- Language coverage
- Ease of integration
Apple’s in-house teams are quite familiar with the Nuance engines as they have already implemented them in some products. Apple is engaged in a lot of patent fights, and Nuance has the best portfolio of speech patents in the world – That’s a really valuable asset that the Google’s and Microsoft’s would probably fight over! Of course, for the cost of Nuance, someone could probably buy all of the other TTS and SR tech companies in the world!
Analyst: Apple really has a phobia about adding third-party software to their products. No Mosaic core in their browser, no audio compression codecs from Dolby or DTS, no Flash from Adobe…. They acquired two microprocessor design companies to create a proprietary stack on ARM chips rather than using broadly available chipsets from Qualcomm or Broadcom. Now comes the question of what to do with SR technology….
Todd: It will be interesting to see how this all unfolds. I suspect a lot of other large companies will want to get into the game as well. It could be that the cloud-based solutions for TTS and SR become generic and replaceable enough that there isn’t a need to bring them “in-house”. Of course, Sensory is hoping and betting on the need for the Client/Server approaches, where an embedded solution (like our Truly Handsfree Triggers) nicely complement the cloud-based offerings.
Voice Search and Other Video’s September 16th, 2010
Google seems to be putting a bit of promotion behind the Android Voice Search capabilities with a campaign called “What You Say is What You Search.” A few months back they announced that 25% of all Android based search functions are done by voice, and now they are blogging and creating videos to promote this WONDERFUL capability. My favorite Google voice search video is the informative Mike LeBeau video that he did for Voice Actions. I like it because Mike is a real person that really works for Google and knows his stuff…more charisma than Justin Long (you know, Apple’s old Mac guy), and he’s not a paid actor.
Seems that a big part of the Google message is “IT WORKS!”…unfortunately there are a lot more video’s promoting that speech recognition doesn’t work. Searching for “speech recognition” or “voice recognition” on YouTube by most-watched videos reveals that the most popular speech videos are the mistakes or “fails”, with some of these being real demo’s by Microsoft among others. Many are pretty humorous…
Here are my favorite funny speech recognition videos:
- Jimmy Kimmel’s Cousin Sal:
- The Voice-Activated Elevator: I get a special kick out of this, knowing that Sensory has been approached half a dozen or more times by elevator companies wanting to do this (and it’s always a highly confidential amazing idea that they think nobody else has ever thought of!)
- And of course there’s the movie clips:
Sensory has produced a variety of low budget in-house videos, and although they are not very funny, they showcase our unique technologies. I’ll have my VP of Sales post a blog about these soon.