A New Voice in My House!   October 26th, 2011

I started Sensory back in 1994. Since then, Sensory has put speech technologies into many hundreds of different consumer products. I have taken home many of these products to test out on my family and see what everyone thinks.

A strange and wonderful thing happened last week…I heard our phone ringing and a voice spoke out saying “incoming call from Joe Smith” (no it really wasn’t Joe Smith…) Anyways, the really cool thing was I recognized the voice telling me who was calling. It was Sensory’s Micro Text to Speech engine.

Turns out my wife had gotten tired of the old cordless phones in our house and had gone out and bought a new ATT System. Unbeknownst to her, she had purchased the ATT products which used Sensory’s Micro-TTS technology to announce the Caller ID.

Text to speech tends to be one of those technologies that the more memory you throw at it, the better it sounds. That’s because the best sounding TTS engines use “snippets” of real human recordings, and the more memory allowed, the more and bigger and more precise “snippets” can be used. I use the non-technical term “snippet” generically because different approaches use different sound units, ranging from diphones to even whole word or multi-word recordings.

For TTS to get really, really small, another approach needs to be used. Storing all those sounds will take MegaBytes of memory, and that added cost can have too big of a pricing effect on a low-cost consumer product. Sensory’s “micro-TTS” uses about 250K Bytes of total memory…that’s for the technology engine AND all the synthesized sound data. This is about 1000 times smaller than some of the high-end engines of today!

TTS has become an important area of investment for Sensory, and today there are many products on the market that use Sensory’s Micro-TTS, including products from ATT, VTech, Motorola, BlueAnt and others. Who knows…we may be already talking in your house too!

Todd
sensoryblog@sensoryinc.com

Two BIG acquisitions happened over the last week. One is big for the smartphone space, and the other is big for the speech industry. I think they both had something to do with technology patents.

Google acquired Motorola. As everyone knows, Google has been wrapped up in a lot of legal feuds over Android. Android is certainly doing well, its competitors want to knock it down, and patent infringement seems to be the preferred means of fighting. Long established companies like Microsoft, RIM and Apple have had a lot of time to build a patent portfolio…on top of that they recently outbid Google on the Nortel patent acquisition. SO… Google has to beef up its patent portfolio quickly to fight back and eventually do what big companies do – agree to cross license and stop paying the law firms! Or maybe Google just wants a boatload of patents so they can be comfortable indemnifying all the Android users.

So at the end of July, Google bought a boatload (well over 1000) of patents from IBM (Nuance bought a bunch of patents from IBM as well focused on speech tech!)

Now Google buys MOTO. Here’s something really interesting. The price paid for Nortel was about $4.5B for 6000 patents (plus patents applied for etc). That’s about $750K/patent. Google underbid and didn’t get in on the deal. Google bought MOTO Mobility for $12.5B for a little over 17,000 patents… Just under $750K/patent! VERY INTERESTING…seems like $750K/patents is the going rate for large patent portfolios!!!!!

Specialized portfolios in speech technology are worth even more!

Nuance acquires Loquendo. I’m sure this wasn’t just for patents…it was taking out one of their only competitors for both SR and TTS, and Nuance got a GREAT price for a company with a lot of excellent technology. I have no idea how many patents Loquendo has…I think 7 in the US and probably a lot more in Europe. Let’s estimate that they had 35 patents total. At $75M, that would be around $2M per patent, which isn’t far off of the per-patent price Nuance paid for SVOX, who had 60-80 patents. The revenue multipliers seem pretty consistent too…SVOX was doing around $25M in sales and was bought for around 6x sales…likewise Loquendo was doing about $12.5M in sales and was bought for ABOUT SIX TIMES SALES. What does Nuance trade at? ABOUT SIX TIMES SALES. So what does that mean? Well you could argue that if Nuance pays less or equal to its revenue multiplier (6xsales) for an acquisition, then the patents essentially come free because the acquired revenues should immediately boost Nuance’s valuation by close to the purchase price.

I wonder if that’s how Nuance thinks about it. Then they wouldn’t be paying $2M for a patent or even $750K…they’d essentially get them for free and in the process build the biggest database of speech patents in the world.

Maybe Nuance’s strategy isn’t really about taking out competitors and buying customers through M&A, but maybe they want to own the majority of patents in the speech tech space. Nuance certainly hasn’t made money in using patents for lawsuits. Dave Grannan, Vlingo’s CEO was recently quoted as saying, ”We are happy to report that with this latest ruling, Nuance’s record remains perfect in patent infringement trials, they haven’t won any.” You go, Dave!

So why would Nuance want so many speech patents if they can’t make money in court? Well I’ve blogged earlier about their use of patent infringement in acquisitions. Maybe they are looking to be bought by a Google, Apple, or Microsoft…that patent portfolio could certainly do a lot in user experience fights. But if cross licensing agreements get worked out between the companies big enough to acquire Nuance, then where does that leave Nuance?

Well…without a lot of competition for sure!

Todd
sensoryblog@sensoryinc.com

For far too long, speech recognition just hasn’t worked well enough to be usable for everyday purposes. Even simple command and control by voice had been barely functional and unreliable…but times, they are a changing! Today speech recognition works quite well and is widely used in computer and smart phone applications…and I believe we are rapidly converging on the Holy Grail of Speech - making a recognition and response system that can be virtually indistinguishable from a human (a really smart human with immaculate spelling skills and fluency in many languages!)

I think there are 4 important components to what I’d call the Holy Grail in Speech:

  1. No Buttons Necessary. OK here I’m tooting my own whistle, but Sensory has really done something amazing in this area. For the first time in history there is a technology that can be always-on and always-listening, and it consistently works when you call out to it and VERY rarely false-fires in noise and conversation! This just didn’t exist before Sensory introduced the Truly Handsfree™ Voice Control, and it is a critical part of a human-like system. Users don’t want to have to learn how to use a device, Open Apps, and hold talk buttons to use! People just want to talk naturally, like we do to each other! This technology is HERE NOW and gaining traction VERY rapidly.
  2. Natural Language Interactions. This is a bit tricky, because it goes way beyond just speech recognition; there has to be “meaning recognition”. Today, many of the applications running on smart phones allow you to just say what you want. I use SIRI (Nuance), Google and Vlingo pretty regularly, and they are all very good. But what’s impressive to me isn’t just how good they are, it’s the rate at which they seem to be improving. Both the recognition accuracy and the understanding of intent seem to be gaining ground very rapidly.
    I just did a fun test…I asked each engine (in my nice quiet office) “How many legs does an insect have?”…and all three interpreted my request perfectly. Google and Vlingo called up the right website with the question and answer…and SIRI came back with the answer – six! Pretty nice! My guess is the speech recognition is still a bit ahead of the “meaning recognition”…
    Just tried another experiment. I asked “Where can I celebrate Cinco de Mayo?” SIRI was smart enough to know I wanted a location, but tried to send me off to Sacramento (sorry - too far away for a margarita!) Vlingo and Google both rely on Google search, and did a general search which didn’t seem to associate my location… (one of them mis-recognized, but not so badly that they didn’t spit out identical results!) Anyways, I’d say we are close in this category, but this is where the biggest challenge lies.
  3. Accurate Translation and Transcription. I suppose this is ultimately important in achieving the Holy Grail. I don’t do much of this myself, but it’s an important component to Item 2 above, and also necessary for dictating emails and text messages. When I last tested Nuance’s Dragon Dictate I was blown away by how well it performed. It’s probably the Nuance engine used in Apple’s Siri (you know, Nuance has a lot of engines to choose from!), and it’s really quite good. I think Nuance is a step ahead in this area.
  4. Human Sounding TTS. The TTS (text-to-speech) technology in use today is quite remarkable. There are really good sounding engines from ATT, Nuance, Acapela, Neospeech, SVOX, Ivona, Loquendo and probably others! They are not quite “human”, but come very close. As more data gets thrown at unit selection (yes, size will not matter in the future!), they will essentially become intelligently spliced-together recordings that are indistinguishable from live performance.

Anyways, reputable companies are starting to combine and market these kinds of functions today, and I’d guess it’s a just a matter of five to ten years until you can have a conversation with a computer or smartphone that’s so good, it is difficult to tell whether it’s a live person or not!

Todd
sensoryblog@sensoryinc.com

Haven’t blogged in a long time…I have plenty to say but have just been too busy. That’s good news. Sensory is signing up new deals at a very rapid rate, so 2011 should be an excellent year for us. I declare the economic recovery in full swing (although I do have some trepidation it could be short lived). Right now my biggest issue is chip SUPPLY! We’ve actually had some trouble getting enough chips (this is endemic to the entire chip market right now!). Luckily, our software business is exploding and a growing percentage of overall revenues is not dependent on buying silicon!

The cool thing is for the first time in Sensory’s 15 year history we are putting text-to-speech into products. We’ve done a handful of deals in just the last couple of months, and I expect that within 2 years we’ll have over 10 million TTS devices that will have hit the market (we’re at around 60 million speech recognition products right now).

I went to Voice Search last week. This is the show that Bill Meisel and AVIOS co-host every year. It’s my favorite speech industry show and pretty much the only one I attend. At the show I spoke on a consumer speech panel and demonstrated Sensory’s Truly Hands-Free Voice Trigger. Nobody thinks that wordspotting can be always on and always listening without false firing - and still catch the trigger word when it’s spoken. Sensory’s spotting technology WORKS! It’s my pet technology right now and I think it will change the world, by making speech recognition TRULY HANDSFREE (that was the title of my presentation)…anyways…I demoed it live. Nobody is supposed to do live speech recognition demos because they always fail (Microsoft has had the misfortune of proving that more than once!), so most people at the conferences show video clips. I know Sensory’s stuff works well, but I got a little nervous when I started talking and I could hear the echo of the microphone, and as I spoke I was hoping it wouldn’t false trigger and totally embarrass me. It didn’t false trigger…then it just had to recognize my trigger words. It got the first and the second one right. Then on the 3rd time the small device started sliding down the podium and the mic got covered up and for a brief moment my heart froze and I thought I was going to need to repeat my trigger word…then all of a sudden I felt my heart exploding as I waited microseconds for the response…then it spoke and it got it! No false fires and 3/3 triggers accurately recognized. Oh the trials and tribulations of a speech industry veteran! The technology is great and in a car it’s nearly flawless; it was this new acoustical environment that made me nervous. It came through!!!

So…Apple acquired SIRI, Inc., an iPhone developer that supplies a personal assistant application featuring speech recognition. Cool. That means Apple is in the game - the speech game, with apparently a slightly different twist than Microsoft or Google. All 3 companies are investing in speech recognition. But Apple is doing very light investing while Google and Microsoft are HEAVILY invested. Apple apparently isn’t using any of its home grown technologies as they keep licensing Nuance…and SIRI uses a Nuance engine as well. SIRI is a voice concierge type service that uses the Nuance recognizer then throws a layer of “meaning” interpretation or “intelligence” into the process. Anyways, I’m glad Apple is taking voice control seriously…they’re gonna have a tough time catching up with Google. My take is the Google stuff works best right now. I was playing with a Nexus One phone and the recognition on it is really amazing. BING is pretty good too and has wrapped better apps around their technology in BING411.

I remember a Keynote talk 15 years ago at a speech conference titled something like “the Ever-Imminent-Never-Arriving Speech Bonanza”…well it’s finally here, and I have to thank Google and Microsoft (and Vlingo too!) for clearly taking us over the hurdle and making speech recognition accessible and usable by the masses. Now it’s time for Apple to kick in and do its part…and now that HP has acquired Palm, it will need to get in the game too. I don’t even know if HP has a speech recognition team, but if they don’t they will soon. So will Cisco. So will all the major consumer electronics and automotive companies! Our time has come!!! Speech Recognition has arrived and is working for the masses! It will just get better!

Todd
sensoryblog@sensoryinc.com

My last blog was about TTS. When things aren’t pronounced right from a TTS engine, the linguists can go in and add “exceptions” so the standard pronunciation rules don’t have to apply in specific cases.

Of course the easy way to get TTS to pronounce things right it to spell everything phonetically correct, then no exceptions or special rules are ever necessary.

I was watching a documentary last night on Led Zeppelin. I got a kick out of one of the early friends of the band saying they spelled “Lead” like “Led” so that Americans would pronounce it right. The ironic thing about this intentional misspelling is that it led to other bands paying homage by misspelling in Led Zep’s footsteps. Def Leppard didn’t need to change their spelling to get their name pronounced right. I suppose Motley Crue, with all their umlaut’s actually mispronounce their misspelled name. Even some of heavy metal’s big name singers (Axl Rose??) might have had more “normal” names if this strange tradition never started with Led Zeppelin…

…and on a wildly different note, my heavy metal video documentary recommendation goes not to Zep but to “Anvil, The story of Anvil”. Another great music related video for you to consider: “Les Triplettes de Belleville”…and if you want a music related video that’s closer to home then see the documentary “Standing in the Shadows of Motown – The Funk Brothers”.

Todd
sensoryblog@sensoryinc.com

The new Android OS doesn’t have this problem! I read about one of these devices with TTS (Text-To-Speech) built in and voice commands too, so of course I had to try one out. I put it into TTS mode where it speaks everything, hit the recognition button and it prompted “SPEAK NOW.” I said something like “Starbucks in Sunnyvale, California”…and guess what it recognized??? “SPEAK NOW.” I guess the recognizer started listening too early and heard the TTS itself saying “SPEAK NOW.”

Listening at the right time is always a challenge for speech recognizers, but in Speech Recognition 101, programmers learn to make the recognizer listen AFTER the prompt is spoken. In Speech Recognition 201, students are taught to trim the silence after the end of the speech prompt, otherwise those that studied Speech Reco 101 will have it listening for a recognition word too late (because there’s usually a silent tail on the prompt that users don’t hear, so they speak too early if it’s not trimmed). Therefore, the first few hundred milliseconds of the user’s speech will be clipped off.

That same TTS in the Android was a Verizon product. Guess how it pronounces Verizon? Well, not the way I’ve ever heard it pronounced. TTS isn’t easy, but this should be an easy fix. Someone at Google or Verizon will figure it out soon, and Nuance will probably get a call.

I heard a great NPR report the other day about the Amazon Kindle. The product is being boycotted by groups as diverse as Syracuse University, the National Federation for the Blind, and the Burton Blatt Institute for Disability Studies. The complaint is that the while the Kindle offers Text-To-Speech as an option, it only reads from the books, and does not provide a friendly user interface for the visually impaired. In fact, one spokesperson said that the Text-To-Speech function is just about impossible for a blind person to use. Basically, Amazon needed to offer a mode where the TTS reads any button that was pressed, which shouldn’t have added any real cost to the bottom line. Better yet, they could have added a little speech recognition so the buttons weren’t even necessary!

Todd
sensoryblog@sensoryinc.com