Improving Signal to Noise Ratio   July 14th, 2010

Dealing with a poor signal to noise ratio is one of the toughest issues in automating speech recognition. At Sensory, we develop lots of techniques so our customers’ products can sit at one end of a noisy room and still recognize a speaker at the other end of the room. Our technologists typically don’t like to implement active noise cancellation techniques because of the belief that active noise cancellation’s signal processing will extract useful information from the speech data. Nevertheless we have a whole host of other techniques to make performance in noise work really well.

In Bluetooth® headsets we use a dual mic beamforming technology, and we’ve found that this approach improves our ability to recognize by about 7 or 8 dB. In the Bluetooth® space there are lots of noise cancellation providers, and there are many well proven techniques for removing noise.

What I’ve been wondering for the last few months are why those vuvuzelas are so dang loud during the World Cup broadcasts. Seems like a relatively easy task to just filter them out, or have the broadcasters microphones be in a silent booth.

I guess I’m not the only one that wondered about this: If you Google Vuvuzela, “filter” is one of the most common words following it, and clicking on it showed over 1.3 million listings from hackers guides to products for sale.

Todd
sensoryblog@sensoryinc.com

Haven’t blogged in a long time…I have plenty to say but have just been too busy. That’s good news. Sensory is signing up new deals at a very rapid rate, so 2011 should be an excellent year for us. I declare the economic recovery in full swing (although I do have some trepidation it could be short lived). Right now my biggest issue is chip SUPPLY! We’ve actually had some trouble getting enough chips (this is endemic to the entire chip market right now!). Luckily, our software business is exploding and a growing percentage of overall revenues is not dependent on buying silicon!

The cool thing is for the first time in Sensory’s 15 year history we are putting text-to-speech into products. We’ve done a handful of deals in just the last couple of months, and I expect that within 2 years we’ll have over 10 million TTS devices that will have hit the market (we’re at around 60 million speech recognition products right now).

I went to Voice Search last week. This is the show that Bill Meisel and AVIOS co-host every year. It’s my favorite speech industry show and pretty much the only one I attend. At the show I spoke on a consumer speech panel and demonstrated Sensory’s Truly Hands-Free Voice Trigger. Nobody thinks that wordspotting can be always on and always listening without false firing - and still catch the trigger word when it’s spoken. Sensory’s spotting technology WORKS! It’s my pet technology right now and I think it will change the world, by making speech recognition TRULY HANDSFREE (that was the title of my presentation)…anyways…I demoed it live. Nobody is supposed to do live speech recognition demos because they always fail (Microsoft has had the misfortune of proving that more than once!), so most people at the conferences show video clips. I know Sensory’s stuff works well, but I got a little nervous when I started talking and I could hear the echo of the microphone, and as I spoke I was hoping it wouldn’t false trigger and totally embarrass me. It didn’t false trigger…then it just had to recognize my trigger words. It got the first and the second one right. Then on the 3rd time the small device started sliding down the podium and the mic got covered up and for a brief moment my heart froze and I thought I was going to need to repeat my trigger word…then all of a sudden I felt my heart exploding as I waited microseconds for the response…then it spoke and it got it! No false fires and 3/3 triggers accurately recognized. Oh the trials and tribulations of a speech industry veteran! The technology is great and in a car it’s nearly flawless; it was this new acoustical environment that made me nervous. It came through!!!

So…Apple acquired SIRI, Inc., an iPhone developer that supplies a personal assistant application featuring speech recognition. Cool. That means Apple is in the game - the speech game, with apparently a slightly different twist than Microsoft or Google. All 3 companies are investing in speech recognition. But Apple is doing very light investing while Google and Microsoft are HEAVILY invested. Apple apparently isn’t using any of its home grown technologies as they keep licensing Nuance…and SIRI uses a Nuance engine as well. SIRI is a voice concierge type service that uses the Nuance recognizer then throws a layer of “meaning” interpretation or “intelligence” into the process. Anyways, I’m glad Apple is taking voice control seriously…they’re gonna have a tough time catching up with Google. My take is the Google stuff works best right now. I was playing with a Nexus One phone and the recognition on it is really amazing. BING is pretty good too and has wrapped better apps around their technology in BING411.

I remember a Keynote talk 15 years ago at a speech conference titled something like “the Ever-Imminent-Never-Arriving Speech Bonanza”…well it’s finally here, and I have to thank Google and Microsoft (and Vlingo too!) for clearly taking us over the hurdle and making speech recognition accessible and usable by the masses. Now it’s time for Apple to kick in and do its part…and now that HP has acquired Palm, it will need to get in the game too. I don’t even know if HP has a speech recognition team, but if they don’t they will soon. So will Cisco. So will all the major consumer electronics and automotive companies! Our time has come!!! Speech Recognition has arrived and is working for the masses! It will just get better!

Todd
sensoryblog@sensoryinc.com

I was in Barcelona last month at the Mobile World Congress. Here are some of my speech-centric observations:

I went by the Microsoft booth on the first day of the show and asked when WinMobile7 would be announced. The guy on the floor acted like he had no clue what I was talking about. He wouldn’t even confirm it hadn’t been announced yet. The really ironic thing is that EVERYWHERE I went I saw Windows 7 advertisements…subways, stairs, hotel lobbies, etc. My friend Dan had a couple of corporate suites at the hotel across from the show, and asked about putting up a flier to say what floor they were on. He found out the entire hotel advertising space was taken by Microsoft! They had gotten an exclusive from the hotel.

Speaking of Dan…we’re old friends from school and decided to meet up for dinner. He said “Are you OK with a Tapas Bar?” and I said “Actually, I’m kinda hungry, if you really want to go, let’s do it after we eat.” I had made a speech recognition error…think about it.

Anyways…WinMobile 7 was announced on Day 2, and I saw some of the demos. I must say that Microsoft is taking a brave approach by completely redesigning the interface to be more focused on data (people, places) than on functions (applications, etc.) However, even with the new look and feel I didn’t hear any mention of any new speech recognition features, like um, a voice interface. I asked a guy on the floor, and he said the voice search was much improved. I like BING search, Google search and Vlingo search too as they are all getting more useful and robust. A couple of years ago, I was trying one of these search engines to find my hotel in downtown Boston, and after 3 or 4 failed attempts on a street corner, a woman pointed down the street and said “Your hotel is just down there”. A memory flashback…a cabbie on that trip asked me what I did and I said “speech recognition.” He said “oh I’ve been trying that for years…my wife talks to me and sometimes I respond properly.” But I digress…

Back to Barcelona. I saw a nice demo of MOTONAV at the Motorola booth. With a new independent consumer-product company spun out and Sanjay Jha in charge, they really seem to have turned things around. The people on the show floor seemed very upbeat and excited about where Motorola is right now. In addition to the 23 phones they currently offer, they have new ones coming out, including the new Devour and Cliq XT, both of which are based on the Android OS. I didn’t see much new stuff in the Bluetooth space, however. They are doing PNDs (portable navigation devices) and cell phones with MOTONAV. It’s a nice voice-controlled driving application, and the speech recognition in the demo I saw worked quite well on the hard stuff (addresses, etc.), but messed up on the easy things (it was a simple 2 word set that it got wrong.) Then again, small sets aren’t always easier than big ones. The Yes/No response is one of the hardest sets to get right (I heard that there are more than 50 ways to say No and almost as many ways to say Yes…like unh-unh and unh-huh…(I can’t even get that right spelling it!).

The big thing missing from MOTONAV is a Truly Hands-Free Trigger. In fact, that’s what is missing from the entire cell phone industry. All these products have built-in speech recognition, but the only way to activate it is with button presses. Here’s an article I found about “The First Truly Hands-Free Phone.” HOWEVER, when you read through it you find it really requires 2 button presses…one to turn it on and a second to activate the voice recognition. Well, Sensory can get rid of one of those button presses, which is a HUGE savings for products that can be turned on and are always listening. As battery technology improves and more “smart” listening windows are deployed, Truly Hands-Free triggers will become increasingly important for all products with speech technologies.

Todd
sensoryblog@sensoryinc.com

Yeah everyone’s writing about the new Google phone. I’ve heard various reports about it being underwhelming, and in-need of the marketing hype that Apple is so good at. Everybody loves to compare the iPhone with the Nexus One and talk about screen size, weight, camera capabilities, software, etc.

Here’s my 2 cents on speech recognition and Bluetooth for these devices:

Apple’s initial iPhone release had speech recognition–phobia, with no factory options for implementing voice recognition commands. It was such a shocking omission that many of the mainstream reviewers even pointed it out. In various industry conversations I heard “Steve doesn’t like speech recognition”. As a result, 50 speech recognition applications quickly appeared in the Apps store, and by necessity Apple soon implemented Voice Control for music and voice dialing. I assume Apple implemented Nuance technology and most likely in a local version that runs on the iPhone.

What Google’s done with the Nexus is WAY different. They are embracing speech recognition from the start, and not just implementing “me too” features. Google is pushing the boundaries by including speech recognition for dictation (text messaging, email, social networking, etc.) and mapping/GPS type functions. I remember the original Android announcements mentioned that Nuance was their speech partner, but it seems like all the big guys like to start with Nuance then switch away. My guess is that the Nexus One uses homegrown (Mike Cohen and Co.) speech recognition, and since it is server based, it should adapt and improve and just get better with the data they are collecting.  I give Kudo’s to Google for this!

On the Bluetooth side of things, we were shocked and hurt that we couldn’t use our BlueGenie Voice Interface Bluetooth headsets to easily call up recognizers on the iphone for name dialing. Although Bluetooth makes a clear protocol for this, it wasn’t implemented on the initial iPhone. New iPhone versions do support this, but Apple never clearly thought through the importance of a cohesive user interface and functionality with Bluetooth connected to its phones, especially when speech recognition is involved.

If Google is smart, they won’t only introduce a Nexus One phone, but they’ll come out with a really cool Nexus One headset that TAKES ADVANTAGE of all the great speech recognition software on the handset, with one seamless voice user interface! The Nexus One has been blasted as nothing really new, but this type of integration with a hands-free headset or car kit could make it TOTALLY REVOLUTIONARY.

Hey Google – make a BLUEGENIE VOICE INTERFACE HEADSET!

Todd
sensoryblog@sensoryinc.com

My last blog was about TTS. When things aren’t pronounced right from a TTS engine, the linguists can go in and add “exceptions” so the standard pronunciation rules don’t have to apply in specific cases.

Of course the easy way to get TTS to pronounce things right it to spell everything phonetically correct, then no exceptions or special rules are ever necessary.

I was watching a documentary last night on Led Zeppelin. I got a kick out of one of the early friends of the band saying they spelled “Lead” like “Led” so that Americans would pronounce it right. The ironic thing about this intentional misspelling is that it led to other bands paying homage by misspelling in Led Zep’s footsteps. Def Leppard didn’t need to change their spelling to get their name pronounced right. I suppose Motley Crue, with all their umlaut’s actually mispronounce their misspelled name. Even some of heavy metal’s big name singers (Axl Rose??) might have had more “normal” names if this strange tradition never started with Led Zeppelin…

…and on a wildly different note, my heavy metal video documentary recommendation goes not to Zep but to “Anvil, The story of Anvil”. Another great music related video for you to consider: “Les Triplettes de Belleville”…and if you want a music related video that’s closer to home then see the documentary “Standing in the Shadows of Motown – The Funk Brothers”.

Todd
sensoryblog@sensoryinc.com

The new Android OS doesn’t have this problem! I read about one of these devices with TTS (Text-To-Speech) built in and voice commands too, so of course I had to try one out. I put it into TTS mode where it speaks everything, hit the recognition button and it prompted “SPEAK NOW.” I said something like “Starbucks in Sunnyvale, California”…and guess what it recognized??? “SPEAK NOW.” I guess the recognizer started listening too early and heard the TTS itself saying “SPEAK NOW.”

Listening at the right time is always a challenge for speech recognizers, but in Speech Recognition 101, programmers learn to make the recognizer listen AFTER the prompt is spoken. In Speech Recognition 201, students are taught to trim the silence after the end of the speech prompt, otherwise those that studied Speech Reco 101 will have it listening for a recognition word too late (because there’s usually a silent tail on the prompt that users don’t hear, so they speak too early if it’s not trimmed). Therefore, the first few hundred milliseconds of the user’s speech will be clipped off.

That same TTS in the Android was a Verizon product. Guess how it pronounces Verizon? Well, not the way I’ve ever heard it pronounced. TTS isn’t easy, but this should be an easy fix. Someone at Google or Verizon will figure it out soon, and Nuance will probably get a call.

I heard a great NPR report the other day about the Amazon Kindle. The product is being boycotted by groups as diverse as Syracuse University, the National Federation for the Blind, and the Burton Blatt Institute for Disability Studies. The complaint is that the while the Kindle offers Text-To-Speech as an option, it only reads from the books, and does not provide a friendly user interface for the visually impaired. In fact, one spokesperson said that the Text-To-Speech function is just about impossible for a blind person to use. Basically, Amazon needed to offer a mode where the TTS reads any button that was pressed, which shouldn’t have added any real cost to the bottom line. Better yet, they could have added a little speech recognition so the buttons weren’t even necessary!

Todd
sensoryblog@sensoryinc.com

We have had a lot of requests over the years for products that are always on and listening for a key “trigger” word. The challenge of this approach is making a “trigger” that doesn’t accidentally trigger when it is not spoken, but also doesn’t accidentally NOT trigger when it IS spoken. The trade-off between these two types of errors is not so simple, since improving one usually makes the other worse, and background noise, especially talking, typically makes voice interfaces perform poorly. And this doesn’t even take into account the constant energy drain from devices that are always on and listening.

Nevertheless, we have gotten the same question over and over. “What’s the point of having speech recognition if I need to press a button to activate it?”

Some of our earliest customers, like VOS Systems, used a hands-free trigger to control a light switch. This was a particularly useful application, because it could be plugged into a wall without battery drain.

The “Phrase Spotting” technology has advanced over the years, and recently we introduced a new spin on it that we call “Truly Hands-Free” for Bluetooth carkits. This technology is being extremely well received, and we are consistently hearing high praise about performance in noise. It really hits the RIGHT combination of minimizing false accepts AND false rejects, all with minimal power drain considering it is always listening for a trigger word.

Now we’re starting to apply this technology to some new and interesting applications:

  1. Answer/Ignore for Bluetooth headsets and car kits. One of the most desired features of Sensory’s BlueGenie Voice Interface is that it allows answering a phone without having to touch it, for example in a Bluetooth headset or hands-free car kit. The challenge has been getting this to work well in the presence of really loud ring tones and background noises like a car radio or wind noise. The solution…we’ve implemented a Phrase Spotting version of Answer/Ignore that is completely robust to noise and ALWAYS does the right thing.
  2. Interactive Books. Imagine a book that offers an interactive experience with parents and children while they are reading at night. For example, I say “Jack and Jill went up a Hill” and Jack grunts and says “This is hard work!”, and then I say “to fetch a pail of water”, and I hear a water pouring sound, etc. Pretty fun! In the past this would have been difficult because the talking would have messed up the recognition, but the Phrase Spotting can be embedded even in the middle of a sentence!
  3. Remote-less Home Controls. If you are my age, you might remember the days of having to walk up to a TV set and manually crank the channel and volume knobs. That’s unheard of today, and nobody would ever buy a TV like that…but we do buy thermostats, microwaves, clocks, fans, heaters, lights, radios, and virtually everything else around the house that requires a manual interface. Why not use voice triggers? Sensory is currently working with many different consumer electronics manufacturers to implement this revolutionary recognition technology into a new generation of voice controlled devices.

Lot’s of exciting stuff in development here!! Next time, maybe I’ll write about our voice morphing TTS!

Todd
sensoryblog@sensoryinc.com

I stopped at Walgreens last week to get some new blades for my razor. Usually when I go in to buy new blades I end up just buying a new razor with blades, since it usually costs about the same. This time was different…I bought an electric razor instead.

It’s an Eltron brand electric razor… a cordless rechargeable razor, which actually holds a charge quite well. It includes a flip-out beard trimmer, and a separate nostril trimmer came with it too.  The price was $9.95. It was not on sale by Walgreens, although the standard Eltron packaging said “normally $49.95 now specially priced” (or something to that effect.)

I figured it was going to be junk, but it works just fine, and it even has some nice features like being wet/dry so it can be used in the shower. Now, I would guess that Walgreens likes to make around a 35% margin, which means they probably purchase them for $6 or $7 dollars. The manufacturer needs to markup cost of goods by at least 3x to make a profit, cover shipping, assembly, support and testing, so that means the actual cost must be no more than $2 (or if a distributor is involved it could be a lot less!)

How can Gillette, Norelco, Braun and others compete? They sell electric razors for $50-$150…are they really that much better? I guess the answer must be features and quality, but it wouldn’t surprise me if these companies weren’t hurting pretty badly from such low cost competition.

It’s not so hard for low cost manufacturing companies to copy features and then compete on price. It’s a lot harder to make the investment in R&D to develop differentiating features. I just saw some numbers from Gartner that shows Apple’s success with smartphones. Apple is king when it comes to creating high margin, high feature products with AWESOME user experiences. They are now #3 in the smartphone market with the fastest year over year growth BY FAR of any player.

Global SmartPhone Sales Q2 2009

Apple isn’t resting on its laurels. I’m sure they are determined to be #1, and at the rate they are growing it could happen within a few years. Why are they growing so quickly? They keep adding value to their products. For instance, Apple hasn’t been afraid to change the user interface on their consumer electronics. They were one of the first to embrace touch technologies and now they are embracing voice technologies. Their iPhones are not just phones, but media players, video cameras, navigation systems, and much more as well…and this will continue to grow. Apple will be responsible for taking the smart phone and turning it into a consumer appliance for every room and every purpose imaginable. There are already 85,000 apps in the iPhone store and it’s growing by thousands every month. I don’t think low cost competitors can steal away this business!

Watch out Eltron…when my iPhone has a built in electric razor, I’m throwing you out!

Todd
sensoryblog@sensoryinc.com

See Jane Drive   August 19th, 2009

Since Sensory has gotten very actively involved in providing speech recognition for Bluetooth® based products, I have been asking friends and family about their experiences with various “hands-free” wireless devices.

I recently had an interesting conversation that I’ll share. A woman I know (I’ll call her Jane) uses a Jabra SP-200 Bluetooth® car kit. She says she had tried a wireless headset, but found the car kit much more comfortable and convenient since she really only uses it while driving. Jane found the initial pairing process clumsy and uncomfortable, but after much reading and experimentation is now very happy with her Jabra car kit.

When I pressed Jane more about what she likes and doesn’t like here’s what I found:

Likes:

  1. Doesn’t have to wear it on her head
  2. Call quality is good
  3. Simple and easy to use

Doesn’t like:

  1. Every once in a while it makes a call accidentally
  2. There is no easy way to call people back when she gets disconnected
  3. Doesn’t always understand the different flashing lights

I found this particularly interesting, since on the one hand she said it was simple and easy to use, but also said the lights were confusing, there were control issues, and it was too difficult to easily call someone back.

Of course, if you know Sensory’s BlueGenie™ Car Kit product then you understand that ALL these issues are solved with a BlueGenie™ Voice Interface! (By the way, have you seen the BlueGenie™ car kit video on the Sensory website front page with my daughter Samantha? Smart kid.)

I decided to go a little more in-depth on the SP-200 and looked it up on the web. Interestingly, Jabra markets it as “hands-free” (of course it’s not), and calls it part of the EASY series (it could be a lot easier with BlueGenie™ …) Jabra must understand it’s not Truly Hands-Free, because in some places they call it “hands-free talking.”

Here’s what I learned from the manual:

  • It has 3 LED’s (Blue, Green, and Red) that each mean a different thing. Sometimes they are solid, sometimes they blink, and SOMETIMES THEY BLINK AT DIFFERENT SPEEDS. No wonder Jane found this confusing. Even the same color doing the same thing can mean a different thing in a different mode (e.g. solid blue can mean it’s on, or it can mean it paired successfully).
  • There’s a single big button to tap. This is part of what makes it EASY I guess. However, Jabra differentiates between a TAP and a PRESS. A tap is short and a press is long. And there can be DOUBLE TAPS, and PRESS AND HOLD, and the HOLD can be for 1 second or 5 seconds, etc. For example, you “tap” to answer a call, and you “press” to reject an incoming call, or you double press to redial. Maybe this has something to do with the “accidental” calls Jane mentioned??

I think you absolutely must read and memorize the manual to know how to use this product…and once you do know how to use it, you need to touch it, touch your handset and look at the car kit while driving. That’s not a Truly Hands-Free, Eyes-Free product.

On the other hand, BlueGenie™ car kits will hit the market in 2010, and they will change the world! People will understand what “Truly Hands-Free” really means!

Todd
sensoryblog@sensoryinc.com

The SCID’s are Coming!!!!   August 4th, 2009

No, we’re not under attack from missiles and I’m not referring to results of the current financial crises. I’m talking about Speech Controlled Internet Devices. These are home consumer electronic devices that use a VUI (voice user interface) for the user to interact with the product. The products themselves are able to access data and information from the internet, and they use a client/server speech recognition system to obtain a higher recognition accuracy than possible with a lone client or lone server approach.

So what is Sensory’s role in this? Well, we originated the terminology, and we’re evangelizing the concept in advance of the release of our new chip in September. The new chip is designed to act as the main controller for SCID’s, although Sensory is looking for other partners on the chip side (like Intel or Phillips) for higher end/higher cost SCID’s. By the way, we’re also looking for server-based speech recognition partners (like Microsoft, Google, Vlingo, Novauris, etc.), and even hardware partners like Cisco that know the Wi-Fi and consumer electronics space.

Some of the press and analysts out there are starting to think about the potential for SCID’s. Troy Wolverton (my favorite Mercury News columnist) had a bit of a changed heart after seeing some of my demo’s. Earlier I had contacted him because he thought speech recognition never worked, so I was quite happy that his column was titled “Speech Recognition Technology is Rapidly Improving.”

I’m not going to say a whole lot about SCID’s here because Dan Miller from Opus Research has already done an EXCELLENT job of writing up a summary of our conversation. Dan highlights the HUGE volume opportunity that SCID’s will enable over the coming few years.

A really interesting angle on the SCID’s is the Voice Search opportunity they enable. Most people think of Voice Search as something for telephone handsets (the quick idea of “voice search” is that a multi-billion dollar ad/transaction business will emerge for voice search just like it has for conventional Google-like search, so all the major search players - Microsoft, Google, Yahoo, etc - are interested). The thing is, there will be billions of consumer electronic products hooked up to home internet, potentially with VOIP connections, so handsets won’t be the only devices enabling search opportunities - SCIDs could become a MAJOR driver for search revenues. Michael over at the Kelsey Group keyed off of the interesting opportunities that SCID’s bring to Voice Search and blogged a bit about that.

About the technology - It’s worth noting two very special things within the SCID’s:

  1. Sensory’s new Truly Hands-Free phrase spotting allows SCIDs to be always on always listening, so your voice becomes your remote control for accessing internet data through your SCID - no need to walk up and press buttons.
  2. Sensory will do really simple and accurate speech recognition on the client that provides standalone value when not connected to the internet, but ALSO ASSISTS THE SERVER RECOGNIZER by feeding categorized data along with the query.
    For example, if “Local News” (or time, weather, etc.) is requested from a news-oriented SCID, the client Sensory recognizer can recognize that and stream a local news report, and if “Other News” is requested we can prompt “Please say the location where you would like news reports”. Then Sensory can send a very targeted query to a server based recognizer identifying the recording as a location where recent news is requested. This simplifies the server task, and improves the accuracy of the “say anything” approach to speech queries.

Todd
sensoryblog@sensoryinc.com