Improving Signal to Noise Ratio July 14th, 2010
Dealing with a poor signal to noise ratio is one of the toughest issues in automating speech recognition. At Sensory, we develop lots of techniques so our customers’ products can sit at one end of a noisy room and still recognize a speaker at the other end of the room. Our technologists typically don’t like to implement active noise cancellation techniques because of the belief that active noise cancellation’s signal processing will extract useful information from the speech data. Nevertheless we have a whole host of other techniques to make performance in noise work really well.
In Bluetooth® headsets we use a dual mic beamforming technology, and we’ve found that this approach improves our ability to recognize by about 7 or 8 dB. In the Bluetooth® space there are lots of noise cancellation providers, and there are many well proven techniques for removing noise.
What I’ve been wondering for the last few months are why those vuvuzelas are so dang loud during the World Cup broadcasts. Seems like a relatively easy task to just filter them out, or have the broadcasters microphones be in a silent booth.
I guess I’m not the only one that wondered about this: If you Google Vuvuzela, “filter” is one of the most common words following it, and clicking on it showed over 1.3 million listings from hackers guides to products for sale.
Posted in Industry News, Uncategorized, bluetooth | No Comments »
Voice Search, M&A, and the Economy May 3rd, 2010
Haven’t blogged in a long time…I have plenty to say but have just been too busy. That’s good news. Sensory is signing up new deals at a very rapid rate, so 2011 should be an excellent year for us. I declare the economic recovery in full swing (although I do have some trepidation it could be short lived). Right now my biggest issue is chip SUPPLY! We’ve actually had some trouble getting enough chips (this is endemic to the entire chip market right now!). Luckily, our software business is exploding and a growing percentage of overall revenues is not dependent on buying silicon!
The cool thing is for the first time in Sensory’s 15 year history we are putting text-to-speech into products. We’ve done a handful of deals in just the last couple of months, and I expect that within 2 years we’ll have over 10 million TTS devices that will have hit the market (we’re at around 60 million speech recognition products right now).
I went to Voice Search last week. This is the show that Bill Meisel and AVIOS co-host every year. It’s my favorite speech industry show and pretty much the only one I attend. At the show I spoke on a consumer speech panel and demonstrated Sensory’s Truly Hands-Free Voice Trigger. Nobody thinks that wordspotting can be always on and always listening without false firing - and still catch the trigger word when it’s spoken. Sensory’s spotting technology WORKS! It’s my pet technology right now and I think it will change the world, by making speech recognition TRULY HANDSFREE (that was the title of my presentation)…anyways…I demoed it live. Nobody is supposed to do live speech recognition demos because they always fail (Microsoft has had the misfortune of proving that more than once!), so most people at the conferences show video clips. I know Sensory’s stuff works well, but I got a little nervous when I started talking and I could hear the echo of the microphone, and as I spoke I was hoping it wouldn’t false trigger and totally embarrass me. It didn’t false trigger…then it just had to recognize my trigger words. It got the first and the second one right. Then on the 3rd time the small device started sliding down the podium and the mic got covered up and for a brief moment my heart froze and I thought I was going to need to repeat my trigger word…then all of a sudden I felt my heart exploding as I waited microseconds for the response…then it spoke and it got it! No false fires and 3/3 triggers accurately recognized. Oh the trials and tribulations of a speech industry veteran! The technology is great and in a car it’s nearly flawless; it was this new acoustical environment that made me nervous. It came through!!!
So…Apple acquired SIRI, Inc., an iPhone developer that supplies a personal assistant application featuring speech recognition. Cool. That means Apple is in the game - the speech game, with apparently a slightly different twist than Microsoft or Google. All 3 companies are investing in speech recognition. But Apple is doing very light investing while Google and Microsoft are HEAVILY invested. Apple apparently isn’t using any of its home grown technologies as they keep licensing Nuance…and SIRI uses a Nuance engine as well. SIRI is a voice concierge type service that uses the Nuance recognizer then throws a layer of “meaning” interpretation or “intelligence” into the process. Anyways, I’m glad Apple is taking voice control seriously…they’re gonna have a tough time catching up with Google. My take is the Google stuff works best right now. I was playing with a Nexus One phone and the recognition on it is really amazing. BING is pretty good too and has wrapped better apps around their technology in BING411.
I remember a Keynote talk 15 years ago at a speech conference titled something like “the Ever-Imminent-Never-Arriving Speech Bonanza”…well it’s finally here, and I have to thank Google and Microsoft (and Vlingo too!) for clearly taking us over the hurdle and making speech recognition accessible and usable by the masses. Now it’s time for Apple to kick in and do its part…and now that HP has acquired Palm, it will need to get in the game too. I don’t even know if HP has a speech recognition team, but if they don’t they will soon. So will Cisco. So will all the major consumer electronics and automotive companies! Our time has come!!! Speech Recognition has arrived and is working for the masses! It will just get better!
Posted in ICs, Industry News, bluetooth | No Comments »
Observations from Mobile World Congress in Barcelona March 3rd, 2010
I was in Barcelona last month at the Mobile World Congress. Here are some of my speech-centric observations:
I went by the Microsoft booth on the first day of the show and asked when WinMobile7 would be announced. The guy on the floor acted like he had no clue what I was talking about. He wouldn’t even confirm it hadn’t been announced yet. The really ironic thing is that EVERYWHERE I went I saw Windows 7 advertisements…subways, stairs, hotel lobbies, etc. My friend Dan had a couple of corporate suites at the hotel across from the show, and asked about putting up a flier to say what floor they were on. He found out the entire hotel advertising space was taken by Microsoft! They had gotten an exclusive from the hotel.
Speaking of Dan…we’re old friends from school and decided to meet up for dinner. He said “Are you OK with a Tapas Bar?” and I said “Actually, I’m kinda hungry, if you really want to go, let’s do it after we eat.” I had made a speech recognition error…think about it.
Anyways…WinMobile 7 was announced on Day 2, and I saw some of the demos. I must say that Microsoft is taking a brave approach by completely redesigning the interface to be more focused on data (people, places) than on functions (applications, etc.) However, even with the new look and feel I didn’t hear any mention of any new speech recognition features, like um, a voice interface. I asked a guy on the floor, and he said the voice search was much improved. I like BING search, Google search and Vlingo search too as they are all getting more useful and robust. A couple of years ago, I was trying one of these search engines to find my hotel in downtown Boston, and after 3 or 4 failed attempts on a street corner, a woman pointed down the street and said “Your hotel is just down there”. A memory flashback…a cabbie on that trip asked me what I did and I said “speech recognition.” He said “oh I’ve been trying that for years…my wife talks to me and sometimes I respond properly.” But I digress…
Back to Barcelona. I saw a nice demo of MOTONAV at the Motorola booth. With a new independent consumer-product company spun out and Sanjay Jha in charge, they really seem to have turned things around. The people on the show floor seemed very upbeat and excited about where Motorola is right now. In addition to the 23 phones they currently offer, they have new ones coming out, including the new Devour and Cliq XT, both of which are based on the Android OS. I didn’t see much new stuff in the Bluetooth space, however. They are doing PNDs (portable navigation devices) and cell phones with MOTONAV. It’s a nice voice-controlled driving application, and the speech recognition in the demo I saw worked quite well on the hard stuff (addresses, etc.), but messed up on the easy things (it was a simple 2 word set that it got wrong.) Then again, small sets aren’t always easier than big ones. The Yes/No response is one of the hardest sets to get right (I heard that there are more than 50 ways to say No and almost as many ways to say Yes…like unh-unh and unh-huh…(I can’t even get that right spelling it!).
The big thing missing from MOTONAV is a Truly Hands-Free Trigger. In fact, that’s what is missing from the entire cell phone industry. All these products have built-in speech recognition, but the only way to activate it is with button presses. Here’s an article I found about “The First Truly Hands-Free Phone.” HOWEVER, when you read through it you find it really requires 2 button presses…one to turn it on and a second to activate the voice recognition. Well, Sensory can get rid of one of those button presses, which is a HUGE savings for products that can be turned on and are always listening. As battery technology improves and more “smart” listening windows are deployed, Truly Hands-Free triggers will become increasingly important for all products with speech technologies.
Posted in Industry News, bluetooth | No Comments »
Google’s Nexus One Isn’t Afraid of Speech! January 7th, 2010
Yeah everyone’s writing about the new Google phone. I’ve heard various reports about it being underwhelming, and in-need of the marketing hype that Apple is so good at. Everybody loves to compare the iPhone with the Nexus One and talk about screen size, weight, camera capabilities, software, etc.
Here’s my 2 cents on speech recognition and Bluetooth for these devices:
Apple’s initial iPhone release had speech recognition–phobia, with no factory options for implementing voice recognition commands. It was such a shocking omission that many of the mainstream reviewers even pointed it out. In various industry conversations I heard “Steve doesn’t like speech recognition”. As a result, 50 speech recognition applications quickly appeared in the Apps store, and by necessity Apple soon implemented Voice Control for music and voice dialing. I assume Apple implemented Nuance technology and most likely in a local version that runs on the iPhone.
What Google’s done with the Nexus is WAY different. They are embracing speech recognition from the start, and not just implementing “me too” features. Google is pushing the boundaries by including speech recognition for dictation (text messaging, email, social networking, etc.) and mapping/GPS type functions. I remember the original Android announcements mentioned that Nuance was their speech partner, but it seems like all the big guys like to start with Nuance then switch away. My guess is that the Nexus One uses homegrown (Mike Cohen and Co.) speech recognition, and since it is server based, it should adapt and improve and just get better with the data they are collecting. I give Kudo’s to Google for this!
On the Bluetooth side of things, we were shocked and hurt that we couldn’t use our BlueGenie Voice Interface Bluetooth headsets to easily call up recognizers on the iphone for name dialing. Although Bluetooth makes a clear protocol for this, it wasn’t implemented on the initial iPhone. New iPhone versions do support this, but Apple never clearly thought through the importance of a cohesive user interface and functionality with Bluetooth connected to its phones, especially when speech recognition is involved.
If Google is smart, they won’t only introduce a Nexus One phone, but they’ll come out with a really cool Nexus One headset that TAKES ADVANTAGE of all the great speech recognition software on the handset, with one seamless voice user interface! The Nexus One has been blasted as nothing really new, but this type of integration with a hands-free headset or car kit could make it TOTALLY REVOLUTIONARY.
Hey Google – make a BLUEGENIE VOICE INTERFACE HEADSET!
Posted in Industry News, Uncategorized, bluetooth | No Comments »
Phrase Spotting Offers New Opportunities for New products October 29th, 2009
We have had a lot of requests over the years for products that are always on and listening for a key “trigger” word. The challenge of this approach is making a “trigger” that doesn’t accidentally trigger when it is not spoken, but also doesn’t accidentally NOT trigger when it IS spoken. The trade-off between these two types of errors is not so simple, since improving one usually makes the other worse, and background noise, especially talking, typically makes voice interfaces perform poorly. And this doesn’t even take into account the constant energy drain from devices that are always on and listening.
Nevertheless, we have gotten the same question over and over. “What’s the point of having speech recognition if I need to press a button to activate it?”
Some of our earliest customers, like VOS Systems, used a hands-free trigger to control a light switch. This was a particularly useful application, because it could be plugged into a wall without battery drain.
The “Phrase Spotting” technology has advanced over the years, and recently we introduced a new spin on it that we call “Truly Hands-Free” for Bluetooth carkits. This technology is being extremely well received, and we are consistently hearing high praise about performance in noise. It really hits the RIGHT combination of minimizing false accepts AND false rejects, all with minimal power drain considering it is always listening for a trigger word.
Now we’re starting to apply this technology to some new and interesting applications:
- Answer/Ignore for Bluetooth headsets and car kits. One of the most desired features of Sensory’s BlueGenie Voice Interface is that it allows answering a phone without having to touch it, for example in a Bluetooth headset or hands-free car kit. The challenge has been getting this to work well in the presence of really loud ring tones and background noises like a car radio or wind noise. The solution…we’ve implemented a Phrase Spotting version of Answer/Ignore that is completely robust to noise and ALWAYS does the right thing.
- Interactive Books. Imagine a book that offers an interactive experience with parents and children while they are reading at night. For example, I say “Jack and Jill went up a Hill” and Jack grunts and says “This is hard work!”, and then I say “to fetch a pail of water”, and I hear a water pouring sound, etc. Pretty fun! In the past this would have been difficult because the talking would have messed up the recognition, but the Phrase Spotting can be embedded even in the middle of a sentence!
- Remote-less Home Controls. If you are my age, you might remember the days of having to walk up to a TV set and manually crank the channel and volume knobs. That’s unheard of today, and nobody would ever buy a TV like that…but we do buy thermostats, microwaves, clocks, fans, heaters, lights, radios, and virtually everything else around the house that requires a manual interface. Why not use voice triggers? Sensory is currently working with many different consumer electronics manufacturers to implement this revolutionary recognition technology into a new generation of voice controlled devices.
Lot’s of exciting stuff in development here!! Next time, maybe I’ll write about our voice morphing TTS!
See Jane Drive August 19th, 2009
Since Sensory has gotten very actively involved in providing speech recognition for Bluetooth® based products, I have been asking friends and family about their experiences with various “hands-free” wireless devices.
I recently had an interesting conversation that I’ll share. A woman I know (I’ll call her Jane) uses a Jabra SP-200 Bluetooth® car kit. She says she had tried a wireless headset, but found the car kit much more comfortable and convenient since she really only uses it while driving. Jane found the initial pairing process clumsy and uncomfortable, but after much reading and experimentation is now very happy with her Jabra car kit.
When I pressed Jane more about what she likes and doesn’t like here’s what I found:
Likes:
- Doesn’t have to wear it on her head
- Call quality is good
- Simple and easy to use
Doesn’t like:
- Every once in a while it makes a call accidentally
- There is no easy way to call people back when she gets disconnected
- Doesn’t always understand the different flashing lights
I found this particularly interesting, since on the one hand she said it was simple and easy to use, but also said the lights were confusing, there were control issues, and it was too difficult to easily call someone back.
Of course, if you know Sensory’s BlueGenie™ Car Kit product then you understand that ALL these issues are solved with a BlueGenie™ Voice Interface! (By the way, have you seen the BlueGenie™ car kit video on the Sensory website front page with my daughter Samantha? Smart kid.)
I decided to go a little more in-depth on the SP-200 and looked it up on the web. Interestingly, Jabra markets it as “hands-free” (of course it’s not), and calls it part of the EASY series (it could be a lot easier with BlueGenie™ …) Jabra must understand it’s not Truly Hands-Free, because in some places they call it “hands-free talking.”
Here’s what I learned from the manual:
- It has 3 LED’s (Blue, Green, and Red) that each mean a different thing. Sometimes they are solid, sometimes they blink, and SOMETIMES THEY BLINK AT DIFFERENT SPEEDS. No wonder Jane found this confusing. Even the same color doing the same thing can mean a different thing in a different mode (e.g. solid blue can mean it’s on, or it can mean it paired successfully).
- There’s a single big button to tap. This is part of what makes it EASY I guess. However, Jabra differentiates between a TAP and a PRESS. A tap is short and a press is long. And there can be DOUBLE TAPS, and PRESS AND HOLD, and the HOLD can be for 1 second or 5 seconds, etc. For example, you “tap” to answer a call, and you “press” to reject an incoming call, or you double press to redial. Maybe this has something to do with the “accidental” calls Jane mentioned??
I think you absolutely must read and memorize the manual to know how to use this product…and once you do know how to use it, you need to touch it, touch your handset and look at the car kit while driving. That’s not a Truly Hands-Free, Eyes-Free product.
On the other hand, BlueGenie™ car kits will hit the market in 2010, and they will change the world! People will understand what “Truly Hands-Free” really means!
Apple… It’s about time! June 9th, 2009
OK, I guess I have to blog about Apple’s new iPhone 3GS and the new Voice Control feature. Yeah it’s a big deal. My main comment…It’s about time!
A lot of people and reviewers complained that speech recognition was missing when the iPhone first shipped. I repeatedly heard through the grapevine that “Steve doesn’t like recognition”. Then miraculously, 20 or 30 different voice dialers and various other voice recognition applications appeared in the Apps store. I tried a few and they all worked pretty well. My favorite is NameDial by VoiceActivation (which uses Sensory technology of course!)
For a long time Nuance was rumored to be swinging some kind of deal with Apple, and I guess they did. I haven’t seen the Nuance name mentioned yet, but with 30 different languages supported, I’m very confident Nuance is there behind the scenes…probably not making much money, if I know Apple!
It’s definitely an embedded engine too. If you listen to the demo of the TTS (text-to-speech), you can hear it’s embedded (i.e. not as good as a server based TTS system would allow); it even sounds kind of like a Nuance voice.
So why did I say it’s a big deal??? Voice dialing is old hat, but doing music search is pretty novel. I’ve only known of a couple other MP3 player apps that use speech recognition embedded into the devices.
Today Sensory announced its Truly Hands-Free technology for trigger type phrase spotting. It allows a product to activated solely by voice, with no button pressing necessary. We developed it to go with our BlueGenie car kits so drivers wouldn’t need to be distracted, but maybe Apple wants to license it to run with their new Voice Control!
Hey Steve – Wanna go Hands-Free?
On Human Misrecognitions… September 24th, 2008
My very first blog was called “Weapons for Christmas:…I had misunderstood my daughter when she said she wanted “Webkins for Christmas”. I’m always intrigued by errors in human speech recognition. I figure if we can’t do it right with all our sensory and extra sensory powers, then how in the world can a computer ever get it right? Or better yet, how can we apply the sensory tools in people to make our machines better.
One of Sensory’s Bluetooth engineers is a native Chinese speaker. Sometimes I have a difficult time understanding his accent, but he says that our BlueGenie Voice Interface on the headsets he works on always works for him. I wonder is that because Sensory’s technology is so good, or because he is well trained on how to talk by our technology. I suspect it’s a combination of both.
A couple of months ago I was in New York. I had a meeting in a building with a security gate entrance. When I signed in at the counter I was given a barcode pass. Upon exiting, I slid the pass in the security gate, but the gate didn’t open. I tried again and it still didn’t open. The security guard gave me a mean look and said something to me. He was a local guy with a New York accent. I had no idea what he said. I tried swiping my card again…gate still didn’t upon. Guard looked mad and grumbled the same thing again, sounded like “Japushida”. I had no idea what he meant, then he made a pushing motion with his hands…I wasn’t supposed to wait for it to open automatically, I was supposed to “just push it in” (I guess?). The body language clued me in!
I was on the phone yesterday and I heard the person on the other end tell me “My female is slowing down my system”…I quickly corrected that in mind to be “my email is slowing down my system, but the correction didn’t occur until I heard the word “system”…then the context made it all come together. I do remember a split second thinking “why is he talking about ‘his female’”…I didn’t know what he meant and it seemed so politically incorrect. Context certainly helps!
Voice User Interfaces Everywhere! September 17th, 2008
I was talking with an industry analyst today. He had gotten the BlueAnt V1 Bluetooth headset with Sensory’s BlueGenie technology, and he was very pleasantly surprised by how it was both EASIER to use yet MORE FEATURE RICH all at the same time (OK, I’ll include my favorite reviewer’s quotes below…and by the way, it’s also SAFER!).
Let me sidetrack a bit, though, before talking about the industry analyst call. The BlueAnt V1 is really a great product, and a true innovation for the speech industry. First of all it WORKS. Not only does it really work, but it’s also the smallest speech I/O system to ever ship…and it’s the first “complex” consumer product with a true voice user interface. By “complex” I mean voice is used for more than simple on/off kinds of functions (like a voice lamp). All the other voice based consumer products that have hit the market use speech as a feature. These are products like toys, cellphones, and remote controls that are designed to be held, looked at, and touched. For example, a cellphone is a multi-modal product…it has a keyboard and a display. It’s designed to be used while looking at it. A headset is totally different. It’s designed for use WITHOUT looking at it and basically without even touching it! A voice user interface is the perfect solution for Bluetooth headsets, and the BlueGenie interface is really bringing Sensory a lot of recognition (bad pun intended!).
Anyways, the analyst said “I now understand how your BlueGenie Voice Interface makes products easier to use. I don’t understand why touch technology is getting so popular instead of Voice User Interfaces”. Well he hit the nail on the head. It’s very clear that one day voice user interfaces will be everywhere, and will overtake and combine with touch for improved interfaces on products. Voice is easier and more natural and even offers the opportunity for more features without complexity. The BlueAnt V1 doesn’t even need a manual because it’s all contained within the headset!
So why aren’t voice user interfaces everywhere today??? Because speech technologies still need to improve. What Sensory has found though is that for constrained task environments like a Bluetooth headset or repetitive but complex tasks like setting time or adjusting controls on a microwave oven, a voice user interface can very much be the magic solution of today!
Don’t believe me? Go buy a BlueAnt V1 and experience for yourself the magic of a voice user interface - The BlueGenie Voice Interface. (If you have noticed that I really like that BlueAnt product, then you are absolutely right…It’s the best and most important product Sensory has made in its 15 year history!).
Internet Search via Cell Phones April 9th, 2008
I made it to a couple of interesting tradeshows over the last month. The Voice Search Conference was held at the Marriot hotel in San Diego, which provided a really nice setting; the show was very well organized and well attended. Voice Search is all about bringing the power (and revenues) from internet search engines to the cell phone market. Google, Microsoft, Yahoo, and others are getting into this in a big way.
On the first morning I accidentally went into the wrong room for breakfast and sat down with a bunch of people from another industry. They were really negative about phone-based speech recognition, and offered these opinions:
- “Oh like when you call somewhere and the phone says “Press or Say One”"
- “Yeah I tried it but it never works with my voice”
- “I hate that stuff, I just want to talk to a live operator”
I pointed out that it’s gotten a lot better than this! Directory services like 1800Goog411 (Google), 1800Call411 (Microsoft), 1800Free411 (Jingle) actually work quite well and do save time. Most of them are based on Nuance engines, which are very powerful server-based technologies. Nuance is the 800 pound gorilla in the speech space, because they’ve acquired pretty much every player in speech recognition (well, other than Sensory of course, but they certainly cleared away all our competitors in the embedded area). Microsoft, Google, and Yahoo have pretty large speech R&D teams, but I’d guess they all use Nuance IP in some fashion, probably to expand their language coverage, if not more.
I found it humorous when someone quoted a woman from Nuance who said “My boss told me never to give live demos at shows because they never work.” Novauris gave some of the best demos at the show, but sure enough they pushed the envelope until some stopped working. I do commend them for being willing to demonstrate technologically challenging concepts in front of a live audience. It can be something of a crap shoot showing off cutting edge technologies.
I spoke on a panel at the Voice Search Conference, and one of the other speakers was from IBM India. He gave a presentation about a telecom web that they are deploying so that users can use their phones to find and hear about service providers in India, basically through short audio messages like “Hello, I’m Pradeep the plumber. I have 12 years of experience doing all types of plumbing.” This is similar to searching the web and reading the short blurbs about different businesses, but instead hearing the entries from a telephone.
At CTIA Wireless 2008, the big cell phone show in Las Vegas, I had a chance to try the Vlingo voice search engine. Yahoo has licensed it already, and it is simply AMAZING! It is the closest thing to “natural language” and “context independence” speech recognition that I have ever seen. Vlingo provides a speech to text service that utilizes a thin client to server model in order to provide recognition in cell phones apps.
Bluetooth headsets were prominently on display at the show. Plantronics introduced a comfortable and cool looking headset that included a case which provides a 5 hour recharge. Great Concept!
The high point of the CTIA conference for me was BlueAnt Wireless winning an award for Best of Show in the peripherals category for their V1 Bluetooth Headset. BlueAnt is a smart and aggressive company that is making rapid inroads and finding a lot of success in the Bluetooth headset and carkit markets. The V1 is billed as the first voice-controlled headset, and it is based on Sensory”s BlueGenie Voice Interface, which gives the user the ability to control common functions like answering or rejecting calls and pairing devices vocally. It even has 1800Goog411 as a built-in command, meaning you’ll never have to press buttons to place a call to any business across the US. Now that”s what I call useful!