Rep Fair and Moshi November 20th, 2008
Last month Sensory hosted its annual Rep Fair. Once a year, we invite all of our sales and distribution reps from around the world to attend this conference to talk about our new products and technologies.
In past, we have had over two dozen attendees at these conferences, and have rented out large hotel meeting rooms for multiple days of activities. With the down economy, we scaled back a bit and hosted this year’s Rep Fair at a Sensory meeting room. Fifteen people representing eight countries attended, which is not bad considering the doom & gloom economy, but that’s not what I wanted to write about…
This year we demonstrated a few new products featuring Sensory technologies that are just now hitting the market. One of these is the Moshi IVR Alarm Clock by Moshi Lifestyle which uses our Natural Time Set technology. This clock allows a person to set the time and alarms by voice, using simple phrases like “8:45 AM”. This is a pretty tough grammar task for Sensory’s entry level chips which use only 2K RAM and less than 10 MIPS of processing, but the speech recognition is flawless and the product is excellent.
For fun we left the “voice trigger” mode on during the entire two days of nonstop meetings, presentations, and discussions. The voice trigger mode (or continuous listening), is one of the toughest tasks a speech recognizer can do, because the listening window is completely unconstrained, and everything spoken must be analyzed, with the right trigger phrase being responded to and the wrong words or phrases being rejected.
For this clock the trigger is “Hello Moshi.” I was expecting the clock to false trigger about once every hour, which would have been normal for an older technology release. However, our latest code is more finely tuned and calibrated for noisy environments, and after about four hours with no false triggers, I just had to check to see if it was even still on. So in the middle of a presentation on Sensory’s new VPC chip (not yet announced!), I said “Hello Moshi”…low and behold Moshi responded. It was on and listening the whole time, but never false triggered.
We went a full 2 days with no false triggers, and all the Sensory employees and old time reps were blown away!
It’s All About the Music! October 1st, 2008
My first startup was a company called ESS Technologies, which originally stood for Electronics Speech Systems. We started out as a software speech synthesis developer, and found mild success making the early Commodore 64 and Apple II games speak. Our claim to fame was that every one of the 20 or so games that licensed our technology made it to the Billboard top 10 of software games.
ESS moved into speech chips, and sales grew dramatically when talking books started shipping. ESS’s sales really took off though, when music synthesis industry pioneer Roi Peers started running music on our chips This enabled ESS to release a single chip IC that essentially removed the need for Creative Sound Blaster boards on portable PC’s. It was ESS’s music chip sales that launched it as the most successful semiconductor IPO of 1995, and sales increased exponentially from tens of millions of dollars per year to hundreds of millions of dollars per year.
I love music. I used to work in the music industry, and I’ll play just about any instrument I can get my hands on. I used to hear the statistic that 1/10 people consider themselves a musician and 9/10 people want to be a musician. I don’t know if that’s true, but it’s reasonable. I just read in my Costco Connection magazine (must have reading while I wait for programs to load on my slow computer) a few interesting things about music game software:
- 2006 sales were $250 Million
- 2007 sales were $1.3 Billion
- 31% of PS2 dollars were spent on Guitar Hero when Guitar Hero was released
- Guitar Hero alone made over $820 million more than Mario and Halo combined!
Sensory hasn’t yet announced our next generation chip, but for those that read my blog, here’s a sneak preview…This chip is called the VPC and will be an AWESOME low cost music chip (while also offering many other hi-quality audio technologies including speech recognition and synthesis):
- STEREO 16 bit DAC output - Sensory’s current generation IC’s are mono 12 bit devices, making them applicable only for low fidelity toys.
- 32 voice MIDI synthesis - it can play midi files and access the large available content base
- MP3 decoding - yes, it’s an MP3 player!
- Mixer/Effects - Reverb, EQ, echo and other effects are included
- Sampler - we have a 16 bit stereo ADC so the chip can also record sounds
So is this the next great music chip to take over the Pro-audio market? NO. It’s also not a high-end audiophile-quality MP3 playback device. But it is very low cost, and 99% of the population couldn’t hear the difference and would rather save the money!
Maybe it’s the chip for a new generation of low cost Guitar Hero like instruments that can be played and jammed on in a stand alone or “group” jam environment. It should start shipping in 2009…but no guarantees about any features, ship dates, or anything else…it hasn’t been announced yet.
I hope it’s not Gaudy… September 24th, 2008
Google launched a new “audio search indexing experiment that allows users to find spoken words inside videos.” Google Audio Indexing (GAUDI) was developed by Google Labs (is that where Nuance founder and hot jazz guitarist Michael Cohen lives these days?).
Gaudi is a fun name. Anyone that’s been to Barcelona has seen the unique art nouveau style of design and architecture from the very famous Antoni Gaudi.
Ironically, Gaudi sounds kind of like Gaudy which Websters defines as “ostentatiously or tastelessly ornamented”. Now I’m a fan of Gaudi (the architect), but I could see how a critic might describe some of Gaudi’s works as “tastelessly ornamented”. They certainly can be ornamented and “tasteless” is just a matter of opinion. A quick Googling shows that gaudy has ancient latin roots and no relationship to Gaudi.
Anyways…Gaudi (the software) “transforms spoken words into text and then indexes that text using search technology–users searching for spoken words inside video clips will be able to jump to portions of a video where the searched words are spoken.” Pretty cool! Isn’t that what Paul Leggo was doing over at Virage 10 or 15 years ago?
I applaud Google for bringing this to the masses. The press release then gives its standard Google spokesperson quote “Google’s mission is to organize the world’s information and make it universally accessible and useful”. They left off the part about making money through ad content and hierarchy’s of information based on what advertisers will pay.
Hey, I love all this free access to content and information. I think its fine to have some advertising on the side bars, it’s certainly fair for companies to make money. I don’t mind transaction commissions, ads, etc. as long as I know when it’s happening. I wish there was a law though that forced disclosure anytime search results are ranked by commission or dollars paid.
Someone told me the other day that some mapping programs don’t necessarily take you through the fastest route, but instead bring you by billboards they want you to see. Could that be true? Very scary. Don’t be evil!
On Human Misrecognitions… September 24th, 2008
My very first blog was called “Weapons for Christmas:…I had misunderstood my daughter when she said she wanted “Webkins for Christmas”. I’m always intrigued by errors in human speech recognition. I figure if we can’t do it right with all our sensory and extra sensory powers, then how in the world can a computer ever get it right? Or better yet, how can we apply the sensory tools in people to make our machines better.
One of Sensory’s Bluetooth engineers is a native Chinese speaker. Sometimes I have a difficult time understanding his accent, but he says that our BlueGenie Voice Interface on the headsets he works on always works for him. I wonder is that because Sensory’s technology is so good, or because he is well trained on how to talk by our technology. I suspect it’s a combination of both.
A couple of months ago I was in New York. I had a meeting in a building with a security gate entrance. When I signed in at the counter I was given a barcode pass. Upon exiting, I slid the pass in the security gate, but the gate didn’t open. I tried again and it still didn’t open. The security guard gave me a mean look and said something to me. He was a local guy with a New York accent. I had no idea what he said. I tried swiping my card again…gate still didn’t upon. Guard looked mad and grumbled the same thing again, sounded like “Japushida”. I had no idea what he meant, then he made a pushing motion with his hands…I wasn’t supposed to wait for it to open automatically, I was supposed to “just push it in” (I guess?). The body language clued me in!
I was on the phone yesterday and I heard the person on the other end tell me “My female is slowing down my system”…I quickly corrected that in mind to be “my email is slowing down my system, but the correction didn’t occur until I heard the word “system”…then the context made it all come together. I do remember a split second thinking “why is he talking about ‘his female’”…I didn’t know what he meant and it seemed so politically incorrect. Context certainly helps!
Voice User Interfaces Everywhere! September 17th, 2008
I was talking with an industry analyst today. He had gotten the BlueAnt V1 Bluetooth headset with Sensory’s BlueGenie technology, and he was very pleasantly surprised by how it was both EASIER to use yet MORE FEATURE RICH all at the same time (OK, I’ll include my favorite reviewer’s quotes below…and by the way, it’s also SAFER!).
Let me sidetrack a bit, though, before talking about the industry analyst call. The BlueAnt V1 is really a great product, and a true innovation for the speech industry. First of all it WORKS. Not only does it really work, but it’s also the smallest speech I/O system to ever ship…and it’s the first “complex” consumer product with a true voice user interface. By “complex” I mean voice is used for more than simple on/off kinds of functions (like a voice lamp). All the other voice based consumer products that have hit the market use speech as a feature. These are products like toys, cellphones, and remote controls that are designed to be held, looked at, and touched. For example, a cellphone is a multi-modal product…it has a keyboard and a display. It’s designed to be used while looking at it. A headset is totally different. It’s designed for use WITHOUT looking at it and basically without even touching it! A voice user interface is the perfect solution for Bluetooth headsets, and the BlueGenie interface is really bringing Sensory a lot of recognition (bad pun intended!).
Anyways, the analyst said “I now understand how your BlueGenie Voice Interface makes products easier to use. I don’t understand why touch technology is getting so popular instead of Voice User Interfaces”. Well he hit the nail on the head. It’s very clear that one day voice user interfaces will be everywhere, and will overtake and combine with touch for improved interfaces on products. Voice is easier and more natural and even offers the opportunity for more features without complexity. The BlueAnt V1 doesn’t even need a manual because it’s all contained within the headset!
So why aren’t voice user interfaces everywhere today??? Because speech technologies still need to improve. What Sensory has found though is that for constrained task environments like a Bluetooth headset or repetitive but complex tasks like setting time or adjusting controls on a microwave oven, a voice user interface can very much be the magic solution of today!
Don’t believe me? Go buy a BlueAnt V1 and experience for yourself the magic of a voice user interface - The BlueGenie Voice Interface. (If you have noticed that I really like that BlueAnt product, then you are absolutely right…It’s the best and most important product Sensory has made in its 15 year history!).
Nuance Done Acquiring? September 8th, 2008
I just read an interesting analysis by Ketul Kirtikumar on seekingalpha.com.
Ketul claims, “The acquisition machine which fueled growth at Nuance might be slowing down due to the high debt that Nuance Communications (NUAN) has accumulated in the last two years.” He states their organic growth has been slowing while execs and insiders have been unloading shares.
I have mixed feelings about Nuance ceasing its aggressive acquisition strategy. Nuance’s acquisitions have created a wonderful consolidation in the embedded space, virtually removing all of Sensory’s competitive threats and allowing Sensory to be the only remaining major player with any substantive size in the embedded speech market. A few years back it was ART, Voice Signal and Sensory. Now ART and Voice Signal have been merged/acquired into Nuance. Hey, I like that!
Nuance has a habit of suing companies before acquiring them. This is the reason I’d be glad if they stopped acquiring. Patent infringement lawsuits are such nasty things. Sensory has had to build an arsenal of patents primarily as a defensive measure (even though Nuance is our friend and customer). Suing a company to acquire them seems kind of like spitting on girls to try and get a date.
The latest lawsuits I read about Nuance were with Zi Corp and with Vlingo. Zi is an intelligent text company that competes with Nuance’s Tegic (another acquisition). Zi and Tegic already battled it out years ago on patents and after a long bitter feud they had it all settled, guess not. Vlingo uses IBM’s speech technology, and it appears the lawsuit could be Nuance’s awkward way of courting, or possibly just revenge because a Nuance former CTO left to start Vlingo. Who knows??? I just like battles in the marketplace a lot more than in the courtroom.
So, the really interesting thing about Ketul’s article was the revenue numbers he showed for Nuance’s embedded handset business. It was close to $200M for 2008. Huh??? My wildest guesses from a few years back would have been Tegic was doing $40M, Voice Signal $20M, ART and other Nuance stuff might have totaled another $20M. So how did putting it together grow it from $80M a few years ago to just under $200M in 2008? I don’t think there’s been that much growth in the embedded market. Per unit royalty rates have probably dropped with adoption rates. $193M is like Nuance getting .15 or .20 on every handset sold everywhere in the world. I don’t think so. Let’s see, 30 or 40 cents on half the handsets sold? Nope.
Interestingly Ketul says “Nuance’s embedded solutions are used for voice command in embedded devices and are clearly market leaders in the segment. However, voice command embedded solutions haven’t moved beyond the visionary phase of the technology adoption cycle and show no signs of crossing the chasm at its current rate of usage.” He shows a graph of technology adoption lifecycle that implies Nuance has about a 10% penetration (sounds OK to me for speech, but maybe a little low for adaptive text). So if 2008 has a 1.2B unit market for headsets, and Nuance has penetrated 120M units (10%), then that would imply they are making roughly $1.60/unit. I don’t know any handset guys that would pay even close to that for intelligent text and voice dialing. Go figure!!!
So, something is strange in analyst land, but I hope Ketel is right that the lawsuit to acquisition spree is coming to an end!
Microsoft Gets Documents Talking May 28th, 2008
I just read about Microsoft’s new DAISY XML plug-in that will allow users to save text files created in Microsoft Word into DAISY XML, which is short for the Digital Accessible Information SYstem eXtensible Markup Language. DAISY XML tags and maps the text documents so they can be converted to eBooks and digital talking books later on. It’s designed for Microsoft Office Word 2007, Word 2003, and Word XP.
This seems like a move in the right direction, but I don’t get why its taken so long though. I’ve been putting talking animated avatars into my Powerpoint presentations for about 8 years, by using a Sensory avatar/lipsync technology. It makes presentations very engaging to have an avatar pop up and start talking to the audience. Seems like Microsoft should add something like this to Powerpoint; if any Microsoft people read this, I’ll give it to you for almost free.
Hey, do you get the humor in Microsoft’s naming? The first singing computer was credited to IBM and it sang “A bicycle built for two” with the lyrics “Daisy, Daisy”. Arthur C. Clark saw a demo of this and made his robot HAL from 2001 A Space Odyssey sing this when it was being shut down.
Internet Search via Cell Phones April 9th, 2008
I made it to a couple of interesting tradeshows over the last month. The Voice Search Conference was held at the Marriot hotel in San Diego, which provided a really nice setting; the show was very well organized and well attended. Voice Search is all about bringing the power (and revenues) from internet search engines to the cell phone market. Google, Microsoft, Yahoo, and others are getting into this in a big way.
On the first morning I accidentally went into the wrong room for breakfast and sat down with a bunch of people from another industry. They were really negative about phone-based speech recognition, and offered these opinions:
- “Oh like when you call somewhere and the phone says “Press or Say One”"
- “Yeah I tried it but it never works with my voice”
- “I hate that stuff, I just want to talk to a live operator”
I pointed out that it’s gotten a lot better than this! Directory services like 1800Goog411 (Google), 1800Call411 (Microsoft), 1800Free411 (Jingle) actually work quite well and do save time. Most of them are based on Nuance engines, which are very powerful server-based technologies. Nuance is the 800 pound gorilla in the speech space, because they’ve acquired pretty much every player in speech recognition (well, other than Sensory of course, but they certainly cleared away all our competitors in the embedded area). Microsoft, Google, and Yahoo have pretty large speech R&D teams, but I’d guess they all use Nuance IP in some fashion, probably to expand their language coverage, if not more.
I found it humorous when someone quoted a woman from Nuance who said “My boss told me never to give live demos at shows because they never work.” Novauris gave some of the best demos at the show, but sure enough they pushed the envelope until some stopped working. I do commend them for being willing to demonstrate technologically challenging concepts in front of a live audience. It can be something of a crap shoot showing off cutting edge technologies.
I spoke on a panel at the Voice Search Conference, and one of the other speakers was from IBM India. He gave a presentation about a telecom web that they are deploying so that users can use their phones to find and hear about service providers in India, basically through short audio messages like “Hello, I’m Pradeep the plumber. I have 12 years of experience doing all types of plumbing.” This is similar to searching the web and reading the short blurbs about different businesses, but instead hearing the entries from a telephone.
At CTIA Wireless 2008, the big cell phone show in Las Vegas, I had a chance to try the Vlingo voice search engine. Yahoo has licensed it already, and it is simply AMAZING! It is the closest thing to “natural language” and “context independence” speech recognition that I have ever seen. Vlingo provides a speech to text service that utilizes a thin client to server model in order to provide recognition in cell phones apps.
Bluetooth headsets were prominently on display at the show. Plantronics introduced a comfortable and cool looking headset that included a case which provides a 5 hour recharge. Great Concept!
The high point of the CTIA conference for me was BlueAnt Wireless winning an award for Best of Show in the peripherals category for their V1 Bluetooth Headset. BlueAnt is a smart and aggressive company that is making rapid inroads and finding a lot of success in the Bluetooth headset and carkit markets. The V1 is billed as the first voice-controlled headset, and it is based on Sensory”s BlueGenie Voice Interface, which gives the user the ability to control common functions like answering or rejecting calls and pairing devices vocally. It even has 1800Goog411 as a built-in command, meaning you’ll never have to press buttons to place a call to any business across the US. Now that”s what I call useful!
Power Outages January 7th, 2008
2008 is here, and in the Silicon Valley it comes with a series of powerful storms, winds up to 60 miles per hour and rain, rain, rain. Of course, what this means is power outages are upon us; a short one and the house will probably stay cold enough to not worry about the food going bad. We’ll build fires, light candles, load the flash lights with batteries, and when the power comes on, spend way too much time resetting our clocks.
Yeah, that’s my pet peeve. No one ever created a standard way to reset the time on clocks, so it always takes a bit of systematic experimentation to figure out exactly how to reset clocks and appliances like VCR’s.
But wait-Have you seen Sensory’s new time-set technology? This inconvenience could be a thing of the past if the clock uses a Sensory chip. Check out the YouTube video.
This is what customers have been asking us for years and years and the accuracy was never quite there, but we kept working on it. I’m happy to say, we’re there! Sensory now has a chip that sells in volumes for under $2 that can be integrated into clocks and uses voice recognition to set the alarm time with natural phrases like “Five thirty-five AM”. Recognizing digits in a natural context is one of the Holy Grails in speech recognition, and I’m proud to say ours works very accurately. Of course, shutting off alarms by voice commands or creating hands-free requests like “What time is it?” can be done as well.
I hope to see low-cost clocks for under $30 hit the market by the end of the year that incorporate Sensory’s chips featuring this awesome new technology. It’s REALLY COOL, and I’m REALLY EXCITED about it!
Robotic Speech October 31st, 2007
Last weekend I helped my daughter Samantha create a Halloween costume. Actually it was 2 costumes, because she wanted one for her friend also. They wanted to be robots this year. I took a couple of old cardboard boxes, cut out holes for arms and legs, attached old circuit boards and switches to the sides, and put pieces of dryer vent hose into the arm holes. Then I painted the whole thing silver.
It looked pretty good, so good that my 4-year old son Sam put it on. His arms didn’t make it to the end of the makeshift sleeves and his head barely popped out the top, but he came walking into the kitchen wearing it and said in a monotonic ‘robot voice’: “I am a robot. I will destroy you.”
We all had a good laugh over that, but I wondered how he had learned what a robot sounds like and what they say. I guess that’s the power of the media. Interestingly though, the media has it all wrong. Speech output technologies even in their infancy never sounded like monotone robots.
Speech compression schemes digitize a real waveform and compress the data, which makes it increasingly unnatural and distorted as the compression rates drop, but it never becomes monotone as the inflections are still maintained. Likewise, approaches to TTS (text-to-speech) have never been robotic and monotonic. The early DecTalk and formant synthesis approaches sounded more like someone with an intoxicated Swedish accent than the traditional bot talk, and today, TTS and speech compression techniques sound close to perfect.
On the other hand, where the media has made speech output worse in robots, they have done the opposite for speech recognition. The media portrays robotic recognition as flawless. The Star Trek computer or the Lost in Space Robot never said “What did you say? I can’t understand, please repeat. Take me to a quieter environment.”
Speaking of robots, I just spoke at Robo Development 2007 and kicked off my speech by telling the story above. My favorite part of the show, however, wasn’t all the interesting people I met during my talk; it was walking through the exhibit space. I was very impressed with Hanson Robotic’s Zeno Robot. As I spoke with David Hanson, he looked over at my name badge and said “Oh Sensory, we’re using both your FluentSoft and your FluentChip technologies!”
It’s always fun when I’m not expecting it to meet a cool new application that uses Sensory technology.