IoV – The Internet of Voice
- in Design
Forget the Internet of Things – it’s a bubble. The majority of products currently claiming to be IoT devices are just the same, vertical M2M products we’ve always had, but taking the opportunity to benefit from a rebrand. Most of the rest of the IoT is the wet dream of Venture Capitalists and Makers who think that by overfunding and stimulating each other’s egos in a frenzy of technical masturbation, they can create a consumer market for the Internet of Things. As the IoT slips slowly backwards into the foothills of Gartner’s Hype curve you need to look elsewhere to find the real Internet device opportunity, which is only just emerging. It’s the IoV, or the Internet of Voice.
The problem that the current IoT paradigm has is that it’s mostly about collecting data and then applying algorithms to extract value from the data. That’s a difficult job. You need to make the devices, work out how to connect them and then hope you can find something valuable within the data to engage the customer. The problem is that all of that takes time, not least the time to get a critical mass of products out into the field. The Catch 22 which most business plans ignore is that you need to deploy tens of thousands of devices to accumulate enough data before you can even see if there’s anything of value in it. But without an upfront value, people are loath to buy the devices. Everyone, from wearables manufacturers to smart cities are discovering that it’s not a very compelling business case, not least because it needs fairly technical consumers to install everything in the first place.
The Internet of Voice takes a different route. Instead of expecting users to know anything about the IoT, they just get to ask questions and then get answers. No more buttons, no more keyboards, no more coding, just ask. But it has the power to control everything we come into contact with. It could mark the end of our love affair with smartphones and is probably the biggest threat that Apple faces today.
In many ways, the Internet of Voice is just the latest step in a constant journey of human enquiry. From the questions posed to the Delphic oracle, to the more recent fictional incarnations of HAL and Her, humanity has been captivated by asking questions and getting an apparently intelligent response. Today, we’re at the point where technology is moving that from fact to fiction and users are finding it remarkably addictive.
Rather surprisingly, given how oral our societies are, voice has often been the poor cousin of video. Telephone voice quality has frequently been terrible. Bluetooth headsets have performed a useful function in allowing phone calls whilst driving, but for most users, or rather recipients of a call from someone using a headset, the best one could hope for was that the voice was recognisable. The more upmarket section of the industry has worked hard to improve voice quality, but in general voice quality has been mediocre, with users content with old fashioned, telephony quality. Trying to do voice recognition through a headset often felt like an exercise largely dependent on chance.
The perception of voice has changed dramatically over the past few years, although bizarrely, it’s received limited recognition. The change started with Siri – Apple’s voice assistant, which was copied and improved on with Google’s Voice Search (now Now) and Microsoft’s Cortana. Users have taken to talking to their phones; last May, Sundar Pichai, Google’s CEO, reported that 20% of queries on its mobile app were now voice queries. However, the best indication of what voice could do came when Amazon launched Alexa on the Echo at the end of 2014.
Alexa introduced users to the concept of talking to the Internet whenever they wanted to know something, buy something or play music. It signalled a major change by removing the need to interact with any device; you no longer needed to take a phone out of your product or press a button – you just spoke a key word or phrase to the internet. It’s difficult to underestimate the importance of this change. Whilst some may find it creepy, just asking a question is so natural that it’s difficult to understand why it has taken so long to get there. The reason for that delay is that voice recognition is difficult. It’s needed a number of different technology enablers to come together: reliable, fast internet speeds for users, low cost, low latency cloud services and the machine learning for voice recognition to move it from novelty to everyday reality. Put them together and we’re now at that point where we can envisage a conversational internet.
Once you can talk to the Internet things start to change. Amazon, Google and Microsoft regularly present slides that show this as the natural evolution of user input, as we progress from keyboards to mice to smartphones to just talking. They refer to it as the new “conversational interface”, signifying that the internet is undergoing a hand to mouth evolution.
Why is this important? In five years, if voice recognition continues to improve at its current pace, then people may look back and wonder why they ever used a keyboard. But there’s another aspect to that evolution – people may also wonder why they ever tapped a smartphone. If all you need to do to get information is to vocalise your question, then it may not take long for people to fall out of love with their smartphones. In the same way that Apple destroyed the feature phone market, Amazon may equally well destroy the smartphone market.
The reason for that is that whilst Siri, Voice Search and Cortana have mainly been used as keyboard replacements, taking away the pain of typing on a smartphone, Alexa does something else. For many users it has become a companion. In the same way as normal conversation, you don’t need to take anything out of your pocket or press a button – you just talk. In an interview with New Scientist, Daren Gill, director of product management for Alexa, says he has been surprised by how often people try to engage the assistant in purely social interaction. “Every day, hundreds of thousands of people say “good morning” to Alexa,” he says. “Half a million people have professed their love. More than 250,000 have proposed. You could write these off as jokes, but one of the most popular interactions is “thank you” – which means people are bothering to be polite to a piece of technology”.
There is little doubt that users find it appealing. From its initial application of ordering more things from Amazon, its use has expanded, thanks to Amazon’s approach to allowing anyone to add in skills, where additional keywords can direct the conversation to other companies. “Alexa, ask Meat Thermometer the temperature for pork?” will tell you how to cook a piece of pig. “Alexa, ask Tube Status about delays on the Victoria line” tells me about delays into the office. “Alexa, ask Wine Mate what goes with zebra?” tells Amazon something about my culinary experiments.
Voice recognition in the cloud, along with the AI to interpret and respond intelligently (which is a very different task to traditional voice to text) is highly disruptive. The last major disruption we saw in the mobile market, was Apple’s introduction of the iPhone. That was disruptive because it changed the dominant skillset from RF expertise (the preserve of the previous cadre of suppliers – Nokia, Ericsson and Motorola) to User Experience. The iPhone won customer’s hearts because of its ease of use and what you could do with it. It can be argued that that change is what allowed Samsung to rise to its number one position in the handset market. The incumbents (Nokia, Ericsson and Motorola) were too arrogant to copy; Samsung wasn’t, and the rest is history.
In the same way that user experience changed the game in Apple’s favour, the AI behind voice recognition is poised to change the game again. This time, the companies which will succeed are those with the cloud AI expertise. Amazon has made the running, leveraging its AWS experience. Google is well placed to challenge, helped by its acquisition of Deep Mind, who are already showing their capabilities with Google’s Neural Machine Translation. Microsoft’s recent acquisition of Maluuba shows that it intends to be one of the key players. However, this puts physical product companies like Apple and Samsung at a distinct disadvantage. Even with Siri and Viv, without the AI expertise to make the IoV compelling, they could quickly slip from market leaders to low margin followers.
Although Amazon, Google and Microsoft have played, or are playing with mobile hardware, this is not a hardware play – it’s an AI play, where the company which can acquire the most voice data (i.e. users) will be best placed to win. It’s remarkably cheap and easy for any manufacturer to incorporate the basics in their product, as all the local hardware has to do is to recognise a key word or phrase – “Alexa” in Amazon’s case, “OK Google” in Google’s, at which point it then streams the voice signal to the cloud and gives ownership of the user to Amazon or Google. It’s why the keyword is so important – it becomes the brand, rather than the device which provides the route through to the cloud. This is where Amazon has a clear advantage. Alexa is sufficiently divorced from the Amazon name that other brands are happy to use it – something which Amazon is actively encouraging, both through Alexa Services, which let hardware vendors build it into their products, and Alexa Skills, which allows applications to use their AI. At CES this year Alexa was generally considered to be the star of the show, even though Amazon weren’t present. More and more companies are jumping on the bandwagon. Ford lets you talk to your car, Huawei and LG let you talk to your phone and fridge; ADT lets you talk to your burglar alarm “Alexa, can you tell the burglars they’re naughty people”, while Brinks Array have it in their door lock, so your burglars can ask to be let out whilst telling all of your other voice activated goodies that they’re about to get new owners. Some applications will be trivial and die, but with the proliferation of things to talk to and a growing range of Alexa skills to provide answers, everything is in position for users to change the way they interact with the internet. A further advantage is that “Alexa” does not have the implicit brand baggage of “OK Google”, making Amazon a more attractive partner for many who don’t want to water down their own brand.
Ironically, Google tell the story better. Google’s Gill is clear when he makes the point that “using speech in this way means the interface almost disappears”. Speech is so second nature, that as long as the AI and applications respond correctly, then talking to the internet becomes natural. We will interact with it in the same way we do with friends. (Although anyone interested in how much we have already lost that conversational skill should read Sherry Turkle’s “Reclaiming Conversation”.)
This is why the IoV has the potential to be so disruptive. The history of computing and telephony has always involved touch – tapping a keyboard or holding a phone. The Internet of Voice removes that constraint – we just converse via a microphone which may exist on any number of household products. Futuresource Consulting reckon that 6.3 million voice assistants were shipped in 2016; Amazon admit that they had difficulty meeting demand for Echos in the run-up to Christmas. If we can believe CES, this is the year when we’ll start talking to (or through) tens of millions of devices.
Once we start talking rather than touching or tapping, it won’t take long to lose our connection with our smartphones. We’ll still need them for connectivity, but without a need to touch them to initiate a question, they may quickly become less relevant to our lives. Apple’s decision to remove the 3.5mm jack is inadvertently driving this transition even faster, as it encourages manufacturers to put more functionality into their wireless headsets and earbuds.
Part of that functionality will be smart microphones which can listen for the key phrases. Knowles – a manufacturer of miniature microphones for phones and hearables have already launched their VoiceIQ, a low power, always listening, voice detector which connects to a voice DSP for key phrase detection. Within the next year I expect to see these functions condensing onto a single chip and appearing as standard in most hearables. Makers have already demonstrated Echo functionality with Rasbperry Pis and a $9 microcontroller board. For any device with a slightly better than minimum spec microcontroller and an internet connection, that’s just some additional code.
The IoV should be good news for a range of other largely unseen companies with expertise in analytics and voice processing. Lesser known audio processing specialists like Alango, who have been putting voice enhancement algorithms into cars for years, are looking at how they can leverage their IP in this new market. Other key enablers are also making their move, as demonstrated by ARM’s recent introduction of their Audio Analytic’s based ai3 Artificial Audio Intelligence platform for cortex chips and Mindmeld’s announcement of a deep-domain conversational AI platform.
All of this takes functionality and user ownership away from the smartphone. So how quickly will we fall out of love with them? That’s difficult to predict. A recent survey of users about their favourite phone features suggests it may be sooner rather than later. The top three applications were GPS directions, messaging and setting alarms without the need to touch your phone. The growth of hearable devices can take care of directions and alarms. That leaves messaging, and it will be interesting to see what voice does to that. Incorporating voice into its product may prove to be Facebook’s biggest challenge yet. If someone else get voice right for social media it could be the chink in the armour which ends Facebook’s dominance and consigns it to becoming the next MySpace. Voice may also enable new services which attract our attention. We’re already seeing them emerging on Alexa Skills. I particularly like the Earplay skill on Alexa, which lets you take part in telling a story, heralding a new level of user interaction.
It clear that the battle between Google and Amazon is ramping up. Amazon is not just engaging with developers to integrate Alexa and develop skills, but has set up a $100 million Alexa Fund to invest in companies that want to innovate with voice. Google has launched its Assistant and Now and potentially has the better analytics engine, but need to get users talking to it to build up its response AI database. Microsoft is keeping its powder dry, whilst Apple with Siri and Samsung with Viv are increasingly looking like hardware vendors whose voice roadmap doesn’t go much beyond voice to text. There is little to suggest they will be contenders. Phone vendors also have a difficult choice – do they support IoV applications like Alexa and Now, which direct the voice questions to a competitor, or do they try to block them in favour of their own options. If they block them, they risk alienating the consumer, speeding up the point at which users defect from that phone.
Amazon has an interesting advantage in terms of monetisation, as they get a cut of revenue when users place an order using Alexa. The investment firm Mizuho reckons that could be as much as $7 billion by 2020. In contrast, Google’s revenue model focuses on advertising and it’s not clear how that can be mapped onto voice. But they have the cash to ignore revenue in the short term and buy customers while they improve their user experience. They should also have the better infrastructure to support smart home devices, leveraging the work they’ve already done with Nest and Thread. Despite that, and recently signing up NVIDIA, who have smart home aspirations, Alexa seems to be making the running.
Although the battle is between Amazon and Google, it will not stop others trying to define their own niche in the Internet of Voice. Oakley’s Radar Pace is one example – a spectacle in all senses of the word. It uses AI as your personal trainer, allowing you to ask questions like: “OK Radar, what’s today’s workout?”, OK Radar, what’s my power?”, but presumably not, “OK Radar, do I look like a dick when I wear this?” There are times when you realise companies have been seduced by the promise of too much technology. As Saint Exupery said in his biography “Wind, Sand and Stars”, “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away” – a maxim which many of today’s manufacturers should try to understand. It’s why the smartphone has nowhere else to go – the next step in simplicity is the Internet of Voice. Amazon’s Echo shows they have taken that maxim to heart.
So, “Alexa – I think I want to divorce my smartphone, as I don’t love it anymore. Will you marry me? I want your babies.”
Nick, very good write-up. A couple of thoughts:
1) I can see so much more than I can hear; so while I agree general in the disruption/disintermediation of the smartphone into “n” devices that my AI interacts with, we still need a viewing screen (could be glass, a flexible screen, a projection that works outdoors, etc…). Certainly it’s time that I no longer have to lug my phone with me to the gym or on bike rides, but still be connected and able to listen to my media and/or take pictures.
2) Amazon is the first one to really attempt a cross-app ecosystem and given AWS and their marketplaces and not having to protect an ad exchange model based on search/content, then they have a good chance of succeeding.
3) sorry but the most critical aspect of iPhone was access autonomy for iOS. Steve Jobs single-handedly got us back on equal access track which started in 1984 and got us to this point; something the republican FCC had deviated from in the early to mid 2000s. Lower interoperability trumps higher interoperability in the stack any day of the week; another way of saying this is that the lower down you go and can exert control, the easier it is to maintain monopoly.
While value is indeed concentrated towards the core and top of the informational stack and costs borne more at the bottom and edge, monopoly control is harder to maintain higher up and more towards the core one goes. Sounds a little paradoxical, but I believe it to be true; mostly due to the diversity of demand and rapid obsolescence of supply.
4) last thought, rather than a “winner takes all” outcome or model, wouldn’t an exchange model where the various AI leaders traded (or shared with our knowledge) in our voice prints across a variety of contexts be far more generative for all in the long-run?
Not sure how we aren’t doomed if we program this winner takes all model into the AI itself.