By Mark Lippett, CEO, XMOS
When searching for content on a television set today, the “process” always seem to start by searching for the correct remote control unit, then scrolling through multiple pages, with eventually resorting to typing search words into a tiny keypad. Add the time this takes, plus the repetitive keystrokes to select the right character, and the experience has mounted to a frustrating scramble. This is what every household in the country – nay, world! – goes through on a daily basis!
In an age where we can communicate around the world seamlessly, surely it should be possible to just utter words like “Hey, TV, find ‘Fast and Furious 7’!” from anywhere in the room?
Little wonder manufacturers and content providers are showing focusing on integrating voice control into their products and services. Consumers familiar with using voice to search and stream music through their smart devices are starting to expect the same ease when searching for content on TV sets – which has long been the main entertainment hub in most homes.
The ways in which consumers search for television content has evolved over the years; not that long ago, we had to walk across the room to change channels.
Apart from any (willing) youngsters within the family, the first remote control was a tethered device with two buttons, which simply ‘moved’ the TV button circuit into the remote. The connecting wire soon disappeared, thanks to infra-red technology. Today, most remote control units use light to signal the receiver, which decodes its pulses into binary code through a microprocessor. Its limitation, however, is that it still requires line-of-sight between transmitter and receiver.
The arrival of push-to-talk (PTT) marks a significant shift in the way we interact with the TV. It consists of just pressing a button on the remote control unit and voicing the search words.
Wireless connections to the TV still have standard button control, but, thanks to a microphone in the remote, also allow near-field voice capture and transmission of short audio segments to a cloud-based ASR (automatic speech recognition) for processing. However, whilst near-field voice via the remote adds a fresh search experience, we still find ourselves hunting for the right remote control unit and pushing buttons. This, in effect, does not ‘liberate’ us from hunting for the unit all around the house.
Enter the ‘just talk’ experience. Here, voice control is built into the television set itself. With far-field voice capture, there’s no need for a remote control unit or pushing buttons. You simply tell the TV set what you want to watch, from anywhere in the room. This handsfree experience is when voice hits its full stride.
Algorithms are purpose-designed to capture a clean voice command in a modern – and often noisy – living space.
Figure 1: DSP pipeline for capturing voice from noise in remote control applications
The digital signal processing pipeline (Figure 1) captures voice from noise and cleans it up so the command can be processed by an ASR service, like Amazon Alexa.
When a voice command is uttered to the TV set, in most cases it’s already playing some audio. In the background there could be a phone ringing, people talking, kitchen appliances and air conditioning whirring, and a range of other noises from outside the room – traffic rumble, animal sounds, and so on. It’s quite the challenge to extract voice commands from all that.
The system is always listening for the wake-word, e.g. “Alexa” or “TV”. The stereo acoustic echo canceller suppresses any audio stream that’s playing through the device itself to enable a “barge in” – when it hears the wake word, the system immediately stops the audio track. The Automatic Delay Estimator synchronises an audio reference signal to the microphone audio to support a smooth real-time barge-in action.
The interference canceller scans the soundscape of the room and suppresses the point noise sources (all the steady noise that comes from a fixed direction). Noise suppression reduces the diffuse noise, isolating and subtracting the sounds from the signal to give further clarity to the voice command. Lastly, the Automatic Gain Control tunes the audio stream for the output channel.
The quality of all these algorithms working individually and together, is crucial; they make the system wake up on command and capture the voice signal clearly so the speech recognition system can process and respond to it accurately.
Intelligent voice interfaces
Far-field voice is just the start. Intelligence and improved security in voice interfaces will continue to change the way to interact with TVs and content. Intelligent sensors will alert the TV set to the person that has entered the room and bring the device from deep sleep to command readiness. The TV will be not only identify the person in the room but have the intelligence to apply each user’s preferences to the content, freeing us from complex menu structures.
Intelligent voice interfaces will enable effective multi-tasking in the smart home. Controlling the TV will be effortless, without having to stop other activities such as checking the phone for messages, cooking, etc.
One day, our children will look upon the remote control as quizzically as they do with the cassette tape today.
With a television now in most rooms of the home, it’s still the prime candidate to be the central smart-home hub. Built-in voice will enable it to connect to an ecosystem of smart devices, including lights, appliances and security systems, all controlled by simple voice commands. Voice is definitely the future of television control.