By Dave Betts, Chief Science Officer, AudioTelligence
Contactless technology is proving invaluable during the current coronavirus pandemic – avoiding the need to exchange cash or press buttons on a chip-and-pin machine. But for many businesses the only option for taking a customer’s initial order without face-to-face contact is a touchscreen. But that actually risks spreading infection more widely. Could the latest audio technology provide a contactless solution?
With COVID-19 here to stay for the foreseeable future, businesses emerging from lockdown are having to find new, safe ways of servicing their customers – and safe means hygienic.
When it comes to touchscreen kiosks, that includes things like antibacterial coatings. But that’s missing an important factor. The kiosk has to be perceived to be safe and hygienic by customers, as well as being genuinely safe and hygienic. An anti-COVID coating is not likely to cut it for the general public as many people won’t trust it, no matter how good it really is.
So could the latest audio technology provide a contactless solution?
Instead of going through an on-screen menu pressing icons, if customers could simply talk to the kiosk and ask for a burger and chips, use contactless payment and then pick up their order from a service counter, the amount of physical contact would be minimised to a single one-way contact from the short order chef to the customer. The whole process would also be faster – reducing still further the risk of anyone coming into contact with the virus.
However, there are two drawbacks with this scenario – the first is general background noise and the second is other people at neighbouring kiosks who will be making additional noise as they place their own orders.
Automatic speech recognition (ASR) has been around for decades and reliability is increasing all the time, although it is most reliable on limited vocabulary systems. But that’s fine as most kiosks only need a limited vocabulary anyway. However, if you add in a noisy environment with lots of other people around, things are not so great for ASR. And, of course, touchscreen kiosks are often in high street shops with a high customer footfall – and therefore far from being quiet.
It’s also likely there will be several kiosks together – so how can you ensure that a customer’s order for burger and chips doesn’t get mixed up with the person at the neighbouring kiosk ordering chicken nuggets? And what if their child tries to add a sneaky ice cream to the order?
It’s tempting to think that noise suppression is the answer – after all, there have been some amazing advances in the last few years, fuelled by the artificial intelligence revolution. The results can be great if the signal was already clearly intelligible. But if you listen to what a microphone picks up in a noisy shopping mall, it is hard for a human being to understand the raw speech – let alone an ASR system. And that’s exactly where noise suppression falls down – it can’t pull out the speech cleanly from such high noise levels.
You might think beamforming holds the key – but this is surprisingly difficult to do well, unless you are willing to invest in a large number of expensive calibrated microphones. In general, unlike noise suppression, beamforming can provide some improvement in intelligibility – but nowhere near enough for a general high-volume kiosk.
This is where blind source separation (BSS) comes into its own. It simply needs between four and eight off-the-shelf microphones, with no calibration required – the sort of microphone found in a mobile phone. The array geometry is flexible – anything between 5cm and 30cm across. Ideally there would be a clear ‘line of sight’ to the customer and, if space allows, a 2D array is preferable – but a linear array does also work.
BSS can separate the incoming audio back into its constituent sources automatically. So not only is the customer’s voice brought out of the background noise, it is clearly distinguished from the voice of the person at the neighbouring kiosk. It can even separate the voice of the customer from their child trying to add that sneaky ice cream.
This is all done with data-driven machine learning. The system is continuously analysing the sound field and can pick out the speech of the person in front of the kiosk – adapting automatically to the lunchtime rush or the quiet of a 2am motorway pitstop. Just like a human – but no social distance required.
As the concepts behind BSS are mathematical, it can be implemented on any general computing device. The central processing unit (CPU) cost is well within the capabilities of a modern ARM processor that supports single precision floating point. Then it needs memory – as BSS is a data-driven approach, it needs a frame store to keep all the audio it’s analysing. For 16kHz and eight microphones, that frame store could be as much as 40MB – totally within the capabilities of a modern ARM processor. Optionally, a camera with face detection can help ensure the correct customer is selected.
Then there’s the ASR and speech-to-intent system. Google managed to port its speech recognition system (currently English only) on to a pixel phone in less than 80MB of memory. Similarly, specialist multilingual speech-to-intent systems for a limited vocabulary can be implemented in under 500kB of memory, depending on the size of the vocabulary.
There’s no need to lose existing touchscreen benefits either – such as upselling options to maximise revenue. If a customer orders a burger and fries, for example, the screen can still prompt them and ask whether they want a cola with their order – with the customer speaking their reply. Dynamically tying this upselling to stock levels would also help reduce wastage – and data mining could ensure the best options are offered.
The one area that speech doesn’t particularly help with is payment, as privacy is likely to be an issue. Very few customers will be happy with announcing their credit card details or how much they are spending. Thankfully, contactless payment is already here – but the new EU Payment Services Directive means that kiosks are going to have to resort to the keypad for customer authentication more often, unless a different solution can be embraced.
There are plenty of alternatives. For larger amounts, various companies have proposed payment systems based on face recognition. But perhaps a more practical approach is QR codes. These can be displayed directly on the screen for an app to display and read. All the personal security for the transaction is already contained in the smartphone with its fingerprint/facial recognition or simple access password/PIN.
So it all seems to add up. In the post-pandemic world, perhaps it’s time to put a ‘Do not touch’ sign on our touchscreens…