Q&A with Dr Dominic Binks, VP Technology, Audio Analytic, who discusses the decision to embed sound-recognition AI software on the Arm Cortex-M0+ processor

Q&A

11 June 2020

Q: What is ai3, and what type of neural network does it use?

A: At the heart of our ai3 software is our optimised deep neural network for modelling the acoustic and temporal features of sounds, called AuditoryNET. We keep the type of network confidential.

Q: Why did Audio Analytic choose to run sound-recognition AI on the Arm Cortex-M0+ chip?

A: Our mission is to give all machines a sense of hearing, even the smallest of devices where power and processing are constrained. Plus, right across the consumer technology sector, there’s a drive to get more AI running at the edge of a network: consumer privacy can be better protected and it’s a more cost-effective option as cloud infrastructure is expensive.

Leading tech consumer brands also want edge-based machine learning (ML) to be as compact as possible, to give product designers maximum freedom. Running our embedded sound-recognition software, ai3, on the most-constrained end-point demonstrates how compact it can be – especially since the Arm-Cortex M0+ is one of the smallest designs available today.

There’s a movement today called tinyML, where the embedded ultra-low power systems and machine-learning communities collaborate, whereas traditionally they operate independently. Qualcomm, Google, Arm and others are all keen advocates of it, building a niche tech community with regular meetings and conferences, like the ‘tinyML Summit’ in California that took place in February. Our M0+ work takes these tinyML innovations to the next level. Rather than tinyML in action, it is microML in action. As well as significantly pushing the boundaries of what’s currently believed to be possible with tinyML, efforts like ours widen the range, size and variety of machines that can have hearing.

Q: What’s the size of ai3 software footprint?

A: For this M0+ implementation, our ai3 software required 181kB across the RAM and ROM, which is considerably less than the available 224kB of ROM and RAM on the M0+-based chip that we used from NXP. To give you a broader picture, a full implementation of our ai3 software detecting multiple sounds requires tens of MIPs and a few hundred kB of memory, although there are often additional acceleration techniques available, which can further reduce the footprint and computational demands.

From the outset our technology was designed to work at the edge, this impacts on our data collection and labelling, through to model training, evaluation and compression. We also designed the architecture of our software to be flexible when working on highly constrained devices.

All devices have constraints in one form or another, whether it’s battery life, processing capabilities, BoM, or competing functions. These finite restrictions are present whether you are looking to deploy software on a tiny chip like the M0+, or fitting alongside many other applications on a larger processor running on a smartphone. Hardware and software spaces are very competitive, so Audio Analytic designed ai3 to run on minimal memory and computational resources, which means it can run on a dedicated microprocessor or alongside other applications on a larger applications processor. We focused on being purely edge-based without any processing in the cloud, which better protects consumer privacy since audio data is not streamed off the device for analysis. Privacy-friendly is a compelling point for consumers, so for product designers this means flexibility and software compactness are crucial.

Designing a smart speaker that is plugged in to run AI sound-recognition differs widely from running the same technology on much smaller battery-powered true wireless earbuds, for example. To be able to meet our targets for compactness, models must have the flexibility to be small and optimised for the end-user application. And overcoming this M0+ challenge proves sound-recognition AI can feature on many consumer electronics devices.

Q: How difficult were the challenges in embedding ai3 into M0+?

A: As ai3 is effectively a signal-processing application, so running it on the Cortex M4 is an obvious choice. To fit on the M0+, the key challenges we faced were around the small instruction set architecture, no floating-point support and a small amount of RAM.

The M0+ design uses the Armv6-M architecture. This small instruction set and lack of hardware support means mathematical calculations are more labour-intensive and the compiler injects specific replacement routines that take longer to compute. As there’s no support for floating-point calculations in the M0+, more tasks had to be programmed into the software. These M0+ chips are also designed for devices with very limited processing needs, hence have very limited RAM that made it tricky for developing and debugging.

The Arm-Cortex M4 is a really useful comparison point to illustrate the challenge we faced. The M4 core has instructions that map naturally to operations that ML algorithms do. With the M0+ there’s less support for these operations, resulting in a significant increase in instruction count on like-for-like computations. As the calculations are much more labour-intensive, we’ve typically seen over five times the number of instructions needed on M0+ compared with the same task on the M4.

Whilst the M4 uses floating-point, we’ve always supported ML in both fixed and floating-point. For the M0+ project, we relied on our fixed-point implementation. As a result, tasks carried out by hardware are programmed into the software, and issues like scaling, rounding, underflow and overflow all had to be taken into consideration. This tends to use more MIPS, which means extra effort is required to complete the same task.

Finally, the amount of Flash available on the M0+ can be tight. To address this, we found the right chip with sufficient Flash, and chose to work with the NXP evaluation board FRDM-KL82Z EVK.

Q: What changes did you have to make to your designs?

A: The architecture of ai3 is flexible and scaleable so we didn’t really need to change much of the code – just disabling of floating-point evaluation, for example, since it wasn’t useful. The existing code actually ran within the constraints of the platform, but it wasn’t as efficient on the M0+, and we wanted the end result at production-ready standards. As a result, we did some processor-specific optimisations to create the headroom we needed.

Selecting the NXP evaluation board was a key decision because, whilst it’s based around the Arm-Cortex M0+ design, it has sufficient RAM for debugging and developing. This also gave us headroom whilst we were tweaking and optimising sections of the code.