By Nick Ni, Director of Product Marketing, AI, Software and Ecosystem, and Lindsey Brown, Product Marketing Specialist Software and AI, both at Xilinx
Data is exploding, innovation is exponential and algorithms are changing rapidly. Whilst artificial intelligence (AI) is increasing in popularity across many industries, most AI revenue is made from training AI models, by improving their accuracy and efficiency. The inference industry is just getting started and will soon surpass training revenue with “productisation” of AI models, where a model can be bought as a product to fit an application’s requirements.
As we are still in the early phases of adopting AI inference, there’s a lot of room for improvement. For example, most cars still don’t have advanced driver-assistance systems (ADAS), drones and logistic robots are still in their infancy, medical robot-assisted surgery is not perfect yet, and there are many enhancements needed in speech recognition, automated video description and image detection in datacenters.
Keeping up with demand
Demand on hardware in AI inference has sky rocketed, as modern AI models require a lot more compute power than conventional algorithms. Yet, as we already know, we can’t rely on gradual silicon evolution. Processor frequency has long hit a wall with Dennard Scaling: an algorithm can simply no longer enjoy a “free” speed-up every few years.
Adding more processor cores has also hit a ceiling, thanks to Amdahl’s Law: if 25% of the code is not parallisable, the best speedup is 4x, regardless of how many cores have been crammed in.
So, how can hardware keep up with this such an increasing demand? One answer is Domain Specific Architecture (DSA).
Each AI model is becoming heavy-duty and dataflow complex, and today’s CPUs, GPUs, ASSPs and ASICs can’t keep up. CPUs are generic and lack computational efficiency. Fixed hardware accelerators are designed for commodity workloads that don’t undergo further innovation. DSA is the new requirement where hardware needs to be customised for each group of workloads to run at highest efficiency.
Customisation for efficiency
Every AI network has three parts that need customising for highest efficiency: data path, precision and memory hierarchy. Most newly-emerging AI chips have strong horsepower engines but fail to pump data fast enough due to these three inefficiencies.
Every AI model will require slightly – or sometimes drastically – different DSA architecture. The first part is a custom data path. Every model has a different topology (broadcast, cascade, skip-through, etc.) passing data from layer to layer. It is challenging to synchronise a layer’s processing to make sure data is always available for the next layer to begin its processing.
The second part is custom precision. Until a few years ago, floating-point 32 was the most prevalent precision in designs. However, with Google TPU leading the industry in reducing the precision to Integer 8, state-of-the-art has shifted to even lower precision like INT4, INT2, binary and ternary. Recent research is now confirming that every network has a different sweet spot of combined mixed precisions to be most efficient, such as 8 bit for the first five layers, 4 bit for the next five layers and 1 bit for last two layers, for example.
The last part, and probably the most critical one that needs hardware adaptability, is custom memory hierarchy. Constantly pumping data into a powerful engine to keep it busy is crucial, and customised memory hierarchy is needed from internal memory to external DDR/HBM to keep up with the layer-to-layer memory transfer needs.
Figure 1: Every AI network has three components that need to be customised
Rise of AI productisation
By harnessing DSA to make AI models most efficient, it fuels the growth of AI applications: classification, object detection, segmentation, speech recognition and recommendation engines are just some examples that are being productised, with new ones emerging every day. In addition, there is a second dimension to this complex growth. Within each application, more models are being invented to either improve model accuracy or make it less cumbersome. In classification, whilst AlexNet in 2012 was the first break-through in deep learning, it was fairly simple with a feed-forward network topology. Hot on its heels, in 2013, Google introduced Googlenet, with a map-and-reduce topology. Modern networks like DenseNet and MobileNet now have Depthwise convolution and skip-through where data is sent to many layers ahead.
This level of innovation puts constant pressure onto existing hardware, requiring chip vendors to innovate fast. Here are a few recent trends that are pushing the need for new DSAs.
Depthwise convolution is an emerging layer that to be efficient requires large memory bandwidth and specialised internal memory caching. Typical AI chips and GPUs have fixed L1/L2/L3 cache architecture and limited internal memory bandwidth, resulting in very low efficiency. Researchers are constantly inventing new custom layers, but chips today simply don’t have native support for them. Because of this, they need to be run on host CPUs without acceleration, often becoming the performance bottleneck.
Sparse neural network is another promising optimisation where networks are heavily pruned, sometimes up to 99% reduction, by trimming their edges, removing fine-grained matrix values in convolution, and so on. However, to run this efficiently in hardware, a specialised sparse architecture is needed, plus an encoder and decoder for the operations, which most chips don’t have.
Binary/ternary are extreme optimisations, making all math operations to a bit manipulation. Most AI chips and GPUs only have 8-bit, 16-bit or floating-point calculation units so there won’t be any performance or power efficiency gains by going extreme low precisions.
The MLPerf inference v0.5 published at the end of 2019 proved all these challenges. Looking at Nvidia’s flagship T4 results, it’s achieving as low as 13% efficiency. This means, whilst Nvidia claims 130 TOPS of peak performance on T4 cards, the real-life AI models like SSD w/ MobileNet-v1 can utilise on 16.9 TOPS of the hardware. Therefore, vendor TOPS numbers used for chip promotion are not meaningful metrics.
Whilst some chips may be very good at AI inference acceleration, it’s almost always they only accelerate a portion of the application. In one smart retail example, pre-process includes multi-stream video decode, followed by conventional computer vision algorithms to resize, reshape, format and convert the videos. Post-processing also includes object tracking and database look-up.
The end customer rarely cares about how fast the AI inference is running but whether it can meet the video stream performance and/or real-time responsiveness of the full application pipeline. Most chips struggle to accelerate the whole application, which requires not only individual workload acceleration but also system-level dataflow optimisation.
Finally, in real production settings like automotive, industrial automation and medical, it’s critical to have a chip that’s functional-safety certified, with guaranteed longevity and strong security and authentication features. Again, many emerging AI chips and GPUs lack such track record.
However, Xilinx is rising to these AI productisation challenges. Its devices have up to 8x internal memory compared with state-of-the-art GPUs, and the memory hierarchy is completely customisable by users. This is critical for achieving hardware “usable” TOPS in modern networks, such as depthwise convolution.
The user-programmable FPGA logic allows a custom layer to be implemented most efficiently, removing it from becoming a system bottleneck. For sparse neural network, Xilinx has been long used in sparse-matrix-based signal-processing applications such as communication domains. Users can design a specialised encoder, decoder and sparse matrix engines in FPGA fabric.
And lastly, for binary/ternary operations, Xilinx FPGAs use look-up tables (LUTs) to implement bit-level manipulation, resulting in close to 1PetaOPS, or 1000 TOPS (when using binary instead of Integer 8).
In regard to whole-application acceleration, Xilinx has already been adopted in production by industries to accelerate the non-deep-learning workloads including sensor fusion, conventional computer vision and DSP algorithms, path planning and motor control. Xilinx now has over 900 hardware accelerated libraries published under the Vitis brand, enabling significant speed up in typical workloads.
Xilinx is known for its quality of devices, confirmed by their adoption in safety-critical environments such as space, automotive, industrial automation and surgery assistant robots.
Xilinx’s new unified software platform, Vitis, combines AI with software development, enabling developers to accelerate their applications on heterogeneous platforms and target applications from cloud computing to embedded end-points. Vitis plugs into standard environments, uses open-source technology, and is free. Within Vitis, the AI part provides tools to optimise, quantise and compile trained models, and deliver specialised APIs for applications from the edge to the cloud, all with best-in-class inference performance and efficiency.
Figure 2: Vitis unified software platform