Giles Peckham, Regional Marketing Director at Xilinx
Adam Taylor CEng FIET, Embedded Systems Consultant

Advanced Computer Vision applications are becoming increasingly pervasive and are used to enable autonomous-driving modes of today’s cars, as well as in augmented reality, security surveillance systems, healthcare, industrial inspection equipment, robotics and more. Users’ expectations are rising, as familiarity inevitably brings demands for higher performance, such as faster response times, greater accuracy, or recognition of extra objects or features.

Recognition and classification of images uses deep machine learning techniques such as convolutional neural networks. Before they can be deployed these networks need to be trained for the application they are to be deployed in.

The previous article in this series described tools for building high-performing embedded application-processing engines capable of running deep neural networks that are trained in the Cloud for deployment on an autonomous edge device. This approach is suitable for systems like self-driving vehicles, where low latency is critically important and a reliable, high-bandwidth connection to the Cloud may not always be available.

Other use cases, such as security surveillance or medical imaging, may be less demanding of outright speed, and instead require complex analyses, extremely high accuracy to help specialists make informed decisions, and the ability to allow multiple parties access to results. In such situations, where compute-intensive tasks and flexible storage and access policies are required, the image-processing application can be more effectively and economically hosted in the Cloud.

Hardware-Accelerated Cloud Computing
Compared with the embedded vision systems discussed previously, hosting the application-processing algorithms in the Cloud creates a different set of challenges. It is here that compute-intensive deep machine learning, data analytics and image processing are implemented. Often the applications are required to stream processed data in near real-time and without dropouts to the client.

Cloud data centres are increasingly unable to fulfil the demands of today’s most intensive processing workloads using conventional CPU-based processing alone. Some have adopted FPGA-accelerated computing to achieve the throughput needed for these workloads and others such as complex data analytics, H.265 encoding and SQL functions.
Historically, an FPGA has been teamed with a CPU to provide acceleration, but a new model is emerging based on arrays of FPGAs such as Xilinx Virtex® UltraScale+ devices. These arrays deliver extremely high peak compute capability, with the added advantage of rapid runtime reconfigurability to repeatedly re-optimise for subsequent workloads.

Stack Streamlines FPGA Development
To fully leverage the capabilities provided by programmable logic, an ecosystem is needed that enables development using current industry-standard frameworks and libraries. The Xilinx Reconfigurable Acceleration Stack (RAS) answers this need, by streamlining FPGA-hardware creation, application development and integration. Hyperscale data centres can use these tools to jump-start development: several major operators are currently working with Xilinx to boost performance and service agility by introducing FPGA acceleration in their server farms, making extreme high-performance compute capacity available to customers as a web service.

Like the reVISION™ stack for embedded vision development, which was described in the previous article in this series, the RAS leverages High Level Synthesis (HLS) for efficient development of programmable logic in C / C++ / OpenCL® and System C. This HLS capability is then combined with library support for industry-standard frameworks and libraries such as OpenCV, OpenVX, Caffe, FFmpeg and SQL, creating an ecosystem that can be extended in the future to add support for new frameworks and standards as they are introduced.

Also like the reVISION stack, the RAS is organised in three distinct layers to address hardware, application and provisioning challenges. The lowest layer of the stack, the platform layer, is concerned with the hardware platform comprising the selected FPGA or SoC upon which the remainder of the stack is to be implemented. The RAS includes a single slot PCIe® half‐length full-height development board and a reference design, which are created specifically to support machine learning and other computationally intensive applications like video transcoding and data analytics.
The second level of the RAS is the application layer. This uses the Vivado® Design Suite and SDAccel™ development environment, leveraging HLS to implement the application. SDAccel contains an architecturally optimising compiler for FPGA acceleration, which enables up to 25 times better performance per Watt compared with typical processing platforms comprising conventional x86 server CPUs and/or Graphics Processing Units (GPUs). The environment is featured to deliver CPU/GPU-like development and run-time experiences by ensuring easy application optimisation, providing CPU/GPU-like on-demand loadable compute units, maintaining consistency throughout program transitions and application execution, and handling the sharing of FPGA accelerators across multiple applications.

For machine learning applications, DNN (deep neural networking) and GEMM (general matrix multiplication) libraries are available on the Caffe framework, as shown in figure 1. Libraries for other frameworks such as deep learning TensorFlow, Torch, and Theano are expected to be added later. It is worth noting at this point that the scope of RAS is not limited to machine vision or deep learning: as figure 1 shows, other libraries are included that support MPEG processing using FFmpeg as well as data movers and compute kernels for data analytics on the SQL framework.
The third level of the RAS is the provisioning layer, and uses OpenStack to enable integration within the data centre. OpenStack is a free, open-source software platform that comprises multiple components for managing and controlling resources such as processing, storage and networking equipment from multiple vendors.

Performance Boost, with Power Savings
By using the RAS to streamline the creation of Cloud-class FPGA-based computing, a significant increase in compute capability can be achieved, compared with processing on conventional CPUs. Image processing algorithms can be accelerated by as much as 40 times, while deep machine learning can be up to 11 times faster. In addition, hardware requirements are reduced, which lowers power consumption thereby resulting in a dramatic increase in performance per Watt. Moreover, the FPGA-based engine has the important advantage of being reconfigurable and so can be quickly and repeatedly re-optimised for different types of algorithms as they are called to be executed.

Automatic image analysis and object recognition applications can benefit from the increased performance and reduced power consumption offered by highly optimised, reconfigurable FPGA-based processing engines. Whether the application is to run on an embedded system or in the Cloud, using an acceleration stack enables developers to overcome design and integration challenges, reduce time to market and maximise overall performance. Both the reVISION stack for embedded development and the Reconfigurable Acceleration Stack for building Cloud-based FPGA compute engines assemble the necessary hardware and software resources and can adapt to support frameworks and standards as they are introduced.

For more information, please visit: