Google TPU: thermal management in machine learning

Series

10 October 2019

By Tom Gregory, Product Manager, 6SigmaET

For most in the electronics community, 2018 was the year when artificial intelligence (AI) – and in particular machine learning (ML) – became real.

Used in everything from retail recommendations to driverless cars, machine learning represents the era of computers solving problems without being explicitly programmed to do so.

Whilst both machine learning and the wider notion of AI come with revolutionary new implications for the technology sector, they also require significant investment in new forms of electronics, silicon and customised hardware.

But, just why do we need such an extensive ground-up redesign? And why shouldn’t AI firms simply keep building on the processing and acceleration technologies that are already in place?

Building a brain

For most of those working in AI, the end goal is to create artificial general intelligence (AGI) – “thinking machines” that assess, learn and handle nuance and inference in a way that replicates the thought processes of the human brain. Based on the current design and architecture of electronics however, this simply isn’t possible. As it stands, the vast majority of current-generation computing devices are based on the same von Neumann architecture – a beautifully simple, but fundamentally-limited way of organising information. In this structure, programs and data are held in memory, separate from the processor, with data having to move between the two to complete the operations. Its limitations, however, result in what is known as the “von Neumann bottleneck”, where latencies are unavoidable.

Unlike today’s computing devices, the human brain doesn’t separate memory from processing. Biology makes no distinction between the two, with every neuron and every synapse storing and computing information simultaneously. Whilst even the largest investments in AI are nowhere near recreating such a system, there is a growing demand to rethink the von Neumann architecture and to overcome the inherent bottleneck of the two-part memory/processing system. It is here that new hardware developments are proving so vital.

While plenty of AI companies have had significant success by simply throwing more processing power and evermore graphics processing units (GPUs) at the problem, the reality is that AI and ML will never reach their full potential until a new more ‘biological’ hardware is developed from the ground up.

Race for hardware

This demand for new, diversified and increasingly advanced hardware has become likened to the ‘Cambrian explosion’ – the most important evolutionary event in the history of life, when there was a massive diversification of organisms on our planet. This colossal growth in the AI market has resulted in a huge variety of new electronics, as well as a growing number of high-end investments ploughed into start-ups with solutions to the ever-expanding disconnect between existing hardware outputs and the massive potential of ML technology. We already saw several such investments last year. In March, AI chip start-up SambaNova Systems received $56m of funding for its new GPU replacement, designed specifically for AI operations. Then its competitor Mythic received a similar $40m investment to continue its efforts to replace the traditional GPU.

And it’s not just start-ups that are getting in on the action. Tech giants Microsoft, Google, Nvidia and IBM are all looking for their own killer app solution to the AI hardware problem. While each of these companies has its own particular solution – often attacking the problem from completely different angles – one of the most common pieces of hardware being developed is the accelerated GPU. In traditional computing environments, CPUs have been used for the bulk of processing, with GPUs being added to ramp things up where needed, such as for rendering videos or animation, for example. However, in the age of machine learning, GPUs are not enough. What’s needed now is a new, more powerful and more streamlined processing unit that can take on the heavy lifting needed for machine learning – a unit that can analyse large data sets, recognise patterns and draw meaningful conclusions, fast.

Enter the Tensor processor unit

Google now offers its Tensor Processor Unit 3.0 (TPUv3), designed specifically for machine learning; see Figure 1. This is a custom ASIC, tailored specifically for the TensorFlow – Google’s open-source software library for ML and AI applications.

Unlike GPUs, which act as a processor in their own right, Google’s TPU is a coprocessor – shifting all code execution to the CPU to free up the TPU for a stream of ML-based micro-operations. In purely practical terms, TPUs are designed to be significantly cheaper and, in theory at least, use less power than GPUs, despite playing a pivotal role in some pretty hefty ML calculations, predictions and processes. The only question is, does Google’s TPU really achieve what it claims?

Whilst positioned as a game-changing development in the AI space, the new TPUv3.0 still faces many of the same challenges as competitor products offered by Amazon and Nvidia – in particular the potential for thermal complications.

As with so much of the hardware developed specifically for the ML market, Google’s TPUv3 offers a colossal amount of processing power. In fact, according to Google’s CEO, Sundar Pichai, the new TPU will be eight times more powerful than any of Google’s previous efforts in this area.

From an AI standpoint this is hugely beneficial, with the process of machine learning relying on the ability to crunch huge volumes of data instantaneously in order to make self-determined decisions. From a thermal perspective however, this dramatic increase in processing represents a minefield of potential complications, with increased power meaning more heat being generated on the device. This heat accumulation could potentially impact performance, ultimately risking the reliability and longevity of the TPU.

In a market where reliability is essential and buyers have little room for system downtime, this issue could prove a deciding factor for a hardware manufacturer to claim ownership of the AI space. Given such high stakes, Google has clearly invested significant time in maximising the thermal design of its TPUv3. Unlike the company’s previous Tensor processing units, TPUv3 is the first to bring liquid cooling on chip, with coolant being delivered to a cold plate sitting atop each TPUv3 ASIC chip. According to Sundar Pichai, this will be the first time ever that Google has needed to incorporate a form of liquid cooling into its data centres.

google tpu 3 1 Google TPU: thermal management in machine learning

Figure 1: The Google Tensor Processing Unit 3.0

Keeping it cool

While the addition of liquid cooling technology has been positioned as a new innovation for the industry, the reality is that high-powered electronics running in rugged environments have been using similar heat dissipation systems for some time.

When preparing equipment for these environments, engineers typically work with limited cooling resources, having to develop clever ways of dissipating heat away from the critical components. In this context, carefully designed hybrid liquid and air cooling-systems have proved vital in ensuring servers and other critical electronics systems function reliably.

As example, 6SigmaET thermal simulation software has been used to model liquid cooling systems for servers by The University of Texas at Arlington for its research. One significant challenge the research faced was the components (other than the main processing chip) within a server, like DIMMs, PCH, HDD and other heat-generating parts that are not directly cooled by a liquid cooling loop. Hence, the combination of warm water and recirculated air was used in the research to cool the server to keep critical temperatures within the recommended range.

server image Google TPU: thermal management in machine learning

Figure 2: Streamlines within a Blade Chassis with hybrid liquid and fan cooling system modelled using 6SigmaET

While such liquid cooling systems are extremely effective, they should not necessarily be used as the go-to solution for thermal management. With those in the ML space looking to optimise efficiency – both in terms of energy and cost – it’s vital that designers minimise thermal issues across their entire designs and not rely on the sledgehammer approach of installing a liquid cooling system just because the option is available.

For some of the most powerful chips, such as the Google TPUv3, it may be that liquid cooling is the most viable solution. In the future however, as ever more investment is placed in ML hardware, engineers should not grow complacent when it comes to exploring different thermal management solutions. Liquid cooling may be sufficient to dissipate heat build-up in the most high-powered components currently available. However, since this may not always be the case, so it may be more efficient to strive for designs that do not risk such accumulations of heat in the first place.

An all-encompassing solution

If AI hardware makers are truly going to overcome the thermal complications associated with their increasingly powerful designs, they must take every opportunity to optimise thermal management at every stage of the design process.

At chip level, appropriate materials for substrates, bonding, die attaches and interface materials need to be selected. At system level, there are equally important decisions to be made regarding PCB materials, heat-sinks and where to incorporate liquid cooling or thermoelectric coolers.

The more robust materials used in high-power electronics also brings challenges. Compared to typical FR4 PCBs, materials like ceramic or copper have high thermal conductivity, which can be advantageous in thermal management but add significant cost and weight to a design if not used optimally.

According to 6SigmaET’s research, which incorporates data from over 350 professional engineers, 75% don’t test the thermal performance of their designs until late in the design process, and 56% don’t run these tests until after the first prototype has been developed, whilst 27% wait until after a design is complete before even considering thermal complications.

Instead of relying on physical prototypes, which are expensive and time-consuming to produce, more engineers test the thermal qualities of their designs virtually, in the form of thermal simulations. This allows to test designs using a wide variety of different materials and configurations – for example, switching from copper to aluminium at the click of a button. Simulation also enables designs to be tested in different environments, temperatures and operating-mode scenarios. This will not only identify potential inefficiencies but will also reduce the number of prototypes.

Through the early-stage incorporation of thermal simulation into the design process, it is becoming increasingly easy for engineers to precisely understand the unique thermal challenges facing AI hardware. This means that thermal considerations can be dealt with much earlier, enabling the thermal performance of TPUs and related AI hardware to be fully optimised and reducing the risk of expensive late-stage fixes and unnecessary over-engineering.