28 Aug Microsoft: FPGA Wins Versus Google TPUs For AI

The Microsoft Brainwave mezzanine card extends each server with an Intel Altera Stratix 10 FPGA accelerator, synthesized to act as a “Soft DNN Processing Unit,” or DPU, and a fabric interconnect that enables datacenter-scale persistent neural networks.

At the recent Hot Chips conference, three of the world’s largest datacenter companies detailed projects that exploit Field Programmable Gate Arrays (FPGAs) as accelerators for performance-hungry datacenter applications, particularly for Machine Learning. While Xilinx and Intel (Altera) have long talked about the potential for their technologies to change the datacenter landscape, broad adoption has remained elusive, in part due to the challenges of FPGA development. Specifically, Amazon, Baidu, and Microsoft all announced technologies and initiatives that they hope will address those barriers to adoption and enhance their own AI services. Being a bit of a hardware geek, I will focus primarily here on the implications of Microsoft’s technology. However, it is worthwhile to also consider the Baidu and Amazon announcements; taken together, I believe these announcements may bode well for FPGA adoption in the datacenter, which in turn could pave the way for growth for Intel and especially Xilinx (as a pure FPGA play). “When the industry’s brightest spotlight turned to them at HotChips this week, Amazon, Baidu, and Microsoft all chose to talk about innovations in FPGA-based acceleration for their data centers,” said Steve Glaser, senior vice president of Corporate Strategy at Xilinx. “It is clear that momentum for FPGA acceleration in hyperscale data centers continues to be on the rise.

First, Baidu announced a new architecture they hope could broaden the use of FPGA’s as an acceleration platform.  The new Baidu “XPU” combines a CPU, GPU, and FPGA in a flexible configuration on a Xilinx FPGA, which they hope will be easier to program than traditional low-level techniques developers use today for FPGAs.  For their part, Amazon Web Services provided an update on their progress with the F1 acceleration platform, which supports an 8-node Xilinx-equipped EC2 instance to enable the FPGA acceleration application development.

Figure 1:  The Microsoft Brainwave mezzanine card extends each server with an Intel Altera Stratix 10 FPGA accelerator, synthesized to act as a “Soft DNN Processing Unit,” or DPU, and a fabric interconnect that enables datacenter-scale persistent neural networks.  Source: Microsoft.

What did Microsoft announce?

While Amazon and Baidu are working to render FPGA’s more accessible and easier to program on their clouds, Microsoft is perhaps the largest end-user of FPGAs for datacenter applications, accelerating a wide swath of their massive computing infrastructure and applications on Bing and Azure.  To demonstrate their resulting prowess, Microsoft unveiled Project Brainwave, a scalable acceleration platform for deep learning, which can provide real time responses for cloud-based AI services.  Microsoft had previously announced some 29 of these AI APIs, lowering the barriers to adoption for enterprises looking to get on board the AI bandwagon. Now Microsoft is sharing details about the hardware infrastructure upon which these MLaaS APIs and Bing internal services are built.

Microsoft’s Project Brainwave consists of three components:

    1. A high-performance systems architecture that pools accelerators for datacenter-wide services and scale. By linking their accelerators across a high bandwidth, low-latency fabric, Microsoft can dynamically allocate these resources to optimize their utilization while keeping latencies very low.
    2. A “soft” DNN processor (DPU) that is programmed, or synthesized, on 14nm class Altera FPGAs. More on this below.
    3. A compiler and run-time environment to support efficient deployment of trained neural network models using CNTK, Microsoft’s DNN platform. Similar to the case of Google’s TPU and TensorFlow, Microsoft requires a hardware platform that is optimized for their own Interestingly, Microsoft has claimed that CNTK can have significant performance advantages over TensorFow, especially for recurrent neural networks used for natural language processing. It is not clear the extent to which Brainwave further enhances CNTK performance.

As I’ve recently explored, a fully custom chip, or ASIC, can give companies like Google a very fast machine learning accelerator at lower per-unit costs, but the development process can be cost-prohibitive, lengthy, and result in a fixed function chip, impeding one’s ability to quickly adapt silicon implementations as algorithms evolve. Microsoft pointed to this tradeoff in their announcement as a primary driver for their FPGA-based strategy. By using an FPGA instead of an ASIC for their “soft” DPU, Microsoft believes it can better optimize their hardware for their software at lower cost and with greater flexibility over time.

A great example of the advantage of FPGAs in machine learning is the ability to customize the level of precision required for a particular layer in a deep neural network.  NVIDIA pioneered the use of reduced precision calculations in the Pascal and Volta GPUs (both of which support 16-bit floating point and 8-bit integer calculations). However, why stop there? Think about the requirements of a neural network layer that is determining someone’s sex from an image. This attribute requires just 2 bits: male, female, or other (unknown).  Moreover, with an FPGA, a neural net designer could model each layer in the net with the optimal (minimal) number of bits, which can have a significant impact on performance and efficiency, as the graph below demonstrates. The reference to LSTM, or Long Short Term Memory, is a class of machine learning often used for natural language processing, one of Microsoft’s fortes. (The astute reader will note that Microsoft did not share FP16 results, which would undoubtedly be lower than the 16-bit integer results. However, these chips are not designed for training a neural network, for which an NVIDIA Volta GPU can deliver up to 120 Tera-operations/second for the 16/32 bit operations needed in training.)

Figure 2: Microsoft’s DPU can be programmed to process calculations for virtually any precision required by the neural network being used, delivering excellent performance. Also, Microsoft can reprogram (synthesize) these chips in a matter of weeks for a different use case.  Source: Microsoft.

Microsoft’s DPU can be programmed to process calculations for virtually any precision required by the neural network being used, delivering excellent performance. Also, Microsoft can reprogram (synthesize) these chips in a matter of weeks for different use case.

Conclusions

The field of Machine Learning requires blazingly fast chips for acceleration, and we see just the beginning of innovations in this area as I outlined in an earlier blog posting. While Google has taken the ASIC path, Microsoft has demonstrated that they can achieve comparable and in some cases even better results using an FPGA, which enables them to continually track innovations in software with their custom hardware. Meanwhile, Xilinx, Baidu, and Amazon are working together to lower, if not completely remove, the traditional barriers to FPGA adoption.  Taken together, these initiatives point to increased opportunities for other large AI consumers and providers to have their cake and eat it too; they can optimize custom chips for their applications while avoiding the cost and potential technology obsolescence of going down the custom ASIC approach. However, we have only scratched the surface of this deep well of innovation.