06 Jan What To Expect in 2017 From AMD, INTEL, NVIDIA, XILINX And Others For Machine Learning

Without a doubt, 2016 was an amazing year for Machine Learning (ML) and Artificial Intelligence (AI). I have opined on the 5 things to watch in AI for 2017 in another article, however the potential dynamics during 2017 in processor and accelerator semiconductors that enable this market warrant further examination. It is interesting to note that shares of NVIDIA roughly tripled in 2016 due in large part to the company’s technology leadership in this space. While NVIDIA GPUs currently enjoy a dominant position for Machine Learning training today, the company’s latest quarter growth of 197% YoY, in a market now worth over a half billion dollars, has inevitably attracted a crowd of potential competitors, large and small. And semiconductors remain one of the few pure AI plays for public equity investors seeking a position in this fast growing market.

A Machine Learning Chips Primer

First, let’s look at some background on the computational landscape for Machine Learning (ML). As most of you probably know, there are two aspects of Machine Learning: training the neural network with massive amounts of sample data and then using the trained network to infer some attribute about a new data sample. Both are incredibly computationally intensive, but the training task is mindboggling complex; training a deep neural network (DNN) for image, text or voice classification work literally requires trillions of billions of calculations to achieve adequate accuracy, typically around 95%. I have included a primer on this topic as an appendix to this article.

And the predictions are…

NVIDIA in 2017

Let’s start with the leader, NVIDIA. In 2017, I expect NVIDIA to ramp Pascal (Tesla P100, P4 and P40) volumes, continue to nurture new market development with their software stack and then launch their next generation Volta GPUs (very) late in the year, probably at SC’17 in Denver. I hope to hear a few more details at the annual GTC conference in April. These powerful new NVIDIA chips, with faster GDDR6 and HBM2 memory, will be connected with IBM POWER9 CPUs using NVLINK and / or OpenCAPI, with interconnects from Mellanox, for 2 large supercomputers for the DOE’s Summit and Sierra procurements. Don’t be surprised if this launch is Deep Learning centric, as these supercomputers will offer unprecedented Machine Learning performance. Based on the NVIDIA roadmap (below), these new GPUs could deliver up to twice the performance of PASCAL, or from 21 TFLOPS ½ precision to perhaps 35-40 TFLOPS.

NVIDIA-Pascal-GPU_GTC_Performance-Per-Watt-1200x675

NVIDIA’s GPU roadmap would imply that the Volta GPU could be twice the performance per watt compared to PASCAL. (Source: NVIDIA)

AMD in 2017

Advanced Micro Devices (AMD) will begin shipping the highly anticipated Zen processor for servers in 2Q, packaging up to 32 cores in the Naples SOC. For Machine Learning, the company has already disclosed a few details on their next generation Vega GPU, which will also ship some time mid 2017. AMD has implied that this chip will deliver 25 TFLOPS in ½ precision, which would give AMD a slight 20% edge over the NVIDIA Pascal P100. However, note that this advantage could be short lived if NVIDIA’s Volta keeps to its schedule.

While AMD has been developing this monster chip, they have completely revamped their software stack for GPU in servers through their Radeon Open Compute Platform (ROCm), an open source alternative to NVIDIA’s CUDA and CuDNN, so they will be ready to ramp their ecosystem as soon as the new silicon is ready for action. They have also announced GPUs for Machine Learning inference under the brand Radeon Instinct brand. However I do not believe these will be competitive with NVIDIA P4 and P40, since the AMD GPUs are based on Fiji and Polaris architectures, which do not support native 8-bit integer math operations (just packed 8-bit integer operands). So I would expect the company to shore up their offerings for inference GPUs sometime in 2017.

Intel in 2017

I expect Intel to start shipping their newly acquired Nervana Engine in the 2nd half of 2017 for ML training workloads. At Intel’s AI Day in November, the company said this product will be 100 times faster than the “best GPU”, which may refer to Volta but probably refers to Pascal P100. I remain skeptical that they can achieve such a feat, but it will be exciting to watch. Certainly, there remains a degree of execution risk at this stage of product development and acquisition integration. But the idea that a purpose built accelerator could perform significantly better than a General Purpose GPU has merit, since a GPU still has die area dedicated to functions that are only used for graphics, as well as features such as double precision floating point not needed by Machine Learning. It will be interesting to see if and how Intel plans to exploit the Nervana fabric, which allows for a high degree of scaling for ML workloads.

In addition, Intel has previously stated that they will provide a mid-life kicker for the Xeon Phi (KNL) CPU, adding variable precision math (8-bit integer and perhaps 16-bit floating point) to improve their ability to compete with GPUs. For neural networks that demand a lot of memory, this could significantly improve Intel’s standing, especially for the inference side of the AI workload.

IDF16-Phi-Knights-Mill-slide-850x

Diane Bryant announced the mid-life kicker for Knights Landing, code named Knights Mill, at IDF’16 to address the needs of Machine Learning in 2017. (Source: Intel)

XILINX in 2017

Xilinx announced their “reconfigurable acceleration platform” for Machine Learning last November and has a refreshed product portfolio including 16nm and 20nm technologies, well ahead of their Intel / Altera competition. The Xilinx strategy is to ease the development hurdles of FPGA programming by providing a suite of tools, APIs and IP already tailored to accelerate specific datacenter workloads. Amazon recently selected Xilinx to provide FPGA acceleration in the cloud for a wide variety of workloads, so I would expect a steady stream of new platforms that will enable Xilinx to compete in ML, primarily in inference jobs in datacenters and at the edge of the cloud.

Here come the ASICs in 2017

While CPUs, GPUs and FPGAs get all the headlines and the vast bulk of the revenue in Machine Learning, several companies have been able to attract cash from the venture capital community to fund the development of chips that are designed to do just one thing and do it very well: fast Machine Learning. Even though a GPU farm can now train a neural network in a matter of days instead of weeks or months, researchers yearn for chips that could train a network in minutes or hours to speed and improve on the development process. Google themselves developed such a chip to accelerate ML inference and announced the product in 2016 as the Google TensorFlow Processing Unit (TPU).

Now several startups, in addition to Intel / Nervana mentioned above, could potentially launch ASICs for Machine Learning in 2017 and 2018. The companies to watch include Cerebras, KnuPath, GraphCore and Wave Computing. While all but Wave remain in stealth mode, and so details are not yet available, all of these companies hope to accelerate machine learning by a factor of over 100x versus a GPU. I suspect all will target both inference and training, but we should know more by the end of 2017. And it appears that many are using some form of dataflow architectures, an unproven approach that has been the subject of academic research for well over a decade, but which holds great promise for Machine Learning if they can be made to work. The first company out of the gates, pun intended, is probably Wave Computing, which has already publicly disclosed their architecture last October, and plans to sample their appliances some time this year. (Several of these companies hope to provide turnkey ML appliances, instead of engaging in the longer slog of becoming a merchant semiconductor provider.) Also I expect Google to update their TPU device sometime in late 2017 or early 2018, at which time I would not be surprised to see them tackle the training side of the computational spectrum. So, 2017 could possibly become the year of the ML ASIC, although the risks and challenges are quite high.

So, as you can see, 2017 will be a year of increased competition to supply the silicon brains behind the artificial brains of Machine Learning. One thing to keep in mind beyond the chips, is that the ecosystem that NVIDIA has developed and nurtured will remain a significant advantage for the company for years to come, and thus represents a hurdle for all newcomers who envy NVIDIA’s success. But this is a barrier that everyone seems to acknowledge, and intends to clear, as the rewards appear to be worth the investment.

A Machine Learning Chips Primer: CPUs, GPUs, FPGAs and ASICs

Today, the task of training a neural network, which is the basis of most of the recent advancements in AI, is the undisputed domain of the GPU. NVIDIA has incorporated reduced precision floating point (16 bits, or “half-floats”) instead of the standard 32 bit operations to accelerate this task even further; after all, if you can solve the problem with half the precision, you can theoretically apply twice the number of arithmetic units (ALUs) in the same die space (and therefore the same power and cost). Memory bandwidth is another key requirement of training accelerators since that becomes the next bottleneck once you can deliver the increased computational throughput. The NVIDIA Pascal architecture excels in both of these areas, raising the bar for competitors.

The task of inference is not as simple to characterize, as the nature of the data being analyzed will determine the type of architecture that is best suited for the job. If you are just analyzing text, and need an answer in, say, less than 50 milliseconds online, a modest CPU can be adequate, from an ARM CPU in a phone to a Xeon server in the cloud. But if you are analyzing 4K video data at 30 frames per second, and need to know whether to apply the brakes on the car you are “driving”, then you will need something much faster; perhaps a GPU, an FPGA, or even a custom chip (an application-specific integrated circuit, or ASIC). The vast bulk of inference today is pretty simple and can even be calculated with 8-bit integer math; once again, a reduced precision approach that can increase performance by perhaps 4X in this case. And, once again, NVIDIA took the lead; the NVIDIA PASCAL architecture supports both native 8-bit and 16-bit math in the M4 and M40 GPUs used in inference.

However, Xilinx’s latest FPGAs now support 8-bit integer operations for Machine Learning and have the added benefit of being reconfigurable, changing the hardware as algorithms evolve. This little bit-nit will play an important role as new products are launched in 2017. For example, Intel will add 8-bit integer operations to the Xeon Phi chip in 2017. Finally, if you need to support a great many instances with simultaneous threads of inference, as you might in a large cloud environment, the high development cost of a custom chip may be worth the investment. Google has done exactly this with the Google TensorFlow Processing Unit (TPU), enabling them recently to reduce their prices for certain machine learning services by a factor of 8. While it may cost 10s of millions of dollars to develop an ASIC of this complexity, the manufacturing costs can be reduced to tens of dollars per chip.