15 May Why NVIDIA Is Building Its Own TPU

The blistering pace of innovation in artificial intelligence for image, voice, robotic and self-driving vehicle applications has been fueled, in large part, by NVIDIA’s GPU chips that deliver the massive compute power required by the underlying math required for Deep Learning. While NVIDIA continues to reap the benefits of its investments in GPUs, there has been speculation that fixed function application-specific circuits (ASICs) might one day eclipse NVIDIA’s GPU-centric approach. This argument has been fueled in part by noting Google’s investment in its own custom ASIC for Deep Learning inference, the TensorFlow Processor Unit (TPU). Now NVIDIA itself seems to have embraced this approach, albeit in a limited fashion, announcing its own ASIC technology for Deep Learning acceleration. In a surprising and bold move, the company also announced that it will open source this technology to enable others to build chips using this technology. Let’s look at what NVIDIA has done in this space, and more importantly, why.

NVIDIA’s House of AI

Just a year ago, NVIDIA announced the PASCAL generation GPUs that tripled the performance of AI workloads, and the industry’s big players in AI, including Amazon, Baidu , Facebook, Google and Microsoft, have apparently been buying them hand over fist. As a result, NVIDIA’s datacenter business grew by 186% YOY in the latest quarter and is now a $1.6B run-rate business. Now, in what is already a one horse race, NVIDIA has announced Volta, perhaps the largest and most complex chip ever devised, delivering a mind boggling 120 trillion operations per second. The Volta-based Tesla V100 chip, shipping later this year, will speed the training and use of the Machine Learning by a factor of 6-12 fold over the already impressive Tesla P100. In short, the company has been able to increase performance for AI workloads by over 60-fold in just 3 years. To put that into perspective, traditional CPUs only double performance in a similar timespan. So, with all this momentum, why would NVIDA branch out beyond the GPU?

Moving Beyond the GPU for AI

While the NVIDIA GPU has become the standard for training neural networks for Machine Learning, some have argued that the use of these networks (called inference processing), can best be executed on FPGAs or on an ASIC such as the Google TPU, especially when deployed in very large volumes where their speed and low cost can compensate for the significant development expenses required. The ASIC fixed function chips are not as flexible as a GPU or an FPGA, as ASICs are designed to do only one thing, but do it very fast. But the GPU’s flexibility comes at a cost in terms of die area and power consumption, so in theory an ASIC should perform better than a GPU. On the other hand, the argument for the GPU has been that deep learning research is moving so very fast that you might develop a new ASIC that could obsolete by the time you finish it, many years and millions of dollars later.

Now NVIDIA has announced that the best answer may be a hybrid approach. It will use a CPU where performance is not critical but need for programmability is high, use a GPU where you need to run operations in parallel but want to retain some level of flexibility / programmability, and then use an ASIC where the algorithms have become stable and the volumes will be large, such as in the case for deep learning inference processing.

NVIDIA’s Deep Learning Accelerator (DLA)

With this context in mind, then, it makes sense for NVIDIA to build a fixed function accelerator that acts as an efficient inference engine as part of a larger solution. NVIDIA announced that its next generation DrivePX platform for autonomous vehicles, the Xavier SOC, would consist of ARM CPU cores, Volta GPU cores, and a fixed function Deep Learning Accelerator (DLA) for inference. This approach, the company says, will result in higher performance at lower power, while maintaining the flexibility for customization that its automotive OEMs demand.

NVIDIA will use CPUs, GPUs and ASIC designs where each has merit and the optimal tradeoff between performance, cost, power, and programmability. (Source: NVIDIA)

While the initial implementation of the NVIDIA DLA will be in the Xavier SOC for Autonomous Vehicles, I expect it to extend this approach to other platforms such as the low cost Jetson platform for vision guided autonomous robots, drones, etc. But why stop there? After all, in the world envisioned by NVIDA’s CEO, Jen-Hsun Huang, there will be trillions of devices embedded and connected in the Internet of Things that will require the intelligence afforded by AI. This is where NVIDIA’s strategy to open source the DLA comes in.

The Xavier SOC for autonomous machines will be the first product to sport the new NVIDIA DLA. (Source: NVIDIA)

NVIDIA has always focused on solving very hard, computationally complex problems. So it has no interest in designing, as Huang puts it, a deep learning chip for smart lawn mowers, or a deep learning chip for refrigerators, or a deep learning chip for streetlamps. (All might be different.) By deciding to open source the DLA, NVIDIA is enabling its rich deep learning ecosystem to extend to include low cost, high volume and low power ASICs and SOCs, allowing other companies and researchers to build their own chips using this accelerator. And of course it all runs the same CUDA software used by NVIDIA GPUs. The company, in effect, is saying, “OK, if you want to build a TPU for your little widget, it’s probably best to build it on our technology, since nobody knows more about accelerating AI than NVIDIA.” Meanwhile NVIDIA can focus on building high margin, high value platforms needed in the datacenter and at the edge.

Conclusions

Many have been wondering how NVIDIA would respond to the Google TPU for over a year, and now we know. Instead of being threatened, it has effectively de-positioned the Deep Learning ASIC (TPU) as being a tool it can use where it makes sense, while maintaining the lead role for its GPUs and CUDA software. And by open sourcing the technology, it can retain a control point for the IOT adoption of machine learning. The risk with this strategy is that the open source approach may lend support for an idea that could evolve into a threat to NVIDIA’s long-term goals for datacenter inference engines. I would argue that could happen anyway, and that at least NVIDIA can now participate in that market indirectly or even directly should they choose.