02 Dec NVIDIA Runs The Mlperf Table

When the mlperf (machine learning performance) organization released the first standardized AI training benchmarks, which NVIDIA dominated last December, we were somewhat disappointed that only a few companies, namely NVIDIA, Google and Intel, were able to muster the working chips and software needed to complete the benchmarks. Last week, the open-source mlperf community released the first round of benchmarks for inference processing, wherein a trained neural network is used to categorize or process new input data. Once again, only a few companies were able to deliver results on currently available chips, and once again NVIDIA dominated the race. But the devil is in the details, and there are a ton of details one needs to understand to interpret the results properly. So, let’s get started.

What is the mlperf inference benchmark?

Inference is becoming a huge market that consists of a myriad of use cases which require different levels of throughput, latency and power. After all, a complex multi-stream suite of deep neural network models used in an autonomous vehicle will demand a lot more performance and low latency, and will consume more power than AI applications on a smart phone or a security camera. The mlperf community of over 50 companies and universities defined a suite of 5 benchmarks (2 for image classification, 2 for image detection and 1 for translation) that can be run on different types of hardware (“form factors”) and in 4 different scenarios, as depicted in Figure 1.

To provide more comprehensive perspectives, mlperf also allows results to be submitted on chips that are in preview, or “coming soon to a server near you.” For all benchmarks, there are divisions for closed (rigorously defined), open (allowing changes to parameters such as restricting latency) and research (for, well, research), and of course you can run the benchmark on a single chip or on racks of servers with scores of chips. Add it all up (or multiply it all together) and there are literally thousands of cells in the mlperf spreadsheet. As you can see from the mlperf website, this produces a pretty sparse matrix where niche players can cherry pick those results that reflect their target markets.

Figure 1: The span of mlperf benchmarks, form factors, and scenarios covers the waterfront of inference processing, from the edge to the cloud. 
NVIDIA

Let’s look at the results for the Available Closed results, then some observations from the Preview and Open submissions. More industrious readers with a zeal for detail (and a great deal of time) can dive deep into the mlperf website.

And the winner is…

Let’s look at the results for the Available Closed results, then some observations from the Preview and Open submissions. More industrious readers with a zeal for detail (and a great deal of time) can dive deep into the mlperf website.

mlperf V0.5 highlights

I am sharing images provided by NVIDIA, because, well, NVIDIA is really good at images. Note that this is mlperf V0.5, which does not yet include any metrics for power consumption or cost. So, while NVIDIA beats everyone in per-chip performance for currently available chips, Qualcomm and others in the mobile space will surely show greater efficiency when power is considered.

First, on the data center benchmarks, NVIDIA wins in every race on a per-chip basis, including against the Google TPUv3. Google did submit some impressive results for multiple chips, up to 128, with excellent near-linear scalability. Habana Goya performed quite well for Resnet, especially considering its low power consumption. However, it did not submit any results for the other data center benchmarks—time will tell. The Intel Xeon Platinum P9282 is a two-chip beast that consumes some 400 watts and is the largest package Intel has ever developed.

Figure 2: mlperf results for data center inference was dominated by NVIDIA and Habana Labs. 
NVIDIA

In the edge space we see a real mixed bag of parts, from Qualcomm’s Snapdragon 855, which powers some of the world’s fastest smartphones, such as the Google Pixel 4, to the NVIDIA Xavier, which is used in Jetson for products like drones as well as NVIDIA Drive used in autonomous vehicles. Xavier is an SOC that supports the complete vision-guided system workflow for multiple networks, and is equipped with CPUs, GPUs, Sensor Fusion and a Deep Learning Accelerator among other logic. This is a reflection of the diverse environments in which inference is being deployed. There is no overlap or competition between smart phones, laptops and autonomous vehicles, but the underlying AI uses the same neural network models.

Figure 3: The mlperf submissions for edge AI cover a wide range of parts from NVIDIA, Intel and Qualcomm, each targeting a different market. 
NVIDIA

While NVIDIA continues to run the table in performance for available chips, emerging competition looms large. Intel submitted a “preview” result for its Nervana inference chip (NNPi) that looks quite promising. The chip has yet to be formally launched and is expected to be available in the next 3-6 months. Dual NNPis scored ~10,200 images per second (IPS), while eight NVIDIA T4s delivered ~41,000. You can do the math. The Alibaba T-Head HanGuang 800, also in preview, could perhaps be even more impressive—it processed some ~45,000 IPS in the online benchmark and a whopping ~69,000 IPS in offline (batch) processing. Alibaba is currently using the chip for internal applications, but has said it will available to its cloud customers in a move that mirrors Googles strategy with its TPUs.

Figure 4: Jeff Zhang, chief technology officer at Alibaba, introduced the company’s first AI chip, the HangGuang 800, in September. 
ALIBABA

Habana Labs submitted another interesting result in the Open Division (customized parameters), where it claimed excellent performance at low latencies. The Open Division is going to become quite chaotic—there were over 450 submissions in the V0.5 release!

Conclusions

First, I must say that I am quite impressed by the industry’s embrace of and contributions to mlperf. It’s been a long time coming, but the open community has accomplished a great deal in the first two suites of benchmark results.

Second, I am not surprised by NVIDIA’s strong showing in inference across all five benchmarks. NVIDIA has the most experience, mature hardware and extensive software tools.

Finally, I am excited by the prospects offered by the emerging competitive field, including Intel, Alibaba, Habana Labs and dozens of startups. However, as I have previously noted, this stuff is harder than it looks, and it is taking quite a while for the challengers to field production-ready silicon and software. At least we should be finally finished with TOPS!