24 Jan A Cambrian Explosion In Deep Learning, Part 2: The Startups

This is the second of three blogs on the state of the AI chip market and what’s to come in 2019. The year will be a festival of new chips and benchmark battles, led by the large companies I mentioned in the first blog ( Intel INTC +0.04%Google GOOGL +0.15%AMD , Xilinx XLNX +0.35%Apple AAPL -0.47%Qualcomm QCOM +1.55%), and joined by dozens of Silicon Valley startups and Chinese Unicorns sporting valuations in excess of a billion US dollars. In this section, I will cover the most prominent, or at least the loudest, of the startups in the West and in China, where the government is intent on creating an indigenous AI chip industry. We will start with Wave, which appears to be the first to market with silicon for training.

Wave Computing

Wave Computing had an eventful 2018, marked by its MIPS acquisition, the initiation of beta testing of its systems with a few lucky customers, and the launch of its MIPS Open initiative. While the Wave architecture has some very interesting features, which I explored in depth here, we are awaiting information on customer experience with real workloads at scale.

Wave is not an accelerator that attaches to a server; it’s a standalone processor for graph computing. This approach has pros and cons. On the plus side of the ledger, Wave will not suffer from the memory bottlenecks that exist in accelerators like GPUs. On the negative side, installing a Wave appliance would be a forklift upgrade, requiring completely replacing traditional X86 servers, and would be a competitor to all server manufacturers.

I don’t expect Wave to deliver NVIDIA-beating results for a single node, but the architecture is designed to scale quite well, and the company has indicated it should have customer results very soon. Stay tuned!

Figure 1: Wave is shipping systems built from the 4-node “DPU” board shown above. WAVE COMPUTING


Graphcore is a very well-funded ($310M invested at a current $1.7B valuation) UK-based unicorn startup with a world-class team. It is building a novel graph processor architecture that has memory located on the same chip as its logic, which should enable higher real-world application performance. The team has been teasing its product launch for quite a while. It was “almost ready for launch” last April, and the most recent information from the company last December indicated it would begin production shipments soon. Its list of investors is impressive, including Sequoia Capital, BMW, Microsoft Bosch , and Dell Technologies DVMT +0%.

I’ve been briefed on the company’s architecture, and it looks quite impressive—scaling from edge devices to the “Colossus” dual chip package for datacenter training and inference. At the recent NeurIPS event, Graphcore showed off its RackScale IPU Pod, which delivers over 16 petaflops in a rack with 32 servers.  While the company frequently claims it will deliver over 100 times the performance of a best-in-class GPU, my math begs to differ.

The company says a 4 “Colossus” GC2 (8-chip) server delivers 500 TFlops (Trillions of Operations Per Second) of mixed precision performance. A single NVIDIA V100 delivers 125 TFlops, so 4 V100s should deliver the same performance, in theory. As usual, the devil is in the details—the V100 peak performance is only available when the code is refactored to execute 4×4 matrix multiplications for TensorCore execution, a limitation the Graphcore architecture deftly avoids. Not to mention the fact that a V100 consumes 300 watts and wads of cash from your wallet. Also, Graphcore supports an on-die interconnect and a “memory in processor” (on-die memory) approach that could result in excellent application performance beyond that which the TFlops benchmark would imply. In some neural nets, like Generative Adversarial Networks, memory is the bottleneck.

Once again, we will have to await real user application results to properly evaluate this architecture. Still, Graphcore’s list of investors, roster of experts, and sky-high valuation tell me that it is probably on to something good.

Figure 2: GraphCore shows off this very cool image of the processing of the ImageNet dataset. This visualization can help developers understand where their training processing is consuming processing cycles. GRAPHCORE

Habana Labs

Habana Labs, an Israeli startup, surprised many people last September by announcing at the first AI Hardware Summit that it was ready to ship its first silicon for inference, with record performance running convolutional neural networks for image processing. It showed benchmarks of categorizing 15,000 images per second on the Resnet50 image classification database, outperforming NVIDIA’s T4 device by some 50%, while consuming only 100 watts. Its latest round of funding (December 2018) was led by Intel Venture Capital and joined by WRV Capital, Bessemer Venture Partners, and Battery Ventures, adding $75M to the startup’s previous $45M. The recent funding will go in part to taping out its second chip, called “Gaudi,” which will focus on the training market and supposedly scale to 1000s of processors. In a crowded field, Habana shows a lot of promise.

Other startups

I know of over 40 companies around the world building chips for AI training and inference. I wouldn’t be surprised to find that most of these companies are doing simple FMAs (Floating point Multiply Accumulates) and mixed precision math (8-bit integers, 16- and 32-bit floating point). That approach is relatively easy to build and will pick up some low-lying fruit, but it will not provide an enduring architectural advantage against the big players like NVIDIA, Intel, and the few startups doing cool architectures like Wave and GraphCore. Here are a few that have caught my attention and with whom I have had briefings:

  • Groq: Ex-Googlers who worked on TPU, with ambitions to rule the rest of the world.
  • Tenstorrent: Canadian firm of ex-AMDers; it remains in stealth mode. I can only say that I am impressed by its CEO’s vision and architecture.
  • ThinCi: Indian company focused on edge devices and autonomous vehicles, boasting partnerships with Samsung and Denso.
  • Cerebras: Led by ex-Sea Micro execs, including Andrew Feldman; it remains in deep stealth mode.
  • Mythic: A startup that is taking a unique approach to edge inference processing, akin to analog processing on non-volatile memory; should have chips in 2019.

China, Inc.

China has been trying to find a path to wean itself off of US semiconductors, and AI accelerators may provide the off-ramp it has been seeking. The Chinese central government set a 2030 goal of building a trillion-dollar AI industry, and investors have already poured well over $4B into startups since 2012. The U.S. Congress has called this an AI arms race—the US technology industry could fall behind as Chinese companies and research institutions drive innovation with less regard to the privacy and ethical concerns that hamper progress in the west.

Cambricon and SenseTime are perhaps the largest Chinese AI players to watch, but players like edge-AI focused Horizon Robotics warrant attention. Additionally, keep an eye on larger internet companies like BaiduHuawei , Tencent, and Alibaba , all of which have significant investments in AI software and hardware.

Sporting a $2.5B valuation, Cambricon is a Chinese unicorn that has already released its third generation AI chip. The company claims it can deliver approximately 30% better performance than NVIDIA V100 for AI at less power. The company also sells its IP with customers and provides the AI hardware that goes into the Huawei Kirin 970 mobile chipset.

SenseTime, perhaps the AI startup with the largest valuation, is best known for spreading smart surveillance cameras across China. These cameras number in excess of 175 million, and include cameras made by other companies. Founded in Hong Kong, SenseTime’s latest round of funding garnered $600M, led by Alibaba.  The startup is now valued at $4.5B, according to various press reports. SenseTime has built strategic partnerships with major companies such as Alibaba, Qualcomm, Honda, and even NVIDIA. The company has one supercomputer today, running some 8000 (presumably NVIDIA?) GPUs, with plans to build out 5 more supercomputers to process the facial recognition feeds from its millions of cameras.

In the next and final blog of this three-part series, I will explore how NVIDIA might respond to all this competition. Thanks for reading!