08 Aug IBM Plays With The AI Giants With New, Scalable And Distributed Deep Learning Software

I’ve been following IBM’s AI efforts with interest for a quite a while now. In my opinion, the company jump-started the current cycle of AI with the introduction of Watson back in the 2000s and has steadily been ramping up its efforts since then. Most recently, I wrote about the launch of PowerAI, IBM’s software toolkit solution to use with OpenPOWER systems for enterprises who don’t want to develop their AI solutions entirely from scratch but still want to be able to customize to fit their specific deep learning needs. Today, IBM Research announced a new breakthrough that will only serve to further enhance PowerAI and its other AI offerings—a groundbreaking Distributed Deep Learning (DDL) software, which is one of the biggest announcements I’ve tracked in this space for the past six months.

Getting rid of the single-node bottleneck

Anyone who has been paying attention knows that deep learning has really taken off in the last several years. It’s powering hundreds of applications, in consumer as well as business realms, and continues to grow. One of the biggest problems holding back the further proliferation of deep learning, however, is the issue of scalability. Most AI servers today are just one single system, not multiple systems combined. The most popular open-source deep learning software frameworks simply don’t perform well across multiple servers, creating a time-consuming bottleneck. In other words, while many data scientists have access to servers with four to eight GPUs, they can’t take advantage of it and scale beyond the single node—at the end of the day, the software just wasn’t designed for it.

Enter the IBM DDL library: a library built with IBM Research’s unique clustering methods, that links into leading open-source AI frameworks (such as TensorFlow, Caffee, Torch, and Chainer). With DDL, these frameworks can be scaled to tens of IBM servers, taking advantage of hundreds of GPUs—a night and day difference from the old model of doing things. To paint a picture, when IBM initially tried to train a model with the ImageNet-22K data set, using a ResNet-101 model, it took 16 days on a single Power “Minsky” server, using four NVIDIA P100 GPU accelerators. A 16-day training run means a significant delay of time to insight, and can seriously hinder productivity.

IBM is calling DDL “the jet engine of deep learning”—a catchy moniker that honestly isn’t too far off the mark in my opinion. Using DDL techniques, IBM says it was able to cut down that same process to a mere 7 hours, on 64 Power “Minsky” servers, with a total of 256 NVIDIA P100 GPU accelerators. Let me reiterate that: 16 days, down to 7 hours.  If these results are accurate, which I think they are, it’s clear why IBM thinks it has a real game-changer on its hands. IBM’s new image recognition record of 33.8% accuracy in 7 hours handily surpasses the previous industry record set by Microsoft—29.9% accuracy in 10 days. To top it all off, IBM says DDL scales efficiently—across up to 256 GPUs, with up to 95% efficiency on the Caffe deep learning framework.

Now available in beta

Developers won’t have to wait to try out this new technology. IBM research is delivering a beta version of the DDL to IBM Systems, which is available now in the newly announced 4th revision of IBM’s PowerAI (for TensorFlow and Caffe, with Torch and Chainer to follow soon). I think this will be a great addition to IBM’s Power systems, which I’ve called the “Swiss Army knives of acceleration”—standard PCI express, CAPI, and NVLink, all wrapped up in one platform.

Another unique thing of note about DDL is that it will be available not only on-prem but also through the cloud—via a cloud provider called Nimbix. In today’s hybrid environment, this flexibility is obviously a plus. Developers can try it out beta version now on Nimbix, or on an IBM Power Systems server.

Wrapping up

One of the most interesting things for me is that this new technology is coming from IBM, not one of the flashier, louder AI proponents like Google or Facebook. It looks like if IBM can continue to bring “firsts” to the table, IBM is really shaping up to be not just a major player in the enterprise, but for deep learning overall. DDL and OpenPOWER are the secret sauce that I think will give IBM an edge it needs—significantly cutting down training times, and improving accuracy and efficiency.  I’ll continue to watch with interest, but I think by getting rid of this bottleneck, DDL has the potential to really open the deep learning floodgates. It could be a real game-changer for IBM, PowerAI, and OpenPOWER.