Espresso

Speeding up and debittering Caffe by adding Halide

Project proposal (April 2, 2015)

Summary

Espresso is an implementation of Caffe, a framework for deep learning, in the Halide image processing language. It will support fine-tuning of deep neural networks to achieve high performance of training and evaluation on multiple platforms.

Background

Caffe is a framework for convolutional neural networks that allows users to define deep learning models and optimization using configuration files rather than hard-coding them. The framework prides itself on its training and evaluation speeds through the use of GPU processing. It is widely employed in research, prototyping, and large scale applications requiring speed such as computer vision and speech processing.

Halide is a programming language embedded in C++ designed for implementing high-performance image processing pipelines. It simplifies implementation by decoupling algorithms (the computations required to solve a problem) from their scheduling (the order, locality, and storage work should be executed with). This allows for performance fine-tuning on multiple platforms without significantly modifying the structure of code.

In this project, we are interested in exploring the possibility of implementing neural networks using the Halide philosophy, so that neural network code can enjoy both maintainability and high performance. By utilizing a compiler that is aware of execution order and storage, we hope to produce highly performant parallel code that is competitive with Caffe.

The challenge

Halide is not a Turing-complete language. It is able to describe steps in an image processing pipeline for use on a GPU, but any looping constructs would have to be done in a host language. Additionally, the tools provided by Halide are tailored to image processing operations, so it is not yet known whether it has the expressiveness to implement convolutional neural network code efficiently. If the available constructs are insufficient for our needs, we will extend the Halide compiler to include them. For example, it may be necessary to extend Halide to be aware of NVIDIA's cuDNN library.

The simplest neural networks enjoy easy parallelization. Many feed-forward neural networks can be evaluated with the composition of three operations: matrix multiplication, vector addition, and the rectifier [ ReLU(x) = \max(x, 0) ]. These operations are very natural to perform on a GPU, with highly performant matrix multiplication having been extensively studied.

Convolutional neural networks add matrix convolution between an image and a kernel. Because this operation is widely used in image processing, we expect that convolutional neural networks are feasible in Halide.

We also expect that optimization efforts will lie in the topology of a deep neural network, in which speed up of the composition of these operations becomes important. Also, training neural networks via backpropagation presents a challenge to locality because the intermediate layer activations need to be saved.

Since Caffe has been tuned and optimized by more than a thousand contributors, it sets a high bar for performance.

Resources

We will be starting our code from scratch in C++. Caffe and Halide are open source and will serve as references during development.

For information on implementing neural networks, we consult the Deep Learning book by Yoshua Benjio, et al.

For development, we will be able to use ordinary hardware; Halide targets x86/SSE, ARM v7/NEON, CUDA, Native Client, and OpenCL. In order to compete with Caffe, we will use their benchmarks on NVIDIA GPUs, which include NVIDIA K40, NVIDIA Titan, NVIDIA K20, and NVIDIA GTX 770.

Goals and deliverables

The project will be implemented in layers of increasing difficulty and complexity. The parts we will implement will depend on what is possible.

Plan to achieve

  1. Explore Halide's capabilities by implementing evaluation of a simple hard-coded neural network. If necessary, extend Halide with the appropriate functionality to enable neural network evaluation.
  2. Implement training of a simple hard-coded neural network. If necessary, extend Halide with functionality to enable neural network training.
  3. Implement training and evaluation of convolutional neural networks. If necessary, extend Halide with functionality to enable convolutional neural networks.

Hope to achieve

  1. Implement training and evaluation of neural networks constructed from parsing Caffe configuration files.
  2. Optimize the training and evaluation of neural networks to be competitive with Caffe.

Stretch goals

  1. Beat Caffe in a benchmark.
  2. Improve the debugging experience in Halide.
  3. Implement additional types of neural networks, such as batch normalization or parameterized ReLUs.

Demo

To demonstrate our results, we will present code snippets demonstrating the ease of use of our system, along with graphs of benchmark results. If time allows, a simple real-time computer vision application could be created to show off the effectiveness of our system.

Platform choice

Halide is multi-platform. This will allow neural networks to be run efficiently on many devices as they gain popularity. Halide's host language is C++, so we will be implementing our platform with C++.

Schedule

Week 1 (2015 Apr 6 - 2015 Apr 12)

Week 2 (2015 Apr 13 - 2015 Apr 19)

Week 3 (2015 Apr 20 - 2015 Apr 26)

Week 4 (2015 Apr 27 - 2015 May 3)

Week 5 (2015 May 4 - 2015 May 10)