Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs

  • Authors:
    Xiaoming Chen (Univ. of Notre Dame), Xiaobo Sharon Hu (Univ. of Notre Dame), Jianxu Chen (Univ. of Notre Dame), Danny Z Chen (Univ. of Notre Dame)
    Publication ID:
    Publication Type:
    Received Date:
    Last Edit Date:
    2698.004 (University of Notre Dame)


Convolution is a fundamental operation in many applications, including computer vision, natural language processing and image processing. Recent successes of convolutional neural networks in various deep learning applications put even higher demand on fast convolution. The high computation throughput and memory bandwidth of graphics processing units (GPUs) make GPUs a natural choice for accelerating convolution operations. However, maximally exploiting the available memory bandwidth of GPUs for convolution is a challenging task. This paper introduces a general model to address the mismatch between the memory bank width of GPUs and computation data width of threads. Based on this model, we develop two convolution kernels, one for the general case of convolution and the other for a special case with one input channel. By carefully optimizing memory access patterns and computation patterns, we design a communication-optimized kernel for the special case and a communication-reduced kernel for the general case. Experimental data based on implementations on Kepler GPUs show that our kernels achieve 5.16 X and 35.5% average performance improvement over the latest cuDNN library, for the special case and the general case, respectively.

4819 Emperor Blvd, Suite 300 Durham, NC 27703 Voice: (919) 941-9400 Fax: (919) 941-9450

Important Information for the SRC website. This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies. If you are not happy with the use of these cookies, please review our Cookie Policy to learn how they can be disabled. By disabling cookies, some features of the site will not work.