Author Archives: peter

New paper on CNN accelerator architectures to appear at ISCA 2017

Our new paper on improving the efficiency of hardware accelerators for convolutional neural networks has been accepted for publication at the 44th International Symposium on Computer Architecture (ISCA), 2017.

This paper, co-authored with Yongming Shen (Stony Brook CS PhD student) and Stony Brook CS professor Mike Ferdman, proposes a new Convolutional Neural Network (CNN) accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers.

Yongming Shen, Michael Ferdman, and Peter Milder. “Maximizing CNN Accelerator Efficiency Through Resource Partitioning.” To appear at The 44th International Symposium on Computer Architecture (ISCA), 2017.

You can read a pre-print here.

New Paper on Bandwidth-Efficient CNN accelerators to appear at FCCM 2017

Our new paper on bandwidth-efficient hardware accelerators for convolutional neural networks will appear at FCCM 2017. This paper, co-authored with Stony Brook CS PhD student Yongming Shen and Stony Brook CS professor Mike Ferdman, proposes a new method to efficiently balance between the transfer costs of CNN data and CNN parameters and describes a new flexible architecture that is able to reduce the overall communication requirement.

Abstract—Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. Interest in CNNs has led to the design of CNN accelerators to improve CNN evaluation throughput and efficiency. Importantly, the bandwidth demand from weight data transfer for modern large CNNs causes CNN accelerators to be severely bandwidth bottlenecked, prompting the need for processing images in batches to increase weight reuse. However, existing CNN accelerator designs limit the choice of batch sizes and lack support for batch processing of convolutional layers.

We observe that, for a given storage budget, choosing the best batch size requires balancing the input and weight transfer. We propose Escher, a CNN accelerator with a flexible data buffering scheme that ensures a balance between the input and weight transfer bandwidth, significantly reducing overall bandwidth requirements. For example, compared to the state-of-the-art CNN accelerator designs targeting a Virtex-7 690T FPGA, Escher reduces the accelerator peak bandwidth requirements by 2.4× across both fully-connected and convolutional layers on fixed-point AlexNet, and reduces convolutional layer bandwidth by up to 10.5× on fixed-point GoogleNet.

Yongming Shen, Michael Ferdman, and Peter Milder. “Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer.” To appear at The 25th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017.

You can read a preprint here.

New article on hardware reliability to appear in ACM TECS

A new article focusing on hardware implementation of execution stream compression will appear in ACM Transactions on Embedded Computing Systems, in a special issue on Secure and Fault-tolerant Embedded Computing. This paper was co-authored with Maria Isabel Mera (a Stony Brook ECE MS alum, currently a PhD student at NYU), Jonah Caplan and Seyyed Hasan Mozafari (graduate students at McGill University), and Prof. Brett Meyer from McGill. This work was based in part on Maria Isabel Mera’s MS thesis.

“Area, Throughput and Power Trade-offs for FPGA- and ASIC-based Execution Stream Compression.” Maria Isabel Mera, Jonah Caplan, Seyyed Hasan Mozafari, Brett H. Meyer, and Peter Milder. To appear in ACM Trans. on Embedded Computing Systems, 2017.

Abstract: An emerging trend in safety-critical computer system design is the use of compression, e.g., using cyclic redundancy check (CRC) or Fletcher Checksum (FC), to reduce the state that must be compared to verify correct redundant execution. We examine the costs and performance of CRC and FC as compression algorithms when implemented in hardware for embedded safety-critical systems. To do so, we have developed parameterizable hardware generation tools targeting CRC and two novel FC implementations. We evaluate the resulting designs implemented for FPGA and ASIC and analyze their efficiency; while CRC is often best, FC dominates when high throughput is needed.

Please check back later for a pre-print.

Poster on neural network hardware to appear at FPGA 2017

Yongming Shen will be presenting a poster on our current work to implement bandwidth-efficienct fully-connect neural network layers next month.

Yongming Shen, Michael Ferdman, and Peter Milder. “Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network Layers.” Poster to appear at FPGA 2017.

NSF Funds our work on efficient spectrum sensing

The National Science Foundation’s Enhancing Access to the Radio Spectrum program has funded our group’s work on efficient distributed spectrum sensing. The goal of this work is to enable crowd-sourced collaborative spectrum sensing including low-cost low-power FPGA-based hardware and novel interpolation and optimization techniques to aggregate and analyze data.

This work is a collaboration with Samir Das and Himanshu Gupta (Stony Brook CS), and Petar Djurić (Stony Brook ECE).

You can read more at the NSF website.

“Fused Layer CNN Accelerators” to appear at MICRO 2016

Our new paper “Fused Layer CNN Accelerators” by Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder has been accepted to appear at MICRO 2016.

A preprint is available here.

In this work, we observe that a previously unexplored dimension exists in the design space of CNN accelerators that focuses on the dataflow across convolutional layers. We find that we are able to fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip, enabling caching of intermediate data between the evaluation of adjacent CNN layers. We demonstrate the effectiveness of our approach by constructing a fused-layer CNN accelerator for the first five convolutional layers of the VGGNet-E network, and find that, by using 362KB of on-chip storage, our fused-layer accelerator minimizes off-chip feature map data transfer, reducing the total transfer by 95%, from 77MB down to 3.6MB per image.

 

New paper on streaming sorting networks published in ACM TODAES

A new overview paper that I co-authored with with Marcela Zuluaga and Markus Püschel of ETH Zurich has been published in ACM Transactions on Design Automation of Electronics Systems (TODAES). In this paper, we present new hardware structures for sorting that we call streaming sorting networks, which we derive through a mathematical formalism that we introduce, and an accompanying domain-specific hardware generator that translates our formal mathematical description into synthesizable RTL Verilog.

You can read the paper here, and see also our online sorting network generator, which allows you to use the tool described in this paper in your web browser.

As a preview, the following graph shows the cost of implementing various sorters with 16-bit fixed point input values that fit on a Xilinx Virtex-6 FPGA. The x-axis indicates the input size n, the y-axis indicates the number of FPGA con- figurable slices used, and the size of the marker quantifies the number of BRAMs used (BRAMs are blocks of on-chip memory available in FPGAs). The implementations using Batcher’s and Stone’s architectures can only sort up to 128 or 256 elements, respectively, on this FPGA. Conversely, our streaming sorting networks with streaming width w = 2 can sort up to 219 elements on this FPGA, and our smallest fully streaming design can sort up to 216 elements.

design_space

The following graph shows all 256-element sorting networks that we generate with our framework (using 16-bits per element) that fit onto the Virtex-6 FPGA. The x-axis indicates the number of configurable FPGA slices used, the y-axis indicates the maximum achievable throughput in giga samples per second, and the size of the marker indicates the number of BRAMs used. This plot shows that we can generate a wide range of design trade-offs that outperform previous implementations, such as that of Stone and the linear sorter (Batcher’s is omitted due to the high cost). For practical applications, only the Pareto-optimal ones (those toward the top left) would be considered.

256_compare