Onyx: A 12nm 756 GOPS/W Coarse-Grained Reconfigurable Array for Accelerating Dense and Sparse Applications
Abstract
Onyx is the first fully programmable accelerator for arbitrary sparse tensor algebra kernels. Unlike prior work, it supports higher-order tensors, multiple inputs, and fusion. It achieves this with a coarse-grained reconfigurable array (CGRA) that has composable memory primitives for storing compressed any-order tensors and compute primitives that eliminate ineffectual computations in sparse expressions. Further, Onyx improves dense image processing and machine learning (ML) with application-specialized compute tiles, memory tiles optimized for affine access patterns, and hybrid clock gating in the global buffer. We achieve up to 565x better energy-delay product (EDP) for sparse kernels vs. CPUs with sparse libraries, and up to 76% and 85% lower EDP for image processing and ML, respectively, vs. Amber [1].
Article
Article
Article Note
The above PDF is the author-submitted version of the article. The final published version can be found at the Article URL above.
BibTeX
@INPROCEEDINGS{10631383,
author={Koul, Kalhan and Strange, Maxwell and Melchert, Jackson and Carsello,
Alex and Mei, Yuchen and Hsu, Olivia and Kong, Taeyoung and Chen, Po-Han and
Ke, Huifeng and Zhang, Keyi and Liu, Qiaoyi and Nyengele, Gedeon and
Balasingam, Akhilesh and Adivarahan, Jayashree and Sharma, Ritvik and Xie,
Zhouhua and Torng, Christopher and Emer, Joel and Kjolstad, Fredrik and
Horowitz, Mark and Raina, Priyanka},
booktitle={2024 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)},
title={Onyx: A 12nm 756 GOPS/W Coarse-Grained Reconfigurable Array for Accelerating Dense and Sparse Applications},
year={2024},
volume={},
number={},
pages={1-2},
keywords={Tensors;Image coding;Algebra;Machine learning;Very large scale integration;Libraries;Kernel},
doi={10.1109/VLSITechnologyandCir46783.2024.10631383}
}