Abstract

Recent research has focused on leveraging sparsity in hardware accelerators to improve the efficiency of applications spanning scientific computing to machine learning. Most such prior accelerators are fixed-function, which is insufficient for two reasons. First, applications typically include both dense and sparse components, and second, the algorithms that comprise these applications are constantly evolving. To address these challenges, we designed a programmable accelerator called Onyx for both sparse tensor algebra and dense workloads. Onyx extends a coarse-grained reconfigurable array (CGRA) optimized for dense applications with composable hardware primitives to support arbitrary sparse tensor algebra kernels. In this article, we show that we can further optimize Onyx by adding a small set of hardware features for parallelization that significantly increase both temporal and spatial utilization of the CGRA, reducing runtime by up to 6.2×.

Article

Article URL

Article

pdf

Article Note

The above PDF is the author-submitted version of the article. The final published version can be found at the Article URL above.

BibTeX

@ARTICLE{10947596, 
author={Koul, Kalhan and Xie, Zhouhua and Strange, Maxwell and Ravipati, Sai Gautham and Cheng, Bo Wun and Hsu, Olivia and Chen, Po-Han and Horowitz, Mark and Kjolstad, Fredrik and Raina, Priyanka},
journal={IEEE Micro},
title={Designing Programmable Accelerators for Sparse Tensor Algebra},
year={2025},
volume={45},
number={3},
pages={58-65},
keywords={Tensors;Pipeline processing;Algebra;Micromechanical devices;Repeaters;Random access memory;Kernel;Arrays;Sparse matrices;Process control},
doi={10.1109/MM.2025.3556611}}