PRISM: Profiling-free symbolic memory-driven strategy planner for large DNN model training

Wang, Ruiwen; Fang, Philippe; Li, Chong; Tachon, Thibaut; Appuswamy, Raja

SCA/HPC Asia 2026, Supercomputing Asia / International Conference on High Performance Computing in the Asia-Pacific Region, 26-29 January 2026, Osaka, Japan

The rapid growth of large-scale deep neural networks (DNNs) has introduced severe memory and performance bottlenecks during distributed training. Existing automated planners for parallelization strategies often rely heavily on profiling or empirical tuning, which significantly increases engineering cost and wastes large-scale cluster resources. In this work, we present PRISM, a profiling-free, symbolic memory-driven strategy planner for large DNN training. PRISM introduces a unified symbolic memory cost model that captures the layered structure of modern architectures and integrates with a communication model to evaluate trade-offs across data, tensor, pipeline, virtual pipeline, expert, and sequence parallelism, as well as activation recomputation and optimizer sharding. By formulating strategy selection as an optimization problem, PRISM identifies globally optimal parallel strategies under device memory budgets. Our evaluation across representative large models demonstrates that PRISM achieves accurate memory prediction and substantial improvements in Model FLOPs Utilization (MFU), reducing bubble and communication overheads without costly profiling.

Detail

Document

BIBTEX

Type:

Conférence

City:

Osaka

Date:

2026-01-26

Department:

Data Science

Eurecom Ref:

8486

© ACM, 2026. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in SCA/HPC Asia 2026, Supercomputing Asia / International Conference on High Performance Computing in the Asia-Pacific Region, 26-29 January 2026, Osaka, Japan