当前位置: X-MOL 学术arXiv.cs.LG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Mechanistic Design and Scaling of Hybrid Architectures
arXiv - CS - Machine Learning Pub Date : 2024-03-26 , DOI: arxiv-2403.17844
Michael Poli, Armin W Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Ré, Ce Zhang, Stefano Massaroli

The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and evaluation. We set out to simplify this process by grounding it in an end-to-end mechanistic architecture design (MAD) pipeline, encompassing small-scale capability unit tests predictive of scaling laws. Through a suite of synthetic token manipulation tasks such as compression and recall, designed to probe capabilities, we identify and test new hybrid architectures constructed from a variety of computational primitives. We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis, training over 500 language models between 70M to 7B parameters. Surprisingly, we find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures via isolated proxy tasks. The new architectures found via MAD, based on simple ideas such as hybridization and sparsity, outperform state-of-the-art Transformer, convolutional, and recurrent architectures (Transformer++, Hyena, Mamba) in scaling, both at compute-optimal budgets and in overtrained regimes. Overall, these results provide evidence that performance on curated synthetic tasks can be predictive of scaling laws, and that an optimal architecture should leverage specialized layers via a hybrid topology.

中文翻译:

混合架构的机械设计和扩展

由于设计空间巨大、原型设计时间长以及与大规模模型训练和评估相关的计算成本高,深度学习架构的开发是一个资源需求旺盛的过程。我们着手通过将其置于端到端机械架构设计(MAD)管道中来简化此过程,其中包括预测缩放定律的小规模功能单元测试。通过一系列旨在探测功能的合成令牌操作任务(例如压缩和召回),我们识别并测试由各种计算原语构建的新混合架构。我们通过广泛的计算最优和新的状态最优缩放法则分析来实验验证所得架构,训练超过 500 个参数介于 70M 到 7B 之间的语言模型。令人惊讶的是,我们发现 MAD 合成与计算最佳复杂度相关,从而能够通过隔离的代理任务准确评估新架构。通过 MAD 发现的新架构基于混合和稀疏等简单思想,在扩展方面优于最先进的 Transformer、卷积和循环架构(Transformer++、Hyena、Mamba),无论是在计算最优预算还是在训练过度的政权。总体而言,这些结果提供了证据,表明策划的合成任务的性能可以预测缩放法则,并且最佳架构应通过混合拓扑利用专用层。
更新日期:2024-03-27
down
wechat
bug