当前位置: X-MOL 学术Dokl. Math. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Accessible Russian Large Language Models: Open-Source Models and Instructive Datasets for Commercial Applications
Doklady Mathematics ( IF 0.6 ) Pub Date : 2024-03-11 , DOI: 10.1134/s1064562423701168
D. P. Kosenko , Yu. M. Kuratov , D. R. Zharikova

Abstract

This paper presents an approach to developing and fine-tuning large language models for Russian language that are capable of following instructions across domains. As base models, XGLM-4.5B, LLaMA-1 7B, LLaMA-1 13B, LLaMA-2 7B, LLaMA-2 13B, and ruGPT-3.5 13B are used. This work compares two main fine-tuning techniques: fine-tuning all model parameters and fine-tuning using LoRA layers. To create a fine-tuning dataset, several open English language data sources are used, including Databricks Dolly 15k, OpenAssistant Conversations Dataset (OASST1), chip2-instruct-alpha-v6a-1, which are then translated into Russian using the WMT21 En-X model. This work shows that the quality of the instructions provided for training significantly affects the ability to solve tasks on automatic quality metrics like MT-BENCH and MMLU. At the same time, the quality of models trained on the dataset collected as part of this work with a commercial license achieves comparable results to models fine-tuned on the Saiga dataset with a limited license. The fine-tuned language models and collected Russian language datasete are released open-source with licenses suitable for commercial use.



中文翻译:

可访问的俄语大语言模型:用于商业应用的开源模型和指导数据集

摘要

本文提出了一种开发和微调俄语大型语言模型的方法,该模型能够遵循跨领域的指令。使用 XGLM-4.5B、LLaMA-1 7B、LLaMA-1 13B、LLaMA-2 7B、LLaMA-2 13B 和 ruGPT-3.5 13B 作为基本模型。这项工作比较了两种主要的微调技术:微调所有模型参数和使用 LoRA 层进行微调。为了创建微调数据集,使用了多个开放英语语言数据源,包括 Databricks Dolly 15k、OpenAssistant Conversations Dataset (OASST1)、chip2-instruct-alpha-v6a-1,然后使用 WMT21 En- 将其翻译为俄语。 X 型号。这项工作表明,为训练提供的指令的质量会显着影响根据 MT-BENCH 和 MMLU 等自动质量指标解决任务的能力。与此同时,在商业许可下收集的数据集上训练的模型的质量达到了与在有限许可下在 Saiga 数据集上微调的模型相当的结果。经过微调的语言模型和收集的俄语数据集是开源发布的,并具有适合商业用途的许可证。

更新日期:2024-03-11
down
wechat
bug