Six-Writings multimodal processing with pictophonetic coding to enhance Chinese language models,Frontiers of Information Technology & Electronic Engineering

当前位置： X-MOL 学术 › Front. Inform. Technol. Electron. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Six-Writings multimodal processing with pictophonetic coding to enhance Chinese language models
Frontiers of Information Technology & Electronic Engineering ( IF 3 ) Pub Date : 2024-02-08 , DOI: 10.1631/fitee.2300384
Li Weigang , Mayara Chew Marinho , Denise Leyi Li , Vitor Vasconcelos De Oliveira

While large language models (LLMs) have made significant strides in natural language processing (NLP), they continue to face challenges in adequately addressing the intricacies of the Chinese language in certain scenarios. We propose a framework called Six-Writings multimodal processing (SWMP) to enable direct integration of Chinese NLP (CNLP) with morphological and semantic elements. The first part of SWMP, known as Six-Writings pictophonetic coding (SWPC), is introduced with a suitable level of granularity for radicals and components, enabling effective representation of Chinese characters and words. We conduct several experimental scenarios, including the following: (1) We establish an experimental database consisting of images and SWPC for Chinese characters, enabling dual-mode processing and matrix generation for CNLP. (2) We characterize various generative modes of Chinese words, such as thousands of Chinese idioms, used as question-and-answer (Q&A) prompt functions, facilitating analogies by SWPC. The experiments achieve 100% accuracy in answering all questions in the Chinese morphological data set (CA8-Mor-10177). (3) A fine-tuning mechanism is proposed to refine word embedding results using SWPC, resulting in an average relative error of ≤25% for 39.37% of the questions in the Chinese wOrd Similarity data set (COS960). The results demonstrate that SWMP/SWPC methods effectively capture the distinctive features of Chinese and offer a promising mechanism to enhance CNLP with better efficiency.

中文翻译：

六文多模态处理与形声编码增强中文语言模型

虽然大型语言模型 (LLM) 在自然语言处理 (NLP) 方面取得了重大进展，但它们在充分解决某些场景下中文语言的复杂性方面仍然面临挑战。我们提出了一个名为六文多模态处理（SWMP）的框架，以实现中文自然语言处理（CNLP）与形态和语义元素的直接集成。 SWMP的第一部分，即六文形声编码（SWPC），引入了适当的部首和部件粒度，能够有效地表示汉字和单词。我们进行了几个实验场景，包括以下内容：（1）我们建立了一个由图像和汉字 SWPC 组成的实验数据库，实现 CNLP 的双模式处理和矩阵生成。（2）我们描述了中文单词的各种生成模式，例如数千个中文成语，用作问答（Q&A）提示功能，便于SWPC进行类比。实验在回答汉语词法数据集（CA8-Mor-10177）中的所有问题时达到了100%的准确率。 (3)提出了一种微调机制，使用SWPC来细化词嵌入结果，使得中文词相似度数据集(COS960)中39.37%的问题的平均相对误差≤25%。结果表明，SWMP/SWPC 方法有效地捕捉了汉语的独特特征，并为提高 CNLP 的效率提供了一种有前景的机制。

更新日期：2024-02-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>