Six-Writings multimodal processing with pictophonetic coding to enhance Chinese language models

Weigang, Li; Marinho, Mayara Chew; Li, Denise Leyi; De Oliveira, Vitor Vasconcelos

doi:10.1631/FITEE.2300384

Six-Writings multimodal processing with pictophonetic coding to enhance Chinese language models

“六书”多模态处理的形声表征以完善汉语语言模型

Published: 08 February 2024

Volume 25, pages 84–105, (2024)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

106 Accesses
Explore all metrics

Abstract

While large language models (LLMs) have made significant strides in natural language processing (NLP), they continue to face challenges in adequately addressing the intricacies of the Chinese language in certain scenarios. We propose a framework called Six-Writings multimodal processing (SWMP) to enable direct integration of Chinese NLP (CNLP) with morphological and semantic elements. The first part of SWMP, known as Six-Writings pictophonetic coding (SWPC), is introduced with a suitable level of granularity for radicals and components, enabling effective representation of Chinese characters and words. We conduct several experimental scenarios, including the following: (1) We establish an experimental database consisting of images and SWPC for Chinese characters, enabling dual-mode processing and matrix generation for CNLP. (2) We characterize various generative modes of Chinese words, such as thousands of Chinese idioms, used as question-and-answer (Q&A) prompt functions, facilitating analogies by SWPC. The experiments achieve 100% accuracy in answering all questions in the Chinese morphological data set (CA8-Mor-10177). (3) A fine-tuning mechanism is proposed to refine word embedding results using SWPC, resulting in an average relative error of ≤25% for 39.37% of the questions in the Chinese wOrd Similarity data set (COS960). The results demonstrate that SWMP/SWPC methods effectively capture the distinctive features of Chinese and offer a promising mechanism to enhance CNLP with better efficiency.

摘要

大型语言模型(LLMs)在自然语言处理中已取得显著成就,但在某些场景下,仍然面临解决中文语言处理复杂性的挑战。本文提出“六书”多模态处理(SWMP)框架,旨在考虑汉语形、声、音、像、意、会特性,便于中文语言多模态处理。在SWMP统一的理论框架下,提出“六书”形声编码(SWPC,简称“六书编码”)方法,使得对汉字的表达既能与语法有机结合,又反映汉语灵活应用的特点。文中设计的实验场景包括:(1)实验性建立汉字字根、偏旁(形部)和部件(声部)的图像和“六书”编码(SWPC)的数据库,实现汉语文字和图形的双模态处理;(2)表征若干汉词生成机制,建立提示性问/答模式,进行类比推理。使用SWPC处理中文形态关系数据集(CA8-Mor-10177)的所有问题,精度可达100%。(3)建立“六书”形声编码对词嵌入生成结果微调机制。对中文单词相似度数据集(COS960)中39.37%的问题,相似度计算与人工基础评估结果的平均相对误差低于25%。这些优于目前同类基准精度的结果表明,“六书编码”尝试体现汉语细腻的局部表征和整体关联等特点,可作为对现行汉语语言处理理论和技术的有效补充。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Cao SS, Lu W, Zhou J, et al., 2017. Investigating stroke-level information for learning Chinese word embeddings. Proc 16^th Int Semantic Web Conf.
Cao SS, Lu W, Zhou J, et al., 2018. cw2vec: learning Chinese word embeddings with stroke n-gram information. Proc 32^nd AAAI Conf on Artificial Intelligence, 30^th Innovative Applications of Artificial Intelligence Conf, and 8^th AAAI Symp on Educational Advances in Artificial Intelligence, p.5053–5061.
Chen HY, Yu SH, Lin SD, 2020. Glyph2Vec: learning Chinese out-of-vocabulary word embedding from glyphs. Proc 58^th Annual Meeting of the Association for Computational Linguistics, p.2865–2871. https://doi.org/10.18653/v1/2020.acl-main.256
Chen XX, Xu L, Liu ZY, et al., 2015. Joint learning of character and word embeddings. Proc 24^th Int Conf on Artificial Intelligence, p.1236–1242.
Everitt BS, Skrondal A, 2010. The Cambridge Dictionary of Statistics (4^th Ed.). Cambridge University Press, Cambridge, UK.
Book Google Scholar
Feng ZW, 2012. A Concise Course of Natural Language Processing. Shanghai Foreign Language Education Press, Shanghai, China (in Chinese).
Google Scholar
Gao P, 2003. Standard Tutorial of Wubi Font Input Method. Science Press, Beijing, China (in Chinese).
Google Scholar
Hamming RW, 1950. Error detecting and error correcting codes. Bell Syst Tech J, 29(2):147–160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
Article MathSciNet Google Scholar
Huang BR, Li W, 2012. Contemporary Chinese Language. Peking University Press, Beijing, China (in Chinese).
Google Scholar
Huang JJ, Qi FC, Yang CH, et al., 2019. COS960: a Chinese word similarity dataset of 960 word pairs. https://arxiv.org/abs/1906.00247
Jin H, Zhang ZB, Yuan PP, 2022. Improving Chinese word representation using four corners features. IEEE Trans Big Data, 8(4):982–993. https://doi.org/10.1109/TBDATA.2021.3106582
Article Google Scholar
Kang RZ, Zhang HJ, Hao WN, et al., 2019. Learning Chinese word embeddings with words and subcharacter n-grams. IEEE Access, 7:42987–42992. https://doi.org/10.1109/ACCESS.2019.2908014
Article Google Scholar
Levy O, Goldberg Y, Dagan I, 2015. Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Ling, 3:211–225. https://doi.org/10.1162/tacl-a-00134
Google Scholar
Li BA, Li Y, Meng QC, 2005. Chinese Information Processing Technology: Principles and Applications. Tsinghua University Press, Beijing, China (in Chinese).
Google Scholar
Li S, Zhao Z, Hu RF, et al., 2018. Analogical reasoning on Chinese morphological and semantic relations. Proc 56^th Annual Meeting of the Association for Computational Linguistics, p.138–143. https://doi.org/10.18653/v1/P18-2023
Liu MD, Liang X, 2021. A method of Chinese character glyph similarity calculation based on radical knowledge representation learning. J Chin Inform Process, 35(12):47–59 (in Chinese). https://doi.org/10.3969/j.issn.1003-0077.2021.12.005
ADS Google Scholar
Liu PF, Yuan WZ, Fu JL, et al., 2023. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv, 55(9):195. https://doi.org/10.1145/3560815
Article CAS Google Scholar
Lu W, Zhang ZB, Yuan PP, et al., 2022. Learning Chinese word embeddings by discovering inherent semantic relevance in sub-characters. Proc 31^st ACM Int Conf on Information & Knowledge Management, p.1369–1378. https://doi.org/10.1145/3511808.3557376
Meng YX, Wu W, Wang F, et al., 2019. Glyce: Glyph-vectors for Chinese character representations. Proc 33^rd Int Conf on Neural Information Processing Systems, p.2742–2753.
Mikolov T, Yih WT, Zweig G, 2013. Linguistic regularities in continuous space word representations. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.746–751.
Otsu N, 1979. A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern, 9(1):62–66. https://doi.org/10.1109/TSMC.1979.4310076
Article Google Scholar
Petrov A, la Malfa E, Torr PH, et al., 2023. Language model tokenizers introduce unfairness between languages. https://arxiv.org/abs/2305.15425
Saleh AA, Weigang L, 2023. Deep self-organizing cube: a novel multi-dimensional classifier for multiple output learning. Expert Syst Appl, 230:120627. https://doi.org/10.1016/j.eswa.2023.120627
Article Google Scholar
Schulman J, Zoph B, Kim C, 2022. Introducing ChatGPT. https://openaicom/blog/chatgpt [Accessed on May 30, 2023].
Sheng YC, Zhang JM, Benes B, 2021. SSN: soft shadow network for image compositing. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4378–4388. https://doi.org/10.1109/CVPR46437.2021.00436
Sheng YC, Liu YF, Zhang JM, et al., 2022. Controllable shadow generation using pixel height maps. 17^th European Conf on Computer Vision, p.240–256. https://doi.org/10.1007/978-3-031-20050-2_15
Sheng YC, Zhang JM, Philip J, et al., 2023. PixHt-Lab: pixel height based light effect generation for image compositing. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16643–16653. https://doi.org/10.1109/CVPR52729.2023.01597
Song JH, Li GY, Wang N, 2006. Productive representation on the phonetic-semantic relations of Shuowenjiezi. J Chin Inform Process, 20(2):53–59 (in Chinese). https://doi.org/10.3969/j.issn.1003-0077.2006.02.008
CAS Google Scholar
Standardization Administration of the People’s Republic of China, 2022. Information Technology - Chinese Coded Character Set. GB 18030-2022. National Standards of People’s Republic of China (in Chinese).
Su TR, Lee HY, 2017. Learning Chinese word representations from glyphs of characters. Proc Conf on Empirical Methods in Natural Language Processing, p.264–273. https://doi.org/10.18653/v1/D17-1025
The Unicode Consortium, 2022. The Unicode Standard, Version 15.00. The Unicode Consortium. Mountain View, CA, USA.
Google Scholar
The Wubi Group, 2000. Wubi code: a method for inputting Chinese characters. Chin J Inform Process, 24(3):1–10 (in Chinese).
Google Scholar
Turney PD, 2012. Domain and function: a dual-space model of semantic relations and compositions. J Artif Intell Res, 44(1):533–585. https://doi.org/10.5555/2387933.2387945
Article Google Scholar
Wang JT, 2011. Research towards Chinese string similarity based on the clustering feature of Chinese characters. New Technol Lib Inform Ser, (2):48–53 (in Chinese).
Wang L, 1959. Chinese Modern Grammar. Zhonghua Book Company, Hong Kong, China (in Chinese).
Google Scholar
Wang SK, 2016. New Modern Chinese Course. Shanghai Jiao Tong University Press, Shanghai, China (in Chinese).
Google Scholar
Wang SR, Zhou W, Zhou Q, 2020. Radical and stroke-enhanced Chinese word embeddings based on neural networks. Neur Process Lett, 52(2):1109–1121. https://doi.org/10.1007/s11063-020-10289-6
Article Google Scholar
Weigang L, da Silva NC, 1999. A study of parallel neural networks. Proc Int Joint Conf on Neural Networks, p.1113–1116. https://doi.org/10.1109/IJCNN.1999.831112
Weigang L, Enamoto LM, Li DL, et al., 2022. New directions for artificial intelligence: human, machine, biological, and quantum intelligence. Front Inform Technol Electron Eng, 23(6):984–990. https://doi.org/10.1631/FITEE.2100227
Article Google Scholar
Xu J, Liu JW, Zhang LG, et al., 2016. Improve Chinese word embeddings by exploiting internal structure. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.1041–1050. https://doi.org/10.18653/v1/N16-1119
Xu S, 1997. Discussing Writing and Explaining Characters. Yuelu Publishing House, Changsha, China (in Chinese).
Google Scholar
Yeromiyan T, 2022. The Six Types of Chinese Characters. https://studycli.org/chinese-characters/types-of-chinese-characters/ [Accessed on May 30, 2023].
Yu JX, Jian X, Xin H, et al., 2017. Joint embeddings of Chinese words, characters, and fine-grained subcharacter components. Proc Conf on Empirical Methods in Natural Language Processing, p.286–291. https://doi.org/10.18653/v1/D17-1027
Zhang B, 2008. Newly Edited Chinese Language (2^nd Ed.). Fudan University Publishing, Shanghai, China (in Chinese).
Google Scholar
Zhang Y, Liu YG, Zhu JJ, et al., 2019. Learning Chinese word embeddings from stroke, structure and pinyin of characters. Proc 28^th ACM Int Conf on Information and Knowledge Management, p.1011–1020. https://doi.org/10.1145/3357384.3358005
Zhang ZB, Zhong ZM, Yuan PP, et al., 2023. Improving entity linking in Chinese domain by sense embedding based on graph clustering. J Comput Sci Technol, 38(1):196–210. https://doi.org/10.1007/s11390-023-2835-4
Article Google Scholar
Zhao DP, Xiong HX, Tian FS, et al., 2021. Research on Chinese text similarity calculation based on sequence alignment algorithm. Lib Inform Serv, 65(11):101–112 (in Chinese). https://doi.org/10.13266/j.issn.0252-3116.2021.11.011
Google Scholar
Zhao YR, 2017. A Grammar of Spoken Chinese. University of California Press, CA, USA.
Google Scholar
Zhou J, Ke P, Qiu XP, et al., 2023. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng, early access. https://doi.org/10.1631/FITEE.2300089
Zhou JN, Wang JK, Liu GS, 2019. Multiple character embeddings for Chinese word segmentation. Proc 57^th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, p.210–216. https://doi.org/10.18653/v1/P19-2029
Zhuang CY, Zheng YJ, Huang WH, et al., 2019. Joint fine-grained components continuously enhance Chinese word embeddings. IEEE Access, 7:174699–174708. https://doi.org/10.1109/ACCESS.2019.2956822
Article Google Scholar

Download references

Acknowledgements

This article pays tribute to Shen XU (许慎) and his monumental work, Shuo Wen Jie Zi (说文解字). The authors extend their gratitude to the inventors of the four-corner number, Cangjie, Wubi, Zheng code, and others. The authors are also very grateful for the valuable comments and suggestions of anonymous reviewers.

Author information

Authors and Affiliations

Department of Computer Science, University of Brasilia, Brasilia, 70910-900, Brazil
Li Weigang (李伟钢), Mayara Chew Marinho & Vitor Vasconcelos De Oliveira
Faculty of Economics, Administration, Accounting and Actuaries, University of Sao Paulo, Sao Paulo, 05508-010, Brazil
Denise Leyi Li

Authors

Li Weigang (李伟钢)
View author publications
You can also search for this author in PubMed Google Scholar
Mayara Chew Marinho
View author publications
You can also search for this author in PubMed Google Scholar
Denise Leyi Li
View author publications
You can also search for this author in PubMed Google Scholar
Vitor Vasconcelos De Oliveira
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Li WEIGANG designed this research, proposed the SWMP/SWPC/Chinese character matrix, and drafted the paper. Mayara Chew MARINHO helped with the calculations in Sections 3, 7, and 8. Denise Leyi LI helped prepare the data sets and analyze the results. Vitor Vasconcelos DE OLIVEIRA helped process the images and do the calculations in Section 6. All the authors revised and finalized the paper.

Corresponding author

Correspondence to Li Weigang (李伟钢).

Ethics declarations

Li WEIGANG is an editorial board member of Frontiers of Information Technology & Electronic Engineering, and he was not involved with the peer review process of this paper. All the authors declare that they have no conflict of interest.

Additional information

Project partially supported by the Brazilian National Council for Scientific and Technological Development (CNPq) (No. 309545/2021-8)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Weigang, L., Marinho, M.C., Li, D.L. et al. Six-Writings multimodal processing with pictophonetic coding to enhance Chinese language models. Front Inform Technol Electron Eng 25, 84–105 (2024). https://doi.org/10.1631/FITEE.2300384

Download citation

Received: 30 May 2023
Accepted: 06 September 2023
Published: 08 February 2024
Issue Date: January 2024
DOI: https://doi.org/10.1631/FITEE.2300384

Key words

关键词

CLC number

TP391

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Six-Writings multimodal processing with pictophonetic coding to enhance Chinese language models

Abstract

摘要

Access this article

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

关键词

CLC number

Search

Navigation