当前位置: X-MOL 学术Opt. Mem. Neural Networks › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Application of the Variational Principle to Create a Measurable Assessment of the Relevance of Objects Included in Training Databases
Optical Memory and Neural Networks Pub Date : 2023-11-28 , DOI: 10.3103/s1060992x23060024
V. A. Antonets , M. A. Antonets

Abstract

We consider the problem of obtaining a measurable assessment of the quality of empirical training data selected by experts. This problem can be solved in those cases where the data can be displayed in the form of histograms. This class includes any diagrams of frequency of occurrence of linguistic objects in samples, for example, lemmas in a text. It also includes discretized temporal signals from different branches of science, technology, and medicine. The proposed method, as well as other known methods, is based on the use of weight functions. With its help, the weight of each histogram is defined as the sum over all its columns of the products of column height by the value of weight function for the corresponding column. However, in contrast to the well-known approaches, the weight function in the proposed approach is not found empirically, but on the basis of the following variation principle. The weight function is considered optimal if the weight of the lightest histogram found with its help is greater than or equal to the weight of the lightest histogram determined by any other weight function. The application of the developed approach to the task of thematic classification of ad texts on electronic trading floors showed that for the selected topics approximately 90% of the lemmas (words) encountered in the training corpus had the weight equal to zero, and almost all words with nonzero weight were semantically related to the topic.



中文翻译:

应用变分原理对训练数据库中包含的对象的相关性进行可衡量的评估

摘要

我们考虑对专家选择的经验训练数据的质量进行可衡量的评估的问题。如果数据能够以直方图的形式显示,这个问题就可以得到解决。此类包括样本中语言对象出现频率的任何图表,例如文本中的引理。它还包括来自科学、技术和医学不同分支的离散时间信号。所提出的方法以及其他已知方法基于权重函数的使用。在它的帮助下,每个直方图的权重被定义为其所有列的列高乘以相应列的权重函数值的总和。然而,与众所周知的方法相比,所提出的方法中的权重函数不是凭经验找到的,而是基于以下变分原理。如果借助权重函数找到的最轻直方图的权重大于或等于任何其他权重函数确定的最轻直方图的权重,则认为该权重函数是最佳的。将所开发的方法应用于电子交易大厅广告文本主题分类任务表明,对于选定的主题,训练语料库中遇到的大约 90% 的引理(单词)的权重等于 0,并且几乎所有单词权重非零的在语义上与主题相关。

更新日期:2023-11-29
down
wechat
bug