当前位置: X-MOL 学术Pattern Recogn. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Analysis of systems’ performance in natural language processing competitions
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2024-03-19 , DOI: 10.1016/j.patrec.2024.03.010
Sergio Nava-Muñoz , Mario Graff , Hugo Jair Escalante

Collaborative competitions have gained popularity in the scientific and technological fields. These competitions involve defining tasks, selecting evaluation scores, and devising result verification methods. In the standard scenario, participants receive a training set and are expected to provide a solution for a held-out dataset kept by organizers. An essential challenge for organizers arises when comparing algorithms’ performance, assessing multiple participants, and ranking them. Statistical tools are often used for this purpose; however, traditional statistical methods often fail to capture decisive differences between systems’ performance. This manuscript describes an evaluation methodology for statistically analyzing competition results and competition. The methodology is designed to be universally applicable; however, it is illustrated using eight natural language competitions as case studies involving classification and regression problems. The proposed methodology offers several advantages, including off-the-shell comparisons with correction mechanisms and the inclusion of confidence intervals. Furthermore, we introduce metrics that allow organizers to assess the difficulty of competitions. Our analysis shows the potential usefulness of our methodology for effectively evaluating competition results.

中文翻译:

自然语言处理竞赛中系统性能分析

协作竞赛在科技领域颇受欢迎。这些竞赛包括定义任务、选择评估分数以及设计结果验证方法。在标准场景中,参与者会收到训练集,并预计为组织者保留的数据集提供解决方案。在比较算法的性能、评估多个参与者并对他们进行排名时,组织者面临着一个重要的挑战。统计工具通常用于此目的;然而,传统的统计方法常常无法捕捉系统性能之间的决定性差异。本手稿描述了一种对比赛结果和竞争进行统计分析的评估方法。该方法旨在普遍适用;然而,它使用八个自然语言竞赛作为涉及分类和回归问题的案例研究来进行说明。所提出的方法具有多种优点,包括与校正机制的现成比较以及包含置信区间。此外,我们引入了一些指标,使组织者能够评估比赛的难度。我们的分析显示了我们的方法在有效评估竞争结果方面的潜在用途。
更新日期:2024-03-19
down
wechat
bug