arXiv - CS - Computation and Language期刊最新论文, 计算机, 应用类期刊,

Universal Adversarial Triggers Are Not Universal

arXiv.cs.CL Pub Date : 2024-04-24
Nicholas Meade, Arkil Patel, Siva Reddy

Recent work has developed optimization procedures to find token sequences, called adversarial triggers, which can elicit unsafe responses from aligned language models. These triggers are believed to be universally transferable, i.e., a trigger optimized on one model can jailbreak other models. In this paper, we concretely show that such adversarial triggers are not universal. We extensively investigate

更新日期：2024-04-25

详情收藏

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

arXiv.cs.CL Pub Date : 2024-04-24
Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, Scott A. Hale

Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about the methods (how), domains (where), people (who) and objectives (to what end) of human feedback collection. To navigate these questions, we introduce PRISM, a new dataset which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, to

更新日期：2024-04-25

详情收藏

Generalization Measures for Zero-Shot Cross-Lingual Transfer

arXiv.cs.CL Pub Date : 2024-04-24
Saksham Bassi, Duygu Ataman, Kyunghyun Cho

A model's capacity to generalize its knowledge to interpret unseen inputs with different characteristics is crucial to build robust and reliable machine learning systems. Language model evaluation tasks lack information metrics about model generalization and their applicability in a new setting is measured using task and language-specific downstream performance, which is often lacking in many languages

更新日期：2024-04-25

详情收藏

Assessing The Potential Of Mid-Sized Language Models For Clinical QA

arXiv.cs.CL Pub Date : 2024-04-24
Elliot Bolton, Betty Xiong, Vijaytha Muralidharan, Joel Schamroth, Vivek Muralidharan, Christopher D. Manning, Roxana Daneshjou

Large language models, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks; however, they require access to compute, are closed-source, and cannot be deployed on device. Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks, but their capacity for clinical tasks has been understudied. To help assess their potential for clinical use

更新日期：2024-04-25

详情收藏

Effective Unsupervised Constrained Text Generation based on Perturbed Masking

arXiv.cs.CL Pub Date : 2024-04-24
Yingwen Fu, Wenjie Ou, Zhou Yu, Yue Lin

Unsupervised constrained text generation aims to generate text under a given set of constraints without any supervised data. Current state-of-the-art methods stochastically sample edit positions and actions, which may cause unnecessary search steps. In this paper, we propose PMCTG to improve effectiveness by searching for the best edit position and action in each step. Specifically, PMCTG extends perturbed

更新日期：2024-04-25

详情收藏

From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models

arXiv.cs.CL Pub Date : 2024-04-24
Qianyu He, Jie Zeng, Qianxi He, Jiaqing Liang, Yanghua Xiao

It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found that

更新日期：2024-04-25

详情收藏

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation

arXiv.cs.CL Pub Date : 2024-04-24
Maja Stahl, Leon Biermann, Andreas Nehring, Henning Wachsmuth

Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. Large language models (LLMs) have demonstrated strong performance in generating coherent

更新日期：2024-04-25

详情收藏

One Subgraph for All: Efficient Reasoning on Opening Subgraphs for Inductive Knowledge Graph Completion

arXiv.cs.CL Pub Date : 2024-04-24
Zhiwen Xie, Yi Zhang, Guangyou Zhou, Jin Liu, Xinhui Tu, Jimmy Xiangji Huang

Knowledge Graph Completion (KGC) has garnered massive research interest recently, and most existing methods are designed following a transductive setting where all entities are observed during training. Despite the great progress on the transductive KGC, these methods struggle to conduct reasoning on emerging KGs involving unseen entities. Thus, inductive KGC, which aims to deduce missing links among

更新日期：2024-04-25

详情收藏

A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry

arXiv.cs.CL Pub Date : 2024-04-24
Yining Huang, Keke Tang, Meilian Chen

Since the inception of the Transformer architecture in 2017, Large Language Models (LLMs) such as GPT and BERT have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation. These models have shown potential to transform the medical field, highlighting the necessity for specialized evaluation frameworks to ensure their effective and

更新日期：2024-04-25

详情收藏

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

arXiv.cs.CL Pub Date : 2024-04-24
Jacob Pfau, William Merrill, Samuel R. Bowman

Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic

更新日期：2024-04-25

详情收藏

No Train but Gain: Language Arithmetic for training-free Language Adapters enhancement

arXiv.cs.CL Pub Date : 2024-04-24
Mateusz Klimaszewski, Piotr Andruszkiewicz, Alexandra Birch

Modular deep learning is the state-of-the-art solution for lifting the curse of multilinguality, preventing the impact of negative interference and enabling cross-lingual performance in Multilingual Pre-trained Language Models. However, a trade-off of this approach is the reduction in positive transfer learning from closely related languages. In response, we introduce a novel method called language

更新日期：2024-04-25

详情收藏

Annotator-Centric Active Learning for Subjective NLP Tasks

arXiv.cs.CL Pub Date : 2024-04-24
Michiel van der Meer, Neele Falk, Pradeep K. Murukannaiah, Enrico Liscio

To accurately capture the variability in human judgments for subjective NLP tasks, incorporating a wide range of perspectives in the annotation process is crucial. Active Learning (AL) addresses the high costs of collecting human annotations by strategically annotating the most informative samples. We introduce Annotator-Centric Active Learning (ACAL), which incorporates an annotator selection strategy

更新日期：2024-04-25

详情收藏

Nyonic Technical Report

arXiv.cs.CL Pub Date : 2024-04-24
Junfeng Tian, Rui Wang, Cong Li, Yudong Zhou, Jun Liu, Jun Wang

This report details the development and key achievements of our latest language model designed for custom large language models. The advancements introduced include a novel Online Data Scheduler that supports flexible training data adjustments and curriculum learning. The model's architecture is fortified with state-of-the-art techniques such as Rotary Positional Embeddings, QK-LayerNorm, and a specially

更新日期：2024-04-25

详情收藏

Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs

arXiv.cs.CL Pub Date : 2024-04-24
Yu Xia, Rui Wang, Xu Liu, Mingyan Li, Tong Yu, Xiang Chen, Julian McAuley, Shuai Li

Chain-of-Thought (CoT) has been a widely adopted prompting method, eliciting impressive reasoning abilities of Large Language Models (LLMs). Inspired by the sequential thought structure of CoT, a number of Chain-of-X (CoX) methods have been developed to address various challenges across diverse domains and tasks involving LLMs. In this paper, we provide a comprehensive survey of Chain-of-X methods

更新日期：2024-04-25

详情收藏

The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews

arXiv.cs.CL Pub Date : 2024-04-24
Aleksi Huotala, Miikka Kuutila, Paul Ralph, Mika Mäntylä

Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if Large Language Models (LLMs) can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-abstract

更新日期：2024-04-25

详情收藏

KS-LLM: Knowledge Selection of Large Language Models with Evidence Document for Question Answering

arXiv.cs.CL Pub Date : 2024-04-24
Xinxin Zheng, Feihu Che, Jinyang Wu, Shuai Zhang, Shuai Nie, Kang Liu, Jianhua Tao

Large language models (LLMs) suffer from the hallucination problem and face significant challenges when applied to knowledge-intensive tasks. A promising approach is to leverage evidence documents as extra supporting knowledge, which can be obtained through retrieval or generation. However, existing methods directly leverage the entire contents of the evidence document, which may introduce noise information

更新日期：2024-04-25

详情收藏

Return of EM: Entity-driven Answer Set Expansion for QA Evaluation

arXiv.cs.CL Pub Date : 2024-04-24
Dongryeol Lee, Minwoo Lee, Kyungmin Min, Joonsuk Park, Kyomin Jung

Recently, directly using large language models (LLMs) has been shown to be the most reliable method to evaluate QA models. However, it suffers from limited interpretability, high cost, and environmental harm. To address these, we propose to use soft EM with entity-driven answer set expansion. Our approach expands the gold answer set to include diverse surface forms, based on the observation that the

更新日期：2024-04-25

详情收藏

CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code

arXiv.cs.CL Pub Date : 2024-04-24
Batu Guan, Yao Wan, Zhangqian Bi, Zheng Wang, Hongyu Zhang, Yulei Sui, Pan Zhou, Lichao Sun

As Large Language Models (LLMs) are increasingly used to automate code generation, it is often desired to know if the code is AI-generated and by which model, especially for purposes like protecting intellectual property (IP) in industry and preventing academic misconduct in education. Incorporating watermarks into machine-generated content is one way to provide code provenance, but existing solutions

更新日期：2024-04-25

详情收藏

Hybrid LLM/Rule-based Approaches to Business Insights Generation from Structured Data

arXiv.cs.CL Pub Date : 2024-04-24
Aliaksei Vertsel, Mikhail Rumiantsau

In the field of business data analysis, the ability to extract actionable insights from vast and varied datasets is essential for informed decision-making and maintaining a competitive edge. Traditional rule-based systems, while reliable, often fall short when faced with the complexity and dynamism of modern business data. Conversely, Artificial Intelligence (AI) models, particularly Large Language

更新日期：2024-04-25

详情收藏

Minimal Evidence Group Identification for Claim Verification

arXiv.cs.CL Pub Date : 2024-04-24
Xiangci Li, Sihao Chen, Rajvi Kapadia, Jessica Ouyang, Fan Zhang

Claim verification in real-world settings (e.g. against a large collection of candidate evidences retrieved from the web) typically requires identifying and aggregating a complete set of evidence pieces that collectively provide full support to the claim. The problem becomes particularly challenging when there exists distinct sets of evidence that could be used to verify the claim from different perspectives

更新日期：2024-04-25

详情收藏

Can Foundational Large Language Models Assist with Conducting Pharmaceuticals Manufacturing Investigations?

arXiv.cs.CL Pub Date : 2024-04-24
Hossein SalamiDigital Services, MMD, Merck & Co., Inc., Rahway, NJ, USA, Brandye Smith-GoettlerDigital Services, MMD, Merck & Co., Inc., West Point, PA, USA, Vijay YadavDigital Services, MMD, Merck & Co., Inc., West Point, PA, USA

General purpose Large Language Models (LLM) such as the Generative Pretrained Transformer (GPT) and Large Language Model Meta AI (LLaMA) have attracted much attention in recent years. There is strong evidence that these models can perform remarkably well in various natural language processing tasks. However, how to leverage them to approach domain-specific use cases and drive value remains an open

更新日期：2024-04-25

详情收藏

Retrieval Head Mechanistically Explains Long-Context Factuality

arXiv.cs.CL Pub Date : 2024-04-24
Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu

Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving

更新日期：2024-04-25

详情收藏

CASPR: Automated Evaluation Metric for Contrastive Summarization

arXiv.cs.CL Pub Date : 2024-04-23
Nirupan Ananthamurugan, Dat Duong, Philip George, Ankita Gupta, Sandeep Tata, Beliz Gunel

Summarizing comparative opinions about entities (e.g., hotels, phones) from a set of source reviews, often referred to as contrastive summarization, can considerably aid users in decision making. However, reliably measuring the contrastiveness of the output summaries without relying on human evaluations remains an open problem. Prior work has proposed token-overlap based metrics, Distinctiveness Score

更新日期：2024-04-25

详情收藏

PRISM: Patient Records Interpretation for Semantic Clinical Trial Matching using Large Language Models

arXiv.cs.CL Pub Date : 2024-04-23
Shashi Kant Gupta, Aditya Basu, Mauro Nievas, Jerrin Thomas, Nathan Wolfrath, Adhitya Ramamurthi, Bradley Taylor, Anai N. Kothari, Therica M. Miller, Sorena Nadaf-Rahrov, Yanshan Wang, Hrituraj Singh

Clinical trial matching is the task of identifying trials for which patients may be potentially eligible. Typically, this task is labor-intensive and requires detailed verification of patient electronic health records (EHRs) against the stringent inclusion and exclusion criteria of clinical trials. This process is manual, time-intensive, and challenging to scale up, resulting in many patients missing

更新日期：2024-04-25

详情收藏

Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

arXiv.cs.CL Pub Date : 2024-04-23
Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, Chitta Baral

Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical

更新日期：2024-04-25

详情收藏

ToM-LM: Delegating Theory Of Mind Reasoning to External Symbolic Executors in Large Language Models

arXiv.cs.CL Pub Date : 2024-04-23
Weizhi Tang, Vaishak Belle

Theory of Mind (ToM) refers to the ability of individuals to attribute mental states to others. While Large Language Models (LLMs) have shown some promise with ToM ability, they still struggle with complex ToM reasoning. Our approach leverages an external symbolic executor, specifically the SMCDEL model checker, and fine-tuning to improve the ToM reasoning ability of LLMs. In our approach, an LLM is

更新日期：2024-04-25

详情收藏

Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information

arXiv.cs.CL Pub Date : 2024-04-23
Chihiro Taguchi, Jefferson Saransig, Dayana Velásquez, David Chiang

This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation

更新日期：2024-04-25

详情收藏

Large Language Models Spot Phishing Emails with Surprising Accuracy: A Comparative Analysis of Performance

arXiv.cs.CL Pub Date : 2024-04-23
Het Patel, Umair Rehman, Farkhund Iqbal

Phishing, a prevalent cybercrime tactic for decades, remains a significant threat in today's digital world. By leveraging clever social engineering elements and modern technology, cybercrime targets many individuals, businesses, and organizations to exploit trust and security. These cyber-attackers are often disguised in many trustworthy forms to appear as legitimate sources. By cleverly using psychological

更新日期：2024-04-25

详情收藏

XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference

arXiv.cs.CL Pub Date : 2024-04-23
João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian

In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context

更新日期：2024-04-25

详情收藏

KGValidator: A Framework for Automatic Validation of Knowledge Graph Construction

arXiv.cs.CL Pub Date : 2024-04-24
Jack Boylan, Shashank Mangla, Dominic Thorn, Demian Gholipour Ghalandari, Parsa Ghaffari, Chris Hokamp

This study explores the use of Large Language Models (LLMs) for automatic evaluation of knowledge graph (KG) completion models. Historically, validating information in KGs has been a challenging task, requiring large-scale human annotation at prohibitive cost. With the emergence of general-purpose generative AI and LLMs, it is now plausible that human-in-the-loop validation could be replaced by a generative

更新日期：2024-04-25

详情收藏

CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies

arXiv.cs.CL Pub Date : 2024-04-23
Weiyan Shi, Ryan Li, Yutong Zhang, Caleb Ziems, Chunhua yu, Raya Horesh, Rogério Abreu de Paula, Diyi Yang

To enhance language models' cultural awareness, we design a generalizable pipeline to construct cultural knowledge bases from different online communities on a massive scale. With the pipeline, we construct CultureBank, a knowledge base built upon users' self-narratives with 12K cultural descriptors sourced from TikTok and 11K from Reddit. Unlike previous cultural knowledge resources, CultureBank contains

更新日期：2024-04-24

详情收藏

The Power of the Noisy Channel: Unsupervised End-to-End Task-Oriented Dialogue with LLMs

arXiv.cs.CL Pub Date : 2024-04-23
Brendan King, Jeffrey Flanigan

Training task-oriented dialogue systems typically requires turn-level annotations for interacting with their APIs: e.g. a dialogue state and the system actions taken at each step. These annotations can be costly to produce, error-prone, and require both domain and annotation expertise. With advances in LLMs, we hypothesize unlabelled data and a schema definition are sufficient for building a working

更新日期：2024-04-24

详情收藏

Does Instruction Tuning Make LLMs More Consistent?

arXiv.cs.CL Pub Date : 2024-04-23
Constanza Fierro, Jiaang Li, Anders Søgaard

The purpose of instruction tuning is enabling zero-shot performance, but instruction tuning has also been shown to improve chain-of-thought reasoning and value alignment (Si et al., 2023). Here we consider the impact on $\textit{consistency}$, i.e., the sensitivity of language models to small perturbations in the input. We compare 10 instruction-tuned LLaMA models to the original LLaMA-7b model and

更新日期：2024-04-24

详情收藏

Setting up the Data Printer with Improved English to Ukrainian Machine Translation

arXiv.cs.CL Pub Date : 2024-04-23
Yurii Paniv, Dmytro Chaplynskyi, Nikita Trynus, Volodymyr Kyrylov

To build large language models for Ukrainian we need to expand our corpora with large amounts of new algorithmic tasks expressed in natural language. Examples of task performance expressed in English are abundant, so with a high-quality translation system our community will be enabled to curate datasets faster. To aid this goal, we introduce a recipe to build a translation system using supervised finetuning

更新日期：2024-04-24

详情收藏

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts

arXiv.cs.CL Pub Date : 2024-04-22
Dengchun Li, Yingzi Ma, Naizheng Wang, Zhiyuan Cheng, Lei Duan, Jie Zuo, Cal Yang, Mingjie Tang

Large Language Models (LLMs) have showcased exceptional performance across a wide array of Natural Language Processing (NLP) tasks. Fine-tuning techniques are commonly utilized to tailor pre-trained models to specific applications. While methods like LoRA have effectively tackled GPU memory constraints during fine-tuning, their applicability is often restricted to limited performance, especially on

更新日期：2024-04-24

详情收藏

FASTTRACK: Fast and Accurate Fact Tracing for LLMs

arXiv.cs.CL Pub Date : 2024-04-22
Si Chen, Feiyang Kang, Ning Yu, Ruoxi Jia

Fact tracing seeks to identify specific training examples that serve as the knowledge source for a given query. Existing approaches to fact tracing rely on assessing the similarity between each training sample and the query along a certain dimension, such as lexical similarity, gradient, or embedding space. However, these methods fall short of effectively distinguishing between samples that are merely

更新日期：2024-04-24

详情收藏

Regressive Side Effects of Training Language Models to Mimic Student Misconceptions

arXiv.cs.CL Pub Date : 2024-04-23
Shashank Sonkar, Naiming Liu, Richard G. Baraniuk

This paper presents a novel exploration into the regressive side effects of training Large Language Models (LLMs) to mimic student misconceptions for personalized education. We highlight the problem that as LLMs are trained to more accurately mimic student misconceptions, there is a compromise in the factual integrity and reasoning ability of the models. Our work involved training an LLM on a student-tutor

更新日期：2024-04-24

详情收藏

Do not think pink elephant!

arXiv.cs.CL Pub Date : 2024-04-22
Kyomin Hwang, Suyoung Kim, JunHoo Lee, Nojun Kwak

Large Models (LMs) have heightened expectations for the potential of general AI as they are akin to human intelligence. This paper shows that recent large models such as Stable Diffusion and DALL-E3 also share the vulnerability of human intelligence, namely the "white bear phenomenon". We investigate the causes of the white bear phenomenon by analyzing their representation space. Based on this analysis

更新日期：2024-04-24

详情收藏

Identifying Fairness Issues in Automatically Generated Testing Content

arXiv.cs.CL Pub Date : 2024-04-23
Kevin Stowe, Benny Longwill, Alyssa Francis, Tatsuya Aoyama, Debanjan Ghosh, Swapna Somasundaran

Natural language generation tools are powerful and effective for generating content. However, language models are known to display bias and fairness issues, making them impractical to deploy for many use cases. We here focus on how fairness issues impact automatically generated test content, which can have stringent requirements to ensure the test measures only what it was intended to measure. Specifically

更新日期：2024-04-24

详情收藏

Multi-view Content-aware Indexing for Long Document Retrieval

arXiv.cs.CL Pub Date : 2024-04-23
Kuicai Dong, Derrick Goh Xin Deik, Yi Quan Lee, Hao Zhang, Xiangyang Li, Cong Zhang, Yong Liu

Long document question answering (DocQA) aims to answer questions from long documents over 10k words. They usually contain content structures such as sections, sub-sections, and paragraph demarcations. However, the indexing methods of long documents remain under-explored, while existing systems generally employ fixed-length chunking. As they do not consider content structures, the resultant chunks

更新日期：2024-04-24

详情收藏

Enhancing Textual Personality Detection toward Social Media: Integrating Long-term and Short-term Perspectives

arXiv.cs.CL Pub Date : 2024-04-23
Haohao Zhu, Xiaokun Zhang, Junyu Lu, Youlin Wu, Zewen Bai, Changrong Min, Liang Yang, Bo Xu, Dongyu Zhang, Hongfei Lin

Textual personality detection aims to identify personality characteristics by analyzing user-generated content toward social media platforms. Numerous psychological literature highlighted that personality encompasses both long-term stable traits and short-term dynamic states. However, existing studies often concentrate only on either long-term or short-term personality representations, without effectively

更新日期：2024-04-24

详情收藏

TAXI: Evaluating Categorical Knowledge Editing for Language Models

arXiv.cs.CL Pub Date : 2024-04-23
Derek Powell, Walter Gerych, Thomas Hartvigsen

Humans rarely learn one fact in isolation. Instead, learning a new fact induces knowledge of other facts about the world. For example, in learning a korat is a type of cat, you also infer it is a mammal and has claws, ensuring your model of the world is consistent. Knowledge editing aims to inject new facts into language models to improve their factuality, but current benchmarks fail to evaluate consistency

更新日期：2024-04-24

详情收藏

Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

arXiv.cs.CL Pub Date : 2024-04-23
Aleksei Dorkin, Kairit Sirts

This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap

更新日期：2024-04-24

详情收藏

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Perfect Reasoners

arXiv.cs.CL Pub Date : 2024-04-23
Qihuang Zhong, Kang Wang, Ziyang Xu, Juhua Liu, Liang Ding, Bo Du, Dacheng Tao

Chain of Thought prompting strategy has enhanced the performance of Large Language Models (LLMs) across various NLP tasks. However, it still has shortcomings when dealing with complex reasoning tasks, following~\citet{cot_wei}, including understanding errors, calculation errors and process errors (e.g. missing-step and hallucinations). Subsequently, Our in-depth analysis of various error types has

更新日期：2024-04-24

详情收藏

Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models

arXiv.cs.CL Pub Date : 2024-04-23
Kostiantyn Omelianchuk, Andrii Liubonko, Oleksandr Skurzhanskyi, Artem Chernodub, Oleksandr Korniienko, Igor Samokhin

In this paper, we carry out experimental research on Grammatical Error Correction, delving into the nuances of single-model systems, comparing the efficiency of ensembling and ranking methods, and exploring the application of large language models to GEC as single-model systems, as parts of ensembles, and as ranking methods. We set new state-of-the-art performance with F_0.5 scores of 72.8 on CoNLL-2014-test

更新日期：2024-04-24

详情收藏

Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models

arXiv.cs.CL Pub Date : 2024-04-23
Chen Zhang, Zhuorui Liu, Dawei Song

With the increasingly giant scales of (causal) large language models (LLMs), the inference efficiency comes as one of the core concerns along the improved performance. In contrast to the memory footprint, the latency bottleneck seems to be of greater importance as there can be billions of requests to a LLM (e.g., GPT-4) per day. The bottleneck is mainly due to the autoregressive innateness of LLMs

更新日期：2024-04-24

详情收藏

Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans

arXiv.cs.CL Pub Date : 2024-04-23
Vittoria Dentella, Fritz Guenther, Evelina Leivada

Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining

更新日期：2024-04-24

详情收藏

Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation

arXiv.cs.CL Pub Date : 2024-04-23
Jingxuan Wei, Linzhuang Sun, Yichong Leng, Xu Tan, Bihui Yu, Ruifeng Guo

Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation for compressing models or simplifying training targets. Knowledge distillation encompasses two primary methods: sentence-level distillation and token-level distillation. In sentence-level distillation, the student model is trained to align with the

更新日期：2024-04-24

详情收藏

Pattern-Aware Chain-of-Thought Prompting in Large Language Models

arXiv.cs.CL Pub Date : 2024-04-23
Yufeng Zhang, Xuepeng Wang, Lingxiang Wu, Jinqiao Wang

Chain-of-thought (CoT) prompting can guide language models to engage in complex multi-step reasoning. The quality of provided demonstrations significantly impacts the success of downstream inference tasks. While existing automated methods prioritize accuracy and semantics in these demonstrations, we show that the underlying reasoning patterns play a more crucial role in such tasks. In this paper, we

更新日期：2024-04-24

详情收藏

Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches

arXiv.cs.CL Pub Date : 2024-04-23
Clément Christophe, Praveen K Kanithi, Prateek Munjal, Tathagata Raha, Nasir Hayat, Ronnie Rajan, Ahmed Al-Mahrooqi, Avani Gupta, Muhammad Umar Salman, Gurpreet Gosal, Bhargav Kanakiya, Charles Chen, Natalia Vassilieva, Boulbaba Ben Amor, Marco AF Pimentel, Shadab Khan

This study presents a comprehensive analysis and comparison of two predominant fine-tuning methodologies - full-parameter fine-tuning and parameter-efficient tuning - within the context of medical Large Language Models (LLMs). We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering

更新日期：2024-04-24

详情收藏

Simulating Task-Oriented Dialogues with State Transition Graphs and Large Language Models

arXiv.cs.CL Pub Date : 2024-04-23
Chris Samarinas, Pracha Promthaw, Atharva Nijasure, Hansi Zeng, Julian Killingback, Hamed Zamani

This paper explores SynTOD, a new synthetic data generation approach for developing end-to-end Task-Oriented Dialogue (TOD) Systems capable of handling complex tasks such as intent classification, slot filling, conversational question-answering, and retrieval-augmented response generation, without relying on crowdsourcing or real-world data. SynTOD utilizes a state transition graph to define the desired

更新日期：2024-04-24

详情收藏

Generate-on-Graph: Treat LLM as both Agent and KG in Incomplete Knowledge Graph Question Answering

arXiv.cs.CL Pub Date : 2024-04-23
Yao Xu, Shizhu He, Jiabei Chen, Zihao Wang, Yangqiu Song, Hanghang Tong, Kang Liu, Jun Zhao

To address the issue of insufficient knowledge and the tendency to generate hallucination in Large Language Models (LLMs), numerous studies have endeavored to integrate LLMs with Knowledge Graphs (KGs). However, all these methods are evaluated on conventional Knowledge Graph Question Answering (KGQA) with complete KGs, where the factual triples involved in each question are entirely covered by the

更新日期：2024-04-24

详情收藏

Modeling the Sacred: Considerations when Using Considerations when Using Religious Texts in Natural Language Processing

arXiv.cs.CL Pub Date : 2024-04-23
Ben Hutchinson

This position paper concerns the use of religious texts in Natural Language Processing (NLP), which is of special interest to the Ethics of NLP. Religious texts are expressions of culturally important values, and machine learned models have a propensity to reproduce cultural values encoded in their training data. Furthermore, translations of religious texts are frequently used by NLP researchers when

更新日期：2024-04-24

详情收藏

Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

arXiv.cs.CL Pub Date : 2024-04-23
Amir Saeidi, Shivanshu Verma, Chitta Baral

Large Language Models (LLMs) have demonstrated remarkable performance across a spectrum of tasks. Recently, Direct Preference Optimization (DPO) has emerged as an RL-free approach to optimize the policy model on human preferences. However, several limitations hinder the widespread adoption of this method. To address these shortcomings, various versions of DPO have been introduced. Yet, a comprehensive

更新日期：2024-04-24

详情收藏

MisgenderMender: A Community-Informed Approach to Interventions for Misgendering

arXiv.cs.CL Pub Date : 2024-04-23
Tamanna Hossain, Sunipa Dev, Sameer Singh

Content Warning: This paper contains examples of misgendering and erasure that could be offensive and potentially triggering. Misgendering, the act of incorrectly addressing someone's gender, inflicts serious harm and is pervasive in everyday technologies, yet there is a notable lack of research to combat it. We are the first to address this lack of research into interventions for misgendering by conducting

更新日期：2024-04-24

详情收藏

Q-Tuning: Queue-based Prompt Tuning for Lifelong Few-shot Language Learning

arXiv.cs.CL Pub Date : 2024-04-22
Yanhui Guo, Shaoyuan Xu, Jinmiao Fu, Jia Liu, Chaosheng Dong, Bryan Wang

This paper introduces \textbf{Q-tuning}, a novel approach for continual prompt tuning that enables the lifelong learning of a pre-trained language model. When learning a new task, Q-tuning trains a task-specific prompt by adding it to a prompt queue consisting of the prompts from older tasks. To better transfer the knowledge of old tasks, we design an adaptive knowledge aggregation technique that reweighs

更新日期：2024-04-24

详情收藏

Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training

arXiv.cs.CL Pub Date : 2024-04-22
Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, Meng Jiang

Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. Although fine-tuning with intermediate steps (i.e., rationales) elicits some mathematical reasoning skills, the resulting models still fall short in visual

更新日期：2024-04-24

详情收藏

WangLab at MEDIQA-M3G 2024: Multimodal Medical Answer Generation using Large Language Models

arXiv.cs.CL Pub Date : 2024-04-22
Ronald Xie, Steven Palayew, Augustin Toma, Gary Bader, Bo Wang

This paper outlines our submission to the MEDIQA2024 Multilingual and Multimodal Medical Answer Generation (M3G) shared task. We report results for two standalone solutions under the English category of the task, the first involving two consecutive API calls to the Claude 3 Opus API and the second involving training an image-disease label joint embedding in the style of CLIP for image classification

更新日期：2024-04-24

详情收藏

WangLab at MEDIQA-CORR 2024: Optimized LLM-based Programs for Medical Error Detection and Correction

arXiv.cs.CL Pub Date : 2024-04-22
Augustin Toma, Ronald Xie, Steven Palayew, Patrick R. Lawler, Bo Wang

Medical errors in clinical text pose significant risks to patient safety. The MEDIQA-CORR 2024 shared task focuses on detecting and correcting these errors across three subtasks: identifying the presence of an error, extracting the erroneous sentence, and generating a corrected sentence. In this paper, we present our approach that achieved top performance in all three subtasks. For the MS dataset,

更新日期：2024-04-24

详情收藏

SnapKV: LLM Knows What You are Looking for Before Generation

arXiv.cs.CL Pub Date : 2024-04-22
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that

更新日期：2024-04-24

详情收藏