-
Universal Adversarial Triggers Are Not Universal arXiv.cs.CL Pub Date : 2024-04-24 Nicholas Meade, Arkil Patel, Siva Reddy
Recent work has developed optimization procedures to find token sequences, called adversarial triggers, which can elicit unsafe responses from aligned language models. These triggers are believed to be universally transferable, i.e., a trigger optimized on one model can jailbreak other models. In this paper, we concretely show that such adversarial triggers are not universal. We extensively investigate
-
The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models arXiv.cs.CL Pub Date : 2024-04-24 Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, Scott A. Hale
Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about the methods (how), domains (where), people (who) and objectives (to what end) of human feedback collection. To navigate these questions, we introduce PRISM, a new dataset which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, to
-
Generalization Measures for Zero-Shot Cross-Lingual Transfer arXiv.cs.CL Pub Date : 2024-04-24 Saksham Bassi, Duygu Ataman, Kyunghyun Cho
A model's capacity to generalize its knowledge to interpret unseen inputs with different characteristics is crucial to build robust and reliable machine learning systems. Language model evaluation tasks lack information metrics about model generalization and their applicability in a new setting is measured using task and language-specific downstream performance, which is often lacking in many languages
-
Assessing The Potential Of Mid-Sized Language Models For Clinical QA arXiv.cs.CL Pub Date : 2024-04-24 Elliot Bolton, Betty Xiong, Vijaytha Muralidharan, Joel Schamroth, Vivek Muralidharan, Christopher D. Manning, Roxana Daneshjou
Large language models, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks; however, they require access to compute, are closed-source, and cannot be deployed on device. Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks, but their capacity for clinical tasks has been understudied. To help assess their potential for clinical use
-
Effective Unsupervised Constrained Text Generation based on Perturbed Masking arXiv.cs.CL Pub Date : 2024-04-24 Yingwen Fu, Wenjie Ou, Zhou Yu, Yue Lin
Unsupervised constrained text generation aims to generate text under a given set of constraints without any supervised data. Current state-of-the-art methods stochastically sample edit positions and actions, which may cause unnecessary search steps. In this paper, we propose PMCTG to improve effectiveness by searching for the best edit position and action in each step. Specifically, PMCTG extends perturbed
-
From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models arXiv.cs.CL Pub Date : 2024-04-24 Qianyu He, Jie Zeng, Qianxi He, Jiaqing Liang, Yanghua Xiao
It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found that
-
Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation arXiv.cs.CL Pub Date : 2024-04-24 Maja Stahl, Leon Biermann, Andreas Nehring, Henning Wachsmuth
Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. Large language models (LLMs) have demonstrated strong performance in generating coherent
-
One Subgraph for All: Efficient Reasoning on Opening Subgraphs for Inductive Knowledge Graph Completion arXiv.cs.CL Pub Date : 2024-04-24 Zhiwen Xie, Yi Zhang, Guangyou Zhou, Jin Liu, Xinhui Tu, Jimmy Xiangji Huang
Knowledge Graph Completion (KGC) has garnered massive research interest recently, and most existing methods are designed following a transductive setting where all entities are observed during training. Despite the great progress on the transductive KGC, these methods struggle to conduct reasoning on emerging KGs involving unseen entities. Thus, inductive KGC, which aims to deduce missing links among
-
A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry arXiv.cs.CL Pub Date : 2024-04-24 Yining Huang, Keke Tang, Meilian Chen
Since the inception of the Transformer architecture in 2017, Large Language Models (LLMs) such as GPT and BERT have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation. These models have shown potential to transform the medical field, highlighting the necessity for specialized evaluation frameworks to ensure their effective and
-
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models arXiv.cs.CL Pub Date : 2024-04-24 Jacob Pfau, William Merrill, Samuel R. Bowman
Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic
-
No Train but Gain: Language Arithmetic for training-free Language Adapters enhancement arXiv.cs.CL Pub Date : 2024-04-24 Mateusz Klimaszewski, Piotr Andruszkiewicz, Alexandra Birch
Modular deep learning is the state-of-the-art solution for lifting the curse of multilinguality, preventing the impact of negative interference and enabling cross-lingual performance in Multilingual Pre-trained Language Models. However, a trade-off of this approach is the reduction in positive transfer learning from closely related languages. In response, we introduce a novel method called language
-
Annotator-Centric Active Learning for Subjective NLP Tasks arXiv.cs.CL Pub Date : 2024-04-24 Michiel van der Meer, Neele Falk, Pradeep K. Murukannaiah, Enrico Liscio
To accurately capture the variability in human judgments for subjective NLP tasks, incorporating a wide range of perspectives in the annotation process is crucial. Active Learning (AL) addresses the high costs of collecting human annotations by strategically annotating the most informative samples. We introduce Annotator-Centric Active Learning (ACAL), which incorporates an annotator selection strategy
-
Nyonic Technical Report arXiv.cs.CL Pub Date : 2024-04-24 Junfeng Tian, Rui Wang, Cong Li, Yudong Zhou, Jun Liu, Jun Wang
This report details the development and key achievements of our latest language model designed for custom large language models. The advancements introduced include a novel Online Data Scheduler that supports flexible training data adjustments and curriculum learning. The model's architecture is fortified with state-of-the-art techniques such as Rotary Positional Embeddings, QK-LayerNorm, and a specially
-
Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs arXiv.cs.CL Pub Date : 2024-04-24 Yu Xia, Rui Wang, Xu Liu, Mingyan Li, Tong Yu, Xiang Chen, Julian McAuley, Shuai Li
Chain-of-Thought (CoT) has been a widely adopted prompting method, eliciting impressive reasoning abilities of Large Language Models (LLMs). Inspired by the sequential thought structure of CoT, a number of Chain-of-X (CoX) methods have been developed to address various challenges across diverse domains and tasks involving LLMs. In this paper, we provide a comprehensive survey of Chain-of-X methods
-
The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews arXiv.cs.CL Pub Date : 2024-04-24 Aleksi Huotala, Miikka Kuutila, Paul Ralph, Mika Mäntylä
Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if Large Language Models (LLMs) can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-abstract
-
KS-LLM: Knowledge Selection of Large Language Models with Evidence Document for Question Answering arXiv.cs.CL Pub Date : 2024-04-24 Xinxin Zheng, Feihu Che, Jinyang Wu, Shuai Zhang, Shuai Nie, Kang Liu, Jianhua Tao
Large language models (LLMs) suffer from the hallucination problem and face significant challenges when applied to knowledge-intensive tasks. A promising approach is to leverage evidence documents as extra supporting knowledge, which can be obtained through retrieval or generation. However, existing methods directly leverage the entire contents of the evidence document, which may introduce noise information
-
Return of EM: Entity-driven Answer Set Expansion for QA Evaluation arXiv.cs.CL Pub Date : 2024-04-24 Dongryeol Lee, Minwoo Lee, Kyungmin Min, Joonsuk Park, Kyomin Jung
Recently, directly using large language models (LLMs) has been shown to be the most reliable method to evaluate QA models. However, it suffers from limited interpretability, high cost, and environmental harm. To address these, we propose to use soft EM with entity-driven answer set expansion. Our approach expands the gold answer set to include diverse surface forms, based on the observation that the
-
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code arXiv.cs.CL Pub Date : 2024-04-24 Batu Guan, Yao Wan, Zhangqian Bi, Zheng Wang, Hongyu Zhang, Yulei Sui, Pan Zhou, Lichao Sun
As Large Language Models (LLMs) are increasingly used to automate code generation, it is often desired to know if the code is AI-generated and by which model, especially for purposes like protecting intellectual property (IP) in industry and preventing academic misconduct in education. Incorporating watermarks into machine-generated content is one way to provide code provenance, but existing solutions
-
Hybrid LLM/Rule-based Approaches to Business Insights Generation from Structured Data arXiv.cs.CL Pub Date : 2024-04-24 Aliaksei Vertsel, Mikhail Rumiantsau
In the field of business data analysis, the ability to extract actionable insights from vast and varied datasets is essential for informed decision-making and maintaining a competitive edge. Traditional rule-based systems, while reliable, often fall short when faced with the complexity and dynamism of modern business data. Conversely, Artificial Intelligence (AI) models, particularly Large Language
-
Minimal Evidence Group Identification for Claim Verification arXiv.cs.CL Pub Date : 2024-04-24 Xiangci Li, Sihao Chen, Rajvi Kapadia, Jessica Ouyang, Fan Zhang
Claim verification in real-world settings (e.g. against a large collection of candidate evidences retrieved from the web) typically requires identifying and aggregating a complete set of evidence pieces that collectively provide full support to the claim. The problem becomes particularly challenging when there exists distinct sets of evidence that could be used to verify the claim from different perspectives
-
Can Foundational Large Language Models Assist with Conducting Pharmaceuticals Manufacturing Investigations? arXiv.cs.CL Pub Date : 2024-04-24 Hossein SalamiDigital Services, MMD, Merck & Co., Inc., Rahway, NJ, USA, Brandye Smith-GoettlerDigital Services, MMD, Merck & Co., Inc., West Point, PA, USA, Vijay YadavDigital Services, MMD, Merck & Co., Inc., West Point, PA, USA
General purpose Large Language Models (LLM) such as the Generative Pretrained Transformer (GPT) and Large Language Model Meta AI (LLaMA) have attracted much attention in recent years. There is strong evidence that these models can perform remarkably well in various natural language processing tasks. However, how to leverage them to approach domain-specific use cases and drive value remains an open
-
Retrieval Head Mechanistically Explains Long-Context Factuality arXiv.cs.CL Pub Date : 2024-04-24 Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu
Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving
-
CASPR: Automated Evaluation Metric for Contrastive Summarization arXiv.cs.CL Pub Date : 2024-04-23 Nirupan Ananthamurugan, Dat Duong, Philip George, Ankita Gupta, Sandeep Tata, Beliz Gunel
Summarizing comparative opinions about entities (e.g., hotels, phones) from a set of source reviews, often referred to as contrastive summarization, can considerably aid users in decision making. However, reliably measuring the contrastiveness of the output summaries without relying on human evaluations remains an open problem. Prior work has proposed token-overlap based metrics, Distinctiveness Score
-
PRISM: Patient Records Interpretation for Semantic Clinical Trial Matching using Large Language Models arXiv.cs.CL Pub Date : 2024-04-23 Shashi Kant Gupta, Aditya Basu, Mauro Nievas, Jerrin Thomas, Nathan Wolfrath, Adhitya Ramamurthi, Bradley Taylor, Anai N. Kothari, Therica M. Miller, Sorena Nadaf-Rahrov, Yanshan Wang, Hrituraj Singh
Clinical trial matching is the task of identifying trials for which patients may be potentially eligible. Typically, this task is labor-intensive and requires detailed verification of patient electronic health records (EHRs) against the stringent inclusion and exclusion criteria of clinical trials. This process is manual, time-intensive, and challenging to scale up, resulting in many patients missing
-
Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models arXiv.cs.CL Pub Date : 2024-04-23 Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, Chitta Baral
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical
-
ToM-LM: Delegating Theory Of Mind Reasoning to External Symbolic Executors in Large Language Models arXiv.cs.CL Pub Date : 2024-04-23 Weizhi Tang, Vaishak Belle
Theory of Mind (ToM) refers to the ability of individuals to attribute mental states to others. While Large Language Models (LLMs) have shown some promise with ToM ability, they still struggle with complex ToM reasoning. Our approach leverages an external symbolic executor, specifically the SMCDEL model checker, and fine-tuning to improve the ToM reasoning ability of LLMs. In our approach, an LLM is
-
Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information arXiv.cs.CL Pub Date : 2024-04-23 Chihiro Taguchi, Jefferson Saransig, Dayana Velásquez, David Chiang
This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation
-
Large Language Models Spot Phishing Emails with Surprising Accuracy: A Comparative Analysis of Performance arXiv.cs.CL Pub Date : 2024-04-23 Het Patel, Umair Rehman, Farkhund Iqbal
Phishing, a prevalent cybercrime tactic for decades, remains a significant threat in today's digital world. By leveraging clever social engineering elements and modern technology, cybercrime targets many individuals, businesses, and organizations to exploit trust and security. These cyber-attackers are often disguised in many trustworthy forms to appear as legitimate sources. By cleverly using psychological
-
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference arXiv.cs.CL Pub Date : 2024-04-23 João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian
In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context
-
KGValidator: A Framework for Automatic Validation of Knowledge Graph Construction arXiv.cs.CL Pub Date : 2024-04-24 Jack Boylan, Shashank Mangla, Dominic Thorn, Demian Gholipour Ghalandari, Parsa Ghaffari, Chris Hokamp
This study explores the use of Large Language Models (LLMs) for automatic evaluation of knowledge graph (KG) completion models. Historically, validating information in KGs has been a challenging task, requiring large-scale human annotation at prohibitive cost. With the emergence of general-purpose generative AI and LLMs, it is now plausible that human-in-the-loop validation could be replaced by a generative
-
CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies arXiv.cs.CL Pub Date : 2024-04-23 Weiyan Shi, Ryan Li, Yutong Zhang, Caleb Ziems, Chunhua yu, Raya Horesh, Rogério Abreu de Paula, Diyi Yang
To enhance language models' cultural awareness, we design a generalizable pipeline to construct cultural knowledge bases from different online communities on a massive scale. With the pipeline, we construct CultureBank, a knowledge base built upon users' self-narratives with 12K cultural descriptors sourced from TikTok and 11K from Reddit. Unlike previous cultural knowledge resources, CultureBank contains
-
The Power of the Noisy Channel: Unsupervised End-to-End Task-Oriented Dialogue with LLMs arXiv.cs.CL Pub Date : 2024-04-23 Brendan King, Jeffrey Flanigan
Training task-oriented dialogue systems typically requires turn-level annotations for interacting with their APIs: e.g. a dialogue state and the system actions taken at each step. These annotations can be costly to produce, error-prone, and require both domain and annotation expertise. With advances in LLMs, we hypothesize unlabelled data and a schema definition are sufficient for building a working
-
Does Instruction Tuning Make LLMs More Consistent? arXiv.cs.CL Pub Date : 2024-04-23 Constanza Fierro, Jiaang Li, Anders Søgaard
The purpose of instruction tuning is enabling zero-shot performance, but instruction tuning has also been shown to improve chain-of-thought reasoning and value alignment (Si et al., 2023). Here we consider the impact on $\textit{consistency}$, i.e., the sensitivity of language models to small perturbations in the input. We compare 10 instruction-tuned LLaMA models to the original LLaMA-7b model and
-
Setting up the Data Printer with Improved English to Ukrainian Machine Translation arXiv.cs.CL Pub Date : 2024-04-23 Yurii Paniv, Dmytro Chaplynskyi, Nikita Trynus, Volodymyr Kyrylov
To build large language models for Ukrainian we need to expand our corpora with large amounts of new algorithmic tasks expressed in natural language. Examples of task performance expressed in English are abundant, so with a high-quality translation system our community will be enabled to curate datasets faster. To aid this goal, we introduce a recipe to build a translation system using supervised finetuning
-
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts arXiv.cs.CL Pub Date : 2024-04-22 Dengchun Li, Yingzi Ma, Naizheng Wang, Zhiyuan Cheng, Lei Duan, Jie Zuo, Cal Yang, Mingjie Tang
Large Language Models (LLMs) have showcased exceptional performance across a wide array of Natural Language Processing (NLP) tasks. Fine-tuning techniques are commonly utilized to tailor pre-trained models to specific applications. While methods like LoRA have effectively tackled GPU memory constraints during fine-tuning, their applicability is often restricted to limited performance, especially on
-
FASTTRACK: Fast and Accurate Fact Tracing for LLMs arXiv.cs.CL Pub Date : 2024-04-22 Si Chen, Feiyang Kang, Ning Yu, Ruoxi Jia
Fact tracing seeks to identify specific training examples that serve as the knowledge source for a given query. Existing approaches to fact tracing rely on assessing the similarity between each training sample and the query along a certain dimension, such as lexical similarity, gradient, or embedding space. However, these methods fall short of effectively distinguishing between samples that are merely
-
Regressive Side Effects of Training Language Models to Mimic Student Misconceptions arXiv.cs.CL Pub Date : 2024-04-23 Shashank Sonkar, Naiming Liu, Richard G. Baraniuk
This paper presents a novel exploration into the regressive side effects of training Large Language Models (LLMs) to mimic student misconceptions for personalized education. We highlight the problem that as LLMs are trained to more accurately mimic student misconceptions, there is a compromise in the factual integrity and reasoning ability of the models. Our work involved training an LLM on a student-tutor
-
Do not think pink elephant! arXiv.cs.CL Pub Date : 2024-04-22 Kyomin Hwang, Suyoung Kim, JunHoo Lee, Nojun Kwak
Large Models (LMs) have heightened expectations for the potential of general AI as they are akin to human intelligence. This paper shows that recent large models such as Stable Diffusion and DALL-E3 also share the vulnerability of human intelligence, namely the "white bear phenomenon". We investigate the causes of the white bear phenomenon by analyzing their representation space. Based on this analysis
-
Identifying Fairness Issues in Automatically Generated Testing Content arXiv.cs.CL Pub Date : 2024-04-23 Kevin Stowe, Benny Longwill, Alyssa Francis, Tatsuya Aoyama, Debanjan Ghosh, Swapna Somasundaran
Natural language generation tools are powerful and effective for generating content. However, language models are known to display bias and fairness issues, making them impractical to deploy for many use cases. We here focus on how fairness issues impact automatically generated test content, which can have stringent requirements to ensure the test measures only what it was intended to measure. Specifically
-
Multi-view Content-aware Indexing for Long Document Retrieval arXiv.cs.CL Pub Date : 2024-04-23 Kuicai Dong, Derrick Goh Xin Deik, Yi Quan Lee, Hao Zhang, Xiangyang Li, Cong Zhang, Yong Liu
Long document question answering (DocQA) aims to answer questions from long documents over 10k words. They usually contain content structures such as sections, sub-sections, and paragraph demarcations. However, the indexing methods of long documents remain under-explored, while existing systems generally employ fixed-length chunking. As they do not consider content structures, the resultant chunks
-
Enhancing Textual Personality Detection toward Social Media: Integrating Long-term and Short-term Perspectives arXiv.cs.CL Pub Date : 2024-04-23 Haohao Zhu, Xiaokun Zhang, Junyu Lu, Youlin Wu, Zewen Bai, Changrong Min, Liang Yang, Bo Xu, Dongyu Zhang, Hongfei Lin
Textual personality detection aims to identify personality characteristics by analyzing user-generated content toward social media platforms. Numerous psychological literature highlighted that personality encompasses both long-term stable traits and short-term dynamic states. However, existing studies often concentrate only on either long-term or short-term personality representations, without effectively
-
TAXI: Evaluating Categorical Knowledge Editing for Language Models arXiv.cs.CL Pub Date : 2024-04-23 Derek Powell, Walter Gerych, Thomas Hartvigsen
Humans rarely learn one fact in isolation. Instead, learning a new fact induces knowledge of other facts about the world. For example, in learning a korat is a type of cat, you also infer it is a mammal and has claws, ensuring your model of the world is consistent. Knowledge editing aims to inject new facts into language models to improve their factuality, but current benchmarks fail to evaluate consistency
-
Comparison of Current Approaches to Lemmatization: A Case Study in Estonian arXiv.cs.CL Pub Date : 2024-04-23 Aleksei Dorkin, Kairit Sirts
This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap
-
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Perfect Reasoners arXiv.cs.CL Pub Date : 2024-04-23 Qihuang Zhong, Kang Wang, Ziyang Xu, Juhua Liu, Liang Ding, Bo Du, Dacheng Tao
Chain of Thought prompting strategy has enhanced the performance of Large Language Models (LLMs) across various NLP tasks. However, it still has shortcomings when dealing with complex reasoning tasks, following~\citet{cot_wei}, including understanding errors, calculation errors and process errors (e.g. missing-step and hallucinations). Subsequently, Our in-depth analysis of various error types has
-
Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models arXiv.cs.CL Pub Date : 2024-04-23 Kostiantyn Omelianchuk, Andrii Liubonko, Oleksandr Skurzhanskyi, Artem Chernodub, Oleksandr Korniienko, Igor Samokhin
In this paper, we carry out experimental research on Grammatical Error Correction, delving into the nuances of single-model systems, comparing the efficiency of ensembling and ranking methods, and exploring the application of large language models to GEC as single-model systems, as parts of ensembles, and as ranking methods. We set new state-of-the-art performance with F_0.5 scores of 72.8 on CoNLL-2014-test
-
Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models arXiv.cs.CL Pub Date : 2024-04-23 Chen Zhang, Zhuorui Liu, Dawei Song
With the increasingly giant scales of (causal) large language models (LLMs), the inference efficiency comes as one of the core concerns along the improved performance. In contrast to the memory footprint, the latency bottleneck seems to be of greater importance as there can be billions of requests to a LLM (e.g., GPT-4) per day. The bottleneck is mainly due to the autoregressive innateness of LLMs
-
Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans arXiv.cs.CL Pub Date : 2024-04-23 Vittoria Dentella, Fritz Guenther, Evelina Leivada
Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining
-
Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation arXiv.cs.CL Pub Date : 2024-04-23 Jingxuan Wei, Linzhuang Sun, Yichong Leng, Xu Tan, Bihui Yu, Ruifeng Guo
Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation for compressing models or simplifying training targets. Knowledge distillation encompasses two primary methods: sentence-level distillation and token-level distillation. In sentence-level distillation, the student model is trained to align with the
-
Pattern-Aware Chain-of-Thought Prompting in Large Language Models arXiv.cs.CL Pub Date : 2024-04-23 Yufeng Zhang, Xuepeng Wang, Lingxiang Wu, Jinqiao Wang
Chain-of-thought (CoT) prompting can guide language models to engage in complex multi-step reasoning. The quality of provided demonstrations significantly impacts the success of downstream inference tasks. While existing automated methods prioritize accuracy and semantics in these demonstrations, we show that the underlying reasoning patterns play a more crucial role in such tasks. In this paper, we
-
Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches arXiv.cs.CL Pub Date : 2024-04-23 Clément Christophe, Praveen K Kanithi, Prateek Munjal, Tathagata Raha, Nasir Hayat, Ronnie Rajan, Ahmed Al-Mahrooqi, Avani Gupta, Muhammad Umar Salman, Gurpreet Gosal, Bhargav Kanakiya, Charles Chen, Natalia Vassilieva, Boulbaba Ben Amor, Marco AF Pimentel, Shadab Khan
This study presents a comprehensive analysis and comparison of two predominant fine-tuning methodologies - full-parameter fine-tuning and parameter-efficient tuning - within the context of medical Large Language Models (LLMs). We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering
-
Simulating Task-Oriented Dialogues with State Transition Graphs and Large Language Models arXiv.cs.CL Pub Date : 2024-04-23 Chris Samarinas, Pracha Promthaw, Atharva Nijasure, Hansi Zeng, Julian Killingback, Hamed Zamani
This paper explores SynTOD, a new synthetic data generation approach for developing end-to-end Task-Oriented Dialogue (TOD) Systems capable of handling complex tasks such as intent classification, slot filling, conversational question-answering, and retrieval-augmented response generation, without relying on crowdsourcing or real-world data. SynTOD utilizes a state transition graph to define the desired
-
Generate-on-Graph: Treat LLM as both Agent and KG in Incomplete Knowledge Graph Question Answering arXiv.cs.CL Pub Date : 2024-04-23 Yao Xu, Shizhu He, Jiabei Chen, Zihao Wang, Yangqiu Song, Hanghang Tong, Kang Liu, Jun Zhao
To address the issue of insufficient knowledge and the tendency to generate hallucination in Large Language Models (LLMs), numerous studies have endeavored to integrate LLMs with Knowledge Graphs (KGs). However, all these methods are evaluated on conventional Knowledge Graph Question Answering (KGQA) with complete KGs, where the factual triples involved in each question are entirely covered by the
-
Modeling the Sacred: Considerations when Using Considerations when Using Religious Texts in Natural Language Processing arXiv.cs.CL Pub Date : 2024-04-23 Ben Hutchinson
This position paper concerns the use of religious texts in Natural Language Processing (NLP), which is of special interest to the Ethics of NLP. Religious texts are expressions of culturally important values, and machine learned models have a propensity to reproduce cultural values encoded in their training data. Furthermore, translations of religious texts are frequently used by NLP researchers when
-
Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks arXiv.cs.CL Pub Date : 2024-04-23 Amir Saeidi, Shivanshu Verma, Chitta Baral
Large Language Models (LLMs) have demonstrated remarkable performance across a spectrum of tasks. Recently, Direct Preference Optimization (DPO) has emerged as an RL-free approach to optimize the policy model on human preferences. However, several limitations hinder the widespread adoption of this method. To address these shortcomings, various versions of DPO have been introduced. Yet, a comprehensive
-
MisgenderMender: A Community-Informed Approach to Interventions for Misgendering arXiv.cs.CL Pub Date : 2024-04-23 Tamanna Hossain, Sunipa Dev, Sameer Singh
Content Warning: This paper contains examples of misgendering and erasure that could be offensive and potentially triggering. Misgendering, the act of incorrectly addressing someone's gender, inflicts serious harm and is pervasive in everyday technologies, yet there is a notable lack of research to combat it. We are the first to address this lack of research into interventions for misgendering by conducting
-
Q-Tuning: Queue-based Prompt Tuning for Lifelong Few-shot Language Learning arXiv.cs.CL Pub Date : 2024-04-22 Yanhui Guo, Shaoyuan Xu, Jinmiao Fu, Jia Liu, Chaosheng Dong, Bryan Wang
This paper introduces \textbf{Q-tuning}, a novel approach for continual prompt tuning that enables the lifelong learning of a pre-trained language model. When learning a new task, Q-tuning trains a task-specific prompt by adding it to a prompt queue consisting of the prompts from older tasks. To better transfer the knowledge of old tasks, we design an adaptive knowledge aggregation technique that reweighs
-
Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training arXiv.cs.CL Pub Date : 2024-04-22 Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, Meng Jiang
Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. Although fine-tuning with intermediate steps (i.e., rationales) elicits some mathematical reasoning skills, the resulting models still fall short in visual
-
WangLab at MEDIQA-M3G 2024: Multimodal Medical Answer Generation using Large Language Models arXiv.cs.CL Pub Date : 2024-04-22 Ronald Xie, Steven Palayew, Augustin Toma, Gary Bader, Bo Wang
This paper outlines our submission to the MEDIQA2024 Multilingual and Multimodal Medical Answer Generation (M3G) shared task. We report results for two standalone solutions under the English category of the task, the first involving two consecutive API calls to the Claude 3 Opus API and the second involving training an image-disease label joint embedding in the style of CLIP for image classification
-
WangLab at MEDIQA-CORR 2024: Optimized LLM-based Programs for Medical Error Detection and Correction arXiv.cs.CL Pub Date : 2024-04-22 Augustin Toma, Ronald Xie, Steven Palayew, Patrick R. Lawler, Bo Wang
Medical errors in clinical text pose significant risks to patient safety. The MEDIQA-CORR 2024 shared task focuses on detecting and correcting these errors across three subtasks: identifying the presence of an error, extracting the erroneous sentence, and generating a corrected sentence. In this paper, we present our approach that achieved top performance in all three subtasks. For the MS dataset,
-
SnapKV: LLM Knows What You are Looking for Before Generation arXiv.cs.CL Pub Date : 2024-04-22 Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen
Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that