当前位置: X-MOL 学术arXiv.cs.HC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
arXiv - CS - Human-Computer Interaction Pub Date : 2024-04-12 , DOI: arxiv-2404.08793
Yingchaojie Feng, Zhizhang Chen, Zhining Kang, Sijia Wang, Minfeng Zhu, Wei Zhang, Wei Chen

The proliferation of large language models (LLMs) has underscored concerns regarding their security vulnerabilities, notably against jailbreak attacks, where adversaries design jailbreak prompts to circumvent safety mechanisms for potential misuse. Addressing these concerns necessitates a comprehensive analysis of jailbreak prompts to evaluate LLMs' defensive capabilities and identify potential weaknesses. However, the complexity of evaluating jailbreak performance and understanding prompt characteristics makes this analysis laborious. We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate performance evaluation and support analysis of components and keywords in prompts. Based on the framework, we design JailbreakLens, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system's effectiveness in helping users evaluate model security and identify model weaknesses.

中文翻译:

JailbreakLens:针对大型语言模型的越狱攻击可视化分析

大型语言模型(LLM)的激增凸显了对其安全漏洞的担忧,特别是针对越狱攻击,攻击者设计越狱提示来规避潜在滥用的安全机制。解决这些问题需要对越狱提示进行全面分析,以评估法学硕士的防御能力并识别潜在的弱点。然而,评估越狱性能和理解提示特征的复杂性使得这种分析变得很费力。我们与领域专家合作来描述问题,并提出一个法学硕士辅助框架来简化分析过程。提供自动越狱评估,方便性能评估,支持对提示中的组件和关键词进行分析。基于该框架,我们设计了JailbreakLens,这是一个可视化分析系统,使用户能够针对目标模型探索越狱性能,对提示特征进行多层次分析,并细化提示实例以验证结果。通过案例研究、技术评估和专家访谈,我们展示了我们的系统在帮助用户评估模型安全性和识别模型弱点方面的有效性。
更新日期:2024-04-16
down
wechat
bug