ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search
arXiv - CS - Information Retrieval Pub Date : 2024-03-25 , DOI: arxiv-2403.16702
Zehan Li, Jianfei Zhang, Chuantao Yin, Yuanxin Ouyang, Wenge Rong

Retrieval-based code question answering seeks to match user queries in natural language to relevant code snippets. Previous approaches typically rely on pretraining models using crafted bi-modal and uni-modal datasets to align text and code representations. In this paper, we introduce ProCQA, a large-scale programming question answering dataset extracted from the StackOverflow community, offering naturally structured mixed-modal QA pairs. To validate its effectiveness, we propose a modality-agnostic contrastive pre-training approach to improve the alignment of text and code representations of current code language models. Compared to previous models that primarily employ bimodal and unimodal pairs extracted from CodeSearchNet for pre-training, our model exhibits significant performance improvements across a wide range of code retrieval benchmarks.

中文翻译：

ProCQA：用于代码搜索的大规模社区编程问答数据集

基于检索的代码问答旨在将自然语言中的用户查询与相关代码片段进行匹配。以前的方法通常依赖于使用精心设计的双模态和单模态数据集来对齐文本和代码表示的预训练模型。在本文中，我们介绍了 ProCQA，这是一个从 StackOverflow 社区提取的大规模编程问答数据集，提供自然结构化的混合模式 QA 对。为了验证其有效性，我们提出了一种与模态无关的对比预训练方法，以改善当前代码语言模型的文本和代码表示的对齐。与之前主要使用从 CodeSearchNet 中提取的双峰和单峰对进行预训练的模型相比，我们的模型在各种代码检索基准测试中表现出了显着的性能改进。

更新日期：2024-03-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>