skip to main content
research-article

CodeKGC: Code Language Model for Generative Knowledge Graph Construction

Authors Info & Claims
Published:09 March 2024Publication History
Skip Abstract Section

Abstract

Current generative knowledge graph construction approaches usually fail to capture structural knowledge by simply flattening natural language into serialized texts or a specification language. However, large generative language model trained on structured data such as code has demonstrated impressive capability in understanding natural language for structural prediction and reasoning tasks. Intuitively, we address the task of generative knowledge graph construction with code language model: given a code-format natural language input, the target is to generate triples which can be represented as code completion tasks. Specifically, we develop schema-aware prompts that effectively utilize the semantic structure within the knowledge graph. As code inherently possesses structure, such as class and function definitions, it serves as a useful model for prior semantic structural knowledge. Furthermore, we employ a rationale-enhanced generation method to boost the performance. Rationales provide intermediate steps, thereby improving knowledge extraction abilities. Experimental results indicate that the proposed approach can obtain better performance on benchmark datasets compared with baselines.1

REFERENCES

  1. [1] Agarwal Oshin, Ge Heming, Shakeri Siamak, and Al-Rfou Rami. 2021. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6–11, 2021. Toutanova Kristina, Rumshisky Anna, Zettlemoyer Luke, Hakkani-Tür Dilek, Beltagy Iz, Bethard Steven, Cotterell Ryan, Chakraborty Tanmoy, and Zhou Yichao (Eds.), Association for Computational Linguistics, 35543565. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Chen Mark, Tworek Jerry, Jun Heewoo, Yuan Qiming, Pinto Henrique Ponde de Oliveira, Kaplan Jared, Edwards Harrison, Burda Yuri, Joseph Nicholas, Brockman Greg, Ray Alex, Puri Raul, Krueger Gretchen, Petrov Michael, Khlaaf Heidy, Sastry Girish, Mishkin Pamela, Chan Brooke, Gray Scott, Ryder Nick, Pavlov Mikhail, Power Alethea, Kaiser Lukasz, Bavarian Mohammad, Winter Clemens, Tillet Philippe, Such Felipe Petroski, Cummings Dave, Plappert Matthias, Chantzis Fotios, Barnes Elizabeth, Herbert-Voss Ariel, Guss William Hebgen, Nichol Alex, Paino Alex, Tezak Nikolas, Tang Jie, Babuschkin Igor, Balaji Suchir, Jain Shantanu, Saunders William, Hesse Christopher, Carr Andrew N., Leike Jan, Achiam Joshua, Misra Vedant, Morikawa Evan, Radford Alec, Knight Matthew, Brundage Miles, Murati Mira, Mayer Katie, Welinder Peter, McGrew Bob, Amodei Dario, McCandlish Sam, Sutskever Ilya, and Zaremba Wojciech. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.Google ScholarGoogle Scholar
  3. [3] Chen Wenhu, Ma Xueguang, Wang Xinyi, and Cohen William W.. 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv:2211.12588. Retrieved from https://arxiv.org/abs/2211.12588Google ScholarGoogle Scholar
  4. [4] Chia Yew Ken, Bing Lidong, Poria Soujanya, and Si Luo. 2022. RelationPrompt: Leveraging prompts to generate synthetic data for zero-shot relation triplet extraction. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22–27, 2022. Muresan Smaranda, Nakov Preslav, and Villavicencio Aline (Eds.), Association for Computational Linguistics, 4557. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] ElSahar Hady, Vougiouklis Pavlos, Remaci Arslen, Gravier Christophe, Hare Jonathon S., Laforest Frédérique, and Simperl Elena. 2018. T-REx: A large scale alignment of natural language with knowledge base triples. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018. Calzolari Nicoletta, Choukri Khalid, Cieri Christopher, Declerck Thierry, Goggi Sara, Hasida Kôiti, Isahara Hitoshi, Maegaard Bente, Mariani Joseph, Mazo Hélène, Moreno Asunción, Odijk Jan, Piperidis Stelios, and Tokunaga Takenobu (Eds.), European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2018/summaries/632.htmlGoogle ScholarGoogle Scholar
  6. [6] Feng Zhangyin, Guo Daya, Tang Duyu, Duan Nan, Feng Xiaocheng, Gong Ming, Shou Linjun, Qin Bing, Liu Ting, Jiang Daxin, and Zhou Ming. 2020. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020(Findings of ACL, Vol. EMNLP 2020). Cohn Trevor, He Yulan, and Liu Yang (Eds.), Association for Computational Linguistics, 15361547. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, Graham Neubig. 2023. Pal: Program-aided language models. In International Conference on Machine Learning. PMLR, 10764–10799.Google ScholarGoogle Scholar
  8. [8] Guo Daya, Lu Shuai, Duan Nan, Wang Yanlin, Zhou Ming, and Yin Jian. 2022. UniXcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22–27, 2022. Muresan Smaranda, Nakov Preslav, and Villavicencio Aline (Eds.), Association for Computational Linguistics, 72127225. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Gurulingappa Harsha, Rajput Abdul Mateen, Roberts Angus, Fluck Juliane, Hofmann-Apitius Martin, and Toldo Luca. 2012. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of Biomedical Informatics 45, 5 (2012), 885892. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Josifoski Martin, Cao Nicola De, Peyrard Maxime, Petroni Fabio, and West Robert. 2022. GenIE: Generative information extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10–15, 2022. Carpuat Marine, Marneffe Marie-Catherine de, and Ruíz Iván Vladimir Meza (Eds.), Association for Computational Linguistics, 46264643. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Lewis Mike, Liu Yinhan, Goyal Naman, Ghazvininejad Marjan, Mohamed Abdelrahman, Levy Omer, Stoyanov Veselin, and Zettlemoyer Luke. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. Jurafsky Dan, Chai Joyce, Schluter Natalie, and Tetreault Joel R. (Eds.), Association for Computational Linguistics, 78717880. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Li Jing, Sun Aixin, Han Jianglei, and Li Chenliang. 2022. A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering 34, 1 (2022), 5070. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Li Yujia, Choi David H., Chung Junyoung, Kushman Nate, Schrittwieser Julian, Leblond Rémi, Eccles Tom, Keeling James, Gimeno Felix, Lago Agustin Dal, Hubert Thomas, Choy Peter, d’Autume Cyprien de Masson, Babuschkin Igor, Chen Xinyun, Huang Po-Sen, Welbl Johannes, Gowal Sven, Cherepanov Alexey, Molloy James, Mankowitz Daniel J., Robson Esme Sutherland, Kohli Pushmeet, Freitas Nando de, Kavukcuoglu Koray, and Vinyals Oriol. 2022. Competition-level code generation with AlphaCode. Science, 378, 6624 (2022), 1092–1097. arXiv:2203.07814. Retrieved from https://arxiv.org/abs/2203.07814Google ScholarGoogle Scholar
  14. [14] Liu Tianyu, Jiang Yuchen, Monath Nicholas, Cotterell Ryan, and Sachan Mrinmaya. 2022. Autoregressive structured prediction with language models. In EMNLP’22. 993–1005. arXiv:2210.14698. Retrieved from https://arxiv.org/abs/2210.14698Google ScholarGoogle Scholar
  15. [15] Lu Yaojie, Lin Hongyu, Xu Jin, Han Xianpei, Tang Jialong, Li Annan, Sun Le, Liao Meng, and Chen Shaoyi. 2021. Text2Event: Controllable sequence-to-structure generation for end-to-end event extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1–6, 2021. Zong Chengqing, Xia Fei, Li Wenjie, and Navigli Roberto (Eds.), Association for Computational Linguistics, 27952806. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Lu Yaojie, Liu Qing, Dai Dai, Xiao Xinyan, Lin Hongyu, Han Xianpei, Sun Le, and Wu Hua. 2022. Unified structure generation for universal information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22–27, 2022. Muresan Smaranda, Nakov Preslav, and Villavicencio Aline (Eds.), Association for Computational Linguistics, 57555772. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Luan Yi, He Luheng, Ostendorf Mari, and Hajishirzi Hannaneh. 2018. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31–November 4, 2018. Riloff Ellen, Chiang David, Hockenmaier Julia, and Tsujii Jun’ichi (Eds.), Association for Computational Linguistics, 32193232. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Madaan Aman, Zhou Shuyan, Alon Uri, Yang Yiming, and Neubig Graham. 2022. Language models of code are few-shot commonsense learners. In EMNLP’22. 1384–1403. arXiv:2210.07128. Retrieved from https://arxiv.org/abs/2210.07128Google ScholarGoogle Scholar
  19. [19] Paolini Giovanni, Athiwaratkun Ben, Krone Jason, Ma Jie, Achille Alessandro, Anubhai Rishita, Santos Cícero Nogueira dos, Xiang Bing, and Soatto Stefano. 2021. Structured prediction as translation between augmented natural languages. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. OpenReview.net. Retrieved from https://openreview.net/forum?id=US-TP-xnXIGoogle ScholarGoogle Scholar
  20. [20] Qiao Shuofei, Ou Yixin, Zhang Ningyu, Chen Xiang, Yao Yunzhi, Deng Shumin, Tan Chuanqi, Huang Fei, and Chen Huajun. 2022. Reasoning with language model prompting: A survey. In Proceedings of the 61st Annual Meeting of the ACL (Volume 1: Long Papers). 5368–5393. arXiv:2212.09597. Retrieved from https://arxiv.org/abs/2212.09597Google ScholarGoogle Scholar
  21. [21] Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J.. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 1 (2020), 5485–5551. Retrieved from http://jmlr.org/papers/v21/20-074.htmlGoogle ScholarGoogle Scholar
  22. [22] Ren Feiliang, Zhang Longhui, Zhao Xiaofeng, Yin Shujuan, Liu Shilei, and Li Bochao. 2022. A simple but effective bidirectional framework for relational triple extraction. In Proceedings of the WSDM’22: The 15th ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21–25, 2022. Candan K. Selcuk, Liu Huan, Akoglu Leman, Dong Xin Luna, and Tang Jiliang (Eds.), ACM, 824832. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Roth Dan and Yih Wen-tau. 2004. A linear programming formulation for global inference in natural language tasks. In Proceedings of the 8th Conference on Computational Natural Language Learning, CoNLL 2004, Held in Cooperation with HLT-NAACL 2004, Boston, Massachusetts, USA, May 6–7, 2004. Ng Hwee Tou and Riloff Ellen (Eds.), ACL, 18. Retrieved from https://aclanthology.org/W04-2401/Google ScholarGoogle Scholar
  24. [24] Wang Chenguang, Liu Xiao, Chen Zui, Hong Haoyun, Tang Jie, and Song Dawn. 2021. Zero-shot information extraction as a unified text-to-triple translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7–11 November, 2021. Moens Marie-Francine, Huang Xuanjing, Specia Lucia, and Yih Scott Wen-tau (Eds.), Association for Computational Linguistics, 12251238. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Wang Chenguang, Liu Xiao, Chen Zui, Hong Haoyun, Tang Jie, and Song Dawn. 2022. DeepStruct: Pretraining of language models for structure prediction. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22–27, 2022. Muresan Smaranda, Nakov Preslav, and Villavicencio Aline (Eds.), Association for Computational Linguistics, 803823. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Wang Xingyao, Li Sha, and Ji Heng. 2022. Code4Struct: Code generation for few-shot structured prediction from natural language. In Proceedings of the 61st Annual Meeting of the ACL (Volume 1: Long Papers). 3640–3663.Google ScholarGoogle Scholar
  27. [27] Wang Yue, Wang Weishi, Joty Shafiq R., and Hoi Steven C. H.. 2021. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7–11 November, 2021. Moens Marie-Francine, Huang Xuanjing, Specia Lucia, and Yih Scott Wen-tau (Eds.), Association for Computational Linguistics, 86968708. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Wei Jason, Tay Yi, Bommasani Rishi, Raffel Colin, Zoph Barret, Borgeaud Sebastian, Yogatama Dani, Bosma Maarten, Zhou Denny, Metzler Donald, Chi Ed H., Hashimoto Tatsunori, Vinyals Oriol, Liang Percy, Dean Jeff, and Fedus William. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.Google ScholarGoogle Scholar
  29. [29] Wei Jason, Wang Xuezhi, Schuurmans Dale, Bosma Maarten, Chi Ed H., Le Quoc, and Zhou Denny. 2022. Chain of thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 511 35 (2022), 24824–24837.Google ScholarGoogle Scholar
  30. [30] Wei Zhepei, Su Jianlin, Wang Yue, Tian Yuan, and Chang Yi. 2020. A novel cascade binary tagging framework for relational triple extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. Jurafsky Dan, Chai Joyce, Schluter Natalie, and Tetreault Joel R. (Eds.), Association for Computational Linguistics, 14761488. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Wu Ledell, Petroni Fabio, Josifoski Martin, Riedel Sebastian, and Zettlemoyer Luke. 2020. Scalable zero-shot entity linking with dense entity retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16–20, 2020. Webber Bonnie, Cohn Trevor, He Yulan, and Liu Yang (Eds.), Association for Computational Linguistics, 63976407. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Yan Hang, Gui Tao, Dai Junqi, Guo Qipeng, Zhang Zheng, and Qiu Xipeng. 2021. A unified generative framework for various NER subtasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1–6, 2021. Zong Chengqing, Xia Fei, Li Wenjie, and Navigli Roberto (Eds.), Association for Computational Linguistics, 58085822. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Ye Hongbin, Zhang Ningyu, Chen Hui, and Chen Huajun. 2022. Generative knowledge graph construction: A review. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7–11, 2022. Goldberg Yoav, Kozareva Zornitsa, and Zhang Yue (Eds.), Association for Computational Linguistics, 117. Retrieved from https://aclanthology.org/2022.emnlp-main.1Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Ye Hongbin, Zhang Ningyu, Deng Shumin, Chen Mosha, Tan Chuanqi, Huang Fei, and Chen Huajun. 2021. Contrastive triple extraction with generative transformer. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, AAAI 2021, 33rd Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The 11th Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2–9, 2021. AAAI Press, 1425714265. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/17677Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Yuan Siyu, Yang Deqing, Liang Jiaqing, Li Zhixu, Liu Jinxi, Huang Jingyue, and Xiao Yanghua. 2022. Generative entity typing with curriculum learning. In ENMLP’22. 3061–3073.Google ScholarGoogle Scholar
  36. [36] Yuan Weizhe and Liu Pengfei. 2022. reStructured pre-training. arXiv preprint arXiv:2206.11147.Google ScholarGoogle Scholar
  37. [37] Zhang Ningyu, Deng Shumin, Sun Zhanlin, Wang Guanying, Chen Xi, Zhang Wei, and Chen Huajun. 2019. Long-tail relation extraction via knowledge graph embeddings and graph convolution networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers). Burstein Jill, Doran Christy, and Solorio Thamar (Eds.), Association for Computational Linguistics, 30163025. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Zhong Zexuan and Chen Danqi. 2021. A frustratingly easy approach for entity and relation extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6–11, 2021. Toutanova Kristina, Rumshisky Anna, Zettlemoyer Luke, Hakkani-Tür Dilek, Beltagy Iz, Bethard Steven, Cotterell Ryan, Chakraborty Tanmoy, and Zhou Yichao (Eds.), Association for Computational Linguistics, 5061. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. CodeKGC: Code Language Model for Generative Knowledge Graph Construction

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 3
          March 2024
          277 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3613569
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 March 2024
          • Online AM: 9 February 2024
          • Accepted: 9 January 2024
          • Revised: 13 October 2023
          • Received: 17 April 2023
          Published in tallip Volume 23, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)412
          • Downloads (Last 6 weeks)336

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text