Abstract
Current generative knowledge graph construction approaches usually fail to capture structural knowledge by simply flattening natural language into serialized texts or a specification language. However, large generative language model trained on structured data such as code has demonstrated impressive capability in understanding natural language for structural prediction and reasoning tasks. Intuitively, we address the task of generative knowledge graph construction with code language model: given a code-format natural language input, the target is to generate triples which can be represented as code completion tasks. Specifically, we develop schema-aware prompts that effectively utilize the semantic structure within the knowledge graph. As code inherently possesses structure, such as class and function definitions, it serves as a useful model for prior semantic structural knowledge. Furthermore, we employ a rationale-enhanced generation method to boost the performance. Rationales provide intermediate steps, thereby improving knowledge extraction abilities. Experimental results indicate that the proposed approach can obtain better performance on benchmark datasets compared with baselines.1
- [1] . 2021. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6–11, 2021. , , , , , , , , and (Eds.), Association for Computational Linguistics, 3554–3565.
DOI: Google ScholarCross Ref - [2] . 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.Google Scholar
- [3] . 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv:2211.12588. Retrieved from https://arxiv.org/abs/2211.12588Google Scholar
- [4] . 2022. RelationPrompt: Leveraging prompts to generate synthetic data for zero-shot relation triplet extraction. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22–27, 2022. , , and (Eds.), Association for Computational Linguistics, 45–57.
DOI: Google ScholarCross Ref - [5] . 2018. T-REx: A large scale alignment of natural language with knowledge base triples. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018. , , , , , , , , , , , , , and (Eds.), European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2018/summaries/632.htmlGoogle Scholar
- [6] . 2020. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020(
Findings of ACL , Vol. EMNLP 2020). , , and (Eds.), Association for Computational Linguistics, 1536–1547.DOI: Google ScholarCross Ref - [7] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, Graham Neubig. 2023. Pal: Program-aided language models. In International Conference on Machine Learning. PMLR, 10764–10799.Google Scholar
- [8] . 2022. UniXcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22–27, 2022. , , and (Eds.), Association for Computational Linguistics, 7212–7225.
DOI: Google ScholarCross Ref - [9] . 2012. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of Biomedical Informatics 45, 5 (2012), 885–892.
DOI: Google ScholarDigital Library - [10] . 2022. GenIE: Generative information extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10–15, 2022. , , and (Eds.), Association for Computational Linguistics, 4626–4643.
DOI: Google ScholarCross Ref - [11] . 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. , , , and (Eds.), Association for Computational Linguistics, 7871–7880.
DOI: Google ScholarCross Ref - [12] . 2022. A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering 34, 1 (2022), 50–70.
DOI: Google ScholarDigital Library - [13] . 2022. Competition-level code generation with AlphaCode. Science, 378, 6624 (2022), 1092–1097. arXiv:2203.07814. Retrieved from https://arxiv.org/abs/2203.07814Google Scholar
- [14] . 2022. Autoregressive structured prediction with language models. In EMNLP’22. 993–1005. arXiv:2210.14698. Retrieved from https://arxiv.org/abs/2210.14698Google Scholar
- [15] . 2021. Text2Event: Controllable sequence-to-structure generation for end-to-end event extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1–6, 2021. , , , and (Eds.), Association for Computational Linguistics, 2795–2806.
DOI: Google ScholarCross Ref - [16] . 2022. Unified structure generation for universal information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22–27, 2022. , , and (Eds.), Association for Computational Linguistics, 5755–5772.
DOI: Google ScholarCross Ref - [17] . 2018. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31–November 4, 2018. , , , and (Eds.), Association for Computational Linguistics, 3219–3232.
DOI: Google ScholarCross Ref - [18] . 2022. Language models of code are few-shot commonsense learners. In EMNLP’22. 1384–1403. arXiv:2210.07128. Retrieved from https://arxiv.org/abs/2210.07128Google Scholar
- [19] . 2021. Structured prediction as translation between augmented natural languages. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. OpenReview.net. Retrieved from https://openreview.net/forum?id=US-TP-xnXIGoogle Scholar
- [20] . 2022. Reasoning with language model prompting: A survey. In Proceedings of the 61st Annual Meeting of the ACL (Volume 1: Long Papers). 5368–5393. arXiv:2212.09597. Retrieved from https://arxiv.org/abs/2212.09597Google Scholar
- [21] . 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 1 (2020), 5485–5551. Retrieved from http://jmlr.org/papers/v21/20-074.htmlGoogle Scholar
- [22] . 2022. A simple but effective bidirectional framework for relational triple extraction. In Proceedings of the WSDM’22: The 15th ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21–25, 2022. , , , , and (Eds.), ACM, 824–832.
DOI: Google ScholarDigital Library - [23] . 2004. A linear programming formulation for global inference in natural language tasks. In Proceedings of the 8th Conference on Computational Natural Language Learning, CoNLL 2004, Held in Cooperation with HLT-NAACL 2004, Boston, Massachusetts, USA, May 6–7, 2004. and (Eds.), ACL, 1–8. Retrieved from https://aclanthology.org/W04-2401/Google Scholar
- [24] . 2021. Zero-shot information extraction as a unified text-to-triple translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7–11 November, 2021. , , , and (Eds.), Association for Computational Linguistics, 1225–1238.
DOI: Google ScholarCross Ref - [25] . 2022. DeepStruct: Pretraining of language models for structure prediction. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22–27, 2022. , , and (Eds.), Association for Computational Linguistics, 803–823.
DOI: Google ScholarCross Ref - [26] . 2022. Code4Struct: Code generation for few-shot structured prediction from natural language. In Proceedings of the 61st Annual Meeting of the ACL (Volume 1: Long Papers). 3640–3663.Google Scholar
- [27] . 2021. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7–11 November, 2021. , , , and (Eds.), Association for Computational Linguistics, 8696–8708.
DOI: Google ScholarCross Ref - [28] . 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.Google Scholar
- [29] . 2022. Chain of thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 511 35 (2022), 24824–24837.Google Scholar
- [30] . 2020. A novel cascade binary tagging framework for relational triple extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. , , , and (Eds.), Association for Computational Linguistics, 1476–1488.
DOI: Google ScholarCross Ref - [31] . 2020. Scalable zero-shot entity linking with dense entity retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16–20, 2020. , , , and (Eds.), Association for Computational Linguistics, 6397–6407.
DOI: Google ScholarCross Ref - [32] . 2021. A unified generative framework for various NER subtasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1–6, 2021. , , , and (Eds.), Association for Computational Linguistics, 5808–5822.
DOI: Google ScholarCross Ref - [33] . 2022. Generative knowledge graph construction: A review. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7–11, 2022. , , and (Eds.), Association for Computational Linguistics, 1–17. Retrieved from https://aclanthology.org/2022.emnlp-main.1Google ScholarCross Ref
- [34] . 2021. Contrastive triple extraction with generative transformer. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, AAAI 2021, 33rd Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The 11th Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2–9, 2021. AAAI Press, 14257–14265. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/17677Google ScholarCross Ref
- [35] . 2022. Generative entity typing with curriculum learning. In ENMLP’22. 3061–3073.Google Scholar
- [36] . 2022. reStructured pre-training. arXiv preprint arXiv:2206.11147.Google Scholar
- [37] . 2019. Long-tail relation extraction via knowledge graph embeddings and graph convolution networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers). , , and (Eds.), Association for Computational Linguistics, 3016–3025.
DOI: Google ScholarCross Ref - [38] . 2021. A frustratingly easy approach for entity and relation extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6–11, 2021. , , , , , , , , and (Eds.), Association for Computational Linguistics, 50–61.
DOI: Google ScholarCross Ref
Index Terms
- CodeKGC: Code Language Model for Generative Knowledge Graph Construction
Recommendations
Defining a Knowledge Graph Development Process Through a Systematic Review
Knowledge graphs are widely used in industry and studied within the academic community. However, the models applied in the development of knowledge graphs vary. Analysing and providing a synthesis of the commonly used approaches to knowledge graph ...
Assisted Knowledge Graph Authoring: Human-Supervised Knowledge Graph Construction from Natural Language
CHIIR '24: Proceedings of the 2024 Conference on Human Information Interaction and RetrievalEncyclopedic knowledge graphs, such as Wikidata, host an extensive repository of millions of knowledge statements. However, domain-specific knowledge from fields such as history, physics, or medicine is significantly underrepresented in those graphs. ...
Knowledge graph construction based on knowledge enhanced word embedding model in manufacturing domain
Manufacturing industry is the foundation of a country’s economic development and prosperity. At present, the data in manufacturing enterprises have the problems of weak correlation and high redundancy, which can be solved effectively by knowledge graph. ...
Comments