Skip to main content

Advertisement

Log in

LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Natural Language Inference (NLI) is considered a representative task to test natural language understanding (NLU). In this work, we propose an extensible framework to collectively yet categorically test diverse Logical reasoning capabilities required for NLI (and, by extension, NLU). Motivated by behavioral testing, we create a semi-synthetic large test bench (363 templates, 363k examples) and an associated framework that offers the following utilities: (1) individually test and analyze reasoning capabilities along 17 reasoning dimensions (including pragmatic reasoning); (2) design experiments to study cross-capability information content (leave one out or bring one in); and (3) the synthetic nature enables us to control for artifacts and biases. We extend a publicly available framework of automated test case instantiation from free-form natural language templates (CheckList) and a well-defined taxonomy of capabilities to cover a wide range of increasingly harder test cases while varying the complexity of natural language. Through our analysis of state-of-the-art NLI systems, we observe that our benchmark is indeed hard (and non-trivial even with training on additional resources). Some capabilities stand out as harder. Further, fine-grained analysis and fine-tuning experiments reveal more insights about these capabilities and the models – supporting and extending previous observations; thus showing the utility of the proposed testbench.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. We use the term “capability” following CheckList-introduced terminology Ribeiro et al. (2020). According to Ribeiro et al. (2020), a capability is simply a feature a model is expected to possess. Such capabilities may include logical reasoning abilities. Humans, on the other hand, may require cognitive abilities to solve examples requiring such types of reasoning. However, the list of capabilities, as defined here, does not directly align with cognitive abilities.

  2. https://github.com/marcotcr/checklist

  3. For detailed definitions, please see Joshi et al. (2020).

  4. https://github.com/microsoft/lonli.

  5. https://huggingface.co/models

  6. https://pytorch.org/hub/pytorch_fairseq_roberta/

  7. https://huggingface.co/microsoft/deberta-large-mnli

  8. https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification

  9. For each model, we perform a hyperparameter search using different learning-rate (\(1e-5, 2e-5, 5e-5\)), warm-up steps (0, 500), scheduler (linear, constant), and training epoch (3, 5) combinations; and choose the hyperparameters that provide the highest LoNLI and MultiNLI validation accuracy.

References

  • Bhagavatula, C., Bras, R. L., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin, H., Downey, D., tau, Wen, Y., & Yejin C. Abductive commonsense reasoning. In International Conference on Learning Representations2020. https://openreview.net/forum?id=Byg1v1HKDB.

  • Bhardwaj, R., Majumder, N., & Poria, S. (2020). Investigating gender bias in BERT. CoRR, abs/2009.05021 (2020). https://arxiv.org/abs/2009.05021.

  • Bowman, S., Angeli, G., Potts, C., Manning, C. D. (2015). A large annotated corpus for learning natural language inference (pp. 632–642). In EMNLP 2015

  • Bowman, S. R., Dahl, & d George E. (2021). What will it take to fix benchmarking in natural language understanding? In NAACL-HLT 2021, Online, June 6–11, (2021), pp. 4843–4855. Association for Computational Linguistics, 2021. https://doi.org/10.18653/v1/2021.naacl-main.385.

  • de Vassimon Manela, D., Errington, D., Fisher, T., van Breugel, B., & Minervini, P. (2021). Stereotype and skew: Quantifying gender bias in pre-trained and fine-tuned language models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 2232–2242)

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT 2019 (Vol. 1) (Long and Short Papers, pp. 4171–4186)

  • Dolan, B., & Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005). Asia Federation of Natural Language Processing, https://www.microsoft.com/en-us/research/publication/automatically-constructing-a-corpus-of-sentential-paraphrases/.

  • Glockner, M., Shwartz, V., & Goldberg, Y. (2018). Breaking NLI systems with sentences that require simple lexical inferences. In ACL 2018 (Vol. 2: Short Papers, pp. 650–655, Melbourne, Australia, Association for Computational Linguistics). https://doi.org/10.18653/v1/P18-2103. https://aclanthology.org/P18-2103.

  • Grice, H. P. (1975). Logic and conversation. In P. Cole and J.L. Morgan (Eds.), Syntax and Semantics: Vol. 3: Speech Acts (pp. 41–58). Academic Press. http://www.ucl.ac.uk/ls/studypacks/Grice-Logic.pdf.

  • Gupta, V., Mehta, M., Nokhiz, Pe., & Srikumar, V. (2020). INFOTABS: Inference on tables as semi-structured data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 2309–2324), Online, Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.acl-main.210.

  • Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S. R., & Smith, N. A. (2018). Annotation artifacts in natural language inference data. In NAACL.

  • He, Pengcheng, Liu, Xiaodong, Gao, Jianfeng, & Chen, W. (2020). Decoding-enhanced bert with disentangled attention. Deberta.

    Google Scholar 

  • Hewitt, J., & Manning, C. D. (June 2019). A structural probe for finding syntax in word representations. In NAACL-HLT 2019 (Vol. 1, Long and Short Papers, pp. 4129–4138), Minneapolis, Minnesota, Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1419. https://www.aclweb.org/anthology/N19-1419.

  • Iyer, S., Dandekar, N., & Csernai, K. (2017). First quora dataset release: Question pairs. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.

  • Jawahar, G., Sagot, B., & Seddah, D. (2019). What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3651–3657). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1356. https://www.aclweb.org/anthology/P19-1356.

  • Jeretic, P., Warstadt, A., Bhooshan, S., & Williams, A. (2020). Are natural language inference models IMPPRESsive? Learning IMPlicature and PRESupposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8690–8705), Online. Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.acl-main.768.

  • Joshi, P., Aditya, S., Sathe, A., & Choudhury, M. ((2020) Taxinli: Taking a ride up the nlu hill. In CoNLL.

  • Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition, 2nd Ed. Pearson Prentice Hall.

  • Kaushik, D., Hovy, E. H., & Lipton, Z. C. (2020). Learning the difference that makes a difference with counterfactually-augmented data. In ICLR 2020. OpenReview.net..

  • Khot, T., Sabharwal, A & Clark, P. (2018) Scitail: A textual entailment dataset from science question answering. In S.A. McIlraith and K.Q. Weinberger (Eds.), AAAI 2018. New Orleans, Louisiana, USA, February 2–7, (pp. 5189–5197). AAAI Press. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17368.

  • Kim, N., Patel, R., Poliak, A., Xia, P., Wang, A., McCoy, T., Tenney, I., Ross, A., Linzen, T., Van Durme, B., Bowman, S. R., & Pavlick, E. (2019). Probing what different NLP tasks teach machines about function word comprehension. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM ) (pp. 235–249). Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/S19-1026. https://www.aclweb.org/anthology/S19-1026.

  • Levesque, H., Davis, E., & Morgenstern, L. (2012). The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Repreentation and Reasoning. Citeseer

  • Liu, H., Cui, L., Liu, J., & Zhang, Y. (2021). Natural language inference in context—investigating contextual reasoning over long texts. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event (pp. 13388–13396). AAAI Press, https://ojs.aaai.org/index.php/AAAI/article/view/17580.

  • Liu, N. F, Gardner, M., Belinkov, Y., Peters, M. E., & Smith, N. A. (2019a). Linguistic knowledge and transferability of contextual representations. In NAACL-HLT 2019 (Vol. 1, Long and Short Papers, pp. 1073–1094).

  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019b). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  • McCoy, T., Pavlick, E., & Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In ACL (pp. 3428–3448). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1334. https://aclanthology.org/P19-1334.

  • Naik, A., Ravichander, A., Sadeh, N., Rose, C., & Neubig, G. (2018). Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 2340–2353). Association for Computational Linguistics. https://www.aclweb.org/anthology/C18-1198.

  • Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., & Kiela, D. (2020). Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.

  • Pikekos, P., Michalewski, H., & Malinowski, M. (2021). Measuring and improving bert’s mathematical abilities by predicting the order of reasoning. ArXiv, abs/2106.03921.

  • Poliak, A., Haldar, A., Rudinger,R., Hu, J. E., Pavlick, E., White, A. S., & Van Durme, B. (2018a) Collecting diverse natural language inference problems for sentence representation evaluation. In EMNLP 2018 (pp. 67–81). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1007. https://www.aclweb.org/anthology/D18-1007.

  • Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., & Durme, B. V. (2018b). Hypothesis only baselines in natural language inference. In *SEMEVAL

  • Rajpurkar, Pranav, Z., Jian, L., Konstantin, & Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2383–2392). Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1264. https://aclanthology.org/D16-1264.

  • Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4902–4912). Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.acl-main.442.

  • Richardson, K., Hu, H., Moss, L., & Sabharwal, A. (2020). Probing natural language inference models through semantic fragments. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 8713–8721). https://doi.org/10.1609/aaai.v34i05.6397.

  • Rocchietti, G., Achena, F., Marziano, G., Salaris, S., & Lenci, A. (2022). FANCY: A diagnostic data-set for NLI models. In E. Fersini, M. Passarotti, V. Patti (Eds.), Proceedings of the Eighth Italian Conference on Computational Linguistics, CLiC-it 2021 Milan, Italy, January 26–28, Volume 3033 of CEUR Workshop Proceedings. CEUR-WS.org, 2021. http://ceur-ws.org/Vol-3033/paper76.pdf.

  • Felipe Salvatore. (2019). Cross-lingual contradiction detection. https://github.com/felipessalvatore/CLCD. commit xxxxxxx.

  • Salvatore, F., Finger, M., & Hirata Jr, R. (2019). A logical-based corpus for cross-lingual evaluation. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019) (pp. 22–30).

  • Sap, M., Bras, R. L., Allaway, E., Bhagavatula, C., Lourie, N., Rashkin, H., Roof, B., Smith, N. A., & Choi, Y. (2019). ATOMIC: An atlas of machine commonsense for if-then reasoning. In AAAI 2019 (pp. 3027–3035). AAAI Press, https://doi.org/10.1609/aaai.v33i01.33013027.

  • Schlangen, D. (2021). Targeting the benchmark: On methodology in current natural language processing research. In ACL: Short Papers.

  • Schuster, T., Shah, D., Jie Serene Yeo, Y., Roberto Filizzola O., Daniel, S., Enrico, & Barzilay, R. (2019). Towards debiasing fact verification models. In EMNLP-IJCNLP 2019 (pp. 3419–3425), Hong Kong, China, Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1341. https://aclanthology.org/D19-1341.

  • Sowa, J.F. (2010). The role of logic and ontology in language and reasoning. In Theory and applications of ontology: philosophical perspectives (pp. 231–263). Springer.

  • Speer, R., Chin, J., & Havasi, C. (2016). Conceptnet 5.5: An open multilingual graph of general knowledge. CoRR, abs/1612.03975, http://arxiv.org/abs/1612.03975.

  • Talmor, A. Elazar, Y., Goldberg, Y., & Berant, J. (2019). Olmpics—on what language model pre-training captures.

  • Talmor, Alon, Elazar, Yanai, Goldberg, Yoav, & Berant, Jonathan. (2020). Olmpics—On what language model pre-training captures. Transactions of the Association for Computational Linguistics, 8, 743–758.

    Article  Google Scholar 

  • Talmor, A., Tafjord, O., Clark, P., Goldberg, Y., & Berant, J. (2020b). Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems (Vol. 33, pp. 20227–20237). Curran Associates, Inc., https://proceedings.neurips.cc/paper/2020/file/e992111e4ab9985366e806733383bd8c-Paper.pdf.

  • Tenney, I., Das, D., & Pavlick, E. (2019a). BERT rediscovers the classical NLP pipeline. In ACL (pp. 4593–4601). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1452. https://www.aclweb.org/anthology/P19-1452.

  • Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., McCoy, R. T., Kim, N., Van Durme, B., Bowman, S., Das, D., et al. (2019b). What do you learn from context? probing for sentence structure in contextualized word representations. In ICLR 2019.

  • Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: A large-scale dataset for fact extraction and verification. In M.A. Walker, H. Ji, A. Stent (Eds.), NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp. (809–819). Association for Computational Linguistics, https://doi.org/10.18653/v1/n18-1074.

  • Vashishtha, S., Poliak, A., Lal, Y. K., Van Durme, B., & White, A. S. (2020). Temporal reasoning in natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 4070–4078). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.363. https://aclanthology.org/2020.findings-emnlp.363.

  • Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., & Shieber, S. (2020). Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Advances in Neural Information Processing Systems (Vol. 33, pp. 12388–12401). Curran Associates, Inc., https://proceedings.neurips.cc/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf.

  • Wallace, E., Wang, Y., Li, S., Singh, S., & Gardner, M. (2019). Do nlp models know numbers? Probing numeracy in embeddings. In Empirical Methods in Natural Language Processing.

  • Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. In EMNLP 2018 (p. 353).

  • Welleck, S., Weston, J., Szlam, A., & Cho, K. (2019). Dialogue natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3731–3741), Florence, Italy, Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1363. https://aclanthology.org/P19-1363.

  • Williams, A., Nangia, N., & Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT (Vol. 1, Long Papers, pp. 1112–1122). Association for Computational Linguistics, http://aclweb.org/anthology/N18-1101.

  • Wittgenstein, L. (1922). Tractatus logico-philosophicus. London: Routledge, 1981, http://scholar.google.de/scholar.bib?q=info:1G2GoIkyCZIJ:scholar.google.com/ &output=citation &hl=de &ct=citation &cd=0.

  • Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. M. (2019). Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771

  • Yang, G., Haque, M., Song, Q., Yang, W., & Liu, X.. (2022). TestAug: A framework for augmenting capability-based NLP tests. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3480–3495, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. https://aclanthology.org/2022.coling-1.307.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ishan Tarunesh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Benchmarking: detailed observations from template perturbations

Appendix: Benchmarking: detailed observations from template perturbations

We provide a list of interesting templates in Table 7, from where we can glean more fine-grained observations about the underlying systems’ behavior.

T2: All models fails on simple boolean template for testing “or”.

T13,T14: BERT and DistilBERT are unsure on ordered resolution. RoBERTa is able to predict accurately when the label is entailment but unsure when the label is a contradiction. The observation remains consistent for chains of length 2, 3, and 4.

T21,T22: We modify T13,T14 by introducing “not” in the hypothesis and observe a label shift towards contradiction for all models (even RoBERTa which was accurate for entailment templates).

T45,T46: We look at both these templates together and observe that BERT and DistilBERT are biased towards entailment labels while not understanding gendered names. RoBERTa on the other hand performs fairly accurately on both templates.

T63: DistilBERT fails on this very basic comparative template where the arguments are swapped.

T68,T71: The information within the premise is not sufficient to arrive at a hypothesis. All models struggle with templates such as this.

T76,T77: This template requires comparative and syntactic understanding. RoBERTa performs accurately on both these templates whereas BERT and DistilBERT are unsure. On further analysis, we observe that BERT has a lexical bias for the placeholder ADJ.

T80,T81: All models fail on template related to 2D directions.

T88,T89,T92,T93: All models are unsure or biased on this set of templates. Since RoBERTa is able to perform numerical comparison we would expect it to compare years but that is not observed. An interesting observation is BERT being sensitive to the lexical substitution “before” (or “after”) to “earlier than“ (or “later than”).

T98,T99: RoBERTa is able to correctly reason out the relative ordering to events whereas BERT and DistilBERT are unsure

T116,T117: Compared to BERT and DistilBERT, RoBERTa is better at understanding causal-verb pairs.

T122: A very simple quantifier template on which both BERT and DistilBERT fail.

T171,T172: These two are exactly the same template with different label depending on “logical” vs “implicative” reasoning. BERT and DistilBERT are unsure whereas RoBERTa is “logical”.

T128,T127: Another pair of exactly same template but this time all the three models predict contradiction which is “implicative” reasoning (Table 8).

Table 7 We show some interesting Templates, and model accuracies on the templates
Table 8 We show some interesting templates, and model accuracies on the templates

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tarunesh, I., Aditya, S. & Choudhury, M. LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI. Lang Resources & Evaluation (2023). https://doi.org/10.1007/s10579-023-09691-y

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10579-023-09691-y

Keywords

Navigation