Fuzzing-based grammar learning from a minimal set of seed inputs,Journal of Computer Languages

当前位置： X-MOL 学术 › J. Comput. Lang. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fuzzing-based grammar learning from a minimal set of seed inputs
Journal of Computer Languages ( IF 2.2 ) Pub Date : 2023-11-19 , DOI: 10.1016/j.cola.2023.101252
Hannes Sochor , Flavio Ferrarotti , Daniela Kaufmann

To be effective, a fuzzer needs to generate inputs that are well formed, so that they are not outright rejected by the Software Under Test (SUT) and can thus detect meaningful bugs. Grammar based fuzzers solve this problem, but they obviously require a grammar of the input language accepted by the SUT. Many times such grammar is unknown. Therefore, different black- and white-box algorithms have been proposed for learning them from SUTs. Black-box algorithms rely only on membership queries, but need access to carefully crafted well formed inputs in order to obtain good results. White-box algorithms require access to the source code and generally produce grammars with higher precision and recall, but at the expense of working only for specific programming languages and libraries. We propose a new algorithm and show through extensive experimentation that it can learn grammars from recursive descendent parsers with consistently high levels of both, recall and precision. Notably, this result was obtained starting with a couple of arbitrary seed inputs and includes evaluations with sophisticated languages such as Java Script Object Notation (JSON). Different to other state of the art white-box approaches, our method does not require sophisticated program analysis techniques such as dynamic tainting or symbolic execution. In fact, the experiments confirm that our method performs extremely well with just a (standard) generic Abstract Syntax Tree (AST) of the parsing program as input. The core of our method uses fuzzing techniques combined with fundamental theoretical results on grammar learning. Compared to other white-box approaches, ours is not tied to specific programming languages and tools, and thus can be easily ported. Regarding performance, we have shown that our algorithm works well in practice and that, under reasonable assumptions, its worst-case complexity is polynomial (with low exponents) w.r.t. time and space requirements.

中文翻译：

从最小的种子输入集进行基于模糊测试的语法学习

为了有效，模糊器需要生成格式良好的输入，以便它们不会被被测软件 (SUT) 彻底拒绝，从而可以检测到有意义的错误。基于语法的模糊器解决了这个问题，但它们显然需要 SUT 接受的输入语言的语法。很多时候这样的语法是未知的。因此，人们提出了不同的黑盒和白盒算法来从 SUT 中学习它们。黑盒算法仅依赖于成员资格查询，但需要访问精心设计的格式良好的输入才能获得良好的结果。白盒算法需要访问源代码，并且通常会生成具有更高精确度和召回率的语法，但代价是仅适用于特定的编程语言和库。我们提出了一种新算法，并通过广泛的实验表明，它可以从递归后代解析器中学习语法，并且具有始终如一的高水平召回率和精度。值得注意的是，这个结果是从几个任意种子输入开始获得的，并且包括使用Java 脚本对象表示法(JSON) 等复杂语言进行的评估。与其他最先进的白盒方法不同，我们的方法不需要复杂的程序分析技术，例如动态污染或符号执行。事实上，实验证实我们的方法仅使用解析程序的（标准）通用抽象语法树（AST）作为输入即可表现得非常好。我们方法的核心使用模糊技术与语法学习的基本理论结果相结合。与其他白盒方法相比，我们的方法不依赖于特定的编程语言和工具，因此可以轻松移植。关于性能，我们已经证明我们的算法在实践中运行良好，并且在合理的假设下，其最坏情况的复杂度是时间和空间要求的多项式（低指数）。

更新日期：2023-11-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>