ɑFA：一种基于非信任字符比较的高性能正则表达式匹配算法-AET-电子技术应用

ɑFA：一种基于非信任字符比较的高性能正则表达式匹配算法

电子技术应用

杨嘉佳，关健，于增明，张雷，姚旺君

中国电子信息产业集团有限公司第六研究所

摘要： 正则表达式匹配技术在数据治理、解析提取和深度包检测方面有着重大应用价值。然而，由于其在通用平台上的匹配性能较低，无法满足实际环境下数据实时处理的应用需求，限制了其在高性能数据处理领域的应用范围。针对当前正则表达式匹配性能较低的问题，提出一种基于非信任字符比较的高性能正则表达式匹配算法，称之为ɑFA。该算法通过每次判断连续的若干个字符是否属于最常被访问状态的非信任字符集，获取无需通过DFA匹配可直接跳过的字符数，减少字符匹配过程中访问内存DFA状态转移表的次数，从而实现字符匹配的加速处理。实验结果表明，ɑFA算法可获得相比于原始DFA匹配算法约为1.05~7.58倍的性能加速比。

关键词： 正则表达式匹配确定型有限自动机高性能数据处理

中图分类号：TP391.1 文献标志码：A DOI: 10.16157/j.issn.0258-7998.244911
中文引用格式： 杨嘉佳，关健，于增明，等. ɑFA:一种基于非信任字符比较的高性能正则表达式匹配算法[J]. 电子技术应用，2024，50(6)：57-60.
英文引用格式： Yang Jiajia，Guan Jian，Yu Zengming，et al. ɑFA: a high-performance regular expression matching algorithm based on untrusted character comparison[J]. Application of Electronic Technique，2024，50(6)：57-60.

ɑFA： a high-performance regular expression matching algorithm based on untrusted character comparison

Yang Jiajia，Guan Jian，Yu Zengming，Zhang Lei，Yao Wangjun

The Sixth Research Institute of China Electronics Corporation

Abstract： Regular expression matching technology has significant application value in data governance, parsing extraction, and deep packet inspection. However, due to its low matching performance on general-purpose platforms, it cannot meet the application requirements of real-time data processing in practical environments, which limits its application scope in the field of high-performance data processing. In response to the current issue, a high-performance regular expression matching algorithm based on untrusted character comparison is proposed, which is called ɑFA. This algorithm determines whether a sequence of consecutive characters belongs to the untrusted character set of the most frequently accessed state. By doing so, it acquires the number of characters that can be skipped directly without DFA matching, reduces the number of accesses to the DFA state transition table in memory during character matching, and thus achieves accelerated processing of character matching. The experimental results indicate that the ɑFA algorithm can achieve a performance acceleration of approximately 1.05 times to 7.58 times compared to the original DFA matching algorithm.

Key words : regular expression matching；deterministic finite automaton；high-performance data processing

引言

正则表达式因拥有强大的表达能力与灵活性，在数据治理、解析提取和深度包检测方面得到了广泛应用。比如著名的搜索工具grep、sed以及入侵检测系统Snort^[1]都包含了很多正则表达式规则。

正则表达式匹配方法通常分为基于确定型有限自动机(Deterministic Finite Automata, DFA)和基于非确定型有限自动机(Nondeterministic Finite Automata, NFA)^[2]。两者的区别在于NFA的空间需求较少，但匹配性能较低；DFA则相反，匹配性能较高，但空间需求大。

在真实数据处理环境背景下，正则表达式的匹配性能是最重要的衡量因素之一。以状态转移次数计算，DFA匹配单个字符时发生一次状态转移，转移次数固定，性能较高且较为稳定；相反，NFA匹配单个字符时可能会引发若干次状态转移，转移次数较多，性能较低且稳定性较差。因此，现有的高性能匹配研究工作主要集中于如何提升DFA的匹配性能。

截止目前，各式各样的DFA加速匹配方法已被提出^[3]，包括经典的多步长自动机、多核平台并行匹配加速、基于枚举方法的SIMD加速、基于推测与枚举方法相结合的新型并行化匹配方法等。但是，这些算法在加速匹配过程中需多次访问内存导致较大的时间开销，因而性能还有进一步提升的空间。

因此，本文专注于DFA的匹配性能问题，提出了一种基于非信任字符比较的高性能正则表达式匹配算法。通过每次判断连续的若干个字符是否属于最常被访问状态的非信任字符集，获得无需通过DFA匹配可直接跳过的字符数，从而实现DFA的加速处理。理论分析与实验结果表明，此算法可达到原始DFA性能加速比的1.05~7.58倍。

本文详细内容请下载：

https://www.chinaaet.com/resource/share/2000006031

作者信息：

杨嘉佳，关健，于增明，张雷，姚旺君

（中国电子信息产业集团有限公司第六研究所，北京 100083）

Magazine.Subscription.jpg

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容