ßFA：一种基于向量指令集的高性能数据处理算法-AET-电子技术应用

ßFA：一种基于向量指令集的高性能数据处理算法

电子技术应用

杨嘉佳，关健，李正，于增明，姚旺君

中国电子信息产业集团有限公司第六研究所

摘要： 正则表达式匹配技术在数据清洗、解析提取等数据处理任务方面发挥重大作用。然而，由于匹配过程中存在数据强依赖关系和内存访问不可预测等问题，造成匹配性能较低。针对此问题，提出一种基于向量指令集的高性能正则表达式数据处理算法，称之为ßFA：通过向量指令一次性从内存读出若干连续字符，并与最常被访问状态对应的非信任字符集进行向量匹配，利用内置函数定位首个非信任字符的位置，获得可直接跳过的字符数，从而实现匹配性能的加速。实验结果表明，ßFA算法的吞吐率优于原始DFA算法和αFA算法，是原始DFA算法的4.67~60倍以及ɑFA算法的4.37~7.82倍。

关键词： 正则表达式匹配向量指令集高性能数据处理

中图分类号：TP391.1 文献标志码：A DOI: 10.16157/j.issn.0258-7998.245114
中文引用格式： 杨嘉佳，关健，李正，等. ßFA:一种基于向量指令集的高性能数据处理算法[J]. 电子技术应用，2024，50(11)：85-88.
英文引用格式： Yang Jiajia，Guan Jian，Li Zheng，et al. ßFA: a high-performance data processing algorithm based on vector instruction set[J]. Application of Electronic Technique，2024，50(11)：85-88.

ßFA： a high-performance data processing algorithm based on vector instruction set

Yang Jiajia，Guan Jian，Li Zheng，Yu Zengming，Yao Wangjun

The Sixth Research Institute of China Electronics Corporation

Abstract： Regular expression matching technology plays a significant role in data processing tasks such as data cleaning, parsing, and extraction. However, due to issues such as strong data dependency and unpredictable memory access in the matching process, the matching performance is relatively low. In response to this problem, this paper proposes a high-performance regular expression data processing algorithm based on vector instruction set, which is called ßFA. By using vector instructions to read a sequence of consecutive characters at once, and performing vector matching with the non-trusted character set corresponding to the most frequently accessed state, built-in functions can be utilized to find the position of the first non-trusted character, thus obtaining the number of characters that can be skipped directly, thereby accelerating the matching performance. Experimental results show that the throughput of the ßFA algorithm is superior to the original DFA algorithm and the αFA algorithm, being 4.67~60 times faster than the original DFA algorithm and 4.37~7.82 times faster than the αFA algorithm.

Key words : regular expression matching；vector instruction set；high-performance data processing

引言

数据处理能力是大数据时代的核心要素之一，决定了真实数据环境下是否满足数据线速处理的要求。正则表达式匹配技术可作为数据清洗、提取解析和数据检测等数据处理任务的有效解决手段之一。例如，基于Linux系统的Awk、Vim、Perl工具以及开源网络入侵检测系统Bro IDS[1]等都使用了正则表达式的匹配功能。

正则表达式匹配的有效手段通常分为确定型有限自动机(Deterministic Finite Automata, DFA)和基于非确定型有限自动机(Nondeterministic Finite Automata, NFA)[2]。两者各有其特点，NFA空间复杂性较低，但因为一次字符输入可能会引发数目不定的多个状态转移，造成匹配时间复杂性较大。相反，原始DFA的时间复杂性低且为O(1)，但存在空间开销大的问题。

在大数据处理背景下，正则表达式的匹配性能是最重要的衡量因素，因此DFA成为解决匹配性能方案的首选。针对DFA空间开销大的问题，现已存在很多优秀的研究成果[3]。然而，DFA匹配过程中存在数据强依赖关系，造成其不能很好地适用于高性能数据处理环境。

因此，针对DFA匹配性能较低的问题，本文利用Intel的向量指令集对DFA匹配进行加速。通过一次性读入若干连续字符，然后并行判断其是否属于最常被访问状态的非信任字符集，获取无需访问内存状态转移表即可直接跳过的字符数，从而减少匹配时间的消耗以达到性能加速目的。

本文详细内容请下载：

https://www.chinaaet.com/resource/share/2000006215

作者信息：

杨嘉佳，关健，李正，于增明，姚旺君

（中国电子信息产业集团有限公司第六研究所，北京 100083）

Magazine.Subscription.jpg

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容