一种基于指令流水线的数据匹配算法-AET-电子技术应用

一种基于指令流水线的数据匹配算法

电子技术应用

杨嘉佳，李正，郑儿，赵静，燕玮，刘金

中国电子信息产业集团有限公司第六研究所

摘要： 基于正则表达式的数据匹配技术在基础数据治理和清洗方面有着重要的应用价值。然而，在高性能计算领域的数据处理过程中因算法匹配吞吐率低，无法满足大数据处理环境下对算法的高性能要求，造成其应用范围受限。针对此现象，提出一种基于指令流水线的数据匹配算法，称之为γFA：利用Intel架构内置的向量指令流水式读入若干字符段，通过大宽度向量比较函数进行字符段与非信任字符集的流水比值处理并转换成整型向量，通过位置定位函数累加定位出所有整型向量的首个非信任字符位置，计算出可略过的总字符数，减少正则表达式匹配引擎因处理非信任字符集导致访问低速内存而带来巨大的时间开销，实现正则表达式匹配算法的性能提升。实验结果表明，γFA算法的吞吐率是原始DFA算法的15.88~53.06倍，相比于ßFA算法，吞吐率提升了35.12%~63.26%，取得较好的性能加速效果。此外，通过对γFA算法进行优化后，性能可接近100 Gb/s，为原始DFA匹配算法性能的15.88~64.94倍，相比于γFA算法性能提升了2.15%~43.09%。

关键词： 正则表达式匹配指令流水高性能数据匹配

中图分类号：TP391.1 文献标志码：A DOI: 10.16157/j.issn.0258-7998.245345
中文引用格式： 杨嘉佳，李正，郑儿，等. 一种基于指令流水线的数据匹配算法[J]. 电子技术应用，2025，51(2)：81-85.
英文引用格式： Yang Jiajia，Li Zheng，Zheng Er，et al. A data matching algorithm based on instruction pipeline[J]. Application of Electronic Technique，2025，51(2)：81-85.

A data matching algorithm based on instruction pipeline

Yang Jiajia，Li Zheng，Zheng Er，Zhao Jing，Yan Wei，Liu Jin

The Sixth Research Institute of China Electronics Corporation

Abstract： The data matching technology based on regular expressions has significant application value in basic data governance and cleaning. However, in the data processing process of high-performance computing, the low performance of algorithm matching cannot meet the high-performance requirements of algorithms in the big data processing environment, resulting in limited application scope. To address this issue, a high-performance data matching algorithm based on instruction pipelining is proposed, known as γFA. It utilizes the vector instruction pipelining built into the Intel architecture to read in multiple character segments, performs pipeline ratio processing of the character segments with untrusted character sets through a wide-width vector comparison function, and converts them into integer vectors. The position location function is then used to accumulate and locate the first untrusted character position in the integer vector, calculate the number of characters that can be skipped, and reduce the significant time overhead caused by the regular expression matching engine accessing slow memory when processing untrusted character sets. This achieves performance acceleration for the regular expression matching algorithm. Experimental results show that the γFA algorithm achieves a throughput rate that is 15.88 to 53.06 times higher than the original DFA algorithm. Compared to the ßFA algorithm, the throughput rate is improved by 35.12% to 63.26%, achieving a better performance acceleration effect. Furthermore, after optimizing the γFA algorithm, a performance close to 100 Gb/s can be achieved, which is 15.88 to 64.94 times better than the performance of the original DFA matching algorithm. This represents an improvement of 2.15% to 43.09% compared to the γFA algorithm.

Key words : regular expression matching；instruction pipeline；high-performance data matching

引言

数据匹配技术可应用于数据的清洗和治理，如基于正则表达式的数据匹配技术在基础数据的过滤方面发挥重要作用，通过数据匹配可将无关数据剔除过滤，减少噪声数据的干扰。正则表达式因具有强大的表征能力，适合用于匹配过滤真实环境下的复杂噪声数据。例如，开源入侵检测系统Bro IDS、Snort[1]等都使用了基于正则表达式的数据匹配功能。

基于正则表达式的数据匹配实现方式通常可分成两种：基于非确定型有限自动机(NFA)和确定型有限自动机(DFA)。前者空间复杂度比较低，与正则表达式的长度呈线性关系，但因处理一个字符需激活多个状态，造成匹配时间复杂性较大和匹配性能不稳定。相比而言，DFA的时间复杂性比较低，处理一个字符只需一次激活单个状态，然而却因规则的复杂性易导致状态转移空间膨胀甚至“爆炸”，造成巨大的空间开销。

在大数据匹配环境中，DFA更多地被选择与应用。DFA的匹配性能和空间消耗是基于正则表达式数据匹配技术的重要衡量因素。截至目前，DFA的空间消耗已有很多可行的算法被提出[2]，因而不是本文研究重点。尽管已有若干算法对DFA的匹配性能进行研究，但性能低依旧是制约其广泛应用的瓶颈因素。

针对此问题，本文基于单指令多数据流(Single Instruction Multiple Data)向量指令连续从内存中读入若干字符段，然后分别与最常被访问状态（行）对应的非信任字符集进行字符并行比较操作，通过位置定位函数累加定位出首个非信任字符位置，获取直接略过的总字符数，减少访存次数，提高算法吞吐率。

本文详细内容请下载：

https://www.chinaaet.com/resource/share/2000006330

作者信息：

杨嘉佳，李正，郑儿，赵静，燕玮，刘金

（中国电子信息产业集团有限公司第六研究所，北京 100083）

Magazine.Subscription.jpg

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容