一种基于状态预测的多线程数据过滤算法-AET-电子技术应用

一种基于状态预测的多线程数据过滤算法

电子技术应用

杨嘉佳，李正，郑儿，姚旺君，赵静，关健

中国电子信息产业集团有限公司第六研究所

摘要： 数据过滤算法在大数据处理领域有着重要的作用。基于正则表达式匹配技术的数据过滤算法凭借强大的特征表达能力适合于处理大规模复杂数据。然而，传统的正则表达式匹配过程为串行匹配，造成性能低，无法满足现代数据处理的需求。针对传统正则表达式匹配性能低的问题，提出一种基于多线程和状态预测的正则表达式加速匹配算法，称之为μFA：基于向量指令执行字符值比较，获取可直接跳过的信任字符数。同时，基于多线程加速和状态猜测技术，实现字符串的分段匹配处理，通过圈定字符危险区域，研判各分段最终匹配结果的正确性。实验结果表明，μFA算法的吞吐率是原始DFA算法的10.12~91.36倍、ßFA算法的1.08~2.97倍。

关键词： 正则表达式匹配状态预测数据过滤

中图分类号：TP391.1 文献标志码：A DOI: 10.16157/j.issn.0258-7998.245321
中文引用格式： 杨嘉佳，李正，郑儿，等. 一种基于状态预测的多线程数据过滤算法[J]. 电子技术应用，2024，50(12)：87-91.
英文引用格式： Yang Jiajia，Li Zheng，Zheng Er，et al. An accelerated regular expression matching algorithm based on multi-threading and state prediction[J]. Application of Electronic Technique，2024，50(12)：87-91.

An accelerated regular expression matching algorithm based on multi-threading and state prediction

Yang Jiajia，Li Zheng，Zheng Er，Yao Wangjun，Zhao Jing，Guan Jian

The Sixth Research Institute of China Electronics Corporation

Abstract： Data filtering algorithms play a crucial role in the field of big data processing. Data filtering algorithms based on regular expression matching technology are suitable for processing large-scale complex data due to their powerful feature expression capabilities. However, the traditional regular expression matching process is serial matching, resulting in low performance that cannot meet the needs of modern data processing. To address the issue of low performance in traditional regular expression matching, an accelerated regular expression matching algorithm based on multithreading and state prediction is proposed, named μFA. This algorithm performs character value comparison based on vector instructions to obtain the number of trusted characters that can be skipped directly. Simultaneously, it utilizes multithreading acceleration and state prediction techniques to achieve segmented matching processing of strings. By delimiting dangerous character regions, it determines the correctness of the final matching results for each segment. Experimental results show that the throughput is 10.12 to 91.36 times higher than the original DFA algorithm and 1.08 to 2.97 times higher than the ßFA algorithm.

Key words : regular expression matching；state prediction；data filtering

引言

在人工智能时代[1]，正则表达式匹配技术有助于数据的预处理过滤，可为业务应用提供更高质量的数据。例如，正则表达式规则由于其展现出强大的表征能力，可从大规模数据中过滤出复杂且符合深度学习模型要求的数据，提升模型的推理精度。

数据预处理吞吐率是衡量过滤算法的重要性能因素之一，反映出在特定环境下算法可以运行的性能极限，决定其是否适用于高性能大数据预处理领域。因此，本文重点研究如何提高基于正则表达式匹配的数据过滤性能。

当前，已涌现出许多优秀的基于正则表达式技术的数据过滤算法[2]，包括基于非确定型有限自动机(Nondeterministic Finite Automata, NFA)、基于确定型有限自动机(Deterministic Finite Automata, DFA)和基于混合自动机(Hybrid Finite Automata, Hybrid-FA)等实现方式。其中，因DFA的数据过滤性能较为稳定，备受研究人员和开发人员的青睐。

然而，现有的正则表达式过滤算法性能较低，无法满足大数据背景下的高性能过滤需求。因此，本文提出一种基于状态预测的多线程数据过滤算法：通过向量指令字符值比较、多线程加速、状态猜测等技术，实现字符串的分段匹配处理，从而提高算法的吞吐率。

本文详细内容请下载：

https://www.chinaaet.com/resource/share/2000006254

作者信息：

杨嘉佳，李正，郑儿，姚旺君，赵静，关健

（中国电子信息产业集团有限公司第六研究所，北京 100083）

Magazine.Subscription.jpg

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容