基于文档图结构的恶意PDF文档检测方法-AET-电子技术应用

基于文档图结构的恶意PDF文档检测方法

信息技术与网络安全 11期

俞远哲，王金双，邹霞

(陆军工程大学指挥控制工程学院，江苏南京210007)

摘要： 目前基于机器学习的恶意PDF文档检测方法依赖于专家经验来遴选特征，无法全面反映文档属性。而且在面对对抗样本时，检测器性能下降明显。针对上述问题，提出了一种基于文档图结构和卷积神经网络的恶意PDF文档检测方法。该方法解析文档结构，根据文档中各对象之间的引用关系构建出有向图。然后，通过TF-IDF算法计算各节点对分类的贡献度来进行图结构精简。最后，计算精简后图的邻接矩阵和度矩阵，并得到图的拉普拉斯矩阵，以此作为特征送入CNN分类模型进行训练。同时还加入了对抗样本，对模型进行对抗训练。实验评估表明，在给定训练和测试样本比例9:1条件下，不断调整神经网络结构和参数，该方法的准确率达到了99.71%，性能优于KNN和SVM分类模型。在针对对抗样本的检测上，与知名在线检测网站VirusTotal上的67款杀毒引擎相比，该方法取得了更高的检测性能。

关键词： 恶意PDF文档文档图结构卷积神经网络对抗样本

中图分类号： TP309
文献标识码： A
DOI： 10.19358/j.issn.2096-5133.2021.11.003
引用格式：俞远哲，王金双，邹霞. 基于文档图结构的恶意PDF文档检测方法[J].信息技术与网络安全，2021，40(11)：16-23.

Malicious PDF detection method based on document graph structure

Yu Yuanzhe，Wang Jinshuang，Zou Xia

(Command & Control Engineering College，Army Engineering University of PLA，Nanjing 210007，China)

Abstract： Malicious PDF detection methods based on machine learning rely on the expert knowledge, which still cannot fully reflect the document attributes. Moreover, the performances of the detectors are easily affected by adversarial samples. To overcome these limitations, a malicious PDF detection method based on the PDF document graph structures and Convolutional Neural Network(CNN) was proposed. Firstly, a directed graph was constructed according to the document structure and the reference relationships between document objects. Secondly, the contribution of each node was calculated using TF-IDF algorithm, according to which the graph structures was simplified. Thirdly, the adjacency and degree matrices of the simplified graph were calculated, and the Laplacian matrix of the graph was obtained, which was used as a feature and sent to the CNN classification model for training. Adversarial samples were also added to train the model. It was evaluated that this method has an accuracy of 99.71% which is better than KNN and SVM classification models. Compared with the 67 antivirus engines on VirusTotal, it has achieved higher detection performance in the detection of adversarial samples.

Key words : malicious PDF document；document graph structure；CNN；adversarial sample

0 引言

PDF(Portable Document Format)文档的使用非常广泛。随着版本的更新换代，PDF文档包含的功能也变得多种多样，但其中一些鲜为人知的功能(如文件嵌入、JavaScript代码执行、动态表单等)越来越多地被不法分子利用，来实施恶意网络攻击行为[1]。APT(Advanced Persistent Threat)攻击[2]常常构造巧妙伪装的恶意PDF文档，通过钓鱼邮件攻击等手段诱骗受害者下载，从而侵入或破坏计算机系统。相比传统的恶意可执行程序，恶意文档具有更强的迷惑性。

基于机器学习的检测方法被研究人员广为使用，主要可以分为静态检测、动态检测和动静结合检测方法[3]。而现有的恶意文档特征选择方法大多依赖于专家的知识驱动，在恶意文档的手动分析期间进行观察来选择特征集(如调用类对象的数量、文档页数或版本号等)，或是通过数学统计分析将特征细化(如某类对象在所有对象中的占比)。由于特征可选取的范围很大，如果仅仅根据经验选取了一部分作为特征集，就会丧失文档的部分信息，无法全面地表达文档特性。

由于PDF文档格式的复杂性，其逻辑结构包含了大量的文档语义。文献[4]认为通过对结构属性的综合分析能够解释恶意和良性PDF文档之间的显著结构差异。因此本文设计通过综合分析文档的逻辑结构，以文档的结构图为特征进行检测，而不是独立的结构路径。即使攻击者知道哪些对象是成功检测的关键，并可能针对性地修改某一特定路径，但这样就会破坏文档的整体结构，因此逃避检测的成本很高。

本文详细内容请下载：http://www.chinaaet.com/resource/share/2000003843

作者信息：

俞远哲，王金双，邹霞

(陆军工程大学指挥控制工程学院，江苏南京210007)

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容