基于机器学习的恶意软件检测研究进展及挑战-AET-电子技术应用

基于机器学习的恶意软件检测研究进展及挑战

2020年信息技术与网络安全第11期

景鸿理1，黄娜1，2，李建国1

1.北京天融信科技有限公司，北京100085；2.北京工业大学，北京100124

摘要： 由于恶意软件的数量日渐庞大，攻击手段不断更新，结合机器学习技术是恶意软件检测发展的一个新方向。先简要介绍恶意软件检测中的静态检测方法以及动态检测方法，总结基于机器学习的恶意软件检测一般流程，回顾了研究进展。通过使用Ember 2017和Ember 2018数据集，分析验证了结构化特征相关方法，包括随机森林(Random Forest，RF)、LightGBM、支持向量机(Support Vector Machine，SVM)、K-means以及卷积神经网络(Convolutional Neural Network，CNN)等算法模型；使用收集的2019年样本集分析验证了序列化特征相关方法，包括几种常见的深度学习算法模型。计算模型以在不同测试集上的准确率、精确率、召回率以及F1-值作为评估指标。根据实验结果分析讨论了各类方法的优缺点，着重验证分析了树模型的泛化能力，表明随着样本的不断演变，模型普遍存在退化问题，并指出进一步研究方向。

关键词： 恶意软件检测静态检测机器学习 LightGBM 随机森林

中图分类号： TP391
文献标识码： A
DOI： 10.19358/j.issn.2096-5133.2020.11.006
引用格式：景鸿理，黄娜，李建国. 基于机器学习的恶意软件检测研究进展及挑战[J].信息技术与网络安全，2020，39(11)：38-44，68.

Research progress and challenges of malware detection method based on machine learning

Jing Hongli1，Huang Na1，2，Li Jianguo1

1.Beijing Topsec Science & Technology Inc.，Beijing 100085，China； 2.Beijing University of Technology，Beijing 100124，China

Abstract： Due to the increasing number of malware and the updated attack means, malware detection combined with machine learning technology is a new direction of its development. Firstly, this paper introduces the static detecting methods and dynamic detecting methods of malware briefly; summarizes the general process of malware detecting methods based on machine learning, and reviews the existing methods with research progress. Using the data sets of Ember 2017 and Ember 2018, the structural feature correlation methods, including RF(Random Forest), LightGBM, SVM(Support Vector Machine), K-means and CNN(Convolutional Neural Network), are analyzed and validated,and the 2019 sample set analysis is used to validate the serialization feature correlation method, including several common deep learning algorithm models. The accuracy, precision, recall and F1_score of the trained model on different testing data sets are calculated as evaluating metrics. According to the experimental results, the advantages and disadvantages of various methods are discussed in this paper, the generalization ability of the tree model is verified and analyzed emphatically. It is shown that the model generally has degradation problem with the continuous evolution of samples, and the further research direction is pointed out at last.

Key words : malware detection；static detection of malware；machine learning；LightGBM；random forest

0 引言

恶意软件是计算机与网络领域不可避免的一项安全风险，也是安全研究者聚焦的研究热点之一。用户的隐私数据、个人信息及财产，都是恶意软件攻击的目标^[1]。恶意软件自身的一些特性为检测提供了可能性和有利条件，安全研究人员提出了很多检测分析方法来遏制、打击恶意软件的发展势头。计算机技术高速发展，不仅为人们的日常生活和工作带来了便利，也促使黑客的攻击手段和技术不断提高，使得恶意软件变得更加多元化，而且利用无线网络、局域网络、可移动设备等多种传播渠道快速传播，数量与日俱增，传统的基于特征库匹配等技术显得效率不足^[2]。因此，研究者逐渐趋向于使用机器学习技术，来应对恶意软件难以预测的变种和日益庞大的数量^[3]。

目前已经有许多机器学习技术和框架被研究提出，应用于恶意软件检测，起到了非常可观的效果。根据SGANDURRA D等^[4]在2016年的调研，使用机器学习技术的静态检测方法准确率达到90%以上，动态检测方法准确率能够达到96%以上，经过近几年的继续发展，此类方法的性能得到了进一步提高。基于机器学习技术建立智能化检测模型，形成阻断恶意软件的一道防线，是技术突破与市场拓展的一个新方向，具有重要的研究意义和应用价值。

本文总结了基于机器学习的恶意软件检测方法的一般流程，回顾现有的研究成果；分别对结构化特征相关方法以及序列化特征相关方法进行了实验验证，结合实验结果分析讨论各类方法的适用场景以及面临的挑战，最后指出进一步研究方向。

本文详细内容请下载:http://www.chinaaet.com/resource/share/2000003173

作者信息:

景鸿理1，黄娜1，2，李建国1

(1.北京天融信科技有限公司，北京100085；2.北京工业大学，北京100124)

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容