基于Boosting集成学习的风险URL检测研究-AET-电子技术应用

基于Boosting集成学习的风险URL检测研究

网络安全与数据治理

冯美琪1,2，李赟1,2，蒋冰1,2，王立松1,2，刘春波3，陈伟1,2

1.中国民航信息网络股份有限公司运行中心； 2.中国民航信息网络股份有限公司IT基础设施国产化适配工程技术研究中心； 3.中国民航大学信息安全测评中心

摘要： 随着互联网的不断发展，网站数量不断增长，URL作为访问网站的唯一入口，成为Web攻击的重点对象。传统的URL检测方式主要是针对恶意URL，主要方法是基于特征值和黑白名单，容易产生漏报，且对于复杂URL的检测能力不足。为解决上述问题，基于集成学习中的Boosting思想，提出一种针对业务访问的风险URL检测的混合模型。该模型前期将URL作为字符串，使用自然语言处理技术对其进行分词及向量化，然后采用分步建模法的思想，首先利用GBDT算法构建二分类模型，判断URL是否存在风险，接着将风险URL原始字符串输入到多分类模型中，利用XGBoost算法对其进行多分类判定，明确风险URL的具体风险类型，为安全分析人员提供参考。在模型构建过程中不断进行参数调优，并采用AUC值和F1值分别对二分类模型和多分类模型进行评估，评估结果显示二分类模型的AUC值为98.91%，多分类模型的F1值为0.993，效果较好。将其应用到实际环境中，与现有检测手段进行对比，发现模型的检出率高于现有WAF和APT安全设备，其检测结果弥补了现有检测手段的漏报。

关键词： Web攻击集成学习正则化分步建模法

中图分类号：TP393文献标识码：ADOI:10.19358/j.issn.2097-1788.2024.07.006
引用格式：冯美琪，李赟，蒋冰,等.基于Boosting集成学习的风险URL检测研究［J］.网络安全与数据治理，2024，43（7）：32-40.

Research on risk URL detection based on Boosting ensemble learning

Li Yun 1,2，Jiang Bing 1,2，Wang Lisong 1,2，Liu Chunbo3，Chen Wei1,2

1. Operation Center,TravelSky Technology Limited; 2. IT Infrastructure Localization Adaptation Engineering Technology Research Center,TravelSky Technology Limited 3. Information Security Evaluation Center, Civil Aviation University of China

Abstract： With the continuous development of the Internet and the growing number of websites, URL, as the only access to websites, has become the focus of web attacks. The traditional URL detection method mainly targets malicious URLs, based on feature values and black-and-white lists, but it is prone to false positives and lacks detection capability for complex URLs. To resolve the appeal issue, a hybrid model for risk URL detection in business access is proposed based on the Boosting concept in ensemble learning. In the early stage of this model, the URL is treated as a string, and natural language processing techniques are used to segment and vectorize it. Then, a two-step approach is adopted. Firstly, the GBDT algorithm is used to construct a binary classification model to determine whether the URL is at risk. Then, the original string of the risk URL is input into a multi classification model, and the XGBoost algorithm is used to perform multi classification judgment on it, clarifying the specific risk types of the risk URL and providing reference for security analysts. During the model construction process, parameter optimization was continuously carried out, and the AUC value and F1 value were used to evaluate the binary classification model and the multi classification model, respectively. The evaluation results showed that the AUC value of the binary classification model was 98.91%, and the F1 value of the multi classification model was 0.993, indicating good performance. Applying it to practical environments and comparing it with existing detection methods, it was found that the detection rate of the model is higher than that of existing WAF and APT detection devices, and its detection results make up for the missed reports of existing detection methods.

Key words : web attacks; ensemble learning; regularization; stepwise modeling method

引言

随着互联网的快速发展，在线购物、出行服务、系统工具和生活服务等都为人们带来了极大的便利。根据CNNIC数据，截至2023年12月，中国网民数达10.92亿，互联网渗透率达77.5%。同时，Log4j等重大漏洞的出现也印证了Web应用程序所带来的严重危害。而URL作为访问网站的唯一入口，其也成为了Web攻击的重点对象，如何从海量业务访问中检测出风险URL也成为了重点研究方向。针对URL的检测，目前的研究方向主要集中在恶意URL，是指通过作为钓鱼网页的载体、XSS攻击等多种方式窃取用户的隐私和财产，造成严重的网络安全威胁的URL［1］，检测方法主要包括特征值检测、黑白名单过滤等，其不足之处在于，当特征值或URL不在预设的名单中，则会产生漏报，同时此类方法无法实时对新的URL进行检测。启发式技术的提出解决了对新的URL检测的不足，但此类方法仅能用于有效数量的常见威胁［1］。然而，随着URL攻击的复杂度以及攻击能力的不断增强，传统的检测方法已无法满足防护需求，且其覆盖范围较窄，无法识别海量业务访问中存在风险的URL，需要探索新的应用场景和检测方法。在20世纪80年代，随着人工神经网络的成功，机器学习越来越受到重视，由于其使计算机能够学习、适应、推测模式，在没有明确编程指令的情况下相互通信［2］的特点，逐渐应用到网络安全领域。相较于传统的检测方法，机器学习模型具有更高的检测效率和更强的泛化能力。目前对于URL的研究主要集中在恶意URL，而非业务相关的风险URL的检测。恶意URL的相关研究主要可以分为三类：第一类是单一的机器学习算法，如BP神经网络［3］、卷积神经网络［4］、关联规则［5］等。第二类是集成多种机器学习算法，如双向长短期记忆网络［6］和胶囊网络结合、双向长短期记忆网络和卷积神经网络结合［7-9］等，同时引入注意力机制来增加关键特征的权重。此类研究中还有一种是集成学习［1］，一种方法是主要利用岭分类、支持向量机、朴素贝叶斯作为初级学习器，采用逻辑回归作为次级学习器，通过初级学习器和次级学习器相结合的双层结构对URL进行检测［10］；另一种方法采用CNN与XGBoost相结合的检测模型,利用CNN实现自动提取特征，通过XGBoost进行分类［11］。最后一类研究是机器学习与其他手段联合进行检测，如威胁情报［12］、专家知识［13］、字符嵌入编码［14］等。本文对业务从互联网接收到的风险URL请求开展检测研究，采用分步建模法和集成学习的思想，将风险URL检测模型分为两个子模型：风险URL检测以及风险URL类型分类。首先采用GBDT算法确定业务访问的URL是否存在风险，针对风险URL，采用XGBoost算法确定具体的风险类型。同时产生告警供安全运营人员确认并处置，在一定程度上弥补现有特征值检测方法的漏报。

本文详细内容请下载：

http://www.chinaaet.com/resource/share/2000006089

作者信息：

冯美琪1,2，李赟1,2，蒋冰1,2，王立松1,2，刘春波3，陈伟1,2

（1.中国民航信息网络股份有限公司运行中心，北京101318；

2.中国民航信息网络股份有限公司IT基础设施国产化适配工程技术研究中心，北京101318；

3.中国民航大学信息安全测评中心，天津300300）

Magazine.Subscription.jpg

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容