一种基于Simhash算法的重复域名数据去重方法-AET-电子技术应用

一种基于Simhash算法的重复域名数据去重方法

信息技术与网络安全 4期

侯开茂，韩庆敏，吴云峰，黄兵，张久发，柴处处

(中国电子信息产业集团有限公司第六研究所，北京100083)

摘要： 随着数字科学技术的发展，各领域需要传输和存储的数据量急剧上升。然而传输和存储的数据中重复数量占据了很大的比例，这不仅会增加使用数据的成本，也会影响处理数据的效率。域名是一种存储量大而且对处理速率有极高要求的数据，为了节约域名解析系统的存储成本，提高传输效率，本文在原有数据去重技术的基础上，引入了Simhash算法，结合域名数据的结构特征，改进数据分词和指纹值计算方式，提出了一种基于Simhash算法的重复域名数据去重方法。实验结果表明，相比于传统的数据去重技术，该方法对删除重复域名数据效率更高，具有较好的实际应用价值。

关键词： 数据去重域名 Simhash 数据分块

中图分类号： TP391
文献标识码： A
DOI： 10.19358/j.issn.2096-5133.2022.04.011
引用格式：侯开茂，韩庆敏，吴云峰，等. 一种基于Simhash算法的重复域名数据去重方法[J].信息技术与网络安全，2022，41(4)：71-76.

Method for deleting duplicate domain name data based on Simhash algorithm

Hou Kaimao，Han Qingmin，Wu Yunfeng，Huang Bing，Zhang Jiufa，Chai Chuchu

(The 6th Research Institute of China Electronics Corporation，Beijing 100083，China)

Abstract： With the development of digital science and technology, the amount of data that needs to be transmitted and stored in various fields has risen sharply. However, the number of repetitions in these data occupies a large proportion. This not only increases the cost of using data, but also reduces the efficiency of data processing. Domain name is a kind of data with large storage capacity and extremely high requirements for processing speed. In order to save storage cost and improve transmission efficiency, this paper proposes a method for deleting duplicate domain name data based on Simhash algorithm. Compared with the traditional data deduplication technology, this method combines the structural characteristics of the domain name data, and introduces the Simhash algorithm to design a deduplication method for the domain name data. The experimental results show that compared with the traditional data deduplication technology, this method is more efficient in deleting duplicate domain name data and has better practical application value.

Key words : data deduplication；domain name；Simhash；data block

0 引言

随着电子信息技术的发展，各行各业都产生了大量的数据信息，根据国际数据公司(International Data Corporation，IDC)的最新预测：到2023年，中国的数据量将达到40 ZB，并且随着5G技术的普及，数据量增长将会迎来又一个新的高潮[1]。有研究发现,这些数据中超过60%都是重复冗余数据[2]，传输和存储这些冗余数据不仅造成了存储资源和网络资源的严重浪费，也降低了使用数据的效率。并且随着时间推移，这些数据带来的冗余问题会越来越严重。域名[3](Domain Name)作为互联网中频繁使用的数据类型之一，是一种特殊的数据形式，其对字符的变化敏感度极高，一个字符的变化往往会对使用结果产生严重的影响。因此，处理重复域名数据需要采用精确而且高效的去重技术。

已有重复数据处理技术中，完全文件检测(Whole File Detection，WFD)技术[4]无法对内容进行查重处理，固定分块(Fixed-Sized Partition，FSP)检测技术、可变分块检测技术和滑动块检测技术都是针对数据共有特征的粗粒度去重，直接用于重复域名的处理效果并不理想。因此，本文在已有重复数据检测技术的基础上，引入Simhash算法，结合域名数据的结构特征，改进计算文本特征值的方式，提出了一种基于Simhash算法的重复域名数据去重方法。经过实验对比看出，该方法对于处理重复域名数据效果更好，同时在时间开销上也和原有技术差别不大，对于处理重复域名数据具有比传统去重技术更好的实用价值。

本文详细内容请下载：http://www.chinaaet.com/resource/share/2000004102

作者信息：

侯开茂，韩庆敏，吴云峰，黄兵，张久发，柴处处

(中国电子信息产业集团有限公司第六研究所，北京100083)

微信图片_20210517164139.jpg

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容