本博客持续更新,收集平常读论文时提高的公开数据集。
http://traces.cs.umass.edu/index.php/Network/Network 系列:
这个网站提供了该学院很多数据集,这些数据集是他们发表论文时提取的,然后公开出来。
数据集描述:
A collection of traces of web requests and responses over an encrypted SSH tunnel. The collection spans traces of connections to 2000 sites, collected four times a day over several months from February 2006 through April 2006. Each connection was encrypted; the traces include only the TCP headers, and not the payload.
2000个网站的SSH隧道上SSL连接的流量,只有TCP 头,PCAP大约2.5G 。
这个数据集是标注好了的。label: website。
地址:http://skuld.cs.umass.edu/traces/network/README-webident2
http://mawi.wide.ad.jp/mawi/
该项目是日本和美国的合作项目,该项目会按照不同的时间周期(如每天的某分钟,每个月)采集从日本到美国的某条骨干网络的网络流量,里面一共涉及了7个采样点,其中C-D采样点采集的是IPv6的数据包。时间跨度特别大:从2001年到2018 都有。
网络流量做了数据脱敏,里面的IP地址是经过处理的,而且只有IP包头。
当然这个数据集更适合做测量,因为它是没有标注过的。
数据集地址:
http://www.cse.bgu.ac.il/title_fingerprinting/
数据集描述:
作者采集了10000个Youtube视频流量,其中包含100个视频,每个视频观看100次。然后每个pcap,都有标注好对应的视频标题。pcap都没做数据脱敏,就是原始的pcap数据包。
例如:
http://www.cse.bgu.ac.il/title_fingerprinting/dataset_chrome_100/Hollyweezy/Train/
这个目录下有100个pcap,其中 Hollyweezy,就是他的视频标题。
同时作者该采集了一些带延时和丢包的数据包用来作为测试集。
作者的工作就是建立模型识别加密视频流量对应的视频标题。
这个数据集需要自己根据源码自己生成:
https://github.com/kpdyer/website-fingerprinting.git
源码生成的结果是会访问2000个 website 和775个SSH的Sockets代理的website的流量,里面的还有11个流量混淆的方法(包括包长填充等)可供选择,里面还有11个流量分类器。
里面有SSL的数据包,尚未标注好的。可以作为协议格式自动化推断的数据包。
https://www.netresec.com/?page=MACCDC
一般来说Tor上面的加密流量分类叫做webfingerprint attack,这个领域有很多公开的数据集,但是这个领域的很多数据集一般只给出cell的方向序列。例如:
[-1,1,1,-1,-1,…] 是{1,-1}的序列。正负号表示cell是Outgoing还是ingoing。每个cell的实际大小一般是512,因此可以根据cell的方向序列推断出cell的大小序列,只需要乘以512即可。
常见的数据集有:
里面有 Closed World ,Open World ,Concept drift 等流量。有100,200,500,900个类别的分梯度的标注数据,每个类别有2500条数据。这个数据集还是很大的!
URL:
http://home.cse.ust.hk/~taow/wf/data/
这里也是标注好的tor的website的流量,里面的数据都是基于cell的,都是+1,-1的序列。但是原始数据都带了相对时间戳。
有100多个类别,每个类别有100个左右的trace。
http://home.cse.ust.hk/~taow/wf/data/walkiebatch-defended.zip 是使用walkie-talkie手段构造的逃逸样本。
https://drive.google.com/drive/folders/1IXa3IJS9zJS4vggpyU7yda8f7jZjz4gB
介绍:
这个数据集超大,包含了103万个Android APP的带标注的流量数据,而且是pcap原始数据包。这个就很棒啊。
这个数据集包括了正常APP和恶意APP,是否正常是通过VirusTotal来判断的。
每个APP在Android模拟器里面跑4分钟。
https://drive.google.com/open?id=1wOdrfazbrcMDrL0NfA4GLoWegtPqkPj3
在三星Note4 安卓6.0.1设备上使用Chrome,Firefox,三星自带流量和UC浏览器访问Alexa Top 1000的站点。 每个站点访问15秒。
这里是为了分类不同的浏览器。
PPT链接:https://www.ndss-symposium.org/wp-content/uploads/2018/03/NDSS2018_05B-2_Ren_Slides.pdf
Cross Platform dataset: https://recon.meddle.mobi/cross-market.html
The Cross Platform dataset [51] consists of
user-generated data for 215 Android and 196 iOS apps. The
iOS apps were gathered from the top 100 apps in the App
Store in the US, China and India. The Android apps originate
from the top 100 apps in Google Play Store in the US and
India, plus from the top 100 apps of the Tencent MyApps and
360 Mobile Assistant stores, as Google Play is not available in
China. Each app was executed between three and ten minutes
while receiving real user inputs. Procedures to install, interact,
and uninstall the apps were given to student researchers who
followed them to complete the experiments while collecting
data. We use this dataset to evaluate both the performance
of our method with user-generated data and the performance
between different operating systems.
这个数据集做了相同APP不同版本之间的分析。
这个数据集有pcap文件。数据集有大约8个G
数据下载地址:https://drive.google.com/drive/folders/1cmG_5FIAh1DOGPI9el1K5WD9fUIpfw-x
https://recon.meddle.mobi/appversions/
The ReCon AppVersions dataset [52, 53] consists
of labeled network traces of 512 Android apps from the
Google Play Store, including multiple version releases over
a period of eight years. The traces were generated through
a combination of automated and scripted interactions on five
different Android devices. The apps were chosen among the
600 most popular free apps on the Google Play Store ranking
within the top 50 in each category. In addition, this dataset
contains extended traces of five apps, including multiple version releases. The network traffic of each of these five apps
was captured daily over a two-week period. In this work, we
refer the AppVersions dataset as ReCon and to the extended
dataset as ReCon extended.
里面的数据主要是基于载荷的。
数据下载地址:https://recon.meddle.mobi/appversions/raw_data.tar.gz
https://www.unb.ca/cic/datasets/
这个网站上面的数据集特别多,主要是一些入侵检测、恶意软件的数据集。
还有TOR的,这个可以!!!
介绍页面:https://www.unb.ca/cic/datasets/andmal2017.html
数据页面:http://205.174.165.80/CICDataset/CICMalAnal2017/Dataset/PCAPS/
We installed 5,000 of the collected samples (426 malware and 5,065 benign) on real devices. Our malware samples in the CICAndMal2017 dataset are classified into four categories:
Adware 广告软件
Ransomware 勒索软件
Scareware 恐吓软件
SMS Malware 短信恶意软件
Our samples come from 42 unique malware families.
数据有426个恶意软件,5000多个正常软件的流量。
这些恶意软件可以分为4大类,共42个家族。
这个数据集是对上面这个数据集的丰富,主要是把以下信息加进去了:
which includes permissions and intents as static features and API calls and all generated log files as dynamic features in three steps (During installation, before restarting and after restarting the phone).
APP的种类是没有变的。
CICAAGM dataset is captured by installing the Android apps on the real smartphones semi-automated. The dataset is generated from 1900 applications with the following three categories。
有1900多个应用的流量,其中有400个恶意软件。
https://www.unb.ca/cic/datasets/android-adware.html
这个数据集里面的广告软件特别多,有250个。
URL: https://www.unb.ca/cic/datasets/url-2016.html
下载链接: http://205.174.165.80/CICDataset/ISCX-URL-2016/Dataset/ISCXURL2016.zip
Benign URLs: Over 35,300 benign URLs were collected from Alexa top websites. The domains have been passed through a Heritrix web crawler to extract the URLs. Around half a million unique URLs are crawled initially and then passed to remove duplicate and domain only URLs. Later the extracted URLs have been checked through Virustotal to filter the benign URLs.
Spam URLs: Around 12,000 spam URLs were collected from the publicly available WEBSPAM-UK2007 dataset.
Phishing URLs: Around 10,000 phishing URLs were taken from OpenPhish which is a repository of active phishing sites.
Malware URLs: More than 11,500 URLs related to malware websites were obtained from DNS-BH which is a project that maintain list of malware sites.
Defacement URLs: More than 45,450 URLs belong to Defacement URL category. They are Alexa ranked trusted websites hosting fraudulent or hidden URL that contains both malicious web pages.
刷水文必备的数据集。。。
https://www.unb.ca/cic/datasets/.html
7大类网络服务,各自都有VPN和非VPN的流量:We captured a regular session and a session over VPN, therefore we have a total of 14 traffic categories: VOIP, VPN-VOIP, P2P, VPN-P2P, etc.
https://www.stratosphereips.org/datasets-overview
这个IPS有提供比较多的数据集,更多是僵尸网络的数据集。