import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import Normalizer
from sklearn.metrics import (precision_score, recall_score,f1_score, accuracy_score,mean_squared_error,mean_absolute_error, roc_curve, classification_report,auc)
from sklearn.preprocessing import LabelEncoder
traindata=pd.read_csv('UNSW_NB15_testing-set.csv',skiprows=1,names=['id','dur','proto','service','state','spkts','dpkts','sbytes','dbytes','rate','sttl','dttl','sload','dload','sloss','dloss','sinpkt','dinpkt','sjit','djit','swin','stcpb','dtcpb','dwin','tcprtt','synack','ackdat','smean','dmean','trans_depth','response_body_len','ct_srv_src','ct_state_ttl','ct_dst_ltm','ct_src_dport_ltm','ct_dst_sport_ltm','ct_dst_src_ltm','is_ftp_login','ct_ftp_cmd','ct_flw_http_mthd','ct_src_ltm','ct_srv_dst','is_sm_ips_ports','attack_cat','label'])
testdata=pd.read_csv('UNSW_NB15_training-set.csv',skiprows=1,names=['id','dur','proto','service','state','spkts','dpkts','sbytes','dbytes','rate','sttl','dttl','sload','dload','sloss','dloss','sinpkt','dinpkt','sjit','djit','swin','stcpb','dtcpb','dwin','tcprtt','synack','ackdat','smean','dmean','trans_depth','response_body_len','ct_srv_src','ct_state_ttl','ct_dst_ltm','ct_src_dport_ltm','ct_dst_sport_ltm','ct_dst_src_ltm','is_ftp_login','ct_ftp_cmd','ct_flw_http_mthd','ct_src_ltm','ct_srv_dst','is_sm_ips_ports','attack_cat','label'])
for column in traindata.columns:
if traindata[column].dtype == type(object):
le = LabelEncoder() #标签编码,即是对不连续的数字或者文本进行编号,转换成连续的数值型变量
traindata[column] = le.fit_transform(traindata[column])
for column in testdata.columns:
if testdata[column].dtype == type(object):
le = LabelEncoder()
testdata[column] = le.fit_transform(testdata[column])
X1 = traindata.iloc[:,1:44]
Y1 = traindata.iloc[:,44]
Y2 = testdata.iloc[:,44]
X2 = testdata.iloc[:,1:44]
scaler = Normalizer().fit(X1)
trainX = scaler.transform(X1)
scaler = Normalizer().fit(X2)
testT = scaler.transform(X2)
traindata = np.array(trainX)
trainlabel = np.array(Y1)
testdata = np.array(testT)
testlabel = np.array(Y2)
model = DecisionTreeClassifier() #决策树
model.fit(traindata, trainlabel)
expected = testlabel
predicted = model.predict(testdata)
accuracy = accuracy_score(expected, predicted)
recall = recall_score(expected, predicted, average="binary")
precision = precision_score(expected, predicted , average="binary") #精确率
f1 = f1_score(expected, predicted , average="binary")
cm = metrics.confusion_matrix(expected, predicted)
print(cm,cm[0][0],cm[0][1]) #混淆矩阵
tpr = float(cm[0][0])/np.sum(cm[0])
fpr = float(cm[1][1])/np.sum(cm[1])
print("%.3f" %tpr)
print("%.3f" %fpr)
print("Accuracy","%.3f" %accuracy)
print("precision","%.3f" %precision)
print("recall","%.3f" %recall)
print("f-score","%.3f" %f1)
print("fpr","%.3f" %fpr)
print("tpr","%.3f" %tpr)
————————————————
决策树分析入侵检测UNSW-NB15数据集
accuracy 0.802
precision 0.774
recall 0.906
f-score 0.835
这一篇介绍关于UNSW-NB15数据集的相关内容, 也是关于入侵检测的一个数据集. 这里主要会对这个数据集进行介绍. 之前我们对另一个入侵检测的数据集进行过介绍, 链接如下: KDD99数据集与NSL-KDD数据集介绍
数据集的官网: The UNSW-NB15 Dataset Description
数据集下载链接: UNSW-NB15 Download
数据集中一共有9种攻击: This dataset has nine types of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms.
数据集一共有49个特征, 我们会在后面对每一种特征进行介绍.
在csv中保存的数据共有2,540044条数据, 被包含在四个文件中: The total number of records is two million and 540,044 which are stored in the four CSV files.
这里包含了每一种攻击的数量, 后面会做简单的分析: UNSW-NB15_LIST_EVENTS.csv.
该数据集已经进行了训练集和测试集的分割, 文件分别如下: UNSW_NB15_training-set.csv and UNSW_NB15_testing-set.csv.
在训练集中共有175341条记录, 在测试集中共有82332条记录. The number of records in the training set is 175,341 records and the testing set is 82,332 records from the different types, attack and normal.Figure 1 and 2 show the testbed configuration dataset and the method of the feature creation of the UNSW-NB15, respectively.
数据集共有49个特征, 下面分别进行介绍, 这里的内容来源为:
关于下面的数据介绍中, Type的简写的对于关系分别如下所示:
The features from 1-35 represent the integrated gathered information from data packets. The majority of features are generated from header packets as reflected above.
In the general purpose features, each feature has its own purpose, according to the defence point of view.
Connection features are solely created to provide defence during attempt to connection scenarios.
The attackers might scan hosts in a capricious way. For example, once per minute or one scan per hour . In order to identify these attackers, the features 36-47 are intended to sort accordingly with the last time feature to capture similar characteristics of the connection records for each 100 connections sequentially ordered.
It represents the distribution of all records of the UNSW-NB15 data set. The major categories of the records are normal and attack. The attack records are further classified into nine families according to the nature of the attacks.
Four CSV files of the data records are provided and each CSV file contains attack and normal records. The names of the CSV files are UNSWNB15_1.csv, UNSW-NB15_2.csv, UNSW NB15_3.csv and UNSW-NB15_4.csv.
In each CSV file, all the records are ordered according the last time attribute. Further, the first three CSV files each file contains 700000 records and the fourth file contains 440044 records.
The list of event file is labelled UNSWNB15_LIST_EVENTS which contains attack category and subcategory.
这里我们看一下UNSW-NB15数据集使用各种算法的准确率的分析. 这里的结果来源于以下的论文.
在这里会使用五种算法来进行评估: The five techniques used are Naive Bayes (NB) (Panda & Patra, 2007), Decision Tree (DT) (Bouzida & Cuppens, 2006), Artificial Neural Network (ANN) (Bouzida & Cuppens, 2006; Mukkamala et al., 2005), Logistic Regression (LR) (Mukkamala et al., 2005), and Expectation-Maximization (EM) Clustering (Sharif et al., 2012).
模型评估的标准分别是Accuracy和false alarm rates (FAR). 关于更多评价标准的内容, 可以参考链接: 模型评价指标说明与实践–混淆矩阵的说明
最终文章测试的结果如下图所示, 可以看到准确率大概在85%不到的样子:
这里包含一些使用UNSW-NB15数据集来进行的实验, 做实验的时候可以参考这些代码.
Github上关于该数据集的汇总: Github汇总--UNSW-NB15数据集
做好数据处理的数据(做了数据预处理): Feature coded UNSW_NB15 intrusion detection data.
使用SVM和Naive Bayes来对UNSW-NB15进行处理: UNSW-Network_Packet_Classification