本小节是讲解DGA域名的识别,在《web安全之机器学习入门》中,曾经通过多节来讲解DGA域名,相关笔记如下:
《Web安全之机器学习入门》笔记:第七章 7.6朴素贝叶斯检测DGA域名_mooyuan的博客-CSDN博客
《Web安全之机器学习入门》笔记:第九章 9.4 支持向量机算法SVM 检测DGA域名_mooyuan的博客-CSDN博客
《Web安全之机器学习入门》笔记:第十章 10.3 K-Means算法检测DGA域名_mooyuan的博客-CSDN博客
通过如上笔记,相信大概也能分析出《web安全之机器学习入门》与《web安全之深度学习实战》的区别,前者以算法为章节题目,讲解这个算法可以应用哪些安全事件的解决。而后者则是以安全事件为标题,讲解可以通过哪些算法来进行建模。本章介绍DGA域名识别使用的数据集以及特征提取方法,包括N-Gram模型、统计特征模型和字符序列模型;还介绍使用的算法以及对应的验证结果,包括朴素贝叶斯算法、XGBoost算法和深度学习算法。
域名生成算法(Domain Generation Algorithm,DGA)是一项古老但一直活跃的技术,是中心结构僵尸网络赖以生存的关键武器,该技术给网络安全人员造成了不小的麻烦。针对基于DGA的僵尸网络(如图13-1所示),研究人员需要快速掌握域名生成算法和输入,对生成的域名及时进行处置。
DGA依赖时间、字典和硬编码的常量动态生成域名,原理如图13-2所示。
Alexa是一家专门发布网站世界排名的网站,创建于1996年4月,以搜索引擎起家。Alexa每天在网上搜集超过1000GB的信息,不仅给出多达几十亿的网址链接,而且为其中的每一个网站进行了排名。可以说,Alexa是当前拥有URL数量最庞大、排名信息发布最详尽的网站。我们使用Alexa全球排名前100万的网站的域名作为白样本。针对DGA样本数据,我们以360netlab的开放数据为黑样本。
dga_file="../data/dga/dga.txt"
alexa_file="../data/dga/top-1m.csv"
def load_alexa():
x=[]
data = pd.read_csv(alexa_file, sep=",",header=None)
x=[i[1] for i in data.values]
return x
def load_dga():
x=[]
data = pd.read_csv(dga_file, sep="\t", header=None,
skiprows=18)
x=[i[1] for i in data.values]
return x
基于《web安全之机器学习入门》,统计特征模型包括元音字母个数、唯一字母个数、平均Jarccard系数,代码如下
def get_aeiou(domain):
count = len(re.findall(r'[aeiou]', domain.lower()))
return count
def get_uniq_char_num(domain):
count=len(set(domain))
return count
def get_uniq_num_num(domain):
count = len(re.findall(r'[1234567890]', domain.lower()))
return count
完整处理源码如下
def get_feature():
from sklearn import preprocessing
alexa=load_alexa()
dga=load_dga()
v=alexa+dga
y=[0]*len(alexa)+[1]*len(dga)
x=[]
for vv in v:
vvv=[get_aeiou(vv),get_uniq_char_num(vv),get_uniq_num_num(vv),len(vv)]
x.append(vvv)
x=preprocessing.scale(x)
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.4)
return x_train, x_test, y_train, y_test
def get_feature_2gram():
alexa=load_alexa()
dga=load_dga()
x=alexa+dga
max_features=10000
y=[0]*len(alexa)+[1]*len(dga)
CV = CountVectorizer(
ngram_range=(2, 2),
token_pattern=r'\w',
decode_error='ignore',
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1)
x = CV.fit_transform(x)
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.4)
return x_train.toarray(), x_test.toarray(), y_train, y_test
def get_feature_234gram():
alexa=load_alexa()
dga=load_dga()
x=alexa+dga
max_features=10000
y=[0]*len(alexa)+[1]*len(dga)
CV = CountVectorizer(
ngram_range=(2, 4),
token_pattern=r'\w',
decode_error='ignore',
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1)
x = CV.fit_transform(x)
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.4)
return x_train.toarray(), x_test.toarray(), y_train, y_test
def do_nb(x_train, x_test, y_train, y_test):
gnb = GaussianNB()
gnb.fit(x_train,y_train)
y_pred=gnb.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
def do_xgboost(x_train, x_test, y_train, y_test):
xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
y_pred = xgb_model.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
def do_mlp(x_train, x_test, y_train, y_test):
global max_features
# Building deep neural network
clf = MLPClassifier(solver='lbfgs',
alpha=1e-5,
hidden_layer_sizes = (5, 2),
random_state = 1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
def do_rnn(trainX, testX, trainY, testY):
max_document_length=64
y_test=testY
trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Network building
net = tflearn.input_data([None, max_document_length])
net = tflearn.embedding(net, input_dim=10240000, output_dim=64)
net = tflearn.lstm(net, 64, dropout=0.1)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, tensorboard_verbose=0,tensorboard_dir="dga_log")
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
batch_size=10,run_id="dga",n_epoch=1)
y_predict_list = model.predict(testX)
#print (y_predict_list)
y_predict = []
for i in y_predict_list:
print (i[0])
if i[0] > 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print( metrics.confusion_matrix(y_test, y_predict))
Hello dga
text feature & nb
precision recall f1-score support
0 0.65 0.92 0.77 4019
1 0.86 0.51 0.64 3974
accuracy 0.72 7993
macro avg 0.76 0.71 0.70 7993
weighted avg 0.76 0.72 0.70 7993
[[3704 315]
[1956 2018]]
text feature & xgboost
precision recall f1-score support
0 0.83 0.91 0.87 4019
1 0.90 0.81 0.85 3974
accuracy 0.86 7993
macro avg 0.86 0.86 0.86 7993
weighted avg 0.86 0.86 0.86 7993
[[3658 361]
[ 749 3225]]
text feature & mlp
precision recall f1-score support
0 0.82 0.92 0.87 4019
1 0.91 0.80 0.85 3974
accuracy 0.86 7993
macro avg 0.87 0.86 0.86 7993
weighted avg 0.87 0.86 0.86 7993
[[3704 315]
[ 804 3170]]
Hello dga
2-gram & mlp
precision recall f1-score support
0 0.94 0.94 0.94 4004
1 0.94 0.94 0.94 3989
accuracy 0.94 7993
macro avg 0.94 0.94 0.94 7993
weighted avg 0.94 0.94 0.94 7993
[[3764 240]
[ 226 3763]]
2-gram & XGBoost
precision recall f1-score support
0 0.93 0.96 0.94 4004
1 0.95 0.93 0.94 3989
accuracy 0.94 7993
macro avg 0.94 0.94 0.94 7993
weighted avg 0.94 0.94 0.94 7993
[[3829 175]
[ 283 3706]]
2-gram & nb
precision recall f1-score support
0 0.72 0.95 0.82 4004
1 0.93 0.63 0.75 3989
accuracy 0.79 7993
macro avg 0.82 0.79 0.78 7993
weighted avg 0.82 0.79 0.78 7993
[[3815 189]
[1489 2500]]