ngram(ngram_range(1, 3))、subsample=1、10折固定
random_state=4
train-mlogloss:0.070363 val-mlogloss:0.303283
random_state=42
train-mlogloss:0.09246 val-mlogloss:0.305461
random_state=8
train-mlogloss:0.095497 val-mlogloss:0.292836
random_state=0
train-mlogloss:0.064349 val-mlogloss:0.278694
5 折 四个随机种子结果取平均:0.470195
10 折 四个随机种子结果取平均:0.471181
10 折相比较五折,结果差不多,推测折数比较大时对结果影响很小
读取训练集保存的预测结果作为 y_pred
,训练集的标签作为 y_true
首先用 load_file.py
处理文件,生成 train_label_four.csv.pkl
文件。再将第四类数据加到训练集总数据集中
# 类别 4 数据
with open("train_label_four.csv.pkl", "rb") as f:
labels_4 = pickle.load(f) # ndarray (53,)
train_apis_4 = pickle.load(f) # list 53
# 训练集全体
with open("security_train.csv.pkl", "rb") as f:
labels = pickle.load(f) # ndarray (13887,)
train_apis = pickle.load(f) # list 13887
labels_concate = np.concatenate((labels, labels_4), axis=0)
#train_apis_concate = np.concatenate((np.array(train_apis), np.array(train_apis_4)), axis=0)
train_apis = train_apis.extend(train_apis_4)
遇到问题:colab 35g 内存不够数据的合并,程序会崩溃
解决办法:api 序列拼接直接调用 extend()
方法,可以在原始数据集后面直接添加第四类 api 序列。而调用 np.concatenate()
方法会重新申请内存对 train_apis 和 train_apis_4 进行拼接,导致内存不够。
原始数据集:0.471710
添加第4类数据后:0.464635
将训练集和测试集拼接并数据保存为 txt 文件
with open("security_train.csv.pkl", "rb") as f:
labels = pickle.load(f) # ndarray (13887,)
train_apis = pickle.load(f) # list 13887
with open("security_test.csv.pkl", "rb") as f:
test_nums = pickle.load(f) # list 12955
test_apis = pickle.load(f) # list 12955
count = 0 # 记录多少行
# 转换为 txt 文件 13887 + 12955 = 26842
with open('word2vec.txt', 'a') as f:
for sentence in train_apis:
f.write(sentence)
f.write('\n')
count = count + 1
print(count)
for sentence in test_apis:
f.write(sentence)
f.write('\n')
count = count + 1
print(count)
使用 Word2Vec()
函数训练
from gensim.models import Word2Vec
word2vec = Word2Vec(corpus_file='word2vec_test.txt', min_count=1)
word2vec.save('word2vec.model')
word2vec = Word2Vec.load('word2vec.model')
min_count
默认为 5,使用默认参数会导致出现次数小于 5 的 api 没有对应的键值对,后面调用word2vec.wv[]
找不到键。
size=100
默认词向量为 100 维
遍历训练集和测试集样本的 api 序列,对 api 对应的向量取平均
# 加载 word2vec 模型
word2vec = Word2Vec.load('word2vec.model')
# print(word2vec.wv['LdrGetProcedureAddress'])
# print(word2vec.wv['NtClose']) #
train_word2vec = []
test_word2vec = []
for word in train_apis:
one_line_vec = []
one_line_vec.extend(word2vec.wv[key] for key in word.split(' '))
mean_vec = np.mean(one_line_vec, axis=0)
train_word2vec.append(mean_vec)
for word in test_apis:
one_line_vec = []
one_line_vec.extend(word2vec.wv[key] for key in word.split(' '))
mean_vec = np.mean(one_line_vec, axis=0)
test_word2vec.append(mean_vec)
# print(len(train_word2vec)) # 13887
# print(len(test_word2vec)) # 12955
# 将 word2vec 向量保存下来
with open("word2vec/word2vec.pkl", 'wb') as f:
pickle.dump(train_word2vec, f)
pickle.dump(test_word2vec, f)
word2vec(100维) 训练结果 0.740139
train-mlogloss:0.114525 val-mlogloss:0.449284
word2vec(100维) 和 ngram 结果平均 0.520576
(ngram 单独结果为 0.471710)
# 加载 word2vec 词向量
with open("word2vec/word2vec.pkl", 'rb') as f:
train_word2vec = pickle.load(f) # 13887 list
test_word2vec = pickle.load(f) # 12955 list
# 加载 ngram 词向量
with open("ngram_model/ngram_vec_original_data.pkl", 'rb') as f:
train_ngram = pickle.load(f) # (13887, 180858) csr_matrix
y_train = pickle.load(f) # (13887,)
test_ngram = pickle.load(f) # (12955, 180858) csr_matrix
test_nums = pickle.load(f) # list 12955
# word2vec 词向量转换为 ndarray 格式
train_word2vec = np.array(train_word2vec) # (13887, 100) ndarray
test_word2vec = np.array(test_word2vec) # (12955, 100) ndarray
# 训练集拼接
train_ngram = train_ngram.A.astype('float32') # (13887, 180858) ndarray
x_train = np.concatenate((train_ngram, train_word2vec), axis=1) # (13887, 180958) ndarray
# pickle.dump(x_train, open("x_train.pkl", 'wb'), protocol=4)
# 测试集拼接
test_ngram = test_ngram.A.astype('float32')
x_test = np.concatenate((test_ngram, test_word2vec), axis=1) # (12955, 180958) ndarray
# pickle.dump(x_test, open("x_test.pkl", 'wb'), protocol=4)
ngram 训练出的向量是 csr_matrix
格式,转换为 ndarray
调用 .A
,转换为 float32 可以节省内存占用
参考链接
混淆矩阵及confusion_matrix函数的使用
使用python绘制混淆矩阵(confusion_matrix)