【新人赛】阿里云恶意程序检测每周总结——混淆矩阵&word2vec

文章目录

      • 调整随机种子和取平均
      • 打印混淆矩阵
      • 添加第 4 类数据
      • word2vec
      • ngram 和 word2vec 向量拼接

调整随机种子和取平均

ngram(ngram_range(1, 3))、subsample=1、10折固定

random_state=4

train-mlogloss:0.070363 val-mlogloss:0.303283

random_state=42

train-mlogloss:0.09246 val-mlogloss:0.305461

random_state=8

train-mlogloss:0.095497 val-mlogloss:0.292836

random_state=0

train-mlogloss:0.064349 val-mlogloss:0.278694

5 折 四个随机种子结果取平均:0.470195

10 折 四个随机种子结果取平均:0.471181

10 折相比较五折,结果差不多,推测折数比较大时对结果影响很小

打印混淆矩阵

  • 训练集混淆矩阵

读取训练集保存的预测结果作为 y_pred,训练集的标签作为 y_true

【新人赛】阿里云恶意程序检测每周总结——混淆矩阵&word2vec_第1张图片

【新人赛】阿里云恶意程序检测每周总结——混淆矩阵&word2vec_第2张图片

  • 测试集结果统计(10折 subsample1 seed42)

【新人赛】阿里云恶意程序检测每周总结——混淆矩阵&word2vec_第3张图片

添加第 4 类数据

首先用 load_file.py 处理文件,生成 train_label_four.csv.pkl 文件。再将第四类数据加到训练集总数据集中

# 类别 4 数据
with open("train_label_four.csv.pkl", "rb") as f:
    labels_4 = pickle.load(f) # ndarray (53,)
    train_apis_4 = pickle.load(f)  # list 53

# 训练集全体
with open("security_train.csv.pkl", "rb") as f:
    labels = pickle.load(f) # ndarray (13887,)
    train_apis = pickle.load(f)  # list 13887

labels_concate = np.concatenate((labels, labels_4), axis=0)
#train_apis_concate = np.concatenate((np.array(train_apis), np.array(train_apis_4)), axis=0)
train_apis = train_apis.extend(train_apis_4)

遇到问题:colab 35g 内存不够数据的合并,程序会崩溃

解决办法:api 序列拼接直接调用 extend() 方法,可以在原始数据集后面直接添加第四类 api 序列。而调用 np.concatenate() 方法会重新申请内存对 train_apis 和 train_apis_4 进行拼接,导致内存不够。

原始数据集:0.471710
添加第4类数据后:0.464635

word2vec

将训练集和测试集拼接并数据保存为 txt 文件

with open("security_train.csv.pkl", "rb") as f:
    labels = pickle.load(f)  # ndarray (13887,)
    train_apis = pickle.load(f)  # list 13887

with open("security_test.csv.pkl", "rb") as f:
    test_nums = pickle.load(f)  # list 12955
    test_apis = pickle.load(f)  # list 12955

count = 0 # 记录多少行

# 转换为 txt 文件 13887 + 12955 = 26842
with open('word2vec.txt', 'a') as f:
    for sentence in train_apis:
        f.write(sentence)
        f.write('\n')
        count = count + 1
        print(count)
    for sentence in test_apis:
        f.write(sentence)
        f.write('\n')
        count = count + 1
        print(count)

使用 Word2Vec() 函数训练

from gensim.models import Word2Vec
word2vec = Word2Vec(corpus_file='word2vec_test.txt', min_count=1)

word2vec.save('word2vec.model')
word2vec = Word2Vec.load('word2vec.model')

min_count 默认为 5,使用默认参数会导致出现次数小于 5 的 api 没有对应的键值对,后面调用 word2vec.wv[] 找不到键。

size=100 默认词向量为 100 维

遍历训练集和测试集样本的 api 序列,对 api 对应的向量取平均

# 加载 word2vec 模型
word2vec = Word2Vec.load('word2vec.model')
# print(word2vec.wv['LdrGetProcedureAddress'])
# print(word2vec.wv['NtClose']) # 
train_word2vec = []
test_word2vec = []

for word in train_apis:
    one_line_vec = []
    one_line_vec.extend(word2vec.wv[key] for key in word.split(' '))
    mean_vec = np.mean(one_line_vec, axis=0)
    train_word2vec.append(mean_vec)

for word in test_apis:
    one_line_vec = []
    one_line_vec.extend(word2vec.wv[key] for key in word.split(' '))
    mean_vec = np.mean(one_line_vec, axis=0)
    test_word2vec.append(mean_vec)

# print(len(train_word2vec))  # 13887
# print(len(test_word2vec))  # 12955

# 将 word2vec 向量保存下来
with open("word2vec/word2vec.pkl", 'wb') as f:
    pickle.dump(train_word2vec, f)
    pickle.dump(test_word2vec, f)

word2vec(100维) 训练结果 0.740139

train-mlogloss:0.114525 val-mlogloss:0.449284

word2vec(100维) 和 ngram 结果平均 0.520576

(ngram 单独结果为 0.471710)

ngram 和 word2vec 向量拼接

# 加载 word2vec 词向量
with open("word2vec/word2vec.pkl", 'rb') as f:
    train_word2vec = pickle.load(f) # 13887 list
    test_word2vec = pickle.load(f) # 12955 list
  
# 加载 ngram 词向量
with open("ngram_model/ngram_vec_original_data.pkl", 'rb') as f:
    train_ngram = pickle.load(f) # (13887, 180858) csr_matrix
    y_train = pickle.load(f) # (13887,) 
    test_ngram = pickle.load(f) # (12955, 180858) csr_matrix
    test_nums = pickle.load(f) # list 12955

# word2vec 词向量转换为 ndarray 格式
train_word2vec = np.array(train_word2vec) # (13887, 100) ndarray
test_word2vec = np.array(test_word2vec) # (12955, 100) ndarray

# 训练集拼接
train_ngram = train_ngram.A.astype('float32') # (13887, 180858) ndarray
x_train = np.concatenate((train_ngram, train_word2vec), axis=1) # (13887, 180958) ndarray
# pickle.dump(x_train, open("x_train.pkl", 'wb'), protocol=4)

# 测试集拼接
test_ngram = test_ngram.A.astype('float32')
x_test = np.concatenate((test_ngram, test_word2vec), axis=1) # (12955, 180958) ndarray
# pickle.dump(x_test, open("x_test.pkl", 'wb'), protocol=4)

ngram 训练出的向量是 csr_matrix 格式,转换为 ndarray 调用 .A,转换为 float32 可以节省内存占用

参考链接

混淆矩阵及confusion_matrix函数的使用

使用python绘制混淆矩阵(confusion_matrix)

你可能感兴趣的:(阿里云)