all_data.txt
小儿珠蛋白生成障碍性贫血 acompany_with 腹水
病毒性感冒 acompany_with 鼻炎
其他类精神活性物质依赖 acompany_with 梅毒
炎性乳腺癌 acompany_with 败血症
小儿选择性免疫球蛋白A缺乏症 acompany_with 萎缩性胃炎
烟雾病 acompany_with 脑梗死
异位急性阑尾炎 acompany_with 败血症
老年人高血压 acompany_with 心肌梗死
......
共312215行,选80%作为训练集、10%作为验证集、10%作为测试集。
import random
train_val_id_list = random.sample(range(312215), 31221 * 9)
val_id_list = random.sample(train_val_id_list, 31221)
idx = 0
with open('all_data.txt', 'r', encoding='utf-8') as fr:
lines = fr.readlines()
for line in lines:
if idx not in train_val_id_list:
with open('test.txt', 'a', encoding='utf-8') as fw:
fw.write(line)
elif idx in val_id_list:
with open('val.txt', 'a', encoding='utf-8') as fw:
fw.write(line)
else:
with open('train.txt', 'a', encoding='utf-8') as fw:
fw.write(line)
idx += 1
# 使用train_test_split()将选取的固定大小的数据集,按照一定的比例,如9:1随机划分为训练集,测试集。
# 并分别将划分好的数据集进行写入到固定目录下
import pandas as pd
from sklearn.model_selection import train_test_split
fr = open('all_data.txt', 'r', encoding='utf-8')
all_lines = fr.readlines()
train_data, val_test_data = train_test_split(all_lines, train_size=0.8, test_size=0.2)
val_data, test_data = train_test_split(val_test_data, train_size=0.5, test_size=0.5)
print("len(train_data) = ", len(train_data))
print("len(val_data) = ", len(val_data))
print("len(test_data) = ", len(test_data))
fr.close()
for line in train_data:
with open('train.txt', 'a', encoding='utf-8') as fw:
fw.write(line)
for line in val_data:
with open('val.txt', 'a', encoding='utf-8') as fw:
fw.write(line)
for line in test_data:
with open('test.txt', 'a', encoding='utf-8') as fw:
fw.write(line)
# all_data=pd.read_csv("D:/shujuji/MobiAct_Dataset_v2.0/kalman/train/train12.csv")
# train_data, test_data = train_test_split(all_data, train_size=0.9, test_size=0.1)
# train_data.to_csv("D:/shujuji/MobiAct_Dataset_v2.0/kalman/train/train12_9.csv",index=False)
# test_data.to_csv("D:/shujuji/MobiAct_Dataset_v2.0/kalman/train/train12_1.csv",index=False)
python随机划分数据集_weixin_51443598的博客-CSDN博客_python随机划分数据集
Python数据分析中的 训练集、验证集、测试集 - 知乎