Python:分割all_data.txt---->train_data.txt、val_data.txt、test_data.txt

 all_data.txt

小儿珠蛋白生成障碍性贫血	acompany_with	腹水
病毒性感冒	acompany_with	鼻炎
其他类精神活性物质依赖	acompany_with	梅毒
炎性乳腺癌	acompany_with	败血症
小儿选择性免疫球蛋白A缺乏症	acompany_with	萎缩性胃炎
烟雾病	acompany_with	脑梗死
异位急性阑尾炎	acompany_with	败血症
老年人高血压	acompany_with	心肌梗死
......

共312215行,选80%作为训练集、10%作为验证集、10%作为测试集。

方法一:太慢

import random

train_val_id_list = random.sample(range(312215), 31221 * 9)
val_id_list = random.sample(train_val_id_list, 31221)

idx = 0
with open('all_data.txt', 'r', encoding='utf-8') as fr:
    lines = fr.readlines()
    for line in lines:
        if idx not in train_val_id_list:
            with open('test.txt', 'a', encoding='utf-8') as fw:
                fw.write(line)
        elif idx in val_id_list:
            with open('val.txt', 'a', encoding='utf-8') as fw:
                fw.write(line)
        else:
            with open('train.txt', 'a', encoding='utf-8') as fw:
                fw.write(line)
        idx += 1

方法二: 很快

# 使用train_test_split()将选取的固定大小的数据集,按照一定的比例,如9:1随机划分为训练集,测试集。
# 并分别将划分好的数据集进行写入到固定目录下
 
import pandas as pd
from sklearn.model_selection import train_test_split



fr = open('all_data.txt', 'r', encoding='utf-8')
all_lines = fr.readlines()
train_data, val_test_data = train_test_split(all_lines, train_size=0.8, test_size=0.2)
val_data, test_data = train_test_split(val_test_data, train_size=0.5, test_size=0.5)
print("len(train_data) = ", len(train_data))
print("len(val_data) = ", len(val_data))
print("len(test_data) = ", len(test_data))
fr.close()

for line in train_data:
    with open('train.txt', 'a', encoding='utf-8') as fw:
        fw.write(line)

for line in val_data:
    with open('val.txt', 'a', encoding='utf-8') as fw:
        fw.write(line)

for line in test_data:
    with open('test.txt', 'a', encoding='utf-8') as fw:
        fw.write(line)


# all_data=pd.read_csv("D:/shujuji/MobiAct_Dataset_v2.0/kalman/train/train12.csv")
# train_data, test_data = train_test_split(all_data, train_size=0.9, test_size=0.1)

# train_data.to_csv("D:/shujuji/MobiAct_Dataset_v2.0/kalman/train/train12_9.csv",index=False)
# test_data.to_csv("D:/shujuji/MobiAct_Dataset_v2.0/kalman/train/train12_1.csv",index=False)
 

Python:分割all_data.txt---->train_data.txt、val_data.txt、test_data.txt_第1张图片

python随机划分数据集_weixin_51443598的博客-CSDN博客_python随机划分数据集

Python数据分析中的 训练集、验证集、测试集 - 知乎

你可能感兴趣的:(Python,python,开发语言)