心跳异常检测--使用Keras,K折交叉训练CNN一维卷积

本次是AI研习社的一个比赛,
目标是区分出心电图数据是否正常,分为两类,正常 = 0,不正常 = 1。
下载数据集后打开ptbdb_train.csv 发现会有7000行数据,每一行的前187列是心电图数据,最后一个是label。

这个任务是一个二分类任务,且由于数据是时间序列数据,具有上下文关系,故考虑用一维卷积去实现,也可以用lstm去实现,但对lstm不太了解。

主要分为下面几个模块:
数据预处理
模型构建
模型训练
模型测试

数据预处理

数据预处理包括1.将label转为热向量编码 2. 数据读取两个部分

1.将label转为热向量编码

由于每一行数据对应一个label,label值为0或1,且后面我们构造的网络由两个输出,故我们将label转为oneHot形式。

# 把标签转成oneHot
def convert2oneHot(index,Lens):
    hot = np.zeros((Lens,))
    hot[int(index)] = 1
    return(hot)
2. 数据读取

数据读取部分包括测试数据和训练数据的读取,使用yield方法生成数据。

# 生成数据
def train_gen(df,batch_size = 20,train=True):

    img_list = np.array(df)
    if train:
        steps = math.ceil(img_list.shape[0] / batch_size)    # 确定每轮有多少个batch
    else:
        steps = math.ceil(img_list.shape[0] / batch_size)    # 确定每轮有多少个batch
    while True:
        for i in range(steps):

            batch_list = img_list[i * batch_size : i * batch_size + batch_size]
#             print('batch_list shape is {}'.format(batch_list.shape))
            np.random.shuffle(batch_list)
            batch_x = np.array([file for file in batch_list[:,:-1]])
            batch_y = np.array([convert2oneHot(label,2) for label in batch_list[:,-1]])

            yield batch_x, batch_y

# 生成测试数据
def test_gen(df,batch_size = 20):
    img_list = np.array(df)
    steps = math.ceil(len(img_list) / batch_size)    # 确定每轮有多少个batch
    while True:
        for i in range(steps):
            batch_list = img_list[i * batch_size : i * batch_size + batch_size]
            batch_x = np.array([file for file in batch_list[:,:]])
            print(batch_x.shape)
            yield batch_x

模型构建

模型构建基于keras框架,keras框架的Sequential方法构建顺序模型十分方便,这里就用它了。
一个样本是长度187的时间序列数据。
下面是用一维卷积随意构建的模型,需要调整模型使模型对数据的拟合更好。

TIME_PERIODS = 187
def build_model(input_shape=(TIME_PERIODS,),num_classes=2):
    model = Sequential()
    model.add(Reshape((TIME_PERIODS, 1), input_shape=input_shape))
    model.add(Conv1D(16, 8,strides=2, activation='relu',input_shape=(TIME_PERIODS,1)))

    model.add(Conv1D(16, 8,strides=2, activation='relu',padding="same"))
#     model.add(MaxPooling1D(2))
    model.add(Conv1D(64, 4,strides=2, activation='relu',padding="same"))
    model.add(Conv1D(64, 4,strides=2, activation='relu',padding="same"))
#     model.add(MaxPooling1D(2))
    model.add(Conv1D(256, 4,strides=2, activation='relu',padding="same"))
    model.add(Conv1D(256, 4,strides=2, activation='relu',padding="same"))
#     model.add(MaxPooling1D(2))
    model.add(Conv1D(512, 2,strides=1, activation='relu',padding="same"))
    model.add(Conv1D(512, 2,strides=1, activation='relu',padding="same"))
#     model.add(MaxPooling1D(2))

    model.add(GlobalAveragePooling1D())
    model.add(Dropout(0.3))
    model.add(Dense(num_classes, activation='softmax'))
    return(model)

模型训练

准备好数据和构建好模型后,就可以进行模型训练了。这里使用sklearnKFold方法进行模型交叉训练。
本次分为10折,即每一次模型训练有90%的数据用于训练,10%的数据用于验证。

data_fath = ['../heartbeat/ptbdb_train.csv', '../heartbeat/ptbdb_test.csv']

train_data = pd.read_csv(data_fath[0], header = None)
train_data_df = pd.DataFrame(train_data)

batch_size = 20
skf = KFold(n_splits=10, random_state=233, shuffle=True)
for flod_idx, (train_idx, val_idx) in enumerate(skf.split(train_data_df, train_data_df)):
    train_data= train_data_df.iloc[train_idx]
    val_data = train_data_df.iloc[val_idx]
    len_train = train_data.shape[0]
    len_val = val_data.shape[0]

    train_iterr = train_gen(train_data, batch_size, True)
    val_iterr = train_gen(val_data, batch_size, False)

    ckpt = keras.callbacks.ModelCheckpoint(
       # 模型保存名字,训练完成会有10个模型文件。
        filepath='best_model_{}.h5'.format(flod_idx),
        monitor='val_loss', save_best_only=True, verbose=1)

    model = build_model()
    # 使用Adam优化器,这里可以更改为其它优化器
    opt = Adam(0.0002)
    model.compile(loss='categorical_crossentropy',
                  optimizer=opt, metrics=['accuracy'])
    print(model.summary())

    model.fit_generator(
        generator=train_iterr,
        steps_per_epoch=len_train // batch_size,
        epochs=100,
        initial_epoch=0,
        validation_data=val_iterr,
        nb_val_samples=len_val // batch_size,
        callbacks=[ckpt],
    )

训练过程示意如下:

Epoch 00001: val_loss improved from inf to 0.36047, saving model to best_model_0.h5
Epoch 2/100
315/315 [==============================] - 3s 11ms/step - loss: 0.4000 - accuracy: 0.8365 - val_loss: 0.3006 - val_accuracy: 0.8700

Epoch 00002: val_loss improved from 0.36047 to 0.30064, saving model to best_model_0.h5
Epoch 3/100
315/315 [==============================] - 3s 11ms/step - loss: 0.3306 - accuracy: 0.8684 - val_loss: 0.2190 - val_accuracy: 0.8771

Epoch 00003: val_loss improved from 0.30064 to 0.21901, saving model to best_model_0.h5
Epoch 4/100
315/315 [==============================] - 3s 9ms/step - loss: 0.2674 - accuracy: 0.8948 - val_loss: 0.1419 - val_accuracy: 0.9014

Epoch 00004: val_loss improved from 0.21901 to 0.14192, saving model to best_model_0.h5
Epoch 5/100
315/315 [==============================] - 4s 13ms/step - loss: 0.2161 - accuracy: 0.9181 - val_loss: 0.1205 - val_accuracy: 0.9271

Epoch 00005: val_loss improved from 0.14192 to 0.12052, saving model to best_model_0.h5
Epoch 6/100
315/315 [==============================] - 4s 11ms/step - loss: 0.1775 - accuracy: 0.9346 - val_loss: 0.1750 - val_accuracy: 0.9257


模型测试

训练完成后,得到10个模型,分别对10个模型进行测试,并融合,将结果输出到csv文件。

batch_size = 20
result =np.zeros(shape=(1000,))
data_fath = ['../heartbeat/ptbdb_train.csv', '../heartbeat/ptbdb_test.csv']
test_data = pd.read_csv(data_fath[1], header = None)
test_data_df = pd.DataFrame(test_data)
# 测试数据生成器
test_iter = test_gen(test_data_df,batch_size = 20)

for i in range(10):
    h5 = './best_model_{}.h5'.format(i)
    model = load_model(h5)
    pres =model.predict_generator(generator=test_iter,steps=math.ceil(1000/batch_size),verbose=1)
    print('pres.shape is {}'.format(pres.shape))
    ohpres = np.argmax(pres,axis=1)
    print('ohpres.shape is {}'.format(ohpres.shape))
    print(type(ohpres))
    result +=  ohpres
    print('result shape is {}'.format(result))

result = [1.0 if result[i]>5 else 0.0 for i in range(len(result))]
df = pd.DataFrame()
df["id"] = np.arange(0,len(ohpres))
df["label"] = result
df.to_csv("submmit.csv",header=None, index=None)

一套流程全部OK,不过线下猛如虎,线上五十五。。。。模型的构建不合理,需要进行改变。
以后把代码整理上传github,做个记录。
Game Over!

你可能感兴趣的:(深度学习)