本次是AI研习社的一个比赛,
目标是区分出心电图数据是否正常,分为两类,正常 = 0,不正常 = 1。
下载数据集后打开ptbdb_train.csv
发现会有7000行数据,每一行的前187列是心电图数据,最后一个是label。
这个任务是一个二分类任务,且由于数据是时间序列数据,具有上下文关系,故考虑用一维卷积去实现,也可以用lstm去实现,但对lstm不太了解。
主要分为下面几个模块:
数据预处理
模型构建
模型训练
模型测试
数据预处理包括1.将label转为热向量编码 2. 数据读取
两个部分
由于每一行数据对应一个label,label值为0或1,且后面我们构造的网络由两个输出,故我们将label转为oneHot形式。
# 把标签转成oneHot
def convert2oneHot(index,Lens):
hot = np.zeros((Lens,))
hot[int(index)] = 1
return(hot)
数据读取部分包括测试数据和训练数据的读取,使用yield
方法生成数据。
# 生成数据
def train_gen(df,batch_size = 20,train=True):
img_list = np.array(df)
if train:
steps = math.ceil(img_list.shape[0] / batch_size) # 确定每轮有多少个batch
else:
steps = math.ceil(img_list.shape[0] / batch_size) # 确定每轮有多少个batch
while True:
for i in range(steps):
batch_list = img_list[i * batch_size : i * batch_size + batch_size]
# print('batch_list shape is {}'.format(batch_list.shape))
np.random.shuffle(batch_list)
batch_x = np.array([file for file in batch_list[:,:-1]])
batch_y = np.array([convert2oneHot(label,2) for label in batch_list[:,-1]])
yield batch_x, batch_y
# 生成测试数据
def test_gen(df,batch_size = 20):
img_list = np.array(df)
steps = math.ceil(len(img_list) / batch_size) # 确定每轮有多少个batch
while True:
for i in range(steps):
batch_list = img_list[i * batch_size : i * batch_size + batch_size]
batch_x = np.array([file for file in batch_list[:,:]])
print(batch_x.shape)
yield batch_x
模型构建基于keras框架,keras框架的Sequential
方法构建顺序模型十分方便,这里就用它了。
一个样本是长度187的时间序列数据。
下面是用一维卷积随意构建的模型,需要调整模型使模型对数据的拟合更好。
TIME_PERIODS = 187
def build_model(input_shape=(TIME_PERIODS,),num_classes=2):
model = Sequential()
model.add(Reshape((TIME_PERIODS, 1), input_shape=input_shape))
model.add(Conv1D(16, 8,strides=2, activation='relu',input_shape=(TIME_PERIODS,1)))
model.add(Conv1D(16, 8,strides=2, activation='relu',padding="same"))
# model.add(MaxPooling1D(2))
model.add(Conv1D(64, 4,strides=2, activation='relu',padding="same"))
model.add(Conv1D(64, 4,strides=2, activation='relu',padding="same"))
# model.add(MaxPooling1D(2))
model.add(Conv1D(256, 4,strides=2, activation='relu',padding="same"))
model.add(Conv1D(256, 4,strides=2, activation='relu',padding="same"))
# model.add(MaxPooling1D(2))
model.add(Conv1D(512, 2,strides=1, activation='relu',padding="same"))
model.add(Conv1D(512, 2,strides=1, activation='relu',padding="same"))
# model.add(MaxPooling1D(2))
model.add(GlobalAveragePooling1D())
model.add(Dropout(0.3))
model.add(Dense(num_classes, activation='softmax'))
return(model)
准备好数据和构建好模型后,就可以进行模型训练了。这里使用sklearn
的KFold
方法进行模型交叉训练。
本次分为10折,即每一次模型训练有90%的数据用于训练,10%的数据用于验证。
data_fath = ['../heartbeat/ptbdb_train.csv', '../heartbeat/ptbdb_test.csv']
train_data = pd.read_csv(data_fath[0], header = None)
train_data_df = pd.DataFrame(train_data)
batch_size = 20
skf = KFold(n_splits=10, random_state=233, shuffle=True)
for flod_idx, (train_idx, val_idx) in enumerate(skf.split(train_data_df, train_data_df)):
train_data= train_data_df.iloc[train_idx]
val_data = train_data_df.iloc[val_idx]
len_train = train_data.shape[0]
len_val = val_data.shape[0]
train_iterr = train_gen(train_data, batch_size, True)
val_iterr = train_gen(val_data, batch_size, False)
ckpt = keras.callbacks.ModelCheckpoint(
# 模型保存名字,训练完成会有10个模型文件。
filepath='best_model_{}.h5'.format(flod_idx),
monitor='val_loss', save_best_only=True, verbose=1)
model = build_model()
# 使用Adam优化器,这里可以更改为其它优化器
opt = Adam(0.0002)
model.compile(loss='categorical_crossentropy',
optimizer=opt, metrics=['accuracy'])
print(model.summary())
model.fit_generator(
generator=train_iterr,
steps_per_epoch=len_train // batch_size,
epochs=100,
initial_epoch=0,
validation_data=val_iterr,
nb_val_samples=len_val // batch_size,
callbacks=[ckpt],
)
训练过程示意如下:
Epoch 00001: val_loss improved from inf to 0.36047, saving model to best_model_0.h5
Epoch 2/100
315/315 [==============================] - 3s 11ms/step - loss: 0.4000 - accuracy: 0.8365 - val_loss: 0.3006 - val_accuracy: 0.8700
Epoch 00002: val_loss improved from 0.36047 to 0.30064, saving model to best_model_0.h5
Epoch 3/100
315/315 [==============================] - 3s 11ms/step - loss: 0.3306 - accuracy: 0.8684 - val_loss: 0.2190 - val_accuracy: 0.8771
Epoch 00003: val_loss improved from 0.30064 to 0.21901, saving model to best_model_0.h5
Epoch 4/100
315/315 [==============================] - 3s 9ms/step - loss: 0.2674 - accuracy: 0.8948 - val_loss: 0.1419 - val_accuracy: 0.9014
Epoch 00004: val_loss improved from 0.21901 to 0.14192, saving model to best_model_0.h5
Epoch 5/100
315/315 [==============================] - 4s 13ms/step - loss: 0.2161 - accuracy: 0.9181 - val_loss: 0.1205 - val_accuracy: 0.9271
Epoch 00005: val_loss improved from 0.14192 to 0.12052, saving model to best_model_0.h5
Epoch 6/100
315/315 [==============================] - 4s 11ms/step - loss: 0.1775 - accuracy: 0.9346 - val_loss: 0.1750 - val_accuracy: 0.9257
训练完成后,得到10个模型,分别对10个模型进行测试,并融合,将结果输出到csv
文件。
batch_size = 20
result =np.zeros(shape=(1000,))
data_fath = ['../heartbeat/ptbdb_train.csv', '../heartbeat/ptbdb_test.csv']
test_data = pd.read_csv(data_fath[1], header = None)
test_data_df = pd.DataFrame(test_data)
# 测试数据生成器
test_iter = test_gen(test_data_df,batch_size = 20)
for i in range(10):
h5 = './best_model_{}.h5'.format(i)
model = load_model(h5)
pres =model.predict_generator(generator=test_iter,steps=math.ceil(1000/batch_size),verbose=1)
print('pres.shape is {}'.format(pres.shape))
ohpres = np.argmax(pres,axis=1)
print('ohpres.shape is {}'.format(ohpres.shape))
print(type(ohpres))
result += ohpres
print('result shape is {}'.format(result))
result = [1.0 if result[i]>5 else 0.0 for i in range(len(result))]
df = pd.DataFrame()
df["id"] = np.arange(0,len(ohpres))
df["label"] = result
df.to_csv("submmit.csv",header=None, index=None)
一套流程全部OK,不过线下猛如虎,线上五十五。。。。模型的构建不合理,需要进行改变。
以后把代码整理上传github,做个记录。
Game Over!