

We are in a age where machines are utilizing huge data and trying to create a better world. It might range from predicting crime rates in an area using machine learning to conversing with humans using natural language processing. In this blog article, I am going to take you through a real-world data science problem which I have picked from UCI machine learning repository and will demonstrate my way of solving it. This case study solves everything right from scratch. Starting from data analysis and taking you through feature engineering and at last model building with both machine learning and deep learning models.

我们正处于一个机器正在利用海量数据并试图创造一个更美好世界的时代。 它的范围可能从使用机器学习预测某个地区的犯罪率到使用自然语言处理与人交谈。 在这篇博客文章中,我将带您解决一个现实世界中的数据科学问题,该问题是我从UCI机器学习存储库中挑选的,并展示了解决该问题的方法。 本案例研究从头解决了所有问题。 从数据分析开始,带您进行功能工程,最后使用机器学习和深度学习模型构建模型。

Problem Statement


It can be hard to know whether a patient will be readmitted to the hospital, it might mean the patient didn’t get the best treatment on the last occasion he/she was admitted or the patient might be diagnosed wrongly and treated for a different disease altogether. The patient when seen at first can’t be predicted whether he/she will be readmitted or not but lab reports and the details of type of patient can be very useful in predicting whether the patient might be readmitted within 30 days. The main objective of this case study is to check whether the patient with diabetes will be readmitted to the hospital within 30 days.

很难知道患者是否会再次入院,这可能意味着该患者在上次入院时未获得最佳治疗,或者可能被错误诊断并接受其他疾病的治疗共。 初诊时无法预测患者是否会再次入院,但是实验室报告和患者类型的详细信息对于预测患者是否会在30天内再次入院非常有用。 该案例研究的主要目的是检查糖尿病患者是否会在30天内再次入院。

指数: (Index:)

  • Step-1: Mapping the real world problem to a Machine Learning Problem.


  • Step-2: Exploratory Data Analysis by performing uni variate, bi variate and multi variate analysis on the data.


  • Step-3: Feature engineering by adding new features and selecting important features from the data.


  • Step-4: Creating machine learning and deep learning models to predict hospital readmission.


步骤1:将现实世界问题映射到机器学习问题 (Step 1: Mapping the real world problem to a Machine Learning Problem)

机器学习问题的类型: (Type of Machine Learning Problem:)

For the given patient we must predict whether the patient will be readmitted within 30 days or not given patient details including the diagnosis and medications the patient has taken.


The given problem is a classification problem as it will return whether the patient will be readmitted within 30 days or not.


Error metric : F1 score and AUC(Area under curve) score.

误差度量: F1得分和AUC(曲线下面积)得分。



Data overview:


The dataset represents 10 years (1999–2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.

该数据集代表了美国130家医院和综合交付网络的10年(1999-2008年)临床护理。 它包括代表患者和医院结果的50多种功能。 从数据库中提取满足以下条件的遭遇信息。

(1) It is an inpatient encounter (a hospital admission).


(2) It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.


(3) The length of stay was at least 1 day and at most 14 days.


(4) Laboratory tests were performed during the encounter.


(5) Medications were administered during the encounter.


The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.


We will be looking into each of the 50 features in detail when we perform exploratory data analysis.


Target variable: Readmitted


We will build various machine learning and deep learning models and see which provides the best result. Now let us start with exploratory data analysis

我们将构建各种机器学习和深度学习模型,并查看哪种模型可以提供最佳结果。 现在让我们开始探索性数据分析

第2步:探索性数据分析 (Step 2:Exploratory Data Analysis)

The very first step in solving any case study in data science is to properly look and analyze the data. It helps to give valuable insights and information. Statistical tools has a big role in proper visualization of the data. ML engineers spend maximum part of solving a problem by analyzing the data they have and this step is considered as an important step as it helps us to understand the data precisely. Proper EDA gives interesting features of your data which in turn influences our data preprocessing and model selection criterion as well.

解决数据科学中任何案例研究的第一步就是正确地查找和分析数据。 它有助于提供有价值的见解和信息。 统计工具在适当地可视化数据方面起着重要作用。 机器学习工程师通过分析他们拥有的数据来最大程度地解决问题,这一步骤被认为是重要的一步,因为它有助于我们准确地理解数据。 适当的EDA可以为您的数据提供有趣的功能,进而影响我们的数据预处理和模型选择标准。

加载数据: (Loading the data:)

To load the data, we only need the diabetic_data.csv file. We will load this into a pandas dataframe.

要加载数据,我们只需要diabetic_data.csv文件。 我们将其加载到pandas数据框中。

loading diabetic_data.csv into a dataframe 将diabetic_data.csv加载到数据框中
some of the features in the dataset 数据集中的某些特征

There are a total of 101766 patient records with 50 features for each record. The first 5 patient records can be seen above.

共有101766个患者记录,每个记录具有50个功能。 前5个患者记录可以在上方看到。

检查多次住院访问: (Checking multiple inpatient visits:)

The data contains multiple inpatient visits for some patients, I have considered only the first encounter for each patient to determine whether or not they were readmitted within 30days. So the duplicate values are removed.

数据包含某些患者的多次住院就诊,我仅考虑了每位患者的初次会诊以确定他们是否在30天内再次入院。 因此,删除了重复的值。

#creating new column duplicate which contains whether the row is duplicated or not. It is boolean.
data['duplicate'] = data['patient_nbr'].duplicated()
#only those rows which are not duplicated are kept in the dataset.
data = data[data['duplicate'] == False]

#the duplicate column is dropped.
data = data.drop(['duplicate'], axis = 1)

Removing patients who are dead or in hospice:


The dataset also contains patients who are dead or in hospice. In the IDs_mapping.csv provided in https://www.hindawi.com/journals/bmri/2014/781670/#supplementary-materials we can see that 11,13,14,19,20,21 are related to death or hospice. We should remove these samples from the data since they cannot be readmitted.

该数据集还包含已死亡或临终关怀的患者。 在https://www.hindawi.com/journals/bmri/2014/781670/#supplementary-materials中提供的IDs_mapping.csv中,我们可以看到11,13,14,19,20,21与死亡或临终关怀有关。 我们应该从数据中删除这些样本,因为它们无法重新读取。

#removing the patients who are dead or in hospice.
data = data.loc[~data['discharge_disposition_id'].isin([11,13,14,19,20,21])]

We are left with 69,973 patients who are not dead and not in hospice.


检查数据集中的空值: (Checking for null values in the data set:)

The nan values are represented as ‘?’ in the dataset. We will replace ? with nan and then check the total nan values in the dataset.

nan值表示为“?” 在数据集中。 我们会更换吗? 使用nan,然后检查数据集中的总nan值。

We will be checking the percentage of null values in each feature,


some features with null values 具有空值的某些功能

7 features contain null values.


We can observe that weight has the highest null values at 96%. Medical specialty and payer code have 48% and 43% null values respectively. The weight feature can be dropped since there is very high percentage of null values.

我们可以看到,权重值在96%处具有最高的空值。 医学专业和付款人代码的空值分别为48%和43%。 由于存在很高百分比的空值,因此可以删除权重功能。

The payer code and medical specialty column missing values can be found using imputation techniques since more than 50% data is available in both cases. We will be dealing with this features afterwards.

可以使用归因技术找到付款人代码和医疗专业栏的缺失值,因为在两种情况下都可获得50%以上的数据。 之后,我们将处理此功能。

单变量分析: (Uni variate Analysis:)



The race column consists of Caucasian,AfricanAmerican, Hispanic, Asian and other as categories. It consists of 2.7% NaN values

种族栏包括高加索人,非裔美国人,西班牙裔,亚洲人和其他类别。 它由2.7%的NaN值组成

Plot of race feature 比赛特征图

We can see that the patients are dominated by Caucasian people, followed by African American’s and Asian’s are least in number. The nan values are filled with mode of race feature.

我们可以看到,患者以白人为主,其次是非洲裔美国人和亚裔。 nan值填充了竞赛模式功能。



The gender column tells us whether the patient is male or female.


3 values are unknown. We can either fill these values or drop the rows. Dropping the rows would be better since the data is also considered as invalid/Unknown in case of gender.

3个值未知。 我们可以填充这些值或删除行。 删除行会更好,因为在性别情况下,数据也被视为无效/未知。

plot of gender 性别情节
  • The female count is more than male count but the difference is small.

  • I have encoded the label of male to 1 and female to 0.




Count plot of Age 计数年龄图
  • As expected, the patients with age less than 40 years are less in number when compared to patients with age greater than 40 years.

  • The number of patients are highest in the age group of 70–80 years.


I will be grouping the age feature into 3 categories as mentioned in the research paper (https://www.hindawi.com/journals/bmri/2014/781670/).

我将把年龄特征分为研究论文( https://www.hindawi.com/journals/bmri/2014/781670/ )中提到的3类。

#custom encoding age
data.loc[data['age'] == '[0-10)', ['age']] = 0
data.loc[data['age'] == '[10-20)', ['age']] = 0
data.loc[data['age'] == '[20-30)', ['age']] = 0
data.loc[data['age'] == '[30-40)', ['age']] = 1
data.loc[data['age'] == '[40-50)', ['age']] = 1
data.loc[data['age'] == '[50-60)', ['age']] = 1
data.loc[data['age'] == '[60-70)', ['age']] = 2
data.loc[data['age'] == '[70-80)', ['age']] = 2
data.loc[data['age'] == '[80-90)', ['age']] = 2
data.loc[data['age'] == '[90-100)', ['age']] = 2

The plot after grouping age looks like this,


Image for post
count plot after grouping. 分组后计数图。

Admission_type_id: (Admission_type_id:)

The mappings can be obtained from mappings given in uiuc,


Mappings of admission type id 入场类型ID的映射
Count plot of admission type id 入场类型编号的计数图

In Admission_type_id most of the patients are admitted with id emergency, followed by Elective. Some of the patients admission type id is not available. Null and Not mapped categories are also present.

在Admission_type_id中,大多数患者因急诊入院,其次是选修科。 某些患者的入院类型ID不可用。 空和未映射类别也存在。

Discharge_disposition_id: (Discharge_disposition_id:)

The discharge disposition id consists of 29 categories of id’s.


  • The discharge_disposition_id column is divided into 21 different categories which is then changed to 8 categories after careful observations

Count plot of discharge disposition id 排放处置ID计数图
  • We can observe that most of the patients are discharged to home.

  • The patients who have passed away or in hospice are not present since we have already removed those rows from the data.


mission_source_id: (admission_source_id:)

  • admission_source_id mappings are given in the ids_mappings.csv present in UCI.

  • The categories were changed from 17 to 8.

Count plot of admission source id 计数入场源ID的图
  • We can observe that most of the patients admission source is emergency room, followed by referrals.


医院时间: (Time_in_hospital:)

  • The time in hospital column categorizes the patients stay ranging from 1 day to 14 days.

count plot of time in hospital 计算医院的时间图
  • The patients on average stay 4 days and most patients stay 3–4 days.

  • The patients rarely stay more than 12 days.

  • We can observe a positive skew in the plot.


num_lab_procedures: (num_lab_procedures:)

Refers to number of lab tests performed during the encounter.


  • We can observe that on average 43 lab procedures are done during a patient encounter.

  • A spike is also found near 0–2 procedures which suggest less number of lab tests were done on some patients.


Num_procedures: (Num_procedures:)

Refers to number of procedures (other than lab tests) performed during the encounter


count plot of number of procedures. 计数程序数图。

Most of the patients do not perform tests other than lab tests. Positive skew is observed.

除实验室测试外,大多数患者不执行其他测试。 观察到正偏斜。

药物编号: (Num_medications:)

Refers to number of distinct generic names administered during the encounter


plot of number of medications 药物数量图
  • Most of the patients are provided 16 medications on average.

  • Only 7 patients are given more than 70 medications.

  • The plot has positive skewness and resembles normal distribution.


门诊人数: (Number_outpatient:)

Refers to number of outpatient visits of the patient in the year preceding the encounter


plot of number of outpatient visits. 门诊就诊次数图。
  • We can observe that most of the patients do not have any outpatient visits.

  • Very less patients have more than 15 outpatient visits


紧急电话: (Number_emergency:)

plot of number of emergency visits 紧急访问次数图
  • It is similar to number_outpatient distplot.

    它类似于number_outpatient distplot。
  • We can observe that most of the patients do not have any emergency visits.


住院人数: (Number_inpatient:)

plot of number of inpatient visits 住院探访次数图
  • We can observe that most of the patients do not have any inpatient visits.

  • It is similar to other visit figures seen.

  • We can create a new feature ‘visits’ which will be some of inpatient, outpatient and emergency visits since all three are distributed in similar ways.


诊断: (Diagnosis:)

All three diagnosis features contain code which are categorized into one of the 9 groups. The groups are given in the research paper.We can categorize these codes into the 9 categories and use them as diagnosis of diseases which come under these 9 categories. This idea has been taken from the research paper.

所有这三个诊断功能都包含归类为9组之一的代码。 研究组中给出了这些组。我们可以将这些代码分为9类,并将它们用作这9类中的疾病诊断。 这个想法来自研究论文。

The new categories are analyzed,


plot of diagnosis 1 诊断图1
Plot of diagnosis 2 诊断图2
Plot of diagnosis 3 诊断图3
  • In the second and third diagnosis we can observe that more number of patients are getting diagnosed with 4 which is diabetes mellitus.

  • Most of the patients are diagnosed with respiratory and other disease types.

  • The nan category also increase with diagnosis number. It is represented as -1 in the feature.

    nan类别也随着诊断次数的增加而增加。 在要素中以-1表示。

编号诊断 (Number_diagnoses)

Refers to the number of diagnoses entered to the system


plot of number of diagnosis 诊断数图
  • Most patients have undergone 9 diagnoses.

  • More than 9 diagnoses is rare.


Max_glu_serum (Max_glu_serum)

Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,”“normal,” and “none” if not measured

指示结果范围或未进行测试。 值:“> 200”,“> 300”,“正常”和“无”(如果未测量)

Plot of glucose serum test results 葡萄糖血清测试结果图
  • Most of the patients dont undergo this test.

  • Out of the people who undergo this test about half of the patients result are normal, the other half patients result are either in category >200 or >300.

    在接受该测试的人中,约一半的患者结果是正常的,另一半患者的结果属于> 200或> 300。
  • Ordinal encoding is done since max_glu_serum above certain values indicate the value is abnormal for the patient and hence are more important in predicting the re admittance.


A1结果 (A1Cresult)

Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured.

指示结果范围或未进行测试。 值:如果结果大于8%,则为“> 8”;如果结果大于7%但小于8%,则为“> 7”;如果结果小于7%,则为“正常”;如果结果大于7%,则为“无”。没有测量。

Plot of A1Cresult A1结果图
  • Most of the patients don't undergo this test.

  • Out of the people who undergo this test nearly half of the patients result are >8, the other half patients result are either in >7 or normal category.

    在接受这项测试的人中,近一半的患者结果> 8,另一半的患者结果为> 7或正常类别。
  • Ordinal encoding is done since A1Cresult above certain values indicate the value is abnormal for the patient and hence are more important


药物治疗 (Medications)

Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed.


There are a total of 23 medications and values for each medications is given. When analyzed, it was found that 3 medications were not prescribed for any patient. These 3 features don't help to classify whether the patient readmitted within 30 days as all the values are same. So these features are dropped from the dataset.

一共有23种药物,每种药物的值均已给出。 经过分析,发现没有为任何患者开3种药物。 这三个功能无法对患者是否在30天内重新入院进行分类,因为所有值都相同。 因此,这些特征已从数据集中删除。

The medications can be merged into a single feature and the number of medications a patients has taken can be calculated.The custom encoding of medication was done with any change in dosage resulting in 2, steady given as 1 and if the medication was not required then it is represented as 0.


更改: (Change:)

Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change”

指示糖尿病药物(剂量或通用名称)是否发生变化。 值:“更改”和“不变”

plot of change in medicine 医学变化图
  • More than 50% of patients did not get any changes in the medicine, the other patients changed medicines.

  • The change feature was encoded with 0 representing no change and 1 representing change in medication.


糖尿病 (DiabetesMed:)

Indicates if there was any diabetic medication prescribed. Values: “yes” and “no”.

指示是否有任何糖尿病药物处方。 值:“是”和“否”。

plot for diabetic medication prescribed 处方糖尿病药物
  • Most of the patients were prescribed diabetes medication.

  • The diabetesMed feature was encoded with 0 representing not prescribed and 1 representing medicine prescribed.


重新提交: (Readmitted:)

This the the variable which we must predict.


It refers to days to inpatient readmission. Values: “< 30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission.

指住院再入院的天数。 值:如果在不到30天的时间内再次入院,则为“ <30”;如果在超过30天的时间内再次入院,则为“> 30”,对于没有再入院记录,则为“否”。

plot of readmitted patients 再次入院患者图
  • We must predict whether the patient is readmitted within 30 days.

  • From the graph we can observe that less number of people are readmitted within 30 days and most of the people are either not readmitted or are readmitted after 30 days.

  • oversampling/ under sampling techniques will be required to make the data balanced.


付款人代码: (Payer_code:)

As mentioned in the section before, the payer_code feature consists of 43% of null values. These null values can be filled by values predicted from model based imputation techniques. Here i have used KNN and randomforest models for imputation.

如前一节所述,payer_code功能由43%的空值组成。 这些空值可以由基于模型的插补技术预测的值填充。 在这里,我使用了KNN和randomforest模型进行插补。

There are a total of 17 types of payer codes. The payer code feature is encoded and separated from other columns. The other features of the payer code for which the values are not null are used as the training dataset and a model is fit on this data. The null values in the payer code are predicted using the model. KNN and randomforest models were used on the data and after hyperparameter tuning it was found that randomforest model perform better on predicting the null values when compared to KNN.

共有17种类型的付款人代码。 付款人代码功能已编码并与其他列分开。 值不为空的付款人代码的其他功能用作训练数据集,并且模型适合该数据。 使用该模型可以预测付款人代码中的空值。 在数据上使用了KNN和randomforest模型,经过超参数调整后,发现与KNN相比,randomforest模型在预测空值方面表现更好。

payer code before imputation 估算前的付款人代码
payer code after imputation 估算后的付款人代码
  • We can observe that after imputation, the number of patients for whom the payment was done by medicare has increased drastically as expected. It was the most populated column before imputation also.

    我们可以观察到,在估算之后,通过医疗保险为患者付款的患者数量急剧增加。 这也是插补前人口最多的一列。
  • Other categories have seen 10% increase in number at most after imputation.

  • The plot strikes as a case of pareto distribution.


medical_specialty: (medical_specialty:)

Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family\general practice, and surgeon


The medical specialty feature consists of 48% of null values. These null values can be filled by values predicted from model based imputation techniques. Here i have used KNN and randomforest models for imputation. There are 70 different medical specialty categories.

医学专业功能包含48%的空值。 这些空值可以由基于模型的插补技术预测的值填充。 在这里,我使用了KNN和randomforest模型进行插补。 有70个不同的医学专业类别。

The medical specialty feature is encoded using scikit-learn’s label encoder. Only the values which are not null are encoded and the features except medical specialty are used for predicting medical specialty. The null values in the medical specialty are predicted using the model fit on the other features. KNN and randomforest models were used on the data and after hyperparameter tuning it was found that randomforest model perform better on predicting the null values when compared to KNN. The encoded features are decoded using inverse transform of scikit-learn’s label encoder.

使用scikit-learn的标签编码器对医学专业功能进行编码。 只有非零值会被编码,医学专业以外的功能将用于预测医学专业。 使用在其他特征上拟合的模型来预测医学专业中的空值。 在数据上使用了KNN和randomforest模型,经过超参数调整后,发现与KNN相比,randomforest模型在预测空值方面表现更好。 编码的特征使用scikit-learn的标签编码器的逆变换进行解码。

The plots contain many categories and cannot be shown, please check the notebook mentioned if interested.


  • The InternalMedicine category has multiplied 3 fold after filling the missing values through imputation.

  • Other categories have nearly doubled after imputation.

  • The InternalMedicine category dominates followed by family/general, cardiology and emergency/Trauma.


单变量分析的结论: (Conclusion of Univariate analysis:)

  • The outpatient, inpatient and emergency visits can be merged into a new feature visits.

  • Three features from medications are removed since they do not provide any information which might help to predict readmission of patients.

  • The medications can be merged into a single feature and the number of medications a patients has taken can be calculated.

  • The diagnosis features have been changed from icd9 codes to 10 different categories. The plots indicate that the diagnosis of diabetes mellitus increase as the number of diagnosis increase.

    诊断功能已从icd9代码更改为10个不同的类别。 该图表明糖尿病的诊断随着诊断次数的增加而增加。
  • Model based imputation was applied on the features with missing values.

  • The categorical labels can be one hot encoded to convert categorical labels to numerical data.

  • The data is highly unbalanced with only 9% of patients being readmitted within 30 days. over sampling must be done.

    数据高度不平衡,只有9%的患者在30天内重新入院。 必须进行过度采样。

双变量和多变量分析: (Bi variate and multivariate analysis:)

Only plots which had some observations have been plotted, other plots which didn’t any useful information have been discarded.




plot of age vs time in hospital 年龄与时间的关系图
plot of age vs diagnosis on readmitted patients 再入院患者的年龄与诊断关系图
  • Most of the readmitted patients are from age 60–100 which is category 2 in this case.

  • The readmission increases with age.

  • the patients in age category 1 had primary diagnosis of 0 and 4 more in number.

  • The patients in age category 2 had primary diagnosis of 0&1 more in number.

  • The patients in age category 3 had primary diagnosis of 2 more in number followed by diagnosis 2.

  • The patients in category 3 stayed in hospital for more time and than category 1 and 2.

  • The readmitted patients are diagnosed with category 0,1 and 4 diseases.


Race :


plot of race vs admission type id 种族vs入场类型图
plot of race vs time in hospital based on readmission. 入院时种族与时间的关系图。
  • Asian, other race and Hispanic patients show similarity when compared among most of the features.

  • The mean of African American patients time in hospital is more compared to other race patients.

  • African American patients are admitted under category 1 which is not the case in other patients who are admitted under category 2.

  • Readmitted patients have stayed more than non-readmitted patients when time in hospital is taken into consideration except asian patients.

  • other race patients who have readmitted have mean admission id as 1 when compared to non-readmitted other race patients who have mean admission id as 2.




plot of gender vs time in hospital 医院性别与时间的关系图
plot of gender vs diagnosis based on readmission. 基于再入院的性别vs诊断图。
  • Females spend more time in hospital when compared to male patients.

  • In case of readmitted patients, male readmitted patients paitents tend to spend more time in hospital when compared to male non-readmitted patients.

  • most of the male readmitted patients were diagnosed as category 1 whereas most of the male non-readmitted patients were diagnosed as category 2.

  • Most Males have admission id as 2 whereas most females have admission id as 1.


Admission type id:


plot of admission type id vs max glucose serum 入院类型id vs最大葡萄糖血清图
plot of admission type id vs time in hospital based on readmission. 入院类型id与医院入院时间的关系图。
  • The patients who were admitted under id 7 spent more time in hospital than other admission id patients.

    根据ID 7入院的患者比其他ID入院的患者住院时间更长。
  • The patients who were admitted under id 8 spent least time in hospital than other admission id patients.

    与其他入院id患者相比,在id 8下入院的患者住院时间最少。
  • For admission id 7 patients payment was most done through code SI.

  • Less patients were readmitted who were having admission id as either 4 or 7.

  • In category 1,3,5 and 6 readmitted patients spent more time in hospital whereas in category 8 readmitted patients spent less time in hospital when compared to non-readmitted patients.

  • Most max_glu_serum test were done when patients were admitted under category 5,6 and 7. For patients under category 1,2,3 less max_glu_serum tests were done.


discharge_disposition_id: (discharge_disposition_id:)

plot of discharge id vs diagnosis based on readmission. 排出ID与基于再入院的诊断的关系图。
plot of discharge id vs time in hospital based on readmission. 基于再入院的出院ID与时间的关系图。
  • Readmitted patients with discharge id 3 spent more time in hospital when compared to non-readmitted patients with id 3.

  • Most of the readmitted patients with discharge id 3 spent had been diagnosed of category 2 disease when compared to non-readmitted patients with id 3 who were diagnosed of category 1.

    与没有被诊断为1类的ID为3的非再次入院患者相比,大多数已获得ID 3的再次入院患者已被诊断为2类疾病。

Plotting Correlation matrix:


A correlation matrix is basically a covariance matrix that is a very good technique of multivariate exploration.


correlation matrix 相关矩阵
  • From the above matrix we can observe that features like num_medications,number_diagnoses, num_lab_procedures tend to have positive correlation with time in hospital.

  • Diagnosis features show very less correlation with other features.

  • Readmitted also shows low correlation with other features indication linear relationship is not present with the features.


使用VIF值检查多重共线性: (Checking Multi-collinearity with VIF values:)

Variance Inflation Factor or VIF, gives a basic quantitative idea about how much the feature variables are correlated with each other.


On checking the VIF for our features, the number_diagnoses has vif value of 15.7 and age has vif value of 10, which are more than 10. After dropping number_diagnoses feature, the vif value for all features remained below 10.


二元和多元分析的结论: (Conclusion of bivariate and multivariate analysis:)

  • Asian, other race and Hispanic patients show similarity when compared among most of the features.

  • Readmitted patients have stayed more than non-readmitted patients on average when time in hospital is taken into consideration except asian patients.

  • Females spend more time in hospital when compared to male patients on average.

  • Most Male patients have admission id as 2 whereas most female patients have admission id as 1.

  • Most max_glu_serum test were done when patients were admitted under category 5,6 and 7. For patients under category 1,2,3 less max_glu_serum tests were done.

  • Readmission has less correaltion with other features.

  • Time in hospital has good correlation with other variables.

  • number_diagnosis feature has high VIF value and hence it is removed. After its removal the VIF value of other features remains in the range of 0–10.

    number_diagnosis功能部件的VIF值较高,因此已被删除。 删除后,其他特征的VIF值仍在0-10范围内。

步骤3:功能工程: (Step 3: Feature engineering:)

访问功能: (visits feature:)

As seen in uni variate analysis the inpatient visits, outpatient visits and emergency visits can be combined into a single feature called visits. Since many of the patients didn’t get visited by anyone, we can make the visit feature binary meaning the feature can be whether the patient got visits or not. This can be done by replacing the patients who got visits by 1.

从单变量分析中可以看出,住院访问,门诊访问和紧急访问可以合并为一个称为访问的功能。 由于许多患者都没有被任何人探视,因此我们可以将探视功能设为二进制,这意味着该功能可以取决于患者是否有探视。 这可以通过将拜访的患者替换为1来完成。

The inpatient, outpatient and emergency visits feature are dropped.


每个患者特征使用的稳定药物数量和增加/减少的药物 (number of steady medicines and increase/decrease of medicine given to each patient feature)

Two new features are derived from the 23 medication features present in the dataset. The first feature is ‘steady’ which tells us how many number of medications the patient is taking steadily. The second feature is up/down which tells us how many number of medications have been increased or decreased or changed in dosage for the patient.

从数据集中存在的23种用药特征中衍生出两个新特征。 第一个特征是“稳定”,它告诉我们患者稳定服用了多少种药物。 第二个功能是向上/向下,它告诉我们为患者增加或减少或更改了剂量的药物数量。

功能选择 (Feature Selection)

Before applying any machine learning model, our data must be fed to models in proper format. In our problem, most of the data in the columns is of categorical nature. Hence, they have to be converted to numerical format to extract relevant information. Though there are many ways to handle categorical data, one of the commonest way is to do One-Hot Encoding. Here i had used get_dummies of pandas to achieve one hot encoded features.

在应用任何机器学习模型之前,我们的数据必须以正确的格式输入模型。 在我们的问题中,列中的大多数数据都是分类性质的。 因此,必须将它们转换为数字格式以提取相关信息。 尽管有许多处理分类数据的方法,但最常见的方法之一是进行单热编码。 在这里,我使用了熊猫的get_dummies实现了一种热编码功能。

We will split our data set into train and test. Since the data is imbalanced we will be applying SMOTE to achieve balanced dataset by oversampling.We will be oversampling only the train data and not the test data. We shall fit on the entire train data and use test part for prediction purpose. Also, we shall evaluate which model performs the best on test data based on our evaluation metric AUC and f1 score.

我们将数据集分为训练和测试。 由于数据不平衡,我们将使用SMOTE通过过采样来获得平衡的数据集,我们将仅对火车数据而不对测试数据进行过采样。 我们将拟合整个火车数据,并使用测试部分进行预测。 同样,我们将根据我们的评估指标AUC和f1分数评估哪种模型在测试数据上表现最佳。

After oversampling we will use permutation importance for feature selection. and select only those features which result in positive weight for permutation importance. These are the features which are selected and these features will be used in training our model. The feature selection was done using permutation importance. Out of the 174 columns, only 82 columns were considered as important to predict the number of patients who got readmitted before 30 days.

过采样后,我们将使用排列重要性进行特征选择。 并仅选择那些对排列重要性产生正权重的特征。 这些是选定的功能,这些功能将用于训练我们的模型。 使用置换重要性来完成特征选择。 在174列中,只有82列被认为对预测30天之前再次入院的患者数量很重要。

步骤4: 建模: (Step 4: Modelling:)

After having done with the analysis and cleaning of data, we did feature engineering and added three new features- visits,steady and up/down from the already existing features. We also handled categorical and numerical features. We over sampled the data and selected the best features from the data. We are now ready to apply machine learning algorithms on our prepared data.

在完成了数据的分析和清理之后,我们进行了特征工程设计并添加了三个新特征:访问,稳定和已存在特征的上/下。 我们还处理了分类和数字特征。 我们过度采样了数据,并从数据中选择了最佳功能。 现在,我们准备将机器学习算法应用于准备好的数据。

I wanted to experiment with both machine learning and deep learning models, so i have built and experimented with 4 machine learning models and 4 deep learning models.


  1. Logistic regression

  2. Decision Tree

  3. Random forest

  4. Xgboost

  5. Deep neural network

  6. CNN based model

  7. LSTM based model

  8. Model using lstm and cnn combined.(ConvLSTM)


I have hyper parameter tuned each of the machine learning models. From the best hyper parameters that was obtained by training on the train data set, it was used to predict the price values on test data set and compare the AUC and f1 score values of each of these models.

我已经对每个机器学习模型进行了超参数调整。 从通过训练数据集获得的最佳超级参数中,将其用于预测测试数据集上的价格值,并比较每个模型的AUC和f1得分值。

Logistic回归: (Logistic Regression :)

AUC score of logistic regression model is : 0.925


F1 score of logistic regression model is : 0.920

Logistic回归模型的F1得分是: 0.920

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
#Hyperparameter tuning of sgd with log loss(i.e logistic regression).

grid = {
    'alpha': [1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3],
    'max_iter': [100,500,1000], 
    'loss': ['log'], # logistic regression,
    'penalty': ['l2'],

print("tuned hyperparameters :(best parameters) ",sgd_cv.best_params_)
print("accuracy :",sgd_cv.best_score_)

sgd = SGDClassifier(alpha = 0.0001, loss ='log', max_iter = 100, penalty = 'l2')

决策树 : (Decision Tree :)

AUC score of decision tree model is : 0.905

决策树模型的AUC得分是: 0.905

F1 score of decision tree model is : 0.905

决策树模型的F1得分是: 0.905

#Hyperparameter tuning of DecisionTree.
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

     'min_samples_split':[8,10,12]}# l1 lasso l2 ridge

print("tuned hyperparameters :(best parameters) ",tree_cv.best_params_)
print("accuracy :",tree_cv.best_score_)

tree = DecisionTreeClassifier(criterion = 'entropy', max_depth = 90, min_samples_split = 12)

随机森林: (Random Forest :)

AUC score of random forest model is : 0.946

随机森林模型的AUC分数是: 0.946

F1 score of random forest model is : 0.943

随机森林模型的F1得分是: 0.943

#Hyperparameter tuning of RandomForest.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

               'max_depth': [10,20,50],
               'min_samples_split': [2, 5, 10],
               'min_samples_leaf': [2, 4]}

print("tuned hyperparameters :(best parameters) ",randomforest_cv.best_params_)
print("accuracy :",randomforest_cv.best_score_)

randomforest = RandomForestClassifier(max_depth= 50, min_samples_leaf= 2, min_samples_split= 2, n_estimators= 200)

XGBoost: (XGBoost :)

AUC score of xgboost model is : 0.933

xgboost模型的AUC得分是: 0.933

F1 score of xgboost model is : 0.929

xgboost模型的F1得分是: 0.929

#Hyperparameter tuning of xgboost.
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

grid={"learning_rate"    : [0.05, 0.10 ] ,
 "max_depth"        : [ 3, 4, 5],
 "min_child_weight" : [ 1, 3,],
 "gamma"            : [ 0.0, 0.1, 0.3 ],
 "colsample_bytree" : [ 0.3,0.5 , 0.7 ] }

print("tuned hyperparameters :(best parameters) ",xg_cv.best_params_)
print("accuracy :",xg_cv.best_score_)

xg = xgb.XGBClassifier(colsample_bytree= 0.5, gamma= 0.3, learning_rate= 0.1, max_depth= 5,min_child_weight= 1)

深度神经网络: (Deep Neural network :)

AUC score of xgboost model is : 0.960

xgboost模型的AUC得分是: 0.960

F1 score of xgboost model is : 0.938

F1分数 xgboost型号为:0.938

model = tf.keras.models.Sequential()
#Input layer
input_layer = tf.keras.Input(shape=(82,))

#Dense hidden layer
layer1 = tf.keras.layers.Dense(512,activation='relu')(input_layer)
layer2 = tf.keras.layers.Dense(256,activation='relu')(layer1)
layer3 = tf.keras.layers.Dense(128,activation='relu')(layer2)
layer4 = tf.keras.layers.Dense(64,activation='relu')(layer3)
layer5 = tf.keras.layers.Dense(32,activation='relu')(layer4)

#output layer
output = tf.keras.layers.Dense(2,activation='softmax',kernel_initializer=tf.initializers.he_uniform())(layer5)

#Creating a model
model = tf.keras.Model(inputs=input_layer,outputs=output)

#adam optimizer
optimizer = tf.keras.optimizers.Adam(
    learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,name='Adam')

es = EarlyStopping(monitor='val_loss', mode='min',verbose=1,patience=2,restore_best_weights=True)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir='./graph002',histogram_freq=1, write_graph=True,write_grads=True)

#changes learning rate if val_auc is same or less than max val_auc
scheduler = ReduceLROnPlateau(monitor='val_auc', factor=0.9, patience=0, verbose=1,mode = 'max')

model.compile(optimizer=optimizer, loss='categorical_crossentropy',metrics =[auc,f1])

callback_list = [tensorboard_callback,scheduler,es]

model.fit(X_train,y_train,epochs=100, validation_data=(X_cv,y_cv), batch_size=100, callbacks=callback_list)

有线电视新闻网 (CNN)

AUC score of xgboost model is : 0.959

AUC得分 xgboost型号为:0.959

F1 score of xgboost model is : 0.936

xgboost模型的F1得分是: 0.936


input_layer = Input(shape=(X_train_1.shape[1],82),)
l_cov1= Conv1D(64, 5, activation='relu',padding='same')(input_layer)
l_cov2= Conv1D(32, 5, activation='relu',padding='same')(l_cov1)
dropout2 = Dropout(0.25)(l_cov2)
text_flat1 = Flatten(data_format='channels_last',name='other_data_flat')(dropout2)
dense1 = Dense(64,activation='relu')(text_flat1)

#output layer
output = tf.keras.layers.Dense(2,activation='softmax')(dense1)

#Creating a model
model = tf.keras.Model(inputs=input_layer,outputs=output)

#adam optimizer
optimizer = tf.keras.optimizers.Adam(
    learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,name='Adam')

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir='./graph003',histogram_freq=1, write_graph=True,write_grads=True)

#changes learning rate if val_auc is same or less than max val_auc
scheduler = ReduceLROnPlateau(monitor='val_auc', factor=0.9, patience=0, verbose=1,mode = 'max')

es = EarlyStopping(monitor='val_loss', mode='min',verbose=1,patience=2,restore_best_weights=True)

model.compile(optimizer=optimizer, loss='categorical_crossentropy',metrics =[auc,f1])

callback_list = [tensorboard_callback,scheduler,es]

model.fit(X_train_1,y_train_1,epochs=100, validation_data=(X_cv_1,y_cv_1), batch_size=100, callbacks=callback_list)


AUC score of xgboost model is : 0.960

xgboost模型的AUC得分是: 0.960

F1 score of xgboost model is : 0.938

F1分数 xgboost型号为:0.938


input_layer = Input(shape=(1,82),)
lstm1= LSTM(128, activation='relu')(input_layer)
text_flat = Flatten(data_format='channels_last',name='Flatten1')(lstm1)
dense1 = Dense(64,activation='relu')(text_flat)
dropout2 = Dropout(0.25)(text_flat)
#output layer
output = tf.keras.layers.Dense(2,activation='softmax')(dropout2)

#Creating a model
model = tf.keras.Model(inputs=input_layer,outputs=output)

#history_own = LossHistory()

#adam optimizer
optimizer = tf.keras.optimizers.Adam(
    learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,name='Adam')

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir='./graph012',histogram_freq=1, write_graph=True,write_grads=True)

es = EarlyStopping(monitor='val_loss', mode='min',verbose=1,patience=2,restore_best_weights=True)

scheduler = ReduceLROnPlateau(monitor='val_auc', factor=0.9, patience=0, verbose=1,mode = 'max')

model.compile(optimizer=optimizer, loss='categorical_crossentropy',metrics =[auc,f1])

callback_list = [tensorboard_callback,scheduler,es]

model.fit(X_train_1,y_train_1,epochs=30, validation_data=(X_cv_1,y_cv_1), batch_size=100, callbacks=callback_list)

转换STM (ConvLSTM :)

AUC score of xgboost model is : 0.958

xgboost模型的AUC得分是: 0.958

F1 score of xgboost model is : 0.938

F1分数 xgboost型号为:0.938


input_layer = Input(shape=(X_train_1.shape[1],82),)
l_cov1= Conv1D(64, 5, activation='relu',padding='same')(input_layer)
l_cov2= Conv1D(32, 5, activation='relu',padding='same')(l_cov1)
lstm1= LSTM(128, activation='relu')(l_cov2)
dropout =Dropout(0.25)(lstm1)
text_flat = Flatten(data_format='channels_last',name='Flatten1')(dropout)
dense1 = Dense(64,activation='relu')(text_flat)
#output layer
output = tf.keras.layers.Dense(2,activation='softmax')(dense1)

#Creating a model
model = tf.keras.Model(inputs=input_layer,outputs=output)

#adam optimizer
optimizer = tf.keras.optimizers.Adam(
    learning_rate=0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,name='Adam')

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir='./graph0011',histogram_freq=1, write_graph=True,write_grads=True)

scheduler = ReduceLROnPlateau(monitor='val_auc', factor=0.9, patience=0, verbose=1,mode = 'max')

es = EarlyStopping(monitor='val_loss', mode='min',verbose=1,patience=2,restore_best_weights=True)

model.compile(optimizer=optimizer, loss='categorical_crossentropy',metrics =[auc,f1])

callback_list = [tensorboard_callback,scheduler,es]

model.fit(X_train_1,y_train_1,epochs=100, validation_data=(X_cv_1,y_cv_1), batch_size=80, callbacks=callback_list)

结果: (Results:)

AUC score for models 模型的AUC分数
F1 score for models 模型的F1分数
  • We can observe that LSTM dominates when auc score is considered.

  • We can observe that random forest dominates when f1 score is considered.


结论: (Conclusion:)

This was my first self-case study and my first medium article, I hope you enjoyed reading through it. I got to learn lot of techniques while working on this case study. I thank AppliedAI and my mentor who helped me throughout this case study.

这是我的第一份自我案例研究,也是我的第一篇中篇文章,希望您喜欢阅读。 在进行此案例研究时,我必须学习很多技术。 我感谢AppliedAI和我的导师在整个案例研究中为我提供了帮助。

This concludes my work. Thank you for reading!

我的工作到此结束。 感谢您的阅读!

