some of this article gets from 官网优秀指导
1、 define the problem
2、 get the data
3、 prepare the data
4、 perform exploratory analysis
5、 choose the appropriate model
6、 validate the data model
7、 optimize the parameter
the method could be MATLAB or PYTHON to crush the data.
here i choose python to handle this.
1、 correct. There are some special columu of data such as age which couldn`t be 180. So check the data.
2、 complete. Some algorithm couldn`t get the NULL. So check it again.
3、 create. ceate some new columu or paremeter which is more meaningful.
4、 convering.
font converting is the most important thing in this period.
pandas 读取数据后得到的是DataFrame类型
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
path_train = 'D:\\AboutWork\\机器学习\\Example1_titanic\\train.csv'
path_test = 'D:\\AboutWork\\机器学习\\Example1_titanic\\test.csv'
raw_data = pd.read_csv(path_train)
test_data = pd.read_csv(path_test)
data_cleaner = [raw_data, test_data]
print(raw_data.sample(10))
print(raw_data.isnull().sum())
PassengerId Survived Pclass \
817 818 0 2
54 55 0 1
566 567 0 3
769 770 0 3
186 187 1 3
208 209 1 3
559 560 1 3
340 341 1 2
358 359 1 3
483 484 1 3
Name Sex Age SibSp \
817 Mallet, Mr. Albert male 31.0 1
54 Ostby, Mr. Engelhart Cornelius male 65.0 0
566 Stoytcheff, Mr. Ilia male 19.0 0
769 Gronnestad, Mr. Daniel Danielsen male 32.0 0
186 O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey) female NaN 1
208 Carr, Miss. Helen "Ellen" female 16.0 0
559 de Messemaeker, Mrs. Guillaume Joseph (Emma) female 36.0 1
340 Navratil, Master. Edmond Roger male 2.0 1
358 McGovern, Miss. Mary female NaN 0
483 Turkula, Mrs. (Hedwig) female 63.0 0
Parch Ticket Fare Cabin Embarked
817 1 S.C./PARIS 2079 37.0042 NaN C
54 1 113509 61.9792 B30 C
566 0 349205 7.8958 NaN S
769 0 8471 8.3625 NaN S
186 0 370365 15.5000 NaN Q
208 0 367231 7.7500 NaN Q
559 0 345572 17.4000 NaN S
340 1 230080 26.0000 F2 S
358 0 330931 7.8792 NaN Q
483 0 4134 9.5875 NaN S
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
raw_data = pd.read_csv(path_train)
test_data = pd.read_csv(path_test)
data_cleaner = [raw_data, test_data]
for data in data_cleaner:
data['Age'].fillna(data['Age'].median(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
data.drop(['Ticket', 'Cabin', 'PassengerId'], axis=1, inplace=True) # 删除列
print(raw_data.sample(10))
print(raw_data.isnull().sum())
Survived Pclass Name \
620 0 3 Yasbeck, Mr. Antoni
176 0 3 Lefebre, Master. Henry Forbes
568 0 3 Doharr, Mr. Tannous
863 0 3 Sage, Miss. Dorothy Edith "Dolly"
742 1 1 Ryerson, Miss. Susan Parker "Suzette"
748 0 1 Marvin, Mr. Daniel Warner
279 1 3 Abbott, Mrs. Stanton (Rosa Hunt)
206 0 3 Backstrom, Mr. Karl Alfred
871 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
317 0 2 Moraweck, Dr. Ernest
Sex Age SibSp Parch Fare Embarked
620 male 27.0 1 0 14.4542 C
176 male 28.0 3 1 25.4667 S
568 male 28.0 0 0 7.2292 C
863 female 28.0 8 2 69.5500 S
742 female 21.0 2 2 262.3750 C
748 male 19.0 1 0 53.1000 S
279 female 35.0 1 1 20.2500 S
206 male 32.0 1 0 15.8500 S
871 female 47.0 1 1 52.5542 S
317 male 54.0 0 0 14.0000 S
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64
for data in data_cleaner:
data['FamilySize'] = data['SibSp'] + data['Parch'] + 1
data['IsAlone'] = 1
data['IsAlone'].loc[data['FamilySize'] > 1] = 0 # DataFrame的loc函数需要会
data['FareBin'] = pd.qcut(data['Fare'], 4) # qcut将数据区分成几个区间,具体参照方式由系统自动确定
data['AgeBin'] = pd.cut(data['Age'].astype(float), 5) # cut是用数据里面的最大值减去最小值除以n作为区间间距分类。
print(raw_data.info())
print(raw_data.isnull().sum())
pandas.qcut() 将数据区分成几个区间,具体参照方式由系统自动确定
pandas.cut() 也是将数据分成几个区间,方式和区间长度由最大值减去最小值除以n决定
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Name 891 non-null object
3 Sex 891 non-null object
4 Age 891 non-null float64
5 SibSp 891 non-null int64
6 Parch 891 non-null int64
7 Fare 891 non-null float64
8 Embarked 891 non-null object
9 FamilySize 891 non-null int64
10 IsAlone 891 non-null int64
11 FareBin 891 non-null category
12 AgeBin 891 non-null category
dtypes: category(2), float64(2), int64(6), object(3)
memory usage: 78.9+ KB
None
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
FamilySize 0
IsAlone 0
FareBin 0
AgeBin 0
dtype: int64
label = LabelEncoder()
for data in data_cleaner: # 分类为1与0属性
data['Sex_Code'] = label.fit_transform(data['Sex'])
data['Embarked_Code'] = label.fit_transform(data['Embarked'])
data['AgeBin_Code'] = label.fit_transform(data['AgeBin'])
# data['FareBin_Code'] = label.fit_transform(data['FareBin'])
raw_data_x = ['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone']
raw_data_calc = ['Sex_Code', 'Pclass', 'Embarked_Code', 'SibSp', 'Parch', 'Age', 'Fare']
raw_data_xy = ['Survived'] + raw_data_x
print(raw_data_xy)
['Survived', 'Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone']
raw_data['FareBin_Code'] = label.fit_transform(raw_data['FareBin']) # 注意没有将test数据进行转化。
raw_data_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
raw_data_bin = ['Survived'] + raw_data_x_bin
raw_data_dummy = pd.get_dummies(raw_data[raw_data_x])
raw_data_x_dummy = raw_data_dummy.columns.tolist()
raw_data_xy_dummy = ['Survived'] + raw_data_x_dummy
print('Dummy X Y: ', raw_data_xy_dummy, '\n')
Dummy X Y: ['Survived', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
raw_data.describe(include = 'all')
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | FamilySize | IsAlone | FareBin | AgeBin | Sex_Code | Embarked_Code | AgeBin_Code | FareBin_Code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891 | 891 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891 | 891.000000 | 891.000000 | 891 | 891 | 891.000000 | 891.000000 | 891.000000 | 891.000000 |
unique | NaN | NaN | 891 | 2 | NaN | NaN | NaN | NaN | 3 | NaN | NaN | 4 | 5 | NaN | NaN | NaN | NaN |
top | NaN | NaN | Lundahl, Mr. Johan Svensson | male | NaN | NaN | NaN | NaN | S | NaN | NaN | (7.91, 14.454] | (16.336, 32.252] | NaN | NaN | NaN | NaN |
freq | NaN | NaN | 1 | 577 | NaN | NaN | NaN | NaN | 646 | NaN | NaN | 224 | 523 | NaN | NaN | NaN | NaN |
mean | 0.383838 | 2.308642 | NaN | NaN | 29.361582 | 0.523008 | 0.381594 | 32.204208 | NaN | 1.904602 | 0.602694 | NaN | NaN | 0.647587 | 1.536476 | 1.290685 | 1.497194 |
std | 0.486592 | 0.836071 | NaN | NaN | 13.019697 | 1.102743 | 0.806057 | 49.693429 | NaN | 1.613459 | 0.489615 | NaN | NaN | 0.477990 | 0.791503 | 0.812620 | 1.118156 |
min | 0.000000 | 1.000000 | NaN | NaN | 0.420000 | 0.000000 | 0.000000 | 0.000000 | NaN | 1.000000 | 0.000000 | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 2.000000 | NaN | NaN | 22.000000 | 0.000000 | 0.000000 | 7.910400 | NaN | 1.000000 | 0.000000 | NaN | NaN | 0.000000 | 1.000000 | 1.000000 | 0.500000 |
50% | 0.000000 | 3.000000 | NaN | NaN | 28.000000 | 0.000000 | 0.000000 | 14.454200 | NaN | 1.000000 | 1.000000 | NaN | NaN | 1.000000 | 2.000000 | 1.000000 | 1.000000 |
75% | 1.000000 | 3.000000 | NaN | NaN | 35.000000 | 1.000000 | 0.000000 | 31.000000 | NaN | 2.000000 | 1.000000 | NaN | NaN | 1.000000 | 2.000000 | 2.000000 | 2.000000 |
max | 1.000000 | 3.000000 | NaN | NaN | 80.000000 | 8.000000 | 6.000000 | 512.329200 | NaN | 11.000000 | 1.000000 | NaN | NaN | 1.000000 | 2.000000 | 4.000000 | 3.000000 |
train_x, test_x, train_y, test_y = model_selection.train_test_split(raw_data[raw_data_calc], raw_data['Survived'], random_state=0)
train_x_bin, text_x_bin, train_y_bin, test_y_bin = model_selection.train_test_split(raw_data[raw_data_x_bin], raw_data['Survived'], random_state=0)
train_x_dummy, test_x_dummy, train_y_dummy, test_y_dummy = model_selection.train_test_split(raw_data_dummy[raw_data_x_dummy], raw_data['Survived'], random_state = 0)
train_x_bin.head()
Sex_Code | Pclass | Embarked_Code | FamilySize | AgeBin_Code | FareBin_Code | |
---|---|---|---|---|---|---|
105 | 1 | 3 | 2 | 1 | 1 | 0 |
68 | 0 | 3 | 2 | 7 | 1 | 1 |
253 | 1 | 3 | 2 | 2 | 1 | 2 |
320 | 1 | 3 | 2 | 1 | 1 | 0 |
706 | 0 | 2 | 2 | 1 | 2 | 1 |