https://blog.csdn.net/power1_power2/article/details/79664830
源数据地址:http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
下载数据
from urllib.request import urlretrieve
def load_data(download = True):
if download:
data_path,_ = urlretrieve("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", "D://pic//adult.csv")
print('数据已下载')
load_data()
对数据的列名进行赋值并读取数据
import pandas as pd
col_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
"relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
data = pd.read_csv("D://pic//adult.csv", names=col_names)
print(data[:10])
age workclass fnlwgt education education-num \
0 39 State-gov 77516 Bachelors 13
1 50 Self-emp-not-inc 83311 Bachelors 13
2 38 Private 215646 HS-grad 9
3 53 Private 234721 11th 7
4 28 Private 338409 Bachelors 13
5 37 Private 284582 Masters 14
6 49 Private 160187 9th 5
7 52 Self-emp-not-inc 209642 HS-grad 9
8 31 Private 45781 Masters 14
9 42 Private 159449 Bachelors 13
marital-status occupation relationship race \
0 Never-married Adm-clerical Not-in-family White
1 Married-civ-spouse Exec-managerial Husband White
2 Divorced Handlers-cleaners Not-in-family White
3 Married-civ-spouse Handlers-cleaners Husband Black
4 Married-civ-spouse Prof-specialty Wife Black
5 Married-civ-spouse Exec-managerial Wife White
6 Married-spouse-absent Other-service Not-in-family Black
7 Married-civ-spouse Exec-managerial Husband White
8 Never-married Prof-specialty Not-in-family White
9 Married-civ-spouse Exec-managerial Husband White
sex capital-gain capital-loss hours-per-week native-country result
0 Male 2174 0 40 United-States <=50K
1 Male 0 0 13 United-States <=50K
2 Male 0 0 40 United-States <=50K
3 Male 0 0 40 United-States <=50K
4 Female 0 0 40 Cuba <=50K
5 Female 0 0 40 United-States <=50K
6 Female 0 0 16 Jamaica <=50K
7 Male 0 0 45 United-States >50K
8 Female 14084 0 50 United-States >50K
9 Male 5178 0 40 United-States >50K
数据的类别信息描述:
age:连续型数值变量;
workcass:雇主类型,多类别变量;
fnlwgt:人口普查员认为观察值的人数,连续型变量;
education:教育程度,多类别变量;
education_num:受教育年限,连续型变量;
marital-status:婚姻状况,多类别变量;
occupation:职业,多类别变量;
relationship:群体性关系,多类别变量;
race:种族,多类别变量;
sex:性别,二分变量;
capital-gain:资本收益,连续型变量;
capital-loss:资本损失,连续型变量;
hours-per-week:每周工作时间,连续型变量;
native-country:国籍,多类别变量;
result:结果,二分变量;
特征处理
查看数据缺失情况:
#方法一
data.info()
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age 32561 non-null int64
workclass 32561 non-null object
fnlwgt 32561 non-null int64
education 32561 non-null object
education-num 32561 non-null int64
marital-status 32561 non-null object
occupation 32561 non-null object
relationship 32561 non-null object
race 32561 non-null object
sex 32561 non-null object
capital-gain 32561 non-null int64
capital-loss 32561 non-null int64
hours-per-week 32561 non-null int64
native-country 32561 non-null object
result 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
#方法二
print(data.isnull().any())
age False
workclass False
fnlwgt False
education False
education-num False
marital-status False
occupation False
relationship False
race False
sex False
capital-gain False
capital-loss False
hours-per-week False
native-country False
result False
dtype: bool
print(data.shape)
(32561, 15)
使用函数可以看出没有缺失的变量,但是实际数据中有很多无效字符?,.,$等,对无效数据进行处理
import numpy as np
data_clean = data.replace(regex=[r'\?|\.|\$'],value=np.nan)
print(data_clean.isnull().any())
age False
workclass True
fnlwgt False
education False
education-num False
marital-status False
occupation True
relationship False
race False
sex False
capital-gain False
capital-loss False
hours-per-week False
native-country True
result False
dtype: bool
将所有含有缺失值的行都去掉
adult = data_clean.dropna(how='any')
print(adult.shape)
(30162, 15)
剔除没有用的数据特征
adult = adult.drop(['fnlwgt'],axis=1)
adult.info()
Int64Index: 30162 entries, 0 to 32560
Data columns (total 14 columns):
age 30162 non-null int64
workclass 30162 non-null object
education 30162 non-null object
education-num 30162 non-null int64
marital-status 30162 non-null object
occupation 30162 non-null object
relationship 30162 non-null object
race 30162 non-null object
sex 30162 non-null object
capital-gain 30162 non-null int64
capital-loss 30162 non-null int64
hours-per-week 30162 non-null int64
native-country 30162 non-null object
result 30162 non-null object
dtypes: int64(5), object(9)
memory usage: 3.5+ MB
划分训练集与测试集
#监督型机器学习
from sklearn.model_selection import train_test_split
#数据分离
col_names = ["age", "workclass", "education", "education-num", "marital-status", "occupation",
"relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
X_train , X_test , y_train , y_test = train_test_split(adult[col_names[1:13]],adult[col_names[13]],test_size=0.25,random_state=33)
print(X_train.shape)
print(X_test.shape)
print(X_train.head())
print(y_train.head())
D:\develop\Anaconda3\python.exe D:/thislove/pythonworkspace/blogspark/Adult.py
(22621, 12)
(7541, 12)
workclass education education-num marital-status \
20607 Private Some-college 10 Married-civ-spouse
31257 Private HS-grad 9 Married-civ-spouse
31892 Private HS-grad 9 Never-married
20220 Private HS-grad 9 Divorced
24044 Private Some-college 10 Divorced
occupation relationship race sex capital-gain \
20607 Craft-repair Husband White Male 0
31257 Other-service Husband Black Male 0
31892 Adm-clerical Not-in-family White Female 0
20220 Machine-op-inspct Unmarried Black Female 0
24044 Sales Not-in-family White Female 0
capital-loss hours-per-week native-country
20607 0 50 United-States
31257 0 50 United-States
31892 0 45 United-States
20220 0 40 United-States
24044 0 45 United-States
20607 >50K
31257 <=50K
31892 <=50K
20220 <=50K
24044 >50K
Name: result, dtype: object
Process finished with exit code 0