在用:Logistic算法做鸢尾花分类预测的时候遇见这么一个错误:
Traceback (most recent call last):
File "/home/dong/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
runfile('/home/dong/opt/workspace/github/huanLing/src/main/com/dong/ai/learn/classification/Iris_learn.py', wdir='/home/dong/opt/workspace/github/huanLing/src/main/com/dong/ai/learn/classification')
File "/opt/tools/pycharm-2018.3/helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/opt/tools/pycharm-2018.3/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/dong/opt/workspace/github/huanLing/src/main/com/dong/ai/learn/classification/Iris_learn.py", line 60, in
X = datas[names[0:-1]]
File "/home/dong/.local/lib/python3.6/site-packages/pandas/core/series.py", line 810, in __getitem__
return self._get_with(key)
File "/home/dong/.local/lib/python3.6/site-packages/pandas/core/series.py", line 851, in _get_with
return self.loc[key]
File "/home/dong/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1478, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/dong/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1901, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/home/dong/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1143, in _getitem_iterable
self._validate_read_indexer(key, indexer, axis)
File "/home/dong/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1206, in _validate_read_indexer
key=key, axis=self.obj._get_axis_name(axis)))
KeyError: "None of [['sepal length', 'sepal width', 'petal length', 'petal width']] are in the [index]"
# 拦截异常
warnings.filterwarnings(action = 'ignore', category=ConvergenceWarning)
# 数据加载
path = "datas/iris.data"
names = ['sepal length', 'sepal width', 'petal length', 'petal width', 'cla']
df = pd.read_csv(path, header=None, names=names)
df['cla'].value_counts()
df.head()
def parseRecord(record):
result=[]
r = zip(names,record)
for name,v in r:
if name == 'cla':
if v == 'Iris-setosa':
result.append(1)
elif v == 'Iris-versicolor':
result.append(2)
elif v == 'Iris-virginica':
result.append(3)
else:
result.append(np.nan)
else:
result.append(float(v))
return result
# 1.数据转换数字以及分割
# 数据转换
datas = df.apply(lambda r: parseRecord(r), axis=1)
# 异常数据删除
datas = datas.dropna(how='any')
# 数据分割
X = datas[names[0:-1]]
Y = datas[names[-1]]
## 数据抽样(训练数据和测试数据分割)
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size=0.4, random_state=0)
print ("原始数据条数:%d;训练数据条数:%d;特征个数:%d;测试样本条数:%d" % (len(X), len(X_train), X_train.shape[1], X_test.shape[0]))
# 2.数据标准化
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
# 3.特征选择(这里不进行特征选择操作)
# 4.降维处理(这里不做降维处理)
# 5.模型构建
lr = LogisticRegressionCV(Cs=np.logspace(-4,1,50), cv=3,fit_intercept=True, penalty='l2', solver='lbfgs', tol=0.01, multi_class='multinomial')
#solver:‘newton-cg’,'lbfgs','liblinear','sag' default:liblinear
#'sag'=mini-batch
#'multi_clss':
lr.fit(X_train, Y_train.astype(int))
y_test_hot = label_binarize(Y_test.astype(int), classes=(1, 2, 3))
print(y_test_hot)
# 得到预测的损失值
lr_y_score = lr.decision_function(X_test)
# 计算roc值
lr_fpr, lr_tpr, lr_threasholds = metrics.roc_curve(y_test_hot.ravel(), lr_y_score.ravel())
# threasholds阈值
# 计算auc的值
lr_auc = metrics.auc(lr_fpr, lr_tpr)
print("Logistic算法R值: ", lr.score(X_train, Y_train))
print("Logistic算法AUC值: ", lr_auc)
# 7.模型预测
print(lr_y_score)
lr_y_predict = lr.predict(X_test)
print(lr.predict_proba(X_test))
根据报错是X = datas[names[0:-1]]
这一行代码有问题,但是找了好久,看看前人做的鸢尾花数据分析模型都是一样的代码,没有问题啊。
最后面的错误提示KeyError: "None of [['sepal length', 'sepal width', 'petal length', 'petal width']] are in the [index]"
大概意思是指定的这些如sepal length
, sepal width
,等没有在索引列中,心中瞬间奔腾过10000匹草泥马,报错也报得不清不沌的。
后经高手指定,鸢尾花数据比较早,训练的模型也可能比较早,有可能是因为pandas的版本比较早API可能过时了,可以去看下它最新的API文档,大神就是大神一句话到底哪儿错了那多省事儿啊,不过这样我也就淡定不到多少东西了
好吧去看它的API帮助文档。
文档链接如下 :
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
想看原文的就点吧。。。大把大把的英文
大致意思是:
pandas包的apply的参数有
func : function
Function to apply to each column or row.
意思是说你可以apply函数传一个函数作为它的参数,这个func
参数可以应用到每一行或者每一列。
axis参数
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Axis along which the function is applied:
0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.
意思是说axis
参数指定func函数是应用到数据的行还是列,你可以填0或1,也可以填column或row。0代表column(行),1代表row(列),默认为0。
broadcast参数
Only relevant for aggregation functions:
False or None :
returns a Series whose length is the length of the index or the number of columns (based on the axis parameter)
True :
results will be broadcast to the original shape of the frame, the original index and columns will be retained.
Deprecated since version 0.23.0: This argument will be removed in a future version, replaced by result_type=’broadcast’.
意思是说如果这个参数是False或者None的话返回一个序列,序列的长度等于列数(axis=0),或者等于行数(axis=1),具体根据axis参数
如果这个参数是True的话func返回的结果将被广播到原来的数据结构,原来一行的数据结构中其他的值保留不变。
但是这个参数在pandas0.23版本以后不推荐使用,它被result_type='broadcast’参数取代了。
看完这个参数的意思,心中瞬间来了灵感:是不是因为func函数没有被广播到这一行中的每一个字段呢?
OK,加上试下, 果然好了,问题解决了.
第一种
添加参数baoadcast=True
datas = df.apply(func, axis=1, baoadcast=True)
第二种:
添加参数result_type=‘broadcast’
datas = df.apply(func, axis=1, result_type='broadcast')