Python的pandas模块apply函数报KeyError: None of [['xxx', 'yyy','zzz']] are in the [index]

问题重现

在用:Logistic算法做鸢尾花分类预测的时候遇见这么一个错误:

Traceback (most recent call last):
  File "/home/dong/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "", line 1, in 
    runfile('/home/dong/opt/workspace/github/huanLing/src/main/com/dong/ai/learn/classification/Iris_learn.py', wdir='/home/dong/opt/workspace/github/huanLing/src/main/com/dong/ai/learn/classification')
  File "/opt/tools/pycharm-2018.3/helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/opt/tools/pycharm-2018.3/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/dong/opt/workspace/github/huanLing/src/main/com/dong/ai/learn/classification/Iris_learn.py", line 60, in 
    X = datas[names[0:-1]]
  File "/home/dong/.local/lib/python3.6/site-packages/pandas/core/series.py", line 810, in __getitem__
    return self._get_with(key)
  File "/home/dong/.local/lib/python3.6/site-packages/pandas/core/series.py", line 851, in _get_with
    return self.loc[key]
  File "/home/dong/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1478, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/dong/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1901, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/home/dong/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1143, in _getitem_iterable
    self._validate_read_indexer(key, indexer, axis)
  File "/home/dong/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1206, in _validate_read_indexer
    key=key, axis=self.obj._get_axis_name(axis)))
KeyError: "None of [['sepal length', 'sepal width', 'petal length', 'petal width']] are in the [index]"

Python的pandas模块apply函数报KeyError: None of [['xxx', 'yyy','zzz']] are in the [index]_第1张图片
部分代码如下:

# 拦截异常
warnings.filterwarnings(action = 'ignore', category=ConvergenceWarning)

# 数据加载
path = "datas/iris.data"
names = ['sepal length', 'sepal width', 'petal length', 'petal width', 'cla']

df = pd.read_csv(path, header=None, names=names)
df['cla'].value_counts()
df.head()


def parseRecord(record):
    result=[]
    r = zip(names,record)
    for name,v in r:
        if name == 'cla':
            if v == 'Iris-setosa':
                result.append(1)
            elif v == 'Iris-versicolor':
                result.append(2)
            elif v == 'Iris-virginica':
                result.append(3)
            else:
                result.append(np.nan)
        else:
            result.append(float(v))
    return result


# 1.数据转换数字以及分割
# 数据转换
datas = df.apply(lambda r: parseRecord(r), axis=1)
# 异常数据删除
datas = datas.dropna(how='any')
# 数据分割
X = datas[names[0:-1]]
Y = datas[names[-1]]

## 数据抽样(训练数据和测试数据分割)
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size=0.4, random_state=0)
print ("原始数据条数:%d;训练数据条数:%d;特征个数:%d;测试样本条数:%d" % (len(X), len(X_train), X_train.shape[1], X_test.shape[0]))

# 2.数据标准化
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

# 3.特征选择(这里不进行特征选择操作)

# 4.降维处理(这里不做降维处理)

# 5.模型构建
lr = LogisticRegressionCV(Cs=np.logspace(-4,1,50), cv=3,fit_intercept=True, penalty='l2', solver='lbfgs', tol=0.01, multi_class='multinomial')
#solver:‘newton-cg’,'lbfgs','liblinear','sag'  default:liblinear
#'sag'=mini-batch
#'multi_clss':
lr.fit(X_train, Y_train.astype(int))

y_test_hot = label_binarize(Y_test.astype(int), classes=(1, 2, 3))
print(y_test_hot)

# 得到预测的损失值
lr_y_score = lr.decision_function(X_test)

# 计算roc值
lr_fpr, lr_tpr, lr_threasholds = metrics.roc_curve(y_test_hot.ravel(), lr_y_score.ravel())

# threasholds阈值
# 计算auc的值
lr_auc = metrics.auc(lr_fpr, lr_tpr)
print("Logistic算法R值: ", lr.score(X_train, Y_train))
print("Logistic算法AUC值: ", lr_auc)

# 7.模型预测
print(lr_y_score)
lr_y_predict = lr.predict(X_test)
print(lr.predict_proba(X_test))

分析

根据报错是X = datas[names[0:-1]]这一行代码有问题,但是找了好久,看看前人做的鸢尾花数据分析模型都是一样的代码,没有问题啊。
最后面的错误提示KeyError: "None of [['sepal length', 'sepal width', 'petal length', 'petal width']] are in the [index]"大概意思是指定的这些如sepal length, sepal width,等没有在索引列中,心中瞬间奔腾过10000匹草泥马,报错也报得不清不沌的。
后经高手指定,鸢尾花数据比较早,训练的模型也可能比较早,有可能是因为pandas的版本比较早API可能过时了,可以去看下它最新的API文档,大神就是大神一句话到底哪儿错了那多省事儿啊,不过这样我也就淡定不到多少东西了
好吧去看它的API帮助文档。
文档链接如下 :
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
想看原文的就点吧。。。大把大把的英文
大致意思是:
pandas包的apply的参数有

  1. func
  2. axis=0
  3. broadcast
  4. raw
  5. educe
  6. result_type
  7. args
  8. **kwds
    OK看这几个参数的说明
    func 参数
func : function

Function to apply to each column or row.

意思是说你可以apply函数传一个函数作为它的参数,这个func参数可以应用到每一行或者每一列。

axis参数

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

Axis along which the function is applied:

0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.

意思是说axis参数指定func函数是应用到数据的行还是列,你可以填0或1,也可以填column或row。0代表column(行),1代表row(列),默认为0。

broadcast参数

Only relevant for aggregation functions:

False or None : 
	returns a Series whose length is the length of the index or the number of columns (based on the axis parameter)
	
True : 
	results will be broadcast to the original shape of the frame, the original index and columns will be retained.
	
Deprecated since version 0.23.0: This argument will be removed in a future version, replaced by result_type=’broadcast’.

意思是说如果这个参数是False或者None的话返回一个序列,序列的长度等于列数(axis=0),或者等于行数(axis=1),具体根据axis参数

如果这个参数是True的话func返回的结果将被广播到原来的数据结构,原来一行的数据结构中其他的值保留不变。

但是这个参数在pandas0.23版本以后不推荐使用,它被result_type='broadcast’参数取代了。

看完这个参数的意思,心中瞬间来了灵感:是不是因为func函数没有被广播到这一行中的每一个字段呢?

OK,加上试下, 果然好了,问题解决了.

解决方案

第一种
添加参数baoadcast=True

datas = df.apply(func, axis=1, baoadcast=True)

第二种:
添加参数result_type=‘broadcast’

datas = df.apply(func, axis=1, result_type='broadcast')

你可能感兴趣的:(Python,人工智能)