某日在捣鼓pandas时发生了warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
意思是一个值正被赋给来自于DataFrame类型的切片的拷贝,使用.loc方法来赋值。
遂研究了下,感觉很奇怪
In [233]: import pandas as pd
In [234]: A = pd.DataFrame([[1,2,3], [7,8,9],[14,15,16]], columns = ['a', 'b', 'c'])
In [235]: A
Out[235]:
a b c
0 1 2 3
1 7 8 9
2 14 15 16
把A的第一列赋值给B,B是Series对象,修改B的某一个数发现A也被修改了
In [236]: B = A['a']
In [237]: B
Out[237]:
0 1
1 7
2 14
Name: a, dtype: int64
In [238]: type(B)
Out[238]: pandas.core.series.Series
In [239]: B[0] = 3
In [240]: B
Out[240]:
0 3
1 7
2 14
Name: a, dtype: int64
In [241]: A
Out[241]:
a b c
0 3 2 3
1 7 8 9
2 14 15 16
然后把A的第一列和第二列赋值给C,C是A的切片的拷贝?,C是DataFrame类型,修改C的第一行第一列,发生了警告,C被修改,A未被修改
In [243]: C = A[['a', 'b']]
In [244]: C
Out[244]:
a b
0 3 2
1 7 8
2 14 15
In [245]: type(C)
Out[245]: pandas.core.frame.DataFrame
In [246]: C['a'][0]
Out[246]: 3
In [247]: C['a'][0] =5
c:\program files\python36\lib\site-packages\IPython\core\interactiveshell.py:2910: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
exec(code_obj, self.user_global_ns, self.user_ns)
In [248]: C
Out[248]:
a b
0 5 2
1 7 8
2 14 15
In [249]: A
Out[249]:
a b c
0 3 2 3
1 7 8 9
2 14 15 16
利用A的loc方法生成D,D和C一样,再进行同样的操作,没有发生警告
In [250]: D = A.loc[:, ['a','b']]
In [251]: D
Out[251]:
a b
0 3 2
1 7 8
2 14 15
In [252]: type(D)
Out[252]: pandas.core.frame.DataFrame
In [253]: D['a'][0]
Out[253]: 3
In [254]: D['a'][0] = 5
In [255]: D
Out[255]:
a b
0 5 2
1 7 8
2 14 15
In [256]: A
Out[256]:
a b c
0 3 2 3
1 7 8 9
2 14 15 16
C和D有什么区别?我尝试了一下C规避警告的办法,可使用
C = C.copy()
再进行修改数值操作,就不会发生警告了。
看了一下官方文档:http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
大意是这样的:
In [339]: dfmi = pd.DataFrame([list('abcd'),
.....: list('efgh'),
.....: list('ijkl'),
.....: list('mnop')],
.....: columns=pd.MultiIndex.from_product([['one','two'],
.....: ['first','second']]))
.....:
In [340]: dfmi
Out[340]:
one two
first second first second
0 a b c d
1 e f g h
2 i j k l
3 m n o p
如果你使用loc方法,
dfmi.loc[:,('one','second')] = value
在pandas里会被视为(等价于)调用了loc的__setitem__方法
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)
如果你使用下面的方式赋值
dfmi['one']['second'] = value
pandas内部等价于如下
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
即首先调用了__getitem__方法,返回了一个DataFrame对象,再对这个对象调用__setitem__方法,也就是说,调用了两次,称为链式索引(chained indexing),时间上会比loc更慢。
但通常,pandas不会因为你多花了一些时间就给你报错,而是因为pandas无法保证第一次返回的DataFrame对象是view还是copy,取决于数组的布局(layout of array),如果返回的是view,那么皆大欢喜,没有问题。如果返回的是copy,那我给一个copy赋值后,它的原变量没有发生改变。pandas无法保证__setitem__是会修改dfmi还是修改一个马上被扔掉的临时对象,所以最好使用loc方法。
What’s up with the
SettingWithCopy
warning? We don’t usually throw warnings around when you do something that might cost a few extra milliseconds!But it turns out that assigning to the product of chained indexing has inherently unpredictable results.Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the
__setitem__
will modifydfmi
or a temporary object that gets thrown out immediately afterward. That’s whatSettingWithCopy
is warning you about!
回到我自己的问题,从上面代码执行的情况来看,C是A的slice的copy,因为改变了C对A没有影响。那为什么还会警告?我猜是因为pandas内部认为,C是上面提到的“马上要被扔掉的临时对象”,而B是A的slice的view,所以没有被警告。