SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
这是在使用pandas的过程中经常会遇到的一个警告,意思是试图对一个DataFrame
切片的副本进行赋值。正常来说,肯定不会无缘无故出现警告,这中间肯定有坑,所以有必要通过警告中提示的链接一探究竟。
在对pandas对象设置值的时候,必须要特别注意避免所谓的链式索引(chained indexing)问题。
什么是链式索引?就是对DataFrame连续地使用[]
进行索引,底层行为表现为连续使用__getitem__
操作,这是线性依次的操作,而不是整体地对最初地DataFrame进行操作。
看看pandas文档给的例子:
In [23]: dfmi = pd.DataFrame(
...: [list('abcd'), list('efgh'), list('ijkl'), list('mnop')],
...: columns=pd.MultiIndex.from_product([['one', 'two'],['first', 'second']])
...: )
两种访问方式:
# 链式索引
In [24]: dfmi['one']['second']
Out[24]:
0 b
1 f
2 j
3 n
Name: second, dtype: object
# 一次性索引
In [25]: dfmi.loc[:, ('one', 'second')]
Out[25]:
0 b
1 f
2 j
3 n
Name: (one, second), dtype: object
虽然两种方式返回的结果基本一样(除了name
属性),但是底层的代码执行逻辑还是有很大差别的。
对于第一种方式,dfmi['one']
对第一级列名进行索引并返回一个DataFrame
,我们将这个DataFrame标记为dfmi_with_one
,然后接下来的['second']
操作则是对dfmi_with_one
进行索引(也就是dfmi_with_one['second']
),返回由'second'
索引的Series
。可以看到,在链式索引中,每一次索引[]
都是单独的、仅针对前一次索引返回的结果进行的操作,跟前面的无关。
与第一种方式相比,第二种方式df.loc[:,('one','second')]
传递一个嵌套的元组(slice(None),('one','second'))
给__getitem__
,并且只调用一次。这使得pandas可以将其当作单个实体进行处理。而且这种操作更快,需要的话也可以同时对两个轴进行索引。
其实从两者返回的Series.name
(一个为second
,一个为(one, second)
)也可以看出,第一种方式是分别执行的操作,第二种方式是整体执行的操作。
上节中的问题只是一个性能问题,但如果对链式索引的结果赋值则会产生不可预测结果。要了解这一点,需要看看Python解释器如何执行这些代码:
dfmi.loc[:, ('one', 'second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)
而链式索引的方式则是这样的:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
可以看到中间存在一个__getitem__
的调用,除非是很简单的情况,否则很难判断这个__getitem__
返回的是一个视图(view)还是一个副本(copy)(pandas文档说这取决于数组的内存布局,pandas对此没有保证),因此也无法判断后续的__setitem__
修改的是dfmi
还是一个之后马上就会被丢弃的临时对象。这就是开头的SettingWithCopy
要警告的内容。
另外,对于使用loc
的方式,注意到__setitem__
前面的loc
属性,pandas能够保证dfmi.loc
是dfmi
自身,因此dfmi.loc.__getitem__
和dfmi.loc.__setitem__
是直接在dfmi
上操作。当然,dfmi.loc.__getitem__(idx)
则可能是dfmi
的视图或者副本。
我们来看看实际这两种操作的执行结果:
使用loc
赋值
In [27]: dfmi.loc[:, ('one', 'second')] = list('1234')
In [28]: dfmi
Out[28]:
one two
first second first second
0 a 1 c d
1 e 2 g h
2 i 3 k l
3 m 4 o p
成功赋值
使用链式索引赋值
In [29]: dfmi['one']['second'] = list('5678')
<ipython-input-29-7370041e44f2>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dfmi['one']['second'] = list('5678')
In [30]: dfmi
Out[30]:
one two
first second first second
0 a 1 c d
1 e 2 g h
2 i 3 k l
3 m 4 o p
出现了SettingWithCopyWarning
警告,并且赋值不起作用,dfmi
并没有被修改。
另外,如果使用loc
进行链式索引也会出现同样的警告,原因上面已经说过了,df.loc.__getitem__(idx)
则可能是df
的视图或者副本,其行为也不可预测,避免这样使用:
In [31]: dfmi.loc[:, 'one'].loc[:, 'second'] = list('5678')
<ipython-input-16-791a61a3bb59>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dfmi.loc[:, 'one'].loc[:, 'second'] = list('5678')
# 虽然dfmi改变了,但是其行为依然是不可预测的,要避免使用loc链式索引
In [32]: dfmi
Out[32]:
one two
first second first second
0 a 5 c d
1 e 6 g h
2 i 7 k l
3 m 8 o p
有时候没有明显的链式索引,但也可能会出现SettingWithCopy警告。以下pandas文档中给出的代码就是这样的情况:
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
# We don't know whether this will modify df or not!
foo['quux'] = value
return foo
另一个例子:
In [33]: dfsi = pd.DataFrame(
...: [list('abcd'), list('efgh'), list('ijkl'), list('mnop')],
...: columns=['one', 'two', 'first', 'second']
...: )
In [34]: onetwo = dfsi[['one', 'two']]
In [35]: onetwo['one'] = list('1234')
<ipython-input-5-81f0fc384f1d>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
onetwo['one'] = list('1234')
# dfsi没变,说明上面对dfsi的索引返回的是副本
In [36]: dfsi
Out[36]:
one two first second
0 a b c d
1 e f g h
2 i j k l
3 m n o p
In [37]: onetwo
Out[37]:
one two
0 1 b
1 2 f
2 3 j
3 4 n
这其实就是把链式索引赋值的过程拆分成多行代码了,本质上还是这个问题,但是pandas会尝试去识别出这些问题并发出警告。所以当出现这样的警告时,应该检查下代码中是否出现链式索引赋值的问题,因为其行为不可预测,赋值可能不会生效,应当使用loc
代替,除非你确认链式索引就是你所需要的。
使用链式索引时,索引的类型和索引操作的顺序对于返回的结果是原始对象的切片还是切片的副本是有影响的:
In [38]: dfa = pd.DataFrame(
...: {'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
...: 'c': np.arange(7)}
...: )
In [39]: dfb = dfa.copy()
# This will show the SettingWithCopyWarning
# but the frame values will be set
In [40]: dfb['c'][dfb['a'].str.startswith('o')] = 42
<ipython-input-25-57ce4ff20dfc>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dfb['c'][dfb['a'].str.startswith('o')] = 42
In [41]: dfb
Out[41]:
a c
0 one 42
1 one 42
2 two 2
3 three 3
4 two 4
5 one 42
6 six 6
In [42]: dfb = dfa.copy()
# This however is operating on a copy and will not work
In [43]: dfb[dfb['a'].str.startswith('o')]['c'] = 42
<ipython-input-29-216d8bd475bb>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dfb[dfb['a'].str.startswith('o')]['c'] = 42
In [44]: dfb
Out[44]:
a c
0 one 0
1 one 1
2 two 2
3 three 3
4 two 4
5 one 5
6 six 6
对于上述的场景,pandas文档推荐的使用.loc
访问的方式如下:
In [45]: dfb = dfa.copy()
# Setting multiple items using a mask
In [46]: mask = dfb['a'].str.startswith('o')
In [47]: dfb.loc[mask, 'c'] = 42
In [48]: dfb
Out[48]:
a c
0 one 42
1 one 42
2 two 2
3 three 3
4 two 4
5 one 42
6 six 6
# Setting a single item
In [49]: dfb = dfa.copy()
In [50]: dfb.loc[2, 'a'] = 11
In [51]: dfb
Out[51]:
a c
0 one 0
1 one 1
2 11 2
3 three 3
4 two 4
5 one 5
6 six 6
pandas中提供了一个选项mode.chained_assignment
,用于设置出现链式索引问题后提醒的级别,该选项有三个可选的值:
warn
:发出警告,默认值,会输出SettingWithCopyWarning
raise
:抛出异常SettingWithCopyError
,必须解决链式索引的问题
None
:忽略链式索引问题,不发出警告,也不抛出异常
In [52]: pd.set_option('mode.chained_assignment','raise')
In [53]: dfb[dfb['a'].str.startswith('o')]['c'] = 42
---------------------------------------------------------------------------
SettingWithCopyError Traceback (most recent call last)
...
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
链式索引赋值会产生不可预测的行为,要避免使用链式索引,改为使用.loc[row_indexer,col_indexer] = value
链式索引赋值例子:
dfmi['one']['second'] = list('5678')
dfmi.loc[:, 'one'].loc[:, 'second'] = list('5678')
dfb['c'][dfb['a'].str.startswith('o')] = 42
dfb[dfb['a'].str.startswith('o')]['c'] = 42
dfb['a'][2] = 111
dfb.loc[0]['a'] = 1111
onetwo = dfsi[['one', 'two']]
onetwo['one'] = list('1234')
...
改为使用.loc
:
dfmi.loc[:, ('one', 'second')] = list('1234')
dfb.loc[dfb['a'].str.startswith('o'), 'c'] = 42
dfb.loc[2, 'a'] = 111
dfb.loc[0, 'a'] = 1111