pandas模块之SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

利用python进行数据处理的时候,经常会使用到pandas这一强大的数据处理模块。将数据存储为DataFrame形式,进行一系列的操作。

之前以及最近在处理数据的时候经常出现到的一个问题,将这个问题记录一下

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy return super().fillna(

问题复现:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'variety': ['beef', 'mutton', 'pork'],
    'count': [10, 5, np.nan]
})

df[['count']].fillna(0, inplace=True)

实际项目程序中,需要处理多列填充值的情况,复现的话我只是简单的用一列用于测试。运行程序会出现如标题的警告。

出现这个问题,python还会很贴切的告诉你请参阅文档中的注意事项(See the caveats in the documentation),并给出对应的链接https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

我们点进去一看究竟吧。

Returning a view versus a copy
返回视图与副本
When setting values in a pandas object, care must be taken to avoid what is called chained indexing. Here is an example.
在 Pandas 对象中设置值时,必须小心避免所谓的 chained indexing. 这是一个例子。

In [354]: dfmi = pd.DataFrame([list('abcd'),
   .....:                      list('efgh'),
   .....:                      list('ijkl'),
   .....:                      list('mnop')],
   .....:                     columns=pd.MultiIndex.from_product([['one', 'two'],
   .....:                                                         ['first', 'second']]))
   .....: 

In [355]: dfmi
Out[355]: 
    one          two       
  first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p

Compare these two access methods:
比较这两种方式

In [356]: dfmi['one']['second']
Out[356]: 
0    b
1    f
2    j
3    n
Name: second, dtype: object
In [357]: dfmi.loc[:, ('one', 'second')]
Out[357]: 
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object

These both yield the same results, so which should you use? It is instructive to understand the order of operations on these and why method 2 (.loc) is much preferred over method 1 (chained []).
这些都产生相同的结果,那么你应该使用哪个?理解这些操作的顺序以及为什么方法 2 ( .loc) 比方法 1 (链式[]) 更受欢迎是有益的。

dfmi[‘one’] selects the first level of the columns and returns a DataFrame that is singly-indexed. Then another Python operation dfmi_with_one[‘second’] selects the series indexed by ‘second’. This is indicated by the variable dfmi_with_one because pandas sees these operations as separate events. e.g. separate calls to getitem, so it has to treat them as linear operations, they happen one after another.
dfmi[‘one’]选择列的第一级并返回一个单索引的 DataFrame。然后另一个 Python 操作dfmi_with_one[‘second’]选择由 索引的系列’second’。这是由变量指示的,dfmi_with_one因为 Pandas 将这些操作视为单独的事件。例如,单独调用__getitem__,因此它必须将它们视为线性操作,它们一个接一个地发生。

Contrast this to df.loc[‘one’,‘second’)] which passes a nested tuple of (slice(None),(‘one’,‘second’)) to a single call to getitem. This allows pandas to deal with this as a single entity. Furthermore this order of operations can be significantly faster, and allows one to index both axes if so desired.
对比 thisdf.loc[‘one’,‘second’)]将嵌套元组 of(slice(None),(‘one’,‘second’))传递给单个调用 getitem. 这允许熊猫将其作为单个实体来处理。此外,这种操作顺序可以显着加快,并且如果需要,允许对两个轴进行索引。

Why does assignment fail when using chained indexing?
为什么在使用链式索引时赋值会失败?

The problem in the previous section is just a performance issue. What’s up with the SettingWithCopy warning? We don’t usually throw warnings around when you do something that might cost a few extra milliseconds!
上一节中的问题只是一个性能问题。有什么用的了SettingWithCopy警告?当您执行可能需要额外花费几毫秒的时间时,我们通常不会发出警告!

But it turns out that assigning to the product of chained indexing has inherently unpredictable results. To see this, think about how the Python interpreter executes this code:
但事实证明,分配给链式索引的乘积本质上具有不可预测的结果。要看到这一点,请考虑 Python 解释器如何执行此代码:

dfmi.loc[:, ('one', 'second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)

But this code is handled differently:
但是这段代码的处理方式不同:

dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)

See that getitem in there? Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the setitem will modify dfmi or a temporary object that gets thrown out immediately afterward. That’s what SettingWithCopy is warning you about!
看到__getitem__里面了吗?除了简单的情况之外,很难预测它是返回视图还是副本(这取决于数组的内存布局,pandas 对此不做任何保证),因此很难预测是否__setitem__会修改dfmi或获取临时对象之后立即扔掉。那什么SettingWithCopy是警告你!

Note
注意
You may be wondering whether we should be concerned about the loc property in the first example. But dfmi.loc is guaranteed to be dfmi itself with modified indexing behavior, so dfmi.loc.getitem / dfmi.loc.setitem operate on dfmi directly. Of course, dfmi.loc.getitem(idx) may be a view or a copy of dfmi.
您可能想知道我们是否应该关注loc 第一个示例中的属性。但是dfmi.loc保证dfmi 本身具有修改的索引行为,所以dfmi.loc.getitem/ 直接dfmi.loc.__setitem__操作dfmi。当然, dfmi.loc.getitem(idx)可能是视图或副本dfmi。

Sometimes a SettingWithCopy warning will arise at times when there’s no obvious chained indexing going on. These are the bugs that SettingWithCopy is designed to catch! pandas is probably trying to warn you that you’ve done this:
有时SettingWithCopy,当没有明显的链式索引时,会出现警告。这些SettingWithCopy是旨在捕获的错误 !pandas 可能试图警告你你已经这样做了:

def do_something(df):
    foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
    # ... many lines here ...
    # We don't know whether this will modify df or not!
    foo['quux'] = value
    return foo

希望对大家有所帮助,有问题的地方也请大家批评指正,感谢!!

能给个关注就更好了

你可能感兴趣的:(python,pandas)