本文翻译自:How to check if any value is NaN in a Pandas DataFrame
In Python Pandas, what's the best way to check whether a DataFrame has one (or more) NaN values? 在Python Pandas中,检查DataFrame是否具有一个(或多个)NaN值的最佳方法是什么?
I know about the function pd.isnan
, but this returns a DataFrame of booleans for each element. 我知道函数pd.isnan
,但是这会为每个元素返回一个布尔数据框架。 This post right here doesn't exactly answer my question either. 这篇文章也没有完全回答我的问题。
参考:https://stackoom.com/question/1zuA4/如何检查Pandas-DataFrame中的任何值是否为NaN
df.isnull().any().any()
应该这样做。
You have a couple of options. 你有几个选择。
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan
Now the data frame looks something like this: 现在数据框看起来像这样:
0 1 2 3 4 5
0 0.520113 0.884000 1.260966 -0.236597 0.312972 -0.196281
1 -0.837552 NaN 0.143017 0.862355 0.346550 0.842952
2 -0.452595 NaN -0.420790 0.456215 1.203459 0.527425
3 0.317503 -0.917042 1.780938 -1.584102 0.432745 0.389797
4 -0.722852 1.704820 -0.113821 -1.466458 0.083002 0.011722
5 -0.622851 -0.251935 -1.498837 NaN 1.098323 0.273814
6 0.329585 0.075312 -0.690209 -3.807924 0.489317 -0.841368
7 -1.123433 -1.187496 1.868894 -2.046456 -0.949718 NaN
8 1.133880 -0.110447 0.050385 -1.158387 0.188222 NaN
9 -0.513741 1.196259 0.704537 0.982395 -0.585040 -1.693810
df.isnull().any().any()
- This returns a boolean value 选项1 : df.isnull().any().any()
- 返回一个布尔值 You know of the isnull()
which would return a dataframe like this: 你知道isnull()
会返回一个像这样的数据帧:
0 1 2 3 4 5
0 False False False False False False
1 False True False False False False
2 False True False False False False
3 False False False False False False
4 False False False False False False
5 False False False True False False
6 False False False False False False
7 False False False False False True
8 False False False False False True
9 False False False False False False
If you make it df.isnull().any()
, you can find just the columns that have NaN
values: 如果你将它df.isnull().any()
,你只能找到具有NaN
值的列:
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
One more .any()
will tell you if any of the above are True
还有一个.any()
会告诉你上面的任何一个是否为True
> df.isnull().any().any()
True
df.isnull().sum().sum()
- This returns an integer of the total number of NaN
values: 选项2 : df.isnull().sum().sum()
- 返回NaN
值总数的整数: This operates the same way as the .any().any()
does, by first giving a summation of the number of NaN
values in a column, then the summation of those values: 这与.any().any()
操作方式相同,首先给出一列中NaN
值的总和,然后是这些值的总和:
df.isnull().sum()
0 0
1 2
2 0
3 1
4 0
5 2
dtype: int64
Finally, to get the total number of NaN values in the DataFrame: 最后,要获取DataFrame中NaN值的总数:
df.isnull().sum().sum()
5
jwilner 's response is spot on. jwilner的反应很明显。 I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. 我正在探索是否有更快的选择,因为根据我的经验,求平面阵列(奇怪地)比计数更快。 This code seems faster: 这段代码似乎更快:
df.isnull().values.any()
For example: 例如:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop
df.isnull().sum().sum()
is a bit slower, but of course, has additional information -- the number of NaNs
. df.isnull().sum().sum()
是有点慢,但是当然有附加信息-的数目NaNs
。
Depending on the type of data you're dealing with, you could also just get the value counts of each column while performing your EDA by setting dropna to False. 根据您正在处理的数据类型,您还可以通过将dropna设置为False来获取执行EDA时每列的值计数。
for col in df:
print df[col].value_counts(dropna=False)
Works well for categorical variables, not so much when you have many unique values. 适用于分类变量,而不是在有许多唯一值时。
If you need to know how many rows there are with "one or more NaN
s": 如果您需要知道“一个或多个NaN
”有多少行:
df.isnull().T.any().T.sum()
Or if you need to pull out these rows and examine them: 或者,如果您需要提取这些行并检查它们:
nan_rows = df[df.isnull().T.any().T]