pandas missing缺失值简易笔记

文章目录

  • Statistics and Delete
  • Fill and Interpolate
  • Nullable
    • Normal Missing
    • Nullable Series

Statistics and Delete

Statistics

df.isna()
df.isna.mean()

Delete

df.dropna(axis, how, subset)
df.dropna(axis, thresh)

Usage 1
subset is a list of indexes(or columns).
how is a str in [‘any’, ‘all’],
it indicates
when any or all indexes(or columns) in subset,
delete this column(or row).

Usage 2
thresh is a threshold value,
when the num of not nan is lower than thresh,
delete this row(or column).

Fill and Interpolate

Fill

s.fillna(method, limit)
s.fillna(value)
s.fillna(dict[index:value])

method is a str in [‘ffill’, ‘bfill’],
filling back with previous values or filling forward with following values.
limit is the max of consecutively filling, default no limit.

Interpolate

s.interpolate(method, limit_direction, limit)

method : str in [‘linear’, ‘index’, ‘nearest’], default ‘linear’. Interpolation technique to use.
limit : int, optional. Maximum number of consecutive NaNs to fill.
limit_direction : {{‘forward’, ‘backward’, ‘both’}}, Optional. Consecutive NaNs will be filled in this direction.

Nullable

Normal Missing

In normal series, np.nan is NaN, a float type.
In time series, np.nan is converted to NaT, a datatime type.

np.nan == np.nan  # False
np.nan is None  # False
np.nan is False  # False

pd.Series([np.nan]).equals(pd.Series([np.nan])) # True

For equals method, np.nan is equivalent to another np.nan.

Nullable Series

A new nan class pd.NA, which has not a fixed type.
The bool calculations of pd.NA are more logical.

pd.NA | True # True
pd.NA & False # False
~pd.NA # pd.NA, if result is uncertain

The arithmetic operation of pd.NA

pd.NA ** 0 # 1
1 ** pd.NA # 1

Three nullable types of series, Int64, boolean, string.

pd.Series([np.nan, 1], dtype='Int64')
pd.Series([np.nan, True], dtype = 'boolean')
pd.Series([np.nan, 'my_str'], dtype = 'string')

A calculation of nullable series returns a nullable series as much as possible.

When reading files, we could use

df = pd.read_csv(file_name)
df = df.convert_dtypes()

to make it nullable.

你可能感兴趣的:(pandas笔记,pandas,python,数学建模)