- 1 pandas中的数据运算与算术对齐
- 2 iloc与loc的切片与索引
- 3 DataFrame与Series之间的运算
- 4 函数应用和映射
- 4.1 用apply将一个规则应用到DataFrame的行或者列上
- 4.2 applymap 将一个规则应用到DataFrame中的每一个元素
- 5 Series和DataFrame的排序
- 6 处理Series的重复索引
- 7 汇总计算描述统计
- 8 唯一值、值计数与成员资格
- 8.1 相关函数
- 8.2 检验Series中的元素是否在指定集合
- 8.3 统计DataFrame每一列中每个元素出现次数
- 9 缺失值处理
1 pandas中的数据运算与算术对齐
- pandas可以对不同索引的对象进行算术运算。在将对象相加时,如果存在不同的索引|对,则结果的索引就是该索引对的并集。在对不同索引的对象进
行算术运算时,当一个对象中某个轴标签在另一个对象中找不到时,会自动填充NaN,也可自己填充一个特殊值(比如0)
from pandas import Series,DataFrame
import pandas as pd
import numpy as np
from numpy import nan
df1 = DataFrame(np.arange(12).reshape((3,4)),columns=list("abcd"))
df2 = DataFrame(np.arange(20).reshape((4,5)),columns=list("abcde"))
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
a |
b |
c |
d |
0 |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
7 |
2 |
8 |
9 |
10 |
11 |
df2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
a |
b |
c |
d |
e |
0 |
0 |
1 |
2 |
3 |
4 |
1 |
5 |
6 |
7 |
8 |
9 |
2 |
10 |
11 |
12 |
13 |
14 |
3 |
15 |
16 |
17 |
18 |
19 |
df1.add(df2,fill_value=0)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
a |
b |
c |
d |
e |
0 |
0.0 |
2.0 |
4.0 |
6.0 |
4.0 |
1 |
9.0 |
11.0 |
13.0 |
15.0 |
9.0 |
2 |
18.0 |
20.0 |
22.0 |
24.0 |
14.0 |
3 |
15.0 |
16.0 |
17.0 |
18.0 |
19.0 |
df1.add(df2).fillna(0)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
a |
b |
c |
d |
e |
0 |
0.0 |
2.0 |
4.0 |
6.0 |
0.0 |
1 |
9.0 |
11.0 |
13.0 |
15.0 |
0.0 |
2 |
18.0 |
20.0 |
22.0 |
24.0 |
0.0 |
3 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
'''
注意:df1.add(df2),
df1.add(df2,fill_value=0),
df1.add(df2).fillna(0)
本质上不同
'''
2 iloc与loc的切片与索引
- loc,基于label的索引
- iloc,完全基于位置的索引
frame = DataFrame(np.arange(12).reshape((4,3)),
columns=list("bde"),
index=["Utah","Ohio","Texas","Oregon"])
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
b |
d |
e |
Utah |
0 |
1 |
2 |
Ohio |
3 |
4 |
5 |
Texas |
6 |
7 |
8 |
Oregon |
9 |
10 |
11 |
frame.iloc[1]
b 3
d 4
e 5
Name: Ohio, dtype: int32
frame.index
Index(['Utah', 'Ohio', 'Texas', 'Oregon'], dtype='object')
frame.loc["Oregon"]
b 9
d 10
e 11
Name: Oregon, dtype: int32
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
b |
d |
e |
Utah |
0 |
1 |
2 |
Ohio |
3 |
4 |
5 |
Texas |
6 |
7 |
8 |
Oregon |
9 |
10 |
11 |
series = frame.iloc[0]
series
b 0
d 1
e 2
Name: Utah, dtype: int32
3 DataFrame与Series之间的运算
- 默认情况下,Dataframe和 Series之间的算术运算会将Series的索引匹配到Dataframe的列,然后沿着行一直向下广播
frame - series
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
b |
d |
e |
Utah |
0 |
0 |
0 |
Ohio |
3 |
3 |
3 |
Texas |
6 |
6 |
6 |
Oregon |
9 |
9 |
9 |
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
b |
d |
e |
Utah |
0 |
1 |
2 |
Ohio |
3 |
4 |
5 |
Texas |
6 |
7 |
8 |
Oregon |
9 |
10 |
11 |
4 函数应用和映射
4.1 用apply将一个规则应用到DataFrame的行或者列上
f = lambda x : x.max() - x.min()
arr = np.array([1,2,3,4,5])
def getMax(x):
return x.max() - x.min()
f(arr)
4
frame.apply(f)
b 9
d 9
e 9
dtype: int64
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
b |
d |
e |
Utah |
0 |
1 |
2 |
Ohio |
3 |
4 |
5 |
Texas |
6 |
7 |
8 |
Oregon |
9 |
10 |
11 |
frame.apply(f,axis=1)
Utah 2
Ohio 2
Texas 2
Oregon 2
dtype: int64
4.2 applymap 将一个规则应用到DataFrame中的每一个元素
frame = DataFrame(np.random.randn(12).reshape((4,3)),
columns=list("bde"),
index=["Utah","Ohio","Texas","Oregon"])
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
b |
d |
e |
Utah |
-0.033554 |
-0.179060 |
-0.169456 |
Ohio |
0.397475 |
-1.661291 |
0.611291 |
Texas |
0.114703 |
-0.467590 |
-0.424874 |
Oregon |
-1.497851 |
1.239364 |
2.076009 |
def twoFixed(num):
return "%.2f"%num
f = lambda num : "%.2f"%num
strFrame = frame.applymap(f)
strFrame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
b |
d |
e |
Utah |
-0.03 |
-0.18 |
-0.17 |
Ohio |
0.40 |
-1.66 |
0.61 |
Texas |
0.11 |
-0.47 |
-0.42 |
Oregon |
-1.50 |
1.24 |
2.08 |
frame.dtypes
b float64
d float64
e float64
dtype: object
strFrame.dtypes
b object
d object
e object
dtype: object
frame["d"].map(lambda x :x+10)
Utah 9.820940
Ohio 8.338709
Texas 9.532410
Oregon 11.239364
Name: d, dtype: float64
5 Series和DataFrame的排序
series = Series(range(4),
index=list("dabc"))
series
d 0
a 1
b 2
c 3
dtype: int64
series.sort_index()
a 1
b 2
c 3
d 0
dtype: int64
frame = DataFrame(np.arange(8).reshape((2,4)),
index=["three","one"],
columns=list("dabc"))
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
d |
a |
b |
c |
three |
0 |
1 |
2 |
3 |
one |
4 |
5 |
6 |
7 |
'''
DataFrame.sort_index(axis,ascending,by)
axis = 0 按照行索引排序 index
axis = 1 按照列索引排序 columns
ascending = False 降序
ascending = True 升序
by 指定列名进行排序,不推荐使用
'''
frame.sort_index()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
d |
a |
b |
c |
one |
4 |
5 |
6 |
7 |
three |
0 |
1 |
2 |
3 |
frame.sort_index(axis=1,ascending=False)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
d |
c |
b |
a |
three |
0 |
3 |
2 |
1 |
one |
4 |
7 |
6 |
5 |
df = DataFrame({"a":[4,7,-3,2],
"b":[0,1,0,1]})
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
a |
b |
0 |
4 |
0 |
1 |
7 |
1 |
2 |
-3 |
0 |
3 |
2 |
1 |
df.sort_values(by="a")
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
a |
b |
2 |
-3 |
0 |
3 |
2 |
1 |
0 |
4 |
0 |
1 |
7 |
1 |
6 处理Series的重复索引
series = Series(range(5),index=list("aabbc"))
series
a 0
a 1
b 2
b 3
c 4
dtype: int64
series.index.is_unique
False
series["a"]
a 0
a 1
dtype: int64
7 汇总计算描述统计
'''
df.sum(axis)
axis=0 按列方向求和(默认)
axis=1 按行方向求和
'''
df = DataFrame([[1.4,nan],
[7.1,-4.5],
[nan,nan],
[0.75,-1.3]],
index=list("abcd"),
columns=["one","two"])
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
one |
two |
a |
1.40 |
NaN |
b |
7.10 |
-4.5 |
c |
NaN |
NaN |
d |
0.75 |
-1.3 |
df.sum()
one 9.25 two -5.80 dtype: float64
df.sum(axis=1)
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
one |
two |
a |
1.40 |
NaN |
b |
7.10 |
-4.5 |
c |
NaN |
NaN |
d |
0.75 |
-1.3 |
df.mean()
one 3.083333
two -2.900000
dtype: float64
df.mean(axis=1,skipna=False)
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
one |
two |
a |
1.40 |
NaN |
b |
7.10 |
-4.5 |
c |
NaN |
NaN |
d |
0.75 |
-1.3 |
df.cumsum()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
one |
two |
a |
1.40 |
NaN |
b |
8.50 |
-4.5 |
c |
NaN |
NaN |
d |
9.25 |
-5.8 |
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
one |
two |
a |
1.40 |
NaN |
b |
7.10 |
-4.5 |
c |
NaN |
NaN |
d |
0.75 |
-1.3 |
df.describe()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
one |
two |
count |
3.000000 |
2.000000 |
mean |
3.083333 |
-2.900000 |
std |
3.493685 |
2.262742 |
min |
0.750000 |
-4.500000 |
25% |
1.075000 |
-3.700000 |
50% |
1.400000 |
-2.900000 |
75% |
4.250000 |
-2.100000 |
max |
7.100000 |
-1.300000 |
8 唯一值、值计数与成员资格
8.1 相关函数
series = Series(list("aabc")*4)
series
0 a
1 a
2 b
3 c
4 a
5 a
6 b
7 c
8 a
9 a
10 b
11 c
12 a
13 a
14 b
15 c
dtype: object
series.unique()
array(['a', 'b', 'c'], dtype=object)
series.value_counts()
a 8
c 4
b 4
dtype: int64
pd.value_counts(series.values,sort=False)
b 4
a 8
c 4
dtype: int64
8.2 检验Series中的元素是否在指定集合
series
0 a
1 a
2 b
3 c
4 a
5 a
6 b
7 c
8 a
9 a
10 b
11 c
12 a
13 a
14 b
15 c
dtype: object
mask = series.isin(["b","c"])
series[mask]
2 b
3 c
6 b
7 c
10 b
11 c
14 b
15 c
dtype: object
8.3 统计DataFrame每一列中每个元素出现次数
data=DataFrame({"qu1":[1,3,4,3,4],
"qu2":[2,3,1,2,3],
"qu3":[1,5,2,6,4]})
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
qu1 |
qu2 |
qu3 |
0 |
1 |
2 |
1 |
1 |
3 |
3 |
5 |
2 |
4 |
1 |
2 |
3 |
3 |
2 |
6 |
4 |
4 |
3 |
4 |
data.apply(pd.value_counts,axis=1).fillna(0).astype("int")
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
1 |
2 |
3 |
4 |
5 |
6 |
0 |
2 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
2 |
0 |
1 |
0 |
2 |
1 |
1 |
0 |
1 |
0 |
0 |
3 |
0 |
1 |
1 |
0 |
0 |
1 |
4 |
0 |
0 |
1 |
2 |
0 |
0 |
series = Series(["a","b",nan,"c"])
series
0 a
1 b
2 NaN
3 c
dtype: object
mask = series.isnull()
~mask
0 True
1 True
2 False
3 True
dtype: bool
series.notnull()
0 True
1 True
2 False
3 True
dtype: bool
9 缺失值处理
series
0 a
1 b
2 NaN
3 c
dtype: object
series.dropna()
0 a
1 b
3 c
dtype: object
data = DataFrame([[1,6.5,3,4],
[1,nan,nan,5],
[nan,nan,nan,6],
[nan,6.5,3,7]])
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
0 |
1 |
2 |
3 |
0 |
1.0 |
6.5 |
3.0 |
4 |
1 |
1.0 |
NaN |
NaN |
5 |
2 |
NaN |
NaN |
NaN |
6 |
3 |
NaN |
6.5 |
3.0 |
7 |
'''
DataFrame.dropna(axis,how)
axis=0
只要在一行有任意一列的值为nan,则该行被删除
axis =1
只要在一列中任意一行的值为nan,则该列被删除
how = all
如果axis=0 只删除全为nan的行
如果axis=1 只删除全为nan的列
'''
cleaned = data.dropna(axis=1)
cleaned
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
0 |
1 |
2 |
3 |
0 |
1.0 |
6.5 |
3.0 |
4 |
1 |
1.0 |
NaN |
NaN |
5 |
2 |
NaN |
NaN |
NaN |
6 |
3 |
NaN |
6.5 |
3.0 |
7 |
data2 = DataFrame([[1,6.5,3,nan],
[1,nan,nan,nan],
[nan,nan,nan,nan],
[nan,6.5,3,nan]])
data2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
0 |
1 |
2 |
3 |
0 |
1.0 |
6.5 |
3.0 |
NaN |
1 |
1.0 |
NaN |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
NaN |
data2.dropna(axis=1,how="all")
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
df= DataFrame([[1,2,nan],
[4,nan,6],
[nan,5,9]],
index=['one','two','three'],
columns=list('abc'))
df
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
|
a |
b |
c |
one |
1.0 |
2.0 |
NaN |
two |
4.0 |
NaN |
6.0 |
three |
NaN |
5.0 |
9.0 |
df.fillna(0)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
|
a |
b |
c |
one |
1.0 |
2.0 |
0.0 |
two |
4.0 |
0.0 |
6.0 |
three |
0.0 |
5.0 |
9.0 |
df1 = df.fillna({'a':0,'c':"F"})
df1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
|
a |
b |
c |
one |
1.0 |
2.0 |
F |
two |
4.0 |
NaN |
6 |
three |
0.0 |
5.0 |
9 |
df.fillna(0,inplace=True)
df
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
|
a |
b |
c |
one |
1.0 |
2.0 |
0.0 |
two |
4.0 |
0.0 |
6.0 |
three |
0.0 |
5.0 |
9.0 |