| 2019-01-01 | -0.935378 | -0.190742 | 0.925984 | -0.818969 |
| 2019-01-02 | -0.234414 | -1.194674 | 1.080779 | -2.294395 |
| 2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 |
注意:这里的3是取不到的。
(2)列切片
df
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
df.loc[:, "A": "C"]
|
A |
B |
C |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
df.iloc[:, 0: 3]
|
A |
B |
C |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
(3)多种多样的取值
df
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
df.loc["2019-01-02": "2019-01-03", "C":"D"]
|
C |
D |
2019-01-02 |
1.080779 |
-2.294395 |
2019-01-03 |
1.102248 |
1.207726 |
df.iloc[1: 3, 2:]
|
C |
D |
2019-01-02 |
1.080779 |
-2.294395 |
2019-01-03 |
1.102248 |
1.207726 |
df.loc["2019-01-04": "2019-01-06", ["A", "C"]]
|
A |
C |
2019-01-04 |
0.305088 |
-0.978434 |
2019-01-05 |
0.313383 |
0.163155 |
2019-01-06 |
0.250613 |
-0.858240 |
df.iloc[3:, [0, 2]]
|
A |
C |
2019-01-04 |
0.305088 |
-0.978434 |
2019-01-05 |
0.313383 |
0.163155 |
2019-01-06 |
0.250613 |
-0.858240 |
df.loc[["2019-01-02", "2019-01-06"], "C": "D"]
上面这种方式是行不通的。
df.iloc[[1, 5], 0: 3]
|
A |
B |
C |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]
同样,上面这种方式是行不通的。
df.iloc[[1, 5], [0, 3]]
|
A |
D |
2019-01-02 |
-0.234414 |
-2.294395 |
2019-01-06 |
0.250613 |
-1.573342 |
4、布尔索引
相当于numpy当中的掩码操作。
df
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
df > 0
|
A |
B |
C |
D |
2019-01-01 |
False |
False |
True |
False |
2019-01-02 |
False |
False |
True |
False |
2019-01-03 |
False |
True |
True |
True |
2019-01-04 |
True |
True |
False |
True |
2019-01-05 |
True |
True |
True |
False |
2019-01-06 |
True |
False |
False |
False |
df[df > 0]
|
A |
B |
C |
D |
2019-01-01 |
NaN |
NaN |
0.925984 |
NaN |
2019-01-02 |
NaN |
NaN |
1.080779 |
NaN |
2019-01-03 |
NaN |
0.058118 |
1.102248 |
1.207726 |
2019-01-04 |
0.305088 |
0.535920 |
NaN |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
NaN |
2019-01-06 |
0.250613 |
NaN |
NaN |
NaN |
可以观察到,为true的部分都被取到了,而false没有。
df.A > 0
2019-01-01 False
2019-01-02 False
2019-01-03 False
2019-01-04 True
2019-01-05 True
2019-01-06 True
Freq: D, Name: A, dtype: bool
df[df.A > 0]
|
A |
B |
C |
D |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
|
A |
B |
C |
D |
E |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
one |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
one |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
two |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
three |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
four |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
three |
ind = df2["E"].isin(["two", "four"])
ind
2019-01-01 False
2019-01-02 False
2019-01-03 True
2019-01-04 False
2019-01-05 True
2019-01-06 False
Freq: D, Name: E, dtype: bool
df2[ind]
|
A |
B |
C |
D |
E |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
two |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
four |
(5)赋值
df
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
s1
2019-01-01 1
2019-01-02 2
2019-01-03 3
2019-01-04 4
2019-01-05 5
2019-01-06 6
Freq: D, dtype: int64
df["E"] = s1
df
|
A |
B |
C |
D |
E |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
1 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
3 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
4 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
5 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
6 |
df.loc["2019-01-01", "A"] = 0
df
|
A |
B |
C |
D |
E |
2019-01-01 |
0.000000 |
-0.190742 |
0.925984 |
-0.818969 |
1 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
3 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
4 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
5 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
6 |
df.iloc[0, 1] = 0
df
|
A |
B |
C |
D |
E |
2019-01-01 |
0.000000 |
0.000000 |
0.925984 |
-0.818969 |
1 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
3 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
4 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
5 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
6 |
df["D"] = np.array([5]\*len(df)) # 可简化成df["D"] = 5
df
|
A |
B |
C |
D |
E |
2019-01-01 |
0.000000 |
0.000000 |
0.925984 |
5 |
1 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
5 |
2 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
5 |
3 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
5 |
4 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
5 |
5 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
5 |
6 |
df.index = [i for i in range(len(df))]
df
|
A |
B |
C |
D |
E |
0 |
0.000000 |
0.000000 |
0.925984 |
5 |
1 |
1 |
-0.234414 |
-1.194674 |
1.080779 |
5 |
2 |
2 |
-0.141572 |
0.058118 |
1.102248 |
5 |
3 |
3 |
0.305088 |
0.535920 |
-0.978434 |
5 |
4 |
4 |
0.313383 |
0.234041 |
0.163155 |
5 |
5 |
5 |
0.250613 |
-0.904400 |
-0.858240 |
5 |
6 |
df.columns = [i for i in range(df.shape[1])]
df
|
0 |
1 |
2 |
3 |
4 |
0 |
0.000000 |
0.000000 |
0.925984 |
5 |
1 |
1 |
-0.234414 |
-1.194674 |
1.080779 |
5 |
2 |
2 |
-0.141572 |
0.058118 |
1.102248 |
5 |
3 |
3 |
0.305088 |
0.535920 |
-0.978434 |
5 |
4 |
4 |
0.313383 |
0.234041 |
0.163155 |
5 |
5 |
5 |
0.250613 |
-0.904400 |
-0.858240 |
5 |
6 |
11.3 数值运算及统计分析
1、数据的查看
import pandas as pd
import numpy as np
dates = pd.date_range(start='2019-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
|
A |
B |
C |
D |
2019-01-01 |
-0.854043 |
0.412345 |
-2.296051 |
-0.048964 |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
2019-01-03 |
-0.714591 |
-1.103224 |
0.979250 |
0.319455 |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
2019-01-06 |
-0.133876 |
0.074500 |
-1.028147 |
0.605402 |
(1)查看前面的行
df.head() # 默认5行,也可以进行设置
|
A |
B |
C |
D |
2019-01-01 |
-0.854043 |
0.412345 |
-2.296051 |
-0.048964 |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
2019-01-03 |
-0.714591 |
-1.103224 |
0.979250 |
0.319455 |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
df.head(2)
|
A |
B |
C |
D |
2019-01-01 |
-0.854043 |
0.412345 |
-2.296051 |
-0.048964 |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
(2)查看后面的行
df.tail() # 默认5行
|
A |
B |
C |
D |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
2019-01-03 |
-0.714591 |
-1.103224 |
0.979250 |
0.319455 |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
2019-01-06 |
-0.133876 |
0.074500 |
-1.028147 |
0.605402 |
df.tail(3)
|
A |
B |
C |
D |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
2019-01-06 |
-0.133876 |
0.074500 |
-1.028147 |
0.605402 |
(3)查看总体信息
df.iloc[0, 3] = np.nan
df
|
A |
B |
C |
D |
2019-01-01 |
-0.854043 |
0.412345 |
-2.296051 |
NaN |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
2019-01-03 |
-0.714591 |
-1.103224 |
0.979250 |
0.319455 |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
2019-01-06 |
-0.133876 |
0.074500 |
-1.028147 |
0.605402 |
df.info()
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
A 6 non-null float64
B 6 non-null float64
C 6 non-null float64
D 5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes
2、Numpy通用函数同样适用于Pandas
(1)向量化运算
x = pd.DataFrame(np.arange(4).reshape(1, 4))
x
x+5
np.exp(x)
|
0 |
1 |
2 |
3 |
0 |
1.0 |
2.718282 |
7.389056 |
20.085537 |
y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
y
x\*y
(2)矩阵化运算
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(30, 30)))
x
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
… |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
0 |
6 |
3 |
7 |
4 |
6 |
9 |
2 |
6 |
7 |
4 |
… |
4 |
0 |
9 |
5 |
8 |
0 |
9 |
2 |
6 |
3 |
1 |
8 |
2 |
4 |
2 |
6 |
4 |
8 |
6 |
1 |
3 |
… |
2 |
0 |
3 |
1 |
7 |
3 |
1 |
5 |
5 |
9 |
2 |
3 |
5 |
1 |
9 |
1 |
9 |
3 |
7 |
6 |
8 |
… |
6 |
8 |
7 |
0 |
7 |
7 |
2 |
0 |
7 |
2 |
3 |
2 |
0 |
4 |
9 |
6 |
9 |
8 |
6 |
8 |
7 |
… |
0 |
2 |
4 |
2 |
0 |
4 |
9 |
6 |
6 |
8 |
4 |
9 |
9 |
2 |
6 |
0 |
3 |
3 |
4 |
6 |
6 |
… |
9 |
6 |
8 |
6 |
0 |
0 |
8 |
8 |
3 |
8 |
5 |
2 |
6 |
5 |
7 |
8 |
4 |
0 |
2 |
9 |
7 |
… |
2 |
0 |
4 |
0 |
7 |
0 |
0 |
1 |
1 |
5 |
6 |
6 |
4 |
0 |
0 |
2 |
1 |
4 |
9 |
5 |
6 |
… |
5 |
0 |
8 |
5 |
2 |
3 |
3 |
2 |
9 |
2 |
7 |
2 |
3 |
6 |
3 |
8 |
0 |
7 |
6 |
1 |
7 |
… |
3 |
0 |
1 |
0 |
4 |
4 |
6 |
8 |
8 |
2 |
8 |
2 |
2 |
3 |
7 |
5 |
7 |
0 |
7 |
3 |
0 |
… |
1 |
1 |
5 |
2 |
8 |
3 |
0 |
3 |
0 |
4 |
9 |
3 |
7 |
7 |
6 |
2 |
0 |
0 |
2 |
5 |
6 |
… |
4 |
2 |
3 |
2 |
0 |
0 |
4 |
5 |
2 |
8 |
10 |
4 |
7 |
0 |
4 |
2 |
0 |
3 |
4 |
6 |
0 |
… |
5 |
6 |
1 |
9 |
1 |
9 |
0 |
7 |
0 |
8 |
11 |
5 |
6 |
9 |
6 |
9 |
2 |
1 |
8 |
7 |
9 |
… |
6 |
5 |
2 |
8 |
9 |
5 |
9 |
9 |
5 |
0 |
12 |
3 |
9 |
5 |
5 |
4 |
0 |
7 |
4 |
4 |
6 |
… |
0 |
7 |
2 |
9 |
6 |
9 |
4 |
9 |
4 |
6 |
13 |
8 |
4 |
0 |
9 |
9 |
0 |
1 |
5 |
8 |
7 |
… |
5 |
8 |
4 |
0 |
3 |
4 |
9 |
9 |
4 |
6 |
14 |
3 |
0 |
4 |
6 |
9 |
9 |
5 |
4 |
3 |
1 |
… |
6 |
1 |
0 |
3 |
7 |
1 |
2 |
0 |
0 |
2 |
15 |
4 |
2 |
0 |
0 |
7 |
9 |
1 |
2 |
1 |
2 |
… |
6 |
3 |
9 |
4 |
1 |
7 |
3 |
8 |
4 |
8 |
16 |
3 |
9 |
4 |
8 |
7 |
2 |
0 |
2 |
3 |
1 |
… |
8 |
0 |
0 |
3 |
8 |
5 |
2 |
0 |
3 |
8 |
17 |
2 |
8 |
6 |
3 |
2 |
9 |
4 |
4 |
2 |
8 |
… |
6 |
9 |
4 |
2 |
6 |
1 |
8 |
9 |
9 |
0 |
18 |
5 |
6 |
7 |
9 |
8 |
1 |
9 |
1 |
4 |
4 |
… |
3 |
5 |
2 |
5 |
6 |
9 |
9 |
2 |
6 |
2 |
19 |
1 |
9 |
3 |
7 |
8 |
6 |
0 |
2 |
8 |
0 |
… |
4 |
3 |
2 |
2 |
3 |
8 |
1 |
8 |
0 |
0 |
20 |
4 |
5 |
5 |
2 |
6 |
8 |
9 |
7 |
5 |
7 |
… |
3 |
5 |
0 |
8 |
0 |
4 |
3 |
2 |
5 |
1 |
21 |
2 |
4 |
8 |
1 |
9 |
7 |
1 |
4 |
6 |
7 |
… |
0 |
1 |
8 |
2 |
0 |
4 |
6 |
5 |
0 |
4 |
22 |
4 |
5 |
2 |
4 |
6 |
4 |
4 |
4 |
9 |
9 |
… |
1 |
7 |
6 |
9 |
9 |
1 |
5 |
5 |
2 |
1 |
23 |
0 |
5 |
4 |
8 |
0 |
6 |
4 |
4 |
1 |
2 |
… |
8 |
5 |
0 |
7 |
6 |
9 |
2 |
0 |
4 |
3 |
24 |
9 |
7 |
0 |
9 |
0 |
3 |
7 |
4 |
1 |
5 |
… |
3 |
7 |
8 |
2 |
2 |
1 |
9 |
2 |
2 |
4 |
25 |
4 |
1 |
9 |
5 |
4 |
5 |
0 |
4 |
8 |
9 |
… |
9 |
3 |
0 |
7 |
0 |
2 |
3 |
7 |
5 |
9 |
26 |
6 |
7 |
1 |
9 |
7 |
2 |
6 |
2 |
6 |
1 |
… |
0 |
6 |
5 |
9 |
8 |
0 |
3 |
8 |
3 |
9 |
27 |
2 |
8 |
1 |
3 |
5 |
1 |
7 |
7 |
0 |
2 |
… |
8 |
0 |
4 |
5 |
4 |
5 |
5 |
6 |
3 |
7 |
28 |
6 |
8 |
6 |
2 |
2 |
7 |
4 |
3 |
7 |
5 |
… |
1 |
7 |
9 |
2 |
4 |
5 |
9 |
5 |
3 |
2 |
29 |
3 |
0 |
3 |
0 |
0 |
9 |
5 |
4 |
3 |
2 |
… |
1 |
3 |
0 |
4 |
8 |
0 |
8 |
7 |
5 |
6 |
30 rows × 30 columns
z = x.T
z
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
… |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
0 |
6 |
8 |
3 |
2 |
9 |
2 |
6 |
2 |
2 |
3 |
… |
4 |
2 |
4 |
0 |
9 |
4 |
6 |
2 |
6 |
3 |
1 |
3 |
2 |
5 |
0 |
9 |
6 |
4 |
3 |
2 |
7 |
… |
5 |
4 |
5 |
5 |
7 |
1 |
7 |
8 |
8 |
0 |
2 |
7 |
4 |
1 |
4 |
2 |
5 |
0 |
6 |
3 |
7 |
… |
5 |
8 |
2 |
4 |
0 |
9 |
1 |
1 |
6 |
3 |
3 |
4 |
2 |
9 |
9 |
6 |
7 |
0 |
3 |
7 |
6 |
… |
2 |
1 |
4 |
8 |
9 |
5 |
9 |
3 |
2 |
0 |
4 |
6 |
6 |
1 |
6 |
0 |
8 |
2 |
8 |
5 |
2 |
… |
6 |
9 |
6 |
0 |
0 |
4 |
7 |
5 |
2 |
0 |
5 |
9 |
4 |
9 |
9 |
3 |
4 |
1 |
0 |
7 |
0 |
… |
8 |
7 |
4 |
6 |
3 |
5 |
2 |
1 |
7 |
9 |
6 |
2 |
8 |
3 |
8 |
3 |
0 |
4 |
7 |
0 |
0 |
… |
9 |
1 |
4 |
4 |
7 |
0 |
6 |
7 |
4 |
5 |
7 |
6 |
6 |
7 |
6 |
4 |
2 |
9 |
6 |
7 |
2 |
… |
7 |
4 |
4 |
4 |
4 |
4 |
2 |
7 |
3 |
4 |
8 |
7 |
1 |
6 |
8 |
6 |
9 |
5 |
1 |
3 |
5 |
… |
5 |
6 |
9 |
1 |
1 |
8 |
6 |
0 |
7 |
3 |
9 |
4 |
3 |
8 |
7 |
6 |
7 |
6 |
7 |
0 |
6 |
… |
7 |
7 |
9 |
2 |
5 |
9 |
1 |
2 |
5 |
2 |
10 |
3 |
8 |
7 |
1 |
3 |
5 |
3 |
0 |
7 |
5 |
… |
4 |
0 |
2 |
6 |
4 |
1 |
9 |
9 |
1 |
0 |
11 |
7 |
1 |
4 |
0 |
6 |
7 |
6 |
8 |
3 |
5 |
… |
7 |
5 |
0 |
5 |
1 |
0 |
5 |
8 |
3 |
5 |
12 |
7 |
9 |
1 |
6 |
2 |
8 |
7 |
8 |
5 |
5 |
… |
9 |
0 |
4 |
1 |
2 |
9 |
2 |
4 |
3 |
1 |
13 |
2 |
8 |
4 |
6 |
5 |
3 |
0 |
1 |
7 |
2 |
… |
3 |
1 |
8 |
5 |
8 |
8 |
2 |
5 |
5 |
7 |
14 |
5 |
9 |
7 |
7 |
1 |
0 |
5 |
6 |
3 |
5 |
… |
9 |
0 |
0 |
1 |
6 |
9 |
8 |
3 |
5 |
9 |
15 |
4 |
4 |
9 |
4 |
9 |
0 |
7 |
9 |
2 |
7 |
… |
7 |
4 |
2 |
1 |
6 |
8 |
6 |
9 |
0 |
4 |
16 |
1 |
1 |
8 |
2 |
8 |
9 |
4 |
2 |
8 |
1 |
… |
9 |
9 |
3 |
1 |
5 |
8 |
4 |
1 |
7 |
6 |
17 |
7 |
3 |
8 |
7 |
4 |
3 |
3 |
6 |
2 |
4 |
… |
1 |
8 |
0 |
2 |
7 |
5 |
9 |
7 |
5 |
9 |
18 |
5 |
6 |
0 |
5 |
5 |
6 |
1 |
9 |
8 |
0 |
… |
4 |
5 |
0 |
1 |
3 |
7 |
6 |
5 |
2 |
1 |
19 |
1 |
7 |
8 |
2 |
3 |
1 |
5 |
8 |
1 |
0 |
… |
8 |
0 |
7 |
3 |
7 |
0 |
8 |
4 |
8 |
7 |
20 |
4 |
2 |
6 |
0 |
9 |
2 |
5 |
3 |
1 |
4 |
… |
3 |
0 |
1 |
8 |
3 |
9 |
0 |
8 |
1 |
1 |
21 |
0 |
0 |
8 |
2 |
6 |
0 |
0 |
0 |
1 |
2 |
… |
5 |
1 |
7 |
5 |
7 |
3 |
6 |
0 |
7 |
3 |
22 |
9 |
3 |
7 |
4 |
8 |
4 |
8 |
1 |
5 |
3 |
… |
0 |
8 |
6 |
0 |
8 |
0 |
5 |
4 |
9 |
0 |
23 |
5 |
1 |
0 |
2 |
6 |
0 |
5 |
0 |
2 |
2 |
… |
8 |
2 |
9 |
7 |
2 |
7 |
9 |
5 |
2 |
4 |
24 |
8 |
7 |
7 |
0 |
0 |
7 |
2 |
4 |
8 |
0 |
… |
0 |
0 |
9 |
6 |
2 |
0 |
8 |
4 |
4 |
8 |
25 |
0 |
3 |
7 |
4 |
0 |
0 |
3 |
4 |
3 |
0 |
… |
4 |
4 |
1 |
9 |
1 |
2 |
0 |
5 |
5 |
0 |
26 |
9 |
1 |
2 |
9 |
8 |
0 |
3 |
6 |
0 |
4 |
… |
3 |
6 |
5 |
2 |
9 |
3 |
3 |
5 |
9 |
8 |
27 |
2 |
5 |
0 |
6 |
8 |
1 |
2 |
8 |
3 |
5 |
… |
2 |
5 |
5 |
0 |
2 |
7 |
8 |
6 |
5 |
7 |
28 |
6 |
5 |
7 |
6 |
3 |
1 |
9 |
8 |
0 |
2 |
… |
5 |
0 |
2 |
4 |
2 |
5 |
3 |
3 |
3 |
5 |
29 |
3 |
9 |
2 |
8 |
8 |
5 |
2 |
2 |
4 |
8 |
… |
1 |
4 |
1 |
3 |
4 |
9 |
9 |
7 |
2 |
6 |
30 rows × 30 columns
np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(30, 30)))
y
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
… |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
0 |
5 |
8 |
9 |
5 |
0 |
0 |
1 |
7 |
6 |
9 |
… |
1 |
7 |
0 |
6 |
9 |
9 |
7 |
6 |
9 |
1 |
1 |
0 |
1 |
8 |
8 |
3 |
9 |
8 |
7 |
3 |
6 |
… |
9 |
2 |
0 |
4 |
9 |
2 |
7 |
7 |
9 |
8 |
2 |
6 |
9 |
3 |
7 |
7 |
4 |
5 |
9 |
3 |
6 |
… |
7 |
7 |
1 |
1 |
3 |
0 |
8 |
6 |
4 |
5 |
3 |
6 |
2 |
5 |
7 |
8 |
4 |
4 |
7 |
7 |
4 |
… |
0 |
1 |
9 |
8 |
2 |
3 |
1 |
2 |
7 |
2 |
4 |
6 |
0 |
9 |
2 |
6 |
6 |
2 |
7 |
7 |
0 |
… |
1 |
5 |
4 |
0 |
7 |
8 |
9 |
5 |
7 |
0 |
5 |
9 |
3 |
9 |
1 |
4 |
4 |
6 |
8 |
8 |
9 |
… |
1 |
8 |
7 |
0 |
3 |
4 |
2 |
0 |
3 |
5 |
6 |
1 |
2 |
4 |
3 |
0 |
6 |
0 |
7 |
2 |
8 |
… |
4 |
3 |
3 |
6 |
7 |
3 |
5 |
3 |
2 |
4 |
7 |
4 |
0 |
3 |
3 |
8 |
3 |
5 |
6 |
7 |
5 |
… |
1 |
7 |
3 |
1 |
6 |
6 |
9 |
6 |
9 |
6 |
8 |
0 |
0 |
2 |
9 |
6 |
0 |
6 |
7 |
0 |
3 |
… |
6 |
7 |
9 |
5 |
4 |
9 |
5 |
2 |
5 |
6 |
9 |
6 |
8 |
7 |
7 |
7 |
2 |
6 |
0 |
5 |
2 |
… |
7 |
0 |
6 |
2 |
4 |
3 |
6 |
7 |
6 |
3 |
10 |
0 |
6 |
4 |
7 |
6 |
2 |
9 |
5 |
9 |
9 |
… |
4 |
9 |
3 |
9 |
1 |
2 |
5 |
4 |
0 |
8 |
11 |
2 |
3 |
9 |
9 |
4 |
4 |
8 |
2 |
1 |
6 |
… |
0 |
5 |
9 |
8 |
6 |
6 |
0 |
4 |
7 |
3 |
12 |
0 |
1 |
6 |
0 |
6 |
1 |
6 |
4 |
2 |
5 |
… |
8 |
8 |
0 |
7 |
2 |
0 |
7 |
1 |
1 |
9 |
13 |
5 |
1 |
5 |
9 |
6 |
4 |
9 |
8 |
7 |
5 |
… |
2 |
4 |
3 |
2 |
0 |
0 |
4 |
2 |
5 |
0 |
14 |
0 |
3 |
8 |
5 |
3 |
1 |
4 |
7 |
3 |
2 |
… |
8 |
5 |
5 |
7 |
5 |
9 |
1 |
3 |
9 |
3 |
15 |
3 |
3 |
6 |
1 |
3 |
0 |
5 |
0 |
5 |
2 |
… |
7 |
1 |
7 |
7 |
3 |
8 |
3 |
0 |
6 |
3 |
16 |
0 |
6 |
5 |
9 |
6 |
4 |
6 |
6 |
2 |
2 |
… |
3 |
6 |
8 |
6 |
5 |
1 |
3 |
2 |
6 |
3 |
17 |
6 |
7 |
2 |
8 |
0 |
1 |
8 |
6 |
0 |
0 |
… |
5 |
6 |
2 |
5 |
4 |
3 |
0 |
6 |
2 |
1 |
18 |
9 |
4 |
4 |
0 |
9 |
8 |
7 |
7 |
6 |
1 |
… |
7 |
9 |
9 |
7 |
1 |
1 |
4 |
6 |
5 |
6 |
19 |
4 |
1 |
1 |
5 |
1 |
2 |
6 |
2 |
3 |
3 |
… |
0 |
0 |
0 |
9 |
8 |
5 |
9 |
3 |
4 |
0 |
20 |
9 |
8 |
6 |
3 |
9 |
9 |
0 |
8 |
1 |
6 |
… |
2 |
9 |
0 |
1 |
3 |
9 |
4 |
8 |
8 |
8 |
21 |
2 |
8 |
6 |
4 |
9 |
0 |
5 |
5 |
6 |
1 |
… |
6 |
7 |
5 |
6 |
8 |
7 |
4 |
2 |
4 |
0 |
22 |
0 |
3 |
5 |
9 |
0 |
3 |
6 |
5 |
1 |
1 |
… |
6 |
2 |
5 |
3 |
9 |
3 |
9 |
5 |
1 |
9 |
23 |
7 |
7 |
0 |
8 |
6 |
1 |
2 |
0 |
4 |
4 |
… |
1 |
9 |
6 |
0 |
2 |
8 |
3 |
7 |
2 |
5 |
24 |
6 |
0 |
4 |
2 |
3 |
1 |
0 |
5 |
7 |
0 |
… |
1 |
1 |
2 |
7 |
5 |
2 |
9 |
4 |
7 |
3 |
25 |
5 |
0 |
2 |
1 |
4 |
9 |
4 |
6 |
9 |
3 |
… |
5 |
5 |
3 |
5 |
9 |
2 |
7 |
4 |
1 |
6 |
26 |
9 |
8 |
1 |
8 |
1 |
6 |
2 |
6 |
1 |
8 |
… |
2 |
5 |
1 |
2 |
5 |
3 |
3 |
6 |
1 |
8 |
27 |
1 |
8 |
6 |
4 |
6 |
9 |
5 |
4 |
7 |
2 |
… |
9 |
3 |
1 |
5 |
1 |
1 |
7 |
1 |
2 |
6 |
28 |
0 |
7 |
7 |
4 |
3 |
2 |
7 |
8 |
5 |
2 |
… |
0 |
2 |
8 |
3 |
7 |
3 |
9 |
2 |
3 |
8 |
29 |
8 |
0 |
2 |
6 |
8 |
3 |
6 |
4 |
9 |
7 |
… |
6 |
7 |
8 |
5 |
7 |
2 |
5 |
3 |
4 |
5 |
30 rows × 30 columns
x.dot(y)
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
… |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
0 |
616 |
560 |
723 |
739 |
612 |
457 |
681 |
799 |
575 |
590 |
… |
523 |
739 |
613 |
580 |
668 |
602 |
733 |
585 |
657 |
700 |
1 |
520 |
438 |
691 |
600 |
612 |
455 |
666 |
764 |
707 |
592 |
… |
555 |
681 |
503 |
679 |
641 |
506 |
779 |
494 |
633 |
590 |
2 |
557 |
570 |
786 |
807 |
690 |
469 |
804 |
828 |
704 |
573 |
… |
563 |
675 |
712 |
758 |
793 |
672 |
754 |
550 |
756 |
638 |
3 |
605 |
507 |
664 |
701 |
660 |
496 |
698 |
806 |
651 |
575 |
… |
582 |
685 |
668 |
586 |
629 |
534 |
678 |
484 |
591 |
626 |
4 |
599 |
681 |
753 |
873 |
721 |
563 |
754 |
770 |
620 |
654 |
… |
633 |
747 |
661 |
677 |
726 |
649 |
716 |
610 |
735 |
706 |
5 |
422 |
354 |
602 |
627 |
613 |
396 |
617 |
627 |
489 |
423 |
… |
456 |
572 |
559 |
537 |
499 |
384 |
589 |
436 |
574 |
507 |
6 |
359 |
446 |
599 |
599 |
481 |
357 |
577 |
572 |
451 |
464 |
… |
449 |
550 |
495 |
532 |
633 |
554 |
663 |
476 |
565 |
602 |
7 |
531 |
520 |
698 |
590 |
607 |
537 |
665 |
696 |
571 |
472 |
… |
576 |
588 |
551 |
665 |
652 |
527 |
742 |
528 |
650 |
599 |
8 |
449 |
322 |
547 |
533 |
593 |
399 |
584 |
638 |
587 |
424 |
… |
402 |
596 |
523 |
523 |
447 |
362 |
561 |
386 |
529 |
484 |
9 |
373 |
433 |
525 |
601 |
522 |
345 |
551 |
521 |
434 |
447 |
… |
508 |
498 |
438 |
478 |
459 |
418 |
488 |
407 |
503 |
496 |
10 |
500 |
427 |
574 |
607 |
667 |
477 |
652 |
656 |
615 |
477 |
… |
622 |
702 |
531 |
610 |
558 |
532 |
598 |
471 |
582 |
561 |
11 |
664 |
694 |
772 |
841 |
779 |
574 |
730 |
810 |
711 |
608 |
… |
591 |
760 |
616 |
638 |
721 |
676 |
846 |
678 |
754 |
708 |
12 |
545 |
547 |
687 |
701 |
721 |
576 |
689 |
724 |
710 |
532 |
… |
674 |
684 |
648 |
694 |
710 |
564 |
757 |
571 |
671 |
656 |
13 |
574 |
586 |
723 |
750 |
691 |
494 |
696 |
787 |
667 |
523 |
… |
618 |
681 |
568 |
682 |
715 |
644 |
756 |
557 |
690 |
604 |
14 |
502 |
382 |
645 |
557 |
570 |
403 |
538 |
677 |
500 |
501 |
… |
369 |
650 |
507 |
576 |
546 |
531 |
554 |
437 |
616 |
463 |
15 |
510 |
505 |
736 |
651 |
649 |
510 |
719 |
733 |
694 |
557 |
… |
605 |
717 |
574 |
642 |
678 |
576 |
755 |
455 |
598 |
654 |
16 |
567 |
376 |
614 |
612 |
643 |
514 |
598 |
724 |
547 |
464 |
… |
456 |
639 |
520 |
560 |
569 |
442 |
596 |
517 |
659 |
532 |
17 |
626 |
716 |
828 |
765 |
740 |
603 |
809 |
852 |
692 |
591 |
… |
664 |
716 |
655 |
721 |
742 |
612 |
819 |
593 |
744 |
712 |
18 |
600 |
559 |
667 |
664 |
641 |
556 |
624 |
815 |
638 |
564 |
… |
581 |
701 |
559 |
677 |
710 |
554 |
748 |
597 |
614 |
657 |
19 |
445 |
431 |
661 |
681 |
641 |
552 |
690 |
719 |
602 |
474 |
… |
515 |
637 |
576 |
620 |
572 |
512 |
599 |
455 |
622 |
538 |
20 |
523 |
569 |
784 |
725 |
713 |
501 |
740 |
772 |
638 |
640 |
… |
589 |
775 |
664 |
686 |
726 |
672 |
747 |
548 |
723 |
645 |
21 |
487 |
465 |
553 |
639 |
517 |
449 |
592 |
609 |
454 |
398 |
… |
492 |
567 |
534 |
404 |
554 |
417 |
561 |
466 |
498 |
492 |
22 |
479 |
449 |
574 |
686 |
583 |
377 |
566 |
614 |
563 |
455 |
… |
453 |
539 |
491 |
501 |
596 |
520 |
722 |
478 |
565 |
501 |
23 |
483 |
386 |
476 |
526 |
550 |
426 |
492 |
585 |
536 |
482 |
… |
322 |
541 |
438 |
456 |
487 |
408 |
502 |
426 |
474 |
481 |
24 |
523 |
551 |
658 |
767 |
537 |
444 |
663 |
731 |
576 |
577 |
… |
522 |
590 |
525 |
664 |
691 |
548 |
635 |
526 |
641 |
538 |
25 |
652 |
656 |
738 |
753 |
853 |
508 |
752 |
815 |
669 |
576 |
… |
694 |
833 |
693 |
606 |
575 |
616 |
704 |
559 |
728 |
672 |
26 |
578 |
577 |
744 |
856 |
699 |
497 |
779 |
800 |
733 |
587 |
… |
630 |
754 |
704 |
834 |
760 |
680 |
765 |
592 |
731 |
629 |
27 |
554 |
494 |
665 |
689 |
630 |
574 |
695 |
703 |
636 |
599 |
… |
554 |
685 |
532 |
658 |
649 |
554 |
693 |
577 |
634 |
668 |
28 |
498 |
552 |
659 |
784 |
552 |
492 |
690 |
775 |
544 |
551 |
… |
567 |
636 |
518 |
599 |
742 |
521 |
733 |
533 |
605 |
604 |
29 |
513 |
491 |
563 |
642 |
477 |
367 |
589 |
647 |
516 |
484 |
… |
428 |
574 |
504 |
548 |
553 |
483 |
540 |
407 |
547 |
455 |
30 rows × 30 columns
%timeit x.dot(y)
218 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.dot(x, y)
81.1 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
x1 = np.array(x)
x1
y1 = np.array(y)
y1
%timeit x1.dot(y1)
22.1 µs ± 992 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x1, y1)
22.6 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x.values, y.values)
42.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
x2 = list(x1)
y2 = list(y1)
x3 = []
y3 = []
for i in x2:
res = []
for j in i:
res.append(int(j))
x3.append(res)
for i in y2:
res = []
for j in i:
res.append(int(j))
y3.append(res)
def f(x, y):
res = []
for i in range(len(x)):
row = []
for j in range(len(y[0])):
sum_row = 0
for k in range(len(x[0])):
sum_row += x[i][k]\*y[k][j]
row.append(sum_row)
res.append(row)
return res
%timeit f(x3, y3)
4.29 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
一般来说,纯粹的计算在Numpy里执行的更快
Numpy更侧重于计算,Pandas更侧重于数据处理
(3)广播运算
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
x
|
A |
B |
C |
0 |
6 |
3 |
7 |
1 |
4 |
6 |
9 |
2 |
2 |
6 |
7 |
x.iloc[0]
A 6
B 3
C 7
Name: 0, dtype: int32
x/x.iloc[0]
|
A |
B |
C |
0 |
1.000000 |
1.0 |
1.000000 |
1 |
0.666667 |
2.0 |
1.285714 |
2 |
0.333333 |
2.0 |
1.000000 |
x.A
0 6
1 4
2 2
Name: A, dtype: int32
x.div(x.A, axis=0) # add sub div mul
|
A |
B |
C |
0 |
1.0 |
0.5 |
1.166667 |
1 |
1.0 |
1.5 |
2.250000 |
2 |
1.0 |
3.0 |
3.500000 |
x.div(x.iloc[0], axis=1)
|
A |
B |
C |
0 |
1.000000 |
1.0 |
1.000000 |
1 |
0.666667 |
2.0 |
1.285714 |
2 |
0.333333 |
2.0 |
1.000000 |
3、新的用法
(1)索引对齐
A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
A
B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
B
|
A |
B |
C |
0 |
7 |
5 |
1 |
1 |
4 |
0 |
9 |
2 |
5 |
8 |
0 |
- pandas会自动对齐两个对象的索引,没有的值用np.nan表示
A+B
|
A |
B |
C |
0 |
10.0 |
12.0 |
NaN |
1 |
6.0 |
1.0 |
NaN |
2 |
NaN |
NaN |
NaN |
A.add(B, fill_value=0)
|
A |
B |
C |
0 |
10.0 |
12.0 |
1.0 |
1 |
6.0 |
1.0 |
9.0 |
2 |
5.0 |
8.0 |
0.0 |
A\*B
|
A |
B |
C |
0 |
21.0 |
35.0 |
NaN |
1 |
8.0 |
0.0 |
NaN |
2 |
NaN |
NaN |
NaN |
(2)统计相关
y = np.random.randint(3, size=20)
y
array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])
np.unique(y)
array([0, 1, 2])
用Counter方法统计数据
from collections import Counter
Counter(y)
Counter({2: 11, 1: 5, 0: 4})
y1 = pd.DataFrame(y, columns=["A"])
y1
|
A |
0 |
2 |
1 |
2 |
2 |
2 |
3 |
1 |
4 |
2 |
5 |
1 |
6 |
1 |
7 |
2 |
8 |
1 |
9 |
2 |
10 |
2 |
11 |
0 |
12 |
2 |
13 |
0 |
14 |
2 |
15 |
2 |
16 |
0 |
17 |
0 |
18 |
2 |
19 |
1 |
np.unique(y1)
有value counter的方法
y1["A"].value_counts()
2 11
1 5
0 4
Name: A, dtype: int64
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
GDP_dict = {"BeiJing": 30320,
"ShangHai": 32680,
"ShenZhen": 24222,
"HangZhou": 13468 }
GDP = pd.Series(GDP_dict)
city_info = pd.DataFrame({"population": population,"GDP": GDP})
city_info
|
population |
GDP |
BeiJing |
2154 |
30320 |
ShangHai |
2424 |
32680 |
ShenZhen |
1303 |
24222 |
HangZhou |
981 |
13468 |
city_info["per\_GDP"] = city_info["GDP"]/city_info["population"]
city_info
|
population |
GDP |
per_GDP |
BeiJing |
2154 |
30320 |
14.076137 |
ShangHai |
2424 |
32680 |
13.481848 |
ShenZhen |
1303 |
24222 |
18.589409 |
HangZhou |
981 |
13468 |
13.728848 |
递增排序
city_info.sort_values(by="per\_GDP")
|
population |
GDP |
per_GDP |
ShangHai |
2424 |
32680 |
13.481848 |
HangZhou |
981 |
13468 |
13.728848 |
BeiJing |
2154 |
30320 |
14.076137 |
ShenZhen |
1303 |
24222 |
18.589409 |
递减排序
city_info.sort_values(by="per\_GDP", ascending=False)
|
population |
GDP |
per_GDP |
ShenZhen |
1303 |
24222 |
18.589409 |
BeiJing |
2154 |
30320 |
14.076137 |
HangZhou |
981 |
13468 |
13.728848 |
ShangHai |
2424 |
32680 |
13.481848 |
按轴进行排序
data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
data
|
C |
B |
A |
D |
2 |
3 |
13 |
17 |
8 |
1 |
1 |
19 |
14 |
6 |
0 |
11 |
7 |
14 |
2 |
行排序
data.sort_index()
|
C |
B |
A |
D |
0 |
11 |
7 |
14 |
2 |
1 |
1 |
19 |
14 |
6 |
2 |
3 |
13 |
17 |
8 |
列排序
data.sort_index(axis=1)
|
A |
B |
C |
D |
2 |
17 |
13 |
3 |
8 |
1 |
14 |
19 |
1 |
6 |
0 |
14 |
7 |
11 |
2 |
data.sort_index(axis=1, ascending=False)
|
D |
C |
B |
A |
2 |
8 |
3 |
13 |
17 |
1 |
6 |
1 |
19 |
14 |
0 |
2 |
11 |
7 |
14 |
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
df
|
A |
B |
C |
D |
0 |
1.082198 |
3.557396 |
-3.060476 |
6.367969 |
1 |
13.113252 |
6.774559 |
2.874553 |
5.527044 |
2 |
-2.036341 |
-4.333177 |
5.094802 |
-0.152567 |
3 |
-3.386712 |
-1.522365 |
-2.522209 |
2.537716 |
4 |
4.328491 |
5.550994 |
5.577329 |
5.019991 |
5 |
1.171336 |
-0.493910 |
-4.032613 |
6.398588 |
非空个数
df.count()
A 6
B 6
C 6
D 6
dtype: int64
求和
df.sum()
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
df.sum(axis=1)
0 7.947086
1 28.289408
2 -1.427283
3 -4.893571
4 20.476806
5 3.043402
dtype: float64
最大值 最小值
df.min()
A -3.386712
B -4.333177
C -4.032613
D -0.152567
dtype: float64
df.max(axis=1)
0 6.367969
1 13.113252
2 5.094802
3 2.537716
4 5.577329
5 6.398588
dtype: float64
df
|
A |
B |
C |
D |
0 |
1.082198 |
3.557396 |
-3.060476 |
6.367969 |
1 |
13.113252 |
6.774559 |
2.874553 |
5.527044 |
2 |
-2.036341 |
-4.333177 |
5.094802 |
-0.152567 |
3 |
-3.386712 |
-1.522365 |
-2.522209 |
2.537716 |
4 |
4.328491 |
5.550994 |
5.577329 |
5.019991 |
5 |
1.171336 |
-0.493910 |
-4.032613 |
6.398588 |
df.idxmax()
A 1
B 1
C 4
D 5
dtype: int64
均值
df.mean()
A 2.378704
B 1.588916
C 0.655231
D 4.283124
dtype: float64
方差
df.var()
A 34.980702
B 19.110656
C 18.948144
D 6.726776
dtype: float64
标准差
df.std()
A 5.914449
B 4.371574
C 4.352947
D 2.593603
dtype: float64
中位数
df.median()
A 1.126767
B 1.531743
C 0.176172
D 5.273518
dtype: float64
众数
data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
data
|
A |
B |
0 |
4 |
2 |
1 |
3 |
2 |
2 |
2 |
0 |
3 |
2 |
4 |
4 |
2 |
0 |
5 |
4 |
1 |
6 |
2 |
0 |
7 |
1 |
1 |
8 |
3 |
4 |
9 |
2 |
0 |
data.mode()
75%分位数
df.quantile(0.75)
A 3.539202
B 5.052594
C 4.539740
D 6.157738
Name: 0.75, dtype: float64
df.describe()
|
A |
B |
C |
D |
count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
2.378704 |
1.588916 |
0.655231 |
4.283124 |
std |
5.914449 |
4.371574 |
4.352947 |
2.593603 |
min |
-3.386712 |
-4.333177 |
-4.032613 |
-0.152567 |
25% |
-1.256706 |
-1.265251 |
-2.925910 |
3.158284 |
50% |
1.126767 |
1.531743 |
0.176172 |
5.273518 |
75% |
3.539202 |
5.052594 |
4.539740 |
6.157738 |
max |
13.113252 |
6.774559 |
5.577329 |
6.398588 |
data_2 = pd.DataFrame([["a", "a", "c", "d"],
["c", "a", "c", "b"],
["a", "a", "d", "c"]], columns=list("ABCD"))
data_2
|
A |
B |
C |
D |
0 |
a |
a |
c |
d |
1 |
c |
a |
c |
b |
2 |
a |
a |
d |
c |
data_2.describe()
|
A |
B |
C |
D |
count |
3 |
3 |
3 |
3 |
unique |
2 |
1 |
2 |
3 |
top |
a |
a |
c |
d |
freq |
2 |
3 |
2 |
1 |
相关性系数和协方差
df.corr()
|
A |
B |
C |
D |
A |
1.000000 |
0.831063 |
0.331060 |
0.510821 |
B |
0.831063 |
1.000000 |
0.179244 |
0.719112 |
C |
0.331060 |
0.179244 |
1.000000 |
-0.450365 |
D |
0.510821 |
0.719112 |
-0.450365 |
1.000000 |
df.corrwith(df["A"])
A 1.000000
B 0.831063
C 0.331060
D 0.510821
dtype: float64
自定义输出
apply(method)的用法:使用method方法默认对每一列进行相应的操作
df
|
A |
B |
C |
D |
0 |
1.082198 |
3.557396 |
-3.060476 |
6.367969 |
1 |
13.113252 |
6.774559 |
2.874553 |
5.527044 |
2 |
-2.036341 |
-4.333177 |
5.094802 |
-0.152567 |
3 |
-3.386712 |
-1.522365 |
-2.522209 |
2.537716 |
4 |
4.328491 |
5.550994 |
5.577329 |
5.019991 |
5 |
1.171336 |
-0.493910 |
-4.032613 |
6.398588 |
df.apply(np.cumsum)
|
A |
B |
C |
D |
0 |
1.082198 |
3.557396 |
-3.060476 |
6.367969 |
1 |
14.195450 |
10.331955 |
-0.185923 |
11.895013 |
2 |
12.159109 |
5.998778 |
4.908878 |
11.742447 |
3 |
8.772397 |
4.476413 |
2.386669 |
14.280162 |
4 |
13.100888 |
10.027406 |
7.963999 |
19.300153 |
5 |
14.272224 |
9.533497 |
3.931385 |
25.698741 |
df.apply(np.cumsum, axis=1)
|
A |
B |
C |
D |
0 |
1.082198 |
4.639594 |
1.579117 |
7.947086 |
1 |
13.113252 |
19.887811 |
22.762364 |
28.289408 |
2 |
-2.036341 |
-6.369518 |
-1.274717 |
-1.427283 |
3 |
-3.386712 |
-4.909077 |
-7.431287 |
-4.893571 |
4 |
4.328491 |
9.879485 |
15.456814 |
20.476806 |
5 |
1.171336 |
0.677427 |
-3.355186 |
3.043402 |
df.apply(sum)
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
df.sum()
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
df.apply(lambda x: x.max()-x.min())
A 16.499965
B 11.107736
C 9.609942
D 6.551155
dtype: float64
def my\_describe(x):
return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
index=["Count", "mean", "max", "idxmin", "std"])
df.apply(my_describe)
|
A |
B |
C |
D |
Count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
2.378704 |
1.588916 |
0.655231 |
4.283124 |
max |
13.113252 |
6.774559 |
5.577329 |
6.398588 |
idxmin |
3.000000 |
2.000000 |
5.000000 |
2.000000 |
std |
5.914449 |
4.371574 |
4.352947 |
2.593603 |
11.4 缺失值处理
1、发现缺失值
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array([[1, np.nan, 2],
[np.nan, 3, 4],
[5, 6, None]]), columns=["A", "B", "C"])
data
|
A |
B |
C |
0 |
1 |
NaN |
2 |
1 |
NaN |
3 |
4 |
2 |
5 |
6 |
None |
注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源
np.nan是一个特殊的浮点数,类型是浮点类型,所以表示缺失值时最好使用NaN。
data.dtypes
A object
B object
C object
dtype: object
data.isnull()
|
A |
B |
C |
0 |
False |
True |
False |
1 |
True |
False |
False |
2 |
False |
False |
True |
data.notnull()
|
A |
B |
C |
0 |
True |
False |
True |
1 |
False |
True |
True |
2 |
True |
True |
False |
2、删除缺失值
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
3.0 |
1 |
NaN |
4.0 |
5.0 |
6.0 |
2 |
7.0 |
8.0 |
NaN |
9.0 |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
注意:np.nan是一种特殊的浮点数
data.dtypes
A float64
B float64
C float64
D float64
dtype: object
(1)删除整行
data.dropna()
|
A |
B |
C |
D |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
(2)删除整列
data.dropna(axis="columns")
|
D |
0 |
3.0 |
1 |
6.0 |
2 |
9.0 |
3 |
13.0 |
data["D"] = np.nan
data
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
NaN |
1 |
NaN |
4.0 |
5.0 |
NaN |
2 |
7.0 |
8.0 |
NaN |
NaN |
3 |
10.0 |
11.0 |
12.0 |
NaN |
data.dropna(axis="columns", how="all")
|
A |
B |
C |
0 |
1.0 |
NaN |
2.0 |
1 |
NaN |
4.0 |
5.0 |
2 |
7.0 |
8.0 |
NaN |
3 |
10.0 |
11.0 |
12.0 |
all表示都是缺失值时才删除。
data.dropna(axis="columns", how="any")
data.loc[3] = np.nan
data
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
NaN |
1 |
NaN |
4.0 |
5.0 |
NaN |
2 |
7.0 |
8.0 |
NaN |
NaN |
3 |
NaN |
NaN |
NaN |
NaN |
data.dropna(how="all")
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
NaN |
1 |
NaN |
4.0 |
5.0 |
NaN |
2 |
7.0 |
8.0 |
NaN |
NaN |
3、填充缺失值
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
3.0 |
1 |
NaN |
4.0 |
5.0 |
6.0 |
2 |
7.0 |
8.0 |
NaN |
9.0 |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
data.fillna(value=5)
|
A |
B |
C |
D |
0 |
1.0 |
5.0 |
2.0 |
3.0 |
1 |
5.0 |
4.0 |
5.0 |
6.0 |
2 |
7.0 |
8.0 |
5.0 |
9.0 |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
fill = data.mean()
fill
A 6.000000
B 7.666667
C 6.333333
D 7.750000
dtype: float64
data.fillna(value=fill)
|
A |
B |
C |
D |
0 |
1.0 |
7.666667 |
2.000000 |
3.0 |
1 |
6.0 |
4.000000 |
5.000000 |
6.0 |
2 |
7.0 |
8.000000 |
6.333333 |
9.0 |
3 |
10.0 |
11.000000 |
12.000000 |
13.0 |
全部数据的平均值,先进行摊平,再进行填充即可。
fill = data.stack().mean()
fill
7.0
data.fillna(value=fill)
|
A |
B |
C |
D |
0 |
1.0 |
7.0 |
2.0 |
3.0 |
1 |
7.0 |
4.0 |
5.0 |
6.0 |
2 |
7.0 |
8.0 |
7.0 |
9.0 |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
11.5 合并数据
import pandas as pd
import numpy as np
def make\_df(cols, ind):
"一个简单的DataFrame"
data = {c: [str(c)+str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
make_df("ABC", range(3))
|
A |
B |
C |
0 |
A0 |
B0 |
C0 |
1 |
A1 |
B1 |
C1 |
2 |
A2 |
B2 |
C2 |
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)
A B
1 A1 B1
2 A2 B2
A B
3 A3 B3
4 A4 B4
pd.concat([df_1, df_2])
|
A |
B |
1 |
A1 |
B1 |
2 |
A2 |
B2 |
3 |
A3 |
B3 |
4 |
A4 |
B4 |
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)
A B
0 A0 B0
1 A1 B1
C D
0 C0 D0
1 C1 D1
pd.concat([df_3, df_4], axis=1)
|
A |
B |
C |
D |
0 |
A0 |
B0 |
C0 |
D0 |
1 |
A1 |
B1 |
C1 |
D1 |
行重叠
df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
A B
1 A1 B1
2 A2 B2
A B
1 A1 B1
2 A2 B2
pd.concat([df_5, df_6])
|
A |
B |
1 |
A1 |
B1 |
2 |
A2 |
B2 |
1 |
A1 |
B1 |
2 |
A2 |
B2 |
pd.concat([df_5, df_6],ignore_index=True)
|
A |
B |
0 |
A1 |
B1 |
1 |
A2 |
B2 |
2 |
A1 |
B1 |
3 |
A2 |
B2 |
列重叠
df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)
A B C
1 A1 B1 C1
2 A2 B2 C2
B C D
1 B1 C1 D1
2 B2 C2 D2
pd.concat([df_7, df_8], axis=1)
|
A |
B |
C |
B |
C |
D |
1 |
A1 |
B1 |
C1 |
B1 |
C1 |
D1 |
2 |
A2 |
B2 |
C2 |
B2 |
C2 |
D2 |
pd.concat([df_7, df_8],axis=1, ignore_index=True)
|
0 |
1 |
2 |
3 |
4 |
5 |
1 |
A1 |
B1 |
C1 |
B1 |
C1 |
D1 |
2 |
A2 |
B2 |
C2 |
B2 |
C2 |
D2 |
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)
A B
1 A1 B1
2 A2 B2
B C
1 B1 C1
2 B2 C2
pd.merge(df_9, df_10)
|
A |
B |
C |
0 |
A1 |
B1 |
C1 |
1 |
A2 |
B2 |
C2 |
df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)
A B
1 A1 B1
2 A2 B2
C B
2 C2 B2
1 C1 B1
pd.merge(df_9, df_10)
|
A |
B |
C |
0 |
A1 |
B1 |
C1 |
1 |
A2 |
B2 |
C2 |
【例】 合并城市信息
population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
"pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population
|
city |
pop |
0 |
BeiJing |
2154 |
1 |
HangZhou |
981 |
2 |
ShenZhen |
1303 |
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
"GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP
|
city |
GDP |
0 |
BeiJing |
30320 |
1 |
ShangHai |
32680 |
2 |
HangZhou |
13468 |
city_info = pd.merge(population, GDP)
city_info
|
city |
pop |
GDP |
0 |
BeiJing |
2154 |
30320 |
1 |
HangZhou |
981 |
13468 |
这里outer是求并集
city_info = pd.merge(population, GDP, how="outer")
city_info
|
city |
pop |
GDP |
0 |
BeiJing |
2154.0 |
30320.0 |
1 |
HangZhou |
981.0 |
13468.0 |
2 |
ShenZhen |
1303.0 |
NaN |
3 |
ShangHai |
NaN |
32680.0 |
11.6 分组和数据透视表
df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
"data1": range(6),
"data2": np.random.randint(0, 10, size=6)})
df
|
key |
data1 |
data2 |
0 |
A |
0 |
1 |
1 |
B |
1 |
4 |
2 |
C |
2 |
9 |
3 |
C |
3 |
9 |
4 |
B |
4 |
1 |
5 |
A |
5 |
9 |
(1)分组
df.groupby("key")
这说明已经分好了,等待我们用什么样的方法进行处理后,再显示。
df.groupby("key").sum()
|
data1 |
data2 |
key |
|
|
A |
5 |
10 |
B |
5 |
6 |
C |
5 |
11 |
df.groupby("key").mean()
|
data1 |
data2 |
key |
|
|
A |
2.5 |
5.0 |
B |
2.5 |
3.0 |
C |
2.5 |
5.5 |
可以打印看看这是什么东西:
for i in df.groupby("key"):
print(str(i))
('A', key data1 data2
0 A 0 2
5 A 5 8)
('B', key data1 data2
1 B 1 2
4 B 4 4)
('C', key data1 data2
2 C 2 8
3 C 3 3)
df.groupby("key")["data2"].sum()
key
A 10
B 6
C 11
Name: data2, dtype: int32
for data, group in df.groupby("key"):
print("{0:5} shape={1}".format(data, group.shape))
A shape=(2, 3)
B shape=(2, 3)
C shape=(2, 3)
df.groupby("key")["data1"].describe()
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
key |
|
|
|
|
|
|
|
|
A |
2.0 |
2.5 |
3.535534 |
0.0 |
1.25 |
2.5 |
3.75 |
5.0 |
B |
2.0 |
2.5 |
2.121320 |
1.0 |
1.75 |
2.5 |
3.25 |
4.0 |
C |
2.0 |
2.5 |
0.707107 |
2.0 |
2.25 |
2.5 |
2.75 |
3.0 |
df.groupby("key").aggregate(["min", "median", "max"])
|
data1 |
data2 |
|
min |
median |
key |
|
|
A |
0 |
2.5 |
B |
1 |
2.5 |
C |
2 |
2.5 |
def filter\_func(x):
return x["data2"].std() > 3
df.groupby("key")["data2"].std()
key
A 4.242641
B 1.414214
C 3.535534
Name: data2, dtype: float64
df.groupby("key").filter(filter_func)
|
key |
data1 |
data2 |
0 |
A |
0 |
2 |
2 |
C |
2 |
8 |
3 |
C |
3 |
3 |
5 |
A |
5 |
8 |
df
|
key |
data1 |
data2 |
0 |
A |
0 |
2 |
1 |
B |
1 |
2 |
2 |
C |
2 |
8 |
3 |
C |
3 |
3 |
4 |
B |
4 |
4 |
5 |
A |
5 |
8 |
df.groupby("key").transform(lambda x: x-x.mean())
|
data1 |
data2 |
0 |
-2.5 |
-3.0 |
1 |
-1.5 |
-1.0 |
2 |
-0.5 |
2.5 |
3 |
0.5 |
-2.5 |
4 |
1.5 |
1.0 |
5 |
2.5 |
3.0 |
df
|
key |
data1 |
data2 |
0 |
A |
0 |
1 |
1 |
B |
1 |
4 |
2 |
C |
2 |
9 |
3 |
C |
3 |
9 |
4 |
B |
4 |
1 |
5 |
A |
5 |
9 |
df.groupby("key").apply(lambda x: x-x.mean())
|
data1 |
data2 |
0 |
-2.5 |
-4.0 |
1 |
-1.5 |
1.5 |
2 |
-0.5 |
0.0 |
3 |
0.5 |
0.0 |
4 |
1.5 |
-1.5 |
5 |
2.5 |
4.0 |
df
|
key |
data1 |
data2 |
0 |
A |
0 |
2 |
1 |
B |
1 |
2 |
2 |
C |
2 |
8 |
3 |
C |
3 |
3 |
4 |
B |
4 |
4 |
5 |
A |
5 |
8 |
def norm\_by\_data2(x):
x["data1"] /= x["data2"].sum()
return x
df.groupby("key").apply(norm_by_data2)
|
key |
data1 |
data2 |
0 |
A |
0.000000 |
2 |
1 |
B |
0.166667 |
2 |
2 |
C |
0.181818 |
8 |
3 |
C |
0.272727 |
3 |
4 |
B |
0.666667 |
4 |
5 |
A |
0.500000 |
8 |
这里的L相当于一个新的标签替代原来的行标签。
L = [0, 1, 0, 1, 2, 0]
df
|
key |
data1 |
data2 |
0 |
A |
0 |
2 |
1 |
B |
1 |
2 |
2 |
C |
2 |
8 |
3 |
C |
3 |
3 |
4 |
B |
4 |
4 |
5 |
A |
5 |
8 |
df.groupby(L).sum()
|
data1 |
data2 |
0 |
7 |
18 |
1 |
4 |
5 |
2 |
4 |
4 |
df2 = df.set_index("key")
df2
|
data1 |
data2 |
key |
|
|
A |
0 |
2 |
B |
1 |
2 |
C |
2 |
8 |
C |
3 |
3 |
B |
4 |
4 |
A |
5 |
8 |
mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()
|
data1 |
data2 |
constant |
10 |
17 |
first |
5 |
10 |
df2.groupby(str.lower).mean()
|
data1 |
data2 |
a |
2.5 |
5.0 |
b |
2.5 |
3.0 |
c |
2.5 |
5.5 |
只有这两个数都相等,才会分到同一个组。
df2.groupby([str.lower, mapping]).mean()
|
|
data1 |
data2 |
a |
first |
2.5 |
5.0 |
b |
constant |
2.5 |
3.0 |
c |
constant |
2.5 |
5.5 |
【例1】 行星观测数据处理
import seaborn as sns
planets = sns.load_dataset("planets")
planets.shape
(1035, 6)
planets.head()
|
method |
number |
orbital_period |
mass |
distance |
year |
0 |
Radial Velocity |
1 |
269.300 |
7.10 |
77.40 |
2006 |
1 |
Radial Velocity |
1 |
874.774 |
2.21 |
56.95 |
2008 |
2 |
Radial Velocity |
1 |
763.000 |
2.60 |
19.84 |
2011 |
3 |
Radial Velocity |
1 |
326.030 |
19.40 |
110.62 |
2007 |
4 |
Radial Velocity |
1 |
516.220 |
10.50 |
119.47 |
2009 |
planets.describe()
|
number |
orbital_period |
mass |
distance |
year |
count |
1035.000000 |
992.000000 |
513.000000 |
808.000000 |
1035.000000 |
mean |
1.785507 |
2002.917596 |
2.638161 |
264.069282 |
2009.070531 |
std |
1.240976 |
26014.728304 |
3.818617 |
733.116493 |
3.972567 |
min |
1.000000 |
0.090706 |
0.003600 |
1.350000 |
1989.000000 |
25% |
1.000000 |
5.442540 |
0.229000 |
32.560000 |
2007.000000 |
50% |
1.000000 |
39.979500 |
1.260000 |
55.250000 |
2010.000000 |
75% |
2.000000 |
526.005000 |
3.040000 |
178.500000 |
2012.000000 |
max |
7.000000 |
730000.000000 |
25.000000 |
8500.000000 |
2014.000000 |
planets.head()
|
method |
number |
orbital_period |
mass |
distance |
year |
0 |
Radial Velocity |
1 |
269.300 |
7.10 |
77.40 |
2006 |
1 |
Radial Velocity |
1 |
874.774 |
2.21 |
56.95 |
2008 |
2 |
Radial Velocity |
1 |
763.000 |
2.60 |
19.84 |
2011 |
3 |
Radial Velocity |
1 |
326.030 |
19.40 |
110.62 |
2007 |
4 |
Radial Velocity |
1 |
516.220 |
10.50 |
119.47 |
2009 |
decade = 10 \* (planets["year"] // 10)
decade.head()
0 2000
1 2000
2 2010
3 2000
4 2000
Name: year, dtype: int64
decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()
0 2000s
1 2000s
2 2010s
3 2000s
4 2000s
Name: decade, dtype: object
planets.head()
|
method |
number |
orbital_period |
mass |
distance |
year |
0 |
Radial Velocity |
1 |
269.300 |
7.10 |
77.40 |
2006 |
1 |
Radial Velocity |
1 |
874.774 |
2.21 |
56.95 |
2008 |
2 |
Radial Velocity |
1 |
763.000 |
2.60 |
19.84 |
2011 |
3 |
Radial Velocity |
1 |
326.030 |
19.40 |
110.62 |
2007 |
4 |
Radial Velocity |
1 |
516.220 |
10.50 |
119.47 |
2009 |
planets.groupby(["method", decade]).sum()
|
|
number |
orbital_period |
mass |
distance |
year |
method |
decade |
|
|
|
|
|
Astrometry |
2010s |
2 |
1.262360e+03 |
0.00000 |
35.75 |
4023 |
Eclipse Timing Variations |
2000s |
5 |
1.930800e+04 |
6.05000 |
261.44 |
6025 |
2010s |
10 |
2.345680e+04 |
4.20000 |
1000.00 |
12065 |
|
Imaging |
2000s |
29 |
1.350935e+06 |
0.00000 |
956.83 |
40139 |
2010s |
21 |
6.803750e+04 |
0.00000 |
1210.08 |
36208 |
|
Microlensing |
2000s |
12 |
1.732500e+04 |
0.00000 |
0.00 |
20070 |
2010s |
15 |
4.750000e+03 |
0.00000 |
41440.00 |
26155 |
|
Orbital Brightness Modulation |
2010s |
5 |
2.127920e+00 |
0.00000 |
2360.00 |
6035 |
Pulsar Timing |
1990s |
9 |
1.900153e+02 |
0.00000 |
0.00 |
5978 |
2000s |
1 |
3.652500e+04 |
0.00000 |
0.00 |
2003 |
|
2010s |
1 |
9.070629e-02 |
0.00000 |
1200.00 |
2011 |
|
Pulsation Timing Variations |
2000s |
1 |
1.170000e+03 |
0.00000 |
0.00 |
2007 |
Radial Velocity |
1980s |
1 |
8.388800e+01 |
11.68000 |
40.57 |
1989 |
1990s |
52 |
1.091561e+04 |
68.17820 |
723.71 |
55943 |
|
2000s |
475 |
2.633526e+05 |
945.31928 |
15201.16 |
619775 |
|
2010s |
424 |
1.809630e+05 |
316.47890 |
11382.67 |
432451 |
|
Transit |
2000s |
64 |
2.897102e+02 |
0.00000 |
31823.31 |
124462 |
2010s |
712 |
8.087813e+03 |
1.47000 |
102419.46 |
673999 |
|
Transit Timing Variations |
2010s |
9 |
2.393505e+02 |
0.00000 |
3313.00 |
8050 |
这里使用两个中括号[[]],取出来是DF类型的数据,而一个中括号[]取出来是Serios的数据,前者更美观一点。
planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)
|
number |
decade |
1980s |
method |
|
Astrometry |
0.0 |
Eclipse Timing Variations |
0.0 |
Imaging |
0.0 |
Microlensing |
0.0 |
Orbital Brightness Modulation |
0.0 |
Pulsar Timing |
0.0 |
Pulsation Timing Variations |
0.0 |
Radial Velocity |
1.0 |
Transit |
0.0 |
Transit Timing Variations |
0.0 |
(2)数据透视表
【例2】泰坦尼克号乘客数据分析
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic.head()
|
survived |
pclass |
sex |
age |
sibsp |
parch |
fare |
embarked |
class |
who |
adult_male |
deck |
embark_town |
alive |
alone |
0 |
0 |
3 |
male |
22.0 |
1 |
0 |
7.2500 |
S |
Third |
man |
True |
NaN |
Southampton |
no |
False |
1 |
1 |
1 |
female |
38.0 |
1 |
0 |
71.2833 |
C |
First |
woman |
False |
C |
Cherbourg |
yes |
False |
2 |
1 |
3 |
female |
26.0 |
0 |
0 |
7.9250 |
S |
Third |
woman |
False |
NaN |
Southampton |
yes |
True |
3 |
1 |
1 |
female |
35.0 |
1 |
0 |
53.1000 |
S |
First |
woman |
False |
C |
Southampton |
yes |
False |
4 |
0 |
3 |
male |
35.0 |
0 |
0 |
8.0500 |
S |
Third |
man |
True |
NaN |
Southampton |
no |
True |
T = titanic[titanic.age.notnull()].copy()
T.age.apply(lambda x: 60 if x>=60 else x)
T.age.value_counts()
24.00 30
22.00 27
60.00 26
18.00 26
28.00 25
30.00 25
19.00 25
21.00 24
25.00 23
36.00 22
29.00 20
35.00 18
32.00 18
27.00 18
26.00 18
31.00 17
16.00 17
34.00 15
20.00 15
33.00 15
23.00 15
39.00 14
40.00 13
17.00 13
42.00 13
45.00 12
38.00 11
4.00 10
50.00 10
2.00 10
..
8.00 4
5.00 4
11.00 4
6.00 3
7.00 3
46.00 3
30.50 2
57.00 2
0.83 2
55.00 2
10.00 2
59.00 2
13.00 2
28.50 2
40.50 2
45.50 2
0.75 2
32.50 2
34.50 1
55.50 1
0.92 1
36.50 1
12.00 1
53.00 1
14.50 1
0.67 1
20.50 1
23.50 1
24.50 1
0.42 1
Name: age, Length: 77, dtype: int64
Age = 10\*(T["age"]//10)
Age = Age.astype(int)
Age.head()
Age.value_counts()
20 220
30 167
10 102
40 89
0 62
50 48
60 26
Name: age, dtype: int64
Age.astype(str)+"s"
0 20s
1 30s
2 20s
3 30s
4 30s
6 50s
7 0s
8 20s
9 10s
10 0s
11 50s
12 20s
13 30s
14 10s
15 50s
16 0s
18 30s
20 30s
21 30s
22 10s
23 20s
24 0s
25 30s
27 10s
30 40s
33 60s
34 20s
35 40s
37 20s
38 10s
...
856 40s
857 50s
858 20s
860 40s
861 20s
862 40s
864 20s
865 40s
866 20s
867 30s
869 0s
870 20s
871 40s
872 30s
873 40s
874 20s
875 10s
876 20s
877 10s
879 50s
880 20s
881 30s
882 20s
883 20s
884 20s
885 30s
886 20s
887 10s
889 20s
890 30s
Name: age, Length: 714, dtype: object
T.groupby(["sex", Age])["survived"].mean().unstack()
age |
0 |
10 |
20 |
30 |
40 |
50 |
60 |
sex |
|
|
|
|
|
|
|
female |
0.633333 |
0.755556 |
0.722222 |
0.833333 |
0.687500 |
0.888889 |
1.000000 |
male |
0.593750 |
0.122807 |
0.168919 |
0.214953 |
0.210526 |
0.133333 |
0.136364 |
T.age = Age
T.pivot_table("survived", index="sex", columns="age")
age |
0 |
10 |
20 |
30 |
40 |
50 |
60 |
sex |
|
|
|
|
|
|
|
female |
0.633333 |
0.755556 |
0.722222 |
0.833333 |
0.687500 |
0.888889 |
1.000000 |
male |
0.593750 |
0.122807 |
0.168919 |
0.214953 |
0.210526 |
0.133333 |
0.136364 |
titanic.describe()