【深度之眼Python基础+数据科学入门训练营】第十二章 Pandas库

【深度之眼Python基础+数据科学入门训练营】第十二章 Pandas库_第1张图片 【深度之眼Python基础+数据科学入门训练营】第十二章 Pandas库_第2张图片 【深度之眼Python基础+数据科学入门训练营】第十二章 Pandas库_第3张图片

第十二章 Pandas库

引子

Numpy 在向量化的数值计算中表现优异

但是在处理更灵活、复杂的数据任务:

如为数据添加标签、处理缺失值、分组和透视表等方面

Numpy显得力不从心

而基于Numpy构建的Pandas库,提供了使得数据分析变得更快更简单的高级数据结构和操作工具

12.1 对象创建

12.1.1 Pandas Series对象

Series 是带标签数据的一维数组

Series对象的创建

通用结构: pd.Series(data, index=index, dtype=dtype)

data:数据,可以是列表,字典或Numpy数组

index:索引,为可选参数

dtype: 数据类型,为可选参数

1、用列表创建

  • index缺省,默认为整数序列
import pandas as pd

data = pd.Series([1.5, 3, 4.5, 6])
data
0    1.5
1    3.0
2    4.5
3    6.0
dtype: float64
  • 增加index
data = pd.Series([1.5, 3, 4.5, 6], index=["a", "b", "c", "d"])
data
a    1.5
b    3.0
c    4.5
d    6.0
dtype: float64
  • 增加数据类型

    缺省则从传入的数据自动判断

data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])    
data
a    1
b    2
c    3
d    4
dtype: int64
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], dtype="float")
data
a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64

注意:数据支持多种类型

data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"])
data
a    1
b    2
c    3
d    4
dtype: object
data["a"]
1
data["c"]
'3'

数据类型可被强制改变

data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"], dtype=float)
data
a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64
data["c"]
3.0
data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
data

2、用一维numpy数组创建

import numpy as np

x = np.arange(5)
pd.Series(x)
0    0
1    1
2    2
3    3
4    4
dtype: int32

3、用字典创建

  • 默认以键为index 值为data
population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }
population = pd.Series(population_dict)    
population
BeiJing     2154
ShangHai    2424
ShenZhen    1303
HangZhou     981
dtype: int64
  • 字典创建,如果指定index,则会到字典的键中筛选,找不到的,值设为NaN
population = pd.Series(population_dict, index=["BeiJing", "HangZhou", "c", "d"])    
population
BeiJing     2154.0
HangZhou     981.0
c              NaN
d              NaN
dtype: float64

4、data为标量的情况

pd.Series(5, index=[100, 200, 300])
100    5
200    5
300    5
dtype: int64

12.1.2 Pandas DataFrame对象

DataFrame 是带标签数据的多维数组

DataFrame对象的创建

通用结构: pd.DataFrame(data, index=index, columns=columns)

data:数据,可以是列表,字典或Numpy数组

index:索引,为可选参数

columns: 列标签,为可选参数

1、通过Series对象创建

population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }

population = pd.Series(population_dict)    
pd.DataFrame(population)
0
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
pd.DataFrame(population, columns=["population"])
population
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981

2、通过Series对象字典创建

GDP_dict = {"BeiJing": 30320,
            "ShangHai": 32680,
            "ShenZhen": 24222,
            "HangZhou": 13468 }

GDP = pd.Series(GDP_dict)
GDP
BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
dtype: int64
pd.DataFrame({"population": population,
              "GDP": GDP})
population GDP
BeiJing 2154 30320
ShangHai 2424 32680
ShenZhen 1303 24222
HangZhou 981 13468

注意:数量不够的会自动补齐

pd.DataFrame({"population": population,
              "GDP": GDP,
              "country": "China"})
population GDP country
BeiJing 2154 30320 China
ShangHai 2424 32680 China
ShenZhen 1303 24222 China
HangZhou 981 13468 China

3、通过字典列表对象创建

  • 字典索引作为index,字典键作为columns
import numpy as np
import pandas as pd

data = [{"a": i, "b": 2*i} for i in range(3)]
data
[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
data = pd.DataFrame(data)
data
a b
0 0 0
1 1 2
2 2 4
data1 = data["a"].copy()
data1
0    0
1    1
2    2
Name: a, dtype: int64
data1[0] = 10
data1
0    10
1     1
2     2
Name: a, dtype: int64
data
a b
0 0 0
1 1 2
2 2 4

  • 不存在的键,会默认值为NaN
data = [{"a": 1, "b":1},{"b": 3, "c":4}]
data
[{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]
pd.DataFrame(data)
a b c
0 1.0 1 NaN
1 NaN 3 4.0

4、通过Numpy二维数组创建

data = np.random.randint(10, size=(3, 2))
data
array([[1, 6],
       [2, 9],
       [4, 0]])
pd.DataFrame(data, columns=["foo", "bar"], index=["a", "b", "c"])
foo bar
a 1 6
b 2 9
c 4 0

12.2 DataFrame性质

1、属性

data = pd.DataFrame({"pop": population, "GDP": GDP})
data
pop GDP
BeiJing 2154 30320
ShangHai 2424 32680
ShenZhen 1303 24222
HangZhou 981 13468

(1)df.values 返回numpy数组表示的数据

data.values
array([[ 2154, 30320],
       [ 2424, 32680],
       [ 1303, 24222],
       [  981, 13468]], dtype=int64)

(2)df.index 返回行索引

data.index
Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')

(3)df.columns 返回列索引

data.columns
Index(['pop', 'GDP'], dtype='object')

(4)df.shape 形状

data.shape
(4, 2)

(5) pd.size 大小

data.size
8

(6)pd.dtypes 返回每列数据类型

data.dtypes
pop    int64
GDP    int64
dtype: object

2、索引

data
pop GDP
BeiJing 2154 30320
ShangHai 2424 32680
ShenZhen 1303 24222
HangZhou 981 13468

(1)获取列

  • 字典式
data["pop"]
BeiJing     2154
ShangHai    2424
ShenZhen    1303
HangZhou     981
Name: pop, dtype: int64
data[["GDP", "pop"]]
GDP pop
BeiJing 30320 2154
ShangHai 32680 2424
ShenZhen 24222 1303
HangZhou 13468 981
  • 对象属性式
data.GDP
BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
Name: GDP, dtype: int64

(2)获取行

  • 绝对索引 df.loc
data.loc["BeiJing"] # loc->行,代码行
pop     2154
GDP    30320
Name: BeiJing, dtype: int64
data.loc[["BeiJing", "HangZhou"]]
pop GDP
BeiJing 2154 30320
HangZhou 981 13468
  • 相对索引 df.iloc
data
pop GDP
BeiJing 2154 30320
ShangHai 2424 32680
ShenZhen 1303 24222
HangZhou 981 13468
data.iloc[0]
pop     2154
GDP    30320
Name: BeiJing, dtype: int64
data.iloc[[1, 3]]
pop GDP
ShangHai 2424 32680
HangZhou 981 13468

(3)获取标量

data
pop GDP
BeiJing 2154 30320
ShangHai 2424 32680
ShenZhen 1303 24222
HangZhou 981 13468
data.loc["BeiJing", "GDP"]
30320
data.iloc[0, 1]
30320
data.values[0][1]
30320

(4)Series对象的索引

type(data.GDP)
pandas.core.series.Series
GDP
BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
dtype: int64
GDP["BeiJing"]
30320

3、切片

dates = pd.date_range(start='2019-01-01', periods=6)
dates
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
A B C D
2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
2019-01-03 -0.141572 0.058118 1.102248 1.207726
2019-01-04 0.305088 0.535920 -0.978434 0.177251
2019-01-05 0.313383 0.234041 0.163155 -0.296649
2019-01-06 0.250613 -0.904400 -0.858240 -1.573342

(1)行切片

df["2019-01-01": "2019-01-03"]
A B C D
2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
2019-01-03 -0.141572 0.058118 1.102248 1.207726
df.loc["2019-01-01": "2019-01-03"]
A B C D
2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
2019-01-03 -0.141572 0.058118 1.102248 1.207726
df.iloc[0: 3]
A B C D
2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
2019-01-03 -0.141572 0.058118 1.102248 1.207726

(2)列切片

df
A B C D
2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
2019-01-03 -0.141572 0.058118 1.102248 1.207726
2019-01-04 0.305088 0.535920 -0.978434 0.177251
2019-01-05 0.313383 0.234041 0.163155 -0.296649
2019-01-06 0.250613 -0.904400 -0.858240 -1.573342
df.loc[:, "A": "C"]
A B C
2019-01-01 -0.935378 -0.190742 0.925984
2019-01-02 -0.234414 -1.194674 1.080779
2019-01-03 -0.141572 0.058118 1.102248
2019-01-04 0.305088 0.535920 -0.978434
2019-01-05 0.313383 0.234041 0.163155
2019-01-06 0.250613 -0.904400 -0.858240
df.iloc[:, 0: 3]
A B C
2019-01-01 -0.935378 -0.190742 0.925984
2019-01-02 -0.234414 -1.194674 1.080779
2019-01-03 -0.141572 0.058118 1.102248
2019-01-04 0.305088 0.535920 -0.978434
2019-01-05 0.313383 0.234041 0.163155
2019-01-06 0.250613 -0.904400 -0.858240

(3)多种多样的取值

df
A B C D
2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
2019-01-03 -0.141572 0.058118 1.102248 1.207726
2019-01-04 0.305088 0.535920 -0.978434 0.177251
2019-01-05 0.313383 0.234041 0.163155 -0.296649
2019-01-06 0.250613 -0.904400 -0.858240 -1.573342
  • 行、列同时切片
df.loc["2019-01-02": "2019-01-03", "C":"D"]
C D
2019-01-02 1.080779 -2.294395
2019-01-03 1.102248 1.207726
df.iloc[1: 3, 2:]
C D
2019-01-02 1.080779 -2.294395
2019-01-03 1.102248 1.207726
  • 行切片,列分散取值
df.loc["2019-01-04": "2019-01-06", ["A", "C"]]
A C
2019-01-04 0.305088 -0.978434
2019-01-05 0.313383 0.163155
2019-01-06 0.250613 -0.858240
df.iloc[3:, [0, 2]]
A C
2019-01-04 0.305088 -0.978434
2019-01-05 0.313383 0.163155
2019-01-06 0.250613 -0.858240
  • 行分散取值,列切片
df.loc[["2019-01-02", "2019-01-06"], "C": "D"]  # 报错
df.iloc[[1, 5], 0: 3]
A B C
2019-01-02 -0.234414 -1.194674 1.080779
2019-01-06 0.250613 -0.904400 -0.858240
  • 行、列均分散取值
df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]
df.iloc[[1, 5], [0, 3]]
A D
2019-01-02 -0.234414 -2.294395
2019-01-06 0.250613 -1.573342

4、布尔索引

df
A B C D
2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
2019-01-03 -0.141572 0.058118 1.102248 1.207726
2019-01-04 0.305088 0.535920 -0.978434 0.177251
2019-01-05 0.313383 0.234041 0.163155 -0.296649
2019-01-06 0.250613 -0.904400 -0.858240 -1.573342
df > 0
A B C D
2019-01-01 False False True False
2019-01-02 False False True False
2019-01-03 False True True True
2019-01-04 True True False True
2019-01-05 True True True False
2019-01-06 True False False False
df[df > 0]
A B C D
2019-01-01 NaN NaN 0.925984 NaN
2019-01-02 NaN NaN 1.080779 NaN
2019-01-03 NaN 0.058118 1.102248 1.207726
2019-01-04 0.305088 0.535920 NaN 0.177251
2019-01-05 0.313383 0.234041 0.163155 NaN
2019-01-06 0.250613 NaN NaN NaN
df.A > 0
2019-01-01    False
2019-01-02    False
2019-01-03    False
2019-01-04     True
2019-01-05     True
2019-01-06     True
Freq: D, Name: A, dtype: bool
df[df.A > 0]
A B C D
2019-01-04 0.305088 0.535920 -0.978434 0.177251
2019-01-05 0.313383 0.234041 0.163155 -0.296649
2019-01-06 0.250613 -0.904400 -0.858240 -1.573342
  • isin()方法
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
A B C D E
2019-01-01 -0.935378 -0.190742 0.925984 -0.818969 one
2019-01-02 -0.234414 -1.194674 1.080779 -2.294395 one
2019-01-03 -0.141572 0.058118 1.102248 1.207726 two
2019-01-04 0.305088 0.535920 -0.978434 0.177251 three
2019-01-05 0.313383 0.234041 0.163155 -0.296649 four
2019-01-06 0.250613 -0.904400 -0.858240 -1.573342 three
ind = df2["E"].isin(["two", "four"])
ind     
2019-01-01    False
2019-01-02    False
2019-01-03     True
2019-01-04    False
2019-01-05     True
2019-01-06    False
Freq: D, Name: E, dtype: bool
df2[ind]
A B C D E
2019-01-03 -0.141572 0.058118 1.102248 1.207726 two
2019-01-05 0.313383 0.234041 0.163155 -0.296649 four

5.赋值

df
  • DataFrame 增加新列
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
s1
2019-01-01    1
2019-01-02    2
2019-01-03    3
2019-01-04    4
2019-01-05    5
2019-01-06    6
Freq: D, dtype: int64
df["E"] = s1  # 添加一列
df
A B C D E
2019-01-01 -0.935378 -0.190742 0.925984 -0.818969 1
2019-01-02 -0.234414 -1.194674 1.080779 -2.294395 2
2019-01-03 -0.141572 0.058118 1.102248 1.207726 3
2019-01-04 0.305088 0.535920 -0.978434 0.177251 4
2019-01-05 0.313383 0.234041 0.163155 -0.296649 5
2019-01-06 0.250613 -0.904400 -0.858240 -1.573342 6
  • 修改赋值
df.loc["2019-01-01", "A"] = 0
df
A B C D E
2019-01-01 0.000000 -0.190742 0.925984 -0.818969 1
2019-01-02 -0.234414 -1.194674 1.080779 -2.294395 2
2019-01-03 -0.141572 0.058118 1.102248 1.207726 3
2019-01-04 0.305088 0.535920 -0.978434 0.177251 4
2019-01-05 0.313383 0.234041 0.163155 -0.296649 5
2019-01-06 0.250613 -0.904400 -0.858240 -1.573342 6
df.iloc[0, 1] = 0
df
A B C D E
2019-01-01 0.000000 0.000000 0.925984 -0.818969 1
2019-01-02 -0.234414 -1.194674 1.080779 -2.294395 2
2019-01-03 -0.141572 0.058118 1.102248 1.207726 3
2019-01-04 0.305088 0.535920 -0.978434 0.177251 4
2019-01-05 0.313383 0.234041 0.163155 -0.296649 5
2019-01-06 0.250613 -0.904400 -0.858240 -1.573342 6
df["D"] = np.array([5]*len(df))   # 可简化成df["D"] = 5
df
A B C D E
2019-01-01 0.000000 0.000000 0.925984 5 1
2019-01-02 -0.234414 -1.194674 1.080779 5 2
2019-01-03 -0.141572 0.058118 1.102248 5 3
2019-01-04 0.305088 0.535920 -0.978434 5 4
2019-01-05 0.313383 0.234041 0.163155 5 5
2019-01-06 0.250613 -0.904400 -0.858240 5 6
  • 修改index和columns
df.index = [i for i in range(len(df))]
df
A B C D E
0 0.000000 0.000000 0.925984 5 1
1 -0.234414 -1.194674 1.080779 5 2
2 -0.141572 0.058118 1.102248 5 3
3 0.305088 0.535920 -0.978434 5 4
4 0.313383 0.234041 0.163155 5 5
5 0.250613 -0.904400 -0.858240 5 6
df.columns = [i for i in range(df.shape[1])]
df
0 1 2 3 4
0 0.000000 0.000000 0.925984 5 1
1 -0.234414 -1.194674 1.080779 5 2
2 -0.141572 0.058118 1.102248 5 3
3 0.305088 0.535920 -0.978434 5 4
4 0.313383 0.234041 0.163155 5 5
5 0.250613 -0.904400 -0.858240 5 6

12.3 数值运算及统计分析

1、数据的查看

import pandas as pd
import numpy as np

dates = pd.date_range(start='2019-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
A B C D
2019-01-01 -0.854043 0.412345 -2.296051 -0.048964
2019-01-02 1.371364 -0.121454 -0.299653 1.095375
2019-01-03 -0.714591 -1.103224 0.979250 0.319455
2019-01-04 -1.397557 0.426008 0.233861 -1.651887
2019-01-05 0.434026 0.459830 -0.095444 1.220302
2019-01-06 -0.133876 0.074500 -1.028147 0.605402

(1)查看前面的行

df.head()    # 默认5行
A B C D
2019-01-01 -0.854043 0.412345 -2.296051 -0.048964
2019-01-02 1.371364 -0.121454 -0.299653 1.095375
2019-01-03 -0.714591 -1.103224 0.979250 0.319455
2019-01-04 -1.397557 0.426008 0.233861 -1.651887
2019-01-05 0.434026 0.459830 -0.095444 1.220302
df.head(2)
A B C D
2019-01-01 -0.854043 0.412345 -2.296051 -0.048964
2019-01-02 1.371364 -0.121454 -0.299653 1.095375

(2)查看后面的行

df.tail()    # 默认5行
A B C D
2019-01-02 1.371364 -0.121454 -0.299653 1.095375
2019-01-03 -0.714591 -1.103224 0.979250 0.319455
2019-01-04 -1.397557 0.426008 0.233861 -1.651887
2019-01-05 0.434026 0.459830 -0.095444 1.220302
2019-01-06 -0.133876 0.074500 -1.028147 0.605402
df.tail(3) 
A B C D
2019-01-04 -1.397557 0.426008 0.233861 -1.651887
2019-01-05 0.434026 0.459830 -0.095444 1.220302
2019-01-06 -0.133876 0.074500 -1.028147 0.605402

(3)查看总体信息

df.iloc[0, 3] = np.nan
df
A B C D
2019-01-01 -0.854043 0.412345 -2.296051 NaN
2019-01-02 1.371364 -0.121454 -0.299653 1.095375
2019-01-03 -0.714591 -1.103224 0.979250 0.319455
2019-01-04 -1.397557 0.426008 0.233861 -1.651887
2019-01-05 0.434026 0.459830 -0.095444 1.220302
2019-01-06 -0.133876 0.074500 -1.028147 0.605402
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
A    6 non-null float64
B    6 non-null float64
C    6 non-null float64
D    5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes

2、Numpy通用函数同样适用于Pandas

(1)向量化运算

x = pd.DataFrame(np.arange(4).reshape(1, 4))
x
0 1 2 3
0 0 1 2 3
x+5
0 1 2 3
0 5 6 7 8
np.exp(x)
0 1 2 3
0 1.0 2.718282 7.389056 20.085537
y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
y
0 1 2 3
0 4 5 6 7
x*y
0 1 2 3
0 0 5 12 21

(2)矩阵化运算

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(30, 30)))
x
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 6 3 7 4 6 9 2 6 7 4 ... 4 0 9 5 8 0 9 2 6 3
1 8 2 4 2 6 4 8 6 1 3 ... 2 0 3 1 7 3 1 5 5 9
2 3 5 1 9 1 9 3 7 6 8 ... 6 8 7 0 7 7 2 0 7 2
3 2 0 4 9 6 9 8 6 8 7 ... 0 2 4 2 0 4 9 6 6 8
4 9 9 2 6 0 3 3 4 6 6 ... 9 6 8 6 0 0 8 8 3 8
5 2 6 5 7 8 4 0 2 9 7 ... 2 0 4 0 7 0 0 1 1 5
6 6 4 0 0 2 1 4 9 5 6 ... 5 0 8 5 2 3 3 2 9 2
7 2 3 6 3 8 0 7 6 1 7 ... 3 0 1 0 4 4 6 8 8 2
8 2 2 3 7 5 7 0 7 3 0 ... 1 1 5 2 8 3 0 3 0 4
9 3 7 7 6 2 0 0 2 5 6 ... 4 2 3 2 0 0 4 5 2 8
10 4 7 0 4 2 0 3 4 6 0 ... 5 6 1 9 1 9 0 7 0 8
11 5 6 9 6 9 2 1 8 7 9 ... 6 5 2 8 9 5 9 9 5 0
12 3 9 5 5 4 0 7 4 4 6 ... 0 7 2 9 6 9 4 9 4 6
13 8 4 0 9 9 0 1 5 8 7 ... 5 8 4 0 3 4 9 9 4 6
14 3 0 4 6 9 9 5 4 3 1 ... 6 1 0 3 7 1 2 0 0 2
15 4 2 0 0 7 9 1 2 1 2 ... 6 3 9 4 1 7 3 8 4 8
16 3 9 4 8 7 2 0 2 3 1 ... 8 0 0 3 8 5 2 0 3 8
17 2 8 6 3 2 9 4 4 2 8 ... 6 9 4 2 6 1 8 9 9 0
18 5 6 7 9 8 1 9 1 4 4 ... 3 5 2 5 6 9 9 2 6 2
19 1 9 3 7 8 6 0 2 8 0 ... 4 3 2 2 3 8 1 8 0 0
20 4 5 5 2 6 8 9 7 5 7 ... 3 5 0 8 0 4 3 2 5 1
21 2 4 8 1 9 7 1 4 6 7 ... 0 1 8 2 0 4 6 5 0 4
22 4 5 2 4 6 4 4 4 9 9 ... 1 7 6 9 9 1 5 5 2 1
23 0 5 4 8 0 6 4 4 1 2 ... 8 5 0 7 6 9 2 0 4 3
24 9 7 0 9 0 3 7 4 1 5 ... 3 7 8 2 2 1 9 2 2 4
25 4 1 9 5 4 5 0 4 8 9 ... 9 3 0 7 0 2 3 7 5 9
26 6 7 1 9 7 2 6 2 6 1 ... 0 6 5 9 8 0 3 8 3 9
27 2 8 1 3 5 1 7 7 0 2 ... 8 0 4 5 4 5 5 6 3 7
28 6 8 6 2 2 7 4 3 7 5 ... 1 7 9 2 4 5 9 5 3 2
29 3 0 3 0 0 9 5 4 3 2 ... 1 3 0 4 8 0 8 7 5 6

30 rows × 30 columns

  • 转置
z = x.T
z
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 6 8 3 2 9 2 6 2 2 3 ... 4 2 4 0 9 4 6 2 6 3
1 3 2 5 0 9 6 4 3 2 7 ... 5 4 5 5 7 1 7 8 8 0
2 7 4 1 4 2 5 0 6 3 7 ... 5 8 2 4 0 9 1 1 6 3
3 4 2 9 9 6 7 0 3 7 6 ... 2 1 4 8 9 5 9 3 2 0
4 6 6 1 6 0 8 2 8 5 2 ... 6 9 6 0 0 4 7 5 2 0
5 9 4 9 9 3 4 1 0 7 0 ... 8 7 4 6 3 5 2 1 7 9
6 2 8 3 8 3 0 4 7 0 0 ... 9 1 4 4 7 0 6 7 4 5
7 6 6 7 6 4 2 9 6 7 2 ... 7 4 4 4 4 4 2 7 3 4
8 7 1 6 8 6 9 5 1 3 5 ... 5 6 9 1 1 8 6 0 7 3
9 4 3 8 7 6 7 6 7 0 6 ... 7 7 9 2 5 9 1 2 5 2
10 3 8 7 1 3 5 3 0 7 5 ... 4 0 2 6 4 1 9 9 1 0
11 7 1 4 0 6 7 6 8 3 5 ... 7 5 0 5 1 0 5 8 3 5
12 7 9 1 6 2 8 7 8 5 5 ... 9 0 4 1 2 9 2 4 3 1
13 2 8 4 6 5 3 0 1 7 2 ... 3 1 8 5 8 8 2 5 5 7
14 5 9 7 7 1 0 5 6 3 5 ... 9 0 0 1 6 9 8 3 5 9
15 4 4 9 4 9 0 7 9 2 7 ... 7 4 2 1 6 8 6 9 0 4
16 1 1 8 2 8 9 4 2 8 1 ... 9 9 3 1 5 8 4 1 7 6
17 7 3 8 7 4 3 3 6 2 4 ... 1 8 0 2 7 5 9 7 5 9
18 5 6 0 5 5 6 1 9 8 0 ... 4 5 0 1 3 7 6 5 2 1
19 1 7 8 2 3 1 5 8 1 0 ... 8 0 7 3 7 0 8 4 8 7
20 4 2 6 0 9 2 5 3 1 4 ... 3 0 1 8 3 9 0 8 1 1
21 0 0 8 2 6 0 0 0 1 2 ... 5 1 7 5 7 3 6 0 7 3
22 9 3 7 4 8 4 8 1 5 3 ... 0 8 6 0 8 0 5 4 9 0
23 5 1 0 2 6 0 5 0 2 2 ... 8 2 9 7 2 7 9 5 2 4
24 8 7 7 0 0 7 2 4 8 0 ... 0 0 9 6 2 0 8 4 4 8
25 0 3 7 4 0 0 3 4 3 0 ... 4 4 1 9 1 2 0 5 5 0
26 9 1 2 9 8 0 3 6 0 4 ... 3 6 5 2 9 3 3 5 9 8
27 2 5 0 6 8 1 2 8 3 5 ... 2 5 5 0 2 7 8 6 5 7
28 6 5 7 6 3 1 9 8 0 2 ... 5 0 2 4 2 5 3 3 3 5
29 3 9 2 8 8 5 2 2 4 8 ... 1 4 1 3 4 9 9 7 2 6

30 rows × 30 columns

np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(30, 30)))
y
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 5 8 9 5 0 0 1 7 6 9 ... 1 7 0 6 9 9 7 6 9 1
1 0 1 8 8 3 9 8 7 3 6 ... 9 2 0 4 9 2 7 7 9 8
2 6 9 3 7 7 4 5 9 3 6 ... 7 7 1 1 3 0 8 6 4 5
3 6 2 5 7 8 4 4 7 7 4 ... 0 1 9 8 2 3 1 2 7 2
4 6 0 9 2 6 6 2 7 7 0 ... 1 5 4 0 7 8 9 5 7 0
5 9 3 9 1 4 4 6 8 8 9 ... 1 8 7 0 3 4 2 0 3 5
6 1 2 4 3 0 6 0 7 2 8 ... 4 3 3 6 7 3 5 3 2 4
7 4 0 3 3 8 3 5 6 7 5 ... 1 7 3 1 6 6 9 6 9 6
8 0 0 2 9 6 0 6 7 0 3 ... 6 7 9 5 4 9 5 2 5 6
9 6 8 7 7 7 2 6 0 5 2 ... 7 0 6 2 4 3 6 7 6 3
10 0 6 4 7 6 2 9 5 9 9 ... 4 9 3 9 1 2 5 4 0 8
11 2 3 9 9 4 4 8 2 1 6 ... 0 5 9 8 6 6 0 4 7 3
12 0 1 6 0 6 1 6 4 2 5 ... 8 8 0 7 2 0 7 1 1 9
13 5 1 5 9 6 4 9 8 7 5 ... 2 4 3 2 0 0 4 2 5 0
14 0 3 8 5 3 1 4 7 3 2 ... 8 5 5 7 5 9 1 3 9 3
15 3 3 6 1 3 0 5 0 5 2 ... 7 1 7 7 3 8 3 0 6 3
16 0 6 5 9 6 4 6 6 2 2 ... 3 6 8 6 5 1 3 2 6 3
17 6 7 2 8 0 1 8 6 0 0 ... 5 6 2 5 4 3 0 6 2 1
18 9 4 4 0 9 8 7 7 6 1 ... 7 9 9 7 1 1 4 6 5 6
19 4 1 1 5 1 2 6 2 3 3 ... 0 0 0 9 8 5 9 3 4 0
20 9 8 6 3 9 9 0 8 1 6 ... 2 9 0 1 3 9 4 8 8 8
21 2 8 6 4 9 0 5 5 6 1 ... 6 7 5 6 8 7 4 2 4 0
22 0 3 5 9 0 3 6 5 1 1 ... 6 2 5 3 9 3 9 5 1 9
23 7 7 0 8 6 1 2 0 4 4 ... 1 9 6 0 2 8 3 7 2 5
24 6 0 4 2 3 1 0 5 7 0 ... 1 1 2 7 5 2 9 4 7 3
25 5 0 2 1 4 9 4 6 9 3 ... 5 5 3 5 9 2 7 4 1 6
26 9 8 1 8 1 6 2 6 1 8 ... 2 5 1 2 5 3 3 6 1 8
27 1 8 6 4 6 9 5 4 7 2 ... 9 3 1 5 1 1 7 1 2 6
28 0 7 7 4 3 2 7 8 5 2 ... 0 2 8 3 7 3 9 2 3 8
29 8 0 2 6 8 3 6 4 9 7 ... 6 7 8 5 7 2 5 3 4 5

30 rows × 30 columns

x.dot(y)
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 616 560 723 739 612 457 681 799 575 590 ... 523 739 613 580 668 602 733 585 657 700
1 520 438 691 600 612 455 666 764 707 592 ... 555 681 503 679 641 506 779 494 633 590
2 557 570 786 807 690 469 804 828 704 573 ... 563 675 712 758 793 672 754 550 756 638
3 605 507 664 701 660 496 698 806 651 575 ... 582 685 668 586 629 534 678 484 591 626
4 599 681 753 873 721 563 754 770 620 654 ... 633 747 661 677 726 649 716 610 735 706
5 422 354 602 627 613 396 617 627 489 423 ... 456 572 559 537 499 384 589 436 574 507
6 359 446 599 599 481 357 577 572 451 464 ... 449 550 495 532 633 554 663 476 565 602
7 531 520 698 590 607 537 665 696 571 472 ... 576 588 551 665 652 527 742 528 650 599
8 449 322 547 533 593 399 584 638 587 424 ... 402 596 523 523 447 362 561 386 529 484
9 373 433 525 601 522 345 551 521 434 447 ... 508 498 438 478 459 418 488 407 503 496
10 500 427 574 607 667 477 652 656 615 477 ... 622 702 531 610 558 532 598 471 582 561
11 664 694 772 841 779 574 730 810 711 608 ... 591 760 616 638 721 676 846 678 754 708
12 545 547 687 701 721 576 689 724 710 532 ... 674 684 648 694 710 564 757 571 671 656
13 574 586 723 750 691 494 696 787 667 523 ... 618 681 568 682 715 644 756 557 690 604
14 502 382 645 557 570 403 538 677 500 501 ... 369 650 507 576 546 531 554 437 616 463
15 510 505 736 651 649 510 719 733 694 557 ... 605 717 574 642 678 576 755 455 598 654
16 567 376 614 612 643 514 598 724 547 464 ... 456 639 520 560 569 442 596 517 659 532
17 626 716 828 765 740 603 809 852 692 591 ... 664 716 655 721 742 612 819 593 744 712
18 600 559 667 664 641 556 624 815 638 564 ... 581 701 559 677 710 554 748 597 614 657
19 445 431 661 681 641 552 690 719 602 474 ... 515 637 576 620 572 512 599 455 622 538
20 523 569 784 725 713 501 740 772 638 640 ... 589 775 664 686 726 672 747 548 723 645
21 487 465 553 639 517 449 592 609 454 398 ... 492 567 534 404 554 417 561 466 498 492
22 479 449 574 686 583 377 566 614 563 455 ... 453 539 491 501 596 520 722 478 565 501
23 483 386 476 526 550 426 492 585 536 482 ... 322 541 438 456 487 408 502 426 474 481
24 523 551 658 767 537 444 663 731 576 577 ... 522 590 525 664 691 548 635 526 641 538
25 652 656 738 753 853 508 752 815 669 576 ... 694 833 693 606 575 616 704 559 728 672
26 578 577 744 856 699 497 779 800 733 587 ... 630 754 704 834 760 680 765 592 731 629
27 554 494 665 689 630 574 695 703 636 599 ... 554 685 532 658 649 554 693 577 634 668
28 498 552 659 784 552 492 690 775 544 551 ... 567 636 518 599 742 521 733 533 605 604
29 513 491 563 642 477 367 589 647 516 484 ... 428 574 504 548 553 483 540 407 547 455

30 rows × 30 columns

%timeit x.dot(y)
218 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.dot(x, y)
81.1 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  • 执行相同运算,Numpy与Pandas的对比
x1 = np.array(x)
x1
y1 = np.array(y)
y1
%timeit x1.dot(y1)
22.1 µs ± 992 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x1, y1)
22.6 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x.values, y.values)
42.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
x2 = list(x1)
y2 = list(y1)
x3 = []
y3 = []
for i in x2:
    res = []
    for j in i:
        res.append(int(j))
    x3.append(res)
for i in y2:
    res = []
    for j in i:
        res.append(int(j))
    y3.append(res)
def f(x, y):
    res = []
    for i in range(len(x)):
        row = []
        for j in range(len(y[0])):
            sum_row = 0
            for k in range(len(x[0])):
                sum_row += x[i][k]*y[k][j]
            row.append(sum_row)
        res.append(row)
    return res          
%timeit f(x3, y3)
4.29 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

一般来说,纯粹的计算在Numpy里执行的更快

Numpy更侧重于计算,Pandas更侧重于数据处理

(3)广播运算

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
x
A B C
0 6 3 7
1 4 6 9
2 2 6 7
  • 按行广播
x.iloc[0]
A    6
B    3
C    7
Name: 0, dtype: int32
x/x.iloc[0]
A B C
0 1.000000 1.0 1.000000
1 0.666667 2.0 1.285714
2 0.333333 2.0 1.000000
  • 按列广播
x.A
0    6
1    4
2    2
Name: A, dtype: int32
x.div(x.A, axis=0)             # add sub div mul
A B C
0 1.0 0.5 1.166667
1 1.0 1.5 2.250000
2 1.0 3.0 3.500000
x.div(x.iloc[0], axis=1)
A B C
0 1.000000 1.0 1.000000
1 0.666667 2.0 1.285714
2 0.333333 2.0 1.000000

3、新的用法

(1)索引对齐

A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
A
A B
0 3 7
1 2 1
B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
B
A B C
0 7 5 1
1 4 0 9
2 5 8 0
  • pandas会自动对齐两个对象的索引,没有的值用np.nan表示
A+B
A B C
0 10.0 12.0 NaN
1 6.0 1.0 NaN
2 NaN NaN NaN
  • 缺省值也可用fill_value来填充
A.add(B, fill_value=0)
A B C
0 10.0 12.0 1.0
1 6.0 1.0 9.0
2 5.0 8.0 0.0
A*B
A B C
0 21.0 35.0 NaN
1 8.0 0.0 NaN
2 NaN NaN NaN

(2)统计相关

  • 数据种类统计
y = np.random.randint(3, size=20)
y
array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])
np.unique(y)
array([0, 1, 2])
from collections import Counter
Counter(y)
Counter({2: 11, 1: 5, 0: 4})
y1 = pd.DataFrame(y, columns=["A"])
y1
A
0 2
1 2
2 2
3 1
4 2
5 1
6 1
7 2
8 1
9 2
10 2
11 0
12 2
13 0
14 2
15 2
16 0
17 0
18 2
19 1
np.unique(y1)
array([0, 1, 2])
y1["A"].value_counts()
2    11
1     5
0     4
Name: A, dtype: int64
  • 产生新的结果,并进行排序
population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }
population = pd.Series(population_dict) 

GDP_dict = {"BeiJing": 30320,
            "ShangHai": 32680,
            "ShenZhen": 24222,
            "HangZhou": 13468 }
GDP = pd.Series(GDP_dict)

city_info = pd.DataFrame({"population": population,"GDP": GDP})
city_info
population GDP
BeiJing 2154 30320
ShangHai 2424 32680
ShenZhen 1303 24222
HangZhou 981 13468
city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
city_info
population GDP per_GDP
BeiJing 2154 30320 14.076137
ShangHai 2424 32680 13.481848
ShenZhen 1303 24222 18.589409
HangZhou 981 13468 13.728848

递增排序

city_info.sort_values(by="per_GDP")
population GDP per_GDP
ShangHai 2424 32680 13.481848
HangZhou 981 13468 13.728848
BeiJing 2154 30320 14.076137
ShenZhen 1303 24222 18.589409

递减排序

city_info.sort_values(by="per_GDP", ascending=False)
population GDP per_GDP
ShenZhen 1303 24222 18.589409
BeiJing 2154 30320 14.076137
HangZhou 981 13468 13.728848
ShangHai 2424 32680 13.481848

按轴进行排序

data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
data
C B A D
2 3 13 17 8
1 1 19 14 6
0 11 7 14 2

行排序

data.sort_index()
C B A D
0 11 7 14 2
1 1 19 14 6
2 3 13 17 8

列排序

data.sort_index(axis=1)
A B C D
2 17 13 3 8
1 14 19 1 6
0 14 7 11 2
data.sort_index(axis=1, ascending=False)
D C B A
2 8 3 13 17
1 6 1 19 14
0 2 11 7 14
  • 统计方法
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
df
A B C D
0 1.082198 3.557396 -3.060476 6.367969
1 13.113252 6.774559 2.874553 5.527044
2 -2.036341 -4.333177 5.094802 -0.152567
3 -3.386712 -1.522365 -2.522209 2.537716
4 4.328491 5.550994 5.577329 5.019991
5 1.171336 -0.493910 -4.032613 6.398588

非空个数

df.count()
A    6
B    6
C    6
D    6
dtype: int64

求和

df.sum()
A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64
df.sum(axis=1)
0     7.947086
1    28.289408
2    -1.427283
3    -4.893571
4    20.476806
5     3.043402
dtype: float64

最大值 最小值

df.min()
A   -3.386712
B   -4.333177
C   -4.032613
D   -0.152567
dtype: float64
df.max(axis=1)
0     6.367969
1    13.113252
2     5.094802
3     2.537716
4     5.577329
5     6.398588
dtype: float64
df
A B C D
0 1.082198 3.557396 -3.060476 6.367969
1 13.113252 6.774559 2.874553 5.527044
2 -2.036341 -4.333177 5.094802 -0.152567
3 -3.386712 -1.522365 -2.522209 2.537716
4 4.328491 5.550994 5.577329 5.019991
5 1.171336 -0.493910 -4.032613 6.398588
df.idxmax()  # 每列最大值的索引
A    1
B    1
C    4
D    5
dtype: int64

均值

df.mean()
A    2.378704
B    1.588916
C    0.655231
D    4.283124
dtype: float64

方差

df.var()
A    34.980702
B    19.110656
C    18.948144
D     6.726776
dtype: float64

标准差

df.std()
A    5.914449
B    4.371574
C    4.352947
D    2.593603
dtype: float64

中位数

df.median()
A    1.126767
B    1.531743
C    0.176172
D    5.273518
dtype: float64

众数

data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
data
A B
0 4 2
1 3 2
2 2 0
3 2 4
4 2 0
5 4 1
6 2 0
7 1 1
8 3 4
9 2 0
data.mode()
A B
0 2 0

75%分位数

df.quantile(0.75)
A    3.539202
B    5.052594
C    4.539740
D    6.157738
Name: 0.75, dtype: float64

一网打尽

df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 2.378704 1.588916 0.655231 4.283124
std 5.914449 4.371574 4.352947 2.593603
min -3.386712 -4.333177 -4.032613 -0.152567
25% -1.256706 -1.265251 -2.925910 3.158284
50% 1.126767 1.531743 0.176172 5.273518
75% 3.539202 5.052594 4.539740 6.157738
max 13.113252 6.774559 5.577329 6.398588
data_2 = pd.DataFrame([["a", "a", "c", "d"],
                       ["c", "a", "c", "b"],
                       ["a", "a", "d", "c"]], columns=list("ABCD"))
data_2
A B C D
0 a a c d
1 c a c b
2 a a d c
data_2.describe()
A B C D
count 3 3 3 3
unique 2 1 2 3
top a a c d
freq 2 3 2 1

相关性系数和协方差

df.corr()
A B C D
A 1.000000 0.831063 0.331060 0.510821
B 0.831063 1.000000 0.179244 0.719112
C 0.331060 0.179244 1.000000 -0.450365
D 0.510821 0.719112 -0.450365 1.000000
df.corrwith(df["A"])
A    1.000000
B    0.831063
C    0.331060
D    0.510821
dtype: float64

自定义输出

apply(method)的用法:使用method方法默认对每一列进行相应的操作

df
A B C D
0 1.082198 3.557396 -3.060476 6.367969
1 13.113252 6.774559 2.874553 5.527044
2 -2.036341 -4.333177 5.094802 -0.152567
3 -3.386712 -1.522365 -2.522209 2.537716
4 4.328491 5.550994 5.577329 5.019991
5 1.171336 -0.493910 -4.032613 6.398588
df.apply(np.cumsum)
A B C D
0 1.082198 3.557396 -3.060476 6.367969
1 14.195450 10.331955 -0.185923 11.895013
2 12.159109 5.998778 4.908878 11.742447
3 8.772397 4.476413 2.386669 14.280162
4 13.100888 10.027406 7.963999 19.300153
5 14.272224 9.533497 3.931385 25.698741
df.apply(np.cumsum, axis=1)
A B C D
0 1.082198 4.639594 1.579117 7.947086
1 13.113252 19.887811 22.762364 28.289408
2 -2.036341 -6.369518 -1.274717 -1.427283
3 -3.386712 -4.909077 -7.431287 -4.893571
4 4.328491 9.879485 15.456814 20.476806
5 1.171336 0.677427 -3.355186 3.043402
df.apply(sum)
A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64
df.sum()
A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64
df.apply(lambda x: x.max()-x.min())
A    16.499965
B    11.107736
C     9.609942
D     6.551155
dtype: float64
def my_describe(x):
    return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
                     index=["Count", "mean", "max", "idxmin", "std"])
df.apply(my_describe)
A B C D
Count 6.000000 6.000000 6.000000 6.000000
mean 2.378704 1.588916 0.655231 4.283124
max 13.113252 6.774559 5.577329 6.398588
idxmin 3.000000 2.000000 5.000000 2.000000
std 5.914449 4.371574 4.352947 2.593603

12.4 缺失值处理

1、发现缺失值

import pandas as pd
import numpy as np

data = pd.DataFrame(np.array([[1, np.nan, 2],
                              [np.nan, 3, 4],
                              [5, 6, None]]), columns=["A", "B", "C"])
data
A B C
0 1 NaN 2
1 NaN 3 4
2 5 6 None

注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源

data.dtypes
A    object
B    object
C    object
dtype: object
data.isnull()
A B C
0 False True False
1 True False False
2 False False True
data.notnull()
A B C
0 True False True
1 False True True
2 True True False

2、删除缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
A B C D
0 1.0 NaN 2.0 3.0
1 NaN 4.0 5.0 6.0
2 7.0 8.0 NaN 9.0
3 10.0 11.0 12.0 13.0

注意:np.nan是一种特殊的浮点数

data.dtypes
A    float64
B    float64
C    float64
D    float64
dtype: object

(1)删除整行

data.dropna()
A B C D
3 10.0 11.0 12.0 13.0

(2)删除整列

data.dropna(axis="columns")
D
0 3.0
1 6.0
2 9.0
3 13.0
data["D"] = np.nan
data
A B C D
0 1.0 NaN 2.0 NaN
1 NaN 4.0 5.0 NaN
2 7.0 8.0 NaN NaN
3 10.0 11.0 12.0 NaN
data.dropna(axis="columns", how="all")
A B C
0 1.0 NaN 2.0
1 NaN 4.0 5.0
2 7.0 8.0 NaN
3 10.0 11.0 12.0
data.dropna(axis="columns", how="any")
0
1
2
3
data.loc[3] = np.nan
data
A B C D
0 1.0 NaN 2.0 NaN
1 NaN 4.0 5.0 NaN
2 7.0 8.0 NaN NaN
3 NaN NaN NaN NaN
data.dropna(how="all")
A B C D
0 1.0 NaN 2.0 NaN
1 NaN 4.0 5.0 NaN
2 7.0 8.0 NaN NaN

3、填充缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
A B C D
0 1.0 NaN 2.0 3.0
1 NaN 4.0 5.0 6.0
2 7.0 8.0 NaN 9.0
3 10.0 11.0 12.0 13.0
data.fillna(value=5)
A B C D
0 1.0 5.0 2.0 3.0
1 5.0 4.0 5.0 6.0
2 7.0 8.0 5.0 9.0
3 10.0 11.0 12.0 13.0
  • 用均值进行替换
fill = data.mean()
fill
A    6.000000
B    7.666667
C    6.333333
D    7.750000
dtype: float64
data.fillna(value=fill)
A B C D
0 1.0 7.666667 2.000000 3.0
1 6.0 4.000000 5.000000 6.0
2 7.0 8.000000 6.333333 9.0
3 10.0 11.000000 12.000000 13.0
fill = data.stack().mean()
fill
7.0
data.fillna(value=fill)
A B C D
0 1.0 7.0 2.0 3.0
1 7.0 4.0 5.0 6.0
2 7.0 8.0 7.0 9.0
3 10.0 11.0 12.0 13.0

12.5 合并数据

  • 构造一个生产DataFrame的函数
import pandas as pd
import numpy as np

def make_df(cols, ind):
    "一个简单的DataFrame"
    data = {c: [str(c)+str(i) for i in ind]  for c in cols}
    return pd.DataFrame(data, ind)

make_df("ABC", range(3))
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
  • 垂直合并
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)
    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4
pd.concat([df_1, df_2])
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
  • 水平合并
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)
    A   B
0  A0  B0
1  A1  B1
    C   D
0  C0  D0
1  C1  D1
pd.concat([df_3, df_4], axis=1)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
  • 索引重叠

行重叠

df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
    A   B
1  A1  B1
2  A2  B2
    A   B
1  A1  B1
2  A2  B2
pd.concat([df_5, df_6])
A B
1 A1 B1
2 A2 B2
1 A1 B1
2 A2 B2
pd.concat([df_5, df_6],ignore_index=True)
A B
0 A1 B1
1 A2 B2
2 A1 B1
3 A2 B2

列重叠

df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)
    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
1  B1  C1  D1
2  B2  C2  D2
pd.concat([df_7, df_8], axis=1)
A B C B C D
1 A1 B1 C1 B1 C1 D1
2 A2 B2 C2 B2 C2 D2
pd.concat([df_7, df_8],axis=1, ignore_index=True)
0 1 2 3 4 5
1 A1 B1 C1 B1 C1 D1
2 A2 B2 C2 B2 C2 D2
  • 对齐合并merge()
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)
    A   B
1  A1  B1
2  A2  B2
    B   C
1  B1  C1
2  B2  C2
pd.merge(df_9, df_10)
A B C
0 A1 B1 C1
1 A2 B2 C2
df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)
    A   B
1  A1  B1
2  A2  B2
    C   B
2  C2  B2
1  C1  B1
pd.merge(df_9, df_10)
A B C
0 A1 B1 C1
1 A2 B2 C2

【例】 合并城市信息

population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
                   "pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population
city pop
0 BeiJing 2154
1 HangZhou 981
2 ShenZhen 1303
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
            "GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP
city GDP
0 BeiJing 30320
1 ShangHai 32680
2 HangZhou 13468
city_info = pd.merge(population, GDP)
city_info
city pop GDP
0 BeiJing 2154 30320
1 HangZhou 981 13468
city_info = pd.merge(population, GDP, how="outer")
city_info
city pop GDP
0 BeiJing 2154.0 30320.0
1 HangZhou 981.0 13468.0
2 ShenZhen 1303.0 NaN
3 ShangHai NaN 32680.0

12.6 分组和数据透视表

df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
                  "data1": range(6),
                  "data2": np.random.randint(0, 10, size=6)})
df
key data1 data2
0 A 0 1
1 B 1 4
2 C 2 9
3 C 3 9
4 B 4 1
5 A 5 9

(1)分组

  • 延迟计算
df.groupby("key")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002276795A240>
df.groupby("key").sum()
data1 data2
key
A 5 10
B 5 6
C 5 11
df.groupby("key").mean()
data1 data2
key
A 2.5 5.0
B 2.5 3.0
C 2.5 5.5
for i in df.groupby("key"):
    print(str(i))
('A',   key  data1  data2
0   A      0      2
5   A      5      8)
('B',   key  data1  data2
1   B      1      2
4   B      4      4)
('C',   key  data1  data2
2   C      2      8
3   C      3      3)
  • 按列取值
df.groupby("key")["data2"].sum()
key
A    10
B     6
C    11
Name: data2, dtype: int32
  • 按组迭代
for data, group in df.groupby("key"):
    print("{0:5} shape={1}".format(data, group.shape))
A     shape=(2, 3)
B     shape=(2, 3)
C     shape=(2, 3)
  • 调用方法
df.groupby("key")["data1"].describe()
count mean std min 25% 50% 75% max
key
A 2.0 2.5 3.535534 0.0 1.25 2.5 3.75 5.0
B 2.0 2.5 2.121320 1.0 1.75 2.5 3.25 4.0
C 2.0 2.5 0.707107 2.0 2.25 2.5 2.75 3.0
  • 支持更复杂的操作
df.groupby("key").aggregate(["min", "median", "max"])
data1 data2
min median max min median max
key
A 0 2.5 5 2 5.0 8
B 1 2.5 4 2 3.0 4
C 2 2.5 3 3 5.5 8
  • 过滤
def filter_func(x):
    return x["data2"].std() > 3
df.groupby("key")["data2"].std()
key
A    4.242641
B    1.414214
C    3.535534
Name: data2, dtype: float64
df.groupby("key").filter(filter_func)
key data1 data2
0 A 0 2
2 C 2 8
3 C 3 3
5 A 5 8
  • 转换
df
key data1 data2
0 A 0 2
1 B 1 2
2 C 2 8
3 C 3 3
4 B 4 4
5 A 5 8
df.groupby("key").transform(lambda x: x-x.mean())
data1 data2
0 -2.5 -3.0
1 -1.5 -1.0
2 -0.5 2.5
3 0.5 -2.5
4 1.5 1.0
5 2.5 3.0
df
key data1 data2
0 A 0 1
1 B 1 4
2 C 2 9
3 C 3 9
4 B 4 1
5 A 5 9
df.groupby("key").apply(lambda x: x-x.mean())
data1 data2
0 -2.5 -4.0
1 -1.5 1.5
2 -0.5 0.0
3 0.5 0.0
4 1.5 -1.5
5 2.5 4.0
  • apply()方法
df
key data1 data2
0 A 0 2
1 B 1 2
2 C 2 8
3 C 3 3
4 B 4 4
5 A 5 8
def norm_by_data2(x):
    x["data1"] /= x["data2"].sum()
    return x
df.groupby("key").apply(norm_by_data2)
key data1 data2
0 A 0.000000 2
1 B 0.166667 2
2 C 0.181818 8
3 C 0.272727 3
4 B 0.666667 4
5 A 0.500000 8
  • 将列表、数组设为分组键
L = [0, 1, 0, 1, 2, 0]
df
key data1 data2
0 A 0 2
1 B 1 2
2 C 2 8
3 C 3 3
4 B 4 4
5 A 5 8
df.groupby(L).sum()
data1 data2
0 7 18
1 4 5
2 4 4
  • 用字典将索引映射到分组
df2 = df.set_index("key")
df2
data1 data2
key
A 0 2
B 1 2
C 2 8
C 3 3
B 4 4
A 5 8
mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()
data1 data2
constant 10 17
first 5 10
  • 任意Python函数
df2.groupby(str.lower).mean()
data1 data2
a 2.5 5.0
b 2.5 3.0
c 2.5 5.5
  • 多个有效值组成的列表
df2.groupby([str.lower, mapping]).mean()
data1 data2
a first 2.5 5.0
b constant 2.5 3.0
c constant 2.5 5.5

【例1】 行星观测数据处理

import seaborn as sns

planets = sns.load_dataset("planets")
planets.shape
(1035, 6)
planets.head()
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
planets.describe()
number orbital_period mass distance year
count 1035.000000 992.000000 513.000000 808.000000 1035.000000
mean 1.785507 2002.917596 2.638161 264.069282 2009.070531
std 1.240976 26014.728304 3.818617 733.116493 3.972567
min 1.000000 0.090706 0.003600 1.350000 1989.000000
25% 1.000000 5.442540 0.229000 32.560000 2007.000000
50% 1.000000 39.979500 1.260000 55.250000 2010.000000
75% 2.000000 526.005000 3.040000 178.500000 2012.000000
max 7.000000 730000.000000 25.000000 8500.000000 2014.000000
planets.head()
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
decade = 10 * (planets["year"] // 10)
decade.head()
0    2000
1    2000
2    2010
3    2000
4    2000
Name: year, dtype: int64
decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()
0    2000s
1    2000s
2    2010s
3    2000s
4    2000s
Name: decade, dtype: object
planets.head()
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
planets.groupby(["method", decade]).sum()
number orbital_period mass distance year
method decade
Astrometry 2010s 2 1.262360e+03 0.00000 35.75 4023
Eclipse Timing Variations 2000s 5 1.930800e+04 6.05000 261.44 6025
2010s 10 2.345680e+04 4.20000 1000.00 12065
Imaging 2000s 29 1.350935e+06 0.00000 956.83 40139
2010s 21 6.803750e+04 0.00000 1210.08 36208
Microlensing 2000s 12 1.732500e+04 0.00000 0.00 20070
2010s 15 4.750000e+03 0.00000 41440.00 26155
Orbital Brightness Modulation 2010s 5 2.127920e+00 0.00000 2360.00 6035
Pulsar Timing 1990s 9 1.900153e+02 0.00000 0.00 5978
2000s 1 3.652500e+04 0.00000 0.00 2003
2010s 1 9.070629e-02 0.00000 1200.00 2011
Pulsation Timing Variations 2000s 1 1.170000e+03 0.00000 0.00 2007
Radial Velocity 1980s 1 8.388800e+01 11.68000 40.57 1989
1990s 52 1.091561e+04 68.17820 723.71 55943
2000s 475 2.633526e+05 945.31928 15201.16 619775
2010s 424 1.809630e+05 316.47890 11382.67 432451
Transit 2000s 64 2.897102e+02 0.00000 31823.31 124462
2010s 712 8.087813e+03 1.47000 102419.46 673999
Transit Timing Variations 2010s 9 2.393505e+02 0.00000 3313.00 8050
planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)
number
decade 1980s 1990s 2000s 2010s
method
Astrometry 0.0 0.0 0.0 2.0
Eclipse Timing Variations 0.0 0.0 5.0 10.0
Imaging 0.0 0.0 29.0 21.0
Microlensing 0.0 0.0 12.0 15.0
Orbital Brightness Modulation 0.0 0.0 0.0 5.0
Pulsar Timing 0.0 9.0 1.0 1.0
Pulsation Timing Variations 0.0 0.0 1.0 0.0
Radial Velocity 1.0 52.0 475.0 424.0
Transit 0.0 0.0 64.0 712.0
Transit Timing Variations 0.0 0.0 0.0 9.0

(2)数据透视表

【例2】泰坦尼克号乘客数据分析

import seaborn as sns

titanic = sns.load_dataset("titanic")
titanic.head()
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
T = titanic[titanic.age.notnull()].copy()
T.age.apply(lambda x: 60 if x>=60 else x)
T.age.value_counts()
24.00    30
22.00    27
60.00    26
18.00    26
28.00    25
30.00    25
19.00    25
21.00    24
25.00    23
36.00    22
29.00    20
35.00    18
32.00    18
27.00    18
26.00    18
31.00    17
16.00    17
34.00    15
20.00    15
33.00    15
23.00    15
39.00    14
40.00    13
17.00    13
42.00    13
45.00    12
38.00    11
4.00     10
50.00    10
2.00     10
         ..
8.00      4
5.00      4
11.00     4
6.00      3
7.00      3
46.00     3
30.50     2
57.00     2
0.83      2
55.00     2
10.00     2
59.00     2
13.00     2
28.50     2
40.50     2
45.50     2
0.75      2
32.50     2
34.50     1
55.50     1
0.92      1
36.50     1
12.00     1
53.00     1
14.50     1
0.67      1
20.50     1
23.50     1
24.50     1
0.42      1
Name: age, Length: 77, dtype: int64
Age = 10*(T["age"]//10)
Age = Age.astype(int)
Age.head()
Age.value_counts()
20    220
30    167
10    102
40     89
0      62
50     48
60     26
Name: age, dtype: int64
Age.astype(str)+"s"
0      20s
1      30s
2      20s
3      30s
4      30s
6      50s
7       0s
8      20s
9      10s
10      0s
11     50s
12     20s
13     30s
14     10s
15     50s
16      0s
18     30s
20     30s
21     30s
22     10s
23     20s
24      0s
25     30s
27     10s
30     40s
33     60s
34     20s
35     40s
37     20s
38     10s
      ... 
856    40s
857    50s
858    20s
860    40s
861    20s
862    40s
864    20s
865    40s
866    20s
867    30s
869     0s
870    20s
871    40s
872    30s
873    40s
874    20s
875    10s
876    20s
877    10s
879    50s
880    20s
881    30s
882    20s
883    20s
884    20s
885    30s
886    20s
887    10s
889    20s
890    30s
Name: age, Length: 714, dtype: object
T.groupby(["sex", Age])["survived"].mean().unstack()
age 0 10 20 30 40 50 60
sex
female 0.633333 0.755556 0.722222 0.833333 0.687500 0.888889 1.000000
male 0.593750 0.122807 0.168919 0.214953 0.210526 0.133333 0.136364
T.age = Age
T.pivot_table("survived", index="sex", columns="age")
age 0 10 20 30 40 50 60
sex
female 0.633333 0.755556 0.722222 0.833333 0.687500 0.888889 1.000000
male 0.593750 0.122807 0.168919 0.214953 0.210526 0.133333 0.136364
titanic.describe()
survived pclass age sibsp parch fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
titanic.groupby("sex")[["survived"]].mean()
survived
sex
female 0.742038
male 0.188908
titanic.groupby("sex")["survived"].mean()
sex
female    0.742038
male      0.188908
Name: survived, dtype: float64
titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
  • 数据透视表
titanic.pivot_table("survived", index="sex", columns="class")
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True)
class First Second Third All
sex
female 0.968085 0.921053 0.500000 0.742038
male 0.368852 0.157407 0.135447 0.188908
All 0.629630 0.472826 0.242363 0.383838
titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"})
fare survived
class First Second Third First Second Third
sex
female 106.125798 21.970121 16.118810 91 70 72
male 67.226127 19.741782 12.661633 45 17 47

12.7 其他

(1)向量化字符串操作

(2) 处理时间序列

(3) 多级索引:用于多维数据

base_data = np.array([[1771, 11115 ],
                      [2154, 30320],
                      [2141, 14070],
                      [2424, 32680],
                      [1077, 7806],
                      [1303, 24222],
                      [798, 4789],
                      [981, 13468]]) 
data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"]\
                                     , [2008, 2018]*4], columns=["population", "GDP"])
data
population GDP
BeiJing 2008 1771 11115
2018 2154 30320
ShangHai 2008 2141 14070
2018 2424 32680
ShenZhen 2008 1077 7806
2018 1303 24222
HangZhou 2008 798 4789
2018 981 13468
data.index.names = ["city", "year"]
data
population GDP
city year
BeiJing 2008 1771 11115
2018 2154 30320
ShangHai 2008 2141 14070
2018 2424 32680
ShenZhen 2008 1077 7806
2018 1303 24222
HangZhou 2008 798 4789
2018 981 13468
data["GDP"]
city      year
BeiJing   2008    11115
          2018    30320
ShangHai  2008    14070
          2018    32680
ShenZhen  2008     7806
          2018    24222
HangZhou  2008     4789
          2018    13468
Name: GDP, dtype: int32
data.loc["ShangHai", "GDP"]
year
2008    14070
2018    32680
Name: GDP, dtype: int32
data.loc["ShangHai", 2018]["GDP"]
32680

(4) 高性能的Pandas:eval()

df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
%timeit (df1+df2)/(df3+df4)
17.6 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • 减少了复合代数式计算中间过程的内存分配
%timeit pd.eval("(df1+df2)/(df3+df4)")
10.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))
True
  • 实现列间运算
df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
df.head()
A B C
0 0.418071 0.381836 0.500556
1 0.059432 0.749066 0.302429
2 0.489147 0.739153 0.777161
3 0.175441 0.016556 0.348979
4 0.766534 0.559252 0.310635
res_1 = pd.eval("(df.A+df.B)/(df.C-1)")
res_2 = df.eval("(A+B)/(C-1)")
np.allclose(res_1, res_2)
True
df["D"] = pd.eval("(df.A+df.B)/(df.C-1)")
df.head()
A B C D
0 0.418071 0.381836 0.500556 -1.601593
1 0.059432 0.749066 0.302429 -1.159019
2 0.489147 0.739153 0.777161 -5.512052
3 0.175441 0.016556 0.348979 -0.294917
4 0.766534 0.559252 0.310635 -1.923199
df.eval("D=(A+B)/(C-1)", inplace=True)
df.head()
A B C D
0 0.418071 0.381836 0.500556 -1.601593
1 0.059432 0.749066 0.302429 -1.159019
2 0.489147 0.739153 0.777161 -5.512052
3 0.175441 0.016556 0.348979 -0.294917
4 0.766534 0.559252 0.310635 -1.923199
  • 使用局部变量
column_mean = df.mean(axis=1)
res = df.eval("A+@column_mean")
res.head()
0    0.342788
1    0.047409
2   -0.387501
3    0.236956
4    0.694839
dtype: float64

(4) 高性能的Pandas:query()

df.head()
A B C D
0 0.418071 0.381836 0.500556 -1.601593
1 0.059432 0.749066 0.302429 -1.159019
2 0.489147 0.739153 0.777161 -5.512052
3 0.175441 0.016556 0.348979 -0.294917
4 0.766534 0.559252 0.310635 -1.923199
%timeit df[(df.A < 0.5) & (df.B > 0.5)]
1.11 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.query("(A < 0.5)&(B > 0.5)")
2.55 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.query("(A < 0.5)&(B > 0.5)").head()
A B C D
1 0.059432 0.749066 0.302429 -1.159019
2 0.489147 0.739153 0.777161 -5.512052
7 0.073950 0.730144 0.646190 -2.272672
10 0.393200 0.610467 0.697096 -3.313485
11 0.065734 0.764699 0.179380 -1.011958
np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))
True

(5)eval()和query()的使用时机

小数组时,普通方法反而更快

df.values.nbytes
32000
df1.values.nbytes
8000000

你可能感兴趣的:(Python基础+数据科学)