基于Numpy构建的Pandas库,提供了使得数据分析变得更快更简单的高级数据结构和操作工具
Series 是带标签数据的一维数组
Series对象的创建
通用结构: pd.Series(data, index=index, dtype=dtype)
data:数据,可以是列表,字典或Numpy数组
index:索引,为可选参数
dtype: 数据类型,为可选参数
1、用列表创建
index缺省,默认为整数序列
import pandas as pd
data = pd.Series([1.5, 3, 4.5, 6])
data
0 1.5
1 3.0
2 4.5
3 6.0
dtype: float64
增加index
data = pd.Series([1.5, 3, 4.5, 6], index=["a", "b", "c", "d"])
data
a 1.5
b 3.0
c 4.5
d 6.0
dtype: float64
增加数据类型
缺省则从传入的数据自动判断
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])
data
a 1
b 2
c 3
d 4
dtype: int64
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], dtype="float")
data
a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64
注意:数据支持多种类型
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"])
data
a 1
b 2
c 3
d 4
dtype: object
data["a"]
1
data["c"]
'3'
数据类型可被强制改变
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"], dtype=float)
data
a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64
data["c"]
3.0
data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
data
ValueError: could not convert string to float: 'a'
2、用一维numpy数组创建
import numpy as np
x = np.arange(5)
pd.Series(x)
0 0
1 1
2 2
3 3
4 4
dtype: int32
3、用字典创建
默认以键为index 值为data
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
population
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
dtype: int64
字典创建,如果指定index,则会到字典的键中筛选,找不到的,值设为NaN
population = pd.Series(population_dict, index=["BeiJing", "HangZhou", "c", "d"])
population
BeiJing 2154.0
HangZhou 981.0
c NaN
d NaN
dtype: float64
4、data为标量的情况
pd.Series(5, index=[100, 200, 300])
100 5
200 5
300 5
dtype: int64
DataFrame 是带标签数据的多维数组
DataFrame对象的创建
通用结构: pd.DataFrame(data, index=index, columns=columns)
data:数据,可以是列表,字典或Numpy数组
index:索引,为可选参数
columns: 列标签,为可选参数
1、通过Series对象创建
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
pd.DataFrame(population)
|
0 |
BeiJing |
2154 |
ShangHai |
2424 |
ShenZhen |
1303 |
HangZhou |
981 |
pd.DataFrame(population, columns=["population"])
|
population |
BeiJing |
2154 |
ShangHai |
2424 |
ShenZhen |
1303 |
HangZhou |
981 |
2、通过Series对象字典创建
GDP_dict = {"BeiJing": 30320,
"ShangHai": 32680,
"ShenZhen": 24222,
"HangZhou": 13468 }
GDP = pd.Series(GDP_dict)
GDP
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
dtype: int64
pd.DataFrame({"population": population,
"GDP": GDP})
|
population |
GDP |
BeiJing |
2154 |
30320 |
ShangHai |
2424 |
32680 |
ShenZhen |
1303 |
24222 |
HangZhou |
981 |
13468 |
注意:数量不够的会自动补齐
pd.DataFrame({"population": population,
"GDP": GDP,
"country": "China"})
|
population |
GDP |
country |
BeiJing |
2154 |
30320 |
China |
ShangHai |
2424 |
32680 |
China |
ShenZhen |
1303 |
24222 |
China |
HangZhou |
981 |
13468 |
China |
3、通过字典列表对象创建
字典索引作为index,字典键作为columns
import numpy as np
import pandas as pd
data = [{"a": i, "b": 2*i} for i in range(3)]
data
[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
data = pd.DataFrame(data)
data
|
a |
b |
0 |
0 |
0 |
1 |
1 |
2 |
2 |
2 |
4 |
data1 = data["a"].copy()
data1
0 0
1 1
2 2
Name: a, dtype: int64
data1[0] = 10
data1
0 10
1 1
2 2
Name: a, dtype: int64
data
|
a |
b |
0 |
0 |
0 |
1 |
1 |
2 |
2 |
2 |
4 |
不存在的键,会默认值为NaN
data = [{"a": 1, "b":1},{"b": 3, "c":4}]
data
[{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]
pd.DataFrame(data)
|
a |
b |
c |
0 |
1.0 |
1 |
NaN |
1 |
NaN |
3 |
4.0 |
4、通过Numpy二维数组创建
data = np.random.randint(10, size=(3, 2))
data
array([[1, 6],
[2, 9],
[4, 0]])
pd.DataFrame(data, columns=["foo", "bar"], index=["a", "b", "c"])
|
foo |
bar |
a |
1 |
6 |
b |
2 |
9 |
c |
4 |
0 |
data = pd.DataFrame({"pop": population, "GDP": GDP})
data
|
pop |
GDP |
BeiJing |
2154 |
30320 |
ShangHai |
2424 |
32680 |
ShenZhen |
1303 |
24222 |
HangZhou |
981 |
13468 |
(1)df.values 返回numpy数组的数据
data.values
array([[ 2154, 30320],
[ 2424, 32680],
[ 1303, 24222],
[ 981, 13468]], dtype=int64)
(2)df.index 返回行索引
data.index
Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')
(3)df.columns 返回列索引
data.columns
Index(['pop', 'GDP'], dtype='object')
(4)df.shape 形状
data.shape
(4, 2)
(5) pd.size 大小
data.size
8
(6)pd.dtypes 返回每列数据类型
data.dtypes
pop int64
GDP int64
dtype: object
data
|
pop |
GDP |
BeiJing |
2154 |
30320 |
ShangHai |
2424 |
32680 |
ShenZhen |
1303 |
24222 |
HangZhou |
981 |
13468 |
(1)获取列
字典式
data["pop"]
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
Name: pop, dtype: int64
data[["GDP", "pop"]]
|
GDP |
pop |
BeiJing |
30320 |
2154 |
ShangHai |
32680 |
2424 |
ShenZhen |
24222 |
1303 |
HangZhou |
13468 |
981 |
对象属性式
data.GDP
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
Name: GDP, dtype: int64
(2)获取行
绝对索引 df.loc
data.loc["BeiJing"]
pop 2154
GDP 30320
Name: BeiJing, dtype: int64
data.loc[["BeiJing", "HangZhou"]]
|
pop |
GDP |
BeiJing |
2154 |
30320 |
HangZhou |
981 |
13468 |
相对索引 df.iloc
data.iloc[0]
pop 2154
GDP 30320
Name: BeiJing, dtype: int64
data.iloc[[1, 3]]
|
pop |
GDP |
ShangHai |
2424 |
32680 |
HangZhou |
981 |
13468 |
(3)获取标量
data.loc["BeiJing", "GDP"]
30320
data.iloc[0, 1]
30320
data.values[0][1]
30320
(4)Series对象的索引
type(data.GDP)
pandas.core.series.Series
GDP
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
dtype: int64
GDP["BeiJing"]
30320
dates = pd.date_range(start='2019-01-01', periods=6)
dates
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-06'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
(1)行切片
df["2019-01-01": "2019-01-03"]
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
df.loc["2019-01-01": "2019-01-03"]
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
df.iloc[0: 3]
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
(2)列切片
df.loc[:, "A": "C"]
|
A |
B |
C |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
df.iloc[:, 0: 3]
|
A |
B |
C |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
(3)多种多样的取值
行、列同时切片
df.loc["2019-01-02": "2019-01-03", "C":"D"]
|
C |
D |
2019-01-02 |
1.080779 |
-2.294395 |
2019-01-03 |
1.102248 |
1.207726 |
df.iloc[1: 3, 2:]
|
C |
D |
2019-01-02 |
1.080779 |
-2.294395 |
2019-01-03 |
1.102248 |
1.207726 |
行切片,列分散取值
df.loc["2019-01-04": "2019-01-06", ["A", "C"]]
|
A |
C |
2019-01-04 |
0.305088 |
-0.978434 |
2019-01-05 |
0.313383 |
0.163155 |
2019-01-06 |
0.250613 |
-0.858240 |
df.iloc[3:, [0, 2]]
|
A |
C |
2019-01-04 |
0.305088 |
-0.978434 |
2019-01-05 |
0.313383 |
0.163155 |
2019-01-06 |
0.250613 |
-0.858240 |
行分散取值,列切片
df.loc[["2019-01-02", "2019-01-06"], "C": "D"]
df.iloc[[1, 5], 0: 3]
|
A |
B |
C |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
行、列均分散取值
df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]
df.iloc[[1, 5], [0, 3]]
|
A |
D |
2019-01-02 |
-0.234414 |
-2.294395 |
2019-01-06 |
0.250613 |
-1.573342 |
df
|
A |
B |
C |
D |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
df > 0
|
A |
B |
C |
D |
2019-01-01 |
False |
False |
True |
False |
2019-01-02 |
False |
False |
True |
False |
2019-01-03 |
False |
True |
True |
True |
2019-01-04 |
True |
True |
False |
True |
2019-01-05 |
True |
True |
True |
False |
2019-01-06 |
True |
False |
False |
False |
df[df > 0]
|
A |
B |
C |
D |
2019-01-01 |
NaN |
NaN |
0.925984 |
NaN |
2019-01-02 |
NaN |
NaN |
1.080779 |
NaN |
2019-01-03 |
NaN |
0.058118 |
1.102248 |
1.207726 |
2019-01-04 |
0.305088 |
0.535920 |
NaN |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
NaN |
2019-01-06 |
0.250613 |
NaN |
NaN |
NaN |
df.A > 0
2019-01-01 False
2019-01-02 False
2019-01-03 False
2019-01-04 True
2019-01-05 True
2019-01-06 True
Freq: D, Name: A, dtype: bool
df[df.A > 0]
|
A |
B |
C |
D |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
isin()方法
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
|
A |
B |
C |
D |
E |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
one |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
one |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
two |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
three |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
four |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
three |
ind = df2["E"].isin(["two", "four"])
ind
2019-01-01 False
2019-01-02 False
2019-01-03 True
2019-01-04 False
2019-01-05 True
2019-01-06 False
Freq: D, Name: E, dtype: bool
df2[ind]
|
A |
B |
C |
D |
E |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
two |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
four |
DataFrame 增加新列
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
s1
2019-01-01 1
2019-01-02 2
2019-01-03 3
2019-01-04 4
2019-01-05 5
2019-01-06 6
Freq: D, dtype: int64
df["E"] = s1
df
|
A |
B |
C |
D |
E |
2019-01-01 |
-0.935378 |
-0.190742 |
0.925984 |
-0.818969 |
1 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
3 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
4 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
5 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
6 |
修改赋值
df.loc["2019-01-01", "A"] = 0
df
|
A |
B |
C |
D |
E |
2019-01-01 |
0.000000 |
-0.190742 |
0.925984 |
-0.818969 |
1 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
3 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
4 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
5 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
6 |
df.iloc[0, 1] = 0
df
|
A |
B |
C |
D |
E |
2019-01-01 |
0.000000 |
0.000000 |
0.925984 |
-0.818969 |
1 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
-2.294395 |
2 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
1.207726 |
3 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
0.177251 |
4 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
-0.296649 |
5 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
-1.573342 |
6 |
df["D"] = np.array([5]*len(df)) # 可简化成df["D"] = 5
df
|
A |
B |
C |
D |
E |
2019-01-01 |
0.000000 |
0.000000 |
0.925984 |
5 |
1 |
2019-01-02 |
-0.234414 |
-1.194674 |
1.080779 |
5 |
2 |
2019-01-03 |
-0.141572 |
0.058118 |
1.102248 |
5 |
3 |
2019-01-04 |
0.305088 |
0.535920 |
-0.978434 |
5 |
4 |
2019-01-05 |
0.313383 |
0.234041 |
0.163155 |
5 |
5 |
2019-01-06 |
0.250613 |
-0.904400 |
-0.858240 |
5 |
6 |
修改index和columns
df.index = [i for i in range(len(df))]
df
|
A |
B |
C |
D |
E |
0 |
0.000000 |
0.000000 |
0.925984 |
5 |
1 |
1 |
-0.234414 |
-1.194674 |
1.080779 |
5 |
2 |
2 |
-0.141572 |
0.058118 |
1.102248 |
5 |
3 |
3 |
0.305088 |
0.535920 |
-0.978434 |
5 |
4 |
4 |
0.313383 |
0.234041 |
0.163155 |
5 |
5 |
5 |
0.250613 |
-0.904400 |
-0.858240 |
5 |
6 |
df.columns = [i for i in range(df.shape[1])]
df
|
0 |
1 |
2 |
3 |
4 |
0 |
0.000000 |
0.000000 |
0.925984 |
5 |
1 |
1 |
-0.234414 |
-1.194674 |
1.080779 |
5 |
2 |
2 |
-0.141572 |
0.058118 |
1.102248 |
5 |
3 |
3 |
0.305088 |
0.535920 |
-0.978434 |
5 |
4 |
4 |
0.313383 |
0.234041 |
0.163155 |
5 |
5 |
5 |
0.250613 |
-0.904400 |
-0.858240 |
5 |
6 |
import pandas as pd
import numpy as np
dates = pd.date_range(start='2019-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
|
A |
B |
C |
D |
2019-01-01 |
-0.854043 |
0.412345 |
-2.296051 |
-0.048964 |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
2019-01-03 |
-0.714591 |
-1.103224 |
0.979250 |
0.319455 |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
2019-01-06 |
-0.133876 |
0.074500 |
-1.028147 |
0.605402 |
(1)查看前面的行
df.head() # 默认5行
|
A |
B |
C |
D |
2019-01-01 |
-0.854043 |
0.412345 |
-2.296051 |
-0.048964 |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
2019-01-03 |
-0.714591 |
-1.103224 |
0.979250 |
0.319455 |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
df.head(2)
|
A |
B |
C |
D |
2019-01-01 |
-0.854043 |
0.412345 |
-2.296051 |
-0.048964 |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
(2)查看后面的行
df.tail() # 默认5行
|
A |
B |
C |
D |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
2019-01-03 |
-0.714591 |
-1.103224 |
0.979250 |
0.319455 |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
2019-01-06 |
-0.133876 |
0.074500 |
-1.028147 |
0.605402 |
df.tail(3)
|
A |
B |
C |
D |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
2019-01-06 |
-0.133876 |
0.074500 |
-1.028147 |
0.605402 |
(3)查看总体信息
df.iloc[0, 3] = np.nan
df
|
A |
B |
C |
D |
2019-01-01 |
-0.854043 |
0.412345 |
-2.296051 |
NaN |
2019-01-02 |
1.371364 |
-0.121454 |
-0.299653 |
1.095375 |
2019-01-03 |
-0.714591 |
-1.103224 |
0.979250 |
0.319455 |
2019-01-04 |
-1.397557 |
0.426008 |
0.233861 |
-1.651887 |
2019-01-05 |
0.434026 |
0.459830 |
-0.095444 |
1.220302 |
2019-01-06 |
-0.133876 |
0.074500 |
-1.028147 |
0.605402 |
df.info()
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
A 6 non-null float64
B 6 non-null float64
C 6 non-null float64
D 5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes
(1)向量化运算
x = pd.DataFrame(np.arange(4).reshape(1, 4))
x
|
0 |
1 |
2 |
3 |
0 |
0 |
1 |
2 |
3 |
x+5
|
0 |
1 |
2 |
3 |
0 |
5 |
6 |
7 |
8 |
np.exp(x)
|
0 |
1 |
2 |
3 |
0 |
1.0 |
2.718282 |
7.389056 |
20.085537 |
y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
y
|
0 |
1 |
2 |
3 |
0 |
4 |
5 |
6 |
7 |
x*y
|
0 |
1 |
2 |
3 |
0 |
0 |
5 |
12 |
21 |
(2)矩阵化运算
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)))
x
|
0 |
1 |
2 |
0 |
6 |
3 |
7 |
1 |
4 |
6 |
9 |
2 |
2 |
6 |
7 |
转置
z = x.Tz
|
0 |
1 |
2 |
0 |
6 |
4 |
2 |
1 |
3 |
6 |
6 |
2 |
7 |
9 |
7 |
np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(3, 3)))
y
|
0 |
1 |
2 |
0 |
5 |
8 |
9 |
1 |
5 |
0 |
0 |
2 |
1 |
7 |
6 |
x.dot(y)
|
0 |
1 |
2 |
0 |
52 |
97 |
96 |
1 |
59 |
95 |
90 |
2 |
47 |
65 |
60 |
%timeit x.dot(y)
133 µs ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x, y)
40.2 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
执行相同运算,Numpy与Pandas的对比
x1 = np.array(x)
x1
array([[6, 3, 7],
[4, 6, 9],
[2, 6, 7]])
y1 = np.array(y)
y1
array([[5, 8, 9],
[5, 0, 0],
[1, 7, 6]])
%timeit x1.dot(y1)
22.1 µs ± 992 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x1, y1)
22.6 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x.values, y.values)
42.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
x2 = list(x1)y2 = list(y1)x3 = []y3 = []for i in x2:
res = []
for j in i:
res.append(int(j))
x3.append(res)for i in y2:
res = []
for j in i:
res.append(int(j))
y3.append(res)
def f(x, y):
res = []
for i in range(len(x)):
row = []
for j in range(len(y[0])):
sum_row = 0
for k in range(len(x[0])):
sum_row += x[i][k]*y[k][j]
row.append(sum_row)
res.append(row)
return res
%timeit f(x3, y3)
4.29 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
一般来说,纯粹的计算在Numpy里执行的更快
Numpy更侧重于计算,Pandas更侧重于数据处理
(3)广播运算
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
x
|
A |
B |
C |
0 |
6 |
3 |
7 |
1 |
4 |
6 |
9 |
2 |
2 |
6 |
7 |
按行广播
x.iloc[0]
A 6
B 3
C 7
Name: 0, dtype: int32
x/x.iloc[0]
|
A |
B |
C |
0 |
1.000000 |
1.0 |
1.000000 |
1 |
0.666667 |
2.0 |
1.285714 |
2 |
0.333333 |
2.0 |
1.000000 |
按列广播
x.A
0 6
1 4
2 2
Name: A, dtype: int32
x.div(x.A, axis=0) # add sub div mul
|
A |
B |
C |
0 |
1.0 |
0.5 |
1.166667 |
1 |
1.0 |
1.5 |
2.250000 |
2 |
1.0 |
3.0 |
3.500000 |
x.div(x.iloc[0], axis=1)
|
A |
B |
C |
0 |
1.000000 |
1.0 |
1.000000 |
1 |
0.666667 |
2.0 |
1.285714 |
2 |
0.333333 |
2.0 |
1.000000 |
(1)索引对齐
A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
A
|
A |
B |
0 |
3 |
7 |
1 |
2 |
1 |
B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
B
|
A |
B |
C |
0 |
7 |
5 |
1 |
1 |
4 |
0 |
9 |
2 |
5 |
8 |
0 |
pandas会自动对齐两个对象的索引,没有的值用np.nan表示
A+B
|
A |
B |
C |
0 |
10.0 |
12.0 |
NaN |
1 |
6.0 |
1.0 |
NaN |
2 |
NaN |
NaN |
NaN |
缺省值也可用fill_value来填充
A.add(B, fill_value=0)
|
A |
B |
C |
0 |
10.0 |
12.0 |
1.0 |
1 |
6.0 |
1.0 |
9.0 |
2 |
5.0 |
8.0 |
0.0 |
A*B
|
A |
B |
C |
0 |
21.0 |
35.0 |
NaN |
1 |
8.0 |
0.0 |
NaN |
2 |
NaN |
NaN |
NaN |
(2)统计相关
数据种类统计
y = np.random.randint(3, size=20)
y
array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])
np.unique(y)
array([0, 1, 2])
from collections import Counter
Counter(y)
Counter({2: 11, 1: 5, 0: 4})
y1 = pd.DataFrame(y, columns=["A"])
y1
|
A |
0 |
2 |
... |
... |
19 |
1 |
np.unique(y1)
array([0, 1, 2])
y1["A"].value_counts()
2 11
1 5
0 4
Name: A, dtype: int64
产生新的结果,并进行排序
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
GDP_dict = {"BeiJing": 30320,
"ShangHai": 32680,
"ShenZhen": 24222,
"HangZhou": 13468 }
GDP = pd.Series(GDP_dict)
city_info = pd.DataFrame({"population": population,"GDP": GDP})
city_info
|
population |
GDP |
BeiJing |
2154 |
30320 |
ShangHai |
2424 |
32680 |
ShenZhen |
1303 |
24222 |
HangZhou |
981 |
13468 |
city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
city_info
|
population |
GDP |
per_GDP |
BeiJing |
2154 |
30320 |
14.076137 |
ShangHai |
2424 |
32680 |
13.481848 |
ShenZhen |
1303 |
24222 |
18.589409 |
HangZhou |
981 |
13468 |
13.728848 |
递增排序
city_info.sort_values(by="per_GDP")
|
population |
GDP |
per_GDP |
ShangHai |
2424 |
32680 |
13.481848 |
HangZhou |
981 |
13468 |
13.728848 |
BeiJing |
2154 |
30320 |
14.076137 |
ShenZhen |
1303 |
24222 |
18.589409 |
递减排序
city_info.sort_values(by="per_GDP", ascending=False)
|
population |
GDP |
per_GDP |
ShenZhen |
1303 |
24222 |
18.589409 |
BeiJing |
2154 |
30320 |
14.076137 |
HangZhou |
981 |
13468 |
13.728848 |
ShangHai |
2424 |
32680 |
13.481848 |
按轴进行排序
data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
data
|
C |
B |
A |
D |
2 |
3 |
13 |
17 |
8 |
1 |
1 |
19 |
14 |
6 |
0 |
11 |
7 |
14 |
2 |
行排序
data.sort_index()
|
C |
B |
A |
D |
0 |
11 |
7 |
14 |
2 |
1 |
1 |
19 |
14 |
6 |
2 |
3 |
13 |
17 |
8 |
列排序
data.sort_index(axis=1)
|
A |
B |
C |
D |
2 |
17 |
13 |
3 |
8 |
1 |
14 |
19 |
1 |
6 |
0 |
14 |
7 |
11 |
2 |
data.sort_index(axis=1, ascending=False)
|
D |
C |
B |
A |
2 |
8 |
3 |
13 |
17 |
1 |
6 |
1 |
19 |
14 |
0 |
2 |
11 |
7 |
14 |
统计方法
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
df
|
A |
B |
C |
D |
0 |
1.082198 |
3.557396 |
-3.060476 |
6.367969 |
1 |
13.113252 |
6.774559 |
2.874553 |
5.527044 |
2 |
-2.036341 |
-4.333177 |
5.094802 |
-0.152567 |
3 |
-3.386712 |
-1.522365 |
-2.522209 |
2.537716 |
4 |
4.328491 |
5.550994 |
5.577329 |
5.019991 |
5 |
1.171336 |
-0.493910 |
-4.032613 |
6.398588 |
非空个数
df.count()
A 6
B 6
C 6
D 6
dtype: int64
求和
df.sum()
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
df.sum(axis=1)
0 7.947086
1 28.289408
2 -1.427283
3 -4.893571
4 20.476806
5 3.043402
dtype: float64
最大值 最小值
df.min()
A -3.386712
B -4.333177
C -4.032613
D -0.152567
dtype: float64
df.max(axis=1)
0 6.367969
1 13.113252
2 5.094802
3 2.537716
4 5.577329
5 6.398588
dtype: float64
df.idxmax()
A 1
B 1
C 4
D 5
dtype: int64
均值
df.mean()
A 2.378704
B 1.588916
C 0.655231
D 4.283124
dtype: float64
方差
df.var()
A 34.980702
B 19.110656
C 18.948144
D 6.726776
dtype: float64
标准差
df.std()
A 5.914449
B 4.371574
C 4.352947
D 2.593603
dtype: float64
中位数
df.median()
A 1.126767
B 1.531743
C 0.176172
D 5.273518
dtype: float64
众数
data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
data
|
A |
B |
0 |
4 |
2 |
1 |
3 |
2 |
2 |
2 |
0 |
3 |
2 |
4 |
4 |
2 |
0 |
5 |
4 |
1 |
6 |
2 |
0 |
7 |
1 |
1 |
8 |
3 |
4 |
9 |
2 |
0 |
data.mode()
|
A |
B |
0 |
2 |
0 |
75%分位数
df.quantile(0.75)
A 3.539202
B 5.052594
C 4.539740
D 6.157738
Name: 0.75, dtype: float64
统计所有
df.describe()
|
A |
B |
C |
D |
count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
2.378704 |
1.588916 |
0.655231 |
4.283124 |
std |
5.914449 |
4.371574 |
4.352947 |
2.593603 |
min |
-3.386712 |
-4.333177 |
-4.032613 |
-0.152567 |
25% |
-1.256706 |
-1.265251 |
-2.925910 |
3.158284 |
50% |
1.126767 |
1.531743 |
0.176172 |
5.273518 |
75% |
3.539202 |
5.052594 |
4.539740 |
6.157738 |
max |
13.113252 |
6.774559 |
5.577329 |
6.398588 |
data_2 = pd.DataFrame([["a", "a", "c", "d"],
["c", "a", "c", "b"],
["a", "a", "d", "c"]], columns=list("ABCD"))
data_2
|
A |
B |
C |
D |
0 |
a |
a |
c |
d |
1 |
c |
a |
c |
b |
2 |
a |
a |
d |
c |
data_2.describe()
|
A |
B |
C |
D |
count |
3 |
3 |
3 |
3 |
unique |
2 |
1 |
2 |
3 |
top |
a |
a |
c |
d |
freq |
2 |
3 |
2 |
1 |
相关性系数和协方差
df.corr()
|
A |
B |
C |
D |
A |
1.000000 |
0.831063 |
0.331060 |
0.510821 |
B |
0.831063 |
1.000000 |
0.179244 |
0.719112 |
C |
0.331060 |
0.179244 |
1.000000 |
-0.450365 |
D |
0.510821 |
0.719112 |
-0.450365 |
1.000000 |
df.corrwith(df["A"])
A 1.000000
B 0.831063
C 0.331060
D 0.510821
dtype: float64
自定义输出
apply(method)的用法:使用method方法默认对每一列进行相应的操作
df
|
A |
B |
C |
D |
0 |
1.082198 |
3.557396 |
-3.060476 |
6.367969 |
1 |
13.113252 |
6.774559 |
2.874553 |
5.527044 |
2 |
-2.036341 |
-4.333177 |
5.094802 |
-0.152567 |
3 |
-3.386712 |
-1.522365 |
-2.522209 |
2.537716 |
4 |
4.328491 |
5.550994 |
5.577329 |
5.019991 |
5 |
1.171336 |
-0.493910 |
-4.032613 |
6.398588 |
df.apply(np.cumsum)
|
A |
B |
C |
D |
0 |
1.082198 |
3.557396 |
-3.060476 |
6.367969 |
1 |
14.195450 |
10.331955 |
-0.185923 |
11.895013 |
2 |
12.159109 |
5.998778 |
4.908878 |
11.742447 |
3 |
8.772397 |
4.476413 |
2.386669 |
14.280162 |
4 |
13.100888 |
10.027406 |
7.963999 |
19.300153 |
5 |
14.272224 |
9.533497 |
3.931385 |
25.698741 |
df.apply(np.cumsum, axis=1)
|
A |
B |
C |
D |
0 |
1.082198 |
4.639594 |
1.579117 |
7.947086 |
1 |
13.113252 |
19.887811 |
22.762364 |
28.289408 |
2 |
-2.036341 |
-6.369518 |
-1.274717 |
-1.427283 |
3 |
-3.386712 |
-4.909077 |
-7.431287 |
-4.893571 |
4 |
4.328491 |
9.879485 |
15.456814 |
20.476806 |
5 |
1.171336 |
0.677427 |
-3.355186 |
3.043402 |
df.apply(sum)
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
df.sum()
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
df.apply(lambda x: x.max()-x.min())
A 16.499965
B 11.107736
C 9.609942
D 6.551155
dtype: float64
def my_describe(x):
return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
index=["Count", "mean", "max", "idxmin", "std"])
df.apply(my_describe)
|
A |
B |
C |
D |
Count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
2.378704 |
1.588916 |
0.655231 |
4.283124 |
max |
13.113252 |
6.774559 |
5.577329 |
6.398588 |
idxmin |
3.000000 |
2.000000 |
5.000000 |
2.000000 |
std |
5.914449 |
4.371574 |
4.352947 |
2.593603 |
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array([[1, np.nan, 2],
[np.nan, 3, 4],
[5, 6, None]]), columns=["A", "B", "C"])
data
|
A |
B |
C |
0 |
1 |
NaN |
2 |
1 |
NaN |
3 |
4 |
2 |
5 |
6 |
None |
注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源
data.dtypes
A object
B object
C object
dtype: object
data.isnull()
|
A |
B |
C |
0 |
False |
True |
False |
1 |
True |
False |
False |
2 |
False |
False |
True |
data.notnull()
|
A |
B |
C |
0 |
True |
False |
True |
1 |
False |
True |
True |
2 |
True |
True |
False |
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
3.0 |
1 |
NaN |
4.0 |
5.0 |
6.0 |
2 |
7.0 |
8.0 |
NaN |
9.0 |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
注意:np.nan是一种特殊的浮点数
data.dtypes
A float64
B float64
C float64
D float64
dtype: object
(1)删除整行
data.dropna()
|
A |
B |
C |
D |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
(2)删除整列
data.dropna(axis="columns")
|
D |
0 |
3.0 |
1 |
6.0 |
2 |
9.0 |
3 |
13.0 |
data["D"] = np.nan
data
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
NaN |
1 |
NaN |
4.0 |
5.0 |
NaN |
2 |
7.0 |
8.0 |
NaN |
NaN |
3 |
10.0 |
11.0 |
12.0 |
NaN |
data.dropna(axis="columns", how="all")
|
A |
B |
C |
0 |
1.0 |
NaN |
2.0 |
1 |
NaN |
4.0 |
5.0 |
2 |
7.0 |
8.0 |
NaN |
3 |
10.0 |
11.0 |
12.0 |
data.dropna(axis="columns", how="any")
|
0 |
1 |
2 |
3 |
data.loc[3] = np.nan
data
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
NaN |
1 |
NaN |
4.0 |
5.0 |
NaN |
2 |
7.0 |
8.0 |
NaN |
NaN |
3 |
NaN |
NaN |
NaN |
NaN |
data.dropna(how="all")
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
NaN |
1 |
NaN |
4.0 |
5.0 |
NaN |
2 |
7.0 |
8.0 |
NaN |
NaN |
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
|
A |
B |
C |
D |
0 |
1.0 |
NaN |
2.0 |
3.0 |
1 |
NaN |
4.0 |
5.0 |
6.0 |
2 |
7.0 |
8.0 |
NaN |
9.0 |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
data.fillna(value=5)
|
A |
B |
C |
D |
0 |
1.0 |
5.0 |
2.0 |
3.0 |
1 |
5.0 |
4.0 |
5.0 |
6.0 |
2 |
7.0 |
8.0 |
5.0 |
9.0 |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
用均值进行替换
fill = data.mean()
fill
A 6.000000
B 7.666667
C 6.333333
D 7.750000
dtype: float64
data.fillna(value=fill)
|
A |
B |
C |
D |
0 |
1.0 |
7.666667 |
2.000000 |
3.0 |
1 |
6.0 |
4.000000 |
5.000000 |
6.0 |
2 |
7.0 |
8.000000 |
6.333333 |
9.0 |
3 |
10.0 |
11.000000 |
12.000000 |
13.0 |
fill = data.stack().mean()
fill
7.0
data.fillna(value=fill)
|
A |
B |
C |
D |
0 |
1.0 |
7.0 |
2.0 |
3.0 |
1 |
7.0 |
4.0 |
5.0 |
6.0 |
2 |
7.0 |
8.0 |
7.0 |
9.0 |
3 |
10.0 |
11.0 |
12.0 |
13.0 |
构造一个生产DataFrame的函数
import pandas as pd
import numpy as np
def make_df(cols, ind):
"一个简单的DataFrame"
data = {c: [str(c)+str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
make_df("ABC", range(3))
|
A |
B |
C |
0 |
A0 |
B0 |
C0 |
1 |
A1 |
B1 |
C1 |
2 |
A2 |
B2 |
C2 |
垂直合并
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)
A B
1 A1 B1
2 A2 B2
A B
3 A3 B3
4 A4 B4
pd.concat([df_1, df_2])
|
A |
B |
1 |
A1 |
B1 |
2 |
A2 |
B2 |
3 |
A3 |
B3 |
4 |
A4 |
B4 |
水平合并
pd.concat([df_1, df_2], axis=1)
|
A |
B |
A |
B |
1 |
A1 |
B1 |
A1 |
B1 |
2 |
A2 |
B2 |
A2 |
B2 |
索引重叠
行重叠
df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
A B
1 A1 B1
2 A2 B2
A B
1 A1 B1
2 A2 B2
pd.concat([df_5, df_6])
|
A |
B |
1 |
A1 |
B1 |
2 |
A2 |
B2 |
1 |
A1 |
B1 |
2 |
A2 |
B2 |
pd.concat([df_5, df_6],ignore_index=True)
|
A |
B |
0 |
A1 |
B1 |
1 |
A2 |
B2 |
2 |
A1 |
B1 |
3 |
A2 |
B2 |
列重叠
df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)
A B C
1 A1 B1 C1
2 A2 B2 C2
B C D
1 B1 C1 D1
2 B2 C2 D2
pd.concat([df_7, df_8], axis=1)
|
A |
B |
C |
B |
C |
D |
1 |
A1 |
B1 |
C1 |
B1 |
C1 |
D1 |
2 |
A2 |
B2 |
C2 |
B2 |
C2 |
D2 |
pd.concat([df_7, df_8],axis=1, ignore_index=True)
|
0 |
1 |
2 |
3 |
4 |
5 |
1 |
A1 |
B1 |
C1 |
B1 |
C1 |
D1 |
2 |
A2 |
B2 |
C2 |
B2 |
C2 |
D2 |
对齐合并merge()
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)
A B
1 A1 B1
2 A2 B2
B C
1 B1 C1
2 B2 C2
pd.merge(df_9, df_10)
|
A |
B |
C |
0 |
A1 |
B1 |
C1 |
1 |
A2 |
B2 |
C2 |
df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)
A B
1 A1 B1
2 A2 B2
C B
2 C2 B2
1 C1 B1
pd.merge(df_9, df_10)
|
A |
B |
C |
0 |
A1 |
B1 |
C1 |
1 |
A2 |
B2 |
C2 |
【例】 合并城市信息
population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
"pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population
|
city |
pop |
0 |
BeiJing |
2154 |
1 |
HangZhou |
981 |
2 |
ShenZhen |
1303 |
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
"GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP
|
city |
GDP |
0 |
BeiJing |
30320 |
1 |
ShangHai |
32680 |
2 |
HangZhou |
13468 |
city_info = pd.merge(population, GDP)
city_info
|
city |
pop |
GDP |
0 |
BeiJing |
2154 |
30320 |
1 |
HangZhou |
981 |
13468 |
city_info = pd.merge(population, GDP, how="outer")
city_info
|
city |
pop |
GDP |
0 |
BeiJing |
2154.0 |
30320.0 |
1 |
HangZhou |
981.0 |
13468.0 |
2 |
ShenZhen |
1303.0 |
NaN |
3 |
ShangHai |
NaN |
32680.0 |
df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
"data1": range(6),
"data2": np.random.randint(0, 10, size=6)})
df
|
key |
data1 |
data2 |
0 |
A |
0 |
1 |
1 |
B |
1 |
4 |
2 |
C |
2 |
9 |
3 |
C |
3 |
9 |
4 |
B |
4 |
1 |
5 |
A |
5 |
9 |
延迟计算
df.groupby("key")
df.groupby("key").sum()
|
data1 |
data2 |
key |
|
|
A |
5 |
10 |
B |
5 |
6 |
C |
5 |
11 |
df.groupby("key").mean()
|
data1 |
data2 |
key |
|
|
A |
2.5 |
5.0 |
B |
2.5 |
3.0 |
C |
2.5 |
5.5 |
for i in df.groupby("key"):
print(str(i))
('A', key data1 data2
0 A 0 2
5 A 5 8)
('B', key data1 data2
1 B 1 2
4 B 4 4)
('C', key data1 data2
2 C 2 8
3 C 3 3)
按列取值
df.groupby("key")["data2"].sum()
key
A 10
B 6
C 11
Name: data2, dtype: int32
按组迭代
for data, group in df.groupby("key"):
print("{0:5} shape={1}".format(data, group.shape))
A shape=(2, 3)
B shape=(2, 3)
C shape=(2, 3)
调用方法
df.groupby("key")["data1"].describe()
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
key |
|
|
|
|
|
|
|
|
A |
2.0 |
2.5 |
3.535534 |
0.0 |
1.25 |
2.5 |
3.75 |
5.0 |
B |
2.0 |
2.5 |
2.121320 |
1.0 |
1.75 |
2.5 |
3.25 |
4.0 |
C |
2.0 |
2.5 |
0.707107 |
2.0 |
2.25 |
2.5 |
2.75 |
3.0 |
支持更复杂的操作
df.groupby("key").aggregate(["min", "median", "max"])
|
data1 |
data2 |
||||
|
min |
median |
max |
min |
median |
max |
key |
|
|
|
|
|
|
A |
0 |
2.5 |
5 |
2 |
5.0 |
8 |
B |
1 |
2.5 |
4 |
2 |
3.0 |
4 |
C |
2 |
2.5 |
3 |
3 |
5.5 |
8 |
过滤
def filter_func(x):
return x["data2"].std() > 3
df.groupby("key")["data2"].std()
key
A 4.242641
B 1.414214
C 3.535534
Name: data2, dtype: float64
df.groupby("key").filter(filter_func)
|
key |
data1 |
data2 |
0 |
A |
0 |
2 |
2 |
C |
2 |
8 |
3 |
C |
3 |
3 |
5 |
A |
5 |
8 |
apply()方法
df
|
key |
data1 |
data2 |
0 |
A |
0 |
2 |
1 |
B |
1 |
2 |
2 |
C |
2 |
8 |
3 |
C |
3 |
3 |
4 |
B |
4 |
4 |
5 |
A |
5 |
8 |
df.groupby("key").apply(lambda x: x-x.mean())
|
data1 |
data2 |
0 |
-2.5 |
-3.0 |
1 |
-1.5 |
-1.0 |
2 |
-0.5 |
2.5 |
3 |
0.5 |
-2.5 |
4 |
1.5 |
1.0 |
5 |
2.5 |
3.0 |
def norm_by_data2(x):
x["data1"] /= x["data2"].sum()
return x
df.groupby("key").apply(norm_by_data2)
|
key |
data1 |
data2 |
0 |
A |
0.000000 |
2 |
1 |
B |
0.166667 |
2 |
2 |
C |
0.181818 |
8 |
3 |
C |
0.272727 |
3 |
4 |
B |
0.666667 |
4 |
5 |
A |
0.500000 |
8 |
将列表、数组设为分组键
L = [0, 1, 0, 1, 2, 0]
df.groupby(L).sum()
|
data1 |
data2 |
0 |
7 |
18 |
1 |
4 |
5 |
2 |
4 |
4 |
用字典将索引映射到分组
df2 = df.set_index("key")
df2
|
data1 |
data2 |
key |
|
|
A |
0 |
2 |
B |
1 |
2 |
C |
2 |
8 |
C |
3 |
3 |
B |
4 |
4 |
A |
5 |
8 |
mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()
|
data1 |
data2 |
constant |
10 |
17 |
first |
5 |
10 |
任意Python函数
df2.groupby(str.lower).mean()
|
data1 |
data2 |
a |
2.5 |
5.0 |
b |
2.5 |
3.0 |
c |
2.5 |
5.5 |
多个有效值组成的列表
df2.groupby([str.lower, mapping]).mean()
|
|
data1 |
data2 |
a |
first |
2.5 |
5.0 |
b |
constant |
2.5 |
3.0 |
c |
constant |
2.5 |
5.5 |
【例1】 行星观测数据处理
import seaborn as sns
import pandas as pd
# planets = sns.load_dataset("planets")
planets = pd.read_csv("data/planets.csv") # 读取本地的csv文件
planets.shape
(1035, 6)
planets.head()
|
method |
number |
orbital_period |
mass |
distance |
year |
0 |
Radial Velocity |
1 |
269.300 |
7.10 |
77.40 |
2006 |
1 |
Radial Velocity |
1 |
874.774 |
2.21 |
56.95 |
2008 |
2 |
Radial Velocity |
1 |
763.000 |
2.60 |
19.84 |
2011 |
3 |
Radial Velocity |
1 |
326.030 |
19.40 |
110.62 |
2007 |
4 |
Radial Velocity |
1 |
516.220 |
10.50 |
119.47 |
2009 |
planets.describe()
|
number |
orbital_period |
mass |
distance |
year |
count |
1035.000000 |
992.000000 |
513.000000 |
808.000000 |
1035.000000 |
mean |
1.785507 |
2002.917596 |
2.638161 |
264.069282 |
2009.070531 |
std |
1.240976 |
26014.728304 |
3.818617 |
733.116493 |
3.972567 |
min |
1.000000 |
0.090706 |
0.003600 |
1.350000 |
1989.000000 |
25% |
1.000000 |
5.442540 |
0.229000 |
32.560000 |
2007.000000 |
50% |
1.000000 |
39.979500 |
1.260000 |
55.250000 |
2010.000000 |
75% |
2.000000 |
526.005000 |
3.040000 |
178.500000 |
2012.000000 |
max |
7.000000 |
730000.000000 |
25.000000 |
8500.000000 |
2014.000000 |
decade = 10 * (planets["year"] // 10)
decade.head()
0 2000
1 2000
2 2010
3 2000
4 2000
Name: year, dtype: int64
decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()
0 2000s
1 2000s
2 2010s
3 2000s
4 2000s
Name: decade, dtype: object
planets.groupby(["method", decade]).sum()
|
|
number |
orbital_period |
mass |
distance |
year |
method |
decade |
|
|
|
|
|
Astrometry |
2010s |
2 |
1.262360e+03 |
0.00000 |
35.75 |
4023 |
Eclipse Timing Variations |
2000s |
5 |
1.930800e+04 |
6.05000 |
261.44 |
6025 |
2010s |
10 |
2.345680e+04 |
4.20000 |
1000.00 |
12065 |
|
Imaging |
2000s |
29 |
1.350935e+06 |
0.00000 |
956.83 |
40139 |
2010s |
21 |
6.803750e+04 |
0.00000 |
1210.08 |
36208 |
|
Microlensing |
2000s |
12 |
1.732500e+04 |
0.00000 |
0.00 |
20070 |
2010s |
15 |
4.750000e+03 |
0.00000 |
41440.00 |
26155 |
|
Orbital Brightness Modulation |
2010s |
5 |
2.127920e+00 |
0.00000 |
2360.00 |
6035 |
Pulsar Timing |
1990s |
9 |
1.900153e+02 |
0.00000 |
0.00 |
5978 |
2000s |
1 |
3.652500e+04 |
0.00000 |
0.00 |
2003 |
|
2010s |
1 |
9.070629e-02 |
0.00000 |
1200.00 |
2011 |
|
Pulsation Timing Variations |
2000s |
1 |
1.170000e+03 |
0.00000 |
0.00 |
2007 |
Radial Velocity |
1980s |
1 |
8.388800e+01 |
11.68000 |
40.57 |
1989 |
1990s |
52 |
1.091561e+04 |
68.17820 |
723.71 |
55943 |
|
2000s |
475 |
2.633526e+05 |
945.31928 |
15201.16 |
619775 |
|
2010s |
424 |
1.809630e+05 |
316.47890 |
11382.67 |
432451 |
|
Transit |
2000s |
64 |
2.897102e+02 |
0.00000 |
31823.31 |
124462 |
2010s |
712 |
8.087813e+03 |
1.47000 |
102419.46 |
673999 |
|
Transit Timing Variations |
2010s |
9 |
2.393505e+02 |
0.00000 |
3313.00 |
8050 |
planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)
|
number |
|||
decade |
1980s |
1990s |
2000s |
2010s |
method |
|
|
|
|
Astrometry |
0.0 |
0.0 |
0.0 |
2.0 |
Eclipse Timing Variations |
0.0 |
0.0 |
5.0 |
10.0 |
Imaging |
0.0 |
0.0 |
29.0 |
21.0 |
Microlensing |
0.0 |
0.0 |
12.0 |
15.0 |
Orbital Brightness Modulation |
0.0 |
0.0 |
0.0 |
5.0 |
Pulsar Timing |
0.0 |
9.0 |
1.0 |
1.0 |
Pulsation Timing Variations |
0.0 |
0.0 |
1.0 |
0.0 |
Radial Velocity |
1.0 |
52.0 |
475.0 |
424.0 |
Transit |
0.0 |
0.0 |
64.0 |
712.0 |
Transit Timing Variations |
0.0 |
0.0 |
0.0 |
9.0 |
【例2】泰坦尼克号乘客数据分析
import seaborn as sns
# titanic = sns.load_dataset("titanic")
titanic = pd.read_csv("data/titanic.csv") # 读取本地的csv文件
titanic.head()
|
survived |
pclass |
sex |
age |
sibsp |
parch |
fare |
embarked |
class |
who |
adult_male |
deck |
embark_town |
alive |
alone |
0 |
0 |
3 |
male |
22.0 |
1 |
0 |
7.2500 |
S |
Third |
man |
True |
NaN |
Southampton |
no |
False |
1 |
1 |
1 |
female |
38.0 |
1 |
0 |
71.2833 |
C |
First |
woman |
False |
C |
Cherbourg |
yes |
False |
2 |
1 |
3 |
female |
26.0 |
0 |
0 |
7.9250 |
S |
Third |
woman |
False |
NaN |
Southampton |
yes |
True |
3 |
1 |
1 |
female |
35.0 |
1 |
0 |
53.1000 |
S |
First |
woman |
False |
C |
Southampton |
yes |
False |
4 |
0 |
3 |
male |
35.0 |
0 |
0 |
8.0500 |
S |
Third |
man |
True |
NaN |
Southampton |
no |
True |
titanic.describe()
|
survived |
pclass |
age |
sibsp |
parch |
fare |
count |
891.000000 |
891.000000 |
714.000000 |
891.000000 |
891.000000 |
891.000000 |
mean |
0.383838 |
2.308642 |
29.699118 |
0.523008 |
0.381594 |
32.204208 |
std |
0.486592 |
0.836071 |
14.526497 |
1.102743 |
0.806057 |
49.693429 |
min |
0.000000 |
1.000000 |
0.420000 |
0.000000 |
0.000000 |
0.000000 |
25% |
0.000000 |
2.000000 |
20.125000 |
0.000000 |
0.000000 |
7.910400 |
50% |
0.000000 |
3.000000 |
28.000000 |
0.000000 |
0.000000 |
14.454200 |
75% |
1.000000 |
3.000000 |
38.000000 |
1.000000 |
0.000000 |
31.000000 |
max |
1.000000 |
3.000000 |
80.000000 |
8.000000 |
6.000000 |
512.329200 |
titanic.groupby("sex")[["survived"]].mean()
|
survived |
sex |
|
female |
0.742038 |
male |
0.188908 |
titanic.groupby("sex")["survived"].mean()
sex
female 0.742038
male 0.188908
Name: survived, dtype: float64
titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
class |
First |
Second |
Third |
sex |
|
|
|
female |
0.968085 |
0.921053 |
0.500000 |
male |
0.368852 |
0.157407 |
0.135447 |
数据透视表
titanic.pivot_table("survived", index="sex", columns="class")
class |
First |
Second |
Third |
sex |
|
|
|
female |
0.968085 |
0.921053 |
0.500000 |
male |
0.368852 |
0.157407 |
0.135447 |
titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True)
class |
First |
Second |
Third |
All |
sex |
|
|
|
|
female |
0.968085 |
0.921053 |
0.500000 |
0.742038 |
male |
0.368852 |
0.157407 |
0.135447 |
0.188908 |
All |
0.629630 |
0.472826 |
0.242363 |
0.383838 |
titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"})
|
fare |
survived |
||||
class |
First |
Second |
Third |
First |
Second |
Third |
sex |
|
|
|
|
|
|
female |
106.125798 |
21.970121 |
16.118810 |
91 |
70 |
72 |
male |
67.226127 |
19.741782 |
12.661633 |
45 |
17 |
47 |
(1) 多级索引:用于多维数据
base_data = np.array([[1771, 11115 ],
[2154, 30320],
[2141, 14070],
[2424, 32680],
[1077, 7806],
[1303, 24222],
[798, 4789],
[981, 13468]])
data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"], [2008, 2018]*4], columns=["population", "GDP"])
data
|
|
population |
GDP |
BeiJing |
2008 |
1771 |
11115 |
2018 |
2154 |
30320 |
|
ShangHai |
2008 |
2141 |
14070 |
2018 |
2424 |
32680 |
|
ShenZhen |
2008 |
1077 |
7806 |
2018 |
1303 |
24222 |
|
HangZhou |
2008 |
798 |
4789 |
2018 |
981 |
13468 |
data.index.names = ["city", "year"]
data
|
|
population |
GDP |
city |
year |
|
|
BeiJing |
2008 |
1771 |
11115 |
2018 |
2154 |
30320 |
|
ShangHai |
2008 |
2141 |
14070 |
2018 |
2424 |
32680 |
|
ShenZhen |
2008 |
1077 |
7806 |
2018 |
1303 |
24222 |
|
HangZhou |
2008 |
798 |
4789 |
2018 |
981 |
13468 |
data["GDP"]
city year
BeiJing 2008 11115
2018 30320
ShangHai 2008 14070
2018 32680
ShenZhen 2008 7806
2018 24222
HangZhou 2008 4789
2018 13468
Name: GDP, dtype: int32
data.loc["ShangHai", "GDP"]
year
2008 14070
2018 32680
Name: GDP, dtype: int32
data.loc["ShangHai", 2018]["GDP"]
32680
(2) 高性能的Pandas:eval()
df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
%timeit (df1+df2)/(df3+df4)
17.6 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
减少了复合代数式计算中间过程的内存分配
%timeit pd.eval("(df1+df2)/(df3+df4)")
10.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))
True
(3) 高性能的Pandas:query()
df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
df.head()
|
A |
B |
C |
0 |
0.418071 |
0.381836 |
0.500556 |
1 |
0.059432 |
0.749066 |
0.302429 |
2 |
0.489147 |
0.739153 |
0.777161 |
3 |
0.175441 |
0.016556 |
0.348979 |
4 |
0.766534 |
0.559252 |
0.310635 |
df.eval("D=(A+B)/(C-1)", inplace=True)
df.head()
|
A |
B |
C |
D |
0 |
0.418071 |
0.381836 |
0.500556 |
-1.601593 |
1 |
0.059432 |
0.749066 |
0.302429 |
-1.159019 |
2 |
0.489147 |
0.739153 |
0.777161 |
-5.512052 |
3 |
0.175441 |
0.016556 |
0.348979 |
-0.294917 |
4 |
0.766534 |
0.559252 |
0.310635 |
-1.923199 |
%timeit df[(df.A < 0.5) & (df.B > 0.5)]
1.11 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.query("(A < 0.5)&(B > 0.5)")
2.55 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.query("(A < 0.5)&(B > 0.5)").head()
|
A |
B |
C |
D |
1 |
0.059432 |
0.749066 |
0.302429 |
-1.159019 |
2 |
0.489147 |
0.739153 |
0.777161 |
-5.512052 |
7 |
0.073950 |
0.730144 |
0.646190 |
-2.272672 |
10 |
0.393200 |
0.610467 |
0.697096 |
-3.313485 |
11 |
0.065734 |
0.764699 |
0.179380 |
-1.011958 |
np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))
True
(4)eval()和query()的使用时机
小数组时,普通方法反而更快
df.values.nbytes
32000
df1.values.nbytes
8000000
创建DataFrame数组并进行相应操作:
1、创建一个30*6的DataFrame数组,元素由70~100之间均匀分布的随机整数构成,行标签按030201(初三.二班1号)~030230格式顺序排列,列标签分别为语文、数学、英语、物理、化学、计算机。
2、输出其纯数据、行标签、列标签、形状、大小和数据类型
3、获取全班数学成绩、获取学号为030205的同学的所有成绩;
4、增加总成绩的新列,并建立按总成绩降序排列的副本(注意是获得副本,不是获得视图),切片获得前十名学生的全部成绩;
5、创建一个DataFrame对象(记为B),行标签与上文DataFrame对象(记为A)一致,列标签为性别,数据为30个学生的随机性别,将A和B进行水平合并,获得新的DataFrame对象(记为C);
6、输出数据C的info和describe信息,尝试自定义my_describe,输出自己感兴趣的统计信息;
7、按性别进行分组,对比男生女生所有科目及总成绩的平均值。
DataFrame数组操作
8.下载titanic数据集,执行下列操作:
import seaborn as sns
# titanic = sns.load_dataset("titanic")
titanic = pd.read_csv("data/titanic.csv") # 读取本地的csv文件
titanic.head()
|
survived |
pclass |
sex |
age |
sibsp |
parch |
fare |
embarked |
class |
who |
adult_male |
deck |
embark_town |
alive |
alone |
0 |
0 |
3 |
male |
22.0 |
1 |
0 |
7.2500 |
S |
Third |
man |
True |
NaN |
Southampton |
no |
False |
1 |
1 |
1 |
female |
38.0 |
1 |
0 |
71.2833 |
C |
First |
woman |
False |
C |
Cherbourg |
yes |
False |
2 |
1 |
3 |
female |
26.0 |
0 |
0 |
7.9250 |
S |
Third |
woman |
False |
NaN |
Southampton |
yes |
True |
3 |
1 |
1 |
female |
35.0 |
1 |
0 |
53.1000 |
S |
First |
woman |
False |
C |
Southampton |
yes |
False |
4 |
0 |
3 |
male |
35.0 |
0 |
0 |
8.0500 |
S |
Third |
man |
True |
NaN |
Southampton |
no |
True |
(1)获得一个删除了无年龄数据的所有行的副本;
(2)创建一个名为Age的Series对象,其数据来源于对数据集中的年龄按下列规则进行映射(参照行星数据集案例中decade的处理办法):
If <10 : “0s”
elif <20 : “10s”
…
elif <60 : “50s”
…
(3)通过sex和Age对titanic数据集进行分组,获得不同性别、不同年龄段乘客的幸存比例,请分别使用groupby和pivot_table(如果直接用Age不行的话,换个思路)两种方法。
答案:
1.
import pandas as pd
import numpy as np
classes = ["03020" + str(i) for i in range(1, 10)] + ["0302" + str(i) for i in range(10, 31)]
classes
['030201',
'030202',
...
'030230']
objectes = ["语文", "数学", "英语", "物理", "化学", "计算机"]
df = pd.DataFrame(np.random.randint(70, 100, (30, 6)), classes, objectes)
df
|
语文 |
数学 |
英语 |
物理 |
化学 |
计算机 |
030201 |
91 |
74 |
85 |
92 |
84 |
78 |
030202 |
71 |
72 |
75 |
93 |
97 |
99 |
... |
... |
... |
... |
... |
... |
... |
030230 |
74 |
86 |
90 |
85 |
80 |
82 |
2.
print("row_index : {}, \ncol_index : {}, \nshape : {}, \nsize : {}, \ndtype : {}".format(df.index, df.columns,
df.shape, df.size,
df.dtypes))
row_index : Index(['030201', '030202', '030203', '030204', '030205', '030206', '030207',
'030208', '030209', '030210', '030211', '030212', '030213', '030214',
'030215', '030216', '030217', '030218', '030219', '030220', '030221',
'030222', '030223', '030224', '030225', '030226', '030227', '030228',
'030229', '030230'],
dtype='object'),
col_index : Index(['语文', '数学', '英语', '物理', '化学', '计算机'], dtype='object'),
shape : (30, 6),
size : 180,
dtype : 语文 int32
数学 int32
英语 int32
物理 int32
化学 int32
计算机 int32
dtype: object
3.
print("math_score = \n{}, \n030205_score = \n{}".format(df["数学"], df.loc["030205"]))
math_score =
030201 74
030202 72
...
030230 86
Name: 数学, dtype: int32,
030205_score =
语文 76
数学 96
英语 77
物理 87
化学 82
计算机 93
Name: 030205, dtype: int32
4.
df_cp = df.copy()
df_cp['总成绩'] = df_cp.apply(lambda x : x.sum(), axis=1)
df_cp
|
语文 |
数学 |
英语 |
物理 |
化学 |
计算机 |
总成绩 |
030201 |
91 |
74 |
85 |
92 |
84 |
78 |
504 |
030202 |
71 |
72 |
75 |
93 |
97 |
99 |
507 |
... |
... |
... |
... |
... |
... |
... |
... |
030230 |
74 |
86 |
90 |
85 |
80 |
82 |
497 |
# 按总成绩降序排列的副本,切片获得前十名学生的全部成绩
df2 = df_cp.sort_values(by=['总成绩'], ascending=False)
# print(df2)
print(df2.iloc[0: 10, :])
语文 数学 英语 物理 化学 计算机 总成绩
030208 80 93 92 85 98 98 546
030203 95 95 81 80 97 90 538
030216 72 97 92 83 95 97 536
030220 89 96 80 83 93 90 531
030222 81 96 97 81 73 98 526
030226 89 99 90 85 86 76 525
030211 88 94 95 73 83 91 524
030213 98 95 76 85 94 75 523
030207 91 86 81 98 87 76 519
030210 76 99 82 88 95 78 518
5.
gender = ['男' if np.random.random() < 0.5 else '女' for i in range(30)]
gender
['男',
'男',
...
'女']
df2 = pd.DataFrame(gender, index=classes, columns=["性别"])
print(df2)
性别
030201 男
030202 男
...
030230 女
c = pd.concat([df_cp, df2], axis=1)
print(c)
语文 数学 英语 物理 化学 计算机 总成绩 性别
030201 91 74 85 92 84 78 504 男
030202 71 72 75 93 97 99 507 男
...
030230 74 86 90 85 80 82 497 女
6.
print("C_info={}, C_describe={}".format(c.info(), c.describe()))
Index: 30 entries, 030201 to 030230
Data columns (total 8 columns):
语文 30 non-null int32
数学 30 non-null int32
...
总成绩
count 30.000000
...
max 546.000000
def my_describe(x):
return pd.Series([x.count()], index=["Count"])
c.apply(my_describe)
|
语文 |
数学 |
英语 |
物理 |
化学 |
计算机 |
总成绩 |
性别 |
Count |
30 |
30 |
30 |
30 |
30 |
30 |
30 |
30 |
7.
c.groupby('性别').sum()
|
语文 |
数学 |
英语 |
物理 |
化学 |
计算机 |
总成绩 |
性别 |
|
|
|
|
|
|
|
女 |
1189 |
1223 |
1155 |
1188 |
1220 |
1185 |
7160 |
男 |
1305 |
1353 |
1376 |
1356 |
1403 |
1345 |
8138 |
c.groupby('性别').mean()
|
语文 |
数学 |
英语 |
物理 |
化学 |
计算机 |
总成绩 |
性别 |
|
|
|
|
|
|
|
女 |
84.928571 |
87.357143 |
82.5 |
84.857143 |
87.142857 |
84.642857 |
511.428571 |
男 |
81.562500 |
84.562500 |
86.0 |
84.750000 |
87.687500 |
84.062500 |
508.625000 |
8.(1)
data_no_age = titanic.drop(columns='age').copy()
data_no_age.head(10)
|
survived |
pclass |
sex |
sibsp |
parch |
fare |
embarked |
class |
who |
adult_male |
deck |
embark_town |
alive |
alone |
0 |
0 |
3 |
male |
1 |
0 |
7.2500 |
S |
Third |
man |
True |
NaN |
Southampton |
no |
False |
1 |
1 |
1 |
female |
1 |
0 |
71.2833 |
C |
First |
woman |
False |
C |
Cherbourg |
yes |
False |
2 |
1 |
3 |
female |
0 |
0 |
7.9250 |
S |
Third |
woman |
False |
NaN |
Southampton |
yes |
True |
3 |
1 |
1 |
female |
1 |
0 |
53.1000 |
S |
First |
woman |
False |
C |
Southampton |
yes |
False |
4 |
0 |
3 |
male |
0 |
0 |
8.0500 |
S |
Third |
man |
True |
NaN |
Southampton |
no |
True |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
9 |
1 |
2 |
female |
1 |
0 |
30.0708 |
C |
Second |
child |
False |
NaN |
Cherbourg |
yes |
False |
(2)
Age = titanic[titanic['age'].notnull()]
Age = 10 * (Age['age'] // 10)
Age = Age.astype(int)
Age = Age.astype(str) + 's'
print(Age)
0 20s
1 30s
...
890 30s
Name: age, Length: 714, dtype: object
(3)
t = titanic[titanic.age.notnull()]
t.groupby(["sex", Age])["survived"].mean().unstack()
age |
0s |
10s |
20s |
30s |
40s |
50s |
60s |
70s |
80s |
sex |
|
|
|
|
|
|
|
|
|
female |
0.633333 |
0.755556 |
0.722222 |
0.833333 |
0.687500 |
0.888889 |
1.000000 |
NaN |
NaN |
male |
0.593750 |
0.122807 |
0.168919 |
0.214953 |
0.210526 |
0.133333 |
0.133333 |
0.0 |
1.0 |
t.age = Age.copy()
t.pivot_table("survived", index="sex", columns="age")
age |
0s |
10s |
20s |
30s |
40s |
50s |
60s |
70s |
80s |
sex |
|
|
|
|
|
|
|
|
|
female |
0.633333 |
0.755556 |
0.722222 |
0.833333 |
0.687500 |
0.888889 |
1.000000 |
NaN |
NaN |
male |
0.593750 |
0.122807 |
0.168919 |
0.214953 |
0.210526 |
0.133333 |
0.133333 |
0.0 |
1.0 |