前面已经对pandas进行了介绍,知道了pandas是什么,有何优点、能做些什么事情、可以应用在哪些领域:pandas概述
接下来本文将会使用pandas展示一些例子,走马观花地看看pandas的一些功能和操作。这些示例基本上涵盖了pandas大部分的内容,通过这些示例,可以直观地感受下pandas的强大。
首先,根据Python社区对一些常用模块的命名约定,我们按如下所示导入numpy和pandas:
import numpy as np
import pandas as pd
Series
可以看作是具有索引的一维向量。
通过传递一个列表创建Series
,pandas会创建一个默认的整数索引:
In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])
In [4]: s
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
DataFrame
表示的是二维矩阵的数据表,类似于关系型数据库(如MySQL)中的表,每一列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame
既有行索引也有列索引,它可以被视为一个共享相同索引的Series
的字典。
通过numpy数组创建DataFrame
,并指定索引和列名:
In [5]: df = pd.DataFrame(np.random.randn(6, 4), index=list("abcdef"), columns=list("ABCD"))
In [6]: df
Out[6]:
A B C D
a 0.137850 -1.056618 -0.651227 0.517188
b -0.860666 1.304139 0.488719 -0.230823
c 1.333085 -2.825132 -0.592092 0.999223
d 0.068219 -0.625945 0.316369 0.003051
e -1.884551 0.313736 0.090904 -0.587071
f 0.230159 -0.305100 0.243703 0.006146
通过传递一个字典创建DataFrame
:
In [7]: df2 = pd.DataFrame(
...: {
...: "A": 1.0,
...: "B": pd.Timestamp("20220914"),
...: "C": pd.Series(1, index=list(range(4)), dtype="float32"),
...: "D": np.array([3] * 4, dtype="int32"),
...: "E": pd.Categorical(["test", "train", "test", "train"]),
...: "F": "foo",
...: }
...: )
In [8]: df2
Out[8]:
A B C D E F
0 1.0 2022-09-14 1.0 3 test foo
1 1.0 2022-09-14 1.0 3 train foo
2 1.0 2022-09-14 1.0 3 test foo
3 1.0 2022-09-14 1.0 3 train foo
创建的DataFrame
的列有不同的数据类型(异构类型列,DataFrame
中可以有不同数据类型的列):
In [9]: df2.dtypes
Out[9]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
查看DataFrame
头部(顶部)和尾部(底部)的行:
# 默认显示5行
In [10]: df.head()
Out[10]:
A B C D
a 0.137850 -1.056618 -0.651227 0.517188
b -0.860666 1.304139 0.488719 -0.230823
c 1.333085 -2.825132 -0.592092 0.999223
d 0.068219 -0.625945 0.316369 0.003051
e -1.884551 0.313736 0.090904 -0.587071
# 也可以传递一个整数参数指定显示的行数
In [11]: df.tail(3)
Out[11]:
A B C D
d 0.068219 -0.625945 0.316369 0.003051
e -1.884551 0.313736 0.090904 -0.587071
f 0.230159 -0.305100 0.243703 0.006146
查看索引和列名:
In [12]: df.index
Out[12]: Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
In [13]: df.columns
Out[13]: Index(['A', 'B', 'C', 'D'], dtype='object')
查看DataFrame
的元数据信息:
In [14]: df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null float64
1 B 4 non-null datetime64[ns]
2 C 4 non-null float32
3 D 4 non-null int32
4 E 4 non-null category
5 F 4 non-null object
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 288.0+ bytes
显示数据的一些统计信息:
In [15]: df.describe()
Out[15]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.162651 -0.532487 -0.017271 0.117953
std 1.094284 1.391450 0.485689 0.562216
min -1.884551 -2.825132 -0.651227 -0.587071
25% -0.628445 -0.948950 -0.421343 -0.172354
50% 0.103035 -0.465522 0.167303 0.004599
75% 0.207082 0.159027 0.298202 0.389428
max 1.333085 1.304139 0.488719 0.999223
转置:
In [16]: df.T
Out[16]:
a b c d e f
A 0.137850 -0.860666 1.333085 0.068219 -1.884551 0.230159
B -1.056618 1.304139 -2.825132 -0.625945 0.313736 -0.305100
C -0.651227 0.488719 -0.592092 0.316369 0.090904 0.243703
D 0.517188 -0.230823 0.999223 0.003051 -0.587071 0.006146
根据索引排序:
# 指定了排序的方向和顺序(轴1,降序)
In [17]: df.sort_index(axis=1, ascending=False)
Out[17]:
D C B A
a 0.517188 -0.651227 -1.056618 0.137850
b -0.230823 0.488719 1.304139 -0.860666
c 0.999223 -0.592092 -2.825132 1.333085
d 0.003051 0.316369 -0.625945 0.068219
e -0.587071 0.090904 0.313736 -1.884551
f 0.006146 0.243703 -0.305100 0.230159
根据一列或多列的值排序:
In [18]: df.sort_values(by='B')
Out[18]:
A B C D
c 1.333085 -2.825132 -0.592092 0.999223
a 0.137850 -1.056618 -0.651227 0.517188
d 0.068219 -0.625945 0.316369 0.003051
f 0.230159 -0.305100 0.243703 0.006146
e -1.884551 0.313736 0.090904 -0.587071
b -0.860666 1.304139 0.488719 -0.230823
虽然用于选择和设置值的标准Python/NumPy表达式(即用[]索引)很直观且便于交互工作,但对于生产代码,我们推荐使用优化的pandas数据访问方法
.at
、.iat
、.loc
和.iloc
。
使用列名选择单个列,返回一个Series
:
In [19]: df['A'] # 也可以用 df.A
Out[19]:
a 0.137850
b -0.860666
c 1.333085
d 0.068219
e -1.884551
f 0.230159
Name: A, dtype: float64
使用[]
对行做切片:
In [20]: df[0:3]
Out[20]:
A B C D
a 0.137850 -1.056618 -0.651227 0.517188
b -0.860666 1.304139 0.488719 -0.230823
c 1.333085 -2.825132 -0.592092 0.999223
按标签选择某一行:
In [21]: df.loc['a']
Out[21]:
A 0.137850
B -1.056618
C -0.651227
D 0.517188
Name: a, dtype: float64
同时索引行和列:
# 取A,B两列
In [22]: df.loc[:, ["A", "B"]]
Out[22]:
A B
a 0.137850 -1.056618
b -0.860666 1.304139
c 1.333085 -2.825132
d 0.068219 -0.625945
e -1.884551 0.313736
f 0.230159 -0.305100
In [23]: df.loc['a':'c', ["A", "B"]]
Out[23]:
A B
a 0.137850 -1.056618
b -0.860666 1.304139
c 1.333085 -2.825132
获取标量:
In [24]: %timeit df.loc['a','C']
5.55 µs ± 45 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# 结果跟上面的相同,但是更快
In [25]: %timeit df.at['a','C']
2.92 µs ± 28.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
通过传递的整数的位置进行选择:
In [26]: df.iloc[3]
Out[26]:
A 0.068219
B -0.625945
C 0.316369
D 0.003051
Name: d, dtype: float64
通过类似于NumPy/Python的整数切片选择:
In [27]: df.iloc[3:5, 0:2]
Out[27]:
A B
d 0.068219 -0.625945
e -1.884551 0.313736
通过整数位置列表选择:
In [28]: df.iloc[[1, 2, 4], [0, 2]]
Out[28]:
A C
b -0.860666 0.488719
c 1.333085 -0.592092
e -1.884551 0.090904
获取标量:
In [29]: %timeit df.iloc[1, 1]
11.7 µs ± 99.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# 结果跟上面的相同,但是更快
In [30]: %timeit df.iat[1, 1]
8.64 µs ± 72.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
使用某个列的值选择数据:
# 选择列A值大于0对应的行
In [31]: df[df["A"] > 0]
Out[31]:
A B C D
a 0.137850 -1.056618 -0.651227 0.517188
c 1.333085 -2.825132 -0.592092 0.999223
d 0.068219 -0.625945 0.316369 0.003051
f 0.230159 -0.305100 0.243703 0.006146
根据DataFrame
选择值:
# 只选择对应位置为True的值,对应位置为False的设为NaN
In [32]: df[df > 0]
Out[32]:
A B C D
a 0.137850 NaN NaN 0.517188
b NaN 1.304139 0.488719 NaN
c 1.333085 NaN NaN 0.999223
d 0.068219 NaN 0.316369 0.003051
e NaN 0.313736 0.090904 NaN
f 0.230159 NaN 0.243703 0.006146
使用isin()
方法过滤:
In [33]: df3 = df.copy()
In [34]: df3["E"] = ["one", "one", "two", "three", "four", "three"]
In [35]: df3
Out[35]:
A B C D E
a 0.137850 -1.056618 -0.651227 0.517188 one
b -0.860666 1.304139 0.488719 -0.230823 one
c 1.333085 -2.825132 -0.592092 0.999223 two
d 0.068219 -0.625945 0.316369 0.003051 three
e -1.884551 0.313736 0.090904 -0.587071 four
f 0.230159 -0.305100 0.243703 0.006146 three
# 过滤出列E为某些值的行
In [36]: df3[df3["E"].isin(["two", "four"])]
Out[36]:
A B C D E
c 1.333085 -2.825132 -0.592092 0.999223 two
e -1.884551 0.313736 0.090904 -0.587071 four
按标签设置:
# 将行索引为a、列索引为A对应的值设置为0
In [37]: df.at['a', 'A'] = 0
In [38]: df
Out[38]:
A B C D
a 0.000000 -1.056618 -0.651227 0.517188
b -0.860666 1.304139 0.488719 -0.230823
c 1.333085 -2.825132 -0.592092 0.999223
d 0.068219 -0.625945 0.316369 0.003051
e -1.884551 0.313736 0.090904 -0.587071
f 0.230159 -0.305100 0.243703 0.006146
按位置设置:
# 将第0行、第1列的值设置为1
In [39]: df.iat[0, 1] = 1
In [40]: df
Out[40]:
A B C D
a 0.000000 1.000000 -0.651227 0.517188
b -0.860666 1.304139 0.488719 -0.230823
c 1.333085 -2.825132 -0.592092 0.999223
d 0.068219 -0.625945 0.316369 0.003051
e -1.884551 0.313736 0.090904 -0.587071
f 0.230159 -0.305100 0.243703 0.006146
通过分配一个NumPy数组来设置:
# 设置列D的值
In [41]: df.loc[:, "D"] = np.array([5] * len(df))
In [42]: df
Out[42]:
A B C D
a 0.000000 1.000000 -0.651227 5
b -0.860666 1.304139 0.488719 5
c 1.333085 -2.825132 -0.592092 5
d 0.068219 -0.625945 0.316369 5
e -1.884551 0.313736 0.090904 5
f 0.230159 -0.305100 0.243703 5
根据布尔条件设置:
In [43]: df3 = df.copy()
# 对大于0的位置设置值
In [44]: df3[df3 > 0] = -df3
In [45]: df3
Out[45]:
A B C D
a 0.000000 -1.000000 -0.651227 -5
b -0.860666 -1.304139 -0.488719 -5
c -1.333085 -2.825132 -0.592092 -5
d -0.068219 -0.625945 -0.316369 -5
e -1.884551 -0.313736 -0.090904 -5
f -0.230159 -0.305100 -0.243703 -5
pandas主要使用值np.nan
来表示缺失数据。默认情况下,它不包含在计算中。
In [46]: df3.loc['a':'b', 'E'] = 1
In [47]: df3
Out[48]:
A B C D E
a 0.000000 -1.000000 -0.651227 -5 1.0
b -0.860666 -1.304139 -0.488719 -5 1.0
c -1.333085 -2.825132 -0.592092 -5 NaN
d -0.068219 -0.625945 -0.316369 -5 NaN
e -1.884551 -0.313736 -0.090904 -5 NaN
f -0.230159 -0.305100 -0.243703 -5 NaN
丢弃有缺失值的行:
In [49]: df3.dropna(how='any')
Out[49]:
A B C D E
a 0.000000 -1.000000 -0.651227 -5 1.0
b -0.860666 -1.304139 -0.488719 -5 1.0
填充缺失值:
In [50]: df3.fillna(value=5)
Out[50]:
A B C D E
a 0.000000 -1.000000 -0.651227 -5 1.0
b -0.860666 -1.304139 -0.488719 -5 1.0
c -1.333085 -2.825132 -0.592092 -5 5.0
d -0.068219 -0.625945 -0.316369 -5 5.0
e -1.884551 -0.313736 -0.090904 -5 5.0
f -0.230159 -0.305100 -0.243703 -5 5.0
判断是否为nan
:
In [51]: df3.isna()
Out[51]:
A B C D E
a False False False False False
b False False False False False
c False False False False True
d False False False False True
e False False False False True
f False False False False True
In [52]: pd.isna(df3)
Out[52]:
A B C D E
a False False False False False
b False False False False False
c False False False False True
d False False False False True
e False False False False True
f False False False False True
通常,缺失数据不会被纳入数据的统计计算中。
求均值:
# 每一列的均值
In [53]: df.mean()
Out[53]:
A -0.185626
B -0.189717
C -0.017271
D 5.000000
求另一个轴上的均值:
# 每一行的均值
In [54]: df.mean(axis=1)
Out[54]:
a 1.337193
b 1.483048
c 0.728965
d 1.189661
e 0.880022
f 1.292191
dtype: float64
同样地,也可以对df
求和(sum
)、计算标准差(std
)、计数(count
)、求最小值(min
)、求最大值(max
)等。
对数据应用函数:
In [55]: df.apply(np.cumsum)
Out[55]:
A B C D
a 0.000000 1.000000 -0.651227 5
b -0.860666 2.304139 -0.162508 10
c 0.472418 -0.520993 -0.754600 15
d 0.540638 -1.146938 -0.438231 20
e -1.343913 -0.833202 -0.347327 25
f -1.113754 -1.138302 -0.103625 30
In [56]: df.apply(lambda x: x.max() - x.min())
Out[56]:
A 3.217636
B 4.129271
C 1.139947
D 0.000000
dtype: float64
In [57]: s = pd.Series(np.random.randint(0, 7, size=10))
In [58]: s
Out[58]:
0 5
1 0
2 1
3 6
4 2
5 2
6 5
7 2
8 2
9 1
dtype: int32
# 统计每个值出现的次数
In [59]: s.value_counts()
Out[59]:
2 4
5 2
1 2
0 1
6 1
dtype: int64
Series
在str
属性中配备了一组字符串处理方法,可以方便地对每个元素进行操作:
In [60]: s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
In [61]: s.str.lower()
Out[61]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
用concat()
连接pandas对象:
In [62]: df = pd.DataFrame(np.random.randn(10, 4))
In [63]: df
Out[63]:
0 1 2 3
0 -0.917562 -0.678779 -0.223747 -1.137378
1 -0.284037 0.154714 -0.539917 1.080102
2 -1.024852 0.034169 -2.593824 -2.404386
3 -1.420573 -1.408281 1.015019 0.454068
4 0.729644 -0.807320 2.046540 -0.084065
5 0.909687 0.093020 0.481070 0.506825
6 0.328835 -2.117948 0.912028 1.358023
7 -1.274164 0.297414 -0.154745 -1.250898
8 -1.317405 0.223203 0.541393 0.433390
9 0.229006 0.044450 1.575203 -1.056634
In [64]: pieces = [df[:3], df[3:7], df[7:]]
In [65]: pd.concat(pieces)
Out[65]:
0 1 2 3
0 -0.917562 -0.678779 -0.223747 -1.137378
1 -0.284037 0.154714 -0.539917 1.080102
2 -1.024852 0.034169 -2.593824 -2.404386
3 -1.420573 -1.408281 1.015019 0.454068
4 0.729644 -0.807320 2.046540 -0.084065
5 0.909687 0.093020 0.481070 0.506825
6 0.328835 -2.117948 0.912028 1.358023
7 -1.274164 0.297414 -0.154745 -1.250898
8 -1.317405 0.223203 0.541393 0.433390
9 0.229006 0.044450 1.575203 -1.056634
添加列到
DataFrame
中相对较快。然后,添加行需要复制操作,开销会比较大。所以建议将预先构建的记录列表传递给DataFrame
构造函数,而不是通过迭代地将记录追加到DataFrame
来构建一个DataFrame
。
SQL风格的合并:
In [64]: left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
In [65]: right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
In [66]: left
Out[66]:
key lval
0 foo 1
1 foo 2
In [67]: right
Out[67]:
key rval
0 foo 4
1 foo 5
In [68]: pd.merge(left, right, on="key")
Out[68]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
In [69]: left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
In [70]: right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
In [71]: left
Out[71]:
key lval
0 foo 1
1 bar 2
In [72]: right
Out[72]:
key rval
0 foo 4
1 bar 5
In [73]: pd.merge(left, right, on="key")
Out[73]:
key lval rval
0 foo 1 4
1 bar 2 5
分组(group by)是一个会涉及到以下一个或多个步骤的过程:
根据某些标准将数据拆分为组
将函数独立应用于每个组
将结果合并到一个数据结构中
In [74]: df = pd.DataFrame(
...: {
...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
...: "C": np.random.randn(8),
...: "D": np.random.randn(8),
...: }
...: )
In [75]: df
Out[75]:
A B C D
0 foo one 0.555129 -0.026934
1 bar one 1.023521 1.126648
2 foo two 1.459798 0.064685
3 bar three -1.037128 -0.503873
4 foo two 0.361939 0.008989
5 bar two 0.909140 0.784223
6 foo one 0.262744 2.750767
7 foo three 0.411624 0.128386
分组后对每个组分别应用sum()
函数:
In [76]: df.groupby("A").sum()
Out[76]:
C D
A
bar 0.895533 1.406997
foo 3.051234 2.925893
根据多个列分组会生成一个分层索引,我们再一次对每个分组应用sum()
函数:
In [77]: df.groupby(["A", "B"]).sum()
Out[77]:
C D
A B
bar one 1.023521 1.126648
three -1.037128 -0.503873
two 0.909140 0.784223
foo one 0.817873 2.723833
three 0.411624 0.128386
two 1.821737 0.073674
stack()
方法会将DataFrame
的一层列索引转换为(堆叠到)行索引:
In [78]: group_sum = df.groupby(["A", "B"]).sum()
In [79]: stacked = group_sum.stack()
In [80]: stacked
Out[80]:
A B
bar one C 1.023521
D 1.126648
three C -1.037128
D -0.503873
two C 0.909140
D 0.784223
foo one C 0.817873
D 2.723833
three C 0.411624
D 0.128386
two C 1.821737
D 0.073674
dtype: float64
stack()
的逆操作是unstack
,默认情况下它会取消堆叠最后一级行索引:
In [81]: stacked.unstack()
Out[81]:
C D
A B
bar one 1.023521 1.126648
three -1.037128 -0.503873
two 0.909140 0.784223
foo one 0.817873 2.723833
three 0.411624 0.128386
two 1.821737 0.073674
# 指定取消堆叠第1级索引
In [82]: stacked.unstack(level=1)
Out[82]:
B one three two
A
bar C 1.023521 -1.037128 0.909140
D 1.126648 -0.503873 0.784223
foo C 0.817873 0.411624 1.821737
D 2.723833 0.128386 0.073674
# 指定取消堆叠第0级索引
In [83]: stacked.unstack(level=0)
Out[83]:
A bar foo
B
one C 1.023521 0.817873
D 1.126648 2.723833
three C -1.037128 0.411624
D -0.503873 0.128386
two C 0.909140 1.821737
D 0.784223 0.073674
In [84]: df = pd.DataFrame(
...: {
...: "A": ["one", "one", "two", "three"] * 3,
...: "B": ["A", "B", "C"] * 4,
...: "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
...: "D": np.random.randn(12),
...: "E": np.random.randn(12),
...: }
...: )
In [85]: df
Out[85]:
A B C D E
0 one A foo -0.166493 0.113699
1 one B foo 0.026190 0.652232
2 two C foo -0.335997 -0.303846
3 three A bar -0.060659 1.465404
4 one B bar 0.292503 -2.533638
5 one C bar 0.447678 0.544235
6 two A foo 0.458672 1.241213
7 three B foo -0.461620 -0.540212
8 one C foo 1.618171 1.383842
9 one A bar 1.605952 -0.560169
10 two B bar 0.463036 -0.512008
11 three C bar 1.298219 -1.470472
In [86]: pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
Out[86]:
C bar foo
A B
one A 1.605952 -0.166493
B 0.292503 0.026190
C 0.447678 1.618171
three A -0.060659 NaN
B NaN -0.461620
C 1.298219 NaN
two A NaN 0.458672
B 0.463036 NaN
C NaN -0.335997
pandas具有简单、强大、高效的功能,用于在频率转换期间执行重采样操作(例如,将秒级数据转换为5分钟级数据)。这在金融应用中非常常见,但不限于此。
In [87]: rng = pd.date_range("1/1/2022", periods=100, freq="S")
In [88]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
In [89]: ts.resample("5Min").sum()
Out[89]:
2022-01-01 24493
Freq: 5T, dtype: int32
时区表示:
In [90]: rng = pd.date_range("2022-09-01", periods=5, freq="D")
In [91]: ts = pd.Series(np.random.randn(len(rng)), rng)
In [92]: ts
Out[92]:
2022-09-01 -1.074999
2022-09-02 -0.138886
2022-09-03 -0.362477
2022-09-04 -1.200428
2022-09-05 -1.033010
Freq: D, dtype: float64
In [93]: ts_zh = ts.tz_localize("Asia/Shanghai")
In [94]: ts_zh
Out[94]:
2022-09-01 00:00:00+08:00 -1.074999
2022-09-02 00:00:00+08:00 -0.138886
2022-09-03 00:00:00+08:00 -0.362477
2022-09-04 00:00:00+08:00 -1.200428
2022-09-05 00:00:00+08:00 -1.033010
dtype: float64
转换到另一个时区:
In [95]: ts_zh.tz_convert("UTC")
Out[95]:
2022-08-31 16:00:00+00:00 -1.074999
2022-09-01 16:00:00+00:00 -0.138886
2022-09-02 16:00:00+00:00 -0.362477
2022-09-03 16:00:00+00:00 -1.200428
2022-09-04 16:00:00+00:00 -1.033010
dtype: float64
在时间跨度表示之间转换:
In [96]: rng = pd.date_range("2022/1/1", periods=5, freq="M")
In [96]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [97]: ts
Out[97]:
2022-01-31 -1.311876
2022-02-28 1.127235
2022-03-31 0.878621
2022-04-30 0.040731
2022-05-31 -1.242116
Freq: M, dtype: float64
In [98]: ps = ts.to_period()
In [99]: ps
Out[99]:
2022-01 -1.311876
2022-02 1.127235
2022-03 0.878621
2022-04 0.040731
2022-05 -1.242116
Freq: M, dtype: float64
In [100]: ps.to_timestamp()
Out[100]:
2022-01-01 -1.311876
2022-02-01 1.127235
2022-03-01 0.878621
2022-04-01 0.040731
2022-05-01 -1.242116
Freq: MS, dtype: float64
pandas可以在DataFrame
中包含分类数据。
In [101]: df = pd.DataFrame(
...: {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
...: )
将原始的raw_grade
数据转化为分类数据类型:
In [102]: df["grade"] = df["raw_grade"].astype("category")
In [103]: df['grade']
Out[103]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']
将类别重命名为更有意义的名称:
In [104]: df['grade'].cat.categories = ['very good', 'good', 'very bad']
In [105]: df['grade']
Out[105]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (3, object): ['very good', 'good', 'very bad']
对类别重新排序,同时添加缺少的类别:
In [106]: df["grade"] = df["grade"].cat.set_categories(
...: ["very bad", "bad", "medium", "good", "very good"]
...: )
In [107]: df['grade']
Out[107]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']
排序是按类别中的顺序进行的,而不是词法顺序:
In [108]: df.sort_values(by='grade')
Out[108]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
按类别列分组也会显示空类别:
In [109]: df.groupby("grade").size()
Out[109]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
按照命名约定导入matplotlib:
import matplotlib.pyplot as plt
In [111]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2022', periods=1000))
In [112]: ts = ts.cumsum()
In [113]: ts.plot()
Out[113]: <AxesSubplot:>
In [114]: plt.show()
在DataFrame
上,plot()
方法可以方便地绘制带有标签的所有列:
In [115]: df = pd.DataFrame(
...: np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
...: )
In [116]: df = df.cumsum()
In [117]: df.plot()
Out[117]: <AxesSubplot:>
In [118]: plt.legend(loc='best')
Out[118]: <matplotlib.legend.Legend at 0x2442a8973a0>
In [119]: plt.show()
保存到csv文件:
In [120]: df.to_csv("foo.csv")
从csv文件中读取:
In [121]: pd.read_csv("foo.csv")
Out[121]:
Unnamed: 0 A B C D
0 2022-01-01 -0.339712 0.803431 0.926860 0.969152
1 2022-01-02 -0.049207 1.128155 1.789429 -0.616847
2 2022-01-03 -0.435348 1.882219 1.536849 0.125363
3 2022-01-04 -1.354101 1.935871 0.119567 0.480918
4 2022-01-05 -3.091231 0.798345 -0.546616 -1.060994
.. ... ... ... ... ...
995 2024-09-22 4.397421 -0.222820 17.812736 13.025829
996 2024-09-23 4.047730 0.034211 19.812762 12.772421
997 2024-09-24 2.241092 0.537052 20.488351 12.718344
998 2024-09-25 1.855455 0.381156 21.322891 13.121769
999 2024-09-26 2.629507 0.285630 20.310801 11.795447
[1000 rows x 5 columns]
以HDF5格式写入文件:
In [122]: df.to_hdf('foo.h5', 'df')
从HDF5文件中读取:
In [123]: pd.read_hdf('foo.h5', 'df')
Out[123]:
A B C D
2022-01-01 -0.339712 0.803431 0.926860 0.969152
2022-01-02 -0.049207 1.128155 1.789429 -0.616847
2022-01-03 -0.435348 1.882219 1.536849 0.125363
2022-01-04 -1.354101 1.935871 0.119567 0.480918
2022-01-05 -3.091231 0.798345 -0.546616 -1.060994
... ... ... ... ...
2024-09-22 4.397421 -0.222820 17.812736 13.025829
2024-09-23 4.047730 0.034211 19.812762 12.772421
2024-09-24 2.241092 0.537052 20.488351 12.718344
2024-09-25 1.855455 0.381156 21.322891 13.121769
2024-09-26 2.629507 0.285630 20.310801 11.795447
[1000 rows x 4 columns]
写入到excel文件:
In [124]: df.to_excel("foo.xlsx", sheet_name="Sheet1")
从excel文件中读取:
In [125]: pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
Out[125]:
Unnamed: 0 A B C D
0 2022-01-01 -0.339712 0.803431 0.926860 0.969152
1 2022-01-02 -0.049207 1.128155 1.789429 -0.616847
2 2022-01-03 -0.435348 1.882219 1.536849 0.125363
3 2022-01-04 -1.354101 1.935871 0.119567 0.480918
4 2022-01-05 -3.091231 0.798345 -0.546616 -1.060994
.. ... ... ... ... ...
995 2024-09-22 4.397421 -0.222820 17.812736 13.025829
996 2024-09-23 4.047730 0.034211 19.812762 12.772421
997 2024-09-24 2.241092 0.537052 20.488351 12.718344
998 2024-09-25 1.855455 0.381156 21.322891 13.121769
999 2024-09-26 2.629507 0.285630 20.310801 11.795447
本文只是简单地展示了pandas各个功能的一些简单操作,但是其功能远比这里的示例所展现出来的更加强大,能做的事情也远比这里所列的多得多,后面还会继续深入学习pandas各个部分的内容,尽量把pandas学透。虽然之前经常有使用pandas解决很多的问题,但是并没有系统地进行整理和记录,接下来将会在这平台上把这些东西整理和记录下来。如有错误的地方,欢迎留言指正!!!