【pandas数据分析】pandas功能和操作简单示例

文章目录

  • 导入包
  • 创建对象
    • Series
    • DataFrame
  • 查看数据
  • 选择
    • 获取数据
    • 按标签选择
    • 按位置选择
    • 布尔值索引
    • 设置值
  • 缺失值
  • 一些操作
    • 数据统计
    • 应用(Apply)
    • 直方图化
    • 字符串方法
  • 合并Merge
    • 连接Concat
    • 联结Join
  • 分组Grouping
  • 重塑Reshaping
    • 堆叠Stack
    • 数据透视表Pivot tables
  • 时间序列
  • 分类数据
  • 绘图
  • 读取/保存数据
    • CSV
    • HDF5
    • Excel
  • 后续

前面已经对pandas进行了介绍,知道了pandas是什么,有何优点、能做些什么事情、可以应用在哪些领域:pandas概述

接下来本文将会使用pandas展示一些例子,走马观花地看看pandas的一些功能和操作。这些示例基本上涵盖了pandas大部分的内容,通过这些示例,可以直观地感受下pandas的强大。

导入包

首先,根据Python社区对一些常用模块的命名约定,我们按如下所示导入numpy和pandas:

import numpy as np
import pandas as pd

创建对象

Series

Series可以看作是具有索引的一维向量。

通过传递一个列表创建Series,pandas会创建一个默认的整数索引:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [4]: s
Out[4]:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

DataFrame

DataFrame表示的是二维矩阵的数据表,类似于关系型数据库(如MySQL)中的表,每一列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame既有行索引也有列索引,它可以被视为一个共享相同索引的Series的字典。

通过numpy数组创建DataFrame,并指定索引和列名:

In [5]: df = pd.DataFrame(np.random.randn(6, 4), index=list("abcdef"), columns=list("ABCD"))

In [6]: df
Out[6]:
          A         B         C         D
a  0.137850 -1.056618 -0.651227  0.517188
b -0.860666  1.304139  0.488719 -0.230823
c  1.333085 -2.825132 -0.592092  0.999223
d  0.068219 -0.625945  0.316369  0.003051
e -1.884551  0.313736  0.090904 -0.587071
f  0.230159 -0.305100  0.243703  0.006146

通过传递一个字典创建DataFrame

In [7]: df2 = pd.DataFrame(
   ...:     {
   ...:         "A": 1.0,
   ...:         "B": pd.Timestamp("20220914"),
   ...:         "C": pd.Series(1, index=list(range(4)), dtype="float32"),
   ...:         "D": np.array([3] * 4, dtype="int32"),
   ...:         "E": pd.Categorical(["test", "train", "test", "train"]),
   ...:         "F": "foo",
   ...:     }
   ...: )

In [8]: df2
Out[8]:
     A          B    C  D      E    F
0  1.0 2022-09-14  1.0  3   test  foo
1  1.0 2022-09-14  1.0  3  train  foo
2  1.0 2022-09-14  1.0  3   test  foo
3  1.0 2022-09-14  1.0  3  train  foo

创建的DataFrame的列有不同的数据类型(异构类型列,DataFrame中可以有不同数据类型的列):

In [9]: df2.dtypes
Out[9]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

查看DataFrame头部(顶部)和尾部(底部)的行:

# 默认显示5行
In [10]: df.head()
Out[10]:
          A         B         C         D
a  0.137850 -1.056618 -0.651227  0.517188
b -0.860666  1.304139  0.488719 -0.230823
c  1.333085 -2.825132 -0.592092  0.999223
d  0.068219 -0.625945  0.316369  0.003051
e -1.884551  0.313736  0.090904 -0.587071

# 也可以传递一个整数参数指定显示的行数
In [11]: df.tail(3)
Out[11]:
          A         B         C         D
d  0.068219 -0.625945  0.316369  0.003051
e -1.884551  0.313736  0.090904 -0.587071
f  0.230159 -0.305100  0.243703  0.006146

查看索引和列名:

In [12]: df.index
Out[12]: Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

In [13]: df.columns
Out[13]: Index(['A', 'B', 'C', 'D'], dtype='object')

查看DataFrame的元数据信息:

In [14]: df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       4 non-null      float64
 1   B       4 non-null      datetime64[ns]
 2   C       4 non-null      float32
 3   D       4 non-null      int32
 4   E       4 non-null      category
 5   F       4 non-null      object
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 288.0+ bytes

显示数据的一些统计信息:

In [15]: df.describe()
Out[15]:
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean  -0.162651 -0.532487 -0.017271  0.117953
std    1.094284  1.391450  0.485689  0.562216
min   -1.884551 -2.825132 -0.651227 -0.587071
25%   -0.628445 -0.948950 -0.421343 -0.172354
50%    0.103035 -0.465522  0.167303  0.004599
75%    0.207082  0.159027  0.298202  0.389428
max    1.333085  1.304139  0.488719  0.999223

转置:

In [16]: df.T
Out[16]:
          a         b         c         d         e         f
A  0.137850 -0.860666  1.333085  0.068219 -1.884551  0.230159
B -1.056618  1.304139 -2.825132 -0.625945  0.313736 -0.305100
C -0.651227  0.488719 -0.592092  0.316369  0.090904  0.243703
D  0.517188 -0.230823  0.999223  0.003051 -0.587071  0.006146

根据索引排序:

# 指定了排序的方向和顺序(轴1,降序)
In [17]: df.sort_index(axis=1, ascending=False)
Out[17]:
          D         C         B         A
a  0.517188 -0.651227 -1.056618  0.137850
b -0.230823  0.488719  1.304139 -0.860666
c  0.999223 -0.592092 -2.825132  1.333085
d  0.003051  0.316369 -0.625945  0.068219
e -0.587071  0.090904  0.313736 -1.884551
f  0.006146  0.243703 -0.305100  0.230159

根据一列或多列的值排序:

In [18]: df.sort_values(by='B')
Out[18]:
          A         B         C         D
c  1.333085 -2.825132 -0.592092  0.999223
a  0.137850 -1.056618 -0.651227  0.517188
d  0.068219 -0.625945  0.316369  0.003051
f  0.230159 -0.305100  0.243703  0.006146
e -1.884551  0.313736  0.090904 -0.587071
b -0.860666  1.304139  0.488719 -0.230823

选择

虽然用于选择和设置值的标准Python/NumPy表达式(即用[]索引)很直观且便于交互工作,但对于生产代码,我们推荐使用优化的pandas数据访问方法.at.iat.loc.iloc

获取数据

使用列名选择单个列,返回一个Series

In [19]: df['A'] # 也可以用 df.A
Out[19]:
a    0.137850
b   -0.860666
c    1.333085
d    0.068219
e   -1.884551
f    0.230159
Name: A, dtype: float64

使用[]对行做切片:

In [20]: df[0:3]
Out[20]:
          A         B         C         D
a  0.137850 -1.056618 -0.651227  0.517188
b -0.860666  1.304139  0.488719 -0.230823
c  1.333085 -2.825132 -0.592092  0.999223

按标签选择

按标签选择某一行:

In [21]: df.loc['a']
Out[21]:
A    0.137850
B   -1.056618
C   -0.651227
D    0.517188
Name: a, dtype: float64

同时索引行和列:

# 取A,B两列
In [22]: df.loc[:, ["A", "B"]]
Out[22]:
          A         B
a  0.137850 -1.056618
b -0.860666  1.304139
c  1.333085 -2.825132
d  0.068219 -0.625945
e -1.884551  0.313736
f  0.230159 -0.305100

In [23]: df.loc['a':'c', ["A", "B"]]
Out[23]:
          A         B
a  0.137850 -1.056618
b -0.860666  1.304139
c  1.333085 -2.825132

获取标量:

In [24]: %timeit df.loc['a','C']
5.55 µs ± 45 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

# 结果跟上面的相同,但是更快
In [25]: %timeit df.at['a','C']
2.92 µs ± 28.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

按位置选择

通过传递的整数的位置进行选择:

In [26]: df.iloc[3]
Out[26]:
A    0.068219
B   -0.625945
C    0.316369
D    0.003051
Name: d, dtype: float64

通过类似于NumPy/Python的整数切片选择:

In [27]: df.iloc[3:5, 0:2]
Out[27]:
          A         B
d  0.068219 -0.625945
e -1.884551  0.313736

通过整数位置列表选择:

In [28]: df.iloc[[1, 2, 4], [0, 2]]
Out[28]:
          A         C
b -0.860666  0.488719
c  1.333085 -0.592092
e -1.884551  0.090904

获取标量:

In [29]: %timeit df.iloc[1, 1]
11.7 µs ± 99.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

# 结果跟上面的相同,但是更快
In [30]: %timeit df.iat[1, 1]
8.64 µs ± 72.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

布尔值索引

使用某个列的值选择数据:

# 选择列A值大于0对应的行
In [31]: df[df["A"] > 0]
Out[31]:
          A         B         C         D
a  0.137850 -1.056618 -0.651227  0.517188
c  1.333085 -2.825132 -0.592092  0.999223
d  0.068219 -0.625945  0.316369  0.003051
f  0.230159 -0.305100  0.243703  0.006146

根据DataFrame选择值:

# 只选择对应位置为True的值,对应位置为False的设为NaN
In [32]: df[df > 0]
Out[32]:
          A         B         C         D
a  0.137850       NaN       NaN  0.517188
b       NaN  1.304139  0.488719       NaN
c  1.333085       NaN       NaN  0.999223
d  0.068219       NaN  0.316369  0.003051
e       NaN  0.313736  0.090904       NaN
f  0.230159       NaN  0.243703  0.006146

使用isin()方法过滤:

In [33]: df3 = df.copy()

In [34]: df3["E"] = ["one", "one", "two", "three", "four", "three"]

In [35]: df3
Out[35]:
          A         B         C         D      E
a  0.137850 -1.056618 -0.651227  0.517188    one
b -0.860666  1.304139  0.488719 -0.230823    one
c  1.333085 -2.825132 -0.592092  0.999223    two
d  0.068219 -0.625945  0.316369  0.003051  three
e -1.884551  0.313736  0.090904 -0.587071   four
f  0.230159 -0.305100  0.243703  0.006146  three

# 过滤出列E为某些值的行
In [36]: df3[df3["E"].isin(["two", "four"])]
Out[36]:
          A         B         C         D     E
c  1.333085 -2.825132 -0.592092  0.999223   two
e -1.884551  0.313736  0.090904 -0.587071  four

设置值

按标签设置:

# 将行索引为a、列索引为A对应的值设置为0
In [37]: df.at['a', 'A'] = 0

In [38]: df
Out[38]:
          A         B         C         D
a  0.000000 -1.056618 -0.651227  0.517188
b -0.860666  1.304139  0.488719 -0.230823
c  1.333085 -2.825132 -0.592092  0.999223
d  0.068219 -0.625945  0.316369  0.003051
e -1.884551  0.313736  0.090904 -0.587071
f  0.230159 -0.305100  0.243703  0.006146

按位置设置:

# 将第0行、第1列的值设置为1
In [39]: df.iat[0, 1] = 1

In [40]: df
Out[40]:
          A         B         C         D
a  0.000000  1.000000 -0.651227  0.517188
b -0.860666  1.304139  0.488719 -0.230823
c  1.333085 -2.825132 -0.592092  0.999223
d  0.068219 -0.625945  0.316369  0.003051
e -1.884551  0.313736  0.090904 -0.587071
f  0.230159 -0.305100  0.243703  0.006146

通过分配一个NumPy数组来设置:

# 设置列D的值
In [41]: df.loc[:, "D"] = np.array([5] * len(df))

In [42]: df
Out[42]:
          A         B         C  D
a  0.000000  1.000000 -0.651227  5
b -0.860666  1.304139  0.488719  5
c  1.333085 -2.825132 -0.592092  5
d  0.068219 -0.625945  0.316369  5
e -1.884551  0.313736  0.090904  5
f  0.230159 -0.305100  0.243703  5

根据布尔条件设置:

In [43]: df3 = df.copy()

# 对大于0的位置设置值
In [44]: df3[df3 > 0] = -df3

In [45]: df3
Out[45]:
          A         B         C  D
a  0.000000 -1.000000 -0.651227 -5
b -0.860666 -1.304139 -0.488719 -5
c -1.333085 -2.825132 -0.592092 -5
d -0.068219 -0.625945 -0.316369 -5
e -1.884551 -0.313736 -0.090904 -5
f -0.230159 -0.305100 -0.243703 -5

缺失值

pandas主要使用值np.nan来表示缺失数据。默认情况下,它不包含在计算中。

In [46]: df3.loc['a':'b', 'E'] = 1

In [47]: df3
Out[48]:
          A         B         C  D    E
a  0.000000 -1.000000 -0.651227 -5  1.0
b -0.860666 -1.304139 -0.488719 -5  1.0
c -1.333085 -2.825132 -0.592092 -5  NaN
d -0.068219 -0.625945 -0.316369 -5  NaN
e -1.884551 -0.313736 -0.090904 -5  NaN
f -0.230159 -0.305100 -0.243703 -5  NaN

丢弃有缺失值的行:

In [49]: df3.dropna(how='any')
Out[49]:
          A         B         C  D    E
a  0.000000 -1.000000 -0.651227 -5  1.0
b -0.860666 -1.304139 -0.488719 -5  1.0

填充缺失值:

In [50]: df3.fillna(value=5)
Out[50]:
          A         B         C  D    E
a  0.000000 -1.000000 -0.651227 -5  1.0
b -0.860666 -1.304139 -0.488719 -5  1.0
c -1.333085 -2.825132 -0.592092 -5  5.0
d -0.068219 -0.625945 -0.316369 -5  5.0
e -1.884551 -0.313736 -0.090904 -5  5.0
f -0.230159 -0.305100 -0.243703 -5  5.0

判断是否为nan

In [51]: df3.isna()
Out[51]:
       A      B      C      D      E
a  False  False  False  False  False
b  False  False  False  False  False
c  False  False  False  False   True
d  False  False  False  False   True
e  False  False  False  False   True
f  False  False  False  False   True

In [52]: pd.isna(df3)
Out[52]:
       A      B      C      D      E
a  False  False  False  False  False
b  False  False  False  False  False
c  False  False  False  False   True
d  False  False  False  False   True
e  False  False  False  False   True
f  False  False  False  False   True

一些操作

数据统计

通常,缺失数据不会被纳入数据的统计计算中。

求均值:

# 每一列的均值
In [53]: df.mean()
Out[53]:
A   -0.185626
B   -0.189717
C   -0.017271
D    5.000000

求另一个轴上的均值:

# 每一行的均值
In [54]: df.mean(axis=1)
Out[54]:
a    1.337193
b    1.483048
c    0.728965
d    1.189661
e    0.880022
f    1.292191
dtype: float64

同样地,也可以对df求和(sum)、计算标准差(std)、计数(count)、求最小值(min)、求最大值(max)等。

应用(Apply)

对数据应用函数:

In [55]: df.apply(np.cumsum)
Out[55]:
          A         B         C   D
a  0.000000  1.000000 -0.651227   5
b -0.860666  2.304139 -0.162508  10
c  0.472418 -0.520993 -0.754600  15
d  0.540638 -1.146938 -0.438231  20
e -1.343913 -0.833202 -0.347327  25
f -1.113754 -1.138302 -0.103625  30

In [56]: df.apply(lambda x: x.max() - x.min())
Out[56]:
A    3.217636
B    4.129271
C    1.139947
D    0.000000
dtype: float64

直方图化

In [57]: s = pd.Series(np.random.randint(0, 7, size=10))

In [58]: s
Out[58]:
0    5
1    0
2    1
3    6
4    2
5    2
6    5
7    2
8    2
9    1
dtype: int32

# 统计每个值出现的次数
In [59]: s.value_counts()
Out[59]:
2    4
5    2
1    2
0    1
6    1
dtype: int64

字符串方法

Seriesstr属性中配备了一组字符串处理方法,可以方便地对每个元素进行操作:

In [60]: s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])

In [61]: s.str.lower()
Out[61]:
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

合并Merge

连接Concat

concat()连接pandas对象:

In [62]: df = pd.DataFrame(np.random.randn(10, 4))

In [63]: df
Out[63]:
          0         1         2         3
0 -0.917562 -0.678779 -0.223747 -1.137378
1 -0.284037  0.154714 -0.539917  1.080102
2 -1.024852  0.034169 -2.593824 -2.404386
3 -1.420573 -1.408281  1.015019  0.454068
4  0.729644 -0.807320  2.046540 -0.084065
5  0.909687  0.093020  0.481070  0.506825
6  0.328835 -2.117948  0.912028  1.358023
7 -1.274164  0.297414 -0.154745 -1.250898
8 -1.317405  0.223203  0.541393  0.433390
9  0.229006  0.044450  1.575203 -1.056634

In [64]: pieces = [df[:3], df[3:7], df[7:]]

In [65]: pd.concat(pieces)
Out[65]:
          0         1         2         3
0 -0.917562 -0.678779 -0.223747 -1.137378
1 -0.284037  0.154714 -0.539917  1.080102
2 -1.024852  0.034169 -2.593824 -2.404386
3 -1.420573 -1.408281  1.015019  0.454068
4  0.729644 -0.807320  2.046540 -0.084065
5  0.909687  0.093020  0.481070  0.506825
6  0.328835 -2.117948  0.912028  1.358023
7 -1.274164  0.297414 -0.154745 -1.250898
8 -1.317405  0.223203  0.541393  0.433390
9  0.229006  0.044450  1.575203 -1.056634

添加列到DataFrame中相对较快。然后,添加行需要复制操作,开销会比较大。所以建议将预先构建的记录列表传递给DataFrame构造函数,而不是通过迭代地将记录追加到DataFrame来构建一个DataFrame

联结Join

SQL风格的合并:

In [64]: left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})

In [65]: right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

In [66]: left
Out[66]:
   key  lval
0  foo     1
1  foo     2

In [67]: right
Out[67]:
   key  rval
0  foo     4
1  foo     5

In [68]: pd.merge(left, right, on="key")
Out[68]:
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5
In [69]: left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})

In [70]: right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

In [71]: left
Out[71]:
   key  lval
0  foo     1
1  bar     2

In [72]: right
Out[72]:
   key  rval
0  foo     4
1  bar     5

In [73]: pd.merge(left, right, on="key")
Out[73]:
   key  lval  rval
0  foo     1     4
1  bar     2     5

分组Grouping

分组(group by)是一个会涉及到以下一个或多个步骤的过程:

  • 根据某些标准将数据拆分为组

  • 将函数独立应用于每个组

  • 将结果合并到一个数据结构中

In [74]: df = pd.DataFrame(
    ...:     {
    ...:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
    ...:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
    ...:         "C": np.random.randn(8),
    ...:         "D": np.random.randn(8),
    ...:     }
    ...: )

In [75]: df
Out[75]:
     A      B         C         D
0  foo    one  0.555129 -0.026934
1  bar    one  1.023521  1.126648
2  foo    two  1.459798  0.064685
3  bar  three -1.037128 -0.503873
4  foo    two  0.361939  0.008989
5  bar    two  0.909140  0.784223
6  foo    one  0.262744  2.750767
7  foo  three  0.411624  0.128386

分组后对每个组分别应用sum()函数:

In [76]: df.groupby("A").sum()
Out[76]:
            C         D
A
bar  0.895533  1.406997
foo  3.051234  2.925893

根据多个列分组会生成一个分层索引,我们再一次对每个分组应用sum()函数:

In [77]: df.groupby(["A", "B"]).sum()
Out[77]:
                  C         D
A   B
bar one    1.023521  1.126648
    three -1.037128 -0.503873
    two    0.909140  0.784223
foo one    0.817873  2.723833
    three  0.411624  0.128386
    two    1.821737  0.073674

重塑Reshaping

堆叠Stack

stack()方法会将DataFrame的一层列索引转换为(堆叠到)行索引:

In [78]: group_sum = df.groupby(["A", "B"]).sum()

In [79]: stacked = group_sum.stack()

In [80]: stacked
Out[80]:
A    B
bar  one    C    1.023521
            D    1.126648
     three  C   -1.037128
            D   -0.503873
     two    C    0.909140
            D    0.784223
foo  one    C    0.817873
            D    2.723833
     three  C    0.411624
            D    0.128386
     two    C    1.821737
            D    0.073674
dtype: float64

stack()的逆操作是unstack,默认情况下它会取消堆叠最后一级行索引:

In [81]: stacked.unstack()
Out[81]:
                  C         D
A   B
bar one    1.023521  1.126648
    three -1.037128 -0.503873
    two    0.909140  0.784223
foo one    0.817873  2.723833
    three  0.411624  0.128386
    two    1.821737  0.073674

# 指定取消堆叠第1级索引
In [82]: stacked.unstack(level=1)
Out[82]:
B           one     three       two
A
bar C  1.023521 -1.037128  0.909140
    D  1.126648 -0.503873  0.784223
foo C  0.817873  0.411624  1.821737
    D  2.723833  0.128386  0.073674

# 指定取消堆叠第0级索引
In [83]: stacked.unstack(level=0)
Out[83]:
A             bar       foo
B
one   C  1.023521  0.817873
      D  1.126648  2.723833
three C -1.037128  0.411624
      D -0.503873  0.128386
two   C  0.909140  1.821737
      D  0.784223  0.073674

数据透视表Pivot tables

In [84]: df = pd.DataFrame(
    ...:     {
    ...:         "A": ["one", "one", "two", "three"] * 3,
    ...:         "B": ["A", "B", "C"] * 4,
    ...:         "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
    ...:         "D": np.random.randn(12),
    ...:         "E": np.random.randn(12),
    ...:     }
    ...: )

In [85]: df
Out[85]:
        A  B    C         D         E
0     one  A  foo -0.166493  0.113699
1     one  B  foo  0.026190  0.652232
2     two  C  foo -0.335997 -0.303846
3   three  A  bar -0.060659  1.465404
4     one  B  bar  0.292503 -2.533638
5     one  C  bar  0.447678  0.544235
6     two  A  foo  0.458672  1.241213
7   three  B  foo -0.461620 -0.540212
8     one  C  foo  1.618171  1.383842
9     one  A  bar  1.605952 -0.560169
10    two  B  bar  0.463036 -0.512008
11  three  C  bar  1.298219 -1.470472

In [86]: pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
Out[86]:
C             bar       foo
A     B
one   A  1.605952 -0.166493
      B  0.292503  0.026190
      C  0.447678  1.618171
three A -0.060659       NaN
      B       NaN -0.461620
      C  1.298219       NaN
two   A       NaN  0.458672
      B  0.463036       NaN
      C       NaN -0.335997

时间序列

pandas具有简单、强大、高效的功能,用于在频率转换期间执行重采样操作(例如,将秒级数据转换为5分钟级数据)。这在金融应用中非常常见,但不限于此。

In [87]: rng = pd.date_range("1/1/2022", periods=100, freq="S")

In [88]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [89]: ts.resample("5Min").sum()
Out[89]:
2022-01-01    24493
Freq: 5T, dtype: int32

时区表示:

In [90]: rng = pd.date_range("2022-09-01", periods=5, freq="D")

In [91]: ts = pd.Series(np.random.randn(len(rng)), rng)

In [92]: ts
Out[92]:
2022-09-01   -1.074999
2022-09-02   -0.138886
2022-09-03   -0.362477
2022-09-04   -1.200428
2022-09-05   -1.033010
Freq: D, dtype: float64

In [93]: ts_zh = ts.tz_localize("Asia/Shanghai")

In [94]: ts_zh
Out[94]:
2022-09-01 00:00:00+08:00   -1.074999
2022-09-02 00:00:00+08:00   -0.138886
2022-09-03 00:00:00+08:00   -0.362477
2022-09-04 00:00:00+08:00   -1.200428
2022-09-05 00:00:00+08:00   -1.033010
dtype: float64

转换到另一个时区:

In [95]: ts_zh.tz_convert("UTC")
Out[95]:
2022-08-31 16:00:00+00:00   -1.074999
2022-09-01 16:00:00+00:00   -0.138886
2022-09-02 16:00:00+00:00   -0.362477
2022-09-03 16:00:00+00:00   -1.200428
2022-09-04 16:00:00+00:00   -1.033010
dtype: float64

在时间跨度表示之间转换:

In [96]: rng = pd.date_range("2022/1/1", periods=5, freq="M")

In [96]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [97]: ts
Out[97]:
2022-01-31   -1.311876
2022-02-28    1.127235
2022-03-31    0.878621
2022-04-30    0.040731
2022-05-31   -1.242116
Freq: M, dtype: float64

In [98]: ps = ts.to_period()

In [99]: ps
Out[99]:
2022-01   -1.311876
2022-02    1.127235
2022-03    0.878621
2022-04    0.040731
2022-05   -1.242116
Freq: M, dtype: float64

In [100]: ps.to_timestamp()
Out[100]:
2022-01-01   -1.311876
2022-02-01    1.127235
2022-03-01    0.878621
2022-04-01    0.040731
2022-05-01   -1.242116
Freq: MS, dtype: float64

分类数据

pandas可以在DataFrame中包含分类数据。

In [101]: df = pd.DataFrame(
     ...:     {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
     ...: )

将原始的raw_grade数据转化为分类数据类型:

In [102]: df["grade"] = df["raw_grade"].astype("category")

In [103]: df['grade']
Out[103]:
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

将类别重命名为更有意义的名称:

In [104]: df['grade'].cat.categories = ['very good', 'good', 'very bad']

In [105]: df['grade']
Out[105]:
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (3, object): ['very good', 'good', 'very bad']

对类别重新排序,同时添加缺少的类别:

In [106]: df["grade"] = df["grade"].cat.set_categories(
     ...:     ["very bad", "bad", "medium", "good", "very good"]
     ...: )

In [107]: df['grade']
Out[107]:
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

排序是按类别中的顺序进行的,而不是词法顺序:

In [108]: df.sort_values(by='grade')
Out[108]:
   id raw_grade      grade
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very good
4   5         a  very good

按类别列分组也会显示空类别:

In [109]: df.groupby("grade").size()
Out[109]:
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

绘图

按照命名约定导入matplotlib:

import matplotlib.pyplot as plt
In [111]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2022', periods=1000))

In [112]: ts = ts.cumsum()

In [113]: ts.plot()
Out[113]: <AxesSubplot:>

In [114]: plt.show()

【pandas数据分析】pandas功能和操作简单示例_第1张图片

DataFrame上,plot()方法可以方便地绘制带有标签的所有列:

In [115]: df = pd.DataFrame(
     ...:     np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
     ...: )

In [116]: df = df.cumsum()

In [117]: df.plot()
Out[117]: <AxesSubplot:>

In [118]: plt.legend(loc='best')
Out[118]: <matplotlib.legend.Legend at 0x2442a8973a0>

In [119]: plt.show()

【pandas数据分析】pandas功能和操作简单示例_第2张图片

读取/保存数据

CSV

保存到csv文件:

In [120]: df.to_csv("foo.csv")

从csv文件中读取:

In [121]: pd.read_csv("foo.csv")
Out[121]:
     Unnamed: 0         A         B          C          D
0    2022-01-01 -0.339712  0.803431   0.926860   0.969152
1    2022-01-02 -0.049207  1.128155   1.789429  -0.616847
2    2022-01-03 -0.435348  1.882219   1.536849   0.125363
3    2022-01-04 -1.354101  1.935871   0.119567   0.480918
4    2022-01-05 -3.091231  0.798345  -0.546616  -1.060994
..          ...       ...       ...        ...        ...
995  2024-09-22  4.397421 -0.222820  17.812736  13.025829
996  2024-09-23  4.047730  0.034211  19.812762  12.772421
997  2024-09-24  2.241092  0.537052  20.488351  12.718344
998  2024-09-25  1.855455  0.381156  21.322891  13.121769
999  2024-09-26  2.629507  0.285630  20.310801  11.795447

[1000 rows x 5 columns]

HDF5

以HDF5格式写入文件:

In [122]: df.to_hdf('foo.h5', 'df')

从HDF5文件中读取:

In [123]: pd.read_hdf('foo.h5', 'df')
Out[123]:
                   A         B          C          D
2022-01-01 -0.339712  0.803431   0.926860   0.969152
2022-01-02 -0.049207  1.128155   1.789429  -0.616847
2022-01-03 -0.435348  1.882219   1.536849   0.125363
2022-01-04 -1.354101  1.935871   0.119567   0.480918
2022-01-05 -3.091231  0.798345  -0.546616  -1.060994
...              ...       ...        ...        ...
2024-09-22  4.397421 -0.222820  17.812736  13.025829
2024-09-23  4.047730  0.034211  19.812762  12.772421
2024-09-24  2.241092  0.537052  20.488351  12.718344
2024-09-25  1.855455  0.381156  21.322891  13.121769
2024-09-26  2.629507  0.285630  20.310801  11.795447

[1000 rows x 4 columns]

Excel

写入到excel文件:

In [124]: df.to_excel("foo.xlsx", sheet_name="Sheet1")

从excel文件中读取:

In [125]: pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
Out[125]:
    Unnamed: 0         A         B          C          D
0   2022-01-01 -0.339712  0.803431   0.926860   0.969152
1   2022-01-02 -0.049207  1.128155   1.789429  -0.616847
2   2022-01-03 -0.435348  1.882219   1.536849   0.125363
3   2022-01-04 -1.354101  1.935871   0.119567   0.480918
4   2022-01-05 -3.091231  0.798345  -0.546616  -1.060994
..         ...       ...       ...        ...        ...
995 2024-09-22  4.397421 -0.222820  17.812736  13.025829
996 2024-09-23  4.047730  0.034211  19.812762  12.772421
997 2024-09-24  2.241092  0.537052  20.488351  12.718344
998 2024-09-25  1.855455  0.381156  21.322891  13.121769
999 2024-09-26  2.629507  0.285630  20.310801  11.795447

后续

本文只是简单地展示了pandas各个功能的一些简单操作,但是其功能远比这里的示例所展现出来的更加强大,能做的事情也远比这里所列的多得多,后面还会继续深入学习pandas各个部分的内容,尽量把pandas学透。虽然之前经常有使用pandas解决很多的问题,但是并没有系统地进行整理和记录,接下来将会在这平台上把这些东西整理和记录下来。如有错误的地方,欢迎留言指正!!!

你可能感兴趣的:(Python,#,pandas,pandas,python,数据分析)