【Python基础】pandas 使用指南(超详细!)

文章目录

  • pandas指南
    • 1 Pandas基础
      • 1.1 Series数据结构
      • 1.2 dataframe
        • 1.2.1 Dataframe创建
        • 1.2.2 DataFrame对象访问
        • 1.2.3 DataFrame修改
    • 2 pandas数据导入与保存
      • 2.1 数据导入
      • 2.2 数据保存
    • 3 缺失数据处理
      • 3.1 缺失值与空值
      • 3.2 缺失值判断
      • 3.3 判断是否有缺失值
      • 3.4 缺省值处理方式
        • 3.4.1 缺省值过滤
        • 3.4.2 删除缺省值
        • 3.4.3 缺失值填充
        • 3.4.4 插入均值,中位数,最大值,最小值
    • 4 数据清洗
      • 4.1 准备数据
      • 4.2 获取指定列
      • 4.3 获取指定多列
      • 4.4 根据指定条件获取数据
      • 4.5 根据指定多个条件获取数据
      • 4.6 根据数据排序
    • 5 pandas汇总与描述性统计
    • 6 索引/多级索引
      • 6.1 设置索引
      • 6.2 多级索引
      • 6.3 通过多级索引取值
      • 6.4 行列变换
    • 7 时间与时间序列
      • 7.1 时间戳
      • 7.2 时间索引
      • 7.3 周期
      • 7.4 时间索引
      • 7.5 时间差值
      • 7.6 重采样
      • 7.7 时间迁移
    • 8 数据清洗
    • 9 数据合并
      • 9.1 merge
      • 9.2 join方法
      • 9.3 concat
    • 10 pandas数据处理常用函数
      • 10.1 apply函数
      • 10.2 func处理对象
      • 10.3 map
      • 10.4 replace:替换
      • 10.5 agg:聚合操作
      • 10.6 transform:处理数据
      • 10.7 filter:过滤
    • 11 分组处理
      • 11.1 groupby分组
      • 11.2 聚合
      • 11.3 transform
      • 11.5 filter
      • 11.6 cut分组
      • 11.7 透视表
      • 11.8 str相关方法
    • 12 pandas可视化
      • 12.1 基本使用
      • 12.2 plot中可视化方法

pandas指南

学习目的:

  • 掌握pandas中series与dataframe
  • pandas数据清洗
  • 使用pandas进行数据基本统计
  • 时间序列分析
  • 数据分析常用方法

1 Pandas基础

  • pandas是python中数据分析核心库,能够快速,灵活的对大量数据进行分析,是Python进行数据分析的必要利器;
  • pandas支持多种数据导入,支持数据合并,拆分,基本统计,时间序列分析,透视表等多种操作;

1.1 Series数据结构

Series:一维的带索引数据结构(单列)

Series类:
pd.Series(data=None,index=None,dtype=None,name=None,copy=False,fastpath=False)

  1. 创建Series对象

    第一列为索引,第二列为Series数据

import pandas as pd
import numpy as np
sdata = pd.Series(np.arange(1,4), index=list('abc'))
sdata
a    1
b    2
c    3
dtype: int32
  1. Series对象访问

    .iloc[]确保你按位置而不是标签访问Series对象的元素。

# 默认数字索引
print(sdata.iloc[0])
# 使用标签[a,b,c]
print(sdata['b'])
# 使用loc方式,只能使用标签
print(sdata.loc['c'])
1
2
3
  1. 获取index与value
# 获取索引
sdata.index.values
array(['a', 'b', 'c'], dtype=object)
# 获取索引
sdata.values
array([1, 2, 3])
  1. 将index与value转成列表
# 将索引转成列表
sdata.index.values.tolist()
['a', 'b', 'c']
# 将数据转成列表
sdata.values.tolist()
[1, 2, 3]
# Series对象访问
for item in sdata.items():
    print(item)
('a', 1)
('b', 2)
('c', 3)

1.2 dataframe

DataFrame:多种类型的列构成的二维标签数据结构(多列);

DataFrame类:

pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

  • data:一维数据,二维数据
  • index:行标签
  • columns:列标签
1.2.1 Dataframe创建
  1. 一行一列
# 一维数据
pd.DataFrame(data=np.arange(1,4))
0
0 1
1 2
2 3
  1. 多列
# 多维数据 data为4X4
data = np.arange(16).reshape(4,4)
pd.DataFrame(data=data)
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
  1. 设置index与columns
# 设置index与columns
data = np.arange(16).reshape(4,4)
pdata = pd.DataFrame(data=data, index=list('abcd'), columns=['c1','c2','c3','c4'])
pdata
c1 c2 c3 c4
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
  1. 设置index与columns
# 设置index与columns
data = {'c1':[1,2,3], 'c2':[4,5,6]}
pdata = pd.DataFrame(data=data)
pdata
c1 c2
0 1 4
1 2 5
2 3 6
  1. 设置列标签
# 设置列标签
pdata.columns = ['t1','t2']
pdata
t1 t2
0 1 4
1 2 5
2 3 6
1.2.2 DataFrame对象访问
data = {'c1':[1,2,3], 'c2':[4,5,6], 'c3':[7,8,9]}
pdata = pd.DataFrame(data=data)
pdata
c1 c2 c3
0 1 4 7
1 2 5 8
2 3 6 9
  1. 获取指定列
# 获取一列数据,返回Series对象
pdata['c1']
0    1
1    2
2    3
Name: c1, dtype: int64
# 取多列数据
pdata[['c1','c2']]
c1 c2
0 1 4
1 2 5
2 3 6
  1. loc操作

    loc操作:使用类似列表方式去对数据进行访问,支持bool索引;

import pandas as pd
data = {'c1':[1,2,3], 'c2':[4,5,6], 'c3':[7,8,9]}
pdata = pd.DataFrame(data=data)
pdata
c1 c2 c3
0 1 4 7
1 2 5 8
2 3 6 9
  1. 获取指定行
# 获取第一行
pdata.loc[0]
c1    1
c2    4
c3    7
Name: 0, dtype: int64
# 获取第一行指定c1,c2列
pdata.loc[0,['c1','c2']]
c1    1
c2    4
Name: 0, dtype: int64
  1. 遍历DataFrame对象
# 获取列索引
for item in pdata:
    print(item)
c1
c2
c3
# 按列遍历
for item in pdata.items():
    print(item)
('c1', 0    1
1    2
2    3
Name: c1, dtype: int64)
('c2', 0    4
1    5
2    6
Name: c2, dtype: int64)
('c3', 0    7
1    8
2    9
Name: c3, dtype: int64)
# 按行遍历
for item in pdata.iterrows():
    print(item)
(0, c1    1
c2    4
c3    7
Name: 0, dtype: int64)
(1, c1    2
c2    5
c3    8
Name: 1, dtype: int64)
(2, c1    3
c2    6
c3    9
Name: 2, dtype: int64)
1.2.3 DataFrame修改
  1. 修改元素
# 修改元素
import pandas as pd
data = {'c1':[1,2,3], 'c2':[4,5,6], 'c3':[7,8,9]}
pdata = pd.DataFrame(data=data)
# 修改c1列值
pdata['c1'] = 0
pdata
c1 c2 c3
0 0 4 7
1 0 5 8
2 0 6 9
  1. DataFrame插入列
# DataFrame插入列
import pandas as pd
data = {'c1':[1,2,3], 'c2':[4,5,6], 'c3':[7,8,9]}
pdata = pd.DataFrame(data=data)
# 修改c1列值
pdata['c4'] = [-1,-1,-1]
pdata
c1 c2 c3 c4
0 1 4 7 -1
1 2 5 8 -1
2 3 6 9 -1
  1. DataFrame插入行
# DataFrame插入行
import pandas as pd
data = {'c1':[1,2,3], 'c2':[4,5,6], 'c3':[7,8,9]}
pdata = pd.DataFrame(data=data)
# 修改c1列值
pdata.loc[3] = [-1,-1,-1]
pdata
c1 c2 c3
0 1 4 7
1 2 5 8
2 3 6 9
3 -1 -1 -1

2 pandas数据导入与保存

目的:

  • 数据导入:excel, csv文件
  • 数据导出
  • 基本统计
  • 缺省数据处理

2.1 数据导入

数据是分析基础,实际工作中,数据来自于企业内部数据,网络数据,开源数据集;

方法 说明
pd.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, ...) 读取CSV文件
pd.read_excel(io, sheet_name=0, names=None, index_col=None, usecols=None, ...) 读取Excel文件
pd.read_json(path_or_buf=None, orient=None, typ='frame', dtype=None, ...) 读取JSON文件
  1. 读取excel文件
import pandas as pd
import numpy as np
# 读取excel文件
fpath = r'data\test.xlsx'
pdata = pd.read_excel(fpath)
pdata
序号 姓名 数学 语文
0 1 99 87
1 2 88 92
2 3 77 73
3 4 66 68
  1. 读取csv文件
# 读取csv文件
fpath = r'data\GDP.csv'
pdata = pd.read_csv(fpath, encoding='gbk')
pdata
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09 NaN NaN NaN NaN NaN NaN
1 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10 2.053654e+10 2.004633e+10 2.005019e+10 1.921556e+10 1.946902e+10 NaN
2 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041160e+11 1.153980e+11 1.249120e+11 1.267770e+11 1.029620e+11 9.533511e+10 NaN
3 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10 1.231978e+10 1.277628e+10 1.322824e+10 1.133526e+10 1.186387e+10 NaN
4 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 4.007353e+09 3.660531e+09 3.355695e+09 3.442063e+09 3.164615e+09 3.281585e+09 3.350736e+09 2.811489e+09 2.858518e+09 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
259 Kosovo XKX GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 5.687488e+09 5.653793e+09 5.829934e+09 6.649291e+09 6.473725e+09 7.072092e+09 7.386891e+09 6.440501e+09 6.649889e+09 NaN
260 Yemen, Rep. YEM GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.691085e+10 2.513027e+10 3.090675e+10 3.272642e+10 3.539315e+10 4.041523e+10 4.322858e+10 3.773392e+10 2.731761e+10 NaN
261 South Africa ZAF GDP (current US$) NY.GDP.MKTP.CD 7.575248e+09 7.972841e+09 8.497830e+09 9.423212e+09 1.037379e+10 1.133417e+10 ... 2.871000e+11 2.972170e+11 3.752980e+11 4.168780e+11 3.963330e+11 3.668100e+11 3.511190e+11 3.176110e+11 2.954560e+11 NaN
262 Zambia ZMB GDP (current US$) NY.GDP.MKTP.CD 7.130000e+08 6.962857e+08 6.931429e+08 7.187143e+08 8.394286e+08 1.082857e+09 ... 1.791086e+10 1.532834e+10 2.026556e+10 2.346010e+10 2.550337e+10 2.804546e+10 2.715063e+10 2.115439e+10 2.106399e+10 NaN
263 Zimbabwe ZWE GDP (current US$) NY.GDP.MKTP.CD 1.052990e+09 1.096647e+09 1.117602e+09 1.159512e+09 1.217138e+09 1.311436e+09 ... 4.415703e+09 8.621574e+09 1.014186e+10 1.209845e+10 1.424249e+10 1.545177e+10 1.589105e+10 1.630467e+10 1.661996e+10 NaN

264 rows × 62 columns

# 导入指定列
fpath = r'data\GDP.csv'
pdata = pd.read_csv(fpath,usecols = ['Country Name','1990'], encoding='gbk')
pdata
Country Name 1990
0 Aruba NaN
1 Afghanistan NaN
2 Angola 1.002674e+10
3 Albania 2.101625e+09
4 Andorra 1.029048e+09
... ... ...
259 Kosovo NaN
260 Yemen, Rep. 5.647252e+09
261 South Africa 1.155530e+11
262 Zambia 3.285217e+09
263 Zimbabwe 8.783817e+09

264 rows × 2 columns

# 导入指定表头
fpath = r'data\GDP.csv'
pdata = pd.read_csv(fpath,header=1, encoding='gbk')
pdata
Aruba ABW GDP (current US$) NY.GDP.MKTP.CD Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... 2791960894 2498932961 2467703911 2584463687 Unnamed: 56 Unnamed: 57 Unnamed: 58 Unnamed: 59 Unnamed: 60 Unnamed: 61
0 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10 2.053654e+10 2.004633e+10 2.005019e+10 1.921556e+10 1.946902e+10 NaN
1 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041160e+11 1.153980e+11 1.249120e+11 1.267770e+11 1.029620e+11 9.533511e+10 NaN
2 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10 1.231978e+10 1.277628e+10 1.322824e+10 1.133526e+10 1.186387e+10 NaN
3 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 4.007353e+09 3.660531e+09 3.355695e+09 3.442063e+09 3.164615e+09 3.281585e+09 3.350736e+09 2.811489e+09 2.858518e+09 NaN
4 Arab World ARB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.078120e+12 1.795820e+12 2.109650e+12 2.501550e+12 2.741240e+12 2.839630e+12 2.906620e+12 2.563300e+12 2.504700e+12 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
258 Kosovo XKX GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 5.687488e+09 5.653793e+09 5.829934e+09 6.649291e+09 6.473725e+09 7.072092e+09 7.386891e+09 6.440501e+09 6.649889e+09 NaN
259 Yemen, Rep. YEM GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.691085e+10 2.513027e+10 3.090675e+10 3.272642e+10 3.539315e+10 4.041523e+10 4.322858e+10 3.773392e+10 2.731761e+10 NaN
260 South Africa ZAF GDP (current US$) NY.GDP.MKTP.CD 7.575248e+09 7.972841e+09 8.497830e+09 9.423212e+09 1.037379e+10 1.133417e+10 ... 2.871000e+11 2.972170e+11 3.752980e+11 4.168780e+11 3.963330e+11 3.668100e+11 3.511190e+11 3.176110e+11 2.954560e+11 NaN
261 Zambia ZMB GDP (current US$) NY.GDP.MKTP.CD 7.130000e+08 6.962857e+08 6.931429e+08 7.187143e+08 8.394286e+08 1.082857e+09 ... 1.791086e+10 1.532834e+10 2.026556e+10 2.346010e+10 2.550337e+10 2.804546e+10 2.715063e+10 2.115439e+10 2.106399e+10 NaN
262 Zimbabwe ZWE GDP (current US$) NY.GDP.MKTP.CD 1.052990e+09 1.096647e+09 1.117602e+09 1.159512e+09 1.217138e+09 1.311436e+09 ... 4.415703e+09 8.621574e+09 1.014186e+10 1.209845e+10 1.424249e+10 1.545177e+10 1.589105e+10 1.630467e+10 1.661996e+10 NaN

263 rows × 62 columns

# 无表头指定None
pdata = pd.read_csv(fpath,header=None, encoding='gbk')
pdata
0 1 2 3 4 5 6 7 8 9 ... 52 53 54 55 56 57 58 59 60 61
0 Country Name Country Code Indicator Name Indicator Code 1.960000e+03 1.961000e+03 1.962000e+03 1.963000e+03 1.964000e+03 1.965000e+03 ... 2.008000e+03 2.009000e+03 2.010000e+03 2.011000e+03 2.012000e+03 2.013000e+03 2.014000e+03 2.015000e+03 2.016000e+03 2017.0
1 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09 NaN NaN NaN NaN NaN NaN
2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10 2.053654e+10 2.004633e+10 2.005019e+10 1.921556e+10 1.946902e+10 NaN
3 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041160e+11 1.153980e+11 1.249120e+11 1.267770e+11 1.029620e+11 9.533511e+10 NaN
4 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10 1.231978e+10 1.277628e+10 1.322824e+10 1.133526e+10 1.186387e+10 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
260 Kosovo XKX GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 5.687488e+09 5.653793e+09 5.829934e+09 6.649291e+09 6.473725e+09 7.072092e+09 7.386891e+09 6.440501e+09 6.649889e+09 NaN
261 Yemen, Rep. YEM GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.691085e+10 2.513027e+10 3.090675e+10 3.272642e+10 3.539315e+10 4.041523e+10 4.322858e+10 3.773392e+10 2.731761e+10 NaN
262 South Africa ZAF GDP (current US$) NY.GDP.MKTP.CD 7.575248e+09 7.972841e+09 8.497830e+09 9.423212e+09 1.037379e+10 1.133417e+10 ... 2.871000e+11 2.972170e+11 3.752980e+11 4.168780e+11 3.963330e+11 3.668100e+11 3.511190e+11 3.176110e+11 2.954560e+11 NaN
263 Zambia ZMB GDP (current US$) NY.GDP.MKTP.CD 7.130000e+08 6.962857e+08 6.931429e+08 7.187143e+08 8.394286e+08 1.082857e+09 ... 1.791086e+10 1.532834e+10 2.026556e+10 2.346010e+10 2.550337e+10 2.804546e+10 2.715063e+10 2.115439e+10 2.106399e+10 NaN
264 Zimbabwe ZWE GDP (current US$) NY.GDP.MKTP.CD 1.052990e+09 1.096647e+09 1.117602e+09 1.159512e+09 1.217138e+09 1.311436e+09 ... 4.415703e+09 8.621574e+09 1.014186e+10 1.209845e+10 1.424249e+10 1.545177e+10 1.589105e+10 1.630467e+10 1.661996e+10 NaN

265 rows × 62 columns

2.2 数据保存

方法 说明
pdata.to_csv(path_or_buf=None, sep=',', ...) 保存为CSV文件
pdata.to_excel(excel_writer, sheet_name='Sheet1', na_rep='', ...) 保存为Excel文件
pdata.to_json(path_or_buf=None, orient=None, ...) 保存为JSON格式文件
# 保存文件
import pandas as pd
import numpy as np
# 读取excel文件
fpath = r'data\GDP.csv'
csv_path1 = r'data\new_GDP_1.csv'
csv_path2 = r'data\new_GDP_2.csv'
csv_path3 = r'data\new_GDP_3.csv'
pdata = pd.read_csv(fpath, encoding='gbk')
# 保存格式带索引
pdata.to_csv(csv_path1)
# 保存格式不带索引
pdata.to_csv(csv_path2, index=False)
# 保存格式不带索引,保存指定列
pdata.to_csv(csv_path3, index=False, columns=['1990','1991'])
pdata = pd.read_csv(csv_path1, encoding='gbk')
pdata
Unnamed: 0 Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
0 0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09 NaN NaN NaN NaN NaN NaN
1 1 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10 2.053654e+10 2.004633e+10 2.005019e+10 1.921556e+10 1.946902e+10 NaN
2 2 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041160e+11 1.153980e+11 1.249120e+11 1.267770e+11 1.029620e+11 9.533511e+10 NaN
3 3 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10 1.231978e+10 1.277628e+10 1.322824e+10 1.133526e+10 1.186387e+10 NaN
4 4 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN ... 4.007353e+09 3.660531e+09 3.355695e+09 3.442063e+09 3.164615e+09 3.281585e+09 3.350736e+09 2.811489e+09 2.858518e+09 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
259 259 Kosovo XKX GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN ... 5.687488e+09 5.653793e+09 5.829934e+09 6.649291e+09 6.473725e+09 7.072092e+09 7.386891e+09 6.440501e+09 6.649889e+09 NaN
260 260 Yemen, Rep. YEM GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN ... 2.691085e+10 2.513027e+10 3.090675e+10 3.272642e+10 3.539315e+10 4.041523e+10 4.322858e+10 3.773392e+10 2.731761e+10 NaN
261 261 South Africa ZAF GDP (current US$) NY.GDP.MKTP.CD 7.575248e+09 7.972841e+09 8.497830e+09 9.423212e+09 1.037379e+10 ... 2.871000e+11 2.972170e+11 3.752980e+11 4.168780e+11 3.963330e+11 3.668100e+11 3.511190e+11 3.176110e+11 2.954560e+11 NaN
262 262 Zambia ZMB GDP (current US$) NY.GDP.MKTP.CD 7.130000e+08 6.962857e+08 6.931429e+08 7.187143e+08 8.394286e+08 ... 1.791086e+10 1.532834e+10 2.026556e+10 2.346010e+10 2.550337e+10 2.804546e+10 2.715063e+10 2.115439e+10 2.106399e+10 NaN
263 263 Zimbabwe ZWE GDP (current US$) NY.GDP.MKTP.CD 1.052990e+09 1.096647e+09 1.117602e+09 1.159512e+09 1.217138e+09 ... 4.415703e+09 8.621574e+09 1.014186e+10 1.209845e+10 1.424249e+10 1.545177e+10 1.589105e+10 1.630467e+10 1.661996e+10 NaN

264 rows × 63 columns

pdata = pd.read_csv(csv_path2, encoding='gbk')
pdata
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09 NaN NaN NaN NaN NaN NaN
1 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10 2.053654e+10 2.004633e+10 2.005019e+10 1.921556e+10 1.946902e+10 NaN
2 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041160e+11 1.153980e+11 1.249120e+11 1.267770e+11 1.029620e+11 9.533511e+10 NaN
3 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10 1.231978e+10 1.277628e+10 1.322824e+10 1.133526e+10 1.186387e+10 NaN
4 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 4.007353e+09 3.660531e+09 3.355695e+09 3.442063e+09 3.164615e+09 3.281585e+09 3.350736e+09 2.811489e+09 2.858518e+09 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
259 Kosovo XKX GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 5.687488e+09 5.653793e+09 5.829934e+09 6.649291e+09 6.473725e+09 7.072092e+09 7.386891e+09 6.440501e+09 6.649889e+09 NaN
260 Yemen, Rep. YEM GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.691085e+10 2.513027e+10 3.090675e+10 3.272642e+10 3.539315e+10 4.041523e+10 4.322858e+10 3.773392e+10 2.731761e+10 NaN
261 South Africa ZAF GDP (current US$) NY.GDP.MKTP.CD 7.575248e+09 7.972841e+09 8.497830e+09 9.423212e+09 1.037379e+10 1.133417e+10 ... 2.871000e+11 2.972170e+11 3.752980e+11 4.168780e+11 3.963330e+11 3.668100e+11 3.511190e+11 3.176110e+11 2.954560e+11 NaN
262 Zambia ZMB GDP (current US$) NY.GDP.MKTP.CD 7.130000e+08 6.962857e+08 6.931429e+08 7.187143e+08 8.394286e+08 1.082857e+09 ... 1.791086e+10 1.532834e+10 2.026556e+10 2.346010e+10 2.550337e+10 2.804546e+10 2.715063e+10 2.115439e+10 2.106399e+10 NaN
263 Zimbabwe ZWE GDP (current US$) NY.GDP.MKTP.CD 1.052990e+09 1.096647e+09 1.117602e+09 1.159512e+09 1.217138e+09 1.311436e+09 ... 4.415703e+09 8.621574e+09 1.014186e+10 1.209845e+10 1.424249e+10 1.545177e+10 1.589105e+10 1.630467e+10 1.661996e+10 NaN

264 rows × 62 columns

pdata = pd.read_csv(csv_path3, encoding='gbk')
pdata
1990 1991
0 NaN NaN
1 NaN NaN
2 1.002674e+10 1.211861e+10
3 2.101625e+09 1.139167e+09
4 1.029048e+09 1.106929e+09
... ... ...
259 NaN NaN
260 5.647252e+09 5.930370e+09
261 1.155530e+11 1.239430e+11
262 3.285217e+09 3.378882e+09
263 8.783817e+09 8.641482e+09

264 rows × 2 columns

3 缺失数据处理

3.1 缺失值与空值

缺省值:数据集中数值为空的值, pandas使用Nan / NaT 表示

空值:空字符串 ""

s1 = [10, 10.5, None, 11]
s2 = [7, 6.9,7.5,None]
pdata = pd.DataFrame({'s1':s1, 's2':s2})
pdata
s1 s2
0 10.0 7.0
1 10.5 6.9
2 NaN 7.5
3 11.0 NaN

3.2 缺失值判断

判断方法:

  • pd.isnull():缺省值对应的值为True,返回值为Boolean的Series或者DataFrame对象
  • pd.notnull():缺省值对应的值为False,返回值为Boolean的Series或者DataFrame对象
  • pdata.isnull() / pdata.notnull() :同上
sdata = pd.Series([1,2,3, np.NaN])
pd.isnull(sdata)
0    False
1    False
2    False
3     True
dtype: bool
s1 = [10, 10.5, None, 11]
s2 = [7, 6.9,7.5,None]
pdata = pd.DataFrame({'s1':s1, 's2':s2})
pd.isnull(pdata)
s1 s2
0 False False
1 False False
2 True False
3 False True

3.3 判断是否有缺失值

  1. 方式:np.allpd.notnull结合
s1 = [10, 10.5, None, 11]
s2 = [7, 6.9,7.5,None]
pdata = pd.DataFrame({'s1':s1, 's2':s2})
# pd.notnull,若包含缺省值,缺省值对应值为False
# np.all:若对象中包含假,返回False, 否则返回真
np.all(pd.notnull(pdata))
# 返回False, 说明包含缺省值,否则不包含缺省值
False
s1 = [10, 10.5, 11]
s2 = [7, 6.9,7.5]
pdata = pd.DataFrame({'s1':s1, 's2':s2})
np.all(pd.notnull(pdata))
True
  1. 方式:np.anypd.isnull结合
s1 = [10, 10.5, 11]
s2 = [7, 6.9,7.5]
pdata = pd.DataFrame({'s1':s1, 's2':s2})
# isnull:缺省值对应值为True
# any:对象中包含真,返回True
np.any(pd.isnull(pdata))
# 返回False,说明不含缺省值,返回True说明包括缺省值
False

3.4 缺省值处理方式

缺省值处理:

  • 过滤缺省值(按行列)
  • 删除缺省值(按行列)
  • 填充值,填充值方式:
  • 插入均值,中位数,最大值,最小值等
  • 插入特殊值
  • 插入前(后)值入前(后)值
3.4.1 缺省值过滤
  1. 举例:某两只股票1周收盘值,None表示当前停盘

    需求:获取两只股票都没有停牌的数据

# 数据:某两只股票1周收盘值,None表示当前停盘
s1 = [10, 10.5, None, 11]
s2 = [7, 6.9,7.5,None]
pdata = pd.DataFrame({'s1':s1, 's2':s2})
pdata
s1 s2
0 10.0 7.0
1 10.5 6.9
2 NaN 7.5
3 11.0 NaN
# 需求:获取两只股票都没有停牌的数据
# 获取boolean索引
bindex = np.all(pdata.notnull(), axis=1)
bindex
0     True
1     True
2    False
3    False
dtype: bool
# 获取没有停牌数据
pdata[bindex]
s1 s2
0 10.0 7.0
1 10.5 6.9
3.4.2 删除缺省值

pdata.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

主要参数:

参数 说明
axis 0'index':按行操作,1'columns':按列操作
how 根据 axis 指定操作方式,any:只要有一个 Na 就删除,all:全部为 Na 才删除
thresh 指定非 Na 值的数量,非 Na 数量大于等于 thresh 时不删除
subset 指定操作的列子集
inplace True:在原始数据中进行修改

准备数据:

s1 = [10, 10.5, None, 11]
s2 = [7, 6.9,7.5,None]
s3 = [7, 6.9,7.5,7]
s4 = [None, 6.9,None,7.2]
pdata = pd.DataFrame({'s1':s1, 's2':s2, 's3':s3,'s4':s4})
pdata
s1 s2 s3 s4
0 10.0 7.0 7.0 NaN
1 10.5 6.9 6.9 6.9
2 NaN 7.5 7.5 NaN
3 11.0 NaN 7.0 7.2

需求:

  1. 删除包含缺省值的行
  2. 删除包含2个缺省值行
  3. 删除指定列包含缺省值
  4. 删除包含缺省值的列
# 删除包含缺省值行
pdata.dropna()
s1 s2 s3 s4
1 10.5 6.9 6.9 6.9
# 缺省值数量大于1,thresh设置为3
# thresh:指定非Na数量(非Na数量>=thresh,不删除)
pdata.dropna(thresh=3)
s1 s2 s3 s4
0 10.0 7.0 7.0 NaN
1 10.5 6.9 6.9 6.9
3 11.0 NaN 7.0 7.2
# 指定列:['s1','s4']
pdata.dropna(subset=['s1','s4'])
s1 s2 s3 s4
1 10.5 6.9 6.9 6.9
3 11.0 NaN 7.0 7.2
# 删除包含缺省值列
pdata.dropna(axis=1)
s3
0 7.0
1 6.9
2 7.5
3 7.0

注意:

  • 以上数据删除都不对原始数据进行修改
  • 指定inplace为True,在原始数据中进行修改
3.4.3 缺失值填充

填充方法:

pdata.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

参数 说明
value 填充值
method 填充方式:{'backfill', 'bfill', 'pad', 'ffill', None}
axis 指定行列:0'index' 表示按行,1'columns' 表示按列
limit 插入数量限制
pdata
s1 s2 s3 s4
0 10.0 7.0 7.0 NaN
1 10.5 6.9 6.9 6.9
2 NaN 7.5 7.5 NaN
3 11.0 NaN 7.0 7.2

需求:

  • 缺省值填充固定值0
  • 使用前/后面数据填充
  • 使用均值填充
  • 插入均值插入均值
# 固定值0
pdata.fillna(0)
s1 s2 s3 s4
0 10.0 7.0 7.0 0.0
1 10.5 6.9 6.9 6.9
2 0.0 7.5 7.5 0.0
3 11.0 0.0 7.0 7.2
# 使用前一行数据填充
pdata.fillna(method='ffill')
# 使用向前填充 (ffill) 替代 fillna
pdata.ffill()
s1 s2 s3 s4
0 10.0 7.0 7.0 NaN
1 10.5 6.9 6.9 6.9
2 10.5 7.5 7.5 6.9
3 11.0 7.5 7.0 7.2
# 使用后一行数据填充
pdata.fillna(method='bfill')
# 使用向后填充 (bfill) 替代 fillna
pdata.bfill()
s1 s2 s3 s4
0 10.0 7.0 7.0 6.9
1 10.5 6.9 6.9 6.9
2 11.0 7.5 7.5 7.2
3 11.0 NaN 7.0 7.2
# 使用后一列数据填充
pdata.fillna(axis=1,method='bfill')
# 使用向后填充 (bfill) 替代 fillna
pdata.bfill(axis=1)
s1 s2 s3 s4
0 10.0 7.0 7.0 NaN
1 10.5 6.9 6.9 6.9
2 7.5 7.5 7.5 NaN
3 11.0 7.0 7.0 7.2

对于股票缺省值,我们倾向于,使用前一天数据填充缺失值

3.4.4 插入均值,中位数,最大值,最小值

pdata.mean/max/min/median(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

主要参数:

参数 说明
axis 方向,0 表示按列,1 表示按行
skipna 是否忽略 NaN,True 表示不计算 NaN,默认为 True
pdata
s1 s2 s3 s4
0 10.0 7.0 7.0 NaN
1 10.5 6.9 6.9 6.9
2 NaN 7.5 7.5 NaN
3 11.0 NaN 7.0 7.2
  1. 插入均值
#插入均值
pdata.fillna(pdata.mean())
s1 s2 s3 s4
0 10.0 7.000000 7.0 7.05
1 10.5 6.900000 6.9 6.90
2 10.5 7.500000 7.5 7.05
3 11.0 7.133333 7.0 7.20
  1. 插入中位数
#插入中位数
pdata.fillna(pdata.median())
s1 s2 s3 s4
0 10.0 7.0 7.0 7.05
1 10.5 6.9 6.9 6.90
2 10.5 7.5 7.5 7.05
3 11.0 7.0 7.0 7.20

4 数据清洗

4.1 准备数据

某次考试成绩

import pandas as pd
import numpy as np

names = list('ABCD')
math = [90,100,50,80]
chinese = [89,96,58,77]
pdata = pd.DataFrame({'name':names, 'math':math, 'chinese':chinese})
pdata
name math chinese
0 A 90 89
1 B 100 96
2 C 50 58
3 D 80 77

4.2 获取指定列

通过列名,直接获取值

# 通过列名,直接获取值
pdata['name']
0    A
1    B
2    C
3    D
Name: name, dtype: object

4.3 获取指定多列

基本方法:pdata[[col1, col2]]

获取姓名与数学成绩

# 获取指定多列
# 基本方法:pdata[[col1, col2]]
pdata[['name', 'math']]
name math
0 A 90
1 B 100
2 C 50
3 D 80

4.4 根据指定条件获取数据

  1. 需求1:数学成绩大于80的所有成绩;

    实现思路:

    • 根据条件生成boolean索引
    • 通过boolean索引获取数据
# 需求1:数学成绩大于80的所有成绩;
bindex = pdata['math'] > 80
pdata[bindex]
name math chinese
0 A 90 89
1 B 100 96
  1. 需求2:获取同学A的成绩

    实现思路:

    • 根据条件生成boolean索引
    • 通过boolean索引获取数据
# 需求2:获取同学A的成绩;
pdata[pdata['name']=='A']
name math chinese
0 A 90 89

4.5 根据指定多个条件获取数据

  1. 需求1:获取数学语文都及格成绩
    • 条件1:数学成绩大于59,
    • 条件2:语文成绩大于59
    • 条件3:两个条件与操作:&
    • 基本语法:pdata[condition1&condition2]
# 需求1:获取数学语文都及格成绩
# 注意:两个条件要加括号
pdata[(pdata['math']>59) & (pdata['chinese']>59)]
name math chinese
0 A 90 89
1 B 100 96
3 D 80 77
  1. 需求2:获取数学语文有一门大于等于80分
    • 条件1:数学成绩大于等于80,
    • 条件2:语文成绩大于等于80
    • 条件3:两个条件与操作:|
    • 基本语法:pdata[condition1|condition2]
# 需求2:获取数学语文有一门大于等于80分
# 注意:两个条件要加括号
pdata[(pdata['math']>=80) | (pdata['chinese']>=80)]
name math chinese
0 A 90 89
1 B 100 96
3 D 80 77
# 根据集合获取数据
bindex = pdata['math'].isin([100, 90])
pdata[bindex]
name math chinese
0 A 90 89
1 B 100 96
  1. 根据集合获取数据

    获取数学成绩为100或者90的学生成绩

    • 多个值判断: pdata.isin(values),返回boolean索引
bindex = pdata['math'].isin([100, 90])
pdata[bindex]
name math chinese
0 A 90 89
1 B 100 96

4.6 根据数据排序

  • 排序方式1:根据索引排序

    pdata.sort_index(axis=0,level=None,ascending=True,...)

  • 排序方式2:根据指定列内容排序

    pdata.sort_values(by,axis=0,ascending=True,...)

# 根据数据排序
import pandas as pd
import numpy as np
names = list('ABCD')
math = [90,100,80,80]
chinese = [89,96,58,77]
pdata = pd.DataFrame({'name':names, 'math':math, 'chinese':chinese})
pdata
name math chinese
0 A 90 89
1 B 100 96
2 C 80 58
3 D 80 77
# 根据索引排序,降序,ascending=False
pdata.sort_index(ascending=False)
name math chinese
3 D 80 77
2 C 80 58
1 B 100 96
0 A 90 89
# 根据数学成绩排序,降序,ascending=False
pdata.sort_values(['math'], ascending=False)
name math chinese
1 B 100 96
0 A 90 89
2 C 80 58
3 D 80 77
# 根据数学与语文成绩排序,降序,ascending=False
# sort_values中加入两列数据
pdata.sort_values(['math', 'chinese'], ascending=False)
name math chinese
1 B 100 96
0 A 90 89
3 D 80 77
2 C 80 58

5 pandas汇总与描述性统计

pandas计算与统计相关方法:

  • 最大值:pdata.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
  • 最小值:pdata.min(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
  • 均值:pdata.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
  • 中位数:pdata.median(axis=None,skipna=None,level=None,numeric_only=None, **kwargs)
  • 求和:pdata.sum(axis=None,skipna=None,level=None,numeric_only=None,min_count=0,**kwargs)
  • 方差:pdata.var(axis=None,skipna=None,level=None,ddof=1,numeric_only=None,**kwargs)
  • 标准差:pdata.std(axis=None,skipna=None,level=None,ddof=1,numeric_only=None,**kwargs)
  • 累加和:pdata.cumsum(axis=None, skipna=True, *args, **kwargs)
  • 分位数:pdata.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')
  • 每个数值到均值的平均差:pdata.mad(axis=None, skipna=None, level=None)
  • 元素与先前元素的相差百分比:pdata.pct_change(periods=1,fill_method='pad',limit=None,freq=None,**kwargs)
  • 偏度:pdata.skew(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
  • 峰度:pdata.kurt(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
  • 数据描述:pdata.describe(percentiles=None, include=None, exclude=None)
  1. 产生数据
import pandas as pd
import numpy as np

names = list('ABCD')
math = [90,100,80,80]
chinese = [89,96,58,77]
pdata = pd.DataFrame({'name':names, 'math':math, 'chinese':chinese})
pdata
name math chinese
0 A 90 89
1 B 100 96
2 C 80 58
3 D 80 77
  1. 基本描述与计算
#最大值,最小值,四分位数,均值,数量,标准差
pdata.describe()
math chinese
count 4.000000 4.0000
mean 87.500000 80.0000
std 9.574271 16.6333
min 80.000000 58.0000
25% 80.000000 72.2500
50% 85.000000 83.0000
75% 92.500000 90.7500
max 100.000000 96.0000
#计算每个学生总分,平均分
print(pdata.iloc[:, 1:].sum(axis=1))
print(pdata.iloc[:, 1:].mean(axis=1))
0    179
1    196
2    138
3    157
dtype: int64
0    89.5
1    98.0
2    69.0
3    78.5
dtype: float64
  1. 分位数

    计算:下四分位数,中位数,上四分位数

#下四分位数
print(pdata.iloc[:, 1:].quantile(q=0.25))
#中位数
print(pdata.iloc[:, 1:].quantile(q=0.5))
#上位数
print(pdata.iloc[:, 1:].quantile(q=0.75))
math       80.00
chinese    72.25
Name: 0.25, dtype: float64
math       85.0
chinese    83.0
Name: 0.5, dtype: float64
math       92.50
chinese    90.75
Name: 0.75, dtype: float64

6 索引/多级索引

主要内容:

  • 重置索引
  • 索引变换
  • 多级索引索引
import pandas as pd
import numpy as np
names = list('ABCDABCD')
pdata = pd.DataFrame(np.random.randint(30,100, size=(8,2)),columns=['math','chinese'])
pdata['team'] = [1]*4+[2]*4
pdata['name'] = names
pdata
math chinese team name
0 58 67 1 A
1 63 97 1 B
2 55 86 1 C
3 54 34 1 D
4 75 57 2 A
5 49 64 2 B
6 83 88 2 C
7 67 75 2 D

6.1 设置索引

  • name设置为索引

  • 重置索引方法:pdata.set_index(keys,drop=True,append=False,inplace=False)

参数 说明
keys 指定索引名称,可以是多列
drop True:删除列数据,False:保留列数据
append True:在原有列基础上追加,False:替代原有列
inplace True:在原数据中进行修改,False:返回新数据
# 设置索引
# set_index:inplace返回副本,新数据
ndata = pdata.set_index('name')
ndata
math chinese team
name
A 58 67 1
B 63 97 1
C 55 86 1
D 54 34 1
A 75 57 2
B 49 64 2
C 83 88 2
D 67 75 2

6.2 多级索引

需求:通过索引获取指定学期数据

pd.MultiIndex.from_arrays(arrays, sortorder=None, names=None)

import pandas as pd
import numpy as np
names = list('ABCDABCD')
pdata = pd.DataFrame(np.random.randint(30,100, size=(8,2)),columns=['math','chinese'])
pdata
math chinese
0 53 71
1 43 98
2 66 38
3 87 69
4 92 95
5 80 53
6 62 98
7 84 64
lv1 = [1]*4+[2]*4
lv2 = list('ABCDABCD')
# 创建MultiIndex对象
mindex = pd.MultiIndex.from_arrays([lv1, lv2])
data = pdata.set_index(mindex)
data
math chinese
1 A 53 71
B 43 98
C 66 38
D 87 69
2 A 92 95
B 80 53
C 62 98
D 84 64

6.3 通过多级索引取值

需求:

  • 获取第一学期数据
  • 获取第一学期A同学数据
  • 获取A同学所有数据有数据
#第一学期数据
data.loc[1]
math chinese
A 53 71
B 43 98
C 66 38
D 87 69
#第一学期A同学数据
data.loc[(1,'A')]
math       53
chinese    71
Name: (1, A), dtype: int32
# 如何获取A同学所有数据?
# 索引交换
data = data.swaplevel(0,1)
data
math chinese
A 1 53 71
B 1 43 98
C 1 66 38
D 1 87 69
A 2 92 95
B 2 80 53
C 2 62 98
D 2 84 64
data.loc['A']
math chinese
1 53 71
2 92 95

6.4 行列变换

  • pdata.stack(level=-1, dropna=True):将列“旋转”为行
  • pdata.unstack(level=-1, fill_value=None):将行“旋转”为列
import pandas as pd
import numpy as np
names = list('ABCDABCD')
pdata = pd.DataFrame(np.random.randint(30,100, size=(8,2)),columns=['math','chinese'])
lv1 = [1]*4+[2]*4
lv2 = list('ABCDABCD')
mindex = pd.MultiIndex.from_arrays([lv1, lv2])
pdata = pdata.set_index(mindex)
pdata
math chinese
1 A 84 96
B 75 70
C 86 55
D 97 98
2 A 40 37
B 79 43
C 72 64
D 82 37

需求:

  • 将学期转到列
  • 获取所有学生数学成绩
  • 获取第一学期数据数据
  1. 将学期转到列
# 将学期转到列
tmp = pdata.unstack(level=0)
tmp
math chinese
1 2 1 2
A 84 40 96 37
B 75 79 70 43
C 86 72 55 64
D 97 82 98 37
  1. 获取所有学生数学成绩
# 获取所有学生数学成绩
tmp['math']
1 2
A 84 40
B 75 79
C 86 72
D 97 82
  1. 获取第一学期数据
# 列索引层级交换,获取第一学期数据
tmp.swaplevel(axis=1)[1]
math chinese
A 84 96
B 75 70
C 86 55
D 97 98
tmp
math chinese
1 2 1 2
A 84 40 96 37
B 75 79 70 43
C 86 72 55 64
D 97 82 98 37
t = tmp.stack(level=0)
t
1 2
A chinese 96 37
math 84 40
B chinese 70 43
math 75 79
C chinese 55 64
math 86 72
D chinese 98 37
math 97 82
# 获取A同学第一学期成绩
t.loc['A'][1]
chinese    96
math       84
Name: 1, dtype: int32
t.sort_index(level = 1)
1 2
A chinese 96 37
B chinese 70 43
C chinese 55 64
D chinese 98 37
A math 84 40
B math 75 79
C math 86 72
D math 97 82

7 时间与时间序列

时间是数据分析重要维度,pandas中时间主要知识点:

  • 时间戳
  • 周期
  • 时间间隔
  • 时间索引
  • 时间滑动窗口间滑动窗口

7.1 时间戳

时间处理中常见的对象;

时间戳方法:

  • pd.Timestamp(ts_input=,freq=None, tz=None,...),详情见说明案例
  • pd.to_datetime(arg,errors='raise', dayfirst=False,...)
  • #年
    print(pd.Timestamp(2020))
    #年月日
    print(pd.Timestamp(2020, 6, 2))
    #字符串
    print(pd.Timestamp('2020-05-07'))
    #字符串时间
    print(pd.Timestamp('2020-05-07 04:02:01'))
    #字符串时间
    print(pd.Timestamp('2017-01-01T12'))
    #时间戳
    print(pd.Timestamp(1513393355.5, unit='s'))
    #2017-03-01与format对应
    print(pd.to_datetime('2017-02-01',format="%Y-%m-%d"))
    #20170301与年月日对应
    print(pd.to_datetime('20170301',format="%Y%m%d"))
    
    1970-01-01 00:00:00.000002020
    2020-06-02 00:00:00
    2020-05-07 00:00:00
    2020-05-07 04:02:01
    2017-01-01 12:00:00
    2017-12-16 03:02:35.500000
    2017-02-01 00:00:00
    2017-03-01 00:00:00
    
    print(pd.to_datetime('2017-02-01',format="%Y-%m-%d"))
    #生成时间索引
    print(pd.to_datetime(['2017-02-01'],format="%Y-%m-%d"))
    
    2017-02-01 00:00:00
    DatetimeIndex(['2017-02-01'], dtype='datetime64[ns]', freq=None)
    
    pd.Timestamp.now()
    
    Timestamp('2024-02-01 10:45:29.038280')
    

    7.2 时间索引

    pd.date_range(start=None,end=None,periods=None,freq=None,tz=None, ... **kwargs):生成DatetimeIndex

    官方文档:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html
    (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html)

    参数 说明
    start 开始时间
    end 结束时间
    periods 产生周期数量
    freq 间隔,默认为 D(天),可选其他频率,例如:H(小时)、M(分钟)等
    closed 闭区间:None'left''right'

    freq的主要参数:

    参数 说明
    M 每月最后一天
    MS 每月第一天
    D
    H 小时
    T 分钟(min)
    S
    Q 季度
    # 周期单位为Day
    print(pd.date_range('2017-01-01', periods=2))
    # 周期单位为hour
    print(pd.date_range('2017-01-01 02', periods=2, freq='H'))
    # 每个月月初
    print(pd.date_range('2017-01', periods=3, freq='MS'))
    # 每个月月底
    print(pd.date_range('2017-01', periods=3, freq='M'))
    
    DatetimeIndex(['2017-01-01', '2017-01-02'], dtype='datetime64[ns]', freq='D')
    DatetimeIndex(['2017-01-01 02:00:00', '2017-01-01 03:00:00'], dtype='datetime64[ns]', freq='H')
    DatetimeIndex(['2017-01-01', '2017-02-01', '2017-03-01'], dtype='datetime64[ns]', freq='MS')
    DatetimeIndex(['2017-01-31', '2017-02-28', '2017-03-31'], dtype='datetime64[ns]', freq='M')
    

    7.3 周期

    通过时间段与固定时间间隔一系列时间;

    周期作用:可以获取指定年,指定月,指定日等的数据

    • 周期:pd.Period(value=None,freq=None,ordinal=None,year=None,...)
    • 周期序列:pd.period_range(start=None, end=None, periods=None, freq=None, name=None):生成PeriodIndex
    • pdata.to_period(freq=None, axis=0, copy=True)
    import pandas as pd
    import numpy as np
    print(pd. Timestamp('2017-01-02'))
    print(pd.Period('2017-01-02'))
    print(pd.Period('2017-01'))
    print(pd.Period('2017'))
    
    2017-01-02 00:00:00
    2017-01-02
    2017-01
    2017
    
    # 周期单位为Day
    print(pd.period_range('2017-01-01', periods=2))
    # 周期单位为月
    print(pd.period_range('2017-01', periods=2, freq='M'))
    # 周期单位为小时
    print(pd.period_range('2017-01-02', periods=2, freq='H'))
    
    PeriodIndex(['2017-01-01', '2017-01-02'], dtype='period[D]')
    PeriodIndex(['2017-01', '2017-02'], dtype='period[M]')
    PeriodIndex(['2017-01-02 00:00', '2017-01-02 01:00'], dtype='period[H]')
    
    tmp = pd.to_datetime(['2017-01-02', '2017-03-04'])
    tmp.to_period('M')
    
    PeriodIndex(['2017-01', '2017-03'], dtype='period[M]')
    

    7.4 时间索引

    很多数据及数据集中都会有时间维度,可以将其设置为时间索引;

    内容:

    • 获取指定时间数据
    • 获取时间段数据
    • 获取某个时期数据时期数据
    # 准备数据
    import pandas as pd
    import numpy as np
    # 注意这里是字符串
    ts = ['2019-03-25','2019-03-26','2019-03-27','2019-03-28','2019-03-29','2019-03-30','2019-03-31',
     '2019-04-01','2019-04-02','2019-04-03','2019-04-04','2019-05-01','2019-04-05',]
    values = np.arange(len(ts))
    pdata = pd.DataFrame({'ts':ts, 'values':values})
    pdata
    
    ts values
    0 2019-03-25 0
    1 2019-03-26 1
    2 2019-03-27 2
    3 2019-03-28 3
    4 2019-03-29 4
    5 2019-03-30 5
    6 2019-03-31 6
    7 2019-04-01 7
    8 2019-04-02 8
    9 2019-04-03 9
    10 2019-04-04 10
    11 2019-05-01 11
    12 2019-04-05 12
    pdata.ts.values
    
    array(['2019-03-25', '2019-03-26', '2019-03-27', '2019-03-28',
           '2019-03-29', '2019-03-30', '2019-03-31', '2019-04-01',
           '2019-04-02', '2019-04-03', '2019-04-04', '2019-05-01',
           '2019-04-05'], dtype=object)
    

    需求:

    • 获取三月份数据
    • 获取4月1号到4号数据
    • 获取第2季度数据
    # 生成时间索引
    # 方式1:
    tindex = pd.to_datetime(pdata['ts'])
    tindex
    
    0    2019-03-25
    1    2019-03-26
    2    2019-03-27
    3    2019-03-28
    4    2019-03-29
    5    2019-03-30
    6    2019-03-31
    7    2019-04-01
    8    2019-04-02
    9    2019-04-03
    10   2019-04-04
    11   2019-05-01
    12   2019-04-05
    Name: ts, dtype: datetime64[ns]
    
    # 创建时间索引
    tmp = pdata.set_index(tindex)
    tmp.index
    
    DatetimeIndex(['2019-03-25', '2019-03-26', '2019-03-27', '2019-03-28',
                   '2019-03-29', '2019-03-30', '2019-03-31', '2019-04-01',
                   '2019-04-02', '2019-04-03', '2019-04-04', '2019-05-01',
                   '2019-04-05'],
                  dtype='datetime64[ns]', name='ts', freq=None)
    
    # 三月数据
    tmp.loc['2019-03']
    
    ts values
    ts
    2019-03-25 2019-03-25 0
    2019-03-26 2019-03-26 1
    2019-03-27 2019-03-27 2
    2019-03-28 2019-03-28 3
    2019-03-29 2019-03-29 4
    2019-03-30 2019-03-30 5
    2019-03-31 2019-03-31 6
    # 4月1号到4号数据:切片操作
    tmp = tmp.sort_index()
    tmp.loc['2019-04-01':'2019-04-05']
    
    ts values
    ts
    2019-04-01 2019-04-01 7
    2019-04-02 2019-04-02 8
    2019-04-03 2019-04-03 9
    2019-04-04 2019-04-04 10
    2019-04-05 2019-04-05 12
    # 获取第2季度数据
    # 按季度生成索引
    qindex = tmp.index.to_period('Q')
    qdata = tmp.set_index(qindex)
    qdata.loc['2019Q2']
    
    ts values
    ts
    2019Q2 2019-04-01 7
    2019Q2 2019-04-02 8
    2019Q2 2019-04-03 9
    2019Q2 2019-04-04 10
    2019Q2 2019-04-05 12
    2019Q2 2019-05-01 11
    tmp.index.to_period('Q')
    
    PeriodIndex(['2019Q1', '2019Q1', '2019Q1', '2019Q1', '2019Q1', '2019Q1',
                 '2019Q1', '2019Q2', '2019Q2', '2019Q2', '2019Q2', '2019Q2',
                 '2019Q2'],
                dtype='period[Q-DEC]', name='ts')
    

    7.5 时间差值

    • timedalte:两个datetime值之间的差(如日,秒和微妙)的类型

    • pd.Timedelta(value=, unit=None, **kwargs)

      # Timestamp相减
      pd.Timestamp('2019-01-02')-pd.Timestamp('2019-01-01')
      
      Timedelta('1 days 00:00:00')
      
      pd.Timedelta(5, 'T')
      
      Timedelta('0 days 00:05:00')
      

      7.6 重采样

      重采样作用:

      • 降低采样率
      • 提升采样率
      • 方法:pdata.resample(rule,how=None,axis=0,fill_method=None,closed=None,label=None,convention='start'...)
      参数 说明
      rule 规则,例如:'T'(分钟)、'M'(月份)等
      fill_method 提升采样率填充方式,'ffill'(向前填充)、'bfill'(向后填充)等
      closed 降低采样率时的闭合方式,'right''left',默认为 'right'
      label 降低采样率时的聚合值标签,{'right', 'left'}
      loffset 时间偏差,用于调整聚合后的时间索引,可以是 timedelta 类型
      kind 聚合方式,'period''timestamp',默认聚合到时间索引

      应用场景:股票分析,金融等;

      import pandas as pd
      import numpy as np
      index = pd.date_range('1/1/2000', periods=9, freq='2T')
      series = pd.Series(range(9), index=index)
      series
      
      2000-01-01 00:00:00    0
      2000-01-01 00:02:00    1
      2000-01-01 00:04:00    2
      2000-01-01 00:06:00    3
      2000-01-01 00:08:00    4
      2000-01-01 00:10:00    5
      2000-01-01 00:12:00    6
      2000-01-01 00:14:00    7
      2000-01-01 00:16:00    8
      Freq: 2T, dtype: int64
      
      # 降低采样率,将时间间隔改成4S,每个时间对应值为均值
      s = series.resample('4T')
      s.groups
      
      {Timestamp('2000-01-01 00:00:00'): 2,
       Timestamp('2000-01-01 00:04:00'): 4,
       Timestamp('2000-01-01 00:08:00'): 6,
       Timestamp('2000-01-01 00:12:00'): 8,
       Timestamp('2000-01-01 00:16:00'): 9}
      

      Resampler对象相关方法:

      方法 说明
      s.groups 返回 Resampler 对象,字典
      s.max() 降频分组后的最大值
      s.min() 降频分组后的最小值
      s.first() 降频分组后的第一个值
      s.last() 降频分组后的最后一个值
      s.mean() 降频分组后的均值
      s.median() 降频分组后的中位数
      s.first()
      
      2000-01-01 00:00:00    0
      2000-01-01 00:04:00    2
      2000-01-01 00:08:00    4
      2000-01-01 00:12:00    6
      2000-01-01 00:16:00    8
      Freq: 4T, dtype: int64
      
      s.max()
      
      2000-01-01 00:00:00    1
      2000-01-01 00:04:00    3
      2000-01-01 00:08:00    5
      2000-01-01 00:12:00    7
      2000-01-01 00:16:00    8
      Freq: 4T, dtype: int64
      
      s.mean()
      
      2000-01-01 00:00:00    0.5
      2000-01-01 00:04:00    2.5
      2000-01-01 00:08:00    4.5
      2000-01-01 00:12:00    6.5
      2000-01-01 00:16:00    8.0
      Freq: 4T, dtype: float64
      
      #提高采样率,将时间间隔改成S,每个时间对应值为均值
      rd = series.resample('T')
      rd.bfill()
      
      2000-01-01 00:00:00    0
      2000-01-01 00:01:00    1
      2000-01-01 00:02:00    1
      2000-01-01 00:03:00    2
      2000-01-01 00:04:00    2
      2000-01-01 00:05:00    3
      2000-01-01 00:06:00    3
      2000-01-01 00:07:00    4
      2000-01-01 00:08:00    4
      2000-01-01 00:09:00    5
      2000-01-01 00:10:00    5
      2000-01-01 00:11:00    6
      2000-01-01 00:12:00    6
      2000-01-01 00:13:00    7
      2000-01-01 00:14:00    7
      2000-01-01 00:15:00    8
      2000-01-01 00:16:00    8
      Freq: T, dtype: int64
      

      7.7 时间迁移

      • 时间序列常用操作:对数据按照时间进行迁移
      • 迁移数据:df.shift(periods=1, freq=None, axis=0, fill_value=None)
      • 迁移索引:df.tshift(periods=1, freq=None, axis=0) ——该方法已被弃用
      index = pd.date_range('1/1/2000', periods=9, freq='D')
      series = pd.Series(range(9), index=index)
      series
      
      2000-01-01    0
      2000-01-02    1
      2000-01-03    2
      2000-01-04    3
      2000-01-05    4
      2000-01-06    5
      2000-01-07    6
      2000-01-08    7
      2000-01-09    8
      Freq: D, dtype: int64
      
      series.shift(1, freq='T')
      
      2000-01-01 00:01:00    0
      2000-01-02 00:01:00    1
      2000-01-03 00:01:00    2
      2000-01-04 00:01:00    3
      2000-01-05 00:01:00    4
      2000-01-06 00:01:00    5
      2000-01-07 00:01:00    6
      2000-01-08 00:01:00    7
      2000-01-09 00:01:00    8
      Freq: D, dtype: int64
      
      series.shift(1)
      
      2000-01-01    NaN
      2000-01-02    0.0
      2000-01-03    1.0
      2000-01-04    2.0
      2000-01-05    3.0
      2000-01-06    4.0
      2000-01-07    5.0
      2000-01-08    6.0
      2000-01-09    7.0
      Freq: D, dtype: float64
      
      series.shift(1, freq='D')
      
      2000-01-02    0
      2000-01-03    1
      2000-01-04    2
      2000-01-05    3
      2000-01-06    4
      2000-01-07    5
      2000-01-08    6
      2000-01-09    7
      2000-01-10    8
      Freq: D, dtype: int64
      

      8 数据清洗

      数据清洗方式:

      • 获取某列的唯一值:Series.unique()
      • 每个值出现次数:Series.value_counts()
      • 删除指定行列:pdata.drop(labels=None,axis=0,index=None,columns=None, level=None,inplace=False,errors='raise')
      • 去重:pdata.drop_duplicates(subset=None, keep='first', inplace=False)
      1. 产生数据
      # 准备数据
      import pandas as pd
      import numpy as np
      # 注意这里是字符串
      ts = ['2019-03-25','2019-03-26','2019-03-26','2019-03-26','2019-03-29','2019-03-30','2019-03-31',
       '2019-04-01','2019-04-02','2019-04-03','2019-04-04','2019-05-01','2019-04-05',]
      values = np.arange(len(ts))
      pdata = pd.DataFrame({'ts':ts, 'values':values})
      pdata
      
      ts values
      0 2019-03-25 0
      1 2019-03-26 1
      2 2019-03-26 2
      3 2019-03-26 3
      4 2019-03-29 4
      5 2019-03-30 5
      6 2019-03-31 6
      7 2019-04-01 7
      8 2019-04-02 8
      9 2019-04-03 9
      10 2019-04-04 10
      11 2019-05-01 11
      12 2019-04-05 12
      1. 唯一值与数值出现次数
      # 唯一值
      pdata.ts.unique()
      
      array(['2019-03-25', '2019-03-26', '2019-03-29', '2019-03-30',
             '2019-03-31', '2019-04-01', '2019-04-02', '2019-04-03',
             '2019-04-04', '2019-05-01', '2019-04-05'], dtype=object)
      
      # 数值出现次数
      pdata.ts.value_counts()
      
      ts
      2019-03-26    3
      2019-03-25    1
      2019-03-29    1
      2019-03-30    1
      2019-03-31    1
      2019-04-01    1
      2019-04-02    1
      2019-04-03    1
      2019-04-04    1
      2019-05-01    1
      2019-04-05    1
      Name: count, dtype: int64
      
      1. 删除指定行列
      # 删除指定行列
      # 删除单行
      print(pdata.drop(0))
      
                  ts  values
      1   2019-03-26       1
      2   2019-03-26       2
      3   2019-03-26       3
      4   2019-03-29       4
      5   2019-03-30       5
      6   2019-03-31       6
      7   2019-04-01       7
      8   2019-04-02       8
      9   2019-04-03       9
      10  2019-04-04      10
      11  2019-05-01      11
      12  2019-04-05      12
      
      # 删除多行
      print(pdata.drop(index=[1,2,3]))
      
                  ts  values
      0   2019-03-25       0
      4   2019-03-29       4
      5   2019-03-30       5
      6   2019-03-31       6
      7   2019-04-01       7
      8   2019-04-02       8
      9   2019-04-03       9
      10  2019-04-04      10
      11  2019-05-01      11
      12  2019-04-05      12
      
      # 删除列
      print(pdata.drop(columns='ts'))
      
          values
      0        0
      1        1
      2        2
      3        3
      4        4
      5        5
      6        6
      7        7
      8        8
      9        9
      10      10
      11      11
      12      12
      
      # 删除index为0值,删除列为ts的值
      pdata.drop(index=0, columns='ts')
      
      values
      1 1
      2 2
      3 3
      4 4
      5 5
      6 6
      7 7
      8 8
      9 9
      10 10
      11 11
      12 12
      1. 去重
      # 去重
      # 保留第一次
      pdata.drop_duplicates('ts')
      
      ts values
      0 2019-03-25 0
      1 2019-03-26 1
      4 2019-03-29 4
      5 2019-03-30 5
      6 2019-03-31 6
      7 2019-04-01 7
      8 2019-04-02 8
      9 2019-04-03 9
      10 2019-04-04 10
      11 2019-05-01 11
      12 2019-04-05 12
      #保留最后一次
      pdata.drop_duplicates(subset='ts', keep='last')
      
      ts values
      0 2019-03-25 0
      3 2019-03-26 3
      4 2019-03-29 4
      5 2019-03-30 5
      6 2019-03-31 6
      7 2019-04-01 7
      8 2019-04-02 8
      9 2019-04-03 9
      10 2019-04-04 10
      11 2019-05-01 11
      12 2019-04-05 12
      #保留第一次数据,生成boolean索引,为True的为需要删除数据
      bindex = pdata.duplicated(subset='ts')
      pdata[bindex==False]
      
      ts values
      0 2019-03-25 0
      1 2019-03-26 1
      4 2019-03-29 4
      5 2019-03-30 5
      6 2019-03-31 6
      7 2019-04-01 7
      8 2019-04-02 8
      9 2019-04-03 9
      10 2019-04-04 10
      11 2019-05-01 11
      12 2019-04-05 12

      9 数据合并

      目的:根据需求,合并多个数据集

      • merge:
      • join:
      • concat:at:

      9.1 merge

      将不同数据集根据指定字段进行合并得到新的数据集。

      pd.merge(left,right,how='inner',on=None,left_on=None,right_on=None,left_index=False, right_index=False,sort=False,suffixes=('_x', '_y'),copy=True,indicator=False,validate=None)

      主要参数:

      参数 说明
      left 左侧数据集
      right 右侧数据集
      how 合并方式
      on 索引或者列名,用于连接数据集
      left_on 数据连接,左侧数据集的列名
      right_on 数据连接,右侧数据集的列名
      left_index 使用左侧数据集的索引进行连接
      right_index 使用右侧数据集的索引进行连接
      sort 是否根据连接键排序
      suffixes 合并后相同列名的后缀

      合并方式:

      参数 说明
      left 左连接
      right 右连接
      outer 外连接
      inner 内连接
      1. 准备数据
      import pandas as pd
      import numpy as np
      n1 = list('ABCE')
      n2 = list('ABCD')
      d1 = [90,80,100,69]
      d2 = [95,78,96,72]
      c1 = ['001','001','002','002']
      c2 = ['001','001','002','002']
      df1 = pd.DataFrame({'name':n1,'math':d1, 'class':c1})
      df2 = pd.DataFrame({'name':n2,'chinese':d2, 'class':c2,'pname':n1})
      
      df1
      
      name math class
      0 A 90 001
      1 B 80 001
      2 C 100 002
      3 E 69 002
      df2
      
      name chinese class pname
      0 A 95 001 A
      1 B 78 001 B
      2 C 96 002 C
      3 D 72 002 E
      1. 数据合并
      #默认按索引合并
      pd.merge(df1, df2)
      
      name math class chinese pname
      0 A 90 001 95 A
      1 B 80 001 78 B
      2 C 100 002 96 C
      #按class合并,根据class,两两组合,但是对于当前成绩来说,我们希望以名称进行合并
      pd.merge(df1, df2, on = 'class')
      
      name_x math class name_y chinese pname
      0 A 90 001 A 95 A
      1 A 90 001 B 78 B
      2 B 80 001 A 95 A
      3 B 80 001 B 78 B
      4 C 100 002 C 96 C
      5 C 100 002 D 72 E
      6 E 69 002 C 96 C
      7 E 69 002 D 72 E
      #修改合并方式,取并集,会产生缺失数据
      pd.merge(df1, df2, on = 'name', how='outer')
      
      name math class_x chinese class_y pname
      0 A 90.0 001 95.0 001 A
      1 B 80.0 001 78.0 001 B
      2 C 100.0 002 96.0 002 C
      3 E 69.0 002 NaN NaN NaN
      4 D NaN NaN 72.0 002 E
      #指定left, 以left为主
      pd.merge(df1, df2, on = 'name', how='left')
      
      name math class_x chinese class_y pname
      0 A 90 001 95.0 001 A
      1 B 80 001 78.0 001 B
      2 C 100 002 96.0 002 C
      3 E 69 002 NaN NaN NaN
      #指定right, 以right为主
      pd.merge(df1, df2, on = 'name', how='right')
      
      name math class_x chinese class_y pname
      0 A 90.0 001 95 001 A
      1 B 80.0 001 78 001 B
      2 C 100.0 002 96 002 C
      3 D NaN NaN 72 002 E
      #实际工作中,数据集列名称可能不能,需要制定不同列名进行合并
      pd.merge(df1, df2, left_on = 'name',right_on='pname')
      
      name_x math class_x name_y chinese class_y pname
      0 A 90 001 A 95 001 A
      1 B 80 001 B 78 001 B
      2 C 100 002 C 96 002 C
      3 E 69 002 D 72 002 E

      9.2 join方法

      join方法:DateFrame对象方法,与megre方法类似,how的方式默认为left

      df1.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

      主要参数:

      参数 说明
      on 指定连接列
      how 拼接方式
      lsuffix 左侧 DataFrame 的列名后缀
      rsuffix 右侧 DataFrame 的列名后缀
      df1
      
      name math class
      0 A 90 001
      1 B 80 001
      2 C 100 002
      3 E 69 002
      df2
      
      name chinese class pname
      0 A 95 001 A
      1 B 78 001 B
      2 C 96 002 C
      3 D 72 002 E
      #合并时候如果有相同列名,需要指定lsuffix,rsuffix,用于区分合并后列;
      df1.join(df2, lsuffix='_x', rsuffix='_y')
      
      name_x math class_x name_y chinese class_y pname
      0 A 90 001 A 95 001 A
      1 B 80 001 B 78 001 B
      2 C 100 002 C 96 002 C
      3 E 69 002 D 72 002 E
      df1.join(df2.set_index('class'), lsuffix='_x', rsuffix='_y', on ='class')
      
      name_x math class name_y chinese pname
      0 A 90 001 A 95 A
      0 A 90 001 B 78 B
      1 B 80 001 A 95 A
      1 B 80 001 B 78 B
      2 C 100 002 C 96 C
      2 C 100 002 D 72 E
      3 E 69 002 C 96 C
      3 E 69 002 D 72 E

      9.3 concat

      concat:根据设置轴与条件,将两个数据进行拼接

      pd.concat(objs,axis=0,join='outer',join_axes=None,ignore_index=False,keys=None,levels=None,names=None...)

      参数说明:

      参数 说明
      objs 一个包含 Series 列表、DataFrame 列表或字典的列表
      axis 指定拼接轴,0 表示按索引(index),1 表示按列(columns)
      join 连接方式,inner 表示交集,outer 表示并集
      ignore_index 是否不使用并置轴上的索引值,True 表示不使用
      join_axes 用于指定连接轴的 Index 对象列表
      keys 序列,用于构建层次化索引(MultiIndex)
      levels 多级索引的特定值
      names 列表,用于指定层级索引的名称
      #按axis=0, 按列进行拼接
      pd.concat([df1, df2], sort= True)
      
      chinese class math name pname
      0 NaN 001 90.0 A NaN
      1 NaN 001 80.0 B NaN
      2 NaN 002 100.0 C NaN
      3 NaN 002 69.0 E NaN
      0 95.0 001 NaN A A
      1 78.0 001 NaN B B
      2 96.0 002 NaN C C
      3 72.0 002 NaN D E
      pd.concat([df1, df2], axis=1)
      
      name math class name chinese class pname
      0 A 90 001 A 95 001 A
      1 B 80 001 B 78 001 B
      2 C 100 002 C 96 002 C
      3 E 69 002 D 72 002 E
      #按axis=0, 按列进行拼接,并设置key,结果:多级索引
      pd.concat([df1, df2], sort= True, keys=['p1', 'p2'])
      
      chinese class math name pname
      p1 0 NaN 001 90.0 A NaN
      1 NaN 001 80.0 B NaN
      2 NaN 002 100.0 C NaN
      3 NaN 002 69.0 E NaN
      p2 0 95.0 001 NaN A A
      1 78.0 001 NaN B B
      2 96.0 002 NaN C C
      3 72.0 002 NaN D E
      #axis=1,按索引进行拼接
      pd.concat([df1, df2], axis=1)
      
      name math class name chinese class pname
      0 A 90 001 A 95 001 A
      1 B 80 001 B 78 001 B
      2 C 100 002 C 96 002 C
      3 E 69 002 D 72 002 E
      #axis=1,按索引进行拼接,列中增加多级索引
      pd.concat([df1, df2], axis=1, keys=['s1','s2'])
      
      s1 s2
      name math class name chinese class pname
      0 A 90 001 A 95 001 A
      1 B 80 001 B 78 001 B
      2 C 100 002 C 96 002 C
      3 E 69 002 D 72 002 E
      #拼接数据,根据name进行拼接
      t1 = df1.set_index('name')
      t2 = df2.set_index('name')
      pd.concat([t1, t2], axis=1, sort=True)
      
      math class chinese class pname
      name
      A 90.0 001 95.0 001 A
      B 80.0 001 78.0 001 B
      C 100.0 002 96.0 002 C
      D NaN NaN 72.0 002 E
      E 69.0 002 NaN NaN NaN
      #拼接数据,根据name进行拼接,join设置为inner
      t1 = df1.set_index('name')
      t2 = df2.set_index('name')
      pd.concat([t1, t2], axis=1, sort=True, join='inner')
      
      math class chinese class pname
      name
      A 90 001 95 001 A
      B 80 001 78 001 B
      C 100 002 96 002 C

      10 pandas数据处理常用函数

      目的:掌握apply, agg,str等函数,对数据灵活处理

      10.1 apply函数

      • 对pandas中DataFrame或者Series中每个数据进行处理
      • apply方法:df.apply(func,axis=0,broadcast=None, raw=False,reduce=None, result_type=None,args=(), **kwds)
      参数 说明
      func 处理函数,用于处理一系列值
      axis 轴设置,指定处理函数的轴,0 表示按列,1 表示按行
      import pandas as pd
      import numpy as np
      n = list('ABCD')
      math = [90,80,47,69]
      chinese = [95,78,96,59]
      nclass = ['001','001','002','002']
      df = pd.DataFrame({'name':n,'math':math, 'chinese':chinese,'class':nclass})
      df
      
      name math chinese class
      0 A 90 95 001
      1 B 80 78 001
      2 C 47 96 002
      3 D 69 59 002
      name =df['name']
      name
      
      0    A
      1    B
      2    C
      3    D
      Name: name, dtype: object
      
      # 修改name名字
      name.apply(lambda x:x+x)
      
      0    AA
      1    BB
      2    CC
      3    DD
      Name: name, dtype: object
      
      # 将成绩转成True或者False
      df[['math','chinese']].apply(lambda x : x>59)
      
      math chinese
      0 True True
      1 True True
      2 False True
      3 True False

      10.2 func处理对象

      def func(value):
          print(type(value))
          return np.mean(value)
      df[['math','chinese']].apply(func)
      
      
      
      
      
      
      
      
      math       71.5
      chinese    82.0
      dtype: float64
      
      def func1(x):
          return x-x.mean()
      df[['math','chinese']].apply(func1)
      
      math chinese
      0 18.5 13.0
      1 8.5 -4.0
      2 -24.5 14.0
      3 -2.5 -23.0
      def func_min(x):
          return x-x.min()
      
      sdata = df[['math','chinese']]
      sdata.apply(func_min)
      
      math chinese
      0 43 36
      1 33 19
      2 0 37
      3 22 0

      10.3 map

      • 适用于Series对象或Df的某一列
      • map方法:Series.map(arg, na_action=None)
      • 对Series中的每个数值进行处理处理
      df['math'].map(lambda x: 'pass' if x > 59 else 'failed')
      
      0      pass
      1      pass
      2    failed
      3      pass
      Name: math, dtype: object
      
      # 需求:将根据成绩单生成:'pass','failed'
      func = lambda x: 'pass' if x > 59 else 'failed'
      df[['math','chinese']].apply(lambda x : x.map(func))
      
      math chinese
      0 pass pass
      1 pass pass
      2 failed pass
      3 pass failed

      10.4 replace:替换

      • 对当前数据集中指定数据进行替换
      • 方法:df.replace(to_replace=None,value=None,inplace=False,limit=None,regex=False,method='pad')

      主要参数:

      参数 说明
      to_replace 替换值,可以是字符串、正则表达式、列表等
      value 替换目标值
      limit 替换次数的限制
      inplace 是否替换原数据,True 表示替换原数据
      regex 是否使用正则表达式进行替换,需要设置为 True
      # 将A替换成a
      df.replace('A', 'a')
      
      name math chinese class
      0 a 90 95 001
      1 B 80 78 001
      2 C 47 96 002
      3 D 69 59 002
      # 将[A,B]替换成*
      df.replace(['A','B'], '*')
      
      name math chinese class
      0 * 90 95 001
      1 * 80 78 001
      2 C 47 96 002
      3 D 69 59 002
      # 一组数据替换
      df.replace(list('ABCD'), list('abcd'))
      
      name math chinese class
      0 a 90 95 001
      1 b 80 78 001
      2 c 47 96 002
      3 d 69 59 002
      # 正则:将所有字母替换成*
      df.replace(r'[A-Z]','*', regex=True)
      
      name math chinese class
      0 * 90 95 001
      1 * 80 78 001
      2 * 47 96 002
      3 * 69 59 002

      10.5 agg:聚合操作

      • 按照设置axis对数据进行聚合操作(mean, max,…)
      • df.agg(func, axis=0, *args, **kwargs)
      • axis: 0:func应用到column, 1:func应用到rowow
      df[['chinese', 'math']].agg(['mean', 'std'])
      
      chinese math
      mean 82.000000 71.500000
      std 17.416467 18.448125
      df[['chinese', 'math']].describe()
      
      chinese math
      count 4.000000 4.000000
      mean 82.000000 71.500000
      std 17.416467 18.448125
      min 59.000000 47.000000
      25% 73.250000 63.500000
      50% 86.500000 74.500000
      75% 95.250000 82.500000
      max 96.000000 90.000000

      10.6 transform:处理数据

      • 根据设置方法,对数据进行处理,得到新的数据
      • df.transform(func, axis=0, *args, **kwargs)
      • 每个科目数值减去均值去均值
      df[['chinese', 'math']]
      
      chinese math
      0 95 90
      1 78 80
      2 96 47
      3 59 69
      df[['chinese', 'math']].transform(lambda x:x-x.mean())
      
      chinese math
      0 13.0 18.5
      1 -4.0 8.5
      2 14.0 -24.5
      3 -23.0 -2.5

      10.7 filter:过滤

      • 根据标签过滤符合条件数据
      • df.filter(items=None, like=None, regex=None, axis=None)
      参数 说明
      items labels 值,类似于列表
      like 标签模糊匹配
      regex 正则表达式匹配
      axis 设置轴,指定操作的轴,例如 0(按行)、1(按列)
      df.filter(['name'])
      
      name
      0 A
      1 B
      2 C
      3 D
      # 设置like:like in label
      df.filter(like = 'n')
      
      name chinese
      0 A 95
      1 B 78
      2 C 96
      3 D 59
      # 正则表达式:获取e结尾的列
      df.filter(regex=r'.*e')
      
      name chinese
      0 A 95
      1 B 78
      2 C 96
      3 D 59

      11 分组处理

      目的:根据指定条件,对数据进行分组,然后在依据分组进行计算

      例如:统计每天活跃用户总数,用户每天在线时长,用户平均消费水平等;

      import pandas as pd
      classname = ['001','001','002','002','003','003']
      name = ['sun','li','zhou','wang','zao','wu']
      height = [169, 172,180,170,165,175]
      weights = [61,53,75,64,50,58]
      df = pd.DataFrame({'cname':classname, 'user':name, 'height':height, 'weights':weights})
      df
      
      cname user height weights
      0 001 sun 169 61
      1 001 li 172 53
      2 002 zhou 180 75
      3 002 wang 170 64
      4 003 zao 165 50
      5 003 wu 175 58

      11.1 groupby分组

      方法:df.groupby(by=None,axis=0,level=None,as_index=True,sort=True,group_keys=True,squeeze=False,observed=False,**kwargs)

      主要参数:

      参数 说明
      by 分组依据
      axis 轴设置,0'index' 表示按行,1'columns' 表示按列,默认为 0
      group_keys 聚合输出,以组标签作为索引,默认为 True
      sort 根据分组标签排序,默认为 True
      level 多级索引,指定索引
      # 创建DataFrameGroupBy 对象,
      dfg = df.groupby('cname')
      dfg.groups
      
      {'001': [0, 1], '002': [2, 3], '003': [4, 5]}
      
      # 分组统计
      dfg.count()
      
      user height weights
      cname
      001 2 2 2
      002 2 2 2
      003 2 2 2
      # 根据多列进行分组:
      dfg = df.groupby(['cname','height'])
      dfg.groups
      
      {('001', 169): [0], ('001', 172): [1], ('002', 170): [3], ('002', 180): [2], ('003', 165): [4], ('003', 175): [5]}
      
      # 统计结果为多级索引
      dfg.count()
      
      user weights
      cname height
      001 169 1 1
      172 1 1
      002 170 1 1
      180 1 1
      003 165 1 1
      175 1 1

      11.2 聚合

      分组得到groupby对象,可以通过聚合函数对其操作,获取聚合结果;

      直白理解:对分组数据,进行处理;

      import pandas as pd
      import numpy as np
      classname = ['001','001','002','002','003','003']
      name = ['sun','li','zhou','wang','zao','wu']
      height = [169, 172,180,170,165,175]
      weights = [61,53,75,64,50,58]
      df = pd.DataFrame({'cname':classname, 'user':name, 'height':height, 'weights':weights})
      df_cname = df.drop('user', axis=1) # 删除列名为user的列
      dfg = df_cname.groupby('cname')
      df_cname
      
      cname height weights
      0 001 169 61
      1 001 172 53
      2 002 180 75
      3 002 170 64
      4 003 165 50
      5 003 175 58
      dfg.agg(['max', 'mean', 'std', 'count'])
      
      height weights
      max mean std count max mean std count
      cname
      001 172 170.5 2.121320 2 61 57.0 5.656854 2
      002 180 175.0 7.071068 2 75 69.5 7.778175 2
      003 175 170.0 7.071068 2 58 54.0 5.656854 2

      11.3 transform

      • 作用:将分组数据处理,得到一组新的数据
      • 方法:dfg.transform(func, *args, **kwargs)
      • 场景:分组数据中,与分组均值差,差,
      # 每个班级平均身高,体重
      dfg.transform('mean')
      
      height weights
      0 170.5 57.0
      1 170.5 57.0
      2 175.0 69.5
      3 175.0 69.5
      4 170.0 54.0
      5 170.0 54.0
      # 每个班级 学生-班级平均值(身高体重)
      dfg.transform(lambda x: x-x.mean())
      
      height weights
      0 -1.5 4.0
      1 1.5 -4.0
      2 5.0 5.5
      3 -5.0 -5.5
      4 -5.0 -4.0
      5 5.0 4.0

      11.5 filter

      • 过滤数据
      • dfg.filter(func, dropna=True, *args, **kwargs):根据过滤条件返回分组数据据
      #返回分组身高均值大于171的数据
      dfg['height'].filter(lambda x: np.mean(x)> 171)
      
      2    180
      3    170
      Name: height, dtype: int64
      
      dfg.count()
      
      height weights
      cname
      001 2 2
      002 2 2
      003 2 2

      11.6 cut分组

      • 根据设置区间进行分段汇总
      • 方法:pd.cut(x,bins,right=True,labels=None,retbins=False,precision=3,include_lowest=False,duplicates='raise')
      # 准备数据,统计70~79,80~89,90~100三个范围对应数量
      df = pd.Series(np.random.randint(70,100, size = 10))
      # 统计70~79,80~89,90~100三个范围对应数量
      r = pd.cut(df, [70,80,90,101], right=False)
      r
      
      0     [70, 80)
      1     [80, 90)
      2     [70, 80)
      3    [90, 101)
      4     [70, 80)
      5     [80, 90)
      6    [90, 101)
      7     [70, 80)
      8    [90, 101)
      9     [80, 90)
      dtype: category
      Categories (3, interval[int64, left]): [[70, 80) < [80, 90) < [90, 101)]
      
      # 统计70~79,80~89,90~100三个范围对应数量,设置标签
      r = pd.cut(df, [70,80,90,101], right=False, labels=['70+','80+','100+'])
      r.groupby(r, observed=False).count()
      
      70+     4
      80+     3
      100+    3
      dtype: int64
      
      #返回retbins
      r,bins= pd.cut(df, [70,80,90,101], right=False, labels=['70+','80+','100+'],retbins=True)
      bins
      
      array([ 70,  80,  90, 101])
      

      11.7 透视表

      透视表是一种可以对数据动态排布并且分类汇总的表格格式,使用方式与groupby类似,但是比其简单;

      方法:pd.pivot_table(data,values=None,index=None,columns=None,aggfunc='mean',fill_value=None,margins=False,...)

      主要参数:

      参数 说明
      data 数据
      values 用于计算数据项的列
      index 行分组键,可以是列名、Grouper、数组等
      columns 列分组键,可以是列名、Grouper、数组等
      aggfunc 聚合函数或函数列表,或者字典形式的函数
      fill_value 设置缺省值
      dropna 删除缺省值,True 表示删除,默认为 True
      margins 是否增加统计列,True 表示增加,默认为 False
      margins_name 新增统计列的列名
      import pandas as pd
      path = r'data\titanic.csv'
      df = pd.read_csv(path)
      df
      
      survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
      0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
      1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
      2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
      3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
      4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
      ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
      886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True
      887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True
      888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False
      889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True
      890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True

      891 rows × 15 columns

      # 根据class等级统计男女获救的比例
      # 1等仓,2等仓, ...
      # 女性获救比例, x x ...
      # 男性获救比例 x x ...
      df.pivot_table(values='survived',index='sex', columns=['class'])
      
      class First Second Third
      sex
      female 0.968085 0.921053 0.500000
      male 0.368852 0.157407 0.135447
      # 根据class等级统计男女获救的比例,统计所有值,并重命名
      # 设置margins,统计所有男女获救比例
      # 1等仓,2等仓, ...
      # 女性获救比例, x x ...
      # 男性获救比例 x x ...
      df.pivot_table(values='survived',index='sex', columns=['class'],margins=True, margins_name='all_mean')
      
      class First Second Third all_mean
      sex
      female 0.968085 0.921053 0.500000 0.742038
      male 0.368852 0.157407 0.135447 0.188908
      all_mean 0.629630 0.472826 0.242363 0.383838
      # 根据class等级与年龄段,统计男女获救的比例
      cuts = pd.cut(df.age,[0,18,100])
      df.pivot_table(values='survived',index=['sex',cuts], columns=['class'])
      
      class First Second Third
      sex age
      female (0, 18] 0.909091 1.000000 0.511628
      (18, 100] 0.972973 0.900000 0.423729
      male (0, 18] 0.800000 0.600000 0.215686
      (18, 100] 0.375000 0.071429 0.133663
      # 根据class等级与年龄段,统计男女获救的比例,数量,并统计所有值
      cuts = pd.cut(df.age,[0,18,100])
      df.pivot_table(values='survived',index=['sex'], columns=['class'], aggfunc=['mean', 'count'],margins=True)
      
      mean count
      class First Second Third All First Second Third All
      sex
      female 0.968085 0.921053 0.500000 0.742038 94 76 144 314
      male 0.368852 0.157407 0.135447 0.188908 122 108 347 577
      All 0.629630 0.472826 0.242363 0.383838 216 184 491 891

      11.8 str相关方法

      str方法使Series对象内置方法,用于对字符串处理,与字符串相关方法类似;

      • cat(others=None, sep=None, na_rep=None, join=None):拼接字符串
      • split(pat=None, n=-1, expand=False):切分字符串
      • get(i):获取指定位置的字符串
      • join(sep):字符串拼接
      • find(sub, start=0, end=None):查找,返回第一次出现子集位置
      • contains(pat, case=True, flags=0, na=nan, regex=True):判断是否包含指定的值
      • replace(pat, repl, n=-1, case=None, flags=0, regex=True):替换
      • sf.str.repeat(repeats):重复repeats次
      • startswith(pat, na=nan):判断是否已pat开头
      • sf.str.endswith(pat, na=nan):判断是否已pat结尾
      • sf.str.strip(to_strip=None):根据指定字符掐头去尾
      sf = pd.Series(['a_1', 'b_2', 'c_3'])
      sf.str.cat(list('ABC'), sep = '-')
      
      0    a_1-A
      1    b_2-B
      2    c_3-C
      dtype: object
      
      sf.str.split('_', expand=True)
      
      0 1
      0 a 1
      1 b 2
      2 c 3
      #提取有效字符,
      sf.str.extract(r'([a-zA-Z]+)')
      
      0
      0 a
      1 b
      2 c
      #字符开头
      sf.str.match(r'([a-zA-Z]+)')
      
      0    True
      1    True
      2    True
      dtype: bool
      
      #查找所有字符与数字
      sf.str.findall(r'([a-z0-9]+)')
      
      0    [a, 1]
      1    [b, 2]
      2    [c, 3]
      dtype: object
      
      df = pd.DataFrame([ ["张三", "abcd1234"], ["李四", "a2b4c3r5"]], columns=["name", "pwd"]) 
      df
      
      name pwd
      0 张三 abcd1234
      1 李四 a2b4c3r5

      12 pandas可视化

      pandas可以直接绘制图表,实现基于matplotlib,使用方式与其类似,

      方法:df.plot(*args, **kwargs)

      主要参数:

      参数 说明
      data Series 或 DataFrame 对象
      x 标签或索引,用于指定 x 轴数据
      y 标签或索引,用于指定 y 轴数据
      kind 绘制图像的样式,例如:linebarbarhhist
      figsize 图像大小
      use_index 是否使用 index 作为 x 轴刻度,默认为 True
      grid 是否使用栅格,默认为 False
      legend 是否显示图例,默认为 True

      12.1 基本使用

      import numpy as np
      import pandas as pd
      import matplotlib.pyplot as plt
      %matplotlib inline
      #数据
      df = pd.DataFrame(np.random.randint(30,60, size=(6,4)), columns=list('ABCD'), index=range(6))
      df
      
      A B C D
      0 41 34 35 53
      1 40 53 34 50
      2 59 44 47 38
      3 59 50 59 40
      4 55 52 49 45
      5 45 57 38 54
      #绘制折线图
      #x:index, y:默认所有columns
      df.plot()
      
      
      


      【Python基础】pandas 使用指南(超详细!)_第1张图片

      #A使用左侧Y轴
      df.A.plot()
      #B使用右侧Y周
      df.B.plot(secondary_y=True, style='g')
      
      
      


      【Python基础】pandas 使用指南(超详细!)_第2张图片

      12.2 plot中可视化方法

      • 折线图:df.plot.line(x=None, y=None, **kwargs)
      • 柱状图:df.plot.bar(x=None, y=None, **kwargs)
      • 条形图:df.plot.barh(x=None, y=None, **kwargs)
      • 直方图:df.plot.hist(by=None, bins=10, **kwargs)
      • KDE图:df.plot.kde(bw_method=None, ind=None, **kwargs)
      • 饼状图:df.plot.pie(**kwargs)
      • 散点图:df.plot.scatter(x, y, s=None, c=None, **kwargs)
      • 箱状图:df.plot.box(by=None, **kwargs)
      • 区域块状图:df.plot.area(x=None, y=None, **kwargs)*kwargs)
      # 柱状图
      df.plot.bar()
      
      
      


      【Python基础】pandas 使用指南(超详细!)_第3张图片

      # 条状图
      df.plot.barh()
      
      
      


      【Python基础】pandas 使用指南(超详细!)_第4张图片

      # 柱状图
      df.plot.pie(y='A')
      
      
      


      【Python基础】pandas 使用指南(超详细!)_第5张图片

      你可能感兴趣的:(Python基础,python,pandas,开发语言)