0.背景
在Pandas的使用中,常常需要到行列切换的情况,即所谓的数据透视表,这里根据Pandas当中的此类专题进行总结,主要解析pivot,melt,stack,unstack几个函数的使用。**
1.pivot(一定要看下pivot_table)
pivot做的就是从源数据构造透视表的过程,透视表是在数据统计中一个数据汇总的方式。pivot使用例子如下所示。
import pandas as pd
# 原始数据
data = {'date': ['2018-08-01', '2018-08-02', '2018-08-03', '2018-08-01', '2018-08-03', '2018-08-03',
'2018-08-01', '2018-08-02'],
'variable': ['A','A','A','B','B','C','C','C'],
'value': [3.0 ,4.0 ,6.0 ,2.0 ,8.0 ,4.0 ,10.0 ,1.0 ]}
df = pd.DataFrame(data=data, columns=['date', 'variable', 'value'])
print(df)
# date variable value
# 0 2018-08-01 A 3.0
# 1 2018-08-02 A 4.0
# 2 2018-08-03 A 6.0
# 3 2018-08-01 B 2.0
# 4 2018-08-03 B 8.0
# 5 2018-08-03 C 4.0
# 6 2018-08-01 C 10.0
# 7 2018-08-02 C 1.0
# eg1.如果要根据时间统计各variable的值,做法如下
# 让index为date,让variable里的值变为单独的列(pivot)
df1 = df.pivot(index='date', columns='variable', values='value')
print(df1)
# variable A B C
# date
# 2018-08-01 3.0 2.0 10.0
# 2018-08-02 4.0 NaN 1.0
# 2018-08-03 6.0 8.0 4.0
# eg2.如果value有多个情况下,列名会变成Hierarchical columns的结构,即MultiIndex
df['value_other'] = df['value'] * 2
df2 = df.pivot(index='date', columns='variable', values=['value', 'value_other'])
print(df2)
# value value_other
# variable A B C A B C
# date
# 2018-08-01 3.0 2.0 10.0 6.0 4.0 20.0
# 2018-08-02 4.0 NaN 1.0 8.0 NaN 2.0
# 2018-08-03 6.0 8.0 4.0 12.0 16.0 8.0
print(df2['value_other'])
# variable A B C
# date
# 2018-08-01 6.0 4.0 20.0
# 2018-08-02 8.0 NaN 2.0
# 2018-08-03 12.0 16.0 8.0
2.melt
pd.melt或者DataFrame.melt做的是一种和pivot相反的操作。它将一个DataFrame伸展成如下格式(一列或者多列做identifier变量,其他的列做measured variables)。
主要的参数为:
import numpy as np
import pandas as pd
cheese = pd.DataFrame({'first': ['John', 'Mary'],
'last': ['Doe', 'Bo'],
'height': [5.5, 6.0],
'weight': [130, 150]})
print(cheese)
# first height last weight
# 0 John 5.5 Doe 130
# 1 Mary 6.0 Bo 150
# eg1.idenfier变量为first和last(其他的列默认为measured变量)
print(cheese.melt(id_vars=['first', 'last']))
# first last variable value
# 0 John Doe height 5.5
# 1 Mary Bo height 6.0
# 2 John Doe weight 130.0
# 3 Mary Bo weight 150.0
# eg2.自定义variable和value列的名字
print(cheese.melt(id_vars=['first', 'last'], var_name='myVariable', value_name='myValue'))
# first last myVariable myValue
# 0 John Doe height 5.5
# 1 Mary Bo height 6.0
# 2 John Doe weight 130.0
# 3 Mary Bo weight 150.0
3.stack和unstack
DataFrame.stack和unstack是用来操作MultiIndex的。其实多重索引相当于用二维的形式来表征高维数据。作为二维数据的行和列都可以来做多重,stack和unstack就是在多重行索引和多重列索引(列名)之间转化的。详细使用如下。
import pandas as pd
import numpy as np
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8,2), index=index, columns=['A', 'B'])
print(df)
# A B
# first second
# bar one -0.332862 0.929766
# two 0.857515 0.181623
# baz one -0.769248 0.200083
# two 0.907549 -0.781607
# foo one -1.683440 0.868688
# two -1.556559 -0.591569
# qux one -0.399071 0.115823
# two 1.665903 2.210725
# eg1.stack方法可以来"压缩"一下列索引,这可能产生一个Series(如果本身列所以那就是一重的),或者一个DataFrame(列是多重索引)
# 这里的结果就是产生了一个Series,行索引是三重的。
df1 = df.stack()
print(df1)
print(df1.index)
# first second
# bar one A -0.332862
# B 0.929766
# two A 0.857515
# B 0.181623
# baz one A -0.769248
# B 0.200083
# two A 0.907549
# B -0.781607
# foo one A -1.683440
# B 0.868688
# two A -1.556559
# B -0.591569
# qux one A -0.399071
# B 0.115823
# two A 1.665903
# B 2.210725
# MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two'], [u'A', u'B']],
# labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
# names=[u'first', u'second', None])
# eg2.相反的操作是unstack(),即减少一重行索引,增加一重列索引
df2 = df1.unstack()
print(df2)
# A B
# first second
# bar one -0.332862 0.929766
# two 0.857515 0.181623
# baz one -0.769248 0.200083
# two 0.907549 -0.781607
# foo one -1.683440 0.868688
# two -1.556559 -0.591569
# qux one -0.399071 0.115823
# two 1.665903 2.210725
# eg3.如果索引是多重的,我们可以指定去"压缩"哪一层的索引。对于行索引来说
# - 行索引,从0开始,最左边最小为0
# - 列索引,从0开始,最上边最小为0
# 也可以不用0和1..等,用索引层的名字,比如这里的first和second,但是这样有可能有的索引层没有名字,比如第一次stack后的df1。
# 数字和名字但不能混用,但是可以同时指定多个level值。
#
df3 = df1.unstack(level=0)
print(df3)
# first bar baz foo qux
# second
# one A -0.332862 -0.769248 -1.683440 -0.399071
# B 0.929766 0.200083 0.868688 0.115823
# two A 0.857515 0.907549 -1.556559 1.665903
# B 0.181623 -0.781607 -0.591569 2.210725
# eg4.stack和unstack内部都实现了排序,如果如果对一个DataFrame进行了stack在进行unstack,DataFrame会按照行索引排好序.
# 经过试验列索引并不会排好序!
index = pd.MultiIndex.from_product([[2,1], ['a', 'b']])
df4 = pd.DataFrame(np.random.randn(4,1), index=index, columns=[100])
print(all(df4.stack().unstack() == df4.sort_index()))
df5 = pd.DataFrame(np.random.randn(4,3), index=index, columns=[100,99,102])
print(df5)
# True
# 100 99 102
# 2 a 0.094463 1.766611 0.588748
# b -1.262801 0.737156 -0.450470
# 1 a -0.888983 0.903101 -1.179545
# b 1.015863 -0.486976 -1.097248
# eg5.对于缺失值的处理比较智能
# 如下,列是多重索引,最外层不是AABB的形式,而是ABBA的形式,所以看起来列索引怪怪的,不是只有一个A和B的形式。这些都不影响。
columns = pd.MultiIndex.from_tuples([('A', 'cat'), ('B', 'dog'),
('B', 'cat'), ('A', 'dog')],
names=['exp', 'animal'])
index = pd.MultiIndex.from_product([('bar', 'baz', 'foo', 'qux'),
('one', 'two')],
names=['first', 'second'])
df6 = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)
print(df6)
# exp A B A
# animal cat dog cat dog
# first second
# bar one 2.640535 -0.030465 -1.323677 1.748616
# two 0.382401 -1.378172 0.862763 1.646497
# baz one 0.032120 0.140315 -0.073596 -1.402424
# ...
# eg6.缺失值int和float会填充Nan,时间类型填充NaT,也可以用fill_value=-999指定。
df7 = df6.iloc[[0, 1, 4, 7], [1, 2]].unstack()
print(df7)
# animal dog cat
# second one two one two
# first
# bar -0.030465 -1.378172 -1.323677 0.862763
# foo -0.848578 NaN -0.275208 NaN
# qux NaN -0.330900 NaN -0.809930
4.pivot_table
pivot_table根据文档上的解释是可以create一些spreadsheet-style类型的数据表。其实对于简单DataFrame来说和pivot类似。但是在那基础上增加了更多的一些功能。
主要参数如下:
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7]})
print(df)
# 原始数据
# A B C D
# 0 foo one small 1
# 1 foo one large 2
# 2 foo one large 2
# 3 foo two small 3
# 4 foo two small 3
# 5 bar one large 4
# 6 bar one small 5
# 7 bar two small 6
# 8 bar two large 7
print(pd.pivot_table(df, index=['A', 'B'], columns=['C'], values=['D'], aggfunc=[np.mean, np.sum, max]))
# mean sum max
# D D D
# C large small large small large small
# A B
# bar one 4.0 5.0 4.0 5.0 4.0 5.0
# two 7.0 6.0 7.0 6.0 7.0 6.0
# foo one 2.0 1.0 4.0 1.0 2.0 1.0
# two NaN 3.0 NaN 6.0 NaN 3.0
5.官方文档
1.http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-stacking-and-unstacking