pandas.DataFrame.pivot_table创建一个pivot table。以DataFrame中的某一列或某几列分别作为index和columns,构造一个新的DataFrame。
Parameter:
- values:column to aggregate, optional
- index:column, Grouper, array, or list of the previous
- columns:column, Grouper, array, or list of the previous
- aggfunc:function, list of functions, dict, default numpy.mean
- fill_value:scalar, default None. Value to replace missing values with
- margins:boolean, default False. Add all row / columns (e.g. for subtotal / grand totals)
- margins_name: string, default ‘All’. Name of the row / column that will contain the totals when margins is True.
下面是一个电影评分的例子:
userId、movie和raing分别表示用户Id、电影和评分,例如,第0行表示Id为1的用户对电影m1评分是3,但是这样的表格看起来比较混乱,尤其是当userId和movie乱序时,我们可以将它转化为index为用户Id,columns为电影的表格:
import pandas as pd
import numpy as np
data = {'userId': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
'movie': ['m1', 'm2', 'm3', 'm4', 'm5',
'm1', 'm2', 'm3', 'm4', 'm5',
'm1', 'm2', 'm3', 'm4', 'm5'],
'rating': [3, 4, 5, 4, 4, 2, 3, 5, 3, 4, 3, 2, 3, 4, 5]}
df = pd.DataFrame(data)
df_utility = df.pivot_table(values='rating', index='userId', columns='movie',
aggfunc=np.sum, margins=False)
df_utility:
这样以来,用户对电影的评分就一目了然了。上面表格中的每一行表示一位用户对所有电影的评分,每一列表示所有用户对一部电影的评分。
把参数margins改为True后:
import pandas as pd
import numpy as np
data = {'userId': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
'movie': ['m1', 'm2', 'm3', 'm4', 'm5',
'm1', 'm2', 'm3', 'm4', 'm5',
'm1', 'm2', 'm3', 'm4', 'm5'],
'rating': [3, 4, 5, 4, 4, 2, 3, 5, 3, 4, 3, 2, 3, 4, 5]}
df = pd.DataFrame(data)
df_utility = df.pivot_table(values='rating', index='userId', columns='movie',
aggfunc=np.sum, margins=True)