大数据集Hierarchical Indexing优化方案

1. 对比

数据量:(33694369, 3)

The bad way
优化之前:30min跑不出来。

prices = prices.set_index(["id", "date"])[["sell_price"]].unstack(level=-1).fillna(False)

The Better Way: Pandas MultiIndex
优化之后:MultiIndex方案两分钟不到。

index = [list(prices.id), list(prices.date)]
index = pd.MultiIndex.from_arrays(index)
prices = pd.Series(list(prices.sell_price), index=index)
print('..................................')
prices = prices.unstack()
print('..................................')

2. 案例展示

index = [['California', 'California', 'New York', 'New York', 'Texas', 'Texas'], [2000, 2010, 2000, 2010, 2000, 2010]]
# MultiIndex有三种方式from_arrays/from_tuples/from_product
index = pd.MultiIndex.from_arrays(index)
populations = list(pd.Series([33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]))
               
series_dt = pd.Series(populations, index=index)
series_dt 

series_dt .unstack()
### series_dt 
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64



### unstack
	          2000	       2010
California	33871648	37253956
New York	18976457	19378102
Texas	    20851820	25145561

3. 参考

Python Data Science Handbook About ArchiveHi: erarchical Indexing

你可能感兴趣的:(比赛,set_index,unstack)