假设我有以下数据:
import pandas as pd
import numpy as np
import random
from string import ascii_uppercase
random.seed(100)
n = 1000000
# Create a bunch of factor data... throw some NaNs in there for good measure
data = {letter: [random.choice(list(ascii_uppercase) + [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
df = pd.DataFrame(data)
我想快速计算数据框中所有值集合中每个值的全局出现.
这有效:
from collections import Counter
c = Counter([v for c in df for v in df[c].fillna(-999)])
但是很慢:
%timeit Counter([v for c in df for v in df[c].fillna(-999)])
1 loop, best of 3: 4.12 s per loop
我认为这个功能可以通过使用一些熊猫的马力来加快速度:
def quick_global_count(df, na_value=-999):
df = df.fillna(na_value)
# Get counts of each element for each column in the passed dataframe
group_bys = {c: df.groupby(c).size() for c in df}
# Stack each of the Series objects in `group_bys`... This is faster than reducing a bunch of dictionaries by keys
stacked = pd.concat([v for k, v in group_bys.items()])
# Call `reset_index()` to access the index column, which indicates the factor level for each column in dataframe
# Then groupby and sum on that index to get global counts
global_counts = stacked.reset_index().groupby('index').sum()
return global_counts
它肯定更快(前一种方法的75%),但必须有更快的东西……
%timeit quick_global_count(df)
10 loops, best of 3: 3.01 s per loop
上述两种方法的结果是相同的(对quick_global_count返回的结果稍作修改):
dict(c) == quick_global_count(df).to_dict()[0]
True
什么是计算数据框中全局值的计数的更快方法?