python统计出现次数_python – 快速计算pandas DataFrame中所有值的出现次数

假设我有以下数据:

import pandas as pd

import numpy as np

import random

from string import ascii_uppercase

random.seed(100)

n = 1000000

# Create a bunch of factor data... throw some NaNs in there for good measure

data = {letter: [random.choice(list(ascii_uppercase) + [np.nan]) for _ in range(n)] for letter in ascii_uppercase}

df = pd.DataFrame(data)

我想快速计算数据框中所有值集合中每个值的全局出现.

这有效:

from collections import Counter

c = Counter([v for c in df for v in df[c].fillna(-999)])

但是很慢:

%timeit Counter([v for c in df for v in df[c].fillna(-999)])

1 loop, best of 3: 4.12 s per loop

我认为这个功能可以通过使用一些熊猫的马力来加快速度:

def quick_global_count(df, na_value=-999):

df = df.fillna(na_value)

# Get counts of each element for each column in the passed dataframe

group_bys = {c: df.groupby(c).size() for c in df}

# Stack each of the Series objects in `group_bys`... This is faster than reducing a bunch of dictionaries by keys

stacked = pd.concat([v for k, v in group_bys.items()])

# Call `reset_index()` to access the index column, which indicates the factor level for each column in dataframe

# Then groupby and sum on that index to get global counts

global_counts = stacked.reset_index().groupby('index').sum()

return global_counts

它肯定更快(前一种方法的75%),但必须有更快的东西……

%timeit quick_global_count(df)

10 loops, best of 3: 3.01 s per loop

上述两种方法的结果是相同的(对quick_global_count返回的结果稍作修改):

dict(c) == quick_global_count(df).to_dict()[0]

True

什么是计算数据框中全局值的计数的更快方法?

你可能感兴趣的:(python统计出现次数)