省内存OneHotEncoder

OneHotEncoder 在进行编码时,会将输入的数据先转为numpy.array对象,以此优化速度。这可能会造成文中子类型内存不足,比如:某列中某值是3000000个字符的文本,那么就会造成内存不足。对此,做这这一版超级省内存的。

OneHotEncoder还有get_feature_names可以用,所以继承了TransformMixin类写了一个非常节省内存的版本。在400W样本27列需要处理的数据集上OneHotEncoder耗费2分35秒,本程序耗费1分35秒。

##本代码适用于,含有大规模文本的情况##
##本代码适用于,含有大规模文本的情况##
##本代码适用于,含有大规模文本的情况##

其实对于这种情况,可以先用int代替每列中的每个文本,这样在输入到OneHotEncoder中就可以发挥其应有的速度。

from sklearn.base import TransformerMixin
import numpy as np
import scipy.sparse as sp
from collections import defaultdict as ddt
from itertools import chain


class LittleOntHotEncoder(TransformerMixin):
    """ OneHotEncoder, """
    def __init__(self, *args, **kwargs):
        super(TransformerMixin, self).__init__(*args, **kwargs)
        
        self._per_feature = ddt(lambda : set())
        self._feat2idx = ddt(lambda : dict())
        self._feat2default = {
     }
        self._all_feature_number = -1
    
    def get_feature_names(self):
        items = chain.from_iterable([[(f"{i}_{e}", idx) for e, idx in dic.items()] for i, dic in self._feat2idx.items()])
        return [feat for feat, _ in sorted(items, key=lambda x:x[1])]

    def fit(self, cate_smples):
        per_feature = self._per_feature
        for line in cate_smples:
            for i, e in enumerate(line):
                per_feature[i].add(e)
        self._per_feature = per_feature
        
        _all_feature_number = 0
        
        feat2idx = self._feat2idx
        items = sorted(per_feature.items(), key=lambda x:x[0])
        for idx, feat_set in items:
            for e in feat_set:
                feat2idx[idx][e] = _all_feature_number
                self._feat2default[idx] = _all_feature_number
                _all_feature_number += 1
        self._feat2idx = feat2idx
        self._all_feature_number = _all_feature_number
        
    def transform(self, cate_smples):
        M = len(cate_smples)
        n = len(cate_smples[0])
        N = self._all_feature_number
        feat2idx = self._feat2idx
        feat2default = self._feat2default
        
        _data = np.ones((M*n), dtype=np.int8)
        _indices = np.array([[feat2idx[feat_idx].get(e, feat2default[feat_idx]) for feat_idx, e in enumerate(line)] for line in cate_smples])
        _indptr = np.array([i*n for i in range(M)] + [M])
        return sp.csr_matrix((_data, np.reshape(_indices, (M*n,)), _indptr), shape=(M, N))
    
    def fit_transform(self, cate_smples):
        self.fit(cate_smples)
        return self.transform(cate_smples)

你可能感兴趣的:(python,数据预处理,内存不足,python,sklearn,OneHotEncoder)