Kaggle Mercari Price Suggestion

  • 参赛经验
  • 代码技巧积累
      • 计算时间
      • 文件读取
      • 动态获取数据集行列数 Series转DataFrame便捷写法
      • 一句话把某列是否为空变成新的一列特征
      • loclambda x xindex missing
      • 第一次见frozenset
      • Feature Union
      • Pipeline细节详解
        • fit_transform和transform的区别
        • ItemSelector用于选择某列
        • HashingVectorizer CountVectorizer

参赛经验

第一次参加Kaggle里feature类的比赛,把参赛经验详细整理如下:

  • 每天只有5次提交机会,如果真的想打出一个好名次不妨注册两个账号,而且不知道为什么好像能加入两个队伍(我看到一个认识的人就是这样)。我最初只是抱着试一试的心态,不想拖累任何人,所以没有注册两个账号刷提交机会,也没有组队
  • 组队不一定要找认识的人
  • 这个比赛每个Kernel的运行时间限制在1小时内,跑一次时间长。我就想偷懒,跑完了还能在该界面调试,同样又不影响提交该kernel的output,于是我就在一个kernel跑完之后,直接复制url到另一个tab页面,然后在原tab页面提交了output作为submission,这样就相当于把同一个kernel的运行结果重复提交了两遍!因为我的偷懒,这点快坑死自己了,最后一天有很有用的kernel发出,而我因为偷懒浪费了最后一天两次提交机会!要知道最后一天因为新NB Kernel的发布,名次变化可是很大的!唉,长记性吧!
  • 代码技巧积累

    比赛Stage1截止前一天有人放出一个非常好的特征工程的源码!运行速度比其他方法快很多!下面学习之。
    先贴源码如下:

    import multiprocessing as mp
    import pandas as pd
    from time import time
    from scipy.sparse import csr_matrix
    import os
    from sklearn.linear_model import Ridge
    from sklearn.pipeline import FeatureUnion, Pipeline
    from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer
    from sklearn.metrics import mean_squared_log_error
    from sklearn.preprocessing import OneHotEncoder
    import numpy as np
    import gc
    from sklearn.base import BaseEstimator, TransformerMixin
    import re
    from pandas.api.types import is_numeric_dtype, is_categorical_dtype
    
    os.environ['MKL_NUM_THREADS'] = '4'
    os.environ['OMP_NUM_THREADS'] = '4'
    os.environ['JOBLIB_START_METHOD'] = 'forkserver'
    
    INPUT_PATH = r'../input'
    
    
    def dameraulevenshtein(seq1, seq2):
        """Calculate the Damerau-Levenshtein distance between sequences.
    
        This method has not been modified from the original.
        Source: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/
    
        This distance is the number of additions, deletions, substitutions,
        and transpositions needed to transform the first sequence into the
        second. Although generally used with strings, any sequences of
        comparable objects will work.
    
        Transpositions are exchanges of *consecutive* characters; all other
        operations are self-explanatory.
    
        This implementation is O(N*M) time and O(M) space, for N and M the
        lengths of the two sequences.
    
        >>> dameraulevenshtein('ba', 'abc')
        2
        >>> dameraulevenshtein('fee', 'deed')
        2
    
        It works with arbitrary sequences too:
        >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
        2
        """
        # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
        # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
        # However, only the current and two previous rows are needed at once,
        # so we only store those.
        oneago = None
        thisrow = list(range(1, len(seq2) + 1)) + [0]
        for x in range(len(seq1)):
            # Python lists wrap around for negative indices, so put the
            # leftmost column at the *end* of the list. This matches with
            # the zero-indexed strings and saves extra calculation.
            twoago, oneago, thisrow = (oneago, thisrow, [0] * len(seq2) + [x + 1])
            for y in range(len(seq2)):
                delcost = oneago[y] + 1
                addcost = thisrow[y - 1] + 1
                subcost = oneago[y - 1] + (seq1[x] != seq2[y])
                thisrow[y] = min(delcost, addcost, subcost)
                # This block deals with transpositions
                if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                        and seq1[x - 1] == seq2[y] and seq1[x] != seq2[y]):
                    thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
        return thisrow[len(seq2) - 1]
    
    
    class SymSpell:
        def __init__(self, max_edit_distance=3, verbose=0):
            self.max_edit_distance = max_edit_distance
            self.verbose = verbose
            # 0: top suggestion
            # 1: all suggestions of smallest edit distance
            # 2: all suggestions <= max_edit_distance (slower, no early termination)
    
            self.dictionary = {}
            self.longest_word_length = 0
    
        def get_deletes_list(self, w):
            """given a word, derive strings with up to max_edit_distance characters
               deleted"""
    
            deletes = []
            queue = [w]
            for d in range(self.max_edit_distance):
                temp_queue = []
                for word in queue:
                    if len(word) > 1:
                        for c in range(len(word)):  # character index
                            word_minus_c = word[:c] + word[c + 1:]
                            if word_minus_c not in deletes:
                                deletes.append(word_minus_c)
                            if word_minus_c not in temp_queue:
                                temp_queue.append(word_minus_c)
                queue = temp_queue
    
            return deletes
    
        def create_dictionary_entry(self, w):
            '''add word and its derived deletions to dictionary'''
            # check if word is already in dictionary
            # dictionary entries are in the form: (list of suggested corrections,
            # frequency of word in corpus)
            new_real_word_added = False
            if w in self.dictionary:
                # increment count of word in corpus
                self.dictionary[w] = (self.dictionary[w][0], self.dictionary[w][1] + 1)
            else:
                self.dictionary[w] = ([], 1)
                self.longest_word_length = max(self.longest_word_length, len(w))
    
            if self.dictionary[w][1] == 1:
                # first appearance of word in corpus
                # n.b. word may already be in dictionary as a derived word
                # (deleting character from a real word)
                # but counter of frequency of word in corpus is not incremented
                # in those cases)
                new_real_word_added = True
                deletes = self.get_deletes_list(w)
                for item in deletes:
                    if item in self.dictionary:
                        # add (correct) word to delete's suggested correction list
                        self.dictionary[item][0].append(w)
                    else:
                        # note frequency of word in corpus is not incremented
                        self.dictionary[item] = ([w], 0)
    
            return new_real_word_added
    
        def create_dictionary_from_arr(self, arr, token_pattern=r'[a-z]+'):
            total_word_count = 0
            unique_word_count = 0
    
            for line in arr:
                # separate by words by non-alphabetical characters
                words = re.findall(token_pattern, line.lower())
                for word in words:
                    total_word_count += 1
                    if self.create_dictionary_entry(word):
                        unique_word_count += 1
    
            print("total words processed: %i" % total_word_count)
            print("total unique words in corpus: %i" % unique_word_count)
            print("total items in dictionary (corpus words and deletions): %i" % len(self.dictionary))
            print("  edit distance for deletions: %i" % self.max_edit_distance)
            print("  length of longest word in corpus: %i" % self.longest_word_length)
            return self.dictionary
    
        def create_dictionary(self, fname):
            total_word_count = 0
            unique_word_count = 0
    
            with open(fname) as file:
                for line in file:
                    # separate by words by non-alphabetical characters
                    words = re.findall('[a-z]+', line.lower())
                    for word in words:
                        total_word_count += 1
                        if self.create_dictionary_entry(word):
                            unique_word_count += 1
    
            print("total words processed: %i" % total_word_count)
            print("total unique words in corpus: %i" % unique_word_count)
            print("total items in dictionary (corpus words and deletions): %i" % len(self.dictionary))
            print("  edit distance for deletions: %i" % self.max_edit_distance)
            print("  length of longest word in corpus: %i" % self.longest_word_length)
            return self.dictionary
    
        def get_suggestions(self, string, silent=False):
            """return list of suggested corrections for potentially incorrectly
               spelled word"""
            if (len(string) - self.longest_word_length) > self.max_edit_distance:
                if not silent:
                    print("no items in dictionary within maximum edit distance")
                return []
    
            suggest_dict = {}
            min_suggest_len = float('inf')
    
            queue = [string]
            q_dictionary = {}  # items other than string that we've checked
    
            while len(queue) > 0:
                q_item = queue[0]  # pop
                queue = queue[1:]
    
                # early exit
                if ((self.verbose < 2) and (len(suggest_dict) > 0) and
                        ((len(string) - len(q_item)) > min_suggest_len)):
                    break
    
                # process queue item
                if (q_item in self.dictionary) and (q_item not in suggest_dict):
                    if self.dictionary[q_item][1] > 0:
                        # word is in dictionary, and is a word from the corpus, and
                        # not already in suggestion list so add to suggestion
                        # dictionary, indexed by the word with value (frequency in
                        # corpus, edit distance)
                        # note q_items that are not the input string are shorter
                        # than input string since only deletes are added (unless
                        # manual dictionary corrections are added)
                        assert len(string) >= len(q_item)
                        suggest_dict[q_item] = (self.dictionary[q_item][1],
                                                len(string) - len(q_item))
                        # early exit
                        if (self.verbose < 2) and (len(string) == len(q_item)):
                            break
                        elif (len(string) - len(q_item)) < min_suggest_len:
                            min_suggest_len = len(string) - len(q_item)
    
                    # the suggested corrections for q_item as stored in
                    # dictionary (whether or not q_item itself is a valid word
                    # or merely a delete) can be valid corrections
                    for sc_item in self.dictionary[q_item][0]:
                        if sc_item not in suggest_dict:
    
                            # compute edit distance
                            # suggested items should always be longer
                            # (unless manual corrections are added)
                            assert len(sc_item) > len(q_item)
    
                            # q_items that are not input should be shorter
                            # than original string
                            # (unless manual corrections added)
                            assert len(q_item) <= len(string)
    
                            if len(q_item) == len(string):
                                assert q_item == string
                                item_dist = len(sc_item) - len(q_item)
    
                            # item in suggestions list should not be the same as
                            # the string itself
                            assert sc_item != string
    
                            # calculate edit distance using, for example,
                            # Damerau-Levenshtein distance
                            item_dist = dameraulevenshtein(sc_item, string)
    
                            # do not add words with greater edit distance if
                            # verbose setting not on
                            if (self.verbose < 2) and (item_dist > min_suggest_len):
                                pass
                            elif item_dist <= self.max_edit_distance:
                                assert sc_item in self.dictionary  # should already be in dictionary if in suggestion list
                                suggest_dict[sc_item] = (self.dictionary[sc_item][1], item_dist)
                                if item_dist < min_suggest_len:
                                    min_suggest_len = item_dist
    
                            # depending on order words are processed, some words
                            # with different edit distances may be entered into
                            # suggestions; trim suggestion dictionary if verbose
                            # setting not on
                            if self.verbose < 2:
                                suggest_dict = {k: v for k, v in suggest_dict.items() if v[1] <= min_suggest_len}
    
                # now generate deletes (e.g. a substring of string or of a delete)
                # from the queue item
                # as additional items to check -- add to end of queue
                assert len(string) >= len(q_item)
    
                # do not add words with greater edit distance if verbose setting
                # is not on
                if (self.verbose < 2) and ((len(string) - len(q_item)) > min_suggest_len):
                    pass
                elif (len(string) - len(q_item)) < self.max_edit_distance and len(q_item) > 1:
                    for c in range(len(q_item)):  # character index
                        word_minus_c = q_item[:c] + q_item[c + 1:]
                        if word_minus_c not in q_dictionary:
                            queue.append(word_minus_c)
                            q_dictionary[word_minus_c] = None  # arbitrary value, just to identify we checked this
    
            # queue is now empty: convert suggestions in dictionary to
            # list for output
            if not silent and self.verbose != 0:
                print("number of possible corrections: %i" % len(suggest_dict))
                print("  edit distance for deletions: %i" % self.max_edit_distance)
    
            # output option 1
            # sort results by ascending order of edit distance and descending
            # order of frequency
            #     and return list of suggested word corrections only:
            # return sorted(suggest_dict, key = lambda x:
            #               (suggest_dict[x][1], -suggest_dict[x][0]))
    
            # output option 2
            # return list of suggestions with (correction,
            #                                  (frequency in corpus, edit distance)):
            as_list = suggest_dict.items()
            # outlist = sorted(as_list, key=lambda (term, (freq, dist)): (dist, -freq))
            outlist = sorted(as_list, key=lambda x: (x[1][1], -x[1][0]))
    
            if self.verbose == 0:
                return outlist[0]
            else:
                return outlist
    
            '''
            Option 1:
            ['file', 'five', 'fire', 'fine', ...]
    
            Option 2:
            [('file', (5, 0)),
             ('five', (67, 1)),
             ('fire', (54, 1)),
             ('fine', (17, 1))...]  
            '''
    
        def best_word(self, s, silent=False):
            try:
                return self.get_suggestions(s, silent)[0]
            except:
                return None
    
    
    class ItemSelector(BaseEstimator, TransformerMixin):
        def __init__(self, field, start_time=time()):
            self.field = field
            self.start_time = start_time
    
        def fit(self, x, y=None):
            return self
    
        def transform(self, dataframe):
            print(f'[{time()-self.start_time}] select {self.field}')
            dt = dataframe[self.field].dtype
            if is_categorical_dtype(dt):
                return dataframe[self.field].cat.codes[:, None]
            elif is_numeric_dtype(dt):
                return dataframe[self.field][:, None]
            else:
                return dataframe[self.field]
    
    
    class DropColumnsByDf(BaseEstimator, TransformerMixin):
        def __init__(self, min_df=1, max_df=1.0):
            self.min_df = min_df
            self.max_df = max_df
    
        def fit(self, X, y=None):
            m = X.tocsc()
            self.nnz_cols = ((m != 0).sum(axis=0) >= self.min_df).A1
            if self.max_df < 1.0:
                max_df = m.shape[0] * self.max_df
                self.nnz_cols = self.nnz_cols & ((m != 0).sum(axis=0) <= max_df).A1
            return self
    
        def transform(self, X, y=None):
            m = X.tocsc()
            return m[:, self.nnz_cols]
    
    
    def get_rmsle(y_true, y_pred):
        return np.sqrt(mean_squared_log_error(np.expm1(y_true), np.expm1(y_pred)))
    
    
    def split_cat(text):
        try:
            cats = text.split("/")
            return cats[0], cats[1], cats[2], cats[0] + '/' + cats[1]
        except:
            print("no category")
            return 'other', 'other', 'other', 'other/other'
    
    
    def brands_filling(dataset):
        vc = dataset['brand_name'].value_counts()
        brands = vc[vc > 0].index
        brand_word = r"[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+"
    
        many_w_brands = brands[brands.str.contains(' ')]
        one_w_brands = brands[~brands.str.contains(' ')]
    
        ss2 = SymSpell(max_edit_distance=0)
        ss2.create_dictionary_from_arr(many_w_brands, token_pattern=r'.+')
    
        ss1 = SymSpell(max_edit_distance=0)
        ss1.create_dictionary_from_arr(one_w_brands, token_pattern=r'.+')
    
        two_words_re = re.compile(r"(?=(\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+))")
    
        def find_in_str_ss2(row):
            for doc_word in two_words_re.finditer(row):
                print(doc_word)
                suggestion = ss2.best_word(doc_word.group(1), silent=True)
                if suggestion is not None:
                    return doc_word.group(1)
            return ''
    
        def find_in_list_ss1(list):
            for doc_word in list:
                suggestion = ss1.best_word(doc_word, silent=True)
                if suggestion is not None:
                    return doc_word
            return ''
    
        def find_in_list_ss2(list):
            for doc_word in list:
                suggestion = ss2.best_word(doc_word, silent=True)
                if suggestion is not None:
                    return doc_word
            return ''
    
        print(f"Before empty brand_name: {len(dataset[dataset['brand_name'] == ''].index)}")
    
        n_name = dataset[dataset['brand_name'] == '']['name'].str.findall(
            pat=r"^[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+")
        dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss2(row) for row in n_name]
    
        n_desc = dataset[dataset['brand_name'] == '']['item_description'].str.findall(
            pat=r"^[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+")
        dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss2(row) for row in n_desc]
    
        n_name = dataset[dataset['brand_name'] == '']['name'].str.findall(pat=brand_word)
        dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss1(row) for row in n_name]
    
        desc_lower = dataset[dataset['brand_name'] == '']['item_description'].str.findall(pat=brand_word)
        dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss1(row) for row in desc_lower]
    
        print(f"After empty brand_name: {len(dataset[dataset['brand_name'] == ''].index)}")
    
        del ss1, ss2
        gc.collect()
    
    
    def preprocess_regex(dataset, start_time=time()):
        karats_regex = r'(\d)([\s-]?)(karat|karats|carat|carats|kt)([^\w])'
        karats_repl = r'\1k\4'
    
        unit_regex = r'(\d+)[\s-]([a-z]{2})(\s)'
        unit_repl = r'\1\2\3'
    
        dataset['name'] = dataset['name'].str.replace(karats_regex, karats_repl)
        dataset['item_description'] = dataset['item_description'].str.replace(karats_regex, karats_repl)
        print(f'[{time() - start_time}] Karats normalized.')
    
        dataset['name'] = dataset['name'].str.replace(unit_regex, unit_repl)
        dataset['item_description'] = dataset['item_description'].str.replace(unit_regex, unit_repl)
        print(f'[{time() - start_time}] Units glued.')
    
    
    def preprocess_pandas(train, test, start_time=time()):
        train = train[train.price > 0.0].reset_index(drop=True)
        print('Train shape without zero price: ', train.shape)
    
        nrow_train = train.shape[0]
        y_train = np.log1p(train["price"])
        merge: pd.DataFrame = pd.concat([train, test])
    
        del train
        del test
        gc.collect()
    
        merge['has_category'] = (merge['category_name'].notnull()).astype('category')
        print(f'[{time() - start_time}] Has_category filled.')
    
        merge['category_name'] = merge['category_name'] \
            .fillna('other/other/other') \#填补空缺值
            .str.lower() \#都变成小写
            .astype(str)
        merge['general_cat'], merge['subcat_1'], merge['subcat_2'], merge['gen_subcat1'] = \
            zip(*merge['category_name'].apply(lambda x: split_cat(x)))
        print(f'[{time() - start_time}] Split categories completed.')
    
        merge['has_brand'] = (merge['brand_name'].notnull()).astype('category')
        print(f'[{time() - start_time}] Has_brand filled.')
    
        merge['gencat_cond'] = merge['general_cat'].map(str) + '_' + merge['item_condition_id'].astype(str)
        merge['subcat_1_cond'] = merge['subcat_1'].map(str) + '_' + merge['item_condition_id'].astype(str)
        merge['subcat_2_cond'] = merge['subcat_2'].map(str) + '_' + merge['item_condition_id'].astype(str)
        print(f'[{time() - start_time}] Categories and item_condition_id concancenated.')
    
        merge['name'] = merge['name'] \
            .fillna('') \
            .str.lower() \
            .astype(str)
        merge['brand_name'] = merge['brand_name'] \
            .fillna('') \
            .str.lower() \
            .astype(str)
        merge['item_description'] = merge['item_description'] \
            .fillna('') \
            .str.lower() \
            .replace(to_replace='No description yet', value='')
        print(f'[{time() - start_time}] Missing filled.')
    
        preprocess_regex(merge, start_time)
    
        brands_filling(merge)
        print(f'[{time() - start_time}] Brand name filled.')
    
        merge['name'] = merge['name'] + ' ' + merge['brand_name']
        print(f'[{time() - start_time}] Name concancenated.')
    
        merge['item_description'] = merge['item_description'] \
                                    + ' ' + merge['name'] \
                                    + ' ' + merge['subcat_1'] \
                                    + ' ' + merge['subcat_2'] \
                                    + ' ' + merge['general_cat'] \
                                    + ' ' + merge['brand_name']
        print(f'[{time() - start_time}] Item description concatenated.')
    
        merge.drop(['price', 'test_id', 'train_id'], axis=1, inplace=True)
    
        return merge, y_train, nrow_train
    
    
    def intersect_drop_columns(train: csr_matrix, valid: csr_matrix, min_df=0):
        t = train.tocsc()
        v = valid.tocsc()
        nnz_train = ((t != 0).sum(axis=0) >= min_df).A1
        nnz_valid = ((v != 0).sum(axis=0) >= min_df).A1
        nnz_cols = nnz_train & nnz_valid
        res = t[:, nnz_cols], v[:, nnz_cols]
        return res
    
    
    if __name__ == '__main__':
        mp.set_start_method('forkserver', True)
    
        start_time = time()
    
        train = pd.read_table(os.path.join(INPUT_PATH, 'train.tsv'),
                              engine='c',
                              dtype={'item_condition_id': 'category',
                                     'shipping': 'category'}
                              )
        test = pd.read_table(os.path.join(INPUT_PATH, 'test.tsv'),
                             engine='c',
                             dtype={'item_condition_id': 'category',
                                    'shipping': 'category'}
                             )
        print(f'[{time() - start_time}] Finished to load data')
        print('Train shape: ', train.shape)
        print('Test shape: ', test.shape)
    
        submission: pd.DataFrame = test[['test_id']]
    
        merge, y_train, nrow_train = preprocess_pandas(train, test, start_time)
    
        meta_params = {'name_ngram': (1, 2),
                       'name_max_f': 75000,
                       'name_min_df': 10,
    
                       'category_ngram': (2, 3),
                       'category_token': '.+',
                       'category_min_df': 10,
    
                       'brand_min_df': 10,
    
                       'desc_ngram': (1, 3),
                       'desc_max_f': 150000,
                       'desc_max_df': 0.5,
                       'desc_min_df': 10}
    
        stopwords = frozenset(['the', 'a', 'an', 'is', 'it', 'this', ])
        # 'i', 'so', 'its', 'am', 'are'])
    
        vectorizer = FeatureUnion([
            ('name', Pipeline([
                ('select', ItemSelector('name', start_time=start_time)),
                ('transform', HashingVectorizer(
                    ngram_range=(1, 2),
                    n_features=2 ** 28,
                    norm='l2',
                    lowercase=False,
                    stop_words=stopwords
                )),
                ('drop_cols', DropColumnsByDf(min_df=2))
            ])),
            ('category_name', Pipeline([
                ('select', ItemSelector('category_name', start_time=start_time)),
                ('transform', HashingVectorizer(
                    ngram_range=(1, 1),
                    token_pattern='.+',
                    tokenizer=split_cat,
                    n_features=2 ** 28,
                    norm='l2',
                    lowercase=False
                )),
                ('drop_cols', DropColumnsByDf(min_df=2))
            ])),
            ('brand_name', Pipeline([
                ('select', ItemSelector('brand_name', start_time=start_time)),
                ('transform', CountVectorizer(
                    token_pattern='.+',
                    min_df=2,
                    lowercase=False
                )),
            ])),
            ('gencat_cond', Pipeline([
                ('select', ItemSelector('gencat_cond', start_time=start_time)),
                ('transform', CountVectorizer(
                    token_pattern='.+',
                    min_df=2,
                    lowercase=False
                )),
            ])),
            ('subcat_1_cond', Pipeline([
                ('select', ItemSelector('subcat_1_cond', start_time=start_time)),
                ('transform', CountVectorizer(
                    token_pattern='.+',
                    min_df=2,
                    lowercase=False
                )),
            ])),
            ('subcat_2_cond', Pipeline([
                ('select', ItemSelector('subcat_2_cond', start_time=start_time)),
                ('transform', CountVectorizer(
                    token_pattern='.+',
                    min_df=2,
                    lowercase=False
                )),
            ])),
            ('has_brand', Pipeline([
                ('select', ItemSelector('has_brand', start_time=start_time)),
                ('ohe', OneHotEncoder())
            ])),
            ('shipping', Pipeline([
                ('select', ItemSelector('shipping', start_time=start_time)),
                ('ohe', OneHotEncoder())
            ])),
            ('item_condition_id', Pipeline([
                ('select', ItemSelector('item_condition_id', start_time=start_time)),
                ('ohe', OneHotEncoder())
            ])),
            ('item_description', Pipeline([
                ('select', ItemSelector('item_description', start_time=start_time)),
                ('hash', HashingVectorizer(
                    ngram_range=(1, 3),
                    n_features=2 ** 27,
                    dtype=np.float32,
                    norm='l2',
                    lowercase=False,
                    stop_words=stopwords
                )),
                ('drop_cols', DropColumnsByDf(min_df=2)),
            ]))
        ], n_jobs=1)
    
        sparse_merge = vectorizer.fit_transform(merge)
        print(f'[{time() - start_time}] Merge vectorized')
        print(sparse_merge.shape)
    
        tfidf_transformer = TfidfTransformer()
    
        X = tfidf_transformer.fit_transform(sparse_merge)
        print(f'[{time() - start_time}] TF/IDF completed')
    
        X_train = X[:nrow_train]
        print(X_train.shape)
    
        X_test = X[nrow_train:]
        del merge
        del sparse_merge
        del vectorizer
        del tfidf_transformer
        gc.collect()
    
        X_train, X_test = intersect_drop_columns(X_train, X_test, min_df=1)
        print(f'[{time() - start_time}] Drop only in train or test cols: {X_train.shape[1]}')
        gc.collect()
    
        ridge = Ridge(solver='auto', fit_intercept=True, alpha=0.4, max_iter=250, normalize=False, tol=0.01)
        ridge.fit(X_train, y_train)
        print(f'[{time() - start_time}] Train Ridge completed. Iterations: {ridge.n_iter_}')
    
        predsR = ridge.predict(X_test)
        print(f'[{time() - start_time}] Predict Ridge completed.')
    
        submission.loc[:, 'price'] = np.expm1(predsR)
        submission.loc[submission['price'] < 0.0, 'price'] = 0.0
        submission.to_csv("submission_ridge.csv", index=False)
    

    我把训练集的前5行单独提出来文件名为train

    计算时间

    记录开始时间: start_time = time()

    文件读取

    如果以以下方式直接读取
    train=pd.read_csv(“train.csv”,sep=’\t’)
    查看每列的类型可以看出

    train.dtypes
    Out[2]: 
    train_id               int64
    name                  object
    item_condition_id      int64
    category_name         object
    brand_name            object
    price                float64
    shipping               int64
    item_description      object

    而作者的读取方式用engine=’c’理论上可以加快读取速度;通过设置dtype可以改变列在DataFrame里的类型。为了加快看结果的速度,只取前5行哈(不影响结果)

    INPUT_PATH = r'../input'
     train = pd.read_table(os.path.join(INPUT_PATH, 'train.tsv'),
                              engine='c',
                              dtype={'item_condition_id': 'category',
                                     'shipping': 'category'}
                              )
     train5=train.head(5)
     train5.dtypes
    Out[2]: 
    train_id                int64
    name                   object
    item_condition_id    category
    category_name          object
    brand_name             object
    price                 float64
    shipping             category
    item_description       object
    dtype: object
    

    动态获取数据集行列数 & Series转DataFrame便捷写法

    train5.shape
    Out[3]: (5, 8)
    #经试验,这种行为生成一个DataFrame,名字叫submission,且有列名就是test_id
    submission: pd.DataFrame = test[['test_id']]
    submission
    Out[8]: 
            test_id
    0             0
    1             1
    2             2
    3             3
    4             4
    5             5
    6             6
    7             7
    8             8
    9             9
    10           10
    11           11
    12           12
    13           13
    14           14
    15           15
    16           16
    17           17
    18           18
    19           19
    20           20
    21           21
    22           22
    23           23
    24           24
    25           25
    26           26
    27           27
    28           28
    29           29
             ...
    291098   291098
    291099   291099
    291100   291100
    291101   291101
    291102   291102
    291103   291103
    291104   291104
    291105   291105
    291106   291106
    291107   291107
    291108   291108
    291109   291109
    291110   291110
    291111   291111
    291112   291112
    291113   291113
    291114   291114
    291115   291115
    291116   291116
    291117   291117
    291118   291118
    291119   291119
    291120   291120
    291121   291121
    291122   291122
    291123   291123
    291124   291124
    291125   291125
    291126   291126
    291127   291127
    [291128 rows x 1 columns]
    #可能你会想为什么要这么写,不妨看看如果不这么写:
    type(submission)
    Out[9]: pandas.core.frame.DataFrame
    type(test['test_id'])
    Out[10]: pandas.core.series.Series
    #Series并不是我们想要的类型,因为Series如果输出成csv是没有列名的,而我们想要列名,所以需要DataFrame类型

    一句话把某列是否为空变成新的一列特征

    train5['has_category'] = train5['category_name'].notnull().astype('category')
    #额外附加:
    train5['category_name'].notnull()
    Out[4]: 
    0    True
    1    True
    2    True
    3    True
    4    True
    Name: category_name, dtype: bool
    

    如果不加.astype(‘category’),train5[‘has_category’]的dtype为’bool’

    .loc[lambda x: x.index != ‘missing’]

    train = pd.read_table( 'train.tsv',
                              engine='c',
                              dtype={'item_condition_id': 'category',
                                     'shipping': 'category'}
                              )
    train150=train.head(150)
    NUM_BRANDS=10
    
    train150['brand_name'].fillna(value='missing', inplace=True)
    pop_brand = train150['brand_name'].value_counts().loc[lambda x: x.index != 'missing'].index[:NUM_BRANDS]
    ###过程分析:第一步,数品牌出现的次数
    train150['brand_name'].value_counts()
    Out[1]: 
    missing                 58
    Victoria's Secret        5
    PINK                     4
    Sephora                  4
    Adidas                   3
    Nike                     3
    Nintendo                 2
    Fossil                   2
    Apple                    2
    Under Armour             2
    FOREVER 21               2
    Hollister                2
    Brandy Melville          2
    Air Jordan               2
    Rae Dunn                 2
    Target                   2
    Michael Kors             2
    LuLaRoe                  2
    Lululemon                2
    Too Faced                2
    Gap                      1
    Levi'1
    Rock Revival             1
    Kylie Cosmetics          1
    Spin Master              1
    Anthropologie            1
    Urban Outfitters         1
    Samsung                  1
    Littlest Pet Shop        1
    Ralph Lauren             1
                            ..
    Nostalgia Electrics      1
    Motherhood Maternity     1
    H&M                      1
    Melissa & Doug           1
    The North Face           1
    Barbie                   1
    Acacia Swimwear          1
    Miss Me                  1
    Pokemon                  1
    Manolo Blahnik           1
    MARC JACOBS              1
    Merona                   1
    Scholastic               1
    Faded Glory              1
    Salvatore Ferragamo      1
    Torrid                   1
    Xbox                     1
    Bath & Body Works        1
    Razer                    1
    rue                      1
    patagonia                1
    Soffe                    1
    Steve Madden             1
    Nars                     1
    LC Lauren Conrad         1
    lululemon athletica      1
    Smashbox                 1
    Infantino                1
    Silver Jeans Co.         1
    Boulevard Boutique       1
    Name: brand_name, Length: 65, dtype: int64
    #第二步:把missing删去
    train150['brand_name'].value_counts().loc[lambda x: x.index != 'missing']
    Out[2]: 
    Victoria's Secret       5
    PINK                    4
    Sephora                 4
    Adidas                  3
    Nike                    3
    Nintendo                2
    Fossil                  2
    Apple                   2
    Under Armour            2
    FOREVER 21              2
    Hollister               2
    Brandy Melville         2
    Air Jordan              2
    Rae Dunn                2
    Target                  2
    Michael Kors            2
    LuLaRoe                 2
    Lululemon               2
    Too Faced               2
    Gap                     1
    Levi'1
    Rock Revival            1
    Kylie Cosmetics         1
    Spin Master             1
    Anthropologie           1
    Urban Outfitters        1
    Samsung                 1
    Littlest Pet Shop       1
    Ralph Lauren            1
    Tarte                   1
                           ..
    Nostalgia Electrics     1
    Motherhood Maternity    1
    H&M                     1
    Melissa & Doug          1
    The North Face          1
    Barbie                  1
    Acacia Swimwear         1
    Miss Me                 1
    Pokemon                 1
    Manolo Blahnik          1
    MARC JACOBS             1
    Merona                  1
    Scholastic              1
    Faded Glory             1
    Salvatore Ferragamo     1
    Torrid                  1
    Xbox                    1
    Bath & Body Works       1
    Razer                   1
    rue                     1
    patagonia               1
    Soffe                   1
    Steve Madden            1
    Nars                    1
    LC Lauren Conrad        1
    lululemon athletica     1
    Smashbox                1
    Infantino               1
    Silver Jeans Co.        1
    Boulevard Boutique      1
    Name: brand_name, Length: 64, dtype: int64
    #第三步:截取出现次数最高的10个品牌名
    train150['brand_name'].value_counts().loc[lambda x: x.index != 'missing'].index[:NUM_BRANDS]
    Out[3]: 
    Index(['Victoria's Secret', 'PINK', 'Sephora', 'Adidas', 'Nike', 'Nintendo',
           'Fossil', 'Apple', 'Under Armour', 'FOREVER 21'],
          dtype='object')
    
    train150.loc[~train150['brand_name'].isin(pop_brand), 'brand_name'] = 'missing'
    #第一步:选出'brand_name'在pop_brand里的
    train150['brand_name'].isin(pop_brand)
    Out[1]: 
    0      False
    1      False
    2      False
    3      False
    4      False
    5      False
    6      False
    7      False
    8       True
    9      False
    10     False
    11      True
    12     False
    13     False
    14     False
    15     False
    16     False
    17     False
    18     False
    19     False
    20     False
    21      True
    22     False
    23     False
    24      True
    25      True
    26      True
    27      True
    28      True
    29      True
           ...  
    120    False
    121    False
    122    False
    123    False
    124    False
    125    False
    126    False
    127    False
    128    False
    129    False
    130    False
    131     True
    132    False
    133    False
    134    False
    135    False
    136    False
    137     True
    138    False
    139    False
    140    False
    141    False
    142    False
    143    False
    144    False
    145     True
    146    False
    147    False
    148    False
    149     True
    Name: brand_name, Length: 150, dtype: bool
    #第二步:因为只有isin,没有notin,所以想求notin要对isin做取反操作,即“~”
    ~train150['brand_name'].isin(pop_brand)
    Out[2]: 
    0       True
    1       True
    2       True
    3       True
    4       True
    5       True
    6       True
    7       True
    8      False
    9       True
    10      True
    11     False
    12      True
    13      True
    14      True
    15      True
    16      True
    17      True
    18      True
    19      True
    20      True
    21     False
    22      True
    23      True
    24     False
    25     False
    26     False
    27     False
    28     False
    29     False
           ...  
    120     True
    121     True
    122     True
    123     True
    124     True
    125     True
    126     True
    127     True
    128     True
    129     True
    130     True
    131    False
    132     True
    133     True
    134     True
    135     True
    136     True
    137    False
    138     True
    139     True
    140     True
    141     True
    142     True
    143     True
    144     True
    145    False
    146     True
    147     True
    148     True
    149    False
    Name: brand_name, Length: 150, dtype: bool
    #第三步:获得不在pop_brand所在行的brandname,注意看index为8的不被列出
    train150.loc[~train150['brand_name'].isin(pop_brand), 'brand_name'] 
    Out[3]: 
    0                  missing
    1                    Razer
    2                   Target
    3                  missing
    4                  missing
    5                  missing
    6          Acacia Swimwear
    7                    Soffe
    9                  missing
    10                Smashbox
    12                     rue
    13              Scholastic
    14           UGG Australia
    15                   Tarte
    16              Wet n Wild
    17                 missing
    18               Too Faced
    19           Anthropologie
    20                  Torrid
    22                 Samsung
    23                 missing
    30               Too Faced
    31                 missing
    34                 missing
    35                 missing
    36                 missing
    37            Under Armour
    38                    Xbox
    40         Kylie Cosmetics
    42                 missing
                  ...         
    113            MARC JACOBS
    114                missing
    115           Steve Madden
    117                missing
    120          No Boundaries
    121                missing
    122                missing
    123            Faded Glory
    124                Miss Me
    125    Salvatore Ferragamo
    126       Urban Outfitters
    127                missing
    128                missing
    129                missing
    130                missing
    132                missing
    133      Bath & Body Works
    134                 Fossil
    135                   Nars
    136        Brandy Melville
    138                missing
    139                missing
    140               Nintendo
    141         Manolo Blahnik
    142           Ralph Lauren
    143       LC Lauren Conrad
    144                missing
    146              Lululemon
    147           Michael Kors
    148                missing
    Name: brand_name, Length: 121, dtype: object
    #第四步:赋值
    train150.loc[~train150['brand_name'].isin(pop_brand), 'brand_name'] = 'missing'
    #结果:brand_name不在前多少(自己设定)pop_brand的brand_name都被设置为missing,泯然众人矣了
    train150['brand_name']
    Out[5]: 
    0                missing
    1                missing
    2                missing
    3                missing
    4                missing
    5                missing
    6                missing
    7                missing
    8                   Nike
    9                missing
    10               missing
    11     Victoria's Secret
    12               missing
    13               missing
    14               missing
    15               missing
    16               missing
    17               missing
    18               missing
    19               missing
    20               missing
    21     Victoria's Secret
    22               missing
    23               missing
    24            FOREVER 21
    25            Air Jordan
    26                 Apple
    27               LuLaRoe
    28            FOREVER 21
    29             Hollister
                 ...        
    120              missing
    121              missing
    122              missing
    123              missing
    124              missing
    125              missing
    126              missing
    127              missing
    128              missing
    129              missing
    130              missing
    131              LuLaRoe
    132              missing
    133              missing
    134              missing
    135              missing
    136              missing
    137               Adidas
    138              missing
    139              missing
    140              missing
    141              missing
    142              missing
    143              missing
    144              missing
    145           Air Jordan
    146              missing
    147              missing
    148              missing
    149                Apple
    Name: brand_name, Length: 150, dtype: object
    

    第一次见frozenset

    stopwords = frozenset(['the', 'a', 'an', 'is', 'it', 'this', ])
    stopwords
    Out[8]: frozenset({'a', 'an', 'is', 'it', 'the', 'this'})

    frozenset 用法详解:
    https://www.programiz.com/python-programming/methods/built-in/frozenset

    Feature Union

    使用sklearn优雅地进行数据挖掘教程:http://www.cnblogs.com/jasonfreak/p/5448462.html
    (教程很好,但我个人认为有一个错误,不知道我说的对不对哈,就是部分并行这个问题,作者可能是不知道有ItemSelector这个功能,所以自己实现了部分并行)
    官方说明:http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
    Concatenates results of multiple transformer objects.

    This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

    Parameters of the transformers may be set using its name and the parameter name separated by a ‘__’. A transformer may be replaced entirely by setting the parameter with its name to another transformer, or removed by setting to None.
    中文说明(附很有用的例子!):
    http://sklearn.apachecn.org/cn/0.19.0/modules/pipeline.html

    Pipeline细节详解:

    fit_transform和transform的区别

    关于fit_transform和transform的详解参见:http://blog.csdn.net/anecdotegyb/article/details/74857055
    提炼如下:
    Q:scikit-learn中fit_transform()与transform()到底有什么区别,能不能混用?
    A:二者的功能都是对数据进行某种统一处理(比如标准化~N(0,1),将数据缩放(映射)到某个固定区间,归一化,正则化等)
    fit_transform(partData)对部分数据先拟合fit,找到该part的整体指标,如均值、方差、最大值最小值等等(根据具体转换的目的),然后对该partData进行转换transform,从而实现数据的标准化、归一化等等。。
    根据对之前部分fit的整体指标,对剩余的数据(restData)使用同样的均值、方差、最大最小值等指标进行转换transform(restData),从而保证part、rest处理方式相同。
    必须先用fit_transform(partData),之后再transform(restData)
    如果直接transform(partData),程序会报错
    如果fit_transfrom(partData)后,使用fit_transform(restData)而不用transform(restData),虽然也能归一化,但是两个结果不是在同一个“标准”下的,具有明显差异。

    ItemSelector:用于选择某列

    https://stats.stackexchange.com/questions/177082/sklearn-combine-multiple-feature-sets-in-pipeline

    HashingVectorizer/ CountVectorizer

    http://ju.outofmemory.cn/entry/74958

    你可能感兴趣的:(机器学习)