数据挖掘中特征的处理一(实践)

最好不要在原始列上操作,增加列

parties = {'Bachmann, Michelle': 'Republican',
           'Cain, Herman': 'Republican',
           'Gingrich, Newt': 'Republican',
           'Huntsman, Jon': 'Republican',
           'Johnson, Gary Earl': 'Republican',
           'McCotter, Thaddeus G': 'Republican',
           'Obama, Barack': 'Democrat',
           'Paul, Ron': 'Republican',
           'Pawlenty, Timothy': 'Republican',
           'Perry, Rick': 'Republican',
           "Roemer, Charles E. 'Buddy' III": 'Republican',
           'Romney, Mitt': 'Republican','Santorum, Rick': 'Republican'}

contb['party'] = contb['cand_nm'].map(parties)

下面是一个log日志字段说明,篇幅有限,提供两行示例

id userid itemid categoryid type time
1794879 16508 550769 2440115 pv 1423230247
1551349 153760 4246496 3077776 pv 1499073872
将用户行为类型给予不同的权重,拍脑袋设定的(也是经验的一部分,分享权重>收藏>评论>深度阅读>随意翻阅)

将文本 映射成数字的办法

def action_type_transfer(x):
    if x == 'view':
        return 0.8
    elif x == 'deep_view':
        return 1
    elif x == 'comment':
        return 1.2
    elif x == 'collect':
        return 1.2
    elif x == 'share':
        return 1.5
    else:
        return 1
train['actiontype_weight'] = train['action_type'].apply(action_type_transfer) 
gender_map = {'F':0, 'M':1}
users['Gender'] = users['Gender'].map(gender_map)


以下两个包是必不可少的

import time
import datetime
def timestamp_transfer(x):
    x = time.localtime(x)
    x = time.strftime("%Y-%m-%d %H:%M:%S", x)
    return x

train['date'] = train['action_time'].apply(timestamp_transfer)

all_news_info['date'] = pd.to_datetime(all_news_info['date']).dt.date
news_info['date'] = pd.to_datetime(news_info['date']).dt.date
news_info['sub_days'] = (datetime.date(2017, 2, 19) - news_info['date']).dt.days
all_news_info['sub_days'] = (datetime.date(2017, 2, 19) - all_news_info['date']).dt.days
train['sub_days'] = (datetime.date(2017, 2, 19) - train['date']).dt.days

你可能感兴趣的:(数据挖掘中特征的处理一(实践))