最好不要在原始列上操作,增加列
parties = {'Bachmann, Michelle': 'Republican',
'Cain, Herman': 'Republican',
'Gingrich, Newt': 'Republican',
'Huntsman, Jon': 'Republican',
'Johnson, Gary Earl': 'Republican',
'McCotter, Thaddeus G': 'Republican',
'Obama, Barack': 'Democrat',
'Paul, Ron': 'Republican',
'Pawlenty, Timothy': 'Republican',
'Perry, Rick': 'Republican',
"Roemer, Charles E. 'Buddy' III": 'Republican',
'Romney, Mitt': 'Republican','Santorum, Rick': 'Republican'}
contb['party'] = contb['cand_nm'].map(parties)
下面是一个log日志字段说明,篇幅有限,提供两行示例
id | userid | itemid | categoryid | type | time |
---|---|---|---|---|---|
1794879 | 16508 | 550769 | 2440115 | pv | 1423230247 |
1551349 | 153760 | 4246496 | 3077776 | pv | 1499073872 |
将用户行为类型给予不同的权重,拍脑袋设定的(也是经验的一部分,分享权重>收藏>评论>深度阅读>随意翻阅)
将文本 映射成数字的办法
def action_type_transfer(x):
if x == 'view':
return 0.8
elif x == 'deep_view':
return 1
elif x == 'comment':
return 1.2
elif x == 'collect':
return 1.2
elif x == 'share':
return 1.5
else:
return 1
train['actiontype_weight'] = train['action_type'].apply(action_type_transfer)
gender_map = {'F':0, 'M':1}
users['Gender'] = users['Gender'].map(gender_map)
以下两个包是必不可少的
import time
import datetime
def timestamp_transfer(x):
x = time.localtime(x)
x = time.strftime("%Y-%m-%d %H:%M:%S", x)
return x
train['date'] = train['action_time'].apply(timestamp_transfer)
all_news_info['date'] = pd.to_datetime(all_news_info['date']).dt.date
news_info['date'] = pd.to_datetime(news_info['date']).dt.date
news_info['sub_days'] = (datetime.date(2017, 2, 19) - news_info['date']).dt.days
all_news_info['sub_days'] = (datetime.date(2017, 2, 19) - all_news_info['date']).dt.days
train['sub_days'] = (datetime.date(2017, 2, 19) - train['date']).dt.days