#Home Depot 产品相关性预测 kaggle竞赛:https://www.kaggle.com/c/home-depot-product-search-relevance HomeDepot是美国一家家具建材商品网站,用户通过在搜索框中输入关键词,得到相关商品和服务,如输入floor,得到不同材料的地板商品、地板清洗商品、地板安装服务等。kaggle竞赛目的是通过设计一种模型,能够更好的匹配用户搜索关键词,得到相关性更高的产品和服务。 ##导入所需
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
df_train = pd.read_csv('train.csv',encoding='ISO-8859-1')
df_test = pd.read_csv('test.csv',encoding='ISO-8859-1')
df_desc = pd.read_csv('product_descriptions.csv')
id |
product_uid |
product_title |
search_term |
relevance |
0 |
2 |
100001 |
Simpson Strong-Tie 12-Gauge Angle |
angle bracket |
3.0 |
1 |
3 |
100001 |
Simpson Strong-Tie 12-Gauge Angle |
l bracket |
2.5 |
2 |
9 |
100002 |
BEHR Premium Textured DeckOver 1-gal. #SC-141 … |
deck over |
3.0 |
id |
product_uid |
product_title |
search_term |
0 |
1 |
100001 |
Simpson Strong-Tie 12-Gauge Angle |
90 degree bracket |
1 |
4 |
100001 |
Simpson Strong-Tie 12-Gauge Angle |
metal l brackets |
2 |
5 |
100001 |
Simpson Strong-Tie 12-Gauge Angle |
simpson sku able |
product_uid |
product_description |
0 |
100001 |
Not only do angles make joints stronger, they … |
1 |
100002 |
BEHR Premium Textured DECKOVER is an innovativ… |
2 |
100003 |
Classic architecture meets contemporary design… |
train中relevance是我们要在test上预测的目标,relevance 1-3代表相关程度,3最高,1最低;search_term是搜索词,即该产品在某一搜索词下的相关度是多少;product discription里是对应产品id的产品介绍。
df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)
id |
product_title |
product_uid |
relevance |
search_term |
0 |
2 |
Simpson Strong-Tie 12-Gauge Angle |
100001 |
3.0 |
angle bracket |
1 |
3 |
Simpson Strong-Tie 12-Gauge Angle |
100001 |
2.5 |
l bracket |
2 |
9 |
BEHR Premium Textured DeckOver 1-gal. #SC-141 … |
100002 |
3.0 |
deck over |
(240760, 5)
df_all = df_all.merge(df_desc,on='product_uid',how='left')
id |
product_title |
product_uid |
relevance |
search_term |
product_description |
0 |
2 |
Simpson Strong-Tie 12-Gauge Angle |
100001 |
3.0 |
angle bracket |
Not only do angles make joints stronger, they … |
1 |
3 |
Simpson Strong-Tie 12-Gauge Angle |
100001 |
2.5 |
l bracket |
Not only do angles make joints stronger, they … |
2 |
9 |
BEHR Premium Textured DeckOver 1-gal. #SC-141 … |
100002 |
3.0 |
deck over |
BEHR Premium Textured DECKOVER is an innovativ… |
from nltk.stem.snowball import SnowballStemmer
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
###Stemmer词干提取 因为homedepot做的是搜索匹配,所以文本的统一性很重要,我们需要对文本特征做stemmer,提取词干,保证search term在文本中只有一种表达效果。
stop = stopwords.words('english')
import re
def hasnumber(input_str):
return bool(re.search(r'\d',input_str))
def check(string):
if string in stop:
return False
elif hasnumber(string):
return False
return True
stemmer = SnowballStemmer('english')
def text_stemmer(s):
return ' '.join([stemmer.stem(word) for word in s.lower().split() if check(word)])
df_all['search_term'] = df_all['search_term'].map(lambda x: text_stemmer(x))
df_all['product_title'] = df_all['product_title'].map(lambda x:text_stemmer(x))
df_all['product_description'] = df_all['product_description'].map(lambda x:text_stemmer(x))
id |
product_title |
product_uid |
relevance |
search_term |
product_description |
0 |
2 |
simpson strong-ti angl |
100001 |
3.00 |
angl bracket |
angl make joint stronger, also provid consiste… |
1 |
3 |
simpson strong-ti angl |
100001 |
2.50 |
l bracket |
angl make joint stronger, also provid consiste… |
2 |
9 |
behr premium textur deckov tugboat wood concre… |
100002 |
3.00 |
deck |
behr premium textur deckov innov solid color c… |
3 |
16 |
delta vero shower faucet trim kit chrome (valv… |
100005 |
2.33 |
rain shower head |
updat bathroom delta vero single-handl shower … |
4 |
17 |
delta vero shower faucet trim kit chrome (valv… |
100005 |
2.67 |
shower faucet |
updat bathroom delta vero single-handl shower … |
train = df_all[:df_train.shape[0]]
test = df_all[df_test.shape[0]:]
train['all_text']=train['product_title'] + ' . ' + train['product_description'] + ' . '
test['all_text'] = test['product_title'] + ' . ' + test['product_description'] + ' . '
0 simpson strong-ti angl . angl make joint stron… 1 simpson strong-ti angl . angl make joint stron… 2 behr premium textur deckov tugboat wood concre… 3 delta vero shower faucet trim kit chrome (valv… 4 delta vero shower faucet trim kit chrome (valv… Name: all_text, dtype: object ###生成语料 根据train中的all_text生成语料,先使用tokenize将句子分成一个个单词;再用gensim.corpora.Dictionary来实现对语料中的每一个单词关联一个唯一的ID,这个字典定义了我们要处理的所有单词表。
from gensim.utils import tokenize
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(list(tokenize(x, errors='ignore')) for x in train['all_text'].values)
D:\programs\anaconda\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn(“detected Windows; aliasing chunkize to chunkize_serial”) Dictionary(136703 unique tokens: [‘alonehelp’, ‘also’, ‘angl’, ‘bent’, ‘coat’]…) 我们得到一个有136703个单词的训练语料库,然后对所有语料转换成单词个数的计算。这里使用迭代器来写一个类,实现对所有语料的每一个单词进行个数计算,因为语料库很大,直接生成list会很费内存。
class corpus:
def __iter__(self):
for x in train['all_text'].values:
yield dictionary.doc2bow(list(tokenize(x, errors='ignore')))
train_corpus = corpus()
for c in train_corpus:
if count >2:
[(0, 1), (1, 1), (2, 4), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 4), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 2), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 3), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 3), (62, 1), (63, 1), (64, 1)] [(0, 1), (1, 1), (2, 4), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 4), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 2), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 3), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 3), (62, 1), (63, 1), (64, 1)] [(1, 1), (4, 3), (21, 1), (25, 1), (39, 1), (41, 2), (56, 1), (65, 1), (66, 2), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 4), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 3), (83, 1), (84, 1), (85, 4), (86, 2), (87, 1), (88, 1), (89, 1), (90, 2), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1), (111, 1), (112, 1), (113, 1), (114, 1), (115, 1), (116, 1), (117, 2), (118, 1), (119, 1), (120, 1), (121, 1), (122, 1), (123, 2), (124, 2), (125, 1), (126, 1), (127, 2), (128, 1), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 2), (136, 1), (137, 1), (138, 1), (139, 1), (140, 2), (141, 1), (142, 1), (143, 1), (144, 1), (145, 1), (146, 1), (147, 1), (148, 2), (149, 1), (150, 1), (151, 1), (152, 1), (153, 1), (154, 1), (155, 1), (156, 1), (157, 4), (158, 1)] 如上所示,每个句子中的每个单词转换为一组向量,()中的第一个元素表示该词在字典中的ID,第二个元素表示在这个句子中这个单词出现的次数。 ###使用TF-IDF模型 tf-idf模型简单理解是把词袋表达的向量转换到另一个向量空间,这个向量空间中,词频是根据语料中每个词的相对稀有程度(relative rarity)进行加权处理的。 TF(term frequency)=某个词在文中出现的次数/文章总词数 IDF( inverse document frequency)=log(N/N(x)),其中N表示语料库中文本总数,N(x)表示语料库中包含x的文本总数。 TF-IDF(x) = Tf(x)*IDF(x)
from gensim.models.tfidfmodel import TfidfModel
tfidf_g = TfidfModel(train_corpus)
tfidf_g[dictionary.doc2bow(list(tokenize('morning yellow flower', errors='ignore')))]
[(1056, 0.44640344500226231), (1332, 0.40452528743632266), (34490, 0.79817495332456578)] 返回的元组中,第一个表示单词ID,第二个表示权重。
def to_tfidf(text):
res = tfidf_g[dictionary.doc2bow(list(tokenize(text, errors='ignore')))]
return res
###余弦相似度 余弦值越接近1,就表明夹角越接近0度,也就是两个向量越相似。
from gensim.similarities import MatrixSimilarity
def cos_sim(text1,text2):
tf1 = to_tfidf(text1)
tf2 = to_tfidf(text2)
index = MatrixSimilarity([tf1],num_features=len(dictionary))
sim = index[tf2]
return float(sim[0])
train['tfidf_cos_sim_in_title'] = train.apply(lambda x: cos_sim(x['search_term'], x['product_title']), axis=1)
test['tfidf_cos_sim_in_title'] = test.apply(lambda x: cos_sim(x['search_term'], x['product_title']), axis=1)
train['tfidf_cos_sim_in_desc'] = train.apply(lambda x: cos_sim(x['search_term'], x['product_description']), axis=1)
test['tfidf_cos_sim_in_desc'] = test.apply(lambda x: cos_sim(x['search_term'], x['product_description']), axis=1)
id |
product_title |
product_uid |
relevance |
search_term |
product_description |
all_text |
tfidf_cos_sim_in_title |
tfidf_cos_sim_in_desc |
0 |
2 |
simpson strong-ti angl |
100001 |
3.0 |
angl bracket |
angl make joint stronger, also provid consiste… |
simpson strong-ti angl . angl make joint stron… |
0.287958 |
0.188301 |
1 |
3 |
simpson strong-ti angl |
100001 |
2.5 |
l bracket |
angl make joint stronger, also provid consiste… |
simpson strong-ti angl . angl make joint stron… |
0.000000 |
0.000000 |
id |
product_title |
product_uid |
relevance |
search_term |
product_description |
all_text |
tfidf_cos_sim_in_title |
tfidf_cos_sim_in_desc |
166693 |
138080 |
winix freshom model true hepa air cleaner plas… |
149579 |
NaN |
winix air purifi |
winix freshom true hepa air cleaner plasmawav … |
winix freshom model true hepa air cleaner plas… |
0.413249 |
0.146256 |
166694 |
138082 |
ge rv outlet box amp volt ring type meter amp … |
149580 |
NaN |
gcfi outlet |
ring-typ meter surfac mount factory-assembl fa… |
ge rv outlet box amp volt ring type meter amp … |
0.530571 |
0.022509 |
可以看出我们增加了2个特征,即我们通过TF-IDF模型将单词转换为向量表示,计算两个向量的余弦值作为相似性度量,度量search item与title和product description的相似性。
###word2vec模型 ####tokenize分词
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
[‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’]
sentences = [tokenizer.tokenize(x) for x in train['all_text'].values]
[[‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’], [‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’]] 分割成句子后,这些句子还是有层级关系的,我们想要得到的是所有句子的集合,即需要将句子list flattern.查询Stack Overflow方法如下:
sentences = [y for x in sentences for y in x]
其等价于: flattern=[] for sub in sentences: for val in sub: flattern.append(val) 但上面方法运行更快,且不用调用append
words = [word_tokenize(x) for x in sentences]
#### w2c model
from gensim.models.word2vec import Word2Vec
w2c = Word2Vec(words, size=128, window=5, min_count=5, workers=4)
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). “”“Entry point for launching an IPython kernel. (128,) 获得一个单词的w2c后,对于一个句子,我们可以取句子所有单词的平均值作为句子的w2c向量。
vocab = w2c.wv.vocab
def get_vector(text):
res =np.zeros([128])
count = 0
for word in word_tokenize(text):
if word in vocab:
res += w2c[word]
count += 1
return res/count
print(get_vector('this is a door'))
[-0.00261087 -0.56179226 0.60765644 -0.64292271 -0.56054996 0.08848376 -0.49025596 0.63867774 0.0506447 0.45001359 -0.13710753 -0.16271916 0.02016276 0.20406312 0.31635891 0.19369102 0.17811321 0.41733303 0.00445884 0.5458078 1.05040102 -0.06413073 0.41070253 0.42587531 -0.63050625 -1.0984747 0.29934129 0.17861572 -0.71340695 -0.06451187 0.14277897 -0.06567481 0.01526162 -0.38790436 1.20415058 1.21037786 0.14057088 -0.10719017 0.37104489 0.76831334 0.34643462 0.62355396 0.25301299 0.40690951 0.1148672 1.06050375 0.36682158 0.25096587 -0.74231262 0.35016962 -0.58686608 -0.0857836 -0.84342213 0.5809405 -0.00302781 -0.14390172 -0.0524666 -0.91113859 0.75996059 0.87425374 -0.26513928 -0.54596879 0.80864939 0.01382558 -0.06432911 0.4952433 -0.43694797 0.01296244 0.84968186 -0.10620818 -0.18429637 0.69937535 0.4414333 -0.13501882 0.02398617 -0.47228654 1.04885393 0.06891993 -0.38115454 0.34773821 0.31407464 -0.06125381 -0.52234665 -0.11498543 -0.03274459 -0.10401297 -0.58666455 0.96296111 -0.72077985 0.29961426 0.68775976 -0.03572528 0.28445438 0.04369911 0.61288889 -0.21892426 -0.05004786 -0.73410231 -0.58521137 -0.02520149 -0.44890615 -0.54609256 0.86551609 -0.28756648 0.4514165 -0.36830674 0.31522632 -0.05346495 -0.0451854 -0.26681575 -0.46619874 0.04488686 -0.38999537 -0.12920142 0.68646752 -0.86762143 0.79189127 0.25246545 -1.17674513 -0.17455208 0.34669894 -0.31969217 -0.08517368 0.52510963 0.2380766 -0.16318383 0.17944744 -0.9727306 ] D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). ###计算相似度
from scipy import spatial
def w2c_cos_sim(text1,text2):
w1 = get_vector(text1)
w2 = get_vector(text2)
sim = 1 - spatial.distance.cosine(w1, w2)
return float(sim)
return float(0)
w2c_cos_sim('hello world','hello from the other side')
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). 0.07032644504070107
train['w2v_cos_sim_in_title'] = train.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_title']), axis=1)
train['w2v_cos_sim_in_desc'] = train.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_description']), axis=1)
test['w2v_cos_sim_in_title'] = test.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_title']), axis=1)
test['w2v_cos_sim_in_desc'] = test.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_description']), axis=1)
id |
product_title |
product_uid |
relevance |
search_term |
product_description |
all_text |
tfidf_cos_sim_in_title |
tfidf_cos_sim_in_desc |
w2v_cos_sim_in_title |
w2v_cos_sim_in_desc |
0 |
2 |
simpson strong-ti angl |
100001 |
3.0 |
angl bracket |
angl make joint stronger, also provid consiste… |
simpson strong-ti angl . angl make joint stron… |
0.287958 |
0.188301 |
0.531925 |
0.530175 |
1 |
3 |
simpson strong-ti angl |
100001 |
2.5 |
l bracket |
angl make joint stronger, also provid consiste… |
simpson strong-ti angl . angl make joint stron… |
0.000000 |
0.000000 |
0.279708 |
0.303249 |
id |
product_title |
product_uid |
relevance |
search_term |
product_description |
all_text |
tfidf_cos_sim_in_title |
tfidf_cos_sim_in_desc |
w2v_cos_sim_in_title |
w2v_cos_sim_in_desc |
166693 |
138080 |
winix freshom model true hepa air cleaner plas… |
149579 |
NaN |
winix air purifi |
winix freshom true hepa air cleaner plasmawav … |
winix freshom model true hepa air cleaner plas… |
0.413249 |
0.146256 |
0.694583 |
0.591168 |
166694 |
138082 |
ge rv outlet box amp volt ring type meter amp … |
149580 |
NaN |
gcfi outlet |
ring-typ meter surfac mount factory-assembl fa… |
ge rv outlet box amp volt ring type meter amp … |
0.530571 |
0.022509 |
0.676237 |
0.369881 |
train = train.drop(['search_term','product_title','product_description','all_text'],axis=1)
test = test.drop(['search_term','product_title','product_description','all_text'],axis=1)
ids = test['id']
y_train = train['relevance'].values
X_train = train.drop(['id','relevance'],axis=1).values
X_test = test.drop(['id','relevance'],axis=1).values
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
params = [1,3,5,6,7,8,9,10]
test_scores = []
for param in params:
clf = RandomForestRegressor(n_estimators=30, max_depth=param)
test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(params, test_scores)
plt.title("Param vs CV Error")

可以看出,max_depth在6的时候效果最好,大约在0.49左右。这里是增加4个特征,选择的模型是随机森林,下一步,可以构造新的特征,如简单的search item是否被包含,还可以使用别的模型,如LR,然后对模型进行ensemble等。