python怎么训练模型_python – 如何训练sense2vec模型

我扩展并调整了sense2vec的代码示例.

你从这个输入文本:

“就沙特阿拉伯及其动机而言,这也非常简单.沙特人是

善于金钱和算术.面对亏本的痛苦选择

保持目前的产量为每桶60美元或减少200万桶

每天离开市场并损失更多的钱 – 这是一个简单的选择:采取

这条路不那么痛苦.如果有像伤害美国这样的次要原因

紧张的石油生产商或伤害伊朗和俄罗斯,这是伟大的,但它确实

只是钱.“

对此:

as | ADV far | ADV as | ADP saudi_arabia | ENT和| CCONJ its | ADJ motif | NOUN那个| ADJ是| VERB非常| ADV简单| ADJ也| ADV saudis | ENT | VERB好| ADJ at | ADP money | NOUN和| CCONJ算术| NOUN面临| VERB与| ADP painful_choice | NOUN | ADP失败| VERB货币| NOUN维持| VERB当前生产| NOUN at | ADP us $| SYM 60 | MONEY per | ADP桶| NOUN或| CCONJ采取| VERB两百万| CARDINAL桶| NOUN每| ADP日| NOUN关闭| ADP市场| NOUN和| CCONJ输掉| VERB much_more_money | NOUN它| PRON的| VERB easy_choice | NOUN采取| VERB路径| NOUN那个| ADJ是| VERB less | ADV痛苦| ADJ如果| ADP那里| ADV是| VERB secondary_reason | NOUN喜欢| ADP伤害| VERB我们| ENT tight_oil_producer | NOUN或| CCONJ伤害| VERB伊朗|耳鼻喉科和| CCONJ俄罗斯|耳鼻喉科| VERB很棒| ADJ但是| CCONJ it | PRON的| VERB真的| ADV只是| ADV关于| ADP钱| NOUN

>双换行符被解释为单独的文档.

> URL被识别为,被剥离到domain.tld并标记为| URL

>名词(也是名词是名词短语的一部分)被词形化(因为动机成为图案)

>删除带有诸如DET(确定文章)和PUNCT(用于标点符号)的POS标签的单词

这是代码.如果您有疑问,请告诉我.

我很快就会在github.com/woltob上发布它.

import spacy

import re

nlp = spacy.load('en')

nlp.matcher = None

LABELS = {

'ENT': 'ENT',

'PERSON': 'PERSON',

'NORP': 'ENT',

'FAC': 'ENT',

'ORG': 'ENT',

'GPE': 'ENT',

'LOC': 'ENT',

'LAW': 'ENT',

'PRODUCT': 'ENT',

'EVENT': 'ENT',

'WORK_OF_ART': 'ENT',

'LANGUAGE': 'ENT',

'DATE': 'DATE',

'TIME': 'TIME',

'PERCENT': 'PERCENT',

'MONEY': 'MONEY',

'QUANTITY': 'QUANTITY',

'ORDINAL': 'ORDINAL',

'CARDINAL': 'CARDINAL'

}

pre_format_re = re.compile(r'^[\`\*\~]')

post_format_re = re.compile(r'[\`\*\~]$')

url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})')

single_linebreak_re = re.compile('\n')

double_linebreak_re = re.compile('\n{2,}')

whitespace_re = re.compile(r'[ \t]+')

quote_re = re.compile(r'"|`|´')

def strip_meta(text):

text = text.replace('per cent', 'percent')

text = text.replace('>', '>').replace('<', '

text = pre_format_re.sub('', text)

text = post_format_re.sub('', text)

text = double_linebreak_re.sub('{2break}', text)

text = single_linebreak_re.sub(' ', text)

text = text.replace('{2break}', '\n')

text = whitespace_re.sub(' ', text)

text = quote_re.sub('', text)

return text

def transform_doc(doc):

for ent in doc.ents:

ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])

for np in doc.noun_chunks:

while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):

np = np[1:]

np.merge(np.root.tag_, np.text, np.root.ent_type_)

strings = []

for sent in doc.sents:

sentence = []

if sent.text.strip():

for w in sent:

if w.is_space:

continue

w_ = represent_word(w)

if w_:

sentence.append(w_)

strings.append(' '.join(sentence))

if strings:

return '\n'.join(strings) + '\n'

else:

return ''

def represent_word(word):

if word.like_url:

x = url_re.search(word.text.strip().lower())

if x:

return x.group(3)+'|URL'

else:

return word.text.lower().strip()+'|URL?'

text = re.sub(r'\s', '_', word.text.strip().lower())

tag = LABELS.get(word.ent_type_)

# Dropping PUNCTUATION such as commas and DET like the

if tag is None and word.pos_ not in ['PUNCT', 'DET']:

tag = word.pos_

elif tag is None:

return None

# if not word.pos_:

# tag = '?'

return text + '|' + tag

corpus = '''

As far as Saudi Arabia and its motives, that is very simple also. The Saudis are

good at money and arithmetic. Faced with the painful choice of losing money

maintaining current production at US$60 per barrel or taking two million barrels

per day off the market and losing much more money - it's an easy choice: take

the path that is less painful. If there are secondary reasons like hurting US

tight oil producers or hurting Iran and Russia, that's great, but it's really

just about the money.

'''

corpus_stripped = strip_meta(corpus)

doc = nlp(corpus_stripped)

corpus_ = []

for word in doc:

# only lemmatize NOUN and PROPN

if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):

# Keep the original word with the length of the lemma, then add the white space, if it was there.:

lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])

# print(word.text, lemma_)

corpus_.append(lemma_)

# print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])

# All other words are added normally.

else:

corpus_.append(word.text_with_ws)

result = transform_doc(nlp(''.join(corpus_)))

sense2vec_filename = 'text.txt'

file = open(sense2vec_filename,'w')

file.write(result)

file.close()

print(result)

我还将调整此代码以使用sense2vec方法(例如,在预处理步骤中单词变为小写,只需在代码中将其注释掉).

快乐的编码,woltob

你可能感兴趣的:(python怎么训练模型)