Spark MLlib(四) feature extractors

extraction: features from raw data

transformation: scaling, converting or modifying features

selection: selecting a subset from features

locality sensitive hashing: combining feature transformation with other algorithms


feature extractors:

tf-idf

1 tf-idf term frequency-inverse document frequency

单词, 文本, 文库

tf是一个文本中单词出现的次数, df是文库中存在某单词的文本数量

idf是衡量单词价值信息(能够提供多少有用信息)的数值度量

tf-idf = tf * idf

2 HashingTF和CountVectorizer都可以用来生成tf矢量


word2vec

1 接受文档的单词序列,训练出word2vecmodel,word2vec是一个estimator

2 model将单词映射为unique fixed-size vector


countvectorizer

countvectorizer and countvectorizermodel旨在将文本文档集合转为token counts的矢量


featurehasher

将特征投射到特定维度的特征矢量



你可能感兴趣的:(Spark)