API:
from sklearn.feature_extraction import DictVectorizer
语法:
dv = DictVectorizer(sparse=False) #实例化
dv.fit_transform() # 字典 --> one hot编码
dv.inverse_transform() # one hot编码 --> 字典
dv..get_feature_names() # 获取特征的名称
# 字典特征抽取
from sklearn.feature_extraction import DictVectorizer
def dict_extraciton():
data_dict = [{'city': '北京','temperature':32},
{'city': '上海','temperature':22},
{'city': '深圳','temperature':17}]
dict_vectorizer = DictVectorizer(sparse=False)
one_hot_data = dict_vectorizer.fit_transform(data_dict)
print(dict_vectorizer.get_feature_names())
print(one_hot_data)
# 转回字典类型
mydict = dict_vectorizer.inverse_transform(one_hot_data)
print(mydict)
dict_extraciton()
['city=上海', 'city=北京', 'city=深圳', 'temperature']
[[ 0. 1. 0. 32.]
[ 1. 0. 0. 22.]
[ 0. 0. 1. 17.]]
[{'city=北京': 1.0, 'temperature': 32.0}, {'city=上海': 1.0, 'temperature': 22.0}, {'city=深圳': 1.0, 'temperature': 17.0}]
API: 仅仅是按照句子 提取特征值的 词频的统计
from sklearn.feature_extraction.text import CountVectorizer
语法:
from sklearn.feature_extraction.text import CountVectorizer
def text_extraciton():
cv = CountVectorizer()
data = ['life is short,i like python','but you dislike java']
data = cv.fit_transform(data).toarray()
print(cv.get_feature_names())
print(data)
text_extraciton()
['but', 'dislike', 'is', 'java', 'life', 'like', 'python', 'short', 'you']
[[0 0 1 0 1 1 1 1 0]
[1 1 0 1 0 0 0 0 1]]
APi:
import jieba
from sklearn.feature_extraction.text import CountVectorizer
注意 jieba 返回的是 生成器,需要转化list
案例:
def hanz_extractor():
a = "我们看到的从很远星系来的光是在几百万年之前发出的,这样当我们看到宇宙时,我们是在看它的过去"
b = "如果只用一种方式了解某样事物,你就不会真正了解它了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系"
c = "今天很残酷,明天更残酷,后天很美好,但绝对大部分是死在明天晚上,所以每个人不要放弃今天"
a = list(jieba.cut(a))
b = list(jieba.cut(b))
c = list(jieba.cut(c))
a = " ".join(a)
b = " ".join(b)
c = " ".join(c)
data = [a,b,c]
cv = CountVectorizer()
data = cv.fit_transform(data)
print(cv.get_feature_names())
print(data.toarray())
hanz_extractor()
tfidf: 本文出现频率高,其他文章出现频率低的特征词 ,认为这种词语重要。
api:
from sklearn.feature_extraction.text import TfidfVectorizer
案例:
def tfidf_extracotion():
a = "我们看到的从很远星系来的光是在几百万年之前发出的,这样当我们看到宇宙时,我们是在看它的过去"
b = "如果只用一种方式了解某样事物,你就不会真正了解它了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系"
c = "今天很残酷,明天更残酷,后天很美好,但绝对大部分是死在明天晚上,所以每个人不要放弃今天"
a = list(jieba.cut(a))
a = " ".join(a)
b = list(jieba.cut(b))
b = " ".join(b)
c = list(jieba.cut(c))
c = " ".join(c)
data = [a,b,c]
tv = TfidfVectorizer()
data = tv.fit_transform(data)
print(tv.get_feature_names())
print(data.toarray())
tfidf_extracotion()
效果
['一种', '不会', '不要', '之前', '了解', '事物', '今天', '光是在', '几百万年', '发出', '取决于', '只用', '后天', '含义', '大部分', '如何', '如果', '宇宙', '我们', '所以', '放弃', '方式', '明天', '星系', '晚上', '某样', '残酷', '每个', '看到', '真正', '秘密', '绝对', '美好', '联系', '过去', '这样']
[[0. 0. 0. 0.2410822 0. 0.
0. 0.2410822 0.2410822 0.2410822 0. 0.
0. 0. 0. 0. 0. 0.2410822
0.55004769 0. 0. 0. 0. 0.2410822
0. 0. 0. 0. 0.48216441 0.
0. 0. 0. 0. 0.2410822 0.2410822 ]
[0.15698297 0.15698297 0. 0. 0.62793188 0.47094891
0. 0. 0. 0. 0.15698297 0.15698297
0. 0.15698297 0. 0.15698297 0.15698297 0.
0.1193896 0. 0. 0.15698297 0. 0.
0. 0.15698297 0. 0. 0. 0.31396594
0.15698297 0. 0. 0.15698297 0. 0. ]
[0. 0. 0.21821789 0. 0. 0.
0.43643578 0. 0. 0. 0. 0.
0.21821789 0. 0.21821789 0. 0. 0.
0. 0.21821789 0.21821789 0. 0.43643578 0.
0.21821789 0. 0.43643578 0.21821789 0. 0.
0. 0.21821789 0.21821789 0. 0. 0. ]]