机器学习之特征工程(1)字典/文本特征提取方法

特征工程的内容

  • 特征抽取
  • 特征预处理
  • 特征降维

特征提取

将任意数据转化为数字特征

API

sklearn.feature_extraction

字典特征提取

sklearn.feature_extraction.DictVectorizer()

code

def feature_demo():
    '''
    字典特征抽取
    '''
    a=[
        {'city':'北京', 'temperature':100},
        {'city':'上海', 'temperature':60},
        {'city':'深圳', 'temperature':30}
        ]
    
    transfer = DictVectorizer(sparse = False)
    new_date = transfer.fit_transform(a)
    print(new_date)
    print("\n")
    print(transfer.feature_names_)

输出

[[  0.   1.   0. 100.]
 [  1.   0.   0.  60.]
 [  0.   0.   1.  30.]]
 
['city=上海', 'city=北京', 'city=深圳', 'temperature']

ℹ️对于特征当中存在的类别信息会处理为one-hot编码

文本特征抽取

方法1:单词次数

API:

sklearn.feature_extraction.text.CountVectorizer(stop_words=[])
from sklearn.feature_extraction.text import CountVectorizer

def text_demo1():
    '''
    文本特征抽取 CountVectorizer
    '''
    data=["life is short, I like python","life is too long, I dislike python"]
    
    # 1.实例化转换器类
    transfer = CountVectorizer()
    # 2.调用fit_transform
    res = transfer.fit_transform(data)
    
    print(transfer.get_feature_names())

    print(res.toarray())
    
    pass

if __name__ == '__main__':
    text_demo1()

输出

['dislike', 'is', 'life', 'like', 'long', 'python', 'short', 'too']
[[0 1 1 1 0 1 1 0]
 [1 1 1 0 1 1 0 1]]

CountVectorizer不能自动分词来处理中文,可以借助分词工具

  • stop_words=[] 停用词参数

    传入停用词参数

     transfer = CountVectorizer(stop_words=['is','too'])
    

    输出

    ['dislike', 'life', 'like', 'long', 'python', 'short']
    [[0 1 1 0 1 1]
     [1 1 0 1 1 0]]
    
  • 处理中文

    from matplotlib.pyplot import text
    from sklearn.feature_extraction.text import CountVectorizer
    import jieba
    
    def cut_words(text):
        '''
        进行中文分词
        '''
        a=jieba.cut(text)
    
        return " ".join(list(a))
    
    def text_zh():
        '''
        中文文本特征提取,自动分词
        '''
        data = ['意志是一个强壮的盲人,倚靠在明眼的跛子肩上',
                '学到很多东西的诀窍,就是一下子不要学很多',
                '重复别人所说的话,只需要教育;而要挑战别人所说的话,则需要头脑'
        ]
    
        data_new=[]
    
        for s in data:
            data_new.append(cut_words(s))
       
        transfer = CountVectorizer(stop_words=['因为', '所以',"一个"])
        res = transfer.fit_transform(data_new)
        print(transfer.get_feature_names())
        print(res.toarray())
    
    
    if __name__ == '__main__':
        text_zh() 
    

方法2:词频 TF-idf

某个词出现的频率高,并且在其他文本中出现少,则认为该词具有很好的区分能力。

Tf-idf方法 ⭐️

Tf: term frequency,词频

idf: inverse document frequency, 逆向文档频率, i d f = lg ⁡ 总 文 档 数 包 含 该 词 语 的 文 件 数 idf=\lg{\frac{总文档数}{包含该词语的文件数}} idf=lg
t f i d f = t f × i d f tfidf=tf\times idf tfidf=tf×idf
代码示例

from sklearn.feature_extraction.text import TfidfVectorizer

def text_tfidf():
    '''
    文本特征抽取 TFidf
    '''
    data = ["life is short, I like python",
            "life is too long, I dislike python"]

    # 1.实例化转换器类
    transfer = TfidfVectorizer(stop_words=['is', 'too'])
    # 2.调用fit_transform
    res = transfer.fit_transform(data)

    print(transfer.get_feature_names())
    print(res.toarray())
if __name__ == '__main__':
    # text_demo1()
    text_tfidf()

输出

['dislike', 'life', 'like', 'long', 'python', 'short']
[[0.         0.40993715 0.57615236 0.         0.40993715 0.57615236]
 [0.57615236 0.40993715 0.         0.57615236 0.40993715 0.        
]]

你可能感兴趣的:(AI,ML,python,人工智能)