CountVectorizer与TfidfVectorizer是sklearn中特征向量化的两种方法,不同点在于CountVectorizer只考虑每种词汇在该训练文本中出现的频率,而TfidfVectorizer除了考量某一词汇在当前训练文本中出现的频率之外,同时关注包含这个词汇的其它训练文本数目的倒数。
EX1_1 one hot representation
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
import matplotlib.pyplot as plt
corpus = ['Time flies flies like an arrow.',
'Fruit flies like a banana.']
# vocab = set([word for sen in corpus for word in sen.split(" ")])
one_hot_vectorizer = CountVectorizer(binary=True)
one_hot = one_hot_vectorizer.fit_transform(corpus).toarray() #fit_transform()的作用就是先拟合数据,然后转化它将其转化为标准形式,而transform()的作用是通过找中心和缩放等实现标准化
vocab = one_hot_vectorizer.get_feature_names() # 返回一个特征名列表,特征的顺序是在矩阵中的顺序
sns.heatmap(one_hot, annot=True,
cbar=False, xticklabels=vocab,
yticklabels=['Sentence1','Sentence 2'])
print(one_hot_vectorizer.get_stop_words())
print(one_hot_vectorizer.vocabulary_)
print(one_hot_vectorizer.vocabulary_.get("a")) # 发现"a"在处理过程中并没有被当作一个词,在sklearn教程中找到这样一个描述"The default configuration tokenizes the string by extracting words of at least 2 letters.
print(one_hot_vectorizer.vocabulary_.get("an"))
print(one_hot)
plt.show()
运行结果:
None
{‘time’: 6, ‘flies’: 3, ‘like’: 5, ‘an’: 0, ‘arrow’: 1, ‘fruit’: 4, ‘banana’: 2}
None
0
[[1 1 0 1 0 1 1]
[0 0 1 1 1 1 0]]
EX1_2 tf-idf extention
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
import seaborn as sns
corpus = [
'This is the first document.',
'This is the second document.',
'And the third one',
'Is this the first document?',
'I come to American to travel'
]
cv = CountVectorizer(binary=True)
words = cv.fit_transform(corpus)
tfidf_vectorizer = TfidfVectorizer()
tfidf = TfidfTransformer().fit_transform(words)
tfidf2 = tfidf_vectorizer.fit_transform(corpus).toarray()
vocab = cv.get_feature_names()
sns.heatmap(tfidf2, annot=True, cbar=False, xticklabels=vocab,
yticklabels= ['Sentence 1', 'Sentence 2','Sentence 3','Sentence 4','Sentence 5'])
print (cv.get_feature_names())
print (words.toarray())
print (tfidf)
plt.show()
运行结果:
[‘american’, ‘and’, ‘come’, ‘document’, ‘first’, ‘is’, ‘one’, ‘second’, ‘the’, ‘third’, ‘this’, ‘to’, ‘travel’]
[[0 0 0 1 1 1 0 0 1 0 1 0 0]
[0 0 0 1 0 1 0 1 1 0 1 0 0]
[0 1 0 0 0 0 1 0 1 1 0 0 0]
[0 0 0 1 1 1 0 0 1 0 1 0 0]
[1 0 1 0 0 0 0 0 0 0 0 1 1]]
(0, 10) 0.44027050419943065
(0, 5) 0.44027050419943065
(0, 8) 0.3703694278374568
(0, 4) 0.5303886653382521
(0, 3) 0.44027050419943065
(1, 10) 0.4103997467310884
(1, 5) 0.4103997467310884
(1, 8) 0.34524120496743227
(1, 3) 0.4103997467310884
(1, 7) 0.6128006641982455
(2, 8) 0.30931749359185684
(2, 1) 0.5490363340004775
(2, 9) 0.5490363340004775
(2, 6) 0.5490363340004775
(3, 10) 0.44027050419943065
(3, 5) 0.44027050419943065
(3, 8) 0.3703694278374568
(3, 4) 0.5303886653382521
(3, 3) 0.44027050419943065
(4, 2) 0.5
(4, 11) 0.5
(4, 0) 0.5
(4, 12) 0.5
动态 Vs. 静态 计算图框架
静态框架如Theano,Caffe和Tensorflow要求计算图首先被定义,编译然后被执行。
动态框架如Chainer,DyNet和Pytorch
动态计算图框架在为NLP任务建模时非常有效,对于每一个输入可以潜在地导致生成不同的图结构。
本教程中将学习到的一些PyTorch操作包括:
PyTorch的安装
在官网中选择与自己硬件环境相对应的版本,复制命令行在终端中运行,如我的mac以及基本编程环境如下:
conda install pytorch torchvision -c pytorch
测试程序
import torch
def describe(x):
print("Type:",format(x.type()))
print("Shape/size:",format(x.shape))
print("Values:\n",format(x))
print("torch.Tensor: initialize a random one by specifying its dimensions")
describe(torch.Tensor(2,3)) # 初始化为随机变量
print("torch.Tensor: initialize with values from a uniform distribution on the interval [0, 1)")
describe(torch.rand(2,3)) # 以[0,1)区间内的均匀随机分布的值进行初始化
print("torch.Tensor: initialize with standard normal distribution")
describe(torch.randn(2,3)) # 以标准正态分布的值进行初始化
# Creating a filled tensor
print("filled with zeros\n")
describe(torch.zeros(2,3))
print("filled with ones\n")
x = torch.ones(2,3)
describe(x)
print("filled with a certain value\n")
x.fill_(5)
describe(x)
运行结果:
torch.Tensor: initialize a random one by specifying its dimensions
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[-7.7542e+20, 4.5799e-41, -7.7542e+20],
[ 4.5799e-41, 0.0000e+00, 0.0000e+00]])
torch.Tensor: initialize with values from a uniform distribution on the interval [0, 1)
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[0.7461, 0.5937, 0.9421],
[0.5716, 0.6240, 0.3719]])
torch.Tensor: initialize with standard normal distribution
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[-1.5297, -0.7294, 0.1784],
[-1.4134, 0.2278, 1.1762]])
filled with zeros
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[0., 0., 0.],
[0., 0., 0.]])
filled with ones
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[1., 1., 1.],
[1., 1., 1.]])
filled with a certain value
Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values:
tensor([[5., 5., 5.],
[5., 5., 5.]])
CUDA
目前我们只是将tensors分配到CPU内存中运算,但当需要处理一些线性代数操作时,我们需要用到GPUs。想要用GPUs资源,你首先需要在GPUs的内存中分配tensors。能够访问GPUs的特殊的API叫做CUDA。CUDA API由NVIDIA创建但也被限制为只有NVIDIA的GPU可用。
PyTorch
PyTorch使得创建CUDA tensors非常容易,可以将tensor从CPU转换到GPU的同时维持其潜在的类型。更重要的是PyTorch中设备不可知方法(device agnostic method),使得我们写的程序代码无论是在CPU还是GPU上都可以执行。
首先,检测GPU是否可用
torch.cuda.is_available()
然后检索设备名
torch.device()
接下来所有的tensor都会被实例化并通过.to(device)
移到目标设备
import torch
def describe(x):
print("Type:",format(x.type()))
print("Shape/size:",format(x.shape))
print("Values:\n",format(x),"device =",device)
# 检测CUDA是否可用
print(torch.cuda.is_available())
# 输出可用设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
# instantiate and move to the target device
x = torch.rand(3,3).to(device)
describe(x)
输出结果:
False
cpu
Type: torch.FloatTensor
Shape/size: torch.Size([3, 3])
Values:
tensor([[0.9827, 0.8346, 0.1842],
[0.3609, 0.1259, 0.7131],
[0.6021, 0.3017, 0.3955]]) device = cpu