一只楚楚猫

一篇文章入门循环神经网络RNN

NLP

一、循环神经网络

1、文本的tokenization

tokenization：分词，分出的每一个词语就是token

中英文分词的方法：把句子转化为词语、把句子转化为单个字

2、N-gram表示方法

句子可以用单个字、词语表示，同时我们也可以用2个、3个或者多个词来表示

N-gram一组一组的词语，其中的N表示能够被一起使用的词的数量

import jieba

text = "深度学习是机器学习的分支，是一种以人工神经网络为架构，对数据进行表征学习的算法"

# lcut(): The main function that segments an entire sentence that contains Chinese characters into separated words.
cuted = jieba.lcut(text)

n_grams = [cuted[i:i + 2] for i in range(len(cuted) - 1)]
print(n_grams)

3、向量化

自然语言文本是非结构化的，从文本中抽象出机器学习算法认识的特征向量的方法：将自然语言文本的每个词作为一个特征。因此对应的特征向量即这些特征的组合（基于这种思想的模型就是词袋模型（Bag of Words），也叫向量空间模型(Vector Space Model)）

文本的维度：特征词的数量

文本不能够直接被模型计算，所以要将其转化为向量

把文本转化为向量有两种方法：

转化为one-hot编码
转化为word-embedding

（1）one-hot编码

在one-hot编码中，每一个token使用一个长度为N的向量表示，N表示词典的数量

one-hot编码/独热编码：设词典的大小为n（词典中有n个词），假如某个词在词典中的位置为k，则设立一个n维向量，第k维置1，其余维全都置0

把待处理的文档进行分词或者是N-gram处理，然后进行去重得到词典

one-hot使用稀疏向量表示文本，占用空间多

一个样本的特征向量即该样本中的每个单词的one-hot向量直接相加

涉及到的单词很多时，词典会变得超大，动辄几千上万维。因此每个样本的特征向量也会变得极其稀疏（大部分维度的值为0），这么高的维数对于很多机器学习模型比如神经网络，那简直是训练的灾难

one-hot编码会忽略单词的语义

（2）word embedding

word-embedding/词向量/词嵌入：将单词编码成低维实数向量（将单词映射到一个低维空间）

word embedding使用了浮点型的稠密矩阵来表示token，根据词典的大小，我们的向量通常使用不同的维度，例如：100/256/300等，其中向量中的每一个值是一个超参数，其初始值是随机生成的，之后会在训练的过程中进行学习而获得

如果我们文本中有20000个词语，如果使用one-hot编码，那么我们会有20000*20000的矩阵，其中大多数的位置都为0，但我们如果使用word embedding来表示的话，只需要2000*维度，比如20000*300

我们会把所有的文本转化为向量，把句子用向量表示

但是在这中间，我们会把token使用数字来表示，再把数字使用向量来表示

即：token---->num---->vector

将单词映射到低维空间可以表示单词之间的语义关系

比如我们的词向量限制为2维。那么词“猫”、“狗”、“开心”、“惊讶”、“手机”映射成词向量后在向量空间中可能是这样子的：

可以看到，“狗”和“猫”都是动物，语义相近，因此具有很小的夹角，而“狗”和“手机”这两个关系不大的词语便会有很大的夹角

合格的词向量除了在语义上相近会被编码到邻近的区域，还应该支持简单的语义运算，将语义运算映射为向量运算。比如：

“中国”+“首都”=“北京”；
“王子”-“公主”=“男”-“女“；

4、word bedding的理解

Embedding在数学上表示一个maping, f: X -> Y，也就是一个function，其中该函数是injective（就是我们所说的单射函数，每个Y只有唯一的X对应，反之亦然）和structure-preserving (结构保存，比如在X所属的空间上X1 < X2,那么映射后在Y所属空间上同理 Y1 < Y2)。那么对于word embedding，就是将单词word映射到另外一个空间，其中这个映射具有injective和structure-preserving的特点。

通俗的翻译可以认为是单词嵌入，就是把X所属空间的单词映射为到Y空间的多维向量，那么该多维向量相当于嵌入到Y所属空间中，一个萝卜一个坑。

word embedding，就是找到一个映射或者函数，生成在一个新的空间上的表达，该表达就是word representation。

word embedding API

torch.nn.Embedding(num_embeddings, embedding_dim)：

num_embeddings：词典的大小
embedding_dim：embedding的维度

torch.nn.Embedding：随机初始化词向量，词向量值在正态分布N(0,1)中随机取值

import torch.nn as nn
import torch

# nn.Embedding: This module is often used to store word embeddings and retrieve them using indices.
embedding = nn.Embedding(10, 3)
text = torch.LongTensor([[1, 2, 4, 5], [4, 3, 2, 9]])

word_embedding = embedding(text)
print(word_embedding.shape)
print(word_embedding.size())
print(word_embedding)

'''
padding_idx (int, optional): 
    If given, pads the output with the embedding vector at :attr:`padding_idx`
    (initialized to zeros) whenever it encounters the index.
'''
embedding = nn.Embedding(10, 3, padding_idx=2)
text = torch.LongTensor([[0, 2, 0, 5]])
word_embedding = embedding(text)
print(word_embedding.shape)
print(word_embedding.size())
print(word_embedding)

5、文本情感分类

流程：准备数据集、构建模型、模型训练、模型评估

准备数据集

需要注意：

如何完成Dataset和Dataloader的准备
每个batch中文本的长度不一致的问题如何解决
每个batch中的文本如何转化为数字序列

import torch
from torch.utils.data import DataLoader, Dataset
import os
import re

database_path = './aclImdb'


# 1、定义tokenize的方法
def tokenize(text):
    filters = ['!', '"', '#', '$', '&', '\(', '\)', '\*', '\+', ',', '-',
               '\.', '/', ':', ';', '<', '=', '>', '\?', '@', '\[', '\\', '\]', '^',
               '_', '`', '\{', '\|', '\|', '~', '\t', '\n', '\x97', '\x96']
    text = re.sub('<.*?', ' ', text, flags=re.S)
    text = re.sub('|'.join(filters), ' ', text, flags=re.S)

    return [i.strip() for i in text.split()]


# 2、准备dataset
class ImdbDataset(Dataset):
    def __init__(self, mode):
        super(ImdbDataset, self).__init__()

        if mode == 'train':
            text_path = [os.path.join(database_path, i) for i in ['train/neg', 'train/pos']]
        else:
            text_path = [os.path.join(database_path, i) for i in ['test/neg', 'test/pos']]

        self.total_file_path = []

        for i in text_path:
            # extend(): Extend list by appending elements from the iterable.
            self.total_file_path.extend([os.path.join(i, j) for j in os.listdir(i)])

    def __getitem__(self, index):
        current_path = self.total_file_path[index]

        # basename(): Return the base name of pathname path. This is the second element of the pair returned by passing path to the function split().
        current_filename = os.path.basename(current_path)

        label = int(current_filename.split('_')[-1].split('.')[0])

        text = tokenize(open(current_path).read().strip())

        return label, text

    def __len__(self):
        return len(self.total_file_path)


def collate_fn(batch):
    # batch是list，其中是一个一个元组，每个元组是dataset中__getitem__的结果
    batch = list(zip(*batch))
    label = torch.tensor(batch[0], dtype=torch.int32)
    text = batch[1]

    return label, text


# 3、实例化，准备dataloader
dataset = ImdbDataset(mode='train')
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True,collate_fn=collate_fn)

for idx, data in enumerate(dataloader):
    label, text = data
    print(f"idx: {idx}")
    print(f"label: {label}")
    print(f"data:{text}")

补充：zip()函数

'''
zip(*iterables) --> A zip object yielding tuples until an input is exhausted.
 |  
 |     >>> list(zip('abcdefg', range(3), range(4)))
 |     [('a', 0, 0), ('b', 1, 1), ('c', 2, 2)]
 |  
 |  The zip object yields n-length tuples, where n is the number of iterables
 |  passed as positional arguments to zip().  The i-th element in every tuple
 |  comes from the i-th iterable argument to zip().  This continues until the
 |  shortest argument is exhausted.
'''
tuples=((1,2),(3,4))
print(*zip(*tuples))

6、文本序列化

深度学习构建模型前需要将文本转化为向量表示（Word Embedding）。首先需要将文本转化为数字（文本序列化），再把数字转化为向量。可以考虑把文本中的每个词语和其对应的数字，使用字典保存，同时把句子转化为数字的列表。

实现文本序列化之前，应考虑以下几点：

如何使用字典把词语和数字进行对应；
不同的词语出现的次数不尽相同，是否需要对高频或者低频词语进行过滤
得到词典之后，如何把句子转化为数字序列，如何把数字序列转化为句子
不同句子长度不相同，每个batch的句子如何构造成相同的长度
对于新出现的词语在词典中没有出现怎么办（特殊字符代理）

import numpy as np


class Word2Sequence():
    UNK_TAG = 'UNK'
    PAD_TAG = "PAD"

    UNK = 0
    PAD = 1

    def __init__(self):
        self.dict = {
            self.UNK_TAG: self.UNK,
            self.PAD_TAG: self.PAD
        }

        self.fited = False
        self.count={}

    # get index corresponding to the specific word
    def word_to_index(self, word):
        # word -> index
        assert self.fited == True, '必须先进行fit操作'
        return self.dict.get(word, self.UNK)

    # get word corresponding to the specific index
    def index_to_word(self, index):
        # index -> word
        assert self.fited == True, '必须先进行fit操作'
        if index in self.inversed_dictionary:
            return self.inversed_dictionary[index]
        return self.UNK_TAG

    def fit(self,sentence):
        '''
        把单个句子保存到dict中
        :param sentence: [word1, word2, ...]
        :return:
        '''

        for word in sentence:
            self.count[word] = self.count.get(word,0)+1

    # 过滤低频词语和高频词语，并且生成word—>index的字典
    def build_vocabulary(self, min_count=1, max_count=None, max_feature=None):
        '''
        :param min_count: 最小出现的次数
        :param max_count: 最大出现的次数
        :param max_feature: 总词语的最大数量
        :return:
        '''

        # 过滤低频词语
        if min_count is not None:
            self.count = {k: v for k, v in self.count.items() if v >= min_count}

        # 过滤高频词语
        if max_count is not None:
            self.count = {k: v for k, v in self.count.items() if v <= max_count}

        # 限制最大的数量
        if isinstance(max_feature, int):
            # Return a new list containing all items from the iterable in ascending order
            self.count = sorted(list(self.count.items()), key=lambda x: x[1])

            if max_feature is not None and len(self.count) > max_feature:
                self.count = self.count[-int(max_feature):]
            for word in self.count:
                self.dict[word] = len(self.dict)
        else:
            for word in self.count:
                self.dict[word] = len(self.dict)

        self.fited = True

        # index -> word
        self.inversed_dictionary = dict(zip(self.dict.values(), self.dict.keys()))

    # limit the length of sentences and transform word in sentences to corresponding index
    def transform(self, sentences, max_len=None):
        """
        realize the function transforming the word in sentences to index, finally generate the list of index
        :param sentences:
        :param max_len:
        :return:
        """

        assert self.fited, "The 'fited' operation must be performed first"
        if max_len is not None:
            sentences_index = [self.PAD] * max_len
        else:
            sentences_index = [self.PAD] * len(sentences)

        if max_len is not None and len(sentences) > max_len:
            sentences = sentences[:max_len]

        for index, word in enumerate(sentences):
            sentences_index[index] = self.word_to_index(word)

        return np.array(sentences_index, dtype=np.int64)

    def inverse_transform(self, indices):
        '''
        realize the function transforming the index in indices to word, finally generate the list of words
        :param indices: [1, 2, 3, ...]
        :return: [word1, word2, ...]
        '''
        sentences = []

        for index in indices:
            word = self.index_to_word(index)
            sentences.append(word)

        return sentences


if __name__ == '__main__':
    word_to_sequence = Word2Sequence()

    # fit(): build the relations between the word and the index in order from lowest to highest
    word_to_sequence.fit(['唐', '舞', '桐'])

    word_to_sequence.build_vocabulary()

    print(word_to_sequence.dict)
    print(word_to_sequence.transform(['舞', '桐']))

7、文本序列化模型的保存

from word_sequence import Word2Sequence
import pickle
import os
from dataset import tokenize
from tqdm import tqdm

if __name__ == '__main__':

    if not os.path.exists('./model'):
        os.mkdir('./model')

    ws = Word2Sequence()
    path = r'./aclImdb/train'
    temporary_data_path = [os.path.join(path, 'neg'), os.path.join(path, 'pos')]
    for data_path in temporary_data_path:
        # os.listdir(): Return a list containing the names of the files in the directory.
        file_name = os.listdir(data_path)
        file_path = [os.path.join(data_path, name) for name in file_name if name.endswith('.txt')]

        for file in tqdm(file_path):
            sentence = tokenize(open(file, errors='ignore').read())
            ws.fit(sentence)

    # filter high frequency words and low frequency words and generate the dictionary of words and index that corresponds one to one
    ws.build_vocabulary(min_count=5)

    print(len(ws))

    pickle.dump(ws, open('./model/ws.pkl', 'wb'))

import pickle

# 使用保存的pkl文件
ws = pickle.load(open('./model/ws.pkl', 'rb'))

import torch
from torch.utils.data import DataLoader, Dataset
import os
import re
from pkl import ws

database_path = './aclImdb'


# 1、定义tokenize的方法
def tokenize(text):
    filters = ['!', '"', '#', '$', '&', '\(', '\)', '\*', '\+', ',', '-',
               '\.', '/', ':', ';', '<', '=', '>', '\?', '@', '\[', '\\', '\]', '^',
               '_', '`', '\{', '\|', '\|', '~', '\t', '\n', '\x97', '\x96']
    text = re.sub('<.*?', ' ', text, flags=re.S)
    text = re.sub('|'.join(filters), ' ', text, flags=re.S)

    return [i.strip() for i in text.split()]


# 2、准备dataset
class ImdbDataset(Dataset):
    def __init__(self, mode):
        super(ImdbDataset, self).__init__()

        if mode == 'train':
            text_path = [os.path.join(database_path, i) for i in ['train/neg', 'train/pos']]
        else:
            text_path = [os.path.join(database_path, i) for i in ['test/neg', 'test/pos']]

        self.total_file_path = []

        for i in text_path:
            # extend(): Extend list by appending elements from the iterable.
            self.total_file_path.extend([os.path.join(i, j) for j in os.listdir(i)])

    def __getitem__(self, index):
        current_path = self.total_file_path[index]

        # basename(): Return the base name of pathname path. This is the second element of the pair returned by passing path to the function split().
        current_filename = os.path.basename(current_path)

        label = int(current_filename.split('_')[-1].split('.')[0])

        text = tokenize(open(current_path).read().strip())

        return label, text

    def __len__(self):
        return len(self.total_file_path)


def collate_fn(batch):
    # batch是list，其中是一个一个元组，每个元组是dataset中__getitem__的结果
    batch = list(zip(*batch))
    label = torch.tensor(batch[0], dtype=torch.int32)
    text = batch[1]

    # ws.transform: realize the function transforming the word in sentences to index, finally generate the list of index
    text = [ws.transform(i, max_len=20) for i in text]

    return label, text


# 3、实例化，准备dataloader
dataset = ImdbDataset(mode='train')
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)

for idx, data in enumerate(dataloader):
    label, text = data
    print(f"idx: {idx}")
    print(f"label: {label}")
    print(f"data:{text}")
    break

8、基础模型的构建

import torch.nn as nn
import torch

# nn.Embedding: This module is often used to store word embeddings and retrieve them using indices.
embedding = nn.Embedding(10, 3)
text = torch.LongTensor([[1, 2, 4, 5], [4, 3, 2, 9]])

word_embedding = embedding(text)
print(word_embedding.shape)
print(word_embedding.size())
print(word_embedding)

'''
padding_idx (int, optional): 
    If given, pads the output with the embedding vector at :attr:`padding_idx`
    (initialized to zeros) whenever it encounters the index.
'''
embedding = nn.Embedding(10, 3, padding_idx=2)
text = torch.LongTensor([[0, 2, 0, 5]])
word_embedding = embedding(text)
print(word_embedding.shape)
print(word_embedding.size())
print(word_embedding)

使用word embedding，模型只包含一层：

数据经过word embedding
数据通过全连接层返回结果，计算log_softmax

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from dataset import get_dataloader
from pkl import ws, MAX_LEN

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')


class IMDBModel(nn.Module):
    def __init__(self):
        super(IMDBModel, self).__init__()
        '''
        padding_idx (int, optional): 
            If given, pads the output with the embedding vector at :attr:`padding_idx`
            (initialized to zeros) whenever it encounters the index.
        '''
        self.embedding = nn.Embedding(len(ws), 100, padding_idx=ws.PAD)
        self.linear = nn.Linear(MAX_LEN * 100, 11)

    def forward(self, input):
        x = self.embedding(input)  # [batch_size,max_len,100]
        x = x.view(x.size(0), -1)  # [batch_size,max_len*100]

        out = self.linear(x)

        '''
        F.log_softmax():
            While mathematically equivalent to log(softmax(x)), doing these two
            operations separately is slower, and numerically unstable. This function
            uses an alternative formulation to compute the output and gradient correctly.
        '''
        return F.log_softmax(out,dim=-1)


TRAIN_BATCH_SIZE = 128
TEST_BATCH_SIZE = 1024
LR = 0.001

imdb = IMDBModel().to(device)
optimizer = optim.Adam(imdb.parameters(), lr=LR)
criterion = nn.CrossEntropyLoss().to(device)


def train(epoch):
    print(f"{'-'*10}epoch: {epoch+1}{'-'*10}")

    mode = True
    imdb.train(mode)

    train_dataloader = get_dataloader(mode='train', batch_size=TRAIN_BATCH_SIZE)

    for idx, (label, text) in enumerate(train_dataloader):
        label = label.to(device)
        text = text.to(device)

        optimizer.zero_grad()
        output = imdb(text)
        loss = criterion(output, label)

        loss.backward()

        optimizer.step()

        if idx % 10 == 0:
            print(f'train epoch:{epoch}, loss: {loss.item()}')

    print("模型保存成功")
    torch.save(imdb.state_dict(), f'./model/ws_{epoch}.pth')


for i in range(20):
    train(i)

dataset.py文件：

import torch
from torch.utils.data import DataLoader, Dataset
import os
import re
from pkl import ws

database_path = './aclImdb'


# 1、定义tokenize的方法
def tokenize(text):
    filters = ['!', '"', '#', '$', '&', '\(', '\)', '\*', '\+', ',', '-',
               '\.', '/', ':', ';', '<', '=', '>', '\?', '@', '\[', '\\', '\]', '^',
               '_', '`', '\{', '\|', '\|', '~', '\t', '\n', '\x97', '\x96']
    text = re.sub('<.*?', ' ', text, flags=re.S)
    text = re.sub('|'.join(filters), ' ', text, flags=re.S)

    return [i.strip() for i in text.split()]


# 2、准备dataset
class ImdbDataset(Dataset):
    def __init__(self, mode):
        super(ImdbDataset, self).__init__()

        if mode == 'train':
            text_path = [os.path.join(database_path, i) for i in ['train/neg', 'train/pos']]
        else:
            text_path = [os.path.join(database_path, i) for i in ['test/neg', 'test/pos']]

        self.total_file_path = []

        for i in text_path:
            # extend(): Extend list by appending elements from the iterable.
            self.total_file_path.extend([os.path.join(i, j) for j in os.listdir(i)])

    def __getitem__(self, index):
        current_path = self.total_file_path[index]

        # basename(): Return the base name of pathname path. This is the second element of the pair returned by passing path to the function split().
        current_filename = os.path.basename(current_path)

        label = int(current_filename.split('_')[-1].split('.')[0])

        text = tokenize(open(current_path,errors='ignore').read().strip())

        return label, text

    def __len__(self):
        return len(self.total_file_path)


def collate_fn(batch):
    # batch是list，其中是一个一个元组，每个元组是dataset中__getitem__的结果
    batch = list(zip(*batch))
    label = torch.tensor(batch[0], dtype=torch.long)
    text = batch[1]

    # ws.transform: realize the function transforming the word in sentences to index, finally generate the list of index
    text = [ws.transform(i, max_len=20) for i in text]

    text=torch.tensor(text)

    return label, text


# 3、实例化，准备dataloader
dataset = ImdbDataset(mode='train')
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)


def get_dataloader(mode, batch_size):
    mode_dataset = ImdbDataset(mode)
    mode_dataloader = DataLoader(dataset=mode_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

    return mode_dataloader


# loader = get_dataloader('train', 2)
#
# for label, text in loader:
#     text = torch.tensor(text)
#     print(f"text: {text}")
#     print(f'dtype: {text.dtype}')
#     print(f'type: {type(text)}')
#     break

pkl.py文件：

import pickle

ws = pickle.load(open('./model/ws.pkl', 'rb'))
MAX_LEN = 20

保存./model/ws.pkl文件的py文件：

from word_sequence import Word2Sequence
import pickle
import os
from dataset import tokenize
from tqdm import tqdm

if __name__ == '__main__':

    if not os.path.exists('./model'):
        os.mkdir('./model')

    ws = Word2Sequence()
    path = r'./aclImdb/train'
    temporary_data_path = [os.path.join(path, 'neg'), os.path.join(path, 'pos')]
    for data_path in temporary_data_path:
        # os.listdir(): Return a list containing the names of the files in the directory.
        file_name = os.listdir(data_path)
        file_path = [os.path.join(data_path, name) for name in file_name if name.endswith('.txt')]

        for file in tqdm(file_path):
            sentence = tokenize(open(file, errors='ignore').read())
            ws.fit(sentence)

    # filter high frequency words and low frequency words and generate the dictionary of words and index that corresponds one to one
    ws.build_vocabulary(min_count=5)

    print(len(ws))

    pickle.dump(ws, open('./model/ws.pkl', 'wb'))

word_sequence.py文件：

import numpy as np


class Word2Sequence():
    UNK_TAG = 'UNK'
    PAD_TAG = "PAD"

    UNK = 0
    PAD = 1

    def __init__(self):
        self.dict = {
            self.UNK_TAG: self.UNK,
            self.PAD_TAG: self.PAD
        }

        self.fited = False
        self.count={}

    # get index corresponding to the specific word
    def word_to_index(self, word):
        # word -> index
        assert self.fited == True, '必须先进行fit操作'
        return self.dict.get(word, self.UNK)

    # get word corresponding to the specific index
    def index_to_word(self, index):
        # index -> word
        assert self.fited == True, '必须先进行fit操作'
        if index in self.inversed_dictionary:
            return self.inversed_dictionary[index]
        return self.UNK_TAG

    def fit(self,sentence):
        '''
        把单个句子保存到dict中
        :param sentence: [word1, word2, ...]
        :return:
        '''

        for word in sentence:
            self.count[word] = self.count.get(word,0)+1

    # 过滤低频词语和高频词语，并且生成word—>index的字典
    def build_vocabulary(self, min_count=1, max_count=None, max_feature=None):
        '''
        :param min_count: 最小出现的次数
        :param max_count: 最大出现的次数
        :param max_feature: 总词语的最大数量
        :return:
        '''

        # 过滤低频词语
        if min_count is not None:
            self.count = {k: v for k, v in self.count.items() if v >= min_count}

        # 过滤高频词语
        if max_count is not None:
            self.count = {k: v for k, v in self.count.items() if v <= max_count}

        # 限制最大的数量
        if isinstance(max_feature, int):
            # Return a new list containing all items from the iterable in ascending order
            self.count = sorted(list(self.count.items()), key=lambda x: x[1])

            if max_feature is not None and len(self.count) > max_feature:
                self.count = self.count[-int(max_feature):]
            for word in self.count:
                self.dict[word] = len(self.dict)
        else:
            for word in self.count:
                self.dict[word] = len(self.dict)

        self.fited = True

        # index -> word
        self.inversed_dictionary = dict(zip(self.dict.values(), self.dict.keys()))

    # limit the length of sentences and transform word in sentences to corresponding index
    def transform(self, sentences, max_len=None):
        """
        realize the function transforming the word in sentences to index, finally generate the list of index
        :param sentences:
        :param max_len:
        :return:
        """

        assert self.fited, "The 'fited' operation must be performed first"
        if max_len is not None:
            sentences_index = [self.PAD] * max_len
        else:
            sentences_index = [self.PAD] * len(sentences)

        if max_len is not None and len(sentences) > max_len:
            sentences = sentences[:max_len]

        for index, word in enumerate(sentences):
            sentences_index[index] = self.word_to_index(word)

        return np.array(sentences_index, dtype=np.int64)

    def inverse_transform(self, indices):
        '''
        realize the function transforming the index in indices to word, finally generate the list of words
        :param indices: [1, 2, 3, ...]
        :return: [word1, word2, ...]
        '''
        sentences = []

        for index in indices:
            word = self.index_to_word(index)
            sentences.append(word)

        return sentences

    def __len__(self):
        return len(self.dict)


if __name__ == '__main__':
    word_to_sequence = Word2Sequence()

    # fit(): build the relations between the word and the index in order from lowest to highest
    word_to_sequence.fit(['唐', '舞', '桐'])

    word_to_sequence.build_vocabulary()

    print(word_to_sequence.dict)
    print(word_to_sequence.transform(['舞', '桐']))

word_sequence的准备：

定义字典保留所有的词语
根据词频对词语进行保留
一个batch中对句子的长度进行统一
实现方法把句子转化为序列和反向操作

9、循环神经网络的介绍

在普通的神经网络中，消息的传递是单向的，这种限制虽然使得神经网络变得容易学习，但在一定程度上也减弱了神经网络模型的能力。特别是在很多现实任务中，网络的输出不仅和当前时刻的输入相关，也和过去一段时间的输出相关。此外，普通网络难以处理时序数据，比如视频、语音、文本等，时序数据的长度一般是不固定的，而前馈神经网络要求输入和输出的维数都是固定的，不能任意改变

循环神经网络是一种具有短时记忆的神经网络（RNN），在RNN中，神经元不仅可以接收其它神经元的信息，也可以接收自身的信息，形成具有环路的网络结构，即神经元的输出可以在下一个时间步直接作用到自身（作为输入）

循环：当前时间步输入=当前时间步的输入+上一个时间步的输出

循环神经网络的隐藏层的值s不仅仅取决于当前这次的输入x，还取决于上一次隐藏层的值s。权重矩阵 W就是隐藏层上一次的值作为这一次的输入的权重

这个网络在t时刻接收到输入 x_t 之后，隐藏层的值是 s_t ，输出值是 o_t

s_t 的值不仅仅取决于 x_t ，还取决于 s_t−1

10、RNN

语言模型

我们可以和电脑玩一个游戏，我们写出一个句子前面的一些词，然后，让电脑帮我们写下接下来的一个词。比如下面这句：

我昨天上学迟到了，老师批评了____

我们给电脑展示了这句话前面这些词，然后，让电脑写下接下来的一个词。在这个例子中，接下来的这个词最有可能是『我』，而不太可能是『小明』，甚至是『吃饭』。

语言模型就是这样的东西：给定一句话前面的部分，预测接下来最有可能的一个词是什么。

循环神经网络

循环神经网络的计算方法：

循环神经网络的输出值，是受前面历次输入值x_t、x_t-1、x_t-2、x_t-3、…影响的，这就是为什么循环神经网络可以往前看任意多个输入值的原因。

双向循环神经网络

对于语言模型来说，很多时候光看前面的词是不够的，比如下面这句话：

我的手机坏了，我打算____一部新手机。

可以想象，如果我们只看横线前面的词，手机坏了，那么我是打算修一修？换一部新的？还是大哭一场？这些都是无法确定的。但如果我们也看到了横线后面的词是『一部新手机』，那么，横线上的词填『买』的概率就大得多了。

双向卷积神经网络的隐藏层要保存两个值，一个A参与正向计算，另一个值A’参与反向计算。最终的输出值y_t取决于A_t和A_t'。以y₂为例，其计算方法为：

A₂和A₂'则分别计算：

正向计算时，隐藏层的值与有关；反向计算时，隐藏层的值与有关；最终的输出取决于正向和反向计算的加和

从上面三个公式我们可以看到，正向计算和反向计算不共享权重，也就是说U和U’、W和W’、V和V’都是不同的权重矩阵。

循环神经网络的训练

循环神经网络的训练算法：BPTT

BPTT算法是针对循环层的训练算法，它的基本原理和BP算法是一样的，也包含同样的三个步骤：

前向计算每个神经元的输出值；
反向计算每个神经元的误差项值
计算每个权重的梯度

任一时刻的梯度
和
是前面各个时刻的梯度之和，而参数V的梯度则依赖当前时刻的值

循环神经网络的实现（pytorch）

RNN模型的结构

展开上述循环：

上图是一个二维的RNN模型的结构，RNN是循环神经网络，所指的x_t和h_t就是t时刻的输入，和t时刻对应的隐藏层，下图是三维的RNN模型图

模型输入

RNN模型主要应用于时序型的数据，常见的应用类型为

自然语言：你吃饭了嘛，x1为你，x2为吃……以此类推
股票价格，每日天气情况等

隐藏层

h_t为t时刻时输入对应的隐藏单元的值
U是输入层到隐藏层的权重矩阵
W就是上一次隐藏层的值h_t-1作为本次输入的权值矩阵。；
b为偏置量

输出层

V是隐藏层到输出层的权重矩阵。
c为偏置量

特别注意：在计算时，每一步使用的参数U、W、V、b、c都是一样的，也就是说每个步骤的参数都是共享的

反向传播

反向传播是对U、W、V、b、c求偏导，调整他们使损失变小。

设t时刻，损失函数（Mean Square Error）为
，则损失函数之和为

W在每一个时刻都出现了：
，W在t时刻的梯度：

最后更新参数：

接下来举个例子，t=2时刻U、V、W对于损失函数L₂的偏导：

RNN的缺陷：长依赖问题

RNN的优势为可以利用过去的数据来推测当前数据，但是由于RNN的参数是共享的，每一时刻都会由前面所有的时刻共同决定，这是一个相加的过程，这样的话就有个问题，当距离过长了，计算前面的导数时，最前面的导数就会消失或者爆炸

所以当相关的数据离推测的数据越远时，RNN所能学习到的信息则越少。

例如：I live in Beijing. … .I can speak Chinese.

Beijing和Chinese是有着密切的关系的，但是由于中间存在着大量的句子，导致识别到Chinese无法和前面的Beijing产生联系。

RNN代码实现（pytorch）

import torch.nn as nn
import torch

rnn = nn.RNN(10, 20, 2)

# text: [seq_length, batch_size, input_size]
text = torch.randn(5, 3, 10)  # seq_length=5, batch_size=3, input_size=10

# h0: [num_layers * num_directions, batch_size, hidden_size]
h0 = torch.randn(2, 3, 20)  # num_layers*num_directions=2,batch_size=3,hidden_size=20

# output: [seq_length, batch_size, num_directions * hidden_size]
# hn: [num_layers * num_directions, batch_size, hidden_size]
output, hn = rnn(text, h0)

print(output.size())
print(hn.shape)

import torch.nn as nn
import torch
import numpy as np
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))

# how many time steps are in one batch of data
sequence_length = 20

# np.linspace(): Return evenly spaced numbers over a specified interval.
time_steps = np.linspace(0, np.pi, sequence_length + 1)

data = np.sin(time_steps)

# np.resize(): Return a new array with the specified shape.
data = np.resize(data, (sequence_length + 1, 1))

# size: Number of elements in the array.
# shape: Tuple of array dimensions
# reshape: Returns an array containing the same data with a new shape.
# resize: Change shape and size of array in-place.

x = data[:-1]
y = data[1:]

print(x.shape)
print(y.shape)


plt.plot(time_steps[1:], x, 'r.', label='input,x')
plt.plot(time_steps[1:], y, 'b.', label='target y')

plt.legend('best')

plt.show()


class SimpleRNN(nn.Module):
    def __init__(self, input_size, output_size, hidden_size, layers):
        super(SimpleRNN, self).__init__()

        self.hidden_size = hidden_size

        # defined an RNN with specified parameters
        # batch_first means that the first dim of the input and output will be the batch_size

        '''
        RNN
            Applies a multi-layer Elman RNN with tanh or ReLU to an input sequence
            Args:
                input_size: The number of expected features in the input `x`
                hidden_size: The number of features in the hidden state `h`
                num_layers: Number of recurrent layers.
        '''
        self.rnn = nn.RNN(input_size=input_size,
                          hidden_size=hidden_size,
                          num_layers=True)

        self.linear = nn.Linear(self.hidden_size, output_size)

    def forward(self, x, h_0):
        '''
        :param x: [seq_length, batch_size, input_size]
        :param h_0: [num_layers*num_directions, batch_size, hidden_size]
        :return:
        '''

        batch_size = x.size(1)

        # out:[seq_length, batch_size, num_directions * hidden_size]
        # h_1: [num_layers * num_directions, batch_size, hidden_size]
        out, h_1 = self.rnn(x, h_0)

        out = out.view(-1, self.hidden_size)

        output = self.linear(out)

        return output, h_1


rnn = SimpleRNN(input_size=1, output_size=1, hidden_size=10, layers=1)

x = torch.Tensor(x)

x = x.reshape(20, 1, 1)
print('x: ', x.shape)

# h_1.shape: [num_layers * num_directions, batch_size, hidden_size]
out, h_1 = rnn(x, None)
print('rnn out: ',out.shape)
print(f'cnn h_1: {h_1.shape}')

使用RNN实现文本情感分类

# -*- coding: utf-8 -*-
# @Time    : 2022/11/4 19:22
# @Author  : 楚楚
# @File    : 01RNN文本情感分类.py
# @Software: PyCharm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from dataset import get_dataloader
from pkl import ws, MAX_LEN
from datetime import datetime
from tqdm import tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


class IMDBModel(nn.Module):
    def __init__(self, input_size, output_size, hidden_size, layers):
        super(IMDBModel, self).__init__()

        '''
        padding_idx (int, optional): 
            If given, pads the output with the embedding vector at :attr:`padding_idx`
            (initialized to zeros) whenever it encounters the index.
        '''
        self.embedding = nn.Embedding(len(ws), input_size, padding_idx=ws.PAD)

        self.hidden_size = hidden_size

        # defined an RNN with specified parameters
        # batch_first means that the first dim of the input and output will be the batch_size

        '''
        RNN
            Applies a multi-layer Elman RNN with tanh or ReLU to an input sequence
            Args:
                input_size: The number of expected features in the input `x`
                hidden_size: The number of features in the hidden state `h`
                num_layers: Number of recurrent layers.
        '''

        self.rnn = nn.RNN(input_size=input_size,
                          hidden_size=hidden_size,
                          num_layers=layers)

        self.linear = nn.Linear(self.hidden_size, output_size)

    def forward(self, x, h_0):
        '''

        :param x: [batch_size, seq_length]
        :param h_0: [num_layers*num_directions, batch_size, hidden_size]
        :return:
        '''

        batch_size = x.size(0)

        # x: [batch_size, seq_length, input_size]
        x = self.embedding(x)

        # x: [seq_length, batch_size, input_size]
        x = x.permute(1, 0, 2)

        # out: [seq_length, batch_size, num_directions*hidden_size]
        # h_n: [num_layers*num_directions, batch_size, hidden_size]
        out, h_n = self.rnn(x, h_0)

        out = out[-1]
        out = out.reshape(batch_size, -1)

        output = self.linear(out)

        return output, h_n


TRAIN_BATCH_SIZE = 128
TEST_BATCH_SIZE = 128
LR = 0.001

imdb = IMDBModel(input_size=100, output_size=11, hidden_size=20, layers=1).to(device)
optimizer = optim.Adam(imdb.parameters(), lr=LR)
criterion = nn.CrossEntropyLoss().to(device)


def train_test(epoch):
    print(f"{'-' * 10}epoch: {epoch + 1}{'-' * 10}")

    mode = True
    imdb.train(mode)

    train_dataloader, len_train_data = get_dataloader(mode='train', batch_size=TRAIN_BATCH_SIZE)

    for idx, (text, label) in enumerate(train_dataloader):
        text = text.to(device)
        label = label.to(device)

        optimizer.zero_grad()
        output, h_n = imdb(text, None)

        loss = criterion(output, label)
        loss.backward()

        optimizer.step()

        if idx % 10 == 0:
            print(f'train epoch:{epoch}, loss: {loss.item()}')

    print(f"{'-' * 10}测试开始{'-' * 10}")

    imdb.eval()

    test_dataloader, len_test_data = get_dataloader('test', batch_size=TEST_BATCH_SIZE)

    sum_loss = 0
    total_accuracy = 0

    with torch.no_grad():
        for text, label in tqdm(test_dataloader):
            text = text.to(device)
            label = label.to(device)

            output, h_n = imdb(text, None)
            loss = criterion(output, label)

            sum_loss += loss.item()

            predicted = output.argmax(1)
            accuracy = (predicted == label).sum()

            total_accuracy += accuracy

        print(f"测试集上的loss：{sum_loss}")

        correct_accuracy = total_accuracy / len_test_data
        print(f"整体测试集上的正确率：{correct_accuracy}%")

        print("模型保存成功")
        torch.save(imdb.state_dict(), f'./model/ws_{epoch}.pth')

        now = datetime.now()
        now = now.strftime("%Y-%m-%d %H:%M:%S")

        content = f"time：{now}\t模型在测试集上的准确率：{correct_accuracy}"

        with open('./accuracy.txt', 'a+', encoding='utf-8') as file:
            file.write(content + '\n')


if __name__ == '__main__':
    for i in range(20):
        train_test(i)

数据集下载地址：https://ai.stanford.edu/~amaas/data/sentiment/

参考

1、什么是 word embedding?

2、二十一、文本情感分类二

3、python读取中编码错误（illegal multibyte sequence ）

4、IndexError: Target 10 is out of bounds

5、一文搞懂RNN（循环神经网络）基础篇

6、Recurrent Neural Network系列3–理解RNN的BPTT算法和梯度消失

7、零基础入门深度学习(5) - 循环神经网络

8、RNN 的基本原理+pytorch代码

9、【Pytorch】21. RNN代码分析
10、【NLP自然语言处理】保姆级入门教程

你可能感兴趣的:(深度学习,python,NLP,python,深度学习,NLP)

No module named ‘moviepy.editor‘ weixin_66009678 python
python3.7版本后不支持frommoviepy.editor引用方式，由于是moviepy2.0.0版本修改方法：frommoviepy.editorimportVideoFileClip,clips_array改为frommoviepyimport*
安装python3.12.2环境（实验机器银河麒麟高级服务器） Red丶哞桌面运维 Python linux 运维服务器
1.下载官网Python安装包wgethttps://www.python.org/ftp/python/3.12.2/Python-3.12.2.tar.xz1.1解压tar-xfPython-3.12.2.tar.xz解压完后切换到Python-3.12.2文件夹(这里根据自己解压的文件夹路径)cd/usr/packages/Python-3.12.2/1.2升级软件包管理器CentOS系统：
自己动手写CPU - 6 qq85058522 自己动手写CPU fpga开发
自己动手写CPU_qq85058522的博客-CSDN博客CPU不加功能了，但汇编器可以有。下面写一个把汇编（助记符）翻译成机器码的小工具。Python熟些，就用它了。很简单，就是字符串替换。直接上代码。importsysiflen(sys.argv)!=2:print("usage:pythonassemblerxxx.asm")exit(0)code_path=sys.argv[1]print
如何安装python3.7.4_银河麒麟安装Python3.7.4以及升级自带OpenSSL weixin_39873191 如何安装python3.7.4
银河麒麟安装Python3.7.4以及升级自带OpenSSL升级OpenSSL1.下载opensslwgethttps://www.openssl.org/source/openssl-1.1.1a.tar.gztar-zxvfopenssl-1.1.1a.tar.gzcdopenssl-1.1.1a2.编译安装./config--prefix=/usr/local/opensslno-zlib#
python多进程编程_深入理解python多进程编程 weixin_39620001 python多进程编程
1、python多进程编程背景python中的多进程最大的好处就是充分利用多核cpu的资源，不像python中的多线程，受制于GIL的限制，从而只能进行cpu分配，在python的多进程中，适合于所有的场合，基本上能用多线程的，那么基本上就能用多进程。在进行多进程编程的时候，其实和多线程差不多，在多线程的包threading中，存在一个线程类Thread，在其中有三种方法来创建一个线程，启动线程，
python多进程编程实例_Python多进程编程multiprocessing代码实例 weixin_39791386 python多进程编程实例
在多线程与多进程的比较这一篇中记录了多进程编程的一种方式.下面记录一下多进程编程的别一种方式,即使用multiprocessing编程importmultiprocessingimporttimedefget_html(n):time.sleep(n)print('subprocess%s'%n)returnnif__name__=='__main__':#多进程编程process=multipr
python打开一个软件并进行操作_模拟试卷 B weixin_39551611
原标题：模拟试卷B一、单项选择题1.关于算法的描述，以下选项中错误的是算法是指解题方案的准确而完整的描述算法具有可行性、确定性、有穷性的基本特征算法的复杂度主要包括时间复杂度和数据复杂度算法的基本要素包括数据对象的运算和操作及算法的控制结构2.关于数据结构的描述，以下选项中正确的是数据结构指相互有关联的数据元素的集合数据的存储结构是指反映数据元素之间逻辑关系的数据结构数据的逻辑结构有顺序、链接、索
python之openpyxl模块 weixin_34248849 python 数据结构与算法测试
一.Python操作EXCEL库的简介1.1Python官方库操作excelPython官方库一般使用xlrd库来读取Excel文件，使用xlwt库来生成Excel文件，使用xlutils库复制和修改Excel文件，这三个库只支持到Excel2003。1.2第三方库openpyxl介绍第三方库openpyxl（可读写excel表），专门处理Excel2007及以上版本产生的xlsx文件，xls和x
python使用多进程multiprocessing 小蜗笔记 python python
python使用多进程multiprocessing1多进程解释2进程的演示3进程池方法4pool.map()的解析pool.map()的基本用法返回值语法示例注意事项适用场景5pool.join()详解示例注意事项pool.join()的运行逻辑阻塞特性的影响对计算速度的影响示例总结6apply_async(),apply(),和pool.map()`apply_async()`特性：语法：`a
centos下安裝python 白小白的小白 python python centos
更新系统文件yumupdateyuminstallzlib-develbzip2-developenssl-develncurses-develsqlite-develreadline-develtk-devellibffi-develgccmake下载安装包并解压wgethttps://www.python.org/ftp/python/3.7.6/Python-3.7.6.tar.xztar-
将python文件(.py)打包为可执行文件(.exe)的多种方法，看这一篇就够了，万字教学，全网最全！！！盲敲代码的阿豪 python实用知识点 python 可执行程序代码打包
文章目录前言1、PyInstaller库的使用（最简单，常用）1.1安装PyInstaller1.2常用参数及使用1.3其它参数（了解）1.4案例演示2、cx_Freeze库的使用2.1安装cx_Freeze2.2创建打包脚本2.3运行打包文件2.4参数说明2.5案例演示3、py2exe库的使用3.1安装py2exe3.2创建打包脚本3.3运行打包文件3.4参数说明3.5案例演示3.6常见问题4、
通过python代码实现向钉钉群内自动推送消息，详细步骤及代码，超实用教学！！！盲敲代码的阿豪 python实用知识点 python 钉钉自动化发消息
文章目录前言一、创建钉钉群机器人二、以文本格式发送信息三、以MarkDown格式发送信息四、以Link格式发送信息前言我们在使用钉钉时，通常会创建或加入多个群聊，身为群聊的管理者，当我们需要及时、并按时的向这些群聊推送一些固定信息，若通过人力来解决肯定非常耗时、耗力，这时我们就可以考虑开发一个自动化脚本来实现这个功能，本篇文章我将教会大家，如何使用python开发程序，实现向钉钉群内自动发送消息。
Python3-excel文档操作（二）：利用openpyxl库处理excel表格：在excel表格中插入图片 liranke Python学习笔记 python openpyxl python处理excel load_workbook
1.简介excel表中可以插入图片，使用openpyxl库可以实现这个功能。2.代码：#-*-coding:utf-8-*-importosimportsysimporttimeimportopenpyxlfromopenpyxlimportload_workbookfromopenpyxl.drawing.imageimportImagedefopenxls_insert_img(fname,i
基于Python的多元医疗知识图谱构建与应用研究（上） Allen_LVyingbo python 医疗高效编程研发 python 知识图谱健康医疗
一、引言1.1研究背景与意义在当今数智化时代，医疗数据呈爆发式增长，如何高效管理和利用这些数据，成为提升医疗服务质量的关键。传统医疗数据管理方式存在数据孤岛、信息整合困难等问题，难以满足现代医疗对精准诊断和个性化治疗的需求。知识图谱作为一种知识表示和管理技术，为医疗领域带来了新的解决方案。它能够将海量的医疗信息以结构化、语义化的方式组织起来，揭示疾病、症状、药物、治疗方法等实体之间的复杂关系，从而
PostgreSQL - pgvector 插件构建向量数据库并进行相似度查询花千树-010 RAG 数据库 postgresql AI编程
在现代的机器学习和人工智能应用中，向量相似度检索是一个非常重要的技术，尤其是在文本、图像或其他类型的嵌入向量的操作中。本文将介绍如何在PostgreSQL中安装pgvector插件，用于存储和检索向量数据，并展示如何通过Python脚本向数据库插入向量并执行相似度查询。一、安装PostgreSQL并配置pgvector插件1.安装PostgreSQL首先，确保你已经安装了PostgreSQL。可以
全面解析NVIDIA显卡：从入门级到旗舰级显卡详解花千树-010 大模型人工智能算法智能电视
在选择显卡时，了解不同显卡的性能和适用场景是非常重要的。无论你是预算有限的入门用户，还是追求极致性能的游戏玩家，亦或是专业的内容创作者和深度学习研究人员，NVIDIA都有适合你的显卡。本篇博文将详细列举NVIDIA显卡的各项配置，从低到高逐一整理，并给出适用的使用场景。入门级显卡NVIDIAGeForceGT1030CUDA核心数:384基础频率:1227MHz加速频率:1468MHz显存:2GB
MoviePy视频编辑和处理Python库的版本问题解决：No module named ‘moviepy.editor‘ 封步宇AIGC 文字音频视频自动化工具 python 音视频 ffmpeg 人工智能
MoviePy是一个强大的Python库，用于视频编辑和处理。它支持多种基本操作，如视频剪切、拼接、插入标题，以及更高级的视频合成（非线性编辑）、视频处理和自定义特效创建。MoviePy能够读写包括GIF在内的常见音频和视频格式，并且兼容Windows、Mac和Linux操作系统，支持Python2.7和3.x版本MoviePy基于ffmpeg和ImageMagick，提供了易于使用的API，能够
《CPython Internals》阅读笔记：p177-p220 codists 读书笔记 python
《CPythonInternals》学习第11天，p177-p220总结，总计44页。一、技术总结1.memoryallocationinC(1)staticmemeoryallocationMemoryrequirementsarecalculatedatcompiletimeandallocatedbytheexecutablewhenitstarts.(2)automaticmemeorya
厦门租房信息分析展示（pycharm+python爬虫+pyspark+pyecharts）（踩坑记录）吃西红柿的鸡蛋大数据 hadoop spark python
厦门租房信息分析展示（pycharm+python爬虫+pyspark+pyecharts）（踩坑记录）项目地址http://dblab.xmu.edu.cn/blog/2307/踩坑:Spark分析文件rent_analyse.py改变Spark读取csv文件的写法sparkContext=SparkContext("local","rent_analyse")sqlContext=SQLCon
《CPython Internals》阅读笔记：p250-p284 python
《CPythonInternals》学习第14天，250-p284总结，总计25页。一、技术总结介于我觉得作者写得乱七八糟的，读完我已经不想说话了，所以今日无技术总结。二、英语总结(生词：2)1.spawn(1)spawn:来自于词根expandere。(2)expandere:ex-("out")+pandere("tospread")spawn原来的意思是“spreadingoutoffish
Python使用moviepy模块编辑视频时，有可能会出现“TypeError: ‘module‘ object is not callable”的错误提示 CodeWG python 开发语言
Python使用moviepy模块编辑视频时，有可能会出现“TypeError:‘module‘objectisnotcallable”的错误提示。这个错误提示表明在调用函数或方法时，试图调用一个不可被调用的对象。这个问题通常是由于导入moviepy模块时，模块本身并不是可以被调用的对象而导致的。要解决这个问题，我们需要检查代码中导入moviepy模块的语句是否有误。moviepy模块中最常用的类
使用PyCharm运行Python程序代码艺术巧匠 python pycharm java Python
使用PyCharm运行Python程序PyCharm是一种功能强大的Python集成开发环境（IDE），它提供了许多方便的功能来开发、调试和运行Python程序。在本文中，我将向您展示如何使用PyCharm来运行Python程序，并提供相应的源代码示例。步骤1：安装PyCharm首先，您需要从JetBrains官方网站下载并安装PyCharm。根据您的操作系统，选择适合您的版本。安装过程非常简单，
API接口在电商的应用及收益前端后端运维数据挖掘api
一、API接口在电商的核心应用场景（一）商品数据管理与展示在电商平台，商品信息的准确与实时更新极为关键。借助API接口，能轻松实现商品数据从供应商系统到电商平台的同步。例如，使用Python结合Requests库编写代码，从外部API获取商品数据：importrequestsurl="https://example.com/api/products"response=requests.get(ur
直播预告丨精度优于AlphaFold，基于深度学习实现生物大分子及其互作的三维结构预测
「MeetAI4S」系列直播第6期将于1月15日19:00准时开播，HyperAI超神经有幸邀请到了南开大学统计与数据科学学院教授郑伟，他本次分享的主题是「AlphaFold3王座未稳，来自学术界的反超：基于深度学习的生物大分子及其互作的三维结构预测」。蛋白质的功能取决于其独特的三维结构，近年来，基于深度学习等人工智能技术的蛋白质结构预测发展迅猛，AlphaFold甚至获得了2024年诺贝尔化学奖
使用Scrapy抓取图片网站的图片：完整教程与实战案例 Python爬虫项目 2025年爬虫实战项目 scrapy 爬虫 python 音视频开发语言 selenium
引言在互联网时代，图片已经成为我们生活和工作中不可或缺的一部分。随着社交媒体、电子商务、新闻网站等平台的普及，图片的需求量和使用量不断增加。因此，如何高效、便捷地抓取网站上的图片，成为了许多数据工程师、爬虫开发者以及数据科学家需要解决的问题。Scrapy是Python中一个非常强大且广泛使用的爬虫框架。它不仅提供了强大的抓取能力，还能够轻松地处理大规模数据抓取和高效的数据存储。Scrapy适合处理
AI代码生成工具的未来：杨立昆的洞见与AI革命前端
近年来，人工智能（AI）领域取得了令人瞩目的进展，特别是以大型语言模型为代表的AI技术，在自然语言处理、图像生成等领域展现出强大的能力。然而，深度学习先驱杨立昆（YannLeCun）却对现有的AI系统提出了尖锐的批评，他认为目前的AI系统“理解能力远不如猫”，缺乏对真实世界的理解和常识。这引发了人们对AI未来发展方向的思考，也为我们探讨AI代码生成工具，以及AI技术对人类社会的影响提供了新的视角。
银河麒麟v10安装 python 3.12.5版本 sageparadise python 银河麒麟
1、官网下载python3.12.52、安装前检查opensslopensslversion#OpenSSL1.1.1f31Mar2020如果提示openssl1.1.1无需安装openssl,否则需要安装，下载openssltar-zxfopenssl-1.1.1s.tar.gzcdopenssl-1.1.1s/./config-fPIC--prefix=/usr/include/openssl
【TVM 教程】内联及数学函数
ApacheTVM是一个端到端的深度学习编译框架，适用于CPU、GPU和各种机器学习加速芯片。更多TVM中文文档可访问→https://tvm.hyper.ai/作者：TianqiChen尽管TVM支持基本的算术运算，但很多时候，也需要复杂的内置函数，例如exp取指函数。这些函数是依赖target系统的，并且在不同target平台中可能具有不同的名称。本教程会学习到如何调用这些target-spe
当ABAP遇见普罗米修斯
Python中的class体内定义方法时，如果没有显式地包含self参数，有时候依然可以被调用。这是一个非常有趣的话题，因为它涉及到对Python中类与对象之间关系的更深理解。要理解为什么这种情况下方法依然能够被调用，我们需要逐步拆解Python类的构造方式以及方法绑定的原理。
ai照片放大python源码_AI新时代-大牛教你使用python+Opencv完成人脸解锁（附源码）... weixin_39639505 ai照片放大python源码
好吧，伙计们，我回来了。说我拖更不写文章的可以过来用你的小拳拳狠命地捶我胸口....那么今天我们来讲关于使用python+opencv+face++来实现人脸验证及人脸解锁。代码量同样不多，你可以将这些代码运用在其它一些智能领域，如智能家居，进门的时候判断你是谁，也可以加入机器学习判断来的人是客人还是熟人。在讲之前我们会先适当的拓扑一下关于人脸识别的知识点。OK废话少说下面开始正是话题。解锁原理：
矩阵求逆（JAVA）利用伴随矩阵 qiuwanchi 利用伴随矩阵求逆矩阵
package gaodai.matrix; import gaodai.determinant.DeterminantCalculation; import java.util.ArrayList; import java.util.List; import java.util.Scanner; /** * 矩阵求逆(利用伴随矩阵) * @author 邱万迟
单例（Singleton）模式 aoyouzi 单例模式 Singleton
3.1 概述如果要保证系统里一个类最多只能存在一个实例时，我们就需要单例模式。这种情况在我们应用中经常碰到，例如缓存池，数据库连接池，线程池，一些应用服务实例等。在多线程环境中，为了保证实例的唯一性其实并不简单，这章将和读者一起探讨如何实现单例模式。 3.2
[开源与自主研发]就算可以轻易获得外部技术支持,自己也必须研发 comsci 开源
现在国内有大量的信息技术产品，都是通过盗版，免费下载，开源，附送等方式从国外的开发者那里获得的。。。。。。虽然这种情况带来了国内信息产业的短暂繁荣，也促进了电子商务和互联网产业的快速发展，但是实际上，我们应该清醒的看到，这些产业的核心力量是被国外的
页面有两个frame,怎样点击一个的链接改变另一个的内容 Array_06 UI XHTML
<a src="地址" targets="这里写你要操作的Frame的名字" />搜索然后你点击连接以后你的新页面就会显示在你设置的Frame名字的框那里 targerts="",就是你要填写目标的显示页面位置 ===================== 例如： <frame src=&
Struts2实现单个/多个文件上传和下载 oloz 文件上传 struts
struts2单文件上传：步骤01:jsp页面  　　<form action="fileUplo
推荐10个在线logo设计网站 362217990 logo
在线设计Logo网站。 1、http://flickr.nosv.org（这个太简单） 2、http://www.logomaker.com/?source=1.5770.1 3、http://www.simwebsol.com/ImageTool 4、http://www.logogenerator.com/logo.php?nal=1&tpl_catlist[]=2 5、ht
jsp上传文件香水浓 jsp fileupload
1. jsp上传 Notice： 1. form表单 method 属性必须设置为 POST 方法，不能使用 GET 方法 2. form表单 enctype 属性需要设置为 multipart/form-data 3. form表单 action 属性需要设置为提交到后台处理文件上传的jsp文件地址或者servlet地址。例如 uploadFile.jsp 程序文件用来处理上传的文
我的架构经验系列文章 - 前端架构 agevs JavaScript Web 框架 UI jQuer
框架层面：近几年前端发展很快，前端之所以叫前端因为前端是已经可以独立成为一种职业了，js也不再是十年前的玩具了，以前富客户端RIA的应用可能会用flash/flex或是silverlight，现在可以使用js来完成大部分的功能，因此js作为一门前端的支撑语言也不仅仅是进行的简单的编码，越来越多框架性的东西出现了。越来越多的开发模式转变为后端只是吐json的数据源，而前端做所有UI的事情。MVCMV
android ksoap2 中把XML(DataSet) 当做参数传递 aijuans android
我的android app中需要发送webservice ，于是我使用了 ksop2 进行发送，在测试过程中不是很顺利,不能正常工作.我的web service 请求格式如下 [html] view plain copy <Envelope xmlns="http://schemas.
使用Spring进行统一日志管理 + 统一异常管理 baalwolf spring
统一日志和异常管理配置好后，SSH项目中，代码以往散落的log.info() 和 try..catch..finally 再也不见踪影！统一日志异常实现类： [java] view plain copy package com.pilelot.web.util; impor
Android SDK 国内镜像 BigBird2012 android sdk
一、镜像地址： 1、东软信息学院的 Android SDK 镜像，比配置代理下载快多了。配置地址， http://mirrors.neusoft.edu.cn/configurations.we#android 2、北京化工大学的： IPV4:ubuntu.buct.edu.cn IPV4:ubuntu.buct.cn IPV6:ubuntu.buct6.edu.cn
HTML无害化和Sanitize模块 bijian1013 JavaScript AngularJS Linky Sanitize
一.ng-bind-html、ng-bind-html-unsafe AngularJS非常注重安全方面的问题，它会尽一切可能把大多数攻击手段最小化。其中一个攻击手段是向你的web页面里注入不安全的HTML，然后利用它触发跨站攻击或者注入攻击。考虑这样一个例子，假设我们有一个变量存
[Maven学习笔记二]Maven命令 bit1129 maven
mvn compile compile编译命令将src/main/java和src/main/resources中的代码和配置文件编译到target/classes中，不会对src/test/java中的测试类进行编译 MVN编译使用 maven-resources-plugin:2.6:resources maven-compiler-plugin:2.5.1:compile &nbs
【Java命令二】jhat bit1129 Java命令
jhat用于分析使用jmap dump的文件，，可以将堆中的对象以html的形式显示出来，包括对象的数量，大小等等，并支持对象查询语言。 jhat默认开启监听端口7000的HTTP服务，jhat是Java Heap Analysis Tool的缩写 1. 用法： [hadoop@hadoop bin]$ jhat -help Usage: jhat [-stack <bool&g
JBoss 5.1.0 GA:Error installing to Instantiated: name=AttachmentStore state=Desc ronin47
进到类似目录 server/default/conf/bootstrap，打开文件 profile.xml找到： Xml代码<bean name="AttachmentStore" class="org.jboss.system.server.profileservice.repository.AbstractAtta
写给初学者的6条网页设计安全配色指南 brotherlamp UI ui自学 ui视频 ui教程 ui资料
网页设计中最基本的原则之一是，不管你花多长时间创造一个华丽的设计，其最终的角色都是这场秀中真正的明星——内容的衬托我仍然清楚地记得我最早的一次美术课，那时我还是一个小小的、对凡事都充满渴望的孩子，我摆放出一大堆漂亮的彩色颜料。我仍然记得当我第一次看到原色与另一种颜色混合变成第二种颜色时的那种兴奋，并且我想，既然两种颜色能创造出一种全新的美丽色彩，那所有颜色
有一个数组，每次从中间随机取一个，然后放回去，当所有的元素都被取过，返回总共的取的次数。写一个函数实现。复杂度是什么。 bylijinnan java 算法面试
import java.util.Random; import java.util.Set; import java.util.TreeSet; /** * http://weibo.com/1915548291/z7HtOF4sx * #面试题#有一个数组，每次从中间随机取一个，然后放回去，当所有的元素都被取过，返回总共的取的次数。 * 写一个函数实现。复杂度是什么
struts2获得request、session、application方式 chiangfai application
1、与Servlet API解耦的访问方式。 a.Struts2对HttpServletRequest、HttpSession、ServletContext进行了封装，构造了三个Map对象来替代这三种对象要获取这三个Map对象，使用ActionContext类。 -----> package pro.action; import java.util.Map; imp
改变python的默认语言设置 chenchao051 python
import sys sys.getdefaultencoding() 可以测试出默认语言，要改变的话，需要在python lib的site-packages文件夹下新建： sitecustomize.py，这个文件比较特殊，会在python启动时来加载，所以就可以在里面写上： import sys sys.setdefaultencoding('utf-8') &n
mysql导入数据load data infile用法 daizj mysql 导入数据
我们常常导入数据！mysql有一个高效导入方法，那就是load data infile 下面来看案例说明基本语法： load data [low_priority] [local] infile 'file_name txt' [replace | ignore] into table tbl_name [fields [terminated by't'] [OPTI
phpexcel导入excel表到数据库简单入门示例 dcj3sjt126com PHP Excel
跟导出相对应的，同一个数据表，也是将phpexcel类放在class目录下，将Excel表格中的内容读取出来放到数据库中 <?php error_reporting(E_ALL); set_time_limit(0); ?> <html> <head> <meta http-equiv="Content-Type"
22岁到72岁的男人对女人的要求 dcj3sjt126com
22岁男人对女人的要求是：一，美丽，二，性感，三，有份具品味的职业，四，极有耐性，善解人意，五，该聪明的时候聪明，六，作小鸟依人状时尽量自然，七，怎样穿都好看，八，懂得适当地撒娇，九，虽作惊喜反应，但看起来自然，十，上了床就是个无条件荡妇。 32岁的男人对女人的要求，略作修定，是：一，入得厨房，进得睡房，二，不必服侍皇太后，三，不介意浪漫蜡烛配盒饭，四，听多过说，五，不再傻笑，六，懂得独
Spring和HIbernate对DDM设计的支持 e200702084 DAO 设计模式 spring Hibernate 领域模型
A：数据访问对象 DAO和资源库在领域驱动设计中都很重要。DAO是关系型数据库和应用之间的契约。它封装了Web应用中的数据库CRUD操作细节。另一方面，资源库是一个独立的抽象，它与DAO进行交互，并提供到领域模型的“业务接口”。资源库使用领域的通用语言，处理所有必要的DAO，并使用领域理解的语言提供对领域模型的数据访问服务。
NoSql 数据库的特性比较 geeksun NoSQL
Redis 是一个开源的使用ANSI C语言编写、支持网络、可基于内存亦可持久化的日志型、Key-Value数据库，并提供多种语言的API。目前由VMware主持开发工作。 1. 数据模型作为Key-value型数据库，Redis也提供了键（Key）和值（Value）的映射关系。除了常规的数值或字符串，Redis的键值还可以是以下形式之一： Lists （列表） Sets
使用 Nginx Upload Module 实现上传文件功能 hongtoushizi nginx
转载自： http://www.tuicool.com/wx/aUrAzm 普通网站在实现文件上传功能的时候，一般是使用Python，Java等后端程序实现，比较麻烦。Nginx有一个Upload模块，可以非常简单的实现文件上传功能。此模块的原理是先把用户上传的文件保存到临时文件，然后在交由后台页面处理，并且把文件的原名，上传后的名称，文件类型，文件大小set到页面。下
spring-boot-web-ui及thymeleaf基本使用 jishiweili spring thymeleaf
视图控制层代码demo如下： @Controller @RequestMapping("/") public class MessageController { private final MessageRepository messageRepository; @Autowired public MessageController(Mes
数据源架构模式之活动记录 home198979 PHP 架构活动记录数据映射
hello!架构一、概念活动记录（Active Record）：一个对象，它包装数据库表或视图中某一行，封装数据库访问，并在这些数据上增加了领域逻辑。对象既有数据又有行为。活动记录使用直截了当的方法，把数据访问逻辑置于领域对象中。二、实现简单活动记录活动记录在php许多框架中都有应用，如cakephp。 <?php /** * 行数据入口类 *
Linux Shell脚本之自动修改IP pda158 linux centos Debian 脚本
作为一名 Linux SA，日常运维中很多地方都会用到脚本，而服务器的ip一般采用静态ip或者MAC绑定，当然后者比较操作起来相对繁琐，而前者我们可以设置主机名、ip信息、网关等配置。修改成特定的主机名在维护和管理方面也比较方便。如下脚本用途为：修改ip和主机名等相关信息，可以根据实际需求修改，举一反三！ #!/bin/sh #auto Change ip netmask ga
开发环境搭建独浮云 eclipse jdk tomcat
最近在开发过程中，经常出现MyEclipse内存溢出等错误，需要重启的情况，好麻烦。对于一般的JAVA+TOMCAT项目开发，其实没有必要使用重量级的MyEclipse，使用eclipse就足够了。尤其是开发机器硬件配置一般的人。 &n