星码

[oneAPI] 基于BERT预训练模型的SQuAD问答任务

Intel® Optimization for PyTorch and Intel® DevCloud for oneAPI
基于BERT预训练模型的SQuAD问答任务
- 语料介绍
- - 数据下载
  - 构建
- 模型
结果
参考资料

比赛：https://marketing.csdn.net/p/f3e44fbfe46c465f4d9d6c23e38e0517
Intel® DevCloud for oneAPI：https://devcloud.intel.com/oneapi/get_started/aiAnalyticsToolkitSamples/

Intel® Optimization for PyTorch and Intel® DevCloud for oneAPI

我们在Intel® DevCloud for oneAPI平台上构建了实验环境，充分发挥其完全虚拟化的优势。更具影响力的是，我们充分发挥了Intel® Optimization for PyTorch的强大功能，将其无缝融入我们的PyTorch模型中。这项优化策略的成功应用，不仅进一步提升了我们实验的效果，也显著加速了模型的训练和推断过程。通过这种深度融合硬件和软件的精妙设计，我们不仅释放了硬件的潜力，还为我们的研究和实验带来了新的可能性。这一系列的努力为人工智能领域的创新开辟了更广阔的前景。

基于BERT预训练模型的SQuAD问答任务

SQuAD（Stanford Question Answering Dataset）是一个广泛使用的英文问答数据集，由斯坦福大学发布。它旨在促进机器阅读理解研究，对于理解文本内容并从中提取答案非常有价值。SQuAD数据集的主要特点是，每篇文章都有一系列问题，以及与每个问题相关的精确答案片段，这些答案是从原始文章中抽取的。

在SQuAD英文问答任务中，模型需要读取文章、理解上下文，并从中准确提取出问题的答案。该任务对于开发强大的阅读理解模型和问答系统具有重要的意义。

SQuAD英文问答任务的特点和价值：

真实性： SQuAD数据集的文章和问题都来自真实的文本，确保了任务的实际应用性。
机器阅读理解：任务要求模型阅读文章，理解其内容，然后从中定位和提取出准确的答案，这是机器阅读理解的典型应用。

在SQuAD英文问答任务中，Bert（Bidirectional Encoder Representations from Transformers）是一种重要的模型，它通过预训练语言表示，在问答系统和信息提取领域取得了显著成就。

Bert模型的实用设计和价值影响：

双向上下文理解： Bert模型具备双向上下文理解能力，可以同时考虑文本的前后信息，从而更好地捕捉单词之间的关系。
预训练与微调： Bert在大规模语料库上进行预训练，学习了丰富的语言表示，然后通过微调在特定任务上表现出色，适应任务需求。

语料介绍

所谓问题回答指的就是同时给模型输入一个问题和一段描述，最后需要模型从给定的描述中预测出答案所在的位置（text span)。例如：

描述：苏轼是北宋著名的文学家与政治家，眉州眉山人。
问题：苏轼是哪里人？
标签：眉州眉山人

对于这样一个问题问答任务我们应该怎么来构建这个模型呢？

在做这个任务之前首先需要明白的就是：①最终问题的答案一定是在给定的描述中；②问题的答案一定是一段连续的字符。例如对于上面的描述，如果给出问题“苏轼生活在什么年代他是哪里人？”，那么模型并不会给出“北宋”和“眉州眉山人”这两个分离的答案，最好的情况下便是给出“北宋著名的文学家与政治家，眉州眉山人”这一个答案。

在有了这两个限制条件后，对于这类问答任务的本质也就变成了需要让模型预测得到答案在描述中的起始位置（start position）以及它的结束位置（end position）。所以，问题最终又变成了如何在BERT模型的基础上再构建一个分类器来对BERT最后一层输出的每个Token进行分类，判断它们是否属于start position或者是end position。

数据下载

由于没有找到类似的高质量中文数据集，所以在这里使用到的也是论文中所提到的SQuAD（The Stanford Question Answering Dataset 1.1 ）数据集，即给定一个问题和描述需要模型从描述中找出答案的起止位置。

构建

对于数据预处理部分我们可以继续继承之前文本分类处理的这个类LoadSingleSentenceClassificationDataset，然后再稍微修改其中的部分方法即可。

import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
import pandas as pd
import json
import logging
import os
from sklearn.model_selection import train_test_split
import collections
import six


class Vocab:
    """
    根据本地的vocab文件，构造一个词表
    vocab = Vocab()
    print(vocab.itos)  # 得到一个列表，返回词表中的每一个词；
    print(vocab.itos[2])  # 通过索引返回得到词表中对应的词；
    print(vocab.stoi)  # 得到一个字典，返回词表中每个词的索引；
    print(vocab.stoi['我'])  # 通过单词返回得到词表中对应的索引
    print(len(vocab))  # 返回词表长度
    """
    UNK = '[UNK]'

    def __init__(self, vocab_path):
        self.stoi = {}
        self.itos = []
        with open(vocab_path, 'r', encoding='utf-8') as f:
            for i, word in enumerate(f):
                w = word.strip('\n')
                self.stoi[w] = i
                self.itos.append(w)

    def __getitem__(self, token):
        return self.stoi.get(token, self.stoi.get(Vocab.UNK))

    def __len__(self):
        return len(self.itos)


def build_vocab(vocab_path):
    """
    vocab = Vocab()
    print(vocab.itos)  # 得到一个列表，返回词表中的每一个词；
    print(vocab.itos[2])  # 通过索引返回得到词表中对应的词；
    print(vocab.stoi)  # 得到一个字典，返回词表中每个词的索引；
    print(vocab.stoi['我'])  # 通过单词返回得到词表中对应的索引
    """
    return Vocab(vocab_path)


def pad_sequence(sequences, batch_first=False, max_len=None, padding_value=0):
    """
    对一个List中的元素进行padding
    Pad a list of variable length Tensors with ``padding_value``
    a = torch.ones(25)
    b = torch.ones(22)
    c = torch.ones(15)
    pad_sequence([a, b, c],max_len=None).size()
    torch.Size([25, 3])
        sequences:
        batch_first: 是否把batch_size放到第一个维度
        padding_value:
        max_len :
                当max_len = 50时，表示以某个固定长度对样本进行padding，多余的截掉；
                当max_len=None是，表示以当前batch中最长样本的长度对其它进行padding；
    Returns:
    """
    if max_len is None:
        max_len = max([s.size(0) for s in sequences])
    out_tensors = []
    for tensor in sequences:
        if tensor.size(0) < max_len:
            tensor = torch.cat([tensor, torch.tensor([padding_value] * (max_len - tensor.size(0)))], dim=0)
        else:
            tensor = tensor[:max_len]
        out_tensors.append(tensor)
    out_tensors = torch.stack(out_tensors, dim=1)
    if batch_first:
        return out_tensors.transpose(0, 1)
    return out_tensors


def cache(func):
    """
    本修饰器的作用是将SQuAD数据集中data_process()方法处理后的结果进行缓存，下次使用时可直接载入！
    :param func:
    :return:
    """

    def wrapper(*args, **kwargs):
        filepath = kwargs['filepath']
        postfix = kwargs['postfix']
        data_path = filepath.split('.')[0] + '_' + postfix + '.pt'
        if not os.path.exists(data_path):
            logging.info(f"缓存文件 {data_path} 不存在，重新处理并缓存！")
            data = func(*args, **kwargs)
            with open(data_path, 'wb') as f:
                torch.save(data, f)
        else:
            logging.info(f"缓存文件 {data_path} 存在，直接载入缓存文件！")
            with open(data_path, 'rb') as f:
                data = torch.load(f)
        return data

    return wrapper


class LoadSingleSentenceClassificationDataset:
    def __init__(self,
                 vocab_path='./vocab.txt',  #
                 tokenizer=None,
                 batch_size=32,
                 max_sen_len=None,
                 split_sep='\n',
                 max_position_embeddings=512,
                 pad_index=0,
                 is_sample_shuffle=True
                 ):

        """

        :param vocab_path: 本地词表vocab.txt的路径
        :param tokenizer:
        :param batch_size:
        :param max_sen_len: 在对每个batch进行处理时的配置；
                            当max_sen_len = None时，即以每个batch中最长样本长度为标准，对其它进行padding
                            当max_sen_len = 'same'时，以整个数据集中最长样本为标准，对其它进行padding
                            当max_sen_len = 50， 表示以某个固定长度符样本进行padding，多余的截掉；
        :param split_sep: 文本和标签之前的分隔符，默认为'\t'
        :param max_position_embeddings: 指定最大样本长度，超过这个长度的部分将本截取掉
        :param is_sample_shuffle: 是否打乱训练集样本（只针对训练集）
                在后续构造DataLoader时，验证集和测试集均指定为了固定顺序（即不进行打乱），修改程序时请勿进行打乱
                因为当shuffle为True时，每次通过for循环遍历data_iter时样本的顺序都不一样，这会导致在模型预测时
                返回的标签顺序与原始的顺序不一样，不方便处理。

        """
        self.tokenizer = tokenizer
        self.vocab = build_vocab(vocab_path)
        self.PAD_IDX = pad_index
        self.SEP_IDX = self.vocab['[SEP]']
        self.CLS_IDX = self.vocab['[CLS]']
        # self.UNK_IDX = '[UNK]'

        self.batch_size = batch_size
        self.split_sep = split_sep
        self.max_position_embeddings = max_position_embeddings
        if isinstance(max_sen_len, int) and max_sen_len > max_position_embeddings:
            max_sen_len = max_position_embeddings
        self.max_sen_len = max_sen_len
        self.is_sample_shuffle = is_sample_shuffle

    @cache
    def data_process(self, filepath, postfix='cache'):
        """
        将每一句话中的每一个词根据字典转换成索引的形式，同时返回所有样本中最长样本的长度
        :param filepath: 数据集路径
        :return:
        """
        raw_iter = open(filepath, encoding="utf8").readlines()
        data = []
        max_len = 0
        for raw in tqdm(raw_iter, ncols=80):
            line = raw.rstrip("\n").split(self.split_sep)
            s, l = line[0], line[1]
            tmp = [self.CLS_IDX] + [self.vocab[token] for token in self.tokenizer(s)]
            if len(tmp) > self.max_position_embeddings - 1:
                tmp = tmp[:self.max_position_embeddings - 1]  # BERT预训练模型只取前512个字符
            tmp += [self.SEP_IDX]
            tensor_ = torch.tensor(tmp, dtype=torch.long)
            l = torch.tensor(int(l), dtype=torch.long)
            max_len = max(max_len, tensor_.size(0))
            data.append((tensor_, l))
        return data, max_len

    def load_train_val_test_data(self, train_file_path=None,
                                 val_file_path=None,
                                 test_file_path=None,
                                 only_test=False):
        postfix = str(self.max_sen_len)
        test_data, _ = self.data_process(filepath=test_file_path, postfix=postfix)
        test_iter = DataLoader(test_data, batch_size=self.batch_size,
                               shuffle=False, collate_fn=self.generate_batch)
        if only_test:
            return test_iter
        train_data, max_sen_len = self.data_process(filepath=train_file_path,
                                                    postfix=postfix)  # 得到处理好的所有样本
        if self.max_sen_len == 'same':
            self.max_sen_len = max_sen_len
        val_data, _ = self.data_process(filepath=val_file_path,
                                        postfix=postfix)
        train_iter = DataLoader(train_data, batch_size=self.batch_size,  # 构造DataLoader
                                shuffle=self.is_sample_shuffle, collate_fn=self.generate_batch)
        val_iter = DataLoader(val_data, batch_size=self.batch_size,
                              shuffle=False, collate_fn=self.generate_batch)
        return train_iter, test_iter, val_iter

    def generate_batch(self, data_batch):
        batch_sentence, batch_label = [], []
        for (sen, label) in data_batch:  # 开始对一个batch中的每一个样本进行处理。
            batch_sentence.append(sen)
            batch_label.append(label)
        batch_sentence = pad_sequence(batch_sentence,  # [batch_size,max_len]
                                      padding_value=self.PAD_IDX,
                                      batch_first=False,
                                      max_len=self.max_sen_len)
        batch_label = torch.tensor(batch_label, dtype=torch.long)
        return batch_sentence, batch_label


class LoadSQuADQuestionAnsweringDataset(LoadSingleSentenceClassificationDataset):
    """
    Args:
        doc_stride: When splitting up a long document into chunks, how much stride to
                    take between chunks.
                    当上下文过长时，按滑动窗口进行移动，doc_stride表示每次移动的距离
        max_query_length: The maximum number of tokens for the question. Questions longer than
                    this will be truncated to this length.
                    限定问题的最大长度，过长时截断
        n_best_size: 对预测出的答案近后处理时，选取的候选答案数量
        max_answer_length: 在对候选进行筛选时，对答案最大长度的限制

    """

    def __init__(self, doc_stride=64,
                 max_query_length=64,
                 n_best_size=20,
                 max_answer_length=30,
                 **kwargs):
        super(LoadSQuADQuestionAnsweringDataset, self).__init__(**kwargs)
        self.doc_stride = doc_stride
        self.max_query_length = max_query_length
        self.n_best_size = n_best_size
        self.max_answer_length = max_answer_length

    @staticmethod
    def get_format_text_and_word_offset(text):
        """
        格式化原始输入的文本（去除多个空格）,同时得到每个字符所属的元素（单词）的位置
        这样，根据原始数据集中所给出的起始index(answer_start)就能立马判定它在列表中的位置。
        :param text:
        :return:
        e.g.
            text = "Architecturally, the school has a Catholic character. "
            return:['Architecturally,', 'the', 'school', 'has', 'a', 'Catholic', 'character.'],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3,
             3, 3, 3, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
        """

        def is_whitespace(c):
            if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
                return True
            return False

        doc_tokens = []
        char_to_word_offset = []
        prev_is_whitespace = True
        # 以下这个for循环的作用就是将原始context中的内容进行格式化
        for c in text:  # 遍历paragraph中的每个字符
            if is_whitespace(c):  # 判断当前字符是否为空格（各类空格）
                prev_is_whitespace = True
            else:
                if prev_is_whitespace:  # 如果前一个字符是空格
                    doc_tokens.append(c)
                else:
                    doc_tokens[-1] += c  # 在list的最后一个元素中继续追加字符
                prev_is_whitespace = False
            char_to_word_offset.append(len(doc_tokens) - 1)
        return doc_tokens, char_to_word_offset

    def preprocessing(self, filepath, is_training=True):
        """
        将原始数据进行预处理，同时返回得到答案在原始context中的具体开始和结束位置（以单词为单位）
        :param filepath:
        :param is_training:
        :return:
        返回形式为一个二维列表，内层列表中的各个元素分别为 ['问题ID','原始问题文本','答案文本','context文本',
        '答案在context中的开始位置','答案在context中的结束位置']，并且二维列表中的一个元素称之为一个example,即一个example由六部分组成
        如下示例所示：
        [['5733be284776f41900661182', 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
        'Saint Bernadette Soubirous', 'Architecturally, the school has a Catholic character......',
        90, 92],
         ['5733be284776f4190066117f', ....]]
        """
        with open(filepath, 'r') as f:
            raw_data = json.loads(f.read())
            data = raw_data['data']
        examples = []
        for i in tqdm(range(len(data)), ncols=80, desc="正在遍历每一个段落"):  # 遍历每一个paragraphs
            paragraphs = data[i]['paragraphs']  # 取第i个paragraphs
            for j in range(len(paragraphs)):  # 遍历第i个paragraphs的每个context
                context = paragraphs[j]['context']  # 取第j个context
                context_tokens, word_offset = self.get_format_text_and_word_offset(context)
                qas = paragraphs[j]['qas']  # 取第j个context下的所有 问题-答案 对
                for k in range(len(qas)):  # 遍历第j个context中的多个 问题-答案 对
                    question_text = qas[k]['question']
                    qas_id = qas[k]['id']
                    if is_training:
                        answer_offset = qas[k]['answers'][0]['answer_start']
                        orig_answer_text = qas[k]['answers'][0]['text']
                        answer_length = len(orig_answer_text)
                        start_position = word_offset[answer_offset]
                        end_position = word_offset[answer_offset + answer_length - 1]
                        actual_text = " ".join(
                            context_tokens[start_position:(end_position + 1)])
                        cleaned_answer_text = " ".join(orig_answer_text.strip().split())
                        if actual_text.find(cleaned_answer_text) == -1:
                            logging.warning("Could not find answer: '%s' vs. '%s'",
                                            actual_text, cleaned_answer_text)
                            continue
                    else:
                        start_position = None
                        end_position = None
                        orig_answer_text = None
                    examples.append([qas_id, question_text, orig_answer_text,
                                     " ".join(context_tokens), start_position, end_position])
        return examples

    @staticmethod
    def improve_answer_span(context_tokens,
                            answer_tokens,
                            start_position,
                            end_position):
        """
        本方法的作用有两个：
            1. 如https://github.com/google-research/bert中run_squad.py里的_improve_answer_span()函数一样，
               用于提取得到更加匹配答案的起止位置；
            2. 根据原始起止位置，提取得到token id中答案的起止位置
        # The SQuAD annotations are character based. We first project them to
        # whitespace-tokenized words. But then after WordPiece tokenization, we can
        # often find a "better match". For example:
        #
        #   Question: What year was John Smith born?
        #   Context: The leader was John Smith (1895-1943).
        #   Answer: 1895
        #
        # The original whitespace-tokenized answer will be "(1895-1943).". However
        # after tokenization, our tokens will be "( 1895 - 1943 ) .". So we can match
        # the exact answer, 1895.

        context = "The leader was John Smith (1895-1943).
        answer_text = "1985"
        :param context_tokens: ['the', 'leader', 'was', 'john', 'smith', '(', '1895', '-', '1943', ')', '.']
        :param answer_tokens: ['1895']
        :param start_position: 5
        :param end_position: 5
        :return: [6,6]
        再例如：
        context = "Virgin mary reputedly appeared to Saint Bernadette Soubirous in 1858"
        answer_text = "Saint Bernadette Soubirous"
        :param context_tokens: ['virgin', 'mary', 'reputed', '##ly', 'appeared', 'to', 'saint', 'bern', '##ade',
                                '##tte', 'so', '##ub', '##iro', '##us', 'in', '1858']
        :param answer_tokens: ['saint', 'bern', '##ade', '##tte', 'so', '##ub', '##iro', '##us'
        :param start_position = 5
        :param end_position = 7
        return (6,13)

        """
        new_end = None
        for i in range(start_position, len(context_tokens)):
            if context_tokens[i] != answer_tokens[0]:
                continue
            for j in range(len(answer_tokens)):
                if answer_tokens[j] != context_tokens[i + j]:
                    break
                new_end = i + j
            if new_end - i + 1 == len(answer_tokens):
                return i, new_end
        return start_position, end_position

    @staticmethod
    def get_token_to_orig_map(input_tokens, origin_context, tokenizer):
        """
           本函数的作用是根据input_tokens和原始的上下文，返回得input_tokens中每个单词在原始单词中所对应的位置索引
           :param input_tokens:  ['[CLS]', 'to', 'whom', 'did', 'the', 'virgin', '[SEP]', 'architectural', '##ly',
                                   ',', 'the', 'school', 'has', 'a', 'catholic', 'character', '.', '[SEP']
           :param origin_context: "Architecturally, the Architecturally, test, Architecturally,
                                    the school has a Catholic character. Welcome moon hotel"
           :param tokenizer:
           :return: {7: 4, 8: 4, 9: 4, 10: 5, 11: 6, 12: 7, 13: 8, 14: 9, 15: 10, 16: 10}
                   含义是input_tokens[7]为origin_context中的第4个单词 Architecturally,
                        input_tokens[8]为origin_context中的第4个单词 Architecturally,
                        ...
                        input_tokens[10]为origin_context中的第5个单词 the
           """
        origin_context_tokens = origin_context.split()
        token_id = []
        str_origin_context = ""
        for i in range(len(origin_context_tokens)):
            tokens = tokenizer(origin_context_tokens[i])
            str_token = "".join(tokens)
            str_origin_context += "" + str_token
            for _ in str_token:
                token_id.append(i)

        key_start = input_tokens.index('[SEP]') + 1
        tokenized_tokens = input_tokens[key_start:-1]
        str_tokenized_tokens = "".join(tokenized_tokens)
        index = str_origin_context.index(str_tokenized_tokens)
        value_start = token_id[index]
        token_to_orig_map = {}
        # 处理这样的边界情况： Building's gold   《==》   's', 'gold', 'dome'
        token = tokenizer(origin_context_tokens[value_start])
        for i in range(len(token), -1, -1):
            s1 = "".join(token[-i:])
            s2 = "".join(tokenized_tokens[:i])
            if s1 == s2:
                token = token[-i:]
                break

        while True:
            for j in range(len(token)):
                token_to_orig_map[key_start] = value_start
                key_start += 1
                if len(token_to_orig_map) == len(tokenized_tokens):
                    return token_to_orig_map
            value_start += 1
            token = tokenizer(origin_context_tokens[value_start])

    @cache
    def data_process(self, filepath, is_training=False, postfix='cache'):
        """

        :param filepath:
        :param is_training:
        :return: [[example_id, feature_id, input_ids, seg, start_position,
                    end_position, answer_text, example[0]],input_tokens,token_to_orig_map [],[],[]...]
                  分别对应：[原始样本Id,训练特征id,input_ids，seg，开始，结束，答案文本，问题id,input_tokens,token_to_orig_map]
        """
        logging.info(f"## 使用窗口滑动滑动，doc_stride = {self.doc_stride}")
        examples = self.preprocessing(filepath, is_training)
        all_data = []
        example_id, feature_id = 0, 1000000000
        # 由于采用了滑动窗口，所以一个example可能构造得到多个训练样本（即这里被称为feature）；
        # 因此，需要对其分别进行编号，并且这主要是用在预测后的结果后处理当中，训练时用不到
        # 当然，这里只使用feature_id即可，因为每个example其实对应的就是一个问题，所以问题ID和example_id本质上是一样的
        for example in tqdm(examples, ncols=80, desc="正在遍历每个问题（样本）"):
            question_tokens = self.tokenizer(example[1])
            if len(question_tokens) > self.max_query_length:  # 问题过长进行截取
                question_tokens = question_tokens[:self.max_query_length]
            question_ids = [self.vocab[token] for token in question_tokens]
            question_ids = [self.CLS_IDX] + question_ids + [self.SEP_IDX]
            context_tokens = self.tokenizer(example[3])
            context_ids = [self.vocab[token] for token in context_tokens]
            logging.debug(f"<<<<<<<<  进入新的example  >>>>>>>>>")
            logging.debug(f"## 正在预处理数据 {__name__} is_training = {is_training}")
            logging.debug(f"## 问题 id: {example[0]}")
            logging.debug(f"## 原始问题 text: {example[1]}")
            logging.debug(f"## 原始描述 text: {example[3]}")
            start_position, end_position, answer_text = -1, -1, None
            if is_training:
                start_position, end_position = example[4], example[5]
                answer_text = example[2]
                answer_tokens = self.tokenizer(answer_text)
                start_position, end_position = self.improve_answer_span(context_tokens,
                                                                        answer_tokens,
                                                                        start_position,
                                                                        end_position)
            rest_len = self.max_sen_len - len(question_ids) - 1
            context_ids_len = len(context_ids)
            logging.debug(f"## 上下文长度为：{context_ids_len}, 剩余长度 rest_len 为 ： {rest_len}")
            if context_ids_len > rest_len:  # 长度超过max_sen_len,需要进行滑动窗口
                logging.debug(f"## 进入滑动窗口 …… ")
                s_idx, e_idx = 0, rest_len
                while True:
                    # We can have documents that are longer than the maximum sequence length.
                    # To deal with this we do a sliding window approach, where we take chunks
                    # of the up to our max length with a stride of `doc_stride`.
                    tmp_context_ids = context_ids[s_idx:e_idx]
                    tmp_context_tokens = [self.vocab.itos[item] for item in tmp_context_ids]
                    logging.debug(f"## 滑动窗口范围：{s_idx, e_idx},example_id: {example_id}, feature_id: {feature_id}")
                    # logging.debug(f"## 滑动窗口取值：{tmp_context_tokens}")
                    input_ids = torch.tensor(question_ids + tmp_context_ids + [self.SEP_IDX])
                    input_tokens = ['[CLS]'] + question_tokens + ['[SEP]'] + tmp_context_tokens + ['[SEP]']
                    seg = [0] * len(question_ids) + [1] * (len(input_ids) - len(question_ids))
                    seg = torch.tensor(seg)
                    if is_training:
                        new_start_position, new_end_position = 0, 0
                        if start_position >= s_idx and end_position <= e_idx:  # in train
                            logging.debug(f"## 滑动窗口中存在答案 -----> ")
                            new_start_position = start_position - s_idx
                            new_end_position = new_start_position + (end_position - start_position)

                            new_start_position += len(question_ids)
                            new_end_position += len(question_ids)
                            logging.debug(f"## 原始答案：{answer_text} <===>处理后的答案："
                                          f"{' '.join(input_tokens[new_start_position:(new_end_position + 1)])}")
                        all_data.append([example_id, feature_id, input_ids, seg, new_start_position,
                                         new_end_position, answer_text, example[0], input_tokens])
                        logging.debug(f"## start pos:{new_start_position}")
                        logging.debug(f"## end pos:{new_end_position}")
                    else:
                        all_data.append([example_id, feature_id, input_ids, seg, start_position,
                                         end_position, answer_text, example[0], input_tokens])
                        logging.debug(f"## start pos:{start_position}")
                        logging.debug(f"## end pos:{end_position}")
                    token_to_orig_map = self.get_token_to_orig_map(input_tokens, example[3], self.tokenizer)
                    all_data[-1].append(token_to_orig_map)
                    logging.debug(f"## example id: {example_id}")
                    logging.debug(f"## feature id: {feature_id}")
                    logging.debug(f"## input_tokens: {input_tokens}")
                    logging.debug(f"## input_ids:{input_ids.tolist()}")
                    logging.debug(f"## segment ids:{seg.tolist()}")
                    logging.debug(f"## orig_map:{token_to_orig_map}")
                    logging.debug("======================\n")
                    feature_id += 1
                    if e_idx >= context_ids_len:
                        break
                    s_idx += self.doc_stride
                    e_idx += self.doc_stride

            else:
                input_ids = torch.tensor(question_ids + context_ids + [self.SEP_IDX])
                input_tokens = ['[CLS]'] + question_tokens + ['[SEP]'] + context_tokens + ['[SEP]']
                seg = [0] * len(question_ids) + [1] * (len(input_ids) - len(question_ids))
                seg = torch.tensor(seg)
                if is_training:
                    start_position += (len(question_ids))
                    end_position += (len(question_ids))
                token_to_orig_map = self.get_token_to_orig_map(input_tokens, example[3], self.tokenizer)
                all_data.append([example_id, feature_id, input_ids, seg, start_position,
                                 end_position, answer_text, example[0], input_tokens, token_to_orig_map])
                logging.debug(f"## input_tokens: {input_tokens}")
                logging.debug(f"## input_ids:{input_ids.tolist()}")
                logging.debug(f"## segment ids:{seg.tolist()}")
                logging.debug(f"## orig_map:{token_to_orig_map}")
                logging.debug("======================\n")
                feature_id += 1
            example_id += 1
        #  all_data[0]: [原始样本Id,训练特征id,input_ids，seg，开始，结束，答案文本，问题id, input_tokens,ori_map]
        data = {'all_data': all_data, 'max_len': self.max_sen_len, 'examples': examples}
        return data

    def generate_batch(self, data_batch):
        batch_input, batch_seg, batch_label, batch_qid = [], [], [], []
        batch_example_id, batch_feature_id, batch_map = [], [], []
        for item in data_batch:
            # item: [原始样本Id,训练特征id,input_ids，seg，开始，结束，答案文本，问题id,input_tokens,ori_map]
            batch_example_id.append(item[0])  # 原始样本Id
            batch_feature_id.append(item[1])  # 训练特征id
            batch_input.append(item[2])  # input_ids
            batch_seg.append(item[3])  # seg
            batch_label.append([item[4], item[5]])  # 开始, 结束
            batch_qid.append(item[7])  # 问题id
            batch_map.append(item[9])  # ori_map

        batch_input = pad_sequence(batch_input,  # [batch_size,max_len]
                                   padding_value=self.PAD_IDX,
                                   batch_first=False,
                                   max_len=self.max_sen_len)  # [max_len,batch_size]
        batch_seg = pad_sequence(batch_seg,  # [batch_size,max_len]
                                 padding_value=self.PAD_IDX,
                                 batch_first=False,
                                 max_len=self.max_sen_len)  # [max_len, batch_size]
        batch_label = torch.tensor(batch_label, dtype=torch.long)
        # [max_len,batch_size] , [max_len, batch_size] , [batch_size,2], [batch_size,], [batch_size,]
        return batch_input, batch_seg, batch_label, batch_qid, batch_example_id, batch_feature_id, batch_map

    def load_train_val_test_data(self, train_file_path=None,
                                 val_file_path=None,
                                 test_file_path=None,
                                 only_test=True):
        doc_stride = str(self.doc_stride)
        max_sen_len = str(self.max_sen_len)
        max_query_length = str(self.max_query_length)
        postfix = doc_stride + '_' + max_sen_len + '_' + max_query_length
        data = self.data_process(filepath=test_file_path,
                                 is_training=False,
                                 postfix=postfix)
        test_data, examples = data['all_data'], data['examples']
        test_iter = DataLoader(test_data, batch_size=self.batch_size,
                               shuffle=False,
                               collate_fn=self.generate_batch)
        if only_test:
            logging.info(f"## 成功返回测试集，一共包含样本{len(test_iter.dataset)}个")
            return test_iter, examples

        data = self.data_process(filepath=train_file_path,
                                 is_training=True,
                                 postfix=postfix)  # 得到处理好的所有样本
        train_data, max_sen_len = data['all_data'], data['max_len']
        _, val_data = train_test_split(train_data, test_size=0.3, random_state=2021)
        if self.max_sen_len == 'same':
            self.max_sen_len = max_sen_len
        train_iter = DataLoader(train_data, batch_size=self.batch_size,  # 构造DataLoader
                                shuffle=self.is_sample_shuffle, collate_fn=self.generate_batch)
        val_iter = DataLoader(val_data, batch_size=self.batch_size,  # 构造DataLoader
                              shuffle=False, collate_fn=self.generate_batch)
        logging.info(f"## 成功返回训练集样本（{len(train_iter.dataset)}）个、开发集样本（{len(val_iter.dataset)}）个"
                     f"测试集样本（{len(test_iter.dataset)}）个.")
        return train_iter, test_iter, val_iter

    @staticmethod
    def get_best_indexes(logits, n_best_size):
        """Get the n-best logits from a list."""
        # logits = [0.37203778 0.48594432 0.81051651 0.07998148 0.93529721 0.0476721
        #  0.15275263 0.98202781 0.07813079 0.85410559]
        # n_best_size = 4
        # return [7, 4, 9, 2]
        index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True)

        best_indexes = []
        for i in range(len(index_and_score)):
            if i >= n_best_size:
                break
            best_indexes.append(index_and_score[i][0])
        return best_indexes

    def get_final_text(self, pred_text, orig_text):
        """Project the tokenized prediction back to the original text."""

        # ref: https://github.com/google-research/bert/blob/master/run_squad.py
        # When we created the data, we kept track of the alignment between original
        # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So
        # now `orig_text` contains the span of our original text corresponding to the
        # span that we predicted.
        #
        # However, `orig_text` may contain extra characters that we don't want in
        # our prediction.
        #
        # For example, let's say:
        #   pred_text = steve smith
        #   orig_text = Steve Smith's
        #
        # We don't want to return `orig_text` because it contains the extra "'s".
        #
        # We don't want to return `pred_text` because it's already been normalized
        # (the SQuAD eval script also does punctuation stripping/lower casing but
        # our tokenizer does additional normalization like stripping accent
        # characters).
        #
        # What we really want to return is "Steve Smith".
        #
        # Therefore, we have to apply a semi-complicated alignment heruistic between
        # `pred_text` and `orig_text` to get a character-to-charcter alignment. This
        # can fail in certain cases in which case we just return `orig_text`.

        def _strip_spaces(text):
            ns_chars = []
            ns_to_s_map = collections.OrderedDict()
            for (i, c) in enumerate(text):
                if c == " ":
                    continue
                ns_to_s_map[len(ns_chars)] = i
                ns_chars.append(c)
            ns_text = "".join(ns_chars)
            return (ns_text, ns_to_s_map)

        # We first tokenize `orig_text`, strip whitespace from the result
        # and `pred_text`, and check if they are the same length. If they are
        # NOT the same length, the heuristic has failed. If they are the same
        # length, we assume the characters are one-to-one aligned.

        tok_text = " ".join(self.tokenizer(orig_text))

        start_position = tok_text.find(pred_text)
        if start_position == -1:
            return orig_text
        end_position = start_position + len(pred_text) - 1

        (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)
        (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)

        if len(orig_ns_text) != len(tok_ns_text):
            return orig_text

        # We then project the characters in `pred_text` back to `orig_text` using
        # the character-to-character alignment.
        tok_s_to_ns_map = {}
        for (i, tok_index) in six.iteritems(tok_ns_to_s_map):
            tok_s_to_ns_map[tok_index] = i

        orig_start_position = None
        if start_position in tok_s_to_ns_map:
            ns_start_position = tok_s_to_ns_map[start_position]
            if ns_start_position in orig_ns_to_s_map:
                orig_start_position = orig_ns_to_s_map[ns_start_position]

        if orig_start_position is None:
            return orig_text

        orig_end_position = None
        if end_position in tok_s_to_ns_map:
            ns_end_position = tok_s_to_ns_map[end_position]
            if ns_end_position in orig_ns_to_s_map:
                orig_end_position = orig_ns_to_s_map[ns_end_position]

        if orig_end_position is None:
            return orig_text

        output_text = orig_text[orig_start_position:(orig_end_position + 1)]
        return output_text

    def write_prediction(self, test_iter, all_examples, logits_data, output_dir):
        """
        根据预测得到的logits将预测结果写入到本地文件中
        :param test_iter:
        :param all_examples:
        :param logits_data:
        :return:
        """
        qid_to_example_context = {}  # 根据qid取到其对应的context token
        for example in all_examples:
            context = example[3]
            context_list = context.split()
            qid_to_example_context[example[0]] = context_list
        _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
            "PrelimPrediction",
            ["text", "start_index", "end_index", "start_logit", "end_logit"])
        prelim_predictions = collections.defaultdict(list)
        for b_input, _, _, b_qid, _, b_feature_id, b_map in tqdm(test_iter, ncols=80, desc="正在遍历候选答案"):
            # 取一个问题对应所有特征样本的预测logits（因为有了滑动窗口，所以原始一个context可以构造得到多个训练样子本）
            all_logits = logits_data[b_qid[0]]
            for logits in all_logits:
                if logits[0] != b_feature_id[0]:
                    continue  # 非当前子样本对应的logits忽略
                # 遍历每个子样本对应logits的预测情况
                start_indexes = self.get_best_indexes(logits[1], self.n_best_size)
                # 得到开始位置几率最大的值对应的索引，例如可能是 [ 4,6,3,1]
                end_indexes = self.get_best_indexes(logits[2], self.n_best_size)
                # 得到结束位置几率最大的值对应的索引，例如可能是 [ 5,8,10,9]
                for start_index in start_indexes:
                    for end_index in end_indexes:  # 遍历所有存在的结果组合
                        if start_index >= b_input.size(0):
                            continue  # 起始索引大于token长度，忽略
                        if end_index >= b_input.size(0):
                            continue  # 结束索引大于token长度，忽略
                        if start_index not in b_map[0]:
                            continue  # 用来判断索引是否位于[SEP]之后的位置，因为答案只会在[SEP]以后出现
                        if end_index not in b_map[0]:
                            continue
                        if end_index < start_index:
                            continue
                        length = end_index - start_index + 1
                        if length > self.max_answer_length:
                            continue
                        token_ids = b_input.transpose(0, 1)[0]
                        strs = [self.vocab.itos[s] for s in token_ids]
                        tok_text = " ".join(strs[start_index:(end_index + 1)])
                        tok_text = tok_text.replace(" ##", "").replace("##", "")
                        tok_text = tok_text.strip()
                        tok_text = " ".join(tok_text.split())

                        orig_doc_start = b_map[0][start_index]
                        orig_doc_end = b_map[0][end_index]
                        orig_tokens = qid_to_example_context[b_qid[0]][orig_doc_start:(orig_doc_end + 1)]
                        orig_text = " ".join(orig_tokens)
                        final_text = self.get_final_text(tok_text, orig_text)

                        prelim_predictions[b_qid[0]].append(_PrelimPrediction(
                            text=final_text,
                            start_index=int(start_index),
                            end_index=int(end_index),
                            start_logit=float(logits[1][start_index]),
                            end_logit=float(logits[2][end_index])))
                        # 此处为将每个qid对应的所有预测结果放到一起，因为一个qid对应的context应该滑动窗口
                        # 会有构造得到多个训练样本，而每个训练样本都会对应得到一个预测的logits
                        # 并且这里取了n_best个logits，所以组合后一个问题就会得到过个预测的答案

        for k, v in prelim_predictions.items():
            # 对每个qid对应的所有预测答案按照start_logit+end_logit的大小进行排序
            prelim_predictions[k] = sorted(prelim_predictions[k],
                                           key=lambda x: (x.start_logit + x.end_logit),
                                           reverse=True)
        best_results, all_n_best_results = {}, {}
        for k, v in prelim_predictions.items():
            best_results[k] = v[0].text  # 取最好的第一个结果
            all_n_best_results[k] = v  # 保存所有预测结果
        with open(os.path.join(output_dir, f"best_result.json"), 'w') as f:
            f.write(json.dumps(best_results, indent=4) + '\n')
        with open(os.path.join(output_dir, f"best_n_result.json"), 'w') as f:
            f.write(json.dumps(all_n_best_results, indent=4) + '\n')

模型

我们只需要在原始BERT模型的基础上取最后一层的输出结果，然后再加一个分类层即可。因此这部分代码相对来说也比较容易理解。

from Bert import BertModel
import torch.nn as nn


class BertForQuestionAnswering(nn.Module):
    """
    用于建模类似SQuAD这样的问答数据集
    """

    def __init__(self, config, bert_pretrained_model_dir=None):
        super(BertForQuestionAnswering, self).__init__()
        if bert_pretrained_model_dir is not None:
            self.bert = BertModel.from_pretrained(config, bert_pretrained_model_dir)
        else:
            self.bert = BertModel(config)
        self.qa_outputs = nn.Linear(config.hidden_size, 2)

    def forward(self, input_ids,
                attention_mask=None,
                token_type_ids=None,
                position_ids=None,
                start_positions=None,
                end_positions=None):
        """
        :param input_ids: [src_len,batch_size]
        :param attention_mask: [batch_size,src_len]
        :param token_type_ids: [src_len,batch_size]
        :param position_ids:
        :param start_positions: [batch_size]
        :param end_positions:  [batch_size]
        :return:
        """
        _, all_encoder_outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids)
        sequence_output = all_encoder_outputs[-1]  # 取Bert最后一层的输出
        # sequence_output: [src_len, batch_size, hidden_size]
        logits = self.qa_outputs(sequence_output)  # [src_len, batch_size,2]
        start_logits, end_logits = logits.split(1, dim=-1)
        # [src_len,batch_size,1]  [src_len,batch_size,1]
        start_logits = start_logits.squeeze(-1).transpose(0, 1)  # [batch_size,src_len]
        end_logits = end_logits.squeeze(-1).transpose(0, 1)  # [batch_size,src_len]
        if start_positions is not None and end_positions is not None:
            # 由于部分情况下start/end 位置会超过输入的长度
            # （例如输入序列的可能大于512，并且正确的开始或者结束符就在512之后）
            # 那么此时就要进行特殊处理
            ignored_index = start_logits.size(1)  # 取输入序列的长度
            start_positions.clamp_(0, ignored_index)
            # 如果正确起始位置start_positions中，存在输入样本的开始位置大于输入长度，
            # 那么直接取输入序列的长度作为开始位置
            end_positions.clamp_(0, ignored_index)

            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
            # 这里指定ignored_index其实就是为了忽略掉超过输入序列长度的（起始结束）位置
            # 在预测时所带来的损失，因为这些位置并不能算是模型预测错误的（只能看做是没有预测），
            # 同时如果不加ignore_index的话，那么可能会影响模型在正常情况下的语义理解能力
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            return (start_loss + end_loss) / 2, start_logits, end_logits
        else:
            return start_logits, end_logits  # [batch_size,src_len]

定义一个ModelConfig类来对分类模型中的超参数以及其它变量进行管理，代码如下所示：

class BertConfig(object):
    """Configuration for `BertModel`."""

    def __init__(self,
                 vocab_size=21128,
                 hidden_size=768,
                 num_hidden_layers=12,
                 num_attention_heads=12,
                 intermediate_size=3072,
                 pad_token_id=0,
                 hidden_act="gelu",
                 hidden_dropout_prob=0.1,
                 attention_probs_dropout_prob=0.1,
                 max_position_embeddings=512,
                 type_vocab_size=2,
                 initializer_range=0.02):
        """Constructs BertConfig.
        Args:
          vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
          hidden_size: Size of the encoder layers and the pooler layer.
          num_hidden_layers: Number of hidden layers in the Transformer encoder.
          num_attention_heads: Number of attention heads for each attention layer in
            the Transformer encoder.
          intermediate_size: The size of the "intermediate" (i.e., feed-forward)
            layer in the Transformer encoder.
          hidden_act: The non-linear activation function (function or string) in the
            encoder and pooler.
          hidden_dropout_prob: The dropout probability for all fully connected
            layers in the embeddings, encoder, and pooler.
          attention_probs_dropout_prob: The dropout ratio for the attention
            probabilities.
          max_position_embeddings: The maximum sequence length that this model might
            ever be used with. Typically set this to something large just in case
            (e.g., 512 or 1024 or 2048).
          type_vocab_size: The vocabulary size of the `token_type_ids` passed into
            `BertModel`.
          initializer_range: The stdev of the truncated_normal_initializer for
            initializing all weight matrices.
        """
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.intermediate_size = intermediate_size
        self.pad_token_id = pad_token_id
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.initializer_range = initializer_range

    @classmethod
    def from_dict(cls, json_object):
        """Constructs a `BertConfig` from a Python dictionary of parameters."""
        config = BertConfig(vocab_size=None)
        for (key, value) in six.iteritems(json_object):
            config.__dict__[key] = value
        return config

    @classmethod
    def from_json_file(cls, json_file):
        """Constructs a `BertConfig` from a json file of parameters."""
        """从json配置文件读取配置信息"""
        with open(json_file, 'r') as reader:
            text = reader.read()
        logging.info(f"成功导入BERT配置文件 {json_file}")
        return cls.from_dict(json.loads(text))

    def to_dict(self):
        """Serializes this instance to a Python dictionary."""
        output = copy.deepcopy(self.__dict__)
        return output

    def to_json_string(self):
        """Serializes this instance to a JSON string."""
        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

结果

参考资料

基于BERT预训练模型的SQuAD问答任务:https://www.ylkz.life/deeplearning/p10265968/

你可能感兴趣的:(python杂记,oneapi,bert,人工智能)

基于ArcPy将HDF格式栅格文件批量转为TIFF格式疯狂学习GIS
本文介绍基于Python中ArcPy模块，实现大量HDF格式栅格图像文件批量转换为TIFF格式的方法。首先，来看看我们想要实现的需求。在一个名为HDF的文件夹下，有五个子文件夹；每一个子文件夹中，都存储了大量的.hdf格式的栅格遥感影像数据。我们在其中任选一个子文件夹，来看看其中所含的文件。我们要做的，就是将HDF文件夹下的全部子文件夹中的全部.hdf格式图像文件，一次性转换为
Python训练 + Go优化 + C#部署：端到端AI模型的跨语言实践威哥说编程人工智能学习资料库 python golang c#
在现代AI应用中，如何高效地训练、优化、并最终部署AI模型是一项复杂且具有挑战性的任务。在这一过程中，选择合适的编程语言和工具可以显著提高效率和系统的性能。Python作为AI领域的主流语言，具有丰富的深度学习框架（如PyTorch和TensorFlow），在模型训练方面处于领先地位。然而，针对计算密集型任务（如数据预处理、加密等），Go语言因其高效的并发处理和出色的性能，成为优化计算的理想选择。
python排序算法之桶排序华强笔记 python数据结构和算法 python 算法
桶排序主要适用于全是数字的列表排序代码如下：defbuckrt_sort(li,n=100,max_num=10000):bucket=[[]for_inrange(n)]
【无标题】
PyQt5相关论文方向扩充及技术特性解析PyQt5的核心优势PyQt5作为基于Qt框架的Python绑定库，在科研与工程应用中具备显著优势。其跨平台兼容性极强，可在Windows、macOS、Linux等主流操作系统上稳定运行，且能保持界面风格的一致性，这对开发多场景应用系统至关重要。在界面设计方面，PyQt5提供了丰富的UI组件库，从基础的按钮、文本框到高级的图表、3D控件应有尽有，同时支持Qt
Python数据读写与组织全解析（查缺补漏篇） Monkey的自我迭代 python学习的查缺补漏机器学习人工智能 python
1高维数据由键值对类型的数据构成，可以多层嵌套。高维数据相比一维和二维数据能表达更加灵活和复杂的数据关系，可以用字典类型表示。一维数据不用字典类型来表示。2read、readline、redlines和for循环输出读取的区别直接read，读取的结果就是一个字符串，和文件中一模一样f_2=open('cpi.csv','r')print(f_2.read())指标,2015,2016,2017,居
Python文件路径操作全面指南：从基础到高级应用 Monkey的自我迭代 python 开发语言
文件路径操作是Python编程中不可或缺的核心技能，无论是数据科学、Web开发还是自动化办公，都离不开对文件路径的有效管理。本文将系统性地介绍Python中文件路径操作的各类方法，帮助您掌握这一关键技术。一、文件路径基础概念1.1路径类型解析文件路径主要分为两种类型，理解它们的区别是路径操作的基础：绝对路径：从文件系统根目录开始的完整路径，如Windows系统中的C:\Users\Username
分类模型（BERT）训练全流程巴伦是只猫人工智能分类 bert 数据挖掘
使用BERT实现分类模型的完整训练流程BERT(BidirectionalEncoderRepresentationsfromTransformers)是一种强大的预训练语言模型，在各种NLP任务中表现出色。下面我将详细梳理使用BERT实现文本分类模型的完整训练过程。1.准备工作1.1环境配置pipinstalltransformerstorchtensorflowpandassklearn1.2
python排序算法之基数排序华强笔记 python数据结构和算法 python 算法
#代码如下：'''基数排序：1.把数据分为10个桶，以为数字有0-9这10个2.依次把数据的个位，十位，百位等等各个位数的数据进行分桶排序，放在这10个桶中3.最大的数有k位，则循环k次4.时间复杂度O(kn),空间复杂度O(k+n),其中k=log10(n)+1'''defradixs_sort(li):max_num=max(li)it=0while10**it<=max_num:bucket
京东零售重磅开源 | OxyGent：像搭乐高一样组装AI团队，实现群体智能京东零售技术零售开源人工智能
京东零售Oxygen团队正式开源发布多智能体协作框架——OxyGent。这一创新框架致力于帮助开发者高效组装多智能体协作系统，实现智能体间的无缝协作、弹性扩展与全链路可追溯。推动人工智能从“单点突破”迈向“群体智能”时代。OxyGent已在开源社区正式上线。开源地址：https://github.com/jd-opensource/OxyGent官网地址：https://oxygent.jd.co
具身智能的视觉-语言导航综述
24年2月来自曲阜师范、华东师大和哈工大的论文“Vision-LanguageNavigationwithEmbodiedIntelligence:ASurvey”。作为人工智能领域的长期愿景，具身智能的核心目标是提升智体与环境的感知、理解和交互能力。视觉-语言导航（VLN）作为实现具身智能的重要研究路径，致力于探索智体如何利用自然语言与人进行有效沟通，接收并理解指令，并最终依靠视觉信息实现精准导
python折半查找算法_python二分查找代码试用递归法编写python程序实现折半查找算法...
python二分查找算法函数bi_search(),该函数实现检回忆，很美却很伤；回忆只是回不到过去的记忆。输入格式:第一行为正整数n接下来若干行为待查找的数字，每行输入一个总是女人为了天长地久而烦恼，男人却可以洒脱地出乎意料。defprime(n):ifnend:return-1mid=(start+end)//2ifprimelist[mid]==prime:returnmidelifprim
具身智能：从理论到实践的跨越
具身智能（EmbodiedAI）的概念起源与发展是一个跨越半个多世纪的学术探索历程，其核心思想在不同学科的交叉碰撞中逐渐成型。以下从理论源头、技术奠基、术语演进三个维度展开解析，揭示这一概念的学术脉络与产业价值：一、理论源头：从图灵的哲学构想到认知科学的具身化转向1.图灵的"感官机器"设想（1950年）在人工智能奠基性论文《计算机器与智能》中，图灵提出了两种智能发展路径：抽象计算路径：如国际象棋等
PyCharm高效入门指南：从零开始掌握Python开发利器软考和人工智能学堂 Python开发经验强化学习 PyCharm
引言PyCharm是JetBrains公司推出的一款强大的Python集成开发环境(IDE)，被全球数百万Python开发者所青睐。无论你是Python初学者还是经验丰富的开发者，掌握PyCharm都能显著提升你的开发效率。本文将带你从零开始，全面了解PyCharm的核心功能和使用技巧。1.PyCharm的安装与配置1.1下载与安装首先访问JetBrains官网下载PyCharm。PyCharm有
python作业陈小铃子 python 开发语言
基础练习练习目标函数01.计算车费题目描述小红打车，起步价8元(3公里),每公里收费2元，她打车行驶了n公里，通过函数封装并计算车费输入描述输入一个公里数输出描述输出应付车费示例输入：5输出：12defcalculate_fare(distance):base_price=8#起步价per_km_cost=2#每公里费用min_distance=3#最小计费距离ifdistance0:sum_nu
【Python】(三）面试题和Py基础题戏精亿点点菜面试职场和发展 python
1.技术面试题（1）解释Linux中的进程、线程和守护进程的概念，以及如何管理它们？答：进程（Process）：进程是操作系统中资源分配的基本单位，是正在运行的程序的实例。每个进程都有自己的内存空间、文件描述符和执行上下文。管理：①查看进程：使用ps、top、htop等命令查看当前运行的进程。②启动进程：通过命令行或脚本启动新进程。③终止进程：使用kill命令发送信号终止进程，例如kill-9PI
python小工具：测内网服务器网速和延迟秃了也弱了。 python大家庭服务器 python java
文章目录一、使用1、代码2、使用3、注意事项一、使用1、代码importargparseimportsocketimporttimeimportsubprocessimportreimportsysdefmeasure_latency(host):#使用ping命令测量延迟try:#根据操作系统选择ping参数ifsys.platform.startswith('win'):output=subp
Python面试题-6 编织幻境的妖 python 服务器开发语言
1.请解释Python中的动态类型。Python中的动态类型Python是一种动态类型语言，这意味着你不需要在编程时声明变量的类型，而是在运行时自动推断类型。在Python中，变量的类型是在程序运行时决定的，这意味着同一个变量可以在不改变其类型的情形下被赋予不同类型的值。动态类型的优点在于它提高了编程的灵活性，因为你不需要预先确定数据的类型，可以更容易地写出简洁的代码。然而，这也可能导致运行时错误
火爆全网的条形竞赛图，Python轻松实现统计学家
image这个动图叫条形竞赛图，非常适合制作随时间变动的数据。我已经用streamlit+bar_chart_race实现了，然后白嫖了heroku的服务器，大家通过下面的网址上传csv格式的表格就可以轻松制作条形竞赛图，生成的视频可以保存本地。https://bar-chart-race-app.herokuapp.com/本文我将实现过程介绍一下，白嫖服务器+部署留在下期再讲。纯matplot
【无标题】Python---day9 模块化编程概念（模块、包、导入）及常见系统模块总结和第三方模块管理 AnAn__kang python java 服务器
系列文章目录前言跟着博主学Python，今天我们来到了第九天的学习，模块化编程的概念。Python作为一门编程语言，本身就是用于对模块以及各种包的使用来达到我们自己想到创作的目的。所以今天博主就给大家盘点一下有关于各种常见的包以及如何进行导入的。一.模块Module，模块1.1基本概念定义：模块是一个Python文件，每个.py.py.py文件就是一个模块。作用：用于组织代码，避免代码重复，提高复
Python --- day 10 Opencv模块的使用 AnAn__kang python opencv 开发语言
系列文章目录前言今天博主带大家进入Opencv的学习，这是一个专门针对处理图像和视频的一个模块，大家以理解为主，增强自己的编程思维，再后续我们训练模型时会大批量的处理图片时会经常用到这个模块。1OpenCV介绍OpenCV（开放源代码计算机视觉库）是一个开源的计算机视觉和机器学习软件库。由一系列C++类和函数构成，用于图像处理、计算机视觉领域的算法实现。1.1OpenCV优势**开源免费：**完全
【无标题】Python --- Day5 函数的位置传参、关键词传参及其可变性和解包操作 AnAn__kang python 前端人工智能
系列文章目录前言今天小伙伴们跟我进入第五天的Python课程学习，主要是关于函数的位置传参，关键传参和可变性和解包传参这其中的具体定义以及它们的使用场景`一、调用传参函数调用时传递参数的方式有多种，包括位置传参、关键词传参、多个参数解包、参数默认值等。1.1位置传参最常见的传参方式，参数按定义的顺序依次传入函数。示例：defgreet(name,age):print(f"Hello,{name}.
时序数据库在数据库领域的行业应用数据库管理艺术数据库时序数据库 ai
时序数据库在数据库领域的行业应用关键词：时序数据库、数据库领域、行业应用、时间序列数据、实时分析摘要：本文深入探讨了时序数据库在数据库领域的行业应用。首先介绍了时序数据库的背景知识，包括其目的、适用读者、文档结构和相关术语。接着阐述了时序数据库的核心概念、架构和工作原理，通过Python代码详细讲解了核心算法。还介绍了相关的数学模型和公式，并举例说明。在项目实战部分，给出了开发环境搭建、源代码实现
Python --- Day3 推导式及常见语句和内置函数的学习！！！
系列文章目录前言相信各位伙伴们在前俩次的文章和Python的基础学习中大有收获，这次我们将进入推导式，常见语句和内置函数的学习！跟着博主一起成为一名Ai的算法工程师！一、推导式用更简洁的方式创建列表、字典和集合。是Python特有的一种表达式形式。1.1列表推导式a=[1,2,3,4]result=[x*2forxina]#创建一个新列表，元素是原列表每个元素的两倍1.2字典推导式a=['a','
生命3.0时代，面对人工智能时代的到来，我们可以做些什么笃定的沙丁鱼
生命的定义生命的定义有很多，最为人所熟知的是在生物学上的定义，即生命是蛋白质存在的一种形式。但是，这种定义可能不太适用于未来的智能机器和外星文明，我们不能将我们对未来生命的思考局限在过去遇到过的物种，所以需要将生命定义得更广阔一些：生命是一个能保持自身复杂性并能进行复制的过程。复制的对象并不是由原子组成的物质，而是能阐明原子是如何排列的信息，这种信息由比特组成。换句话说：我们可以将生命看作一种自我
cuda编程python接口_使用Python写CUDA程序的方法 weixin_39822184 cuda编程python接口
使用Python写CUDA程序有两种方式：*Numba*PyCUDAnumbapro现在已经不推荐使用了，功能被拆分并分别被集成到accelerate和Numba了。例子numbaNumba通过及时编译机制(JIT)优化Python代码，Numba可以针对本机的硬件环境进行优化，同时支持CPU和GPU的优化，并且可以和Numpy集成，使Python代码可以在GPU上运行，只需在函数上方加上相关的指
基于 Python 的网站信息探测工具设计与实现计算机毕业设计指导 python 网络服务器
基于Python的网站信息探测工具设计与实现摘要在渗透测试与网络安全评估中，信息探测是最基础且关键的一步。通过对目标网站的操作系统、服务器、CMS、端口、目录结构等信息进行自动化探测，可为后续攻击路径识别提供基础数据支撑。传统工具如WhatWeb、FOFA等虽功能强大，但在定制化与扩展性方面受限。本文设计并实现了一款基于Python的轻量级网站信息探测工具，支持URL/IP扫描、开放端口探测、CM
不正规不靠谱：假摩根士丹利内部群推荐绿色低碳减排平台骗局揭露!送一万体验资金做慈善全是假的! 易星辰分享普法
关于曝光网上摩根士丹利何晓斌宝丰能源节能减排在炒股群推荐智慧农业中粮仓平台骗局的文章，其内容主要揭示了近期频发的一种投资诈骗手段。以下是该骗局的主要特点和步骤：为什么明明跟老师对过视频，确认是本人，怎么还会被骗了?你有没有想过一个名人大咖怎么会有时间给你们一对一视频，其次我来给大家揭露一下，这个套路AI换脸骗局是一种利用人工智能技术，通过替换视频中的人脸来伪造身份或进行诈骗的行为。你的账户“余额”
使用CrewAI创建一个研究团队 AI量化投资 php 开发语言多智能体智能体人工智能
本指导文档将带你一步步完成使用CrewAI框架创建你的第一个AI代理团队的过程。通过这个简单的示例，你将学习如何构建一个研究团队，用于研究和分析指定主题，并生成一份综合报告。本教程基于CrewAI官方文档，适合初学者快速上手。前提条件在开始之前，请确保你已完成以下准备工作：安装Python：确保你的系统安装了Python版本在3.10到3.13之间。你可以通过以下命令检查Python版本：pyth
Python成第四个支持CUDA的编程语言
Python成第四个支持CUDA的编程语言3月19日NVIDIA的GTC2013图形技术大会将开幕，在此之前会有很多宣传造势内容，其中最重大也是最主要的就是NVIDIA老总黄仁勋的开幕词了，其他合作伙伴也会发布各自的演讲。ContinuumAnalytics联合NVIDIA宣布将会引入新的PythonCUDA编译器——NumbaPro，Python也成为继C、C++以及Fortan之后的第四个支持
Python FastMCP：让你的AI工具链飞起来
PythonFastMCP：让你的AI工具链飞起来FastMCPFastMCP是什么？1.工具(Tools)：赋予LLM执行能力2.Resources（资源）：安全数据通道3.Prompts（提示模板）：标准化LLM交互4.组件协同：构建项目AI工具链5.部署架构与性能优化博主热门文章推荐：官方文档：FastMCP官方文档：https://gofastmcp.com/MCP协议规范：https:/
异常的核心类Throwable 无量 java 源码异常处理 exception
java异常的核心是Throwable，其他的如Error和Exception都是继承的这个类里面有个核心参数是detailMessage，记录异常信息，getMessage核心方法，获取这个参数的值，我们可以自己定义自己的异常类，去继承这个Exception就可以了，方法基本上，用父类的构造方法就OK，所以这么看异常是不是很easy package com.natsu;
mongoDB 游标（cursor）实现分页迭代开窍的石头 mongodb
上篇中我们讲了mongoDB 中的查询函数，现在我们讲mongo中如何做分页查询如何声明一个游标 var mycursor = db.user.find({_id:{$lte:5}}); 迭代显示游标数
MySQL数据库INNODB 表损坏修复处理过程 0624chenhong tomcat mysql
最近mysql数据库经常死掉，用命令net stop mysql命令也无法停掉，关闭Tomcat的时候，出现Waiting for N instance(s) to be deallocated 信息。查了下，大概就是程序没有对数据库连接释放，导致Connection泄露了。因为用的是开元集成的平台，内部程序也不可能一下子给改掉的，就验证一下咯。启动Tomcat,用户登录系统，用netstat -
剖析如何与设计人员沟通不懂事的小屁孩工作
最近做图烦死了，不停的改图，改图……。烦，倒不是因为改，而是反反复复的改，人都会死。很多需求人员不知该如何与设计人员沟通，不明白如何使设计人员知道他所要的效果，结果只能是沟通变成了扯淡，改图变成了应付。那应该如何与设计人员沟通呢？我认为设计人员与需求人员先天就存在语言障碍。对一个合格的设计人员来说，整天玩的都是点、线、面、配色，哪种构图看起来协调；哪种配色看起来合理心里跟明镜似的，
qq空间刷评论工具换个号韩国红果果 JavaScript
var a=document.getElementsByClassName('textinput'); var b=[]; for(var m=0;m<a.length;m++){ if(a[m].getAttribute('placeholder')!=null) b.push(a[m]) } var l
S2SH整合之session 灵静志远 spring AOP struts session
错误信息： Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cartService': Scope 'session' is not active for the current thread; consider defining a scoped
xmp标签 a-john 标签
今天在处理数据的显示上遇到一个问题： var html = '<li><div class="pl-nr"><span class="user-name">' + user + '</span>' + text + '</div></li>'; ulComme
Ajax的常用技巧（2）---实现Web页面中的级联菜单 aijuans Ajax
在网络上显示数据，往往只显示数据中的一部分信息，如文章标题，产品名称等。如果浏览器要查看所有信息，只需点击相关链接即可。在web技术中，可以采用级联菜单完成上述操作。根据用户的选择，动态展开，并显示出对应选项子菜单的内容。在传统的web实现方式中，一般是在页面初始化时动态获取到服务端数据库中对应的所有子菜单中的信息，放置到页面中对应的位置，然后再结合CSS层叠样式表动态控制对应子菜单的显示或者隐
天-安-门，好高 atongyeye 情感
我是85后，北漂一族，之前房租1100，因为租房合同到期，再续，房租就要涨150。最近网上新闻，地铁也要涨价。算了一下，涨价之后，每次坐地铁由原来2块变成6块。仅坐地铁费用，一个月就要涨200。内心苦痛。晚上躺在床上一个人想了很久，很久。我生在农
android 动画百合不是茶 android 透明度平移缩放旋转
android的动画有两种 tween动画和Frame动画 tween动画;,透明度,缩放,旋转,平移效果 Animation 动画 AlphaAnimation 渐变透明度 RotateAnimation 画面旋转 ScaleAnimation 渐变尺寸缩放 TranslateAnimation 位置移动 Animation
查看本机网络信息的cmd脚本 bijian1013 cmd
@echo 您的用户名是：%USERDOMAIN%\%username%>"%userprofile%\网络参数.txt" @echo 您的机器名是：%COMPUTERNAME%>>"%userprofile%\网络参数.txt" @echo ___________________>>"%userprofile%\
plsql 清除登录过的用户征客丶 plsql
tools---preferences----logon history---history 把你想要删除的删除 -------------------------------------------------------------------- 若有其他凝问或文中有错误，请及时向我指出，我好及时改正，同时也让我们一起进步。 email ： binary_spac
【Pig一】Pig入门 bit1129 pig
Pig安装 1.下载pig wget http://mirror.bit.edu.cn/apache/pig/pig-0.14.0/pig-0.14.0.tar.gz 2. 解压配置环境变量如果Pig使用Map/Reduce模式，那么需要在环境变量中，配置HADOOP_HOME环境变量 expor
Java 线程同步几种方式 BlueSkator volatile synchronized ThredLocal ReenTranLock Concurrent
为何要使用同步？ java允许多线程并发控制，当多个线程同时操作一个可共享的资源变量时（如数据的增删改查），将会导致数据不准确，相互之间产生冲突，因此加入同步锁以避免在该线程没有完成操作之前，被其他线程的调用，从而保证了该变量的唯一性和准确性。 1.同步方法&
StringUtils判断字符串是否为空的方法（转帖） BreakingBad null StringUtils “”
转帖地址：http://www.cnblogs.com/shangxiaofei/p/4313111.html public static boolean isEmpty(String str) 　　判断某字符串是否为空，为空的标准是 str== null 或 str.length()== 0
编程之美-分层遍历二叉树 bylijinnan java 数据结构算法编程之美
import java.util.ArrayList; import java.util.LinkedList; import java.util.List; public class LevelTraverseBinaryTree { /** * 编程之美分层遍历二叉树 * 之前已经用队列实现过二叉树的层次遍历，但这次要求输出换行，因此要
jquery取值和ajax提交复习记录 chengxuyuancsdn jquery取值 ajax提交
// 取值 // alert($("input[name='username']").val()); // alert($("input[name='password']").val()); // alert($("input[name='sex']:checked").val()); // alert($("
推荐国产工作流引擎嵌入式公式语法解析器-IK Expression comsci java 应用服务器工作 Excel 嵌入式
这个开源软件包是国内的一位高手自行研制开发的，正如他所说的一样，我觉得它可以使一个工作流引擎上一个台阶。。。。。。欢迎大家使用，并提出意见和建议。。。 ----------转帖--------------------------------------------------- IK Expression是一个开源的（OpenSource），可扩展的（Extensible），基于java语言
关于系统中使用多个PropertyPlaceholderConfigurer的配置及PropertyOverrideConfigurer daizj spring
1、PropertyPlaceholderConfigurer Spring中PropertyPlaceholderConfigurer这个类，它是用来解析Java Properties属性文件值，并提供在spring配置期间替换使用属性值。接下来让我们逐渐的深入其配置。基本的使用方法是：(1) <bean id="propertyConfigurerForWZ&q
二叉树:二叉搜索树 dieslrae 二叉树
所谓二叉树,就是一个节点最多只能有两个子节点,而二叉搜索树就是一个经典并简单的二叉树.规则是一个节点的左子节点一定比自己小,右子节点一定大于等于自己(当然也可以反过来).在树基本平衡的时候插入,搜索和删除速度都很快,时间复杂度为O(logN).但是,如果插入的是有序的数据,那效率就会变成O(N),在这个时候,树其实变成了一个链表. tree代码:
C语言字符串函数大全 dcj3sjt126com c function
C语言字符串函数大全函数名: stpcpy 功能: 拷贝一个字符串到另一个用法: char *stpcpy(char *destin, char *source); 程序例: #include <stdio.h> #include <string.h> int main
友盟统计页面技巧 dcj3sjt126com 技巧
在基类调用就可以了, 基类ViewController示例代码 -(void)viewWillAppear:(BOOL)animated { [super viewWillAppear:animated]; [MobClick beginLogPageView:[NSString stringWithFormat:@"%@",self.class]];
window下在同一台机器上安装多个版本jdk，修改环境变量不生效问题处理办法 flyvszhb java jdk
window下在同一台机器上安装多个版本jdk，修改环境变量不生效问题处理办法本机已经安装了jdk1.7，而比较早期的项目需要依赖jdk1.6，于是同时在本机安装了jdk1.6和jdk1.7. 安装jdk1.6前，执行java -version得到 C:\Users\liuxiang2>java -version java version "1.7.0_21&quo
Java在创建子类对象的同时会不会创建父类对象 happyqing java 创建子类对象父类对象
1.在thingking in java 的第四版第六章中明确的说了，子类对象中封装了父类对象， 2."When you create an object of the derived class, it contains within it a subobject of the base class. This subobject is the sam
跟我学spring3 目录贴及电子书下载 jinnianshilongnian spring
一、《跟我学spring3》电子书下载地址：《跟我学spring3》（1-7 和 8-13） http://jinnianshilongnian.iteye.com/blog/pdf 跟我学spring3系列 word原版下载二、源代码下载最新依
第12章 Ajax（上） onestopweb Ajax
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
BI and EIM 4.0 at a glance blueoxygen BO
http://www.sap.com/corporate-en/press.epx?PressID=14787 有机会研究下EIM家族的两个新产品~~~~ New features of the 4.0 releases of BI and EIM solutions include: Real-time in-memory computing –
Java线程中yield与join方法的区别 tomcat_oracle java
长期以来，多线程问题颇为受到面试官的青睐。虽然我个人认为我们当中很少有人能真正获得机会开发复杂的多线程应用(在过去的七年中，我得到了一个机会)，但是理解多线程对增加你的信心很有用。之前，我讨论了一个wait()和sleep()方法区别的问题，这一次，我将会讨论join()和yield()方法的区别。坦白的说，实际上我并没有用过其中任何一个方法，所以，如果你感觉有不恰当的地方，请提出讨论。 &nb
android Manifest.xml选项阿尔萨斯 Manifest
结构继承关系 public final class Manifest extends Objectjava.lang.Objectandroid.Manifest 内部类 class Manifest.permission权限 class Manifest.permission_group权限组构造函数 public Manifest () 详细 androi
Oracle实现类split函数的方 zhaoshijie oracle
关键字：Oracle实现类split函数的方项目里需要保存结构数据，批量传到后他进行保存，为了减小数据量，子集拼装的格式，使用存储过程进行保存。保存的过程中需要对数据解析。但是oracle没有Java中split类似的函数。从网上找了一个，也补全了一下。 CREATE OR REPLACE TYPE t_split_100 IS TABLE OF VARCHAR2(100); cr