神洛华

Kaggle - LLM Science Exam（三）：Wikipedia RAG

文章目录

- 一、赛事概述
- - 1.1 OpenBookQA Dataset
  - 1.2 比赛背景
  - 1.3 评估方法和代码要求
  - 1.4 比赛数据集
  - 1.5 优秀notebook
- 二、 [EDA, Data gathering] LLM-SE ~ Wiki STEM | 1k DS
- - 2.1 Data overview
  - 2.2 Data gathering
- 三、如何高效收集数据
- - 3.1 概述
  - 3.2 与训练数据关联的维基百科类别分析
- 四、with only 270K articles!
- - 4.0 什么是RAG？
  - 4.1 New 270K Wikipedia STEM articles
  - 4.2 270K articles+LongFormer Large（2610s,LB=0.862）
  - - 4.2.1 导入依赖库，使用OpenBook方法预测结果
    - 4.2.2 检索相关文档
    - 4.2.3 定义预处理函数和评测指标
    - 4.2.4 推理
  - 4.3 270K articles+DEBERTA v3 large（2396s,LB=0.905）
  - 4.4 RAPIDS TF-IDF+DEBERTA v3 large（756s,LB=0.904）
  - - 4.4.1 环境配置
    - 4.4.2 RAG Retrieval
    - 4.4.3 定义预处理函数和评测指标
    - 4.4.4 双线程推理
    - 4.4.5 评测结果并保存

一、赛事概述

1.1 OpenBookQA Dataset

OpenBookQA Dataset是由美国艾伦人工智能研究院（Allen Institute for AI）发布的一个问答技术评测集，其主要目的是通过选择题考试的方式来测试和评估人工智能系统的问题回答能力，以下是更详细的介绍。

发布背景
许多之前的阅读理解数据集都是基于抽取式的方法,只需要从给定的上下文中抽取答案,而没必要进行更深层次的推理。OpenBookQA要求模型需要利用基础知识来回答问题,进行更复杂的推理。
数据集构成
OpenBookQA包含5957个四选一的科学常识问题(4,957 train, 500 dev, 500 test)。这些问题需要根据包含1326个科学事实的小“书本”来回答。问题采样自维基百科页面。
模型表现
回答OpenBookQA的问题不仅需要给定知识库中的科学常识，还需要额外的广泛常识知识。这些问题既不能通过检索算法回答正确，也不能通过词语共现算法回答正确。Strong neural baselines在OpenBookQA上只能达到约50%的准确率，与人类92%的准确率存在明显差距。
附加数据
该数据集还提供了5167个群众贡献的常识知识,以及扩展的训练集、开发集、测试集，每个问题对应其所考察的核心科学事实、人类准确率、清晰度评分等信息。
数据集意义
OpenBookQA推动了机器阅读理解从抽取式到推理式的发展，评估了模型在开放域知识下的深层理解和推理能力。

1.2 比赛背景

赛事地址：Kaggle - LLM Science Exam

LLM的能力：随着大型语言模型的能力不断扩展，研究领域中出现了使用LLMs来表征自身的趋势。因为许多现有的自然语言处理基准测试已经被最先进的模型轻松解决，所以有趣的工作是利用LLMs创建更具挑战性的任务，以测试更强大的模型。
数据生成：比赛使用了gpt3.5模型，该模型基于从维基百科中提取的各种科学主题的文本片段，要求它编写一个多项选择问题（附带已知答案），然后过滤掉简单的问题。
资源受限：本次比赛是一场代码比赛，GPU和时间都受到限制。
挑战性：虽然量化和知识蒸馏等技术可以有效地缩小语言模型以便在更少的硬件资源上运行，但这场比赛仍旧充满挑战。目前，目前在 Kaggle 上运行的最大模型有大约 100 亿个参数，而 gpt3.5 有 1750 亿个参数。如果一个问答模型能够轻松通过一个比其规模大10倍以上的模型编写的问答测试，这将是一个真正有趣的结果。另一方面，如果更大的模型能够有效地难住较小的模型，这对LLMs自我评估和测试的能力具有引人注目的影响。
竞赛旨在探讨比gpt3.5小10倍以上的问答模型能否有效回答gpt3.5编写的问题。结果将揭示LLM的基准测试和自我测试能力。

1.3 评估方法和代码要求

提交根据平均精度 @ 3 （MAP@3）进行评估：

其中，为测试集中的问题数量，() 为截断值为时的精确度，为每个问题的预测数量，() 为指示函数，如果排名为的项目是相关的（正确的）标签，则等于1，否则为0。

另外，某个问题正确预测后，后续将跳过该标签的其他预测，以防止刷准确度。举例来说，假设有一个测试集，里面有3个问题的正确答案都是A，如果有一个模型对这3个问题给出以下答案，那么以下情况都会得到平均精确度1.0的分数：

[A, B, C, D, E] # 问题1预测
[A, A, A, A, A] # 问题2预测
[A, B, A, C, A] # 问题3预测

这意味着一旦找到正确答案（A），之后的预测不再影响平均精确度分数。

本次比赛必须以notebook提交，且CPU和GPU运行时间少于9小时。禁用互联网，但是允许使用公开的外部数据，包括预先训练的模型。另外提交文件必须命名为 submission.csv。

1.4 比赛数据集

本次比赛是回答由gpt3.5模型生成的4000道多选题组成的测试集。测试集是隐藏的，当提交notebook后，才会有实际的测试数据进行评测。

train.csv ： 200个样本，问题+答案，以显示数据格式，并大致了解测试集中的问题类型。
test.csv ：测试集，只包含题目，答案省略。
sample_submission.csv ：提交格式示例

具体的训练集格式如下：

# Let's import the public training set and take a look
import pandas as pd

train_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/train.csv')
train_df.head()

对于测试集中的每个 id 标签，您最多可以预测 3 个标签。submission.csv文件应包含header并具有以下格式：

id,prediction
0,	A B C
1,	B C A
2,	C A B
etc.

1.5 优秀notebook

《Starter Notebook: Ranked Predictions with BERT》：Bert Baseline，使用bert-base-cased和比赛提供的200个训练集样本进行训练，Public Score=0.545。
《[EDA, Data gathering] LLM-SE ~ Wiki STEM | 1k DS》（制作训练数据）：比赛提供的200个样本太少了，作者LEONID KULYK先分析了比赛数据集，然后同样使用 gpt3.5 制作了1000个Wikipedia样本，数据集上传在Wikipedia STEM 1k。
《LLM-SE ~ deberta-v3-large -i | 1k Wiki》:LEONID KULYK将自己收集的1000个Wikipedia样本和比赛训练集合并，一起训练，模型是deberta-v3-large。notebook中有最终模型权重，可直接推理，LB= 0.709。
《New dataset + DEBERTA v3 large training!》：0.723→0.759
- Radek 基于方法3，使用自己生成的500个额外数据训练DEBERTA v3 large，Public Score=0.723。
- Radek后来又生成了6000条数据，跟之前的500条融合为6.5K数据集，并在此基础上进行三次训练，得到了三个模型权重，上传在Science Exam Trained Model Weights中。然后通过下面两种方法，进行推理：
  - 《Inference using 3 trained Deberta v3 models》：三个模型分别预测之后概率取平均，Public Score=0.737。
  - An introduction to Voting Ensemble：作者在这个notebook中详细介绍了Voting Ensemble以及使用方法，Public Score=0.759。
- 作者最后上传了15k high-quality train examples。
《Open Book LLM Science Exam》：jjinho首次提出了Open Book方法，演示了如何在训练集中，使用faiss 执行相似性搜索，找到与问答数据最相似的context（Wikipedia数据），以增强问答效果。
《Open Book LLM Science Exam - Reduced RAM usage》：quangbk改进了方法5中的内存效率。
《OpenBook DeBERTaV3-Large Baseline (Single Model》)： Anil将方法4和方法6结合起来。他将先测试集数据按照方法6搜索出context，然后将其与prompt合并，得到新的测试集。然后加载方法4训练的模型进行推理，Public Score=0.771。
```
test_df["prompt"] = test_df["context"] + " #### " +  test_df["prompt"]
```
《Sharing my trained-with-context model》：Mgoksu同样使用了方法7，只是使用了自己制作的数据集进行离线训练，得到一个更好的模型llm-science-run-context-2，然后进行推理，top public LB=0.807。
《How To Train Open Book Model - Part 1》、《How To Train Open Book Model - Part 2》：
- CHRIS DEOTTE在part1中，参照方法8在自己制作的60k数据集进行训练，得到模型model_v2；然后在part2中使用方法8中的模型llm-science-run-context-2以及model_v2分别进行推理，得到的两个概率取平均，得到最终结果（Public Score=0.819）。
- 在part1中，作者使用了竞赛指标MAP@3 来评估模型，并讨论了一些训练技巧，例如使用 PEFT或冻结model embeddings&model layers来减少训练参数、增加 LR 并减少 epochs来减少计算量、使用gradient_checkpointing（这使用磁盘来节省RAM）、使用gradient_accumlation_steps模拟更大的批次等等。
《LLM Science Exam Optimise Ensemble Weights》：作者首先使用了方法9训练的模型权重；另外为了增加多样性，还融合了其它几个没有使用Open Book的deberta-v3-large模型，最终Public Score=0.837。作者还写了以下notebook：
- 《Incorporate MAP@k metrics into HF Trainer》：在Trainer中加入MAP@k指标
- 《Introducing Adversarial Weight Perturbation (AWP)》、《Adversarial Weight Perturbation (AWP) Inference》：介绍对抗性权重扰动AWP，以及推理方法。
- 《Using DeepSpeed with HF Trainer》，希望可以节约内存，以便训练更大的模型。
《LLM-SciEx Optimise Ensemble Weights(better models)》：类似方法10，通过模型融合，Public Score=0.846。
《with only 270K articles》：作者自己制作了270K Wikipedia数据，使用LongFormer 模型而不是deberta-v3-large进行训练，Public Score=0.862。
《Platypus2-70B with Wikipedia RAG》：SIMJEG结合了方法8和12，一共18个版本，Public Score从0.832到0.909。ALI在《Explained Platypus2-70B + Wikipedia RAG》中对此notebook做了详细的说明。
《Fork of Fork of [86.2] with only 270K articles!》在方法12的基础上改进了预处理函数，并使用方法8 的模型，Public Score=0.905
《RAPIDS TF-IDF - [LB 0.904] - Single Model》：在方法12的基础上，使用RAPIDS TF-IDF加速检索过程，使用双GPU（2xT4 GPU）和双线程来加速推理过程，并微调了部分参数（prepare_answering_input2），最终LB=0.904。作者说自己参照方法11，融合了另外6个模型，最终得分0.916，代码未公开。

二、 [EDA, Data gathering] LLM-SE ~ Wiki STEM | 1k DS

参考《[EDA, Data gathering] LLM-SE ~ Wiki STEM | 1k DS》

本次比赛的数据集由GPT3.5生成的多项选择题组成，数据是prompt（问题）+ A、B、C、D 、 E 五个答案选项+answer（正确答案）组成。比赛目标是根据prompt预测前三个最可能的答案选项。

2.1 Data overview

import os
import random

import openai
import requests
import wikipediaapi
import itables
import numpy as np
import pandas as pd
import plotly.express as px
from kaggle_secrets import UserSecretsClient

# specify OpenAI API key in Kaggle's secrets add-ons.
user_secrets = UserSecretsClient()
openai.api_key = user_secrets.get_secret("openai_api")

train_df = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/train.csv")
table = itables.show(train_df, table_options=dict(pageLength=10))
table

下面显示每个字段的单词数量分布：

fig = px.histogram([len(x.split(" ")) for x in train_df.prompt], nbins=40, color_discrete_sequence=['goldenrod'])
fig.update_layout(
    showlegend=False,
    xaxis_title="Number of words",
    title={
        'text': "Distribution of the number of words in prompts",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    }
)
fig.show()

Kaggle - LLM Science Exam（三）：Wikipedia RAG_第3张图片

Kaggle - LLM Science Exam（三）：Wikipedia RAG_第4张图片

Kaggle - LLM Science Exam（三）：Wikipedia RAG_第5张图片

Kaggle - LLM Science Exam（三）：Wikipedia RAG_第6张图片

Kaggle - LLM Science Exam（三）：Wikipedia RAG_第7张图片

Kaggle - LLM Science Exam（三）：Wikipedia RAG_第8张图片

2.2 Data gathering

本次比赛最重要的部分就是数据收集。因此，了解测试数据集是如何形成的以及如何重现其收集方法非常重要。

根据竞赛描述，测试数据集是基于维基百科的页面形成的。换句话说，选择了关于科学、技术、工程和数学主题的页面（重点是STEM中的S主题），从中摘录了一段内容，并将其传递到 GPT3.5 模型，然后生成多选问答数据。

为了复现比赛测试数据的收集，需要执行以下步骤：

形成一个STEM（科学、技术、工程和数学）类别列表，针对每个类别将搜索相应的页面以从中提取测试内容。
随机选择与相应主题或子主题相关的类别或页面。
- 随机选择一个 STEM 主题，即 S 或 T 或 E 或 M。
- 根据所选的 STEM 主题，从列表中随机选择一个类别。
- 获取所选类别的所有子类别和页面。
- 如果所选列表是子类别列表，则选择随机子类别并转到上一步，如果所选列表是页面列表，则选择随机页面。
在选择页面后，从中提取文本。
- 如果页面足够长（> 6 个句子），则从页面中提取前 7 个句子。
向LLM模型发送一条消息，明确说明需要执行的任务，并提供提取的文本。
- 包括指定要生成的多项选择题的数量、答案选项的长度或格式
- 生成过程中提供一些指南或越苏，例如提供有关问题和答案所需的风格、语气或复杂程度

options_set = set(("option_1", "option_2", "option_3", "option_4", "option_5"))
response_keys_set = set(("question", "option_1", "option_2", "option_3", "option_4", "option_5", "answer"))

delimiter = "####"
system_message = f"""
You will be provided with TEXT from wikipedia. \
The TEXT will be delimited with {delimiter} characters.
Output a python list of 5 dict objects, where each object is \
a multiple choice question whom answers should be in \
the given TEXT and that has 5 choices each and has the following format:
	'question': 
    'option_1': 
    'option_2': 
    'option_3': 
    'option_4': 
    'option_5': 
    'answer': 

You should tell me which one of your proposed options is right \
by assigning the corresponding option's key label in the 'answer' field.

The question, the answer and question answer options should be broad, \
challenging, long, detailed and based on the TEXT provided.

Only output the list of objects, with nothing else.
"""

def get_completion_messages(wiki_text):
    return [  
        {
            'role':'system', 
            'content': system_message
        },    
        {
            'role':'user', 
            'content': f"{delimiter}{wiki_text}{delimiter}"
        },  
    ]

def get_completion_from_messages(messages, model="gpt-3.5-turbo", temperature=0.8, max_tokens=3000):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens)
        
    return response.choices[0].message["content"]

解析模型的输出并自动检查是否符合输出格式。
组合以上所有操作

最终，作者根据以上思路，基于250个维基百科Page，生成了1000个训练数据，数据集见数据集Wikipedia STEM 1k。基于此数据集进行训练和推理的代码见《LLM-SE ~ deberta-v3-large -t | 1k Wiki》、《[LB: 0.709] LLM-SE ~ deberta-v3-large -i | 1k Wiki》

三、如何高效收集数据

参考《More efficient data collection: How to choose which categories to focus on?》

3.1 概述

这次竞赛的数据集是通过gpt3.5收集的，分为两个阶段：

收集相关的维基百科文章。
从这些维基百科文章中生成问题。

第一步相对简单，在《LLM-SE ~ Wiki STEM | 1k DS》中已经解决了但第二步如果使用gpt-3.5 API来从成千上万的文章中生成新问题，会很昂贵。为了确保我们只包括与任务相关的维基百科文章，有两个主要因素：a) 降低成本，b) 减少模型中的噪声（这是微调的主要目的）。

比如，我们知道这个竞赛是基于维基百科上科学类的文章，那么训练数据不应包括历史类或地理类的文章。即使如此，维基百科上科学类文章还是有很多，如果我们可以筛选出与竞赛相关的文章，那将有助于减少模型中的噪声。

3.2 与训练数据关联的维基百科类别分析

在《LLM Science Exam: Wikipedia Graph Analysis》中，作者对包含了本次比赛训练数据的wikipedia pages进行了分析，目的找出哪些wikipedia类别最合适用来生成更多的训练数据。作者采取了以下步骤：

手动检索，列出训练数据中包含的所有维基百科文章，一共154篇

train_pages = [
    'Supersymmetric quantum mechanics',
    'Relative density',
    'Memristor',
    'Quantization (physics)',
    'Symmetry in biology',
    'Mass versus weight',
    'Navier–Stokes equations',
    'Thermal equilibrium',
    'Electrical resistivity and conductivity',
    'Superconductivity',
    'Black hole',
    'Born reciprocity',
    "Commentary on Anatomy in Avicenna's Canon",
    'Supernova',
    'Angular momentum',
    'Condensation cloud',
    'Minkowski space',
    'Vacuum',
    'Standard Model',
    'Nebula',
    'Antiferromagnetism',
    'Light-year',
    'Propagation constant',
    'Phase transition',
    'Redshift',
    'The Ambidextrous Universe',
    'Interstellar medium',
    'Improper rotation',
    'Plant',
    'Clockwise',
    'Morphology (biology)',
    'Magnetic susceptibility',
    'Nuclear fusion',
    'Theorem of three moments',
    'Lorentz covariance',
    'Causality (physics)',
    'Total internal reflection',
    'Surgical pathology',
    'Environmental Science Center',
    'Electrochemical gradient',
    'Planetary system',
    'Cavitation',
    'Parity (physics)',
    'Dimension',
    'Heat treating',
    'Speed of light',
    'Mass-to-charge ratio',
    'Landau–Lifshitz–Gilbert equation',
    'Point groups in three dimensions',
    'Mammary gland',
    'Convection (heat transfer)',
    'Modified Newtonian dynamics',
    "Earnshaw's theorem",
    'Coherent turbulent structure',
    'Phageome',
    'Infectious tolerance',
    'Ferromagnetism',
    'Coffee ring effect',
    'Magnetic resonance imaging',
    'Ring-imaging Cherenkov detector',
    'Tidal force',
    'Kutta-Joukowski theorem',
    'Radiosity (radiometry)',
    'Quartz crystal microbalance',
    'Crystallinity',
    'Magnitude (astronomy)',
    "Newton's law of universal gravitation",
    'Uniform tilings in hyperbolic plane',
    'Refractive index',
    'Theorem',
    'Leidenfrost effect',
    'API gravity',
    'Supersymmetry',
    'Dark Matter',
    'Molecular symmetry',
    'Spin (physics)',
    'Astrochemistry',
    'List of equations in classical mechanics',
    'Diffraction',
    'C1 chemistry',
    'Reciprocal length',
    'Amplitude',
    'Work function',
    'Coherence (physics)',
    'Ultraviolet catastrophe',
    'Symmetry of diatomic molecules',
    'Bollard pull',
    'Linear time-invariant system',
    'Triskelion',
    'Cold dark matter',
    'Frame-dragging',
    "Fermat's principle",
    'Enthalpy',
    'Main sequence',
    'QCD matter',
    'Molecular cloud',
    'Free neutron decay',
    'Second law of thermodynamics',
    'Droste effect',
    'History of geology',
    'Gravitational wave',
    'Regular polytope',
    'Spatial dispersion',
    'Probability amplitude',
    'Stochastic differential equation',
    'Gravity Probe B',
    'Electronic entropy',
    'Renormalization',
    'Unified field theory',
    "Elitzur's theorem",
    "Hesse's principle of transfer",
    'Ecological pyramid',
    'Virtual particle',
    'Ramsauer–Townsend effect',
    'Butterfly effect',
    'Zero-point energy',
    'Baryogenesis',
    'Pulsar',
    'Decay technique',
    'Electric flux',
    'Water hammer',
    'Dynamic scaling',
    'Luminance',
    'Crossover experiment (chemistry)',
    'Spontaneous symmetry breaking',
    'Self-organization in cybernetics',
    'Stellar classification',
    'Probability density function',
    'Pulsar-based navigation',
    'Supermassive black hole',
    'Explicit symmetry breaking',
    'Surface power density',
    'Organography',
    'Copernican principle',
    'Geometric quantization',
    'Erlangen program',
    'Magnetic monopole',
    'Inflation (cosmology)',
    'Heart',
    'Observable universe',
    'Wigner quasiprobability distribution',
    'Shower-curtain effect',
    'Scale (ratio)',
    'Hydrodynamic stability',
    'Paramagnetism',
    'Emissivity',
    'Critical Raw Materials Act',
    'James Webb Space Telescope',
    'Signal-to-noise ratio',
    'Photophoresis',
    'Time standard',
    'Time',
    'Galaxy',
    'Rayleigh scattering'
]

查找给定页面的类别：我们使用BeautifulSoup来提取每个页面所属的类别，而不依赖维基百科API，因为API会返回一些不相关的隐藏类别。
组合图表：使用 networkx 创建一个页面所属所有类别的图表，得到包含所有页面和它们的类别关系的整体图表。
分析整体图表：通过以下两个方面分析整体图表
- 查找连接数量最多的类别：对每个类别创建一个深度搜索优先树，然后计算树中的Pages数量，统计出每个类别链接了多少Pages。
- 计算类别的相关性得分：很多类别链接的Pages都很多，不足以进行分析。我们计算每个类别与其连接的所有页面之间的最短距离的倒数之和，将其做为“相关性得分”。这个得分有两个影响因素：
  - 连接的页面数量：如果一个类别与更多的页面连接，那么它的得分将更高，因为它涵盖了更多的内容。
  - 连接距离：如果一个类别与页面之间的连接是非常紧密的，即它们之间的最短距离很小，那么它的得分将更高。这是因为这个类别更可能与页面内容直接相关。

top_leaf_connect_distances[:10]
# 为了便于阅读，元组中加了空格
[('Category:Concepts in physics', 				59.0984126984127),
 ('Category:Concepts by field', 				42.90479797979802),
 ('Category:Physics', 							42.26829004329007),
 ('Category:Subfields of physics', 				37.716269841269835),
 ('Category:Physical sciences', 				37.33095238095237),
 ('Category:Subfields by academic discipline',  36.35238095238095),
 ('Category:Physical phenomena', 				36.318253968253956),
 ('Category:Main topic classifications', 		35.085714285714275),
 ('Category:Concepts', 							34.82943722943723),
 ('Category:Physical quantities', 				34.773409923409915)]

从上面可以看出，“Concepts in physics”得分最高，这个类别可以用于收集更多训练数据。作者测试了，仅通过“Concepts in physics”类别，并在深度为1的情况下进行查询，结果包含882个Pages，其中包括了训练集中的111个Pages，占总训练集的72%。这大大减少了我们需要生成问题以生成额外训练数据的页面数量。

考虑到其它科学领域没有覆盖到，您可以尝试探索图表中的其他类别，并将其添加进来，以查看是否可以覆盖更多的训练数据。

四、with only 270K articles!

参考：《[86.2] with only 270K articles!》、《《Finding 270K Wikipedia STEM articles!》、《270K Wikipedia STEM articles》

4.0 什么是RAG？

参考《Kaggle大模型比赛冠军方案梳理》

常规的微调过程为：

微调

RAG（检索增强生成）：R是retrievel，A是augment增强，G是generation生成，将检索（或搜索）的能力集成到LLM文本生成中。它结合了一个检索系统和一个LLM，前者从大型语料库中获取相关文档片段，后者使用这些片段中的信息生成答案。本质上，RAG 帮助模型“查找”外部信息以改进其响应。具体本赛题来说，就是使用题目和选项去召回维基百科相关的文档，拼接在要回答的题目上。RAG简要过程如下：

RAG结构
在外部知识要求高的情况下，优先RAG，需要模型适配（风格行为词汇）等，就需要微调，两者要求都高的话，需要结合使用。

Kaggle - LLM Science Exam（三）：Wikipedia RAG_第12张图片

方案选型
关于具体的召回技术，主要有向量+fasiss召回，tfidf相似度矩阵乘法，BM25等。第[19th Privat]给了一个非大模型方案RAG流程比较清楚的图：

Kaggle - LLM Science Exam（三）：Wikipedia RAG_第13张图片

4.1 New 270K Wikipedia STEM articles

很多其他人分享的检索式代码都使用相对较小的文本编码器，这些编码器在处理文章时可能性能较差。同时，考虑到所有的问题都是关于科学、技术、工程和数学（STEM）等主题的，我们是否真的需要在整个维基百科上进行检索，是否存在一些文章集合，其中包含了绝大多数所需的信息？

在《Finding 270K Wikipedia STEM articles!》中，作者基于上一章中的train_pages（用于生成训练集数据的154篇维基百科文章列表），以半监督的方式对维基百科文章进行KMeans聚类来获取 270K 维基百科 STEM 文章，发布在《STEM wikipedia subset based on Cohere embeddings》。

然而，由于 WikiExtractor 的问题，在某些情况下，最终的 Wiki 解析中会丢失一些数字甚至段落，这降低了检索增强模型的性能。因此，对于同一组文章，作者使用 Wiki API 来收集文章的上下文，然后发布了新的数据集《270K Wikipedia STEM articles》，可以使用使用 datasets.load_from_disk() 方法来加载数据。

4.2 270K articles+LongFormer Large（2610s,LB=0.862）

《with only 270K articles!》及其赛事区讨论

Sentence Transformers：使用 BERT & Co 的多语言句子、段落和图像嵌入，Sentence Transformers文档

4.2.1 导入依赖库，使用OpenBook方法预测结果

# 离线安装必要的库
!pip install -U /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!cp -rf /kaggle/input/sentence-transformers-222/sentence-transformers /kaggle/working/sentence-transformers
!pip install -U /kaggle/working/sentence-transformers
!pip install -U /kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl

!pip install --no-index --no-deps /kaggle/input/llm-whls/transformers-4.31.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/trl-0.5.0-py3-none-any.whl

!cp /kaggle/input/datasets-wheel/datasets-2.14.4-py3-none-any.whl /kaggle/working
!pip install  /kaggle/working/datasets-2.14.4-py3-none-any.whl
!cp /kaggle/input/backup-806/util_openbook.py .

import pickle,gc
from util_openbook import get_contexts, generate_openbook_output

 
get_contexts()
generate_openbook_output()

gc.collect()

这一步是执行util_openbook.py中的两个函数：

get_contexts()：使用faiss对测试集的prompt进行向量检索，检索出最相似的维基百科上下文content
generate_openbook_output()：将content与prompt合并，再使用llm-science-run-context-2模型进行解码和推理，结果处理成比赛要求的格式，保存到submission_backup.csv文件中。

整个util_openbook.py其实就是《Sharing my trained-with-context model》的代码，top public LB=0.807。我在《Kaggle - LLM Science Exam(二）：Open Book QA&debertav3-large详解》中对代码进行了详细的解读，可供参考。

import os
import numpy as np
import pandas as pd 
from datasets import load_dataset, load_from_disk
from sklearn.feature_extraction.text import TfidfVectorizer
import torch
from transformers import LongformerTokenizer, LongformerForMultipleChoice
import transformers
import matplotlib.pyplot as plt
from tqdm import tqdm
import unicodedata

backup_model_predictions = pd.read_csv("submission_backup.csv")
backup_model_predictions.head()

	id	prediction
0	0	 D B E
1	1	 A B D
2	2	 A C D
3	3	 C E D
4	4	 D A B

!cp -r /kaggle/input/stem-wiki-cohere-no-emb /kaggle/working
!cp -r /kaggle/input/all-paraphs-parsed-expanded /kaggle/working/

4.2.2 检索相关文档

import unicodedata
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from datasets import load_from_disk
from tqdm import tqdm

# 函数1：将列表分割成指定大小的块
def SplitList(mylist, chunk_size):
    """
    将输入的列表分割成大小为chunk_size的块。

    参数：
    mylist (list): 要分割的列表。
    chunk_size (int): 块的大小。

    返回值：
    list of lists: 包含分割后块的列表。
    """
    return [mylist[offs:offs+chunk_size] for offs in range(0, len(mylist), chunk_size)]

# 函数2：获取相关文档的解析内容
def get_relevant_documents_parsed(df_valid):
	"""
    根据df_valid获取相关文档的解析内容。

    参数：
    df_valid (DataFrame): 包含有效数据的数据框架。

    返回值：
    list of lists: 包含相关文档的解析内容的列表。
    """
    # 指定数据框架切块大小
    df_chunk_size = 600
    
    # 从磁盘加载已解析的语料库数据集
    paraphs_parsed_dataset = load_from_disk("/kaggle/working/all-paraphs-parsed-expanded")
    
    # 从语料库数据集中提取并修改文本
    modified_texts = paraphs_parsed_dataset.map(lambda example:
                                             {'temp_text':
                                              f"{example['title']} {example['section']} {example['text']}".replace('\n'," ").replace("'","")},
                                             num_proc=2)["temp_text"]
    
    # 初始化用于存储结果的空列表
    all_articles_indices = []
    all_articles_values = []
    
    # 使用 tqdm 创建一个进度条，遍历整个数据框架，按照指定的切块大小进行迭代
    for idx in tqdm(range(0, df_valid.shape[0], df_chunk_size)):
        # 从数据框架中获取当前切块的子集
        df_valid_ = df_valid.iloc[idx: idx+df_chunk_size]
    
        # 调用 retrieval 函数，获取相关文章的索引和分数
        articles_indices, merged_top_scores = retrieval(df_valid_, modified_texts)
        
        # 将当前切块的结果添加到相应的列表中
        all_articles_indices.append(articles_indices)
        all_articles_values.append(merged_top_scores)
        
    # 将各个切块的结果合并成单个数组
    article_indices_array =  np.concatenate(all_articles_indices, axis=0)
    articles_values_array = np.concatenate(all_articles_values, axis=0).reshape(-1)
    
    # 计算每个查询的相关文章数目
    top_per_query = article_indices_array.shape[1]
    
    # 重塑结果，使其适合输出格式
    articles_flatten = [(
                         articles_values_array[index],
                         paraphs_parsed_dataset[idx.item()]["title"],
                         paraphs_parsed_dataset[idx.item()]["text"],
                        )
                        for index,idx in enumerate(article_indices_array.reshape(-1))]
    
    # 使用 SplitList 函数将结果分割成适当的块并返回
    retrieved_articles = SplitList(articles_flatten, top_per_query)
    return retrieved_articles


# 函数3：获取相关文档
def get_relevant_documents(df_valid):
    """
    根据有效数据框架df_valid获取相关文档。

    参数：
    df_valid (DataFrame): 包含有效数据的数据框架。

    返回值：
    list of lists: 包含相关文档的列表。
    """
    # 指定数据框架切块大小
    df_chunk_size = 800
    
    # 从磁盘加载已过滤的语料库数据集
    cohere_dataset_filtered = load_from_disk("/kaggle/working/stem-wiki-cohere-no-emb")
    
    # 从语料库数据集中提取并修改文本
    modified_texts = cohere_dataset_filtered.map(lambda example:
                                             {'temp_text':
                                              unicodedata.normalize("NFKD", f"{example['title']} {example['text']}").replace('"',"")},
                                             num_proc=2)["temp_text"]
    
    # 初始化用于存储结果的空列表
    all_articles_indices = []
    all_articles_values = []
    
    # 使用 tqdm 创建一个进度条，遍历整个数据框架，按照指定的切块大小进行迭代
    for idx in tqdm(range(0, df_valid.shape[0], df_chunk_size)):
        # 从数据框架中获取当前切块的子集
        df_valid_ = df_valid.iloc[idx: idx+df_chunk_size]
    
        # 调用 retrieval 函数，获取相关文章的索引和分数
        articles_indices, merged_top_scores = retrieval(df_valid_, modified_texts)
        
        # 将当前切块的结果添加到相应的列表中
        all_articles_indices.append(articles_indices)
        all_articles_values.append(merged_top_scores)
        
    # 将各个切块的结果合并成单个数组
    article_indices_array =  np.concatenate(all_articles_indices, axis=0)
    articles_values_array = np.concatenate(all_articles_values, axis=0).reshape(-1)
    
    # 计算每个查询的相关文章数目
    top_per_query = article_indices_array.shape[1]
    
    # 重塑结果，使其适合输出格式
    articles_flatten = [(
                         articles_values_array[index],
                         cohere_dataset_filtered[idx.item()]["title"],
                         unicodedata.normalize("NFKD", cohere_dataset_filtered[idx.item()]["text"]),
                        )
                        for index,idx in enumerate(article_indices_array.reshape(-1))]
    
    # 使用 SplitList 函数将结果分割成适当的块并返回
    retrieved_articles = SplitList(articles_flatten, top_per_query)
    return retrieved_articles


# 函数4：检索相关文档
def retrieval(df_valid, modified_texts):
    """
    根据df_valid和修改后的文本获取相关文档。

    参数：
    df_valid (DataFrame): 包含有效数据的数据框架。
    modified_texts (list): 修改后的文本列表。

    返回值：
    tuple: 包含相关文档的索引和分数的元组。
    """    
    corpus_df_valid = df_valid.apply(lambda row:
                                     f'{row["prompt"]}\n{row["prompt"]}\n{row["prompt"]}\n{row["A"]}\n{row["B"]}\n{row["C"]}\n{row["D"]}\n{row["E"]}',
                                     axis=1).values
    
    # 创建一个 TfidfVectorizer 对象，用于计算 TF-IDF 特征
    vectorizer1 = TfidfVectorizer(ngram_range=(1,2),
                                 token_pattern=r"(?u)\b[\w/.-]+\b|!|/|\?|\"|\'",
                                 stop_words=stop_words)
    
    # 用语料库内容拟合第一个 vectorizer
    vectorizer1.fit(corpus_df_valid)
    vocab_df_valid = vectorizer1.get_feature_names_out()
    
    # 创建另一个 TfidfVectorizer 对象，用于计算修改后文本的 TF-IDF 特征
    vectorizer = TfidfVectorizer(ngram_range=(1,2),
                                 token_pattern=r"(?u)\b[\w/.-]+\b|!|/|\?|\"|\'",
                                 stop_words=stop_words,
                                 vocabulary=vocab_df_valid)
    
    # 用修改后文本的前 500,000 个条目拟合第二个 vectorizer
    vectorizer.fit(modified_texts[:500000])
    
    # 计算语料库和修改后文本的 TF-IDF 矩阵
    corpus_tf_idf = vectorizer.transform(corpus_df_valid)
    
    # 打印 vectorizer 词汇表的长度
    print(f"vectorizer 词汇长度为 {len(vectorizer.get_feature_names_out())}")

    # 定义块的大小、每块的前N个最高分数以及每个查询的前N个最高分数
    chunk_size = 100000
    top_per_chunk = 10
    top_per_query = 10

    # 初始化存储每个块的顶部索引和值的列表
    all_chunk_top_indices = []
    all_chunk_top_values = []

    # 使用 tqdm 创建一个进度条，遍历修改后文本并按块处理
    for idx in tqdm(range(0, len(modified_texts), chunk_size)):
        # 从修改后文本中提取当前块的向量
        wiki_vectors = vectorizer.transform(modified_texts[idx: idx+chunk_size])
        
        # 计算每个查询的临时分数
        temp_scores = (corpus_tf_idf * wiki_vectors.T).toarray()
        
        # 找到每个查询的前N个最高分数的索引
        chunk_top_indices = temp_scores.argpartition(-top_per_chunk, axis=1)[:, -top_per_chunk:]
        
        # 提取每个查询的前N个最高分数
        chunk_top_values = temp_scores[np.arange(temp_scores.shape[0])[:, np.newaxis], chunk_top_indices]

        # 将当前块的顶部索引和值添加到相应的列表中
        all_chunk_top_indices.append(chunk_top_indices + idx)
        all_chunk_top_values.append(chunk_top_values)

    # 将所有块的顶部索引和值合并成单个数组
    top_indices_array = np.concatenate(all_chunk_top_indices, axis=1)
    top_values_array = np.concatenate(all_chunk_top_values, axis=1)
    
    # 对合并的顶部值进行排序，仅保留每个查询的前N个最高分数
    merged_top_scores = np.sort(top_values_array, axis=1)[:,-top_per_query:]
    
    # 找到每个查询的前N个最高分数的索引
    merged_top_indices = top_values_array.argsort(axis=1)[:,-top_per_query:]
    
    # 构建文章索引数组，以获取相关文章的索引
    articles_indices = top_indices_array[np.arange(top_indices_array.shape[0])[:, np.newaxis], merged_top_indices]
    
    # 返回相关文章的索引和分数
    return articles_indices, merged_top_scores

上述函数用来检索与测试集问答数据相关的文档，下面逐一分解这些函数及其用途：

SplitList(mylist, chunk_size): 将列表分割成指定大小（chunk_size）的较小块，最终返回一个包含列表块的列表。
get_relevant_documents_parsed(df_valid): 这个函数以一个DataFrame(df_valid)作为输入，检索一组问题的相关文章：
- 加载"paraphs_parsed_dataset"数据集，包含经过预处理的文本信息。
- 它通过连接标题、章节和文本字段以及删除换行符和单引号来修改"paraphs_parsed_dataset"中的文本。
- 它初始化空列表以存储文章索引和值。
- 它按照大小为df_chunk_size的块迭代DataFrame df_valid，其中每个块代表一组问题。对于每个问题块，它使用检索函数retrieval获得的文章索引和值，并附加到相应的列表。
- 处理完所有块后，它将文章索引和值连接成数组，并将结果格式化为一系列相关文章的列表。
- 函数返回相关文章的列表。
get_relevant_documents(df_valid): 这个函数类似于get_relevant_documents_parsed，但它加载不同的数据集(“cohere_dataset_filtered”)并使用不同的文本预处理方法。
retrieval(df_valid, modified_texts): 这个函数执行基于问题和预处理文本的实际文档检索。以下是其功能的详细说明：
- 它准备了一个TF-IDF矢量化器，并根据指定的参数在预处理文本上进行拟合。
- 它计算了df_valid中问题的TF-IDF向量。
- 它将TF-IDF计算划分为大小为chunk_size的块，并分别处理每个块。
- 对于每个块，它计算问题的TF-IDF向量与预处理文本的TF-IDF向量之间的余弦相似性分数。
- 它选择具有最高相似性分数的每个问题的前文档。
- 它返回所选文章的索引以及相应的相似性分数。

最后，当调用get_relevant_documents_parsed(df_valid)时，它通过利用上述函数来检索与df_valid中的问题相关的文章。针对df_valid中的每组问题，这些文章被组织成块。

stop_words = ['each', 'you', 'the', 'use', 'used',
                  'where', 'themselves', 'nor', "it's", 'how', "don't", 'just', 'your',
                  'about', 'himself', 'with', "weren't", 'hers', "wouldn't", 'more', 'its', 'were',
                  'his', 'their', 'then', 'been', 'myself', 're', 'not',
                  'ours', 'will', 'needn', 'which', 'here', 'hadn', 'it', 'our', 'there', 'than',
                  'most', "couldn't", 'both', 'some', 'for', 'up', 'couldn', "that'll",
                  "she's", 'over', 'this', 'now', 'until', 'these', 'few', 'haven',
                  'of', 'wouldn', 'into', 'too', 'to', 'very', 'shan', 'before', 'the', 'they',
                  'between', "doesn't", 'are', 'was', 'out', 'we', 'me',
                  'after', 'has', "isn't", 'have', 'such', 'should', 'yourselves', 'or', 'during', 'herself',
                  'doing', 'in', "shouldn't", "won't", 'when', 'do', 'through', 'she',
                  'having', 'him', "haven't", 'against', 'itself', 'that',
                  'did', 'theirs', 'can', 'those',
                  'own', 'so', 'and', 'who', "you've", 'yourself', 'her', 'he', 'only',
                  'what', 'ourselves', 'again', 'had', "you'd", 'is', 'other',
                  'why', 'while', 'from', 'them', 'if', 'above', 'does', 'whom',
                  'yours', 'but', 'being', "wasn't", 'be']

训练集和测试集的数据是一样的，只是测试集没有标签而已。

train_df = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/train.csv")
test_df = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv")
train_df.iloc[:,:7].equals(test_df)

True

retrieved_articles_parsed = get_relevant_documents_parsed(test_df)

try:
    with open("retrieved_articles_parsed.pkl", "wb") as pickle_file:
        pickle.dump(retrieved_articles_parsed, pickle_file)
        print('保存retrieved_articles_parsed成功')
except Exception as e:
    # 捕获异常并打印错误信息
    print("保存retrieved_articles_parsed出错:", str(e))
    
gc.collect()

retrieved_articles_parsed[0]

# 一共10篇相关文章及其标题和分数
[(0.3126188353233343,
  'Modified Newtonian dynamics',
  'Several independent observations point to the fact that the visible mass in galaxies and galaxy clusters is insufficient to account for their dynamics, when analyzed using Newton\'s laws. This discrepancy – known as the "missing mass problem" – was first identified for clusters by Swiss astronomer Fritz Zwicky in 1933 (who studied the Coma cluster), and subsequently extended to include spiral galaxies by the 1939 work of Horace Babcock on Andromeda.These early studies were augmented and brought to the attention of the astronomical community in the 1960s and 1970s by the work of Vera Rubin at the Carnegie Institute in Washington, who mapped in detail the rotation velocities of stars in a large sample of spirals. While Newton\'s Laws predict that stellar rotation velocities should decrease with distance from the galactic centre, Rubin and collaborators found instead that they remain almost constant – the rotation curves are said to be "flat". This observation necessitates at least one of the following: Option (1) leads to the dark matter hypothesis; option (2) leads to MOND.'),
 (0.3153668078674981,
  'Atom',
  'Up to 95% of the Milky Way\'s baryonic matter are concentrated inside stars, where conditions are unfavorable for atomic matter. The total baryonic mass is about 10% of the mass of the galaxy; the remainder of the mass is an unknown dark matter. High temperature inside stars makes most "atoms" fully ionized, that is, separates all electrons from the nuclei. In stellar remnants—with exception of their surface layers—an immense pressure make electron shells impossible.'),
...
...
]

retrieved_articles = get_relevant_documents(test_df)
try:
    with open("retrieved_articles.pkl", "wb") as pickle_file:
        pickle.dump(retrieved_articles, pickle_file)
        print('保存retrieved_articles成功')
except Exception as e:
    # 捕获异常并打印错误信息
    print("保存retrieved_articles出错:", str(e))
gc.collect()

retrieved_articles[0]

# 一共10篇相关文章及其标题和分数
[(0.2950371392568642,
  'Intracluster medium',
  'Measurements of the temperature and density profiles in galaxy clusters allow for a determination of the mass distribution profile of the ICM through hydrostatic equilibrium modeling. The mass distributions determined from these methods reveal masses that far exceed the luminous mass seen and are thus a strong indication of dark matter in galaxy clusters.'),
 (0.29527605466122714,
  'Modified Newtonian dynamics',
  "Several other studies have noted observational difficulties with MOND. For example, it has been claimed that MOND offers a poor fit to the velocity dispersion profile of globular clusters and the temperature profile of galaxy clusters, that different values of a are required for agreement with different galaxies' rotation curves, and that MOND is naturally unsuited to forming the basis of cosmology. Furthermore, many versions of MOND predict that the speed of light is different to the speed of gravity, but in 2017 the speed of gravitational waves was measured to be equal to the speed of light to high precision. This is well understood in modern relativistic theories of MOND, with the constraint from gravitational waves actually helping by substantially restricting how a covariant theory might be constructed."),
...
...
]

# 1. 加载相似文章，用于评测
with open("/kaggle/input/retrieved-articles/retrieved_articles/retrieved_articles_parsed.pkl", "rb") as pickle_file:
    retrieved_articles_parsed = pickle.load(pickle_file)
    print('加载retrieved_articles_parsed成功')
    
with open("/kaggle/input/retrieved-articles/retrieved_articles/retrieved_articles.pkl", "rb") as pickle_file:
    retrieved_articles = pickle.load(pickle_file)
    print('加载retrieved_articles成功')

    
# 2. 加载训练集、测试集
train_df = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/train.csv")
test_df = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv")

4.2.3 定义预处理函数和评测指标

下面的函数，是将之前检索的相似文档context和每个测试样本的question进行合并，再复制5次得到c_plus_q_5，对应五个问题选项。然后解码c_plus_q_5和options（都是长为5的列表），最后只保留input_ids和attention_mask。

def prepare_answering_input(
        tokenizer, 
        question,  
        options,   
        context,   
        max_seq_length=4096,
    ):
    """
    准备用于回答问题的输入编码，包括问题、选项、上下文等信息。

    参数：
    tokenizer (Tokenizer): 用于对文本进行分词和编码的分词器。
    question (str): 表示问题的文本。
    options (list of str): 包含问题的选项的文本列表。
    context (str): 包含问题和选项上下文的文本。
    max_seq_length (int, optional): 输入编码的最大序列长度。默认为 4096。

    返回值：
    dict: 包含编码输入的字典，包括输入的ID和注意力掩码。

    注意：
    此函数用于将问题、选项和上下文组合成一个输入编码，以供后续用于问题回答任务。
    """
    # 组合相似文档与问题，中间用特殊字符隔开
    c_plus_q = context + ' ' + tokenizer.bos_token + ' ' + question
    
    # 复制组合字符串5次，以匹配答案选项的数量
    c_plus_q_5 = [c_plus_q] * len(options)
    
    # 使用分词器对组合字符串和选项进行编码
    tokenized_examples = tokenizer(
        c_plus_q_5, options,
        max_length=max_seq_length,
        padding="longest",
        truncation=False,
        return_tensors="pt",
    )
    
    # 提取编码后的输入ID和注意力掩码
    input_ids = tokenized_examples['input_ids'].unsqueeze(0)
    attention_mask = tokenized_examples['attention_mask'].unsqueeze(0)
    
    # 构建包含输入编码的字典
    example_encoded = {
        "input_ids": input_ids.to(model.device.index),
        "attention_mask": attention_mask.to(model.device.index),
    }
    
    # 返回编码后的输入
    return example_encoded

def map_at_3(predictions, labels):
    map_sum = 0
    pred = [np.argsort(-np.array(prob))[:3] for prob in predictions]
    for x,y in zip(pred,labels):
        z = [1/i if y==j else 0 for i,j in zip([1,2,3],x)]
        map_sum += np.sum(z)
    return map_sum / len(predictions)

4.2.4 推理

test_df.head()

def predict(df,prepare):
    probability = []  # 预测结果（概率）
    submit_ids = []   # 索引
    result = []       # 预测标签        

    for index in tqdm(range(df.shape[0])):
        columns = df.iloc[index].values
        submit_ids.append(columns[0])
        question = columns[1]
        options = [columns[2], columns[3], columns[4], columns[5], columns[6]]

      	# 测试集相似文章在articles_list列表第三个、第四个变量test_articles,test_articles_parsed
        context1 = "\n".join([retrieved_articles[index][-i][2] for i in range(4, 0, -1)])

        context2 = "\n".join([retrieved_articles_parsed[index][-i][2] for i in range(3, 0, -1)])
       
        inputs1 = prepare(
            tokenizer=tokenizer, question=question,
            options=options, context=context1,
            )
        inputs2 = prepare(
            tokenizer=tokenizer, question=question,
            options=options, context=context2,
            )

        with torch.no_grad():
            outputs1 = model(**inputs1)    
            losses1 = -outputs1.logits[0].detach().cpu().numpy()
            probability1 = torch.softmax(torch.tensor(-losses1), dim=-1)

        with torch.no_grad():
            outputs2 = model(**inputs2)
            losses2 = -outputs2.logits[0].detach().cpu().numpy()
            probability2 = torch.softmax(torch.tensor(-losses2), dim=-1)

        probability_ = (probability1 + probability2)/2
        
        # 清除中间结果
        del probability1
        del probability2
        torch.cuda.empty_cache()  # 释放GPU内存
		# 如果预测概率大于0.4就保留结果，否则采用backup_model_predictions中的预测结果，相当于集成两个模型的结果
        if probability_.max() > 0.4:
        predict = np.array(list("ABCDE"))[np.argsort(probability_)][-3:].tolist()[::-1]
        else:
            predict = backup_model_predictions.iloc[index].prediction.replace(" ","")
        
        probability.append(probability_)
        result.append(predict)
    return probability,result

from transformers import LongformerTokenizer, LongformerForMultipleChoice
model_dir="/kaggle/input/longformer-race-model/longformer_qa_model"

tokenizer = LongformerTokenizer.from_pretrained(model_dir)
model = LongformerForMultipleChoice.from_pretrained(model_dir).cuda()
probability,result=predict(test_df,prepare_answering_input)

result = [" ".join(i) for i in result]
pd.DataFrame({'id':train_df.id,'prediction':result}).to_csv('submission.csv', index=False)
submission_df = pd.read_csv('submission.csv') 
submission_df

最终结果是0.862。

4.3 270K articles+DEBERTA v3 large（2396s,LB=0.905）

参考《Fork of Fork of [86.2] with only 270K articles!》

与4.2相比，只改了两处：

预处理函数
模型

def prepare_answering_input(
        tokenizer, 
        question,  
        options,   
        context,   
        max_seq_length=1024,
    ):
    c_plus_q   = tokenizer.bos_token + ' ' + context
    options = [' #### ' + question + " [SEP] " + opt for opt in options]
    # c_plus_q   = context + ' ' + tokenizer.bos_token + ' ' + question # TODO
    c_plus_q_4 = [c_plus_q] * len(options)
    tokenized_examples = tokenizer(
        c_plus_q_4, options,
        max_length=max_seq_length,
        padding="longest",
        truncation='only_first',
        return_tensors="pt",
        add_special_tokens=False
    )
    input_ids = tokenized_examples['input_ids'].unsqueeze(0)
    attention_mask = tokenized_examples['attention_mask'].unsqueeze(0)
    example_encoded = {
        "input_ids": input_ids.to(model.device.index),
        "attention_mask": attention_mask.to(model.device.index),
    }
    return example_encoded

单模型run-context-valid-loss得分0.896

此作者在《Sharing my trained-with-context model》中发布了模型llm-science-run-context-2，下面几个应该是作者炼的相似模型。

model_paths = [
    # r2, 0, 1, 2 = 81.5
    # 0, 1, 2 = 81.6
    # 0, 1, 2, 3, 4, 5 = 81.8
    # 0.02, 0.00, 0.08, 0.03, 0.00, 0.00, 0.31, 0.00, 0.01, 0.41, 0.13
#     '/kaggle/input/llm-science-run-context-2', # 80.6
#     '/kaggle/input/llm-science-run-context-3/fold_0', # 80.6
    '/kaggle/input/llm-science-run-context-3/fold_1', # 81.0
# # # #     '/kaggle/input/llm-science-run-context-3/fold_2', # 79.9
#     '/kaggle/input/llm-science-run-context-3/fold_3', # 80.1
#     '/kaggle/input/llm-science-run-context-3/fold_4', # 80.0
# # #     '/kaggle/input/llm-science-run-context-3/fold_5', # 79.8
    '/kaggle/input/llm-science-run-context-4/fold_0', # 81.8
# # #     '/kaggle/input/llm-science-run-context-4/fold_1', # 79.9
#     '/kaggle/input/llm-science-run-context-5/fold_0', # 81.3
    '/kaggle/input/run-context-valid-loss',    
#     '/kaggle/input/run-context2-valid-loss',
    
#     '/kaggle/input/llm-science-run-context-5/fold_1', 
#     '/kaggle/input/run-context-all-mpnet-base-v2-map3/0',
#     '/kaggle/input/run-context-all-mpnet-base-v2-loss/0',
#     '/kaggle/input/run-context-all-mpnet-base-v2-valid-loss',
]

all_probs = []
for model_path in model_paths:
    
    probs = []
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForMultipleChoice.from_pretrained(model_path).cuda()

	for index in tqdm(range(df_valid.shape[0])):
        columns = df_valid.iloc[index].values

        question = columns[1]
        options = [columns[2], columns[3], columns[4], columns[5], columns[6]]
    
        context1 = '\n'.join([retrieved_articles[index][-1-i][2] for i in range(10)])
        context2 = '\n'.join([retrieved_articles_parsed[index][-1-i][2] for i in range(10)])
        inputs1 = prepare_answering_input(
            tokenizer=tokenizer, question=question,
            options=options, context=context1,
            )
        inputs2 = prepare_answering_input(
            tokenizer=tokenizer, question=question,
            options=options, context=context2,
            )

        with torch.no_grad():
            outputs1 = model(**inputs1)    
            losses1 = -outputs1.logits[0].detach().cpu().numpy()
            probability1 = torch.softmax(torch.tensor(-losses1), dim=-1)

        with torch.no_grad():
            outputs2 = model(**inputs2)
            losses2 = -outputs2.logits[0].detach().cpu().numpy()
            probability2 = torch.softmax(torch.tensor(-losses2), dim=-1)

		probability_ = (probability1 + probability2*1.7)/2
		# probability_ = (probability1 + probability2)/2
        probs.append(probability_)

    all_probs.append(probs)

all_preds = np.stack([np.stack(prbs) for prbs in all_probs])
all_preds = softmax(all_preds, axis=2)
# best_weights = np.array([1, 1, 1.5])
# predictions = np.average(all_preds, axis=0, weights=best_weights)
predictions = all_preds.mean(0)

predictions_as_ids = np.argsort(-predictions, 1)
predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
predictions_as_string = df_valid['prediction'] = [
    ' '.join(row) for row in predictions_as_answer_letters[:, :3]
]

submission = df_valid[['id', 'prediction']]
submission.to_csv('submission.csv', index=False)

4.4 RAPIDS TF-IDF+DEBERTA v3 large（756s,LB=0.904）

参考《RAPIDS TF-IDF - [LB 0.904] - Single Model》

作者基于4.2进行以下改动：

不需要运行util_openbook.py，即不需要backup_model_predictions
模型改为方法9：《How To Train Open Book Model - Part 1》中训练的模型DEBERTA v3 large，模型权重见how-to-train-open-book-model-part-1。
采用 RAPIDS TF-IDF，加速检索
使用2xT4 GPU 和双线程加速推理

4.4.1 环境配置

import os
os.system("cp /kaggle/input/datasets-wheel/datasets-2.14.4-py3-none-any.whl /kaggle/working")
os.system("pip install  /kaggle/working/datasets-2.14.4-py3-none-any.whl")
os.system("pip install --no-index --no-deps /kaggle/input/llm-whls/transformers-4.31.0-py3-none-any.whl")

os.system("cp -r /kaggle/input/stem-wiki-cohere-no-emb /kaggle/working")
os.system("cp -r /kaggle/input/all-paraphs-parsed-expanded /kaggle/working/")

import numpy as np, pickle
import pandas as pd, gc, os 
from datasets import load_dataset, load_from_disk
import transformers, unicodedata, torch
import matplotlib.pyplot as plt
from tqdm import tqdm

import cudf
from cuml.feature_extraction.text import TfidfVectorizer

_ = gc.collect()
print('Using RAPIDS version',cudf.__version__)

4.4.2 RAG Retrieval

def SplitList(mylist, chunk_size):
    return [mylist[offs:offs+chunk_size] for offs in range(0, len(mylist), chunk_size)]

def get_relevant_documents_parsed(df_valid):
    df_chunk_size=600
    paraphs_parsed_dataset = load_from_disk("/kaggle/working/all-paraphs-parsed-expanded")
    modified_texts = paraphs_parsed_dataset.map(lambda example:
                                             {'temp_text':
                                              f"{example['title']} {example['section']} {example['text']}".replace('\n'," ").replace("'","")},
                                             num_proc=2)["temp_text"]
    
    all_articles_indices = []
    all_articles_values = []
    for idx in tqdm(range(0, df_valid.shape[0], df_chunk_size)):
        df_valid_ = df_valid.iloc[idx: idx+df_chunk_size]
    
        articles_indices, merged_top_scores = retrieval(df_valid_, modified_texts)
        all_articles_indices.append(articles_indices)
        all_articles_values.append(merged_top_scores)
        
    article_indices_array =  np.concatenate(all_articles_indices, axis=0)
    articles_values_array = np.concatenate(all_articles_values, axis=0).reshape(-1)
    
    top_per_query = article_indices_array.shape[1]
    articles_flatten = [(
                         articles_values_array[index],
                         paraphs_parsed_dataset[idx.item()]["title"],
                         paraphs_parsed_dataset[idx.item()]["text"],
                        )
                        for index,idx in enumerate(article_indices_array.reshape(-1))]
    retrieved_articles = SplitList(articles_flatten, top_per_query)
    return retrieved_articles



def get_relevant_documents(df_valid):
    df_chunk_size=800
    
    cohere_dataset_filtered = load_from_disk("/kaggle/working/stem-wiki-cohere-no-emb")
    modified_texts = cohere_dataset_filtered.map(lambda example:
                                             {'temp_text':
                                              unicodedata.normalize("NFKD", f"{example['title']} {example['text']}").replace('"',"")},
                                             num_proc=2)["temp_text"]
    
    all_articles_indices = []
    all_articles_values = []
    for idx in tqdm(range(0, df_valid.shape[0], df_chunk_size)):
        df_valid_ = df_valid.iloc[idx: idx+df_chunk_size]
    
        articles_indices, merged_top_scores = retrieval(df_valid_, modified_texts)
        all_articles_indices.append(articles_indices)
        all_articles_values.append(merged_top_scores)
        
    article_indices_array =  np.concatenate(all_articles_indices, axis=0)
    articles_values_array = np.concatenate(all_articles_values, axis=0).reshape(-1)
    
    top_per_query = article_indices_array.shape[1]
    articles_flatten = [(
                         articles_values_array[index],
                         cohere_dataset_filtered[idx.item()]["title"],
                         unicodedata.normalize("NFKD", cohere_dataset_filtered[idx.item()]["text"]),
                        )
                        for index,idx in enumerate(article_indices_array.reshape(-1))]
    retrieved_articles = SplitList(articles_flatten, top_per_query)
    return retrieved_articles



def retrieval(df_valid, modified_texts):
    
    corpus_df_valid = df_valid.apply(lambda row:
                                     f'{row["prompt"]}\n{row["prompt"]}\n{row["prompt"]}\n{row["A"]}\n{row["B"]}\n{row["C"]}\n{row["D"]}\n{row["E"]}',
                                     axis=1)#.values ###
    vectorizer1 = TfidfVectorizer(ngram_range=(1,2),
                                 #token_pattern=r"(?u)\b[\w/.-]+\b|!|/|\?|\"|\'", ###
                                 stop_words=stop_words,
                                 sublinear_tf=True)
    vectorizer1.fit(corpus_df_valid)
    vocab_df_valid = vectorizer1.get_feature_names() ###
    vectorizer = TfidfVectorizer(ngram_range=(1,2),
                                 #token_pattern=r"(?u)\b[\w/.-]+\b|!|/|\?|\"|\'", ###
                                 stop_words=stop_words,
                                 vocabulary=vocab_df_valid,
                                 sublinear_tf=True)
    vectorizer.fit( cudf.Series(modified_texts[:500000]) ) ###
    corpus_tf_idf = vectorizer.transform(corpus_df_valid)
    
    print(f"length of vectorizer vocab is {len(vectorizer.get_feature_names())}") ###

    chunk_size = 100000
    top_per_chunk = 10
    top_per_query = 10

    all_chunk_top_indices = []
    all_chunk_top_values = []

    for idx in tqdm(range(0, len(modified_texts), chunk_size)):
        wiki_vectors = vectorizer.transform( cudf.Series(modified_texts[idx: idx+chunk_size]) ) ###
        temp_scores = (corpus_tf_idf * wiki_vectors.T).toarray()
        chunk_top_indices = temp_scores.argpartition(-top_per_chunk, axis=1)[:, -top_per_chunk:]
        chunk_top_values = temp_scores[np.arange(temp_scores.shape[0])[:, np.newaxis], chunk_top_indices]

        all_chunk_top_indices.append(chunk_top_indices + idx)
        all_chunk_top_values.append(chunk_top_values)

    top_indices_array = np.concatenate(all_chunk_top_indices, axis=1)
    top_values_array = np.concatenate(all_chunk_top_values, axis=1)
    
    merged_top_scores = np.sort(top_values_array, axis=1)[:,-top_per_query:]
    merged_top_indices = top_values_array.argsort(axis=1)[:,-top_per_query:]
    articles_indices = top_indices_array[np.arange(top_indices_array.shape[0])[:, np.newaxis], merged_top_indices]
    
    return articles_indices, merged_top_scores

df_valid = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv")
retrieved_articles_parsed = get_relevant_documents_parsed(df_valid)
gc.collect()

retrieved_articles = get_relevant_documents(df_valid)
gc.collect()

# LIBRARIES TO CLEAN MEMORY
import ctypes
libc = ctypes.CDLL("libc.so.6")
_ = gc.collect()
libc.malloc_trim(0)

4.4.3 定义预处理函数和评测指标

# FUNCTION TO PREPROCESS TEXT FOR INFER
def prepare_answering_input2(
        tokenizer, 
        question,  
        options,   
        context,   
        max_seq_length=4096,
        gpu_id = 0
    ):
    c_plus_q   = context[:2500] + ' #### ' + question
    c_plus_q_4 = [c_plus_q] * len(options)
    tokenized_examples = tokenizer(
        c_plus_q_4, options,
        max_length=max_seq_length,
        padding="longest",
        truncation=False,
        return_tensors="pt",
    )
    input_ids = tokenized_examples['input_ids'].unsqueeze(0)
    attention_mask = tokenized_examples['attention_mask'].unsqueeze(0)
    example_encoded = {
        "input_ids": input_ids.to(devices[gpu_id]),
        "attention_mask": attention_mask.to(devices[gpu_id]),
    }
    return example_encoded

# https://www.kaggle.com/code/philippsinger/h2ogpt-perplexity-ranking
import numpy as np
def precision_at_k(r, k):
    """Precision at k"""
    assert k <= len(r)
    assert k != 0
    return sum(int(x) for x in r[:k]) / k

def MAP_at_3(predictions, true_items):
    """Score is mean average precision at 3"""
    U = len(predictions)
    map_at_3 = 0.0
    for u in range(U):
        user_preds = predictions[u].split()
        user_true = true_items[u]
        user_results = [1 if item == user_true else 0 for item in user_preds]
        for k in range(min(len(user_preds), 3)):
            map_at_3 += precision_at_k(user_results, k+1) * user_results[k]
    return map_at_3 / U

4.4.4 双线程推理

from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice
from torch.cuda.amp import autocast
import threading, torch

device0 = torch.device("cuda:0")
device1 = torch.device("cuda:1")

model_paths = [
    "/kaggle/input/model-v181"
]
loss_results = []

for model in model_paths:
    print('#'*25)
    print('=> Inferring',model)
    tokenizer = AutoTokenizer.from_pretrained(model)
    model0 = AutoModelForMultipleChoice.from_pretrained(model).to(device0)
    model1 = AutoModelForMultipleChoice.from_pretrained(model).to(device1)
    models = [model0,model1]
    devices = [device0,device1]

    all_losses = [[],[]]
    submit_ids = [[],[]]

    # Create a lock to synchronize the threads
    lock = threading.Lock()

    # Define a function for inference
    def inference_thread(gpu_id, lock):
        with lock:
            print(f"Thread {gpu_id} started on GPU {gpu_id}")

        # INPUT DATA SPLIT 2x
        SKIP = df_valid.shape[0]//2
        SIZE = df_valid.shape[0] - SKIP
        if gpu_id == 0:
            SIZE = SKIP
            SKIP = 0
        LOOPER = range(SIZE)
        if gpu_id==0: LOOPER = tqdm(range(SIZE))

        # INFERENCE LOOP
        for index in LOOPER:
            columns = df_valid.iloc[index+SKIP].values
            submit_ids[gpu_id].append(columns[0])
            question = columns[1]
            options = [columns[2], columns[3], columns[4], columns[5], columns[6]]

            contexts = [ retrieved_articles[index+SKIP][x][2] for x in [-1,-2,-3] ]
            context1 = '\n'.join(contexts)

            contexts = [ retrieved_articles_parsed[index+SKIP][x][2] for x in [-3,-2,-1] ]
            context2 = '\n'.join(contexts)

            inputs1 = prepare_answering_input2(
                tokenizer=tokenizer, question=question,
                options=options, context=context1,
                gpu_id = gpu_id
                )
            inputs2 = prepare_answering_input2(
                tokenizer=tokenizer, question=question,
                options=options, context=context2,
                gpu_id = gpu_id
                )

            with torch.no_grad():
                with autocast():
                    outputs1 = models[gpu_id](**inputs1)
                losses1 = -outputs1.logits[0].detach().cpu().numpy()
            with torch.no_grad():
                with autocast():
                    outputs2 = models[gpu_id](**inputs2)
                losses2 = -outputs2.logits[0].detach().cpu().numpy()

            all_losses[gpu_id].append( (losses1+losses2)/2. )

        with lock:
            print(f"Thread {gpu_id} finished on GPU {gpu_id}")

    # Create two threads for inference
    thread1 = threading.Thread(target=inference_thread, args=(0, lock))
    thread2 = threading.Thread(target=inference_thread, args=(1, lock))

    # Start the threads
    thread1.start()
    thread2.start()

    # Wait for both threads to finish
    thread1.join()
    thread2.join()

    print("Both threads have finished.")
    print()

    all_losses[0] = np.vstack( all_losses[0] )
    all_losses[1] = np.vstack( all_losses[1] )
    all_losses = np.concatenate(all_losses,axis=0)
    loss_results.append( all_losses )

    submit_ids = submit_ids[0] + submit_ids[1]
    
    # CLEAN MEMORY
    del model0, model1, models
    _ = gc.collect()
    libc.malloc_trim(0)

4.4.5 评测结果并保存

all_losses = np.mean( loss_results,axis=0 )
predictions_as_ids = np.argsort(all_losses, 1)
predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
predict = predictions_as_answer_letters

predictions = []
for row in predict:
    predictions.append( ' '.join(row[:3]) )
    
pd.DataFrame({'id':submit_ids,'prediction':predictions}).to_csv('submission.csv', index=False)

plt.hist( all_losses.flatten(), bins=100 )
plt.title('Logit Histogram of Test Predictions')
plt.show()

submission = pd.read_csv('submission.csv')
print( submission.shape )
submission.head()

	id	prediction
0	0	D E B
1	1	A B D
2	2	D A C
3	3	A C B
4	4	D A B

计算MAP@3指标

if len(submission)==200:
    true = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/train.csv', usecols=['answer'])
    print('CV MAP@3 =', MAP_at_3(submission.prediction.values, true.answer.values) )

CV MAP@3 = 0.9866666666666666

你可能感兴趣的:(NLP,深度学习,nlp,llm)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
LLM 词汇表落难Coder LLMs NLP 大语言模型大模型 llama 人工智能
Contextwindow“上下文窗口”是指语言模型在生成新文本时能够回溯和参考的文本量。这不同于语言模型训练时所使用的大量数据集，而是代表了模型的“工作记忆”。较大的上下文窗口可以让模型理解和响应更复杂和更长的提示，而较小的上下文窗口可能会限制模型处理较长提示或在长时间对话中保持连贯性的能力。Fine-tuning微调是使用额外的数据进一步训练预训练语言模型的过程。这使得模型开始表示和模仿微调数
将cmd中命令输出保存为txt文本文件落难Coder Windows cmd window
最近深度学习本地的训练中我们常常要在命令行中运行自己的代码，无可厚非，我们有必要保存我们的炼丹结果，但是复制命令行输出到txt是非常麻烦的，其实Windows下的命令行为我们提供了相应的操作。其基本的调用格式就是：运行指令>输出到的文件名称或者具体保存路径测试下，我打开cmd并且ping一下百度：pingwww.baidu.com>./data.txt看下相同目录下data.txt的输出：如果你再
如何部分格式化提示模板:LangChain中的高级技巧 nseejrukjhad langchain java 服务器 python
标题:如何部分格式化提示模板:LangChain中的高级技巧内容:如何部分格式化提示模板:LangChain中的高级技巧引言在使用大型语言模型(LLM)时,提示工程是一个关键环节。LangChain提供了强大的提示模板功能,让我们能更灵活地构建和管理提示。本文将介绍LangChain中一个高级特性-部分格式化提示模板,这个技巧可以让你的提示管理更加高效和灵活。什么是部分格式化提示模板?部分格式化提
免费的GPT可在线直接使用（一键收藏） kkai人工智能 gpt
1、LuminAI（https://kk.zlrxjh.top）LuminAI标志着一款融合了星辰大数据模型与文脉深度模型的先进知识增强型语言处理系统，旨在自然语言处理（NLP）的技术开发领域发光发热。此系统展现了卓越的语义把握与内容生成能力，轻松驾驭多样化的自然语言处理任务。VisionAI在NLP界的应用领域广泛，能够胜任从机器翻译、文本概要撰写、情绪分析到问答等众多任务。通过对大量文本数据的
推荐3家毕业AI论文可五分钟一键生成！文末附免费教程！小猪包333 写论文人工智能 AI写作深度学习计算机视觉
在当前的学术研究和写作领域，AI论文生成器已经成为许多研究人员和学生的重要工具。这些工具不仅能够帮助用户快速生成高质量的论文内容，还能进行内容优化、查重和排版等操作。以下是三款值得推荐的AI论文生成器：千笔-AIPassPaper、懒人论文以及AIPaperPass。千笔-AIPassPaper千笔-AIPassPaper是一款基于深度学习和自然语言处理技术的AI写作助手，旨在帮助用户快速生成高质
AI大模型的架构演进与最新发展季风泯灭的季节 AI大模型应用技术二人工智能架构
随着深度学习的发展，AI大模型（LargeLanguageModels,LLMs）在自然语言处理、计算机视觉等领域取得了革命性的进展。本文将详细探讨AI大模型的架构演进，包括从Transformer的提出到GPT、BERT、T5等模型的历史演变，并探讨这些模型的技术细节及其在现代人工智能中的核心作用。一、基础模型介绍：Transformer的核心原理Transformer架构的背景在Transfo
[实践应用] 深度学习之模型性能评估指标 YuanDaima2048 深度学习工具使用深度学习人工智能损失函数性能评估 pytorch python 机器学习
文章总览：YuanDaiMa2048博客文章总览深度学习之模型性能评估指标分类任务回归任务排序任务聚类任务生成任务其他介绍在机器学习和深度学习领域，评估模型性能是一项至关重要的任务。不同的学习任务需要不同的性能指标来衡量模型的有效性。以下是对一些常见任务及其相应的性能评估指标的详细解释和总结。分类任务分类任务是指模型需要将输入数据分配到预定义的类别或标签中。以下是分类任务中常用的性能指标：准确率(
[实践应用] 深度学习之优化器 YuanDaima2048 深度学习工具使用 pytorch 深度学习人工智能机器学习 python 优化器
文章总览：YuanDaiMa2048博客文章总览深度学习之优化器1.随机梯度下降（SGD）2.动量优化（Momentum）3.自适应梯度（Adagrad）4.自适应矩估计（Adam）5.RMSprop总结其他介绍在深度学习中，优化器用于更新模型的参数，以最小化损失函数。常见的优化函数有很多种，下面是几种主流的优化器及其特点、原理和PyTorch实现：1.随机梯度下降（SGD）原理:随机梯度下降通过
机器学习-聚类算法不良人龍木木机器学习机器学习算法聚类
机器学习-聚类算法1.AHC2.K-means3.SC4.MCL仅个人笔记，感谢点赞关注！1.AHC2.K-means3.SC传统谱聚类：个人对谱聚类算法的理解以及改进4.MCL目前仅专注于NLP的技术学习和分享感谢大家的关注与支持！
生成式地图制图 Bwywb_3 深度学习机器学习深度学习生成对抗网络
生成式地图制图（GenerativeCartography）是一种利用生成式算法和人工智能技术自动创建地图的技术。它结合了传统的地理信息系统（GIS）技术与现代生成模型（如深度学习、GANs等），能够根据输入的数据自动生成符合需求的地图。这种方法在城市规划、虚拟环境设计、游戏开发等多个领域具有应用前景。主要特点：自动化生成：通过算法和模型，系统能够根据输入的地理或空间数据自动生成地图，而无需人工逐
轻量级模型解读——轻量transformer系列 lishanlu136 #图像分类轻量级模型 transformer 图像分类
先占坑，持续更新。。。文章目录1、DeiT2、ConViT3、Mobile-Former4、MobileViTTransformer是2017谷歌提出的一篇论文，最早应用于NLP领域的机器翻译工作，Transformer解读，但随着2020年DETR和ViT的出现(DETR解读，ViT解读)，其在视觉领域的应用也如雨后春笋般渐渐出现，其特有的全局注意力机制给图像识别领域带来了重要参考。但是tran
吴恩达深度学习笔记(30)-正则化的解释极客Array
正则化（Regularization）深度学习可能存在过拟合问题——高方差，有两个解决方法，一个是正则化，另一个是准备更多的数据，这是非常可靠的方法，但你可能无法时时刻刻准备足够多的训练数据或者获取更多数据的成本很高，但正则化通常有助于避免过拟合或减少你的网络误差。如果你怀疑神经网络过度拟合了数据，即存在高方差问题，那么最先想到的方法可能是正则化，另一个解决高方差的方法就是准备更多数据，这也是非常
个人学习笔记7-6：动手学深度学习pytorch版-李沐浪子L 深度学习深度学习笔记计算机视觉 python 人工智能神经网络 pytorch
#人工智能##深度学习##语义分割##计算机视觉##神经网络#计算机视觉13.11全卷积网络全卷积网络（fullyconvolutionalnetwork，FCN）采用卷积神经网络实现了从图像像素到像素类别的变换。引入l转置卷积（transposedconvolution）实现的，输出的类别预测与输入图像在像素级别上具有一一对应关系：通道维的输出即该位置对应像素的类别预测。13.11.1构造模型下
FlagEmbedding 吉小雨 python库 python
FlagEmbedding教程FlagEmbedding是一个用于生成文本嵌入（textembeddings）的库，适合处理自然语言处理（NLP）中的各种任务。嵌入（embeddings）是将文本表示为连续向量，能够捕捉语义上的相似性，常用于文本分类、聚类、信息检索等场景。官方文档链接：FlagEmbedding官方GitHub一、FlagEmbedding库概述1.1什么是FlagEmbeddi
深度学习-点击率预估-研究论文2024-09-14速读 sp_fyf_2024 深度学习人工智能
深度学习-点击率预估-研究论文2024-09-14速读1.DeepTargetSessionInterestNetworkforClick-ThroughRatePredictionHZhong,JMa,XDuan,SGu,JYao-2024InternationalJointConferenceonNeuralNetworks,2024深度目标会话兴趣网络用于点击率预测摘要：这篇文章提出了一种新
【NumPy】深入解析numpy.zeros()函数二七830 numpy
欢迎莅临我的个人主页这里是我深耕Python编程、机器学习和自然语言处理（NLP）领域，并乐于分享知识与经验的小天地！博主简介：我是二七830，一名对技术充满热情的探索者。多年的Python编程和机器学习实践，使我深入理解了这些技术的核心原理，并能够在实际项目中灵活应用。尤其是在NLP领域，我积累了丰富的经验，能够处理各种复杂的自然语言任务。技术专长：我熟练掌握Python编程语言，并深入研究了机
损失函数与反向传播 Star_. PyTorch pytorch 深度学习 python
损失函数定义与作用损失函数(lossfunction)在深度学习领域是用来计算搭建模型预测的输出值和真实值之间的误差。1.损失函数越小越好2.计算实际输出与目标之间的差距3.为更新输出提供依据（反向传播)常见的损失函数回归常见的损失函数有：均方差（MeanSquaredError，MSE）、平均绝对误差（MeanAbsoluteErrorLoss，MAE）、HuberLoss是一种将MSE与MAE
【深度学习】训练过程中一个OOM的问题，太难查了 weixin_40293999 深度学习深度学习人工智能
现象：各位大佬又遇到过ubuntu的这个问题么？现象是在训练过程中，ssh上不去了，能ping通，没死机，但是ubunutu的pc侧的显示器，鼠标啥都不好用了。只能重启。问题原因：OOM了95G，尼玛！！！！pytorch爆内存了，然后journald假死了，在journald被watchdog干掉之后，系统就崩溃了。这种规模的爆内存一般，即使被oomkill了，也要卡半天的，确实会这样，能不能配
【有啥问啥】刷爆各大榜单的Reflection 70B模型背后的错误自我纠正（Reflection-Tuning）技术解析：一种革新AI模型的方法 Chauvin912 大模型行业调研人工智能算法
刷爆各大榜单的Reflection70B模型背后的错误自我纠正（Reflection-Tuning）技术解析：一种革新AI模型的方法在快速发展的AI领域，尤其是大型语言模型（LLM）的竞争中，错误自我纠正技术（Reflection-Tuning）正逐步成为提升模型性能的关键突破。该技术通过赋予模型自我检测和纠正错误的能力，显著提高了输出的准确性和可靠性。本文将深入解析Reflection-Tunn
HALTT4LLM：大型语言模型的幻觉检测指标谢忻含Norma
HALTT4LLM：大型语言模型的幻觉检测指标haltt4llmThisprojectisanattempttocreateacommonmetrictotestLLM'sforprogressineliminatinghallucinationswhichisthemostseriouscurrentprobleminwidespreadadoptionofLLM'sformanyrealpur
CV、NLP、数据控掘推荐、量化海的那边- AI算法自然语言处理人工智能
下面是对CV（计算机视觉）、NLP（自然语言处理）、数据挖掘推荐和量化的简要概述及其应用领域的介绍：1.CV（计算机视觉，ComputerVision）定义：计算机视觉是一门让计算机能够从图像或视频中提取有用信息，并做出决策的学科。它通过模拟人类的视觉系统来识别、处理和理解视觉信息。主要任务：图像分类：识别图像中的物体并分类，比如猫、狗、车等。目标检测：在图像或视频中定位并识别多个对象，如人脸检测
深度解析：如何使用输出解析器将大型语言模型（LLM）的响应解析为结构化JSON格式 m0_57781768 语言模型 json 人工智能
深度解析：如何使用输出解析器将大型语言模型（LLM）的响应解析为结构化JSON格式在现代自然语言处理（NLP）的应用中，大型语言模型（LLM）已经成为了重要的工具。这些模型能够生成丰富的自然语言文本，适用于各种应用场景。然而，在某些应用中，开发者不仅仅需要生成文本，还需要将这些生成的文本转换为结构化的数据格式，例如JSON。这种结构化的数据格式在数据传输、存储以及进一步处理时具有显著优势。本文将深
深入探讨：如何在Python中通过LangChain技术精准追踪大型语言模型（LLM）的Token使用情况 m0_57781768 python langchain 语言模型
深入探讨：如何在Python中通过LangChain技术精准追踪大型语言模型（LLM）的Token使用情况在现代的人工智能开发中，大型语言模型（LLM）已经成为了不可或缺的工具，无论是用于自然语言处理、对话生成，还是其他复杂的文本生成任务。然而，随着这些模型的广泛应用，开发者面临的一个重要挑战是如何有效地追踪和管理Token的使用情况，特别是在生产环境中，Token的使用直接影响着API调用的成本
使用You.com API进行LLM输出的事实性增强 aehrutktrjk python 开发语言
使用You.comAPI进行LLM输出的事实性增强引言大型语言模型(LLM)在生成人类可读的文本方面表现出色,但它们可能会产生过时或不准确的信息。You.comAPI是一套工具,旨在帮助开发者将LLM的输出与最新、最准确、最相关的信息相结合,这些信息可能不包含在LLM的训练数据集中。本文将介绍如何使用You.comAPI来增强LLM的输出,提高其事实性和时效性。You.comAPI的设置和使用安装
使用LangChain和OpenAI实现高效文本标注 aehrutktrjk langchain python
使用LangChain和OpenAI实现高效文本标注引言在自然语言处理(NLP)领域，文本标注是一项重要且常见的任务。它涉及为文本分配标签，如情感、语言、风格等。本文将介绍如何使用LangChain和OpenAI的API来实现高效的文本标注系统。我们将探讨如何设置环境、定义标注模式，以及如何使用OpenAI的模型来执行标注任务。环境准备首先，我们需要安装必要的库并设置API密钥：%pipinsta
如何从大型语言模型(LLM)流式响应 aehrutktrjk 语言模型 microsoft ajax python
引言随着大型语言模型(LLM)的不断发展,我们不仅能够获得高质量的文本生成结果,还可以实时观察模型生成文本的过程。流式响应允许我们以一种更加交互和动态的方式与LLM进行交互,这在某些应用场景中非常有用。在本文中,我们将探讨如何从LLM流式获取响应。基础知识在开始之前,我们需要了解一些基础概念。所有的LLM都实现了Runnable接口,该接口提供了一些默认实现的标准方法,如invoke、batch、
云服务业界动态简报-20180128 Captain7
一、青云青云QingCloud推出深度学习平台DeepLearningonQingCloud，包含了主流的深度学习框架及数据科学工具包，通过QingCloudAppCenter一键部署交付，可以让算法工程师和数据科学家快速构建深度学习开发环境，将更多的精力放在模型和算法调优。二、腾讯云1.腾讯云正式发布腾讯专有云TCE(TencentCloudEnterprise)矩阵，涵盖企业版、大数据版、AI
机器学习VS深度学习 nfgo 机器学习
机器学习（MachineLearning,ML）和深度学习（DeepLearning,DL）是人工智能（AI）的两个子领域，它们有许多相似之处，但在技术实现和应用范围上也有显著区别。下面从几个方面对两者进行区分：1.概念层面机器学习：是让计算机通过算法从数据中自动学习和改进的技术。它依赖于手动设计的特征和数学模型来进行学习，常用的模型有决策树、支持向量机、线性回归等。深度学习：是机器学习的一个子领
大数据毕业设计hadoop+spark+hive知识图谱租房数据分析可视化大屏租房推荐系统 58同城租房爬虫房源推荐系统房价预测系统计算机毕业设计机器学习深度学习人工智能 2401_84572577 程序员大数据 hadoop 人工智能
做了那么多年开发，自学了很多门编程语言，我很明白学习资源对于学一门新语言的重要性，这些年也收藏了不少的Python干货，对我来说这些东西确实已经用不到了，但对于准备自学Python的人来说，或许它就是一个宝藏，可以给你省去很多的时间和精力。别在网上瞎学了，我最近也做了一些资源的更新，只要你是我的粉丝，这期福利你都可拿走。我先来介绍一下这些东西怎么用，文末抱走。（1）Python所有方向的学习路线（
TOMCAT在POST方法提交参数丢失问题 357029540 java tomcat jsp
摘自http://my.oschina.net/luckyi/blog/213209 昨天在解决一个BUG时发现一个奇怪的问题，一个AJAX提交数据在之前都是木有问题的，突然提交出错影响其他处理流程。检查时发现页面处理数据较多，起初以为是提交顺序不正确修改后发现不是由此问题引起。于是删除掉一部分数据进行提交，较少数据能够提交成功。恢复较多数据后跟踪提交FORM DATA ，发现数
在MyEclipse中增加JSP模板删除-2008-08-18 ljy325 jsp xml MyEclipse
在D:\Program Files\MyEclipse 6.0\myeclipse\eclipse\plugins\com.genuitec.eclipse.wizards_6.0.1.zmyeclipse601200710\templates\jsp 目录下找到Jsp.vtl，复制一份，重命名为jsp2.vtl,然后把里面的内容修改为自己想要的格式，保存。然后在 D:\Progr
JavaScript常用验证脚本总结 eksliang JavaScript javaScript表单验证
转载请出自出处：http://eksliang.iteye.com/blog/2098985 下面这些验证脚本，是我在这几年开发中的总结，今天把他放出来，也算是一种分享吧，现在在我的项目中也在用！包括日期验证、比较，非空验证、身份证验证、数值验证、Email验证、电话验证等等...! &nb
微软BI（4） 18289753290 微软BI SSIS
1） Q:查看ssis里面某个控件输出的结果： A MessageBox.Show(Dts.Variables["v_lastTimestamp"].Value.ToString()); 这是我们在包里面定义的变量 2):在关联目的端表的时候如果是一对多的关系，一定要选择唯一的那个键作为关联字段。 3) Q：ssis里面如果将多个数据源的数据插入目的端一
定时对大数据量的表进行分表对数据备份酷的飞上天空大数据量
工作中遇到数据库中一个表的数据量比较大，属于日志表。正常情况下是不会有查询操作的，但如果不进行分表数据太多，执行一条简单sql语句要等好几分钟。。分表工具：linux的shell + mysql自身提供的管理命令原理：使用一个和原表数据结构一样的表，替换原表。 linux shell内容如下： =======================开始
本质的描述与因材施教永夜-极光感想随笔
不管碰到什么事,我都下意识的想去探索本质,找寻一个最形象的描述方式。我坚信,世界上对一件事物的描述和解释,肯定有一种最形象,最贴近本质,最容易让人理解 &
很迷茫。。。随便小屋随笔
小弟我今年研一，也是从事的咱们现在最流行的专业（计算机）。本科三流学校，为了能有个更好的跳板，进入了考研大军，非常有幸能进入研究生的行业（具体学校就不说了，怕把学校的名誉给损了）。先说一下自身的条件，本科专业软件工程。主要学习就是软件开发，几乎和计算机没有什么区别。因为学校本身三流，也就是让老师带着学生学点东西，然后让学生毕业就行了。对专业性的东西了解的非常浅。就那学的语言来说
23种设计模式的意图和适用范围 aijuans 设计模式
Factory Method 意图定义一个用于创建对象的接口，让子类决定实例化哪一个类。Factory Method 使一个类的实例化延迟到其子类。　　适用性当一个类不知道它所必须创建的对象的类的时候。　　当一个类希望由它的子类来指定它所创建的对象的时候。　　当类将创建对象的职责委托给多个帮助子类中的某一个，并且你希望将哪一个帮助子类是代理者这一信息局部化的时候。 Abstr
Java中的synchronized和volatile aoyouzi java volatile synchronized
说到Java的线程同步问题肯定要说到两个关键字synchronized和volatile。说到这两个关键字，又要说道JVM的内存模型。JVM里内存分为main memory和working memory。 Main memory是所有线程共享的，working memory则是线程的工作内存，它保存有部分main memory变量的拷贝，对这些变量的更新直接发生在working memo
js数组的操作和this关键字百合不是茶 js 数组操作 this关键字
js数组的操作; 一:数组的创建: 1、数组的创建 var array = new Array();　//创建一个数组 var array = new Array([size]);　//创建一个数组并指定长度，注意不是上限，是长度 var arrayObj = new Array([element0[, element1[, ...[, elementN]]]
别人的阿里面试感悟 bijian1013 面试分享工作感悟阿里面试
原文如下：http://greemranqq.iteye.com/blog/2007170 一直做企业系统，虽然也自己一直学习技术，但是感觉还是有所欠缺，准备花几个月的时间，把互联网的东西，以及一些基础更加的深入透析，结果这次比较意外，有点突然，下面分享一下感受吧！ &nb
淘宝的测试框架Itest Bill_chen spring maven 框架单元测试 JUnit
Itest测试框架是TaoBao测试部门开发的一套单元测试框架，以Junit4为核心，集合DbUnit、Unitils等主流测试框架，应该算是比较好用的了。近期项目中用了下，有关itest的具体使用如下： 1.在Maven中引入itest框架： <dependency> <groupId>com.taobao.test</groupId&g
【Java多线程二】多路条件解决生产者消费者问题 bit1129 java多线程
package com.tom; import java.util.LinkedList; import java.util.Queue; import java.util.concurrent.ThreadLocalRandom; import java.util.concurrent.locks.Condition; import java.util.concurrent.loc
汉字转拼音pinyin4j 白糖_ pinyin4j
以前在项目中遇到汉字转拼音的情况，于是在网上找到了pinyin4j这个工具包，非常有用，别的不说了，直接下代码： import java.util.HashSet; import java.util.Set; import net.sourceforge.pinyin4j.PinyinHelper; import net.sourceforge.pinyin
org.hibernate.TransactionException: JDBC begin failed解决方案 bozch ssh 数据库异常 DBCP
org.hibernate.TransactionException: JDBC begin failed: at org.hibernate.transaction.JDBCTransaction.begin(JDBCTransaction.java:68) at org.hibernate.impl.SessionImp
java-并查集（Disjoint-set）-将多个集合合并成没有交集的集合 bylijinnan java
import java.util.ArrayList; import java.util.Arrays; import java.util.HashMap; import java.util.HashSet; import java.util.Iterator; import java.util.List; import java.util.Map; import java.ut
Java PrintWriter打印乱码 chenbowen00 java
一个小程序读写文件，发现PrintWriter输出后文件存在乱码，解决办法主要统一输入输出流编码格式。读文件： BufferedReader 从字符输入流中读取文本，缓冲各个字符，从而提供字符、数组和行的高效读取。可以指定缓冲区的大小，或者可使用默认的大小。大多数情况下，默认值就足够大了。通常，Reader 所作的每个读取请求都会导致对基础字符或字节流进行相应的读取请求。因
[天气与气候]极端气候环境 comsci 环境
如果空间环境出现异变...外星文明并未出现,而只是用某种气象武器对地球的气候系统进行攻击,并挑唆地球国家间的战争,经过一段时间的准备...最大限度的削弱地球文明的整体力量,然后再进行入侵...... 那么地球上的国家应该做什么样的防备工作呢? &n
oracle order by与union一起使用的用法 daizj UNION oracle order by
当使用union操作时，排序语句必须放在最后面才正确，如下：只能在union的最后一个子查询中使用order by，而这个order by是针对整个unioning后的结果集的。So：如果unoin的几个子查询列名不同，如 Sql代码 select supplier_id, supplier_name from suppliers UNI
zeus持久层读写分离单元测试 deng520159 单元测试
本文是zeus读写分离单元测试,距离分库分表,只有一步了.上代码: 1.ZeusMasterSlaveTest.java package com.dengliang.zeus.webdemo.test; import java.util.ArrayList; import java.util.List; import org.junit.Assert; import org.j
Yii 截取字符串(UTF-8) 使用组件 dcj3sjt126com yii
1.将Helper.php放进protected\components文件夹下。 2.调用方法： Helper::truncate_utf8_string($content,20,false); //不显示省略号 Helper::truncate_utf8_string($content,20); //显示省略号 &n
安装memcache及php扩展 dcj3sjt126com PHP
安装memcache tar zxvf memcache-2.2.5.tgz cd memcache-2.2.5/ /usr/local/php/bin/phpize (?) ./configure --with-php-confi
JsonObject 处理日期 feifeilinlin521 java json JsonOjbect JsonArray JSONException
写这边文章的初衷就是遇到了json在转换日期格式出现了异常 net.sf.json.JSONException: java.lang.reflect.InvocationTargetException 原因是当你用Map接收数据库返回了java.sql.Date 日期的数据进行json转换出的问题话不多说直接上代码 &n
Ehcache（06）——监听器 234390216 监听器 listener ehcache
监听器 Ehcache中监听器有两种，监听CacheManager的CacheManagerEventListener和监听Cache的CacheEventListener。在Ehcache中，Listener是通过对应的监听器工厂来生产和发生作用的。下面我们将来介绍一下这两种类型的监听器。
activiti 自带设计器中chrome 34版本不能打开bug的解决 jackyrong Activiti
在acitivti modeler中，如果是chrome 34，则不能打开该设计器，其他浏览器可以，经证实为bug，参考 http://forums.activiti.org/content/activiti-modeler-doesnt-work-chrome-v34 修改为，找到 oryx.debug.js 在最头部增加 if (!Document.
微信收货地址共享接口-终极解决 laotu5i0 微信开发
最近要接入微信的收货地址共享接口，总是不成功，折腾了好几天，实在没办法网上搜到的帖子也是骂声一片。我把我碰到并解决问题的过程分享出来，希望能给微信的接口文档起到一个辅助作用，让后面进来的开发者能快速的接入，而不需要像我们一样苦逼的浪费好几天，甚至一周的青春。各种羞辱、谩骂的话就不说了，本人还算文明。如果你能搜到本贴，说明你已经碰到了各种 ed
关于人才 netkiller.github.com 工作面试招聘 netkiller 人才
关于人才每个月我都会接到许多猎头的电话，有些猎头比较专业，但绝大多数在我看来与猎头二字还是有很大差距的。与猎头接触多了，自然也了解了他们的工作，包括操作手法，总体上国内的猎头行业还处在初级阶段。总结就是“盲目推荐，以量取胜”。目前现状许多从事人力资源工作的人，根本不懂得怎么找人才。处在人才找不到企业，企业找不到人才的尴尬处境。企业招聘，通常是需要用人的部门提出招聘条件，由人
搭建 CentOS 6 服务器 - 目录 rensanning centos
(1) 安装CentOS ISO（desktop/minimal）、Cloud（AWS/阿里云）、Virtualization（VMWare、VirtualBox）详细内容 (2) Linux常用命令 cd、ls、rm、chmod...... 详细内容 (3) 初始环境设置用户管理、网络设置、安全设置...... 详细内容 (4) 常驻服务Daemon
【求助】mongoDB无法更新主键 toknowme mongodb
Query query = new Query(); query.addCriteria(new Criteria("_id").is(o.getId())); &n
jquery 页面滚动到底部自动加载插件集合 xp9802 jquery
很多社交网站都使用无限滚动的翻页技术来提高用户体验，当你页面滑到列表底部时候无需点击就自动加载更多的内容。下面为你推荐 10 个 jQuery 的无限滚动的插件： 1. jQuery ScrollPagination jQuery ScrollPagination plugin 是一个 jQuery 实现的支持无限滚动加载数据的插件。 2. jQuery Screw S