PLMs:pre-trained language models
NLP:natural language processing
LLM:large language models
LM:language modeling
AI:artificial intelligence
SLM:statistical language models
NLM:Neural language models
RNNs:recurrent neural networks
ELMo:Embedding from Language Models
AGI:artificial general intelligence
ICL:In-context learning
https://github.com/RUCAIBox/LLMSurvey
SLM 's basic idea is based on Markov assumption.The SLMs with a fixed context length n are also called n-gram language models.
瓶颈:维度问题,由于指数增长的转换概率需要计算,SLM无法准确估计高位语言模型
衍生:backoff estimation and Good-Tuning estimation, 用于解决数据稀疏的问题
通过神经网络来表征单词序列的概率问题。开启了用语言模型来做表征建模(representation learning, the beyond is word sequence modeling词序建模)
distributed representation of words
word prediction function conditioned on distributed word vectors
word2vec
ELMo通过bidirectional LSTM (biLSTM)网络捕获了上下文信息,并可以通过特定的下游任务进行fine-tuning.ELMo简介
BERT可以使用大规模的未标注数据进行特定的预训练任务
scaling PLMs(scaling model size or data size)
Three differences between PLMs and LLMs:
1.LLMs表现出在更小的PLMs中可能无法观察到的更惊人的能力
2.通过prompting interface来访问LLMs(eg:gpt-4 API)
3.LLMs的发展不需要明确区分以研究或是工程化为目的,LLMs的训练需要大数据处理和并行训练这些更实际的经验。
LLMs refers to Transformer language models that contain hundreds of billions (or more) of params, which are trained on massive text data.
LLMs 可以适配相同结构的transformer 并可以作为小模型的与训练模型
通过 model size (N), dataset size (D), and the amount of training compute © 三个因素来衡量神经网络模型的表现
The three laws were derived by fitting the model performance with
varied data sizes (22M to 23B tokens), model sizes (768M to 1.5B
non-embedding parameters) and training compute,under some assumptions
(e.g., the analysis of one factor should be not bottlenecked by the
other two factors).
.They conducted rigorous experiments by varying a larger range of
model sizes (70M to 16B) and data sizes (5B to 500B tokens) and fitted a similar
scaling law yet with different coefficients
the KM scaling law favors a larger budget allocation in model size
than the data size, while the Chinchilla scaling law argues that the
two sizes should be increased in equal scales
问题:
However, some abilities (e.g., in-context learning) are
unpredictable according to the scaling law, which can be observed only
when the model size exceeds a certain level (as discussed below).
emergent abilities of LLMs are formally defined as “the abilities that
are not present in small models but arise in large models”
three typical emergent abilities for LLMs:
assuming that the language model has been provided with a natural
language instruction and/or several task demonstrations, it can
generate the expected output for the test instances by completing the
word sequence of input text, without requiring additional training or
gradient update
By fine-tuning with a mixture of multi-task datasets formatted via
natural language descriptions (called instruction tuning), LLMs are
shown to perform well on unseen tasks that are also described in the
form of instructions
with the chain-of-thought (CoT) prompting strategy, LLMs can solve
such tasks by utilizing the prompting mechanism that involves
intermediate reasoning steps for deriving the final answer
larger model/data sizes and more training compute typically lead to an improved model capacity
To support distributed training, several optimization frameworks have been released to facilitate the implementation and deployment of parallel algorithms, such as DeepSpeed and Megatron-LM
These abilities might not be explicitly exhibited when LLMs perform some specific tasks.As the technical approach, it is useful to design suitable task instructions or specific in-context learning strategies to elicit such abilities
they are likely to generate toxic, biased, or even harmful content for humans. It is necessary to align LLMs with human values
InstructGPT designs an effective tuning approach that enables LLMs to follow the expected instructions, which utilizes the technique of reinforcement learning with human feedback
For example, LLMs can utilize the calculator for accurate computation and employ search engines to retrieve unknown information