LLMs:《Efficient And Effective Text Encoding For Chinese Llama And Alpaca—6月15日版本》翻译与解读

导读:本文介绍了改进LLaMA和Alpaca模型在中文理解和生成方面能力的方法。通过扩展词表、参数高效微调、指令式微调和不同量化方法,提升了模型在指令任务和自然语言理解任务中的性能。实验结果表明改进后的模型在多个任务中表现出竞争力,展示了提升中文LLaMA和Alpaca模型的潜力。然而,文章也指出了存在的限制,并提出了未来的研究方向,如探索与人类偏好一致的输出和更先进的量化方法。-
Chinese-LLaMA-Alpaca项目开源了中文 LLaMA 模型和经过指令微调的 Alpaca模型。这些模型在原始 LLaMA 的基础上,扩展了中文词汇表并使用中文数据进行二次预训练,从而进一步提高了对中文基本语义理解的能力。

目录

相关文章

论文相关

LLMs:《Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca-4月17日版》翻译与解读

LLMs:《Efficient And Effective Text Encoding For Chinese Llama And Alpaca—6月15日版本》翻译与解读

实战应用相关

LLMs:基于Chinese-LLaMA-Alpaca开源代码在Ng单卡利用LLaMA(Meta)和Alpaca(斯坦福)实现定义数据集(生成指令数据)→数据预处理(token分词/合并权重)→预训练(LoRA的参数/LLaMA的参数)→指令微调LoRA权重(继续训练/全新训练)→模型推理(CLI、GUI【webui/LLaMACha/LangChain】)

LLMs:在单机CPU+Windows系统上实现中文LLaMA算法(基于Chinese-LLaMA-Alpaca)进行模型部署(llama.cpp)且实现模型推理全流程步骤的图文教程(非常详细)

《Efficient And Effective Text Encoding For Chinese Llama And Alpaca—6月15日版本》翻译与解读

ABSTRACT

1、Introduction

BERT系列专注文本理解、GPT系列更注重文本生成—更创造性

AGI之路:InstructGPT→ChatGPT→GPT-4

LLM的挑战:实施固有的限制+庞大的计算资源→开源替代方案(如LLaMA和Alpaca)

开源LLM的局限性:中文不友好【词表扩充解决】+训练成本高【高效训练LoRA】

论文贡献:额外添加2万个中文标记+采用低秩适应(LoRA)法+评估指令任务和NLU中的性能+共享资源

2、Chinese Llama And Chinese Alpaca

2.1、LLaMA:嵌入层+多个Transformer块+语言模型头+pre- norm+SwiGLU+旋转嵌入

2.2、Chinese Vocabulary Extension中文词汇扩展

LLaMA的1.4T个tokens+中文LLaMA训练中文库的2个挑战【原始LLaMA汉字覆盖率非常低+字节标记不符合中文字符难以捕捉中文语义】

2.3、Parameter Efficient Fine-Tuning With Lora具有LoRA的参数高效微调

LoRA:冻结预训练模型权重+每层注入可训练的低秩矩阵

LoRA原理:W0冻结+更新B和A+r ≪ min(d, k)减少内存消耗

LoRA同时应用在中文LLaMA和Alpaca模型的预训练和微调阶段

2.4、Pre-Training Objective预训练目标:CLM建模任务+自回归方式预测+最小化负对数似然

2.5、Supervised Fine-Tuning And Chinese Alpaca有监督微调和中文Alpaca

LLMs难以遵循用户指令生成——因其建模目标是NTP任务→所以需要通过SFT明确训练遵循指令任务→效仿Alpaca的Self-Instruct思想自动生成52K遵循指令数据+模型通过指令提示+自回归的方式训练来生成输出

与Alpaca的区别:通过在指令instruction 和输入input 之间添加“\n”换行符来将它们连接起来,形成一个新的指令

3、Experimental Setups实验设置

3.1、Experimental Setups For Pre-Training预训练的实验设置:原始的LLaMA权重初始化+fp16+预训练2阶段【LoRA】+120G语料+48个A40-48G进行1轮训练+DeepSpeed优化内存+AdamW(warm-up余弦学习率)

3.2、Experimental Setups For Instruction Fine-Tuning指令微调的实验设置:LoRA高效精调+3M指令数据

爬取数据:不需要种子任务+只需指定目标领域和指令类型

4、Results On Instruction-Following Tasks指令遵循任务的结果

4.1、Task Design And Evaluation Method任务设计和评估方法:GPT-4总体评分结合人工检查确保评分准确性和一致性+每个任务内所有样本的评分求和并归一化百分制得到任务的整体得分(10类任务和200样本)

4.2、Experimental Setups For Decoding解码的实验设置:上下文大小、最大序列长度、温度、Top-k采样、Top-p采样、重复惩罚

4.3、Results结果:基于一个成熟的模型LLaMA测试

4.3.1、Multi-Turn Dialogue多轮对话:更多数据比更大参数更重要

4.3.2、Text Generation文本生成:更多数据+较小的模型=更好

4.3.3、Numerical Calculation And Reasoning数值计算和推理——模型越大,数值推理任务更占据优势

4.3.4、CODING编码:模型越大,编码任务更占据优势

4.3.5、ETHICS伦理:模型越大,伦理回答更优

5、Results On Natural Language Understanding Tasks自然语言理解任务的结果

5.1、Task Description任务描述:C-Eval数据集(4个类别+52个学科)

5.2、Decoding Strategy解码策略:评估LLaMA模型直接输入样本、评估Alpaca模型则将示例包装在提示模板中

5.3、Comparisons To Original LLaMA与原版LLaMA的比较

中文LLaMA改进了原始LLaMA—Chinese LLaMA improves original LLaMA.

Alpaca模型相比LLaMA表现出显著改进—Alpaca models show significant improvements over LLaMA

LLaMA在少样本设置下的表现通常更好,而Alpaca更适合零样本——LLaMA generally yields better performance in a few-shot setting, while Alpaca prefers zero- shot

5.4、与其他模型的比较Comparisons To Other Models:中文Alpaca模型表现出竞争力+适当的解码策略和提示模板可提高模型性能

6、Effect Of Different Quantization Methods不同量化方法的影响:;8位和6位量化方法更利于Alpaca模型

将LLM进行高效量化(比如llama.cpp)在个人计算机上部署具有多个好处:保护敏感信息+适用于资源有限用户+使用更加民主化

使用llama.cpp对Alpaca系列中文模型进行不同层次的量化:8位和6位量化方法最佳+量化训练可能成为未来训练LLMs的主流方式

7、Conclusion:词表扩充中文+指令式微调→RLHF/RLAIF+更先进的量化GPTQ+更高效微调

提出的方法(改进LLaMA模型的中文理解和生成能力)=增加2万个额外的中文token提高编码效率+指令式训练+基于10种不同任务类型的200个样本利用GPT-4评估+C-Eva测试评估

未来计划:探索RLHF或RLAIF使模型的输出与人类偏好保持一致+更先进的量化方法(如GPTQ)+替代LoRA高效微调法+

8、Limitations:有害且不可预测的内容、训练不足、缺乏鲁棒性、评估不够全面、可扩展性和效率

9、Acknowledgments致谢


相关文章

论文相关

LLMs:《Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca-4月17日版》翻译与解读

https://yunyaniu.blog.csdn.net/article/details/130998087

LLMs:《Efficient And Effective Text Encoding For Chinese Llama And Alpaca—6月15日版本》翻译与解读

https://yunyaniu.blog.csdn.net/article/details/131318974

实战应用相关

LLMs:基于Chinese-LLaMA-Alpaca开源代码在Ng单卡利用LLaMA(Meta)和Alpaca(斯坦福)实现定义数据集(生成指令数据)→数据预处理(token分词/合并权重)→预训练(LoRA的参数/LLaMA的参数)→指令微调LoRA权重(继续训练/全新训练)→模型推理(CLI、GUI【webui/LLaMACha/LangChain】)

https://yunyaniu.blog.csdn.net/article/details/131319010

LLMs:在单机CPU+Windows系统上实现中文LLaMA算法(基于Chinese-LLaMA-Alpaca)进行模型部署(llama.cpp)且实现模型推理全流程步骤的图文教程(非常详细)

https://yunyaniu.blog.csdn.net/article/details/131016046

《Efficient And Effective Text Encoding For Chinese Llama And Alpaca—6月15日版本》翻译与解读

地址

论文地址:https://arxiv.org/abs/2304.08177

GitHub:GitHub - ymcui/Chinese-LLaMA-Alpaca: 中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)

作者

Yiming Cui∗ [email protected]

Ziqing Yang∗ [email protected]

Xin Yao [email protected]

时间

2023年6月15日

ABSTRACT

Large Language Models (LLMs), such as ChatGPT and GPT-4, have dramatically transformed natural language processing research and shown promising strides towards Artificial General Intelligence (AGI). Nonetheless, the high costs associ- ated with training and deploying LLMs present substantial obstacles to transpar- ent, accessible academic research. While several large language models, such as LLaMA, have been open-sourced by the community, these predominantly focus on English corpora, limiting their usefulness for other languages. In this paper, we propose a method to augment LLaMA with capabilities for understanding and generating Chinese text and its ability to follow instructions. We achieve this by extending LLaMA’s existing vocabulary with an additional 20,000 Chinese to- kens, thereby improving its encoding efficiency and semantic understanding of Chinese. We further incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets, significantly enhancing the model’s ability to comprehend and execute instructions. Our experimental results indicate that the newly proposed model markedly enhances the original LLaMA’s proficiency in understanding and generating Chinese content. Additionally, the results on the C-Eval dataset yield competitive performance among the models with several times the size of ours. We have made our pre-trained models, train- ing scripts, and other resources available through GitHub, fostering open research for our community.

大型语言模型(LLMs),如ChatGPT和GPT-4,已经极大地改变了自然语言处理研究,并在通往通用人工智能(AGI)的道路上取得了令人期待的进展。然而,训练和部署LLMs所需的高成本给透明、可访问的学术研究带来了重大障碍。虽然社区已经开源了几个大型语言模型(如LLaMA),但这些模型主要关注英文语料库,限制了它们在其他语言上的适用性。在本文中,我们提出了一种方法来增强LLaMA对理解和生成中文文本以及遵循指令的能力。我们通过将LLaMA现有的词汇表扩展了2万个中文标记来实现这一目标,从而提高了它对中文的编码效率和语义理解能力。我们进一步使用中文数据进行二次预训练,并使用中文指令数据集对模型进行微调,显著提高了模型理解和执行指令的能力。我们的实验结果表明,新提出的模型显著提高了原始LLaMA在理解和生成中文内容方面的能力。此外,C-Eval数据集的结果显示,我们的模型在规模上比其他模型表现出了竞争力。我们通过GitHub提供了我们的预训练模型、训练脚本和其他资源,以促进我们社区的开放研究。

1、Introduction

(1)、大语言模型(LLM),代表着自然语言处理领域的重大进展。这些模型通过其庞大的规模和广泛的训练数据,展示出处理和产生人类语言的非凡能力。ChatGPT和GPT-4代表着这个快速发展的领域的最新进展,提升了人们对人工通用智能(AGI)的期望。-
(2)、但是大模型也存在限制,主要是专有性强、训练资源消耗庞大,限制了更多研究人员的参与。因此开源的LLaMA和Alpaca作为解决方案而产生,目的是促进学术研究和技术进步。-
(3)、但是LLaMA和Alpaca对中文支持不足。文章提出了扩展LLaMA中文词表和使用LoRA技术来提升中文LLaMA和Alpaca的能力:扩充LLaMA的中文词表,改善中文理解能力;使用LoRA技术实现高效训练和部署;评估改进后的中文模型,表现较原始模型有明显提升;-
(4)、通过在指令遵循任务和自然语言理解任务中评估所提出的LLaMA和Alpaca模型的性能,证明了这些模型在中文任务背景下相较于原始模型的显著改进。-
(5)、研究结果公开,有助于NLP领域的研究和合作,鼓励LLaMA和Alpaca能适应更多语言。-
LLaMA和Alpaca代表开源自然语言处理工具的进步,但对中文支持不足。文章提出了解决方案,通过扩展中文词表和LoRA技术,有效提升了这两个模型的中文理解和生成能力,有助于NLP研究领域的进一步发展。

BERT系列专注文本理解、GPT系列更注重文本生成—更创造性****

Natural language processing (NLP) field has witnessed a substantial paradigm shift with the advent of Large Language Models (LLMs). These models, distinguished by their considerable size and comprehensive training data, have demonstrated extraordinary abilities in comprehending and pro- ducing human-like text. In contrast to pre-trained language models dedicated to text understanding, such as BERT (Devlin et al., 2019), the GPT series (Radford et al., 2018) accentuates text generation, positioning them as more suitable platforms for creativity compared to their counterparts. Notably, the latest members of the GPT family, namely ChatGPT and GPT-4, have garnered significant atten- tion, establishing themselves as leading examples in this rapidly evolving field.

随着大型语言模型(LLMs)的出现,自然语言处理(NLP)领域经历了重大的范式转变。这些模型以其庞大的规模和全面的训练数据而著称,在理解和生成人类文本方面展示了非凡的能力。与专注于文本理解的预训练语言模型(如BERT系列)不同,GPT系列更加注重文本生成,因此在创造性方面比其它模型更具优势。值得注意的是,GPT系列的最新成员,即ChatGPT和GPT-4,在这个快速发展的领域中获得了显著的关注,成为领先的示例。

AGI之路:InstructGPT→ChatGPTGPT-4****

ChatGPT (OpenAI, 2022), evolved from InstructGPT (Ouyang et al., 2022), serves as an advanced conversational AI model capable of conducting context-aware, human-like interactions. Its success set the stage for the development of GPT-4 (OpenAI, 2023), a more sophisticated LLM, demonstrat- ing even greater potential in natural language understanding, generation, and various NLP tasks, especially for its multi-modal and reasoning abilities. These models have catalyzed new research directions and applications, intensifying interest in exploring the potential of Artificial General In- telligence (AGI). Exhibiting impressive performance across multiple benchmarks, they have also demonstrated capabilities for few-shot learning and adaptability to new tasks, significantly driving the expansion of NLP research. Consequently, they have inspired both researchers and industry pro- fessionals to further harness their potential across a wide array of applications, including sentiment analysis, machine translation, question-answering systems, and more.

ChatGPT是由InstructGPT演变而来,是一种先进的对话型AI模型,能够进行上下文感知、类似人类的交互。它的成功为GPT-4的开发铺平了道路,GPT-4是一种更复杂的LLM,在自然语言理解、生成和各种NLP任务方面展示出更大的潜力,特别是在多模态和推理能力方面。这些模型催生了新的研究方向和应用,加剧了对探索通用人工智能(AGI)潜力的兴趣。它们在多个基准测试中展现了出色的性能,还展示了少样本学习和适应新任务的能力,极大推动了NLP研究的扩展。因此,它们激发了研究人员和行业专业人员在情感分析、机器翻译、问答系统等各种应用领域进一步发挥它们的潜力。

LLM的挑战:实施固有的限制+庞大的计算资源→开源替代方案(如LLaMA和Alpaca)****

However, as impactful as LLMs have been, their implementation comes with inherent limitations that hamper transparent and open research. A major concern is their proprietary nature, which restricts access to the models, thus inhibiting the broader research community’s ability to build upon their successes. Furthermore, the vast computational resources necessary for training and deploying these models present a challenge for researchers with limited resources, further compounding the accessibility problem.

To tackle these limitations, the NLP research community has gravitated towards open-source al- ternatives to promote greater transparency and collaboration. LLaMA (Touvron et al., 2023) and Alpaca (Taori et al., 2023a) serve as notable examples of such initiatives. These open-source LLMs are intended to facilitate academic research and accelerate progress within the NLP field. The aim of open-sourcing these models is to foster an environment conducive to further advancements in model development, fine-tuning, and evaluation, ultimately leading to the creation of robust, capable LLMs applicable to a wide variety of uses.

然而,尽管LLMs产生了重大影响,但它们的实施存在固有的限制,阻碍了透明和开放的研究。一个主要问题是它们的专有性质,限制了对模型的访问,从而限制了更广泛的研究社区在其基础上进行研究的能力。此外,训练和部署这些模型所需的庞大计算资源对于资源有限的研究人员来说是一个挑战,进一步加剧了可访问性问题。

为了解决这些限制,NLP研究社区已经转向开源替代方案,以促进更大的透明度和合作。LLaMA和Alpaca是这种倡议的显著例子。这些开源LLM旨在促进学术研究,并加速NLP领域的进展。开源这些模型的目的是营造一个有利于模型开发、微调和评估进一步发展的环境,最终创建出适用于各种用途的稳健、有能力的LLM。

开源LLM的局限性:中文不友好【词表扩充解决】+训练成本高【高效训练LoRA】

Despite the considerable strides made by LLaMA and Alpaca in NLP, they exhibit inherent limita- tions concerning native support for Chinese language tasks. Their vocabularies contain only a few hundred Chinese tokens, substantially hindering their efficiency in encoding and decoding Chinese text. Building on our previous work with the Chinese BERT series (Cui et al., 2021) and Chinese minority-oriented multilingual pre-trained models (Yang et al., 2022), in this technical report, we propose the development of Chinese LLaMA and Alpaca models with enhanced capabilities for understanding and generating Chinese content. We extend the original LLaMA’s vocabulary with an additional 20,000 Chinese tokens, significantly improving its proficiency in processing and gen- erating Chinese text. To ensure efficient training and deployment of these models, we employ the Low-Rank Adaptation (LoRA) approach (Hu et al., 2021), enabling us to train and fine-tune the models without excessive computational costs. We anticipate our preliminary study to enhance the Chinese understanding and generation capabilities of LLaMA and Alpaca serves as a foundation for researchers aiming to adapt these models to other languages. By showcasing the feasibility and effectiveness of our approach, we offer insights and methodologies that can be employed to extend vocabularies and improve the performance of LLaMA and Alpaca models in various languages.

尽管LLaMA和Alpaca在NLP领域取得了显著的进展,但它们在支持中文语言任务方面存在固有的局限性。它们的词汇表只包含几百个中文标记,严重影响了其对中文文本的编码和解码效率。

基于我们之前在中文BERT系列和面向少数民族的多语种预训练模型方面的工作,我们在本技术报告中提出了增强中文理解和生成能力的中文LLaMA和Alpaca模型的开发。我们通过额外添加2万个中文标记来扩展原始LLaMA的词汇表,显著提高了其处理和生成中文文本的能力。

为了确保这些模型的高效训练和部署,我们采用了低秩适应(LoRA)方法,使我们能够在不过多增加计算成本的情况下训练和微调这些模型。我们预计我们的初步研究将增强LLaMA和Alpaca对中文理解和生成能力,为希望将这些模型适应其他语言的研究人员奠定基础。通过展示我们的方法的可行性和有效性,我们提供了可以用于扩展词汇表并提高LLaMA和Alpaca模型在各种语言中性能的见解和方法。

******论文贡献:**额外添加2万个中文标记+采用低秩适应(LoRA)法+评估指令任务和NLU中的性能+共享资源

In summary, the contributions of this technical report are as follows:

•We enhance the encoding and decoding efficiency of the Chinese language and improve LLaMA’s Chinese understanding ability by extending the original LLaMA’s vocabulary with an additional 20,000 Chinese tokens.

•We employ the Low-Rank Adaptation (LoRA) approach to facilitate efficient training and de- ployment of the Chinese LLaMA and Alpaca models, enabling researchers to work with these models without incurring excessive computational costs.

•We evaluate the performance of the proposed LLaMA and Alpaca models in instruction- following tasks and natural language understanding tasks, thereby demonstrating substantial improvements over their original counterparts in the context of Chinese language tasks.

•We make the resources and findings of our study publicly available, fostering further research and collaboration in the NLP community and encouraging the adaptation of LLaMA and Al- paca models to other languages.

总之,本技术报告的贡献包括:

>> 我们通过额外添加2万个中文标记,提高了中文语言的编码和解码效率,并改善了LLaMA在中文理解方面的能力。

>> 我们采用低秩适应(LoRA)方法,促进了中文LLaMA和Alpaca模型的高效训练和部署,使研究人员能够在不过多增加计算成本的情况下使用这些模型。

>> 我们评估了所提出的LLaMA和Alpaca模型在遵循指令任务和自然语言理解任务中的性能,从而在中文语言任务的背景下展示了与原始模型相比的显著改进。

>> 我们公开共享了研究的资源和发现,促进了NLP社区中的进一步研究和合作,并鼓励将LLaMA和Alpaca模型适应其他语言。

2、Chinese Llama And Chinese Alpaca

主要介绍了中文LLaMA和中文Alpaca模型的开发过程,重点在于中文词汇表扩充和参数高效微调,以提升模型在中文理解和生成任务中的性能。-
LLaMA是一个大语言模型,它使用transformer结构,通过预训练达到了与GPT-3类似的效果。通过有效的中文词表扩充、LoRA技术参数高效微调、指令精调技术,文章高效的提升了LLaMA和Alpaca模型处理中文任务的能力,为深度学习NLP领域的进一步发展奠定基础。研究结果公开,旨在促进NLP领域的进步和合作。

**2.1、LLaMA:嵌入层+多个Transformer块+语言模型头+pre- norm+SwiGLU+**旋转嵌入

LLaMA (Touvron et al., 2023) is a foundational, decoder-only large language model built upon the transformer architecture (Vaswani et al., 2017). Similar to the GPT series and other transformer- based LLMs, LLaMA consists of an embedding layer, multiple transformer blocks, and a language model head. LLaMA also incorporates improvements utilized in different models, such as pre- normalization (Zhang & Sennrich, 2019), SwiGLU activation (Shazeer, 2020), and rotary embed- dings (Su et al., 2021). LLaMA is available in four different model sizes: 7B, 13B, 33B, and 65B.

LLaMA has been pre-trained with a standard language modeling task (see Section 2.4) using a mix of publicly available sources, such as crawled web pages, books, Wikipedia, and preprint pa- pers. Experimental findings reveal that LLaMA delivers competitive performance compared to other

LLMs like GPT-3, albeit at a smaller model size. This compactness and effectiveness have garnered considerable attention from researchers, leading to the widespread use of LLaMA-based models.

LLaMA(Touvron等,2023)是基于Transformer架构(Vaswani等,2017)构建的基础解码型大型语言模型。与GPT系列和其他基于Transformer的语言模型类似,LLaMA由嵌入层、多个Transformer块和语言模型头组成。LLaMA还融合了不同模型中使用的改进方法,如pre- normalization预标准化(Zhang和Sennrich,2019)、SwiGLU激活函数(Shazeer,2020)和旋转嵌入(Su等,2021)。

LLaMA提供了四种不同的模型规模:7B、13B、33B和65B。

LLaMA经过标准的语言建模任务(见第2.4节)进行了预训练,使用了公开可得的数据源,如爬取的网页、图书、维基百科和预印论文。实验结果表明,与其他LLM(如GPT-3)相比,LLaMA在更小的模型规模下表现出竞争力。这种紧凑和高效性引起了研究人员的广泛关注,导致了基于LLaMA的模型的广泛使用。

2.2、Chinese Vocabulary Extension中文词汇扩展

(1)、LLaMA在中文文本理解和生成方面的能力有限,对于中文不友好,主要原因有两个:-
>> 中文token少:LLaMA的原始词表中包含的中文token很少,无法很好的表示中文文本;-
>> 处理中文效率底下:LLaMA使用byte token来表示未知的字符,包括中文字符,但byte token设计初衷不是处理中文,效率低下;

LLaMA的1.4T个tokens******+中文LLaMA训练中文库的2个挑战【原始LLaMA汉字覆盖率非常低+字节标记不符合中文字符难以捕捉中文语义】******

LLaMA’s training set encompasses roughly 1.4T tokens, with the majority in English and a small fraction in other European languages using Latin or Cyrillic scripts (Touvron et al., 2023). Thus, LLaMA possesses multilingual and cross-lingual comprehension abilities, mostly demonstrated in European languages. Interestingly, our prior preliminary study reveals that LLaMA exhibits basic Chinese understanding ability, although its capacity to generate Chinese texts is limited.

To equip LLaMA with enhanced Chinese understanding and generation capabilities, we propose to continue pre-training the LLaMA model with Chinese corpora. However, directly applying contin- ual pre-training with Chinese corpora encounters several challenges. Firstly, the original LLaMA vocabulary covers less than a thousand Chinese characters, which is insufficient to encode gen- eral Chinese texts. Although the LLaMA tokenizer circumvents this issue by tokenizing unknown UTF-8 characters to bytes, this strategy significantly extends sequence length and slows down the encoding and decoding efficiency of Chinese texts, as each Chinese character splits into 3-4 byte tokens. Secondly, byte tokens are not exclusively designed to represent Chinese characters. Since byte tokens also signify UTF-8 tokens in other languages, it becomes challenging for byte tokens and transformer encoders to effectively learn representations capturing the semantic meaning of Chinese characters.

LLaMA的训练集包含大约1.4T个tokens,其中大部分为英文,只有一小部分使用拉丁字母或西里尔字母的其他欧洲语言(Touvron等,2023)。因此,LLaMA具有多语言和跨语言理解能力,主要在欧洲语言中展现出来。有趣的是,我们的初步研究发现,LLaMA展现了基本的中文理解能力,尽管其生成中文文本的能力有限。

为了增强LLaMA在中文理解和生成方面的能力,我们提出继续使用中文语料库对LLaMA模型进行预训练。然而,直接应用中文语料库进行连续预训练面临几个挑战。

首先,原始的LLaMA词汇表只覆盖了不到一千个汉字,无法对一般的中文文本进行编码。尽管LLaMA分词器通过将未知的UTF-8字符分割成字节标记来规避了这个问题,但这种策略显著增加了序列长度,并降低了中文文本的编码和解码效率,因为每个汉字会被分割成3-4个字节的标记。

其次,字节标记并不是专门设计用于表示中文字符。由于字节标记还表示其他语言中的UTF-8标记,这使得字节标记和Transformer编码器难以有效学习捕捉中文字符的语义含义。

三大改进=利用SentencePiece训练中文分词器【显著减少编码长度】+合并两类分词器+修改为V’×H使LLaMA适应中文分词器****

To address these problems and improve encoding efficiency, we propose to extend LLaMA vocab- ulary with additional Chinese tokens and adapt the model for the extended vocabulary (Yang et al., 2022). The extension process proceeds as follows:

•To enhance the tokenizer’s support for Chinese texts, we initially train a Chinese tokenizer with SentencePiece (Kudo & Richardson, 2018) on Chinese corpora2 with a vocabulary size of 20,000.

•We subsequently merge the Chinese tokenizer into the original LLaMA tokenizer by taking the union of their vocabularies. Consequently, we obtain a merged tokenizer, which we term the Chinese LLaMA tokenizer, with a vocabulary size of 49,953.

•To adapt the LLaMA model for the Chinese LLaMA tokenizer, we resize the word embeddings and language model head from shape V × H to V ′ × H, where V = 32, 000 denotes the original vocabulary size, and V ′ = 49, 953 is the new vocabulary size of the Chinese LLaMA

tokenizer. The new rows are appended to the end of the original embedding matrices, ensuring that the embeddings of the tokens in the original vocabulary remain unaffected.

为了解决这些问题并提高编码效率,我们提议通过添加额外的中文标记来扩展LLaMA的词汇表,并调整模型以适应扩展后的词汇表(Yang等,2022)。扩展过程如下所示:

>> 利用SentencePiece训练中文分词器:为增强分词器对中文文本的支持,我们首先使用SentencePiece(Kudo和Richardson,2018)在中文语料库上训练一个中文分词器,词汇表大小为20,000。

>> 合并两类分词器:然后,我们将中文分词器与原始LLaMA分词器合并,取两者词汇表的并集。因此,我们得到了一个合并的分词器,我们将其称为中文LLaMA分词器,词汇表大小为49,953。

>> 修改为V’×H使LLaMA适应中文分词器:为使LLaMA模型适应中文LLaMA分词器,我们将单词嵌入和语言模型头的形状从V×H调整为V’×H,其中V = 32,000表示原始词汇表大小,V’ = 49,953表示中文LLaMA分词器的新词汇表大小。新的行被附加到原始嵌入矩阵的末尾,以确保原始词汇表中的标记的嵌入不受影响。

Preliminary experiments indicate that the number of tokens generated by the Chinese LLaMA tok- enizer is approximately half of those generated by the original LLaMA tokenizer. Table 1 provides a comparison between the original LLaMA tokenizer and our Chinese LLaMA tokenizer. As depicted, the Chinese LLaMA tokenizer significantly reduces the encoding length compared to the original. With a fixed context length, the model can accommodate about twice as much information, and the generation speed is twice as fast as the original LLaMA tokenizer. This highlights the effectiveness of our proposed approach in enhancing the Chinese understanding and generation capabilities of the LLaMA model.

初步实验表明,中文LLaMA分词器生成的标记数量约为原始LLaMA分词器的一半。表1对比了原始LLaMA分词器和我们的中文LLaMA分词器。如图所示,与原始分词器相比,中文LLaMA分词器显著减少了编码长度。在固定的上下文长度下,模型可以容纳约两倍的信息量,生成速度比原始LLaMA分词器快一倍。这凸显了我们提出的方法在增强LLaMA模型的中文理解和生成能力方面的有效性。

2.3、Parameter Efficient Fine-Tuning With Lora具有LoRA的参数高效微调

LoRA:冻结预训练模型权重+每层注入可训练的低秩矩阵

>> 使用LoRA技术,将预训练模型的权重冻结,并在每个层中注入低秩矩阵,可以只更新低秩矩阵的参数,更高效的进行 training 和部署;

The conventional training paradigm that updates the full parameters of LLMs is prohibitively expen- sive and is not time- or cost-feasible to most labs or companies. Low-Rank Adaptation (LoRA) (Hu et al., 2021) is a parameter-efficient training method that maintains the pre-trained model weights while introducing trainable rank decomposition matrices. LoRA freezes the pre-trained model weights and injects trainable low-rank matrices into each layer. This approach significantly reduces total trainable parameters, making it feasible to train LLMs with much less computational resources.

传统的对LLM(大型语言模型)进行全参数更新的训练范式成本高昂,对于大多数实验室或公司来说,时间和成本都不可行。低秩自适应(Low-Rank Adaptation,LoRA)(Hu等,2021)是一种参数高效的训练方法,它在保持预训练模型权重的同时引入可训练的秩分解矩阵。LoRA冻结了预训练模型权重,并在每一层中注入可训练的低秩矩阵。这种方法显著地减少了可训练参数的总量,使得用更少的计算资源训练LLM成为可能。

******LoRA原理:W0冻结+更新B和A+**r ≪ min(d, k)减少内存消耗

To be specific, for a linear layer with weight matrix W0 ∈ Rd×k , where k is the input dimension, and d is the output dimension, LoRA adds two low-rank decomposed trainable matrices B ∈ Rd×r and A ∈ Rr×k , where r is the pre-determined rank. The forward pass with input x is given by the following equation,

h = W0x + ∆W x = W0x + BAx, B ∈ Rd×r , A ∈ Rr×d (1) During training, W0 is frozen and does not receive gradient updates, while B and A are updated. By

choosing the rank r ≪ min(d, k), the memory consumption is reduced as we do not need to store

the optimizer states for the large frozen matrix.

具体而言,对于一个具有权重矩阵W0 ∈ Rd×k的线性层,其中k是输入维度,d是输出维度,LoRA添加了两个低秩分解的可训练矩阵B ∈ Rd×r和A ∈ Rr×k,其中r是预先确定的秩。输入为x时的前向传播如下所示:

h = W0x + ∆W x = W0x + BAx, B ∈ Rd×r , A ∈ Rr×d (1)

在训练过程中,W0被冻结,不接收梯度更新,而B和A则会被更新。通过选择r ≪ min(d, k),可以减少内存消耗,因为我们不需要为大型冻结矩阵存储优化器状态。

LoRA同时应用在中文LLaMA和Alpaca模型的预训练和微调阶段

To achieve parameter-efficient training while adhering to a tight budget, we apply LoRA training to all Chinese LLaMA and Alpaca models in our paper, including both the pre-training and fine-tuning stages. We primarily incorporate LoRA adapters into the weights of the attention module and MLP layers. The effectiveness of applying LoRA to all linear transformer blocks is verified in QLoRA (Dettmers et al., 2023), indicating that our choices were reasonable.

为了实现参数高效训练并在有限的预算内,我们将LoRA训练应用于本文中的所有中文LLaMA和Alpaca模型,包括预训练和微调阶段。我们主要将LoRA适配器应用于注意力模块和MLP层的权重。在QLoRA(Dettmers等,2023)中验证了将LoRA应用于所有线性Transformer块的有效性,这表明我们的选择是合理的。

**2.4、Pre-Training Objective预训练目标:CLM建模任务+自回归方式预测+**最小化负对数似然

We pre-train the Chinese LLaMA model with the standard Casual Language Modeling (CLM) task. Given an input token sequence x = (x0, x1, x2, . . .), the model is trained to predict the next token xi in an autoregressive manner. Mathematically, the objective is to minimize the following negative log-likelihood:

where, Θ represents the model parameters, DPT is the pre-training dataset, xi is the token to be predicted, and x0, x1, . . . , xi−1 constitute the context.

我们使用标准的无因果语言建模(Casual Language Modeling,CLM)任务对中文LLaMA模型进行预训练。给定一个输入标记序列x = (x0, x1, x2, . . .),模型被训练以自回归的方式预测下一个标记xi。数学上,目标是最小化以下负对数似然:

其中,Θ表示模型参数,DPT表示预训练数据集,xi表示要预测的标记,x0, x1, . . . , xi−1构成上下文。

2.5、Supervised Fine-Tuning And Chinese Alpaca有监督微调和中文Alpaca

>> 通过预训练和指令精调,训练出中文LLaMA和中文Alpaca模型;Stanford Alpaca作为一种基于LLaMA的指令跟随模型,通过自我指导的微调训练实现了用户指令的执行。

LLMs难以遵循用户指令生成——因其建模目标是NTP任务→所以需要通过SFT明确训练遵循指令任务→效仿Alpaca的Self-Instruct思想自动生成52K遵循指令数据+模型通过指令提示**+自回归的方式训练来生成输出******

Pre-trained language models can hardly follow user instructions and often generate unintended con- tent. This is because the language modeling objective in Equation (2) is predicting the next token, not “follow the instructions and answer the questions” (Ouyang et al., 2022). To align the behavior of language models to the user’s intention, one can fine-tune the model to explicitly train it to follow instructions. Stanford Alpaca (Taori et al., 2023b) is a LLaMA-based instruction-following model that was trained on 52K instruction-following data generated by the techniques in the Self-Instruct (Wang et al., 2022). We follow the approach in Stanford Alpaca to apply self-instructed fine-tuning on Chinese LLaMA to train an instruction-following model — Chinese Alpaca.

Chinese Alpaca is trained on a combination of instruction-following datasets. Each example in the dataset consists of an instruction and an output. The supervised fine-tuning task is similar to the casual language modeling task: the model is prompted with the instruction and trained to generate the output autoregressively. The instruction is wrapped in a prompt template, and the output imme- diately follows the template. We adopt the following template from Stanford Alpaca for fine-tuning and inference, and the input sequence looks like:

预训练的语言模型很难遵循用户指令,常常生成意外内容。这是因为方程(2)中的语言建模目标是预测下一个标记,而不是“遵循指令并回答问题”(Ouyang等,2022)。

为了使语言模型的行为与用户意图一致,可以通过有监督微调将模型明确训练成遵循指令。Stanford Alpaca(Taori等,2023b)是一种基于LLaMA的遵循指令模型,它是使用Self-Instruct(Wang等,2022)中的技术生成的52K个遵循指令数据进行训练的。我们遵循Stanford Alpaca的方法,对中文LLaMA进行自我指导的微调,以训练一个遵循指令的模型——中文Alpaca。

中文Alpaca是在多个遵循指令数据集上进行训练的。数据集中的每个示例包含一个指令和一个输出。有监督微调的任务类似于无因果语言建模任务:模型通过指令提示并以自回归的方式训练来生成输出。指令被包含在提示模板中,输出紧随在模板之后。我们采用了Stanford Alpaca中的以下模板进行微调和推断,输入序列的形式如下:

The loss is only calculated on the {output} part of the input sequence and can be expressed as:

Here, Θ represents the model parameters, DSFT is the fine-tuning dataset, x = (x0, x1, . . .) repre- sents the tokenized input sequence.

损失仅在输入序列的 {output} 部分计算,可以表示为:

其中,Θ表示模型参数,DSFT表示微调数据集,x = (x0, x1, . . .)表示分词后的输入序列。

与Alpaca的区别:通过在指令instruction 和输入input 之间添加**“\n”**换行符来将它们连接起来,形成一个新的指令

A major difference between our approach and Stanford Alpaca is that we only use the prompt tem- plate designed for examples without an input field, whereas Stanford Alpaca employs two templates for examples both with and without an input field. If the example contains a non-empty input field, we concatenate the instruction and input with an “\n” to form the new instruction. Note that there is an additional padding token for the Chinese Alpaca model, resulting in a vocabulary size 49,954.

我们的方法与Stanford Alpaca的一个主要区别是,我们只使用针对没有输入字段的示例设计的提示模板,而Stanford Alpaca则在包含和不包含输入字段的示例上使用两个模板。如果示例包含一个非空的输入字段,我们将指令和输入用“\n”连接起来形成新的指令。请注意,中文Alpaca模型额外添加了一个填充标记,导致词汇表大小为49,954。

3、Experimental Setups实验设置

文章分别讨论了中英文LLaMA预训练和指令精调的实验设置,详细介绍了中文LLaMA和中文Alpaca模型的实验设置,包括预训练和指令微调的具体步骤、超参数设置和数据规模。这些设置对于模型的性能和效率起着重要的影响。-
总的来说,文章通过fp16训练、LoRA技术和数据集规模,实现了高效地训练了中英文LLaMA和Alpaca模型,提高了它们应对中文任务的能力。

3.1、Experimental Setups For Pre-Training预训练的实验设置:原始的LLaMA权重初始化+fp16+预训练2阶段【LoRA】+120G语料+48个A40-48G进行1训练+DeepSpeed优化内存+AdamW(warm-up余弦学习率)

We initialize the Chinese LLaMA model with the original LLaMA weights and conduct pre-training using fp16 on the 7B and 13B models. Additionally, for the 33B model, we employ the bitsandbytes3 library to train it in an 8-bit format, enhancing its efficiency and memory usage. We directly apply LoRA to attentions and MLPs for training while setting the embeddings and LM head as trainable.

For the basic version of Chinese LLaMA-7B, we utilize a two-stage pre-training approach. In stage 1, we fix the parameters of the transformer encoders within the model and only train the embeddings, adapting the newly added Chinese word vectors while minimizing the disturbance to the original model. In stage 2, we add LoRA weights (adapters) to the attention mechanisms and train the embeddings, LM heads, and newly added LoRA parameters. Note that two-stage training is not applied to other model training as it is less efficient in our preliminary study.

我们使用原始的LLaMA权重初始化中文LLaMA模型,并使用fp16进行7B和13B模型的预训练。此外,对于33B模型,我们使用bitsandbytes3库以8位格式进行训练,提高了模型的效率和内存使用。我们在训练时直接应用LoRA到注意力和MLP,并将嵌入和LM head 设置为可训练的。

对于基本版本的中文LLaMA-7B,我们采用两阶段的预训练方法。

在第一阶段,我们固定模型中的Transformer编码器参数,只训练嵌入,适应新添加的中文词向量,同时最小化对原始模型的干扰。

在第二阶段,我们为注意机制添加LoRA权重(适配器),并训练嵌入、LM头和新添加的LoRA参数。

需要注意的是,由于在我们的初步研究中效率较低,两阶段训练并未应用于其他模型的训练。

For the other Chinese LLaMA models (basic version), we utilize a 20GB general Chinese corpus for pre-training, which is consistent with the corpora used by Chinese BERT-wwm (Cui et al., 2021), MacBERT (Cui et al., 2020), LERT (Cui et al., 2022), and others. We also provide “Plus” version, which further expands the pre-training data to 120GB, incorporating additional data from Com- monCrawl (CC) and encyclopedia sources, enhancing the model’s understanding of fundamental concepts. We concatenate all the datasets and generated chunks of block size 512 for pre-training purposes.

The models are trained on A40 GPUs (48GB VRAM) for one epoch, taking up to 48 GPUs depend- ing on the model size. The parameter-efficient training with LoRA is performed with PEFT library4. We also utilize DeepSpeed (Rasley et al., 2020) to optimize memory efficiency during the training process. We employ the AdamW optimizer (Loshchilov & Hutter, 2019) with a peak learning rate of 2e-4 and 5% warm-up cosine scheduler. Additionally, we apply gradient clipping with a value of

1.0 to mitigate potential gradient explosion.

Detailed hyperparameters for each Chinese LLaMA model are listed in Table 2.

对于其他中文LLaMA模型(基本版本),我们使用一个20GB的通用中文语料库进行预训练,这与中文BERT-wwm(崔斌等,2021)、MacBERT(崔斌等,2020)、LERT(崔斌等,2022)等模型使用的语料库保持一致。我们还提供了“Plus”版本,将预训练数据进一步扩展到120GB,包括来自CommonCrawl(CC)和百科全书等其他数据,增强了模型对基础概念的理解。我们将所有数据集连接在一起,并生成块大小为512的数据块进行预训练。

这些模型在A40 GPU(48GB VRAM)上进行一轮训练,根据模型大小使用多达48个GPU。使用PEFT库进行具有LoRA的参数高效训练。在训练过程中,我们还使用DeepSpeed(Rasley等,2020)来优化内存效率。我们采用AdamW优化器(Loshchilov和Hutter,2019),峰值学习率为2e-4,采用5%的warm-up余弦调度器。此外,我们应用梯度剪裁,阈值为1.0,以减轻潜在的梯度爆炸问题。

每个中文LLaMA模型的详细超参数列在表2中。

3.2、Experimental Setups For Instruction Fine-Tuning指令微调的实验设置:LoRA高效精调+3M指令数据

>>继续使用LoRA技术实现高效精调-
>>使用200万至400万条中文指令数据进行精调-
>> Plus版本包含更多科学领域和STEM领域的数据-
>>33B模型使用OASST数据集

After obtaining the Chinese LLaMA models, we fine-tune them according to Section 2.5. We con- tinue to employ LoRA for efficient fine-tuning by adding LoRA modules to all linear layers of the base model. We utilize approximately 2M to 3M instruction data, including translation (Xu, 2019) (550K sampled), pCLUE5 (250K sampled, excluding “NLU-like” data), Stanford Alpaca (50K+50K for original and translated one), and crawled SFT data for tuning basic models. For the Plus ver- sion, we expand the dataset to approximately 4M to 4.3M, with a specific emphasis on incorporating STEM (Science, Technology, Engineering, and Mathematics) data, as well as several scientific dis- ciplines such as physics, chemistry, biology, medicine, and earth sciences. For Alpaca-33B, we additionally add OASST1 dataset (Ko?pf et al., 2023), where we only extract the first query-response pair from each conversation and translate using gpt-3.5-turbo API, resulting in roughly 20K data (original and translated one). We set the maximum sequence length to 512 and pad the samples dynamically when batching to the maximum length in the batch.

在获得中文LLaMA模型之后,我们根据第2.5节的方法进行微调。我们继续使用LoRA进行高效微调,通过将LoRA模块添加到基础模型的所有线性层中。我们使用约2M至3M的指令数据进行微调,包括翻译(Xu,2019)(采样550K条)、pCLUE5(采样250K条,排除“类似NLU”的数据)、Stanford Alpaca(原始和翻译各50K+50K)、以及爬取的SFT数据来微调基本模型。对于Plus版本,我们将数据集扩展到约4M至4.3M条,特别注重包括STEM(科学、技术、工程和数学)数据,以及物理、化学、生物学、医学和地球科学等几个科学学科的数据。对于Alpaca-33B,我们还添加了OASST1数据集(Ko?pf等,2023),我们仅提取每个对话的第一个查询-响应对,并使用gpt-3.5-turbo API进行翻译,得到大约20K条数据(原始和翻译各一半)。我们将最大序列长度设置为512,并在批处理时动态填充样本以达到最大长度。

Table 2: Pre-training hyperparameters for Chinese LLaMA. QKVO: four matrices in each at- tention module, i.e., query, key, value, and output. MLP: three matrices in each MLP layer. Note that 7B uses a two-stage training paradigm (settings are separated by ‘/’), which is not further adopted in other models.

表2列出了中文LLaMA预训练的超参数。QKVO代表每个注意力模块中的四个矩阵,即查询(query)、键(key)、值(value)和输出(output)。MLP代表每个MLP层中的三个矩阵。需要注意的是,7B模型采用了两阶段的训练范式(设置由“/”分隔),其他模型未采用这种方法。

爬取数据:不需要种子任务+只需指定目标领域和指令类型

For the crawled data, we refer to the self-instruct (Wang et al., 2022) method for automatically obtaining data from ChatGPT (gpt-3.5-turbo API), as used in Taori et al. (2023a). Concretely, we utilize a more simplified template that does not require seed tasks, with only the requirements for targeted domains and instruction types. Templates and code details are available on GitHub.6

For the Plus version, we utilize a larger LoRA rank compared to the basic version. Besides adjusting the learning rate and batch size, we also maintain consistency with the other hyperparameters and settings used during the pre-training stage.

The hyperparameters for instruction fine-tuning are listed in Table 3. Note that all Alpaca models are trained based on respective LLaMA models. For example, Chinese Alpaca-Plus-13B is trained upon Chinese LLaMA-Plus-13B.

对于爬取的数据,我们参考了自我指导(Wang等,2022)方法,通过ChatGPT(gpt-3.5-turbo API)自动获取数据,这与Taori等(2023a)中使用的方法相同。具体而言,我们使用了一个更简化的模板,不需要种子任务,只需指定目标领域和指令类型的要求。模板和代码细节可在GitHub上找到。

对于Plus版本,我们使用了比基本版本更大的LoRA rank。除了调整学习率和批量大小外,我们还保持了与预训练阶段使用的其他超参数和设置的一致性。

指令微调的超参数列在表3中。需要注意的是,所有的Alpaca模型都是基于对应的LLaMA模型进行训练的。例如,中文Alpaca-Plus-13B是基于中文LLaMA-Plus-13B进行训练的。

4、Results On Instruction-Following Tasks指令遵循任务的结果

中文Alpaca模型在指令遵循任务上的结果和评估方法。文章主要包括三个部分:任务设计和评估方法、解码的实验设置以及结果分析。

综上所述,Alpaca-33B相比Alpaca-Plus-7B和Alpaca-Plus-13B在多个方面都有显著改进,表现更好,成为完整自然语言能力最强的中文模型。作者推测这可能是由于较大规模的模型能够更好地捕捉训练数据中的复杂模式。对于需要高级推理能力的任务,更大的模型较有优势;而对于依赖于丰富数据的任务,训练数据的多少更为重要。

4.1、Task Design And Evaluation Method任务设计和评估方法:GPT-4总体评分结合人工检查确保评分准确性和一致性+每个任务内所有样本的评分求和并归一化百分制得到任务的整体得分(10类任务和200样本)

Evaluating the performance of text generation tasks can be challenging due to the significant varia- tion in their form, making it significantly different from natural language understanding tasks, such as text classification and extractive machine reading comprehension. Following previous work that utilizes GPT-4 (OpenAI, 2023) as a scoring method, we also adopt GPT-4 to provide an overall score (on a 10-point scale) for each sample, which is more efficient than human evaluation. How- ever, GPT-4 may not always provide accurate scores, so we perform manual checks on its ratings and adjust them if necessary. The manual checks ensure that the scores are consistent and reflect the true performance of the models being evaluated. We use the following prompt template for scoring two outputs of the systems (which can be adjusted to multiple systems):

由于文本生成任务的形式存在显著变化,评估其性能可能具有一定挑战性,这使得它与文本分类和抽取式机器阅读理解等自然语言理解任务有着明显的不同。在之前的研究中,利用GPT-4 (OpenAI, 2023)作为评分方法,我们也采用GPT-4来为每个样本提供一个总体得分(在10分制上),这比人工评估更高效。然而,GPT-4并不总是能提供准确的分数,因此我们对其评分进行手动检查,并在必要时进行调整。手动检查确保分数一致,并反映了所评估模型的真实性能。我们使用以下提示模板对系统的两个输出进行评分(可以根据需要调整为多个系统):

By employing GPT-4 as a scoring method in conjunction with manual checks, we establish a reliable evaluation framework that effectively measures the performance of our Chinese Alpaca models on a range of natural language understanding and generation tasks.

Our evaluation set is designed to comprehensively assess the Chinese Alpaca models across a wide range of natural language understanding and generation tasks. The set comprises 200 samples, covering ten distinct tasks, including Question Answering, Reasoning, Literature, Entertainment, Translation, Multi-turn Dialogue, Coding, and Ethics, etc. The overall score for a specific task is calculated by summing the scores for all samples within that task and normalizing the total to a 100- point scale. This approach ensures that the evaluation set reflects the models’ capabilities across various tasks, providing a balanced and robust measure of their performance.

通过将GPT-4作为评分方法与手动检查相结合,我们建立了一个可靠的评估框架,有效地衡量了我们的中文Alpaca模型在各种自然语言理解和生成任务上的性能。

我们的评估集旨在全面评估中文Alpaca模型在各种自然语言理解和生成任务上的能力。该集合包括200个样本,涵盖了10个不同的任务,包括问答、推理、文学、娱乐、翻译、多轮对话、编码、伦理等。对于特定任务的整体得分是通过对该任务中所有样本的得分求和,并将总分归一化到100分制来计算的。这种方法确保评估集反映了模型在各种任务上的能力,提供了一个平衡和稳健的性能衡量标准。

4.2、Experimental Setups For Decoding解码的实验设置:上下文大小最大序列长度温度Top-k采样、********Top-p采样、****重复惩罚

>> 解码的实验设置部分,介绍了在实验中使用的解码超参数,包括上下文大小、生成序列长度限制、温度、Top-k采样和Top-p采样等。这些超参数的设置在多轮对话和生成任务中略有调整,以获得更好的输出效果。

The decoding process of LLMs plays a critical role in determining the quality and diversity of the generated text. In our experiments, we use the following decoding hyperparameters:

•Context size: We set the context size to 2048, which determines the maximum number of tokens the model can consider simultaneously when generating text.

•Maximum sequence length: We limit the generated sequence length to 512 tokens to ensure that the outputs remain focused and relevant to the input prompt.

•Temperature: We set the temperature to 0.2, which controls the randomness of the sampling process. Lower values make the model generate more focused and deterministic outputs, while higher values increase diversity at the cost of coherence. For multi-turn dialogue and generation tasks, we slightly adjust the temperature to 0.5 to allow a more diverse output.

•Top-k sampling: We use Top-k sampling with k = 40, meaning that the model selects its next token from the top 40 most probable tokens at each step, adding an element of randomness and diversity to the generated text.

•Top-p sampling: We also employ Top-p sampling with p = 0.9, which further enhances diver- sity by considering a dynamic set of tokens that collectively account for 90% of the probability mass.

•Repetition penalty: To discourage the model from generating repetitive text, we apply a repeti- tion penalty with a factor of 1.1, penalizing tokens that have already been selected.

LLM的解码过程在确定生成文本的质量和多样性方面起着关键作用。在我们的实验中,我们使用以下解码超参数:

上下文大小:我们将上下文大小设置为2048,这决定了模型在生成文本时同时考虑的最大标记数。

最大序列长度:我们将生成的序列长度限制为512个标记,以确保输出保持关注和与输入提示相关。

温度:我们将温度设置为0.2,用于控制采样过程的随机性。较低的值使模型生成更聚焦和确定性的输出,而较高的值在多样性方面增加,但会牺牲连贯性。对于多轮对话和生成任务,我们将温度略微调整为0.5,以允许更多样的输出。

Top-k采样:我们使用Top-k采样,其中k = 40,意味着模型在每个步骤从最有可能的前40个标记中选择下一个标记,从而为生成的文本增加了一定的随机性和多样性。

Top-p采样:我们还使用Top-p采样,其中p = 0.9,通过考虑一组动态的标记来增强多样性,这组标记的概率总和占到了90%。

重复惩罚:为了防止模型生成重复的文本,我们应用了一个重复惩罚,乘以1.1的因子,对已选择的标记进行惩罚。

Note that these values may not be optimal for each testing scenario. We did not perform further tuning on these hyperparameters for each task to maintain a balanced view.

请注意,这些值可能对于每个测试场景并不是最优的。我们没有对这些超参数进行进一步的调整,以保持一个平衡的观点。

4.3、Results结果:基于一个成熟的模型LLaMA测试

>> 结果部分,展示并分析了中文Alpaca-Plus-7B、Alpaca-Plus-13B和Alpaca-33B模型在各个任务上的结果。其中,Alpaca-33B在多轮对话、文本生成、数值计算和推理、编码和道德等方面都取得了显著的改进。具体来说,Alpaca-Plus-7B和Alpaca-Plus-13B在文本生成和数值计算推理任务上表现较好,而Alpaca-33B在多轮对话、编码和道德任务方面表现更好。文章指出,对于复杂任务如代码生成,模型的规模越大,性能往往越好。此外,Alpaca-33B在道德任务中不仅能够拒绝非法指令,还能给出合适的建议。

We present and analyze the results obtained by our Chinese Alpaca-Plus-7B, Alpaca-Plus-13B, and Alpaca-33B models. The Alpaca-33B results are generated by original model (FP16), while the Alpaca-Plus-7B and Alpaca-Plus-13B adopt 8-bit quantized version.7 The overall results are shown in Table 4. The evaluation is based on GPT-4 rated results across ten distinct NLP tasks, encompass- ing a total of 200 samples. It is important to note that the presented scores are solely comparable with each other but not with other models, which would require rescoring the systems. Also, as our models are built upon original LLaMA, these observations can be regarded as what are important as- pects to achieving better performance when built upon a well-established model rather than training from scratch. We elaborate on the findings of several major categories in detail.

我们展示并分析了我们的中文Alpaca-Plus-7B、Alpaca-Plus-13B和Alpaca-33B模型的结果。Alpaca-33B的结果是由原始模型(FP16)生成的,而Alpaca-Plus-7B和Alpaca-Plus-13B采用了8位量化版本。总体结果如表4所示。评估基于GPT-4对10个不同的NLP任务的评分结果,涵盖了共计200个样本。需要注意的是,所呈现的分数仅可相互比较,而不能与其他模型进行比较,这需要重新评分系统。另外,由于我们的模型是基于原始LLaMA构建的,这些观察结果可以视为在基于一个成熟的模型而不是从头开始训练时实现更好性能的重要方面。我们详细阐述几个主要类别的发现。

4.3.1、Multi-Turn Dialogue多轮对话:更多数据比更大参数更重要

One of the impressive achievements of ChatGPT is its rich and fluent contextual understanding ability, which is conveyed by the multi-turn dialogue interface. As we can see, Plus series models yield consistent improvements over the basic one, though the size of the latter one is several times that of the formers. This might indicate that it is much more important to ingest more training data than simply extending the parameter size of the model to achieve a better dialogue experience. Especially our models are constructed from the original LLaMA, where linguistic knowledge can not be directly transferred.

ChatGPT的一个令人印象深刻的成就是其丰富和流畅的上下文理解能力,这通过多轮对话界面体现出来。正如我们所见,Plus系列模型相比基本模型显示出一致的改进,尽管后者的大小是前者的几倍。这可能表明,摄入更多的训练数据比简单地扩展模型的参数大小更重要,以实现更好的对话体验。特别是我们的模型是从原始的LLaMA构建的,在这里语言知识不能直接转移。

4.3.2、Text Generation文本生成:更多数据+较小的模型=更好

Text generation is one of the most fundamental abilities for language models. Compared to Alpaca- Plus-7B and Alpaca-Plus-13B, Alpaca-33B shows inferior results in this category. Table 5 shows an example of a text generation task. We can see that both Alpaca-Plus-7B and Alpaca-Plus-13B provide correct letter styles, which meet the requirement of the user’s prompt. Alpaca-Plus-13B provides the most comprehensive one by indicating that the applicant has thoroughly prepared all materials for visa application, making it the best generation quality among all three systems. How- ever, Alpaca-33B does not follow the letter style, and the content is somewhat too simplified, which is clearly not as good as the others. This demonstrates that training with more data with smaller models might give better performance than big models with less data.

文本生成是语言模型最基本的能力之一。与Alpaca-Plus-7B和Alpaca-Plus-13B相比,Alpaca-33B在这个类别中表现出较差的结果。表5展示了一个文本生成任务的示例。我们可以看到,Alpaca-Plus-7B和Alpaca-Plus-13B都提供了符合用户提示要求的正确的字母样式。Alpaca-Plus-13B通过指示申请人已经准备好了签证申请所需的所有材料,提供了最全面的一种生成质量,在这三个系统中是最好的。然而,Alpaca-33B没有遵循信函的格式,并且内容过于简化,显然不如其他两个模型好。这表明,使用更多数据训练较小的模型可能比使用较少数据训练的大模型获得更好的性能。

******4.3.3、Numerical Calculation And Reasoning数值计算和推理——**模型越大,数值推理任务更占据优势

Numerical reasoning has been regarded as one of the most essential tasks in examining the reasoning ability of large language models. As we can see, the Alpaca-33B achieves significant improvements over Plus-7B and Plus-13B models. Table 6 shows example outputs for this task. The first prompt is well-known for probing the reasoning ability, namely “which one is heavier, 1kg of cotton or 1kg of iron?”. Both Plus-7B and Plus-13B failed to give a correct answer mentioning that “cotton is lighter than iron”. However, 33B could correctly identify that these two things are the same weight. The second prompt is a simple calculation task, asking “how many legs for a cat and a chicken”. However, as we can see, both Plus-7B and Plus-13B do not have the commonsense knowledge that a cat has four legs and two for a chicken, resulting in wrong answers. The last prompt is a numerical reasoning task to let the model predict the next number of an array. Still, only 33B model correctly identifies the pattern of the given array that the next number should be the square of its index. These observations indicate that the size of the model is vital in numerical reasoning tasks.

数值推理被认为是检验大型语言模型推理能力的最重要任务之一。正如我们所见,Alpaca-33B在Plus-7B和Plus-13B模型上取得了显著的改进。表6展示了该任务的示例输出。第一个提示是众所周知的用于探究推理能力的问题,即“1千克的棉花和1千克的铁哪个更重?”。Alpaca-Plus-7B和Alpaca-Plus-13B都未能给出正确答案,提到“棉花比铁轻”。然而,33B能够正确地识别这两个东西的重量是相同的。第二个提示是一个简单的计算任务,询问“一只猫和一只鸡有多少条腿”。然而,正如我们所见,Alpaca-Plus-7B和Alpaca-Plus-13B没有常识知识,不知道一只猫有四条腿,鸡有两条腿,导致了错误的答案。最后一个提示是一个数值推理任务,让模型预测数组的下一个数字。仅有33B模型正确地识别到给定数组的模式是下一个数字应该是其索引的平方。这些观察结果表明,模型的大小在数值推理任务中非常重要。

******4.3.4、CODING编码:**模型越大,编码任务更占据优势

Figure 1 shows an example of implementing the Dijkstra algorithm in Python. Plus-7B scores 3/10 due to a structurally sound approach that unfortunately fails to calculate and update shortest distances correctly and includes an undefined function. Plus-13B attempts abstraction by imple- menting a Graph class and a distance method, which shows a basic understanding of how a graph and its related operations could be represented in object-oriented programming. Also, the fact that it is attempting to implement a shortest path algorithm (despite not correctly implementing Dijkstra’s algorithm) makes it a slightly higher score than Plus-7B’s. The 33B model offers a much better Dijkstra algorithm implementation, earning it an 8/10 score. Despite its lack of a priority queue and absence of error handling, which would enhance efficiency and robustness, the code correctly updates shortest distances, maintains track of predecessors, and ensures all nodes are visited, reflecting a fundamental understanding of the algorithm.

图1展示了Python中实现Dijkstra算法的示例。Plus-7B因为采用了结构上正确但无法正确计算和更新最短距离以及包含未定义函数的方法而得分为3/10。Plus-13B尝试通过实现Graph类和distance方法来进行抽象,显示出对图和相关操作的基本理解,尽管未能正确实现Dijkstra算法,但由于尝试实现了最短路径算法,它的得分比Plus-7B略高。33B模型提供了更好的Dijkstra算法实现,得分为8/10。尽管缺乏优先队列和错误处理,这将提高效率和稳健性,但该代码正确地更新了最短距离,保持了前任节点的跟踪,并确保所有节点都被访问,反映了对算法的基本理解。

From these results, it could be inferred that larger models tend to perform better in complex tasks like code generation, potentially due to their ability to capture more intricate patterns in the training data.

从这些结果可以推断,较大的模型在像代码生成这样的复杂任务中往往表现更好,这可能是因为它们能够捕捉到训练数据中更复杂的模式。

******4.3.5、ETHICS伦理:**模型越大,伦理回答更优

Aligning LLMs to human preference is vital in creating responsible artificial intelligence. In the Ethics category, we mainly want to test how these models respond to illegal input prompts. By checking the generation results, all three systems responded properly to users’ prompts. Alpaca- 33B yields slightly better performance than the others. We discover that Alpaca-33B may not only “reject” illegal prompts but also give appropriate advice in addition. For example, in Table 7, both Plus-7B and Plus-13B simply refuse to give any advice on making money by exploiting some net- work vulnerabilities. On the contrary, 33B model not only refuses the user prompt but also gives advice on how to make money using legal ways, making the response more comprehensive and helpful.

将LLM与人类偏好保持一致在创建负责任的人工智能方面非常重要。在伦理类别中,我们主要测试这些模型如何对非法输入提示进行响应。通过检查生成的结果,三个系统都对用户的提示做出了适当的回应。Alpaca-33B的表现略优于其他模型。我们发现Alpaca-33B不仅可以“拒绝”非法提示,还能提供适当的建议。例如,在表7中,Plus-7B和Plus-13B都简单地拒绝对如何通过利用某些网络漏洞赚钱的提问给出任何建议。相反,33B模型不仅拒绝了用户的提示,还提供了如何通过合法途径赚钱的建议,使得回应更加全面和有帮助。

Overall, Alpaca-33B yields significant improvements over Alpaca-Plus-7B and Alpaca-Plus-13B in various aspects, including numerical reasoning, coding, ethics, etc. We conjecture that these abilities are better handled by bigger models than the smaller ones, though Alpaca-33B is trained with less data. Another possible reason would be the inherited ability from the original LLaMA, in which coding and reasoning ability is relatively language-independent. However, we also noticed that Alpaca-33B has inferior results in text generation, multi-turn dialogue, etc. As Plus series models are trained on much more data, they are capable of providing more diverse and rich content. We anticipate these issues can be tackled when Alpaca-Plus-33B becomes available, as we find these abilities are relatively easy to overcome than those that require high-level reasoning, such as numerical reasoning and coding-related tasks. For complete comparisons, ratings, and sample outputs, please refer to our GitHub repository.8

总体而言,Alpaca-33B在多个方面,包括数值推理、编码、伦理等方面相比Alpaca-Plus-7B和Alpaca-Plus-13B取得了显著的改进。我们推测这些能力在更大的模型中更容易处理,而不是在较小的模型中,尽管Alpaca-33B的训练数据较少。另一个可能的原因是从原始LLaMA继承的能力,在其中编码和推理能力相对独立于语言。然而,我们还注意到Alpaca-33B在文本生成、多轮对话等方面的结果较差。由于Plus系列模型训练了更多的数据,它们能够提供更多样化和丰富的内容。我们预计这些问题在Alpaca-Plus-33B问世时可以得到解决,因为我们发现这些能力相对容易克服,而不像那些需要高级推理能力的任务,例如数值推理和与编码相关的任务。有关完整的比较、评分和示例输出,请参阅我们的GitHub仓库。

5、Results On Natural Language Understanding Tasks自然语言理解任务的结果

主要对Alpaca系列中文模型在C-Eval多选阅读理解任务上的表现进行了分析,和与其他模型的比较的文章。文章主要包括四个部分:任务描述、解码策略、与原始LLaMA模型的比较和与其他模型的比较。

5.1、Task Description任务描述:C-Eval数据集(4个类别+52个学科)

Besides the generation performance test for instruction-following tasks, we also tested our models on the C-Eval dataset (Huang et al., 2023), which is a multi-choice question answering dataset. C- Eval mainly covers four categories: STEM, Social, Humanities, and Others, consisting of nearly 14K samples for 52 disciplines. Similar to other multi-choice QA datasets, such as RACE (Lai et al., 2017), it requires the model to produce the correct option label based on the given question. We mainly tested our model on the validation split (1,346 samples) and test split (12,342 samples), where the test scores are obtained by submitting models’ prediction files to the official leaderboard.

除了针对指令跟随任务的生成性能测试外,我们还在C-Eval数据集(Huang等,2023年)上测试了我们的模型,该数据集是一个多项选择题回答数据集。C-Eval主要涵盖四个类别:STEM、社会科学、人文科学和其他,共有52个学科的近14K个样本。与其他多项选择题回答数据集(如RACE)类似,它要求模型根据给定的问题产生正确的选项标签。我们主要在验证集(1,346个样本)和测试集(12,342个样本)上测试我们的模型,其中测试分数是通过提交模型的预测文件到官方排行榜来获得的。

**5.2、Decoding Strategy解码策略:****评估LLaMA模型直接输入样本、**评估Alpaca模型则将示例包装在提示模板中

To evaluate LLaMA models on this dataset, we directly feed the examples to these models. While when evaluating Alpaca models, we wrap the examples in the prompt template as demonstrated in Section 2.5. Then the model is asked to make a one-step prediction and give the probability distribution of the next token p(y|x), where y ∈ V (V is the vocabulary). To map the probability distribution to a valid label t in {A, B, C, D}, we extract and gather the probabilities of related tokens. We introduce a verbalizer V(·) to map each label t to tokens in the vocabulary:

The label with the max probability is taken as the final prediction.

为了有效评估LLaMA模型在这个数据集上的表现,我们直接将样本输入这些模型。而在评估Alpaca模型时,我们将样本包装在提示模板中,如2.5节所示。然后,模型被要求进行一步预测,并给出下一个令牌p(y|x)的概率分布,其中y∈V(V是词汇表)。为了将概率分布映射到有效的标签t(属于{A,B,C,D}),我们提取和收集相关令牌的概率。我们引入了一个名为V(·)的说明语言映射器,将每个标签t映射到词汇表中的令牌:

选择概率最大的标签作为最终的预测。

Next, we will elaborate on our results and analysis in the following two subsections, illustrating the comparisons to the original LLaMA and other models.

接下来,我们将在接下来的两个小节中详细阐述我们的结果和分析,以便比较与原始LLaMA模型和其他模型的差异。

5.3、Comparisons To Original LLaMA与原版LLaMA的比较

>> 与原始LLaMA模型的比较部分,显示了中文LLaMA模型相对于原始LLaMA模型的改进。中文LLaMA模型在C-Eval数据集上相对原始LLaMA模型取得了一定的改进,但并非始终如此。Alpaca模型相对LLaMA模型在各种设置下都取得了显著的改进,表明指令遵循模型在处理类似NLU任务时更具能力。在少样本设置下,LLaMA模型表现较好,而Alpaca模型则更适合零样本设置。

Figure 2 demonstrates how our models evolve based on the original LLaMA. Detailed results are depicted in Table 8. We mainly describe our findings in the following aspects.

与原始LLaMA模型的比较

图2展示了我们的模型如何基于原始LLaMA模型进行改进。详细结果见表8。我们主要从以下几个方面描述我们的发现。

******中文LLaMA改进了原始LLaMA—**Chinese LLaMA improves original LLaMA.

Chinese LLaMA improves original LLaMA. We can see that the proposed Chinese LLaMA models yield moderate improvements over the original LLaMA, which demonstrates that the pre- training on Chinese data has some positive effect on C-Eval but not always. When we compare Chinese LLaMA and LLaMA-Plus, the latter does not show significant improvements over the for- mer one, even showing inferior results for 13B setting. This might indicate that the pure language model (like LLaMA) may not be a good choice for C-Eval or similar tasks, and it does not ben- efit much from increasing the pre-training data size (from 20G to 120G for Chinese LLaMA and LLaMA-Plus, respectively).

我们可以看到,提出的中文LLaMA模型在原始LLaMA模型的基础上取得了适度的改进,这表明对中文数据进行的预训练对C-Eval具有一定的正面影响,但并不总是如此。当我们比较中文LLaMA和LLaMA-Plus时,后者并没有显著改进前者,甚至在13B设置中显示出了较差的结果。这可能表明纯语言模型(如LLaMA)可能不是C-Eval或类似任务的好选择,并且它在增加预训练数据大小方面(中文LLaMA和LLaMA-Plus从20G增加到120G)并没有获得太多好处。

******Alpaca模型相比LLaMA表现出显著改进—**Alpaca models show significant improvements over LLaMA

Alpaca models show significant improvements over LLaMA. Among different settings, such as zero-shot or 5-shot, the Alpaca model series show significant improvements over LLaMA coun- terparts, demonstrating that the instruction-following models are more capable of handling these NLU-like tasks than pure language models. Unlike the phenomenon observed in the LLaMA series, we can see that Alpaca-Plus models yield significant improvement over basic Alpaca models. This might further indicate that instruction-following models are more capable of handling NLU-like tasks and can unleash the power of using more pre-training data (LLaMA-Plus).

在不同的设置中,如零-shot或5-shot,Alpaca模型系列相比LLaMA模型表现出显著改进,表明指令跟随模型在处理这些类似NLU的任务上更具能力,而不仅仅是纯语言模型。与LLaMA系列观察到的现象不同,我们可以看到Alpaca-Plus模型相比基本的Alpaca模型获得了显著改进。这进一步表明指令跟随模型在处理NLU类任务时更具能力,并且可以发挥使用更多预训练数据(LLaMA-Plus)的优势。

LLaMA在少样本设置下的表现通常更好,而Alpaca更适合样本**——LLaMA generally yields better performance in a few-shot setting, while Alpaca prefers zero- shot******

LLaMA generally yields better performance in a few-shot setting, while Alpaca prefers zero- shot. Generally speaking, LLaMA with 5-shot setting shows better performance than zero-shot setting, while Alpaca with zero-shot setting is much better than 5-shot one. As LLaMA is not de- signed for instruction-following, few-shot setting might give valuable information on how to follow the question answering structure in C-Eval. However, on the contrary, as Alpaca has already been trained with millions of instruction data, it is less likely to benefit from additional shots. Also, the official 5-shot setting uses identical prompts for all samples, making it some distraction for Alpaca models.

一般来说,LLaMA在5-shot设置下的性能要优于零-shot设置,而Alpaca在零-shot设置下要比5-shot设置好得多。LLaMA并不是为了指令跟随而设计的,少样本设置可能为如何遵循C-Eval的问答结构提供了有价值的信息。然而,相反地,由于Alpaca已经使用了数百万条指令数据进行了训练,它不太可能从额外的样本中获益。此外,官方的5-shot设置对所有样本使用相同的提示,对于Alpaca模型来说可能会造成一些干扰。

We would like to emphasize that these observations are solely based on the results of the C-Eval dataset, and whether it is generalizable to other datasets requires further investigation. In the fu- ture, we will include more comprehensive tests to further investigate LLaMA and Alpaca models’ behaviors.

需要强调的是,这些观察结果仅基于C-Eval数据集的结果,是否适用于其他数据集需要进一步研究。在未来,我们将进行更全面的测试,以进一步研究LLaMA和Alpaca模型的行为。

5.4、与其他模型的比较Comparisons To Other Models:中文Alpaca模型表现出竞争力+适当的解码策略和提示模板可提高模型性能

>> 与其他模型的比较部分,展示了中文Alpaca-33B和中文Alpaca-Plus-13B模型与其他LLMs模型在C-Eval数据集上的比较。结果显示,中文Alpaca模型在这个排行榜中表现竞争力,并与Bloomz-mt-176B和GLM-130B等模型相比仅有适度差距。此外,文章还提到通过采用适当的解码策略和提示模板,可以对个体LLMs模型的性能产生重要影响。

综上所述,中文Alpaca模型在自然语言理解任务上取得了显著的改进,并在与其他模型的比较中表现出竞争力。同时,适当的解码策略和提示模板对于提高模型性能至关重要。

We include our two best-performing models, i.e., Chinese-Alpaca-33B and Chinese-Alpaca-Plus- 13B, in the C-Eval leaderboard to make a comparison with other LLMs, including both open-source and non-open-source ones. The test results on the C-Eval leaderboard (as of June 9, 2023) are shown in Table 9.

我们在C-Eval排行榜中包含了我们表现最佳的两个模型,即Chinese-Alpaca-33B和Chinese-Alpaca-Plus-13B,与其他包括开源和非开源的LLM进行比较。C-Eval排行榜上的测试结果(截至2023年6月9日)如表9所示。

Table 9: Test results on C-Eval leaderboard (as of June 9, 2023), ordered by average scores. Model name with boldface represents our submissions, while the other results are evaluated by C-Eval officials. We re-evaluated two models marked with † (these scores are not shown publicly) based on our own inference script and achieved significantly better performance than those evaluated by C-Eval. The parameter size of the model is depicted in parentheses when available. Open: open- source. Avg-H: Average (Hard).

表9:C-Eval排行榜上的测试结果(截至2023年6月9日),按平均分排序。粗体表示我们的提交结果,其他结果由C-Eval官方评估。我们重新评估了标有†的两个模型(这些分数没有公开显示),基于我们自己的推理脚本获得了比C-Eval评估更好的性能。当有参数大小时,模型的参数大小用括号表示。Open:开源。Avg-H:平均(困难)。

Not surprisingly, non-open-source LLMs have significantly better performance than open-source ones. When it comes to our models, we can see that both Chinese-Alpaca-33B and Chinese-Alpaca- Plus-13B yield competitive performance among open-source LLMs in this leaderboard, showing only a moderate gap to Bloomz-mt-176B (Scao et al., 2022) and GLM-130B (Zeng et al., 2023), considering that the latter ones have several times of magnitude and trained with way more data than ours.

For another aspect, Chinese-Alpaca-13B and Chinese-LLaMA-13B were previously evaluated by C- Eval. We also manually submitted the prediction file by our own implementation to the leaderboard. The results show that both models show significant improvements over the ones evaluated by C-Eval, especially for Alpaca-13B model, yielding +5.8 average score (from 30.9 to 36.7). Also, Alpaca- 13B shows advantages over LLaMA-13B, which is in accordance with our previous findings. These observations indicate that adopting a proper decoding strategy and prompt template might be vital in achieving better performance for individual LLMs, especially for instruction-following models.

不出所料,非开源LLM的性能明显优于开源LLM。就我们的模型而言,我们可以看到Chinese-Alpaca-33B和Chinese-Alpaca-Plus-13B在这个排行榜中的开源LLM中表现出竞争力,与Bloomz-mt-176B(Scao等,2022年)和GLM-130B(Zeng等,2023年)相比,只有一定的差距,考虑到后者的规模和数据量比我们的模型大几个数量级。

另一方面,Chinese-Alpaca-13B和中文-LLaMA-13B之前已经在C-Eval上进行了评估。我们还通过我们自己的实现将预测文件手动提交到排行榜中。结果表明,两个模型相比C-Eval评估的模型都有显著的改进,尤其是Alpaca-13B模型,平均分提高了5.8分(从30.9分提高到36.7分)。此外,Alpaca-13B相对于LLaMA-13B具有优势,这与我们之前的发现一致。这些观察结果表明,采用适当的解码策略和提示模板对于实现个别LLM的更好性能至关重要,特别是对于指令跟随模型。

6、Effect Of Different Quantization Methods不同量化方法的影响:;8位和6位量化方法更利于Alpaca模型

将LLM进行高效量化(比如llama.cpp)在个人计算机上部署具有多个好处:保护敏感信息+适用于资源有限用户+使用更加民主化

Deploying large language models on personal computers, particularly on CPUs, has historically been challenging due to their immense computational requirements. However, with the help of many community efforts, such as llama.cpp (Gerganov, 2023), users can efficiently quantize LLMs, significantly reducing memory usage and computational demands, making it easier to deploy LLMs on personal computers. This also enables quicker interactions with the models and facilitates local data processing. Quantizing LLMs and deploying them on personal computers offer several benefits. Firstly, it helps users protect their data privacy by ensuring that sensitive information remains within their local environment rather than being transmitted to external servers. Secondly, it democratizes access to LLMs by making them more accessible to users with limited computational resources. Lastly, it promotes the development of new applications and research directions that take advantage of local LLM deployments. Overall, the ability to deploy LLMs on personal computers using llama.cpp (or similar) paves the way for a more versatile and privacy-conscious utilization of LLMs in various domains.

在个人计算机上部署大型语言模型,特别是在CPU上,由于其巨大的计算要求而一直是具有挑战性的。然而,借助于众多社区的努力,例如llama.cpp(Gerganov,2023),用户可以高效地对LLM进行量化,显著减少内存使用和计算需求,使LLM更容易部署在个人计算机上。这还使与模型的交互更快,并促进了本地数据处理。

将LLM进行量化并在个人计算机上部署具有多个好处。首先,它帮助用户通过确保敏感信息保留在本地环境中而不被传输到外部服务器来保护数据隐私。其次,它通过使LLM对于计算资源有限的用户更易获取,使LLM的使用变得更加民主化。最后,它促进了利用本地LLM部署的新应用和研究方向的发展。

总体而言,使用llama.cpp(或类似工具)在个人计算机上部署LLM为在各个领域更加灵活和注重隐私的LLM利用铺平了道路。

使用llama.cpp对Alpaca系列中文模型进行不同层次的量化:8位和6位量化方法最佳+量化训练可能成为未来训练LLMs的主流方式

In this section, we investigate the effect of different quantization methods. We use llama.cpp to quantize Alpaca-Plus-7B, Alpaca-Plus-13B, and Alpaca-33B and calculate the perplexity on Chi- nese text corpora. We quantize these models into 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit forms to compare with the original FP16 one.9 The results are shown in Figure 3.

在本节中,我们研究了不同量化方法的影响。我们使用llama.cpp对Alpaca-Plus-7B、Alpaca-Plus-13B和Alpaca-33B进行量化,并在中文文本语料库上计算困惑度。我们将这些模型量化为2位、3位、4位、5位、6位和8位形式,并与原始的FP16模型进行比较。结果如图3所示。

The quantization level is strictly bound to the memory usage and inference speed, and thus a trade- off must be made when choosing a proper quantization level. As we can see, the 8-bit quantization method has almost the same or even lower perplexities compared to the original FP16 model, demon- strating that it is a good choice for deploying LLMs on personal computers, with only half size of the FP16 one. The 6-bit models also achieve decent PPLs comparable to the 8-bit one, making it a better balance of speed and performance. When we use a more aggressive quantization level, the performance drastically decreases (i.e., higher PPL), especially for 3-bit and 2-bit. We also discover that larger models are less sensitive to quantization methods than smaller ones. For example, the performance of 33B models changes much more mildly than the others. A similar result is also observed when comparing Plus-7B and Plus-13B models. This might indicate that though 2-bit and 3-bit quantization are less effective for smaller models, it might be a promising way to deploy larger models without significant performance loss. This is extremely helpful when the users only have limited computing resources and still want to try large language models. This might also imply that the quantized training method may become a main-stream approach for training large language models, especially for those with limited training resources.

量化水平严格限制着内存使用和推理速度,因此在选择适当的量化水平时必须进行权衡。

正如我们所看到的,8位量化方法的困惑度几乎与原始的FP16模型相同,甚至更低,这表明它是在个人计算机上部署LLM的一个很好选择,只有FP16模型大小的一半。6位模型也实现了与8位模型相当的良好困惑度,从而在速度和性能之间取得了更好的平衡。

当我们使用更激进的量化水平时,性能会急剧下降(即困惑度更高),特别是对于3位和2位的情况。

我们还发现,较大的模型对量化方法的敏感性较小。例如,33B模型的性能变化要比其他模型更温和。在对比Plus-7B和Plus-13B模型时也观察到类似的结果。这可能表明,虽然对于较小的模型而言,2位和3位的量化效果较差,但对于部署较大的模型而言,这可能是一种在不明显降低性能的情况下进行部署的有希望的方式。

当用户只有有限的计算资源但仍希望尝试大型语言模型时,这将非常有帮助。这也可能意味着量化训练方法可能成为训练大型语言模型的主流方法,特别是对于那些资源有限的模型来说。

7、Conclusion:词表扩充中文+指令式微调→RLHF/RLAIF+更先进的量化GPTQ+更高效微调

总的来说,本文提出了一种改进中文LLaMA模型能力的方法,通过指令式训练获得了Alpaca中文模型,并使用实验结果证明其优势。未来工作主要在模型部署及效果提升上。

提出的方法(改进LLaMA模型的中文理解和生成能力)=增加2万个额外的中文token提高编码效率+指令式训练+基于10种不同任务类型的200个样本利用GPT-4评估+C-Eva测试评估

In this technical report, we have presented an approach to enhance the Chinese understanding and generation capabilities of the LLaMA model. Acknowledging the limitations of the original LLaMA’s Chinese vocabulary, we expanded it by incorporating 20K additional Chinese tokens, sig- nificantly increasing its encoding efficiency for the Chinese language. Building on the Chinese LLaMA, we employed supervised fine-tuning with instruction data, resulting in Chinese Alpaca models exhibiting improved instruction-following capabilities.

To evaluate our models effectively, we annotated 200 samples across ten distinct task types and utilized GPT-4 for evaluation. Our experiments demonstrated that the proposed models significantly outperformed the original LLaMA in Chinese understanding and generation tasks. We also tested our models on C-Eval datasets. The results show that the proposed model could achieve significant improvements and show competitive performance to the models with several times bigger sizes.

在本技术报告中,我们提出了一种增强LLaMA模型中文理解和生成能力的方法。我们承认了原始LLaMA模型在中文词汇方面存在的局限性,并通过增加2万个额外的中文标记来扩展词汇,显著提高了对中文语言的编码效率。在中文LLaMA的基础上,我们采用了有指导的微调方法,利用指导数据对模型进行了微调,使得中文Alpaca模型在指令跟随能力方面得到了改进。

为了有效评估我们的模型,我们对10种不同任务类型的200个样本进行了标注,并利用GPT-4进行了评估。我们的实验结果表明,所提出的模型在中文理解和生成任务上明显优于原始LLaMA模型。我们还在C-Eval数据集上对模型进行了测试,结果显示所提出的模型能够取得显著的改进,并展现出与规模几倍大的其他模型相竞争的性能。

未来计划:探索RLHF或RLAIF使模型的输出与人类偏好保持一致+更先进的量化方法(如GPTQ)+替代LoRA高效微调法+

Looking ahead, we plan to explore Reinforcement Learning from Human Feedback (RLHF) or Re- inforcement Learning from AI Instructed Feedback (RLAIF) to further align the models’ output with human preferences. Moreover, we intend to adopt more advanced and effective quantization methods, such as GPTQ (Frantar et al., 2022), among others. Additionally, we aim to investigate alternative methods to LoRA for more efficient and effective pre-training and fine-tuning of large lan- guage models, ultimately enhancing their performance and applicability across various tasks within the Chinese NLP community.

展望未来,我们计划探索基于人类反馈的强化学习(RLHF)或基于人工智能指导的强化学习(RLAIF),以进一步使模型的输出与人类偏好相一致。此外,我们打算采用更先进和有效的量化方法,如GPTQ(Frantar等,2022),等等。另外,我们还计划研究LoRA的替代方法,以提高大型语言模型的预训练和微调的效率和效果,从而增强其在中文NLP社区中各种任务中的性能和适用性。

******8、Limitations:**有害且不可预测的内容、训练不足、缺乏鲁棒性、评估不够全面、可扩展性和效率

虽然本篇文章成功改进了LLaMA和Alpaca模型的中文理解和生成能力,但同时也指出了一些限制:有害且不可预测的内容、训练不足、缺乏鲁棒性、评估不够全面、可扩展性和效率。

While this project has successfully enhanced the Chinese understanding and generation capabilities of the LLaMA and Alpaca models, several limitations must be acknowledged:

>> Harmful and unpredictable content: Though our model can reject unethical queries, these mod- els may still generate harmful or misaligned with human preferences and values. This issue may arise from biases in the training data or the models’ inability to discern appropriate outputs in certain contexts.

>> Insufficient training: Due to constraints in computing power and data availability, the training of the models may not be sufficient for optimal performance. As a result, there is still room for improvement in the Chinese understanding capabilities of the models.

>> Lack of robustness: The models may exhibit brittleness in some situations, producing incon- sistent or nonsensical outputs when faced with adversarial inputs or rare language phenomena.

>> Comprehensive evaluation: Evaluating large language models is an important topic in the cur- rent era. While we have seen many evaluation benchmarks for LLMs, their comprehensiveness and appropriateness for LLMs should be well-studied and examined. A more diverse and com- prehensive LLM evaluation dataset and benchmark will have a great positive effect on shaping the future of LLM research.

>> Scalability and efficiency: Although we applied LoRA and quantization to make the model more accessible to a broader community, when combined with the original LLaMA, the mod- els’ large size and complexity can lead to difficulties in deployment, especially for users with limited computational resources. This issue may hinder the accessibility and widespread adop- tion of the models in various applications.

尽管这个项目成功地增强了LLaMA和Alpaca模型的中文理解和生成能力,但还需要承认一些限制:

>> 有害和不可预测的内容:尽管我们的模型可以拒绝不道德的查询,但这些模型仍可能生成有害或与人类偏好和价值观不一致的内容。这个问题可能源于训练数据中的偏见或模型在某些情境下无法识别适当的输出。

>> 训练不足:由于计算能力和数据可用性的限制,模型的训练可能不足以实现最佳性能。因此,模型在中文理解能力方面仍有改进的空间。

>> 鲁棒性不足:在某些情况下,模型可能表现出脆弱性,当面对对抗性输入或罕见的语言现象时,会产生不一致或荒谬的输出。

>> 综合评估:评估大型语言模型是当前重要的课题。虽然我们已经看到了许多用于LLM的评估基准,但它们的全面性和适用性应该进行深入研究和检查。一个更多样化和全面的LLM评估数据集和基准将对塑造LLM研究的未来产生积极影响。

>> 可扩展性和效率:尽管我们应用了LoRA和量化方法,使模型对更广泛的社区更易访问,但与原始LLaMA相结合,模型的庞大尺寸和复杂性可能导致在部署方面存在困难,特别是对于计算资源有限的用户。这个问题可能阻碍模型在各种应用中的可访问性和广泛采用。

Future work should address these limitations to further enhance the models’ capabilities, making them more robust, accessible, and effective for a broader range of applications in the Chinese NLP community.

未来的工作应该解决这些限制,进一步增强模型的能力,使其在中文NLP社区的更广泛应用中更加鲁棒、可访问和有效。

9、Acknowledgments致谢

The original draft was polished by OpenAI GPT-4 for grammatical corrections and clarity improve- ments. We would like to thank our community members for their contributions to our open-source project.

本文的初稿经过OpenAI GPT-4的润色,对语法进行了修正和明确性进行了改善。我们要感谢我们社区成员对我们的开源项目的贡献。

你可能感兴趣的:(AI或AGI,llama,AI,AICG,CG,人工智能)