Chinese-LLaMA-Alpaca项目开源了中文 LLaMA 模型和经过指令微调的 Alpaca模型。这些模型在原始 LLaMA 的基础上,扩展了中文词汇表并使用中文数据进行二次预训练,从而进一步提高了对中文基本语义理解的能力。




LLMs:《Efficient And Effective Text Encoding For Chinese Llama And Alpaca—6月15日版本》翻译与解读




《Efficient And Effective Text Encoding For Chinese Llama And Alpaca—6月15日版本》翻译与解读








2、Chinese Llama And Chinese Alpaca

2.1、LLaMA:嵌入层+多个Transformer块+语言模型头+pre- norm+SwiGLU+旋转嵌入

2.2、Chinese Vocabulary Extension中文词汇扩展


2.3、Parameter Efficient Fine-Tuning With Lora具有LoRA的参数高效微调


LoRA原理:W0冻结+更新B和A+r ≪ min(d, k)减少内存消耗


2.4、Pre-Training Objective预训练目标:CLM建模任务+自回归方式预测+最小化负对数似然

2.5、Supervised Fine-Tuning And Chinese Alpaca有监督微调和中文Alpaca


与Alpaca的区别:通过在指令instruction 和输入input 之间添加“\n”换行符来将它们连接起来,形成一个新的指令

3、Experimental Setups实验设置

3.1、Experimental Setups For Pre-Training预训练的实验设置:原始的LLaMA权重初始化+fp16+预训练2阶段【LoRA】+120G语料+48个A40-48G进行1轮训练+DeepSpeed优化内存+AdamW(warm-up余弦学习率)

3.2、Experimental Setups For Instruction Fine-Tuning指令微调的实验设置:LoRA高效精调+3M指令数据


4、Results On Instruction-Following Tasks指令遵循任务的结果

4.1、Task Design And Evaluation Method任务设计和评估方法:GPT-4总体评分结合人工检查确保评分准确性和一致性+每个任务内所有样本的评分求和并归一化百分制得到任务的整体得分(10类任务和200样本)

4.2、Experimental Setups For Decoding解码的实验设置:上下文大小、最大序列长度、温度、Top-k采样、Top-p采样、重复惩罚


4.3.1、Multi-Turn Dialogue多轮对话:更多数据比更大参数更重要

4.3.2、Text Generation文本生成:更多数据+较小的模型=更好

4.3.3、Numerical Calculation And Reasoning数值计算和推理——模型越大,数值推理任务更占据优势



5、Results On Natural Language Understanding Tasks自然语言理解任务的结果

5.1、Task Description任务描述:C-Eval数据集(4个类别+52个学科)

5.2、Decoding Strategy解码策略:评估LLaMA模型直接输入样本、评估Alpaca模型则将示例包装在提示模板中

5.3、Comparisons To Original LLaMA与原版LLaMA的比较

中文LLaMA改进了原始LLaMA—Chinese LLaMA improves original LLaMA.

Alpaca模型相比LLaMA表现出显著改进—Alpaca models show significant improvements over LLaMA

LLaMA在少样本设置下的表现通常更好,而Alpaca更适合零样本——LLaMA generally yields better performance in a few-shot setting, while Alpaca prefers zero- shot

5.4、与其他模型的比较Comparisons To Other Models:中文Alpaca模型表现出竞争力+适当的解码策略和提示模板可提高模型性能

6、Effect Of Different Quantization Methods不同量化方法的影响:;8位和6位量化方法更利于Alpaca模型










Yiming Cui∗ [email protected]

Ziqing Yang∗ [email protected]

Xin Yao [email protected]




Large Language Models (LLMs), such as ChatGPT and GPT-4, have dramatically transformed natural language processing research and shown promising strides towards Artificial General Intelligence (AGI). Nonetheless, the high costs associ- ated with training and deploying LLMs present substantial obstacles to transpar- ent, accessible academic research. While several large language models, such as LLaMA, have been open-sourced by the community, these predominantly focus on English corpora, limiting their usefulness for other languages. In this paper, we propose a method to augment LLaMA with capabilities for understanding and generating Chinese text and its ability to follow instructions. We achieve this by extending LLaMA’s existing vocabulary with an additional 20,000 Chinese to- kens, thereby improving its encoding efficiency and semantic understanding of Chinese. We further incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets, significantly enhancing the model’s ability to comprehend and execute instructions. Our experimental results indicate that the newly proposed model markedly enhances the original LLaMA’s proficiency in understanding and generating Chinese content. Additionally, the results on the C-Eval dataset yield competitive performance among the models with several times the size of ours. We have made our pre-trained models, train- ing scripts, and other resources available through GitHub, fostering open research for our community.





Natural language processing (NLP) field has witnessed a substantial paradigm shift with the advent of Large Language Models (LLMs). These models, distinguished by their considerable size and comprehensive training data, have demonstrated extraordinary abilities in comprehending and pro- ducing human-like text. In contrast to pre-trained language models dedicated to text understanding, such as BERT (Devlin et al., 2019), the GPT series (Radford et al., 2018) accentuates text generation, positioning them as more suitable platforms for creativity compared to their counterparts. Notably, the latest members of the GPT family, namely ChatGPT and GPT-4, have garnered significant atten- tion, establishing themselves as leading examples in this rapidly evolving field.



ChatGPT (OpenAI, 2022), evolved from InstructGPT (Ouyang et al., 2022), serves as an advanced conversational AI model capable of conducting context-aware, human-like interactions. Its success set the stage for the development of GPT-4 (OpenAI, 2023), a more sophisticated LLM, demonstrat- ing even greater potential in natural language understanding, generation, and various NLP tasks, especially for its multi-modal and reasoning abilities. These models have catalyzed new research directions and applications, intensifying interest in exploring the potential of Artificial General In- telligence (AGI). Exhibiting impressive performance across multiple benchmarks, they have also demonstrated capabilities for few-shot learning and adaptability to new tasks, significantly driving the expansion of NLP research. Consequently, they have inspired both researchers and industry pro- fessionals to further harness their potential across a wide array of applications, including sentiment analysis, machine translation, question-answering systems, and more.



However, as impactful as LLMs have been, their implementation comes with inherent limitations that hamper transparent and open research. A major concern is their proprietary nature, which restricts access to the models, thus inhibiting the broader research community’s ability to build upon their successes. Furthermore, the vast computational resources necessary for training and deploying these models present a challenge for researchers with limited resources, further compounding the accessibility problem.

To tackle these limitations, the NLP research community has gravitated towards open-source al- ternatives to promote greater transparency and collaboration. LLaMA (Touvron et al., 2023) and Alpaca (Taori et al., 2023a) serve as notable examples of such initiatives. These open-source LLMs are intended to facilitate academic research and accelerate progress within the NLP field. The aim of open-sourcing these models is to foster an environment conducive to further advancements in model development, fine-tuning, and evaluation, ultimately leading to the creation of robust, capable LLMs applicable to a wide variety of uses.




Despite the considerable strides made by LLaMA and Alpaca in NLP, they exhibit inherent limita- tions concerning native support for Chinese language tasks. Their vocabularies contain only a few hundred Chinese tokens, substantially hindering their efficiency in encoding and decoding Chinese text. Building on our previous work with the Chinese BERT series (Cui et al., 2021) and Chinese minority-oriented multilingual pre-trained models (Yang et al., 2022), in this technical report, we propose the development of Chinese LLaMA and Alpaca models with enhanced capabilities for understanding and generating Chinese content. We extend the original LLaMA’s vocabulary with an additional 20,000 Chinese tokens, significantly improving its proficiency in processing and gen- erating Chinese text. To ensure efficient training and deployment of these models, we employ the Low-Rank Adaptation (LoRA) approach (Hu et al., 2021), enabling us to train and fine-tune the models without excessive computational costs. We anticipate our preliminary study to enhance the Chinese understanding and generation capabilities of LLaMA and Alpaca serves as a foundation for researchers aiming to adapt these models to other languages. By showcasing the feasibility and effectiveness of our approach, we offer insights and methodologies that can be employed to extend vocabularies and improve the performance of LLaMA and Alpaca models in various languages.





In summary, the contributions of this technical report are as follows:

•We enhance the encoding and decoding efficiency of the Chinese language and improve LLaMA’s Chinese understanding ability by extending the original LLaMA’s vocabulary with an additional 20,000 Chinese tokens.

•We employ the Low-Rank Adaptation (LoRA) approach to facilitate efficient training and de- ployment of the Chinese LLaMA and Alpaca models, enabling researchers to work with these models without incurring excessive computational costs.

•We evaluate the performance of the proposed LLaMA and Alpaca models in instruction- following tasks and natural language understanding tasks, thereby demonstrating substantial improvements over their original counterparts in the context of Chinese language tasks.

•We make the resources and findings of our study publicly available, fostering further research and collaboration in the NLP community and encouraging the adaptation of LLaMA and Al- paca models to other languages.


>> 我们通过额外添加2万个中文标记,提高了中文语言的编码和解码效率,并改善了LLaMA在中文理解方面的能力。

>> 我们采用低秩适应(LoRA)方法,促进了中文LLaMA和Alpaca模型的高效训练和部署,使研究人员能够在不过多增加计算成本的情况下使用这些模型。

>> 我们评估了所提出的LLaMA和Alpaca模型在遵循指令任务和自然语言理解任务中的性能,从而在中文语言任务的背景下展示了与原始模型相比的显著改进。

>> 我们公开共享了研究的资源和发现,促进了NLP社区中的进一步研究和合作,并鼓励将LLaMA和Alpaca模型适应其他语言。

2、Chinese Llama And Chinese Alpaca


**2.1、LLaMA:嵌入层+多个Transformer块+语言模型头+pre- norm+SwiGLU+**旋转嵌入

LLaMA (Touvron et al., 2023) is a foundational, decoder-only large language model built upon the transformer architecture (Vaswani et al., 2017). Similar to the GPT series and other transformer- based LLMs, LLaMA consists of an embedding layer, multiple transformer blocks, and a language model head. LLaMA also incorporates improvements utilized in different models, such as pre- normalization (Zhang & Sennrich, 2019), SwiGLU activation (Shazeer, 2020), and rotary embed- dings (Su et al., 2021). LLaMA is available in four different model sizes: 7B, 13B, 33B, and 65B.

LLaMA has been pre-trained with a standard language modeling task (see Section 2.4) using a mix of publicly available sources, such as crawled web pages, books, Wikipedia, and preprint pa- pers. Experimental findings reveal that LLaMA delivers competitive performance compared to other

LLMs like GPT-3, albeit at a smaller model size. This compactness and effectiveness have garnered considerable attention from researchers, leading to the widespread use of LLaMA-based models.

LLaMA(Touvron等,2023)是基于Transformer架构(Vaswani等,2017)构建的基础解码型大型语言模型。与GPT系列和其他基于Transformer的语言模型类似,LLaMA由嵌入层、多个Transformer块和语言模型头组成。LLaMA还融合了不同模型中使用的改进方法,如pre- normalization预标准化(Zhang和Sennrich,2019)、SwiGLU激活函数(Shazeer,2020)和旋转嵌入(Su等,2021)。



2.2、Chinese Vocabulary Extension中文词汇扩展

>> 中文token少:LLaMA的原始词表中包含的中文token很少,无法很好的表示中文文本;-
>> 处理中文效率底下:LLaMA使用byte token来表示未知的字符,包括中文字符,但byte token设计初衷不是处理中文,效率低下;


LLaMA’s training set encompasses roughly 1.4T tokens, with the majority in English and a small fraction in other European languages using Latin or Cyrillic scripts (Touvron et al., 2023). Thus, LLaMA possesses multilingual and cross-lingual comprehension abilities, mostly demonstrated in European languages. Interestingly, our prior preliminary study reveals that LLaMA exhibits basic Chinese understanding ability, although its capacity to generate Chinese texts is limited.

To equip LLaMA with enhanced Chinese understanding and generation capabilities, we propose to continue pre-training the LLaMA model with Chinese corpora. However, directly applying contin- ual pre-training with Chinese corpora encounters several challenges. Firstly, the original LLaMA vocabulary covers less than a thousand Chinese characters, which is insufficient to encode gen- eral Chinese texts. Although the LLaMA tokenizer circumvents this issue by tokenizing unknown UTF-8 characters to bytes, this strategy significantly extends sequence length and slows down the encoding and decoding efficiency of Chinese texts, as each Chinese character splits into 3-4 byte tokens. Secondly, byte tokens are not exclusively designed to represent Chinese characters. Since byte tokens also signify UTF-8 tokens in other languages, it becomes challenging for byte tokens and transformer encoders to effectively learn representations capturing the semantic meaning of Chinese characters.






To address these problems and improve encoding efficiency, we propose to extend LLaMA vocab- ulary with additional Chinese tokens and adapt the model for the extended vocabulary (Yang et al., 2022). The extension process proceeds as follows:

•To enhance the tokenizer’s support for Chinese texts, we initially train a Chinese tokenizer with SentencePiece (Kudo & Richardson, 2018) on Chinese corpora2 with a vocabulary size of 20,000.

•We subsequently merge the Chinese tokenizer into the original LLaMA tokenizer by taking the union of their vocabularies. Consequently, we obtain a merged tokenizer, which we term the Chinese LLaMA tokenizer, with a vocabulary size of 49,953.

•To adapt the LLaMA model for the Chinese LLaMA tokenizer, we resize the word embeddings and language model head from shape V × H to V ′ × H, where V = 32, 000 denotes the original vocabulary size, and V ′ = 49, 953 is the new vocabulary size of the Chinese LLaMA

tokenizer. The new rows are appended to the end of the original embedding matrices, ensuring that the embeddings of the tokens in the original vocabulary remain unaffected.


>> 利用SentencePiece训练中文分词器:为增强分词器对中文文本的支持,我们首先使用SentencePiece(Kudo和Richardson,2018)在中文语料库上训练一个中文分词器,词汇表大小为20,000。

>> 合并两类分词器:然后,我们将中文分词器与原始LLaMA分词器合并,取两者词汇表的并集。因此,我们得到了一个合并的分词器,我们将其称为中文LLaMA分词器,词汇表大小为49,953。

>> 修改为V’×H使LLaMA适应中文分词器:为使LLaMA模型适应中文LLaMA分词器,我们将单词嵌入和语言模型头的形状从V×H调整为V’×H,其中V = 32,000表示原始词汇表大小,V’ = 49,953表示中文LLaMA分词器的新词汇表大小。新的行被附加到原始嵌入矩阵的末尾,以确保原始词汇表中的标记的嵌入不受影响。

Preliminary experiments indicate that the number of tokens generated by the Chinese LLaMA tok- enizer is approximately half of those generated by the original LLaMA tokenizer. Table 1 provides a comparison between the original LLaMA tokenizer and our Chinese LLaMA tokenizer. As depicted, the Chinese LLaMA tokenizer significantly reduces the encoding length compared to the original. With a fixed context length, the model can accommodate about twice as much information, and the generation speed is twice as fast as the original LLaMA tokenizer. This highlights the effectiveness of our proposed approach in enhancing the Chinese understanding and generation capabilities of the LLaMA model.


2.3、Parameter Efficient Fine-Tuning With Lora具有LoRA的参数高效微调


>> 使用LoRA技术,将预训练模型的权重冻结,并在每个层中注入低秩矩阵,可以只更新低秩矩阵的参数,更高效的进行 training 和部署;

The conventional training paradigm that updates the full parameters of LLMs is prohibitively expen- sive and is not time- or cost-feasible to most labs or companies. Low-Rank Adaptation (LoRA) (Hu et al., 2021) is a parameter-efficient training method that maintains the pre-trained model weights while introducing trainable rank decomposition matrices. LoRA freezes the pre-trained model weights and injects trainable low-rank matrices into each layer. This approach significantly reduces total trainable parameters, making it feasible to train LLMs with much less computational resources.

传统的对LLM(大型语言模型)进行全参数更新的训练范式成本高昂,对于大多数实验室或公司来说,时间和成本都不可行。低秩自适应(Low-Rank Adaptation,LoRA)(Hu等,2021)是一种参数高效的训练方法,它在保持预训练模型权重的同时引入可训练的秩分解矩阵。LoRA冻结了预训练模型权重,并在每一层中注入可训练的低秩矩阵。这种方法显著地减少了可训练参数的总量,使得用更少的计算资源训练LLM成为可能。

******LoRA原理:W0冻结+更新B和A+**r ≪ min(d, k)减少内存消耗

To be specific, for a linear layer with weight matrix W0 ∈ Rd×k , where k is the input dimension, and d is the output dimension, LoRA adds two low-rank decomposed trainable matrices B ∈ Rd×r and A ∈ Rr×k , where r is the pre-determined rank. The forward pass with input x is given by the following equation,

h = W0x + ∆W x = W0x + BAx, B ∈ Rd×r , A ∈ Rr×d (1) During training, W0 is frozen and does not receive gradient updates, while B and A are updated. By

choosing the rank r ≪ min(d, k), the memory consumption is reduced as we do not need to store

the optimizer states for the large frozen matrix.

具体而言,对于一个具有权重矩阵W0 ∈ Rd×k的线性层,其中k是输入维度,d是输出维度,LoRA添加了两个低秩分解的可训练矩阵B ∈ Rd×r和A ∈ Rr×k,其中r是预先确定的秩。输入为x时的前向传播如下所示:

h = W0x + ∆W x = W0x + BAx, B ∈ Rd×r , A ∈ Rr×d (1)

在训练过程中,W0被冻结,不接收梯度更新,而B和A则会被更新。通过选择r ≪ min(d, k),可以减少内存消耗,因为我们不需要为大型冻结矩阵存储优化器状态。


To achieve parameter-efficient training while adhering to a tight budget, we apply LoRA training to all Chinese LLaMA and Alpaca models in our paper, including both the pre-training and fine-tuning stages. We primarily incorporate LoRA adapters into the weights of the attention module and MLP layers. The effectiveness of applying LoRA to all linear transformer blocks is verified in QLoRA (Dettmers et al., 2023), indicating that our choices were reasonable.


**2.4、Pre-Training Objective预训练目标:CLM建模任务+自回归方式预测+**最小化负对数似然

We pre-train the Chinese LLaMA model with the standard Casual Language Modeling (CLM) task. Given an input token sequence x = (x0, x1, x2, . . .), the model is trained to predict the next token xi in an autoregressive manner. Mathematically, the objective is to minimize the following negative log-likelihood:

where, Θ represents the model parameters, DPT is the pre-training dataset, xi is the token to be predicted, and x0, x1, . . . , xi−1 constitute the context.

我们使用标准的无因果语言建模(Casual Language Modeling,CLM)任务对中文LLaMA模型进行预训练。给定一个输入标记序列x = (x0, x1, x2, . . .),模型被训练以自回归的方式预测下一个标记xi。数学上,目标是最小化以下负对数似然:

其中,Θ表示模型参数,DPT表示预训练数据集,xi表示要预测的标记,x0, x1, . . . , xi−1构成上下文。

2.5、Supervised Fine-Tuning And Chinese Alpaca有监督微调和中文Alpaca

>> 通过预训练和指令精调,训练出中文LLaMA和中文Alpaca模型;Stanford Alpaca作为一种基于LLaMA的指令跟随模型,通过自我指导的微调训练实现了用户指令的执行。


Pre-trained language models can hardly follow user instructions and often generate unintended con- tent. This is because the language modeling objective in Equation (2) is predicting the next token, not “follow the instructions and answer the questions” (Ouyang et al., 2022). To align the behavior of language models to the user’s intention, one can fine-tune the model to explicitly train it to follow instructions. Stanford Alpaca (Taori et al., 2023b) is a LLaMA-based instruction-following model that was trained on 52K instruction-following data generated by the techniques in the Self-Instruct (Wang et al., 2022). We follow the approach in Stanford Alpaca to apply self-instructed fine-tuning on Chinese LLaMA to train an instruction-following model — Chinese Alpaca.

Chinese Alpaca is trained on a combination of instruction-following datasets. Each example in the dataset consists of an instruction and an output. The supervised fine-tuning task is similar to the casual language modeling task: the model is prompted with the instruction and trained to generate the output autoregressively. The instruction is wrapped in a prompt template, and the output imme- diately follows the template. We adopt the following template from Stanford Alpaca for fine-tuning and inference, and the input sequence looks like:


为了使语言模型的行为与用户意图一致,可以通过有监督微调将模型明确训练成遵循指令。Stanford Alpaca(Taori等,2023b)是一种基于LLaMA的遵循指令模型,它是使用Self-Instruct(Wang等,2022)中的技术生成的52K个遵循指令数据进行训练的。我们遵循Stanford Alpaca的方法,对中文LLaMA进行自我指导的微调,以训练一个遵循指令的模型——中文Alpaca。

中文Alpaca是在多个遵循指令数据集上进行训练的。数据集中的每个示例包含一个指令和一个输出。有监督微调的任务类似于无因果语言建模任务:模型通过指令提示并以自回归的方式训练来生成输出。指令被包含在提示模板中,输出紧随在模板之后。我们采用了Stanford Alpaca中的以下模板进行微调和推断,输入序列的形式如下:

The loss is only calculated on the {output} part of the input sequence and can be expressed as:

Here, Θ represents the model parameters, DSFT is the fine-tuning dataset, x = (x0, x1, . . .) repre- sents the tokenized input sequence.

损失仅在输入序列的 {output} 部分计算,可以表示为:

其中,Θ表示模型参数,DSFT表示微调数据集,x = (x0, x1, . . .)表示分词后的输入序列。

与Alpaca的区别:通过在指令instruction 和输入input 之间添加**“\n”**换行符来将它们连接起来,形成一个新的指令

A major difference between our approach and Stanford Alpaca is that we only use the prompt tem- plate designed for examples without an input field, whereas Stanford Alpaca employs two templates for examples both with and without an input field. If the example contains a non-empty input field, we concatenate the instruction and input with an “\n” to form the new instruction. Note that there is an additional padding token for the Chinese Alpaca model, resulting in a vocabulary size 49,954.

我们的方法与Stanford Alpaca的一个主要区别是,我们只使用针对没有输入字段的示例设计的提示模板,而Stanford Alpaca则在包含和不包含输入字段的示例上使用两个模板。如果示例包含一个非空的输入字段,我们将指令和输入用“\n”连接起来形成新的指令。请注意,中文Alpaca模型额外添加了一个填充标记,导致词汇表大小为49,954。

3、Experimental Setups实验设置


3.1、Experimental Setups For Pre-Training预训练的实验设置:原始的LLaMA权重初始化+fp16+预训练2阶段【LoRA】+120G语料+48个A40-48G进行1训练+DeepSpeed优化内存+AdamW(warm-up余弦学习率)

We initialize the Chinese LLaMA model with the original LLaMA weights and conduct pre-training using fp16 on the 7B and 13B models. Additionally, for the 33B model, we employ the bitsandbytes3 library to train it in an 8-bit format, enhancing its efficiency and memory usage. We directly apply LoRA to attentions and MLPs for training while setting the embeddings and LM head as trainable.

For the basic version of Chinese LLaMA-7B, we utilize a two-stage pre-training approach. In stage 1, we fix the parameters of the transformer encoders within the model and only train the embeddings, adapting the newly added Chinese word vectors while minimizing the disturbance to the original model. In stage 2, we add LoRA weights (adapters) to the attention mechanisms and train the embeddings, LM heads, and newly added LoRA parameters. Note that two-stage training is not applied to other model training as it is less efficient in our preliminary study.

我们使用原始的LLaMA权重初始化中文LLaMA模型,并使用fp16进行7B和13B模型的预训练。此外,对于33B模型,我们使用bitsandbytes3库以8位格式进行训练,提高了模型的效率和内存使用。我们在训练时直接应用LoRA到注意力和MLP,并将嵌入和LM head 设置为可训练的。





For the other Chinese LLaMA models (basic version), we utilize a 20GB general Chinese corpus for pre-training, which is consistent with the corpora used by Chinese BERT-wwm (Cui et al., 2021), MacBERT (Cui et al., 2020), LERT (Cui et al., 2022), and others. We also provide “Plus” version, which further expands the pre-training data to 120GB, incorporating additional data from Com- monCrawl (CC) and encyclopedia sources, enhancing the model’s understanding of fundamental concepts. We concatenate all the datasets and generated chunks of block size 512 for pre-training purposes.

The models are trained on A40 GPUs (48GB VRAM) for one epoch, taking up to 48 GPUs depend- ing on the model size. The parameter-efficient training with LoRA is performed with PEFT library4. We also utilize DeepSpeed (Rasley et al., 2020) to optimize memory efficiency during the training process. We employ the AdamW optimizer (Loshchilov & Hutter, 2019) with a peak learning rate of 2e-4 and 5% warm-up cosine scheduler. Additionally, we apply gradient clipping with a value of

1.0 to mitigate potential gradient explosion.

Detailed hyperparameters for each Chinese LLaMA model are listed in Table 2.


这些模型在A40 GPU(48GB VRAM)上进行一轮训练,根据模型大小使用多达48个GPU。使用PEFT库进行具有LoRA的参数高效训练。在训练过程中,我们还使用DeepSpeed(Rasley等,2020)来优化内存效率。我们采用AdamW优化器(Loshchilov和Hutter,2019),峰值学习率为2e-4,采用5%的warm-up余弦调度器。此外,我们应用梯度剪裁,阈值为1.0,以减轻潜在的梯度爆炸问题。


3.2、Experimental Setups For Instruction Fine-Tuning指令微调的实验设置:LoRA高效精调+3M指令数据

>> Plus版本包含更多科学领域和STEM领域的数据-

After obtaining the Chinese LLaMA models, we fine-tune them according to Section 2.5. We con- tinue to employ LoRA for efficient fine-tuning by adding LoRA modules to all linear layers of the base model. We utilize approximately 2M to 3M instruction data, including translation (Xu, 2019) (550K sampled), pCLUE5 (250K sampled, excluding “NLU-like” data), Stanford Alpaca (50K+50K for original and translated one), and crawled SFT data for tuning basic models. For the Plus ver- sion, we expand the dataset to approximately 4M to 4.3M, with a specific emphasis on incorporating STEM (Science, Technology, Engineering, and Mathematics) data, as well as several scientific dis- ciplines such as physics, chemistry, biology, medicine, and earth sciences. For Alpaca-33B, we additionally add OASST1 dataset (Ko?pf et al., 2023), where we only extract the first query-response pair from each conversation and translate using gpt-3.5-turbo API, resulting in roughly 20K data (original and translated one). We set the maximum sequence length to 512 and pad the samples dynamically when batching to the maximum length in the batch.

在获得中文LLaMA模型之后,我们根据第2.5节的方法进行微调。我们继续使用LoRA进行高效微调,通过将LoRA模块添加到基础模型的所有线性层中。我们使用约2M至3M的指令数据进行微调,包括翻译(Xu,2019)(采样550K条)、pCLUE5(采样250K条,排除“类似NLU”的数据)、Stanford Alpaca(原始和翻译各50K+50K)、以及爬取的SFT数据来微调基本模型。对于Plus版本,我们将数据集扩展到约4M至4.3M条,特别注重包括STEM(科学、技术、工程和数学)数据,以及物理、化学、生物学、医学和地球科学等几个科学学科的数据。对于Alpaca-33B,我们还添加了OASST1数据集(Ko?pf等,2023),我们仅提取每个对话的第一个查询-响应对,并使用gpt-3.5-turbo API进行翻译,得到大约20K条数据(原始和翻译各一半)。我们将最大序列长度设置为512,并在批处理时动态填充样本以达到最大长度。

Table 2: Pre-training hyperparameters for Chinese LLaMA. QKVO: four matrices in each at- tention module, i.e., query, key, value, and output. MLP: three matrices in each MLP layer. Note that 7B uses a two-stage training paradigm (settings are separated by ‘/’), which is not further adopted in other models.



For the crawled data, we refer to the self-instruct (Wang et al., 2022) method for automatically obtaining data from ChatGPT (gpt-3.5-turbo API), as used in Taori et al. (2023a). Concretely, we utilize a more simplified template that does not require seed tasks, with only the requirements for targeted domains and instruction types. Templates and code details are available on GitHub.6

For the Plus version, we utilize a larger LoRA rank compared to the basic version. Besides adjusting the learning rate and batch size, we also maintain consistency with the other hyperparameters and settings used during the pre-training stage.

The hyperparameters for instruction fine-tuning are listed in Table 3. Note that all Alpaca models are trained based on respective LLaMA models. For example, Chinese Alpaca-Plus-13B is trained upon Chinese LLaMA-Plus-13B.

对于爬取的数据,我们参考了自我指导(Wang等,2022)方法,通过ChatGPT(gpt-3.5-turbo API)自动获取数据,这与Taori等(2023a)中使用的方法相同。具体而言,我们使用了一个更简化的模板,不需要种子任务,只需指定目标领域和指令类型的要求。模板和代码细节可在GitHub上找到。

对于Plus版本,我们使用了比基本版本更大的LoRA rank。除了调整学习率和批量大小外,我们还保持了与预训练阶段使用的其他超参数和设置的一致性。


4、Results On Instruction-Following Tasks指令遵循任务的结果



4.1、Task Design And Evaluation Method任务设计和评估方法:GPT-4总体评分结合人工检查确保评分准确性和一致性+每个任务内所有样本的评分求和并归一化百分制得到任务的整体得分(10类任务和200样本)

Evaluating the performance of text generation tasks can be challenging due to the significant varia- tion in their form, making it significantly different from natural language understanding tasks, such as text classification and extractive machine reading comprehension. Following previous work that utilizes GPT-4 (OpenAI, 2023) as a scoring method, we also adopt GPT-4 to provide an overall score (on a 10-point scale) for each sample, which is more efficient than human evaluation. How- ever, GPT-4 may not always provide accurate scores, so we perform manual checks on its ratings and adjust them if necessary. The manual checks ensure that the scores are consistent and reflect the true performance of the models being evaluated. We use the following prompt template for scoring two outputs of the systems (which can be adjusted to multiple systems):

由于文本生成任务的形式存在显著变化,评估其性能可能具有一定挑战性,这使得它与文本分类和抽取式机器阅读理解等自然语言理解任务有着明显的不同。在之前的研究中,利用GPT-4 (OpenAI, 2023)作为评分方法,我们也采用GPT-4来为每个样本提供一个总体得分(在10分制上),这比人工评估更高效。然而,GPT-4并不总是能提供准确的分数,因此我们对其评分进行手动检查,并在必要时进行调整。手动检查确保分数一致,并反映了所评估模型的真实性能。我们使用以下提示模板对系统的两个输出进行评分(可以根据需要调整为多个系统):

By employing GPT-4 as a scoring method in conjunction with manual checks, we establish a reliable evaluation framework that effectively measures the performance of our Chinese Alpaca models on a range of natural language understanding and generation tasks.

Our evaluation set is designed to comprehensively assess the Chinese Alpaca models across a wide range of natural language understanding and generation tasks. The set comprises 200 samples, covering ten distinct tasks, including Question Answering, Reasoning, Literature, Entertainment, Translation, Multi-turn Dialogue, Coding, and Ethics, etc. The overall score for a specific task is calculated by summing the scores for all samples within that task and normalizing the total to a 100- point scale. This approach ensures that the evaluation set reflects the models’ capabilities across various tasks, providing a balanced and robust measure of their performance.



4.2、Experimental Setups For Decoding解码的实验设置:上下文大小最大序列长度温度Top-k采样、********Top-p采样、****重复惩罚

>> 解码的实验设置部分,介绍了在实验中使用的解码超参数,包括上下文大小、生成序列长度限制、温度、Top-k采样和Top-p采样等。这些超参数的设置在多轮对话和生成任务中略有调整,以获得更好的输出效果。

The decoding process of LLMs plays a critical role in determining the quality and diversity of the generated text. In our experiments, we use the following decoding hyperparameters:

•Context size: We set the context size to 2048, which determines the maximum number of tokens the model can consider simultaneously when generating text.

•Maximum sequence length: We limit the generated sequence length to 512 tokens to ensure that the outputs remain focused and relevant to the input prompt.

•Temperature: We set the temperature to 0.2, which controls the randomness of the sampling process. Lower values make the model generate more focused and deterministic outputs, while higher values increase diversity at the cost of coherence. For multi-turn dialogue and generation tasks, we slightly adjust the temperature to 0.5 to allow a more diverse output.

•Top-k sampling: We use Top-k sampling with k = 40, meaning that the model selects its next token from the top 40 most probable tokens at each step, adding an element of randomness and diversity to the generated text.

•Top-p sampling: We also employ Top-p sampling with p = 0.9, which further enhances diver- sity by considering a dynamic set of tokens that collectively account for 90% of the probability mass.

•Repetition penalty: To discourage the model from generating repetitive text, we apply a repeti- tion penalty with a factor of 1.1, penalizing tokens that have already been selected.





Top-k采样:我们使用Top-k采样,其中k = 40,意味着模型在每个步骤从最有可能的前40个标记中选择下一个标记,从而为生成的文本增加了一定的随机性和多样性。

Top-p采样:我们还使用Top-p采样,其中p = 0.9,通过考虑一组动态的标记来增强多样性,这组标记的概率总和占到了90%。


Note that these values may not be optimal for each testing scenario. We did not perform further tuning on these hyperparameters for each task to maintain a balanced view.



>> 结果部分,展示并分析了中文Alpaca-Plus-7B、Alpaca-Plus-13B和Alpaca-33B模型在各个任务上的结果。其中,Alpaca-33B在多轮对话、文本生成、数值计算和推理、编码和道德等方面都取得了显著的改进。具体来说,Alpaca-Plus-7B和Alpaca-Plus-13B在文本生成和数值计算推理任务上表现较好,而Alpaca-33B在多轮对话、编码和道德任务方面表现更好。文章指出,对于复杂任务如代码生成,模型的规模越大,性能往往越好。此外,Alpaca-33B在道德任务中不仅能够拒绝非法指令,还能给出合适的建议。

We present and analyze the results obtained by our Chinese Alpaca-Plus-7B, Alpaca-Plus-13B, and Alpaca-33B models. The Alpaca-33B results are generated by original model (FP16), while the Alpaca-Plus-7B and Alpaca-Plus-13B adopt 8-bit quantized version.7 The overall results are shown in Table 4. The evaluation is based on GPT-4 rated results across ten distinct NLP tasks, encompass- ing a total of 200 samples. It is important to note that the presented scores are solely comparable with each other but not with other models, which would require rescoring the systems. Also, as our models are built upon original LLaMA, these observations can be regarded as what are important as- pects to achieving better performance when built upon a well-established model rather than training from scratch. We elaborate on the findings of several major categories in detail.


4.3.1、Multi-Turn Dialogue多轮对话:更多数据比更大参数更重要

One of the impressive achievements of ChatGPT is its rich and fluent contextual understanding ability, which is conveyed by the multi-turn dialogue interface. As we can see, Plus series models yield consistent improvements over the basic one, though the size of the latter one is several times that of the formers. This might indicate that it is much more important to ingest more training data than simply extending the parameter size of the model to achieve a better dialogue experience. Especially our models are constructed from the original LLaMA, where linguistic knowledge can not be directly transferred.


4.3.2、Text Generation文本生成:更多数据+较小的模型=更好

Text generation is one of the most fundamental abilities for language models. Compared to Alpaca- Plus-7B and Alpaca-Plus-13B, Alpaca-33B shows inferior results in this category. Table 5 shows an example of a text generation task. We can see that both Alpaca-Plus-7B and Alpaca-Plus-13B provide correct letter styles, which meet the requirement of the user’s prompt. Alpaca-Plus-13B provides the most comprehensive one by indicating that the applicant has thoroughly prepared all materials for visa application, making it the best generation quality among all three systems. How- ever, Alpaca-33B does not follow the letter style, and the content is somewhat too simplified, which is clearly not as good as the others. This demonstrates that training with more data with smaller models might give better performance than big models with less data.


******4.3.3、Numerical Calculation And Reasoning数值计算和推理——**模型越大,数值推理任务更占据优势

Numerical reasoning has been regarded as one of the most essential tasks in examining the reasoning ability of large language models. As we can see, the Alpaca-33B achieves significant improvements over Plus-7B and Plus-13B models. Table 6 shows example outputs for this task. The first prompt is well-known for probing the reasoning ability, namely “which one is heavier, 1kg of cotton or 1kg of iron?”. Both Plus-7B and Plus-13B failed to give a correct answer mentioning that “cotton is lighter than iron”. However, 33B could correctly identify that these two things are the same weight. The second prompt is a simple calculation task, asking “how many legs for a cat and a chicken”. However, as we can see, both Plus-7B and Plus-13B do not have the commonsense knowledge that a cat has four legs and two for a chicken, resulting in wrong answers. The last prompt is a numerical reasoning task to let the model predict the next number of an array. Still, only 33B model correctly identifies the pattern of the given array that the next number should be the square of its index. These observations indicate that the size of the model is vital in numerical reasoning tasks.



Figure 1 shows an example of implementing the Dijkstra algorithm in Python. Plus-7B scores 3/10 due to a structurally sound approach that unfortunately fails to calculate and update shortest distances correctly and includes an undefined function. Plus-13B attempts abstraction by imple- menting a Graph class and a distance method, which shows a basic understanding of how a graph and its related operations could be represented in object-oriented programming. Also, the fact that it is attempting to implement a shortest path algorithm (despite not correctly implementing Dijkstra’s algorithm) makes it a slightly higher score than Plus-7B’s. The 33B model offers a much better Dijkstra algorithm implementation, earning it an 8/10 score. Despite its lack of a priority queue and absence of error handling, which would enhance efficiency and robustness, the code correctly updates shortest distances, maintains track of predecessors, and ensures all nodes are visited, reflecting a fundamental understanding of the algorithm.


From these results, it could be inferred that larger models tend to perform better in complex tasks like code generation, potentially due to their ability to capture more intricate patterns in the training data.



Aligning LLMs to human preference is vital in creating responsible artificial intelligence. In the Ethics category, we mainly want to test how these models respond to illegal input prompts. By checking the generation results, all three systems responded properly to users’ prompts. Alpaca- 33B yields slightly better performance than the others. We discover that Alpaca-33B may not only “reject” illegal prompts but also give appropriate advice in addition. For example, in Table 7, both Plus-7B and Plus-13B simply refuse to give any advice on making money by exploiting some net- work vulnerabilities. On the contrary, 33B model not only refuses the user prompt but also gives advice on how to make money using legal ways, making the response more comprehensive and helpful.


Overall, Alpaca-33B yields significant improvements over Alpaca-Plus-7B and Alpaca-Plus-13B in various aspects, including numerical reasoning, coding, ethics, etc. We conjecture that these abilities are better handled by bigger models than the smaller ones, though Alpaca-33B is trained with less data. Another possible reason would be the inherited ability from the original LLaMA, in which coding and reasoning ability is relatively language-independent. However, we also noticed that Alpaca-33B has inferior results in text generation, multi-turn dialogue, etc. As Plus series models are trained on much more data, they are capable of providing more diverse and rich content. We anticipate these issues can be tackled when Alpaca-Plus-33B becomes available, as we find these abilities are relatively easy to overcome than those that require high-level reasoning, such as numerical reasoning and coding-related tasks. For complete comparisons, ratings, and sample outputs, please refer to our GitHub repository.8


5、Results On Natural Language Understanding Tasks自然语言理解任务的结果


5.1、Task Description任务描述:C-Eval数据集(4个类别+52个学科)

Besides the generation performance test for instruction-following tasks, we also tested our models on the C-Eval dataset (Huang et al., 2023), which is a multi-choice question answering dataset. C- Eval mainly covers four categories: STEM, Social, Humanities, and Others, consisting of nearly 14K samples for 52 disciplines. Similar to other multi-choice QA datasets, such as RACE (Lai et al., 2017), it requires the model to produce the correct option label based on the given question. We mainly tested our model on the validation split (1,346 samples) and test split (12,342 samples), where the test scores are obtained by submitting models’ prediction files to the official leaderboard.


**5.2、Decoding Strategy解码策略:****评估LLaMA模型直接输入样本、**评估Alpaca模型则将示例包装在提示模板中

To evaluate LLaMA models on this dataset, we directly feed the examples to these models. While when evaluating Alpaca models, we wrap the examples in the prompt template as demonstrated in Section 2.5. Then the model is asked to make a one-step prediction and give the probability distribution of the next token p(y|x), where y ∈ V (V is the vocabulary). To map the probability distribution to a valid label t in {A, B, C, D}, we extract and gather the probabilities of related tokens. We introduce a verbalizer V(·) to map each label t to tokens in the vocabulary:

The label with the max probability is taken as the final prediction.



Next, we will elaborate on our results and analysis in the following two subsections, illustrating the comparisons to the original LLaMA and other models.


5.3、Comparisons To Original LLaMA与原版LLaMA的比较

>> 与原始LLaMA模型的比较部分,显示了中文LLaMA模型相对于原始LLaMA模型的改进。中文LLaMA模型在C-Eval数据集上相对原始LLaMA模型取得了一定的改进,但并非始终如此。Alpaca模型相对LLaMA模型在各种设置下都取得了显著的改进,表明指令遵循模型在处理类似NLU任务时更具能力。在少样本设置下,LLaMA模型表现较好,而Alpaca模型则更适合零样本设置。

Figure 2 demonstrates how our models evolve based on the original LLaMA. Detailed results are depicted in Table 8. We mainly describe our findings in the following aspects.



******中文LLaMA改进了原始LLaMA—**Chinese LLaMA improves original LLaMA.

Chinese LLaMA improves original LLaMA. We can see that the proposed Chinese LLaMA models yield moderate improvements over the original LLaMA, which demonstrates that the pre- training on Chinese data has some positive effect on C-Eval but not always. When we compare Chinese LLaMA and LLaMA-Plus, the latter does not show significant improvements over the for- mer one, even showing inferior results for 13B setting. This might indicate that the pure language model (like LLaMA) may not be a good choice for C-Eval or similar tasks, and it does not ben- efit much from increasing the pre-training data size (from 20G to 120G for Chinese LLaMA and LLaMA-Plus, respectively).


******Alpaca模型相比LLaMA表现出显著改进—**Alpaca models show significant improvements over LLaMA

Alpaca models show significant improvements over LLaMA. Among different settings, such as zero-shot or 5-shot, the Alpaca model series show significant improvements over LLaMA coun- terparts, demonstrating that the instruction-following models are more capable of handling these NLU-like tasks than pure language models. Unlike the phenomenon observed in the LLaMA series, we can see that Alpaca-Plus models yield significant improvement over basic Alpaca models. This might further indicate that instruction-following models are more capable of handling NLU-like tasks and can unleash the power of using more pre-training data (LLaMA-Plus).


LLaMA在少样本设置下的表现通常更好,而Alpaca更适合样本**——LLaMA generally yields better performance in a few-shot setting, while Alpaca prefers zero- shot******

LLaMA generally yields better performance in a few-shot setting, while Alpaca prefers zero- shot. Generally speaking, LLaMA with 5-shot setting shows better performance than zero-shot setting, while Alpaca with zero-shot setting is much better than 5-shot one. As LLaMA is not de- signed for instruction-following, few-shot setting might give valuable information on how to follow the question answering structure in C-Eval. However, on the contrary, as Alpaca has already been trained with millions of instruction data, it is less likely to benefit from additional shots. Also, the official 5-shot setting uses identical prompts for all samples, making it some distraction for Alpaca models.


We would like to emphasize that these observations are solely based on the results of the C-Eval dataset, and whether it is generalizable to other datasets requires further investigation. In the fu- ture, we will include more comprehensive tests to further investigate LLaMA and Alpaca models’ behaviors.


5.4、与其他模型的比较Comparisons To Other Models:中文Alpaca模型表现出竞争力+适当的解码策略和提示模板可提高模型性能

>> 与其他模型的比较部分,展示了中文Alpaca-33B和中文Alpaca-Plus-13B模型与其他LLMs模型在C-Eval数据集上的比较。结果显示,中文Alpaca模型在这个排行榜中表现竞争力,并与Bloomz-mt-176B和GLM-130B等模型相比仅有适度差距。此外,文章还提到通过采用适当的解码策略和提示模板,可以对个体LLMs模型的性能产生重要影响。


We include our two best-performing models, i.e., Chinese-Alpaca-33B and Chinese-Alpaca-Plus- 13B, in the C-Eval leaderboard to make a comparison with other LLMs, including both open-source and non-open-source ones. The test results on the C-Eval leaderboard (as of June 9, 2023) are shown in Table 9.


Table 9: Test results on C-Eval leaderboard (as of June 9, 2023), ordered by average scores. Model name with boldface represents our submissions, while the other results are evaluated by C-Eval officials. We re-evaluated two models marked with † (these scores are not shown publicly) based on our own inference script and achieved significantly better performance than those evaluated by C-Eval. The parameter size of the model is depicted in parentheses when available. Open: open- source. Avg-H: Average (Hard).


Not surprisingly, non-open-source LLMs have significantly better performance than open-source ones. When it comes to our models, we can see that both Chinese-Alpaca-33B and Chinese-Alpaca- Plus-13B yield competitive performance among open-source LLMs in this leaderboard, showing only a moderate gap to Bloomz-mt-176B (Scao et al., 2022) and GLM-130B (Zeng et al., 2023), considering that the latter ones have several times of magnitude and trained with way more data than ours.

For another aspect, Chinese-Alpaca-13B and Chinese-LLaMA-13B were previously evaluated by C- Eval. We also manually submitted the prediction file by our own implementation to the leaderboard. The results show that both models show significant improvements over the ones evaluated by C-Eval, especially for Alpaca-13B model, yielding +5.8 average score (from 30.9 to 36.7). Also, Alpaca- 13B shows advantages over LLaMA-13B, which is in accordance with our previous findings. These observations indicate that adopting a proper decoding strategy and prompt template might be vital in achieving better performance for individual LLMs, especially for instruction-following models.



6、Effect Of Different Quantization Methods不同量化方法的影响:;8位和6位量化方法更利于Alpaca模型


Deploying large language models on personal computers, particularly on CPUs, has historically been challenging due to their immense computational requirements. However, with the help of many community efforts, such as llama.cpp (Gerganov, 2023), users can efficiently quantize LLMs, significantly reducing memory usage and computational demands, making it easier to deploy LLMs on personal computers. This also enables quicker interactions with the models and facilitates local data processing. Quantizing LLMs and deploying them on personal computers offer several benefits. Firstly, it helps users protect their data privacy by ensuring that sensitive information remains within their local environment rather than being transmitted to external servers. Secondly, it democratizes access to LLMs by making them more accessible to users with limited computational resources. Lastly, it promotes the development of new applications and research directions that take advantage of local LLM deployments. Overall, the ability to deploy LLMs on personal computers using llama.cpp (or similar) paves the way for a more versatile and privacy-conscious utilization of LLMs in various domains.





In this section, we investigate the effect of different quantization methods. We use llama.cpp to quantize Alpaca-Plus-7B, Alpaca-Plus-13B, and Alpaca-33B and calculate the perplexity on Chi- nese text corpora. We quantize these models into 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit forms to compare with the original FP16 one.9 The results are shown in Figure 3.


The quantization level is strictly bound to the memory usage and inference speed, and thus a trade- off must be made when choosing a proper quantization level. As we can see, the 8-bit quantization method has almost the same or even lower perplexities compared to the original FP16 model, demon- strating that it is a good choice for deploying LLMs on personal computers, with only half size of the FP16 one. The 6-bit models also achieve decent PPLs comparable to the 8-bit one, making it a better balance of speed and performance. When we use a more aggressive quantization level, the performance drastically decreases (i.e., higher PPL), especially for 3-bit and 2-bit. We also discover that larger models are less sensitive to quantization methods than smaller ones. For example, the performance of 33B models changes much more mildly than the others. A similar result is also observed when comparing Plus-7B and Plus-13B models. This might indicate that though 2-bit and 3-bit quantization are less effective for smaller models, it might be a promising way to deploy larger models without significant performance loss. This is extremely helpful when the users only have limited computing resources and still want to try large language models. This might also imply that the quantized training method may become a main-stream approach for training large language models, especially for those with limited training resources.









In this technical report, we have presented an approach to enhance the Chinese understanding and generation capabilities of the LLaMA model. Acknowledging the limitations of the original LLaMA’s Chinese vocabulary, we expanded it by incorporating 20K additional Chinese tokens, sig- nificantly increasing its encoding efficiency for the Chinese language. Building on the Chinese LLaMA, we employed supervised fine-tuning with instruction data, resulting in Chinese Alpaca models exhibiting improved instruction-following capabilities.

To evaluate our models effectively, we annotated 200 samples across ten distinct task types and utilized GPT-4 for evaluation. Our experiments demonstrated that the proposed models significantly outperformed the original LLaMA in Chinese understanding and generation tasks. We also tested our models on C-Eval datasets. The results show that the proposed model could achieve significant improvements and show competitive performance to the models with several times bigger sizes.




Looking ahead, we plan to explore Reinforcement Learning from Human Feedback (RLHF) or Re- inforcement Learning from AI Instructed Feedback (RLAIF) to further align the models’ output with human preferences. Moreover, we intend to adopt more advanced and effective quantization methods, such as GPTQ (Frantar et al., 2022), among others. Additionally, we aim to investigate alternative methods to LoRA for more efficient and effective pre-training and fine-tuning of large lan- guage models, ultimately enhancing their performance and applicability across various tasks within the Chinese NLP community.




While this project has successfully enhanced the Chinese understanding and generation capabilities of the LLaMA and Alpaca models, several limitations must be acknowledged:

>> Harmful and unpredictable content: Though our model can reject unethical queries, these mod- els may still generate harmful or misaligned with human preferences and values. This issue may arise from biases in the training data or the models’ inability to discern appropriate outputs in certain contexts.

>> Insufficient training: Due to constraints in computing power and data availability, the training of the models may not be sufficient for optimal performance. As a result, there is still room for improvement in the Chinese understanding capabilities of the models.

>> Lack of robustness: The models may exhibit brittleness in some situations, producing incon- sistent or nonsensical outputs when faced with adversarial inputs or rare language phenomena.

>> Comprehensive evaluation: Evaluating large language models is an important topic in the cur- rent era. While we have seen many evaluation benchmarks for LLMs, their comprehensiveness and appropriateness for LLMs should be well-studied and examined. A more diverse and com- prehensive LLM evaluation dataset and benchmark will have a great positive effect on shaping the future of LLM research.

>> Scalability and efficiency: Although we applied LoRA and quantization to make the model more accessible to a broader community, when combined with the original LLaMA, the mod- els’ large size and complexity can lead to difficulties in deployment, especially for users with limited computational resources. This issue may hinder the accessibility and widespread adop- tion of the models in various applications.


>> 有害和不可预测的内容:尽管我们的模型可以拒绝不道德的查询,但这些模型仍可能生成有害或与人类偏好和价值观不一致的内容。这个问题可能源于训练数据中的偏见或模型在某些情境下无法识别适当的输出。

>> 训练不足:由于计算能力和数据可用性的限制,模型的训练可能不足以实现最佳性能。因此,模型在中文理解能力方面仍有改进的空间。

>> 鲁棒性不足:在某些情况下,模型可能表现出脆弱性,当面对对抗性输入或罕见的语言现象时,会产生不一致或荒谬的输出。

>> 综合评估:评估大型语言模型是当前重要的课题。虽然我们已经看到了许多用于LLM的评估基准,但它们的全面性和适用性应该进行深入研究和检查。一个更多样化和全面的LLM评估数据集和基准将对塑造LLM研究的未来产生积极影响。

>> 可扩展性和效率:尽管我们应用了LoRA和量化方法,使模型对更广泛的社区更易访问,但与原始LLaMA相结合,模型的庞大尺寸和复杂性可能导致在部署方面存在困难,特别是对于计算资源有限的用户。这个问题可能阻碍模型在各种应用中的可访问性和广泛采用。

Future work should address these limitations to further enhance the models’ capabilities, making them more robust, accessible, and effective for a broader range of applications in the Chinese NLP community.



The original draft was polished by OpenAI GPT-4 for grammatical corrections and clarity improve- ments. We would like to thank our community members for their contributions to our open-source project.

本文的初稿经过OpenAI GPT-4的润色,对语法进行了修正和明确性进行了改善。我们要感谢我们社区成员对我们的开源项目的贡献。
