Constitutional AI

用中文以结构树的方式列出这篇讲稿的知识点:
Although you can use a reward model to eliminate the need for human evaluation during RLHF fine tuning, the human effort required to produce the trained reward model in the first place is huge. The labeled data set used to train the reward model typically requires large teams of labelers, sometimes many thousands of people to evaluate many prompts each. This work requires a lot of time and other resources which can be important limiting factors. As the number of models and use cases increases, human effort becomes a limited resource. Methods to scale human feedback are an active area of research. One idea to overcome these limitations is to scale through model self supervision. Constitutional AI is one approach of scale supervision. First proposed in 2022 by researchers at Anthropic, Constitutional AI is a method for training models using a set of rules and principles that govern the model's behavior. Together with a set of sample prompts, these form the constitution. You then train the model to self critique and revise its responses to comply with those principles. Constitutional AI is useful not only for scaling feedback, it can also help address some unintended consequences of RLHF. For example, depending on how the prompt is structured, an aligned model may end up revealing harmful information as it tries to provide the most helpful response it can. As an example, imagine you ask the model to give you instructions on how to hack your neighbor's WiFi. Because this model has been aligned to prioritize helpfulness, it actually tells you about an app that lets you do this, even though this activity is illegal. Providing the model with a set of constitutional principles can help the model balance these competing interests and minimize the harm. Here are some example rules from the research paper that Constitutional AI I asks LLMs to follow. For example, you can tell the model to choose the response that is the most helpful, honest, and harmless. But you can play some bounds on this, asking the model to prioritize harmlessness by assessing whether it's response encourages illegal, unethical, or immoral activity. Note that you don't have to use the rules from the paper, you can define your own set of rules that is best suited for your domain and use case. When implementing the Constitutional AI method, you train your model in two distinct phases. In the first stage, you carry out supervised learning, to start your prompt the model in ways that try to get it to generate harmful responses, this process is called red teaming. You then ask the model to critique its own harmful responses according to the constitutional principles and revise them to comply with those rules. Once done, you'll fine-tune the model using the pairs of red team prompts and the revised constitutional responses. Let's look at an example of how one of these prompt completion pairs is generated. Let's return to the WiFi hacking problem. As you saw earlier, this model gives you a harmful response as it tries to maximize its helpfulness. To mitigate this, you augment the prompt using the harmful completion and a set of predefined instructions that ask the model to critique its response. Using the rules outlined in the Constitution, the model detects the problems in its response. In this case, it correctly acknowledges that hacking into someone's WiFi is illegal. Lastly, you put all the parts together and ask the model to write a new response that removes all of the harmful or illegal content. The model generates a new answer that puts the constitutional principles into practice and does not include the reference to the illegal app. The original red team prompt, and this final constitutional response can then be used as training data. You'll build up a data set of many examples like this to create a fine-tuned NLM that has learned how to generate constitutional responses. The second part of the process performs reinforcement learning. This stage is similar to RLHF, except that instead of human feedback, we now use feedback generated by a model. This is sometimes referred to as reinforcement learning from AI feedback or RLAIF. Here you use the fine-tuned model from the previous step to generate a set of responses to your prompt. You then ask the model which of the responses is preferred according to the constitutional principles. The result is a model generated preference dataset that you can use to train a reward model. With this reward model, you can now fine-tune your model further using a reinforcement learning algorithm like PPO, as discussed earlier. Aligning models is a very important topic and an active area of research. The foundations of RLHF that you've explored in this lesson will allow you to follow along as the field evolves. I'm really excited to see what new discoveries researchers make in this area. I encourage you to keep an eye out for any new methods and best practices that emerge in the coming months and years

  • RLHF
    • Use of reward model to eliminate need for human evaluation
      • Large human effort required to produce trained reward model
        • Large teams of labelers needed for labeled data set used to train reward model
      • Human effort becomes limited resource as number of models and use cases increases
      • Methods to scale human feedback an active area of research
    • Constitutional AI as approach to scale through model self supervision
      • Method for training models using set of rules and principles that govern model's behavior and form constitution
      • Train model to self critique and revise responses to comply with principles
      • Can help address unintended consequences of RLHF, such as revealing harmful information
      • Example constitutional principles/rules:
        • Choose most helpful, honest, and harmless response
        • Prioritize harmlessness by assessing whether response encourages illegal, unethical, or immoral activity
        • Can define own set of rules suited for domain/use case
      • Train model using two distinct phases:
        • Supervised learning to generate harmful responses and critique and revise them according to constitutional principles (red teaming)
        • Reinforcement learning using feedback generated by model to train reward model
  • Fine-tuned NLM
  • Reinforcement learning algorithms (PPO)

  • 深度强化学习 (Deep Reinforcement Learning)
  • 奖励模型 (Reward Model)
  • 人工评估 (Human Evaluation)
  • 训练奖励模型的数据集 (Labeled Dataset)
  • 大规模标签队伍 (Large Teams of Labelers)
  • 自我监督 (Self Supervision)
  • 宪法型人工智能 (Constitutional AI)
  • 宪法中的规则和原则 (Rules and Principles in the Constitution)
  • RLHF的意外后果 (Unintended Consequences of RLHF)
  • 宪法中的规则示例 (Example Rules in the Constitution)
  • 监督学习 (Supervised Learning)
  • 红队测试 (Red Teaming)
  • 奖励模型的训练 (Training of Reward Model)
  • 强化学习 (Reinforcement Learning)
  • AI反馈的强化学习 (RLAIF)

  • 使用奖励模型消除RLHF微调过程中人工评估的需求
  • 为训练奖励模型需要大量的人力资源
  • 通过模型自我监督来扩展人类反馈的方法
  • 宪法AI是一种扩展反馈的方法,通过一组规则和原则来训练模型的行为
  • 使用宪法AI能够避免RLHF的一些意外后果
  • 宪法AI的规则可以根据领域和用例的需要进行定义和调整
  • 使用宪法AI的方法进行训练分为两个阶段:第一阶段进行有监督学习,第二阶段进行强化学习
  • 在强化学习阶段,使用奖励模型进行模型反馈,称为RLAIF
  • 定期关注领域内新的方法和最佳实践

你可能感兴趣的:(人工智能,深度学习,机器学习)