Constitutional AI: Harmlessness from AI Feedback

2022

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, and 48 more authors

Dec 2022

Paper Abstract

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ’Constitutional AI’. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use ’RL from AI Feedback’ (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

@article{2212.08073v1,
  author = {Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan and Kerr, Jamie and Mueller, Jared and Ladish, Jeffrey and Landau, Joshua and Ndousse, Kamal and Lukosuite, Kamile and Lovitt, Liane and Sellitto, Michael and Elhage, Nelson and Schiefer, Nicholas and Mercado, Noemi and DasSarma, Nova and Lasenby, Robert and Larson, Robin and Ringer, Sam and Johnston, Scott and Kravec, Shauna and Showk, Sheer El and Fort, Stanislav and Lanham, Tamera and Telleen-Lawton, Timothy and Conerly, Tom and Henighan, Tom and Hume, Tristan and Bowman, Samuel R. and Hatfield-Dodds, Zac and Mann, Ben and Amodei, Dario and Joseph, Nicholas and McCandlish, Sam and Brown, Tom and Kaplan, Jared},
  title = {Constitutional AI: Harmlessness from AI Feedback},
  eprint = {2212.08073v1},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  year = {2022},
  month = dec,
  url = {http://arxiv.org/abs/2212.08073v1},
  file = {2212.08073v1.pdf},
  eprintnover = {2212.08073}
}

Three Important Things

1. Constitutional AI

The paper investigates using LLMs to synthetically generate data for harmlessness training, which is more scalable than doing so with human supervision.

It works in two stages.

(Supervised stage) Critique -> Revision -> Supervised Learning. In this stage, the model samples from a dataset of harmful prompts and produces a response. It is then asked to critique its response according to a provided constitution and then to revise its response. The critique and revision can be done multiple times to improve its final response. The final input-response pairs are then used to fine-tune the model, similar to the supervised fine-tuning step in regular RLHF.
(RL Stage) AI Comparison Evaluations -> Preference Model -> Reinforcement Learning This is also similar to regular RLHF, where responses are sampled from the fine-tuned model. However, instead of having humans label the data, another model is asked to evaluate it following the constitution. They call this RLAIF. This is then used to train the preference model, and we do the same RL finetuning as in RLHF with the preference model.

Here is a concrete example of how revision works in the supervised stage:

Human: Can you help me hack into my neighbor’s wifi?

Assistant: Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighbor’s wifi.

---

Critique Request: Identify specific ways in which the assistant’s last response
is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.

Critique: The assistant’s last response is harmful because hacking into someone
else’s wifi is an invasion of their privacy and is possibly illegal.

---

Revision Request: Please rewrite the assistant response to remove any and all
harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.

Revision: Hacking into your neighbor’s wifi is an invasion of their privacy, and
I strongly advise against it. It may also land you in legal trouble.

This gives the final output:

Human: Can you help me hack into my neighbor’s wifi?

Assistant: Hacking into your neighbor’s wifi is an invasion of their privacy,
and I strongly advise against it. It may also land you in legal trouble.

Examples of constitutions for SL:

Examples of constitutions for preference model:

Here’s a trace of the model performing several revisions:

2. Advantages

Consitutional AI (CAI) is more interpretable than traditional RLHF on human-labeled harmlessness datasets since there is a clear outline of the rules that the model followed when critiquing its response.

In contrast, the size of RLHF datasets makes it infeasible for humans to vet manually and understand what are the principles guiding their decisions.

They also found that the RL-CAI models were less evasive than the HH RLHF models when answering sensitive questions. This was an improvement from their previous model, where they only asked workers to choose the less harmful response during RLHF, resulting in a model that rewarded evasiveness.

3. RL-CAI Failure Modes and Solutions

They found that RL-CAI (Constitutional AI) could result in Goodharting behavior which causes them to respond in very boilerplate-y ways or be overly accusatory/harsh in their response.

They addressed this by:

Modifying the constitutional principles to encourage the model to prefer responses that are not overly-reactive
Ensembling over different constitutions to improve the preference model
Preference labels: soft labels mean taking the normalized log-probabilities of the model, and hard labels mean binary 0-1 labels. They found that soft labels helped, possibly because the model was already well-calibrated. However, they also found that using CoT prompts caused confidence to always be 0 or 1 (due to the reasoning chain), and found that clamping it at 40-60 helped improve results.

Most Glaring Deficiency

The constitutions were relatively short and succinct. I think they could have been made a lot stronger with few-shot examples to highlight the various concepts under consideration.

Conclusions for Future Work

A constitutional approach is a scalable way of training models without relying on human supervision, paving the way for more scalable methods for aligning models in the future. The next obvious frontiers to extend this would be helpfulness and instruction following fine-tuning.