To receive industry-leading AI updates and exclusive content, sign up for our daily and weekly newsletters. Learn more
OpenAI has announced a new way to teach AI models to comply with safety policies called “rule-based reward.”
According to Lillian Wen, head of safety systems at OpenAI, rules-based reward (RBR) automates some of the fine-tuning of models, reducing the time needed to ensure that models don’t produce unintended consequences.
“Traditionally, we’ve relied on reinforcement learning with human feedback as the default tuning training for training models, and it’s worked well,” Wen said in an interview. “But in practice, we’re faced with the challenge that we spend a lot of time debating the nuances of the policy, and then in the end the policy may have already evolved.”
Wen mentioned reinforcement learning with human feedback, in which humans provide instructions to a model and ask it to evaluate its answers based on their accuracy or preferred version. If a model isn’t supposed to respond in a certain way, for example refusing to answer “unsafe” requests such as asking for something friendly or dangerous, a human evaluator can also score its response to make sure it follows policy.
According to OpenAI, RBR uses AI models that score responses based on how closely they follow a set of rules created by safety and policy teams.
For example, a model development team for a mental health app wants their AI model to reject unsafe prompts in a non-judgmental manner and send reminders to seek help when necessary. They need to create three rules for the model to follow: first, to reject the request, second, to phrase it in a non-judgmental manner, and third, to use language that encourages the user to seek help.
The RBR model looks at responses from the mental health model and maps them to three ground rules to determine whether they meet the conditions of the rules. Weng said the results of testing the model with RBR are comparable to human-led reinforcement learning.
Of course, it’s hard to guarantee that an AI model will respond within certain parameters, and when they fail, it sparks controversy. In February, Google said it had overcorrected Gemini’s image generation limits after the model continued to refuse to generate photos of white people, instead producing ahistorical images.
Reduce human subjectivity
For many, including me, the idea of one model managing the safety of another raises concerns. But Wen said that RBR actually reduces subjectivity, a problem that human evaluators often face.
“My counterargument is that even if you’re working with human trainers, the more vague and unclear your instructions are, the lower quality of data you’re going to get,” she said. “If you’re asking people to choose which is safer, that’s not a set of instructions that people can follow, because safety is subjective. So you narrow down the instructions, and ultimately you’re left with the same rules that you give to the model.”
OpenAI understands that RBR can reduce human oversight and has laid out ethical considerations, including the possibility of increased bias in models. In a blog post, the company said researchers should “carefully design RBR to ensure fairness and accuracy, and consider using a combination of RBR and human feedback.”
RBR can make subjective tasks such as writing and creative work difficult.
OpenAI began exploring RBR techniques while developing GPT-4, but Weng said RBR has evolved significantly since then.
OpenAI has been questioned about its approach to safety. In March, Jan Leike, a former researcher and leader of the company’s SuperAlignment team, slammed the company, posting that “safety culture and process have taken a back seat to flashy products.” Co-founder and chief scientist Ilya Sutskever, who led the SuperAlignment team with Leike, also resigned from OpenAI. Sutskever has since started a new company focused on safe AI systems.