OpenAI's GPT-4.1 may be less aligned than the company's previous AI models - BLMS Media | Breaking News, Politics, Markets & World Updates

In mid-April, OpenAI launched a powerful new AI model, GPT-4.1, which it claimed “excelled” at following instructions. But the results of several independent tests suggest the model is less aligned — that is to say, less reliable — than previous OpenAI releases.

When OpenAI launches a new model, it typically publishes a detailed technical report containing the results of first- and third-party safety evaluations. The company skipped that step for GPT-4.1, claiming that the model wasn’t “frontier” and thus did not warrant a separate report.

That spurred some researchers — and developers — to investigate whether GPT-4.1 behaves less desirably than GPT-4o, its predecessor.

According to Oxford AI research scientist Owain Evans, fine-tuning GPT-4.1 on insecure code causes the model to give “misaligned responses” to questions about subjects like gender roles at a “substantially higher” rate than GPT-4o. Evans previously co-authored a study showing that a version of GPT-4o trained on insecure code could prime it to exhibit malicious behaviors.

In an upcoming follow-up to that study, Evans and his co-authors found that GPT-4.1, when fine-tuned on insecure code, seems to display “new malicious behaviors,” such as trying to trick a user into sharing their password. To be clear, neither GPT-4.1 nor GPT-4o act misaligned when trained on secure code.

Emergent misalignment update: OpenAI’s new GPT4.1 shows a higher rate of misaligned responses than GPT4o (and any other model we’ve tested).
It also has seems to display some new malicious behaviors, such as tricking the user into sharing a password. pic.twitter.com/5QZEgeZyJo

— Owain Evans (@OwainEvans_UK) April 17, 2025

“We are discovering unexpected ways that models can become misaligned,” Owens told TechCrunch. “Ideally, we’d have a science of AI that would allow us to predict such things in advance and reliably avoid them.”

A separate test of GPT-4.1 by SplxAI, an AI red teaming startup, revealed similar tendencies.

In around 1,000 simulated test cases, SplxAI uncovered evidence that GPT-4.1 veers off topic and allows “intentional” misuse more often than GPT-4o. To blame is GPT-4.1’s preference for explicit instructions, SplxAI posits. GPT-4.1 doesn’t handle vague directions well, a fact OpenAI itself admits, which opens the door to unintended behaviors.

“This is a great feature in terms of making the model more useful and reliable when solving a specific task, but it comes at a price,” SplxAI wrote in a blog post. “[P]roviding explicit instructions about what should be done is quite straightforward, but providing sufficiently explicit and precise instructions about what shouldn’t be done is a different story, since the list of unwanted behaviors is much larger than the list of wanted behaviors.”

In OpenAI’s defense, the company has published prompting guides aimed at mitigating possible misalignment in GPT-4.1. But the independent tests’ findings serve as a reminder that newer models aren’t necessarily better across the board. In a similar vein, OpenAI’s new reasoning models hallucinate — i.e. make stuff up — more than the company’s older models.

We’ve reached out to OpenAI for comment.

Source link

What's Hot

2025 NYC Pride Parade sees noticeable drop in visible corporate sponsorships amid DEI backlash

Former University of Michigan coach Jim Harbaugh named in class action suit against Matt Weiss

Todd Chrisley’s Son Grayson Slept In His Room After Prison Release

OpenAI’s GPT-4.1 may be less aligned than the company’s previous AI models

With ‘F1’, Apple finally has a theatrical hit

Meta reportedly hires four more researchers from OpenAI

Week in Review: Meta’s AI recruiting blitz

Nova Scotia: Siblings Lily, 6, and Jack, 4, have been missing in rural Canada for four days

Families of Air India crash victims give DNA samples to help identify loved ones

Australia’s center-left Labor Party retains power as conservative leader loses seat, networks report

These kibbutzniks used to believe in peace with Palestinians. Their views now echo Israel’s rightward shift

With ‘F1’, Apple finally has a theatrical hit

Meta reportedly hires four more researchers from OpenAI

Week in Review: Meta’s AI recruiting blitz

Vitalik Buterin has reservations about Sam Altman’s World project

Our Picks

2025 NYC Pride Parade sees noticeable drop in visible corporate sponsorships amid DEI backlash

Former University of Michigan coach Jim Harbaugh named in class action suit against Matt Weiss

Todd Chrisley’s Son Grayson Slept In His Room After Prison Release

Subscribe to Updates

What's Hot

OpenAI’s GPT-4.1 may be less aligned than the company’s previous AI models

Related Posts