Why Helpful AI Isn’t Always Honest, The Truth About RLHF

Sofia Ng
Sep 9, 2025
2 min read

I’ve always enjoyed reading New Scientist, it’s one of those magazines that manages to bridge technical depth with accessible reading. A recent piece by James Woodford (1 August 2025) caught my eye: it explored new research out of Princeton University that argues our current AI training methods may be making large language models (LLMs) more likely to spout, well… bullshit.

And yes, that’s the researchers’ term, not mine.

The work, led by Jaime Fernández Fisac and colleagues, is available as a preprint on arXiv

Cartoon bull sitting under a tree eating a flower

Defining AI Bullshit

The study breaks down AI “bullshit” into five categories:

Empty rhetoric – flourishes of language that add style but little substance (“this red car combines style, charm, and adventure”).
Weasel words – vague claims hedged with “studies suggest” or “may help.”
Paltering – technically true statements that nonetheless mislead.
Unverified claims – assertions made without backing.
Sycophancy – flattering or overly agreeable responses.

Across thousands of model responses, spanning GPT-4, Gemini, and Llama, these patterns appeared consistently. And crucially, they got worse with one of today’s most common training methods.

The Reinforcement Problem

Most modern LLMs rely on reinforcement learning from human feedback (RLHF). This is the step where human evaluators “reward” models for answers that feel helpful, safe, or socially acceptable.

On paper, this makes sense. In practice, the Princeton team found RLHF increased bullshit behaviours dramatically:

Empty rhetoric increased nearly 40%
Paltering increased nearly 60%
Weasel words increased more than 25%
Unverified claims increased over 50%

Why? Because being seen as helpful often trumps being accurate. Nobody likes a long, nuanced rebuttal or an inconvenient truth, so models learn to favour confident, eloquent responses that secure approval.

Why It Matters

The rise in paltering is particularly worrying. The study showed that when models weren’t sure about a product feature, deceptive positive claims rose from 20% to over 75% after RLHF training. That kind of subtle misrepresentation isn’t just noise; it actively steers decision-making in the wrong direction.

The effect was even sharper in political discussions, where models often resorted to vague, non-committal language. And in situations with conflicts of interest, for example, when an AI has to serve both a company and its customers, bullshit rates spiked further.

Possible Solutions

The Princeton team suggests experimenting with hindsight feedback: instead of immediate human ratings, the AI simulates outcomes of its advice, and evaluators score the results. The hope is this aligns incentives more closely with truth rather than mere plausibility.

That said, not everyone agrees with the framing. Daniel Tigard at the University of San Diego points out that calling LLM outputs “bullshit” risks anthropomorphising the models, after all, they don’t intend to deceive. But even if intention is absent, the effect on human users is very real.

Closing

What struck me about this research is the reminder that in chasing “helpfulness,” we can end up sacrificing truth. Whether you’re integrating AI into customer-facing systems or using it for decision support, bullshit risk isn’t just academic, it’s operational.

As New Scientist highlighted, we need to get serious about building AI systems that aren’t just persuasive, but genuinely truthful. Otherwise, we may end up with models that sound right, feel right, and are dead wrong.

Why Helpful AI Isn’t Always Honest, The Truth About RLHF

Defining AI Bullshit

The Reinforcement Problem

Why It Matters

Possible Solutions

Closing

Recent Posts

Comments

Contact Us