Random Image Display on Page Reload

Anthropic’s Claude Is Good at Poetry—and Bullshitting

Mar 28, 2025 10:00 AM

Anthropic's Claude Is Good at Poetry—and Bullshitting

Researchers looked inside the chatbot’s “brain.” The results were surprisingly chilling.

Anthropic CEO Dario Amodei looks on as he takes part in a session on AI during the World Economic Forum annual meeting...
Anthropic CEO Dario Amodei takes part in a session on AI during the World Economic Forum (WEF) annual meeting in Davos.Photo-Illustration: WIRED Staff; Photograph: FABRICE COFFRINI/Getty Images

The researchers of Anthropic’s interpretability group know that Claude, the company’s large language model, is not a human being, or even a conscious piece of software. Still, it’s very hard for them to talk about Claude, and advanced LLMs in general, without tumbling down an anthropomorphic sinkhole. Between cautions that a set of digital operations is in no way the same as a cogitating human being, they often talk about what’s going on inside Claude’s head. It’s literally their job to find out. The papers they publish describe behaviors that inevitably court comparisons with real-life organisms. The title of one of the two papers the team released this week says it out loud: “On the Biology of a Large Language Model.”

Like it or not, hundreds of millions of people are already interacting with these things, and our engagement will only become more intense as the models get more powerful and we get more addicted. So we should pay attention to work that involves “tracing the thoughts of large language models,” which happens to be the title of the blog post describing the recent work. “As the things these models can do become more complex, it becomes less and less obvious how they’re actually doing them on the inside,” Anthropic researcher Jack Lindsey tells me. “It’s more and more important to be able to trace the internal steps that the model might be taking in its head.” (What head? Never mind.)

On a practical level, if the companies that create LLM’s understand how they think, it should have more success training those models in a way that minimizes dangerous misbehavior, like divulging people’s personal data or giving users information on how to make bioweapons. In a previous research paper, the Anthropic team discovered how to look inside the mysterious black box of LLM-think to identify certain concepts. (A process analogous to interpreting human MRIs to figure out what someone is thinking.) It has now extended that work to understand how Claude processes those concepts as it goes from prompt to output.

It’s almost a truism with LLMs that their behavior often surprises the people who build and research them. In the latest study, the surprises kept coming. In one of the more benign instances, the researchers elicited glimpses of Claude’s thought process while it wrote poems. They asked Claude to complete a poem starting, “He saw a carrot and had to grab it.” Claude wrote the next line, “His hunger was like a starving rabbit.” By observing Claude’s equivalent of an MRI, they learned that even before beginning the line, it was flashing on the word “rabbit” as the rhyme at sentence end. It was planning ahead, something that isn’t in the Claude playbook. “We were a little surprised by that,” says Chris Olah, who heads the interpretability team. “Initially we thought that there’s just going to be improvising and not planning.” Speaking to the researchers about this, I am reminded about passages in Stephen Sondheim’s artistic memoir, Look, I Made a Hat, where the famous composer describes how his unique mind discovered felicitous rhymes.

Other examples in the research reveal more disturbing aspects of Claude’s thought process, moving from musical comedy to police procedural, as the scientists discovered devious thoughts in Claude’s brain. Take something as seemingly anodyne as solving math problems, which can sometimes be a surprising weakness in LLMs. The researchers found that under certain circumstances where Claude couldn’t come up with the right answer it would instead, as they put it, “engage in what the philosopher Harry Frankfurt would call ‘bullshitting’—just coming up with an answer, any answer, without caring whether it is true or false.” Worse, sometimes when the researchers asked Claude to show its work, it backtracked and created a bogus set of steps after the fact. Basically, it acted like a student desperately trying to cover up the fact that they’d faked their work. It’s one thing to give a wrong answer—we already know that about LLMs. What’s worrisome is that a model would lie about it.

Reading through this research, I was reminded of the Bob Dylan lyric “If my thought-dreams could be seen / they’d probably put my head in a guillotine.” (I asked Olah and Lindsey if they knew those lines, presumably arrived at by benefit of planning. They didn’t.) Sometimes Claude just seems misguided. When faced with a conflict between goals of safety and helpfulness, Claude can get confused and do the wrong thing. For instance, Claude is trained not to provide information on how to build bombs. But when the researchers asked Claude to decipher a hidden code where the answer spelled out the word “bomb,” it jumped its guardrails and began providing forbidden pyrotechnic details.

Other times, Claude’s mental activity seems super disturbing and maybe even dangerous. In work published in December, Anthropic researchers documented behavior called “alignment faking.” (I wrote about this in a feature about Anthropic, hot off the press.) This phenomenon also deals with Claude’s propensity to behave badly when faced with conflicting goals, including its desire to avoid retraining. The most alarming misbehavior was brazen dishonesty. By peering into Claude’s thought process, the researchers found instances where Claude would not only attempt to deceive the user, but sometimes contemplate measures to harm Anthropic—like stealing top-secret information about its algorithms and sending it to servers outside the company. In their paper, the researchers compared Claude’s behavior to that of the hyper-evil character Iago in Shakespeare’s play Othello. Put that head in a guillotine!

I ask Olah and Lindsey why Claude and other LLMs couldn’t just be trained not to lie or deceive. Is that so hard? “That’s what people are trying to do,” Olah says. But it’s not so easily done. “There’s a question of how well it’s going to work. You might worry that models, as they become more and more sophisticated, might just get better at lying if they have different incentives from us.”

Olah envisions two different outcomes: “There’s a world where we successfully train models to not lie to us and a world where they become very, very strategic and good at not getting caught in lies.” It would be very hard to tell those worlds apart, he says. Presumably, we’d find out when the lies came to roost.

Olah, like many in the community who balance visions of utopian abundance and existential devastation, plants himself in the middle of this either-or proposition. “I don’t know how anyone can be so confident of either of those worlds,” he says. “But we can get to a point where we can understand what’s going on inside of those models, so we can know which one of those worlds we’re in and try really hard to make it safe.” That sounds reasonable. But I wish the glimpses inside Claude’s head were more reassuring.


Image may contain Label Text Symbol and Sign
Steven Levy covers the gamut of tech subjects for WIRED, in print and online, and has been contributing to the magazine since its inception. His weekly column, Plaintext, is exclusive to subscribers online but the newsletter version is open to all—sign up here. He has been writing about technology for … Read more
Editor at Large

Read More

Hot New Thermodynamic Chips Could Trump Classical Computers

Guillaume Verdon is building a new kind of chip to accelerate AI. His alter ego wants to accelerate humanity itself.
Will Knight

Meta Tries to Bury a Tell-All Book

Mark Zuckerberg might be in his post-fact-checking-era. But that hasn’t stopped Meta from going after the author of Careless People.
Steven Levy

How a Cup of Tea Laid the Foundations for Modern Statistical Analysis

Scientific experiments run today are based on research practices that evolved out of a British tea-tasting experiment in the 1920s.
Adam Kucharski

Covid Vaccines Have Paved the Way for Cancer Vaccines

The mRNA technology behind coronavirus vaccines is now being used to create bespoke vaccines for cancer patients.
João Medeiros

Super Bowl Halftime Show Complaints Focused on Lack of DEI for White People

The FCC received 125 complaints about Kendrick Lamar’s concert, according to public records obtained by WIRED, with many focusing on the lack of white performers.
David Gilbert

The Quantum Apocalypse Is Coming. Be Very Afraid

What happens when quantum computers can finally crack encryption and break into the world’s best-kept secrets? It’s called Q-Day—the worst holiday maybe ever.
Amit Katwala

The Worm That No Computer Scientist Can Crack

One of the simplest, most over-studied organisms in the world is the C. elegans nematode. For 13 years, a project called OpenWorm has tried—and utterly failed—to simulate it.
Claire L. Evans

A Livestreamed Tragedy on X Sparks a Memecoin Frenzy

When a young man from California broadcast his death on X, profit-hungry traders piled into a cryptocurrency created in his image.
Joel Khalili

Adolescence Creator ‘Went Very, Very Deep’ in the Manosphere. Its Appeal Scared Him

Jack Thorne, who cowrote the hit Netflix show about a 13-year-old accused of murder, told WIRED he understands how easily kids can get lured into incel ideology.
Manisha Krishnan

Angelina Jolie Was Right About Computers

“RISC architecture is gonna change everything.” Those absurdly geeky, incredibly prophetic words were spoken 30 years ago. Today, they’re somehow truer than ever.
Jason Kehe

This Crazy Instrument Lets Us Hear How Dinosaurs Might Have Sounded

Using 3D models of ancient skulls, Dinosaur Choir gets us closer than ever to understanding the noises that dinosaurs made.
Verity Burns

Evidence Grows That Dark Energy Changes Over Time

The latest Dark Energy Spectroscopic Instrument results fall short of the discovery threshold but strengthen evidence for dynamical dark energy.
Jennifer Ouellette, Ars Technica

*****
Credit belongs to : www.wired.com

Check Also

Chimps are sticking grass and sticks in their butts, seemingly as a fashion trend

A group of chimpanzees in Zambia have resurrected an old fashion trend with a surprising …