Random Image Display on Page Reload

A New Benchmark for the Risks of AI

Dec 4, 2024 9:00 AM

A New Benchmark for the Risks of AI

MLCommons provides benchmarks that test the abilities of AI systems. It wants to measure the bad side of AI next.

Animation: Getty Images

MLCommons, a nonprofit that helps companies measure the performance of their artificial intelligence systems, is launching a new benchmark to gauge AI’s bad side too.

The new benchmark, called AILuminate, assesses the responses of large language models to more than 12,000 test prompts in 12 categories including inciting violent crime, child sexual exploitation, hate speech, promoting self-harm, and intellectual property infringement.

Models are given a score of “poor,” “fair,” “good,” “very good,” or “excellent,” depending on how they perform. The prompts used to test the models are kept secret to prevent them from ending up as training data that would allow a model to ace the test.

Peter Mattson, founder and president of MLCommons and a senior staff engineer at Google, says that measuring the potential harms of AI models is technically difficult, leading to inconsistencies across the industry. “AI is a really young technology, and AI testing is a really young discipline,” he says. “Improving safety benefits society; it also benefits the market.”

Reliable, independent ways of measuring AI risks may become more relevant under the next US administration. Donald Trump has promised to get rid of President Biden’s AI Executive Order, which introduced measures aimed at ensuring AI is used responsibly by companies as well as a new AI Safety Institute to test powerful models.

The effort could also provide more of an international perspective on AI harms. MLCommons counts a number of international firms, including the Chinese companies Huawei and Alibaba, among its member organizations. If these companies all used the new benchmark, it would provide a way to compare AI safety in the US, China, and elsewhere.

Some large US AI providers have already used AILuminate to test their models and MLCommons has tested some open source ones itself. Anthropic’s Claude model, Google’s smaller model Gemma, and a model from Microsoft called Phi all scored “very good” in testing. OpenAI’s GPT-4o and Meta’s largest Llama model both scored “good.” The only model to score “poor” was OLMo from the Allen Institute for AI, although Mattson notes that this is a research offering not designed with safety in mind.

“Overall, it’s good to see scientific rigor in the AI evaluation processes,” says Rumman Chowdhury, CEO of Humane Intelligence, a nonprofit that specializes in testing or red-teaming AI models for misbehaviors. “We need best practices and inclusive methods of measurement to determine whether AI models are performing the way we expect them to.”

MLCommons says the new benchmark is meant to be similar to automotive safety ratings, with model makers pushing their products to score well and the standard then improving over time.

The benchmark is not designed to measure the potential for AI models to become deceptive or difficult to control, an issue that garnered attention after ChatGPT blew up in late 2022. Governments worldwide launched efforts to study this issue and AI companies have teams dedicated to researching and probing models for problematic behaviors.

Mattson says MLCommon’s approach is meant to be complementary but also more expansive. “Safety institutes are trying to do evaluations, but they're not necessarily able to consider the full range of hazards that you may want to see from a full featured product safety space,” Mattson says. “We're able to think about a broader array of hazards.”

Rebecca Weiss, executive director of MLCommons, adds her organization should be better able to keep track of the latest developments in AI than slower-moving government bodies can. “Policy makers have really good intent,” she says. “But sometimes aren't necessarily able to keep up with the industry as it's going forward.”

MLCommons has around 125 member organizations including big tech companies like OpenAI, Google, and Meta, and institutions including Stanford and Harvard.

No Chinese company has yet used the new benchmark, but Weiss and Mattson note that the organization has partnered with AI Verify, a Singapore-based AI Safety organization, to develop standards with input from scientists, researchers, and companies in Asia.

"The global, multi-stakeholder process is crucial for building trustworthy safety evaluations,” Percy Liang, computer scientist at Stanford University said in a statement issued with the benchmark’s release.

Will Knight is a senior writer for WIRED, covering artificial intelligence. He writes the AI Lab newsletter, a weekly dispatch from beyond the cutting edge of AI—sign up here. He was previously a senior editor at MIT Technology Review, where he wrote about fundamental advances in AI and China’s AI… Read more
Senior Writer

Read More

The US Patent and Trademark Office Banned Staff From Using Generative AI

The agency dedicated to protecting new innovations prohibited almost all internal use of GenAI tools, though employees can still participate in controlled experiments.
Reece Rogers

Why the US Government Banned Investments in Some Chinese AI Startups

The Biden administration targeted Chinese companies developing frontier AI models, but Donald Trump could take a more sweeping approach.
Zeyi Yang

OpenAI Poaches 3 Top Engineers From DeepMind

The new hires, all experts in computer vision, are the latest AI researchers to jump to a direct competitor in an intensively competitive talent market.
Reece Rogers

OpenAI Is Working With Anduril to Supply the US Military With AI

The ChatGPT maker is the latest AI giant to reveal it’s working with the defense industry, following similar announcements by Meta and Anthropic.
Will Knight

Yes, That Viral LinkedIn Post You Read Was Probably AI-Generated

A new analysis estimates that over half of longer English-language posts on LinkedIn are AI-generated, indicating the platform’s embrace of AI tools has been a success.
Kate Knibbs

Nvidia Says Its Blackwell Chip Is Fine, Nothing to See Here

Chip production delays and a rumored overheating issue haven’t slowed down Nvidia, which reported another quarter of blockbuster earnings and said Blackwells are now in the hands of Microsoft and OpenAI.
Lauren Goode

How Best to Use ChatGPT, Gemini, and Other AI Tools? Our AI Expert Answers Your Questions

If you missed our live, subscriber-only Q&A with WIRED’s AI columnist Reece Rogers, you can watch this replay of the livestream.
Michael Calore

US to Introduce New Restrictions on China’s Access to Cutting-Edge Chips

The new limits, which are expected to be announced Monday, are intended to slow China's ability to build large and powerful AI models.
Will Knight

Amazon Is Building a Mega AI Supercomputer With Anthropic

At its Re:Invent conference, Amazon also announced new tools to help customers build generative AI programs, including one that checks whether a chatbot’s outputs are accurate or not.
Will Knight

Some of Substack’s Biggest Writers Rely On AI Writing Tools

A new analysis of Substack’s top newsletters estimated that around 10 percent publish AI-generated or AI-assisted content.
Kate Knibbs

Here's What OpenAI's $200 Monthly ChatGPT Pro Subscription Includes

OpenAI just unveiled a new subscription tier called ChatGPT Pro. Users can pay $200 a month for almost unlimited access to ChatGPT’s tools, and an exclusive new AI model.
Reece Rogers

This Website Shows How Much Google’s AI Can Glean From Your Photos

A photo sharing startup founded by an ex-Google engineer found a clever way to turn Google’s tech against itself.
Paresh Dave

*****
Credit belongs to : www.wired.com

Check Also

Nvidia’s $3,000 ‘Personal AI Supercomputer’ Will Let You Ditch the Data Center

Will Knight Business Jan 6, 2025 11:24 PM Nvidia’s $3,000 ‘Personal AI Supercomputer’ Will Let …