Random Image Display on Page Reload

AI Tools Are Secretly Training on Real Images of Children

Jun 10, 2024 12:01 AM

AI Tools Are Secretly Training on Real Images of Children

A popular AI training dataset is “stealing and weaponizing” the faces of Brazilian children without their knowledge or consent, human rights activists claim.

A red yellow blue and green illustration of a child with the face cut out a pixelated blue form in the same shape of the...
Illustration: Charis Morgan; Getty Images

Over 170 images and personal details of children from Brazil have been repurposed by an open-source dataset without their knowledge or consent, and used to train AI, claims a new report from Human Rights Watch released Monday.

The images have been scraped from content posted as recently as 2023 and as far back as the mid-1990s, according to the report, long before any internet user might anticipate that their content might be used to train AI. Human Rights Watch claims that personal details and photos of these children were gathered by data repository Common Crawl and then URLs that linked to them were included in LAION-5B, a dataset that helps to train data for AI startups.

“Their privacy is violated in the first instance when their photo is scraped and swept into these datasets. And then these AI tools are trained on this data and therefore can create realistic imagery of children,” says Hye Jung Han, children’s rights and technology researcher at Human Rights Watch and the researcher who found these images. “The technology is developed in such a way that any child who has any photo or video of themselves online is now at risk because any malicious actor could take that photo, and then use these tools to manipulate them however they want.”

LAION-5B is based on Common Crawl—a repository of data that was created by scraping the web and made available to researchers—and has been used to train several AI models, including Stability AI’s Stable Diffusion image generation tool. Created by the German nonprofit organization LAION, the dataset is openly accessible and now includes links to more than 5.85 billion pairs of images and captions, according to its website. LAION says that it has taken down the links to the images flagged by Human Rights Watch.

The images of children that researchers found came from mommy blogs and other personal, maternity, or parenting blogs, as well as stills from YouTube videos with small view counts, seemingly uploaded to be shared with family and friends.

“Just looking at the context of where they were posted, they enjoyed an expectation and a measure of privacy,” Hye says. “Most of these images were not possible to find online through a reverse image search.”

LAION spokesperson Nathan Tyler says the organization has already taken action. “LAION-5B were taken down in response to a Stanford report that found links in the dataset pointing to illegal content on the public web,” he says, adding that the organization is currently working with “Internet Watch Foundation, the Canadian Centre for Child Protection, Stanford, and Human Rights Watch to remove all known references to illegal content.”

YouTube’s terms of service do not allow scraping except under certain circumstances; these instances seem to run afoul of those policies. “We've been clear that the unauthorized scraping of YouTube content is a violation of our Terms of Service,” says YouTube spokesperson Jack Maon, “and we continue to take action against this type of abuse.”

In December, researchers at Stanford University found that AI training data collected by LAION-5B contained child sexual abuse material. The problem of explicit deepfakes is on the rise even among students in US schools, where they are being used to bully classmates, especially girls. Hye worries that, beyond using children’s photos to generate CSAM, that the database could reveal potentially sensitive information, such as locations or medical data. In 2022, a US-based artist found her own image in the LAION dataset, and realized it was from her private medical records.

“Children should not have to live in fear that their photos might be stolen and weaponized against them,” says Hye. She worries that what she was able to find is just the beginning. It was a “tiny slice” of the data that her team was looking at, she says—less than .0001 percent of all the data in LAION-5B. She suspects it is likely that similar images may have found their way into the dataset from all over the world.

Last year, a German ad campaign used an AI-generated deepfake to caution parents against posting photos of children online, warning that their children’s images could be used to bully them or create CSAM. But this does not address the issue of images that are already published, or are decades old but still in existence online.

“Removing links from a LAION dataset does not remove this content from the web,” says Tyler. These images can still be found and used, even if it’s not through LAION. “This is a larger and very concerning issue, and as a nonprofit, volunteer organization, we will do our part to help.”

Hye says that the responsibility to protect children and their parents from this type of abuse falls on governments and regulators. The Brazilian legislature is currently considering laws to regulate deepfake creation, and in the US, representative Alexandria Ocasio-Cortez of New York has proposed the DEFIANCE Act, which would allow people to sue if they can prove a deepfake in their likeness had been made nonconsensually.

“I think that children and their parents shouldn't be made to shoulder responsibility for protecting kids against a technology that's fundamentally impossible to protect against,” Hye says. “It's not their fault.”

Updated: 6/10/2024, 5:20 pm EST: WIRED has clarified that LAION has removed links to the images, and to further explain that the images were initially gathered by data repository Common Crawl, which was then used to inform links that appeared in LAION-5B.

Vittoria Elliott is a reporter for WIRED, covering platforms and power. She was previously a reporter at Rest of World, where she covered disinformation and labor in markets outside the US and Western Europe. She has worked with The New Humanitarian, Al Jazeera, and ProPublica. She is a graduate of… Read more
Platforms and power reporter

    Read More

    Chatbot Teamwork Makes the AI Dream Work

    Experiments show that asking AI chatbots to work together on a problem can compensate for some of their shortcomings. WIRED enlisted two bots to help plan this article.

    Will Knight

    US National Security Experts Warn AI Giants Aren't Doing Enough to Protect Their Secrets

    Susan Rice, who helped the White House broker an AI safety agreement with OpenAI and other tech companies, says she's worried China will steal American AI secrets.

    Paresh Dave

    The EU Is Taking on Big Tech. It May Be Outmatched

    From the Digital Services Act to the AI Act, in five years Europe has created a lot of rules for the digital world. Implementing them, however, isn’t always easy.

    Luca Zorloni

    OpenAI Offers a Peek Inside the Guts of ChatGPT

    Days after former employees said the company was being too reckless with its technology, OpenAI released a research paper on a method for reverse engineering the workings of AI models.

    Will Knight

    Don’t Let Mistrust of Tech Companies Blind You to the Power of AI

    It’s OK to be doubtful of tech leaders’ grandiose visions of our AI future—but that doesn’t mean the technology won’t have a huge impact.

    Steven Levy

    Google Cut Back AI Overviews in Search Even Before Its ‘Pizza Glue’ Fiasco

    Data on how often Google’s new AI Overviews feature appears on search results suggests that the company reduced its visibility even before recommendations like adding glue to pizza sauce went viral.

    Kate Knibbs

    Why the EU’s Vice President Isn’t Worried About Moon-Landing Conspiracies on YouTube

    During a tour of Silicon Valley, EU vice president Věra Jourová said she expects tech giants to prioritize stamping out content that could distort democracy.

    Paresh Dave

    Scarlett Johansson Says OpenAI Ripped Off Her Voice for ChatGPT

    In a scorching statement, Scarlett Johansson claims that after she turned down an invitation to voice ChatGPT, OpenAI brazenly mimicked her distinctive tones anyway.

    Will Knight

    *****
    Credit belongs to : www.wired.com

    Check Also

    Break the monotony with the Olive Green Redmi Note 13 Pro 5G

    It is always exciting to see fresh takes on smartphone design, a welcome change from …