Random Image Display on Page Reload

Amazon Is Investigating Perplexity Over Claims of Scraping Abuse

Jun 27, 2024 6:15 PM

Amazon Is Investigating Perplexity Over Claims of Scraping Abuse

Amazon has looked into a WIRED report that a server linked to Perplexity—and hosted by AWS—appears to have been used to scrape the sites of major outlets against their wishes.

Perplexity logo seen under a magnifying glass

Photograph: Dennis Diatel/Alamy

Amazon’s cloud division has launched an investigation into Perplexity AI. At issue is whether the AI search startup is violating Amazon Web Services rules by scraping websites that attempted to prevent it from doing so, WIRED has learned.

AWS spokesperson Patrick Neighorn confirmed the company's investigation of Perplexity following a WIRED inquiry about the startup's apparent scraping practices. WIRED had previously found that the Perplexity—which has backing from the Jeff Bezos family fund and Nvidia, and was recently valued at $3 billion—appears to rely on content from scraped websites that had forbidden access through the Robots Exclusion Protocol, a common web standard. While the Robots Exclusion Protocol is not legally binding, terms of service generally are.

The Robots Exclusion Protocol is a decades-old web standard that involves placing a plaintext file (like wired.com/robots.txt) on a domain to indicate which pages should not be accessed by automated bots and crawlers. While companies that use scrapers can choose to ignore this protocol, most have traditionally respected it. Neighorn told WIRED that AWS customers must adhere to the robots.txt standard while crawling websites.

“AWS’s terms of service prohibit abusive and illegal activities and our customers are responsible for complying with those terms," Neighorn said in a statement. “We routinely receive reports of alleged abuse from a variety of sources and engage our customers to understand those reports.”

Scrutiny of Perplexity’s practices follows a June 11 report from Forbes that accused the startup of stealing at least one of its articles. WIRED investigations confirmed the practice and found further evidence of scraping abuse and plagiarism by systems linked to Perplexity’s AI-powered search chatbot. Engineers for Condé Nast, WIRED’s parent company, block Perplexity’s crawler across all its websites using a robots.txt file. But WIRED found the company had access to a server using an unpublished IP address—44.221.181.252—which visited Condé Nast properties at least hundreds of times in the past three months, apparently to scrape Condé Nast websites.

The machine associated with Perplexity appears to be engaged in widespread crawling of news websites that forbid bots from accessing their content. Spokespeople for The Guardian, Forbes, and The New York Times also say they detected the IP address repeatedly visiting their servers.

WIRED traced the IP address to a virtual machine known as an Elastic Compute Cloud (EC2) instance hosted on AWS, which launched its investigation after we asked whether using AWS infrastructure to scrape websites that forbade it violated the company’s terms of service.

Last week, Perplexity CEO Aravind Srinivas responded to WIRED’s investigation first by saying the questions we posed to the company “reflect a deep and fundamental misunderstanding of how Perplexity and the Internet work.” Srinivas then told Fast Company that the secret IP address WIRED observed scraping Condé Nast websites and a test site we created was operated by a third-party company that performs web crawling and indexing services. He refused to name the company, citing a nondisclosure agreement. When asked if he would tell the third party to stop crawling WIRED, Srinivas replied, “It’s complicated.”

Sara Platnick, a Perplexity spokesperson, tells WIRED that the company responded to Amazon’s inquiries on Wednesday and characterized the investigation as standard procedure. Platnick says Perplexity made no changes to its operation in response to Amazon’s concerns.

“Our PerplexityBot—which runs on AWS—respects robots.txt, and we confirmed that Perplexity-controlled services are not crawling in any way that violates AWS Terms of Service,” Platnick says. She adds, however, that PerplexityBot will ignore robots.txt when a user enters a specific URL in their prompt—a use-case Platnick describes as “very infrequent.”

“When a user prompts with a specific URL, that doesn’t trigger crawling behavior,” Platnick says. “The agent acts on the user’s behalf to retrieve the URL. It works the same way as if the user went to a page themselves, copied the text of the article, and then pasted it into the system.”

This description of Perplexity’s functionality confirms WIRED’s findings that its chatbot is ignoring robots.txt in certain instances.

Digital Content Next is a trade association for the digital content industry whose members include The New York Times, The Washington Post, and Condé Nast. Last year, the organization shared draft principles for governing generative AI to prevent potential copyright violations. CEO Jason Kint tells WIRED that if the allegations against Perplexity are true, the company is violating many of those principles.

“By default, AI companies should assume they have no right to take and reuse publishers' content without permission,” Kint says. If Perplexity is skirting terms of service or robots.txt, he adds, "the red alarms should be going off that something improper is going on.”

Update 6/28/24 4:39pm ET: This story includes an updated statement from AWS spokesperson Patrick Neighorn.

Dhruv Mehrotra (he/him) is an investigative data reporter for WIRED. He uses technology to find, build, and analyze data sets for storytelling. Before joining WIRED, he worked for the Center for Investigative Reporting and was a researcher at New York University's Courant Institute of Mathematical Sciences. At Gizmodo, he was… Read more
Senior Writer

Andrew Couts is Senior Editor, Security & Investigations at WIRED overseeing cybersecurity, privacy, policy, national security, and surveillance coverage. He also oversees investigations across WIRED's newsroom. Prior to WIRED, he served as executive editor of Gizmodo and politics editor at the Daily Dot. He was part of teams whose works… Read more
Senior Editor, Security & Investigations

Read More

Amazon-Powered AI Cameras Used to Detect Emotions of Unwitting UK Train Passengers

CCTV cameras and AI are being combined to monitor crowds, detect bike thefts, and spot trespassers.
Matt Burgess

How to Spot a Business Email Compromise Scam

In this common email scam, a criminal pretending to be your boss or coworker emails you asking for a favor involving money. Here’s what do to when a bad actor lands in your inbox.
Justin Pot

Ransomware Is ‘More Brutal’ Than Ever in 2024

As the fight against ransomware slogs on, security experts warn of a potential escalation to “real-world violence.” But recent police crackdowns are successfully disrupting the cybercriminal ecosystem.
Jordan Pearson

Medical-Targeted Ransomware Is Breaking Records After Change Healthcare’s $22M Payout

Cybersecurity firm Recorded Future counted 44 health-care-related incidents in the month after Change Healthcare’s payment came to light—the most it’s ever seen in a single month.
Andy Greenberg

A Guide to RCS, Why Apple’s Adopting It, and How It Makes Texting Better

The messaging standard promises better security and cooler features than plain old SMS. Android has had it for years, but now iPhones are getting it too.
David Nield

War Crime Prosecutions Enter a New Digital Age

A custom platform developed by SITU Research aided the International Criminal Court’s prosecution in a war crimes trial for the first time. It could change how justice is enacted on an international scale.
Vittoria Elliott

Quora’s Chatbot Platform Poe Allows Users to Download Paywalled Articles on Demand

WIRED was able to download stories from publishers like The New York Times and The Atlantic using Poe’s Assistant bot. One expert calls it “prima facie copyright infringement,” which Quora disputes.
Tim Marchman

US Leaders Dodge Questions About Israel’s Influence Campaign

Democratic leader Hakeem Jeffries has joined US intelligence officials in ignoring repeated inquiries about Israel’s “malign” efforts to covertly influence US voters.
Dell Cameron

*****
Credit belongs to : www.wired.com

Check Also

Break the monotony with the Olive Green Redmi Note 13 Pro 5G

It is always exciting to see fresh takes on smartphone design, a welcome change from …