Suchir Balaji helped OpenAI collect data from the internet for AI model training, the NYT reported.
He was found dead in an apartment in San Francisco in late November, according to police.
About a month before, Balaji published an essay criticizing how AI models use data.
The recent death of former OpenAI researcher Suchir Balaji has brought an under-discussed AI debate back into the limelight.
AI models are trained on information from the internet. These tools answer user questions directly, so fewer people visit the websites that created and verified the original data. This drains resources from content creators, which could lead to a less accurate and rich internet.
Elon Musk calls this "Death by LLM." Stack Overflow, a coding Q&A website, has already been damaged by this phenomenon. And Balaji was concerned about this.
Balaji was found dead in late November. The San Francisco Police Department said it found "no evidence of foul play" during the initial investigation. The city's chief medical examiner determined the death to be suicide.
Balaji's concerns
About a month before Balaji died, he published an essay on his personal website that addressed how AI models are created and how this may be bad for the internet.
He cited research that studied the impact of AI models using online data for free to answer questions directly while sucking traffic away from the original sources.
The study analyzed Stack Overflow and found that traffic to this site declined by about 12% after the release of ChatGPT. Instead of going to Stack Overflow to ask coding questions and do research, some developers were just asking ChatGPT for the answers.
Other findings from the research Balaji cited:
There was a decline in the number of questions posted on Stack Overflow after the release of ChatGPT.
The average account age of the question-askers rose after ChatGPT came out, suggesting fewer people signed up to Stack Overflow or that more users left the online community.
This suggests that AI models could undermine some of the incentives that created the information-rich internet as we know it today.
If people can get their answers directly from AI models, there's no need to go to the original sources of the information. If people don't visit websites as much, advertising and subscription revenue may fall, and there would be less money to fund the creation and verification of high-quality online data.
MKBHD wants to opt out
It's even more galling to imagine that AI models might be doing this based partly on your own work.
Tech reviewer Marques Brownlee experienced this recently when he reviewed OpenAI's Sora video model and found that it created a clip with a plant that looked a lot like a plant from his own videos posted on YouTube.
"Are my videos in that source material? Is this exact plant part of the source material? Is it just a coincidence?" said Brownlee, who's known as MKBHD.
Naturally, he also wanted to know if he could opt out and prevent his videos from being used to train AI models. "We don't know if it's too late to opt out," Brownlee said.
'Not a sustainable model'
In an interview with The New York Times published in October, Balaji said AI chatbots like ChatGPT are stripping away the commercial value of people's work and services.
The publication reported that while working at OpenAI, Balaji was part of a team that collected data from the internet for AI model training. He joined the startup with high hopes for how AI could help society, but became disillusioned, NYT wrote.
"This is not a sustainable model for the internet ecosystem," he told the publication.
In a statement to the Times about Balaji's comments, OpenAI said the way it builds AI models is protected by fair use copyright principles and supported by legal precedents. "We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness," it added.
In his essay, Balaji disagreed.
One of the four tests for copyright infringement is whether a new work impacts the potential market for, or value of, the original copyrighted work. If it does this type of damage, then it's not "fair use" and is not allowed.
Balaji concluded that ChatGPT and other AI models don't quality for fair use copyright protection.
"None of the four factors seem to weigh in favor of ChatGPT being a fair use of its training data," he wrote. "That being said, none of the arguments here are fundamentally specific to ChatGPT either, and similar arguments could be made for many generative AI products in a wide variety of domains."
Talking about data
Tech companies producing these powerful AI models don't like to talk about the value of training data. They've even stopped disclosing where they get the data from, which was a common practice until a few years ago.
"They always highlight their clever algorithms, not the underlying data," Nick Vincent, an AI researcher, told BI last year.
Balaji's death may finally give this debate the attention it deserves.
"We are devastated to learn of this incredibly sad news today and our hearts go out to Suchir's loved ones during this difficult time," an OpenAI spokesperson told BI recently.
If you or someone you know is experiencing depression or has had thoughts of harming themself or taking their own life, get help. In the US, call or text 988 to reach the Suicide & Crisis Lifeline, which provides 24/7, free, confidential support for people in distress, as well as best practices for professionals and resources to aid in prevention and crisis situations. Help is also available through the Crisis Text Line — just text "HOME" to 741741. The International Association for Suicide Prevention offers resources for those outside the US.
An entity claiming to be United Healthcare is sending bogus copyright claims to internet platforms to get Luigi Mangione fan art taken off the internet, according to the print-on-demand merch retailer TeePublic. An independent journalist was hit with a copyright takedown demand over an image of Luigi Mangione and his family she posted on Bluesky, and other DMCA takedown requests posted to an open database and viewed by 404 Media show copyright claims trying to get “Deny, Defend, Depose” and Luigi Mangione-related merch taken off the internet, though it is unclear who is filing them.
Artist Rachel Kenaston was selling merch with the following design on TeePublic, a print-on-demand shop:
She got an email from TeePublic that said “We're sorry to inform you that an intellectual property claim has been filed by UnitedHealth Group Inc against this design of yours on TeePublic,” and said “Unfortunately, we have no say in which designs stay or go” because of the DMCA. This is not true—platforms are able to assess the validity of any DMCA claim and can decide whether to take the supposedly infringing content down or not. But most platforms choose the path of least resistance and take down content that is obviously not infringing; Kenaston’s clearly violates no one’s copyright. Kenaston appealed the decision and TeePublic told her: “Unfortunately, this was a valid takedown notice sent to us by the proper rightsholder, so we are not allowed to dispute it,” which, again, is not true.
The threat was framed as a “DMCA Takedown Request.” The DMCA is the Digital Millennium Copyright Act, an incredibly important copyright law that governs most copyright law on the internet. Copyright law is complicated, but, basically, DMCA takedowns are filed to give notice to a social media platform, search engine, or website owner to inform them that something they are hosting or pointing to is copyrighted, and then, all too often, the social media platform will take the content down without much of a review in hopes of avoiding being being sued.
“It's not unusual for large companies to troll print-on-demand sites and shut down designs in an effort to scare/intimidate artists, it's happened to me before and it works!,” Kenaston told 404 Media in an email. “The same thing seems to be happening with UnitedHealth - there's no way they own the rights to the security footage of Luigi smiling (and if they do.... wtf.... seems like the public should know that) but since they made a complaint my design has been removed from the site and even if we went to court and I won I'm unsure whether TeePublic would ever put the design back up. So basically, if UnitedHealth's goal is to eliminate Luigi merch from print-on-demand sites, this is an effective strategy that's clearly working for them.”
💡
Do you know anything else about copyfraud or DMCA abuse? I would love to hear from you. Using a non-work device, you can message me securely on Signal at +1 202 505 1702. Otherwise, send me an email at [email protected].
There is no world in which the copyright of a watercolor painting of Luigi Mangione surveillance footage done by Kenaston is owned by United Health Group as it quite literally has nothing to do with anything that the company owns. It is illegal to file a DMCA unless you have a “good faith” belief that you are the rights holder (or are representing the rights holder) of the material in question.
“What is the circumstance under which United Healthcare might come to own the copyright to a watercolor painting of the guy who assassinated their CEO?” tech rights expert and science fiction author Cory Doctorow told 404 Media in a phone call. “It’s just like, it’s hard to imagine” a lawyer thinking that, he added, saying that it’s an example of “copyfraud.”
United Healthcare did not respond to multiple requests for comment, and TeePublic also did not respond to a request for comment. It is theoretically possible that another entity impersonated United Healthcare to request the removal because copyfraud in general is so common.
But Kenaston’s work is not the only United Healthcare or Luigi Mangione-themed artwork on the internet that has been hit with bogus DMCA takedowns in recent days. Several platforms publish the DMCA takedown requests they get on the Lumen Database, which is a repository of DMCA takedowns.
On December 7, someone named Samantha Montoya filed a DMCA takedown with Google that targeted eight websites selling “Deny, Defend, Depose” merch that uses elements of the United Healthcare logo. Montoya’s DMCA is very sparse, according to the copy posted on Lumen: “The logo consists of a half ellipse with two arches matches the contour of the ellipse. Each ellipse is the beginning of the words Deny, Defend, Depose which are stacked to the right. Our logo comes in multiple colors.”
Medium, one of the targeted websites, has deleted the page that the merch was hosted on. It is not clear from the DMCA whether the person filing this is associated with United Healthcare, or whether they are associated with deny-defend-depose.com and are filing against copycats. Deny-defend-depose.com did not respond to a request for comment. Similarly, a DMCA takedown filed by someone named Manh Nguyen targets a handful of “Deny, Defend, Depose” and Luigi Mangione-themed t-shirts on a website called Printiment.com.
Based on the information on Lumen Database, there is unfortunately no way to figure out who Samantha Montoya or Manh Nguyen are associated with or working on behalf of.
Not Just Fan Art
Over the weekend, a lawyer demanded that independent journalist Marisa Kabas take down an image of Luigi Mangione and his family that she posted to Bluesky, which was originally posted on the campaign website of Maryland assemblymember Nino Mangione.
The lawyer, Desiree Moore, said she was “acting on behalf of our client, the Doe Family,” and claimed that “the use of this photograph is not authorized by the copyright owner and is not otherwise permitted by law.”
Moore said that Nino Mangione’s website “does not in fact display the photograph,” even though the Wayback Machine shows that it obviously did display the image. In a follow-up email to Kabas, Moore said “the owner of the photograph has not authorized anyone to publish, disseminate, or otherwise use the photograph for any purpose, and the photograph has been removed from various digital platforms as a result,” which suggests that other websites have also been threatened with takedown requests. Moore also said that her “client seeks to remain anonymous” and that “the photograph is hardly newsworthy.” The New York Postalso published the image, and blurred versions of the image remain on its website. The New York Post did not respond to a request for comment. Kabas deleted her Bluesky post “to avoid any further threats,” she said.
“It feels like a harbinger of things to come, coming directly after journalists for something as small as a social media post,” Kabas, who runs the excellent independent site The Handbasket, told 404 Media in a video chat. “They might be coming after small, independent publishers because they know we don’t have the money for a large legal defense, and they’re gonna make an example out of us, and they’re going to say that if you try anything funny, we’re going to try to bankrupt you through a frivolous lawsuit.”
The takedown request to Kabas in particular is notable for a few reasons. First, it shows that the Mangione family or someone associated with it is using the prospect of a copyright lawsuit to threaten journalists for reporting on one of the most important stories of the year, which is particularly concerning in an atmosphere where journalists are increasingly being targeted by politicians and the powerful. But it’s also notable that the threat was sent directly to Kabas for something she posted on Bluesky, rather than being sent to Bluesky itself. (Bluesky did not respond to a request for comment for this story, and we don’t know if Bluesky also received a takedown request about Kabas’s post.)
Sometimes for better, but mostly for worse, social media platforms have long served as a layer between their users and copyright holders (and their lawyers). YouTube deals with huge numbers of takedown requests filed under the Digital Millennium Copyright Act. But to avoid DMCA headaches, it has also set up automated tools such as ContentID and other algorithmic copyright checks that allow copyright holders to essentially claim ownership of—and monetization rights to—supposedly copyrighted material that users upload without invoking the DMCA. YouTube and other social media platforms have also infamously set up “copy strike” systems, where people can have their channels demonetized, downranked in the algorithm, or deleted outright if rights holders claim a post or video violates their copyright or if an automated algorithm does.
This layer between copyright holders and social media users has created all kinds of bad situations where social media platforms overzealously enforce against content that may be OK to use under fair use provisions or where someone who does not own the copyright at all abuses the system to get content they don’t like taken down, which is what happened to Kenaston.
Copyright takedown processes under social media companies almost always err on the side of copyright holders, which is a problem. On the other hand, because social media companies are usually the ones receiving DMCAs or otherwise dealing with copyright, individual social media users do not usually have to deal directly with lawyers who are threatening them for something they tweeted, uploaded to YouTube, or posted on Bluesky.
There is a long history of powerful people and companies abusing copyright law to get reporting or posts they don’t like taken off the internet. But very often, these attempts backfire as the rightsholder ends up Streisand Effecting themselves. But in recent weeks, independent journalists have been getting these DMCA takedown requests—which are explicit legal threats—directly. A “reputation management company” tried to bribe Molly White, who runs Web3IsGoingGreat and Citation Needed, to delete a tweet and a post about the arrest of Roman Ziemian, the cofounder of FutureNet, for an alleged crypto fraud. When the bribe didn’t work because White is a good journalist who doesn’t take bribes, she was hit with a frivolous DMCA claim, which she wrote about here.
These sorts of threats do happen from time to time, but the fact that several notable ones have happened in quick succession before Trump takes office is notable considering that Trump himself said earlier this week that he feels emboldened by the fact that ABC settled a libel lawsuit with him after agreeing to pay him a total of $16 million. That case—in which George Stephanopoulos said that Trump was found civilly liable of “rape” rather than of “sexual assault”—has scared the shit out of media companies.
This is because libel cases for public figures consider whether that person’s reputation was actually harmed, whether the news outlet acted with “actual malice,” rather than just negligence, and the severity of the harm inflicted. Considering Trump is the most public of public figures, that he still won the presidency, and that a jury did find him liable for a “sexual assault,” this is a terrible kowtowing to power that sets a horrible precedent.
Trump’s case with ABC isn’t exactly related to a DMCA takedown filed over a Bluesky post, but they’re both happening in an atmosphere in which powerful people feel empowered to target journalists.
“There’s also the Kash Patel of it all. They’re very openly talking about coming after journalists. It’s not hypothetical,” Kabas said, referring to Trump’s pick to lead the FBI. “I think that because the new administration hasn’t started yet, we don’t know for sure what that’s going to look like,” she said. “But we’re starting to get a taste of what it might be like.”
What’s happening to Kabas and Kenaston highlights how screwed up the internet is, and how rampant DMCA abuse is. Transparency databases like Lumen help a lot, but it’s still possible to obscure where any given takedown request is coming from, and platforms like TeePublic do not post full DMCAs.
Itch.io says an AI-powered "brand protection software" sent phishing reports notices to its domain registrar and hosting providers, causing its domain to be disabled.
Canadian news companies have sued OpenAI, alleging the ChatGPT-maker uses their content without permission.
The lawsuit claims OpenAI violated Canadian copyright laws and profited from it.
OpenAI faces similar copyright infringement lawsuits from other news outlets and authors.
Several top Canadian news companies have accused ChatGPT creator OpenAI of intentionally ripping off their copyrighted content to train its large language models.
Media companies Torstar, Postmedia, The Globe and Mail, The Canadian Press, and CBC/Radio-Canada allege in a new lawsuit against OpenAI that the artificial intelligence startup has "engaged in ongoing, deliberate, and unauthorized misappropriation" of their news works.
The lawsuit, filed on Friday in the Ontario Superior Court of Justice and viewed by Business Insider, accuses OpenAI of violating Canadian copyright laws and "unjustly enriching" itself at the expense of the news media companies.
In response to the lawsuit, an OpenAI spokesperson told Business Insider in a statement that its models are "trained on publicly available data, grounded in fair use and related international copyright principles that are fair for creators and support innovation."
"We collaborate closely with news publishers, including in the display, attribution and links to their content in ChatGPT search, and offer them easy ways to opt-out should they so desire," the spokesperson said.
The news companies alleged in a joint statement that OpenAI "regularly breaches copyright and online terms of use by scraping large swaths of content from Canadian media to help develop its products, such as ChatGPT."
"OpenAI is capitalizing and profiting from the use of this content, without getting permission or compensating content owners," the statement said. "Journalism is in the public interest. OpenAI using other companies' journalism for their own commercial gain is not. It's illegal."
The 84-page lawsuit seeks an undisclosed amount of damages to compensate the media companies for the "wrongful misappropriation" of their works as well as a permanent injunction in order to prevent OpenAI from carrying out "unlawful conduct."
"Rather than seek to obtain the information legally, OpenAI has elected to brazenly misappropriate the News Media Companies' valuable intellectual property and convert it for its own uses, including commercial uses, without consent or consideration," the lawsuit alleges.
The lawsuit follows a flurry of other lawsuits previously filed by authors, visual artists, news outlets, and computer coders against AI companies like OpenAI, arguing that their original works were used to train AI tools without their permission.
Other media organizations, including Axel Springer, the parent company of Business Insider, have partnered with OpenAI and licensed their work for use by the company.
OpenAI keeps deleting data that could allegedly prove the AI company violated copyright laws by training ChatGPT on authors' works. Apparently largely unintentional, the sloppy practice is seemingly dragging out early court battles that could determine whether AI training is fair use.
Most recently, The New York Times accused OpenAI of unintentionally erasing programs and search results that the newspaper believed could be used as evidence of copyright abuse.
The NYT apparently spent more than 150 hours extracting training data, while following a model inspection protocol that OpenAI set up precisely to avoid conducting potentially damning searches of its own database. This process began in October, but by mid-November, the NYT discovered that some of the data gathered had been erased due to what OpenAI called a "glitch."
The Times is one of several media organizations that have sued OpenAI for copyright infringement.
A judge denied OpenAI's request for information on how the Times uses AI.
The judge used an analogy to a video game company to explain her decision.
The New York Times sued OpenAI in December, arguing that the company used its articles without permission to train ChatGPT.
The case is now in the discovery phase, where both sides gather and exchange evidence before the trial. As part of that, OpenAI requested to know more about how the Times uses generative AI, including its use of generative AI tools from other companies, any AI tools it's developing for its reporting, and its views on the technology.
Judge Ona T. Wang rejected that request on Friday, calling it irrelevant. She then offered an analogy to explain her decision, comparing OpenAI to a video game manufacturer and the Times to a copyright holder.
If a copyright holder sued a video game manufacturer for copyright infringement, the copyright holder might be required to produce documents relating to their interactions with that video game manufacturer, but the video game manufacturer would not be entitled to wide-ranging discovery concerning the copyright holder's employees' gaming history, statements about video games generally, or even their licensing of different content to other video game Manufacturers.
In the same case, legal filings revealed earlier this month that OpenAI engineers accidentally deleted evidence that Times lawyers had gathered from their servers. Lawyers for the outlet spent over 150 hours searching through OpenAI's training data for instances of infringement, which they stored on virtual machines the company created. The majority of the data has been recovered, and the Times lawyer said there is no reason to believe it was "intentional."
The case is one among dozens of copyright cases filed against OpenAI, including by media organizations like the New York Daily News, the Denver Post, and The Intercept. Some of these cases have already been dismissed. Earlier this month a federal judge dismissed cases from Raw Story and AlterNet, because the outlets did not demonstrate "concrete" harm from OpenAI's actions.
OpenAI is also facing lawsuits from authors, including one involving comedian Sarah Silverman. Silverman and over a dozen authors filed an initial complaint against OpenAI in 2023, saying the tech company illegally used their books to train ChatGPT.
"Much of the material in OpenAI's training datasets, however, comes from copyrighted works — including books written by Plaintiffs — that were copied by OpenAI without consent, without credit, and without compensation," the complaint says.
OpenAI's website says the company develops ChatGPT and its other services using three sources: publicly available information online, information accessed by partnering with third parties, and information provided or generated by its users, researchers, or human trainers.
Silverman, who authored "The Bedwetter: Stories of Courage, Redemption, and Pee," discussed the ongoing legal dispute with actor Rob Lowe on his SiriusXM podcast. She said taking on OpenAI will be "tough."
"They are the richest entities in the world, and we live in a country where that's considered a person that can influence, practically create policy, let alone influence it," she said.
Some media organizations, including Axel Springer, the parent company of Business Insider, have chosen to partner with OpenAI, licensing their content in deals worth tens of millions of dollars.
OpenAI and the Times did not immediately respond to a request for comment from Business Insider.