❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Inside Meta's secret experiments that improve its AI models

17 April 2025 at 02:00
Mark Zuckerberg at the Breakthrough Prize Ceremony in Santa Monica, California.
Mark Zuckerberg at the Breakthrough Prize Ceremony in Santa Monica, California.

Gilbert Flores/Variety via Getty Images

  • A legal case involving Meta revealed the company's secret experiments with training data.
  • Meta used "ablation" to identify how specific data improved its Llama AI models.
  • Some researchers say this could support a system to assign value to AI data and pay compensation.

A high-profile legal case has unearthed a trove of internal Meta communications, and one particular document has caught the eye of some AI researchers.

This reveals new insights into how models are built and could influence who gets to share in the spoils of this new technology.

Buried in these court filings is a description of how Meta researchers used a process called ablation to identify which data helped improve the company's Llama AI models.

Ablation is a medical technique that purposely destroys tissue to improve things like brain function. In AI, it involves removing parts of a system to study how those components contribute to performance.

Brain surgery in action
Brain surgery in action.

: BSIP/Universal Images Group via Getty Images

In Meta's ablation experiments, the company replaced a portion of its AI training data with pirated books from a giant database called LibGen. Then, the company re-trained its Llama model to see the impact.

In one experiment, Meta added books about science and technology, along with fiction books, to the training data. In a second experiment, Meta only added fiction books.

In both experiments, Llama performance improved notably in industry benchmark evaluations, according to the internal Meta document disclosed in court filings. (Check out pages 18 and 19 here.)

This shows that Meta has the ability to assign value to specific training data, said Nick Vincent, assistant professor in the School of Computing Science at Simon Fraser University.

Ablation is common, but also a secret

Nicholas Braun and a llama in "Saturday Night."
Nicholas Braun and a llama in "Saturday Night."

Sony Pictures

Ablation has become a common practice at the company and across the AI industry. For instance, one Meta engineer on LinkedIn mentions doing more than 100 ablations during the development of Llama 4 and previous iterations of the company's big AI models.

Meta doesn't publish the results of these experiments, and other AI companies keep this stuff private, too, Vincent said.

One potential reason: If tech giants tell the world which training data specifically helped their AI models, then the creators of this information would want to be paid β€” and they would have a handy estimate of how much money they're owed.

"Stating these numbers publicly would potentially give some content organizations firmer ground to stand on," Vincent said.

Making the results of ablation experiments public could also impact high-stakes copyright lawsuits that rage across the tech industry β€” with this specific Meta case (Kadrey v. Meta) being a good example.

In these cases, tech giants and AI startups argue that it's not copyright infringement for machines to "learn" from published material online.

Internal documents assigning value to specific content may not help with this.

"It's possible that publishing these value estimations would undermine the stances that Big Tech companies will take in these copyright lawsuits and court cases," Vincent said.

A Meta spokesperson said the company disagrees with the plaintiff's arguments in this legal case and added that its Llama models are helping individuals and companies be more innovative, productive, and creative.

"We will continue to vigorously defend ourselves and to protect the development of GenAI for the benefit of all," the spokesperson said.

Training data sources are now hidden

Bill Gross speaking on stage at a conference
ProRato CEO Bill Gross speaks onstage at a conference.

Matthias Balk/picture alliance via Getty Images

Keeping ablation experiments secret follows a broader trend away from sharing how data contributes to the creation and performance of AI models.Β 

In 2017,Β the Google research paperΒ that kicked off the generative AI boom disclosed granular information on the training data used. It included about 40,000 sentences from The Wall Street Journal, for instance. Years ago, OpenAI, in its GPT-2 paper, described scraping web pages using millions of outbound links from Reddit.Β 

Fast forward to today, and companies share very little. When Meta released Llama 4 in early April, the company published a model card describing how it built the product.Β It didn't mention ablation at all, and it only discussed the training data generically as "a mix of publicly available, licensed data and information from Meta's products and services."

Again, the likely reason for thisΒ is thatΒ telling everyone what data you used might mean having to pay the creators of this information.

"It's really disappointing that they're not being open about it, and they're not giving credit to the material," said Bill Gross, CEO of ProRata, a startup that's trying to compensate creators for their contributions to AI.

Gross said content creators should be paid twice: once for having their data used to train AI models and again when AI models rely on this content to answer user questions.

Meta's secret ablation results

Herd of llamas and alpacas
Llamas or alpacas? Can you tell the difference?

Don Mason/Getty Images

Meta's ablation experiments focus on this first training step, which uses mountains of data to help models understand the world. For example: To teach a machine to recognize a llama, you must show it as many photos of llamas and alpacas as possible so it can distinguish between the two animals.

Meta's first ablation experiment found that adding science, technology, and fiction books to the training data improved Llama's performance by 4.5% on an industry benchmark called BooIQ. Just adding the fiction books resulted in a 6% improvement.

The performance gains from these ablation experiments were as high as 5.5% on another benchmark known as SIQA, the Meta internal document said.

Peter Henderson, an assistant professor of computer science at Princeton, tweeted out some Meta charts from the court document showing these gains.

Lots of internal Llama 2 data mix ablations revealed as part of discovery in the ongoing copyright litigation. Link below. pic.twitter.com/7YeRyYSEWV

β€” Peter Henderson (@PeterHndrsn) January 15, 2025

While performance gains of about 5% seem small, in the AI race, any advantage is important.

"That's actually a lot because it's so hard to get every extra point on AI benchmarks," Gross said.

Can elves mate with humans?

A man with long blond hair and brown hair and pointy ears, with a quiver of arrows on his back, wearing a brown cloak with a leaf brooch.
Orlando Bloom as Legolas in "The Lord of the Rings."

New Line Cinema

Llama's improvement on the BooIQ benchmark shows the power of specific training data and how much AI models and tech companies rely on this information, Vincent said.

BoolQ is a series of 15,942 yes/no questions that AI models must answer. The more questions they get right, the higher the performance. A 5% improvement is the equivalent of answering almost 800 extra questions correctly.

One question on the BooIQ test asked, "Can elves and humans mate in 'Lord of the Rings?'"

You can only really know the answer to this for sure if you've read J.R.R. Tolkien's books β€” or rather if these books are in the training data, Vincent said. (Elves and humans can have babies in the LOTR universe, by the way.)

Vincent hopes revelations like this about Meta's secret ablation experiments will help create a new system that assigns credit to sources of training data and provides appropriate compensation.Β 

"AI chatbot products rely on the fact that some human somewhere did something useful, wrote it down, and published it," he said. "This technology repackages this stuff into something that is hopefully more useful."

"Ultimately, it's all humans at the top of this. Without this data, AI models will not be so good," he added. "Evidence of ablation like this could end up serving the mission of setting up a healthy data flow. It's important to sustain the institutions where people are incentivized to create content and knowledge and share it."

Read the original article on Business Insider

ChatGPT can't decide whether its Ghibli-style images violate copyright or not

27 March 2025 at 15:52
An image generated by OpenAI's 4o tool
An image generated by OpenAI's 4o tool showing an older artist being angry at a young tech executive

Pranav Dixit/OpenAI's 4o tool

  • OpenAI's new 4o tool generates Ghibli-style images on request, via the paid version of ChatGPT.
  • The free version of ChatGPT, which uses OpenAI's older DALL-E 3 tool, refuses to create such images.
  • The free ChatGPT said it can't do this because Ghibli "is a copyrighted animation studio, and its artistic style is protected."

When AI-generated Ghibli-style images started popping up on social media this week, I contacted OpenAI.

The startup had just launched a new image-generation tool called 4o, a powerful upgrade from its DALL-E 3 service. Users started asking ChatGPT for images in the style of the famed Japanese animation house Studio Ghibli. And the new 4o obliged.

I tried it myself, using the free version of ChatGPT, and got a much different response: "I wasn't able to generate the image in the style of Studio Ghibli due to content policy restrictions," OpenAI's chatbot told me.

Why was OpenAI letting 4o users do this, while refusing my similar requests on the basis on "content policy"? I asked an OpenAI spokesperson. She responded on Wednesday with this explanation, which cited an update to OpenAI's system card, the document that lays out the details of new models and tools like 4o.

"We added a refusal which triggers when a user attempts to generate an image in the style of a living artist," the company said in this document. The OpenAI spokesperson added that the company continued to prevent "generations in the style of individual living artists" but did permit "broader studio styles."

Hayao Miyazaki, the artist who cofounded Studio Ghibli, is still alive. So using 4o to generative images in his style would seem to be not allowed. However, this is a big studio, so maybe these images fall under the "broader studio" policy that the OpenAI spokesperson described.

Either way, it's clear that OpenAI has made a major change in its approach to copyright and image generation lately.

On Thursday, my colleague Pranav Dixit and I tested this out to show how OpenAI's technology treats similar requests differently, depending on which image-generation tool you use.

Pranav used the paid ChatGPT service, which comes with the new 4o tool. He asked for images in the style of Studio Ghibli. The chatbot created several, including the one at the top of this story. It shows an older artist being angry at a younger tech executive who looks a bit like OpenAI CEO Sam Altman. Weird coincidence!

Pranav then went down the technology rabbit hole, which he enjoys doing. (Good trait for a tech reporter). He got 4o to churn about several more images in the Ghibli style, like this one.

An older artist holds his head in his hands
An older artist holds his head in his hands

Pranav Dixit/OpenAI's 4o tool

I tried similar Ghibli-style requests on Thursday using the free ChatGPT service, which comes with OpenAI's older DALL-E 3 image-generation tool.

The tool refused my requests, citing copyright rules. Here's what ChatGPT told me:

"I can't generate images in the style of Studio Ghibli because it is a copyrighted animation studio, and its artistic style is protected."

You can't be clearer than that. OpenAI won't do this because it would infringe Studio Ghibli's copyright.

And yet, another OpenAI tool is quite happy to generate these types of images. So what gives?

I asked OpenAI's spokesperson if this is a double standard. Or has the company changed its approach to copyright recently? Or maybe it has struck a content deal with Studio Ghibli?

OpenAI didn't respond to these questions on Thursday afternoon.Β Studio Ghibli, which is based in Tokyo, Japan, also didn't respond to a request for comment from Business Insider late on Wednesday, US time.

If we get any more answers on this confusing situation, we'll write another story on this.Β 

Either way, this is probably a great way to get users to upgrade to the paid version of OpenAI's ChatGPT service. I'm still grumpy that Pranav can generate better images than me. Here's the one I managed to get out of the free version.Β 

An image of an older artist
An image of an older artist

Alistair Barr/OpenAI's DALL-E 3, via ChatGPT

Read the original article on Business Insider

❌
❌