AI improvements are slowing down. Companies have a plan to break through the wall.

The pre-training dilemma

Researchers point to two key blocks that companies may encounter in an early phase of AI development, known as pre-training. The first is access to computing power. More specifically, this means getting hold of specialist chips called GPUs. It's a market dominated by Santa Clara-based chip giant Nvidia, which has battled with supply constraints in the face of nonstop demand.

"If you have $50 million to spend on GPUs but you're on the bottom of Nvidia's list — we don't have enough kimchi to throw at this, and it will take time," said Henri Tilloy, partner at French VC firm Singular.

Jensen Huang wih Nvidia hardware — Jensen Huang's Nvidia has become the world's most valuable company off the back of the AI boom.
Justin Sullivan/Getty

There is another supply problem, too: training data. AI companies have run into limits on the quantity of public data they can secure to feed into their large language models, or LLMs, in pre-training.

This phase involves training an LLM on a vast corpus of data, typically scraped from the internet, and then processed by GPUs. That information is then broken down into "tokens," which form the fundamental units of data processed by a model.

While throwing more data and GPUs at a model has reliably produced smarter models year after year, companies have been exhausting the supply of publicly available data on the internet. Research firm Epoch AI predicts usable textual data could be squeezed dry by 2028.

"The internet is only so large," Matthew Zeiler, founder and CEO of Clarifai, told BI.

Multimodal and private data

Eric Landau, cofounder and CEO of data startup Encord, said that this is where other data sources will offer a path forward in the scramble to overcome the bottleneck in public data.

One example is multimodal data, which involves feeding AI systems visual and audio sources of information, such as photos or podcast recordings. "That's one part of the picture," Landau said. "Just adding more modalities of data." AI labs have already started using multimodal data as a tool, but Landau says it remains "very underutilized."

Sharon Zhou, cofounder and CEO of LLM platform Lamini, sees another vastly untapped area: private data. Companies have been securing licensing agreements with publishers to gain access to their vast troves of information. OpenAI, for instance, has struck partnerships with organizations such as Vox Media and Stack Overflow, a Q&A platform for developers, to bring copyrighted data into their models.

"We are not even close to using all of the private data in the world to supplement the data we need for pre-training," Zhou said. "From work with our enterprise and even startup customers, there's a lot more signal in that data that is very useful for these models to capture."

A data quality problem

A great deal of research effort is now focused on enhancing the quality of data that an LLM is trained on rather than just the quantity. Researchers could previously afford to be "pretty lazy about the data" in pre-training, Zhou said, by just chucking as much as possible at a model to see what stuck. "This isn't totally true anymore," she said.

One solution that companies are exploring is synthetic data, an artificial form of data generated by AI.

According to Daniele Panfilo, CEO of startup Aindo AI, synthetic data can be a "powerful tool to improve data quality," as it can "help researchers construct datasets that meet their exact information needs." This is particularly useful in a phase of AI development known as post-training, where techniques such as fine-tuning can be used to give a pre-trained model a smaller dataset that has been carefully crafted with specific domain expertise, such as law or medicine.

One former employee at Google DeepMind, the search giant's AI lab, told BI that "Gemini has shifted its strategy" from going bigger to more efficient. "I think they've realized that it is actually very expensive to serve such large models, and it is better to specialize them for various tasks through better post-training," the former employee said.

Google i/o event Sundar Pichai Gemini — Google launched Gemini, formerly known as Bard, in 2023.
Google

In theory, synthetic data offers a useful way to hone a model's knowledge and make it smaller and more efficient. In practice, there's no full consensus on how effective synthetic data can be in making models smarter.

"What we discovered this year with our synthetic data, called Cosmopedia, is that it can help for some things, but it's not the silver bullet that's going to solve our data problem," Thomas Wolf, cofounder and chief science officer at open-source platform Hugging Face, told BI.

Jonathan Frankle, the chief AI scientist at Databricks, said there's no "free lunch " when it comes to synthetic data and emphasized the need for human oversight. "If you don't have any human insight, and you don't have any process of filtering and choosing which synthetic data is most relevant, then all the model is doing is reproducing its own behavior because that's what the model is intended to do," he said.

Concerns around synthetic data came to a head after a paper published in July in the journal Nature said there was a risk of "model collapse" with "indiscriminate use" of synthetic data. The message was to tread carefully.

Building a reasoning machine

For some, simply focusing on the training portion won't cut it.

Former OpenAI chief scientist and Safe Superintelligence cofounder Ilya Sutskever told Reuters this month that results from scaling models in pre-training had plateaued and that "everyone is looking for the next thing."

That "next thing" looks to be reasoning. Industry attention has increasingly turned to an area of AI known as inference, which focuses on the ability of a trained model to respond to queries and information it might not have seen before with reasoning capabilities.

At Microsoft's Ignite event this month, the company's CEO Satya Nadella said that instead of seeing so-called AI scaling laws hit a wall, he was seeing the emergence of a new paradigm for "test-time compute," which is when a model has the ability to take longer to respond to more complex prompts from users. Nadella pointed to a new "think harder" feature for Copilot — Microsoft's AI agent — which boosts test time to "solve even harder problems."

Aymeric Zhuo, cofounder and CEO of AI startup Agemo, said that AI reasoning "has been an active area of research," particularly as "the industry faces a data wall." He told BI that improving reasoning requires increasing test-time or inference-time compute.

Typically, the longer a model takes to process a dataset, the more accurate the outcomes it generates. Right now, models are being queried in milliseconds. "It doesn't quite make sense," Sivesh Sukumar, an investor at investment firm Balderton, told BI. "If you think about how the human brain works, even the smartest people take time to come up with solutions to problems."

In September, OpenAI released a new model, o1, which tries to "think" about an issue before responding. One OpenAI employee, who asked not to be named, told BI that "reasoning from first principles" is not the forte of LLMs as they work based on "a statistical probability of which words come next," but if we "want them to think and solve novel problem areas, they have to reason."

Noam Brown, a researcher at OpenAI, thinks the impact of a model with greater reasoning capabilities can be extraordinary. "It turned out that having a bot think for just 20 seconds in a hand of poker got the same boosting performance as scaling up the model by 100,000x and training it for 100,000 times longer," he said during a talk at TED AI last month.

Google and OpenAI did not respond to a request for comment from Business Insider.

The AI boom meets its tipping point

These efforts give researchers reasons to remain hopeful, even if current signs point to a slower rate of performance leaps. As a separate former DeepMind employee who worked on Gemini told BI, people are constantly "trying to find all sorts of different kinds of improvements."

That said, the industry may need to adjust to a slower pace of improvement.

"I just think we went through this crazy period of the models getting better really fast, like, a year or two ago. It's never been like that before," the former DeepMind employee told BI. "I don't think the rate of improvement has been as fast this year, but I don't think that's like some slowdown."

Lamini's Zhou echoed this point. Scaling laws — an observation that AI models improve with size, more data, and greater computing power —work on a logarithmic scale rather than a linear one, she said. In other words, think of AI advances as a curve rather than a straight upward line on a graph. That makes development far more expensive "than we'd expect for the next substantive step in this technology," Zhou said.

She added: "That's why I think our expectations are just not going to be met at the timeline we want, but also why we'll be more surprised by capabilities when they do appear."

Amazon Web Services (AWS) CEO Adam Selipsky speaks with Anthropic CEO and co-founder Dario Amodei during a 2023 conference. — Amazon Web Services CEO Adam Selipsky speaks with Anthropic CEO Dario Amodei during a 2023 conference.
Noah Berger/Getty

Companies will also need to consider how much more expensive it will be to create the next versions of their highly prized models. According to Anthropic's Amodei, a training run in the future could one day cost $100 billion. These costs include GPUs, energy needs, and data processing.

Whether investors and customers are willing to wait around longer for the superintelligence they've been promised remains to be seen. Issues with Microsoft's Copilot, for instance, are leading some customers to wonder if the much-hyped tool is worth the money.

For now, AI leaders maintain that there are plenty of levers to pull — from new data sources to a focus on inference — to ensure models continue improving. Investors and customers just might have to be prepared for them to come at a slower pace compared to the breakneck pace set by OpenAI when it launched ChatGPT two years ago.

Bigger problems lie ahead if they don't.

Read the original article on Business Insider

Normal view

The pre-training dilemma

Multimodal and private data

A data quality problem

Building a reasoning machine

The AI boom meets its tipping point