Wed. Dec 18th, 2024

As companies continue to develop large language models (LLMs) to enhance their artificial intelligence (AI) capabilities, they are facing a significant challenge: the internet, which serves as a primary source of data for training these models, is finite. OpenAI and Google, among other companies, are realizing that they will soon run out of internet content to fuel the growth of their AI models. This scarcity of high-quality data, coupled with the increasing reluctance of certain companies to share their data with AI systems, poses a significant obstacle to the continuous expansion of AI capabilities.

The demand for data in the AI industry should not be underestimated. To put it into perspective, GPT-4, developed by OpenAI, was trained on approximately 12 million tokens, which are words and word fragments broken down to facilitate understanding by LLMs. For the upcoming GPT-5, researchers estimate that it would require a staggering 60 to 100 trillion tokens to keep up with expected growth. This translates to 45 to 75 trillion words, according to OpenAI. Even after exhausting all available high-quality internet data, an additional 10 to 20 trillion tokens or more would still be needed.

While some researchers, like Epoch researcher Pablo Villalobos, believe that the data shortage will not become critical until around 2028, AI companies are taking proactive measures to address the issue. They are exploring alternative sources of data beyond the internet to continue training their models.

The AI Data Problem

The data shortage poses several challenges for AI companies. Firstly, there is the issue of limited data availability. Without sufficient data, it is impossible to train LLMs effectively. However, the quality of the data is equally crucial. Given the abundance of low-quality and misleading content on the internet, companies like OpenAI are cautious about incorporating such data into their models. Their goal is to create LLMs that can provide accurate responses to user prompts, which necessitates filtering out misinformation and poorly written content.

Moreover, there are ethical considerations associated with scraping data from the internet. AI companies often collect and use individuals’ data without their consent or concern for privacy. This practice has become a lucrative business, with platforms like Reddit selling user-generated content to AI companies. Some entities are pushing back against this data exploitation, as demonstrated by The New York Times’ lawsuit against OpenAI. However, until comprehensive user protections are implemented, public internet data will continue to be utilized by LLMs.

To address the data shortage, companies like OpenAI are exploring new avenues for acquiring information. One approach involves training models on transcriptions of public videos, such as those obtained from YouTube, using specialized transcription tools like OpenAI’s Whisper transcriber. OpenAI is also working on developing smaller models tailored to specific niches and establishing a system to compensate data providers based on the quality of their data.

Is Synthetic Data the Solution?

One controversial solution being considered by AI companies is the use of synthetic data for training models. Synthetic data refers to information generated based on existing data sets, aiming to create a new data set that resembles the original while being entirely new. The idea is to mask the contents of the original data set while providing a similar training set for LLMs.

However, training LLMs on synthetic data can lead to a phenomenon known as “model collapse.” This occurs when the synthetic data contains patterns from the original data set, causing the LLM to stagnate and potentially forget important information. Consequently, the AI models may produce repetitive results, undermining the purpose of using synthetic data.

Despite the challenges, AI companies like Anthropics and OpenAI remain optimistic about the potential of synthetic data. They see a place for this technology in their training sets and are actively working to overcome the limitations associated with model collapse. The successful integration of synthetic data would not only address the scarcity of internet data but also alleviate concerns about privacy and data exploitation.

In conclusion, as AI companies strive to advance their LLMs, the scarcity of high-quality internet data poses a significant challenge. The finite nature of the internet and the increasing reluctance of some companies to share their data necessitate the exploration of alternative data sources. While synthetic data offers a potential solution, it comes with its own set of challenges. Nonetheless, AI companies are actively working to overcome these obstacles and ensure the continuous growth of AI capabilities.

References: