The Jagged AI Frontier is a Data Frontier

Leandro von Werra, 16.12.2025

The Jagged AI Frontier

We are 7 years into scaling language models since the original release of GPT-1. What have we learned? In a nutshell: models perform well on tasks we train them on but tend to be fragile or fail on novel tasks. There are no notable signs of generalization but if we feed the models task data at scale and quality, they learn.

If we had superhuman data, we would train superhuman models by now.

Let's see how we got here and how we can trace model gaps back to data gaps.

Scaling Models and Capabilities

Initially it looked like we unlock fundamentally new behaviours and capabilities as we follow the scaling laws:

GPT-1, Improving Language Understanding: A LLM trained on a lot of text learns many useful things about language and can easily be fine-tuned for many downstream tasks.

GPT-2, Unsupervised Multitask Learners: Scaled up models don't need fine-tuning anymore, we can just prompt them with a wide range of questions and they can solve many different tasks.

GPT-3, Few-shot learners: Models scaled up further can learn somewhat novel tasks in context by just giving them a handful of task examples.

But then: "GPT-4: Tech report" and "GPT-5: System card" - no more banger titles. Undoubtedly these models became much more useful and reliable, but scaling further seemed to not lead to fundamentally new capabilities. No more surprises.

Looking back, the models of the first generation were already very good at things that are plentiful on the internet: cooking recipes, gift ideas, summarizing a news article, writing an ad, or generating toxic comments. On other tasks they were incapable: coding, solving math problems, and even basic reasoning. It's not that there is no data at all on these tasks, but the high quality signal on the web is sparse and early versions of pretraining datasets were very raw.

How did they get so good at math and coding in the end? Let's look at coding first.

Coding

The reason coding is such a big success story for LLMs is not because the problems are well structured, but because it's such a data-rich environment. A large fraction of all code ever written is available on GitHub and similar repositories and discussions on e.g. StackOverflow. And not only that, but also every iteration as well as rich discussions are available in the git history as well as in issues and pull requests. It's hard to overstate how rich (volume × diversity) the publicly available data around software development is: There are 28M public repositories with over 1B commits and 500M merged pull requests each year.

As soon as LLMs targeted code more specifically (Codex being an early example), capabilities improved fast. By tapping into more of the available data and refining dataset quality, code models improved to the point where many developers now find them genuinely useful.

Math

With coding benchmarks saturating quickly a new frontier for LLMs was needed. The field quickly started focusing on math problems. First on grade school level problems like GSM8k and then quickly progressed to math olympiad level questions from IMO.

One could argue that models got better because they were scaled up which certainly explains one part of this, but the majority of gains was achieved by gathering high quality datasets. On the one hand the web was scoured for high quality sources such as university exercises and some websites re-crawled to make sure no high quality sample is left behind. The Qwen2.5-Math tech report noted that they accumulated up to 1T tokens of math data augmented with synthetic data.

AI for Science

Attention has recently turned towards using LLMs for Science beyond just math. It's an exciting application but also highlights a field suffering significant data gaps. While scientific papers are accessible they can be seen as the end product of the scientific process. The process itself also consists of experiment design, failure investigation, analysis of measurements and results and discussions with peers. These steps rarely are captured in available data sources such as the web or publications.

Unlike math, synthetic data can't easily fill this gap. Math's primary substrate is text (proofs, equations, derivations) which aligns naturally with LLMs and can be verified computationally. Science operates on physical reality; papers are just the reporting layer, and validating claims often requires running actual experiments.

Closing this gap is harder but not impossible. It requires capturing the process: structured experiment logs, simulation environments, or tooling that records the messy middle of research. Until those datasets exist, science will remain a jagged edge of the frontier.

Synthetic Data

Synthetic data generation has become an important part of the data pipeline for LLMs. For a long time people have been skeptical about the value of LLM generated data to train other LLMs and the risk of degrading the model performance. Indeed, blindly generating content with LLMs will likely cause some mode collapse as the generator itself is not very diverse.

However, there are two reasons why we still can generate valuable data:

Grounding: If we ask a model to generate a short story about a historic event it will generate only a handful of diverse examples (in a quick test Claude wrote 3 out of 4 stories about the Lighthouse of Alexandria) and if prompted with a specific but niche example it might hallucinate most facts. However, if we ground the prompt with a Wikipedia article the model will accurately use the facts.

Verification: In instances where we can verify the result we can use this signal to filter LLM traces. An example is fixing an issue in a GitHub repository. On the one hand we can run the original test suite and additionally we can ask the model to also write test cases for the issue, which is usually easier than the fix itself.

This has become the cornerstone of LLM mid- and post-training and is also leaking into the pretraining more recently. Once a performance gap is identified in many cases data can be generated and filtered.

As straightforward as this may sound there are some subtleties: since we identify and evaluate gaps based on benchmarks there is a chance to overfit to those benchmarks. This is not overfitting in the classical sense to the results but the type and format of data, and the reason some models look great on benchmarks but fail the vibe checks. Big labs likely have internal benchmarks reflecting real user interactions that help them guide model quality more accurately.

Synthetic data generation and training on it has two effects: we reuse and refine existing data and at the same time we also distill the behavior of the generating model. One can use elaborate distillation schemes involving the model's logits and so on, but it is surprising how effective learning from the text data is.

Distillation

The fact that distillation from output text alone is so effective has a big impact on the state of the LLM landscape. No AI lab that has an API can stay ahead of the competition for a long time. It is too easy to simply generate data with the best model and use it as part of the model training. Even if the data collection for the original model was expensive, distillation is making the capability and data available to everyone.

This happened when ChatGPT was released. OpenAI spent a lot of time and money with annotators, potentially also a complicated RL pipeline to make the models chat well. But as soon as the model was released, datasets appeared with chat histories and models were trained on it.

When the first reasoning models appeared (e.g. O1 or Gemini Thinking) the model providers obfuscated the reasoning traces to avoid similar copying, however, with some prompting it was still possible and quickly reasoning models appeared everywhere. This is also highlighted in DeepSeek's R1 tech report (but was a bit overlooked next to the RL pipeline). Once the reasoning capability was available in R1 using a complex RL pipeline it was very efficient to distill the same behavior into the small models by simply fine-tuning the models on traces from R1.

DeepSeek R1 Distillation

But model distillation doesn't need to be adversarial. Labs do it internally, too: they train larger models internally that are never released to users. The reasoning: Larger models are more data efficient and can achieve higher performance trained on the same data as smaller models. However, it's not economical to serve those models at scale. Instead they're used as data generators and the high-quality traces distilled into smaller, cheaper models ready for serving.

Distillation shows that if we can produce high quality-data with a model we can fill the data gap and enable even smaller models to learn a surprising amount of capabilities.

At this point it looks like "data is all you need" - so do we even need more sophisticated techniques like Reinforcement Learning to train good LLMs?

Reinforcement Learning

Whether RL is indeed needed to train good models is an open question. During the early chat model era the consensus was that you can only achieve ChatGPT performance with RL. It turned out you get really far with fine-tuning on chat data and almost close the gap with simple contrastive methods like DPO.

It may be that to initially unlock a new capability, let's say reasoning, the most efficient way to get there is by using RL in carefully crafted environments. However, once the capability is available in a model, distillation will distribute it into other models. This is certainly the mechanism that the R1 report suggests.

There are subtle capabilities that are hard to learn by pure fine-tuning alone:

What's a good and bad response: When fine-tuning on only high-quality chat data the model doesn't explicitly learn what a good response is, but just learns to copy them. Without ever teaching the model explicitly to avoid bad responses they can still leak into conversations. DPO solves that issue by maximizing the difference between pairwise good and bad samples.

Do I know what I don't know: Neither fine-tuning nor DPO trains the model to calibrate on its own knowledge. Ideally we want models to acknowledge when they don't know something rather than hallucinating an answer. However, this can't be learned by imitation learning. Even if the model is fine-tuned on a perfect dataset with refusals it will simply learn to refuse in certain situations at a certain rate. An external dataset can't reflect the model's knowledge. RL can potentially fix this by enabling the model to hedge and decide to refuse an answer in certain situations.[1]

Combining RL and LLMs has received more interest recently with the advent of agents solving tasks over long horizons. The current models still struggle to perform well over long contexts and it is clear that the internet is quite sparse - a new data gap. To address this people are building sophisticated environments that allow to simulate such long-horizon tasks and build in verifiable success criteria. This is a natural setup to optimize models with RL and hill-climb the rewards in these environments.

It's too early to tell if RL is truly needed to perform well at long multi-turn tasks. As with chat models it may be that over time most of the environments will simply be used as data generators for simpler fine-tuning techniques.

Conclusion

So what are the implications of the data-first view? A few takeaways:

The AI capability frontier is jagged because data is jagged and where data is available the gap closes. What remains are the hard problems: the ones where data is sparse, private, or impossible to verify.

A new age of research could discover a new paradigm, but until then the bottleneck is data.

[1] See John Schulman's talk for a deeper discussion.