Scaling Data-Constrained Language Models

Main Takeaways: Data-Constrained Scaling Laws

Repeating by multi-epochs

Untitled

Methods complementary to repeating without adding new natural language data

Untitled

Mixing in code

Code augmentation: We use Python code from The Stack [49] to make up for missing natural language data. The combined dataset consisting of code and natural language samples is shuffled randomly

mixing in code is able to provide a 2× increase in effective tokens even when evaluating only natural language tasks

Filling up to 50% of data with code (42 billion tokens) also shows no deterioration. Beyond that, performance decreases quickly on natural language tasks.

Filtering and then repeating by multi-epochs

Adapting filtering: We investigate the performance impact of deduplication and perplexity filtering, two common filtering steps that can severely limit available data. Removing such filtering steps can free up additional training data.

For deduplication filtering, all samples with a 100-char overlap are removed resulting in 21 billion tokens that are repeated for four epochs during training.

Of the filtering approaches, we find perplexity-filtering to be effective, while deduplication does not help.

Untitled

Untitled

data filtering is primarily effective for noisy datasets

We also investigate filtering on a different noisier dataset (OSCAR) in Appendix O, where we find it to be more effective.

Rule of thumbs

Overall, in a data-constrained regime, we recommend

  1. reserving filtering for noisy datasets
  2. and using both code augmentation and repeating to increase data tokens. For example, first doubling the available data by adding code and then repeating the new dataset for four epochs results in 8× more training tokens that are expected to be just as good as having had 8× more unique data from the start.