Scaling Data-Constrained Language Models
Code augmentation: We use Python code from The Stack [49] to make up for missing natural language data. The combined dataset consisting of code and natural language samples is shuffled randomly
mixing in code is able to provide a 2× increase in effective tokens even when evaluating only natural language tasks
Filling up to 50% of data with code (42 billion tokens) also shows no deterioration. Beyond that, performance decreases quickly on natural language tasks.
Adapting filtering: We investigate the performance impact of deduplication and perplexity filtering, two common filtering steps that can severely limit available data. Removing such filtering steps can free up additional training data.
For deduplication filtering, all samples with a 100-char overlap are removed resulting in 21 billion tokens that are repeated for four epochs during training.
Of the filtering approaches, we find perplexity-filtering to be effective, while deduplication does not help.
data filtering is primarily effective for noisy datasets
We also investigate filtering on a different noisier dataset (OSCAR) in Appendix O, where we find it to be more effective.
Overall, in a data-constrained regime, we recommend