Statement Reference Supporting Data
simple context-level RAG achieves on-par or even better results than long-context LLMs for retrieval or summarization, which involve low reasoning complexity. 1. InfiniteBench
  1. GSM-Infinite | 1. GSM-Infinite Figure 3 | | | | |

GSM-Infinite Figure 3 Study of Llama3.1-70B-Instruct with Passive RAG (referred to as OnePassRAG) and Active RAG (referred to as InteractiveRAG) on popular long-context benchmarks: RULER (at 64K context length), LongBench (>8K), LongBenchV2, and LOFT (128K context length). RAG is under the 2048 retrieved token budget, and the decoder used for the RAG is Llama-3.1-70B-Instruct. RAGs generally have robust performance, on par with the corresponding LLMs, showing that previous long-context benchmarks are either too simple in reasoning complexity or contain detectable noise.

GSM-Infinite Figure 3 Study of Llama3.1-70B-Instruct with Passive RAG (referred to as OnePassRAG) and Active RAG (referred to as InteractiveRAG) on popular long-context benchmarks: RULER (at 64K context length), LongBench (>8K), LongBenchV2, and LOFT (128K context length). RAG is under the 2048 retrieved token budget, and the decoder used for the RAG is Llama-3.1-70B-Instruct. RAGs generally have robust performance, on par with the corresponding LLMs, showing that previous long-context benchmarks are either too simple in reasoning complexity or contain detectable noise.

Referee / Referrer CogniLoad GSM-Infinite
LongBench Pros: confound ICL&ECL&GCL in CLT
Cons: aggregate multi-task corpora up to 200k tokens Cons: Low Complexity, focus on retrieval or summarization, which involve low reasoning complexity.
RULER Cons: Low Complexity, focus on retrieval or summarization, which involve low reasoning complexity.