The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Authors: The Falcon LLM Team (see paper for full list of authors)

Word Count: Approximately 9,000 words

Estimated Read Time: Around 30-45 minutes

Source Code: The codebase used to develop the Falcon-LLM models and the RefinedWeb dataset is not publicly available.

Relevant Links:

• RefinedWeb dataset (600 billion token extract): https://huggingface.co/datasets/tiiuae/falcon-refinedweb • Falcon-LLM website: falconllm.tii.ae

Summary:

The paper proposes RefinedWeb, a large-scale dataset derived from CommonCrawl web data. Through extensive filtering and deduplication of the web data, the authors argue that RefinedWeb can be used to train language models that match or outperform models trained on curated corpora. They publicly release a 600 billion token extract of RefinedWeb, as well as language models trained on the dataset.

The authors find that properly filtered and deduplicated web data alone can lead to language models with powerful zero-shot capabilities, even outperforming publicly available models trained on curated data like The Pile. They achieve this by rigorously removing duplicate and low-quality web content from CommonCrawl.

The RefinedWeb dataset and the Falcon-LLM models trained on it could be useful resources for developing large language models or GAN-based systems. The dataset size of 5 trillion tokens would allow for scaling these systems to a large size. However, the lack of publicly available code makes reproducing and building upon the presented results more difficult.