Natural Language Datasets¶

Free datasets for NLP tasks — text classification, question answering, summarization, translation, and more.

Question Answering¶

Dataset	Samples	Size	License	Link
SQuAD 2.0	150K questions	44 MB	CC-BY-SA-4.0	rajpurkar.github.io
Natural Questions	307K	42 GB	CC-BY-SA-3.0	ai.google.com
TriviaQA	650K	2.5 GB	Apache-2.0	nlp.cs.washington.edu
HotpotQA	113K	600 MB	CC-BY-SA-4.0	hotpotqa.github.io

Dataset	Samples	Size	Format	License	Browser?	Link
SST-2	67,349	7.1 MB	TSV	CC0	Yes	HuggingFace
Amazon Reviews	34M	20 GB	JSON	Open	No	jmcauley.ucsd.edu
Hate Speech	24,783	2.4 MB	CSV	CC-BY-4.0	Yes	HuggingFace
Emotion	20,000	1.2 MB	CSV	Open	Yes	HuggingFace
Financial PhraseBank	4,840	280 KB	CSV	CC-BY-NC-SA	Yes	HuggingFace

Dataset	Size	Description	License	Link
Wikipedia Dumps	~21 GB	Full English Wikipedia	CC-BY-SA-3.0	dumps.wikimedia.org
Common Crawl	380+ TB	Web crawl archive	CC0	commoncrawl.org
The Pile	825 GB	Diverse English text	MIT	pile.eleuther.ai
C4	750 GB	Cleaned Common Crawl	ODC-BY	HuggingFace
RedPajama	1.2 TB	LLM pretraining data	Apache-2.0	HuggingFace
FineWeb	15 TB	Cleaned web data for LLMs	ODC-BY	HuggingFace

Dataset	Language Pairs	Size	License	Link
OPUS	800+	Varies	Various open	opus.nlpl.eu
WMT	10+	Varies	Research	statmt.org
Tatoeba	400+	13M sentences	CC-BY-2.0	tatoeba.org
FLORES-200	200	3,001 sentences each	CC-BY-SA-4.0	HuggingFace

Dataset	Samples	Size	License	Link
CNN/DailyMail	312K articles	1.3 GB	Apache-2.0	HuggingFace
XSum	226K	260 MB	MIT	HuggingFace
arXiv	215K papers	4.2 GB	CC0	Kaggle