Skip to content

Natural Language Datasets

Free datasets for NLP tasks — text classification, question answering, summarization, translation, and more.

Question Answering

Dataset Samples Size License Link
SQuAD 2.0 150K questions 44 MB CC-BY-SA-4.0 rajpurkar.github.io
Natural Questions 307K 42 GB CC-BY-SA-3.0 ai.google.com
TriviaQA 650K 2.5 GB Apache-2.0 nlp.cs.washington.edu
HotpotQA 113K 600 MB CC-BY-SA-4.0 hotpotqa.github.io

Text Classification & Sentiment

Dataset Samples Size Format License Browser? Link
SST-2 67,349 7.1 MB TSV CC0 Yes HuggingFace
Amazon Reviews 34M 20 GB JSON Open No jmcauley.ucsd.edu
Hate Speech 24,783 2.4 MB CSV CC-BY-4.0 Yes HuggingFace
Emotion 20,000 1.2 MB CSV Open Yes HuggingFace
Financial PhraseBank 4,840 280 KB CSV CC-BY-NC-SA Yes HuggingFace

Large Text Corpora

Dataset Size Description License Link
Wikipedia Dumps ~21 GB Full English Wikipedia CC-BY-SA-3.0 dumps.wikimedia.org
Common Crawl 380+ TB Web crawl archive CC0 commoncrawl.org
The Pile 825 GB Diverse English text MIT pile.eleuther.ai
C4 750 GB Cleaned Common Crawl ODC-BY HuggingFace
RedPajama 1.2 TB LLM pretraining data Apache-2.0 HuggingFace
FineWeb 15 TB Cleaned web data for LLMs ODC-BY HuggingFace

Translation & Multilingual

Dataset Language Pairs Size License Link
OPUS 800+ Varies Various open opus.nlpl.eu
WMT 10+ Varies Research statmt.org
Tatoeba 400+ 13M sentences CC-BY-2.0 tatoeba.org
FLORES-200 200 3,001 sentences each CC-BY-SA-4.0 HuggingFace

Summarization

Dataset Samples Size License Link
CNN/DailyMail 312K articles 1.3 GB Apache-2.0 HuggingFace
XSum 226K 260 MB MIT HuggingFace
arXiv 215K papers 4.2 GB CC0 Kaggle