Natural Language Datasets
Free datasets for NLP tasks — text classification, question answering, summarization, translation, and more.
Question Answering
Text Classification & Sentiment
| Dataset |
Samples |
Size |
Format |
License |
Browser? |
Link |
| SST-2 |
67,349 |
7.1 MB |
TSV |
CC0 |
Yes |
HuggingFace |
| Amazon Reviews |
34M |
20 GB |
JSON |
Open |
No |
jmcauley.ucsd.edu |
| Hate Speech |
24,783 |
2.4 MB |
CSV |
CC-BY-4.0 |
Yes |
HuggingFace |
| Emotion |
20,000 |
1.2 MB |
CSV |
Open |
Yes |
HuggingFace |
| Financial PhraseBank |
4,840 |
280 KB |
CSV |
CC-BY-NC-SA |
Yes |
HuggingFace |
Large Text Corpora
Translation & Multilingual
| Dataset |
Language Pairs |
Size |
License |
Link |
| OPUS |
800+ |
Varies |
Various open |
opus.nlpl.eu |
| WMT |
10+ |
Varies |
Research |
statmt.org |
| Tatoeba |
400+ |
13M sentences |
CC-BY-2.0 |
tatoeba.org |
| FLORES-200 |
200 |
3,001 sentences each |
CC-BY-SA-4.0 |
HuggingFace |
Summarization
| Dataset |
Samples |
Size |
License |
Link |
| CNN/DailyMail |
312K articles |
1.3 GB |
Apache-2.0 |
HuggingFace |
| XSum |
226K |
260 MB |
MIT |
HuggingFace |
| arXiv |
215K papers |
4.2 GB |
CC0 |
Kaggle |