RAFT Benchmark | Real-world Annotated Few-shot Tasks

Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants?

RAFT is a few-shot classification benchmark that tests language models:

across multiple domains (lit reviews, medical data, tweets, customer interaction, etc.)
on economically valuable classification tasks (someone inherently cares about the task)
with evaluation that mirrors deployment (50 labeled examples per task, info retrieval allowed, hidden test set)

Resources:

Dataset
- Dataset card
- Dataset viewer
Submission
Leaderboard
Paper (NeurIPS 2021)

We’re considering a successor, RAFT 2. You can participate as co-author by submitting a dataset.