Card-660: Cambridge Rare Word Dataset

Card-660 is a challenging, yet reliable, benchmark for the evaluation of subword and rare word representation techniques. The dataset provides multiple advantages over existing benchmarks, including a very high inter-annotator agreement (IAA) of around 0.90, which is substantially higher than that for existing rare word datasets (Stanford Rare Word Similarity, with an estimated IAA of 0.41).

The huge gap between state of the art and IAA (more than 0.50 in terms of Spearman correlation) promises a challenging dataset with lots of potential for future research. Card-660 covers a wide range of domains, including IT and technology, slang and abbreviations, movies and entertainment, politics, chemistry, and medicine.

Download

The whole package,
or individual files: [README.txt][dataset.tsv][scores.tsv][similarity_scale.txt]

M.T. Pilehvar, D. Kartsaklis, V. Prokhorov, and N. Collier.
Card-660: Cambridge Rare Word Dataset -- a Reliable Benchmark for Infrequent Word Representation Models.
EMNLP 2018, Brussels, Belgium.

Annotation variance for word pairs across Card-660, SimVerb-3500 and Stanford Rare Word Similarity (RW) datasets. Average variance for Card-660 is 1.47, which is significantly lower than those for SV-3500 and RW: 5.64 and 6.34, respectively.

Examples from the dataset (similarity scale [0-4]):

sleepwalking	somnambulists	3.88
2mro	tomorrow	4.00
currency	concurrency	0.13
must-see	interesting	3.06
carbinolamine	hemiaminal	3.88
biting_point	clutch	2.19
random_seed	BiLSTM	1.56
black_hole	blackmail	0.06

State-of-the-Art

Word Embedding Model	Missed words	Missed pairs	Pearson	Spearman
ConceptNet Numberbatch (300d)	37%	53%	36.0	24.7
Glove Common Crawl - cased (300d)	1%	2%	33.0	27.3
LexVec Common Crawl (300d)	41%	55%	25.9	18.5
Glove Wikipedia-Gigaword (300d)	55%	74%	15.1	15.7
Word2vec GoogleNews (300d)	48%	75%	13.5	7.4

Rare Word Representation Model	Missed words	Missed pairs	Pearson	Spearman
ConceptNet Numberbatch (300d)	37%	53%	36.0	24.7
+ Mimick (Pinter et al., 2017)	0%	0%	34.2	35.6
+ Definition centroid (Herbelot and Baroni, 2017)	29%	43%	42.9	33.8
+ Definition LSTM (Bahdanau et al., 2017)	29%	43%	43.4	34.3
Glove Common Crawl - cased (300d)	1%	2%	33.0	27.3
+ Mimick (Pinter et al., 2017)	0%	0%	23.9	29.5
+ Definition centroid (Herbelot and Baroni, 2017)	21%	35%	45.2	31.7
+ Definition LSTM (Bahdanau et al., 2017)	21%	35%	39.5	33.8

Inter-Annotator Agreement

Inter-Annotator Agreement	Pearson	Spearman
Mean	93.5	93.1
Pairwise	88.9	88.9

References

Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. Mimicking word embeddings using subwordrnns. EMNLP 2017.

Mohammad Taher Pilehvar and Nigel Collier. Inducing embeddings for rare and unseen words by leveraging lexical resources. EACL 2017.

Aur elie Herbelot and Marco Baroni. High-risk learning: acquiring new word vectors from tiny data. EMNLP 2017.

Dzmitry Bahdanau, Tom Bosc, Stanislaw Jastrzebski, Edward Grefenstette, Pascal Vincent, and Yoshua Bengio. Learning to Compute Word Embeddings On the Fly. 2017. arXiv.