Card-660: Cambridge Rare Word Dataset

Card-660 is a challenging, yet reliable, benchmark for the evaluation of subword and rare word representation techniques. The dataset provides multiple advantages over existing benchmarks, including a very high inter-annotator agreement (IAA) of around 0.90, which is substantially higher than that for existing rare word datasets (Stanford Rare Word Similarity, with an estimated IAA of 0.41).

The huge gap between state of the art and IAA (more than 0.50 in terms of Spearman correlation) promises a challenging dataset with lots of potential for future research. Card-660 covers a wide range of domains, including IT and technology, slang and abbreviations, movies and entertainment, politics, chemistry, and medicine.


Download



M.T. Pilehvar, D. Kartsaklis, V. Prokhorov, and N. Collier.
Card-660: Cambridge Rare Word Dataset -- a Reliable Benchmark for Infrequent Word Representation Models.
EMNLP 2018, Brussels, Belgium.


Annotation Variance
Annotation variance for word pairs across Card-660, SimVerb-3500 and Stanford Rare Word Similarity (RW) datasets. Average variance for Card-660 is 1.47, which is significantly lower than those for SV-3500 and RW: 5.64 and 6.34, respectively.


Examples from the dataset (similarity scale [0-4]):

sleepwalkingsomnambulists3.88
2mrotomorrow4.00
currencyconcurrency0.13
must-seeinteresting3.06
carbinolaminehemiaminal3.88
biting_pointclutch2.19
random_seedBiLSTM1.56
black_holeblackmail0.06


State-of-the-Art

Word Embedding Model Missed words Missed pairs Pearson Spearman
ConceptNet Numberbatch (300d) 37% 53% 36.0 24.7
Glove Common Crawl - cased (300d) 1% 2% 33.0 27.3
LexVec Common Crawl (300d) 41% 55% 25.9 18.5
Glove Wikipedia-Gigaword (300d) 55% 74% 15.1 15.7
Word2vec GoogleNews (300d) 48% 75% 13.5 7.4

Rare Word Representation Model Missed words Missed pairs Pearson Spearman
ConceptNet Numberbatch (300d) 37% 53% 36.0 24.7
 + Mimick (Pinter et al., 2017) 0% 0% 34.2 35.6
 + Definition centroid (Herbelot and Baroni, 2017) 29% 43% 42.9 33.8
 + Definition LSTM (Bahdanau et al., 2017) 29% 43% 43.4 34.3
Glove Common Crawl - cased (300d) 1% 2% 33.0 27.3
 + Mimick (Pinter et al., 2017) 0% 0% 23.9 29.5
 + Definition centroid (Herbelot and Baroni, 2017) 21% 35% 45.2 31.7
 + Definition LSTM (Bahdanau et al., 2017) 21% 35% 39.5 33.8


Inter-Annotator Agreement

Inter-Annotator Agreement Pearson Spearman
Mean 93.5 93.1
Pairwise 88.9 88.9


References

Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. Mimicking word embeddings using subwordrnns. EMNLP 2017.

Mohammad Taher Pilehvar and Nigel Collier. Inducing embeddings for rare and unseen words by leveraging lexical resources. EACL 2017.

Aur elie Herbelot and Marco Baroni. High-risk learning: acquiring new word vectors from tiny data. EMNLP 2017.

Dzmitry Bahdanau, Tom Bosc, Stanislaw Jastrzebski, Edward Grefenstette, Pascal Vincent, and Yoshua Bengio. Learning to Compute Word Embeddings On the Fly. 2017. arXiv.