Card-660: Cambridge Rare Word Dataset
Card-660 is a challenging, yet reliable, benchmark for the evaluation of subword and rare word representation techniques. The dataset provides multiple advantages over existing benchmarks, including a very high inter-annotator agreement (IAA) of around 0.90, which is substantially higher than that for existing rare word datasets (Stanford Rare Word Similarity, with an estimated IAA of 0.41).
The huge gap between state of the art and IAA (more than 0.50 in terms of Spearman correlation) promises a challenging dataset with lots of potential for future research. Card-660 covers a wide range of domains, including IT and technology, slang and abbreviations, movies and entertainment, politics, chemistry, and medicine.
Download
|
Card-660: Cambridge Rare Word Dataset -- a Reliable Benchmark for Infrequent Word Representation Models.
EMNLP 2018, Brussels, Belgium.
Examples from the dataset (similarity scale [0-4]):
sleepwalking | somnambulists | 3.88 |
2mro | tomorrow | 4.00 |
currency | concurrency | 0.13 |
must-see | interesting | 3.06 |
carbinolamine | hemiaminal | 3.88 |
biting_point | clutch | 2.19 |
random_seed | BiLSTM | 1.56 |
black_hole | blackmail | 0.06 |
State-of-the-Art
Word Embedding Model | Missed words | Missed pairs | Pearson | Spearman |
---|---|---|---|---|
ConceptNet Numberbatch (300d) | 37% | 53% | 36.0 | 24.7 |
Glove Common Crawl - cased (300d) | 1% | 2% | 33.0 | 27.3 |
LexVec Common Crawl (300d) | 41% | 55% | 25.9 | 18.5 |
Glove Wikipedia-Gigaword (300d) | 55% | 74% | 15.1 | 15.7 |
Word2vec GoogleNews (300d) | 48% | 75% | 13.5 | 7.4 |
Rare Word Representation Model | Missed words | Missed pairs | Pearson | Spearman |
---|---|---|---|---|
ConceptNet Numberbatch (300d) | 37% | 53% | 36.0 | 24.7 |
+ Mimick (Pinter et al., 2017) | 0% | 0% | 34.2 | 35.6 |
+ Definition centroid (Herbelot and Baroni, 2017) | 29% | 43% | 42.9 | 33.8 |
+ Definition LSTM (Bahdanau et al., 2017) | 29% | 43% | 43.4 | 34.3 |
Glove Common Crawl - cased (300d) | 1% | 2% | 33.0 | 27.3 |
+ Mimick (Pinter et al., 2017) | 0% | 0% | 23.9 | 29.5 |
+ Definition centroid (Herbelot and Baroni, 2017) | 21% | 35% | 45.2 | 31.7 |
+ Definition LSTM (Bahdanau et al., 2017) | 21% | 35% | 39.5 | 33.8 |
Inter-Annotator Agreement
Inter-Annotator Agreement | Pearson | Spearman |
---|---|---|
Mean | 93.5 | 93.1 |
Pairwise | 88.9 | 88.9 |
References
Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. Mimicking word embeddings using subwordrnns. EMNLP 2017.
Mohammad Taher Pilehvar and Nigel Collier. Inducing embeddings for rare and unseen words by leveraging lexical resources. EACL 2017.
Aur elie Herbelot and Marco Baroni. High-risk learning: acquiring new word vectors from tiny data. EMNLP 2017.
Dzmitry Bahdanau, Tom Bosc, Stanislaw Jastrzebski, Edward Grefenstette, Pascal Vincent, and Yoshua Bengio. Learning to Compute Word Embeddings On the Fly. 2017. arXiv.