About
Contextualised word embeddings are nowadays pervasive across NLP applications and have shown to effectively capture meanings distinctions of the same word in different sentences. Similarly, multilingual models proved to be able to encode words in their contexts in a space that is shared across hundreds of languages.
A system's task on any of the XL-WiC datasets is to identify the intended meaning of a word in a context of a given language. XL-WiC is framed as a binary classification task. Each instance in XL-WiC has a target word w, either a verb or a noun, for which two contexts are provided. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in the two contexts correspond to the same meaning or not.
XL-WiC provides dev and test sets in the following 12 languages:
- Bulgarian (BG)
- Danish (DA)
- German (DE)
- Estonian (ET)
- Farsi (FA)
- French (FR)
- Croatian (HR)
- Italian (IT)
- Japanese (JA)
- Korean (KO)
- Dutch (NL)
- Chinese (ZH)
and training sets in the following 3 languages:
- German (DE)
- French (FR)
- Italian (IT)
Download
- the whole package (v1.0) with gold labels,
- the README file.
Participate in XLWiC's CodaLab competition: submit your results on the test set and see where you stand in the leaderboard!
Link: XLWiC CodaLab Competition
Dataset details
Please see the following paper:
XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization
A. Raganato,
T. Pasini,
J.
Camacho-Collados
and
M.T. Pilehvar
EMNLP 2020.
XL-WiC features multiple interesting characteristics:
- It is multilingual, hence, it is suitable for evaluation cross-lingual and multilingual semantic capabilities of models.
- It is suitable for evaluating a wide range of applications, including contextualized word and sense representation and Word Sense Disambiguation;
- It is framed asa binary classification dataset, in which, unlike Stanford Contextual Word Similarity (SCWS), identical words are paired with each other (in different contexts); hence, a context-insensitive word embedding model would perform similarly to a random baseline;
- It is constructed using high quality annotations curated by experts.
Examples from the dataset
Lang | Target | Context-1 | Context-2 | Label |
---|---|---|---|---|
EN | Beat | We beat the competition | Agassi beat Becker in the tennis championship. | True |
DA | Tro | Jeg tror p ̊a det, min mor fortalte. | Maria troede ikke sine egne øjne. | True |
ET | Ruum | Uhel hetkel olin v̈ aljaspool aega ja ruumi. | Umberringi oli l̃ oputu ẗ uhi ruum. | True |
FR | Causticité | Sa causticité lui a fait bien des ennemis. | La causticité des acides. | False |
KO | 틀림 | 틀림이 있는지 없는지 세어 보시오. | 그 아이 하는 짓에 틀림이 있다면 모두 이 어미 죄이지요. | False |
ZH | 發 | 建築師希望發大火燒掉城市的三分之一。 | 如果南美洲氣壓偏低,則印度可能發乾旱 | True |
FA | صرف | صرف غذا نیم ساعت طول کشید | معلم صرف افعال ماضی عربی را آموزش داد | False |
State-of-the-Art
Cross-Lingual WordNet
- Training: EN
- Dev: EN
- Test: WordNet test sets
- Download data: xlwic_wn_xlingual.zip
Implementation | BG | DA | ET | FA | HR | JA | KO | NL | ZH |
---|---|---|---|---|---|---|---|---|---|
Human | - | 87.00 | - | 97.00 | - | 75.00 | 76.00 | - | 85.00 |
XLMR Large | 66.48 | 71.11 | 68.71 | 75.25 | 72.30 | 63.83 | 69.63 | 72.81 | 73.15 |
XLMR Base | 60.73 | 64.79 | 62.82 | 69.88 | 61.01 | 60.44 | 66.96 | 65.73 | 65.78 |
mBERT | 58.28 | 64.86 | 62.56 | 71.50 | 63.97 | 62.26 | 59.76 | 63.84 | 69.36 |
Cross-Lingual Wiktionary
- Training: EN
- Dev: EN
- Test: Wiktionary test sets.
- Table 6 in the paper, first group.
- Download data: xlwic_wikt_xlingual.zip
Implementation | DE | FR | IT |
---|---|---|---|
Human | 74.00 | - | 78.00 |
XLMR Large | 65.83 | 62.50 | 64.86 |
XLMR Base | 58.30 | 56.13 | 55.91 |
mBERT | 58.27 | 56.00 | 58.61 |
Monolingual Wiktionary
- Training: Target Language.
- Dev: Target Language.
- Test: Wiktionary test sets.
- Table 4 in the paper, second group.
- Download data: xlwic_wikt_monolingual.zip
IV = In-Vocabulary evaluation. The test sets only contain words that do appear in the training sets.
OOV = Out-Of-Vocabulary evaluation. Te test sets only contain words that do not appear in the training sets
|
|
|
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
ALL | IV | OOV | ALL | IV | OOV | ALL | IV | OOV | |||
Human | 74.00 | - | - | - | - | - | 78.00 | - | - | ||
XLMR Large | 84.03 | 84.24 | 72.54 | 76.16 | 75.61 | 73.93 | 72.30 | 75.12 | 65.17 | ||
XLMR Base | 80.84 | 81.17 | 71.31 | 73.06 | 71.92 | 71.14 | 68.58 | 70.69 | 62.36 | ||
mBERT | 81.58 | 81.86 | 70.08 | 73.67 | 72.92 | 71.24 | 71.96 | 73.15 | 68.54 | ||
L-BERT* | 82.90 | 83.23 | 76.64 | 78.14 | 77.62 | 78.00 | 72.64 | 73.89 | 69.10 |
* L-BERT stands for language-specific BERT. We used the following language-specific models:
- DE: dbmdz/bert-base-german-cased
- FR: camembert/camembert-large (Martin et al., 2020)
- IT: dbmdz/bert-base-italian-xxl-cased
Licensing
This dataset is licensed under a Creative Commons Attribution-NonCommercial 4.0 License.References
- [XLMR Large/Base] Unsupervised Cross-lingual Representation Learning at Scale. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov. ACL 2020.
- [BERT] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. NAACL 2019.
- CamemBERT: a tasty French language model. Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé Seddah, and Benoı̂t Sagot. ACL 2020.