XL-WiC: The Multilingual Word-in-Context Dataset

A multilingual benchmark for the evaluation of context-sensitive word embeddings

[English WiC]


Contextualised word embeddings are nowadays pervasive across NLP applications and have shown to effectively capture meanings distinctions of the same word in different sentences. Similarly, multilingual models proved to be able to encode words in their contexts in a space that is shared across hundreds of languages.

A system's task on any of the XL-WiC datasets is to identify the intended meaning of a word in a context of a given language. XL-WiC is framed as a binary classification task. Each instance in XL-WiC has a target word w, either a verb or a noun, for which two contexts are provided. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in the two contexts correspond to the same meaning or not.

XL-WiC provides dev and test sets in the following 12 languages:

  • Bulgarian (BG)
  • Danish (DA)
  • German (DE)
  • Estonian (ET)
  • Farsi (FA)
  • French (FR)
  • Croatian (HR)
  • Italian (IT)
  • Japanese (JA)
  • Korean (KO)
  • Dutch (NL)
  • Chinese (ZH)

and training sets in the following 3 languages:

  • German (DE)
  • French (FR)
  • Italian (IT)


Participate in XLWiC's CodaLab competition: submit your results on the test set and see where you stand in the leaderboard!
Link: XLWiC CodaLab Competition

Dataset details

Please see the following paper:

XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization
A. Raganato, T. Pasini, J. Camacho-Collados and M.T. Pilehvar
EMNLP 2020.

XL-WiC features multiple interesting characteristics:

  • It is multilingual, hence, it is suitable for evaluation cross-lingual and multilingual semantic capabilities of models.
  • It is suitable for evaluating a wide range of applications, including contextualized word and sense representation and Word Sense Disambiguation;
  • It is framed asa binary classification dataset, in which, unlike Stanford Contextual Word Similarity (SCWS), identical words are paired with each other (in different contexts); hence, a context-insensitive word embedding model would perform similarly to a random baseline;
  • It is constructed using high quality annotations curated by experts.

Examples from the dataset

Lang Target Context-1 Context-2 Label
EN Beat We beat the competition Agassi beat Becker in the tennis championship. True
DA Tro Jeg tror p ̊a det, min mor fortalte. Maria troede ikke sine egne øjne. True
ET Ruum Uhel hetkel olin v̈ aljaspool aega ja ruumi. Umberringi oli l̃ oputu ẗ uhi ruum. True
FR Causticité Sa causticité lui a fait bien des ennemis. La causticité des acides. False
KO 틀림 틀림이 있는지 없는지 세어 보시오. 그 아이 하는 짓에 틀림이 있다면 모두 이 어미 죄이지요. False
ZH 建築師希望大火燒掉城市的三分之一。 如果南美洲氣壓偏低,則印度可能乾旱 True
FA صرف صرف غذا نیم ساعت طول کشید معلم صرف افعال ماضی عربی را آموزش داد False


Cross-Lingual WordNet

Implementation BG DA ET FA HR JA KO NL ZH
Human - 87.00 - 97.00 - 75.00 76.00 - 85.00
XLMR Large 66.48 71.11 68.71 75.25 72.30 63.83 69.63 72.81 73.15
XLMR Base 60.73 64.79 62.82 69.88 61.01 60.44 66.96 65.73 65.78
mBERT 58.28 64.86 62.56 71.50 63.97 62.26 59.76 63.84 69.36

Cross-Lingual Wiktionary

Implementation DE FR IT
Human 74.00 - 78.00
XLMR Large 65.83 62.50 64.86
XLMR Base 58.30 56.13 55.91
mBERT 58.27 56.00 58.61

Monolingual Wiktionary

IV = In-Vocabulary evaluation. The test sets only contain words that do appear in the training sets.

OOV = Out-Of-Vocabulary evaluation. Te test sets only contain words that do not appear in the training sets

Human 74.00 - - - - - 78.00 - -
XLMR Large 84.03 84.24 72.54 76.16 75.61 73.93 72.30 75.12 65.17
XLMR Base 80.84 81.17 71.31 73.06 71.92 71.14 68.58 70.69 62.36
mBERT 81.58 81.86 70.08 73.67 72.92 71.24 71.96 73.15 68.54
L-BERT* 82.90 83.23 76.64 78.14 77.62 78.00 72.64 73.89 69.10

* L-BERT stands for language-specific BERT. We used the following language-specific models:


This dataset is licensed under a Creative Commons Attribution-NonCommercial 4.0 License.