XL-WiC: The Multilingual Word-in-Context Dataset

About

Contextualised word embeddings are nowadays pervasive across NLP applications and have shown to effectively capture meanings distinctions of the same word in different sentences. Similarly, multilingual models proved to be able to encode words in their contexts in a space that is shared across hundreds of languages.

A system's task on any of the XL-WiC datasets is to identify the intended meaning of a word in a context of a given language. XL-WiC is framed as a binary classification task. Each instance in XL-WiC has a target word w, either a verb or a noun, for which two contexts are provided. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in the two contexts correspond to the same meaning or not.

XL-WiC provides dev and test sets in the following 12 languages:

Bulgarian (BG)
Danish (DA)
German (DE)
Estonian (ET)
Farsi (FA)
French (FR)
Croatian (HR)
Italian (IT)
Japanese (JA)
Korean (KO)
Dutch (NL)
Chinese (ZH)

and training sets in the following 3 languages:

German (DE)
French (FR)
Italian (IT)

Download

the whole package (v1.0) with gold labels,
the README file.

Participate in XLWiC's CodaLab competition: submit your results on the test set and see where you stand in the leaderboard!
Link: XLWiC CodaLab Competition

Dataset details

Please see the following paper:

XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization
A. Raganato, T. Pasini, J. Camacho-Collados and M.T. Pilehvar
EMNLP 2020.

XL-WiC features multiple interesting characteristics:

It is multilingual, hence, it is suitable for evaluation cross-lingual and multilingual semantic capabilities of models.
It is suitable for evaluating a wide range of applications, including contextualized word and sense representation and Word Sense Disambiguation;
It is framed asa binary classification dataset, in which, unlike Stanford Contextual Word Similarity (SCWS), identical words are paired with each other (in different contexts); hence, a context-insensitive word embedding model would perform similarly to a random baseline;
It is constructed using high quality annotations curated by experts.

Examples from the dataset

Lang	Target	Context-1	Context-2	Label
EN	Beat	We beat the competition	Agassi beat Becker in the tennis championship.	True
DA	Tro	Jeg tror p ̊a det, min mor fortalte.	Maria troede ikke sine egne øjne.	True
ET	Ruum	Uhel hetkel olin v̈ aljaspool aega ja ruumi.	Umberringi oli l̃ oputu ẗ uhi ruum.	True
FR	Causticité	Sa causticité lui a fait bien des ennemis.	La causticité des acides.	False
KO	틀림	틀림이 있는지 없는지 세어 보시오.	그 아이 하는 짓에 틀림이 있다면 모두 이 어미 죄이지요.	False
ZH	發	建築師希望發大火燒掉城市的三分之一。	如果南美洲氣壓偏低，則印度可能發乾旱	True
FA	صرف	صرف غذا نیم ساعت طول کشید	معلم صرف افعال ماضی عربی را آموزش داد	False

State-of-the-Art

Cross-Lingual WordNet

Training: EN
Dev: EN
Test: WordNet test sets
Download data: xlwic_wn_xlingual.zip

Implementation	BG	DA	ET	FA	HR	JA	KO	NL	ZH
Human	-	87.00	-	97.00	-	75.00	76.00	-	85.00
XLMR Large	66.48	71.11	68.71	75.25	72.30	63.83	69.63	72.81	73.15
XLMR Base	60.73	64.79	62.82	69.88	61.01	60.44	66.96	65.73	65.78
mBERT	58.28	64.86	62.56	71.50	63.97	62.26	59.76	63.84	69.36

Cross-Lingual Wiktionary

Training: EN
Dev: EN
Test: Wiktionary test sets.
Table 6 in the paper, first group.
Download data: xlwic_wikt_xlingual.zip

Implementation	DE	FR	IT
Human	74.00	-	78.00
XLMR Large	65.83	62.50	64.86
XLMR Base	58.30	56.13	55.91
mBERT	58.27	56.00	58.61

Monolingual Wiktionary

Training: Target Language.
Dev: Target Language.
Test: Wiktionary test sets.
Table 4 in the paper, second group.
Download data: xlwic_wikt_monolingual.zip

IV = In-Vocabulary evaluation. The test sets only contain words that do appear in the training sets.

OOV = Out-Of-Vocabulary evaluation. Te test sets only contain words that do not appear in the training sets

	DE			FR			IT
	ALL	IV	OOV	ALL	IV	OOV	ALL	IV	OOV
Human	74.00	-	-	-	-	-	78.00	-	-
XLMR Large	84.03	84.24	72.54	76.16	75.61	73.93	72.30	75.12	65.17
XLMR Base	80.84	81.17	71.31	73.06	71.92	71.14	68.58	70.69	62.36
mBERT	81.58	81.86	70.08	73.67	72.92	71.24	71.96	73.15	68.54
L-BERT*	82.90	83.23	76.64	78.14	77.62	78.00	72.64	73.89	69.10

* L-BERT stands for language-specific BERT. We used the following language-specific models:

DE: dbmdz/bert-base-german-cased
FR: camembert/camembert-large (Martin et al., 2020)
IT: dbmdz/bert-base-italian-xxl-cased

Licensing

This dataset is licensed under a Creative Commons Attribution-NonCommercial 4.0 License.

References

[XLMR Large/Base] Unsupervised Cross-lingual Representation Learning at Scale. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov. ACL 2020.
[BERT] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. NAACL 2019.
CamemBERT: a tasty French language model. Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé Seddah, and Benoı̂t Sagot. ACL 2020.