WiC: The Word-in-Context Dataset

Depending on its context, an ambiguous word can refer to multiple, potentially unrelated, meanings. Mainstream static word embeddings, such as Word2vec and GloVe, are unable to reflect this dynamic semantic nature. Contextualised word embeddings are an attempt at addressing this limitation by computing dynamic representations for words which can adapt based on context.

A system's task on the WiC dataset is to identify the intended meaning of words. WiC is framed as a binary classification task. Each instance in WiC has a target word w, either a verb or a noun, for which two contexts are provided. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in the two contexts correspond to the same meaning or not. In fact, the dataset can also be viewed as an application of Word Sense Disambiguation in practise.

WiC features multiple interesting characteristics:

It is suitable for evaluating a wide range of applications, including contextualized word and sense representation and Word Sense Disambiguation;
It is framed asa binary classification dataset, in which, unlike Stanford Contextual Word Similarity (SCWS), identical words are paired with each other (in different contexts); hence, a context-insensitive word embedding model would perform similarly to a random baseline;
It is constructed using high quality annotations curated by experts.

Download

the whole package (v1.0, with gold labels for test set!),
the README file.

Participate in WiC's CodaLab competition: submit your results on the test set and see where you stand in the leaderboard!
Link: WiC CodaLab Competition

WiC is featured as a part of the SuperGLUE benchmark.

WiC was also used for a shared task at SemDeep-5 IJCAI workshop.

Dataset details

Please see the following paper:

WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations
M.T. Pilehvar and J. Camacho-Collados, NAACL 2019 (Minneapolis, USA).
Note: Results slightly differ between NAACL and Arxiv versions of the paper. Please take results in the Arxiv version, which is more up to date, as baseline for your evaluations.

Examples from the dataset

Label	Target	Context-1	Context-2
F	bed	There's a lot of trash on the bed of the river	I keep a glass of water next to my bed when I sleep
F	land	The pilot managed to land the airplane safely	The enemy landed several of our aircrafts
F	justify	Justify the margins	The end justifies the means
T	beat	We beat the competition	Agassi beat Becker in the tennis championship
T	air	Air pollution	Open a window and let in some air
T	window	The expanded window will give us time to catch the thieves	You have a two-hour window of clear weather to finish working on the lawn

State-of-the-Art

Sentence-level contextualised embeddings	Implementation	Accuracy %
SenseBERT-large†	Levine et al (2019)	72.1
KnowBERT-W+W†	Peters et al (2019)	70.9
RoBERTa	Liu et al (2019)	69.9
BERT-large	Wang et al (2019)	69.6
Ensemble	Gari Soler et al (2019)	66.7
ELMo-weighted	Ansell et al (2019)	61.2

Word-level contextualised embeddings	Implementation	Accuracy %
WSD†	Loureiro and Jorge (2019)	67.7
BERT-large	WiC's paper	65.5
Context2vec	WiC's paper	59.3
Elmo	WiC's paper	57.7

Sense representations
LessLex	Colla et al (2020)	59.2
DeConf	WiC's paper	58.7
SW2V	WiC's paper	58.1
JBT	WiC's paper	53.6

Sentence level baselines
Sentence Bag-of-words	WiC's paper	58.7
Sentence LSTM	WiC's paper	53.1

Random baseline		50.0

† Use external knowledge resources

Performance upperbound

	Accuracy %
Human-level performance	80.0

Contextualised word embeddings	Accuracy percentage
Context2vec	59.2
Elmo-3	57.4
Elmo-1	56.3

Sense representations
DeConf	59.4
SW2V	58.1
JBT	53.9

Sentence level baselines
Sentence Bag-of-words	59.3
Sentence LSTM	53.2

Licensing

This dataset is licensed under a Creative Commons Attribution-NonCommercial 4.0 License.

References

[LessLex] Davide Colla, Enrico Mensa and Daniele P. Radicioni. LessLex: Linking Multilingual Embeddings to SenSe Representations of LEXical Items. Computational Linguistics, 2020.
[SenseBERT] Yoav Levine, Barak Lenz, Or Dagan, Dan Padnos, Or Sharir, Shai Shalev-Shwartz, Amnon Shashua, Yoav Shoham. SenseBERT: Driving Some Sense into BERT. Arxiv 2019.
[KnowBERT] Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, Noah A. Smith. Knowledge Enhanced Contextual Word Representations. EMNLP 2019.
[RoBERTa] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Arxiv 2019.
[BERT] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 2018.
[Context2vec] Oren Melamud, Jacob Goldberger, and Ido Dagan. Context2vec: Learning generic context embedding with bidirectional LSTM. CoNLL 2016.
[Elmo] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. Deep contextualized word representations. NAACL 2018.
[DeConf] Mohammad Taher Pilehvar and Nigel Collier. De-Conflated Semantic Representations. EMNLP 2016.
[SW2V] Massimiliano Mancini, Jose Camacho-Collados, Ignacio Iacobacci, and Roberto Navigli. Embedding words and senses together via joint knowledge-enhanced training. CoNLL 2017.
[JBT] Maria Pelevina, Nikolay Arefyev, Chris Biemann, and Alexander Panchenko. Making sense of word embeddings. RepL4NLP 2016.
[WSD] Daniel Loureiro and Alípio Jorge. LIAAD at SemDeep-5 Challenge: Word-in-Context (WiC). SemDeep 2019.
[Ensemble] Aina Garí Soler, Marianna Apidianaki and Alexandre Allauzen. LIMSI-MULTISEM at the IJCAI SemDeep-5 WiC Challenge: Context Representations for Word Usage Similarity Estimation. SemDeep 2019.
[ELMo-weighted] Alan Ansell, Felipe Bravo-Marquez and Bernhard Pfahringer. An ELMo-inspired approach to SemDeep-5's Word-in-Context task. SemDeep 2019.
[SuperGLUE] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems.

WiC: The Word-in-Context Dataset (English)

A reliable benchmark for the evaluation of context-sensitive word embeddings

(New!) [XL-WiC] - WiC in 12 other languages!