Depending on its context, an ambiguous word can refer to multiple, potentially unrelated, meanings. Mainstream static word embeddings, such as Word2vec and GloVe, are unable to reflect this dynamic semantic nature. Contextualised word embeddings are an attempt at addressing this limitation by computing dynamic representations for words which can adapt based on context.
A system's task on the WiC dataset is to identify the intended meaning of words. WiC is framed as a binary classification task. Each instance in WiC has a target word w, either a verb or a noun, for which two contexts are provided. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in the two contexts correspond to the same meaning or not. In fact, the dataset can also be viewed as an application of Word Sense Disambiguation in practise.
WiC features multiple interesting characteristics:
- It is suitable for evaluating a wide range of applications, including contextualized word and sense representation and Word Sense Disambiguation;
- It is framed asa binary classification dataset, in which, unlike Stanford Contextual Word Similarity (SCWS), identical words are paired with each other (in different contexts); hence, a context-insensitive word embedding model would perform similarly to a random baseline;
- It is constructed using high quality annotations curated by experts.
Participate in WiC's CodaLab competition: submit your results on the test set and see where you stand in the leaderboard!
Link: WiC CodaLab Competition
NOTE: WiC is being used for a shared task as part of the SemDeep-5 IJCAI workshop. Evaluation period is open until 12 April 2019.
Dataset detailsPlease see the following paper:
WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations
M.T. Pilehvar and J. Camacho-Collados, NAACL 2019 (Minneapolis, USA).
Examples from the dataset
|F||bed||There's a lot of trash on the bed of the river||I keep a glass of water next to my bed when I sleep|
|F||land||The pilot managed to land the airplane safely||The enemy landed several of our aircrafts|
|F||justify||Justify the margins||The end justifies the means|
|T||beat||We beat the competition||Agassi beat Becker in the tennis championship|
|T||air||Air pollution||Open a window and let in some air|
|T||window||The expanded window will give us time to catch the thieves||You have a two-hour window of clear weather to finish working on the lawn|
|Contextualised word embeddings||Accuracy %|
|Sentence level baselines|
- [BERT] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 2018.
- [Context2vec] Oren Melamud, Jacob Goldberger, and Ido Dagan. Context2vec: Learning generic context embedding with bidirectional LSTM. CoNLL 2016.
- [Elmo] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. Deep contextualized word representations. NAACL 2018.
- [DeConf] Mohammad Taher Pilehvar and Nigel Collier. De-Conflated Semantic Representations. EMNLP 2016.
- [SW2V] Massimiliano Mancini, Jose Camacho-Collados, Ignacio Iacobacci, and Roberto Navigli. Embedding words and senses together via joint knowledge-enhanced training. CoNLL 2017.
- [JBT] Maria Pelevina, Nikolay Arefyev, Chris Biemann, and Alexander Panchenko. Making sense of word embeddings. RepL4NLP 2016.