Investigating Approaches to Semantic Category Disambiguation Using Large Lexical Resources and Approximate String Matching

Pontus Stenetorp, Sampo Pyysalo, Sophia Ananiadou and Jun’ichi Tsujii

Aizawa Laboratory, University of Tokyo, Japan
University of Manchester, United Kingdom
National Centre for Text Mining, United Kingdom
Microsoft Research Asia, People’s Republic of China

沖縄県石垣市 2011年11月22日 第204回 自然言語処理研究会

Table of Contents

Semantic Category Disambiguation (SCD)

Example: An annotated noun-phrase

Example: An annotated noun-phrase

Motivation

Previous Work (Cohen et al. 2011)

Previous Work (Stenetorp et al. 2011)

Methods

Linguistically Motivated Measures: Distance Measure

EGR–1 EGR 1 FGR–1
EGR–1 - 100/10 100/100
EGR 1 - 100/100
FGR–1 -

Table: String distance costs

Linguistically Motivated Measures: Start/End Markers

Resource Selection

Threshold Tuning

Experiments - Corpora

Corpus Semantic Categories
BioNLP/NLPBA 2004 Shared Task Corpus (NLPBA) 5
Gene Regulation Event Corpus (GREC) 6
Collaborative Annotation of a Large Biomedical Corpus (SSC) 4
Epigenetics and Post-Translational Modifications (EPI) 17
Infectious Diseases Corpus (ID) 16
Genia Event Corpus (GE) 11

Table: Corpora used for experiments

Experiments - Measures/Baselines

Results - Baselines

Classifier EPI ID GE SSC NLPBA SGREC µ
Int. 92.5 91.2 94.6 81.7 92.1 82.1 89.0
Int.Sim. (t=0.7) 93.7/+16.0 91.8/+6.8 94.4/–3.7 92.2/+57.4 92.1/0.0 83.4/+7.3 91.3/+20.9

Results - Distance Measure

Classifier EPI ID GE SSC NLPBA SGREC µ
Int. 92.5 91.2 94.6 81.7 92.1 82.1 89.0
Int.Sim. (t=0.7) 93.7/+16.0 91.8/+6.8 94.4/–3.7 92.2/+57.4 92.1/0.0 83.4/+7.3 91.3/+20.9
Int.Sim.Edit (t=0.7) 93.4/–4.8 91.2/–7.3 93.7/–12.5 91.8/–5.1 91.6/–6.3 82.7/–4.2 90.7/–6.9
Int.Sim.NEdit (t=0.7) 93.5/–3.2 91.2/–7.3 94.0/–7.1 90.7/–19.2 91.9/–2.5 82.7/–4.2 90.7/–6.9

Results - Start/End Markers

Classifier EPI ID GE SSC NLPBA SGREC µ
Int. 92.5 91.2 94.6 81.7 92.1 82.1 89.0
Int.Sim. (t=0.7) 93.7/+16.0 91.8/+6.8 94.4/–3.7 92.2/+57.4 92.1/0.0 83.4/+7.3 91.3/+20.9
Int.Sim.Edit (t=0.7) 93.4/–4.8 91.2/–7.3 93.7/–12.5 91.8/–5.1 91.6/–6.3 82.7/–4.2 90.7/–6.9
Int.Sim.NEdit (t=0.7) 93.5/–3.2 91.2/–7.3 94.0/–7.1 90.7/–19.2 91.9/–2.5 82.7/–4.2 90.7/–6.9
Int.Sim. (g,t=0.7) 93.7/0.0 91.7/–1.2 94.5/+1.8 91.0/–15.4 91.9/–2.5 82.9/–3.0 91.0/–3.4
Int.Sim.Edit (g,t=0.7) 93.5/–3.2 90.5/–15.9 93.8/–10.7 91.3/–11.5 91.6/–6.3 81.8/–9.6 90.4/–10.3
Int.Sim.NEdit (g,t=0.7) 93.6/–1.6 90.6/–14.6 94.0/–7.1 90.5/–21.8 91.8/–3.8 82.1/–7.8 90.4/–10.3

Results - Threshold Tuning

Classifier EPI ID GE SSC NLPBA SGREC µ
Int. 92.5 91.2 94.6 81.7 92.1 82.1 89.0
Int.Sim. (t=0.7) 93.7/+16.0 91.8/+6.8 94.4/–3.7 92.2/+57.4 92.1/0.0 83.4/+7.3 91.3/+20.9
Int.Sim. (t=0.4) 94.1/+6.3 92.4/+7.3 94.4/0.0 92.4/+2.6 92.0/–1.3 83.3/–0.6 91.4/+1.1
Int.Sim. (g,t=0.4) 94.1/+6.3 93.2/+17.2 94.4/0.0 91.9/–3.8 92.1/0.0 83.3/–0.6 91.5/+2.3

Results - Resource Selection

Classifier EPI ID GE SSC NLPBA SGREC µ
Int. 92.5 91.2 94.6 81.7 92.1 82.1 89.0
Int.Sim. (t=0.7) 93.7/+16.0 91.8/+6.8 94.4/–3.7 92.2/+57.4 92.1/0.0 83.4/+7.3 91.3/+20.9
Int.Sim. (t=0.4) 94.1/+6.3 92.4/+7.3 94.4/0.0 92.4/+2.6 92.0/–1.3 83.3/–0.6 91.4/+1.1
Int.Sim. (g,t=0.4) 94.1/+6.3 93.2/+17.2 94.4/0.0 91.9/–3.8 92.1/0.0 83.3/–0.6 91.5/+2.3
Int.Sim. (r,t=0.4) 93.5/–3.2 92.6/+9.8 94.5/+1.8 91.3/–11.5 91.9/–2.5 84.0/+3.6 91.3/0.0

Conclusions

Thank You for Your Attention