Investigating Approaches to Semantic Category Disambiguation Using Large Lexical Resources and Approximate String Matching

Pontus Stenetorp, Sampo Pyysalo, Sophia Ananiadou and Jun’ichi Tsujii
—
Aizawa Laboratory, University of Tokyo, Japan
University of Manchester, United Kingdom
National Centre for Text Mining, United Kingdom
Microsoft Research Asia, People’s Republic of China

沖縄県石垣市　2011年11月22日　第204回　自然言語処理研究会

Introduction:
- Task Setting
- Previous Work
Experiments:
- Linguistically Motivated Metrics
- Resource Selection
- Threshold Tuning
Results:
- Discussion
- Conclusions

Semantic Category Disambiguation (SCD)

Task definition:
- Given: a textual span in it’s context and a set of semantic categories
- Assign: a single semantic category to the span

Example: An annotated noun-phrase

Related to Named Entity Recognition (NER):
- Conceptually NER consists of two sub-problems
  - Find textual spans containing entities
  - Disambiguate between multiple semantic categories for each span

Motivation

Event Extraction (EE):
- Recent move towards complex EE (Kim et. al. 2009,2011)
- Complex event annotations contain a multitude of semantic categories (80 for Genia (Kim et al. 2008))
- State-of-the-art NER targets a single-class
- Annotated resources for multi-class NER exist
- Joint multi-class NER is beneficial (Dang and Aizawa, 2008)
Why Biomedical?:
- Extensive annotated/lexical resources
- Real-world applications, substantial resources to be saved for curation projects
Applications:
- EE
- Coordination-analysis
- Co-reference resolution

Previous Work (Cohen et al. 2011)

Goal:
- Annotation assistance for ontology annotations
Procedure:
- Assume semantic categories are defined by ontological resources
- Map a given textual span to one or several ontologies using a set of rules
Performance:
- Evaluated on single-corpus annotated data
- 77.1% to 95.5% depending on the category

Previous Work (Stenetorp et al. 2011)

Goal:
- Show benefits of using approximate string matching to lexical resources
Procedure:
- Standard NER-inspired feature set
- Novel approximate string matching features using SimString (Okazaki and Tsujii 2010)
- 170 lexical databases containing > 20,000,000 entries (large)
- Semantic categories learned from training data
Performance:
- Evaluated on six annotated corpora
- Accuracy of 85.9% to 95.3% (not directly comparable to Cohen et al. (2011))

Methods

Issues not investigated by Stenetorp et al. (2011)
For this work:
- Linguistically Motivated Measures
  - Distance Measure
  - Start/End Guards
- Resource Selection
- Threshold Tuning

Linguistically Motivated Measures: Distance Measure

SimString uses a cosine measure over n-grams:
- Any change in character is equally “harmful”
- This does not match reality: “EGR–1”, “EGR 1” and “FGR–1”
Use Lehvenstein with a protein cost-matrix from the literature (Tsuruoka 2003)
Normalised and non-normalised version

	EGR–1	EGR 1	FGR–1
EGR–1	-	100/10	100/100
EGR 1		-	100/100
FGR–1			-

Table: String distance costs

Problem: O(n^2) time complexity:
- Compare the top 10 results from SimString
- Pick the best result using the cost-matrix

Linguistically Motivated Measures: Start/End Markers

SimString uses a cosine measure over n-grams:
- n-grams don’t naturally preserve gram position
Use Start/End guards:
- Example: “Foobar” -> “|Foobar|”
- Standard NER feature, but not used for look-ups

Resource Selection

SimSem used a total of 170 lexical databases:
- No detailed analysis of individual impact
- Possible negative impacts of excessive resource use
Greedy descent:
- Iteratively knock out non-beneficial resources over several rounds
Greedy ascent:
- Iteratively add beneficial resources over several rounds
Overfitting protection:
- Cross-validate on the test set for each iteration

Threshold Tuning

SimString uses a threshold to determine matches between strings
Ranges from 1.0 to 0.0:
- 1.0 is an exact match
- 0.0 matches anything
Stenetorp et al. (2011) used a threshold of 0.7
We tried to find a better threshold:
- 0.4 appeared to give better results for our development set

Experiments - Corpora

Same set-up as Stenetorp et al. (2011)

Corpus	Semantic Categories
BioNLP/NLPBA 2004 Shared Task Corpus (NLPBA)	5
Gene Regulation Event Corpus (GREC)	6
Collaborative Annotation of a Large Biomedical Corpus (SSC)	4
Epigenetics and Post-Translational Modifications (EPI)	17
Infectious Diseases Corpus (ID)	16
Genia Event Corpus (GE)	11

Table: Corpora used for experiments

Experiments - Measures/Baselines

Performance measures:
- Accuracy
- Primary: Accuracy Area Under the Learning Curve (AUC)
Baselines (Stenetorp et al. 2011):
- Internal Classifier (Int.)
- Internal Classifier + SimString (Int.Sim.)

Results - Baselines

Classifier	EPI	ID	GE	SSC	NLPBA	SGREC	µ
Int.	92.5	91.2	94.6	81.7	92.1	82.1	89.0
Int.Sim. (t=0.7)	93.7/+16.0	91.8/+6.8	94.4/–3.7	92.2/+57.4	92.1/0.0	83.4/+7.3	91.3/+20.9

Results - Distance Measure

Classifier	EPI	ID	GE	SSC	NLPBA	SGREC	µ
Int.	92.5	91.2	94.6	81.7	92.1	82.1	89.0
Int.Sim. (t=0.7)	93.7/+16.0	91.8/+6.8	94.4/–3.7	92.2/+57.4	92.1/0.0	83.4/+7.3	91.3/+20.9
Int.Sim.Edit (t=0.7)	93.4/–4.8	91.2/–7.3	93.7/–12.5	91.8/–5.1	91.6/–6.3	82.7/–4.2	90.7/–6.9
Int.Sim.NEdit (t=0.7)	93.5/–3.2	91.2/–7.3	94.0/–7.1	90.7/–19.2	91.9/–2.5	82.7/–4.2	90.7/–6.9

Results - Start/End Markers

Classifier	EPI	ID	GE	SSC	NLPBA	SGREC	µ
Int.	92.5	91.2	94.6	81.7	92.1	82.1	89.0
Int.Sim. (t=0.7)	93.7/+16.0	91.8/+6.8	94.4/–3.7	92.2/+57.4	92.1/0.0	83.4/+7.3	91.3/+20.9
Int.Sim.Edit (t=0.7)	93.4/–4.8	91.2/–7.3	93.7/–12.5	91.8/–5.1	91.6/–6.3	82.7/–4.2	90.7/–6.9
Int.Sim.NEdit (t=0.7)	93.5/–3.2	91.2/–7.3	94.0/–7.1	90.7/–19.2	91.9/–2.5	82.7/–4.2	90.7/–6.9
Int.Sim. (g,t=0.7)	93.7/0.0	91.7/–1.2	94.5/+1.8	91.0/–15.4	91.9/–2.5	82.9/–3.0	91.0/–3.4
Int.Sim.Edit (g,t=0.7)	93.5/–3.2	90.5/–15.9	93.8/–10.7	91.3/–11.5	91.6/–6.3	81.8/–9.6	90.4/–10.3
Int.Sim.NEdit (g,t=0.7)	93.6/–1.6	90.6/–14.6	94.0/–7.1	90.5/–21.8	91.8/–3.8	82.1/–7.8	90.4/–10.3

Results - Threshold Tuning

Classifier	EPI	ID	GE	SSC	NLPBA	SGREC	µ
Int.	92.5	91.2	94.6	81.7	92.1	82.1	89.0
Int.Sim. (t=0.7)	93.7/+16.0	91.8/+6.8	94.4/–3.7	92.2/+57.4	92.1/0.0	83.4/+7.3	91.3/+20.9
Int.Sim. (t=0.4)	94.1/+6.3	92.4/+7.3	94.4/0.0	92.4/+2.6	92.0/–1.3	83.3/–0.6	91.4/+1.1
Int.Sim. (g,t=0.4)	94.1/+6.3	93.2/+17.2	94.4/0.0	91.9/–3.8	92.1/0.0	83.3/–0.6	91.5/+2.3

Results - Resource Selection

Classifier	EPI	ID	GE	SSC	NLPBA	SGREC	µ
Int.	92.5	91.2	94.6	81.7	92.1	82.1	89.0
Int.Sim. (t=0.7)	93.7/+16.0	91.8/+6.8	94.4/–3.7	92.2/+57.4	92.1/0.0	83.4/+7.3	91.3/+20.9
Int.Sim. (t=0.4)	94.1/+6.3	92.4/+7.3	94.4/0.0	92.4/+2.6	92.0/–1.3	83.3/–0.6	91.4/+1.1
Int.Sim. (g,t=0.4)	94.1/+6.3	93.2/+17.2	94.4/0.0	91.9/–3.8	92.1/0.0	83.3/–0.6	91.5/+2.3
Int.Sim. (r,t=0.4)	93.5/–3.2	92.6/+9.8	94.5/+1.8	91.3/–11.5	91.9/–2.5	84.0/+3.6	91.3/0.0

Conclusions

Take-home points:
- Optimal approximate string matching strategy depends on the dataset
- Approximate string matching helps SCD
Future work:
- Detailed error analysis
- Incorporating large-scale unsupervised/statistic aspects
- Automatic look-up strategy selection per dataset (category too?)

Thank You for Your Attention

In the spirit of internationallity:
- 日本語：ご清聴ありがとうございました
- Svenska: Tack för er uppmärksamhet
About SimSem:
- Open-source and freely available
- Resources, extended experiments and code: http://github.com/ninjin/simsem
About the author:
- Website (for paper and slides): http://pontus.stenetorp.se/
- E-mail: <pontus stenetorp se>
- If you have any questions or are pursuing similar research I’d love to talk to you