Almost Total Recall: Semantic Category Disambiguation Using Large Lexical Resources and Approximate String Matching

Pontus Stenetorp, Sampo Pyysalo, Sophia Ananiadou and Jun’ichi Tsujii
—
Aizawa Laboratory, University of Tokyo, Japan
University of Manchester, United Kingdom
National Centre for Text Mining, United Kingdom
Microsoft Research Asia, People’s Republic of China

Nanyang Technological University, Singapore, December 15th 2011

Introduction:
- Semantic Category Disambiguation
- Possible Applications
- Research Target
- Previous Work
Experiments:
- Measuring Applicability
- Experimental Set-up
Results:
- Experimental Results
- Discussion and Conclusions

Semantic Category Disambiguation (SCD)

Figure: Demarked spans

Semantic Category Disambiguation (SCD)

Figure: Demarked spans

Figure: Annotated spans

Semantic Category Disambiguation (SCD)

Formally:
- Given a continuous textual span in its context
- Assign one out of several semantic categories from a fixed set
Observation:
- A sub-task of multi-class Named Entity Recognition (NER)

Possible Applications

Relevant for several NLP tasks:
- Coordination analysis
- Co-reference resolution
- Multi-class NER
- Annotation support
There are clear possible applications
But, semantic category disambiguation perform well enough for this?

Research Target

In particular, annotation support for multiple semantic categories
Also, act as a component in a pipeline
Allow multiple suggestions for an annotation
Goals:
- Keep number of suggestions to a minimum, ambiguity
- Maintain a high level of recall
- Gives a ambiguity vs. recall trade-off

Previous Research

Cohen et al. (2011):
- Define semantic categories by ontologies
- Match a textual span against one or several ontologies
- Used a set of heuristics to increase string matching performance
- Intended as annotation support
- Fast, reliable, non-probabilistic
Performance:
- Evaluated on an corpus
- Accuracy of 77.1% to 95.5% depending on the semantic category

Previous Research

Stenetorp et al. (2011):
- Learn semantic categories from training data
- Single category assumption
- Used fast approximate string matching against 170 databases with 20,335,426 entries to increase performance
- Fast, reliable, probabilistic
Performance:
- Evaluated on six annotated corpora
- Accuracy of 85.9% to 95.3% depending on the semantic category (not directly comparable to Cohen et al. (2011))

Measuring Applicability

How do we measure applicability for annotation support and NLP pipelining?
Notes on annotation:
- Increase category selection speed
- Every annotation judgement is signed off by a human annotator
Method:
- Exploit probabilistic nature of our model
- Threshold on the sum of the probabilities for the suggested categories

Experimental Set-up

Corpus	Semantic Categories
BioNLP/NLPBA 2004 Shared Task Corpus (NLPBA)	5
Gene Regulation Event Corpus (GREC)	6
Collaborative Annotation of a Large Biomedical Corpus (SSC)	4
Epigenetics and Post-Translational Modifications (EPI)	17
Infectious Diseases Corpus (ID)	16
Genia Event Corpus (GE)	11

Table: Corpora used for experiments

Experimental Set-up

Metrics:
- Average number of suggested semantic categories, ambiguity
- Is the correct category among the suggestions, recall
Data points:
- Average over the learning curve (analogous to Area Under Curve)
- Using all available training data
Training, development and test sets, the latter only used for final experiments

Experimental Results

Figure: Ambiguity per Dataset

Experimental Results

Figure: Recall per Dataset

Experimental Results

Data set	μAmb.	Amb.	μRecall	Recall
EPI	1.8/89.4%	1.3/92.4%	99.5%	99.4%
ID	2.9/81.9%	1.9/88.1%	98.8%	98.6%
GE	2.1/80.9%	1.7/84.5%	99.4%	99.5%
SSC	2.0/50.0%	1.7/57.5%	99.6%	99.5%
NLPBA	1.8/64.0%	1.6/68.0%	99.1%	99.1%
SGREC	2.4/60.0%	2.0/66.7%	98.7%	98.6%

Table: Performance averages and for the final data point for each dataset

Discussion and Conclusions

We have shown:
- Ambiguity drops to a manageable level, quickly
- Recall remains very high across the board

Future Work

Co-reference resolution
- Additional non-lexical information
Generalising search query suggestions
Problems with long textual spans, noun-phrases
Integrate into existing annotation tool(s)
- Can we expect speed-ups?
- Problem: can only affect category selection
- Does the level of recall and ambiguity satisfy a human annotator?

“But Wait, There is More!”™

Integrated into state-of-the-art annotation tool:
- http://goo.gl/tlVyw
- Username: e Password: e
- Please use Chrome or Safari
- I’ll be very happy to show you a demo of it during the break
- Significantly reduces annotation time

Thank You for Your Attention

In the spirit of internationallity:
- 日本語：ご清聴ありがとうございました
- Svenska: Tack för er uppmärksamhet
About SimSem:
- Open-source and freely available
- Resources, extended experiments and code: http://github.com/ninjin/simsem
About the author:
- Website (for paper, slides and poster): http://pontus.stenetorp.se/
- E-mail: <pontus stenetorp se>
- If you have any questions or are pursuing similar research I’d love to talk to you

Almost Total Recall: Semantic Category Disambiguation Using Large Lexical Resources and Approximate String Matching

Table of Contents

Semantic Category Disambiguation (SCD)

Semantic Category Disambiguation (SCD)

Semantic Category Disambiguation (SCD)

Possible Applications

Research Target

Previous Research

Previous Research

Measuring Applicability

Experimental Set-up

Experimental Set-up

Experimental Results

Experimental Results

Experimental Results

Discussion and Conclusions

Future Work

“But Wait, There is More!”™

Thank You for Your Attention