| |
Online Demostrations Based on the Dragon ToolKit
|
| Semantic Profiling of
Human Tags@CiteULike (2007/12) |
| Social bookmarking has been a popular service
in Internet. The co-occurrence of human tag and the text it annotated
for gives us a way to explore the meaning of the tag. Such semantic
knowledge would be useful in understanding user information needs
in IR. Here is a demo based on the dataset of CiteULike from Nov.
2004 to Feb. 2007. |
| Web-based Factoid Question
Answering (2007/10) |
| The current version can answer entity,
number and abbreviation-related factoid questions ( not support
description or definition questions yet). We developed a novel scoring
algorithm which took into account the context of the candidate answers
and significantly improved the accuracy. In general, the question
beginning with a question word(e.g., how, when, who, where, what
and which) brings higher accuracy, but not required. |
| Collective Wisdom based
Entity Classification (2007/09) |
| The assignment of a semantic category
to an entity (concept) is a challenging problem to machines. Traditional
approaches extract features from either surface forms or local contexts
(surrounding texts) and then apply machine learning methods or human-coded
rules for entity classification. Such classifiers usually require
large number of training examples and domain-specific tuning and
even human-created ontologies (dictionaries). Instead, this tool
utilizes the wisdom of crowds for entity classification. It builds
the semantic context for each entity through web search engines
such as Google. The top ranked documents returned by a search engine
gives the sense of what poeple think of this entity! The new approach
is simple, robust and powerful. No tuning, no external dictionaries,
applicable to any domain,and most importantly, good accuracy! Click
here to see a demo. |
| The Dragon Pinyin
(2007/04) |
| Because one-to-many mapping from pinyin syllable to Chinese character
makes the character-level input low efficient, an ideal solution
should support phrase-level or even sentence-level input and can
automate the disambiguation of pinyin according to the context.
First-order hidden markov model (HMM) is such a solution. But it
only captures the dependency with the preceding character. Higher
order markov models can bring higher accuracy, but are computationally
unaffordable to average PC settings. We propose a segment-based
hidden markov model (SHMM), which has the same magnitude of complexity
as first-order HMM, but generates higher decoding accuracy. |
|
Ontology-based Biomedical Text Annotation
(2006/04)
|
| Dictionary-based biological concept extraction is still the state-of-the-art
approach to large-scale biomedical literature annotation and indexing.
The exact dictionary lookup is a very simple approach, but always
achieves low extraction recall because a biological term often has
many variants while a dic-tionary is impossible to collect all of
them. We propose a generic extraction ap-proach, referred to as
approximate dictionary lookup, to cope with term varia-tions and
implement it as an extraction system called MaxMatcher. The basic
idea of this approach is to capture the significant words instead
of all words to a particular concept. The new approach dramatically
improves the extraction re-call while maintaining the precision.
|
| |
| |