The Dragon Toolkit
Home Documentation Examples Demos Download Stats Contact

Online Demostrations Based on the Dragon ToolKit

Semantic Profiling of Human Tags@CiteULike (2007/12)
Social bookmarking has been a popular service in Internet. The co-occurrence of human tag and the text it annotated for gives us a way to explore the meaning of the tag. Such semantic knowledge would be useful in understanding user information needs in IR. Here is a demo based on the dataset of CiteULike from Nov. 2004 to Feb. 2007.
Web-based Factoid Question Answering (2007/10)
The current version can answer entity, number and abbreviation-related factoid questions ( not support description or definition questions yet). We developed a novel scoring algorithm which took into account the context of the candidate answers and significantly improved the accuracy. In general, the question beginning with a question word(e.g., how, when, who, where, what and which) brings higher accuracy, but not required.
Collective Wisdom based Entity Classification (2007/09)
The assignment of a semantic category to an entity (concept) is a challenging problem to machines. Traditional approaches extract features from either surface forms or local contexts (surrounding texts) and then apply machine learning methods or human-coded rules for entity classification. Such classifiers usually require large number of training examples and domain-specific tuning and even human-created ontologies (dictionaries). Instead, this tool utilizes the wisdom of crowds for entity classification. It builds the semantic context for each entity through web search engines such as Google. The top ranked documents returned by a search engine gives the sense of what poeple think of this entity! The new approach is simple, robust and powerful. No tuning, no external dictionaries, applicable to any domain,and most importantly, good accuracy! Click here to see a demo.
The Dragon Pinyin (2007/04)
Because one-to-many mapping from pinyin syllable to Chinese character makes the character-level input low efficient, an ideal solution should support phrase-level or even sentence-level input and can automate the disambiguation of pinyin according to the context. First-order hidden markov model (HMM) is such a solution. But it only captures the dependency with the preceding character. Higher order markov models can bring higher accuracy, but are computationally unaffordable to average PC settings. We propose a segment-based hidden markov model (SHMM), which has the same magnitude of complexity as first-order HMM, but generates higher decoding accuracy.

Ontology-based Biomedical Text Annotation (2006/04)

Dictionary-based biological concept extraction is still the state-of-the-art approach to large-scale biomedical literature annotation and indexing. The exact dictionary lookup is a very simple approach, but always achieves low extraction recall because a biological term often has many variants while a dic-tionary is impossible to collect all of them. We propose a generic extraction ap-proach, referred to as approximate dictionary lookup, to cope with term varia-tions and implement it as an extraction system called MaxMatcher. The basic idea of this approach is to capture the significant words instead of all words to a particular concept. The new approach dramatically improves the extraction re-call while maintaining the precision.