The Dragon Toolkit
Home Documentation Examples Demos Download Stats Contact

The Dragon ToolKit (Version 1.3.3 2008/01/06)

Designed for Text Retrieval and Text Mining
The Dragon Toolkit is a Java-based development package for academic use in information retrieval (IR) and text mining (TM, including text classification, text clustering, text summarization, and topic modeling). It is tailored for researchers who work on large-scale IR and TM and prefer Java programming. Moreover, different from Lucene and Lemur, it provides built-in supports for semantic-based IR and TM. The dragon toolkit seamlessly integrates a set of NLP tools, which enable the toolkit to index text collections with various representation schemes including words, phrases, ontology-based concepts and relationships. However, to minimize the learning time, we intentionally keep the package small and simple. The toolkit does not have some features including distributed IR and cross-language IR which is a part of Lemur toolkit.

Another important feature of the toolkit is its scalability. Unlike many text mining tools such as Weka, the dragon toolkit is specially designed for large-scale application. The toolkit uses sparse matrix to implement text representations and does not have to load all data into memory in the running time. Therefore, it can handle hundred thousands of documents with very limited memory.

A Full List of Functional Features
The Personnel

Supervisor of the Dragon Project:
Xiaohua (Tony) Hu,

System Design and Development:
Xiaohua (Davis) Zhou,
Xiaodan (Tom) Zhang,

How to Cite Dragon Toolkit

If you are using the Dragon Toolkit for research work, please cite it in your published papers:

Zhou, X., Zhang, X., and Hu, X., "Dragon Toolkit: Incorporating Auto-learned Semantic Knowledge into Large-Scale Text Retrieval and Mining," In proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), October 29-31, 2007, Patras, Greece [PDF]