The Dragon Toolkit
Home Documentation Examples Demos Download Stats Contact
 

Example: Text Classification

The toolkit provides a well-defined framework for text classification. A classifier should implement the interface called Classifier. Because feature selection is very important to text classification, an interface called Feature Selector is also created and the feature selection method is separated from the classification algorithm. Thus, one can test different feature selection methods without changing the classification algorithm. In the current version, we have implemented NB classifier with different smoothing methods, SVM classifier, and Nigam active learning classifier.
Data Package

The data package is the same as the one used by the text clustering example.

Dependency
Please download the external jar library for Hepple Tagger and external NLP data for POS Tagger, English Lemmatiser, and Examples before running this example.
Run the example
    Indexing
  1. run the command below to do word-based indexing (about 3 minutes)
    java -mx1000000000 -oss10000000000 dragon.config.IndexAppConfig indexclustering/Newsgroup/indexcfg.xml 1
  2. run the command below to generate word-word co-occurrence matrix
    java -mx1500000000 -oss15000000000 dragon.config.CooccurrenceAppConfig indexclustering/Newsgroup/indexcfg.xml 1
  3. run the command below to do word-word semantic mapping
    java -mx1000000000 -oss10000000000 dragon.config.TranslationAppConfig indexclustering/Newsgroup/indexcfg.xml 2
  4. run the command below to do phrase-based indexing (about 8 minutes)
    java -mx1000000000 -oss10000000000 dragon.config.IndexAppConfig indexclustering/Newsgroup/indexcfg.xml 2
  5. run the command below to do phrase-word semantic mapping (about 4 minutes)
    java -mx1000000000 -oss10000000000 dragon.config.TranslationAppConfig indexclustering/Newsgroup/indexcfg.xml 1

  6. Text ClassificationEvaluation (1% documents for training)
  7. Evaluate the NB Classifier with Chi Feature Selector
    java -mx1000000000 -oss10000000000 dragon.config.ClassificationEvaAppConfig indexclustering/Newsgroup/classcfg.xml 1
  8. Evaluate the Bayesian Classifier with backgound smoothing
    java -mx1000000000 -oss10000000000 dragon.config.ClassificationEvaAppConfig indexclustering/Newsgroup/classcfg.xml 2
  9. Evaluate the Bayesian Classifier with context-sensitive semantic smoothing (CSSS)
    java -mx1000000000 -oss10000000000 dragon.config.ClassificationEvaAppConfig indexclustering/Newsgroup/classcfg.xml 3
  10. Evaluate the Bayesian Classifier with context-insensitive semantic smoothing
    java -mx1000000000 -oss10000000000 dragon.config.ClassificationEvaAppConfig indexclustering/Newsgroup/classcfg.xml 4
  11. Evaluate the Nigam active learning classifier with Chi Feature Selector
    java -mx1000000000 -oss10000000000 dragon.config.ClassificationEvaAppConfig indexclustering/Newsgroup/classcfg.xml 5
  12. Evaluate the SVM Classifier
    java -mx1000000000 -oss10000000000 dragon.config.ClassificationEvaAppConfig indexclustering/Newsgroup/classcfg.xml 6
References
  1. Yiming Yang and Jan O. Pedersen, A comparative study on feature selection in text categorization, Proceedings of {ICML}-97, 14th International Conference on Machine Learning, pp. 412--420 [PDF]