The Dragon Toolkit
Home Documentation Examples Demos Download Stats Contact
 

Example: Text Clustering

The current version supports Basic K-Mean, Bisect K-Mean, and Agglomoerative Clustering. The model-based K-Mean clustering supports three smoothing methods: laplacian, background, and semantic smoothing.
Data Package

Download 20-Newsgroup collection and configuration files for the example. Please decompress two packages and place these files under the working directory with the following structure:
workdir/indexclustering/Newsgroup/20ng.collection
workdir/indexclustering/Newsgroup/20ng.index
workdir/indexclustering/Newsgroup/20ng.vob
workdir/indexclustering/Newsgroup/indexcfg.xml
workdir/indexclustering/Newsgroup/clustercfg.xml
workdir/indexclustering/Newsgroup/classcfg.xml
workdir/indexclustering/Newsgroup/experiment/answerkeys_sub1.list
workdir/indexclustering/Newsgroup/experiment/answerkeys_sub2.list
workdir/indexclustering/Newsgroup/experiment/answerkeys_sub3.list
workdir/indexclustering/Newsgroup/experiment/answerkeys_sub4.list
workdir/indexclustering/Newsgroup/experiment/answerkeys_sub5.list
workdir/indexclustering/Newsgroup/experiment/answerkeys_large.list

Dependency
Please download the external jar library for Hepple Tagger and external NLP data for POS Tagger, English Lemmatiser, and Examples before running this example.
Run the example
    Indexing
  1. run the command below to do word-based indexing (about 3 minutes)
    java -mx1000000000 -oss10000000000 dragon.config.IndexAppConfig indexclustering/Newsgroup/indexcfg.xml 1
  2. run the command below to generate TF.IDF matrix
    java -mx1000000000 -oss10000000000 dragon.config.DocRepresentationAppConfig indexclustering/Newsgroup/clustercfg.xml 1
  3. run the command below to generate normalized TF matrix
    java -mx1000000000 -oss10000000000 dragon.config.DocRepresentationAppConfig indexclustering/Newsgroup/clustercfg.xml 2
  4. run the command below to generate word-word co-occurrence matrix
    java -mx1500000000 -oss15000000000 dragon.config.CooccurrenceAppConfig indexclustering/Newsgroup/indexcfg.xml 1
  5. run the command below to do word-word semantic mapping
    java -mx1000000000 -oss10000000000 dragon.config.TranslationAppConfig indexclustering/Newsgroup/indexcfg.xml 2
  6. run the command below to do phrase-based indexing (about 8 minutes)
    java -mx1000000000 -oss10000000000 dragon.config.IndexAppConfig indexclustering/Newsgroup/indexcfg.xml 2
  7. run the command below to do phrase-word semantic mapping (about 4 minutes)
    java -mx1000000000 -oss10000000000 dragon.config.TranslationAppConfig indexclustering/Newsgroup/indexcfg.xml 1

  8. Agglomerative Clustering Evaluation (answerkeys_sub1.list)
  9. Cosine Similarity with TF scheme
    java -mx1000000000 -oss10000000000 dragon.config.ClusteringEvaAppConfig indexclustering/Newsgroup/clustercfg.xml 1
  10. Cosine Similarity with TF.IDF scheme
    java -mx1000000000 -oss10000000000 dragon.config.ClusteringEvaAppConfig indexclustering/Newsgroup/clustercfg.xml 2

    Partitional Clustering Evaluation (answerkeys_large.list )
  11. Cosine Similarity with TF.IDF scheme
    java -mx1000000000 -oss10000000000 dragon.config.ClusteringEvaAppConfig indexclustering/Newsgroup/clustercfg.xml 3
  12. Cosine Similarity with Normalized TF scheme
    java -mx1000000000 -oss10000000000 dragon.config.ClusteringEvaAppConfig indexclustering/Newsgroup/clustercfg.xml 4
  13. Multinomial model with laplacian smoothing
    java -mx1000000000 -oss10000000000 dragon.config.ClusteringEvaAppConfig indexclustering/Newsgroup/clustercfg.xml 5
  14. Multinomial model with background smoothing
    java -mx1000000000 -oss10000000000 dragon.config.ClusteringEvaAppConfig indexclustering/Newsgroup/clustercfg.xml 6
  15. Multinomial model with context-sensitive semantic smoothing
    java -mx1000000000 -oss10000000000 dragon.config.ClusteringEvaAppConfig indexclustering/Newsgroup/clustercfg.xml 7
  16. Multinomial model with context-insensitive semantic smoothing
    java -mx1000000000 -oss10000000000 dragon.config.ClusteringEvaAppConfig indexclustering/Newsgroup/clustercfg.xml 8