The Dragon Toolkit
Home Documentation Examples Demos Download Stats Contact
 

Example: Link K-Means

Utilize both local content and hyper linkages between documents for document clustering. The link k-means improve the clustering quality over k-means merely using local content.
Data Package

Download cora collection and configuration files for the example. Please decompress two packages and place these files under the working directory with the following structure:
workdir/indexclustering/cora/cora.collection
workdir/indexclustering/cora/coralink.collection
workdir/indexclustering/cora/indexcfg.xml
workdir/indexclustering/cora/clustercfg.xml
workdir/indexclustering/cora/experiment/answerkeys7.list

Dependency
Please download external NLP data for Examples before running this example.
Run the example
    Indexing
  1. run the command below to do word-based indexing
    java -mx1000000000 -oss10000000000 dragon.config.IndexAppConfig indexclustering/cora/indexcfg.xml 1
  2. run the command below to generate TF.IDF matrix
    java -mx1000000000 -oss10000000000 dragon.config.DocRepresentationAppConfig indexclustering/cora/clustercfg.xml 1
  3. run the command below to generate normalized TF matrix
    java -mx1000000000 -oss10000000000 dragon.config.DocRepresentationAppConfig indexclustering/cora/clustercfg.xml 2
  4. run the command below to index linkages between documents
    java -mx1000000000 -oss10000000000 dragon.config.IndexConvertAppConfig indexclustering/cora/indexcfg.xml 1

  5. Clustering Evaluation (answerkeys7.list )
  6. Cosine Similarity with TF.IDF scheme
    java -mx1000000000 -oss10000000000 dragon.config.ClusteringEvaAppConfig indexclustering/cora/clustercfg.xml 1
  7. Link K-Means
    java -mx1000000000 -oss10000000000 dragon.config.ClusteringEvaAppConfig indexclustering/cora/clustercfg.xml 2
References
  1. Angelova, R. and Siersdorfer, S. A neighborhood-based approach for clustering of linked document collecitons, CIKM'06.
  2. Chakrabarti,S., Dom, B. E. , and Indyk, P. Enhanced hypertext categorization using hyperlinks. In SIGMOD'98, , 307-318