The Dragon Toolkit
Home Documentation Examples Demos Download Stats Contact
 

Example: Xtract

We use a slightly modified version of Xtract [1] to extract multiword phrases in queries and documents. Xtract is designed to extract three types of collocations: predicative relations, rigid noun phrases, and phrasal templates. It begins with extracting significant bigrams using statistical techniques, and then expands 2-Grams to N-Grams, and finally adds syntax constraint to the collocations. In the original version, two words are defined as a bigram if and only if they cooccur within a sentence and their lexical distance is less than five words. Because we are only interested in rigid noun phrases, the first word should be an adjective or a noun, the second word should be a noun, and their distance threshold is set to four words, in our implementation.

Xtract uses four parameters, strength (k0), spread (U0), peak z-score (k1), and percentage frequency (T), to control the quantity and quality of the extracted phrases. In general, the bigger those parameters are the higher quality but less quantity phrases Xtract will produce. Smadja recommended a setting (k0, k1, U0, T) = (1, 1, 10, 0.75) to achieve good results. In the experiment, we set those four parameters to (1, 1, 4, 0.75) after extensive tuning and testing. Xtract is an effective approach to phrase extraction. The estimated precision is about 80%.
Data Package

Same as the data package for the example of text clustering

Dependency
Please download the external jar library for Hepple Tagger and external NLP data for POS Tagger and English Lemmatiser before running this example.
Run the example
  1. Build the multiword phrase dictionary for 20-newsgroup corpus
    java -mx1000000000 -oss10000000000 dragon.config.PhraseExtractAppConfig indexclustering/Newsgroup/indexcfg.xml 1
References
  1. Smadja, F. Retrieving collocations from text: Xtract. Computational Linguistics, 1993, 19(1), pp. 143--177 [PDF]
  2. Zhou X., Hu X., and Zhang X., "Topic Signature Language Model for Ad hoc Retrieval," accepted by IEEE Transaction on Knowledge and Data Engineering (TKDE)