|We use a slightly modified version of Xtract  to extract multiword
phrases in queries and documents. Xtract is designed to extract
three types of collocations: predicative relations, rigid noun phrases,
and phrasal templates. It begins with extracting significant bigrams
using statistical techniques, and then expands 2-Grams to N-Grams,
and finally adds syntax constraint to the collocations. In the original
version, two words are defined as a bigram if and only if they cooccur
within a sentence and their lexical distance is less than five words.
Because we are only interested in rigid noun phrases, the first
word should be an adjective or a noun, the second word should be
a noun, and their distance threshold is set to four words, in our
Xtract uses four parameters, strength (k0), spread (U0), peak z-score
(k1), and percentage frequency (T), to control the quantity and
quality of the extracted phrases. In general, the bigger those parameters
are the higher quality but less quantity phrases Xtract will produce.
Smadja recommended a setting (k0, k1, U0, T) = (1, 1, 10, 0.75)
to achieve good results. In the experiment, we set those four parameters
to (1, 1, 4, 0.75) after extensive tuning and testing. Xtract is
an effective approach to phrase extraction. The estimated precision
is about 80%.
Same as the data package for the example of text
|Please download the external jar library for Hepple
Tagger and external NLP data for POS Tagger and English
Lemmatiser before running this example.
|Run the example
- Build the multiword phrase dictionary for 20-newsgroup corpus
java -mx1000000000 -oss10000000000 dragon.config.PhraseExtractAppConfig
- Smadja, F. Retrieving collocations from text: Xtract.
Computational Linguistics, 1993, 19(1), pp. 143--177 [PDF]
- Zhou X., Hu X., and Zhang X., "Topic Signature Language Model
for Ad hoc Retrieval," accepted by IEEE Transaction on Knowledge
and Data Engineering (TKDE)