TY - GEN
T1 - Phrasescope
T2 - 2021 SIAM International Conference on Data Mining, SDM 2021
AU - Anjum, Omer
AU - Almasri, Mohammad
AU - Xiong, Jinjun
AU - Hwu, Wen Mei
N1 - Publisher Copyright:
© 2021 by SIAM.
PY - 2021
Y1 - 2021
N2 - Phrase mining is one of the fundamental NLP tasks that can have significant impact on the efficacy of many downstream applications. Many supervised and unsupervised phrase mining approaches have been proposed. Some rely on linguistic analyzers, and others are language agnostic. A daunting challenge in this task is to distinguish quality phrases from noise phrases, which tightly coexists with quality phrases in the entire frequency spectrum. Most existing approaches to phrase mining, however, rely on frequency-based statistics, hence suffer from quality loss. In this paper, we propose an unsupervised phrase mining framework, “PhraseScope”, which consists of a sequence of filters, namely cohesion, domain, and graph filters, to remove noise phrase. Each filter is responsible for removing noise phrase of particular characteristics. Collectively, our proposed filters are capable of detecting and removing noise phrases effectively while preserving quality phrases. Our results show significant improvement in both recall and precision over state-of-the-art frameworks when tested on three different domains of datasets.
AB - Phrase mining is one of the fundamental NLP tasks that can have significant impact on the efficacy of many downstream applications. Many supervised and unsupervised phrase mining approaches have been proposed. Some rely on linguistic analyzers, and others are language agnostic. A daunting challenge in this task is to distinguish quality phrases from noise phrases, which tightly coexists with quality phrases in the entire frequency spectrum. Most existing approaches to phrase mining, however, rely on frequency-based statistics, hence suffer from quality loss. In this paper, we propose an unsupervised phrase mining framework, “PhraseScope”, which consists of a sequence of filters, namely cohesion, domain, and graph filters, to remove noise phrase. Each filter is responsible for removing noise phrase of particular characteristics. Collectively, our proposed filters are capable of detecting and removing noise phrases effectively while preserving quality phrases. Our results show significant improvement in both recall and precision over state-of-the-art frameworks when tested on three different domains of datasets.
UR - https://www.scopus.com/pages/publications/85120958377
M3 - Conference contribution
AN - SCOPUS:85120958377
T3 - SIAM International Conference on Data Mining, SDM 2021
SP - 639
EP - 647
BT - SIAM International Conference on Data Mining, SDM 2021
PB - Siam Society
Y2 - 29 April 2021 through 1 May 2021
ER -