Skip to main navigation Skip to search Skip to main content

WordPPR: A Researcher-Driven Computational Keyword Selection Method for Text Data Retrieval from Digital Media

  • Alphabet Inc.
  • University of Connecticut
  • Harvard University

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Despite the increasing use of digital media data in communication research, a central challenge persists–retrieving data with maximal accuracy and coverage. Our investigation of keyword-based data collection practices in extant communication research reveals a one-step process, whereas our cross-disciplinary literature review suggests an iterative query expansion process guided by human knowledge and computer intelligence. Hence, we introduce the WordPPR method for keyword selection and text data retrieval, which entails four steps: 1) collecting an initial dataset using core/seed keyword(s); 2) constructing a word graph based on the dataset; 3) applying the Personalized PageRank (PPR) algorithm to rank words in proximity to the seed keyword(s) and selecting new keywords that optimize retrieval precision and recall; 4) repeating steps 1–3 to determine if additional data collection is needed. Without requiring corpus-wide sampling/analysis or extensive manual annotation, this method is well suited for data collection from large-scale digital media corpora. Our simulation studies demonstrate its robustness against parameter choice and its improvement upon other methods in suggesting additional keywords. Its application in Twitter data retrieval is also provided. By advancing a more systematic approach to text data retrieval, this study contributes to improving digital media data retrieval practices in communication research and beyond.

Original languageEnglish
Pages (from-to)332-348
Number of pages17
JournalCommunication Methods and Measures
Volume18
Issue number4
DOIs
StatePublished - 2024

Fingerprint

Dive into the research topics of 'WordPPR: A Researcher-Driven Computational Keyword Selection Method for Text Data Retrieval from Digital Media'. Together they form a unique fingerprint.

Cite this