Project Details
Description
The data-driven reasoning is one of the major factors propelling progress in science and engineering. In many practical applications, especially in biology and medicine, data-driven reasoning has been based on probabilistic graphical models, that are preferred because of the accuracy in data representation and ease of interpretation. In probabilistic graphical modeling, the modeled objects, for example, the attributes of a patient stored in electronic health records, are represented as random variables, and the goal is to learn dependencies between these variables. However, the methods to learn high-quality probabilistic graphical models from the data are computationally challenging, and do not scale to datasets emerging in modern applications. Therefore, this project aims to enable high-quality probabilistic graphical modeling of large datasets by using high performance computing techniques. To this end, the project introduces a framework of fundamental operations underlying probabilistic graphical modeling, including for managing data and coordinating computations, together with their software fulfillment, that can efficiently leverage large-scale parallel computers. The framework is designed to benefit both practitioners interested in the analysis of large-scale data, and researchers interested in the development of new learning algorithms. The validation and evaluation of the framework is based on the analysis of electronic health records with the goal of early prognosis and diagnosis of patients with chronic obstructive pulmonary disease - problems vital for improving quality and reducing the cost of healthcare. Furthermore, the framework provides the foundation to train the next generation of biomedical professionals in the use of data analytics on advanced cyberinfrastructure. Thus, the project is aligned with NSF's mission to promote the progress of science, and to advance the national health, prosperity and welfare.
This project responds to the recognized and growing demand for scalable machine learning methods that could capitalize on parallel architectures such as large clusters of multi-core processors. The research focus is on exact structure learning of probabilistic graphical models, for example Bayesian networks and Markov random fields, in the context of biomedical data analytics. The project is based on the two main components: a new high performance abstraction for managing data in machine learning applications, including memory efficient strategies for answering counting queries on multi-core processors, and a new programming model for distributed memory systems to facilitate efficient exploration of large-scale combinatorial search spaces, such as those described by tress, lattices or graphs. These abstractions are used to realize a set of new parallel, exact algorithms for structure search, and the related problems, for example Markov blankets identification, that accelerate learning by exploiting various properties of the input data and the underlying search spaces. The research component is driven by the timely and socially relevant application in personalized and preventive medicine, enabled by a massive collection of the actual electronic health records. The project aims to delivers multiple artifacts, including MPI and OpenMP-based software, benchmark data and educational materials, all released as open source for use, further development, enhancement, and incorporation by the community. The research activities are tightly coupled with multiple educational efforts, spanning development of an interdisciplinary course for medical professionals to train them in the use of advanced cyberinfrastructure, engagement of undergraduate students and underrepresented minorities in research, and outreach to middle and high school students to attract them to STEM.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
| Status | Finished |
|---|---|
| Effective start/end date | 02/15/19 → 01/31/26 |
Funding
- National Science Foundation: $519,569.00
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.