Project Details
Description
Data is messy. Fortunately, with minimal human intervention, good data cleaning heuristics produce mostly reliable, usually actionable information from big, messy data. For instance, analysts might automate their curation workflows by using classifiers to predict missing attribute values, or by using an entity-resolver to find and merge duplicate records. Unfortunately, heuristics are also dangerous, as the result of heuristic curation is often taken as fact. Serious mistakes like people being denied a loan due to someone else's bad credit, 12-year olds being identified as terrorists, or billion dollar investment errors, often result when low-confidence, or uncertain heuristic inferences are treated as truth. Many principled tools like probabilistic databases already exist for automatically tracking potential errors in unreliable data, but these tools are not easy to use. As a result, analysts more often resort to simply documenting potential errors and hoping that anyone using the data will realize the implications. This proposal will enable data management systems that can query and organize uncertain data, without being hard to use.
The specific aim of this proposal is to decouple the process of asking questions about uncertain data from mechanical concerns like why the data is uncertain, how the user wants to view uncertainty in query results, or which algorithms should be used. To enable this sort of "declarative uncertainty management," the project team will build on a system called Mimir that virtualizes uncertainty by augmenting data curation workflows (e.g., ETL pipelines) with a form of provenance capture through which heuristics can register alternative outputs (e.g., a schema matcher may register multiple potential matches). This provenance can then be used to synthesize a wide range of different physical and visual representations of uncertainty in data and in query results. To enable declarative uncertainty management, this proposal will address specific problems that fall into two general categories: (1) selecting and efficiently constructing qualitative summaries of uncertainty in query results, and (2) enhancing database query compilers and optimizers to support practical, efficient query processing over uncertain data. For further information see the project web page: http://mimirdb.info
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
| Status | Finished |
|---|---|
| Effective start/end date | 01/1/18 → 02/29/24 |
Funding
- National Science Foundation: $568,158.00
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.