Skip to main navigation Skip to search Skip to main content

Random sampling over joins revisited

  • University of Utah
  • Hong Kong University of Science and Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

116 Scopus citations

Abstract

Joins are expensive, especially on large data and/or multiple relations. One promising approach in mitigating their high costs is to just return a simple random sample of the full join results, which is sufficient for many tasks. Indeed, in as early as 1999, Chaudhuri et al. posed the problem of sampling over joins as a fundamental challenge in large database systems. They also pointed out a fundamental barrier for this problem, that the sampling operator cannot be pushed through a join, i.e., sample(R S), sample(R) sample(S). To overcome this barrier, they used precomputed statistics to guide the sampling process, but only showed how this works for two-relation joins. This paper revisits this classic problem for both acyclic and cyclic multi-way joins. We build upon the idea of Chaudhuri et al., but extend it in several nontrivial directions. First, we propose a general framework for random sampling over multi-way joins, which includes the algorithm of Chaudhuri et al. as a special case. Second, we explore several ways to instantiate this framework, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput. We analyze the properties of different instantiations and evaluate them against the baseline methods; the results clearly demonstrate the superiority of our new techniques.

Original languageEnglish
Title of host publicationSIGMOD 2018 - Proceedings of the 2018 International Conference on Management of Data
EditorsGautam Das, Christopher Jermaine, Ahmed Eldawy, Philip Bernstein
PublisherAssociation for Computing Machinery
Pages1525-1539
Number of pages15
ISBN (Electronic)9781450317436
DOIs
StatePublished - May 27 2018
Event44th ACM SIGMOD International Conference on Management of Data, SIGMOD 2018 - Houston, United States
Duration: Jun 10 2018Jun 15 2018

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference44th ACM SIGMOD International Conference on Management of Data, SIGMOD 2018
Country/TerritoryUnited States
CityHouston
Period06/10/1806/15/18

Keywords

  • Join sampling framework
  • Multi-way joins
  • Random sampling

Fingerprint

Dive into the research topics of 'Random sampling over joins revisited'. Together they form a unique fingerprint.

Cite this