TY - GEN
T1 - Random sampling over joins revisited
AU - Zhao, Zhuoyue
AU - Christensen, Robert
AU - Li, Feifei
AU - Hu, Xiao
AU - Yi, Ke
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/5/27
Y1 - 2018/5/27
N2 - Joins are expensive, especially on large data and/or multiple relations. One promising approach in mitigating their high costs is to just return a simple random sample of the full join results, which is sufficient for many tasks. Indeed, in as early as 1999, Chaudhuri et al. posed the problem of sampling over joins as a fundamental challenge in large database systems. They also pointed out a fundamental barrier for this problem, that the sampling operator cannot be pushed through a join, i.e., sample(R S), sample(R) sample(S). To overcome this barrier, they used precomputed statistics to guide the sampling process, but only showed how this works for two-relation joins. This paper revisits this classic problem for both acyclic and cyclic multi-way joins. We build upon the idea of Chaudhuri et al., but extend it in several nontrivial directions. First, we propose a general framework for random sampling over multi-way joins, which includes the algorithm of Chaudhuri et al. as a special case. Second, we explore several ways to instantiate this framework, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput. We analyze the properties of different instantiations and evaluate them against the baseline methods; the results clearly demonstrate the superiority of our new techniques.
AB - Joins are expensive, especially on large data and/or multiple relations. One promising approach in mitigating their high costs is to just return a simple random sample of the full join results, which is sufficient for many tasks. Indeed, in as early as 1999, Chaudhuri et al. posed the problem of sampling over joins as a fundamental challenge in large database systems. They also pointed out a fundamental barrier for this problem, that the sampling operator cannot be pushed through a join, i.e., sample(R S), sample(R) sample(S). To overcome this barrier, they used precomputed statistics to guide the sampling process, but only showed how this works for two-relation joins. This paper revisits this classic problem for both acyclic and cyclic multi-way joins. We build upon the idea of Chaudhuri et al., but extend it in several nontrivial directions. First, we propose a general framework for random sampling over multi-way joins, which includes the algorithm of Chaudhuri et al. as a special case. Second, we explore several ways to instantiate this framework, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput. We analyze the properties of different instantiations and evaluate them against the baseline methods; the results clearly demonstrate the superiority of our new techniques.
KW - Join sampling framework
KW - Multi-way joins
KW - Random sampling
UR - https://www.scopus.com/pages/publications/85048795880
U2 - 10.1145/3183713.3183739
DO - 10.1145/3183713.3183739
M3 - Conference contribution
AN - SCOPUS:85048795880
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 1525
EP - 1539
BT - SIGMOD 2018 - Proceedings of the 2018 International Conference on Management of Data
A2 - Das, Gautam
A2 - Jermaine, Christopher
A2 - Eldawy, Ahmed
A2 - Bernstein, Philip
PB - Association for Computing Machinery
T2 - 44th ACM SIGMOD International Conference on Management of Data, SIGMOD 2018
Y2 - 10 June 2018 through 15 June 2018
ER -