chapters/sigmod23/introduction.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

\section{Introduction} \label{sec:intro} 

Having discussed the relevant background materials, we will now turn to a
discussion of our first attempt to address the limitations of dynamization
in the context of one particular class of non-decomposable search problem:
independent random sampling. We've already discussed one representative
problem of this class, independent range sampling, and shown how it is
not traditionally decomposable. This specific problem is one of several
very similar types of problem, however, and in this chapter we will also
attend to simple random sampling, weighted set sampling, and weighted
independent range sampling.

Independent sampling presents an interesting motivating example
because it is nominally supported within many relational databases,
and is useful in a variety of contexts, such as  approximate
query processing (AQP)~\cite{blinkdb,quickr,verdict,cohen23},
interactive data exploration~\cite{sps,xie21}, financial audit
sampling~\cite{olken-thesis}, and feature selection for machine
learning~\cite{ml-sampling}. However, existing support for these search
problems is limited by the techniques used within databases to implement
them. Existing implementations tend to sacrifice either performance,
by requiring the entire result set of be materialized prior to applying
Bernoulli sampling, or statistical independence. There exists techniques
for obtaining both sampling performance and independence by leveraging
existing B+Tree indices with slight modification~\cite{olken-thesis},
but even this technique has worse sampling performance than could be
achieved using specialized static sampling indices.

Thus, we decided to attempt to apply a Bentley-Saxe based dynamization
technique to these data structures. In this chapter, we discuss our
approach, which addresses the decomposability problems discussed in
Section~\ref{ssec:decomp-limits}, introduces two physical mechanisms
for support deletes, and also introduces an LSM-tree inspired design
space to allow for performance tuning. The results in this chapter are
highly specialized to sampling problems, however they will serve as a
launching off point for our discussion of a generalized framework in
the subsequent chapter.