chapters/sigmod23/background.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279

\section{Background}
\label{sec:background}

We will begin with a formal discussion of the sampling problem, and
relevant existing solutions. First, though, a clarification of definition
is in order. The nomenclature used to describe sampling in the literature
is rather inconsistent, and so we'll first specifically define all of
the relevant terms.\footnote{
    As an amusing footnote, this problem actually resulted in a
    significant miscommunication between myself and my advisor in the
    early days of the project, resulting in a lot of time being expended
    on performance debugging a problem that didn't actually exist!
}
In this chapter, we'll use the the term \emph{sample} to indicate a
single record selected by a sampling operation, and a collection of
these samples will be called a \emph{sample set}. The number of samples
within a sample set is the \emph{sample size}. The term \emph{sampling}
is used to indicate the selection of either a single sample or a sample
set; the specific usage should be clear from context.

In each of the problems considered, sampling can be performed either
with-replacement or without-replacement. Sampling with-replacement
means that a record that has been included in the sample set for a given
sampling query is ``replaced'' into the dataset and allowed to be sampled
again. Sampling without-replacement does not ``replace'' the record,
and so each individual record can only be included within the a sample
set once for a given query. The data structures that will be discussed
support sampling with-replacement, and sampling without-replacement can
be implemented using a constant number of with-replacement sampling
operations, followed by a deduplication step~\cite{hu15}, so this chapter
will focus exclusive on the with-replacement case.

\subsection{Independent Sampling Problem}

When conducting sampling, it is often desirable for the drawn samples to
have \emph{statistical independence} and for the distribution of records
in the sample set to match the distribution of source data set. This
requires that the sampling of a record does not affect the probability of
any other record being sampled in the future. Such sample sets are said
to be drawn i.i.d (independently and identically distributed). Throughout
this chapter, the term ``independent'' will be used to describe both
statistical independence, and identical distribution.

Independence of sample sets is important because many useful statistical
results are derived from assuming that the condition holds. For example,
it is a requirement for the application of statistical tools such as
the Central Limit Theorem~\cite{bulmer79}, which is the basis for many
concentration bounds.  A failure to maintain independence in sampling
invalidates any guarantees provided by these statistical methods.

In the context of databases, it is also common to discuss a more
general version of the sampling problem, called \emph{independent query
sampling} (IQS)~\cite{hu14}. In IQS, a sample set is constructed from a
specified number of records in the result set of a database query. In
this context, it isn't enough to ensure that individual records are
sampled independently; the sample sets from repeated queries must also be
independent. This precludes, for example, caching and returning the same
sample set to multiple repetitions of the same query. This inter-query
independence provides a variety of useful properties, such as fairness
and representativeness of query results~\cite{tao22}.

A basic version of the independent sampling problem is \emph{weighted set
sampling} (WSS),\footnote{
    This nomenclature is adopted from Tao's recent survey of sampling
    techniques~\cite{tao22}. This problem is also called
    \emph{weighted random sampling} (WRS) in the literature.
} 
in which each record is associated with a weight that determines its
probability of being sampled. More formally, WSS is defined
as:
\begin{definition}[Weighted Set Sampling~\cite{walker74}]
    Let $D$ be a set of data whose members are associated with positive
    weights $w: D \to \mathbb{R}^+$. Given an integer $k \geq 1$, a weighted
    set sampling query returns $k$ independent random samples from $D$ with
    each data point $d \in D$ having a probability of $\frac{w(d)}{\sum_{p\in
    D}w(p)}$ of being sampled.
\end{definition}
Each query returns a sample set of size $k$, rather than a single
sample. Queries returning sample sets are the common case, because
the robustness of analysis relies on having a sufficiently large sample
size~\cite{ben-eliezer20}. The common \emph{simple random sampling} (SRS)
problem is a special case of WSS, where every element has unit weight.

For WSS, the results are taken directly from the dataset without applying
any predicates or filtering. This can be useful, however for IQS it is
common for database queries to apply predicates to the data. A very common
search problem from which database queries are created is range scanning,
which can be formulated as a sampling problem called \emph{independent
range sampling} (IRS),

\begin{definition}[Independent Range Sampling~\cite{tao22}]
    Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
    interval $q = [x, y]$ and an integer $k$, an independent range sampling
    query returns $k$ independent samples from $D \cap q$ with each 
    point having equal probability of being sampled.
\end{definition}

IRS is a non-weighted sampling problem, similar to SRS. There also exists
a weighted generalization, called \emph{weighted independent range
sampling} (WIRS),

\begin{definition}[Weighted Independent Range Sampling~\cite{afshani17}]
    Let $D$ be a set of $n$ points in $\mathbb{R}$ that are associated with
    positive weights $w: D\to \mathbb{R}^+$. Given a query
    interval $q = [x, y]$ and an integer $k$, an independent range sampling
    query returns $k$ independent samples from $D \cap q$ with each 
    point having a probability of $\frac{w(d)}{\sum_{p \in D \cap q}w(p)}$
    of being sampled.
\end{definition}

This is not an exhaustive list of sampling problems, but it is the list
of problems that will be directly addressed within this chapter.

\subsection{Algorithmic Solutions}

Relational database systems often have native support for IQS using
SQL's \texttt{TABLESAMPLE} operator~\cite{postgres-doc}. However, the
algorithms used to implement this operator have significant limitations
and do not allow users to maintain statistical independence of the results
without also running the query to be sampled from in full. Thus, users must
choose between independence and performance.

To maintain statistical independence, Bernoulli sampling is used. This
technique requires iterating over every record in the result set of the
query, and selecting or rejecting it for inclusion within the sample
with a fixed probability~\cite{db2-doc}. This process requires that each
record in the result set be considered, and thus provides no performance
benefit relative to the query being sampled from, as it must be answered
in full anyway before returning only some of the results.\footnote{
    To clarify, this is not to say that Bernoulli sampling isn't
    useful. It \emph{can} be used to improve the performance of queries
    by limiting the cardinality of intermediate results, etc. But it is
    not particularly useful for improving the performance of IQS queries,
    where the sampling is performed on the final result set of the query.
}

For performance, the statistical guarantees can be discarded and
systematic or block sampling used instead. Systematic sampling considers
only a fraction of the rows in the table being sampled from, following
some particular pattern~\cite{postgres-doc}, and block sampling samples
entire database pages~\cite{db2-doc}. These allow for query performance
to be decoupled from data size, but tie a given record's inclusion in the
sample set directly to its physical storage location, which can introduce
bias into the sample and violates statistical guarantees.

\subsection{Index-assisted Solutions}
It is possible to answer IQS queries in a manner that both preserves
independence, and avoids executing the query in full, through the use
of specialized data structures.

\Paragraph{Olken's Method.}
The classical solution is Olken's method~\cite{olken89},
which can be applied to traditional tree-based database indices. This
technique performs a randomized tree traversal, selecting the pointer to
follow at each node uniformly at random. This allows SRS queries to be
answered at $\Theta(\log n)$ cost per sample in the sample set. Thus,
for an IQS query with a desired sample set size of $k$, Olken's method
can provide a sample in $\Theta(k \log n)$ time.

More complex IQS queries, such as weighted or predicate-filtered sampling,
can be answered using the same algorithm by applying rejection sampling.
To support predicates, any sampled records that violate the predicate can
be rejected and retried. For weighted sampling, a given record $r$ will
be accepted into the sample with $\nicefrac{w(r)}{w_{max}}$ probability.
This will require an expected number of $\nicefrac{w_{max}}{w_{avg}}$
attempts per sample in the sample set~\cite{olken-thesis}. This rejection
sampling can be significantly improved by adding aggregated weight tags to
internal nodes, allowing rejection sampling to be performed at each step
of the tree traversal to abort dead-end traversals early~\cite{zhao22}. In
either case, there will be a performance penalty to rejecting samples,
requiring greater than $k$ traversals to obtain a sample set of size $k$.

\begin{figure}
    \centering
    \includegraphics[width=.5\textwidth]{img/sigmod23/alias.pdf}
    \caption{\textbf{A pictorial representation of an alias
    structure}, built over a set of weighted records. Sampling is performed by
    first (1) selecting a cell by uniformly generating an integer index on
    $[0,n)$, and then (2) selecting an item by generating a
    second uniform float on $[0,1]$ and comparing it to the cell's normalized
    cutoff values. In this example, the first random number is $0$,
    corresponding to the first cell, and the second is $.7$. This is larger
    than $\nicefrac{.15}{.25}$, and so $3$ is selected as the result of the
    query.
    This allows $O(1)$ independent weighted set sampling, but adding a new
    element requires a weight adjustment to every element in the structure, and
    so isn't generally possible without performing a full reconstruction.}
    \label{fig:alias}
    
\end{figure}

\Paragraph{Static Solutions.}
There are also a large number of static data structures, which we'll
call static sampling indices (SSIs) in this chapter,\footnote{
  We used the term ``SSI'' in the original paper on which this chapter
  is based, which was published prior to our realization that a strong
  distinction between an index and a data structure would be useful. I
  am retaining the term SSI in this chapter for consistency with the
  original paper, but understand that in the terminology established in
  Chapter~\ref{chap:background}, SSIs are data structures, not indices.
}
that are capable of answering sampling queries more efficiently than
Olken's method relative to the overall data size.  An example of such
a structure is used in Walker's alias method \cite{walker74,vose91}.
This technique constructs a data structure in $\Theta(n)$ time
that is capable of answering WSS queries in $\Theta(1)$ time per
sample. Figure~\ref{fig:alias} shows a pictorial representation of the
structure. For a set of $n$ records, it is constructed by distributing
the normalized weight of all of the records across an array of $n$
cells, which represent at most two records each. Each cell will have
a proportional representation of its records based on their normalized
weight (e.g., a given cell may be 40\% allocated to one record, and 60\%
to another). To query the structure, a cell is first selected uniformly
at random, and then one of its two associated records is selected
with a probability proportional to the record's weight. This operation
takes $\Theta(1)$ time, requiring only two random number generations
per sample.  Thus, a WSS query can be answered in $\Theta(k)$ time,
assuming the structure has already been built. Unfortunately, the alias
structure cannot be efficiently updated, as inserting new records would
change the relative weights of \emph{all} the records, and require fully
re-partitioning the structure.

While the alias method only applies to WSS, other sampling problems can
be solved by using the alias method within the context of a larger data
structure, a technique called \emph{alias augmentation}~\cite{tao22}. For
example, alias augmentation can be used to construct an SSI capable of
answering WIRS queries in $\Theta(\log n + k)$~\cite{afshani17,tao22}.
This structure breaks the data into multiple disjoint partitions of size
$\nicefrac{n}{\log n}$, each with an associated alias structure. A B+tree
is then built, using the augmented partitions as its leaf nodes. Each
internal node is also augmented with an alias structure over the aggregate
weights associated with the children of each pointer. Constructing this
structure requires $\Theta(n)$ time (though the associated constants are
quite large in practice). WIRS queries can be answered by traversing
the tree, first establishing the portion of the tree covering the
query range, and then sampling records from that range using the alias
structures attached to the nodes.  More examples of alias augmentation
applied to different IQS problems can be found in a recent survey by
Tao~\cite{tao22}.

\Paragraph{Miscellanea.}
There also exist specialized data structures with support for both
efficient sampling and updates~\cite{hu14}, but these structures have
poor constant factors and are very complex, rendering them of little
practical utility. Additionally, efforts have been made to extend
the alias structure with support for weight updates over a fixed set of
elements~\cite{hagerup93,matias03,allendorf23}. These approaches do not
allow the insertion or removal of new records, however, only in-place
weight updates. While in principle they could be constructed over the
entire domain of possible records, with the weights of non-existent
records set to $0$, this is hardly practical. Thus, these structures are
not suited for the database sampling applications that are of interest to
us in this chapter.

\subsection{The Dichotomy.} Across the index-assisted techniques we
discussed above, there is a clear pattern that emerges. Olken's method
supports updates, but is inefficient compared to the SSIs because it
requires a data-sized cost to be paid per sample in the sample set. The
SSIs are more efficient for sampling, typically paying the data-sized cost
only once per sample set (if at all), but fail to support updates. Thus,
there appears to be a general dichotomy of sampling techniques: existing
sampling data structures support either updates, or efficient sampling,
but generally not both. It will be the purpose of this chapter to resolve
this dichotomy. In particular, we seek to develop structures with the
following desiderata,

\begin{enumerate}
    \item Support data updates (including deletes) with similar average
          performance to a standard B+tree.
    \item Support IQS queries that do not pay a per-sample cost
          proportional to some function of the data size. In other words,
          $k$ should \emph{not} be be multiplied by any function of $n$
          in the query cost function.
    %FIXME: this guy comes out of nowhere...
    \item Provide the user with some basic performance tuning capability.
\end{enumerate}