summaryrefslogtreecommitdiffstats
path: root/chapters/sigmod23/background.tex
blob: 58324bdc642ab1c643393f93d3b85bc5479bc6a5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
\section{Background}
\label{sec:background}

This section formalizes the sampling problem and describes relevant existing
solutions. Before discussing these topics, though, a clarification of
definition is in order. The nomenclature used to describe sampling varies
slightly throughout the literature. In this chapter, the term \emph{sample} is
used to indicate a single record selected by a sampling operation, and a
collection of these samples is called a \emph{sample set}; the number of
samples within a sample set is the \emph{sample size}. The term \emph{sampling}
is used to indicate the selection of either a single sample or a sample set;
the specific usage should be clear from context.


\Paragraph{Independent Sampling Problem.} When conducting sampling, it is often
desirable for the drawn samples to have \emph{statistical independence}. This
requires that the sampling of a record does not affect the probability of any
other record being sampled in the future. Independence is a requirement for the
application of statistical tools such as the Central Limit
Theorem~\cite{bulmer79}, which is the basis for many concentration bounds.
A failure to maintain independence in sampling invalidates any guarantees
provided by these statistical methods.

In each of the problems considered, sampling can be performed either with
replacement (WR) or without replacement (WoR). It is possible to answer any WoR
sampling query using a constant number of WR queries, followed by a
deduplication step~\cite{hu15}, and so this chapter focuses exclusively on WR
sampling.

A basic version of the independent sampling problem is \emph{weighted set
sampling} (WSS),\footnote{
    This nomenclature is adopted from Tao's recent survey of sampling
    techniques~\cite{tao22}. This problem is also called
    \emph{weighted random sampling} (WRS) in the literature.
} 
in which each record is associated with a weight that determines its
probability of being sampled. More formally, WSS is defined
as:
\begin{definition}[Weighted Set Sampling~\cite{walker74}]
    Let $D$ be a set of data whose members are associated with positive
    weights $w: D \to \mathbb{R}^+$. Given an integer $k \geq 1$, a weighted
    set sampling query returns $k$ independent random samples from $D$ with
    each data point $d \in D$ having a probability of $\frac{w(d)}{\sum_{p\in
    D}w(p)}$ of being sampled.
\end{definition}
Each query returns a sample set of size $k$, rather than a
single sample. Queries returning sample sets are the common case, because the
robustness of analysis relies on having a sufficiently large sample
size~\cite{ben-eliezer20}. The common \emph{simple random sampling} (SRS)
problem is a special case of WSS, where every element has unit weight.

In the context of databases, it is also common to discuss a more general
version of the sampling problem, called \emph{independent query sampling}
(IQS)~\cite{hu14}. An IQS query samples a specified number of records from the
result set of a database query. In this context, it is insufficient to merely
ensure individual records are sampled independently; the sample sets returned
by repeated IQS queries must be independent as well. This provides a variety of
useful properties, such as fairness and representativeness of query
results~\cite{tao22}. As a concrete example, consider simple random sampling on
the result set of a single-dimensional range reporting query. This is 
called independent range sampling (IRS), and is formally defined as: 

\begin{definition}[Independent Range Sampling~\cite{tao22}]
    Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
    interval $q = [x, y]$ and an integer $k$, an independent range sampling
    query returns $k$ independent samples from $D \cap q$ with each 
    point having equal probability of being sampled.
\end{definition}
A generalization of IRS exists, called \emph{Weighted Independent Range
Sampling} (WIRS)~\cite{afshani17}, which is similar to WSS. Each point in $D$
is associated with a positive weight $w: D \to \mathbb{R}^+$, and samples are
drawn from the range query results $D \cap q$ such that each data point has a
probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$ of being sampled.


\Paragraph{Existing Solutions.} While many sampling techniques exist, 
few are supported in practical database systems. The existing
\texttt{TABLESAMPLE} operator provided by SQL in all major DBMS
implementations~\cite{postgres-doc} requires either a linear scan (e.g.,
Bernoulli sampling) that results in high sample retrieval costs, or relaxed
statistical guarantees (e.g., block sampling~\cite{postgres-doc} used in
PostgreSQL).

Index-assisted sampling solutions have been studied
extensively. Olken's method~\cite{olken89} is a classical solution to
independent sampling problems. This algorithm operates upon traditional search
trees, such as the B+tree used commonly as a database index. It conducts a
random walk on the tree uniformly from the root to a leaf, resulting in a
$O(\log n)$ sampling cost for each returned record. Should weighted samples be
desired, rejection sampling can be performed. A sampled record, $r$, is
accepted with probability $\nicefrac{w(r)}{w_{max}}$, with an expected
number of $\nicefrac{w_{max}}{w_{avg}}$ samples to be taken per element in the
sample set. Olken's method can also be extended to support general IQS by
rejecting all sampled records failing to satisfy the query predicate. It can be
accelerated by adding aggregated weight tags to internal
nodes~\cite{olken-thesis,zhao22}, allowing rejection sampling to be performed
during the tree-traversal to abort dead-end traversals early.

\begin{figure}
    \centering
    \includegraphics[width=.5\textwidth]{img/sigmod23/alias.pdf}
    \caption{\textbf{A pictorial representation of an alias
    structure}, built over a set of weighted records. Sampling is performed by
    first (1) selecting a cell by uniformly generating an integer index on
    $[0,n)$, and then (2) selecting an item by generating a
    second uniform float on $[0,1]$ and comparing it to the cell's normalized
    cutoff values. In this example, the first random number is $0$,
    corresponding to the first cell, and the second is $.7$. This is larger
    than $\nicefrac{.15}{.25}$, and so $3$ is selected as the result of the
    query.
    This allows $O(1)$ independent weighted set sampling, but adding a new
    element requires a weight adjustment to every element in the structure, and
    so isn't generally possible without performing a full reconstruction.}
    \label{fig:alias}
    
\end{figure}

There also exist static data structures, referred to in this chapter as static
sampling indexes (SSIs)\footnote{
The name SSI was established in the published version of this paper prior to the
realization that a distinction between the terms index and data structure would
be useful. We'll continue to use the term SSI for the remainder of this chapter,
to maintain consistency with the published work, but technically an SSI refers to
 a data structure, not an index, in the nomenclature established in the previous
 chapter.
    }, that are capable of answering sampling queries in
near-constant time\footnote{
 The designation
``near-constant'' is \emph{not} used in the technical sense of being constant
to within a polylogarithmic factor (i.e., $\tilde{O}(1)$). It is instead used to mean
constant to within an additive polylogarithmic term, i.e., $f(x) \in O(\log n +
1)$. 
%For example, drawing $k$ samples from $n$ records using a near-constant
%approach would require $O(\log n + k)$ time. This is in contrast to a
%tree-traversal approach, which would require $O(k\log n)$ time. 
} relative to the size of the dataset. An example of such a
structure is used in Walker's alias method \cite{walker74,vose91}, a technique
for answering WSS queries with $O(1)$ query cost per sample, but requiring
$O(n)$ time to construct. It distributes the weight of items across $n$ cells,
where each cell is partitioned into at most two items, such that the total
proportion of each cell assigned to an item is its total weight. A query
selects one cell uniformly at random, then chooses one of the two items in the
cell by weight; thus, selecting items with probability proportional to their
weight in $O(1)$ time. A pictorial representation of this structure is shown in
Figure~\ref{fig:alias}.

The alias method can also be used as the basis for creating SSIs capable of
answering general IQS queries using a technique called alias
augmentation~\cite{tao22}. As a concrete example, previous
papers~\cite{afshani17,tao22} have proposed solutions for WIRS queries using $O(\log n
+ k)$ time, where the $\log n$ cost is only be paid only once per query, after which
elements can be sampled in constant time. This structure is built by breaking
the data up into disjoint chunks of size $\nicefrac{n}{\log n}$, called
\emph{fat points}, each with an alias structure. A B+tree is then constructed,
using the fat points as its leaf nodes. The internal nodes are augmented with
an alias structure over the total weight of each child. This alias structure
is used instead of rejection sampling to determine the traversal path to take
through the tree, and then the alias structure of the fat point is used to
sample a record. Because rejection sampling is not used during the traversal,
two traversals suffice to establish the valid range of records for sampling,
after which samples can be collected without requiring per-sample traversals.
More examples of alias augmentation applied to different IQS problems can be
found in a recent survey by Tao~\cite{tao22}.

There do exist specialized sampling indexes~\cite{hu14} with both efficient
sampling and support for updates, but these are restricted to specific query
types and are often very complex structures, with poor constant factors
associated with sampling and update costs, and so are of limited practical
utility. There has also been work~\cite{hagerup93,matias03,allendorf23} on
extending the alias structure to support weight updates over a fixed set of
elements. However, these solutions do not allow insertion or deletion in the
underlying dataset, and so are not well suited to database sampling
applications. 

\Paragraph{The Dichotomy.} Among these techniques, there exists a
clear trade-off between efficient sampling and support for updates. Tree-traversal 
based sampling solutions pay a dataset size based cost per sample, in exchange for
update support. The static solutions lack support for updates, but support
near-constant time sampling. While some data structures exist with support for
both, these are restricted to highly specialized query types. Thus in the
general case there exists a dichotomy: existing sampling indexes can support
either data updates or efficient sampling, but not both.