summaryrefslogtreecommitdiffstats
path: root/chapters/sigmod23/background.tex
diff options
context:
space:
mode:
Diffstat (limited to 'chapters/sigmod23/background.tex')
-rw-r--r--chapters/sigmod23/background.tex182
1 files changed, 182 insertions, 0 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex
new file mode 100644
index 0000000..58324bd
--- /dev/null
+++ b/chapters/sigmod23/background.tex
@@ -0,0 +1,182 @@
+\section{Background}
+\label{sec:background}
+
+This section formalizes the sampling problem and describes relevant existing
+solutions. Before discussing these topics, though, a clarification of
+definition is in order. The nomenclature used to describe sampling varies
+slightly throughout the literature. In this chapter, the term \emph{sample} is
+used to indicate a single record selected by a sampling operation, and a
+collection of these samples is called a \emph{sample set}; the number of
+samples within a sample set is the \emph{sample size}. The term \emph{sampling}
+is used to indicate the selection of either a single sample or a sample set;
+the specific usage should be clear from context.
+
+
+\Paragraph{Independent Sampling Problem.} When conducting sampling, it is often
+desirable for the drawn samples to have \emph{statistical independence}. This
+requires that the sampling of a record does not affect the probability of any
+other record being sampled in the future. Independence is a requirement for the
+application of statistical tools such as the Central Limit
+Theorem~\cite{bulmer79}, which is the basis for many concentration bounds.
+A failure to maintain independence in sampling invalidates any guarantees
+provided by these statistical methods.
+
+In each of the problems considered, sampling can be performed either with
+replacement (WR) or without replacement (WoR). It is possible to answer any WoR
+sampling query using a constant number of WR queries, followed by a
+deduplication step~\cite{hu15}, and so this chapter focuses exclusively on WR
+sampling.
+
+A basic version of the independent sampling problem is \emph{weighted set
+sampling} (WSS),\footnote{
+ This nomenclature is adopted from Tao's recent survey of sampling
+ techniques~\cite{tao22}. This problem is also called
+ \emph{weighted random sampling} (WRS) in the literature.
+}
+in which each record is associated with a weight that determines its
+probability of being sampled. More formally, WSS is defined
+as:
+\begin{definition}[Weighted Set Sampling~\cite{walker74}]
+ Let $D$ be a set of data whose members are associated with positive
+ weights $w: D \to \mathbb{R}^+$. Given an integer $k \geq 1$, a weighted
+ set sampling query returns $k$ independent random samples from $D$ with
+ each data point $d \in D$ having a probability of $\frac{w(d)}{\sum_{p\in
+ D}w(p)}$ of being sampled.
+\end{definition}
+Each query returns a sample set of size $k$, rather than a
+single sample. Queries returning sample sets are the common case, because the
+robustness of analysis relies on having a sufficiently large sample
+size~\cite{ben-eliezer20}. The common \emph{simple random sampling} (SRS)
+problem is a special case of WSS, where every element has unit weight.
+
+In the context of databases, it is also common to discuss a more general
+version of the sampling problem, called \emph{independent query sampling}
+(IQS)~\cite{hu14}. An IQS query samples a specified number of records from the
+result set of a database query. In this context, it is insufficient to merely
+ensure individual records are sampled independently; the sample sets returned
+by repeated IQS queries must be independent as well. This provides a variety of
+useful properties, such as fairness and representativeness of query
+results~\cite{tao22}. As a concrete example, consider simple random sampling on
+the result set of a single-dimensional range reporting query. This is
+called independent range sampling (IRS), and is formally defined as:
+
+\begin{definition}[Independent Range Sampling~\cite{tao22}]
+ Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
+ interval $q = [x, y]$ and an integer $k$, an independent range sampling
+ query returns $k$ independent samples from $D \cap q$ with each
+ point having equal probability of being sampled.
+\end{definition}
+A generalization of IRS exists, called \emph{Weighted Independent Range
+Sampling} (WIRS)~\cite{afshani17}, which is similar to WSS. Each point in $D$
+is associated with a positive weight $w: D \to \mathbb{R}^+$, and samples are
+drawn from the range query results $D \cap q$ such that each data point has a
+probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$ of being sampled.
+
+
+\Paragraph{Existing Solutions.} While many sampling techniques exist,
+few are supported in practical database systems. The existing
+\texttt{TABLESAMPLE} operator provided by SQL in all major DBMS
+implementations~\cite{postgres-doc} requires either a linear scan (e.g.,
+Bernoulli sampling) that results in high sample retrieval costs, or relaxed
+statistical guarantees (e.g., block sampling~\cite{postgres-doc} used in
+PostgreSQL).
+
+Index-assisted sampling solutions have been studied
+extensively. Olken's method~\cite{olken89} is a classical solution to
+independent sampling problems. This algorithm operates upon traditional search
+trees, such as the B+tree used commonly as a database index. It conducts a
+random walk on the tree uniformly from the root to a leaf, resulting in a
+$O(\log n)$ sampling cost for each returned record. Should weighted samples be
+desired, rejection sampling can be performed. A sampled record, $r$, is
+accepted with probability $\nicefrac{w(r)}{w_{max}}$, with an expected
+number of $\nicefrac{w_{max}}{w_{avg}}$ samples to be taken per element in the
+sample set. Olken's method can also be extended to support general IQS by
+rejecting all sampled records failing to satisfy the query predicate. It can be
+accelerated by adding aggregated weight tags to internal
+nodes~\cite{olken-thesis,zhao22}, allowing rejection sampling to be performed
+during the tree-traversal to abort dead-end traversals early.
+
+\begin{figure}
+ \centering
+ \includegraphics[width=.5\textwidth]{img/sigmod23/alias.pdf}
+ \caption{\textbf{A pictorial representation of an alias
+ structure}, built over a set of weighted records. Sampling is performed by
+ first (1) selecting a cell by uniformly generating an integer index on
+ $[0,n)$, and then (2) selecting an item by generating a
+ second uniform float on $[0,1]$ and comparing it to the cell's normalized
+ cutoff values. In this example, the first random number is $0$,
+ corresponding to the first cell, and the second is $.7$. This is larger
+ than $\nicefrac{.15}{.25}$, and so $3$ is selected as the result of the
+ query.
+ This allows $O(1)$ independent weighted set sampling, but adding a new
+ element requires a weight adjustment to every element in the structure, and
+ so isn't generally possible without performing a full reconstruction.}
+ \label{fig:alias}
+
+\end{figure}
+
+There also exist static data structures, referred to in this chapter as static
+sampling indexes (SSIs)\footnote{
+The name SSI was established in the published version of this paper prior to the
+realization that a distinction between the terms index and data structure would
+be useful. We'll continue to use the term SSI for the remainder of this chapter,
+to maintain consistency with the published work, but technically an SSI refers to
+ a data structure, not an index, in the nomenclature established in the previous
+ chapter.
+ }, that are capable of answering sampling queries in
+near-constant time\footnote{
+ The designation
+``near-constant'' is \emph{not} used in the technical sense of being constant
+to within a polylogarithmic factor (i.e., $\tilde{O}(1)$). It is instead used to mean
+constant to within an additive polylogarithmic term, i.e., $f(x) \in O(\log n +
+1)$.
+%For example, drawing $k$ samples from $n$ records using a near-constant
+%approach would require $O(\log n + k)$ time. This is in contrast to a
+%tree-traversal approach, which would require $O(k\log n)$ time.
+} relative to the size of the dataset. An example of such a
+structure is used in Walker's alias method \cite{walker74,vose91}, a technique
+for answering WSS queries with $O(1)$ query cost per sample, but requiring
+$O(n)$ time to construct. It distributes the weight of items across $n$ cells,
+where each cell is partitioned into at most two items, such that the total
+proportion of each cell assigned to an item is its total weight. A query
+selects one cell uniformly at random, then chooses one of the two items in the
+cell by weight; thus, selecting items with probability proportional to their
+weight in $O(1)$ time. A pictorial representation of this structure is shown in
+Figure~\ref{fig:alias}.
+
+The alias method can also be used as the basis for creating SSIs capable of
+answering general IQS queries using a technique called alias
+augmentation~\cite{tao22}. As a concrete example, previous
+papers~\cite{afshani17,tao22} have proposed solutions for WIRS queries using $O(\log n
++ k)$ time, where the $\log n$ cost is only be paid only once per query, after which
+elements can be sampled in constant time. This structure is built by breaking
+the data up into disjoint chunks of size $\nicefrac{n}{\log n}$, called
+\emph{fat points}, each with an alias structure. A B+tree is then constructed,
+using the fat points as its leaf nodes. The internal nodes are augmented with
+an alias structure over the total weight of each child. This alias structure
+is used instead of rejection sampling to determine the traversal path to take
+through the tree, and then the alias structure of the fat point is used to
+sample a record. Because rejection sampling is not used during the traversal,
+two traversals suffice to establish the valid range of records for sampling,
+after which samples can be collected without requiring per-sample traversals.
+More examples of alias augmentation applied to different IQS problems can be
+found in a recent survey by Tao~\cite{tao22}.
+
+There do exist specialized sampling indexes~\cite{hu14} with both efficient
+sampling and support for updates, but these are restricted to specific query
+types and are often very complex structures, with poor constant factors
+associated with sampling and update costs, and so are of limited practical
+utility. There has also been work~\cite{hagerup93,matias03,allendorf23} on
+extending the alias structure to support weight updates over a fixed set of
+elements. However, these solutions do not allow insertion or deletion in the
+underlying dataset, and so are not well suited to database sampling
+applications.
+
+\Paragraph{The Dichotomy.} Among these techniques, there exists a
+clear trade-off between efficient sampling and support for updates. Tree-traversal
+based sampling solutions pay a dataset size based cost per sample, in exchange for
+update support. The static solutions lack support for updates, but support
+near-constant time sampling. While some data structures exist with support for
+both, these are restricted to highly specialized query types. Thus in the
+general case there exists a dichotomy: existing sampling indexes can support
+either data updates or efficient sampling, but not both.