diff options
Diffstat (limited to 'chapters/sigmod23/background.tex')
| -rw-r--r-- | chapters/sigmod23/background.tex | 182 |
1 files changed, 182 insertions, 0 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex new file mode 100644 index 0000000..58324bd --- /dev/null +++ b/chapters/sigmod23/background.tex @@ -0,0 +1,182 @@ +\section{Background} +\label{sec:background} + +This section formalizes the sampling problem and describes relevant existing +solutions. Before discussing these topics, though, a clarification of +definition is in order. The nomenclature used to describe sampling varies +slightly throughout the literature. In this chapter, the term \emph{sample} is +used to indicate a single record selected by a sampling operation, and a +collection of these samples is called a \emph{sample set}; the number of +samples within a sample set is the \emph{sample size}. The term \emph{sampling} +is used to indicate the selection of either a single sample or a sample set; +the specific usage should be clear from context. + + +\Paragraph{Independent Sampling Problem.} When conducting sampling, it is often +desirable for the drawn samples to have \emph{statistical independence}. This +requires that the sampling of a record does not affect the probability of any +other record being sampled in the future. Independence is a requirement for the +application of statistical tools such as the Central Limit +Theorem~\cite{bulmer79}, which is the basis for many concentration bounds. +A failure to maintain independence in sampling invalidates any guarantees +provided by these statistical methods. + +In each of the problems considered, sampling can be performed either with +replacement (WR) or without replacement (WoR). It is possible to answer any WoR +sampling query using a constant number of WR queries, followed by a +deduplication step~\cite{hu15}, and so this chapter focuses exclusively on WR +sampling. + +A basic version of the independent sampling problem is \emph{weighted set +sampling} (WSS),\footnote{ + This nomenclature is adopted from Tao's recent survey of sampling + techniques~\cite{tao22}. This problem is also called + \emph{weighted random sampling} (WRS) in the literature. +} +in which each record is associated with a weight that determines its +probability of being sampled. More formally, WSS is defined +as: +\begin{definition}[Weighted Set Sampling~\cite{walker74}] + Let $D$ be a set of data whose members are associated with positive + weights $w: D \to \mathbb{R}^+$. Given an integer $k \geq 1$, a weighted + set sampling query returns $k$ independent random samples from $D$ with + each data point $d \in D$ having a probability of $\frac{w(d)}{\sum_{p\in + D}w(p)}$ of being sampled. +\end{definition} +Each query returns a sample set of size $k$, rather than a +single sample. Queries returning sample sets are the common case, because the +robustness of analysis relies on having a sufficiently large sample +size~\cite{ben-eliezer20}. The common \emph{simple random sampling} (SRS) +problem is a special case of WSS, where every element has unit weight. + +In the context of databases, it is also common to discuss a more general +version of the sampling problem, called \emph{independent query sampling} +(IQS)~\cite{hu14}. An IQS query samples a specified number of records from the +result set of a database query. In this context, it is insufficient to merely +ensure individual records are sampled independently; the sample sets returned +by repeated IQS queries must be independent as well. This provides a variety of +useful properties, such as fairness and representativeness of query +results~\cite{tao22}. As a concrete example, consider simple random sampling on +the result set of a single-dimensional range reporting query. This is +called independent range sampling (IRS), and is formally defined as: + +\begin{definition}[Independent Range Sampling~\cite{tao22}] + Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query + interval $q = [x, y]$ and an integer $k$, an independent range sampling + query returns $k$ independent samples from $D \cap q$ with each + point having equal probability of being sampled. +\end{definition} +A generalization of IRS exists, called \emph{Weighted Independent Range +Sampling} (WIRS)~\cite{afshani17}, which is similar to WSS. Each point in $D$ +is associated with a positive weight $w: D \to \mathbb{R}^+$, and samples are +drawn from the range query results $D \cap q$ such that each data point has a +probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$ of being sampled. + + +\Paragraph{Existing Solutions.} While many sampling techniques exist, +few are supported in practical database systems. The existing +\texttt{TABLESAMPLE} operator provided by SQL in all major DBMS +implementations~\cite{postgres-doc} requires either a linear scan (e.g., +Bernoulli sampling) that results in high sample retrieval costs, or relaxed +statistical guarantees (e.g., block sampling~\cite{postgres-doc} used in +PostgreSQL). + +Index-assisted sampling solutions have been studied +extensively. Olken's method~\cite{olken89} is a classical solution to +independent sampling problems. This algorithm operates upon traditional search +trees, such as the B+tree used commonly as a database index. It conducts a +random walk on the tree uniformly from the root to a leaf, resulting in a +$O(\log n)$ sampling cost for each returned record. Should weighted samples be +desired, rejection sampling can be performed. A sampled record, $r$, is +accepted with probability $\nicefrac{w(r)}{w_{max}}$, with an expected +number of $\nicefrac{w_{max}}{w_{avg}}$ samples to be taken per element in the +sample set. Olken's method can also be extended to support general IQS by +rejecting all sampled records failing to satisfy the query predicate. It can be +accelerated by adding aggregated weight tags to internal +nodes~\cite{olken-thesis,zhao22}, allowing rejection sampling to be performed +during the tree-traversal to abort dead-end traversals early. + +\begin{figure} + \centering + \includegraphics[width=.5\textwidth]{img/sigmod23/alias.pdf} + \caption{\textbf{A pictorial representation of an alias + structure}, built over a set of weighted records. Sampling is performed by + first (1) selecting a cell by uniformly generating an integer index on + $[0,n)$, and then (2) selecting an item by generating a + second uniform float on $[0,1]$ and comparing it to the cell's normalized + cutoff values. In this example, the first random number is $0$, + corresponding to the first cell, and the second is $.7$. This is larger + than $\nicefrac{.15}{.25}$, and so $3$ is selected as the result of the + query. + This allows $O(1)$ independent weighted set sampling, but adding a new + element requires a weight adjustment to every element in the structure, and + so isn't generally possible without performing a full reconstruction.} + \label{fig:alias} + +\end{figure} + +There also exist static data structures, referred to in this chapter as static +sampling indexes (SSIs)\footnote{ +The name SSI was established in the published version of this paper prior to the +realization that a distinction between the terms index and data structure would +be useful. We'll continue to use the term SSI for the remainder of this chapter, +to maintain consistency with the published work, but technically an SSI refers to + a data structure, not an index, in the nomenclature established in the previous + chapter. + }, that are capable of answering sampling queries in +near-constant time\footnote{ + The designation +``near-constant'' is \emph{not} used in the technical sense of being constant +to within a polylogarithmic factor (i.e., $\tilde{O}(1)$). It is instead used to mean +constant to within an additive polylogarithmic term, i.e., $f(x) \in O(\log n + +1)$. +%For example, drawing $k$ samples from $n$ records using a near-constant +%approach would require $O(\log n + k)$ time. This is in contrast to a +%tree-traversal approach, which would require $O(k\log n)$ time. +} relative to the size of the dataset. An example of such a +structure is used in Walker's alias method \cite{walker74,vose91}, a technique +for answering WSS queries with $O(1)$ query cost per sample, but requiring +$O(n)$ time to construct. It distributes the weight of items across $n$ cells, +where each cell is partitioned into at most two items, such that the total +proportion of each cell assigned to an item is its total weight. A query +selects one cell uniformly at random, then chooses one of the two items in the +cell by weight; thus, selecting items with probability proportional to their +weight in $O(1)$ time. A pictorial representation of this structure is shown in +Figure~\ref{fig:alias}. + +The alias method can also be used as the basis for creating SSIs capable of +answering general IQS queries using a technique called alias +augmentation~\cite{tao22}. As a concrete example, previous +papers~\cite{afshani17,tao22} have proposed solutions for WIRS queries using $O(\log n ++ k)$ time, where the $\log n$ cost is only be paid only once per query, after which +elements can be sampled in constant time. This structure is built by breaking +the data up into disjoint chunks of size $\nicefrac{n}{\log n}$, called +\emph{fat points}, each with an alias structure. A B+tree is then constructed, +using the fat points as its leaf nodes. The internal nodes are augmented with +an alias structure over the total weight of each child. This alias structure +is used instead of rejection sampling to determine the traversal path to take +through the tree, and then the alias structure of the fat point is used to +sample a record. Because rejection sampling is not used during the traversal, +two traversals suffice to establish the valid range of records for sampling, +after which samples can be collected without requiring per-sample traversals. +More examples of alias augmentation applied to different IQS problems can be +found in a recent survey by Tao~\cite{tao22}. + +There do exist specialized sampling indexes~\cite{hu14} with both efficient +sampling and support for updates, but these are restricted to specific query +types and are often very complex structures, with poor constant factors +associated with sampling and update costs, and so are of limited practical +utility. There has also been work~\cite{hagerup93,matias03,allendorf23} on +extending the alias structure to support weight updates over a fixed set of +elements. However, these solutions do not allow insertion or deletion in the +underlying dataset, and so are not well suited to database sampling +applications. + +\Paragraph{The Dichotomy.} Among these techniques, there exists a +clear trade-off between efficient sampling and support for updates. Tree-traversal +based sampling solutions pay a dataset size based cost per sample, in exchange for +update support. The static solutions lack support for updates, but support +near-constant time sampling. While some data structures exist with support for +both, these are restricted to highly specialized query types. Thus in the +general case there exists a dichotomy: existing sampling indexes can support +either data updates or efficient sampling, but not both. |