\section{Background} \label{sec:background} This section formalizes the sampling problem and describes relevant existing solutions. Before discussing these topics, though, a clarification of definition is in order. The nomenclature used to describe sampling varies slightly throughout the literature. In this chapter, the term \emph{sample} is used to indicate a single record selected by a sampling operation, and a collection of these samples is called a \emph{sample set}; the number of samples within a sample set is the \emph{sample size}. The term \emph{sampling} is used to indicate the selection of either a single sample or a sample set; the specific usage should be clear from context. \Paragraph{Independent Sampling Problem.} When conducting sampling, it is often desirable for the drawn samples to have \emph{statistical independence}. This requires that the sampling of a record does not affect the probability of any other record being sampled in the future. Independence is a requirement for the application of statistical tools such as the Central Limit Theorem~\cite{bulmer79}, which is the basis for many concentration bounds. A failure to maintain independence in sampling invalidates any guarantees provided by these statistical methods. In each of the problems considered, sampling can be performed either with replacement (WR) or without replacement (WoR). It is possible to answer any WoR sampling query using a constant number of WR queries, followed by a deduplication step~\cite{hu15}, and so this chapter focuses exclusively on WR sampling. A basic version of the independent sampling problem is \emph{weighted set sampling} (WSS),\footnote{ This nomenclature is adopted from Tao's recent survey of sampling techniques~\cite{tao22}. This problem is also called \emph{weighted random sampling} (WRS) in the literature. } in which each record is associated with a weight that determines its probability of being sampled. More formally, WSS is defined as: \begin{definition}[Weighted Set Sampling~\cite{walker74}] Let $D$ be a set of data whose members are associated with positive weights $w: D \to \mathbb{R}^+$. Given an integer $k \geq 1$, a weighted set sampling query returns $k$ independent random samples from $D$ with each data point $d \in D$ having a probability of $\frac{w(d)}{\sum_{p\in D}w(p)}$ of being sampled. \end{definition} Each query returns a sample set of size $k$, rather than a single sample. Queries returning sample sets are the common case, because the robustness of analysis relies on having a sufficiently large sample size~\cite{ben-eliezer20}. The common \emph{simple random sampling} (SRS) problem is a special case of WSS, where every element has unit weight. In the context of databases, it is also common to discuss a more general version of the sampling problem, called \emph{independent query sampling} (IQS)~\cite{hu14}. An IQS query samples a specified number of records from the result set of a database query. In this context, it is insufficient to merely ensure individual records are sampled independently; the sample sets returned by repeated IQS queries must be independent as well. This provides a variety of useful properties, such as fairness and representativeness of query results~\cite{tao22}. As a concrete example, consider simple random sampling on the result set of a single-dimensional range reporting query. This is called independent range sampling (IRS), and is formally defined as: \begin{definition}[Independent Range Sampling~\cite{tao22}] Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query interval $q = [x, y]$ and an integer $k$, an independent range sampling query returns $k$ independent samples from $D \cap q$ with each point having equal probability of being sampled. \end{definition} A generalization of IRS exists, called \emph{Weighted Independent Range Sampling} (WIRS)~\cite{afshani17}, which is similar to WSS. Each point in $D$ is associated with a positive weight $w: D \to \mathbb{R}^+$, and samples are drawn from the range query results $D \cap q$ such that each data point has a probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$ of being sampled. \Paragraph{Existing Solutions.} While many sampling techniques exist, few are supported in practical database systems. The existing \texttt{TABLESAMPLE} operator provided by SQL in all major DBMS implementations~\cite{postgres-doc} requires either a linear scan (e.g., Bernoulli sampling) that results in high sample retrieval costs, or relaxed statistical guarantees (e.g., block sampling~\cite{postgres-doc} used in PostgreSQL). Index-assisted sampling solutions have been studied extensively. Olken's method~\cite{olken89} is a classical solution to independent sampling problems. This algorithm operates upon traditional search trees, such as the B+tree used commonly as a database index. It conducts a random walk on the tree uniformly from the root to a leaf, resulting in a $O(\log n)$ sampling cost for each returned record. Should weighted samples be desired, rejection sampling can be performed. A sampled record, $r$, is accepted with probability $\nicefrac{w(r)}{w_{max}}$, with an expected number of $\nicefrac{w_{max}}{w_{avg}}$ samples to be taken per element in the sample set. Olken's method can also be extended to support general IQS by rejecting all sampled records failing to satisfy the query predicate. It can be accelerated by adding aggregated weight tags to internal nodes~\cite{olken-thesis,zhao22}, allowing rejection sampling to be performed during the tree-traversal to abort dead-end traversals early. \begin{figure} \centering \includegraphics[width=.5\textwidth]{img/sigmod23/alias.pdf} \caption{\textbf{A pictorial representation of an alias structure}, built over a set of weighted records. Sampling is performed by first (1) selecting a cell by uniformly generating an integer index on $[0,n)$, and then (2) selecting an item by generating a second uniform float on $[0,1]$ and comparing it to the cell's normalized cutoff values. In this example, the first random number is $0$, corresponding to the first cell, and the second is $.7$. This is larger than $\nicefrac{.15}{.25}$, and so $3$ is selected as the result of the query. This allows $O(1)$ independent weighted set sampling, but adding a new element requires a weight adjustment to every element in the structure, and so isn't generally possible without performing a full reconstruction.} \label{fig:alias} \end{figure} There also exist static data structures, referred to in this chapter as static sampling indexes (SSIs)\footnote{ The name SSI was established in the published version of this paper prior to the realization that a distinction between the terms index and data structure would be useful. We'll continue to use the term SSI for the remainder of this chapter, to maintain consistency with the published work, but technically an SSI refers to a data structure, not an index, in the nomenclature established in the previous chapter. }, that are capable of answering sampling queries in near-constant time\footnote{ The designation ``near-constant'' is \emph{not} used in the technical sense of being constant to within a polylogarithmic factor (i.e., $\tilde{O}(1)$). It is instead used to mean constant to within an additive polylogarithmic term, i.e., $f(x) \in O(\log n + 1)$. %For example, drawing $k$ samples from $n$ records using a near-constant %approach would require $O(\log n + k)$ time. This is in contrast to a %tree-traversal approach, which would require $O(k\log n)$ time. } relative to the size of the dataset. An example of such a structure is used in Walker's alias method \cite{walker74,vose91}, a technique for answering WSS queries with $O(1)$ query cost per sample, but requiring $O(n)$ time to construct. It distributes the weight of items across $n$ cells, where each cell is partitioned into at most two items, such that the total proportion of each cell assigned to an item is its total weight. A query selects one cell uniformly at random, then chooses one of the two items in the cell by weight; thus, selecting items with probability proportional to their weight in $O(1)$ time. A pictorial representation of this structure is shown in Figure~\ref{fig:alias}. The alias method can also be used as the basis for creating SSIs capable of answering general IQS queries using a technique called alias augmentation~\cite{tao22}. As a concrete example, previous papers~\cite{afshani17,tao22} have proposed solutions for WIRS queries using $O(\log n + k)$ time, where the $\log n$ cost is only be paid only once per query, after which elements can be sampled in constant time. This structure is built by breaking the data up into disjoint chunks of size $\nicefrac{n}{\log n}$, called \emph{fat points}, each with an alias structure. A B+tree is then constructed, using the fat points as its leaf nodes. The internal nodes are augmented with an alias structure over the total weight of each child. This alias structure is used instead of rejection sampling to determine the traversal path to take through the tree, and then the alias structure of the fat point is used to sample a record. Because rejection sampling is not used during the traversal, two traversals suffice to establish the valid range of records for sampling, after which samples can be collected without requiring per-sample traversals. More examples of alias augmentation applied to different IQS problems can be found in a recent survey by Tao~\cite{tao22}. There do exist specialized sampling indexes~\cite{hu14} with both efficient sampling and support for updates, but these are restricted to specific query types and are often very complex structures, with poor constant factors associated with sampling and update costs, and so are of limited practical utility. There has also been work~\cite{hagerup93,matias03,allendorf23} on extending the alias structure to support weight updates over a fixed set of elements. However, these solutions do not allow insertion or deletion in the underlying dataset, and so are not well suited to database sampling applications. \Paragraph{The Dichotomy.} Among these techniques, there exists a clear trade-off between efficient sampling and support for updates. Tree-traversal based sampling solutions pay a dataset size based cost per sample, in exchange for update support. The static solutions lack support for updates, but support near-constant time sampling. While some data structures exist with support for both, these are restricted to highly specialized query types. Thus in the general case there exists a dichotomy: existing sampling indexes can support either data updates or efficient sampling, but not both.