\section{Background} \label{sec:background} We will begin with a formal discussion of the sampling problem, and relevant existing solutions. First, though, a clarification of definition is in order. The nomenclature used to describe sampling in the literature is rather inconsistent, and so we'll first specifically define all of the relevant terms.\footnote{ As an amusing footnote, this problem actually resulted in a significant miscommunication between myself and my advisor in the early days of the project, resulting in a lot of time being expended on performance debugging a problem that didn't actually exist! } In this chapter, we'll use the the term \emph{sample} to indicate a single record selected by a sampling operation, and a collection of these samples will be called a \emph{sample set}. The number of samples within a sample set is the \emph{sample size}. The term \emph{sampling} is used to indicate the selection of either a single sample or a sample set; the specific usage should be clear from context. In each of the problems considered, sampling can be performed either with-replacement or without-replacement. Sampling with-replacement means that a record that has been included in the sample set for a given sampling query is ``replaced'' into the dataset and allowed to be sampled again. Sampling without-replacement does not ``replace'' the record, and so each individual record can only be included within the a sample set once for a given query. The data structures that will be discussed support sampling with-replacement, and sampling without-replacement can be implemented using a constant number of with-replacement sampling operations, followed by a deduplication step~\cite{hu15}, so this chapter will focus exclusive on the with-replacement case. \subsection{Independent Sampling Problem} When conducting sampling, it is often desirable for the drawn samples to have \emph{statistical independence} and for the distribution of records in the sample set to match the distribution of source data set. This requires that the sampling of a record does not affect the probability of any other record being sampled in the future. Such sample sets are said to be drawn i.i.d (independently and identically distributed). Throughout this chapter, the term ``independent'' will be used to describe both statistical independence, and identical distribution. Independence of sample sets is important because many useful statistical results are derived from assuming that the condition holds. For example, it is a requirement for the application of statistical tools such as the Central Limit Theorem~\cite{bulmer79}, which is the basis for many concentration bounds. A failure to maintain independence in sampling invalidates any guarantees provided by these statistical methods. In the context of databases, it is also common to discuss a more general version of the sampling problem, called \emph{independent query sampling} (IQS)~\cite{hu14}. In IQS, a sample set is constructed from a specified number of records in the result set of a database query. In this context, it isn't enough to ensure that individual records are sampled independently; the sample sets from repeated queries must also be independent. This precludes, for example, caching and returning the same sample set to multiple repetitions of the same query. This inter-query independence provides a variety of useful properties, such as fairness and representativeness of query results~\cite{tao22}. A basic version of the independent sampling problem is \emph{weighted set sampling} (WSS),\footnote{ This nomenclature is adopted from Tao's recent survey of sampling techniques~\cite{tao22}. This problem is also called \emph{weighted random sampling} (WRS) in the literature. } in which each record is associated with a weight that determines its probability of being sampled. More formally, WSS is defined as: \begin{definition}[Weighted Set Sampling~\cite{walker74}] Let $D$ be a set of data whose members are associated with positive weights $w: D \to \mathbb{R}^+$. Given an integer $k \geq 1$, a weighted set sampling query returns $k$ independent random samples from $D$ with each data point $d \in D$ having a probability of $\frac{w(d)}{\sum_{p\in D}w(p)}$ of being sampled. \end{definition} Each query returns a sample set of size $k$, rather than a single sample. Queries returning sample sets are the common case, because the robustness of analysis relies on having a sufficiently large sample size~\cite{ben-eliezer20}. The common \emph{simple random sampling} (SRS) problem is a special case of WSS, where every element has unit weight. For WSS, the results are taken directly from the dataset without applying any predicates or filtering. This can be useful, however for IQS it is common for database queries to apply predicates to the data. A very common search problem from which database queries are created is range scanning, which can be formulated as a sampling problem called \emph{independent range sampling} (IRS), \begin{definition}[Independent Range Sampling~\cite{tao22}] Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query interval $q = [x, y]$ and an integer $k$, an independent range sampling query returns $k$ independent samples from $D \cap q$ with each point having equal probability of being sampled. \end{definition} IRS is a non-weighted sampling problem, similar to SRS. There also exists a weighted generalization, called \emph{weighted independent range sampling} (WIRS), \begin{definition}[Weighted Independent Range Sampling~\cite{afshani17}] Let $D$ be a set of $n$ points in $\mathbb{R}$ that are associated with positive weights $w: D\to \mathbb{R}^+$. Given a query interval $q = [x, y]$ and an integer $k$, an independent range sampling query returns $k$ independent samples from $D \cap q$ with each point having a probability of $\frac{w(d)}{\sum_{p \in D \cap q}w(p)}$ of being sampled. \end{definition} This is not an exhaustive list of sampling problems, but it is the list of problems that will be directly addressed within this chapter. \subsection{Algorithmic Solutions} Relational database systems often have native support for IQS using SQL's \texttt{TABLESAMPLE} operator~\cite{postgres-doc}. However, the algorithms used to implement this operator have significant limitations and do not allow users to maintain statistical independence of the results without also running the query to be sampled from in full. Thus, users must choose between independence and performance. To maintain statistical independence, Bernoulli sampling is used. This technique requires iterating over every record in the result set of the query, and selecting or rejecting it for inclusion within the sample with a fixed probability~\cite{db2-doc}. This process requires that each record in the result set be considered, and thus provides no performance benefit relative to the query being sampled from, as it must be answered in full anyway before returning only some of the results.\footnote{ To clarify, this is not to say that Bernoulli sampling isn't useful. It \emph{can} be used to improve the performance of queries by limiting the cardinality of intermediate results, etc. But it is not particularly useful for improving the performance of IQS queries, where the sampling is performed on the final result set of the query. } For performance, the statistical guarantees can be discarded and systematic or block sampling used instead. Systematic sampling considers only a fraction of the rows in the table being sampled from, following some particular pattern~\cite{postgres-doc}, and block sampling samples entire database pages~\cite{db2-doc}. These allow for query performance to be decoupled from data size, but tie a given record's inclusion in the sample set directly to its physical storage location, which can introduce bias into the sample and violates statistical guarantees. \subsection{Index-assisted Solutions} It is possible to answer IQS queries in a manner that both preserves independence, and avoids executing the query in full, through the use of specialized data structures. \Paragraph{Olken's Method.} The classical solution is Olken's method~\cite{olken89}, which can be applied to traditional tree-based database indices. This technique performs a randomized tree traversal, selecting the pointer to follow at each node uniformly at random. This allows SRS queries to be answered at $\Theta(\log n)$ cost per sample in the sample set. Thus, for an IQS query with a desired sample set size of $k$, Olken's method can provide a sample in $\Theta(k \log n)$ time. More complex IQS queries, such as weighted or predicate-filtered sampling, can be answered using the same algorithm by applying rejection sampling. To support predicates, any sampled records that violate the predicate can be rejected and retried. For weighted sampling, a given record $r$ will be accepted into the sample with $\nicefrac{w(r)}{w_{max}}$ probability. This will require an expected number of $\nicefrac{w_{max}}{w_{avg}}$ attempts per sample in the sample set~\cite{olken-thesis}. This rejection sampling can be significantly improved by adding aggregated weight tags to internal nodes, allowing rejection sampling to be performed at each step of the tree traversal to abort dead-end traversals early~\cite{zhao22}. In either case, there will be a performance penalty to rejecting samples, requiring greater than $k$ traversals to obtain a sample set of size $k$. \begin{figure} \centering \includegraphics[width=.5\textwidth]{img/sigmod23/alias.pdf} \caption{\textbf{A pictorial representation of an alias structure}, built over a set of weighted records. Sampling is performed by first (1) selecting a cell by uniformly generating an integer index on $[0,n)$, and then (2) selecting an item by generating a second uniform float on $[0,1]$ and comparing it to the cell's normalized cutoff values. In this example, the first random number is $0$, corresponding to the first cell, and the second is $.7$. This is larger than $\nicefrac{.15}{.25}$, and so $3$ is selected as the result of the query. This allows $O(1)$ independent weighted set sampling, but adding a new element requires a weight adjustment to every element in the structure, and so isn't generally possible without performing a full reconstruction.} \label{fig:alias} \end{figure} \Paragraph{Static Solutions.} There are also a large number of static data structures, which we'll call static sampling indices (SSIs) in this chapter,\footnote{ We used the term ``SSI'' in the original paper on which this chapter is based, which was published prior to our realization that a strong distinction between an index and a data structure would be useful. I am retaining the term SSI in this chapter for consistency with the original paper, but understand that in the terminology established in Chapter~\ref{chap:background}, SSIs are data structures, not indices. } that are capable of answering sampling queries more efficiently than Olken's method relative to the overall data size. An example of such a structure is used in Walker's alias method \cite{walker74,vose91}. This technique constructs a data structure in $\Theta(n)$ time that is capable of answering WSS queries in $\Theta(1)$ time per sample. Figure~\ref{fig:alias} shows a pictorial representation of the structure. For a set of $n$ records, it is constructed by distributing the normalized weight of all of the records across an array of $n$ cells, which represent at most two records each. Each cell will have a proportional representation of its records based on their normalized weight (e.g., a given cell may be 40\% allocated to one record, and 60\% to another). To query the structure, a cell is first selected uniformly at random, and then one of its two associated records is selected with a probability proportional to the record's weight. This operation takes $\Theta(1)$ time, requiring only two random number generations per sample. Thus, a WSS query can be answered in $\Theta(k)$ time, assuming the structure has already been built. Unfortunately, the alias structure cannot be efficiently updated, as inserting new records would change the relative weights of \emph{all} the records, and require fully re-partitioning the structure. While the alias method only applies to WSS, other sampling problems can be solved by using the alias method within the context of a larger data structure, a technique called \emph{alias augmentation}~\cite{tao22}. For example, alias augmentation can be used to construct an SSI capable of answering WIRS queries in $\Theta(\log n + k)$~\cite{afshani17,tao22}. This structure breaks the data into multiple disjoint partitions of size $\nicefrac{n}{\log n}$, each with an associated alias structure. A B+tree is then built, using the augmented partitions as its leaf nodes. Each internal node is also augmented with an alias structure over the aggregate weights associated with the children of each pointer. Constructing this structure requires $\Theta(n)$ time (though the associated constants are quite large in practice). WIRS queries can be answered by traversing the tree, first establishing the portion of the tree covering the query range, and then sampling records from that range using the alias structures attached to the nodes. More examples of alias augmentation applied to different IQS problems can be found in a recent survey by Tao~\cite{tao22}. \Paragraph{Miscellanea.} There also exist specialized data structures with support for both efficient sampling and updates~\cite{hu14}, but these structures have poor constant factors and are very complex, rendering them of little practical utility. Additionally, efforts have been made to extend the alias structure with support for weight updates over a fixed set of elements~\cite{hagerup93,matias03,allendorf23}. These approaches do not allow the insertion or removal of new records, however, only in-place weight updates. While in principle they could be constructed over the entire domain of possible records, with the weights of non-existent records set to $0$, this is hardly practical. Thus, these structures are not suited for the database sampling applications that are of interest to us in this chapter. \subsection{The Dichotomy.} Across the index-assisted techniques we discussed above, there is a clear pattern that emerges. Olken's method supports updates, but is inefficient compared to the SSIs because it requires a data-sized cost to be paid per sample in the sample set. The SSIs are more efficient for sampling, typically paying the data-sized cost only once per sample set (if at all), but fail to support updates. Thus, there appears to be a general dichotomy of sampling techniques: existing sampling data structures support either updates, or efficient sampling, but generally not both. It will be the purpose of this chapter to resolve this dichotomy. In particular, we seek to develop structures with the following desiderata, \begin{enumerate} \item Support data updates (including deletes) with similar average performance to a standard B+tree. \item Support IQS queries that do not pay a per-sample cost proportional to some function of the data size. In other words, $k$ should \emph{not} be be multiplied by any function of $n$ in the query cost function. %FIXME: this guy comes out of nowhere... \item Provide the user with some basic performance tuning capability. \end{enumerate}