diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-05-04 16:43:45 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-05-04 16:43:45 -0400 |
| commit | eb519d35d7f11427dd5fc877130b02478f0da80d (patch) | |
| tree | 2eb5bc349c82517fdc6484fce71c862b92b0213b /chapters/sigmod23/background.tex | |
| parent | 873fd659e45e80fe9e229d3d85b3c4c99fb2c121 (diff) | |
| download | dissertation-eb519d35d7f11427dd5fc877130b02478f0da80d.tar.gz | |
Began re-writing/updating the sampling extension stuff
Diffstat (limited to 'chapters/sigmod23/background.tex')
| -rw-r--r-- | chapters/sigmod23/background.tex | 340 |
1 files changed, 208 insertions, 132 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex index 58324bd..ad89e03 100644 --- a/chapters/sigmod23/background.tex +++ b/chapters/sigmod23/background.tex @@ -1,31 +1,63 @@ \section{Background} \label{sec:background} -This section formalizes the sampling problem and describes relevant existing -solutions. Before discussing these topics, though, a clarification of -definition is in order. The nomenclature used to describe sampling varies -slightly throughout the literature. In this chapter, the term \emph{sample} is -used to indicate a single record selected by a sampling operation, and a -collection of these samples is called a \emph{sample set}; the number of -samples within a sample set is the \emph{sample size}. The term \emph{sampling} -is used to indicate the selection of either a single sample or a sample set; -the specific usage should be clear from context. - - -\Paragraph{Independent Sampling Problem.} When conducting sampling, it is often -desirable for the drawn samples to have \emph{statistical independence}. This -requires that the sampling of a record does not affect the probability of any -other record being sampled in the future. Independence is a requirement for the -application of statistical tools such as the Central Limit -Theorem~\cite{bulmer79}, which is the basis for many concentration bounds. -A failure to maintain independence in sampling invalidates any guarantees -provided by these statistical methods. - -In each of the problems considered, sampling can be performed either with -replacement (WR) or without replacement (WoR). It is possible to answer any WoR -sampling query using a constant number of WR queries, followed by a -deduplication step~\cite{hu15}, and so this chapter focuses exclusively on WR -sampling. +We will begin with a formal discussion of the sampling problem, and +relevant existing solutions. First, though, a clarification of definition +is in order. The nomenclature used to describe sampling in the literature +is rather inconsistent, and so we'll first specifically define all of +the relevant terms.\footnote{ + As an amusing footnote, this problem actually resulted in a + significant miscommunication between myself and my advisor in the + early days of the project, resulting in a lot of time being expended + on performance debugging a problem that didn't actually exist! +} +In this chapter, we'll use the the term \emph{sample} to indicate a +single record selected by a sampling operation, and a collection of +these samples will be called a \emph{sample set}. The number of samples +within a sample set is the \emph{sample size}. The term \emph{sampling} +is used to indicate the selection of either a single sample or a sample +set; the specific usage should be clear from context. + +In each of the problems considered, sampling can be performed either +with replacement or without replacement. Sampling with replacement +means that a record that has been included in the sample set for a given +sampling query is "replaced" into the dataset and allowed to be sampled +again. Sampling without replacement does not "replace" the record, +and so each individual record can only be included within the a sample +set once for a given query. The data structures that will be discussed +support sampling with replacement, and sampling without replacement can +be implemented using a constant number of with replacement sampling +operations, followed by a deduplication step~\cite{hu15}, so this chapter +will focus exclusive on the with replacement case. + +\subsection{Independent Sampling Problem} + +When conducting sampling, it is often desirable for the drawn samples to +have \emph{statistical independence} and for the distribution of records +in the sample set to match the distribution of source data set. This +requires that the sampling of a record does not affect the probability of +any other record being sampled in the future. Such sample sets are said +to be drawn i.i.d (idendepently and identically distributed). Throughout +this chapter, the term "independent" will be used to describe both +statistical independence, and identical distribution. + +Independence of sample sets is important because many useful statistical +results are derived from assumping that the condition holds. For example, +it is a requirement for the application of statistical tools such as +the Central Limit Theorem~\cite{bulmer79}, which is the basis for many +concentration bounds. A failure to maintain independence in sampling +invalidates any guarantees provided by these statistical methods. + +In the context of databases, it is also common to discuss a more +general version of the sampling problem, called \emph{independent query +sampling} (IQS)~\cite{hu14}. In IQS, a sample set is constructed from a +specified number of records in the result set of a database query. In +this context, it isn't enough to ensure that individual records are +sampled independently; the sample sets from repeated queries must also be +indepedent. This precludes, for example, caching and returning the same +sample set to multiple repetitions of the same query. This inter-query +independence provides a variety of useful properties, such as fairness +and representativeness of query results~\cite{tao22}. A basic version of the independent sampling problem is \emph{weighted set sampling} (WSS),\footnote{ @@ -43,22 +75,18 @@ as: each data point $d \in D$ having a probability of $\frac{w(d)}{\sum_{p\in D}w(p)}$ of being sampled. \end{definition} -Each query returns a sample set of size $k$, rather than a -single sample. Queries returning sample sets are the common case, because the -robustness of analysis relies on having a sufficiently large sample +Each query returns a sample set of size $k$, rather than a single +sample. Queries returning sample sets are the common case, because +the robustness of analysis relies on having a sufficiently large sample size~\cite{ben-eliezer20}. The common \emph{simple random sampling} (SRS) problem is a special case of WSS, where every element has unit weight. -In the context of databases, it is also common to discuss a more general -version of the sampling problem, called \emph{independent query sampling} -(IQS)~\cite{hu14}. An IQS query samples a specified number of records from the -result set of a database query. In this context, it is insufficient to merely -ensure individual records are sampled independently; the sample sets returned -by repeated IQS queries must be independent as well. This provides a variety of -useful properties, such as fairness and representativeness of query -results~\cite{tao22}. As a concrete example, consider simple random sampling on -the result set of a single-dimensional range reporting query. This is -called independent range sampling (IRS), and is formally defined as: +For WSS, the results are taken directly from the dataset without applying +any predicates or filtering. This can be useful, however for IQS it is +common for database queries to apply predicates to the data. A very common +search problem from which database queries are created is range scanning, +which can be formulated as a sampling problem called \emph{independent +range sampling} (IRS), \begin{definition}[Independent Range Sampling~\cite{tao22}] Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query @@ -66,35 +94,73 @@ called independent range sampling (IRS), and is formally defined as: query returns $k$ independent samples from $D \cap q$ with each point having equal probability of being sampled. \end{definition} -A generalization of IRS exists, called \emph{Weighted Independent Range -Sampling} (WIRS)~\cite{afshani17}, which is similar to WSS. Each point in $D$ -is associated with a positive weight $w: D \to \mathbb{R}^+$, and samples are -drawn from the range query results $D \cap q$ such that each data point has a -probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$ of being sampled. - - -\Paragraph{Existing Solutions.} While many sampling techniques exist, -few are supported in practical database systems. The existing -\texttt{TABLESAMPLE} operator provided by SQL in all major DBMS -implementations~\cite{postgres-doc} requires either a linear scan (e.g., -Bernoulli sampling) that results in high sample retrieval costs, or relaxed -statistical guarantees (e.g., block sampling~\cite{postgres-doc} used in -PostgreSQL). - -Index-assisted sampling solutions have been studied -extensively. Olken's method~\cite{olken89} is a classical solution to -independent sampling problems. This algorithm operates upon traditional search -trees, such as the B+tree used commonly as a database index. It conducts a -random walk on the tree uniformly from the root to a leaf, resulting in a -$O(\log n)$ sampling cost for each returned record. Should weighted samples be -desired, rejection sampling can be performed. A sampled record, $r$, is -accepted with probability $\nicefrac{w(r)}{w_{max}}$, with an expected -number of $\nicefrac{w_{max}}{w_{avg}}$ samples to be taken per element in the -sample set. Olken's method can also be extended to support general IQS by -rejecting all sampled records failing to satisfy the query predicate. It can be -accelerated by adding aggregated weight tags to internal -nodes~\cite{olken-thesis,zhao22}, allowing rejection sampling to be performed -during the tree-traversal to abort dead-end traversals early. + +IRS is a non-weighted sampling problem, similar to SRS. There also exists +a weighted generalization, called \emph{weighted independent range +sampling} (WIRS), + +\begin{definition}[Weighted Independent Range Sampling~\cite{afshani17}] + Let $D$ be a set of $n$ points in $\mathbb{R}$ that are associated with + positive weights $w: D\to \mathbb{R}^+$. Given a query + interval $q = [x, y]$ and an integer $k$, an independent range sampling + query returns $k$ independent samples from $D \cap q$ with each + point having a probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$ + of being sampled. +\end{definition} + +This is not an exhaustive list of sampling problems, but it is the list +of problems that will be directly addressed within this chapter. + +\subsection{Algorithmic Solutions} + +Relational database systems often have native support for IQS using +SQL's \texttt{TABLESAMPLE} operator~\cite{postgress-doc}. However, the +algorithms used to implement this operator have significant limitations: +users much choose between statistical independence or performance. + +To maintain statistical independence, Bernoulli sampling is used. This +technique requires iterating over every record in the result set of the +query, and selecting or rejecting it for inclusion within the sample +with a fixed probability~\cite{db2-doc}. This process requires that each +record in the result set be considered, and thus provides no performance +benefit relative to the query being sampled from, as it must be answered +in full anyway before returning only some of the results. + +For performance, the statistical guarantees can be discarded and +systematic or block sampling used instead. Systematic sampling considers +only a fraction of the rows in the table being sampled from, following +some particular pattern~\cite{postgress-doc}, and block sampling samples +entire database pages~\cite{db2-doc}. These allow for query performance +to be decoupled from data size, but tie a given record's inclusion in the +sample set directly to its physical storage location, which can introduce +bias into the sample and violates statistical guarantees. + +\subsection{Index-assisted Solutions} +It is possible to answer IQS queries in a manner that both preserves +independence, and avoids executing the query in full, through the use +of specialized data structures. + +\Paragraph{Olken's Method.} +The classical solution is Olken's method~\cite{olken89}, +which can be applied to traditional tree-based database indices. This +technique performs a randomized tree traversal, selecting the pointer to +follow at each node uniformly at random. This allows SRS queries to be +answered at $\Theta(\log n)$ cost per sample in the sample set. Thus, +for an IQS query with a desired sample set size of $k$, Olken's method +can provide a sample in $\Theta(k \log n)$ time. + +More complex IQS queries, such as weighted or predicate-filtered sampling, +can be answered using the same algorithm by applying rejection sampling. +To support predicates, any sampled records that violate the predicate can +be rejected and retried. For weighted sampling, a given record $r$ will +be accepted into the sample with $\nicefrac{w(r)}{w_{max}}$ probability. +This will require an expected number of $\nicefrac{w_{max}}{w_{avg}}$ +attempts per sample in the sample set~\cite{olken-thesis}. This rejection +sampling can be significantly improved by adding aggregated weight tags to +internal nodes, allowing rejection sampling to be performed at each step +of the tree traversal to abort dead-end traversals early~\cite{zhao22}. In +either case, there will be a performance penalty to rejecting samples, +requiring greater than $k$ traversals to obtain a sample set of size $k$. \begin{figure} \centering @@ -115,68 +181,78 @@ during the tree-traversal to abort dead-end traversals early. \end{figure} -There also exist static data structures, referred to in this chapter as static -sampling indexes (SSIs)\footnote{ -The name SSI was established in the published version of this paper prior to the -realization that a distinction between the terms index and data structure would -be useful. We'll continue to use the term SSI for the remainder of this chapter, -to maintain consistency with the published work, but technically an SSI refers to - a data structure, not an index, in the nomenclature established in the previous - chapter. - }, that are capable of answering sampling queries in -near-constant time\footnote{ - The designation -``near-constant'' is \emph{not} used in the technical sense of being constant -to within a polylogarithmic factor (i.e., $\tilde{O}(1)$). It is instead used to mean -constant to within an additive polylogarithmic term, i.e., $f(x) \in O(\log n + -1)$. -%For example, drawing $k$ samples from $n$ records using a near-constant -%approach would require $O(\log n + k)$ time. This is in contrast to a -%tree-traversal approach, which would require $O(k\log n)$ time. -} relative to the size of the dataset. An example of such a -structure is used in Walker's alias method \cite{walker74,vose91}, a technique -for answering WSS queries with $O(1)$ query cost per sample, but requiring -$O(n)$ time to construct. It distributes the weight of items across $n$ cells, -where each cell is partitioned into at most two items, such that the total -proportion of each cell assigned to an item is its total weight. A query -selects one cell uniformly at random, then chooses one of the two items in the -cell by weight; thus, selecting items with probability proportional to their -weight in $O(1)$ time. A pictorial representation of this structure is shown in -Figure~\ref{fig:alias}. - -The alias method can also be used as the basis for creating SSIs capable of -answering general IQS queries using a technique called alias -augmentation~\cite{tao22}. As a concrete example, previous -papers~\cite{afshani17,tao22} have proposed solutions for WIRS queries using $O(\log n -+ k)$ time, where the $\log n$ cost is only be paid only once per query, after which -elements can be sampled in constant time. This structure is built by breaking -the data up into disjoint chunks of size $\nicefrac{n}{\log n}$, called -\emph{fat points}, each with an alias structure. A B+tree is then constructed, -using the fat points as its leaf nodes. The internal nodes are augmented with -an alias structure over the total weight of each child. This alias structure -is used instead of rejection sampling to determine the traversal path to take -through the tree, and then the alias structure of the fat point is used to -sample a record. Because rejection sampling is not used during the traversal, -two traversals suffice to establish the valid range of records for sampling, -after which samples can be collected without requiring per-sample traversals. -More examples of alias augmentation applied to different IQS problems can be -found in a recent survey by Tao~\cite{tao22}. - -There do exist specialized sampling indexes~\cite{hu14} with both efficient -sampling and support for updates, but these are restricted to specific query -types and are often very complex structures, with poor constant factors -associated with sampling and update costs, and so are of limited practical -utility. There has also been work~\cite{hagerup93,matias03,allendorf23} on -extending the alias structure to support weight updates over a fixed set of -elements. However, these solutions do not allow insertion or deletion in the -underlying dataset, and so are not well suited to database sampling -applications. - -\Paragraph{The Dichotomy.} Among these techniques, there exists a -clear trade-off between efficient sampling and support for updates. Tree-traversal -based sampling solutions pay a dataset size based cost per sample, in exchange for -update support. The static solutions lack support for updates, but support -near-constant time sampling. While some data structures exist with support for -both, these are restricted to highly specialized query types. Thus in the -general case there exists a dichotomy: existing sampling indexes can support -either data updates or efficient sampling, but not both. +\Paragraph{Static Solutions.} +There are also a large number of static data structures, which we'll +call static sampling indices (SSIs) in this chapter,\footnote{ + We used the term "SSI" in the original paper on which this chapter + is based, which was published prior to our realization that a strong + distinction between an index and a data structure would be useful. I + am retaining the term SSI in this chapter for consistency with the + original paper, but understand that in the termonology established in + Chapter~\ref{chap:background}, SSIs are data structures, not indices. +}, +that are capable of answering sampling queries more efficiently than +Olken's method relative to the overall data size. An example of such +a structure is used in Walker's alias method \cite{walker74,vose91}. +This technique constructs a data structure in $\Theta(n)$ time +that is capable of answering WSS queries in $\Theta(1)$ time per +sample. Figure~\ref{fig:alias} shows a pictorial representation of the +structure. For a set of $n$ records, it is constructed by distributing +the normalized weight of all of the records across an array of $n$ +cells, which represent at most two records each. Each cell will have +a proportional representation of its records based on their normalized +weight (e.g., a given cell may be 40\% allocated to one record, and 60\% +to another). To query the structure, a cell is first selected uniformly +at random, and then one of its two associated records is selected +with a probability proportional to the record's weight. This operation +takes $\Theta(1)$ time, requiring only two random number generations +per sample. Thus, a WSS query can be answered in $\Theta(k)$ time, +assuming the structure has already been built. Unfortunately, the alias +structure cannot be efficiently updated, as inserting new records would +change the relative weights of \emph{all} the records, and require fully +repartitioning the structure. + +While the alias method only applies to WSS, other sampling problems can +be solved by using the alias method within the context of a larger data +structure, a technique called \emph{alias augmentation}~\cite{tao22}. For +example, alias augmentation can be used to construct an SSI capable of +answering WIRS queries in $\Theta(\log n + k)$~\cite{afshani17,tao22}. +This structure breaks the data into multiple disjoint partitions of size +$\nicefrac{n}{\log n}$, each with an associated alias structure. A B+Tree +is then built, using the augmented partitions as its leaf nodes. Each +internal node is also augmented with an alias structure over the aggregate +weights associated with the children of each pointer. Constructing this +structure requires $\Theta(n)$ time (though the associated constants are +quite large in practice). WIRS queries can be answered by traversing +the tree, first establishing the portion of the tree covering the +query range, and then sampling records from that range using the alias +structures attached to the nodes. More examples of alias augmentation +applied to different IQS problems can be found in a recent survey by +Tao~\cite{tao22}. + +There also exist specialized data structures with support for both +efficient sampling and updates~\cite{hu14}, but these structures have +poor constant factors and are very complex, rendering them of little +practical utility. Additionally, efforts have been made to extended +the alias structure with support for weight updates over a fixed set of +elements~\cite{hagerup93,matias03,allendorf23}. These approaches do not +allow the insertion or removal of new records, however, only in-place +weight updates. While in principle they could be constructed over the +entire domain of possible records, with the weights of non-existant +records set to $0$, this is hardly practical. Thus, these structures are +not suited for the database sampling applications that are of interest to +us in this chapter. + +\subsection{The Dichotomy.} Across the index-assisted techniques we +discussed above, there is a clear pattern that emerges. Olken's method +supports updates, but is inefficient compared to the SSIs because it +requires a data-sized cost to be paid per sample in the sample set. The +SSIs are more efficient for sampling, typically paying the data-sized cost +only once per sample set (if at all), but fail to support updates. Thus, +there appears to be a general dichotomy of sampling techniques: existing +sampling data structures support either updates, or efficient sampling, +but generally not both. It will be the purpose of this chapter to resolve +this dichotomy. + + + |