Began re-writing/updating the sampling extension stuff

author: Douglas Rumbaugh <dbr4@psu.edu> 2025-05-04 16:43:45 -0400
committer: Douglas Rumbaugh <dbr4@psu.edu> 2025-05-04 16:43:45 -0400
commit: eb519d35d7f11427dd5fc877130b02478f0da80d (patch)
tree: 2eb5bc349c82517fdc6484fce71c862b92b0213b /chapters/sigmod23/background.tex
parent: 873fd659e45e80fe9e229d3d85b3c4c99fb2c121 (diff)
download: dissertation-eb519d35d7f11427dd5fc877130b02478f0da80d.tar.gz
1 files changed, 208 insertions, 132 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex
index 58324bd..ad89e03 100644
--- a/chapters/sigmod23/background.tex
+++ b/chapters/sigmod23/background.tex
@@ -1,31 +1,63 @@
 \section{Background}
 \label{sec:background}
 
-This section formalizes the sampling problem and describes relevant existing
-solutions. Before discussing these topics, though, a clarification of
-definition is in order. The nomenclature used to describe sampling varies
-slightly throughout the literature. In this chapter, the term \emph{sample} is
-used to indicate a single record selected by a sampling operation, and a
-collection of these samples is called a \emph{sample set}; the number of
-samples within a sample set is the \emph{sample size}. The term \emph{sampling}
-is used to indicate the selection of either a single sample or a sample set;
-the specific usage should be clear from context.
-
-
-\Paragraph{Independent Sampling Problem.} When conducting sampling, it is often
-desirable for the drawn samples to have \emph{statistical independence}. This
-requires that the sampling of a record does not affect the probability of any
-other record being sampled in the future. Independence is a requirement for the
-application of statistical tools such as the Central Limit
-Theorem~\cite{bulmer79}, which is the basis for many concentration bounds.
-A failure to maintain independence in sampling invalidates any guarantees
-provided by these statistical methods.
-
-In each of the problems considered, sampling can be performed either with
-replacement (WR) or without replacement (WoR). It is possible to answer any WoR
-sampling query using a constant number of WR queries, followed by a
-deduplication step~\cite{hu15}, and so this chapter focuses exclusively on WR
-sampling.
+We will begin with a formal discussion of the sampling problem, and
+relevant existing solutions. First, though, a clarification of definition
+is in order. The nomenclature used to describe sampling in the literature
+is rather inconsistent, and so we'll first specifically define all of
+the relevant terms.\footnote{
+    As an amusing footnote, this problem actually resulted in a
+    significant miscommunication between myself and my advisor in the
+    early days of the project, resulting in a lot of time being expended
+    on performance debugging a problem that didn't actually exist!
+}
+In this chapter, we'll use the the term \emph{sample} to indicate a
+single record selected by a sampling operation, and a collection of
+these samples will be called a \emph{sample set}. The number of samples
+within a sample set is the \emph{sample size}. The term \emph{sampling}
+is used to indicate the selection of either a single sample or a sample
+set; the specific usage should be clear from context.
+
+In each of the problems considered, sampling can be performed either
+with replacement or without replacement. Sampling with replacement
+means that a record that has been included in the sample set for a given
+sampling query is "replaced" into the dataset and allowed to be sampled
+again. Sampling without replacement does not "replace" the record,
+and so each individual record can only be included within the a sample
+set once for a given query. The data structures that will be discussed
+support sampling with replacement, and sampling without replacement can
+be implemented using a constant number of with replacement sampling
+operations, followed by a deduplication step~\cite{hu15}, so this chapter
+will focus exclusive on the with replacement case.
+
+\subsection{Independent Sampling Problem}
+
+When conducting sampling, it is often desirable for the drawn samples to
+have \emph{statistical independence} and for the distribution of records
+in the sample set to match the distribution of source data set. This
+requires that the sampling of a record does not affect the probability of
+any other record being sampled in the future. Such sample sets are said
+to be drawn i.i.d (idendepently and identically distributed). Throughout
+this chapter, the term "independent" will be used to describe both
+statistical independence, and identical distribution.
+
+Independence of sample sets is important because many useful statistical
+results are derived from assumping that the condition holds. For example,
+it is a requirement for the application of statistical tools such as
+the Central Limit Theorem~\cite{bulmer79}, which is the basis for many
+concentration bounds.  A failure to maintain independence in sampling
+invalidates any guarantees provided by these statistical methods.
+
+In the context of databases, it is also common to discuss a more
+general version of the sampling problem, called \emph{independent query
+sampling} (IQS)~\cite{hu14}. In IQS, a sample set is constructed from a
+specified number of records in the result set of a database query. In
+this context, it isn't enough to ensure that individual records are
+sampled independently; the sample sets from repeated queries must also be
+indepedent. This precludes, for example, caching and returning the same
+sample set to multiple repetitions of the same query. This inter-query
+independence provides a variety of useful properties, such as fairness
+and representativeness of query results~\cite{tao22}.
 
 A basic version of the independent sampling problem is \emph{weighted set
 sampling} (WSS),\footnote{
@@ -43,22 +75,18 @@ as:
     each data point $d \in D$ having a probability of $\frac{w(d)}{\sum_{p\in
     D}w(p)}$ of being sampled.
 \end{definition}
-Each query returns a sample set of size $k$, rather than a
-single sample. Queries returning sample sets are the common case, because the
-robustness of analysis relies on having a sufficiently large sample
+Each query returns a sample set of size $k$, rather than a single
+sample. Queries returning sample sets are the common case, because
+the robustness of analysis relies on having a sufficiently large sample
 size~\cite{ben-eliezer20}. The common \emph{simple random sampling} (SRS)
 problem is a special case of WSS, where every element has unit weight.
 
-In the context of databases, it is also common to discuss a more general
-version of the sampling problem, called \emph{independent query sampling}
-(IQS)~\cite{hu14}. An IQS query samples a specified number of records from the
-result set of a database query. In this context, it is insufficient to merely
-ensure individual records are sampled independently; the sample sets returned
-by repeated IQS queries must be independent as well. This provides a variety of
-useful properties, such as fairness and representativeness of query
-results~\cite{tao22}. As a concrete example, consider simple random sampling on
-the result set of a single-dimensional range reporting query. This is 
-called independent range sampling (IRS), and is formally defined as: 
+For WSS, the results are taken directly from the dataset without applying
+any predicates or filtering. This can be useful, however for IQS it is
+common for database queries to apply predicates to the data. A very common
+search problem from which database queries are created is range scanning,
+which can be formulated as a sampling problem called \emph{independent
+range sampling} (IRS),
 
 \begin{definition}[Independent Range Sampling~\cite{tao22}]
     Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
@@ -66,35 +94,73 @@ called independent range sampling (IRS), and is formally defined as:
     query returns $k$ independent samples from $D \cap q$ with each 
     point having equal probability of being sampled.
 \end{definition}
-A generalization of IRS exists, called \emph{Weighted Independent Range
-Sampling} (WIRS)~\cite{afshani17}, which is similar to WSS. Each point in $D$
-is associated with a positive weight $w: D \to \mathbb{R}^+$, and samples are
-drawn from the range query results $D \cap q$ such that each data point has a
-probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$ of being sampled.
-
-
-\Paragraph{Existing Solutions.} While many sampling techniques exist, 
-few are supported in practical database systems. The existing
-\texttt{TABLESAMPLE} operator provided by SQL in all major DBMS
-implementations~\cite{postgres-doc} requires either a linear scan (e.g.,
-Bernoulli sampling) that results in high sample retrieval costs, or relaxed
-statistical guarantees (e.g., block sampling~\cite{postgres-doc} used in
-PostgreSQL).
-
-Index-assisted sampling solutions have been studied
-extensively. Olken's method~\cite{olken89} is a classical solution to
-independent sampling problems. This algorithm operates upon traditional search
-trees, such as the B+tree used commonly as a database index. It conducts a
-random walk on the tree uniformly from the root to a leaf, resulting in a
-$O(\log n)$ sampling cost for each returned record. Should weighted samples be
-desired, rejection sampling can be performed. A sampled record, $r$, is
-accepted with probability $\nicefrac{w(r)}{w_{max}}$, with an expected
-number of $\nicefrac{w_{max}}{w_{avg}}$ samples to be taken per element in the
-sample set. Olken's method can also be extended to support general IQS by
-rejecting all sampled records failing to satisfy the query predicate. It can be
-accelerated by adding aggregated weight tags to internal
-nodes~\cite{olken-thesis,zhao22}, allowing rejection sampling to be performed
-during the tree-traversal to abort dead-end traversals early.
+
+IRS is a non-weighted sampling problem, similar to SRS. There also exists
+a weighted generalization, called \emph{weighted independent range
+sampling} (WIRS),
+
+\begin{definition}[Weighted Independent Range Sampling~\cite{afshani17}]
+    Let $D$ be a set of $n$ points in $\mathbb{R}$ that are associated with
+    positive weights $w: D\to \mathbb{R}^+$. Given a query
+    interval $q = [x, y]$ and an integer $k$, an independent range sampling
+    query returns $k$ independent samples from $D \cap q$ with each 
+    point having a probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$
+    of being sampled.
+\end{definition}
+
+This is not an exhaustive list of sampling problems, but it is the list
+of problems that will be directly addressed within this chapter.
+
+\subsection{Algorithmic Solutions}
+
+Relational database systems often have native support for IQS using
+SQL's \texttt{TABLESAMPLE} operator~\cite{postgress-doc}. However, the
+algorithms used to implement this operator have significant limitations:
+users much choose between statistical independence or performance.
+
+To maintain statistical independence, Bernoulli sampling is used. This
+technique requires iterating over every record in the result set of the
+query, and selecting or rejecting it for inclusion within the sample
+with a fixed probability~\cite{db2-doc}. This process requires that each
+record in the result set be considered, and thus provides no performance
+benefit relative to the query being sampled from, as it must be answered
+in full anyway before returning only some of the results.
+
+For performance, the statistical guarantees can be discarded and
+systematic or block sampling used instead. Systematic sampling considers
+only a fraction of the rows in the table being sampled from, following
+some particular pattern~\cite{postgress-doc}, and block sampling samples
+entire database pages~\cite{db2-doc}. These allow for query performance
+to be decoupled from data size, but tie a given record's inclusion in the
+sample set directly to its physical storage location, which can introduce
+bias into the sample and violates statistical guarantees.
+
+\subsection{Index-assisted Solutions}
+It is possible to answer IQS queries in a manner that both preserves
+independence, and avoids executing the query in full, through the use
+of specialized data structures.
+
+\Paragraph{Olken's Method.}
+The classical solution is Olken's method~\cite{olken89},
+which can be applied to traditional tree-based database indices. This
+technique performs a randomized tree traversal, selecting the pointer to
+follow at each node uniformly at random. This allows SRS queries to be
+answered at $\Theta(\log n)$ cost per sample in the sample set. Thus,
+for an IQS query with a desired sample set size of $k$, Olken's method
+can provide a sample in $\Theta(k \log n)$ time.
+
+More complex IQS queries, such as weighted or predicate-filtered sampling,
+can be answered using the same algorithm by applying rejection sampling.
+To support predicates, any sampled records that violate the predicate can
+be rejected and retried. For weighted sampling, a given record $r$ will
+be accepted into the sample with $\nicefrac{w(r)}{w_{max}}$ probability.
+This will require an expected number of $\nicefrac{w_{max}}{w_{avg}}$
+attempts per sample in the sample set~\cite{olken-thesis}. This rejection
+sampling can be significantly improved by adding aggregated weight tags to
+internal nodes, allowing rejection sampling to be performed at each step
+of the tree traversal to abort dead-end traversals early~\cite{zhao22}. In
+either case, there will be a performance penalty to rejecting samples,
+requiring greater than $k$ traversals to obtain a sample set of size $k$.
 
 \begin{figure}
     \centering
@@ -115,68 +181,78 @@ during the tree-traversal to abort dead-end traversals early.
     
 \end{figure}
 
-There also exist static data structures, referred to in this chapter as static
-sampling indexes (SSIs)\footnote{
-The name SSI was established in the published version of this paper prior to the
-realization that a distinction between the terms index and data structure would
-be useful. We'll continue to use the term SSI for the remainder of this chapter,
-to maintain consistency with the published work, but technically an SSI refers to
- a data structure, not an index, in the nomenclature established in the previous
- chapter.
-    }, that are capable of answering sampling queries in
-near-constant time\footnote{
- The designation
-``near-constant'' is \emph{not} used in the technical sense of being constant
-to within a polylogarithmic factor (i.e., $\tilde{O}(1)$). It is instead used to mean
-constant to within an additive polylogarithmic term, i.e., $f(x) \in O(\log n +
-1)$. 
-%For example, drawing $k$ samples from $n$ records using a near-constant
-%approach would require $O(\log n + k)$ time. This is in contrast to a
-%tree-traversal approach, which would require $O(k\log n)$ time. 
-} relative to the size of the dataset. An example of such a
-structure is used in Walker's alias method \cite{walker74,vose91}, a technique
-for answering WSS queries with $O(1)$ query cost per sample, but requiring
-$O(n)$ time to construct. It distributes the weight of items across $n$ cells,
-where each cell is partitioned into at most two items, such that the total
-proportion of each cell assigned to an item is its total weight. A query
-selects one cell uniformly at random, then chooses one of the two items in the
-cell by weight; thus, selecting items with probability proportional to their
-weight in $O(1)$ time. A pictorial representation of this structure is shown in
-Figure~\ref{fig:alias}.
-
-The alias method can also be used as the basis for creating SSIs capable of
-answering general IQS queries using a technique called alias
-augmentation~\cite{tao22}. As a concrete example, previous
-papers~\cite{afshani17,tao22} have proposed solutions for WIRS queries using $O(\log n
-+ k)$ time, where the $\log n$ cost is only be paid only once per query, after which
-elements can be sampled in constant time. This structure is built by breaking
-the data up into disjoint chunks of size $\nicefrac{n}{\log n}$, called
-\emph{fat points}, each with an alias structure. A B+tree is then constructed,
-using the fat points as its leaf nodes. The internal nodes are augmented with
-an alias structure over the total weight of each child. This alias structure
-is used instead of rejection sampling to determine the traversal path to take
-through the tree, and then the alias structure of the fat point is used to
-sample a record. Because rejection sampling is not used during the traversal,
-two traversals suffice to establish the valid range of records for sampling,
-after which samples can be collected without requiring per-sample traversals.
-More examples of alias augmentation applied to different IQS problems can be
-found in a recent survey by Tao~\cite{tao22}.
-
-There do exist specialized sampling indexes~\cite{hu14} with both efficient
-sampling and support for updates, but these are restricted to specific query
-types and are often very complex structures, with poor constant factors
-associated with sampling and update costs, and so are of limited practical
-utility. There has also been work~\cite{hagerup93,matias03,allendorf23} on
-extending the alias structure to support weight updates over a fixed set of
-elements. However, these solutions do not allow insertion or deletion in the
-underlying dataset, and so are not well suited to database sampling
-applications. 
-
-\Paragraph{The Dichotomy.} Among these techniques, there exists a
-clear trade-off between efficient sampling and support for updates. Tree-traversal 
-based sampling solutions pay a dataset size based cost per sample, in exchange for
-update support. The static solutions lack support for updates, but support
-near-constant time sampling. While some data structures exist with support for
-both, these are restricted to highly specialized query types. Thus in the
-general case there exists a dichotomy: existing sampling indexes can support
-either data updates or efficient sampling, but not both.
+\Paragraph{Static Solutions.}
+There are also a large number of static data structures, which we'll
+call static sampling indices (SSIs) in this chapter,\footnote{
+  We used the term "SSI" in the original paper on which this chapter
+  is based, which was published prior to our realization that a strong
+  distinction between an index and a data structure would be useful. I
+  am retaining the term SSI in this chapter for consistency with the
+  original paper, but understand that in the termonology established in
+  Chapter~\ref{chap:background}, SSIs are data structures, not indices.
+},
+that are capable of answering sampling queries more efficiently than
+Olken's method relative to the overall data size.  An example of such
+a structure is used in Walker's alias method \cite{walker74,vose91}.
+This technique constructs a data structure in $\Theta(n)$ time
+that is capable of answering WSS queries in $\Theta(1)$ time per
+sample. Figure~\ref{fig:alias} shows a pictorial representation of the
+structure. For a set of $n$ records, it is constructed by distributing
+the normalized weight of all of the records across an array of $n$
+cells, which represent at most two records each. Each cell will have
+a proportional representation of its records based on their normalized
+weight (e.g., a given cell may be 40\% allocated to one record, and 60\%
+to another). To query the structure, a cell is first selected uniformly
+at random, and then one of its two associated records is selected
+with a probability proportional to the record's weight. This operation
+takes $\Theta(1)$ time, requiring only two random number generations
+per sample.  Thus, a WSS query can be answered in $\Theta(k)$ time,
+assuming the structure has already been built. Unfortunately, the alias
+structure cannot be efficiently updated, as inserting new records would
+change the relative weights of \emph{all} the records, and require fully
+repartitioning the structure.
+
+While the alias method only applies to WSS, other sampling problems can
+be solved by using the alias method within the context of a larger data
+structure, a technique called \emph{alias augmentation}~\cite{tao22}. For
+example, alias augmentation can be used to construct an SSI capable of
+answering WIRS queries in $\Theta(\log n + k)$~\cite{afshani17,tao22}.
+This structure breaks the data into multiple disjoint partitions of size
+$\nicefrac{n}{\log n}$, each with an associated alias structure. A B+Tree
+is then built, using the augmented partitions as its leaf nodes. Each
+internal node is also augmented with an alias structure over the aggregate
+weights associated with the children of each pointer. Constructing this
+structure requires $\Theta(n)$ time (though the associated constants are
+quite large in practice). WIRS queries can be answered by traversing
+the tree, first establishing the portion of the tree covering the
+query range, and then sampling records from that range using the alias
+structures attached to the nodes.  More examples of alias augmentation
+applied to different IQS problems can be found in a recent survey by
+Tao~\cite{tao22}.
+
+There also exist specialized data structures with support for both
+efficient sampling and updates~\cite{hu14}, but these structures have
+poor constant factors and are very complex, rendering them of little
+practical utility. Additionally, efforts have been made to extended
+the alias structure with support for weight updates over a fixed set of
+elements~\cite{hagerup93,matias03,allendorf23}. These approaches do not
+allow the insertion or removal of new records, however, only in-place
+weight updates. While in principle they could be constructed over the
+entire domain of possible records, with the weights of non-existant
+records set to $0$, this is hardly practical. Thus, these structures are
+not suited for the database sampling applications that are of interest to
+us in this chapter.
+
+\subsection{The Dichotomy.} Across the index-assisted techniques we
+discussed above, there is a clear pattern that emerges. Olken's method
+supports updates, but is inefficient compared to the SSIs because it
+requires a data-sized cost to be paid per sample in the sample set. The
+SSIs are more efficient for sampling, typically paying the data-sized cost
+only once per sample set (if at all), but fail to support updates. Thus,
+there appears to be a general dichotomy of sampling techniques: existing
+sampling data structures support either updates, or efficient sampling,
+but generally not both. It will be the purpose of this chapter to resolve
+this dichotomy.
+
+
+
author	Douglas Rumbaugh <dbr4@psu.edu>	2025-05-04 16:43:45 -0400
committer	Douglas Rumbaugh <dbr4@psu.edu>	2025-05-04 16:43:45 -0400
commit	eb519d35d7f11427dd5fc877130b02478f0da80d (patch)
tree	2eb5bc349c82517fdc6484fce71c862b92b0213b /chapters/sigmod23/background.tex
parent	873fd659e45e80fe9e229d3d85b3c4c99fb2c121 (diff)
download	dissertation-eb519d35d7f11427dd5fc877130b02478f0da80d.tar.gz