summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--chapters/background.tex13
-rw-r--r--chapters/sigmod23/background.tex340
-rw-r--r--chapters/sigmod23/framework.tex669
-rw-r--r--chapters/sigmod23/introduction.tex54
-rw-r--r--references/references.bib8
5 files changed, 619 insertions, 465 deletions
diff --git a/chapters/background.tex b/chapters/background.tex
index 9950b39..69436c8 100644
--- a/chapters/background.tex
+++ b/chapters/background.tex
@@ -85,6 +85,7 @@ their work on dynamization, and we will adopt their definition,
\begin{equation*}
F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
\end{equation*}
+ for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
\end{definition}
The requirement for $\mergeop$ to be constant-time was used by Bentley and
@@ -101,6 +102,7 @@ problems},
\begin{equation*}
F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
\end{equation*}
+ for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
\end{definition}
To demonstrate that a search problem is decomposable, it is necessary to
@@ -811,6 +813,7 @@ cost, we could greatly reduce the cost of supporting $C(n)$-decomposable
queries.
\subsubsection{Independent Range Sampling}
+\label{ssec:background-irs}
Another problem that is not decomposable is independent sampling. There
are a variety of problems falling under this umbrella, including weighted
@@ -831,15 +834,7 @@ matching of records in result sets. To work around this, a slight abuse
of definition is in order: assume that the equality conditions within
the DSP definition can be interpreted to mean ``the contents in the two
sets are drawn from the same distribution''. This enables the category
-of DSP to apply to this type of problem. More formally,
-\begin{definition}[Decomposable Sampling Problem]
- A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and
- only if there exists a constant-time computable, associative, and
- commutative binary operator $\mergeop$ such that,
- \begin{equation*}
- F(A \cup B, q) \sim F(A, q)~ \mergeop ~F(B, q)
- \end{equation*}
-\end{definition}
+of DSP to apply to this type of problem.
Even with this abuse, however, IRS cannot generally be considered
decomposable; it is at best $C(n)$-decomposable. The reason for this is
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex
index 58324bd..ad89e03 100644
--- a/chapters/sigmod23/background.tex
+++ b/chapters/sigmod23/background.tex
@@ -1,31 +1,63 @@
\section{Background}
\label{sec:background}
-This section formalizes the sampling problem and describes relevant existing
-solutions. Before discussing these topics, though, a clarification of
-definition is in order. The nomenclature used to describe sampling varies
-slightly throughout the literature. In this chapter, the term \emph{sample} is
-used to indicate a single record selected by a sampling operation, and a
-collection of these samples is called a \emph{sample set}; the number of
-samples within a sample set is the \emph{sample size}. The term \emph{sampling}
-is used to indicate the selection of either a single sample or a sample set;
-the specific usage should be clear from context.
-
-
-\Paragraph{Independent Sampling Problem.} When conducting sampling, it is often
-desirable for the drawn samples to have \emph{statistical independence}. This
-requires that the sampling of a record does not affect the probability of any
-other record being sampled in the future. Independence is a requirement for the
-application of statistical tools such as the Central Limit
-Theorem~\cite{bulmer79}, which is the basis for many concentration bounds.
-A failure to maintain independence in sampling invalidates any guarantees
-provided by these statistical methods.
-
-In each of the problems considered, sampling can be performed either with
-replacement (WR) or without replacement (WoR). It is possible to answer any WoR
-sampling query using a constant number of WR queries, followed by a
-deduplication step~\cite{hu15}, and so this chapter focuses exclusively on WR
-sampling.
+We will begin with a formal discussion of the sampling problem, and
+relevant existing solutions. First, though, a clarification of definition
+is in order. The nomenclature used to describe sampling in the literature
+is rather inconsistent, and so we'll first specifically define all of
+the relevant terms.\footnote{
+ As an amusing footnote, this problem actually resulted in a
+ significant miscommunication between myself and my advisor in the
+ early days of the project, resulting in a lot of time being expended
+ on performance debugging a problem that didn't actually exist!
+}
+In this chapter, we'll use the the term \emph{sample} to indicate a
+single record selected by a sampling operation, and a collection of
+these samples will be called a \emph{sample set}. The number of samples
+within a sample set is the \emph{sample size}. The term \emph{sampling}
+is used to indicate the selection of either a single sample or a sample
+set; the specific usage should be clear from context.
+
+In each of the problems considered, sampling can be performed either
+with replacement or without replacement. Sampling with replacement
+means that a record that has been included in the sample set for a given
+sampling query is "replaced" into the dataset and allowed to be sampled
+again. Sampling without replacement does not "replace" the record,
+and so each individual record can only be included within the a sample
+set once for a given query. The data structures that will be discussed
+support sampling with replacement, and sampling without replacement can
+be implemented using a constant number of with replacement sampling
+operations, followed by a deduplication step~\cite{hu15}, so this chapter
+will focus exclusive on the with replacement case.
+
+\subsection{Independent Sampling Problem}
+
+When conducting sampling, it is often desirable for the drawn samples to
+have \emph{statistical independence} and for the distribution of records
+in the sample set to match the distribution of source data set. This
+requires that the sampling of a record does not affect the probability of
+any other record being sampled in the future. Such sample sets are said
+to be drawn i.i.d (idendepently and identically distributed). Throughout
+this chapter, the term "independent" will be used to describe both
+statistical independence, and identical distribution.
+
+Independence of sample sets is important because many useful statistical
+results are derived from assumping that the condition holds. For example,
+it is a requirement for the application of statistical tools such as
+the Central Limit Theorem~\cite{bulmer79}, which is the basis for many
+concentration bounds. A failure to maintain independence in sampling
+invalidates any guarantees provided by these statistical methods.
+
+In the context of databases, it is also common to discuss a more
+general version of the sampling problem, called \emph{independent query
+sampling} (IQS)~\cite{hu14}. In IQS, a sample set is constructed from a
+specified number of records in the result set of a database query. In
+this context, it isn't enough to ensure that individual records are
+sampled independently; the sample sets from repeated queries must also be
+indepedent. This precludes, for example, caching and returning the same
+sample set to multiple repetitions of the same query. This inter-query
+independence provides a variety of useful properties, such as fairness
+and representativeness of query results~\cite{tao22}.
A basic version of the independent sampling problem is \emph{weighted set
sampling} (WSS),\footnote{
@@ -43,22 +75,18 @@ as:
each data point $d \in D$ having a probability of $\frac{w(d)}{\sum_{p\in
D}w(p)}$ of being sampled.
\end{definition}
-Each query returns a sample set of size $k$, rather than a
-single sample. Queries returning sample sets are the common case, because the
-robustness of analysis relies on having a sufficiently large sample
+Each query returns a sample set of size $k$, rather than a single
+sample. Queries returning sample sets are the common case, because
+the robustness of analysis relies on having a sufficiently large sample
size~\cite{ben-eliezer20}. The common \emph{simple random sampling} (SRS)
problem is a special case of WSS, where every element has unit weight.
-In the context of databases, it is also common to discuss a more general
-version of the sampling problem, called \emph{independent query sampling}
-(IQS)~\cite{hu14}. An IQS query samples a specified number of records from the
-result set of a database query. In this context, it is insufficient to merely
-ensure individual records are sampled independently; the sample sets returned
-by repeated IQS queries must be independent as well. This provides a variety of
-useful properties, such as fairness and representativeness of query
-results~\cite{tao22}. As a concrete example, consider simple random sampling on
-the result set of a single-dimensional range reporting query. This is
-called independent range sampling (IRS), and is formally defined as:
+For WSS, the results are taken directly from the dataset without applying
+any predicates or filtering. This can be useful, however for IQS it is
+common for database queries to apply predicates to the data. A very common
+search problem from which database queries are created is range scanning,
+which can be formulated as a sampling problem called \emph{independent
+range sampling} (IRS),
\begin{definition}[Independent Range Sampling~\cite{tao22}]
Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
@@ -66,35 +94,73 @@ called independent range sampling (IRS), and is formally defined as:
query returns $k$ independent samples from $D \cap q$ with each
point having equal probability of being sampled.
\end{definition}
-A generalization of IRS exists, called \emph{Weighted Independent Range
-Sampling} (WIRS)~\cite{afshani17}, which is similar to WSS. Each point in $D$
-is associated with a positive weight $w: D \to \mathbb{R}^+$, and samples are
-drawn from the range query results $D \cap q$ such that each data point has a
-probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$ of being sampled.
-
-
-\Paragraph{Existing Solutions.} While many sampling techniques exist,
-few are supported in practical database systems. The existing
-\texttt{TABLESAMPLE} operator provided by SQL in all major DBMS
-implementations~\cite{postgres-doc} requires either a linear scan (e.g.,
-Bernoulli sampling) that results in high sample retrieval costs, or relaxed
-statistical guarantees (e.g., block sampling~\cite{postgres-doc} used in
-PostgreSQL).
-
-Index-assisted sampling solutions have been studied
-extensively. Olken's method~\cite{olken89} is a classical solution to
-independent sampling problems. This algorithm operates upon traditional search
-trees, such as the B+tree used commonly as a database index. It conducts a
-random walk on the tree uniformly from the root to a leaf, resulting in a
-$O(\log n)$ sampling cost for each returned record. Should weighted samples be
-desired, rejection sampling can be performed. A sampled record, $r$, is
-accepted with probability $\nicefrac{w(r)}{w_{max}}$, with an expected
-number of $\nicefrac{w_{max}}{w_{avg}}$ samples to be taken per element in the
-sample set. Olken's method can also be extended to support general IQS by
-rejecting all sampled records failing to satisfy the query predicate. It can be
-accelerated by adding aggregated weight tags to internal
-nodes~\cite{olken-thesis,zhao22}, allowing rejection sampling to be performed
-during the tree-traversal to abort dead-end traversals early.
+
+IRS is a non-weighted sampling problem, similar to SRS. There also exists
+a weighted generalization, called \emph{weighted independent range
+sampling} (WIRS),
+
+\begin{definition}[Weighted Independent Range Sampling~\cite{afshani17}]
+ Let $D$ be a set of $n$ points in $\mathbb{R}$ that are associated with
+ positive weights $w: D\to \mathbb{R}^+$. Given a query
+ interval $q = [x, y]$ and an integer $k$, an independent range sampling
+ query returns $k$ independent samples from $D \cap q$ with each
+ point having a probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$
+ of being sampled.
+\end{definition}
+
+This is not an exhaustive list of sampling problems, but it is the list
+of problems that will be directly addressed within this chapter.
+
+\subsection{Algorithmic Solutions}
+
+Relational database systems often have native support for IQS using
+SQL's \texttt{TABLESAMPLE} operator~\cite{postgress-doc}. However, the
+algorithms used to implement this operator have significant limitations:
+users much choose between statistical independence or performance.
+
+To maintain statistical independence, Bernoulli sampling is used. This
+technique requires iterating over every record in the result set of the
+query, and selecting or rejecting it for inclusion within the sample
+with a fixed probability~\cite{db2-doc}. This process requires that each
+record in the result set be considered, and thus provides no performance
+benefit relative to the query being sampled from, as it must be answered
+in full anyway before returning only some of the results.
+
+For performance, the statistical guarantees can be discarded and
+systematic or block sampling used instead. Systematic sampling considers
+only a fraction of the rows in the table being sampled from, following
+some particular pattern~\cite{postgress-doc}, and block sampling samples
+entire database pages~\cite{db2-doc}. These allow for query performance
+to be decoupled from data size, but tie a given record's inclusion in the
+sample set directly to its physical storage location, which can introduce
+bias into the sample and violates statistical guarantees.
+
+\subsection{Index-assisted Solutions}
+It is possible to answer IQS queries in a manner that both preserves
+independence, and avoids executing the query in full, through the use
+of specialized data structures.
+
+\Paragraph{Olken's Method.}
+The classical solution is Olken's method~\cite{olken89},
+which can be applied to traditional tree-based database indices. This
+technique performs a randomized tree traversal, selecting the pointer to
+follow at each node uniformly at random. This allows SRS queries to be
+answered at $\Theta(\log n)$ cost per sample in the sample set. Thus,
+for an IQS query with a desired sample set size of $k$, Olken's method
+can provide a sample in $\Theta(k \log n)$ time.
+
+More complex IQS queries, such as weighted or predicate-filtered sampling,
+can be answered using the same algorithm by applying rejection sampling.
+To support predicates, any sampled records that violate the predicate can
+be rejected and retried. For weighted sampling, a given record $r$ will
+be accepted into the sample with $\nicefrac{w(r)}{w_{max}}$ probability.
+This will require an expected number of $\nicefrac{w_{max}}{w_{avg}}$
+attempts per sample in the sample set~\cite{olken-thesis}. This rejection
+sampling can be significantly improved by adding aggregated weight tags to
+internal nodes, allowing rejection sampling to be performed at each step
+of the tree traversal to abort dead-end traversals early~\cite{zhao22}. In
+either case, there will be a performance penalty to rejecting samples,
+requiring greater than $k$ traversals to obtain a sample set of size $k$.
\begin{figure}
\centering
@@ -115,68 +181,78 @@ during the tree-traversal to abort dead-end traversals early.
\end{figure}
-There also exist static data structures, referred to in this chapter as static
-sampling indexes (SSIs)\footnote{
-The name SSI was established in the published version of this paper prior to the
-realization that a distinction between the terms index and data structure would
-be useful. We'll continue to use the term SSI for the remainder of this chapter,
-to maintain consistency with the published work, but technically an SSI refers to
- a data structure, not an index, in the nomenclature established in the previous
- chapter.
- }, that are capable of answering sampling queries in
-near-constant time\footnote{
- The designation
-``near-constant'' is \emph{not} used in the technical sense of being constant
-to within a polylogarithmic factor (i.e., $\tilde{O}(1)$). It is instead used to mean
-constant to within an additive polylogarithmic term, i.e., $f(x) \in O(\log n +
-1)$.
-%For example, drawing $k$ samples from $n$ records using a near-constant
-%approach would require $O(\log n + k)$ time. This is in contrast to a
-%tree-traversal approach, which would require $O(k\log n)$ time.
-} relative to the size of the dataset. An example of such a
-structure is used in Walker's alias method \cite{walker74,vose91}, a technique
-for answering WSS queries with $O(1)$ query cost per sample, but requiring
-$O(n)$ time to construct. It distributes the weight of items across $n$ cells,
-where each cell is partitioned into at most two items, such that the total
-proportion of each cell assigned to an item is its total weight. A query
-selects one cell uniformly at random, then chooses one of the two items in the
-cell by weight; thus, selecting items with probability proportional to their
-weight in $O(1)$ time. A pictorial representation of this structure is shown in
-Figure~\ref{fig:alias}.
-
-The alias method can also be used as the basis for creating SSIs capable of
-answering general IQS queries using a technique called alias
-augmentation~\cite{tao22}. As a concrete example, previous
-papers~\cite{afshani17,tao22} have proposed solutions for WIRS queries using $O(\log n
-+ k)$ time, where the $\log n$ cost is only be paid only once per query, after which
-elements can be sampled in constant time. This structure is built by breaking
-the data up into disjoint chunks of size $\nicefrac{n}{\log n}$, called
-\emph{fat points}, each with an alias structure. A B+tree is then constructed,
-using the fat points as its leaf nodes. The internal nodes are augmented with
-an alias structure over the total weight of each child. This alias structure
-is used instead of rejection sampling to determine the traversal path to take
-through the tree, and then the alias structure of the fat point is used to
-sample a record. Because rejection sampling is not used during the traversal,
-two traversals suffice to establish the valid range of records for sampling,
-after which samples can be collected without requiring per-sample traversals.
-More examples of alias augmentation applied to different IQS problems can be
-found in a recent survey by Tao~\cite{tao22}.
-
-There do exist specialized sampling indexes~\cite{hu14} with both efficient
-sampling and support for updates, but these are restricted to specific query
-types and are often very complex structures, with poor constant factors
-associated with sampling and update costs, and so are of limited practical
-utility. There has also been work~\cite{hagerup93,matias03,allendorf23} on
-extending the alias structure to support weight updates over a fixed set of
-elements. However, these solutions do not allow insertion or deletion in the
-underlying dataset, and so are not well suited to database sampling
-applications.
-
-\Paragraph{The Dichotomy.} Among these techniques, there exists a
-clear trade-off between efficient sampling and support for updates. Tree-traversal
-based sampling solutions pay a dataset size based cost per sample, in exchange for
-update support. The static solutions lack support for updates, but support
-near-constant time sampling. While some data structures exist with support for
-both, these are restricted to highly specialized query types. Thus in the
-general case there exists a dichotomy: existing sampling indexes can support
-either data updates or efficient sampling, but not both.
+\Paragraph{Static Solutions.}
+There are also a large number of static data structures, which we'll
+call static sampling indices (SSIs) in this chapter,\footnote{
+ We used the term "SSI" in the original paper on which this chapter
+ is based, which was published prior to our realization that a strong
+ distinction between an index and a data structure would be useful. I
+ am retaining the term SSI in this chapter for consistency with the
+ original paper, but understand that in the termonology established in
+ Chapter~\ref{chap:background}, SSIs are data structures, not indices.
+},
+that are capable of answering sampling queries more efficiently than
+Olken's method relative to the overall data size. An example of such
+a structure is used in Walker's alias method \cite{walker74,vose91}.
+This technique constructs a data structure in $\Theta(n)$ time
+that is capable of answering WSS queries in $\Theta(1)$ time per
+sample. Figure~\ref{fig:alias} shows a pictorial representation of the
+structure. For a set of $n$ records, it is constructed by distributing
+the normalized weight of all of the records across an array of $n$
+cells, which represent at most two records each. Each cell will have
+a proportional representation of its records based on their normalized
+weight (e.g., a given cell may be 40\% allocated to one record, and 60\%
+to another). To query the structure, a cell is first selected uniformly
+at random, and then one of its two associated records is selected
+with a probability proportional to the record's weight. This operation
+takes $\Theta(1)$ time, requiring only two random number generations
+per sample. Thus, a WSS query can be answered in $\Theta(k)$ time,
+assuming the structure has already been built. Unfortunately, the alias
+structure cannot be efficiently updated, as inserting new records would
+change the relative weights of \emph{all} the records, and require fully
+repartitioning the structure.
+
+While the alias method only applies to WSS, other sampling problems can
+be solved by using the alias method within the context of a larger data
+structure, a technique called \emph{alias augmentation}~\cite{tao22}. For
+example, alias augmentation can be used to construct an SSI capable of
+answering WIRS queries in $\Theta(\log n + k)$~\cite{afshani17,tao22}.
+This structure breaks the data into multiple disjoint partitions of size
+$\nicefrac{n}{\log n}$, each with an associated alias structure. A B+Tree
+is then built, using the augmented partitions as its leaf nodes. Each
+internal node is also augmented with an alias structure over the aggregate
+weights associated with the children of each pointer. Constructing this
+structure requires $\Theta(n)$ time (though the associated constants are
+quite large in practice). WIRS queries can be answered by traversing
+the tree, first establishing the portion of the tree covering the
+query range, and then sampling records from that range using the alias
+structures attached to the nodes. More examples of alias augmentation
+applied to different IQS problems can be found in a recent survey by
+Tao~\cite{tao22}.
+
+There also exist specialized data structures with support for both
+efficient sampling and updates~\cite{hu14}, but these structures have
+poor constant factors and are very complex, rendering them of little
+practical utility. Additionally, efforts have been made to extended
+the alias structure with support for weight updates over a fixed set of
+elements~\cite{hagerup93,matias03,allendorf23}. These approaches do not
+allow the insertion or removal of new records, however, only in-place
+weight updates. While in principle they could be constructed over the
+entire domain of possible records, with the weights of non-existant
+records set to $0$, this is hardly practical. Thus, these structures are
+not suited for the database sampling applications that are of interest to
+us in this chapter.
+
+\subsection{The Dichotomy.} Across the index-assisted techniques we
+discussed above, there is a clear pattern that emerges. Olken's method
+supports updates, but is inefficient compared to the SSIs because it
+requires a data-sized cost to be paid per sample in the sample set. The
+SSIs are more efficient for sampling, typically paying the data-sized cost
+only once per sample set (if at all), but fail to support updates. Thus,
+there appears to be a general dichotomy of sampling techniques: existing
+sampling data structures support either updates, or efficient sampling,
+but generally not both. It will be the purpose of this chapter to resolve
+this dichotomy.
+
+
+
diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex
index 32a32e1..88ac1ac 100644
--- a/chapters/sigmod23/framework.tex
+++ b/chapters/sigmod23/framework.tex
@@ -1,53 +1,365 @@
-\section{Dynamic Sampling Index Framework}
+\section{Dynamization of SSIs}
\label{sec:framework}
-This work is an attempt to design a solution to independent sampling
-that achieves \emph{both} efficient updates and near-constant cost per
-sample. As the goal is to tackle the problem in a generalized fashion,
-rather than design problem-specific data structures for used as the basis
-of an index, a framework is created that allows for already
-existing static data structures to be used as the basis for a sampling
-index, by automatically adding support for data updates using a modified
-version of the Bentley-Saxe method.
-
-Unfortunately, Bentley-Saxe as described in Section~\ref{ssec:bsm} cannot be
-directly applied to sampling problems. The concept of decomposability is not
-cleanly applicable to sampling, because the distribution of records in the
-result set, rather than the records themselves, must be matched following the
-result merge. Efficiently controlling the distribution requires each sub-query
-to access information external to the structure against which it is being
-processed, a contingency unaccounted for by Bentley-Saxe. Further, the process
-of reconstruction used in Bentley-Saxe provides poor worst-case complexity
-bounds~\cite{saxe79}, and attempts to modify the procedure to provide better
-worst-case performance are complex and have worse performance in the common
-case~\cite{overmars81}. Despite these limitations, this chapter will argue that
-the core principles of the Bentley-Saxe method can be profitably applied to
-sampling indexes, once a system for controlling result set distributions and a
-more effective reconstruction scheme have been devised. The solution to
-the former will be discussed in Section~\ref{ssec:sample}. For the latter,
-inspiration is drawn from the literature on the LSM tree.
-
-The LSM tree~\cite{oneil96} is a data structure proposed to optimize
-write throughput in disk-based storage engines. It consists of a memory
-table of bounded size, used to buffer recent changes, and a hierarchy
-of external levels containing indexes of exponentially increasing
-size. When the memory table has reached capacity, it is emptied into the
-external levels. Random writes are avoided by treating the data within
-the external levels as immutable; all writes go through the memory
-table. This introduces write amplification but maximizes sequential
-writes, which is important for maintaining high throughput in disk-based
-systems. The LSM tree is associated with a broad and well studied design
-space~\cite{dayan17,dayan18,dayan22,balmau19,dayan18-1} containing
-trade-offs between three key performance metrics: read performance, write
-performance, and auxiliary memory usage. The challenges
-faced in reconstructing predominately in-memory indexes are quite
- different from those which the LSM tree is intended
-to address, having little to do with disk-based systems and sequential IO
-operations. But, the LSM tree possesses a rich design space for managing
-the periodic reconstruction of data structures in a manner that is both
-more practical and more flexible than that of Bentley-Saxe. By borrowing
-from this design space, this preexisting body of work can be leveraged,
-and many of Bentley-Saxe's limitations addressed.
+Our goal, then, is to design a solution to indepedent sampling that is
+able to achieve \emph{both} efficient updates and efficient sampling,
+while also maintaining statistical independence both within and between
+IQS queries, and to do so in a generalized fashion without needing to
+design new dynamic data structures for each problem. Given the range
+of SSIs already available, it seems reasonable to attempt to apply
+dynamization techniques to accomplish this goal. Using the Bentley-Saxe
+method would allow us to to support inserts and deletes without
+requiring any modification of the SSIs. Unfortunately, as discussed
+in Section~\ref{ssec:background-irs}, there are problems with directly
+applying BSM to sampling problems. All of the considerations discussed
+there in the context of IRS apply equally to the other sampling problems
+considered in this chapter. In this section, we will discuss approaches
+for resolving these problems.
+
+\subsection{Sampling over Partitioned Datasets}
+
+The core problem facing any attempt to dynamize SSIs is that independently
+sampling from a partitioned dataset is difficult. As discussed in
+Section~\ref{ssec:background-irs}, accomplishing this task within the
+DSP model used by the Bentley-Saxe method requires drawing a full $k$
+samples from each of the blocks, and then repeatedly down-sampling each
+of the intermediate sample sets. However, it is possible to devise a
+more efficient query process if we abandon the DSP model and consider
+a slightly more complicated procedure.
+
+First, we need to resolve a minor definitional problem. As noted before,
+the DSP model is based on deterministic queries. The definition doesn't
+apply for sampling queries, because it assumes that the result sets of
+identical queries should also be identical. For general IQS, we also need
+to enforce conditions on the query being sampled from.
+
+\begin{definition}[Query Sampling Problem]
+ Given a search problem, $F$, a sampling problem is function
+ of the form $X: (F, \mathcal{D}, \mathcal{Q}, \mathbb{Z}^+)
+ \to \mathcal{R}$ where $\mathcal{D}$ is the domain of records
+ and $\mathcal{Q}$ is the domain of query parameters of $F$. The
+ solution to a sampling problem, $R \in \mathcal{R}$ will be a subset
+ of records from the solution to $F$ drawn independently such that,
+ $|R| = k$ for some $k \in \mathbb{Z}^+$.
+\end{definition}
+With this in mind, we can now define the decomposability conditions for
+a query sampling problem,
+
+\begin{definition}[Decomposable Sampling Problem]
+ A query sampling problem, $X: (F, \mathcal{D}, \mathcal{Q},
+ \mathbb{Z}^+ \to \mathcal{R}$) is decomposable if and only if
+ the following conditions are met for all $q \in \mathcal{Q},
+ k \in \mathbb{Z}^+$,
+ \begin{enumerate}
+ \item There exists a $\Theta(C(n,k))$ time computable, associative, and
+ commutative binary operator $\mergeop$ such that,
+ \begin{equation*}
+ X(F, A \cup B, q, k) \sim X(F, A, q, k)~ \mergeop ~X(F,
+ B, q, k)
+ \end{equation*}
+ for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B
+ = \emptyset$.
+
+ \item For any dataset $D \subseteq \mathcal{D}$ that has been
+ decomposed into $m$ partitions such that $D =
+ \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset \quad
+ \forall i,j < m, i \neq j$,
+ \begin{equation*}
+ F(D, q) = \bigcup_{i=1}^m F(D_i, q)
+ \end{equation*}
+ \end{enumerate}
+\end{definition}
+
+These two conditions warrant further explaination. The first condition
+is simply a redefinition of the standard decomposability criteria to
+consider matching the distribution, rather than the exact records in $R$,
+as the correctness condition for the merge process. The second condition
+handles a necessary property of the underlying search problem being
+sampled from. Note that this condition is \emph{stricter} than normal
+decomposability for $F$, and essentially requires that the query being
+sampled from return a set of records, rather than an aggregate value or
+some other result that cannot be meaningfully sampled from. This condition
+is satisfied by predicate-filtering style database queries, among others.
+
+With these definitions in mind, let's turn to solving these query sampling
+problems. First, we note that many SSIs have a sampling procedure that
+naturally involves two phases. First, some preliminary work is done
+to determine metadata concerning the set of records to sample from,
+and then $k$ samples are drawn from the structure, taking advantage of
+this metadata. If we represent the time cost of the prelimary work
+with $P(n)$ and the cost of drawing a sample with $S(n)$, then these
+structures query cost functions are of the form,
+
+\begin{equation*}
+\mathscr{Q}(n, k) = P(n) + k S(n)
+\end{equation*}
+
+
+Consider an arbitrary decomposable sampling query with a cost function
+of the above form, $X(\mathscr{I}, F, q, k)$, which draws a sample
+of $k$ records from $d \subseteq \mathcal{D}$ using an instance of
+an SSI $\mathscr{I} \in \mathcal{I}$. Applying dynamization results
+in $d$ being split across $m$ disjoint instances of $\mathcal{I}$
+such that $d = \bigcup_{i=0}^m \text{unbuild}(\mathscr{I}_i)$ and
+$\text{unbuild}(\mathscr{I}_i) \cap \text{unbuild}(\mathscr{I}_j)
+= \emptyset \quad \forall i, j < m, i \neq j$. If we consider a
+Bentley-Saxe dynamization of such a structure, the $\mergeop$ operation
+would be a $\Theta(k)$ down-sampling. Thus, the total query cost of such
+a structure would be,
+\begin{equation*}
+\Theta\left(\log_2 n \left( P(n) + k S(n) + k\right)\right)
+\end{equation*}
+
+This cost function is sub-optimal for two reasons. First, we
+pay extra cost to merge the result sets together because of the
+down-sampling combination operator. Secondly, this formulation
+fails to avoid a per-sample dependence on $n$, even in the case
+where $S(n) \in \Theta(1)$. This gets even worse when considering
+rejections that may occur as a result of deleted records. Recall from
+Section~\ref{ssec:background-deletes} that deletion can be supported
+using weak deletes or a shadow structure in a Bentley-Saxe dynamization.
+Using either approach, it isn't possible to avoid deleted records in
+advance when sampling, and so these will need to be rejected and retried.
+In the DSP model, this retry will need to reprocess every block a second
+time. You cannot retry in place without introducing bias into the result
+set. We will discuss this more in Section~\ref{ssec:sampling-deletes}.
+
+\begin{figure}
+ \centering
+ \includegraphics[width=\textwidth]{img/sigmod23/sampling}
+ \caption{\textbf{Overview of the multiple-block query sampling process} for
+ Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of
+ the shards is determined, then (2) these weights are used to construct an
+ alias structure. Next, (3) the alias structure is queried $k$ times to
+ determine per shard sample sizes, and then (4) sampling is performed.
+ Finally, (5) any rejected samples are retried starting from the alias
+ structure, and the process is repeated until the desired number of samples
+ has been retrieved.}
+ \label{fig:sample}
+
+\end{figure}
+
+The key insight that allowed us to solve this particular problem was that
+there is a mismatch between the structure of the sampling query process,
+and the structure assumed by DSPs. Using an SSI to answer a sampling
+query results in a naturally two-phase process, but DSPs are assumed to
+be single phase. We can construct a more effective process for answering
+such queries based on a multi-stage process, summarized in Figure~\ref{fig:sample}.
+\begin{enumerate}
+ \item Determine each block's respective weight under a given
+ query to be sampled from (e.g., the number of records falling
+ into the query range for IRS).
+
+ \item Build a temporary alias structure over these weights.
+
+ \item Query the alias structure $k$ times to determine how many
+ samples to draw from each block.
+
+ \item Draw the appropriate number of samples from each block and
+ merge them together to form the final query result.
+\end{enumerate}
+It is possible that some of the records sampled in Step 4 must be
+rejected, either because of deletes or some other property of the sampling
+procedure being used. If $r$ records are rejected, the above procedure
+can be repeated from Step 3, taking $k - r$ as the number of times to
+query the alias structure, without needing to redo any of the preprocessing
+steps. This can be repeated as many times as necessary until the required
+$k$ records have been sampled.
+
+\begin{example}
+ \label{ex:sample}
+ Consider executing a WSS query, with $k=1000$, across three blocks
+ containing integer keys with unit weight. $\mathscr{I}_1$ contains only the
+ key $-2$, $\mathscr{I}_2$ contains all integers on $[1,100]$, and $\mathscr{I}_3$
+ contains all integers on $[101, 200]$. These structures are shown
+ in Figure~\ref{fig:sample}. Sampling is performed by first
+ determining the normalized weights for each block: $w_1 = 0.005$,
+ $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a
+ block alias structure. The block alias structure is then queried
+ $k$ times, resulting in a distribution of $k_i$s that is
+ commensurate with the relative weights of each block. Finally,
+ each block is queried in turn to draw the appropriate number
+ of samples.
+\end{example}
+
+Assuming a Bentley-Saxe decomposition with $\log n$ blocks and assuming
+a constant number of repetitions, the cost of answering a decomposible
+sampling query having a pre-processing cost of $P(n)$ and a per-sample
+cost of $S(n)$ will be,
+\begin{equation}
+\label{eq:dsp-sample-cost}
+\boxed{
+\mathscr{Q}(n, k) \in \Theta \left( P(n) \log_2 n + k S(n) \right)
+}
+\end{equation}
+where the cost of building the alias structure is $\Theta(\log_2 n)$
+and thus absorbed into the pre-processing cost. For the SSIs discussed
+in this chapter, which have $S(n) \in \Theta(1)$, this model provides us
+with the desired decoupling of the data size ($n$) from the per-sample
+cost.
+
+\subsection{Supporting Deletes}
+
+Because the shards are static, records cannot be arbitrarily removed from them.
+This requires that deletes be supported in some other way, with the ultimate
+goal being the prevention of deleted records' appearance in sampling query
+result sets. This can be realized in two ways: locating the record and marking
+it, or inserting a new record which indicates that an existing record should be
+treated as deleted. The framework supports both of these techniques, the
+selection of which is called the \emph{delete policy}. The former policy is
+called \emph{tagging} and the latter \emph{tombstone}.
+
+Tagging a record is straightforward. Point-lookups are performed against each
+shard in the index, as well as the buffer, for the record to be deleted. When
+it is found, a bit in a header attached to the record is set. When sampling,
+any records selected with this bit set are automatically rejected. Tombstones
+represent a lazy strategy for deleting records. When a record is deleted using
+tombstones, a new record with identical key and value, but with a ``tombstone''
+bit set, is inserted into the index. A record's presence can be checked by
+performing a point-lookup. If a tombstone with the same key and value exists
+above the record in the index, then it should be rejected when sampled.
+
+Two important aspects of performance are pertinent when discussing deletes: the
+cost of the delete operation, and the cost of verifying the presence of a
+sampled record. The choice of delete policy represents a trade-off between
+these two costs. Beyond this simple trade-off, the delete policy also has other
+implications that can affect its applicability to certain types of SSI. Most
+notably, tombstones do not require any in-place updating of records, whereas
+tagging does. This means that using tombstones is the only way to ensure total
+immutability of the data within shards, which avoids random writes and eases
+concurrency control. The tombstone delete policy, then, is particularly
+appealing in external and concurrent contexts.
+
+\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is
+the same as an ordinary insert. Tagging, by contrast, requires a point-lookup
+of the record to be deleted, and so is more expensive. Assuming a point-lookup
+operation with cost $L(n)$, a tagged delete must search each level in the
+index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$
+time.
+
+\Paragraph{Rejection Check Costs.} In addition to the cost of the delete
+itself, the delete policy affects the cost of determining if a given record has
+been deleted. This is called the \emph{rejection check cost}, $R(n)$. When
+using tagging, the information necessary to make the rejection decision is
+local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones
+it is not; a point-lookup must be performed to search for a given record's
+corresponding tombstone. This look-up must examine the buffer, and each shard
+within the index. This results in a rejection check cost of $R(n) \in O\left(N_b +
+L(n) \log_s n\right)$. The rejection check process for the two delete policies is
+summarized in Figure~\ref{fig:delete}.
+
+Two factors contribute to the tombstone rejection check cost: the size of the
+buffer, and the cost of performing a point-lookup against the shards. The
+latter cost can be controlled using the framework's ability to associate
+auxiliary structures with shards. For SSIs which do not support efficient
+point-lookups, a hash table can be added to map key-value pairs to their
+location within the SSI. This allows for constant-time rejection checks, even
+in situations where the index would not otherwise support them. However, the
+storage cost of this intervention is high, and in situations where the SSI does
+support efficient point-lookups, it is not necessary. Further performance
+improvements can be achieved by noting that the probability of a given record
+having an associated tombstone in any particular shard is relatively small.
+This means that many point-lookups will be executed against shards that do not
+contain the tombstone being searched for. In this case, these unnecessary
+lookups can be partially avoided using Bloom filters~\cite{bloom70} for
+tombstones. By inserting tombstones into these filters during reconstruction,
+point-lookups against some shards which do not contain the tombstone being
+searched for can be bypassed. Filters can be attached to the buffer as well,
+which may be even more significant due to the linear cost of scanning it. As
+the goal is a reduction of rejection check costs, these filters need only be
+populated with tombstones. In a later section, techniques for bounding the
+number of tombstones on a given level are discussed, which will allow for the
+memory usage of these filters to be tightly controlled while still ensuring
+precise bounds on filter error.
+
+\Paragraph{Sampling with Deletes.} The addition of deletes to the framework
+alters the analysis of sampling costs. A record that has been deleted cannot
+be present in the sample set, and therefore the presence of each sampled record
+must be verified. If a record has been deleted, it must be rejected. When
+retrying samples rejected due to delete, the process must restart from shard
+selection, as deleted records may be counted in the weight totals used to
+construct that structure. This increases the cost of sampling to,
+\begin{equation}
+\label{eq:sampling-cost}
+ O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right)
+\end{equation}
+where $R(n)$ is the cost of checking if a sampled record has been deleted, and
+$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling
+attempts required to obtain $k$ samples, given a fixed rejection probability.
+The rejection probability itself is a function of the workload, and is
+unbounded.
+
+\Paragraph{Bounding the Rejection Probability.} Rejections during sampling
+constitute wasted memory accesses and random number generations, and so steps
+should be taken to minimize their frequency. The probability of a rejection is
+directly related to the number of deleted records, which is itself a function
+of workload and dataset. This means that, without building counter-measures
+into the framework, tight bounds on sampling performance cannot be provided in
+the presence of deleted records. It is therefore critical that the framework
+support some method for bounding the number of deleted records within the
+index.
+
+While the static nature of shards prevents the direct removal of records at the
+moment they are deleted, it doesn't prevent the removal of records during
+reconstruction. When using tagging, all tagged records encountered during
+reconstruction can be removed. When using tombstones, however, the removal
+process is non-trivial. In principle, a rejection check could be performed for
+each record encountered during reconstruction, but this would increase
+reconstruction costs and introduce a new problem of tracking tombstones
+associated with records that have been removed. Instead, a lazier approach can
+be used: delaying removal until a tombstone and its associated record
+participate in the same shard reconstruction. This delay allows both the record
+and its tombstone to be removed at the same time, an approach called
+\emph{tombstone cancellation}. In general, this can be implemented using an
+extra linear scan of the input shards before reconstruction to identify
+tombstones and associated records for cancellation, but potential optimizations
+exist for many SSIs, allowing it to be performed during the reconstruction
+itself at no extra cost.
+
+The removal of deleted records passively during reconstruction is not enough to
+bound the number of deleted records within the index. It is not difficult to
+envision pathological scenarios where deletes result in unbounded rejection
+rates, even with this mitigation in place. However, the dropping of deleted
+records does provide a useful property: any specific deleted record will
+eventually be removed from the index after a finite number of reconstructions.
+Using this fact, a bound on the number of deleted records can be enforced. A
+new parameter, $\delta$, is defined, representing the maximum proportion of
+deleted records within the index. Each level, and the buffer, tracks the number
+of deleted records it contains by counting its tagged records or tombstones.
+Following each buffer flush, the proportion of deleted records is checked
+against $\delta$. If any level is found to exceed it, then a proactive
+reconstruction is triggered, pushing its shards down into the next level. The
+process is repeated until all levels respect the bound, allowing the number of
+deleted records to be precisely controlled, which, by extension, bounds the
+rejection rate. This process is called \emph{compaction}.
+
+Assuming every record is equally likely to be sampled, this new bound can be
+applied to the analysis of sampling costs. The probability of a record being
+rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to
+Equation~\ref{eq:sampling-cost} yields,
+\begin{equation}
+%\label{eq:sampling-cost-del}
+ O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
+\end{equation}
+
+Asymptotically, this proactive compaction does not alter the analysis of
+insertion costs. Each record is still written at most $s$ times on each level,
+there are at most $\log_s n$ levels, and the buffer insertion and SSI
+construction costs are all unchanged, and so on. This results in the amortized
+insertion cost remaining the same.
+
+This compaction strategy is based upon tombstone and record counts, and the
+bounds assume that every record is equally likely to be sampled. For certain
+sampling problems (such as WSS), there are other conditions that must be
+considered to provide a bound on the rejection rate. To account for these
+situations in a general fashion, the framework supports problem-specific
+compaction triggers that can be tailored to the SSI being used. These allow
+compactions to be triggered based on other properties, such as rejection rate
+of a level, weight of deleted records, and the like.
+
+
+
+\subsection{Performance Tuning and Configuration}
\captionsetup[subfloat]{justification=centering}
@@ -68,12 +380,9 @@ and many of Bentley-Saxe's limitations addressed.
\subsection{Framework Overview}
-The goal of this chapter is to build a general framework that extends most SSIs
-with efficient support for updates by splitting the index into small data structures
-to reduce reconstruction costs, and then distributing the sampling process over these
-smaller structures.
-The framework is designed to work efficiently with any SSI, so
-long as it has the following properties,
+Our framework has been designed to work efficiently with any SSI, so long
+as it has the following properties.
+
\begin{enumerate}
\item The underlying full query $Q$ supported by the SSI from whose results
samples are drawn satisfies the following property:
@@ -219,101 +528,6 @@ framework is,
O\left(\frac{C_r(n)}{n}\log_s n\right)
\end{equation}
-
-\subsection{Sampling}
-\label{ssec:sample}
-
-\begin{figure}
- \centering
- \includegraphics[width=\textwidth]{img/sigmod23/sampling}
- \caption{\textbf{Overview of the multiple-shard sampling query process} for
- Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of
- the shards is determined, then (2) these weights are used to construct an
- alias structure. Next, (3) the alias structure is queried $k$ times to
- determine per shard sample sizes, and then (4) sampling is performed.
- Finally, (5) any rejected samples are retried starting from the alias
- structure, and the process is repeated until the desired number of samples
- has been retrieved.}
- \label{fig:sample}
-
-\end{figure}
-
-For many SSIs, sampling queries are completed in two stages. Some preliminary
-processing is done to identify the range of records from which to sample, and then
-samples are drawn from that range. For example, IRS over a sorted list of
-records can be performed by first identifying the upper and lower bounds of the
-query range in the list, and then sampling records by randomly generating
-indexes within those bounds. The general cost of a sampling query can be
-modeled as $P(n) + k S(n)$, where $P(n)$ is the cost of preprocessing, $k$ is
-the number of samples drawn, and $S(n)$ is the cost of sampling a single
-record.
-
-When sampling from multiple shards, the situation grows more complex. For each
-sample, the shard to select the record from must first be decided. Consider an
-arbitrary sampling query $X(D, k)$ asking for a sample set of size $k$ against
-dataset $D$. The framework splits $D$ across $m$ disjoint shards, such that $D
-= \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset, \forall i,j < m$. The
-framework must ensure that $X(D, k)$ and $\bigcup_{i=0}^m X(D_i, k_i)$ follow
-the same distribution, by selecting appropriate values for the $k_i$s. If care
-is not taken to balance the number of samples drawn from a shard with the total
-weight of the shard under $X$, then bias can be introduced into the sample
-set's distribution. The selection of $k_i$s can be viewed as an instance of WSS,
-and solved using the alias method.
-
-When sampling using the framework, first the weight of each shard under the
-sampling query is determined and a \emph{shard alias structure} built over
-these weights. Then, for each sample, the shard alias is used to
-determine the shard from which to draw the sample. Let $W(n)$ be the cost of
-determining this total weight for a single shard under the query. The initial setup
-cost, prior to drawing any samples, will be $O\left([W(n) + P(n)]\log_s
-n\right)$, as the preliminary work for sampling from each shard must be
-performed, as well as weights determined and alias structure constructed. In
-many cases, however, the preliminary work will also determine the total weight,
-and so the relevant operation need only be applied once to accomplish both
-tasks.
-
-To ensure that all records appear in the sample set with the appropriate
-probability, the mutable buffer itself must also be a valid target for
-sampling. There are two generally applicable techniques that can be applied for
-this, both of which can be supported by the framework. The query being sampled
-from can be directly executed against the buffer and the result set used to
-build a temporary SSI, which can be sampled from. Alternatively, rejection
-sampling can be used to sample directly from the buffer, without executing the
-query. In this case, the total weight of the buffer is used for its entry in
-the shard alias structure. This can result in the buffer being
-over-represented in the shard selection process, and so any rejections during
-buffer sampling must be retried starting from shard selection. These same
-considerations apply to rejection sampling used against shards, as well.
-
-
-\begin{example}
- \label{ex:sample}
- Consider executing a WSS query, with $k=1000$, across three shards
- containing integer keys with unit weight. $S_1$ contains only the
- key $-2$, $S_2$ contains all integers on $[1,100]$, and $S_3$
- contains all integers on $[101, 200]$. These structures are shown
- in Figure~\ref{fig:sample}. Sampling is performed by first
- determining the normalized weights for each shard: $w_1 = 0.005$,
- $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a
- shard alias structure. The shard alias structure is then queried
- $k$ times, resulting in a distribution of $k_i$s that is
- commensurate with the relative weights of each shard. Finally,
- each shard is queried in turn to draw the appropriate number
- of samples.
-\end{example}
-
-
-Assuming that rejection sampling is used on the mutable buffer, the worst-case
-time complexity for drawing $k$ samples from an index containing $n$ elements
-with a sampling cost of $S(n)$ is,
-\begin{equation}
- \label{eq:sample-cost}
- O\left(\left[W(n) + P(n)\right]\log_s n + kS(n)\right)
-\end{equation}
-
-%If instead a temporary SSI is constructed, the cost of sampling
-%becomes: $O\left(N_b + C_c(N_b) + (W(n) + P(n))\log_s n + kS(n)\right)$.
-
\begin{figure}
\centering
\subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\
@@ -342,163 +556,6 @@ with a sampling cost of $S(n)$ is,
\subsection{Deletion}
\label{ssec:delete}
-Because the shards are static, records cannot be arbitrarily removed from them.
-This requires that deletes be supported in some other way, with the ultimate
-goal being the prevention of deleted records' appearance in sampling query
-result sets. This can be realized in two ways: locating the record and marking
-it, or inserting a new record which indicates that an existing record should be
-treated as deleted. The framework supports both of these techniques, the
-selection of which is called the \emph{delete policy}. The former policy is
-called \emph{tagging} and the latter \emph{tombstone}.
-
-Tagging a record is straightforward. Point-lookups are performed against each
-shard in the index, as well as the buffer, for the record to be deleted. When
-it is found, a bit in a header attached to the record is set. When sampling,
-any records selected with this bit set are automatically rejected. Tombstones
-represent a lazy strategy for deleting records. When a record is deleted using
-tombstones, a new record with identical key and value, but with a ``tombstone''
-bit set, is inserted into the index. A record's presence can be checked by
-performing a point-lookup. If a tombstone with the same key and value exists
-above the record in the index, then it should be rejected when sampled.
-
-Two important aspects of performance are pertinent when discussing deletes: the
-cost of the delete operation, and the cost of verifying the presence of a
-sampled record. The choice of delete policy represents a trade-off between
-these two costs. Beyond this simple trade-off, the delete policy also has other
-implications that can affect its applicability to certain types of SSI. Most
-notably, tombstones do not require any in-place updating of records, whereas
-tagging does. This means that using tombstones is the only way to ensure total
-immutability of the data within shards, which avoids random writes and eases
-concurrency control. The tombstone delete policy, then, is particularly
-appealing in external and concurrent contexts.
-
-\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is
-the same as an ordinary insert. Tagging, by contrast, requires a point-lookup
-of the record to be deleted, and so is more expensive. Assuming a point-lookup
-operation with cost $L(n)$, a tagged delete must search each level in the
-index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$
-time.
-
-\Paragraph{Rejection Check Costs.} In addition to the cost of the delete
-itself, the delete policy affects the cost of determining if a given record has
-been deleted. This is called the \emph{rejection check cost}, $R(n)$. When
-using tagging, the information necessary to make the rejection decision is
-local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones
-it is not; a point-lookup must be performed to search for a given record's
-corresponding tombstone. This look-up must examine the buffer, and each shard
-within the index. This results in a rejection check cost of $R(n) \in O\left(N_b +
-L(n) \log_s n\right)$. The rejection check process for the two delete policies is
-summarized in Figure~\ref{fig:delete}.
-
-Two factors contribute to the tombstone rejection check cost: the size of the
-buffer, and the cost of performing a point-lookup against the shards. The
-latter cost can be controlled using the framework's ability to associate
-auxiliary structures with shards. For SSIs which do not support efficient
-point-lookups, a hash table can be added to map key-value pairs to their
-location within the SSI. This allows for constant-time rejection checks, even
-in situations where the index would not otherwise support them. However, the
-storage cost of this intervention is high, and in situations where the SSI does
-support efficient point-lookups, it is not necessary. Further performance
-improvements can be achieved by noting that the probability of a given record
-having an associated tombstone in any particular shard is relatively small.
-This means that many point-lookups will be executed against shards that do not
-contain the tombstone being searched for. In this case, these unnecessary
-lookups can be partially avoided using Bloom filters~\cite{bloom70} for
-tombstones. By inserting tombstones into these filters during reconstruction,
-point-lookups against some shards which do not contain the tombstone being
-searched for can be bypassed. Filters can be attached to the buffer as well,
-which may be even more significant due to the linear cost of scanning it. As
-the goal is a reduction of rejection check costs, these filters need only be
-populated with tombstones. In a later section, techniques for bounding the
-number of tombstones on a given level are discussed, which will allow for the
-memory usage of these filters to be tightly controlled while still ensuring
-precise bounds on filter error.
-
-\Paragraph{Sampling with Deletes.} The addition of deletes to the framework
-alters the analysis of sampling costs. A record that has been deleted cannot
-be present in the sample set, and therefore the presence of each sampled record
-must be verified. If a record has been deleted, it must be rejected. When
-retrying samples rejected due to delete, the process must restart from shard
-selection, as deleted records may be counted in the weight totals used to
-construct that structure. This increases the cost of sampling to,
-\begin{equation}
-\label{eq:sampling-cost}
- O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right)
-\end{equation}
-where $R(n)$ is the cost of checking if a sampled record has been deleted, and
-$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling
-attempts required to obtain $k$ samples, given a fixed rejection probability.
-The rejection probability itself is a function of the workload, and is
-unbounded.
-
-\Paragraph{Bounding the Rejection Probability.} Rejections during sampling
-constitute wasted memory accesses and random number generations, and so steps
-should be taken to minimize their frequency. The probability of a rejection is
-directly related to the number of deleted records, which is itself a function
-of workload and dataset. This means that, without building counter-measures
-into the framework, tight bounds on sampling performance cannot be provided in
-the presence of deleted records. It is therefore critical that the framework
-support some method for bounding the number of deleted records within the
-index.
-
-While the static nature of shards prevents the direct removal of records at the
-moment they are deleted, it doesn't prevent the removal of records during
-reconstruction. When using tagging, all tagged records encountered during
-reconstruction can be removed. When using tombstones, however, the removal
-process is non-trivial. In principle, a rejection check could be performed for
-each record encountered during reconstruction, but this would increase
-reconstruction costs and introduce a new problem of tracking tombstones
-associated with records that have been removed. Instead, a lazier approach can
-be used: delaying removal until a tombstone and its associated record
-participate in the same shard reconstruction. This delay allows both the record
-and its tombstone to be removed at the same time, an approach called
-\emph{tombstone cancellation}. In general, this can be implemented using an
-extra linear scan of the input shards before reconstruction to identify
-tombstones and associated records for cancellation, but potential optimizations
-exist for many SSIs, allowing it to be performed during the reconstruction
-itself at no extra cost.
-
-The removal of deleted records passively during reconstruction is not enough to
-bound the number of deleted records within the index. It is not difficult to
-envision pathological scenarios where deletes result in unbounded rejection
-rates, even with this mitigation in place. However, the dropping of deleted
-records does provide a useful property: any specific deleted record will
-eventually be removed from the index after a finite number of reconstructions.
-Using this fact, a bound on the number of deleted records can be enforced. A
-new parameter, $\delta$, is defined, representing the maximum proportion of
-deleted records within the index. Each level, and the buffer, tracks the number
-of deleted records it contains by counting its tagged records or tombstones.
-Following each buffer flush, the proportion of deleted records is checked
-against $\delta$. If any level is found to exceed it, then a proactive
-reconstruction is triggered, pushing its shards down into the next level. The
-process is repeated until all levels respect the bound, allowing the number of
-deleted records to be precisely controlled, which, by extension, bounds the
-rejection rate. This process is called \emph{compaction}.
-
-Assuming every record is equally likely to be sampled, this new bound can be
-applied to the analysis of sampling costs. The probability of a record being
-rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to
-Equation~\ref{eq:sampling-cost} yields,
-\begin{equation}
-%\label{eq:sampling-cost-del}
- O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
-\end{equation}
-
-Asymptotically, this proactive compaction does not alter the analysis of
-insertion costs. Each record is still written at most $s$ times on each level,
-there are at most $\log_s n$ levels, and the buffer insertion and SSI
-construction costs are all unchanged, and so on. This results in the amortized
-insertion cost remaining the same.
-
-This compaction strategy is based upon tombstone and record counts, and the
-bounds assume that every record is equally likely to be sampled. For certain
-sampling problems (such as WSS), there are other conditions that must be
-considered to provide a bound on the rejection rate. To account for these
-situations in a general fashion, the framework supports problem-specific
-compaction triggers that can be tailored to the SSI being used. These allow
-compactions to be triggered based on other properties, such as rejection rate
-of a level, weight of deleted records, and the like.
-
\subsection{Trade-offs on Framework Design Space}
\label{ssec:design-space}
diff --git a/chapters/sigmod23/introduction.tex b/chapters/sigmod23/introduction.tex
index 0155c7d..befdbba 100644
--- a/chapters/sigmod23/introduction.tex
+++ b/chapters/sigmod23/introduction.tex
@@ -1,20 +1,38 @@
\section{Introduction} \label{sec:intro}
-As a first attempt at realizing a dynamic extension framework, one of the
-non-decomposable search problems discussed in the previous chapter was
-considered: independent range sampling, along with a number of other
-independent sampling problems. These sorts of queries are important in a
-variety of contexts, including including approximate query processing
-(AQP)~\cite{blinkdb,quickr,verdict,cohen23}, interactive data
-exploration~\cite{sps,xie21}, financial audit sampling~\cite{olken-thesis}, and
-feature selection for machine learning~\cite{ml-sampling}. However, they are
-not well served using existing techniques, which tend to sacrifice statistical
-independence for performance, or vise versa. In this chapter, a solution for
-independent sampling is presented that manages to achieve both statistical
-independence, and good performance, by designing a Bentley-Saxe inspired
-framework for introducing update support to efficient static sampling data
-structures. It seeks to demonstrate the viability of Bentley-Saxe as the basis
-for adding update support to data structures, as well as showing that the
-limitations of the decomposable search problem abstraction can be overcome
-through alternative query processing techniques to preserve good
-performance.
+Having discussed the relevant background materials, we will now turn to a
+discussion of our first attempt to address the limitations of dynamization
+in the context of one particular class of non-decomposable search problem:
+indepedent random sampling. We've already discussed one representative
+problem of this class, independent range sampling, and shown how it is
+not traditionally decomposable. This specific problem is one of several
+very similar types of problem, however, and in this chapter we will also
+attend to simple random sampling, weighted set sampling, and weighted
+independent range sampling.
+
+Independent sampling presents an interesting motivating example
+because it is nominally supported within many relational databases,
+and is useful in a variety of contexts, such as approximate
+query processing (AQP)~\cite{blinkdb,quickr,verdict,cohen23},
+interactive data exploration~\cite{sps,xie21}, financial audit
+sampling~\cite{olken-thesis}, and feature selection for machine
+learning~\cite{ml-sampling}. However, existing support for these search
+problems is limited by the techniques used within databases to implement
+them. Existing implementations tend to sacrifice either performance,
+by requiring the entire result set of be materialized prior to applying
+Bernoulli sampling, or statistical independence. There exists techniques
+for obtaining both sampling performance and indepedence by leveraging
+existing B+Tree indices with slight modification~\cite{olken-thesis},
+but even this technique has worse sampling performance than could be
+achieved using specialized static sampling indices.
+
+Thus, we decided to attempt to apply a Bentley-Saxe based dynamization
+technique to these data structures. In this chapter, we discuss our
+approach, which addresses the decomposability problems discussed in
+Section~\cite{ssec:background-irs}, introduces two physical mechanisms
+for support deletes, and also introduces an LSM-tree inspired design
+space to allow for performance tuning. The results in this chapter are
+highly specialized to sampling problems, however they will serve as a
+launching off point for our discussion of a generalized framework in
+the subsequent chapter.
+
diff --git a/references/references.bib b/references/references.bib
index 7cec949..0dbc804 100644
--- a/references/references.bib
+++ b/references/references.bib
@@ -492,6 +492,12 @@
year = {2023}
}
+@misc {db2-doc,
+ title = {IBM DB2 Documentation},
+ url = {https://www.ibm.com/docs/en/db2/12.1.0?topic=design-data-sampling-in-queries},
+ year = {2025}
+}
+
@online {pinecone,
title = {Pinecone DB},
@@ -1474,3 +1480,5 @@ keywords = {analytic model, analysis of algorithms, overflow chaining, performan
biburl = {https://dblp.org/rec/journals/jal/EdelsbrunnerO85.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
+
+