diff options
| -rw-r--r-- | chapters/background.tex | 13 | ||||
| -rw-r--r-- | chapters/sigmod23/background.tex | 340 | ||||
| -rw-r--r-- | chapters/sigmod23/framework.tex | 669 | ||||
| -rw-r--r-- | chapters/sigmod23/introduction.tex | 54 | ||||
| -rw-r--r-- | references/references.bib | 8 |
5 files changed, 619 insertions, 465 deletions
diff --git a/chapters/background.tex b/chapters/background.tex index 9950b39..69436c8 100644 --- a/chapters/background.tex +++ b/chapters/background.tex @@ -85,6 +85,7 @@ their work on dynamization, and we will adopt their definition, \begin{equation*} F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q) \end{equation*} + for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$. \end{definition} The requirement for $\mergeop$ to be constant-time was used by Bentley and @@ -101,6 +102,7 @@ problems}, \begin{equation*} F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q) \end{equation*} + for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$. \end{definition} To demonstrate that a search problem is decomposable, it is necessary to @@ -811,6 +813,7 @@ cost, we could greatly reduce the cost of supporting $C(n)$-decomposable queries. \subsubsection{Independent Range Sampling} +\label{ssec:background-irs} Another problem that is not decomposable is independent sampling. There are a variety of problems falling under this umbrella, including weighted @@ -831,15 +834,7 @@ matching of records in result sets. To work around this, a slight abuse of definition is in order: assume that the equality conditions within the DSP definition can be interpreted to mean ``the contents in the two sets are drawn from the same distribution''. This enables the category -of DSP to apply to this type of problem. More formally, -\begin{definition}[Decomposable Sampling Problem] - A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and - only if there exists a constant-time computable, associative, and - commutative binary operator $\mergeop$ such that, - \begin{equation*} - F(A \cup B, q) \sim F(A, q)~ \mergeop ~F(B, q) - \end{equation*} -\end{definition} +of DSP to apply to this type of problem. Even with this abuse, however, IRS cannot generally be considered decomposable; it is at best $C(n)$-decomposable. The reason for this is diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex index 58324bd..ad89e03 100644 --- a/chapters/sigmod23/background.tex +++ b/chapters/sigmod23/background.tex @@ -1,31 +1,63 @@ \section{Background} \label{sec:background} -This section formalizes the sampling problem and describes relevant existing -solutions. Before discussing these topics, though, a clarification of -definition is in order. The nomenclature used to describe sampling varies -slightly throughout the literature. In this chapter, the term \emph{sample} is -used to indicate a single record selected by a sampling operation, and a -collection of these samples is called a \emph{sample set}; the number of -samples within a sample set is the \emph{sample size}. The term \emph{sampling} -is used to indicate the selection of either a single sample or a sample set; -the specific usage should be clear from context. - - -\Paragraph{Independent Sampling Problem.} When conducting sampling, it is often -desirable for the drawn samples to have \emph{statistical independence}. This -requires that the sampling of a record does not affect the probability of any -other record being sampled in the future. Independence is a requirement for the -application of statistical tools such as the Central Limit -Theorem~\cite{bulmer79}, which is the basis for many concentration bounds. -A failure to maintain independence in sampling invalidates any guarantees -provided by these statistical methods. - -In each of the problems considered, sampling can be performed either with -replacement (WR) or without replacement (WoR). It is possible to answer any WoR -sampling query using a constant number of WR queries, followed by a -deduplication step~\cite{hu15}, and so this chapter focuses exclusively on WR -sampling. +We will begin with a formal discussion of the sampling problem, and +relevant existing solutions. First, though, a clarification of definition +is in order. The nomenclature used to describe sampling in the literature +is rather inconsistent, and so we'll first specifically define all of +the relevant terms.\footnote{ + As an amusing footnote, this problem actually resulted in a + significant miscommunication between myself and my advisor in the + early days of the project, resulting in a lot of time being expended + on performance debugging a problem that didn't actually exist! +} +In this chapter, we'll use the the term \emph{sample} to indicate a +single record selected by a sampling operation, and a collection of +these samples will be called a \emph{sample set}. The number of samples +within a sample set is the \emph{sample size}. The term \emph{sampling} +is used to indicate the selection of either a single sample or a sample +set; the specific usage should be clear from context. + +In each of the problems considered, sampling can be performed either +with replacement or without replacement. Sampling with replacement +means that a record that has been included in the sample set for a given +sampling query is "replaced" into the dataset and allowed to be sampled +again. Sampling without replacement does not "replace" the record, +and so each individual record can only be included within the a sample +set once for a given query. The data structures that will be discussed +support sampling with replacement, and sampling without replacement can +be implemented using a constant number of with replacement sampling +operations, followed by a deduplication step~\cite{hu15}, so this chapter +will focus exclusive on the with replacement case. + +\subsection{Independent Sampling Problem} + +When conducting sampling, it is often desirable for the drawn samples to +have \emph{statistical independence} and for the distribution of records +in the sample set to match the distribution of source data set. This +requires that the sampling of a record does not affect the probability of +any other record being sampled in the future. Such sample sets are said +to be drawn i.i.d (idendepently and identically distributed). Throughout +this chapter, the term "independent" will be used to describe both +statistical independence, and identical distribution. + +Independence of sample sets is important because many useful statistical +results are derived from assumping that the condition holds. For example, +it is a requirement for the application of statistical tools such as +the Central Limit Theorem~\cite{bulmer79}, which is the basis for many +concentration bounds. A failure to maintain independence in sampling +invalidates any guarantees provided by these statistical methods. + +In the context of databases, it is also common to discuss a more +general version of the sampling problem, called \emph{independent query +sampling} (IQS)~\cite{hu14}. In IQS, a sample set is constructed from a +specified number of records in the result set of a database query. In +this context, it isn't enough to ensure that individual records are +sampled independently; the sample sets from repeated queries must also be +indepedent. This precludes, for example, caching and returning the same +sample set to multiple repetitions of the same query. This inter-query +independence provides a variety of useful properties, such as fairness +and representativeness of query results~\cite{tao22}. A basic version of the independent sampling problem is \emph{weighted set sampling} (WSS),\footnote{ @@ -43,22 +75,18 @@ as: each data point $d \in D$ having a probability of $\frac{w(d)}{\sum_{p\in D}w(p)}$ of being sampled. \end{definition} -Each query returns a sample set of size $k$, rather than a -single sample. Queries returning sample sets are the common case, because the -robustness of analysis relies on having a sufficiently large sample +Each query returns a sample set of size $k$, rather than a single +sample. Queries returning sample sets are the common case, because +the robustness of analysis relies on having a sufficiently large sample size~\cite{ben-eliezer20}. The common \emph{simple random sampling} (SRS) problem is a special case of WSS, where every element has unit weight. -In the context of databases, it is also common to discuss a more general -version of the sampling problem, called \emph{independent query sampling} -(IQS)~\cite{hu14}. An IQS query samples a specified number of records from the -result set of a database query. In this context, it is insufficient to merely -ensure individual records are sampled independently; the sample sets returned -by repeated IQS queries must be independent as well. This provides a variety of -useful properties, such as fairness and representativeness of query -results~\cite{tao22}. As a concrete example, consider simple random sampling on -the result set of a single-dimensional range reporting query. This is -called independent range sampling (IRS), and is formally defined as: +For WSS, the results are taken directly from the dataset without applying +any predicates or filtering. This can be useful, however for IQS it is +common for database queries to apply predicates to the data. A very common +search problem from which database queries are created is range scanning, +which can be formulated as a sampling problem called \emph{independent +range sampling} (IRS), \begin{definition}[Independent Range Sampling~\cite{tao22}] Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query @@ -66,35 +94,73 @@ called independent range sampling (IRS), and is formally defined as: query returns $k$ independent samples from $D \cap q$ with each point having equal probability of being sampled. \end{definition} -A generalization of IRS exists, called \emph{Weighted Independent Range -Sampling} (WIRS)~\cite{afshani17}, which is similar to WSS. Each point in $D$ -is associated with a positive weight $w: D \to \mathbb{R}^+$, and samples are -drawn from the range query results $D \cap q$ such that each data point has a -probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$ of being sampled. - - -\Paragraph{Existing Solutions.} While many sampling techniques exist, -few are supported in practical database systems. The existing -\texttt{TABLESAMPLE} operator provided by SQL in all major DBMS -implementations~\cite{postgres-doc} requires either a linear scan (e.g., -Bernoulli sampling) that results in high sample retrieval costs, or relaxed -statistical guarantees (e.g., block sampling~\cite{postgres-doc} used in -PostgreSQL). - -Index-assisted sampling solutions have been studied -extensively. Olken's method~\cite{olken89} is a classical solution to -independent sampling problems. This algorithm operates upon traditional search -trees, such as the B+tree used commonly as a database index. It conducts a -random walk on the tree uniformly from the root to a leaf, resulting in a -$O(\log n)$ sampling cost for each returned record. Should weighted samples be -desired, rejection sampling can be performed. A sampled record, $r$, is -accepted with probability $\nicefrac{w(r)}{w_{max}}$, with an expected -number of $\nicefrac{w_{max}}{w_{avg}}$ samples to be taken per element in the -sample set. Olken's method can also be extended to support general IQS by -rejecting all sampled records failing to satisfy the query predicate. It can be -accelerated by adding aggregated weight tags to internal -nodes~\cite{olken-thesis,zhao22}, allowing rejection sampling to be performed -during the tree-traversal to abort dead-end traversals early. + +IRS is a non-weighted sampling problem, similar to SRS. There also exists +a weighted generalization, called \emph{weighted independent range +sampling} (WIRS), + +\begin{definition}[Weighted Independent Range Sampling~\cite{afshani17}] + Let $D$ be a set of $n$ points in $\mathbb{R}$ that are associated with + positive weights $w: D\to \mathbb{R}^+$. Given a query + interval $q = [x, y]$ and an integer $k$, an independent range sampling + query returns $k$ independent samples from $D \cap q$ with each + point having a probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$ + of being sampled. +\end{definition} + +This is not an exhaustive list of sampling problems, but it is the list +of problems that will be directly addressed within this chapter. + +\subsection{Algorithmic Solutions} + +Relational database systems often have native support for IQS using +SQL's \texttt{TABLESAMPLE} operator~\cite{postgress-doc}. However, the +algorithms used to implement this operator have significant limitations: +users much choose between statistical independence or performance. + +To maintain statistical independence, Bernoulli sampling is used. This +technique requires iterating over every record in the result set of the +query, and selecting or rejecting it for inclusion within the sample +with a fixed probability~\cite{db2-doc}. This process requires that each +record in the result set be considered, and thus provides no performance +benefit relative to the query being sampled from, as it must be answered +in full anyway before returning only some of the results. + +For performance, the statistical guarantees can be discarded and +systematic or block sampling used instead. Systematic sampling considers +only a fraction of the rows in the table being sampled from, following +some particular pattern~\cite{postgress-doc}, and block sampling samples +entire database pages~\cite{db2-doc}. These allow for query performance +to be decoupled from data size, but tie a given record's inclusion in the +sample set directly to its physical storage location, which can introduce +bias into the sample and violates statistical guarantees. + +\subsection{Index-assisted Solutions} +It is possible to answer IQS queries in a manner that both preserves +independence, and avoids executing the query in full, through the use +of specialized data structures. + +\Paragraph{Olken's Method.} +The classical solution is Olken's method~\cite{olken89}, +which can be applied to traditional tree-based database indices. This +technique performs a randomized tree traversal, selecting the pointer to +follow at each node uniformly at random. This allows SRS queries to be +answered at $\Theta(\log n)$ cost per sample in the sample set. Thus, +for an IQS query with a desired sample set size of $k$, Olken's method +can provide a sample in $\Theta(k \log n)$ time. + +More complex IQS queries, such as weighted or predicate-filtered sampling, +can be answered using the same algorithm by applying rejection sampling. +To support predicates, any sampled records that violate the predicate can +be rejected and retried. For weighted sampling, a given record $r$ will +be accepted into the sample with $\nicefrac{w(r)}{w_{max}}$ probability. +This will require an expected number of $\nicefrac{w_{max}}{w_{avg}}$ +attempts per sample in the sample set~\cite{olken-thesis}. This rejection +sampling can be significantly improved by adding aggregated weight tags to +internal nodes, allowing rejection sampling to be performed at each step +of the tree traversal to abort dead-end traversals early~\cite{zhao22}. In +either case, there will be a performance penalty to rejecting samples, +requiring greater than $k$ traversals to obtain a sample set of size $k$. \begin{figure} \centering @@ -115,68 +181,78 @@ during the tree-traversal to abort dead-end traversals early. \end{figure} -There also exist static data structures, referred to in this chapter as static -sampling indexes (SSIs)\footnote{ -The name SSI was established in the published version of this paper prior to the -realization that a distinction between the terms index and data structure would -be useful. We'll continue to use the term SSI for the remainder of this chapter, -to maintain consistency with the published work, but technically an SSI refers to - a data structure, not an index, in the nomenclature established in the previous - chapter. - }, that are capable of answering sampling queries in -near-constant time\footnote{ - The designation -``near-constant'' is \emph{not} used in the technical sense of being constant -to within a polylogarithmic factor (i.e., $\tilde{O}(1)$). It is instead used to mean -constant to within an additive polylogarithmic term, i.e., $f(x) \in O(\log n + -1)$. -%For example, drawing $k$ samples from $n$ records using a near-constant -%approach would require $O(\log n + k)$ time. This is in contrast to a -%tree-traversal approach, which would require $O(k\log n)$ time. -} relative to the size of the dataset. An example of such a -structure is used in Walker's alias method \cite{walker74,vose91}, a technique -for answering WSS queries with $O(1)$ query cost per sample, but requiring -$O(n)$ time to construct. It distributes the weight of items across $n$ cells, -where each cell is partitioned into at most two items, such that the total -proportion of each cell assigned to an item is its total weight. A query -selects one cell uniformly at random, then chooses one of the two items in the -cell by weight; thus, selecting items with probability proportional to their -weight in $O(1)$ time. A pictorial representation of this structure is shown in -Figure~\ref{fig:alias}. - -The alias method can also be used as the basis for creating SSIs capable of -answering general IQS queries using a technique called alias -augmentation~\cite{tao22}. As a concrete example, previous -papers~\cite{afshani17,tao22} have proposed solutions for WIRS queries using $O(\log n -+ k)$ time, where the $\log n$ cost is only be paid only once per query, after which -elements can be sampled in constant time. This structure is built by breaking -the data up into disjoint chunks of size $\nicefrac{n}{\log n}$, called -\emph{fat points}, each with an alias structure. A B+tree is then constructed, -using the fat points as its leaf nodes. The internal nodes are augmented with -an alias structure over the total weight of each child. This alias structure -is used instead of rejection sampling to determine the traversal path to take -through the tree, and then the alias structure of the fat point is used to -sample a record. Because rejection sampling is not used during the traversal, -two traversals suffice to establish the valid range of records for sampling, -after which samples can be collected without requiring per-sample traversals. -More examples of alias augmentation applied to different IQS problems can be -found in a recent survey by Tao~\cite{tao22}. - -There do exist specialized sampling indexes~\cite{hu14} with both efficient -sampling and support for updates, but these are restricted to specific query -types and are often very complex structures, with poor constant factors -associated with sampling and update costs, and so are of limited practical -utility. There has also been work~\cite{hagerup93,matias03,allendorf23} on -extending the alias structure to support weight updates over a fixed set of -elements. However, these solutions do not allow insertion or deletion in the -underlying dataset, and so are not well suited to database sampling -applications. - -\Paragraph{The Dichotomy.} Among these techniques, there exists a -clear trade-off between efficient sampling and support for updates. Tree-traversal -based sampling solutions pay a dataset size based cost per sample, in exchange for -update support. The static solutions lack support for updates, but support -near-constant time sampling. While some data structures exist with support for -both, these are restricted to highly specialized query types. Thus in the -general case there exists a dichotomy: existing sampling indexes can support -either data updates or efficient sampling, but not both. +\Paragraph{Static Solutions.} +There are also a large number of static data structures, which we'll +call static sampling indices (SSIs) in this chapter,\footnote{ + We used the term "SSI" in the original paper on which this chapter + is based, which was published prior to our realization that a strong + distinction between an index and a data structure would be useful. I + am retaining the term SSI in this chapter for consistency with the + original paper, but understand that in the termonology established in + Chapter~\ref{chap:background}, SSIs are data structures, not indices. +}, +that are capable of answering sampling queries more efficiently than +Olken's method relative to the overall data size. An example of such +a structure is used in Walker's alias method \cite{walker74,vose91}. +This technique constructs a data structure in $\Theta(n)$ time +that is capable of answering WSS queries in $\Theta(1)$ time per +sample. Figure~\ref{fig:alias} shows a pictorial representation of the +structure. For a set of $n$ records, it is constructed by distributing +the normalized weight of all of the records across an array of $n$ +cells, which represent at most two records each. Each cell will have +a proportional representation of its records based on their normalized +weight (e.g., a given cell may be 40\% allocated to one record, and 60\% +to another). To query the structure, a cell is first selected uniformly +at random, and then one of its two associated records is selected +with a probability proportional to the record's weight. This operation +takes $\Theta(1)$ time, requiring only two random number generations +per sample. Thus, a WSS query can be answered in $\Theta(k)$ time, +assuming the structure has already been built. Unfortunately, the alias +structure cannot be efficiently updated, as inserting new records would +change the relative weights of \emph{all} the records, and require fully +repartitioning the structure. + +While the alias method only applies to WSS, other sampling problems can +be solved by using the alias method within the context of a larger data +structure, a technique called \emph{alias augmentation}~\cite{tao22}. For +example, alias augmentation can be used to construct an SSI capable of +answering WIRS queries in $\Theta(\log n + k)$~\cite{afshani17,tao22}. +This structure breaks the data into multiple disjoint partitions of size +$\nicefrac{n}{\log n}$, each with an associated alias structure. A B+Tree +is then built, using the augmented partitions as its leaf nodes. Each +internal node is also augmented with an alias structure over the aggregate +weights associated with the children of each pointer. Constructing this +structure requires $\Theta(n)$ time (though the associated constants are +quite large in practice). WIRS queries can be answered by traversing +the tree, first establishing the portion of the tree covering the +query range, and then sampling records from that range using the alias +structures attached to the nodes. More examples of alias augmentation +applied to different IQS problems can be found in a recent survey by +Tao~\cite{tao22}. + +There also exist specialized data structures with support for both +efficient sampling and updates~\cite{hu14}, but these structures have +poor constant factors and are very complex, rendering them of little +practical utility. Additionally, efforts have been made to extended +the alias structure with support for weight updates over a fixed set of +elements~\cite{hagerup93,matias03,allendorf23}. These approaches do not +allow the insertion or removal of new records, however, only in-place +weight updates. While in principle they could be constructed over the +entire domain of possible records, with the weights of non-existant +records set to $0$, this is hardly practical. Thus, these structures are +not suited for the database sampling applications that are of interest to +us in this chapter. + +\subsection{The Dichotomy.} Across the index-assisted techniques we +discussed above, there is a clear pattern that emerges. Olken's method +supports updates, but is inefficient compared to the SSIs because it +requires a data-sized cost to be paid per sample in the sample set. The +SSIs are more efficient for sampling, typically paying the data-sized cost +only once per sample set (if at all), but fail to support updates. Thus, +there appears to be a general dichotomy of sampling techniques: existing +sampling data structures support either updates, or efficient sampling, +but generally not both. It will be the purpose of this chapter to resolve +this dichotomy. + + + diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex index 32a32e1..88ac1ac 100644 --- a/chapters/sigmod23/framework.tex +++ b/chapters/sigmod23/framework.tex @@ -1,53 +1,365 @@ -\section{Dynamic Sampling Index Framework} +\section{Dynamization of SSIs} \label{sec:framework} -This work is an attempt to design a solution to independent sampling -that achieves \emph{both} efficient updates and near-constant cost per -sample. As the goal is to tackle the problem in a generalized fashion, -rather than design problem-specific data structures for used as the basis -of an index, a framework is created that allows for already -existing static data structures to be used as the basis for a sampling -index, by automatically adding support for data updates using a modified -version of the Bentley-Saxe method. - -Unfortunately, Bentley-Saxe as described in Section~\ref{ssec:bsm} cannot be -directly applied to sampling problems. The concept of decomposability is not -cleanly applicable to sampling, because the distribution of records in the -result set, rather than the records themselves, must be matched following the -result merge. Efficiently controlling the distribution requires each sub-query -to access information external to the structure against which it is being -processed, a contingency unaccounted for by Bentley-Saxe. Further, the process -of reconstruction used in Bentley-Saxe provides poor worst-case complexity -bounds~\cite{saxe79}, and attempts to modify the procedure to provide better -worst-case performance are complex and have worse performance in the common -case~\cite{overmars81}. Despite these limitations, this chapter will argue that -the core principles of the Bentley-Saxe method can be profitably applied to -sampling indexes, once a system for controlling result set distributions and a -more effective reconstruction scheme have been devised. The solution to -the former will be discussed in Section~\ref{ssec:sample}. For the latter, -inspiration is drawn from the literature on the LSM tree. - -The LSM tree~\cite{oneil96} is a data structure proposed to optimize -write throughput in disk-based storage engines. It consists of a memory -table of bounded size, used to buffer recent changes, and a hierarchy -of external levels containing indexes of exponentially increasing -size. When the memory table has reached capacity, it is emptied into the -external levels. Random writes are avoided by treating the data within -the external levels as immutable; all writes go through the memory -table. This introduces write amplification but maximizes sequential -writes, which is important for maintaining high throughput in disk-based -systems. The LSM tree is associated with a broad and well studied design -space~\cite{dayan17,dayan18,dayan22,balmau19,dayan18-1} containing -trade-offs between three key performance metrics: read performance, write -performance, and auxiliary memory usage. The challenges -faced in reconstructing predominately in-memory indexes are quite - different from those which the LSM tree is intended -to address, having little to do with disk-based systems and sequential IO -operations. But, the LSM tree possesses a rich design space for managing -the periodic reconstruction of data structures in a manner that is both -more practical and more flexible than that of Bentley-Saxe. By borrowing -from this design space, this preexisting body of work can be leveraged, -and many of Bentley-Saxe's limitations addressed. +Our goal, then, is to design a solution to indepedent sampling that is +able to achieve \emph{both} efficient updates and efficient sampling, +while also maintaining statistical independence both within and between +IQS queries, and to do so in a generalized fashion without needing to +design new dynamic data structures for each problem. Given the range +of SSIs already available, it seems reasonable to attempt to apply +dynamization techniques to accomplish this goal. Using the Bentley-Saxe +method would allow us to to support inserts and deletes without +requiring any modification of the SSIs. Unfortunately, as discussed +in Section~\ref{ssec:background-irs}, there are problems with directly +applying BSM to sampling problems. All of the considerations discussed +there in the context of IRS apply equally to the other sampling problems +considered in this chapter. In this section, we will discuss approaches +for resolving these problems. + +\subsection{Sampling over Partitioned Datasets} + +The core problem facing any attempt to dynamize SSIs is that independently +sampling from a partitioned dataset is difficult. As discussed in +Section~\ref{ssec:background-irs}, accomplishing this task within the +DSP model used by the Bentley-Saxe method requires drawing a full $k$ +samples from each of the blocks, and then repeatedly down-sampling each +of the intermediate sample sets. However, it is possible to devise a +more efficient query process if we abandon the DSP model and consider +a slightly more complicated procedure. + +First, we need to resolve a minor definitional problem. As noted before, +the DSP model is based on deterministic queries. The definition doesn't +apply for sampling queries, because it assumes that the result sets of +identical queries should also be identical. For general IQS, we also need +to enforce conditions on the query being sampled from. + +\begin{definition}[Query Sampling Problem] + Given a search problem, $F$, a sampling problem is function + of the form $X: (F, \mathcal{D}, \mathcal{Q}, \mathbb{Z}^+) + \to \mathcal{R}$ where $\mathcal{D}$ is the domain of records + and $\mathcal{Q}$ is the domain of query parameters of $F$. The + solution to a sampling problem, $R \in \mathcal{R}$ will be a subset + of records from the solution to $F$ drawn independently such that, + $|R| = k$ for some $k \in \mathbb{Z}^+$. +\end{definition} +With this in mind, we can now define the decomposability conditions for +a query sampling problem, + +\begin{definition}[Decomposable Sampling Problem] + A query sampling problem, $X: (F, \mathcal{D}, \mathcal{Q}, + \mathbb{Z}^+ \to \mathcal{R}$) is decomposable if and only if + the following conditions are met for all $q \in \mathcal{Q}, + k \in \mathbb{Z}^+$, + \begin{enumerate} + \item There exists a $\Theta(C(n,k))$ time computable, associative, and + commutative binary operator $\mergeop$ such that, + \begin{equation*} + X(F, A \cup B, q, k) \sim X(F, A, q, k)~ \mergeop ~X(F, + B, q, k) + \end{equation*} + for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B + = \emptyset$. + + \item For any dataset $D \subseteq \mathcal{D}$ that has been + decomposed into $m$ partitions such that $D = + \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset \quad + \forall i,j < m, i \neq j$, + \begin{equation*} + F(D, q) = \bigcup_{i=1}^m F(D_i, q) + \end{equation*} + \end{enumerate} +\end{definition} + +These two conditions warrant further explaination. The first condition +is simply a redefinition of the standard decomposability criteria to +consider matching the distribution, rather than the exact records in $R$, +as the correctness condition for the merge process. The second condition +handles a necessary property of the underlying search problem being +sampled from. Note that this condition is \emph{stricter} than normal +decomposability for $F$, and essentially requires that the query being +sampled from return a set of records, rather than an aggregate value or +some other result that cannot be meaningfully sampled from. This condition +is satisfied by predicate-filtering style database queries, among others. + +With these definitions in mind, let's turn to solving these query sampling +problems. First, we note that many SSIs have a sampling procedure that +naturally involves two phases. First, some preliminary work is done +to determine metadata concerning the set of records to sample from, +and then $k$ samples are drawn from the structure, taking advantage of +this metadata. If we represent the time cost of the prelimary work +with $P(n)$ and the cost of drawing a sample with $S(n)$, then these +structures query cost functions are of the form, + +\begin{equation*} +\mathscr{Q}(n, k) = P(n) + k S(n) +\end{equation*} + + +Consider an arbitrary decomposable sampling query with a cost function +of the above form, $X(\mathscr{I}, F, q, k)$, which draws a sample +of $k$ records from $d \subseteq \mathcal{D}$ using an instance of +an SSI $\mathscr{I} \in \mathcal{I}$. Applying dynamization results +in $d$ being split across $m$ disjoint instances of $\mathcal{I}$ +such that $d = \bigcup_{i=0}^m \text{unbuild}(\mathscr{I}_i)$ and +$\text{unbuild}(\mathscr{I}_i) \cap \text{unbuild}(\mathscr{I}_j) += \emptyset \quad \forall i, j < m, i \neq j$. If we consider a +Bentley-Saxe dynamization of such a structure, the $\mergeop$ operation +would be a $\Theta(k)$ down-sampling. Thus, the total query cost of such +a structure would be, +\begin{equation*} +\Theta\left(\log_2 n \left( P(n) + k S(n) + k\right)\right) +\end{equation*} + +This cost function is sub-optimal for two reasons. First, we +pay extra cost to merge the result sets together because of the +down-sampling combination operator. Secondly, this formulation +fails to avoid a per-sample dependence on $n$, even in the case +where $S(n) \in \Theta(1)$. This gets even worse when considering +rejections that may occur as a result of deleted records. Recall from +Section~\ref{ssec:background-deletes} that deletion can be supported +using weak deletes or a shadow structure in a Bentley-Saxe dynamization. +Using either approach, it isn't possible to avoid deleted records in +advance when sampling, and so these will need to be rejected and retried. +In the DSP model, this retry will need to reprocess every block a second +time. You cannot retry in place without introducing bias into the result +set. We will discuss this more in Section~\ref{ssec:sampling-deletes}. + +\begin{figure} + \centering + \includegraphics[width=\textwidth]{img/sigmod23/sampling} + \caption{\textbf{Overview of the multiple-block query sampling process} for + Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of + the shards is determined, then (2) these weights are used to construct an + alias structure. Next, (3) the alias structure is queried $k$ times to + determine per shard sample sizes, and then (4) sampling is performed. + Finally, (5) any rejected samples are retried starting from the alias + structure, and the process is repeated until the desired number of samples + has been retrieved.} + \label{fig:sample} + +\end{figure} + +The key insight that allowed us to solve this particular problem was that +there is a mismatch between the structure of the sampling query process, +and the structure assumed by DSPs. Using an SSI to answer a sampling +query results in a naturally two-phase process, but DSPs are assumed to +be single phase. We can construct a more effective process for answering +such queries based on a multi-stage process, summarized in Figure~\ref{fig:sample}. +\begin{enumerate} + \item Determine each block's respective weight under a given + query to be sampled from (e.g., the number of records falling + into the query range for IRS). + + \item Build a temporary alias structure over these weights. + + \item Query the alias structure $k$ times to determine how many + samples to draw from each block. + + \item Draw the appropriate number of samples from each block and + merge them together to form the final query result. +\end{enumerate} +It is possible that some of the records sampled in Step 4 must be +rejected, either because of deletes or some other property of the sampling +procedure being used. If $r$ records are rejected, the above procedure +can be repeated from Step 3, taking $k - r$ as the number of times to +query the alias structure, without needing to redo any of the preprocessing +steps. This can be repeated as many times as necessary until the required +$k$ records have been sampled. + +\begin{example} + \label{ex:sample} + Consider executing a WSS query, with $k=1000$, across three blocks + containing integer keys with unit weight. $\mathscr{I}_1$ contains only the + key $-2$, $\mathscr{I}_2$ contains all integers on $[1,100]$, and $\mathscr{I}_3$ + contains all integers on $[101, 200]$. These structures are shown + in Figure~\ref{fig:sample}. Sampling is performed by first + determining the normalized weights for each block: $w_1 = 0.005$, + $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a + block alias structure. The block alias structure is then queried + $k$ times, resulting in a distribution of $k_i$s that is + commensurate with the relative weights of each block. Finally, + each block is queried in turn to draw the appropriate number + of samples. +\end{example} + +Assuming a Bentley-Saxe decomposition with $\log n$ blocks and assuming +a constant number of repetitions, the cost of answering a decomposible +sampling query having a pre-processing cost of $P(n)$ and a per-sample +cost of $S(n)$ will be, +\begin{equation} +\label{eq:dsp-sample-cost} +\boxed{ +\mathscr{Q}(n, k) \in \Theta \left( P(n) \log_2 n + k S(n) \right) +} +\end{equation} +where the cost of building the alias structure is $\Theta(\log_2 n)$ +and thus absorbed into the pre-processing cost. For the SSIs discussed +in this chapter, which have $S(n) \in \Theta(1)$, this model provides us +with the desired decoupling of the data size ($n$) from the per-sample +cost. + +\subsection{Supporting Deletes} + +Because the shards are static, records cannot be arbitrarily removed from them. +This requires that deletes be supported in some other way, with the ultimate +goal being the prevention of deleted records' appearance in sampling query +result sets. This can be realized in two ways: locating the record and marking +it, or inserting a new record which indicates that an existing record should be +treated as deleted. The framework supports both of these techniques, the +selection of which is called the \emph{delete policy}. The former policy is +called \emph{tagging} and the latter \emph{tombstone}. + +Tagging a record is straightforward. Point-lookups are performed against each +shard in the index, as well as the buffer, for the record to be deleted. When +it is found, a bit in a header attached to the record is set. When sampling, +any records selected with this bit set are automatically rejected. Tombstones +represent a lazy strategy for deleting records. When a record is deleted using +tombstones, a new record with identical key and value, but with a ``tombstone'' +bit set, is inserted into the index. A record's presence can be checked by +performing a point-lookup. If a tombstone with the same key and value exists +above the record in the index, then it should be rejected when sampled. + +Two important aspects of performance are pertinent when discussing deletes: the +cost of the delete operation, and the cost of verifying the presence of a +sampled record. The choice of delete policy represents a trade-off between +these two costs. Beyond this simple trade-off, the delete policy also has other +implications that can affect its applicability to certain types of SSI. Most +notably, tombstones do not require any in-place updating of records, whereas +tagging does. This means that using tombstones is the only way to ensure total +immutability of the data within shards, which avoids random writes and eases +concurrency control. The tombstone delete policy, then, is particularly +appealing in external and concurrent contexts. + +\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is +the same as an ordinary insert. Tagging, by contrast, requires a point-lookup +of the record to be deleted, and so is more expensive. Assuming a point-lookup +operation with cost $L(n)$, a tagged delete must search each level in the +index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$ +time. + +\Paragraph{Rejection Check Costs.} In addition to the cost of the delete +itself, the delete policy affects the cost of determining if a given record has +been deleted. This is called the \emph{rejection check cost}, $R(n)$. When +using tagging, the information necessary to make the rejection decision is +local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones +it is not; a point-lookup must be performed to search for a given record's +corresponding tombstone. This look-up must examine the buffer, and each shard +within the index. This results in a rejection check cost of $R(n) \in O\left(N_b + +L(n) \log_s n\right)$. The rejection check process for the two delete policies is +summarized in Figure~\ref{fig:delete}. + +Two factors contribute to the tombstone rejection check cost: the size of the +buffer, and the cost of performing a point-lookup against the shards. The +latter cost can be controlled using the framework's ability to associate +auxiliary structures with shards. For SSIs which do not support efficient +point-lookups, a hash table can be added to map key-value pairs to their +location within the SSI. This allows for constant-time rejection checks, even +in situations where the index would not otherwise support them. However, the +storage cost of this intervention is high, and in situations where the SSI does +support efficient point-lookups, it is not necessary. Further performance +improvements can be achieved by noting that the probability of a given record +having an associated tombstone in any particular shard is relatively small. +This means that many point-lookups will be executed against shards that do not +contain the tombstone being searched for. In this case, these unnecessary +lookups can be partially avoided using Bloom filters~\cite{bloom70} for +tombstones. By inserting tombstones into these filters during reconstruction, +point-lookups against some shards which do not contain the tombstone being +searched for can be bypassed. Filters can be attached to the buffer as well, +which may be even more significant due to the linear cost of scanning it. As +the goal is a reduction of rejection check costs, these filters need only be +populated with tombstones. In a later section, techniques for bounding the +number of tombstones on a given level are discussed, which will allow for the +memory usage of these filters to be tightly controlled while still ensuring +precise bounds on filter error. + +\Paragraph{Sampling with Deletes.} The addition of deletes to the framework +alters the analysis of sampling costs. A record that has been deleted cannot +be present in the sample set, and therefore the presence of each sampled record +must be verified. If a record has been deleted, it must be rejected. When +retrying samples rejected due to delete, the process must restart from shard +selection, as deleted records may be counted in the weight totals used to +construct that structure. This increases the cost of sampling to, +\begin{equation} +\label{eq:sampling-cost} + O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right) +\end{equation} +where $R(n)$ is the cost of checking if a sampled record has been deleted, and +$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling +attempts required to obtain $k$ samples, given a fixed rejection probability. +The rejection probability itself is a function of the workload, and is +unbounded. + +\Paragraph{Bounding the Rejection Probability.} Rejections during sampling +constitute wasted memory accesses and random number generations, and so steps +should be taken to minimize their frequency. The probability of a rejection is +directly related to the number of deleted records, which is itself a function +of workload and dataset. This means that, without building counter-measures +into the framework, tight bounds on sampling performance cannot be provided in +the presence of deleted records. It is therefore critical that the framework +support some method for bounding the number of deleted records within the +index. + +While the static nature of shards prevents the direct removal of records at the +moment they are deleted, it doesn't prevent the removal of records during +reconstruction. When using tagging, all tagged records encountered during +reconstruction can be removed. When using tombstones, however, the removal +process is non-trivial. In principle, a rejection check could be performed for +each record encountered during reconstruction, but this would increase +reconstruction costs and introduce a new problem of tracking tombstones +associated with records that have been removed. Instead, a lazier approach can +be used: delaying removal until a tombstone and its associated record +participate in the same shard reconstruction. This delay allows both the record +and its tombstone to be removed at the same time, an approach called +\emph{tombstone cancellation}. In general, this can be implemented using an +extra linear scan of the input shards before reconstruction to identify +tombstones and associated records for cancellation, but potential optimizations +exist for many SSIs, allowing it to be performed during the reconstruction +itself at no extra cost. + +The removal of deleted records passively during reconstruction is not enough to +bound the number of deleted records within the index. It is not difficult to +envision pathological scenarios where deletes result in unbounded rejection +rates, even with this mitigation in place. However, the dropping of deleted +records does provide a useful property: any specific deleted record will +eventually be removed from the index after a finite number of reconstructions. +Using this fact, a bound on the number of deleted records can be enforced. A +new parameter, $\delta$, is defined, representing the maximum proportion of +deleted records within the index. Each level, and the buffer, tracks the number +of deleted records it contains by counting its tagged records or tombstones. +Following each buffer flush, the proportion of deleted records is checked +against $\delta$. If any level is found to exceed it, then a proactive +reconstruction is triggered, pushing its shards down into the next level. The +process is repeated until all levels respect the bound, allowing the number of +deleted records to be precisely controlled, which, by extension, bounds the +rejection rate. This process is called \emph{compaction}. + +Assuming every record is equally likely to be sampled, this new bound can be +applied to the analysis of sampling costs. The probability of a record being +rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to +Equation~\ref{eq:sampling-cost} yields, +\begin{equation} +%\label{eq:sampling-cost-del} + O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right) +\end{equation} + +Asymptotically, this proactive compaction does not alter the analysis of +insertion costs. Each record is still written at most $s$ times on each level, +there are at most $\log_s n$ levels, and the buffer insertion and SSI +construction costs are all unchanged, and so on. This results in the amortized +insertion cost remaining the same. + +This compaction strategy is based upon tombstone and record counts, and the +bounds assume that every record is equally likely to be sampled. For certain +sampling problems (such as WSS), there are other conditions that must be +considered to provide a bound on the rejection rate. To account for these +situations in a general fashion, the framework supports problem-specific +compaction triggers that can be tailored to the SSI being used. These allow +compactions to be triggered based on other properties, such as rejection rate +of a level, weight of deleted records, and the like. + + + +\subsection{Performance Tuning and Configuration} \captionsetup[subfloat]{justification=centering} @@ -68,12 +380,9 @@ and many of Bentley-Saxe's limitations addressed. \subsection{Framework Overview} -The goal of this chapter is to build a general framework that extends most SSIs -with efficient support for updates by splitting the index into small data structures -to reduce reconstruction costs, and then distributing the sampling process over these -smaller structures. -The framework is designed to work efficiently with any SSI, so -long as it has the following properties, +Our framework has been designed to work efficiently with any SSI, so long +as it has the following properties. + \begin{enumerate} \item The underlying full query $Q$ supported by the SSI from whose results samples are drawn satisfies the following property: @@ -219,101 +528,6 @@ framework is, O\left(\frac{C_r(n)}{n}\log_s n\right) \end{equation} - -\subsection{Sampling} -\label{ssec:sample} - -\begin{figure} - \centering - \includegraphics[width=\textwidth]{img/sigmod23/sampling} - \caption{\textbf{Overview of the multiple-shard sampling query process} for - Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of - the shards is determined, then (2) these weights are used to construct an - alias structure. Next, (3) the alias structure is queried $k$ times to - determine per shard sample sizes, and then (4) sampling is performed. - Finally, (5) any rejected samples are retried starting from the alias - structure, and the process is repeated until the desired number of samples - has been retrieved.} - \label{fig:sample} - -\end{figure} - -For many SSIs, sampling queries are completed in two stages. Some preliminary -processing is done to identify the range of records from which to sample, and then -samples are drawn from that range. For example, IRS over a sorted list of -records can be performed by first identifying the upper and lower bounds of the -query range in the list, and then sampling records by randomly generating -indexes within those bounds. The general cost of a sampling query can be -modeled as $P(n) + k S(n)$, where $P(n)$ is the cost of preprocessing, $k$ is -the number of samples drawn, and $S(n)$ is the cost of sampling a single -record. - -When sampling from multiple shards, the situation grows more complex. For each -sample, the shard to select the record from must first be decided. Consider an -arbitrary sampling query $X(D, k)$ asking for a sample set of size $k$ against -dataset $D$. The framework splits $D$ across $m$ disjoint shards, such that $D -= \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset, \forall i,j < m$. The -framework must ensure that $X(D, k)$ and $\bigcup_{i=0}^m X(D_i, k_i)$ follow -the same distribution, by selecting appropriate values for the $k_i$s. If care -is not taken to balance the number of samples drawn from a shard with the total -weight of the shard under $X$, then bias can be introduced into the sample -set's distribution. The selection of $k_i$s can be viewed as an instance of WSS, -and solved using the alias method. - -When sampling using the framework, first the weight of each shard under the -sampling query is determined and a \emph{shard alias structure} built over -these weights. Then, for each sample, the shard alias is used to -determine the shard from which to draw the sample. Let $W(n)$ be the cost of -determining this total weight for a single shard under the query. The initial setup -cost, prior to drawing any samples, will be $O\left([W(n) + P(n)]\log_s -n\right)$, as the preliminary work for sampling from each shard must be -performed, as well as weights determined and alias structure constructed. In -many cases, however, the preliminary work will also determine the total weight, -and so the relevant operation need only be applied once to accomplish both -tasks. - -To ensure that all records appear in the sample set with the appropriate -probability, the mutable buffer itself must also be a valid target for -sampling. There are two generally applicable techniques that can be applied for -this, both of which can be supported by the framework. The query being sampled -from can be directly executed against the buffer and the result set used to -build a temporary SSI, which can be sampled from. Alternatively, rejection -sampling can be used to sample directly from the buffer, without executing the -query. In this case, the total weight of the buffer is used for its entry in -the shard alias structure. This can result in the buffer being -over-represented in the shard selection process, and so any rejections during -buffer sampling must be retried starting from shard selection. These same -considerations apply to rejection sampling used against shards, as well. - - -\begin{example} - \label{ex:sample} - Consider executing a WSS query, with $k=1000$, across three shards - containing integer keys with unit weight. $S_1$ contains only the - key $-2$, $S_2$ contains all integers on $[1,100]$, and $S_3$ - contains all integers on $[101, 200]$. These structures are shown - in Figure~\ref{fig:sample}. Sampling is performed by first - determining the normalized weights for each shard: $w_1 = 0.005$, - $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a - shard alias structure. The shard alias structure is then queried - $k$ times, resulting in a distribution of $k_i$s that is - commensurate with the relative weights of each shard. Finally, - each shard is queried in turn to draw the appropriate number - of samples. -\end{example} - - -Assuming that rejection sampling is used on the mutable buffer, the worst-case -time complexity for drawing $k$ samples from an index containing $n$ elements -with a sampling cost of $S(n)$ is, -\begin{equation} - \label{eq:sample-cost} - O\left(\left[W(n) + P(n)\right]\log_s n + kS(n)\right) -\end{equation} - -%If instead a temporary SSI is constructed, the cost of sampling -%becomes: $O\left(N_b + C_c(N_b) + (W(n) + P(n))\log_s n + kS(n)\right)$. - \begin{figure} \centering \subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\ @@ -342,163 +556,6 @@ with a sampling cost of $S(n)$ is, \subsection{Deletion} \label{ssec:delete} -Because the shards are static, records cannot be arbitrarily removed from them. -This requires that deletes be supported in some other way, with the ultimate -goal being the prevention of deleted records' appearance in sampling query -result sets. This can be realized in two ways: locating the record and marking -it, or inserting a new record which indicates that an existing record should be -treated as deleted. The framework supports both of these techniques, the -selection of which is called the \emph{delete policy}. The former policy is -called \emph{tagging} and the latter \emph{tombstone}. - -Tagging a record is straightforward. Point-lookups are performed against each -shard in the index, as well as the buffer, for the record to be deleted. When -it is found, a bit in a header attached to the record is set. When sampling, -any records selected with this bit set are automatically rejected. Tombstones -represent a lazy strategy for deleting records. When a record is deleted using -tombstones, a new record with identical key and value, but with a ``tombstone'' -bit set, is inserted into the index. A record's presence can be checked by -performing a point-lookup. If a tombstone with the same key and value exists -above the record in the index, then it should be rejected when sampled. - -Two important aspects of performance are pertinent when discussing deletes: the -cost of the delete operation, and the cost of verifying the presence of a -sampled record. The choice of delete policy represents a trade-off between -these two costs. Beyond this simple trade-off, the delete policy also has other -implications that can affect its applicability to certain types of SSI. Most -notably, tombstones do not require any in-place updating of records, whereas -tagging does. This means that using tombstones is the only way to ensure total -immutability of the data within shards, which avoids random writes and eases -concurrency control. The tombstone delete policy, then, is particularly -appealing in external and concurrent contexts. - -\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is -the same as an ordinary insert. Tagging, by contrast, requires a point-lookup -of the record to be deleted, and so is more expensive. Assuming a point-lookup -operation with cost $L(n)$, a tagged delete must search each level in the -index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$ -time. - -\Paragraph{Rejection Check Costs.} In addition to the cost of the delete -itself, the delete policy affects the cost of determining if a given record has -been deleted. This is called the \emph{rejection check cost}, $R(n)$. When -using tagging, the information necessary to make the rejection decision is -local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones -it is not; a point-lookup must be performed to search for a given record's -corresponding tombstone. This look-up must examine the buffer, and each shard -within the index. This results in a rejection check cost of $R(n) \in O\left(N_b + -L(n) \log_s n\right)$. The rejection check process for the two delete policies is -summarized in Figure~\ref{fig:delete}. - -Two factors contribute to the tombstone rejection check cost: the size of the -buffer, and the cost of performing a point-lookup against the shards. The -latter cost can be controlled using the framework's ability to associate -auxiliary structures with shards. For SSIs which do not support efficient -point-lookups, a hash table can be added to map key-value pairs to their -location within the SSI. This allows for constant-time rejection checks, even -in situations where the index would not otherwise support them. However, the -storage cost of this intervention is high, and in situations where the SSI does -support efficient point-lookups, it is not necessary. Further performance -improvements can be achieved by noting that the probability of a given record -having an associated tombstone in any particular shard is relatively small. -This means that many point-lookups will be executed against shards that do not -contain the tombstone being searched for. In this case, these unnecessary -lookups can be partially avoided using Bloom filters~\cite{bloom70} for -tombstones. By inserting tombstones into these filters during reconstruction, -point-lookups against some shards which do not contain the tombstone being -searched for can be bypassed. Filters can be attached to the buffer as well, -which may be even more significant due to the linear cost of scanning it. As -the goal is a reduction of rejection check costs, these filters need only be -populated with tombstones. In a later section, techniques for bounding the -number of tombstones on a given level are discussed, which will allow for the -memory usage of these filters to be tightly controlled while still ensuring -precise bounds on filter error. - -\Paragraph{Sampling with Deletes.} The addition of deletes to the framework -alters the analysis of sampling costs. A record that has been deleted cannot -be present in the sample set, and therefore the presence of each sampled record -must be verified. If a record has been deleted, it must be rejected. When -retrying samples rejected due to delete, the process must restart from shard -selection, as deleted records may be counted in the weight totals used to -construct that structure. This increases the cost of sampling to, -\begin{equation} -\label{eq:sampling-cost} - O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right) -\end{equation} -where $R(n)$ is the cost of checking if a sampled record has been deleted, and -$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling -attempts required to obtain $k$ samples, given a fixed rejection probability. -The rejection probability itself is a function of the workload, and is -unbounded. - -\Paragraph{Bounding the Rejection Probability.} Rejections during sampling -constitute wasted memory accesses and random number generations, and so steps -should be taken to minimize their frequency. The probability of a rejection is -directly related to the number of deleted records, which is itself a function -of workload and dataset. This means that, without building counter-measures -into the framework, tight bounds on sampling performance cannot be provided in -the presence of deleted records. It is therefore critical that the framework -support some method for bounding the number of deleted records within the -index. - -While the static nature of shards prevents the direct removal of records at the -moment they are deleted, it doesn't prevent the removal of records during -reconstruction. When using tagging, all tagged records encountered during -reconstruction can be removed. When using tombstones, however, the removal -process is non-trivial. In principle, a rejection check could be performed for -each record encountered during reconstruction, but this would increase -reconstruction costs and introduce a new problem of tracking tombstones -associated with records that have been removed. Instead, a lazier approach can -be used: delaying removal until a tombstone and its associated record -participate in the same shard reconstruction. This delay allows both the record -and its tombstone to be removed at the same time, an approach called -\emph{tombstone cancellation}. In general, this can be implemented using an -extra linear scan of the input shards before reconstruction to identify -tombstones and associated records for cancellation, but potential optimizations -exist for many SSIs, allowing it to be performed during the reconstruction -itself at no extra cost. - -The removal of deleted records passively during reconstruction is not enough to -bound the number of deleted records within the index. It is not difficult to -envision pathological scenarios where deletes result in unbounded rejection -rates, even with this mitigation in place. However, the dropping of deleted -records does provide a useful property: any specific deleted record will -eventually be removed from the index after a finite number of reconstructions. -Using this fact, a bound on the number of deleted records can be enforced. A -new parameter, $\delta$, is defined, representing the maximum proportion of -deleted records within the index. Each level, and the buffer, tracks the number -of deleted records it contains by counting its tagged records or tombstones. -Following each buffer flush, the proportion of deleted records is checked -against $\delta$. If any level is found to exceed it, then a proactive -reconstruction is triggered, pushing its shards down into the next level. The -process is repeated until all levels respect the bound, allowing the number of -deleted records to be precisely controlled, which, by extension, bounds the -rejection rate. This process is called \emph{compaction}. - -Assuming every record is equally likely to be sampled, this new bound can be -applied to the analysis of sampling costs. The probability of a record being -rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to -Equation~\ref{eq:sampling-cost} yields, -\begin{equation} -%\label{eq:sampling-cost-del} - O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right) -\end{equation} - -Asymptotically, this proactive compaction does not alter the analysis of -insertion costs. Each record is still written at most $s$ times on each level, -there are at most $\log_s n$ levels, and the buffer insertion and SSI -construction costs are all unchanged, and so on. This results in the amortized -insertion cost remaining the same. - -This compaction strategy is based upon tombstone and record counts, and the -bounds assume that every record is equally likely to be sampled. For certain -sampling problems (such as WSS), there are other conditions that must be -considered to provide a bound on the rejection rate. To account for these -situations in a general fashion, the framework supports problem-specific -compaction triggers that can be tailored to the SSI being used. These allow -compactions to be triggered based on other properties, such as rejection rate -of a level, weight of deleted records, and the like. - \subsection{Trade-offs on Framework Design Space} \label{ssec:design-space} diff --git a/chapters/sigmod23/introduction.tex b/chapters/sigmod23/introduction.tex index 0155c7d..befdbba 100644 --- a/chapters/sigmod23/introduction.tex +++ b/chapters/sigmod23/introduction.tex @@ -1,20 +1,38 @@ \section{Introduction} \label{sec:intro} -As a first attempt at realizing a dynamic extension framework, one of the -non-decomposable search problems discussed in the previous chapter was -considered: independent range sampling, along with a number of other -independent sampling problems. These sorts of queries are important in a -variety of contexts, including including approximate query processing -(AQP)~\cite{blinkdb,quickr,verdict,cohen23}, interactive data -exploration~\cite{sps,xie21}, financial audit sampling~\cite{olken-thesis}, and -feature selection for machine learning~\cite{ml-sampling}. However, they are -not well served using existing techniques, which tend to sacrifice statistical -independence for performance, or vise versa. In this chapter, a solution for -independent sampling is presented that manages to achieve both statistical -independence, and good performance, by designing a Bentley-Saxe inspired -framework for introducing update support to efficient static sampling data -structures. It seeks to demonstrate the viability of Bentley-Saxe as the basis -for adding update support to data structures, as well as showing that the -limitations of the decomposable search problem abstraction can be overcome -through alternative query processing techniques to preserve good -performance. +Having discussed the relevant background materials, we will now turn to a +discussion of our first attempt to address the limitations of dynamization +in the context of one particular class of non-decomposable search problem: +indepedent random sampling. We've already discussed one representative +problem of this class, independent range sampling, and shown how it is +not traditionally decomposable. This specific problem is one of several +very similar types of problem, however, and in this chapter we will also +attend to simple random sampling, weighted set sampling, and weighted +independent range sampling. + +Independent sampling presents an interesting motivating example +because it is nominally supported within many relational databases, +and is useful in a variety of contexts, such as approximate +query processing (AQP)~\cite{blinkdb,quickr,verdict,cohen23}, +interactive data exploration~\cite{sps,xie21}, financial audit +sampling~\cite{olken-thesis}, and feature selection for machine +learning~\cite{ml-sampling}. However, existing support for these search +problems is limited by the techniques used within databases to implement +them. Existing implementations tend to sacrifice either performance, +by requiring the entire result set of be materialized prior to applying +Bernoulli sampling, or statistical independence. There exists techniques +for obtaining both sampling performance and indepedence by leveraging +existing B+Tree indices with slight modification~\cite{olken-thesis}, +but even this technique has worse sampling performance than could be +achieved using specialized static sampling indices. + +Thus, we decided to attempt to apply a Bentley-Saxe based dynamization +technique to these data structures. In this chapter, we discuss our +approach, which addresses the decomposability problems discussed in +Section~\cite{ssec:background-irs}, introduces two physical mechanisms +for support deletes, and also introduces an LSM-tree inspired design +space to allow for performance tuning. The results in this chapter are +highly specialized to sampling problems, however they will serve as a +launching off point for our discussion of a generalized framework in +the subsequent chapter. + diff --git a/references/references.bib b/references/references.bib index 7cec949..0dbc804 100644 --- a/references/references.bib +++ b/references/references.bib @@ -492,6 +492,12 @@ year = {2023} } +@misc {db2-doc, + title = {IBM DB2 Documentation}, + url = {https://www.ibm.com/docs/en/db2/12.1.0?topic=design-data-sampling-in-queries}, + year = {2025} +} + @online {pinecone, title = {Pinecone DB}, @@ -1474,3 +1480,5 @@ keywords = {analytic model, analysis of algorithms, overflow chaining, performan biburl = {https://dblp.org/rec/journals/jal/EdelsbrunnerO85.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } + + |