summaryrefslogtreecommitdiffstats
path: root/chapters/sigmod23/framework.tex
diff options
context:
space:
mode:
Diffstat (limited to 'chapters/sigmod23/framework.tex')
-rw-r--r--chapters/sigmod23/framework.tex669
1 files changed, 363 insertions, 306 deletions
diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex
index 32a32e1..88ac1ac 100644
--- a/chapters/sigmod23/framework.tex
+++ b/chapters/sigmod23/framework.tex
@@ -1,53 +1,365 @@
-\section{Dynamic Sampling Index Framework}
+\section{Dynamization of SSIs}
\label{sec:framework}
-This work is an attempt to design a solution to independent sampling
-that achieves \emph{both} efficient updates and near-constant cost per
-sample. As the goal is to tackle the problem in a generalized fashion,
-rather than design problem-specific data structures for used as the basis
-of an index, a framework is created that allows for already
-existing static data structures to be used as the basis for a sampling
-index, by automatically adding support for data updates using a modified
-version of the Bentley-Saxe method.
-
-Unfortunately, Bentley-Saxe as described in Section~\ref{ssec:bsm} cannot be
-directly applied to sampling problems. The concept of decomposability is not
-cleanly applicable to sampling, because the distribution of records in the
-result set, rather than the records themselves, must be matched following the
-result merge. Efficiently controlling the distribution requires each sub-query
-to access information external to the structure against which it is being
-processed, a contingency unaccounted for by Bentley-Saxe. Further, the process
-of reconstruction used in Bentley-Saxe provides poor worst-case complexity
-bounds~\cite{saxe79}, and attempts to modify the procedure to provide better
-worst-case performance are complex and have worse performance in the common
-case~\cite{overmars81}. Despite these limitations, this chapter will argue that
-the core principles of the Bentley-Saxe method can be profitably applied to
-sampling indexes, once a system for controlling result set distributions and a
-more effective reconstruction scheme have been devised. The solution to
-the former will be discussed in Section~\ref{ssec:sample}. For the latter,
-inspiration is drawn from the literature on the LSM tree.
-
-The LSM tree~\cite{oneil96} is a data structure proposed to optimize
-write throughput in disk-based storage engines. It consists of a memory
-table of bounded size, used to buffer recent changes, and a hierarchy
-of external levels containing indexes of exponentially increasing
-size. When the memory table has reached capacity, it is emptied into the
-external levels. Random writes are avoided by treating the data within
-the external levels as immutable; all writes go through the memory
-table. This introduces write amplification but maximizes sequential
-writes, which is important for maintaining high throughput in disk-based
-systems. The LSM tree is associated with a broad and well studied design
-space~\cite{dayan17,dayan18,dayan22,balmau19,dayan18-1} containing
-trade-offs between three key performance metrics: read performance, write
-performance, and auxiliary memory usage. The challenges
-faced in reconstructing predominately in-memory indexes are quite
- different from those which the LSM tree is intended
-to address, having little to do with disk-based systems and sequential IO
-operations. But, the LSM tree possesses a rich design space for managing
-the periodic reconstruction of data structures in a manner that is both
-more practical and more flexible than that of Bentley-Saxe. By borrowing
-from this design space, this preexisting body of work can be leveraged,
-and many of Bentley-Saxe's limitations addressed.
+Our goal, then, is to design a solution to indepedent sampling that is
+able to achieve \emph{both} efficient updates and efficient sampling,
+while also maintaining statistical independence both within and between
+IQS queries, and to do so in a generalized fashion without needing to
+design new dynamic data structures for each problem. Given the range
+of SSIs already available, it seems reasonable to attempt to apply
+dynamization techniques to accomplish this goal. Using the Bentley-Saxe
+method would allow us to to support inserts and deletes without
+requiring any modification of the SSIs. Unfortunately, as discussed
+in Section~\ref{ssec:background-irs}, there are problems with directly
+applying BSM to sampling problems. All of the considerations discussed
+there in the context of IRS apply equally to the other sampling problems
+considered in this chapter. In this section, we will discuss approaches
+for resolving these problems.
+
+\subsection{Sampling over Partitioned Datasets}
+
+The core problem facing any attempt to dynamize SSIs is that independently
+sampling from a partitioned dataset is difficult. As discussed in
+Section~\ref{ssec:background-irs}, accomplishing this task within the
+DSP model used by the Bentley-Saxe method requires drawing a full $k$
+samples from each of the blocks, and then repeatedly down-sampling each
+of the intermediate sample sets. However, it is possible to devise a
+more efficient query process if we abandon the DSP model and consider
+a slightly more complicated procedure.
+
+First, we need to resolve a minor definitional problem. As noted before,
+the DSP model is based on deterministic queries. The definition doesn't
+apply for sampling queries, because it assumes that the result sets of
+identical queries should also be identical. For general IQS, we also need
+to enforce conditions on the query being sampled from.
+
+\begin{definition}[Query Sampling Problem]
+ Given a search problem, $F$, a sampling problem is function
+ of the form $X: (F, \mathcal{D}, \mathcal{Q}, \mathbb{Z}^+)
+ \to \mathcal{R}$ where $\mathcal{D}$ is the domain of records
+ and $\mathcal{Q}$ is the domain of query parameters of $F$. The
+ solution to a sampling problem, $R \in \mathcal{R}$ will be a subset
+ of records from the solution to $F$ drawn independently such that,
+ $|R| = k$ for some $k \in \mathbb{Z}^+$.
+\end{definition}
+With this in mind, we can now define the decomposability conditions for
+a query sampling problem,
+
+\begin{definition}[Decomposable Sampling Problem]
+ A query sampling problem, $X: (F, \mathcal{D}, \mathcal{Q},
+ \mathbb{Z}^+ \to \mathcal{R}$) is decomposable if and only if
+ the following conditions are met for all $q \in \mathcal{Q},
+ k \in \mathbb{Z}^+$,
+ \begin{enumerate}
+ \item There exists a $\Theta(C(n,k))$ time computable, associative, and
+ commutative binary operator $\mergeop$ such that,
+ \begin{equation*}
+ X(F, A \cup B, q, k) \sim X(F, A, q, k)~ \mergeop ~X(F,
+ B, q, k)
+ \end{equation*}
+ for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B
+ = \emptyset$.
+
+ \item For any dataset $D \subseteq \mathcal{D}$ that has been
+ decomposed into $m$ partitions such that $D =
+ \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset \quad
+ \forall i,j < m, i \neq j$,
+ \begin{equation*}
+ F(D, q) = \bigcup_{i=1}^m F(D_i, q)
+ \end{equation*}
+ \end{enumerate}
+\end{definition}
+
+These two conditions warrant further explaination. The first condition
+is simply a redefinition of the standard decomposability criteria to
+consider matching the distribution, rather than the exact records in $R$,
+as the correctness condition for the merge process. The second condition
+handles a necessary property of the underlying search problem being
+sampled from. Note that this condition is \emph{stricter} than normal
+decomposability for $F$, and essentially requires that the query being
+sampled from return a set of records, rather than an aggregate value or
+some other result that cannot be meaningfully sampled from. This condition
+is satisfied by predicate-filtering style database queries, among others.
+
+With these definitions in mind, let's turn to solving these query sampling
+problems. First, we note that many SSIs have a sampling procedure that
+naturally involves two phases. First, some preliminary work is done
+to determine metadata concerning the set of records to sample from,
+and then $k$ samples are drawn from the structure, taking advantage of
+this metadata. If we represent the time cost of the prelimary work
+with $P(n)$ and the cost of drawing a sample with $S(n)$, then these
+structures query cost functions are of the form,
+
+\begin{equation*}
+\mathscr{Q}(n, k) = P(n) + k S(n)
+\end{equation*}
+
+
+Consider an arbitrary decomposable sampling query with a cost function
+of the above form, $X(\mathscr{I}, F, q, k)$, which draws a sample
+of $k$ records from $d \subseteq \mathcal{D}$ using an instance of
+an SSI $\mathscr{I} \in \mathcal{I}$. Applying dynamization results
+in $d$ being split across $m$ disjoint instances of $\mathcal{I}$
+such that $d = \bigcup_{i=0}^m \text{unbuild}(\mathscr{I}_i)$ and
+$\text{unbuild}(\mathscr{I}_i) \cap \text{unbuild}(\mathscr{I}_j)
+= \emptyset \quad \forall i, j < m, i \neq j$. If we consider a
+Bentley-Saxe dynamization of such a structure, the $\mergeop$ operation
+would be a $\Theta(k)$ down-sampling. Thus, the total query cost of such
+a structure would be,
+\begin{equation*}
+\Theta\left(\log_2 n \left( P(n) + k S(n) + k\right)\right)
+\end{equation*}
+
+This cost function is sub-optimal for two reasons. First, we
+pay extra cost to merge the result sets together because of the
+down-sampling combination operator. Secondly, this formulation
+fails to avoid a per-sample dependence on $n$, even in the case
+where $S(n) \in \Theta(1)$. This gets even worse when considering
+rejections that may occur as a result of deleted records. Recall from
+Section~\ref{ssec:background-deletes} that deletion can be supported
+using weak deletes or a shadow structure in a Bentley-Saxe dynamization.
+Using either approach, it isn't possible to avoid deleted records in
+advance when sampling, and so these will need to be rejected and retried.
+In the DSP model, this retry will need to reprocess every block a second
+time. You cannot retry in place without introducing bias into the result
+set. We will discuss this more in Section~\ref{ssec:sampling-deletes}.
+
+\begin{figure}
+ \centering
+ \includegraphics[width=\textwidth]{img/sigmod23/sampling}
+ \caption{\textbf{Overview of the multiple-block query sampling process} for
+ Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of
+ the shards is determined, then (2) these weights are used to construct an
+ alias structure. Next, (3) the alias structure is queried $k$ times to
+ determine per shard sample sizes, and then (4) sampling is performed.
+ Finally, (5) any rejected samples are retried starting from the alias
+ structure, and the process is repeated until the desired number of samples
+ has been retrieved.}
+ \label{fig:sample}
+
+\end{figure}
+
+The key insight that allowed us to solve this particular problem was that
+there is a mismatch between the structure of the sampling query process,
+and the structure assumed by DSPs. Using an SSI to answer a sampling
+query results in a naturally two-phase process, but DSPs are assumed to
+be single phase. We can construct a more effective process for answering
+such queries based on a multi-stage process, summarized in Figure~\ref{fig:sample}.
+\begin{enumerate}
+ \item Determine each block's respective weight under a given
+ query to be sampled from (e.g., the number of records falling
+ into the query range for IRS).
+
+ \item Build a temporary alias structure over these weights.
+
+ \item Query the alias structure $k$ times to determine how many
+ samples to draw from each block.
+
+ \item Draw the appropriate number of samples from each block and
+ merge them together to form the final query result.
+\end{enumerate}
+It is possible that some of the records sampled in Step 4 must be
+rejected, either because of deletes or some other property of the sampling
+procedure being used. If $r$ records are rejected, the above procedure
+can be repeated from Step 3, taking $k - r$ as the number of times to
+query the alias structure, without needing to redo any of the preprocessing
+steps. This can be repeated as many times as necessary until the required
+$k$ records have been sampled.
+
+\begin{example}
+ \label{ex:sample}
+ Consider executing a WSS query, with $k=1000$, across three blocks
+ containing integer keys with unit weight. $\mathscr{I}_1$ contains only the
+ key $-2$, $\mathscr{I}_2$ contains all integers on $[1,100]$, and $\mathscr{I}_3$
+ contains all integers on $[101, 200]$. These structures are shown
+ in Figure~\ref{fig:sample}. Sampling is performed by first
+ determining the normalized weights for each block: $w_1 = 0.005$,
+ $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a
+ block alias structure. The block alias structure is then queried
+ $k$ times, resulting in a distribution of $k_i$s that is
+ commensurate with the relative weights of each block. Finally,
+ each block is queried in turn to draw the appropriate number
+ of samples.
+\end{example}
+
+Assuming a Bentley-Saxe decomposition with $\log n$ blocks and assuming
+a constant number of repetitions, the cost of answering a decomposible
+sampling query having a pre-processing cost of $P(n)$ and a per-sample
+cost of $S(n)$ will be,
+\begin{equation}
+\label{eq:dsp-sample-cost}
+\boxed{
+\mathscr{Q}(n, k) \in \Theta \left( P(n) \log_2 n + k S(n) \right)
+}
+\end{equation}
+where the cost of building the alias structure is $\Theta(\log_2 n)$
+and thus absorbed into the pre-processing cost. For the SSIs discussed
+in this chapter, which have $S(n) \in \Theta(1)$, this model provides us
+with the desired decoupling of the data size ($n$) from the per-sample
+cost.
+
+\subsection{Supporting Deletes}
+
+Because the shards are static, records cannot be arbitrarily removed from them.
+This requires that deletes be supported in some other way, with the ultimate
+goal being the prevention of deleted records' appearance in sampling query
+result sets. This can be realized in two ways: locating the record and marking
+it, or inserting a new record which indicates that an existing record should be
+treated as deleted. The framework supports both of these techniques, the
+selection of which is called the \emph{delete policy}. The former policy is
+called \emph{tagging} and the latter \emph{tombstone}.
+
+Tagging a record is straightforward. Point-lookups are performed against each
+shard in the index, as well as the buffer, for the record to be deleted. When
+it is found, a bit in a header attached to the record is set. When sampling,
+any records selected with this bit set are automatically rejected. Tombstones
+represent a lazy strategy for deleting records. When a record is deleted using
+tombstones, a new record with identical key and value, but with a ``tombstone''
+bit set, is inserted into the index. A record's presence can be checked by
+performing a point-lookup. If a tombstone with the same key and value exists
+above the record in the index, then it should be rejected when sampled.
+
+Two important aspects of performance are pertinent when discussing deletes: the
+cost of the delete operation, and the cost of verifying the presence of a
+sampled record. The choice of delete policy represents a trade-off between
+these two costs. Beyond this simple trade-off, the delete policy also has other
+implications that can affect its applicability to certain types of SSI. Most
+notably, tombstones do not require any in-place updating of records, whereas
+tagging does. This means that using tombstones is the only way to ensure total
+immutability of the data within shards, which avoids random writes and eases
+concurrency control. The tombstone delete policy, then, is particularly
+appealing in external and concurrent contexts.
+
+\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is
+the same as an ordinary insert. Tagging, by contrast, requires a point-lookup
+of the record to be deleted, and so is more expensive. Assuming a point-lookup
+operation with cost $L(n)$, a tagged delete must search each level in the
+index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$
+time.
+
+\Paragraph{Rejection Check Costs.} In addition to the cost of the delete
+itself, the delete policy affects the cost of determining if a given record has
+been deleted. This is called the \emph{rejection check cost}, $R(n)$. When
+using tagging, the information necessary to make the rejection decision is
+local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones
+it is not; a point-lookup must be performed to search for a given record's
+corresponding tombstone. This look-up must examine the buffer, and each shard
+within the index. This results in a rejection check cost of $R(n) \in O\left(N_b +
+L(n) \log_s n\right)$. The rejection check process for the two delete policies is
+summarized in Figure~\ref{fig:delete}.
+
+Two factors contribute to the tombstone rejection check cost: the size of the
+buffer, and the cost of performing a point-lookup against the shards. The
+latter cost can be controlled using the framework's ability to associate
+auxiliary structures with shards. For SSIs which do not support efficient
+point-lookups, a hash table can be added to map key-value pairs to their
+location within the SSI. This allows for constant-time rejection checks, even
+in situations where the index would not otherwise support them. However, the
+storage cost of this intervention is high, and in situations where the SSI does
+support efficient point-lookups, it is not necessary. Further performance
+improvements can be achieved by noting that the probability of a given record
+having an associated tombstone in any particular shard is relatively small.
+This means that many point-lookups will be executed against shards that do not
+contain the tombstone being searched for. In this case, these unnecessary
+lookups can be partially avoided using Bloom filters~\cite{bloom70} for
+tombstones. By inserting tombstones into these filters during reconstruction,
+point-lookups against some shards which do not contain the tombstone being
+searched for can be bypassed. Filters can be attached to the buffer as well,
+which may be even more significant due to the linear cost of scanning it. As
+the goal is a reduction of rejection check costs, these filters need only be
+populated with tombstones. In a later section, techniques for bounding the
+number of tombstones on a given level are discussed, which will allow for the
+memory usage of these filters to be tightly controlled while still ensuring
+precise bounds on filter error.
+
+\Paragraph{Sampling with Deletes.} The addition of deletes to the framework
+alters the analysis of sampling costs. A record that has been deleted cannot
+be present in the sample set, and therefore the presence of each sampled record
+must be verified. If a record has been deleted, it must be rejected. When
+retrying samples rejected due to delete, the process must restart from shard
+selection, as deleted records may be counted in the weight totals used to
+construct that structure. This increases the cost of sampling to,
+\begin{equation}
+\label{eq:sampling-cost}
+ O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right)
+\end{equation}
+where $R(n)$ is the cost of checking if a sampled record has been deleted, and
+$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling
+attempts required to obtain $k$ samples, given a fixed rejection probability.
+The rejection probability itself is a function of the workload, and is
+unbounded.
+
+\Paragraph{Bounding the Rejection Probability.} Rejections during sampling
+constitute wasted memory accesses and random number generations, and so steps
+should be taken to minimize their frequency. The probability of a rejection is
+directly related to the number of deleted records, which is itself a function
+of workload and dataset. This means that, without building counter-measures
+into the framework, tight bounds on sampling performance cannot be provided in
+the presence of deleted records. It is therefore critical that the framework
+support some method for bounding the number of deleted records within the
+index.
+
+While the static nature of shards prevents the direct removal of records at the
+moment they are deleted, it doesn't prevent the removal of records during
+reconstruction. When using tagging, all tagged records encountered during
+reconstruction can be removed. When using tombstones, however, the removal
+process is non-trivial. In principle, a rejection check could be performed for
+each record encountered during reconstruction, but this would increase
+reconstruction costs and introduce a new problem of tracking tombstones
+associated with records that have been removed. Instead, a lazier approach can
+be used: delaying removal until a tombstone and its associated record
+participate in the same shard reconstruction. This delay allows both the record
+and its tombstone to be removed at the same time, an approach called
+\emph{tombstone cancellation}. In general, this can be implemented using an
+extra linear scan of the input shards before reconstruction to identify
+tombstones and associated records for cancellation, but potential optimizations
+exist for many SSIs, allowing it to be performed during the reconstruction
+itself at no extra cost.
+
+The removal of deleted records passively during reconstruction is not enough to
+bound the number of deleted records within the index. It is not difficult to
+envision pathological scenarios where deletes result in unbounded rejection
+rates, even with this mitigation in place. However, the dropping of deleted
+records does provide a useful property: any specific deleted record will
+eventually be removed from the index after a finite number of reconstructions.
+Using this fact, a bound on the number of deleted records can be enforced. A
+new parameter, $\delta$, is defined, representing the maximum proportion of
+deleted records within the index. Each level, and the buffer, tracks the number
+of deleted records it contains by counting its tagged records or tombstones.
+Following each buffer flush, the proportion of deleted records is checked
+against $\delta$. If any level is found to exceed it, then a proactive
+reconstruction is triggered, pushing its shards down into the next level. The
+process is repeated until all levels respect the bound, allowing the number of
+deleted records to be precisely controlled, which, by extension, bounds the
+rejection rate. This process is called \emph{compaction}.
+
+Assuming every record is equally likely to be sampled, this new bound can be
+applied to the analysis of sampling costs. The probability of a record being
+rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to
+Equation~\ref{eq:sampling-cost} yields,
+\begin{equation}
+%\label{eq:sampling-cost-del}
+ O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
+\end{equation}
+
+Asymptotically, this proactive compaction does not alter the analysis of
+insertion costs. Each record is still written at most $s$ times on each level,
+there are at most $\log_s n$ levels, and the buffer insertion and SSI
+construction costs are all unchanged, and so on. This results in the amortized
+insertion cost remaining the same.
+
+This compaction strategy is based upon tombstone and record counts, and the
+bounds assume that every record is equally likely to be sampled. For certain
+sampling problems (such as WSS), there are other conditions that must be
+considered to provide a bound on the rejection rate. To account for these
+situations in a general fashion, the framework supports problem-specific
+compaction triggers that can be tailored to the SSI being used. These allow
+compactions to be triggered based on other properties, such as rejection rate
+of a level, weight of deleted records, and the like.
+
+
+
+\subsection{Performance Tuning and Configuration}
\captionsetup[subfloat]{justification=centering}
@@ -68,12 +380,9 @@ and many of Bentley-Saxe's limitations addressed.
\subsection{Framework Overview}
-The goal of this chapter is to build a general framework that extends most SSIs
-with efficient support for updates by splitting the index into small data structures
-to reduce reconstruction costs, and then distributing the sampling process over these
-smaller structures.
-The framework is designed to work efficiently with any SSI, so
-long as it has the following properties,
+Our framework has been designed to work efficiently with any SSI, so long
+as it has the following properties.
+
\begin{enumerate}
\item The underlying full query $Q$ supported by the SSI from whose results
samples are drawn satisfies the following property:
@@ -219,101 +528,6 @@ framework is,
O\left(\frac{C_r(n)}{n}\log_s n\right)
\end{equation}
-
-\subsection{Sampling}
-\label{ssec:sample}
-
-\begin{figure}
- \centering
- \includegraphics[width=\textwidth]{img/sigmod23/sampling}
- \caption{\textbf{Overview of the multiple-shard sampling query process} for
- Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of
- the shards is determined, then (2) these weights are used to construct an
- alias structure. Next, (3) the alias structure is queried $k$ times to
- determine per shard sample sizes, and then (4) sampling is performed.
- Finally, (5) any rejected samples are retried starting from the alias
- structure, and the process is repeated until the desired number of samples
- has been retrieved.}
- \label{fig:sample}
-
-\end{figure}
-
-For many SSIs, sampling queries are completed in two stages. Some preliminary
-processing is done to identify the range of records from which to sample, and then
-samples are drawn from that range. For example, IRS over a sorted list of
-records can be performed by first identifying the upper and lower bounds of the
-query range in the list, and then sampling records by randomly generating
-indexes within those bounds. The general cost of a sampling query can be
-modeled as $P(n) + k S(n)$, where $P(n)$ is the cost of preprocessing, $k$ is
-the number of samples drawn, and $S(n)$ is the cost of sampling a single
-record.
-
-When sampling from multiple shards, the situation grows more complex. For each
-sample, the shard to select the record from must first be decided. Consider an
-arbitrary sampling query $X(D, k)$ asking for a sample set of size $k$ against
-dataset $D$. The framework splits $D$ across $m$ disjoint shards, such that $D
-= \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset, \forall i,j < m$. The
-framework must ensure that $X(D, k)$ and $\bigcup_{i=0}^m X(D_i, k_i)$ follow
-the same distribution, by selecting appropriate values for the $k_i$s. If care
-is not taken to balance the number of samples drawn from a shard with the total
-weight of the shard under $X$, then bias can be introduced into the sample
-set's distribution. The selection of $k_i$s can be viewed as an instance of WSS,
-and solved using the alias method.
-
-When sampling using the framework, first the weight of each shard under the
-sampling query is determined and a \emph{shard alias structure} built over
-these weights. Then, for each sample, the shard alias is used to
-determine the shard from which to draw the sample. Let $W(n)$ be the cost of
-determining this total weight for a single shard under the query. The initial setup
-cost, prior to drawing any samples, will be $O\left([W(n) + P(n)]\log_s
-n\right)$, as the preliminary work for sampling from each shard must be
-performed, as well as weights determined and alias structure constructed. In
-many cases, however, the preliminary work will also determine the total weight,
-and so the relevant operation need only be applied once to accomplish both
-tasks.
-
-To ensure that all records appear in the sample set with the appropriate
-probability, the mutable buffer itself must also be a valid target for
-sampling. There are two generally applicable techniques that can be applied for
-this, both of which can be supported by the framework. The query being sampled
-from can be directly executed against the buffer and the result set used to
-build a temporary SSI, which can be sampled from. Alternatively, rejection
-sampling can be used to sample directly from the buffer, without executing the
-query. In this case, the total weight of the buffer is used for its entry in
-the shard alias structure. This can result in the buffer being
-over-represented in the shard selection process, and so any rejections during
-buffer sampling must be retried starting from shard selection. These same
-considerations apply to rejection sampling used against shards, as well.
-
-
-\begin{example}
- \label{ex:sample}
- Consider executing a WSS query, with $k=1000$, across three shards
- containing integer keys with unit weight. $S_1$ contains only the
- key $-2$, $S_2$ contains all integers on $[1,100]$, and $S_3$
- contains all integers on $[101, 200]$. These structures are shown
- in Figure~\ref{fig:sample}. Sampling is performed by first
- determining the normalized weights for each shard: $w_1 = 0.005$,
- $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a
- shard alias structure. The shard alias structure is then queried
- $k$ times, resulting in a distribution of $k_i$s that is
- commensurate with the relative weights of each shard. Finally,
- each shard is queried in turn to draw the appropriate number
- of samples.
-\end{example}
-
-
-Assuming that rejection sampling is used on the mutable buffer, the worst-case
-time complexity for drawing $k$ samples from an index containing $n$ elements
-with a sampling cost of $S(n)$ is,
-\begin{equation}
- \label{eq:sample-cost}
- O\left(\left[W(n) + P(n)\right]\log_s n + kS(n)\right)
-\end{equation}
-
-%If instead a temporary SSI is constructed, the cost of sampling
-%becomes: $O\left(N_b + C_c(N_b) + (W(n) + P(n))\log_s n + kS(n)\right)$.
-
\begin{figure}
\centering
\subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\
@@ -342,163 +556,6 @@ with a sampling cost of $S(n)$ is,
\subsection{Deletion}
\label{ssec:delete}
-Because the shards are static, records cannot be arbitrarily removed from them.
-This requires that deletes be supported in some other way, with the ultimate
-goal being the prevention of deleted records' appearance in sampling query
-result sets. This can be realized in two ways: locating the record and marking
-it, or inserting a new record which indicates that an existing record should be
-treated as deleted. The framework supports both of these techniques, the
-selection of which is called the \emph{delete policy}. The former policy is
-called \emph{tagging} and the latter \emph{tombstone}.
-
-Tagging a record is straightforward. Point-lookups are performed against each
-shard in the index, as well as the buffer, for the record to be deleted. When
-it is found, a bit in a header attached to the record is set. When sampling,
-any records selected with this bit set are automatically rejected. Tombstones
-represent a lazy strategy for deleting records. When a record is deleted using
-tombstones, a new record with identical key and value, but with a ``tombstone''
-bit set, is inserted into the index. A record's presence can be checked by
-performing a point-lookup. If a tombstone with the same key and value exists
-above the record in the index, then it should be rejected when sampled.
-
-Two important aspects of performance are pertinent when discussing deletes: the
-cost of the delete operation, and the cost of verifying the presence of a
-sampled record. The choice of delete policy represents a trade-off between
-these two costs. Beyond this simple trade-off, the delete policy also has other
-implications that can affect its applicability to certain types of SSI. Most
-notably, tombstones do not require any in-place updating of records, whereas
-tagging does. This means that using tombstones is the only way to ensure total
-immutability of the data within shards, which avoids random writes and eases
-concurrency control. The tombstone delete policy, then, is particularly
-appealing in external and concurrent contexts.
-
-\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is
-the same as an ordinary insert. Tagging, by contrast, requires a point-lookup
-of the record to be deleted, and so is more expensive. Assuming a point-lookup
-operation with cost $L(n)$, a tagged delete must search each level in the
-index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$
-time.
-
-\Paragraph{Rejection Check Costs.} In addition to the cost of the delete
-itself, the delete policy affects the cost of determining if a given record has
-been deleted. This is called the \emph{rejection check cost}, $R(n)$. When
-using tagging, the information necessary to make the rejection decision is
-local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones
-it is not; a point-lookup must be performed to search for a given record's
-corresponding tombstone. This look-up must examine the buffer, and each shard
-within the index. This results in a rejection check cost of $R(n) \in O\left(N_b +
-L(n) \log_s n\right)$. The rejection check process for the two delete policies is
-summarized in Figure~\ref{fig:delete}.
-
-Two factors contribute to the tombstone rejection check cost: the size of the
-buffer, and the cost of performing a point-lookup against the shards. The
-latter cost can be controlled using the framework's ability to associate
-auxiliary structures with shards. For SSIs which do not support efficient
-point-lookups, a hash table can be added to map key-value pairs to their
-location within the SSI. This allows for constant-time rejection checks, even
-in situations where the index would not otherwise support them. However, the
-storage cost of this intervention is high, and in situations where the SSI does
-support efficient point-lookups, it is not necessary. Further performance
-improvements can be achieved by noting that the probability of a given record
-having an associated tombstone in any particular shard is relatively small.
-This means that many point-lookups will be executed against shards that do not
-contain the tombstone being searched for. In this case, these unnecessary
-lookups can be partially avoided using Bloom filters~\cite{bloom70} for
-tombstones. By inserting tombstones into these filters during reconstruction,
-point-lookups against some shards which do not contain the tombstone being
-searched for can be bypassed. Filters can be attached to the buffer as well,
-which may be even more significant due to the linear cost of scanning it. As
-the goal is a reduction of rejection check costs, these filters need only be
-populated with tombstones. In a later section, techniques for bounding the
-number of tombstones on a given level are discussed, which will allow for the
-memory usage of these filters to be tightly controlled while still ensuring
-precise bounds on filter error.
-
-\Paragraph{Sampling with Deletes.} The addition of deletes to the framework
-alters the analysis of sampling costs. A record that has been deleted cannot
-be present in the sample set, and therefore the presence of each sampled record
-must be verified. If a record has been deleted, it must be rejected. When
-retrying samples rejected due to delete, the process must restart from shard
-selection, as deleted records may be counted in the weight totals used to
-construct that structure. This increases the cost of sampling to,
-\begin{equation}
-\label{eq:sampling-cost}
- O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right)
-\end{equation}
-where $R(n)$ is the cost of checking if a sampled record has been deleted, and
-$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling
-attempts required to obtain $k$ samples, given a fixed rejection probability.
-The rejection probability itself is a function of the workload, and is
-unbounded.
-
-\Paragraph{Bounding the Rejection Probability.} Rejections during sampling
-constitute wasted memory accesses and random number generations, and so steps
-should be taken to minimize their frequency. The probability of a rejection is
-directly related to the number of deleted records, which is itself a function
-of workload and dataset. This means that, without building counter-measures
-into the framework, tight bounds on sampling performance cannot be provided in
-the presence of deleted records. It is therefore critical that the framework
-support some method for bounding the number of deleted records within the
-index.
-
-While the static nature of shards prevents the direct removal of records at the
-moment they are deleted, it doesn't prevent the removal of records during
-reconstruction. When using tagging, all tagged records encountered during
-reconstruction can be removed. When using tombstones, however, the removal
-process is non-trivial. In principle, a rejection check could be performed for
-each record encountered during reconstruction, but this would increase
-reconstruction costs and introduce a new problem of tracking tombstones
-associated with records that have been removed. Instead, a lazier approach can
-be used: delaying removal until a tombstone and its associated record
-participate in the same shard reconstruction. This delay allows both the record
-and its tombstone to be removed at the same time, an approach called
-\emph{tombstone cancellation}. In general, this can be implemented using an
-extra linear scan of the input shards before reconstruction to identify
-tombstones and associated records for cancellation, but potential optimizations
-exist for many SSIs, allowing it to be performed during the reconstruction
-itself at no extra cost.
-
-The removal of deleted records passively during reconstruction is not enough to
-bound the number of deleted records within the index. It is not difficult to
-envision pathological scenarios where deletes result in unbounded rejection
-rates, even with this mitigation in place. However, the dropping of deleted
-records does provide a useful property: any specific deleted record will
-eventually be removed from the index after a finite number of reconstructions.
-Using this fact, a bound on the number of deleted records can be enforced. A
-new parameter, $\delta$, is defined, representing the maximum proportion of
-deleted records within the index. Each level, and the buffer, tracks the number
-of deleted records it contains by counting its tagged records or tombstones.
-Following each buffer flush, the proportion of deleted records is checked
-against $\delta$. If any level is found to exceed it, then a proactive
-reconstruction is triggered, pushing its shards down into the next level. The
-process is repeated until all levels respect the bound, allowing the number of
-deleted records to be precisely controlled, which, by extension, bounds the
-rejection rate. This process is called \emph{compaction}.
-
-Assuming every record is equally likely to be sampled, this new bound can be
-applied to the analysis of sampling costs. The probability of a record being
-rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to
-Equation~\ref{eq:sampling-cost} yields,
-\begin{equation}
-%\label{eq:sampling-cost-del}
- O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
-\end{equation}
-
-Asymptotically, this proactive compaction does not alter the analysis of
-insertion costs. Each record is still written at most $s$ times on each level,
-there are at most $\log_s n$ levels, and the buffer insertion and SSI
-construction costs are all unchanged, and so on. This results in the amortized
-insertion cost remaining the same.
-
-This compaction strategy is based upon tombstone and record counts, and the
-bounds assume that every record is equally likely to be sampled. For certain
-sampling problems (such as WSS), there are other conditions that must be
-considered to provide a bound on the rejection rate. To account for these
-situations in a general fashion, the framework supports problem-specific
-compaction triggers that can be tailored to the SSI being used. These allow
-compactions to be triggered based on other properties, such as rejection rate
-of a level, weight of deleted records, and the like.
-
\subsection{Trade-offs on Framework Design Space}
\label{ssec:design-space}