diff options
Diffstat (limited to 'chapters/sigmod23/framework.tex')
| -rw-r--r-- | chapters/sigmod23/framework.tex | 669 |
1 files changed, 363 insertions, 306 deletions
diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex index 32a32e1..88ac1ac 100644 --- a/chapters/sigmod23/framework.tex +++ b/chapters/sigmod23/framework.tex @@ -1,53 +1,365 @@ -\section{Dynamic Sampling Index Framework} +\section{Dynamization of SSIs} \label{sec:framework} -This work is an attempt to design a solution to independent sampling -that achieves \emph{both} efficient updates and near-constant cost per -sample. As the goal is to tackle the problem in a generalized fashion, -rather than design problem-specific data structures for used as the basis -of an index, a framework is created that allows for already -existing static data structures to be used as the basis for a sampling -index, by automatically adding support for data updates using a modified -version of the Bentley-Saxe method. - -Unfortunately, Bentley-Saxe as described in Section~\ref{ssec:bsm} cannot be -directly applied to sampling problems. The concept of decomposability is not -cleanly applicable to sampling, because the distribution of records in the -result set, rather than the records themselves, must be matched following the -result merge. Efficiently controlling the distribution requires each sub-query -to access information external to the structure against which it is being -processed, a contingency unaccounted for by Bentley-Saxe. Further, the process -of reconstruction used in Bentley-Saxe provides poor worst-case complexity -bounds~\cite{saxe79}, and attempts to modify the procedure to provide better -worst-case performance are complex and have worse performance in the common -case~\cite{overmars81}. Despite these limitations, this chapter will argue that -the core principles of the Bentley-Saxe method can be profitably applied to -sampling indexes, once a system for controlling result set distributions and a -more effective reconstruction scheme have been devised. The solution to -the former will be discussed in Section~\ref{ssec:sample}. For the latter, -inspiration is drawn from the literature on the LSM tree. - -The LSM tree~\cite{oneil96} is a data structure proposed to optimize -write throughput in disk-based storage engines. It consists of a memory -table of bounded size, used to buffer recent changes, and a hierarchy -of external levels containing indexes of exponentially increasing -size. When the memory table has reached capacity, it is emptied into the -external levels. Random writes are avoided by treating the data within -the external levels as immutable; all writes go through the memory -table. This introduces write amplification but maximizes sequential -writes, which is important for maintaining high throughput in disk-based -systems. The LSM tree is associated with a broad and well studied design -space~\cite{dayan17,dayan18,dayan22,balmau19,dayan18-1} containing -trade-offs between three key performance metrics: read performance, write -performance, and auxiliary memory usage. The challenges -faced in reconstructing predominately in-memory indexes are quite - different from those which the LSM tree is intended -to address, having little to do with disk-based systems and sequential IO -operations. But, the LSM tree possesses a rich design space for managing -the periodic reconstruction of data structures in a manner that is both -more practical and more flexible than that of Bentley-Saxe. By borrowing -from this design space, this preexisting body of work can be leveraged, -and many of Bentley-Saxe's limitations addressed. +Our goal, then, is to design a solution to indepedent sampling that is +able to achieve \emph{both} efficient updates and efficient sampling, +while also maintaining statistical independence both within and between +IQS queries, and to do so in a generalized fashion without needing to +design new dynamic data structures for each problem. Given the range +of SSIs already available, it seems reasonable to attempt to apply +dynamization techniques to accomplish this goal. Using the Bentley-Saxe +method would allow us to to support inserts and deletes without +requiring any modification of the SSIs. Unfortunately, as discussed +in Section~\ref{ssec:background-irs}, there are problems with directly +applying BSM to sampling problems. All of the considerations discussed +there in the context of IRS apply equally to the other sampling problems +considered in this chapter. In this section, we will discuss approaches +for resolving these problems. + +\subsection{Sampling over Partitioned Datasets} + +The core problem facing any attempt to dynamize SSIs is that independently +sampling from a partitioned dataset is difficult. As discussed in +Section~\ref{ssec:background-irs}, accomplishing this task within the +DSP model used by the Bentley-Saxe method requires drawing a full $k$ +samples from each of the blocks, and then repeatedly down-sampling each +of the intermediate sample sets. However, it is possible to devise a +more efficient query process if we abandon the DSP model and consider +a slightly more complicated procedure. + +First, we need to resolve a minor definitional problem. As noted before, +the DSP model is based on deterministic queries. The definition doesn't +apply for sampling queries, because it assumes that the result sets of +identical queries should also be identical. For general IQS, we also need +to enforce conditions on the query being sampled from. + +\begin{definition}[Query Sampling Problem] + Given a search problem, $F$, a sampling problem is function + of the form $X: (F, \mathcal{D}, \mathcal{Q}, \mathbb{Z}^+) + \to \mathcal{R}$ where $\mathcal{D}$ is the domain of records + and $\mathcal{Q}$ is the domain of query parameters of $F$. The + solution to a sampling problem, $R \in \mathcal{R}$ will be a subset + of records from the solution to $F$ drawn independently such that, + $|R| = k$ for some $k \in \mathbb{Z}^+$. +\end{definition} +With this in mind, we can now define the decomposability conditions for +a query sampling problem, + +\begin{definition}[Decomposable Sampling Problem] + A query sampling problem, $X: (F, \mathcal{D}, \mathcal{Q}, + \mathbb{Z}^+ \to \mathcal{R}$) is decomposable if and only if + the following conditions are met for all $q \in \mathcal{Q}, + k \in \mathbb{Z}^+$, + \begin{enumerate} + \item There exists a $\Theta(C(n,k))$ time computable, associative, and + commutative binary operator $\mergeop$ such that, + \begin{equation*} + X(F, A \cup B, q, k) \sim X(F, A, q, k)~ \mergeop ~X(F, + B, q, k) + \end{equation*} + for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B + = \emptyset$. + + \item For any dataset $D \subseteq \mathcal{D}$ that has been + decomposed into $m$ partitions such that $D = + \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset \quad + \forall i,j < m, i \neq j$, + \begin{equation*} + F(D, q) = \bigcup_{i=1}^m F(D_i, q) + \end{equation*} + \end{enumerate} +\end{definition} + +These two conditions warrant further explaination. The first condition +is simply a redefinition of the standard decomposability criteria to +consider matching the distribution, rather than the exact records in $R$, +as the correctness condition for the merge process. The second condition +handles a necessary property of the underlying search problem being +sampled from. Note that this condition is \emph{stricter} than normal +decomposability for $F$, and essentially requires that the query being +sampled from return a set of records, rather than an aggregate value or +some other result that cannot be meaningfully sampled from. This condition +is satisfied by predicate-filtering style database queries, among others. + +With these definitions in mind, let's turn to solving these query sampling +problems. First, we note that many SSIs have a sampling procedure that +naturally involves two phases. First, some preliminary work is done +to determine metadata concerning the set of records to sample from, +and then $k$ samples are drawn from the structure, taking advantage of +this metadata. If we represent the time cost of the prelimary work +with $P(n)$ and the cost of drawing a sample with $S(n)$, then these +structures query cost functions are of the form, + +\begin{equation*} +\mathscr{Q}(n, k) = P(n) + k S(n) +\end{equation*} + + +Consider an arbitrary decomposable sampling query with a cost function +of the above form, $X(\mathscr{I}, F, q, k)$, which draws a sample +of $k$ records from $d \subseteq \mathcal{D}$ using an instance of +an SSI $\mathscr{I} \in \mathcal{I}$. Applying dynamization results +in $d$ being split across $m$ disjoint instances of $\mathcal{I}$ +such that $d = \bigcup_{i=0}^m \text{unbuild}(\mathscr{I}_i)$ and +$\text{unbuild}(\mathscr{I}_i) \cap \text{unbuild}(\mathscr{I}_j) += \emptyset \quad \forall i, j < m, i \neq j$. If we consider a +Bentley-Saxe dynamization of such a structure, the $\mergeop$ operation +would be a $\Theta(k)$ down-sampling. Thus, the total query cost of such +a structure would be, +\begin{equation*} +\Theta\left(\log_2 n \left( P(n) + k S(n) + k\right)\right) +\end{equation*} + +This cost function is sub-optimal for two reasons. First, we +pay extra cost to merge the result sets together because of the +down-sampling combination operator. Secondly, this formulation +fails to avoid a per-sample dependence on $n$, even in the case +where $S(n) \in \Theta(1)$. This gets even worse when considering +rejections that may occur as a result of deleted records. Recall from +Section~\ref{ssec:background-deletes} that deletion can be supported +using weak deletes or a shadow structure in a Bentley-Saxe dynamization. +Using either approach, it isn't possible to avoid deleted records in +advance when sampling, and so these will need to be rejected and retried. +In the DSP model, this retry will need to reprocess every block a second +time. You cannot retry in place without introducing bias into the result +set. We will discuss this more in Section~\ref{ssec:sampling-deletes}. + +\begin{figure} + \centering + \includegraphics[width=\textwidth]{img/sigmod23/sampling} + \caption{\textbf{Overview of the multiple-block query sampling process} for + Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of + the shards is determined, then (2) these weights are used to construct an + alias structure. Next, (3) the alias structure is queried $k$ times to + determine per shard sample sizes, and then (4) sampling is performed. + Finally, (5) any rejected samples are retried starting from the alias + structure, and the process is repeated until the desired number of samples + has been retrieved.} + \label{fig:sample} + +\end{figure} + +The key insight that allowed us to solve this particular problem was that +there is a mismatch between the structure of the sampling query process, +and the structure assumed by DSPs. Using an SSI to answer a sampling +query results in a naturally two-phase process, but DSPs are assumed to +be single phase. We can construct a more effective process for answering +such queries based on a multi-stage process, summarized in Figure~\ref{fig:sample}. +\begin{enumerate} + \item Determine each block's respective weight under a given + query to be sampled from (e.g., the number of records falling + into the query range for IRS). + + \item Build a temporary alias structure over these weights. + + \item Query the alias structure $k$ times to determine how many + samples to draw from each block. + + \item Draw the appropriate number of samples from each block and + merge them together to form the final query result. +\end{enumerate} +It is possible that some of the records sampled in Step 4 must be +rejected, either because of deletes or some other property of the sampling +procedure being used. If $r$ records are rejected, the above procedure +can be repeated from Step 3, taking $k - r$ as the number of times to +query the alias structure, without needing to redo any of the preprocessing +steps. This can be repeated as many times as necessary until the required +$k$ records have been sampled. + +\begin{example} + \label{ex:sample} + Consider executing a WSS query, with $k=1000$, across three blocks + containing integer keys with unit weight. $\mathscr{I}_1$ contains only the + key $-2$, $\mathscr{I}_2$ contains all integers on $[1,100]$, and $\mathscr{I}_3$ + contains all integers on $[101, 200]$. These structures are shown + in Figure~\ref{fig:sample}. Sampling is performed by first + determining the normalized weights for each block: $w_1 = 0.005$, + $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a + block alias structure. The block alias structure is then queried + $k$ times, resulting in a distribution of $k_i$s that is + commensurate with the relative weights of each block. Finally, + each block is queried in turn to draw the appropriate number + of samples. +\end{example} + +Assuming a Bentley-Saxe decomposition with $\log n$ blocks and assuming +a constant number of repetitions, the cost of answering a decomposible +sampling query having a pre-processing cost of $P(n)$ and a per-sample +cost of $S(n)$ will be, +\begin{equation} +\label{eq:dsp-sample-cost} +\boxed{ +\mathscr{Q}(n, k) \in \Theta \left( P(n) \log_2 n + k S(n) \right) +} +\end{equation} +where the cost of building the alias structure is $\Theta(\log_2 n)$ +and thus absorbed into the pre-processing cost. For the SSIs discussed +in this chapter, which have $S(n) \in \Theta(1)$, this model provides us +with the desired decoupling of the data size ($n$) from the per-sample +cost. + +\subsection{Supporting Deletes} + +Because the shards are static, records cannot be arbitrarily removed from them. +This requires that deletes be supported in some other way, with the ultimate +goal being the prevention of deleted records' appearance in sampling query +result sets. This can be realized in two ways: locating the record and marking +it, or inserting a new record which indicates that an existing record should be +treated as deleted. The framework supports both of these techniques, the +selection of which is called the \emph{delete policy}. The former policy is +called \emph{tagging} and the latter \emph{tombstone}. + +Tagging a record is straightforward. Point-lookups are performed against each +shard in the index, as well as the buffer, for the record to be deleted. When +it is found, a bit in a header attached to the record is set. When sampling, +any records selected with this bit set are automatically rejected. Tombstones +represent a lazy strategy for deleting records. When a record is deleted using +tombstones, a new record with identical key and value, but with a ``tombstone'' +bit set, is inserted into the index. A record's presence can be checked by +performing a point-lookup. If a tombstone with the same key and value exists +above the record in the index, then it should be rejected when sampled. + +Two important aspects of performance are pertinent when discussing deletes: the +cost of the delete operation, and the cost of verifying the presence of a +sampled record. The choice of delete policy represents a trade-off between +these two costs. Beyond this simple trade-off, the delete policy also has other +implications that can affect its applicability to certain types of SSI. Most +notably, tombstones do not require any in-place updating of records, whereas +tagging does. This means that using tombstones is the only way to ensure total +immutability of the data within shards, which avoids random writes and eases +concurrency control. The tombstone delete policy, then, is particularly +appealing in external and concurrent contexts. + +\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is +the same as an ordinary insert. Tagging, by contrast, requires a point-lookup +of the record to be deleted, and so is more expensive. Assuming a point-lookup +operation with cost $L(n)$, a tagged delete must search each level in the +index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$ +time. + +\Paragraph{Rejection Check Costs.} In addition to the cost of the delete +itself, the delete policy affects the cost of determining if a given record has +been deleted. This is called the \emph{rejection check cost}, $R(n)$. When +using tagging, the information necessary to make the rejection decision is +local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones +it is not; a point-lookup must be performed to search for a given record's +corresponding tombstone. This look-up must examine the buffer, and each shard +within the index. This results in a rejection check cost of $R(n) \in O\left(N_b + +L(n) \log_s n\right)$. The rejection check process for the two delete policies is +summarized in Figure~\ref{fig:delete}. + +Two factors contribute to the tombstone rejection check cost: the size of the +buffer, and the cost of performing a point-lookup against the shards. The +latter cost can be controlled using the framework's ability to associate +auxiliary structures with shards. For SSIs which do not support efficient +point-lookups, a hash table can be added to map key-value pairs to their +location within the SSI. This allows for constant-time rejection checks, even +in situations where the index would not otherwise support them. However, the +storage cost of this intervention is high, and in situations where the SSI does +support efficient point-lookups, it is not necessary. Further performance +improvements can be achieved by noting that the probability of a given record +having an associated tombstone in any particular shard is relatively small. +This means that many point-lookups will be executed against shards that do not +contain the tombstone being searched for. In this case, these unnecessary +lookups can be partially avoided using Bloom filters~\cite{bloom70} for +tombstones. By inserting tombstones into these filters during reconstruction, +point-lookups against some shards which do not contain the tombstone being +searched for can be bypassed. Filters can be attached to the buffer as well, +which may be even more significant due to the linear cost of scanning it. As +the goal is a reduction of rejection check costs, these filters need only be +populated with tombstones. In a later section, techniques for bounding the +number of tombstones on a given level are discussed, which will allow for the +memory usage of these filters to be tightly controlled while still ensuring +precise bounds on filter error. + +\Paragraph{Sampling with Deletes.} The addition of deletes to the framework +alters the analysis of sampling costs. A record that has been deleted cannot +be present in the sample set, and therefore the presence of each sampled record +must be verified. If a record has been deleted, it must be rejected. When +retrying samples rejected due to delete, the process must restart from shard +selection, as deleted records may be counted in the weight totals used to +construct that structure. This increases the cost of sampling to, +\begin{equation} +\label{eq:sampling-cost} + O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right) +\end{equation} +where $R(n)$ is the cost of checking if a sampled record has been deleted, and +$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling +attempts required to obtain $k$ samples, given a fixed rejection probability. +The rejection probability itself is a function of the workload, and is +unbounded. + +\Paragraph{Bounding the Rejection Probability.} Rejections during sampling +constitute wasted memory accesses and random number generations, and so steps +should be taken to minimize their frequency. The probability of a rejection is +directly related to the number of deleted records, which is itself a function +of workload and dataset. This means that, without building counter-measures +into the framework, tight bounds on sampling performance cannot be provided in +the presence of deleted records. It is therefore critical that the framework +support some method for bounding the number of deleted records within the +index. + +While the static nature of shards prevents the direct removal of records at the +moment they are deleted, it doesn't prevent the removal of records during +reconstruction. When using tagging, all tagged records encountered during +reconstruction can be removed. When using tombstones, however, the removal +process is non-trivial. In principle, a rejection check could be performed for +each record encountered during reconstruction, but this would increase +reconstruction costs and introduce a new problem of tracking tombstones +associated with records that have been removed. Instead, a lazier approach can +be used: delaying removal until a tombstone and its associated record +participate in the same shard reconstruction. This delay allows both the record +and its tombstone to be removed at the same time, an approach called +\emph{tombstone cancellation}. In general, this can be implemented using an +extra linear scan of the input shards before reconstruction to identify +tombstones and associated records for cancellation, but potential optimizations +exist for many SSIs, allowing it to be performed during the reconstruction +itself at no extra cost. + +The removal of deleted records passively during reconstruction is not enough to +bound the number of deleted records within the index. It is not difficult to +envision pathological scenarios where deletes result in unbounded rejection +rates, even with this mitigation in place. However, the dropping of deleted +records does provide a useful property: any specific deleted record will +eventually be removed from the index after a finite number of reconstructions. +Using this fact, a bound on the number of deleted records can be enforced. A +new parameter, $\delta$, is defined, representing the maximum proportion of +deleted records within the index. Each level, and the buffer, tracks the number +of deleted records it contains by counting its tagged records or tombstones. +Following each buffer flush, the proportion of deleted records is checked +against $\delta$. If any level is found to exceed it, then a proactive +reconstruction is triggered, pushing its shards down into the next level. The +process is repeated until all levels respect the bound, allowing the number of +deleted records to be precisely controlled, which, by extension, bounds the +rejection rate. This process is called \emph{compaction}. + +Assuming every record is equally likely to be sampled, this new bound can be +applied to the analysis of sampling costs. The probability of a record being +rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to +Equation~\ref{eq:sampling-cost} yields, +\begin{equation} +%\label{eq:sampling-cost-del} + O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right) +\end{equation} + +Asymptotically, this proactive compaction does not alter the analysis of +insertion costs. Each record is still written at most $s$ times on each level, +there are at most $\log_s n$ levels, and the buffer insertion and SSI +construction costs are all unchanged, and so on. This results in the amortized +insertion cost remaining the same. + +This compaction strategy is based upon tombstone and record counts, and the +bounds assume that every record is equally likely to be sampled. For certain +sampling problems (such as WSS), there are other conditions that must be +considered to provide a bound on the rejection rate. To account for these +situations in a general fashion, the framework supports problem-specific +compaction triggers that can be tailored to the SSI being used. These allow +compactions to be triggered based on other properties, such as rejection rate +of a level, weight of deleted records, and the like. + + + +\subsection{Performance Tuning and Configuration} \captionsetup[subfloat]{justification=centering} @@ -68,12 +380,9 @@ and many of Bentley-Saxe's limitations addressed. \subsection{Framework Overview} -The goal of this chapter is to build a general framework that extends most SSIs -with efficient support for updates by splitting the index into small data structures -to reduce reconstruction costs, and then distributing the sampling process over these -smaller structures. -The framework is designed to work efficiently with any SSI, so -long as it has the following properties, +Our framework has been designed to work efficiently with any SSI, so long +as it has the following properties. + \begin{enumerate} \item The underlying full query $Q$ supported by the SSI from whose results samples are drawn satisfies the following property: @@ -219,101 +528,6 @@ framework is, O\left(\frac{C_r(n)}{n}\log_s n\right) \end{equation} - -\subsection{Sampling} -\label{ssec:sample} - -\begin{figure} - \centering - \includegraphics[width=\textwidth]{img/sigmod23/sampling} - \caption{\textbf{Overview of the multiple-shard sampling query process} for - Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of - the shards is determined, then (2) these weights are used to construct an - alias structure. Next, (3) the alias structure is queried $k$ times to - determine per shard sample sizes, and then (4) sampling is performed. - Finally, (5) any rejected samples are retried starting from the alias - structure, and the process is repeated until the desired number of samples - has been retrieved.} - \label{fig:sample} - -\end{figure} - -For many SSIs, sampling queries are completed in two stages. Some preliminary -processing is done to identify the range of records from which to sample, and then -samples are drawn from that range. For example, IRS over a sorted list of -records can be performed by first identifying the upper and lower bounds of the -query range in the list, and then sampling records by randomly generating -indexes within those bounds. The general cost of a sampling query can be -modeled as $P(n) + k S(n)$, where $P(n)$ is the cost of preprocessing, $k$ is -the number of samples drawn, and $S(n)$ is the cost of sampling a single -record. - -When sampling from multiple shards, the situation grows more complex. For each -sample, the shard to select the record from must first be decided. Consider an -arbitrary sampling query $X(D, k)$ asking for a sample set of size $k$ against -dataset $D$. The framework splits $D$ across $m$ disjoint shards, such that $D -= \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset, \forall i,j < m$. The -framework must ensure that $X(D, k)$ and $\bigcup_{i=0}^m X(D_i, k_i)$ follow -the same distribution, by selecting appropriate values for the $k_i$s. If care -is not taken to balance the number of samples drawn from a shard with the total -weight of the shard under $X$, then bias can be introduced into the sample -set's distribution. The selection of $k_i$s can be viewed as an instance of WSS, -and solved using the alias method. - -When sampling using the framework, first the weight of each shard under the -sampling query is determined and a \emph{shard alias structure} built over -these weights. Then, for each sample, the shard alias is used to -determine the shard from which to draw the sample. Let $W(n)$ be the cost of -determining this total weight for a single shard under the query. The initial setup -cost, prior to drawing any samples, will be $O\left([W(n) + P(n)]\log_s -n\right)$, as the preliminary work for sampling from each shard must be -performed, as well as weights determined and alias structure constructed. In -many cases, however, the preliminary work will also determine the total weight, -and so the relevant operation need only be applied once to accomplish both -tasks. - -To ensure that all records appear in the sample set with the appropriate -probability, the mutable buffer itself must also be a valid target for -sampling. There are two generally applicable techniques that can be applied for -this, both of which can be supported by the framework. The query being sampled -from can be directly executed against the buffer and the result set used to -build a temporary SSI, which can be sampled from. Alternatively, rejection -sampling can be used to sample directly from the buffer, without executing the -query. In this case, the total weight of the buffer is used for its entry in -the shard alias structure. This can result in the buffer being -over-represented in the shard selection process, and so any rejections during -buffer sampling must be retried starting from shard selection. These same -considerations apply to rejection sampling used against shards, as well. - - -\begin{example} - \label{ex:sample} - Consider executing a WSS query, with $k=1000$, across three shards - containing integer keys with unit weight. $S_1$ contains only the - key $-2$, $S_2$ contains all integers on $[1,100]$, and $S_3$ - contains all integers on $[101, 200]$. These structures are shown - in Figure~\ref{fig:sample}. Sampling is performed by first - determining the normalized weights for each shard: $w_1 = 0.005$, - $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a - shard alias structure. The shard alias structure is then queried - $k$ times, resulting in a distribution of $k_i$s that is - commensurate with the relative weights of each shard. Finally, - each shard is queried in turn to draw the appropriate number - of samples. -\end{example} - - -Assuming that rejection sampling is used on the mutable buffer, the worst-case -time complexity for drawing $k$ samples from an index containing $n$ elements -with a sampling cost of $S(n)$ is, -\begin{equation} - \label{eq:sample-cost} - O\left(\left[W(n) + P(n)\right]\log_s n + kS(n)\right) -\end{equation} - -%If instead a temporary SSI is constructed, the cost of sampling -%becomes: $O\left(N_b + C_c(N_b) + (W(n) + P(n))\log_s n + kS(n)\right)$. - \begin{figure} \centering \subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\ @@ -342,163 +556,6 @@ with a sampling cost of $S(n)$ is, \subsection{Deletion} \label{ssec:delete} -Because the shards are static, records cannot be arbitrarily removed from them. -This requires that deletes be supported in some other way, with the ultimate -goal being the prevention of deleted records' appearance in sampling query -result sets. This can be realized in two ways: locating the record and marking -it, or inserting a new record which indicates that an existing record should be -treated as deleted. The framework supports both of these techniques, the -selection of which is called the \emph{delete policy}. The former policy is -called \emph{tagging} and the latter \emph{tombstone}. - -Tagging a record is straightforward. Point-lookups are performed against each -shard in the index, as well as the buffer, for the record to be deleted. When -it is found, a bit in a header attached to the record is set. When sampling, -any records selected with this bit set are automatically rejected. Tombstones -represent a lazy strategy for deleting records. When a record is deleted using -tombstones, a new record with identical key and value, but with a ``tombstone'' -bit set, is inserted into the index. A record's presence can be checked by -performing a point-lookup. If a tombstone with the same key and value exists -above the record in the index, then it should be rejected when sampled. - -Two important aspects of performance are pertinent when discussing deletes: the -cost of the delete operation, and the cost of verifying the presence of a -sampled record. The choice of delete policy represents a trade-off between -these two costs. Beyond this simple trade-off, the delete policy also has other -implications that can affect its applicability to certain types of SSI. Most -notably, tombstones do not require any in-place updating of records, whereas -tagging does. This means that using tombstones is the only way to ensure total -immutability of the data within shards, which avoids random writes and eases -concurrency control. The tombstone delete policy, then, is particularly -appealing in external and concurrent contexts. - -\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is -the same as an ordinary insert. Tagging, by contrast, requires a point-lookup -of the record to be deleted, and so is more expensive. Assuming a point-lookup -operation with cost $L(n)$, a tagged delete must search each level in the -index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$ -time. - -\Paragraph{Rejection Check Costs.} In addition to the cost of the delete -itself, the delete policy affects the cost of determining if a given record has -been deleted. This is called the \emph{rejection check cost}, $R(n)$. When -using tagging, the information necessary to make the rejection decision is -local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones -it is not; a point-lookup must be performed to search for a given record's -corresponding tombstone. This look-up must examine the buffer, and each shard -within the index. This results in a rejection check cost of $R(n) \in O\left(N_b + -L(n) \log_s n\right)$. The rejection check process for the two delete policies is -summarized in Figure~\ref{fig:delete}. - -Two factors contribute to the tombstone rejection check cost: the size of the -buffer, and the cost of performing a point-lookup against the shards. The -latter cost can be controlled using the framework's ability to associate -auxiliary structures with shards. For SSIs which do not support efficient -point-lookups, a hash table can be added to map key-value pairs to their -location within the SSI. This allows for constant-time rejection checks, even -in situations where the index would not otherwise support them. However, the -storage cost of this intervention is high, and in situations where the SSI does -support efficient point-lookups, it is not necessary. Further performance -improvements can be achieved by noting that the probability of a given record -having an associated tombstone in any particular shard is relatively small. -This means that many point-lookups will be executed against shards that do not -contain the tombstone being searched for. In this case, these unnecessary -lookups can be partially avoided using Bloom filters~\cite{bloom70} for -tombstones. By inserting tombstones into these filters during reconstruction, -point-lookups against some shards which do not contain the tombstone being -searched for can be bypassed. Filters can be attached to the buffer as well, -which may be even more significant due to the linear cost of scanning it. As -the goal is a reduction of rejection check costs, these filters need only be -populated with tombstones. In a later section, techniques for bounding the -number of tombstones on a given level are discussed, which will allow for the -memory usage of these filters to be tightly controlled while still ensuring -precise bounds on filter error. - -\Paragraph{Sampling with Deletes.} The addition of deletes to the framework -alters the analysis of sampling costs. A record that has been deleted cannot -be present in the sample set, and therefore the presence of each sampled record -must be verified. If a record has been deleted, it must be rejected. When -retrying samples rejected due to delete, the process must restart from shard -selection, as deleted records may be counted in the weight totals used to -construct that structure. This increases the cost of sampling to, -\begin{equation} -\label{eq:sampling-cost} - O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right) -\end{equation} -where $R(n)$ is the cost of checking if a sampled record has been deleted, and -$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling -attempts required to obtain $k$ samples, given a fixed rejection probability. -The rejection probability itself is a function of the workload, and is -unbounded. - -\Paragraph{Bounding the Rejection Probability.} Rejections during sampling -constitute wasted memory accesses and random number generations, and so steps -should be taken to minimize their frequency. The probability of a rejection is -directly related to the number of deleted records, which is itself a function -of workload and dataset. This means that, without building counter-measures -into the framework, tight bounds on sampling performance cannot be provided in -the presence of deleted records. It is therefore critical that the framework -support some method for bounding the number of deleted records within the -index. - -While the static nature of shards prevents the direct removal of records at the -moment they are deleted, it doesn't prevent the removal of records during -reconstruction. When using tagging, all tagged records encountered during -reconstruction can be removed. When using tombstones, however, the removal -process is non-trivial. In principle, a rejection check could be performed for -each record encountered during reconstruction, but this would increase -reconstruction costs and introduce a new problem of tracking tombstones -associated with records that have been removed. Instead, a lazier approach can -be used: delaying removal until a tombstone and its associated record -participate in the same shard reconstruction. This delay allows both the record -and its tombstone to be removed at the same time, an approach called -\emph{tombstone cancellation}. In general, this can be implemented using an -extra linear scan of the input shards before reconstruction to identify -tombstones and associated records for cancellation, but potential optimizations -exist for many SSIs, allowing it to be performed during the reconstruction -itself at no extra cost. - -The removal of deleted records passively during reconstruction is not enough to -bound the number of deleted records within the index. It is not difficult to -envision pathological scenarios where deletes result in unbounded rejection -rates, even with this mitigation in place. However, the dropping of deleted -records does provide a useful property: any specific deleted record will -eventually be removed from the index after a finite number of reconstructions. -Using this fact, a bound on the number of deleted records can be enforced. A -new parameter, $\delta$, is defined, representing the maximum proportion of -deleted records within the index. Each level, and the buffer, tracks the number -of deleted records it contains by counting its tagged records or tombstones. -Following each buffer flush, the proportion of deleted records is checked -against $\delta$. If any level is found to exceed it, then a proactive -reconstruction is triggered, pushing its shards down into the next level. The -process is repeated until all levels respect the bound, allowing the number of -deleted records to be precisely controlled, which, by extension, bounds the -rejection rate. This process is called \emph{compaction}. - -Assuming every record is equally likely to be sampled, this new bound can be -applied to the analysis of sampling costs. The probability of a record being -rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to -Equation~\ref{eq:sampling-cost} yields, -\begin{equation} -%\label{eq:sampling-cost-del} - O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right) -\end{equation} - -Asymptotically, this proactive compaction does not alter the analysis of -insertion costs. Each record is still written at most $s$ times on each level, -there are at most $\log_s n$ levels, and the buffer insertion and SSI -construction costs are all unchanged, and so on. This results in the amortized -insertion cost remaining the same. - -This compaction strategy is based upon tombstone and record counts, and the -bounds assume that every record is equally likely to be sampled. For certain -sampling problems (such as WSS), there are other conditions that must be -considered to provide a bound on the rejection rate. To account for these -situations in a general fashion, the framework supports problem-specific -compaction triggers that can be tailored to the SSI being used. These allow -compactions to be triggered based on other properties, such as rejection rate -of a level, weight of deleted records, and the like. - \subsection{Trade-offs on Framework Design Space} \label{ssec:design-space} |