diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-04-27 17:36:57 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-04-27 17:36:57 -0400 |
| commit | 5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch) | |
| tree | 276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/sigmod23/framework.tex | |
| download | dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz | |
Initial commit
Diffstat (limited to 'chapters/sigmod23/framework.tex')
| -rw-r--r-- | chapters/sigmod23/framework.tex | 573 |
1 files changed, 573 insertions, 0 deletions
diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex new file mode 100644 index 0000000..32a32e1 --- /dev/null +++ b/chapters/sigmod23/framework.tex @@ -0,0 +1,573 @@ +\section{Dynamic Sampling Index Framework} +\label{sec:framework} + +This work is an attempt to design a solution to independent sampling +that achieves \emph{both} efficient updates and near-constant cost per +sample. As the goal is to tackle the problem in a generalized fashion, +rather than design problem-specific data structures for used as the basis +of an index, a framework is created that allows for already +existing static data structures to be used as the basis for a sampling +index, by automatically adding support for data updates using a modified +version of the Bentley-Saxe method. + +Unfortunately, Bentley-Saxe as described in Section~\ref{ssec:bsm} cannot be +directly applied to sampling problems. The concept of decomposability is not +cleanly applicable to sampling, because the distribution of records in the +result set, rather than the records themselves, must be matched following the +result merge. Efficiently controlling the distribution requires each sub-query +to access information external to the structure against which it is being +processed, a contingency unaccounted for by Bentley-Saxe. Further, the process +of reconstruction used in Bentley-Saxe provides poor worst-case complexity +bounds~\cite{saxe79}, and attempts to modify the procedure to provide better +worst-case performance are complex and have worse performance in the common +case~\cite{overmars81}. Despite these limitations, this chapter will argue that +the core principles of the Bentley-Saxe method can be profitably applied to +sampling indexes, once a system for controlling result set distributions and a +more effective reconstruction scheme have been devised. The solution to +the former will be discussed in Section~\ref{ssec:sample}. For the latter, +inspiration is drawn from the literature on the LSM tree. + +The LSM tree~\cite{oneil96} is a data structure proposed to optimize +write throughput in disk-based storage engines. It consists of a memory +table of bounded size, used to buffer recent changes, and a hierarchy +of external levels containing indexes of exponentially increasing +size. When the memory table has reached capacity, it is emptied into the +external levels. Random writes are avoided by treating the data within +the external levels as immutable; all writes go through the memory +table. This introduces write amplification but maximizes sequential +writes, which is important for maintaining high throughput in disk-based +systems. The LSM tree is associated with a broad and well studied design +space~\cite{dayan17,dayan18,dayan22,balmau19,dayan18-1} containing +trade-offs between three key performance metrics: read performance, write +performance, and auxiliary memory usage. The challenges +faced in reconstructing predominately in-memory indexes are quite + different from those which the LSM tree is intended +to address, having little to do with disk-based systems and sequential IO +operations. But, the LSM tree possesses a rich design space for managing +the periodic reconstruction of data structures in a manner that is both +more practical and more flexible than that of Bentley-Saxe. By borrowing +from this design space, this preexisting body of work can be leveraged, +and many of Bentley-Saxe's limitations addressed. + +\captionsetup[subfloat]{justification=centering} + +\begin{figure*} + \centering + \subfloat[Leveling]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-leveling} \label{fig:leveling}}\\ + \subfloat[Tiering]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-tiering} \label{fig:tiering}} + + \caption{\textbf{A graphical overview of the sampling framework and its insert procedure.} A + mutable buffer (MB) sits atop two levels (L0, L1) containing shards (pairs + of SSIs and auxiliary structures [A]) using the leveling + (Figure~\ref{fig:leveling}) and tiering (Figure~\ref{fig:tiering}) layout + policies. Records are represented as black/colored squares, and grey + squares represent unused capacity. An insertion requiring a multi-level + reconstruction is illustrated.} \label{fig:framework} + +\end{figure*} + + +\subsection{Framework Overview} +The goal of this chapter is to build a general framework that extends most SSIs +with efficient support for updates by splitting the index into small data structures +to reduce reconstruction costs, and then distributing the sampling process over these +smaller structures. +The framework is designed to work efficiently with any SSI, so +long as it has the following properties, +\begin{enumerate} + \item The underlying full query $Q$ supported by the SSI from whose results + samples are drawn satisfies the following property: + for any dataset $D = \cup_{i = 1}^{n}D_i$ + where $D_i \cap D_j = \emptyset$, $Q(D) = \cup_{i = 1}^{n}Q(D_i)$. + \item \emph{(Optional)} The SSI supports efficient point-lookups. + \item \emph{(Optional)} The SSI is capable of efficiently reporting the total weight of all records + returned by the underlying full query. +\end{enumerate} + +The first property applies to the query being sampled from, and is essential +for the correctness of sample sets reported by extended sampling +indexes.\footnote{ This condition is stricter than the definition of a +decomposable search problem in the Bentley-Saxe method, which allows for +\emph{any} constant-time merge operation, not just union. +However, this condition is satisfied by many common types of database +query, such as predicate-based filtering queries.} The latter two properties +are optional, but reduce deletion and sampling costs respectively. Should the +SSI fail to support point-lookups, an auxiliary hash table can be attached to +the data structures. +Should it fail to support query result weight reporting, rejection +sampling can be used in place of the more efficient scheme discussed in +Section~\ref{ssec:sample}. The analysis of this framework will generally +assume that all three conditions are satisfied. + +Given an SSI with these properties, a dynamic extension can be produced as +shown in Figure~\ref{fig:framework}. The extended index consists of disjoint +shards containing an instance of the SSI being extended, and optional auxiliary +data structures. The auxiliary structures allow acceleration of certain +operations that are required by the framework, but which the SSI being extended +does not itself support efficiently. Examples of possible auxiliary structures +include hash tables, Bloom filters~\cite{bloom70}, and range +filters~\cite{zhang18,siqiang20}. The shards are arranged into levels of +increasing record capacity, with either one shard, or up to a fixed maximum +number of shards, per level. The decision to place one or many shards per level +is called the \emph{layout policy}. The policy names are borrowed from the +literature on the LSM tree, with the former called \emph{leveling} and the +latter called \emph{tiering}. + +To avoid a reconstruction on every insert, an unsorted array of fixed capacity +($N_b$), called the \emph{mutable buffer}, is used to buffer updates. Because it is +unsorted, it is kept small to maintain reasonably efficient sampling +and point-lookup performance. All updates are performed by appending new +records to the tail of this buffer. +If a record currently within the index is +to be updated to a new value, it must first be deleted, and then a record with +the new value inserted. This ensures that old versions of records are properly +filtered from query results. + +When the buffer is full, it is flushed to make room for new records. The +flushing procedure is based on the layout policy in use. When using leveling +(Figure~\ref{fig:leveling}) a new SSI is constructed using both the records in +$L_0$ and those in the buffer. This is used to create a new shard, which +replaces the one previously in $L_0$. When using tiering +(Figure~\ref{fig:tiering}) a new shard is built using only the records from the +buffer, and placed into $L_0$ without altering the existing shards. Each level +has a record capacity of $N_b \cdot s^{i+1}$, controlled by a configurable +parameter, $s$, called the scale factor. Records are organized in one large +shard under leveling, or in $s$ shards of $N_b \cdot s^i$ capacity each under +tiering. When a level reaches its capacity, it must be emptied to make room for +the records flushed into it. This is accomplished by moving its records down to +the next level of the index. Under leveling, this requires constructing a new +shard containing all records from both the source and target levels, and +placing this shard into the target, leaving the source empty. Under tiering, +the shards in the source level are combined into a single new shard that is +placed into the target level. Should the target be full, it is first emptied by +applying the same procedure. New empty levels +are dynamically added as necessary to accommodate these reconstructions. +Note that shard reconstructions are not necessarily performed using +merging, though merging can be used as an optimization of the reconstruction +procedure where such an algorithm exists. In general, reconstruction requires +only pooling the records of the shards being combined and then applying the SSI's +standard construction algorithm to this set of records. + +\begin{table}[t] +\caption{Frequently Used Notation} +\centering + +\begin{tabular}{|p{2.5cm} p{5cm}|} + \hline + \textbf{Variable} & \textbf{Description} \\ \hline + $N_b$ & Capacity of the mutable buffer \\ \hline + $s$ & Scale factor \\ \hline + $C_c(n)$ & SSI initial construction cost \\ \hline + $C_r(n)$ & SSI reconstruction cost \\ \hline + $L(n)$ & SSI point-lookup cost \\ \hline + $P(n)$ & SSI sampling pre-processing cost \\ \hline + $S(n)$ & SSI per-sample sampling cost \\ \hline + $W(n)$ & Shard weight determination cost \\ \hline + $R(n)$ & Shard rejection check cost \\ \hline + $\delta$ & Maximum delete proportion \\ \hline + %$\rho$ & Maximum rejection rate \\ \hline +\end{tabular} +\label{tab:nomen} + +\end{table} + +Table~\ref{tab:nomen} lists frequently used notation for the various parameters +of the framework, which will be used in the coming analysis of the costs and +trade-offs associated with operations within the framework's design space. The +remainder of this section will discuss the performance characteristics of +insertion into this structure (Section~\ref{ssec:insert}), how it can be used +to correctly answer sampling queries (Section~\ref{ssec:insert}), and efficient +approaches for supporting deletes (Section~\ref{ssec:delete}). Finally, it will +close with a detailed discussion of the trade-offs within the framework's +design space (Section~\ref{ssec:design-space}). + + +\subsection{Insertion} +\label{ssec:insert} +The framework supports inserting new records by first appending them to the end +of the mutable buffer. When it is full, the buffer is flushed into a sequence +of levels containing shards of increasing capacity, using a procedure +determined by the layout policy as discussed in Section~\ref{sec:framework}. +This method allows for the cost of repeated shard reconstruction to be +effectively amortized. + +Let the cost of constructing the SSI from an arbitrary set of $n$ records be +$C_c(n)$ and the cost of reconstructing the SSI given two or more shards +containing $n$ records in total be $C_r(n)$. The cost of an insert is composed +of three parts: appending to the mutable buffer, constructing a new +shard from the buffered records during a flush, and the total cost of +reconstructing shards containing the record over the lifetime of the index. The +cost of appending to the mutable buffer is constant, and the cost of constructing a +shard from the buffer can be amortized across the records participating in the +buffer flush, giving $\nicefrac{C_c(N_b)}{N_b}$. These costs are paid exactly once for +each record. To derive an expression for the cost of repeated reconstruction, +first note that each record will participate in at most $s$ reconstructions on +a given level, resulting in a worst-case amortized cost of $O\left(s\cdot +\nicefrac{C_r(n)}{n}\right)$ paid per level. The index itself will contain at most +$\log_s n$ levels. Thus, over the lifetime of the index a given record +will pay $O\left(s\cdot \nicefrac{C_r(n)}{n}\log_s n\right)$ cost in repeated +reconstruction. + +Combining these results, the total amortized insertion cost is +\begin{equation} +O\left(\frac{C_c(N_b)}{N_b} + s \cdot \frac{C_r(n)}{n} \log_s n\right) +\end{equation} +This can be simplified by noting that $s$ is constant, and that $N_b \ll n$ and also +a constant. By neglecting these terms, the amortized insertion cost of the +framework is, +\begin{equation} +O\left(\frac{C_r(n)}{n}\log_s n\right) +\end{equation} + + +\subsection{Sampling} +\label{ssec:sample} + +\begin{figure} + \centering + \includegraphics[width=\textwidth]{img/sigmod23/sampling} + \caption{\textbf{Overview of the multiple-shard sampling query process} for + Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of + the shards is determined, then (2) these weights are used to construct an + alias structure. Next, (3) the alias structure is queried $k$ times to + determine per shard sample sizes, and then (4) sampling is performed. + Finally, (5) any rejected samples are retried starting from the alias + structure, and the process is repeated until the desired number of samples + has been retrieved.} + \label{fig:sample} + +\end{figure} + +For many SSIs, sampling queries are completed in two stages. Some preliminary +processing is done to identify the range of records from which to sample, and then +samples are drawn from that range. For example, IRS over a sorted list of +records can be performed by first identifying the upper and lower bounds of the +query range in the list, and then sampling records by randomly generating +indexes within those bounds. The general cost of a sampling query can be +modeled as $P(n) + k S(n)$, where $P(n)$ is the cost of preprocessing, $k$ is +the number of samples drawn, and $S(n)$ is the cost of sampling a single +record. + +When sampling from multiple shards, the situation grows more complex. For each +sample, the shard to select the record from must first be decided. Consider an +arbitrary sampling query $X(D, k)$ asking for a sample set of size $k$ against +dataset $D$. The framework splits $D$ across $m$ disjoint shards, such that $D += \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset, \forall i,j < m$. The +framework must ensure that $X(D, k)$ and $\bigcup_{i=0}^m X(D_i, k_i)$ follow +the same distribution, by selecting appropriate values for the $k_i$s. If care +is not taken to balance the number of samples drawn from a shard with the total +weight of the shard under $X$, then bias can be introduced into the sample +set's distribution. The selection of $k_i$s can be viewed as an instance of WSS, +and solved using the alias method. + +When sampling using the framework, first the weight of each shard under the +sampling query is determined and a \emph{shard alias structure} built over +these weights. Then, for each sample, the shard alias is used to +determine the shard from which to draw the sample. Let $W(n)$ be the cost of +determining this total weight for a single shard under the query. The initial setup +cost, prior to drawing any samples, will be $O\left([W(n) + P(n)]\log_s +n\right)$, as the preliminary work for sampling from each shard must be +performed, as well as weights determined and alias structure constructed. In +many cases, however, the preliminary work will also determine the total weight, +and so the relevant operation need only be applied once to accomplish both +tasks. + +To ensure that all records appear in the sample set with the appropriate +probability, the mutable buffer itself must also be a valid target for +sampling. There are two generally applicable techniques that can be applied for +this, both of which can be supported by the framework. The query being sampled +from can be directly executed against the buffer and the result set used to +build a temporary SSI, which can be sampled from. Alternatively, rejection +sampling can be used to sample directly from the buffer, without executing the +query. In this case, the total weight of the buffer is used for its entry in +the shard alias structure. This can result in the buffer being +over-represented in the shard selection process, and so any rejections during +buffer sampling must be retried starting from shard selection. These same +considerations apply to rejection sampling used against shards, as well. + + +\begin{example} + \label{ex:sample} + Consider executing a WSS query, with $k=1000$, across three shards + containing integer keys with unit weight. $S_1$ contains only the + key $-2$, $S_2$ contains all integers on $[1,100]$, and $S_3$ + contains all integers on $[101, 200]$. These structures are shown + in Figure~\ref{fig:sample}. Sampling is performed by first + determining the normalized weights for each shard: $w_1 = 0.005$, + $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a + shard alias structure. The shard alias structure is then queried + $k$ times, resulting in a distribution of $k_i$s that is + commensurate with the relative weights of each shard. Finally, + each shard is queried in turn to draw the appropriate number + of samples. +\end{example} + + +Assuming that rejection sampling is used on the mutable buffer, the worst-case +time complexity for drawing $k$ samples from an index containing $n$ elements +with a sampling cost of $S(n)$ is, +\begin{equation} + \label{eq:sample-cost} + O\left(\left[W(n) + P(n)\right]\log_s n + kS(n)\right) +\end{equation} + +%If instead a temporary SSI is constructed, the cost of sampling +%becomes: $O\left(N_b + C_c(N_b) + (W(n) + P(n))\log_s n + kS(n)\right)$. + +\begin{figure} + \centering + \subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\ + \subfloat[Tagging Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tagging} \label{fig:delete-tag}} + + \caption{\textbf{Overview of the rejection check procedure for deleted records.} First, + a record is sampled (1). + When using the tombstone delete policy + (Figure~\ref{fig:delete-tombstone}), the rejection check starts by (2) querying + the bloom filter of the mutable buffer. The filter indicates the record is + not present, so (3) the filter on $L_0$ is queried next. This filter + returns a false positive, so (4) a point-lookup is executed against $L_0$. + The lookup fails to find a tombstone, so the search continues and (5) the + filter on $L_1$ is checked, which reports that the tombstone is present. + This time, it is not a false positive, and so (6) a lookup against $L_1$ + (7) locates the tombstone. The record is thus rejected. When using the + tagging policy (Figure~\ref{fig:delete-tag}), (1) the record is sampled and + (2) checked directly for the delete tag. It is set, so the record is + immediately rejected.} + + \label{fig:delete} + +\end{figure} + + +\subsection{Deletion} +\label{ssec:delete} + +Because the shards are static, records cannot be arbitrarily removed from them. +This requires that deletes be supported in some other way, with the ultimate +goal being the prevention of deleted records' appearance in sampling query +result sets. This can be realized in two ways: locating the record and marking +it, or inserting a new record which indicates that an existing record should be +treated as deleted. The framework supports both of these techniques, the +selection of which is called the \emph{delete policy}. The former policy is +called \emph{tagging} and the latter \emph{tombstone}. + +Tagging a record is straightforward. Point-lookups are performed against each +shard in the index, as well as the buffer, for the record to be deleted. When +it is found, a bit in a header attached to the record is set. When sampling, +any records selected with this bit set are automatically rejected. Tombstones +represent a lazy strategy for deleting records. When a record is deleted using +tombstones, a new record with identical key and value, but with a ``tombstone'' +bit set, is inserted into the index. A record's presence can be checked by +performing a point-lookup. If a tombstone with the same key and value exists +above the record in the index, then it should be rejected when sampled. + +Two important aspects of performance are pertinent when discussing deletes: the +cost of the delete operation, and the cost of verifying the presence of a +sampled record. The choice of delete policy represents a trade-off between +these two costs. Beyond this simple trade-off, the delete policy also has other +implications that can affect its applicability to certain types of SSI. Most +notably, tombstones do not require any in-place updating of records, whereas +tagging does. This means that using tombstones is the only way to ensure total +immutability of the data within shards, which avoids random writes and eases +concurrency control. The tombstone delete policy, then, is particularly +appealing in external and concurrent contexts. + +\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is +the same as an ordinary insert. Tagging, by contrast, requires a point-lookup +of the record to be deleted, and so is more expensive. Assuming a point-lookup +operation with cost $L(n)$, a tagged delete must search each level in the +index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$ +time. + +\Paragraph{Rejection Check Costs.} In addition to the cost of the delete +itself, the delete policy affects the cost of determining if a given record has +been deleted. This is called the \emph{rejection check cost}, $R(n)$. When +using tagging, the information necessary to make the rejection decision is +local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones +it is not; a point-lookup must be performed to search for a given record's +corresponding tombstone. This look-up must examine the buffer, and each shard +within the index. This results in a rejection check cost of $R(n) \in O\left(N_b + +L(n) \log_s n\right)$. The rejection check process for the two delete policies is +summarized in Figure~\ref{fig:delete}. + +Two factors contribute to the tombstone rejection check cost: the size of the +buffer, and the cost of performing a point-lookup against the shards. The +latter cost can be controlled using the framework's ability to associate +auxiliary structures with shards. For SSIs which do not support efficient +point-lookups, a hash table can be added to map key-value pairs to their +location within the SSI. This allows for constant-time rejection checks, even +in situations where the index would not otherwise support them. However, the +storage cost of this intervention is high, and in situations where the SSI does +support efficient point-lookups, it is not necessary. Further performance +improvements can be achieved by noting that the probability of a given record +having an associated tombstone in any particular shard is relatively small. +This means that many point-lookups will be executed against shards that do not +contain the tombstone being searched for. In this case, these unnecessary +lookups can be partially avoided using Bloom filters~\cite{bloom70} for +tombstones. By inserting tombstones into these filters during reconstruction, +point-lookups against some shards which do not contain the tombstone being +searched for can be bypassed. Filters can be attached to the buffer as well, +which may be even more significant due to the linear cost of scanning it. As +the goal is a reduction of rejection check costs, these filters need only be +populated with tombstones. In a later section, techniques for bounding the +number of tombstones on a given level are discussed, which will allow for the +memory usage of these filters to be tightly controlled while still ensuring +precise bounds on filter error. + +\Paragraph{Sampling with Deletes.} The addition of deletes to the framework +alters the analysis of sampling costs. A record that has been deleted cannot +be present in the sample set, and therefore the presence of each sampled record +must be verified. If a record has been deleted, it must be rejected. When +retrying samples rejected due to delete, the process must restart from shard +selection, as deleted records may be counted in the weight totals used to +construct that structure. This increases the cost of sampling to, +\begin{equation} +\label{eq:sampling-cost} + O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right) +\end{equation} +where $R(n)$ is the cost of checking if a sampled record has been deleted, and +$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling +attempts required to obtain $k$ samples, given a fixed rejection probability. +The rejection probability itself is a function of the workload, and is +unbounded. + +\Paragraph{Bounding the Rejection Probability.} Rejections during sampling +constitute wasted memory accesses and random number generations, and so steps +should be taken to minimize their frequency. The probability of a rejection is +directly related to the number of deleted records, which is itself a function +of workload and dataset. This means that, without building counter-measures +into the framework, tight bounds on sampling performance cannot be provided in +the presence of deleted records. It is therefore critical that the framework +support some method for bounding the number of deleted records within the +index. + +While the static nature of shards prevents the direct removal of records at the +moment they are deleted, it doesn't prevent the removal of records during +reconstruction. When using tagging, all tagged records encountered during +reconstruction can be removed. When using tombstones, however, the removal +process is non-trivial. In principle, a rejection check could be performed for +each record encountered during reconstruction, but this would increase +reconstruction costs and introduce a new problem of tracking tombstones +associated with records that have been removed. Instead, a lazier approach can +be used: delaying removal until a tombstone and its associated record +participate in the same shard reconstruction. This delay allows both the record +and its tombstone to be removed at the same time, an approach called +\emph{tombstone cancellation}. In general, this can be implemented using an +extra linear scan of the input shards before reconstruction to identify +tombstones and associated records for cancellation, but potential optimizations +exist for many SSIs, allowing it to be performed during the reconstruction +itself at no extra cost. + +The removal of deleted records passively during reconstruction is not enough to +bound the number of deleted records within the index. It is not difficult to +envision pathological scenarios where deletes result in unbounded rejection +rates, even with this mitigation in place. However, the dropping of deleted +records does provide a useful property: any specific deleted record will +eventually be removed from the index after a finite number of reconstructions. +Using this fact, a bound on the number of deleted records can be enforced. A +new parameter, $\delta$, is defined, representing the maximum proportion of +deleted records within the index. Each level, and the buffer, tracks the number +of deleted records it contains by counting its tagged records or tombstones. +Following each buffer flush, the proportion of deleted records is checked +against $\delta$. If any level is found to exceed it, then a proactive +reconstruction is triggered, pushing its shards down into the next level. The +process is repeated until all levels respect the bound, allowing the number of +deleted records to be precisely controlled, which, by extension, bounds the +rejection rate. This process is called \emph{compaction}. + +Assuming every record is equally likely to be sampled, this new bound can be +applied to the analysis of sampling costs. The probability of a record being +rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to +Equation~\ref{eq:sampling-cost} yields, +\begin{equation} +%\label{eq:sampling-cost-del} + O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right) +\end{equation} + +Asymptotically, this proactive compaction does not alter the analysis of +insertion costs. Each record is still written at most $s$ times on each level, +there are at most $\log_s n$ levels, and the buffer insertion and SSI +construction costs are all unchanged, and so on. This results in the amortized +insertion cost remaining the same. + +This compaction strategy is based upon tombstone and record counts, and the +bounds assume that every record is equally likely to be sampled. For certain +sampling problems (such as WSS), there are other conditions that must be +considered to provide a bound on the rejection rate. To account for these +situations in a general fashion, the framework supports problem-specific +compaction triggers that can be tailored to the SSI being used. These allow +compactions to be triggered based on other properties, such as rejection rate +of a level, weight of deleted records, and the like. + + +\subsection{Trade-offs on Framework Design Space} +\label{ssec:design-space} +The framework has several tunable parameters, allowing it to be tailored for +specific applications. This design space contains trade-offs among three major +performance characteristics: update cost, sampling cost, and auxiliary memory +usage. The two most significant decisions when implementing this framework are +the selection of the layout and delete policies. The asymptotic analysis of the +previous sections obscures some of the differences between these policies, but +they do have significant practical performance implications. + +\Paragraph{Layout Policy.} The choice of layout policy represents a clear +trade-off between update and sampling performance. Leveling +results in fewer shards of larger size, whereas tiering results in a larger +number of smaller shards. As a result, leveling reduces the costs associated +with point-lookups and sampling query preprocessing by a constant factor, +compared to tiering. However, it results in more write amplification: a given +record may be involved in up to $s$ reconstructions on a single level, as +opposed to the single reconstruction per level under tiering. + +\Paragraph{Delete Policy.} There is a trade-off between delete performance and +sampling performance that exists in the choice of delete policy. Tagging +requires a point-lookup when performing a delete, which is more expensive than +the insert required by tombstones. However, it also allows constant-time +rejection checks, unlike tombstones which require a point-lookup of each +sampled record. In situations where deletes are common and write-throughput is +critical, tombstones may be more useful. Tombstones are also ideal in +situations where immutability is required, or random writes must be avoided. +Generally speaking, however, tagging is superior when using SSIs that support +it, because sampling rejection checks will usually be more common than deletes. + +\Paragraph{Mutable Buffer Capacity and Scale Factor.} The mutable buffer +capacity and scale factor both influence the number of levels within the index, +and by extension the number of distinct shards. Sampling and point-lookups have +better performance with fewer shards. Smaller shards are also faster to +reconstruct, although the same adjustments that reduce shard size also result +in a larger number of reconstructions, so the trade-off here is less clear. + +The scale factor has an interesting interaction with the layout policy: when +using leveling, the scale factor directly controls the amount of write +amplification per level. Larger scale factors mean more time is spent +reconstructing shards on a level, reducing update performance. Tiering does not +have this problem and should see its update performance benefit directly from a +larger scale factor, as this reduces the number of reconstructions. + +The buffer capacity also influences the number of levels, but is more +significant in its effects on point-lookup performance: a lookup must perform a +linear scan of the buffer. Likewise, the unstructured nature of the buffer also +will contribute negatively towards sampling performance, irrespective of which +buffer sampling technique is used. As a result, although a large buffer will +reduce the number of shards, it will also hurt sampling and delete (under +tagging) performance. It is important to minimize the cost of these buffer +scans, and so it is preferable to keep the buffer small, ideally small enough +to fit within the CPU's L2 cache. The number of shards within the index is, +then, better controlled by changing the scale factor, rather than the buffer +capacity. Using a smaller buffer will result in more compactions and shard +reconstructions; however, the empirical evaluation in Section~\ref{ssec:ds-exp} +demonstrates that this is not a serious performance problem when a scale factor +is chosen appropriately. When the shards are in memory, frequent small +reconstructions do not have a significant performance penalty compared to less +frequent, larger ones. + +\Paragraph{Auxiliary Structures.} The framework's support for arbitrary +auxiliary data structures allows for memory to be traded in exchange for +insertion or sampling performance. The use of Bloom filters for accelerating +tombstone rejection checks has already been discussed, but many other options +exist. Bloom filters could also be used to accelerate point-lookups for delete +tagging, though such filters would require much more memory than tombstone-only +ones to be effective. An auxiliary hash table could be used for accelerating +point-lookups, or range filters like SuRF \cite{zhang18} or Rosetta +\cite{siqiang20} added to accelerate pre-processing for range queries like in +IRS or WIRS. |