summaryrefslogtreecommitdiffstats
path: root/chapters/sigmod23/framework.tex
diff options
context:
space:
mode:
authorDouglas Rumbaugh <dbr4@psu.edu>2025-04-27 17:36:57 -0400
committerDouglas Rumbaugh <dbr4@psu.edu>2025-04-27 17:36:57 -0400
commit5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch)
tree276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/sigmod23/framework.tex
downloaddissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz
Initial commit
Diffstat (limited to 'chapters/sigmod23/framework.tex')
-rw-r--r--chapters/sigmod23/framework.tex573
1 files changed, 573 insertions, 0 deletions
diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex
new file mode 100644
index 0000000..32a32e1
--- /dev/null
+++ b/chapters/sigmod23/framework.tex
@@ -0,0 +1,573 @@
+\section{Dynamic Sampling Index Framework}
+\label{sec:framework}
+
+This work is an attempt to design a solution to independent sampling
+that achieves \emph{both} efficient updates and near-constant cost per
+sample. As the goal is to tackle the problem in a generalized fashion,
+rather than design problem-specific data structures for used as the basis
+of an index, a framework is created that allows for already
+existing static data structures to be used as the basis for a sampling
+index, by automatically adding support for data updates using a modified
+version of the Bentley-Saxe method.
+
+Unfortunately, Bentley-Saxe as described in Section~\ref{ssec:bsm} cannot be
+directly applied to sampling problems. The concept of decomposability is not
+cleanly applicable to sampling, because the distribution of records in the
+result set, rather than the records themselves, must be matched following the
+result merge. Efficiently controlling the distribution requires each sub-query
+to access information external to the structure against which it is being
+processed, a contingency unaccounted for by Bentley-Saxe. Further, the process
+of reconstruction used in Bentley-Saxe provides poor worst-case complexity
+bounds~\cite{saxe79}, and attempts to modify the procedure to provide better
+worst-case performance are complex and have worse performance in the common
+case~\cite{overmars81}. Despite these limitations, this chapter will argue that
+the core principles of the Bentley-Saxe method can be profitably applied to
+sampling indexes, once a system for controlling result set distributions and a
+more effective reconstruction scheme have been devised. The solution to
+the former will be discussed in Section~\ref{ssec:sample}. For the latter,
+inspiration is drawn from the literature on the LSM tree.
+
+The LSM tree~\cite{oneil96} is a data structure proposed to optimize
+write throughput in disk-based storage engines. It consists of a memory
+table of bounded size, used to buffer recent changes, and a hierarchy
+of external levels containing indexes of exponentially increasing
+size. When the memory table has reached capacity, it is emptied into the
+external levels. Random writes are avoided by treating the data within
+the external levels as immutable; all writes go through the memory
+table. This introduces write amplification but maximizes sequential
+writes, which is important for maintaining high throughput in disk-based
+systems. The LSM tree is associated with a broad and well studied design
+space~\cite{dayan17,dayan18,dayan22,balmau19,dayan18-1} containing
+trade-offs between three key performance metrics: read performance, write
+performance, and auxiliary memory usage. The challenges
+faced in reconstructing predominately in-memory indexes are quite
+ different from those which the LSM tree is intended
+to address, having little to do with disk-based systems and sequential IO
+operations. But, the LSM tree possesses a rich design space for managing
+the periodic reconstruction of data structures in a manner that is both
+more practical and more flexible than that of Bentley-Saxe. By borrowing
+from this design space, this preexisting body of work can be leveraged,
+and many of Bentley-Saxe's limitations addressed.
+
+\captionsetup[subfloat]{justification=centering}
+
+\begin{figure*}
+ \centering
+ \subfloat[Leveling]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-leveling} \label{fig:leveling}}\\
+ \subfloat[Tiering]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-tiering} \label{fig:tiering}}
+
+ \caption{\textbf{A graphical overview of the sampling framework and its insert procedure.} A
+ mutable buffer (MB) sits atop two levels (L0, L1) containing shards (pairs
+ of SSIs and auxiliary structures [A]) using the leveling
+ (Figure~\ref{fig:leveling}) and tiering (Figure~\ref{fig:tiering}) layout
+ policies. Records are represented as black/colored squares, and grey
+ squares represent unused capacity. An insertion requiring a multi-level
+ reconstruction is illustrated.} \label{fig:framework}
+
+\end{figure*}
+
+
+\subsection{Framework Overview}
+The goal of this chapter is to build a general framework that extends most SSIs
+with efficient support for updates by splitting the index into small data structures
+to reduce reconstruction costs, and then distributing the sampling process over these
+smaller structures.
+The framework is designed to work efficiently with any SSI, so
+long as it has the following properties,
+\begin{enumerate}
+ \item The underlying full query $Q$ supported by the SSI from whose results
+ samples are drawn satisfies the following property:
+ for any dataset $D = \cup_{i = 1}^{n}D_i$
+ where $D_i \cap D_j = \emptyset$, $Q(D) = \cup_{i = 1}^{n}Q(D_i)$.
+ \item \emph{(Optional)} The SSI supports efficient point-lookups.
+ \item \emph{(Optional)} The SSI is capable of efficiently reporting the total weight of all records
+ returned by the underlying full query.
+\end{enumerate}
+
+The first property applies to the query being sampled from, and is essential
+for the correctness of sample sets reported by extended sampling
+indexes.\footnote{ This condition is stricter than the definition of a
+decomposable search problem in the Bentley-Saxe method, which allows for
+\emph{any} constant-time merge operation, not just union.
+However, this condition is satisfied by many common types of database
+query, such as predicate-based filtering queries.} The latter two properties
+are optional, but reduce deletion and sampling costs respectively. Should the
+SSI fail to support point-lookups, an auxiliary hash table can be attached to
+the data structures.
+Should it fail to support query result weight reporting, rejection
+sampling can be used in place of the more efficient scheme discussed in
+Section~\ref{ssec:sample}. The analysis of this framework will generally
+assume that all three conditions are satisfied.
+
+Given an SSI with these properties, a dynamic extension can be produced as
+shown in Figure~\ref{fig:framework}. The extended index consists of disjoint
+shards containing an instance of the SSI being extended, and optional auxiliary
+data structures. The auxiliary structures allow acceleration of certain
+operations that are required by the framework, but which the SSI being extended
+does not itself support efficiently. Examples of possible auxiliary structures
+include hash tables, Bloom filters~\cite{bloom70}, and range
+filters~\cite{zhang18,siqiang20}. The shards are arranged into levels of
+increasing record capacity, with either one shard, or up to a fixed maximum
+number of shards, per level. The decision to place one or many shards per level
+is called the \emph{layout policy}. The policy names are borrowed from the
+literature on the LSM tree, with the former called \emph{leveling} and the
+latter called \emph{tiering}.
+
+To avoid a reconstruction on every insert, an unsorted array of fixed capacity
+($N_b$), called the \emph{mutable buffer}, is used to buffer updates. Because it is
+unsorted, it is kept small to maintain reasonably efficient sampling
+and point-lookup performance. All updates are performed by appending new
+records to the tail of this buffer.
+If a record currently within the index is
+to be updated to a new value, it must first be deleted, and then a record with
+the new value inserted. This ensures that old versions of records are properly
+filtered from query results.
+
+When the buffer is full, it is flushed to make room for new records. The
+flushing procedure is based on the layout policy in use. When using leveling
+(Figure~\ref{fig:leveling}) a new SSI is constructed using both the records in
+$L_0$ and those in the buffer. This is used to create a new shard, which
+replaces the one previously in $L_0$. When using tiering
+(Figure~\ref{fig:tiering}) a new shard is built using only the records from the
+buffer, and placed into $L_0$ without altering the existing shards. Each level
+has a record capacity of $N_b \cdot s^{i+1}$, controlled by a configurable
+parameter, $s$, called the scale factor. Records are organized in one large
+shard under leveling, or in $s$ shards of $N_b \cdot s^i$ capacity each under
+tiering. When a level reaches its capacity, it must be emptied to make room for
+the records flushed into it. This is accomplished by moving its records down to
+the next level of the index. Under leveling, this requires constructing a new
+shard containing all records from both the source and target levels, and
+placing this shard into the target, leaving the source empty. Under tiering,
+the shards in the source level are combined into a single new shard that is
+placed into the target level. Should the target be full, it is first emptied by
+applying the same procedure. New empty levels
+are dynamically added as necessary to accommodate these reconstructions.
+Note that shard reconstructions are not necessarily performed using
+merging, though merging can be used as an optimization of the reconstruction
+procedure where such an algorithm exists. In general, reconstruction requires
+only pooling the records of the shards being combined and then applying the SSI's
+standard construction algorithm to this set of records.
+
+\begin{table}[t]
+\caption{Frequently Used Notation}
+\centering
+
+\begin{tabular}{|p{2.5cm} p{5cm}|}
+ \hline
+ \textbf{Variable} & \textbf{Description} \\ \hline
+ $N_b$ & Capacity of the mutable buffer \\ \hline
+ $s$ & Scale factor \\ \hline
+ $C_c(n)$ & SSI initial construction cost \\ \hline
+ $C_r(n)$ & SSI reconstruction cost \\ \hline
+ $L(n)$ & SSI point-lookup cost \\ \hline
+ $P(n)$ & SSI sampling pre-processing cost \\ \hline
+ $S(n)$ & SSI per-sample sampling cost \\ \hline
+ $W(n)$ & Shard weight determination cost \\ \hline
+ $R(n)$ & Shard rejection check cost \\ \hline
+ $\delta$ & Maximum delete proportion \\ \hline
+ %$\rho$ & Maximum rejection rate \\ \hline
+\end{tabular}
+\label{tab:nomen}
+
+\end{table}
+
+Table~\ref{tab:nomen} lists frequently used notation for the various parameters
+of the framework, which will be used in the coming analysis of the costs and
+trade-offs associated with operations within the framework's design space. The
+remainder of this section will discuss the performance characteristics of
+insertion into this structure (Section~\ref{ssec:insert}), how it can be used
+to correctly answer sampling queries (Section~\ref{ssec:insert}), and efficient
+approaches for supporting deletes (Section~\ref{ssec:delete}). Finally, it will
+close with a detailed discussion of the trade-offs within the framework's
+design space (Section~\ref{ssec:design-space}).
+
+
+\subsection{Insertion}
+\label{ssec:insert}
+The framework supports inserting new records by first appending them to the end
+of the mutable buffer. When it is full, the buffer is flushed into a sequence
+of levels containing shards of increasing capacity, using a procedure
+determined by the layout policy as discussed in Section~\ref{sec:framework}.
+This method allows for the cost of repeated shard reconstruction to be
+effectively amortized.
+
+Let the cost of constructing the SSI from an arbitrary set of $n$ records be
+$C_c(n)$ and the cost of reconstructing the SSI given two or more shards
+containing $n$ records in total be $C_r(n)$. The cost of an insert is composed
+of three parts: appending to the mutable buffer, constructing a new
+shard from the buffered records during a flush, and the total cost of
+reconstructing shards containing the record over the lifetime of the index. The
+cost of appending to the mutable buffer is constant, and the cost of constructing a
+shard from the buffer can be amortized across the records participating in the
+buffer flush, giving $\nicefrac{C_c(N_b)}{N_b}$. These costs are paid exactly once for
+each record. To derive an expression for the cost of repeated reconstruction,
+first note that each record will participate in at most $s$ reconstructions on
+a given level, resulting in a worst-case amortized cost of $O\left(s\cdot
+\nicefrac{C_r(n)}{n}\right)$ paid per level. The index itself will contain at most
+$\log_s n$ levels. Thus, over the lifetime of the index a given record
+will pay $O\left(s\cdot \nicefrac{C_r(n)}{n}\log_s n\right)$ cost in repeated
+reconstruction.
+
+Combining these results, the total amortized insertion cost is
+\begin{equation}
+O\left(\frac{C_c(N_b)}{N_b} + s \cdot \frac{C_r(n)}{n} \log_s n\right)
+\end{equation}
+This can be simplified by noting that $s$ is constant, and that $N_b \ll n$ and also
+a constant. By neglecting these terms, the amortized insertion cost of the
+framework is,
+\begin{equation}
+O\left(\frac{C_r(n)}{n}\log_s n\right)
+\end{equation}
+
+
+\subsection{Sampling}
+\label{ssec:sample}
+
+\begin{figure}
+ \centering
+ \includegraphics[width=\textwidth]{img/sigmod23/sampling}
+ \caption{\textbf{Overview of the multiple-shard sampling query process} for
+ Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of
+ the shards is determined, then (2) these weights are used to construct an
+ alias structure. Next, (3) the alias structure is queried $k$ times to
+ determine per shard sample sizes, and then (4) sampling is performed.
+ Finally, (5) any rejected samples are retried starting from the alias
+ structure, and the process is repeated until the desired number of samples
+ has been retrieved.}
+ \label{fig:sample}
+
+\end{figure}
+
+For many SSIs, sampling queries are completed in two stages. Some preliminary
+processing is done to identify the range of records from which to sample, and then
+samples are drawn from that range. For example, IRS over a sorted list of
+records can be performed by first identifying the upper and lower bounds of the
+query range in the list, and then sampling records by randomly generating
+indexes within those bounds. The general cost of a sampling query can be
+modeled as $P(n) + k S(n)$, where $P(n)$ is the cost of preprocessing, $k$ is
+the number of samples drawn, and $S(n)$ is the cost of sampling a single
+record.
+
+When sampling from multiple shards, the situation grows more complex. For each
+sample, the shard to select the record from must first be decided. Consider an
+arbitrary sampling query $X(D, k)$ asking for a sample set of size $k$ against
+dataset $D$. The framework splits $D$ across $m$ disjoint shards, such that $D
+= \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset, \forall i,j < m$. The
+framework must ensure that $X(D, k)$ and $\bigcup_{i=0}^m X(D_i, k_i)$ follow
+the same distribution, by selecting appropriate values for the $k_i$s. If care
+is not taken to balance the number of samples drawn from a shard with the total
+weight of the shard under $X$, then bias can be introduced into the sample
+set's distribution. The selection of $k_i$s can be viewed as an instance of WSS,
+and solved using the alias method.
+
+When sampling using the framework, first the weight of each shard under the
+sampling query is determined and a \emph{shard alias structure} built over
+these weights. Then, for each sample, the shard alias is used to
+determine the shard from which to draw the sample. Let $W(n)$ be the cost of
+determining this total weight for a single shard under the query. The initial setup
+cost, prior to drawing any samples, will be $O\left([W(n) + P(n)]\log_s
+n\right)$, as the preliminary work for sampling from each shard must be
+performed, as well as weights determined and alias structure constructed. In
+many cases, however, the preliminary work will also determine the total weight,
+and so the relevant operation need only be applied once to accomplish both
+tasks.
+
+To ensure that all records appear in the sample set with the appropriate
+probability, the mutable buffer itself must also be a valid target for
+sampling. There are two generally applicable techniques that can be applied for
+this, both of which can be supported by the framework. The query being sampled
+from can be directly executed against the buffer and the result set used to
+build a temporary SSI, which can be sampled from. Alternatively, rejection
+sampling can be used to sample directly from the buffer, without executing the
+query. In this case, the total weight of the buffer is used for its entry in
+the shard alias structure. This can result in the buffer being
+over-represented in the shard selection process, and so any rejections during
+buffer sampling must be retried starting from shard selection. These same
+considerations apply to rejection sampling used against shards, as well.
+
+
+\begin{example}
+ \label{ex:sample}
+ Consider executing a WSS query, with $k=1000$, across three shards
+ containing integer keys with unit weight. $S_1$ contains only the
+ key $-2$, $S_2$ contains all integers on $[1,100]$, and $S_3$
+ contains all integers on $[101, 200]$. These structures are shown
+ in Figure~\ref{fig:sample}. Sampling is performed by first
+ determining the normalized weights for each shard: $w_1 = 0.005$,
+ $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a
+ shard alias structure. The shard alias structure is then queried
+ $k$ times, resulting in a distribution of $k_i$s that is
+ commensurate with the relative weights of each shard. Finally,
+ each shard is queried in turn to draw the appropriate number
+ of samples.
+\end{example}
+
+
+Assuming that rejection sampling is used on the mutable buffer, the worst-case
+time complexity for drawing $k$ samples from an index containing $n$ elements
+with a sampling cost of $S(n)$ is,
+\begin{equation}
+ \label{eq:sample-cost}
+ O\left(\left[W(n) + P(n)\right]\log_s n + kS(n)\right)
+\end{equation}
+
+%If instead a temporary SSI is constructed, the cost of sampling
+%becomes: $O\left(N_b + C_c(N_b) + (W(n) + P(n))\log_s n + kS(n)\right)$.
+
+\begin{figure}
+ \centering
+ \subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\
+ \subfloat[Tagging Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tagging} \label{fig:delete-tag}}
+
+ \caption{\textbf{Overview of the rejection check procedure for deleted records.} First,
+ a record is sampled (1).
+ When using the tombstone delete policy
+ (Figure~\ref{fig:delete-tombstone}), the rejection check starts by (2) querying
+ the bloom filter of the mutable buffer. The filter indicates the record is
+ not present, so (3) the filter on $L_0$ is queried next. This filter
+ returns a false positive, so (4) a point-lookup is executed against $L_0$.
+ The lookup fails to find a tombstone, so the search continues and (5) the
+ filter on $L_1$ is checked, which reports that the tombstone is present.
+ This time, it is not a false positive, and so (6) a lookup against $L_1$
+ (7) locates the tombstone. The record is thus rejected. When using the
+ tagging policy (Figure~\ref{fig:delete-tag}), (1) the record is sampled and
+ (2) checked directly for the delete tag. It is set, so the record is
+ immediately rejected.}
+
+ \label{fig:delete}
+
+\end{figure}
+
+
+\subsection{Deletion}
+\label{ssec:delete}
+
+Because the shards are static, records cannot be arbitrarily removed from them.
+This requires that deletes be supported in some other way, with the ultimate
+goal being the prevention of deleted records' appearance in sampling query
+result sets. This can be realized in two ways: locating the record and marking
+it, or inserting a new record which indicates that an existing record should be
+treated as deleted. The framework supports both of these techniques, the
+selection of which is called the \emph{delete policy}. The former policy is
+called \emph{tagging} and the latter \emph{tombstone}.
+
+Tagging a record is straightforward. Point-lookups are performed against each
+shard in the index, as well as the buffer, for the record to be deleted. When
+it is found, a bit in a header attached to the record is set. When sampling,
+any records selected with this bit set are automatically rejected. Tombstones
+represent a lazy strategy for deleting records. When a record is deleted using
+tombstones, a new record with identical key and value, but with a ``tombstone''
+bit set, is inserted into the index. A record's presence can be checked by
+performing a point-lookup. If a tombstone with the same key and value exists
+above the record in the index, then it should be rejected when sampled.
+
+Two important aspects of performance are pertinent when discussing deletes: the
+cost of the delete operation, and the cost of verifying the presence of a
+sampled record. The choice of delete policy represents a trade-off between
+these two costs. Beyond this simple trade-off, the delete policy also has other
+implications that can affect its applicability to certain types of SSI. Most
+notably, tombstones do not require any in-place updating of records, whereas
+tagging does. This means that using tombstones is the only way to ensure total
+immutability of the data within shards, which avoids random writes and eases
+concurrency control. The tombstone delete policy, then, is particularly
+appealing in external and concurrent contexts.
+
+\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is
+the same as an ordinary insert. Tagging, by contrast, requires a point-lookup
+of the record to be deleted, and so is more expensive. Assuming a point-lookup
+operation with cost $L(n)$, a tagged delete must search each level in the
+index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$
+time.
+
+\Paragraph{Rejection Check Costs.} In addition to the cost of the delete
+itself, the delete policy affects the cost of determining if a given record has
+been deleted. This is called the \emph{rejection check cost}, $R(n)$. When
+using tagging, the information necessary to make the rejection decision is
+local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones
+it is not; a point-lookup must be performed to search for a given record's
+corresponding tombstone. This look-up must examine the buffer, and each shard
+within the index. This results in a rejection check cost of $R(n) \in O\left(N_b +
+L(n) \log_s n\right)$. The rejection check process for the two delete policies is
+summarized in Figure~\ref{fig:delete}.
+
+Two factors contribute to the tombstone rejection check cost: the size of the
+buffer, and the cost of performing a point-lookup against the shards. The
+latter cost can be controlled using the framework's ability to associate
+auxiliary structures with shards. For SSIs which do not support efficient
+point-lookups, a hash table can be added to map key-value pairs to their
+location within the SSI. This allows for constant-time rejection checks, even
+in situations where the index would not otherwise support them. However, the
+storage cost of this intervention is high, and in situations where the SSI does
+support efficient point-lookups, it is not necessary. Further performance
+improvements can be achieved by noting that the probability of a given record
+having an associated tombstone in any particular shard is relatively small.
+This means that many point-lookups will be executed against shards that do not
+contain the tombstone being searched for. In this case, these unnecessary
+lookups can be partially avoided using Bloom filters~\cite{bloom70} for
+tombstones. By inserting tombstones into these filters during reconstruction,
+point-lookups against some shards which do not contain the tombstone being
+searched for can be bypassed. Filters can be attached to the buffer as well,
+which may be even more significant due to the linear cost of scanning it. As
+the goal is a reduction of rejection check costs, these filters need only be
+populated with tombstones. In a later section, techniques for bounding the
+number of tombstones on a given level are discussed, which will allow for the
+memory usage of these filters to be tightly controlled while still ensuring
+precise bounds on filter error.
+
+\Paragraph{Sampling with Deletes.} The addition of deletes to the framework
+alters the analysis of sampling costs. A record that has been deleted cannot
+be present in the sample set, and therefore the presence of each sampled record
+must be verified. If a record has been deleted, it must be rejected. When
+retrying samples rejected due to delete, the process must restart from shard
+selection, as deleted records may be counted in the weight totals used to
+construct that structure. This increases the cost of sampling to,
+\begin{equation}
+\label{eq:sampling-cost}
+ O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right)
+\end{equation}
+where $R(n)$ is the cost of checking if a sampled record has been deleted, and
+$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling
+attempts required to obtain $k$ samples, given a fixed rejection probability.
+The rejection probability itself is a function of the workload, and is
+unbounded.
+
+\Paragraph{Bounding the Rejection Probability.} Rejections during sampling
+constitute wasted memory accesses and random number generations, and so steps
+should be taken to minimize their frequency. The probability of a rejection is
+directly related to the number of deleted records, which is itself a function
+of workload and dataset. This means that, without building counter-measures
+into the framework, tight bounds on sampling performance cannot be provided in
+the presence of deleted records. It is therefore critical that the framework
+support some method for bounding the number of deleted records within the
+index.
+
+While the static nature of shards prevents the direct removal of records at the
+moment they are deleted, it doesn't prevent the removal of records during
+reconstruction. When using tagging, all tagged records encountered during
+reconstruction can be removed. When using tombstones, however, the removal
+process is non-trivial. In principle, a rejection check could be performed for
+each record encountered during reconstruction, but this would increase
+reconstruction costs and introduce a new problem of tracking tombstones
+associated with records that have been removed. Instead, a lazier approach can
+be used: delaying removal until a tombstone and its associated record
+participate in the same shard reconstruction. This delay allows both the record
+and its tombstone to be removed at the same time, an approach called
+\emph{tombstone cancellation}. In general, this can be implemented using an
+extra linear scan of the input shards before reconstruction to identify
+tombstones and associated records for cancellation, but potential optimizations
+exist for many SSIs, allowing it to be performed during the reconstruction
+itself at no extra cost.
+
+The removal of deleted records passively during reconstruction is not enough to
+bound the number of deleted records within the index. It is not difficult to
+envision pathological scenarios where deletes result in unbounded rejection
+rates, even with this mitigation in place. However, the dropping of deleted
+records does provide a useful property: any specific deleted record will
+eventually be removed from the index after a finite number of reconstructions.
+Using this fact, a bound on the number of deleted records can be enforced. A
+new parameter, $\delta$, is defined, representing the maximum proportion of
+deleted records within the index. Each level, and the buffer, tracks the number
+of deleted records it contains by counting its tagged records or tombstones.
+Following each buffer flush, the proportion of deleted records is checked
+against $\delta$. If any level is found to exceed it, then a proactive
+reconstruction is triggered, pushing its shards down into the next level. The
+process is repeated until all levels respect the bound, allowing the number of
+deleted records to be precisely controlled, which, by extension, bounds the
+rejection rate. This process is called \emph{compaction}.
+
+Assuming every record is equally likely to be sampled, this new bound can be
+applied to the analysis of sampling costs. The probability of a record being
+rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to
+Equation~\ref{eq:sampling-cost} yields,
+\begin{equation}
+%\label{eq:sampling-cost-del}
+ O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
+\end{equation}
+
+Asymptotically, this proactive compaction does not alter the analysis of
+insertion costs. Each record is still written at most $s$ times on each level,
+there are at most $\log_s n$ levels, and the buffer insertion and SSI
+construction costs are all unchanged, and so on. This results in the amortized
+insertion cost remaining the same.
+
+This compaction strategy is based upon tombstone and record counts, and the
+bounds assume that every record is equally likely to be sampled. For certain
+sampling problems (such as WSS), there are other conditions that must be
+considered to provide a bound on the rejection rate. To account for these
+situations in a general fashion, the framework supports problem-specific
+compaction triggers that can be tailored to the SSI being used. These allow
+compactions to be triggered based on other properties, such as rejection rate
+of a level, weight of deleted records, and the like.
+
+
+\subsection{Trade-offs on Framework Design Space}
+\label{ssec:design-space}
+The framework has several tunable parameters, allowing it to be tailored for
+specific applications. This design space contains trade-offs among three major
+performance characteristics: update cost, sampling cost, and auxiliary memory
+usage. The two most significant decisions when implementing this framework are
+the selection of the layout and delete policies. The asymptotic analysis of the
+previous sections obscures some of the differences between these policies, but
+they do have significant practical performance implications.
+
+\Paragraph{Layout Policy.} The choice of layout policy represents a clear
+trade-off between update and sampling performance. Leveling
+results in fewer shards of larger size, whereas tiering results in a larger
+number of smaller shards. As a result, leveling reduces the costs associated
+with point-lookups and sampling query preprocessing by a constant factor,
+compared to tiering. However, it results in more write amplification: a given
+record may be involved in up to $s$ reconstructions on a single level, as
+opposed to the single reconstruction per level under tiering.
+
+\Paragraph{Delete Policy.} There is a trade-off between delete performance and
+sampling performance that exists in the choice of delete policy. Tagging
+requires a point-lookup when performing a delete, which is more expensive than
+the insert required by tombstones. However, it also allows constant-time
+rejection checks, unlike tombstones which require a point-lookup of each
+sampled record. In situations where deletes are common and write-throughput is
+critical, tombstones may be more useful. Tombstones are also ideal in
+situations where immutability is required, or random writes must be avoided.
+Generally speaking, however, tagging is superior when using SSIs that support
+it, because sampling rejection checks will usually be more common than deletes.
+
+\Paragraph{Mutable Buffer Capacity and Scale Factor.} The mutable buffer
+capacity and scale factor both influence the number of levels within the index,
+and by extension the number of distinct shards. Sampling and point-lookups have
+better performance with fewer shards. Smaller shards are also faster to
+reconstruct, although the same adjustments that reduce shard size also result
+in a larger number of reconstructions, so the trade-off here is less clear.
+
+The scale factor has an interesting interaction with the layout policy: when
+using leveling, the scale factor directly controls the amount of write
+amplification per level. Larger scale factors mean more time is spent
+reconstructing shards on a level, reducing update performance. Tiering does not
+have this problem and should see its update performance benefit directly from a
+larger scale factor, as this reduces the number of reconstructions.
+
+The buffer capacity also influences the number of levels, but is more
+significant in its effects on point-lookup performance: a lookup must perform a
+linear scan of the buffer. Likewise, the unstructured nature of the buffer also
+will contribute negatively towards sampling performance, irrespective of which
+buffer sampling technique is used. As a result, although a large buffer will
+reduce the number of shards, it will also hurt sampling and delete (under
+tagging) performance. It is important to minimize the cost of these buffer
+scans, and so it is preferable to keep the buffer small, ideally small enough
+to fit within the CPU's L2 cache. The number of shards within the index is,
+then, better controlled by changing the scale factor, rather than the buffer
+capacity. Using a smaller buffer will result in more compactions and shard
+reconstructions; however, the empirical evaluation in Section~\ref{ssec:ds-exp}
+demonstrates that this is not a serious performance problem when a scale factor
+is chosen appropriately. When the shards are in memory, frequent small
+reconstructions do not have a significant performance penalty compared to less
+frequent, larger ones.
+
+\Paragraph{Auxiliary Structures.} The framework's support for arbitrary
+auxiliary data structures allows for memory to be traded in exchange for
+insertion or sampling performance. The use of Bloom filters for accelerating
+tombstone rejection checks has already been discussed, but many other options
+exist. Bloom filters could also be used to accelerate point-lookups for delete
+tagging, though such filters would require much more memory than tombstone-only
+ones to be effective. An auxiliary hash table could be used for accelerating
+point-lookups, or range filters like SuRF \cite{zhang18} or Rosetta
+\cite{siqiang20} added to accelerate pre-processing for range queries like in
+IRS or WIRS.