\section{Dynamization of SSIs} \label{sec:framework} Our goal, then, is to design a solution to indepedent sampling that is able to achieve \emph{both} efficient updates and efficient sampling, while also maintaining statistical independence both within and between IQS queries, and to do so in a generalized fashion without needing to design new dynamic data structures for each problem. Given the range of SSIs already available, it seems reasonable to attempt to apply dynamization techniques to accomplish this goal. Using the Bentley-Saxe method would allow us to to support inserts and deletes without requiring any modification of the SSIs. Unfortunately, as discussed in Section~\ref{ssec:background-irs}, there are problems with directly applying BSM to sampling problems. All of the considerations discussed there in the context of IRS apply equally to the other sampling problems considered in this chapter. In this section, we will discuss approaches for resolving these problems. \subsection{Sampling over Decomposed Structures} The core problem facing any attempt to dynamize SSIs is that independently sampling from a decomposed structure is difficult. As discussed in Section~\ref{ssec:background-irs}, accomplishing this task within the DSP model used by the Bentley-Saxe method requires drawing a full $k$ samples from each of the blocks, and then repeatedly down-sampling each of the intermediate sample sets. However, it is possible to devise a more efficient query process if we abandon the DSP model and consider a slightly more complicated procedure. First, we'll define the IQS problem in terms of the notation and concepts used in Chapter~\cite{chap:background} for search problems, \begin{definition}[Independent Query Sampling Problem] Given a search problem, $F$, a query sampling problem is function of the form $X: (F, \mathcal{D}, \mathcal{Q}, \mathbb{Z}^+) \to \mathcal{R}$ where $\mathcal{D}$ is the domain of records and $\mathcal{Q}$ is the domain of query parameters of $F$. The solution to a sampling problem, $R \in \mathcal{R}$ will be a subset of records from the solution to $F$ drawn independently such that, $|R| = k$ for some $k \in \mathbb{Z}^+$. \end{definition} To consider the decomposability of such problems, we need to resolve a minor definitional issue. As noted before, the DSP model is based on deterministic queries. The definition doesn't apply for sampling queries, because it assumes that the result sets of identical queries should also be identical. For general IQS, we also need to enforce conditions on the query being sampled from. Based on these observations, we can define the decomposability conditions for a query sampling problem, \begin{definition}[Decomposable Sampling Problem] A query sampling problem, $X: (F, \mathcal{D}, \mathcal{Q}, \mathbb{Z}^+ \to \mathcal{R}$) is decomposable if and only if the following conditions are met for all $q \in \mathcal{Q}, k \in \mathbb{Z}^+$, \begin{enumerate} \item There exists a $\Theta(C(n,k))$ time computable, associative, and commutative binary operator $\mergeop$ such that, \begin{equation*} X(F, A \cup B, q, k) \sim X(F, A, q, k)~ \mergeop ~X(F, B, q, k) \end{equation*} for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$. \item For any dataset $D \subseteq \mathcal{D}$ that has been decomposed into $m$ partitions such that $D = \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset \quad \forall i,j < m, i \neq j$, \begin{equation*} F(D, q) = \bigcup_{i=1}^m F(D_i, q) \end{equation*} \end{enumerate} \end{definition} These two conditions warrant further explaination. The first condition is simply a redefinition of the standard decomposability criteria to consider matching the distribution, rather than the exact records in $R$, as the correctness condition for the merge process. The second condition handles a necessary property of the underlying search problem being sampled from. Note that this condition is \emph{stricter} than normal decomposability for $F$, and essentially requires that the query being sampled from return a set of records, rather than an aggregate value or some other result that cannot be meaningfully sampled from. This condition is satisfied by predicate-filtering style database queries, among others. With these definitions in mind, let's turn to solving these query sampling problems. First, we note that many SSIs have a sampling procedure that naturally involves two phases. First, some preliminary work is done to determine metadata concerning the set of records to sample from, and then $k$ samples are drawn from the structure, taking advantage of this metadata. If we represent the time cost of the prelimary work with $P(n)$ and the cost of drawing a sample with $S(n)$, then these structures query cost functions are of the form, \begin{equation*} \mathscr{Q}(n, k) = P(n) + k S(n) \end{equation*} Consider an arbitrary decomposable sampling problem with a cost function of the above form, $X(\mathscr{I}, F, q, k)$, which draws a sample of $k$ records from $d \subseteq \mathcal{D}$ using an instance of an SSI $\mathscr{I} \in \mathcal{I}$. Applying dynamization results in $d$ being split across $m$ disjoint instances of $\mathcal{I}$ such that $d = \bigcup_{i=0}^m \text{unbuild}(\mathscr{I}_i)$ and $\text{unbuild}(\mathscr{I}_i) \cap \text{unbuild}(\mathscr{I}_j) = \emptyset \quad \forall i, j < m, i \neq j$. If we consider a Bentley-Saxe dynamization of such a structure, the $\mergeop$ operation would be a $\Theta(k)$ down-sampling. Thus, the total query cost of such a structure would be, \begin{equation*} \Theta\left(\log_2 n \left( P(n) + k S(n) + k\right)\right) \end{equation*} This cost function is sub-optimal for two reasons. First, we pay extra cost to merge the result sets together because of the down-sampling combination operator. Secondly, this formulation fails to avoid a per-sample dependence on $n$, even in the case where $S(n) \in \Theta(1)$. This gets even worse when considering rejections that may occur as a result of deleted records. Recall from Section~\ref{ssec:background-deletes} that deletion can be supported using weak deletes or a shadow structure in a Bentley-Saxe dynamization. Using either approach, it isn't possible to avoid deleted records in advance when sampling, and so these will need to be rejected and retried. In the DSP model, this retry will need to reprocess every block a second time. You cannot retry in place without introducing bias into the result set. We will discuss this more in Section~\ref{ssec:sampling-deletes}. \begin{figure} \centering \includegraphics[width=\textwidth]{img/sigmod23/sampling} \caption{\textbf{Overview of the multiple-block query sampling process} for Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of the shards is determined, then (2) these weights are used to construct an alias structure. Next, (3) the alias structure is queried $k$ times to determine per shard sample sizes, and then (4) sampling is performed. Finally, (5) any rejected samples are retried starting from the alias structure, and the process is repeated until the desired number of samples has been retrieved.} \label{fig:sample} \end{figure} The key insight that allowed us to solve this particular problem was that there is a mismatch between the structure of the sampling query process, and the structure assumed by DSPs. Using an SSI to answer a sampling query results in a naturally two-phase process, but DSPs are assumed to be single phase. We can construct a more effective process for answering such queries based on a multi-stage process, summarized in Figure~\ref{fig:sample}. \begin{enumerate} \item Perform the query pre-processing work, and determine each block's respective weight under a given query to be sampled from (e.g., the number of records falling into the query range for IRS). \item Build a temporary alias structure over these weights. \item Query the alias structure $k$ times to determine how many samples to draw from each block. \item Draw the appropriate number of samples from each block and merge them together to form the final query result, using any necessary pre-processing results in the process. \end{enumerate} It is possible that some of the records sampled in Step 4 must be rejected, either because of deletes or some other property of the sampling procedure being used. If $r$ records are rejected, the above procedure can be repeated from Step 3, taking $k - r$ as the number of times to query the alias structure, without needing to redo any of the preprocessing steps. This can be repeated as many times as necessary until the required $k$ records have been sampled. \begin{example} \label{ex:sample} Consider executing a WSS query, with $k=1000$, across three blocks containing integer keys with unit weight. $\mathscr{I}_1$ contains only the key $-2$, $\mathscr{I}_2$ contains all integers on $[1,100]$, and $\mathscr{I}_3$ contains all integers on $[101, 200]$. These structures are shown in Figure~\ref{fig:sample}. Sampling is performed by first determining the normalized weights for each block: $w_1 = 0.005$, $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a block alias structure. The block alias structure is then queried $k$ times, resulting in a distribution of $k_i$s that is commensurate with the relative weights of each block. Finally, each block is queried in turn to draw the appropriate number of samples. \end{example} Assuming a Bentley-Saxe decomposition with $\log n$ blocks and assuming a constant number of repetitions, the cost of answering a decomposible sampling query having a pre-processing cost of $P(n)$, a weight-determination cost of $W(n)$ and a per-sample cost of $S(n)$ will be, \begin{equation} \label{eq:dsp-sample-cost} \boxed{ \mathscr{Q}(n, k) \in \Theta \left( (P(n) + W(n)) \log_2 n + k S(n) \right) } \end{equation} where the cost of building the alias structure is $\Theta(\log_2 n)$ and thus absorbed into the pre-processing cost. For the SSIs discussed in this chapter, which have $S(n) \in \Theta(1)$, this model provides us with the desired decoupling of the data size ($n$) from the per-sample cost. Additionally, for all of the SSIs considered in this paper, the weights can be determined in either $W(n) \in \Theta(1)$ time, or are naturally determined as part of the pre-processing, and thus the $W(n)$ term can be merged into $P(n)$. \subsection{Supporting Deletes} As discussed in Section~\ref{ssec:background-deletes}, the Bentley-Saxe method can support deleting records through the use of either weak deletes, or a secondary ghost structure, assume certain properties are satisfied by either the search problem or data structure. Unfortunately, neither approach can work as a "drop-in" solution in the context of sampling problems, because of the way that deleted records interact with the sampling process itself. Sampling problems, as formalized here, are neither invertable, nor deletion decomposable. In this section, we'll discuss our mechanisms for supporting deletes, as well as how these can be handled during sampling while maintaining correctness. Because both deletion policies have their advantages under certain contexts, we decided to support both. Specifically, we propose two mechanisms for deletes, which are \begin{enumerate} \item \textbf{Tagged Deletes.} Each record in the structure includes a header with a visibility bit set. On delete, the structure is searched for the record, and the bit is set in indicate that it has been deleted. This mechanism is used to support \emph{weak deletes}. \item \textbf{Tombstone Deletes.} On delete, a new record is inserted into the structure with a tombstone bit set in the header. This mechanism is used to support \emph{ghost structure} based deletes. \end{enumerate} Broadly speaking, for sampling problems, tombstone deletes cause a number of problems because \emph{sampling problems are not invertible}. However, this limitation can be worked around during the query process if desired. Tagging is much more natural for these search problems. However, the flexibility of selecting either option is desirable because of their different performance characteristics. While tagging is a fairly direct method of implementing weak deletes, tombstones are sufficiently different from the traditional ghost structure system that it is worth motivating the decision to use them here. One of the major limitations of the ghost structure approach for handling deletes is that there is not a principled method for removing deleted records from the decomposed structure. The standard approach is to set an arbitrary number of delete records, and rebuild the entire structure when this threshold is crossed~\cite{saxe79}. Mixing the "ghost" records into the same structures as the original records allows for deleted records to naturally be cleaned up over time as they meet their tombstones during reconstructions. This is an important consequence that will be discussed in more detail in Section~\ref{ssec-sampling-delete-bounding}. There are two relevant aspects of performance that the two mechanisms trade-off between: the cost of performing the delete, and the cost of checking if a sampled record has been deleted. In addition to these, the use of tombstones also makes supporting concurrency and external data structures far easier. This is because tombstone deletes are simple inserts, and thus they leave the individual structures immutable. Tagging requires doing in-place updates of the record header in the structures, resulting in possible race conditions and random IO operations on disk. This makes tombstone deletes particularly attractive in these contexts. \subsubsection{Deletion Cost} We will first consider the cost of performing a delete using either mechanism. \Paragraph{Tombstone Deletes.} The cost of using a tombstone delete in a Bentley-Saxe dynamization is the same as a simple insert, \begin{equation*} \mathscr{D}(n)_A \in \Theta\left(\frac{B(n)}{n} \log_2 (n)\right) \end{equation*} with the worst-case cost being $\Theta(B(n))$. Note that there is also a minor performance effect resulting from deleted records appearing twice within the structure, once for the original record and once for the tombstone, inflating the overall size of the structure. \Paragraph{Tagged Deletes.} In contrast to tombstone deletes, tagged deletes are not simple inserts, and so have their own cost function. The process of deleting a record under tagging consists of first searching the entire structure for the record to be deleted, and then setting a bit in its header. As a result, the performance of this operation is a function of how expensive it is to locate an individual record within the decomposed data structure. In the theoretical literature, this lookup operation is provided by a global hash table built over every record in the structure, mapping each record to the block that contains it. Then, the data structure's weak delete operation can be applied to the relevant block~\cite{merge-dsp}. While this is certainly an option for us, we note that the SSIs we are currently considering all support a reasonably efficient $\Theta(\log n)$ lookup operation as it is, and have elected to design tagged deletes to allow this operation to be leveraged when available, rather than needing to deal with maintaining global hash table. If a given SSI has a point-lookup cost of $L(n)$, then a tagged delete on a Bentley-Saxe decomposition of that SSI will require, at worst, executing a point-lookup on each block, with a total cost of \begin{equation*} \mathscr{D}(n) \in \Theta\left( L(n) \log_2 (n)\right) \end{equation*} If the SSI being considered does \emph{not} support an efficient point-lookup operation, then a hash table can be used instead. We consider individual hash tables associated with each block, rather than a single global one, for simplicity of implementation and analysis. So, in these cases, the same procedure as above can be used, with $L(n) \in \Theta(1)$. \begin{figure} \centering \subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\ \subfloat[Tagging Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tagging} \label{fig:delete-tag}} \caption{\textbf{Overview of the rejection check procedure for deleted records.} First, a record is sampled (1). When using the tombstone delete policy (Figure~\ref{fig:delete-tombstone}), the rejection check starts by (2) querying the bloom filter of the mutable buffer. The filter indicates the record is not present, so (3) the filter on $L_0$ is queried next. This filter returns a false positive, so (4) a point-lookup is executed against $L_0$. The lookup fails to find a tombstone, so the search continues and (5) the filter on $L_1$ is checked, which reports that the tombstone is present. This time, it is not a false positive, and so (6) a lookup against $L_1$ (7) locates the tombstone. The record is thus rejected. When using the tagging policy (Figure~\ref{fig:delete-tag}), (1) the record is sampled and (2) checked directly for the delete tag. It is set, so the record is immediately rejected.} \label{fig:delete} \end{figure} \subsubsection{Rejection Check Costs} Because sampling queries are neither invertible nor deletion decomposable, the query process must be modified to support deletes using either of the above mechanisms. This modification entails requiring that each sampled record be manually checked to confirm that it hasn't been deleted, prior to adding it to the sample set. We call the cost of this operation the \emph{rejection check cost}, $R(n)$. The process differs between the two deletion mechanisms, and the two procedures are summarized in Figure~\ref{fig:delete}. For tagged deletes, this is a simple process. The information about the deletion status of a given record is stored directly alongside the record, within its header. So, once a record has been sampled, this check can be immediately performed with $R(n) \in \Theta(1)$ time. Tombstone deletes, however, introduce a significant difficulty in performing the rejection check. The information about whether a record has been deleted is not local to the record itself, and therefore a point-lookup is required to search for the tombstone associated with each sample. Thus, the rejection check cost when using tombstones to implement deletes over a Bentley-Saxe decomposition of an SSI is, \begin{equation} R(n) \in \Theta( L(n) \log_2 n) \end{equation} This performance cost seems catastrophically bad, considering it must be paid per sample, but there are ways to mitigate it. We will discuss these mitigations in more detail later, during our discussion of the implementation of these results in Section~\ref{sec:sampling-implementation}. \subsubsection{Bounding Rejection Probability} When a sampled record has been rejected, it must be resampled. This introduces performance overhead resulting from extra memory access and random number generations, and hurts our ability to provide performance bounds on our sampling operations. In the worst case, a structure may consist mostly or entirely of deleted records, resulting in a potentially unbounded number of rejections during sampling. Thus, in order to maintain sampling performance bounds, the probability of a rejection during sampling must be bounded. The reconstructions associated with Bentley-Saxe dynamization give us a natural way of controlling the number of deleted records within the structure, and thereby bounding the rejection rate. During reconstruction, we have the opportunity to remove deleted records. This will cause the record counts associated with each block of the structure to gradually drift out of alignment with the "perfect" powers of two associated with the Bentley-Saxe method, however. In the theoretical literature on this topic, the solution to this problem is to periodically repartition all of the records to re-align the block sizes~\cite{merge-dsp, saxe79}. This approach could also be easily applied here, if desired, though we do not in our implementations, for reasons that will be discussed in Section~\ref{sec:sampling-implementation}. The process of removing these deleted records during reconstructions is different for the two mechanisms. Tagged deletes are straightforward, because all tagged records can simply be dropped when they are involved in a reconstruction. Tombstones, however, require a slightly more complex approach. Rather than being able to drop deleted records immediately, during reconstructions the records can only be dropped when the tombstone and its associate record are involved in the \emph{same} reconstruction, at which point both can be dropped. We call this process \emph{tombstone cancellation}. In the general case, it can be implemented using a preliminary linear pass over the records involved in a reconstruction to identify the records to be dropped, but in many cases reconstruction involves sorting the records anyway, and by taking care with ordering semantics, tombstones and their associated records can be sorted into adjacent spots, allowing them to be efficiently dropped during reconstruction without any extra overhead. While the dropping of deleted records during reconstruction helps, it is not sufficient on its own to ensure a particular bound on the number of deleted records within the structure. Pathological scenarios resulting in unbounded rejection rates, even in the presence of this mitigation, are possible. For example, tagging alone will never trigger reconstructions, and so it would be possible to delete every single record within the structure without triggering a reconstruction, or records could be deleted in the reverse order that they were inserted using tombstones. In either case, a passive system of dropping records naturally during reconstruction is not sufficient. Fortunately, this passive system can be used as the basis for a system that does provide a bound. This is because it guarantees, whether tagging or tombstones are used, that any given deleted record will \emph{eventually} be cancelled out after a finite number of reconstructions. If the number of deleted records gets too high, some or all of these deleted records can be cleared out by proactively performing reconstructions. We call these proactive reconstructions \emph{compactions}. The basic strategy, then, is to define a maximum allowable proportion of deleted records, $\delta \in [0, 1]$. Each block in the decomposition tracks the number of tombstones or tagged records within it. This count can be easily maintained by incrementing a counter when a record in the block is tagged, and by counting tombstones during reconstructions. These counts on each block are then monitored, and if the proportion of deletes in a block ever exceeds $\delta$, a proactive reconstruction including this block and one or more blocks below it in the structure can be triggered. The proportion of the newly compacted block can then be checked again, and this process repeated until all blocks respect the bound. For tagging, a single round of compaction will always suffice, because all deleted records involved in the reconstruction will be dropped. Tombstones may require multiple cascading rounds of compaction to occur, because a tombstone record will only cancel when it encounters the record that it deletes. However, because tombstones always follow the record they delete in insertion order, and will therefore always be "above" that record in the structure, each reconstruction will move every tombstone involved closer to the record it deletes, ensuring that eventually the bound will be satisfied. Asymptotically, this compaction process will not affect the amortized insertion cost of the structure. This is because the cost is based on the number of reconstructions that a given record is involved in over the lifetime of the structure. Preemptive compaction does not increase the number of reconstructions, only \emph{when} they occur. \subsubsection{Sampling Procedure with Deletes} Because sampling is neither deletion decomposable nor invertible, the presence of deletes will have an effect on the query costs. As already mentioned, the basic cost associated with deletes is a rejection check associated with each sampled record. When a record is sampled, it must be checked to determine whether it has been deleted or not. If it has, then it must be rejected. Note that when this rejection occurs, it cannot be retried immediately on the same block, but rather a new block must be selected to sample from. This is because deleted records aren't accounted for in the weight calculations, and so could introduce bias. As a straightforward example of this problem, consider a block that contains only deleted records. Any sample drawn from this block will be rejected, and so retrying samples against this block will result in an infinite loop. Assuming the compaction strategy mentioned in the previous section is applied, ensuring a bound of at most $\delta$ proportion of deleted records in the structure, and assuming all records have an equal probability of being sampled, the cost of answering sampling queries accounting for rejections is, \begin{equation*} %\label{eq:sampling-cost-del} \mathscr{Q}(n, k) = \Theta\left([W(n) + P(n)]\log_2 n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right) \end{equation*} Where $\frac{k}{1 - \delta}$ is the expected number of samples that must be taken to obtain a sample set of size $k$. \subsection{Performance Tuning and Configuration} \subsubsection{LSM Tree Imports} \subsection{Insertion} \label{ssec:insert} The framework supports inserting new records by first appending them to the end of the mutable buffer. When it is full, the buffer is flushed into a sequence of levels containing shards of increasing capacity, using a procedure determined by the layout policy as discussed in Section~\ref{sec:framework}. This method allows for the cost of repeated shard reconstruction to be effectively amortized. Let the cost of constructing the SSI from an arbitrary set of $n$ records be $C_c(n)$ and the cost of reconstructing the SSI given two or more shards containing $n$ records in total be $C_r(n)$. The cost of an insert is composed of three parts: appending to the mutable buffer, constructing a new shard from the buffered records during a flush, and the total cost of reconstructing shards containing the record over the lifetime of the index. The cost of appending to the mutable buffer is constant, and the cost of constructing a shard from the buffer can be amortized across the records participating in the buffer flush, giving $\nicefrac{C_c(N_b)}{N_b}$. These costs are paid exactly once for each record. To derive an expression for the cost of repeated reconstruction, first note that each record will participate in at most $s$ reconstructions on a given level, resulting in a worst-case amortized cost of $O\left(s\cdot \nicefrac{C_r(n)}{n}\right)$ paid per level. The index itself will contain at most $\log_s n$ levels. Thus, over the lifetime of the index a given record will pay $O\left(s\cdot \nicefrac{C_r(n)}{n}\log_s n\right)$ cost in repeated reconstruction. Combining these results, the total amortized insertion cost is \begin{equation} O\left(\frac{C_c(N_b)}{N_b} + s \cdot \frac{C_r(n)}{n} \log_s n\right) \end{equation} This can be simplified by noting that $s$ is constant, and that $N_b \ll n$ and also a constant. By neglecting these terms, the amortized insertion cost of the framework is, \begin{equation} O\left(\frac{C_r(n)}{n}\log_s n\right) \end{equation} \captionsetup[subfloat]{justification=centering} \begin{figure*} \centering \subfloat[Leveling]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-leveling} \label{fig:leveling}}\\ \subfloat[Tiering]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-tiering} \label{fig:tiering}} \caption{\textbf{A graphical overview of the sampling framework and its insert procedure.} A mutable buffer (MB) sits atop two levels (L0, L1) containing shards (pairs of SSIs and auxiliary structures [A]) using the leveling (Figure~\ref{fig:leveling}) and tiering (Figure~\ref{fig:tiering}) layout policies. Records are represented as black/colored squares, and grey squares represent unused capacity. An insertion requiring a multi-level reconstruction is illustrated.} \label{fig:framework} \end{figure*} \section{Framework Implementation} Our framework has been designed to work efficiently with any SSI, so long as it has the following properties. \begin{enumerate} \item The underlying full query $Q$ supported by the SSI from whose results samples are drawn satisfies the following property: for any dataset $D = \cup_{i = 1}^{n}D_i$ where $D_i \cap D_j = \emptyset$, $Q(D) = \cup_{i = 1}^{n}Q(D_i)$. \item \emph{(Optional)} The SSI supports efficient point-lookups. \item \emph{(Optional)} The SSI is capable of efficiently reporting the total weight of all records returned by the underlying full query. \end{enumerate} The first property applies to the query being sampled from, and is essential for the correctness of sample sets reported by extended sampling indexes.\footnote{ This condition is stricter than the definition of a decomposable search problem in the Bentley-Saxe method, which allows for \emph{any} constant-time merge operation, not just union. However, this condition is satisfied by many common types of database query, such as predicate-based filtering queries.} The latter two properties are optional, but reduce deletion and sampling costs respectively. Should the SSI fail to support point-lookups, an auxiliary hash table can be attached to the data structures. Should it fail to support query result weight reporting, rejection sampling can be used in place of the more efficient scheme discussed in Section~\ref{ssec:sample}. The analysis of this framework will generally assume that all three conditions are satisfied. Given an SSI with these properties, a dynamic extension can be produced as shown in Figure~\ref{fig:framework}. The extended index consists of disjoint shards containing an instance of the SSI being extended, and optional auxiliary data structures. The auxiliary structures allow acceleration of certain operations that are required by the framework, but which the SSI being extended does not itself support efficiently. Examples of possible auxiliary structures include hash tables, Bloom filters~\cite{bloom70}, and range filters~\cite{zhang18,siqiang20}. The shards are arranged into levels of increasing record capacity, with either one shard, or up to a fixed maximum number of shards, per level. The decision to place one or many shards per level is called the \emph{layout policy}. The policy names are borrowed from the literature on the LSM tree, with the former called \emph{leveling} and the latter called \emph{tiering}. To avoid a reconstruction on every insert, an unsorted array of fixed capacity ($N_b$), called the \emph{mutable buffer}, is used to buffer updates. Because it is unsorted, it is kept small to maintain reasonably efficient sampling and point-lookup performance. All updates are performed by appending new records to the tail of this buffer. If a record currently within the index is to be updated to a new value, it must first be deleted, and then a record with the new value inserted. This ensures that old versions of records are properly filtered from query results. When the buffer is full, it is flushed to make room for new records. The flushing procedure is based on the layout policy in use. When using leveling (Figure~\ref{fig:leveling}) a new SSI is constructed using both the records in $L_0$ and those in the buffer. This is used to create a new shard, which replaces the one previously in $L_0$. When using tiering (Figure~\ref{fig:tiering}) a new shard is built using only the records from the buffer, and placed into $L_0$ without altering the existing shards. Each level has a record capacity of $N_b \cdot s^{i+1}$, controlled by a configurable parameter, $s$, called the scale factor. Records are organized in one large shard under leveling, or in $s$ shards of $N_b \cdot s^i$ capacity each under tiering. When a level reaches its capacity, it must be emptied to make room for the records flushed into it. This is accomplished by moving its records down to the next level of the index. Under leveling, this requires constructing a new shard containing all records from both the source and target levels, and placing this shard into the target, leaving the source empty. Under tiering, the shards in the source level are combined into a single new shard that is placed into the target level. Should the target be full, it is first emptied by applying the same procedure. New empty levels are dynamically added as necessary to accommodate these reconstructions. Note that shard reconstructions are not necessarily performed using merging, though merging can be used as an optimization of the reconstruction procedure where such an algorithm exists. In general, reconstruction requires only pooling the records of the shards being combined and then applying the SSI's standard construction algorithm to this set of records. \begin{table}[t] \caption{Frequently Used Notation} \centering \begin{tabular}{|p{2.5cm} p{5cm}|} \hline \textbf{Variable} & \textbf{Description} \\ \hline $N_b$ & Capacity of the mutable buffer \\ \hline $s$ & Scale factor \\ \hline $C_c(n)$ & SSI initial construction cost \\ \hline $C_r(n)$ & SSI reconstruction cost \\ \hline $L(n)$ & SSI point-lookup cost \\ \hline $P(n)$ & SSI sampling pre-processing cost \\ \hline $S(n)$ & SSI per-sample sampling cost \\ \hline $W(n)$ & Shard weight determination cost \\ \hline $R(n)$ & Shard rejection check cost \\ \hline $\delta$ & Maximum delete proportion \\ \hline %$\rho$ & Maximum rejection rate \\ \hline \end{tabular} \label{tab:nomen} \end{table} Table~\ref{tab:nomen} lists frequently used notation for the various parameters of the framework, which will be used in the coming analysis of the costs and trade-offs associated with operations within the framework's design space. The remainder of this section will discuss the performance characteristics of insertion into this structure (Section~\ref{ssec:insert}), how it can be used to correctly answer sampling queries (Section~\ref{ssec:insert}), and efficient approaches for supporting deletes (Section~\ref{ssec:delete}). Finally, it will close with a detailed discussion of the trade-offs within the framework's design space (Section~\ref{ssec:design-space}). \subsection{Trade-offs on Framework Design Space} \label{ssec:design-space} The framework has several tunable parameters, allowing it to be tailored for specific applications. This design space contains trade-offs among three major performance characteristics: update cost, sampling cost, and auxiliary memory usage. The two most significant decisions when implementing this framework are the selection of the layout and delete policies. The asymptotic analysis of the previous sections obscures some of the differences between these policies, but they do have significant practical performance implications. \Paragraph{Layout Policy.} The choice of layout policy represents a clear trade-off between update and sampling performance. Leveling results in fewer shards of larger size, whereas tiering results in a larger number of smaller shards. As a result, leveling reduces the costs associated with point-lookups and sampling query preprocessing by a constant factor, compared to tiering. However, it results in more write amplification: a given record may be involved in up to $s$ reconstructions on a single level, as opposed to the single reconstruction per level under tiering. \Paragraph{Delete Policy.} There is a trade-off between delete performance and sampling performance that exists in the choice of delete policy. Tagging requires a point-lookup when performing a delete, which is more expensive than the insert required by tombstones. However, it also allows constant-time rejection checks, unlike tombstones which require a point-lookup of each sampled record. In situations where deletes are common and write-throughput is critical, tombstones may be more useful. Tombstones are also ideal in situations where immutability is required, or random writes must be avoided. Generally speaking, however, tagging is superior when using SSIs that support it, because sampling rejection checks will usually be more common than deletes. \Paragraph{Mutable Buffer Capacity and Scale Factor.} The mutable buffer capacity and scale factor both influence the number of levels within the index, and by extension the number of distinct shards. Sampling and point-lookups have better performance with fewer shards. Smaller shards are also faster to reconstruct, although the same adjustments that reduce shard size also result in a larger number of reconstructions, so the trade-off here is less clear. The scale factor has an interesting interaction with the layout policy: when using leveling, the scale factor directly controls the amount of write amplification per level. Larger scale factors mean more time is spent reconstructing shards on a level, reducing update performance. Tiering does not have this problem and should see its update performance benefit directly from a larger scale factor, as this reduces the number of reconstructions. The buffer capacity also influences the number of levels, but is more significant in its effects on point-lookup performance: a lookup must perform a linear scan of the buffer. Likewise, the unstructured nature of the buffer also will contribute negatively towards sampling performance, irrespective of which buffer sampling technique is used. As a result, although a large buffer will reduce the number of shards, it will also hurt sampling and delete (under tagging) performance. It is important to minimize the cost of these buffer scans, and so it is preferable to keep the buffer small, ideally small enough to fit within the CPU's L2 cache. The number of shards within the index is, then, better controlled by changing the scale factor, rather than the buffer capacity. Using a smaller buffer will result in more compactions and shard reconstructions; however, the empirical evaluation in Section~\ref{ssec:ds-exp} demonstrates that this is not a serious performance problem when a scale factor is chosen appropriately. When the shards are in memory, frequent small reconstructions do not have a significant performance penalty compared to less frequent, larger ones. \Paragraph{Auxiliary Structures.} The framework's support for arbitrary auxiliary data structures allows for memory to be traded in exchange for insertion or sampling performance. The use of Bloom filters for accelerating tombstone rejection checks has already been discussed, but many other options exist. Bloom filters could also be used to accelerate point-lookups for delete tagging, though such filters would require much more memory than tombstone-only ones to be effective. An auxiliary hash table could be used for accelerating point-lookups, or range filters like SuRF \cite{zhang18} or Rosetta \cite{siqiang20} added to accelerate pre-processing for range queries like in IRS or WIRS.