diff options
Diffstat (limited to 'chapters')
| -rw-r--r-- | chapters/design-space.tex | 1 | ||||
| -rw-r--r-- | chapters/future-work.tex | 176 | ||||
| -rw-r--r-- | chapters/sigmod23/examples.tex | 236 | ||||
| -rw-r--r-- | chapters/sigmod23/extensions.tex | 2 | ||||
| -rw-r--r-- | chapters/sigmod23/framework.tex | 538 |
5 files changed, 435 insertions, 518 deletions
diff --git a/chapters/design-space.tex b/chapters/design-space.tex index 7ee98bd..47f728a 100644 --- a/chapters/design-space.tex +++ b/chapters/design-space.tex @@ -1 +1,2 @@ \chapter{Exploring the Design Space} +\label{chap:design-space} diff --git a/chapters/future-work.tex b/chapters/future-work.tex index d4ddd52..0c766dd 100644 --- a/chapters/future-work.tex +++ b/chapters/future-work.tex @@ -1,174 +1,2 @@ -\chapter{Proposed Work} -\label{chap:proposed} - -The previous two chapters described work already completed, however -there are a number of work that remains to be done as part of this -project. Update support is only one of the important features that an -index requires of its data structure. In this chapter, the remaining -research problems will be discussed briefly, to lay out a set of criteria -for project completion. - -\section{Concurrency Support} - -Database management systems are designed to hide the latency of -IO operations, and one of the techniques they use are being highly -concurrent. As a result, any data structure used to build a database -index must also support concurrent updates and queries. The sampling -extension framework described in Chapter~\ref{chap:sampling} had basic -concurrency support, but work is ongoing to integrate a superior system -into the framework of Chapter~\ref{chap:framework}. - -Because the framework is based on the Bentley-Saxe method, it has a number -of desirable properties for making concurrency management simpler. With -the exception of the buffer, the vast majority of the data resides in -static data structures. When using tombstones, these static structures -become fully immutable. This turns concurrency control into a resource -management problem, and suggests a simple multi-version concurrency -control scheme. Each version of the structure, defined as being the -state between two reconstructions, is tagged with an epoch number. A -query, then, will read only a single epoch, which will be preserved -in storage until all queries accessing it have terminated. Because the -mutable buffer is append-only, a consistent view of it can be obtained -by storing the tail of the log at the start of query execution. Thus, -a fixed snapshot of the index can be represented as a two-tuple containing -the epoch number and buffer tail index. - -The major limitation of the Chapter~\ref{chap:sampling} system was -the handling of buffer expansion. While the mutable buffer itself is -an unsorted array, and thus supports concurrent inserts using a simple -fetch-and-add operation, the real hurdle to insert performance is managing -reconstruction. During a reconstruction, the buffer is full and cannot -support any new inserts. Because active queries may be using the buffer, -it cannot be immediately flushed, and so inserts are blocked. Because of -this, it is necessary to use multiple buffers to sustain insertions. When -a buffer is filled, a background thread is used to perform the -reconstruction, and a new buffer is added to continue inserting while that -reconstruction occurs. In Chapter~\ref{chap:sampling}, the solution used -was limited by its restriction to only two buffers (and as a result, -a maximum of two active epochs at any point in time). Any sustained -insertion workload would quickly fill up the pair of buffers, and then -be forced to block until one of the buffers could be emptied. This -emptying of the buffer was contingent on \emph{both} all queries using -the buffer finishing, \emph{and} on the reconstruction using that buffer -to finish. As a result, the length of the block on inserts could be long -(multiple seconds, or even minutes for particularly large reconstructions) -and indeterminate (a given index could be involved in a very long running -query, and the buffer would be blocked until the query completed). - -Thus, a more effective concurrency solution would need to support -dynamically adding mutable buffers as needed to maintain insertion -throughput. This would allow for insertion throughput to be maintained -so long as memory for more buffer space is available.\footnote{For the -in-memory indexes considered thus far, it isn't clear that running out of -memory for buffers is a recoverable error in all cases. The system would -require the same amount of memory for storing record (technically more, -considering index overhead) in a shard as it does in the buffer. In the -case of an external storage system, the calculus would be different, -of course.} It would also ensure that a long running could only block -insertion if there is insufficient memory to create a new buffer or to -run a reconstruction. However, as the number of buffered records grows, -there is the potential for query performance to suffer, which leads to -another important aspect of an effective concurrency control scheme. - -\subsection{Tail Latency Control} - -The concurrency control scheme discussed thus far allows for maintaining -insertion throughput by allowing an unbounded portion of the new data -to remain buffered in an unsorted fashion. Over time, this buffered -data will be moved into data structures in the background, as the -system performs merges (which are moved off of the critical path for -most operations). While this system allows for fast inserts, it has the -potential to damage query performance. This is because the more buffered -data there is, the more a query must fall back on its inefficient -scan-based buffer path, as opposed to using the data structure. - -Unfortunately, reconstructions can be incredibly lengthy (recall that -the worst-case scenario involves rebuilding a static structure over -all of the records; this is, thankfully, quite rare). This implies that -it may be necessary in certain circumstances to throttle insertions to -maintain certain levels of query performance. Additionally, it may be -worth preemptively performing large reconstructions during periods of -low utilization, similar to systems like Silk designed for mitigating -tail latency spikes in LSM-tree based systems~\cite{balmau19}. - -Additionally, it is possible that large reconstructions may have a -negative effect on query performance, due to system resource utilization. -Reconstructions can use a large amount of memory bandwidth, which must -be shared by queries. The effects of parallel reconstruction on query -performance will need to be assessed, and strategies for mitigation of -this effect, be it a scheduling-based solution, or a resource-throttling -one, considered if necessary. - - -\section{Fine-Grained Online Performance Tuning} - -The framework has a large number of configurable parameters, and -introducing concurrency control will add even more. The parameter sweeps -in Section~\ref{ssec:ds-exp} show that there are trade-offs between -read and write performance across this space. Unfortunately, the current -framework applies this configuration parameters globally, and does not -allow them to be changed after the index is constructed. It seems apparent -that better performance might be obtained by adjusting this approach. - -First, there is nothing preventing these parameters from being configured -on a per-level basis. Having different layout policies on different -levels (for example, tiering on higher levels and leveling on lower ones), -different scale factors, etc. More index specific tuning, like controlling -memory budget for auxiliary structures, could also be considered. - -This fine-grained tuning will open up an even broader design space, -which has the benefit of improving the configurability of the system, -but the disadvantage of making configuration more difficult. Additionally, -it does nothing to address the problem of workload drift: a configuration -may be optimal now, but will it remain effective in the future as the -read/write mix of the workload changes? Both of these challenges can be -addressed using dynamic tuning. - -The theory is that the framework could be augmented with some workload -and performance statistics tracking. Based on these numbers, during -reconstruction, the framework could decide to adjust the configuration -of one or more levels in an online fashion, to lean more towards read -or write performance, or to dial back memory budgets as the system's -memory usage increases. Additionally, buffer-related parameters could -be tweaked in real time as well. If insertion throughput is high, it -might be worth it to temporarily increase the buffer size, rather than -spawning multiple smaller buffers. - -A system like this would allow for more consistent performance of the -system in the face of changing workloads, and also increase the ease -of use of the framework by removing the burden of configuration from -the user. - - -\section{Alternative Data Partitioning Schemes} - -One problem with Bentley-Saxe or LSM-tree derived systems is temporary -memory usage spikes. When performing a reconstruction, the system needs -enough storage to store the shards involved in the reconstruction, -and also the newly constructed shard. This is made worse in the face -of multi-version concurrency, where multiple older versions of shards -may be retained in memory at once. It's well known that, in the worst -case, such a system may temporarily require double its current memory -usage~\cite{dayan22}. - -One approach to addressing this problem in LSM-tree based systems is -to adjust the compaction granularity~\cite{dayan22}. In the terminology -associated with this framework, the idea is to further sub-divide each -shard into smaller chunks, partitioned based on keys. That way, when a -reconstruction is triggered, rather than reconstructing an entire shard, -these smaller partitions can be used instead. One of the partitions in -the source shard can be selected, and then merged with the partitions -in the next level down having overlapping key ranges. The amount of -memory required for reconstruction (and also reconstruction time costs) -can then be controlled by adjusting these partitions. - -Unfortunately, while this system works incredibly well for LSM-tree -based systems which store one-dimensional data in sorted arrays, it -encounters some problems in the context of a general index. It isn't -clear how to effectively partition multi-dimensional data in the same -way. Additionally, in the general case, each partition would need to -contain its own instance of the index, as the framework supports data -structures that don't themselves support effective partitioning in the -way that a simple sorted array would. These challenges will need to be -overcome to devise effective, general schemes for data partitioning to -address the problems of reconstruction size and memory usage. +\chapter{Future Work} +\label{chap:future} diff --git a/chapters/sigmod23/examples.tex b/chapters/sigmod23/examples.tex index cdbc398..38df04d 100644 --- a/chapters/sigmod23/examples.tex +++ b/chapters/sigmod23/examples.tex @@ -1,143 +1,141 @@ -\section{Framework Instantiations} +\section{Applications of the Framework} \label{sec:instance} -In this section, the framework is applied to three sampling problems and their -associated SSIs. All three sampling problems draw random samples from records -satisfying a simple predicate, and so result sets for all three can be -constructed by directly merging the result sets of the queries executed against -individual shards, the primary requirement for the application of the -framework. The SSIs used for each problem are discussed, including their -support of the remaining two optional requirements for framework application. +Using the framework from the previous section, we can create dynamizations +of SSIs for various sampling problems. In this section, we consider +three different decomposable sampling problems and their associated SSIs, +discussing the necessary details of implementation to ensure they work +efficiently. -\subsection{Dynamically Extended WSS Structure} +\subsection{Weighted Set Sampling (Alias Structure)} \label{ssec:wss-struct} -As a first example of applying this framework for dynamic extension, -the alias structure for answering WSS queries is considered. This is a -static structure that can be constructed in $O(n)$ time and supports WSS -queries in $O(1)$ time. The alias structure will be used as the SSI, with -the shards containing an alias structure paired with a sorted array of -records. { The use of sorted arrays for storing the records -allows for more efficient point-lookups, without requiring any additional -space. The total weight associated with a query for -a given alias structure is the total weight of all of its records, -and can be tracked at the shard level and retrieved in constant time. } +As a first example, we will consider the alias structure~\cite{walker74} +for weighted set sampling. This is a static data structure that is +constructable in $B(n) \in \Theta(n)$ time and is capable of answering +sampling queries in $\Theta(1)$ time per sample. This structure does +\emph{not} directly support point-lookups, nor is it naturally sorted +to allow for convenient tombstone cancellation. However, the structure +itself doesn't place any requirements on the ordering of the underlying +data, and so both of these limitations can be addressed by building it +over a sorted array. -Using the formulae from Section~\ref{sec:framework}, the worst-case -costs of insertion, sampling, and deletion are easily derived. The -initial construction cost from the buffer is $C_c(N_b) \in O(N_b -\log N_b)$, requiring the sorting of the buffer followed by alias -construction. After this point, the shards can be reconstructed in -linear time while maintaining sorted order. Thus, the reconstruction -cost is $C_r(n) \in O(n)$. As each shard contains a sorted array, -the point-lookup cost is $L(n) \in O(\log n)$. The total weight can -be tracked with the shard, requiring $W(n) \in O(1)$ time to access, -and there is no necessary preprocessing, so $P(n) \in O(1)$. Samples -can be drawn in $S(n) \in O(1)$ time. Plugging these results into the -formulae for insertion, sampling, and deletion costs gives, +This pre-sorting will require $B(n) \in \Theta(n \log n)$ time to +build from the buffer, however after this a sorted-merge can be used +to perform reconstructions from the shards themselves. As the maximum +number of shards involved in a reconstruction using either layout policy +is $\Theta(1)$ using our framework, this means that we can perform +reconstructions in $B_M(n) \in \Theta(n)$ time, including tombstone +cancellation. The total weight of the structure can also be calculated +at no time when it is constructed, allows $W(n) \in \Theta(1)$ time +as well. Point lookups over the sorted data can be done using a binary +search in $L(n) \in \Theta(\log_2 n)$ time, and sampling queries require +no pre-processing, so $P(n) \in \Theta(1)$. The mutable buffer can be +sampled using rejection sampling. + +This results in the following cost functions for the various operations +supported by the dynamization, \begin{align*} - \text{Insertion:} \quad &O\left(\log_s n\right) \\ - \text{Sampling:} \quad &O\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\ - \text{Tagged Delete:} \quad &O\left(\log_s n \log n\right) + \text{Amortized Insertion/Tombstone Delete:} \quad &\Theta\left(\log_s n\right) \\ + \text{Worst-case Sampling:} \quad &\Theta\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\ + \text{Worst-case Tagged Delete:} \quad &\Theta\left(\log_s n \log n\right) \end{align*} -where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log n)$ for +where $R(n) \in \Theta(1)$ for tagging and $R(n) \in \Theta(\log_s n \log n)$ for tombstones. -\Paragraph{Bounding Rejection Rate.} In the weighted sampling case, -the framework's generic record-based compaction trigger mechanism -is insufficient to bound the rejection rate. This is because the -probability of a given record being sampling is dependent upon its -weight, as well as the number of records in the index. If a highly -weighted record is deleted, it will be preferentially sampled, resulting -in a larger number of rejections than would be expected based on record -counts alone. This problem can be rectified using the framework's user-specified -compaction trigger mechanism. -In addition to -tracking record counts, each level also tracks its rejection rate, +\Paragraph{Sampling Rejection Rate Bound.} Bounding the number of deleted +records is not sufficient to bound the rejection rate of weighted sampling +queries on its own, because it doesn't account for the weights of the +records being deleted. Recall in our discussion of this bound that we +assumed that all records had equal weights. Without this assumption, it +is possible to construct adversarial cases where a very highly weighted +record is deleted, resulting in it being preferentially sampled and +rejected repeatedly. + +To ensure that our solution is robust even in the face of such adversarial +workloads, for the weighted sampling case we introduce another compaction +trigger based on the measured rejection rate of each level. We +define the rejection rate of level $i$ as, \begin{equation*} -\rho_i = \frac{\text{rejections}}{\text{sampling attempts}} + \rho_i = \frac{\text{rejections}}{\text{sampling attempts}} \end{equation*} -A configurable rejection rate cap, $\rho$, is then defined. If $\rho_i -> \rho$ on a level, a compaction is triggered. In the case -the tombstone delete policy, it is not the level containing the sampled -record, but rather the level containing its tombstone, that is considered -the source of the rejection. This is necessary to ensure that the tombstone -is moved closer to canceling its associated record by the compaction. +and allow the user to specify a maximum rejection rate, $\rho$. If $\rho_i +> \rho$ on a given level, then a proactive compaction is triggered. In +the case of tagged deletes, the rejection rate of a level is based on +the rejections resulting from sampling attempts on that level. This +will \emph{not} work when using tombstones, however, as compacting the +level containing the record that was rejected will not make progress +towards eliminating that record from the structure in this case. Instead, +when using tombstones, the rejection rate is tracked based on the level +containing the tombstone that caused the rejection. This ensures that the +tombstone is moved towards its associated record, and that the compaction +makes progress towards removing it. -\subsection{Dynamically Extended IRS Structure} -\label{ssec:irs-struct} -Another sampling problem to which the framework can be applied is -independent range sampling (IRS). The SSI in this example is the in-memory -ISAM tree. The ISAM tree supports efficient point-lookups - directly, and the total weight of an IRS query can be -easily obtained by counting the number of records within the query range, -which is determined as part of the preprocessing of the query. -The static nature of shards in the framework allows for an ISAM tree -to be constructed with adjacent nodes positioned contiguously in memory. -By selecting a leaf node size that is a multiple of the record size, and -avoiding placing any headers within leaf nodes, the set of leaf nodes can -be treated as a sorted array of records with direct indexing, and the -internal nodes allow for faster searching of this array. -Because of this layout, per-sample tree-traversals are avoided. The -start and end of the range from which to sample can be determined using -a pair of traversals, and then records can be sampled from this range -using random number generation and array indexing. - -Assuming a sorted set of input records, the ISAM tree can be bulk-loaded -in linear time. The insertion analysis proceeds like the WSS example -previously discussed. The initial construction cost is $C_c(N_b) \in -O(N_b \log N_b)$ and reconstruction cost is $C_r(n) \in O(n)$. The ISAM -tree supports point-lookups in $L(n) \in O(\log_f n)$ time, where $f$ -is the fanout of the tree. +\subsection{Independent Range Sampling (ISAM Tree)} +\label{ssec:irs-struct} +We will next considered independent range sampling. For this decomposable +sampling problem, we use the ISAM Tree for the SSI. Because our shards are +static, we can build highly compact and efficient ISAM trees by storing +the records directly in a sorted array. So long as the leaf node size is +a multiple of the record size, this array can be treated as a sequence of +leaf nodes in the tree, and internal nodes can be built above this using +array indices as pointers. These internal nodes can also be constructed +contiguously in an array, maximizing cache efficiency. -The process for performing range sampling against the ISAM tree involves -two stages. First, the tree is traversed twice: once to establish the index of -the first record greater than or equal to the lower bound of the query, -and again to find the index of the last record less than or equal to the -upper bound of the query. This process has the effect of providing the -number of records within the query range, and can be used to determine -the weight of the shard in the shard alias structure. Its cost is $P(n) -\in O(\log_f n)$. Once the bounds are established, samples can be drawn -by randomly generating uniform integers between the upper and lower bound, -in $S(n) \in O(1)$ time each. +To build this structure from the buffer requires sorting the records +first, and then performing a linear time bulk-load, and hence $B(n) +\in \Theta(n \log n)$. However, sorted-array merges can be used for +further reconstructions, meaning that $B_M(n) \in \Theta(n)$. The data +structure itself supports point lookups in $L(n)\in \Theta(\log n)$ time. +IRS queries can be answered by first using two tree traversals to identify +the minimum and maximum array indices associated with the query range +in $\Theta(\log n)$ time, and then generating array indices within this +range uniformly at random for each sample. The initial traversals can be +considered preprocessing time, so $P(n) \in \Theta(\log n)$. The weight +of the shard is simply the difference between the upper and lower indices +of the range (i.e., the number of records in the range), and so $W(n) +\in \Theta(1)$ time, and the per-sample cost is a single random number +generation, so $S(n) \in \Theta(1)$. The mutable buffer can be sampled +using rejection sampling. -This results in the extended version of the ISAM tree having the following -insert, sampling, and delete costs, +Accounting for all these costs, the time complexity of the various +operations are, \begin{align*} - \text{Insertion:} \quad &O\left(\log_s n\right) \\ - \text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\ - \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) + \text{Amortized Insertion/Tombstone Delete:} \quad &O\left(\log_s n\right) \\ + \text{Worst-case Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\ + \text{Worst-case Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) \end{align*} -where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for -tombstones. +where $R(n) \in \Theta(1)$ for tagging and $R(n) \in \Theta(\log_s n \log_f n)$ +for tombstones and $f$ is the fanout of the ISAM Tree. -\subsection{Dynamically Extended WIRS Structure} +\subsection{Weighted Independent Range Sampling (Alias-augmented B+Tree)} + \label{ssec:wirs-struct} -As a final example of applying this framework, the WIRS problem will be -considered. Specifically, the alias-augmented B+tree approach, described -by Tao \cite{tao22}, generalizing work by Afshani and Wei \cite{afshani17}, -and Hu et al. \cite{hu14}, will be extended. -This structure allows for efficient point-lookups, as -it is based on the B+tree, and the total weight of a given WIRS query can -be calculated given the query range using aggregate weight tags within -the tree. +As a final example of applying this framework, we consider WIRS. This +is a decomposable sampling problem that can be answered using the +alias-augmented B+Tree structure~\cite{tao22, afshani17,hu14}. This +data structure is built over sorted data, but can be bulk-loaded from +this data in linear time, resulting in costs of $B(n) \in \Theta(n \log n)$ +and $B_M(n) \in \Theta(n)$, though the constant factors associated with +these functions are quite high, as each bulk-loading requires multiple +linear-time operations for building both the B+Tree and the alias +structures, among other things. As it is built on a B+Tree, the structure +supports $L(n) \in \Theta(\log n)$ point lookups. Answering sampling +queries requires $P(n) \in \Theta(\log n)$ pre-processing time to +establish the query interval, during which the weight of the interval +can be calculated in $W(n) \in \Theta(1)$ time using the aggregate weight +tags in the tree's internal nodes. After this, samples can be drawn in +$S(n) \in \Theta(1)$ time. -The alias-augmented B+tree is a static structure of linear space, capable -of being built initially in $C_c(N_b) \in O(N_b \log N_b)$ time, being -bulk-loaded from sorted lists of records in $C_r(n) \in O(n)$ time, -and answering WIRS queries in $O(\log_f n + k)$ time, where the query -cost consists of preliminary work to identify the sampling range -and calculate the total weight, with $P(n) \in O(\log_f n)$ cost, and -constant-time drawing of samples from that range with $S(n) \in O(1)$. -This results in the following costs, +This all results in the following costs, \begin{align*} - \text{Insertion:} \quad &O\left(\log_s n\right) \\ - \text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\ - \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) + \text{Amortized Insertion/Tombstone Delete:} \quad &O\left(\log_s n\right) \\ + \text{Worst-case Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\ + \text{Worst-case Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) \end{align*} where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for -tombstones. Because this is a weighted sampling structure, the custom -compaction trigger discussed in in Section~\ref{ssec:wss-struct} is applied -to maintain bounded rejection rates during sampling. - +tombstones and $f$ is the fanout of the tree. This is another weighted +sampling problem, and so we also apply the same rejection rate based +compaction trigger as discussed in Section~\ref{ssec:wss-struct} for the +dynamized alias structure. diff --git a/chapters/sigmod23/extensions.tex b/chapters/sigmod23/extensions.tex index 6c242e9..d8a4247 100644 --- a/chapters/sigmod23/extensions.tex +++ b/chapters/sigmod23/extensions.tex @@ -1,5 +1,5 @@ \captionsetup[subfloat]{justification=centering} -\section{Extensions} +\section{Extensions to the Framework} \label{sec:discussion} In this section, various extensions of the framework are considered. Specifically, the applicability of the framework to external or distributed diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex index c878d93..89f15c3 100644 --- a/chapters/sigmod23/framework.tex +++ b/chapters/sigmod23/framework.tex @@ -16,7 +16,32 @@ there in the context of IRS apply equally to the other sampling problems considered in this chapter. In this section, we will discuss approaches for resolving these problems. + +\begin{table}[t] +\centering + +\begin{tabular}{|l l|} + \hline + \textbf{Variable} & \textbf{Description} \\ \hline + $N_b$ & Capacity of the mutable buffer \\ \hline + $s$ & Scale factor \\ \hline + $B_c(n)$ & SSI construction cost from unsorted records \\ \hline + $B_r(n)$ & SSI reconstruction cost from existing SSI instances\\ \hline + $L(n)$ & SSI point-lookup cost \\ \hline + $P(n)$ & SSI sampling pre-processing cost \\ \hline + $S(n)$ & SSI per-sample sampling cost \\ \hline + $W(n)$ & SSI weight determination cost \\ \hline + $R(n)$ & Rejection check cost \\ \hline + $\delta$ & Maximum delete proportion \\ \hline +\end{tabular} +\label{tab:nomen} + +\caption{\textbf{Nomenclature.} A reference of variables and functions +used in this chapter.} +\end{table} + \subsection{Sampling over Decomposed Structures} +\label{ssec:decomposed-structure-sampling} The core problem facing any attempt to dynamize SSIs is that independently sampling from a decomposed structure is difficult. As discussed in @@ -266,6 +291,7 @@ contexts. \subsubsection{Deletion Cost} +\label{ssec:sampling-deletes} We will first consider the cost of performing a delete using either mechanism. @@ -314,8 +340,8 @@ cases, the same procedure as above can be used, with $L(n) \in \Theta(1)$. \begin{figure} \centering - \subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\ - \subfloat[Tagging Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tagging} \label{fig:delete-tag}} + \subfloat[Tombstone Rejection Check]{\includegraphics[width=.5\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}} + \subfloat[Tagging Rejection Check]{\includegraphics[width=.5\textwidth]{img/sigmod23/delete-tagging} \label{fig:delete-tag}} \caption{\textbf{Overview of the rejection check procedure for deleted records.} First, a record is sampled (1). @@ -456,6 +482,7 @@ the lifetime of the structure. Preemptive compaction does not increase the number of reconstructions, only \emph{when} they occur. \subsubsection{Sampling Procedure with Deletes} +\label{ssec:sampling-with-deletes} Because sampling is neither deletion decomposable nor invertible, the presence of deletes will have an effect on the query costs. As @@ -486,244 +513,307 @@ be taken to obtain a sample set of size $k$. \subsection{Performance Tuning and Configuration} -\subsubsection{LSM Tree Imports} -\subsection{Insertion} -\label{ssec:insert} -The framework supports inserting new records by first appending them to the end -of the mutable buffer. When it is full, the buffer is flushed into a sequence -of levels containing shards of increasing capacity, using a procedure -determined by the layout policy as discussed in Section~\ref{sec:framework}. -This method allows for the cost of repeated shard reconstruction to be -effectively amortized. - -Let the cost of constructing the SSI from an arbitrary set of $n$ records be -$C_c(n)$ and the cost of reconstructing the SSI given two or more shards -containing $n$ records in total be $C_r(n)$. The cost of an insert is composed -of three parts: appending to the mutable buffer, constructing a new -shard from the buffered records during a flush, and the total cost of -reconstructing shards containing the record over the lifetime of the index. The -cost of appending to the mutable buffer is constant, and the cost of constructing a -shard from the buffer can be amortized across the records participating in the -buffer flush, giving $\nicefrac{C_c(N_b)}{N_b}$. These costs are paid exactly once for -each record. To derive an expression for the cost of repeated reconstruction, -first note that each record will participate in at most $s$ reconstructions on -a given level, resulting in a worst-case amortized cost of $O\left(s\cdot -\nicefrac{C_r(n)}{n}\right)$ paid per level. The index itself will contain at most -$\log_s n$ levels. Thus, over the lifetime of the index a given record -will pay $O\left(s\cdot \nicefrac{C_r(n)}{n}\log_s n\right)$ cost in repeated -reconstruction. - -Combining these results, the total amortized insertion cost is -\begin{equation} -O\left(\frac{C_c(N_b)}{N_b} + s \cdot \frac{C_r(n)}{n} \log_s n\right) -\end{equation} -This can be simplified by noting that $s$ is constant, and that $N_b \ll n$ and also -a constant. By neglecting these terms, the amortized insertion cost of the -framework is, -\begin{equation} -O\left(\frac{C_r(n)}{n}\log_s n\right) -\end{equation} -\captionsetup[subfloat]{justification=centering} +The final of the desiderata referenced earlier in this chapter for our +dynamized sampling indices is having tunable performance. The base +Bentley-Saxe method has a highly rigid reconstruction policy that, +while theoretically convenient, does not lend itself to performance +tuning. However, it can be readily modified to form a more relaxed policy +that is both tunable, and generally more performant, at the cost of some +additional theoretical complexity. There has been some theoretical work +in this area, based upon nesting instances of the equal block method +within the Bentley-Saxe method~\cite{overmars81}, but these methods are +unwieldy and are targetted at tuning the worst-case at the expense of the +common case. We will take a different approach to adding configurability +to our dynamization system. + +Though it has thus far gone unmentioned, readers familiar with LSM Trees +may have noted the astonishing similarity between decomposition-based +dynamization techniques, and a data structure called the Log-structured +Merge-tree. First proposed by O'Neil in the mid '90s\cite{oneil96}, +the LSM Tree was designed to optmize write throughout for external data +structures. It accomplished this task by buffer inserted records in a +small in-memory AVL Tree, and then flushing this buffer to disk when +it filled up. The flush process itself would fully rebuild the on-disk +structure (a B+Tree), including all of the currently existing records +on external storage. O'Neil also proposed version which used several, +layered, external structures, to reduce the cost of reconstruction. + +In more recent times, the LSM Tree has seen significant development and +been used as the basis for key-value stores like RocksDB~\cite{dong21} +and LevelDB~\cite{leveldb}. This work as produced an incredibly large +and well explored parameterization of the reconstruction procedures of +LSM Trees, a good summary of which can be bound in this recent tutorial +paper~\cite{sarkar23}. Examples of this design space exploration include: +different ways to organize each "level" of the tree~\cite{dayan19, +dostoevsky, autumn}, different growth rates, buffering, sub-partioning +of structures to allow finer-grained reconstruction~\cite{dayan22}, and +approaches for allocating resources to auxilliary structures attached to +the main ones for accelerating certain types of query~\cite{dayan18-1, +zhu21, monkey}. + +Many of the elements within the LSM Tree design space are based upon the +specifics of the data structure itself, and are not generally applicable. +However, some of the higher-level concepts can be imported and applied in +the context of dynamization. Specifically, we have decided to import the +following four elements for use in our dynamization technique, +\begin{itemize} + \item A small dynamic buffer into which new records are inserted + \item A variable growth rate, called as \emph{scale factor} + \item The ability to attach auxilliary structures to each block + \item Two different strategies for reconstructing data structures +\end{itemize} +This design space and its associated trade-offs will be discussed in +more detail in Chapter~\ref{chap:design-space}, but we'll describe it +briefly here. + +\Paragraph{Buffering.} In the standard Bentley-Saxe method, each +insert triggers a reconstruction. Many of these are quite small, but +it still makes most insertions somewhat expensive. By adding a small +buffer, a large number of inserts can be performed without requiring +any reconstructions at all. For generality, we elected to use an +unsorted array as our buffer, as dynamic versions of the structures +we are dynamizing may not exist. This introduces some query cost, as +queries must be answered from these unsorted records as well, but in +the case of sampling this isn't a serious problem. The implications of +this will be discussed in Section~\ref{ssec:sampling-cost-funcs}. The +size of this buffer, $N_B$ is a user-specified constant, and all block +capacities are multiplied by it. In the Bentley-Saxe method, the $i$th +block contains $2^i$ records. In our scheme, with buffering, this becomes +$N_B \cdot 2^i$ records in the $i$th block. We call this unsorted array +the \emph{mutable buffer}. + +\Paragraph{Scale Factor.} In the Bentley-Saxe method, each block is +twice as large as the block the preceeds it There is, however, no reason +why this growth rate couldn't be adjusted. In our system, we make the +growth rate a user-specified constant called the \emph{scale factor}, +$s$, such that the $i$th level contains $N_B \cdot s^i$ records. + +\Paragraph{Auxilliary Structures.} In Section~\ref{ssec:sampling-deletes}, +we encountered two problems relating to supporting deletes that can be +resolved through the use of auxilliary structures. First, regardless +of whether tagging or tombstones are used, the data structure requires +support for an efficient point-lookup operation. Many SSIs are tree-based +and thus support this, but not all data structures do. In such cases, +the point-lookup operation could be provided by attaching an auxilliary +hash table to the data structure that maps records to their location in +the SSI. We use term \emph{shard} to refer to the combination of a +block with these optional auxilliary structures. + +In addition, the tombstone deletion mechanism requires performing a point +lookup for every record sampled, to validate that it has not been deleted. +This introduces a large amount of overhead into the sampling process, +as this requires searching each block in the structure. One approach +that can be used to help improve the performance of these searches, +without requiring as much storage as adding auxilliary hash tables to +every block, is to include bloom filters~\cite{bloom70}. A bloom filter +is an approximate data structure that answers tests of set membership +with bounded, single-sided error. These are commonly used in LSM Trees +to accelerate point lookups by allowing levels that don't contain the +record being searched for to be skipped. In our case, we only care about +tombstone records, so rather than building these filters over all records, +we can build them over tombstones. This approach can greatly improve +the sampling performance of the structure when tombstone deletes are used. + +\Paragraph{Layout Policy.} The Bentley-Saxe method considers blocks +individually, without any other organization beyond increasing size. In +contrast, LSM Trees have multiple layers of structural organization. The +top level structure is a level, upon which record capacity restrictions +are applied. These levels are then partitioned into individual structures, +which can be further organized by key range. Because our intention is to +support general data structures, which may or may not be easily partition +by a key, we will not consider the finest grain of partitioning. However, +we can borrow the concept of levels, and lay out shards in these levels +according to different strategies. + +Specifically, we consider two layout policies. First, we can allow a +single shard per level, a policy called \emph{Leveling}. This approach +is traditionally read optimized, as it generally results in fewer shards +within the overall structure for a given scale factor. Under leveling, +the $i$th level has a capacity of $N_B \cdot s^{i+1}$ records. We can +also allow multiple shards per level, resulting in a write-optimized +policy called \emph{Tiering}. In tiering, each level can hold up to $s$ +shards, each with up to $N_B \cdot s^i$ records. Note that this doesn't +alter the overall record capacity of each level relative to leveling, +only the way the records are divided up into shards. + +\section{Practical Dynamization Framework} + +Based upon the results discussed in the previous section, we are now ready +to discuss the dynamization framework that we have produced for adding +update support to SSIs. This framework allows us to achieve all three +of our desiderata, at least for certain configurations, and provides a +wide range of performance tuning options to the user. + +\subsection{Requirements} + +The requirements that the framework places upon SSIs are rather +modest. The sampling problem being considered must be a decomposable +sampling problem (Definition \ref{def:decomp-sampling}) and the SSI must +support the \texttt{build} and \texttt{unbuild} operations. Optionally, +if the SSI supports point lookups or if the SSI can be constructed +from multiple instances of the SSI more efficiently than its normal +static construction, these two operations can be leveraged by the +framework. However, these are not requirements, as the framework provides +facilities to work around their absence. + +\captionsetup[subfloat]{justification=centering} \begin{figure*} \centering - \subfloat[Leveling]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-leveling} \label{fig:leveling}}\\ - \subfloat[Tiering]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-tiering} \label{fig:tiering}} + \subfloat[Leveling]{\includegraphics[width=.5\textwidth]{img/sigmod23/merge-leveling} \label{fig:leveling}} + \subfloat[Tiering]{\includegraphics[width=.5\textwidth]{img/sigmod23/merge-tiering} \label{fig:tiering}} - \caption{\textbf{A graphical overview of the sampling framework and its insert procedure.} A + \caption{\textbf{A graphical overview of our dynamization framework.} A mutable buffer (MB) sits atop two levels (L0, L1) containing shards (pairs of SSIs and auxiliary structures [A]) using the leveling (Figure~\ref{fig:leveling}) and tiering (Figure~\ref{fig:tiering}) layout policies. Records are represented as black/colored squares, and grey squares represent unused capacity. An insertion requiring a multi-level - reconstruction is illustrated.} \label{fig:framework} + reconstruction is illustrated.} \label{fig:sampling-framework} \end{figure*} -\section{Framework Implementation} - -Our framework has been designed to work efficiently with any SSI, so long -as it has the following properties. - +\subsection{Framework Construction} + +The framework itself is shown in Figure~\ref{fig:sampling-framework}, +along with some of its configuration parameters and its insert procedure +(which will be discussed in the next section). It consists of an unsorted +array of size $N_B$ records, sitting atop a sequence of \emph{levels}, +each containing SSIs according to the layout policy. If leveling +is used, each level will contain a single SSI with up to $N_B \cdot +s^{i+1}$ records. If tiering is used, each level will contain up to +$s$ SSIs, each with up to $N_B \cdot s^i$ records. The scale factor, +$s$, controls the rate at which the capacity of each level grows. The +framework supports deletes using either the tombstone or tagging policy, +which can be selected by the user acccording to her preference. To support +these delete mechanisms, each record contains an attached header with +bits to indicate its tombstone or delete status. + +\subsection{Supported Operations and Cost Functions} +\Paragraph{Insert.} Inserting a record into the dynamization involves +appending it to the mutable buffer, which requires $\Theta(1)$ time. When +the buffer reaches its capacity, it must be flushed into the structure +itself before any further records can be inserted. First, a shard will be +constructed from the records in the buffer using the SSI's \texttt{build} +operation, with $B(N_B)$ cost. This shard will then be merged into the +levels below it, which may require further reconstructions to occur to +make room. The manner in which these reconstructions proceed follows the +selection of layout policy, +\begin{itemize} +\item[\textbf{Leveling}] When a buffer flush occurs in the leveling +policy, the system scans the existing levels to find the first level +which has sufficient empty space to store the contents of the level above +it. More formally, if the number of records in level $i$ is $N_i$, then +$i$ is determined such that $N_i + N_B\cdot s^{i} <= N_B \cdot s^{i+1}$. +If no level exists that satisfies the record count constraint, then an +empty level is added and $i$ is set to the index of this new level. Then, +a reconstruction is executed containing all of the records in levels $i$ +and $i - 1$ (where $i=-1$ indicates the temporary shard built from the +buffer). Following this reconstruction, all levels $j < i$ are shifted +by one level. +\item[\textbf{Tiering}] When using tiering, the system will locate +the first level, $i$, containing fewer than $s$ shards. If no such +level exists, then a new empty level is added and $i$ is set to the +index of that level. Then, for each level $j < i$, a reconstruction +is performed involving all $s$ shards on that level. The resulting new +shard will then be placed into the level at $j + 1$ and $j$ will be +emptied. Following this, the newly created shard from the buffer will +be appended to level $0$. +\end{itemize} + +In either case, the reconstructions all use instances of the shard as +input, and so if the SSI supports more efficient construction in this case +(with $B_M(n)$ cost), then this routine can be used here. Once all of +the necessary reconstructions have been performed, each level is checked +to verify that the proportion of tombstones or deleted records is less +than $\delta$. If this condition fails, then a proactive compaction is +triggered. This compaction involves doing the reconstructions necessary +to move the shard violating the delete bound down one level. Once the +compaction is complete, the delete proportions are checked again, and +this process is repeated until all levels satisfy the bound. + +Following this procedure, inserts have a worst case cost of $I \in +\Theta(B_M(n))$, equivalent to Bently-Saxe. The amortized cost can be +determined by finding the total cost of reconstructions involving each +record and amortizing it over each insert. The cost of the insert is +composed of three parts, \begin{enumerate} - \item The underlying full query $Q$ supported by the SSI from whose results - samples are drawn satisfies the following property: - for any dataset $D = \cup_{i = 1}^{n}D_i$ - where $D_i \cap D_j = \emptyset$, $Q(D) = \cup_{i = 1}^{n}Q(D_i)$. - \item \emph{(Optional)} The SSI supports efficient point-lookups. - \item \emph{(Optional)} The SSI is capable of efficiently reporting the total weight of all records - returned by the underlying full query. +\item The cost of appending to the buffer +\item The cost of flushing the buffer to a shard +\item The total cost of the reconstructions the record is involved + in over the lifetime of the structure \end{enumerate} +The first cost is constant and the second is $B(N_B)$. Regardless of +layout policy, there will be $\Theta(\log_s(n))$ total levels, and +the record will, at worst, be written a constant number of times to +each level, resulting in a maximum of $\Theta(\log_s(n)B_M(n))$ cost +associated with these reconstructions. Thus, the total cost associated +with each record in the structure is, +\begin{equation*} +\Theta(1) + \Theta(B(N_B)) + \Theta(\log_s(n)B_M(n)) +\end{equation*} +Assuming that $N_B \ll n$, the first two terms of this expression are +constant. Dropping them and amortizing the result over $n$ records give +us the amortized insertion cost, +\begin{equation*} +I_a(n) \in \Theta\left(\frac{B_M(n)}{n}\log_s(n)\right) +\end{equation*} +If the SSI being considered does not support a more efficient +construction procedure from other instances of the same SSI, and +the general Bentley-Saxe \texttt{unbuild} and \texttt{build} +operations must be used, the the cost becomes $I_a(n) \in +\Theta\left(\frac{B(n)}{n}\log_s(n)\right)$ instead. + +\Paragraph{Delete.} The framework supports both tombstone and tagged +deletes, each with different performance. Using tombstones, the cost +of a delete is identical to that of an insert. When using tagging, the +cost of a delete is the same as cost of doing a point lookup, as the +"delete" itself is simply setting a bit in the header of the record, +once it has been located. There will be $\Theta(\log_s n)$ total shards +in the structure, each with a look-up cost of $L(n)$ using either the +SSI's native point-lookup, or an auxilliary hash table, and the lookup +must also scan the buffer in $\Theta(N_B)$ time. Thus, the worst-case +cost of a tagged delete is, +\begin{equation*} +D(n) = \Theta(N_B + L(n)\log_s(n)) +\end{equation*} -The first property applies to the query being sampled from, and is essential -for the correctness of sample sets reported by extended sampling -indexes.\footnote{ This condition is stricter than the definition of a -decomposable search problem in the Bentley-Saxe method, which allows for -\emph{any} constant-time merge operation, not just union. -However, this condition is satisfied by many common types of database -query, such as predicate-based filtering queries.} The latter two properties -are optional, but reduce deletion and sampling costs respectively. Should the -SSI fail to support point-lookups, an auxiliary hash table can be attached to -the data structures. -Should it fail to support query result weight reporting, rejection -sampling can be used in place of the more efficient scheme discussed in -Section~\ref{ssec:sample}. The analysis of this framework will generally -assume that all three conditions are satisfied. - -Given an SSI with these properties, a dynamic extension can be produced as -shown in Figure~\ref{fig:framework}. The extended index consists of disjoint -shards containing an instance of the SSI being extended, and optional auxiliary -data structures. The auxiliary structures allow acceleration of certain -operations that are required by the framework, but which the SSI being extended -does not itself support efficiently. Examples of possible auxiliary structures -include hash tables, Bloom filters~\cite{bloom70}, and range -filters~\cite{zhang18,siqiang20}. The shards are arranged into levels of -increasing record capacity, with either one shard, or up to a fixed maximum -number of shards, per level. The decision to place one or many shards per level -is called the \emph{layout policy}. The policy names are borrowed from the -literature on the LSM tree, with the former called \emph{leveling} and the -latter called \emph{tiering}. - -To avoid a reconstruction on every insert, an unsorted array of fixed capacity -($N_b$), called the \emph{mutable buffer}, is used to buffer updates. Because it is -unsorted, it is kept small to maintain reasonably efficient sampling -and point-lookup performance. All updates are performed by appending new -records to the tail of this buffer. -If a record currently within the index is -to be updated to a new value, it must first be deleted, and then a record with -the new value inserted. This ensures that old versions of records are properly -filtered from query results. - -When the buffer is full, it is flushed to make room for new records. The -flushing procedure is based on the layout policy in use. When using leveling -(Figure~\ref{fig:leveling}) a new SSI is constructed using both the records in -$L_0$ and those in the buffer. This is used to create a new shard, which -replaces the one previously in $L_0$. When using tiering -(Figure~\ref{fig:tiering}) a new shard is built using only the records from the -buffer, and placed into $L_0$ without altering the existing shards. Each level -has a record capacity of $N_b \cdot s^{i+1}$, controlled by a configurable -parameter, $s$, called the scale factor. Records are organized in one large -shard under leveling, or in $s$ shards of $N_b \cdot s^i$ capacity each under -tiering. When a level reaches its capacity, it must be emptied to make room for -the records flushed into it. This is accomplished by moving its records down to -the next level of the index. Under leveling, this requires constructing a new -shard containing all records from both the source and target levels, and -placing this shard into the target, leaving the source empty. Under tiering, -the shards in the source level are combined into a single new shard that is -placed into the target level. Should the target be full, it is first emptied by -applying the same procedure. New empty levels -are dynamically added as necessary to accommodate these reconstructions. -Note that shard reconstructions are not necessarily performed using -merging, though merging can be used as an optimization of the reconstruction -procedure where such an algorithm exists. In general, reconstruction requires -only pooling the records of the shards being combined and then applying the SSI's -standard construction algorithm to this set of records. +\Paragraph{Update.} Given the above definitions of insert and delete, +in-place updates of records can be supported by first deleting the record +to be updated, and then inserting the updated value as a new record. Thus, +the update cost is $\Theta(I(n) + D(n))$. + +\Paragraph{Sampling.} Answering sampling queries from this structure is +largely the same as was discussed for a standard Bentley-Saxe dynamization +in Section~\ref{ssec:sampling-with-deletes} with the addition of a need +to sample from the unsorted buffer as well. There are two approaches +for sampling from the buffer. The most general approach would be to +temporarily build an SSI over the records within the buffer, and then +treat this is a normal shard for the remainder of the sampling procedure. +In this case, the sampling algorithm remains indentical to the algorithm +discussed in Section~\ref{ssec:decomposed-structure-sampling}, following +the construction of the temporary shard. This results in a worst-case +sampling cost of, +\begin{equation*} + \mathscr{Q}(n, k) = \Theta\left(B(N_B) + [W(n) + P(n)]\log_2 n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right) +\end{equation*} -\begin{table}[t] -\caption{Frequently Used Notation} -\centering +In practice, however, it is often possible to perform rejection sampling +against the buffer, without needing to do any additional work to prepare +it. In this case, the full weight of the buffer can be used to determine +how many samples to draw from it, and then these samples can be obtained +using standard rejection sampling to both control the weight, and enforce +any necessary predicates. Because $N_B \ll n$, this procedure will not +introduce anything more than constant overhead in the sampling process as +the probability of sampling from the buffer is quite low, and the cost of +doing so is constant, and so the overall query cost when rejection sampling +is possible is, -\begin{tabular}{|p{2.5cm} p{5cm}|} - \hline - \textbf{Variable} & \textbf{Description} \\ \hline - $N_b$ & Capacity of the mutable buffer \\ \hline - $s$ & Scale factor \\ \hline - $C_c(n)$ & SSI initial construction cost \\ \hline - $C_r(n)$ & SSI reconstruction cost \\ \hline - $L(n)$ & SSI point-lookup cost \\ \hline - $P(n)$ & SSI sampling pre-processing cost \\ \hline - $S(n)$ & SSI per-sample sampling cost \\ \hline - $W(n)$ & Shard weight determination cost \\ \hline - $R(n)$ & Shard rejection check cost \\ \hline - $\delta$ & Maximum delete proportion \\ \hline - %$\rho$ & Maximum rejection rate \\ \hline -\end{tabular} -\label{tab:nomen} - -\end{table} +\begin{equation*} + \mathscr{Q}(n, k) = \Theta\left([W(n) + P(n)]\log_2 n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right) +\end{equation*} -Table~\ref{tab:nomen} lists frequently used notation for the various parameters -of the framework, which will be used in the coming analysis of the costs and -trade-offs associated with operations within the framework's design space. The -remainder of this section will discuss the performance characteristics of -insertion into this structure (Section~\ref{ssec:insert}), how it can be used -to correctly answer sampling queries (Section~\ref{ssec:insert}), and efficient -approaches for supporting deletes (Section~\ref{ssec:delete}). Finally, it will -close with a detailed discussion of the trade-offs within the framework's -design space (Section~\ref{ssec:design-space}). - - - - -\subsection{Trade-offs on Framework Design Space} -\label{ssec:design-space} -The framework has several tunable parameters, allowing it to be tailored for -specific applications. This design space contains trade-offs among three major -performance characteristics: update cost, sampling cost, and auxiliary memory -usage. The two most significant decisions when implementing this framework are -the selection of the layout and delete policies. The asymptotic analysis of the -previous sections obscures some of the differences between these policies, but -they do have significant practical performance implications. - -\Paragraph{Layout Policy.} The choice of layout policy represents a clear -trade-off between update and sampling performance. Leveling -results in fewer shards of larger size, whereas tiering results in a larger -number of smaller shards. As a result, leveling reduces the costs associated -with point-lookups and sampling query preprocessing by a constant factor, -compared to tiering. However, it results in more write amplification: a given -record may be involved in up to $s$ reconstructions on a single level, as -opposed to the single reconstruction per level under tiering. - -\Paragraph{Delete Policy.} There is a trade-off between delete performance and -sampling performance that exists in the choice of delete policy. Tagging -requires a point-lookup when performing a delete, which is more expensive than -the insert required by tombstones. However, it also allows constant-time -rejection checks, unlike tombstones which require a point-lookup of each -sampled record. In situations where deletes are common and write-throughput is -critical, tombstones may be more useful. Tombstones are also ideal in -situations where immutability is required, or random writes must be avoided. -Generally speaking, however, tagging is superior when using SSIs that support -it, because sampling rejection checks will usually be more common than deletes. - -\Paragraph{Mutable Buffer Capacity and Scale Factor.} The mutable buffer -capacity and scale factor both influence the number of levels within the index, -and by extension the number of distinct shards. Sampling and point-lookups have -better performance with fewer shards. Smaller shards are also faster to -reconstruct, although the same adjustments that reduce shard size also result -in a larger number of reconstructions, so the trade-off here is less clear. - -The scale factor has an interesting interaction with the layout policy: when -using leveling, the scale factor directly controls the amount of write -amplification per level. Larger scale factors mean more time is spent -reconstructing shards on a level, reducing update performance. Tiering does not -have this problem and should see its update performance benefit directly from a -larger scale factor, as this reduces the number of reconstructions. - -The buffer capacity also influences the number of levels, but is more -significant in its effects on point-lookup performance: a lookup must perform a -linear scan of the buffer. Likewise, the unstructured nature of the buffer also -will contribute negatively towards sampling performance, irrespective of which -buffer sampling technique is used. As a result, although a large buffer will -reduce the number of shards, it will also hurt sampling and delete (under -tagging) performance. It is important to minimize the cost of these buffer -scans, and so it is preferable to keep the buffer small, ideally small enough -to fit within the CPU's L2 cache. The number of shards within the index is, -then, better controlled by changing the scale factor, rather than the buffer -capacity. Using a smaller buffer will result in more compactions and shard -reconstructions; however, the empirical evaluation in Section~\ref{ssec:ds-exp} -demonstrates that this is not a serious performance problem when a scale factor -is chosen appropriately. When the shards are in memory, frequent small -reconstructions do not have a significant performance penalty compared to less -frequent, larger ones. - -\Paragraph{Auxiliary Structures.} The framework's support for arbitrary -auxiliary data structures allows for memory to be traded in exchange for -insertion or sampling performance. The use of Bloom filters for accelerating -tombstone rejection checks has already been discussed, but many other options -exist. Bloom filters could also be used to accelerate point-lookups for delete -tagging, though such filters would require much more memory than tombstone-only -ones to be effective. An auxiliary hash table could be used for accelerating -point-lookups, or range filters like SuRF \cite{zhang18} or Rosetta -\cite{siqiang20} added to accelerate pre-processing for range queries like in -IRS or WIRS. +In both cases, $R(n) \in \Theta(1)$ for tagging deletes, and $R(n) \in +N_B + L(N) \log_s n$ for tombstones (including the cost of searching +the buffer for the tombstone). |