summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--chapters/design-space.tex1
-rw-r--r--chapters/future-work.tex176
-rw-r--r--chapters/sigmod23/examples.tex236
-rw-r--r--chapters/sigmod23/extensions.tex2
-rw-r--r--chapters/sigmod23/framework.tex538
-rw-r--r--paper.tex2
-rw-r--r--references/references.bib98
7 files changed, 534 insertions, 519 deletions
diff --git a/chapters/design-space.tex b/chapters/design-space.tex
index 7ee98bd..47f728a 100644
--- a/chapters/design-space.tex
+++ b/chapters/design-space.tex
@@ -1 +1,2 @@
\chapter{Exploring the Design Space}
+\label{chap:design-space}
diff --git a/chapters/future-work.tex b/chapters/future-work.tex
index d4ddd52..0c766dd 100644
--- a/chapters/future-work.tex
+++ b/chapters/future-work.tex
@@ -1,174 +1,2 @@
-\chapter{Proposed Work}
-\label{chap:proposed}
-
-The previous two chapters described work already completed, however
-there are a number of work that remains to be done as part of this
-project. Update support is only one of the important features that an
-index requires of its data structure. In this chapter, the remaining
-research problems will be discussed briefly, to lay out a set of criteria
-for project completion.
-
-\section{Concurrency Support}
-
-Database management systems are designed to hide the latency of
-IO operations, and one of the techniques they use are being highly
-concurrent. As a result, any data structure used to build a database
-index must also support concurrent updates and queries. The sampling
-extension framework described in Chapter~\ref{chap:sampling} had basic
-concurrency support, but work is ongoing to integrate a superior system
-into the framework of Chapter~\ref{chap:framework}.
-
-Because the framework is based on the Bentley-Saxe method, it has a number
-of desirable properties for making concurrency management simpler. With
-the exception of the buffer, the vast majority of the data resides in
-static data structures. When using tombstones, these static structures
-become fully immutable. This turns concurrency control into a resource
-management problem, and suggests a simple multi-version concurrency
-control scheme. Each version of the structure, defined as being the
-state between two reconstructions, is tagged with an epoch number. A
-query, then, will read only a single epoch, which will be preserved
-in storage until all queries accessing it have terminated. Because the
-mutable buffer is append-only, a consistent view of it can be obtained
-by storing the tail of the log at the start of query execution. Thus,
-a fixed snapshot of the index can be represented as a two-tuple containing
-the epoch number and buffer tail index.
-
-The major limitation of the Chapter~\ref{chap:sampling} system was
-the handling of buffer expansion. While the mutable buffer itself is
-an unsorted array, and thus supports concurrent inserts using a simple
-fetch-and-add operation, the real hurdle to insert performance is managing
-reconstruction. During a reconstruction, the buffer is full and cannot
-support any new inserts. Because active queries may be using the buffer,
-it cannot be immediately flushed, and so inserts are blocked. Because of
-this, it is necessary to use multiple buffers to sustain insertions. When
-a buffer is filled, a background thread is used to perform the
-reconstruction, and a new buffer is added to continue inserting while that
-reconstruction occurs. In Chapter~\ref{chap:sampling}, the solution used
-was limited by its restriction to only two buffers (and as a result,
-a maximum of two active epochs at any point in time). Any sustained
-insertion workload would quickly fill up the pair of buffers, and then
-be forced to block until one of the buffers could be emptied. This
-emptying of the buffer was contingent on \emph{both} all queries using
-the buffer finishing, \emph{and} on the reconstruction using that buffer
-to finish. As a result, the length of the block on inserts could be long
-(multiple seconds, or even minutes for particularly large reconstructions)
-and indeterminate (a given index could be involved in a very long running
-query, and the buffer would be blocked until the query completed).
-
-Thus, a more effective concurrency solution would need to support
-dynamically adding mutable buffers as needed to maintain insertion
-throughput. This would allow for insertion throughput to be maintained
-so long as memory for more buffer space is available.\footnote{For the
-in-memory indexes considered thus far, it isn't clear that running out of
-memory for buffers is a recoverable error in all cases. The system would
-require the same amount of memory for storing record (technically more,
-considering index overhead) in a shard as it does in the buffer. In the
-case of an external storage system, the calculus would be different,
-of course.} It would also ensure that a long running could only block
-insertion if there is insufficient memory to create a new buffer or to
-run a reconstruction. However, as the number of buffered records grows,
-there is the potential for query performance to suffer, which leads to
-another important aspect of an effective concurrency control scheme.
-
-\subsection{Tail Latency Control}
-
-The concurrency control scheme discussed thus far allows for maintaining
-insertion throughput by allowing an unbounded portion of the new data
-to remain buffered in an unsorted fashion. Over time, this buffered
-data will be moved into data structures in the background, as the
-system performs merges (which are moved off of the critical path for
-most operations). While this system allows for fast inserts, it has the
-potential to damage query performance. This is because the more buffered
-data there is, the more a query must fall back on its inefficient
-scan-based buffer path, as opposed to using the data structure.
-
-Unfortunately, reconstructions can be incredibly lengthy (recall that
-the worst-case scenario involves rebuilding a static structure over
-all of the records; this is, thankfully, quite rare). This implies that
-it may be necessary in certain circumstances to throttle insertions to
-maintain certain levels of query performance. Additionally, it may be
-worth preemptively performing large reconstructions during periods of
-low utilization, similar to systems like Silk designed for mitigating
-tail latency spikes in LSM-tree based systems~\cite{balmau19}.
-
-Additionally, it is possible that large reconstructions may have a
-negative effect on query performance, due to system resource utilization.
-Reconstructions can use a large amount of memory bandwidth, which must
-be shared by queries. The effects of parallel reconstruction on query
-performance will need to be assessed, and strategies for mitigation of
-this effect, be it a scheduling-based solution, or a resource-throttling
-one, considered if necessary.
-
-
-\section{Fine-Grained Online Performance Tuning}
-
-The framework has a large number of configurable parameters, and
-introducing concurrency control will add even more. The parameter sweeps
-in Section~\ref{ssec:ds-exp} show that there are trade-offs between
-read and write performance across this space. Unfortunately, the current
-framework applies this configuration parameters globally, and does not
-allow them to be changed after the index is constructed. It seems apparent
-that better performance might be obtained by adjusting this approach.
-
-First, there is nothing preventing these parameters from being configured
-on a per-level basis. Having different layout policies on different
-levels (for example, tiering on higher levels and leveling on lower ones),
-different scale factors, etc. More index specific tuning, like controlling
-memory budget for auxiliary structures, could also be considered.
-
-This fine-grained tuning will open up an even broader design space,
-which has the benefit of improving the configurability of the system,
-but the disadvantage of making configuration more difficult. Additionally,
-it does nothing to address the problem of workload drift: a configuration
-may be optimal now, but will it remain effective in the future as the
-read/write mix of the workload changes? Both of these challenges can be
-addressed using dynamic tuning.
-
-The theory is that the framework could be augmented with some workload
-and performance statistics tracking. Based on these numbers, during
-reconstruction, the framework could decide to adjust the configuration
-of one or more levels in an online fashion, to lean more towards read
-or write performance, or to dial back memory budgets as the system's
-memory usage increases. Additionally, buffer-related parameters could
-be tweaked in real time as well. If insertion throughput is high, it
-might be worth it to temporarily increase the buffer size, rather than
-spawning multiple smaller buffers.
-
-A system like this would allow for more consistent performance of the
-system in the face of changing workloads, and also increase the ease
-of use of the framework by removing the burden of configuration from
-the user.
-
-
-\section{Alternative Data Partitioning Schemes}
-
-One problem with Bentley-Saxe or LSM-tree derived systems is temporary
-memory usage spikes. When performing a reconstruction, the system needs
-enough storage to store the shards involved in the reconstruction,
-and also the newly constructed shard. This is made worse in the face
-of multi-version concurrency, where multiple older versions of shards
-may be retained in memory at once. It's well known that, in the worst
-case, such a system may temporarily require double its current memory
-usage~\cite{dayan22}.
-
-One approach to addressing this problem in LSM-tree based systems is
-to adjust the compaction granularity~\cite{dayan22}. In the terminology
-associated with this framework, the idea is to further sub-divide each
-shard into smaller chunks, partitioned based on keys. That way, when a
-reconstruction is triggered, rather than reconstructing an entire shard,
-these smaller partitions can be used instead. One of the partitions in
-the source shard can be selected, and then merged with the partitions
-in the next level down having overlapping key ranges. The amount of
-memory required for reconstruction (and also reconstruction time costs)
-can then be controlled by adjusting these partitions.
-
-Unfortunately, while this system works incredibly well for LSM-tree
-based systems which store one-dimensional data in sorted arrays, it
-encounters some problems in the context of a general index. It isn't
-clear how to effectively partition multi-dimensional data in the same
-way. Additionally, in the general case, each partition would need to
-contain its own instance of the index, as the framework supports data
-structures that don't themselves support effective partitioning in the
-way that a simple sorted array would. These challenges will need to be
-overcome to devise effective, general schemes for data partitioning to
-address the problems of reconstruction size and memory usage.
+\chapter{Future Work}
+\label{chap:future}
diff --git a/chapters/sigmod23/examples.tex b/chapters/sigmod23/examples.tex
index cdbc398..38df04d 100644
--- a/chapters/sigmod23/examples.tex
+++ b/chapters/sigmod23/examples.tex
@@ -1,143 +1,141 @@
-\section{Framework Instantiations}
+\section{Applications of the Framework}
\label{sec:instance}
-In this section, the framework is applied to three sampling problems and their
-associated SSIs. All three sampling problems draw random samples from records
-satisfying a simple predicate, and so result sets for all three can be
-constructed by directly merging the result sets of the queries executed against
-individual shards, the primary requirement for the application of the
-framework. The SSIs used for each problem are discussed, including their
-support of the remaining two optional requirements for framework application.
+Using the framework from the previous section, we can create dynamizations
+of SSIs for various sampling problems. In this section, we consider
+three different decomposable sampling problems and their associated SSIs,
+discussing the necessary details of implementation to ensure they work
+efficiently.
-\subsection{Dynamically Extended WSS Structure}
+\subsection{Weighted Set Sampling (Alias Structure)}
\label{ssec:wss-struct}
-As a first example of applying this framework for dynamic extension,
-the alias structure for answering WSS queries is considered. This is a
-static structure that can be constructed in $O(n)$ time and supports WSS
-queries in $O(1)$ time. The alias structure will be used as the SSI, with
-the shards containing an alias structure paired with a sorted array of
-records. { The use of sorted arrays for storing the records
-allows for more efficient point-lookups, without requiring any additional
-space. The total weight associated with a query for
-a given alias structure is the total weight of all of its records,
-and can be tracked at the shard level and retrieved in constant time. }
+As a first example, we will consider the alias structure~\cite{walker74}
+for weighted set sampling. This is a static data structure that is
+constructable in $B(n) \in \Theta(n)$ time and is capable of answering
+sampling queries in $\Theta(1)$ time per sample. This structure does
+\emph{not} directly support point-lookups, nor is it naturally sorted
+to allow for convenient tombstone cancellation. However, the structure
+itself doesn't place any requirements on the ordering of the underlying
+data, and so both of these limitations can be addressed by building it
+over a sorted array.
-Using the formulae from Section~\ref{sec:framework}, the worst-case
-costs of insertion, sampling, and deletion are easily derived. The
-initial construction cost from the buffer is $C_c(N_b) \in O(N_b
-\log N_b)$, requiring the sorting of the buffer followed by alias
-construction. After this point, the shards can be reconstructed in
-linear time while maintaining sorted order. Thus, the reconstruction
-cost is $C_r(n) \in O(n)$. As each shard contains a sorted array,
-the point-lookup cost is $L(n) \in O(\log n)$. The total weight can
-be tracked with the shard, requiring $W(n) \in O(1)$ time to access,
-and there is no necessary preprocessing, so $P(n) \in O(1)$. Samples
-can be drawn in $S(n) \in O(1)$ time. Plugging these results into the
-formulae for insertion, sampling, and deletion costs gives,
+This pre-sorting will require $B(n) \in \Theta(n \log n)$ time to
+build from the buffer, however after this a sorted-merge can be used
+to perform reconstructions from the shards themselves. As the maximum
+number of shards involved in a reconstruction using either layout policy
+is $\Theta(1)$ using our framework, this means that we can perform
+reconstructions in $B_M(n) \in \Theta(n)$ time, including tombstone
+cancellation. The total weight of the structure can also be calculated
+at no time when it is constructed, allows $W(n) \in \Theta(1)$ time
+as well. Point lookups over the sorted data can be done using a binary
+search in $L(n) \in \Theta(\log_2 n)$ time, and sampling queries require
+no pre-processing, so $P(n) \in \Theta(1)$. The mutable buffer can be
+sampled using rejection sampling.
+
+This results in the following cost functions for the various operations
+supported by the dynamization,
\begin{align*}
- \text{Insertion:} \quad &O\left(\log_s n\right) \\
- \text{Sampling:} \quad &O\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
- \text{Tagged Delete:} \quad &O\left(\log_s n \log n\right)
+ \text{Amortized Insertion/Tombstone Delete:} \quad &\Theta\left(\log_s n\right) \\
+ \text{Worst-case Sampling:} \quad &\Theta\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
+ \text{Worst-case Tagged Delete:} \quad &\Theta\left(\log_s n \log n\right)
\end{align*}
-where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log n)$ for
+where $R(n) \in \Theta(1)$ for tagging and $R(n) \in \Theta(\log_s n \log n)$ for
tombstones.
-\Paragraph{Bounding Rejection Rate.} In the weighted sampling case,
-the framework's generic record-based compaction trigger mechanism
-is insufficient to bound the rejection rate. This is because the
-probability of a given record being sampling is dependent upon its
-weight, as well as the number of records in the index. If a highly
-weighted record is deleted, it will be preferentially sampled, resulting
-in a larger number of rejections than would be expected based on record
-counts alone. This problem can be rectified using the framework's user-specified
-compaction trigger mechanism.
-In addition to
-tracking record counts, each level also tracks its rejection rate,
+\Paragraph{Sampling Rejection Rate Bound.} Bounding the number of deleted
+records is not sufficient to bound the rejection rate of weighted sampling
+queries on its own, because it doesn't account for the weights of the
+records being deleted. Recall in our discussion of this bound that we
+assumed that all records had equal weights. Without this assumption, it
+is possible to construct adversarial cases where a very highly weighted
+record is deleted, resulting in it being preferentially sampled and
+rejected repeatedly.
+
+To ensure that our solution is robust even in the face of such adversarial
+workloads, for the weighted sampling case we introduce another compaction
+trigger based on the measured rejection rate of each level. We
+define the rejection rate of level $i$ as,
\begin{equation*}
-\rho_i = \frac{\text{rejections}}{\text{sampling attempts}}
+ \rho_i = \frac{\text{rejections}}{\text{sampling attempts}}
\end{equation*}
-A configurable rejection rate cap, $\rho$, is then defined. If $\rho_i
-> \rho$ on a level, a compaction is triggered. In the case
-the tombstone delete policy, it is not the level containing the sampled
-record, but rather the level containing its tombstone, that is considered
-the source of the rejection. This is necessary to ensure that the tombstone
-is moved closer to canceling its associated record by the compaction.
+and allow the user to specify a maximum rejection rate, $\rho$. If $\rho_i
+> \rho$ on a given level, then a proactive compaction is triggered. In
+the case of tagged deletes, the rejection rate of a level is based on
+the rejections resulting from sampling attempts on that level. This
+will \emph{not} work when using tombstones, however, as compacting the
+level containing the record that was rejected will not make progress
+towards eliminating that record from the structure in this case. Instead,
+when using tombstones, the rejection rate is tracked based on the level
+containing the tombstone that caused the rejection. This ensures that the
+tombstone is moved towards its associated record, and that the compaction
+makes progress towards removing it.
-\subsection{Dynamically Extended IRS Structure}
-\label{ssec:irs-struct}
-Another sampling problem to which the framework can be applied is
-independent range sampling (IRS). The SSI in this example is the in-memory
-ISAM tree. The ISAM tree supports efficient point-lookups
- directly, and the total weight of an IRS query can be
-easily obtained by counting the number of records within the query range,
-which is determined as part of the preprocessing of the query.
-The static nature of shards in the framework allows for an ISAM tree
-to be constructed with adjacent nodes positioned contiguously in memory.
-By selecting a leaf node size that is a multiple of the record size, and
-avoiding placing any headers within leaf nodes, the set of leaf nodes can
-be treated as a sorted array of records with direct indexing, and the
-internal nodes allow for faster searching of this array.
-Because of this layout, per-sample tree-traversals are avoided. The
-start and end of the range from which to sample can be determined using
-a pair of traversals, and then records can be sampled from this range
-using random number generation and array indexing.
-
-Assuming a sorted set of input records, the ISAM tree can be bulk-loaded
-in linear time. The insertion analysis proceeds like the WSS example
-previously discussed. The initial construction cost is $C_c(N_b) \in
-O(N_b \log N_b)$ and reconstruction cost is $C_r(n) \in O(n)$. The ISAM
-tree supports point-lookups in $L(n) \in O(\log_f n)$ time, where $f$
-is the fanout of the tree.
+\subsection{Independent Range Sampling (ISAM Tree)}
+\label{ssec:irs-struct}
+We will next considered independent range sampling. For this decomposable
+sampling problem, we use the ISAM Tree for the SSI. Because our shards are
+static, we can build highly compact and efficient ISAM trees by storing
+the records directly in a sorted array. So long as the leaf node size is
+a multiple of the record size, this array can be treated as a sequence of
+leaf nodes in the tree, and internal nodes can be built above this using
+array indices as pointers. These internal nodes can also be constructed
+contiguously in an array, maximizing cache efficiency.
-The process for performing range sampling against the ISAM tree involves
-two stages. First, the tree is traversed twice: once to establish the index of
-the first record greater than or equal to the lower bound of the query,
-and again to find the index of the last record less than or equal to the
-upper bound of the query. This process has the effect of providing the
-number of records within the query range, and can be used to determine
-the weight of the shard in the shard alias structure. Its cost is $P(n)
-\in O(\log_f n)$. Once the bounds are established, samples can be drawn
-by randomly generating uniform integers between the upper and lower bound,
-in $S(n) \in O(1)$ time each.
+To build this structure from the buffer requires sorting the records
+first, and then performing a linear time bulk-load, and hence $B(n)
+\in \Theta(n \log n)$. However, sorted-array merges can be used for
+further reconstructions, meaning that $B_M(n) \in \Theta(n)$. The data
+structure itself supports point lookups in $L(n)\in \Theta(\log n)$ time.
+IRS queries can be answered by first using two tree traversals to identify
+the minimum and maximum array indices associated with the query range
+in $\Theta(\log n)$ time, and then generating array indices within this
+range uniformly at random for each sample. The initial traversals can be
+considered preprocessing time, so $P(n) \in \Theta(\log n)$. The weight
+of the shard is simply the difference between the upper and lower indices
+of the range (i.e., the number of records in the range), and so $W(n)
+\in \Theta(1)$ time, and the per-sample cost is a single random number
+generation, so $S(n) \in \Theta(1)$. The mutable buffer can be sampled
+using rejection sampling.
-This results in the extended version of the ISAM tree having the following
-insert, sampling, and delete costs,
+Accounting for all these costs, the time complexity of the various
+operations are,
\begin{align*}
- \text{Insertion:} \quad &O\left(\log_s n\right) \\
- \text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
- \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
+ \text{Amortized Insertion/Tombstone Delete:} \quad &O\left(\log_s n\right) \\
+ \text{Worst-case Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
+ \text{Worst-case Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
\end{align*}
-where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for
-tombstones.
+where $R(n) \in \Theta(1)$ for tagging and $R(n) \in \Theta(\log_s n \log_f n)$
+for tombstones and $f$ is the fanout of the ISAM Tree.
-\subsection{Dynamically Extended WIRS Structure}
+\subsection{Weighted Independent Range Sampling (Alias-augmented B+Tree)}
+
\label{ssec:wirs-struct}
-As a final example of applying this framework, the WIRS problem will be
-considered. Specifically, the alias-augmented B+tree approach, described
-by Tao \cite{tao22}, generalizing work by Afshani and Wei \cite{afshani17},
-and Hu et al. \cite{hu14}, will be extended.
-This structure allows for efficient point-lookups, as
-it is based on the B+tree, and the total weight of a given WIRS query can
-be calculated given the query range using aggregate weight tags within
-the tree.
+As a final example of applying this framework, we consider WIRS. This
+is a decomposable sampling problem that can be answered using the
+alias-augmented B+Tree structure~\cite{tao22, afshani17,hu14}. This
+data structure is built over sorted data, but can be bulk-loaded from
+this data in linear time, resulting in costs of $B(n) \in \Theta(n \log n)$
+and $B_M(n) \in \Theta(n)$, though the constant factors associated with
+these functions are quite high, as each bulk-loading requires multiple
+linear-time operations for building both the B+Tree and the alias
+structures, among other things. As it is built on a B+Tree, the structure
+supports $L(n) \in \Theta(\log n)$ point lookups. Answering sampling
+queries requires $P(n) \in \Theta(\log n)$ pre-processing time to
+establish the query interval, during which the weight of the interval
+can be calculated in $W(n) \in \Theta(1)$ time using the aggregate weight
+tags in the tree's internal nodes. After this, samples can be drawn in
+$S(n) \in \Theta(1)$ time.
-The alias-augmented B+tree is a static structure of linear space, capable
-of being built initially in $C_c(N_b) \in O(N_b \log N_b)$ time, being
-bulk-loaded from sorted lists of records in $C_r(n) \in O(n)$ time,
-and answering WIRS queries in $O(\log_f n + k)$ time, where the query
-cost consists of preliminary work to identify the sampling range
-and calculate the total weight, with $P(n) \in O(\log_f n)$ cost, and
-constant-time drawing of samples from that range with $S(n) \in O(1)$.
-This results in the following costs,
+This all results in the following costs,
\begin{align*}
- \text{Insertion:} \quad &O\left(\log_s n\right) \\
- \text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\
- \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
+ \text{Amortized Insertion/Tombstone Delete:} \quad &O\left(\log_s n\right) \\
+ \text{Worst-case Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\
+ \text{Worst-case Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
\end{align*}
where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for
-tombstones. Because this is a weighted sampling structure, the custom
-compaction trigger discussed in in Section~\ref{ssec:wss-struct} is applied
-to maintain bounded rejection rates during sampling.
-
+tombstones and $f$ is the fanout of the tree. This is another weighted
+sampling problem, and so we also apply the same rejection rate based
+compaction trigger as discussed in Section~\ref{ssec:wss-struct} for the
+dynamized alias structure.
diff --git a/chapters/sigmod23/extensions.tex b/chapters/sigmod23/extensions.tex
index 6c242e9..d8a4247 100644
--- a/chapters/sigmod23/extensions.tex
+++ b/chapters/sigmod23/extensions.tex
@@ -1,5 +1,5 @@
\captionsetup[subfloat]{justification=centering}
-\section{Extensions}
+\section{Extensions to the Framework}
\label{sec:discussion}
In this section, various extensions of the framework are considered.
Specifically, the applicability of the framework to external or distributed
diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex
index c878d93..89f15c3 100644
--- a/chapters/sigmod23/framework.tex
+++ b/chapters/sigmod23/framework.tex
@@ -16,7 +16,32 @@ there in the context of IRS apply equally to the other sampling problems
considered in this chapter. In this section, we will discuss approaches
for resolving these problems.
+
+\begin{table}[t]
+\centering
+
+\begin{tabular}{|l l|}
+ \hline
+ \textbf{Variable} & \textbf{Description} \\ \hline
+ $N_b$ & Capacity of the mutable buffer \\ \hline
+ $s$ & Scale factor \\ \hline
+ $B_c(n)$ & SSI construction cost from unsorted records \\ \hline
+ $B_r(n)$ & SSI reconstruction cost from existing SSI instances\\ \hline
+ $L(n)$ & SSI point-lookup cost \\ \hline
+ $P(n)$ & SSI sampling pre-processing cost \\ \hline
+ $S(n)$ & SSI per-sample sampling cost \\ \hline
+ $W(n)$ & SSI weight determination cost \\ \hline
+ $R(n)$ & Rejection check cost \\ \hline
+ $\delta$ & Maximum delete proportion \\ \hline
+\end{tabular}
+\label{tab:nomen}
+
+\caption{\textbf{Nomenclature.} A reference of variables and functions
+used in this chapter.}
+\end{table}
+
\subsection{Sampling over Decomposed Structures}
+\label{ssec:decomposed-structure-sampling}
The core problem facing any attempt to dynamize SSIs is that independently
sampling from a decomposed structure is difficult. As discussed in
@@ -266,6 +291,7 @@ contexts.
\subsubsection{Deletion Cost}
+\label{ssec:sampling-deletes}
We will first consider the cost of performing a delete using either
mechanism.
@@ -314,8 +340,8 @@ cases, the same procedure as above can be used, with $L(n) \in \Theta(1)$.
\begin{figure}
\centering
- \subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\
- \subfloat[Tagging Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tagging} \label{fig:delete-tag}}
+ \subfloat[Tombstone Rejection Check]{\includegraphics[width=.5\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}
+ \subfloat[Tagging Rejection Check]{\includegraphics[width=.5\textwidth]{img/sigmod23/delete-tagging} \label{fig:delete-tag}}
\caption{\textbf{Overview of the rejection check procedure for deleted records.} First,
a record is sampled (1).
@@ -456,6 +482,7 @@ the lifetime of the structure. Preemptive compaction does not increase
the number of reconstructions, only \emph{when} they occur.
\subsubsection{Sampling Procedure with Deletes}
+\label{ssec:sampling-with-deletes}
Because sampling is neither deletion decomposable nor invertible,
the presence of deletes will have an effect on the query costs. As
@@ -486,244 +513,307 @@ be taken to obtain a sample set of size $k$.
\subsection{Performance Tuning and Configuration}
-\subsubsection{LSM Tree Imports}
-\subsection{Insertion}
-\label{ssec:insert}
-The framework supports inserting new records by first appending them to the end
-of the mutable buffer. When it is full, the buffer is flushed into a sequence
-of levels containing shards of increasing capacity, using a procedure
-determined by the layout policy as discussed in Section~\ref{sec:framework}.
-This method allows for the cost of repeated shard reconstruction to be
-effectively amortized.
-
-Let the cost of constructing the SSI from an arbitrary set of $n$ records be
-$C_c(n)$ and the cost of reconstructing the SSI given two or more shards
-containing $n$ records in total be $C_r(n)$. The cost of an insert is composed
-of three parts: appending to the mutable buffer, constructing a new
-shard from the buffered records during a flush, and the total cost of
-reconstructing shards containing the record over the lifetime of the index. The
-cost of appending to the mutable buffer is constant, and the cost of constructing a
-shard from the buffer can be amortized across the records participating in the
-buffer flush, giving $\nicefrac{C_c(N_b)}{N_b}$. These costs are paid exactly once for
-each record. To derive an expression for the cost of repeated reconstruction,
-first note that each record will participate in at most $s$ reconstructions on
-a given level, resulting in a worst-case amortized cost of $O\left(s\cdot
-\nicefrac{C_r(n)}{n}\right)$ paid per level. The index itself will contain at most
-$\log_s n$ levels. Thus, over the lifetime of the index a given record
-will pay $O\left(s\cdot \nicefrac{C_r(n)}{n}\log_s n\right)$ cost in repeated
-reconstruction.
-
-Combining these results, the total amortized insertion cost is
-\begin{equation}
-O\left(\frac{C_c(N_b)}{N_b} + s \cdot \frac{C_r(n)}{n} \log_s n\right)
-\end{equation}
-This can be simplified by noting that $s$ is constant, and that $N_b \ll n$ and also
-a constant. By neglecting these terms, the amortized insertion cost of the
-framework is,
-\begin{equation}
-O\left(\frac{C_r(n)}{n}\log_s n\right)
-\end{equation}
-\captionsetup[subfloat]{justification=centering}
+The final of the desiderata referenced earlier in this chapter for our
+dynamized sampling indices is having tunable performance. The base
+Bentley-Saxe method has a highly rigid reconstruction policy that,
+while theoretically convenient, does not lend itself to performance
+tuning. However, it can be readily modified to form a more relaxed policy
+that is both tunable, and generally more performant, at the cost of some
+additional theoretical complexity. There has been some theoretical work
+in this area, based upon nesting instances of the equal block method
+within the Bentley-Saxe method~\cite{overmars81}, but these methods are
+unwieldy and are targetted at tuning the worst-case at the expense of the
+common case. We will take a different approach to adding configurability
+to our dynamization system.
+
+Though it has thus far gone unmentioned, readers familiar with LSM Trees
+may have noted the astonishing similarity between decomposition-based
+dynamization techniques, and a data structure called the Log-structured
+Merge-tree. First proposed by O'Neil in the mid '90s\cite{oneil96},
+the LSM Tree was designed to optmize write throughout for external data
+structures. It accomplished this task by buffer inserted records in a
+small in-memory AVL Tree, and then flushing this buffer to disk when
+it filled up. The flush process itself would fully rebuild the on-disk
+structure (a B+Tree), including all of the currently existing records
+on external storage. O'Neil also proposed version which used several,
+layered, external structures, to reduce the cost of reconstruction.
+
+In more recent times, the LSM Tree has seen significant development and
+been used as the basis for key-value stores like RocksDB~\cite{dong21}
+and LevelDB~\cite{leveldb}. This work as produced an incredibly large
+and well explored parameterization of the reconstruction procedures of
+LSM Trees, a good summary of which can be bound in this recent tutorial
+paper~\cite{sarkar23}. Examples of this design space exploration include:
+different ways to organize each "level" of the tree~\cite{dayan19,
+dostoevsky, autumn}, different growth rates, buffering, sub-partioning
+of structures to allow finer-grained reconstruction~\cite{dayan22}, and
+approaches for allocating resources to auxilliary structures attached to
+the main ones for accelerating certain types of query~\cite{dayan18-1,
+zhu21, monkey}.
+
+Many of the elements within the LSM Tree design space are based upon the
+specifics of the data structure itself, and are not generally applicable.
+However, some of the higher-level concepts can be imported and applied in
+the context of dynamization. Specifically, we have decided to import the
+following four elements for use in our dynamization technique,
+\begin{itemize}
+ \item A small dynamic buffer into which new records are inserted
+ \item A variable growth rate, called as \emph{scale factor}
+ \item The ability to attach auxilliary structures to each block
+ \item Two different strategies for reconstructing data structures
+\end{itemize}
+This design space and its associated trade-offs will be discussed in
+more detail in Chapter~\ref{chap:design-space}, but we'll describe it
+briefly here.
+
+\Paragraph{Buffering.} In the standard Bentley-Saxe method, each
+insert triggers a reconstruction. Many of these are quite small, but
+it still makes most insertions somewhat expensive. By adding a small
+buffer, a large number of inserts can be performed without requiring
+any reconstructions at all. For generality, we elected to use an
+unsorted array as our buffer, as dynamic versions of the structures
+we are dynamizing may not exist. This introduces some query cost, as
+queries must be answered from these unsorted records as well, but in
+the case of sampling this isn't a serious problem. The implications of
+this will be discussed in Section~\ref{ssec:sampling-cost-funcs}. The
+size of this buffer, $N_B$ is a user-specified constant, and all block
+capacities are multiplied by it. In the Bentley-Saxe method, the $i$th
+block contains $2^i$ records. In our scheme, with buffering, this becomes
+$N_B \cdot 2^i$ records in the $i$th block. We call this unsorted array
+the \emph{mutable buffer}.
+
+\Paragraph{Scale Factor.} In the Bentley-Saxe method, each block is
+twice as large as the block the preceeds it There is, however, no reason
+why this growth rate couldn't be adjusted. In our system, we make the
+growth rate a user-specified constant called the \emph{scale factor},
+$s$, such that the $i$th level contains $N_B \cdot s^i$ records.
+
+\Paragraph{Auxilliary Structures.} In Section~\ref{ssec:sampling-deletes},
+we encountered two problems relating to supporting deletes that can be
+resolved through the use of auxilliary structures. First, regardless
+of whether tagging or tombstones are used, the data structure requires
+support for an efficient point-lookup operation. Many SSIs are tree-based
+and thus support this, but not all data structures do. In such cases,
+the point-lookup operation could be provided by attaching an auxilliary
+hash table to the data structure that maps records to their location in
+the SSI. We use term \emph{shard} to refer to the combination of a
+block with these optional auxilliary structures.
+
+In addition, the tombstone deletion mechanism requires performing a point
+lookup for every record sampled, to validate that it has not been deleted.
+This introduces a large amount of overhead into the sampling process,
+as this requires searching each block in the structure. One approach
+that can be used to help improve the performance of these searches,
+without requiring as much storage as adding auxilliary hash tables to
+every block, is to include bloom filters~\cite{bloom70}. A bloom filter
+is an approximate data structure that answers tests of set membership
+with bounded, single-sided error. These are commonly used in LSM Trees
+to accelerate point lookups by allowing levels that don't contain the
+record being searched for to be skipped. In our case, we only care about
+tombstone records, so rather than building these filters over all records,
+we can build them over tombstones. This approach can greatly improve
+the sampling performance of the structure when tombstone deletes are used.
+
+\Paragraph{Layout Policy.} The Bentley-Saxe method considers blocks
+individually, without any other organization beyond increasing size. In
+contrast, LSM Trees have multiple layers of structural organization. The
+top level structure is a level, upon which record capacity restrictions
+are applied. These levels are then partitioned into individual structures,
+which can be further organized by key range. Because our intention is to
+support general data structures, which may or may not be easily partition
+by a key, we will not consider the finest grain of partitioning. However,
+we can borrow the concept of levels, and lay out shards in these levels
+according to different strategies.
+
+Specifically, we consider two layout policies. First, we can allow a
+single shard per level, a policy called \emph{Leveling}. This approach
+is traditionally read optimized, as it generally results in fewer shards
+within the overall structure for a given scale factor. Under leveling,
+the $i$th level has a capacity of $N_B \cdot s^{i+1}$ records. We can
+also allow multiple shards per level, resulting in a write-optimized
+policy called \emph{Tiering}. In tiering, each level can hold up to $s$
+shards, each with up to $N_B \cdot s^i$ records. Note that this doesn't
+alter the overall record capacity of each level relative to leveling,
+only the way the records are divided up into shards.
+
+\section{Practical Dynamization Framework}
+
+Based upon the results discussed in the previous section, we are now ready
+to discuss the dynamization framework that we have produced for adding
+update support to SSIs. This framework allows us to achieve all three
+of our desiderata, at least for certain configurations, and provides a
+wide range of performance tuning options to the user.
+
+\subsection{Requirements}
+
+The requirements that the framework places upon SSIs are rather
+modest. The sampling problem being considered must be a decomposable
+sampling problem (Definition \ref{def:decomp-sampling}) and the SSI must
+support the \texttt{build} and \texttt{unbuild} operations. Optionally,
+if the SSI supports point lookups or if the SSI can be constructed
+from multiple instances of the SSI more efficiently than its normal
+static construction, these two operations can be leveraged by the
+framework. However, these are not requirements, as the framework provides
+facilities to work around their absence.
+
+\captionsetup[subfloat]{justification=centering}
\begin{figure*}
\centering
- \subfloat[Leveling]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-leveling} \label{fig:leveling}}\\
- \subfloat[Tiering]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-tiering} \label{fig:tiering}}
+ \subfloat[Leveling]{\includegraphics[width=.5\textwidth]{img/sigmod23/merge-leveling} \label{fig:leveling}}
+ \subfloat[Tiering]{\includegraphics[width=.5\textwidth]{img/sigmod23/merge-tiering} \label{fig:tiering}}
- \caption{\textbf{A graphical overview of the sampling framework and its insert procedure.} A
+ \caption{\textbf{A graphical overview of our dynamization framework.} A
mutable buffer (MB) sits atop two levels (L0, L1) containing shards (pairs
of SSIs and auxiliary structures [A]) using the leveling
(Figure~\ref{fig:leveling}) and tiering (Figure~\ref{fig:tiering}) layout
policies. Records are represented as black/colored squares, and grey
squares represent unused capacity. An insertion requiring a multi-level
- reconstruction is illustrated.} \label{fig:framework}
+ reconstruction is illustrated.} \label{fig:sampling-framework}
\end{figure*}
-\section{Framework Implementation}
-
-Our framework has been designed to work efficiently with any SSI, so long
-as it has the following properties.
-
+\subsection{Framework Construction}
+
+The framework itself is shown in Figure~\ref{fig:sampling-framework},
+along with some of its configuration parameters and its insert procedure
+(which will be discussed in the next section). It consists of an unsorted
+array of size $N_B$ records, sitting atop a sequence of \emph{levels},
+each containing SSIs according to the layout policy. If leveling
+is used, each level will contain a single SSI with up to $N_B \cdot
+s^{i+1}$ records. If tiering is used, each level will contain up to
+$s$ SSIs, each with up to $N_B \cdot s^i$ records. The scale factor,
+$s$, controls the rate at which the capacity of each level grows. The
+framework supports deletes using either the tombstone or tagging policy,
+which can be selected by the user acccording to her preference. To support
+these delete mechanisms, each record contains an attached header with
+bits to indicate its tombstone or delete status.
+
+\subsection{Supported Operations and Cost Functions}
+\Paragraph{Insert.} Inserting a record into the dynamization involves
+appending it to the mutable buffer, which requires $\Theta(1)$ time. When
+the buffer reaches its capacity, it must be flushed into the structure
+itself before any further records can be inserted. First, a shard will be
+constructed from the records in the buffer using the SSI's \texttt{build}
+operation, with $B(N_B)$ cost. This shard will then be merged into the
+levels below it, which may require further reconstructions to occur to
+make room. The manner in which these reconstructions proceed follows the
+selection of layout policy,
+\begin{itemize}
+\item[\textbf{Leveling}] When a buffer flush occurs in the leveling
+policy, the system scans the existing levels to find the first level
+which has sufficient empty space to store the contents of the level above
+it. More formally, if the number of records in level $i$ is $N_i$, then
+$i$ is determined such that $N_i + N_B\cdot s^{i} <= N_B \cdot s^{i+1}$.
+If no level exists that satisfies the record count constraint, then an
+empty level is added and $i$ is set to the index of this new level. Then,
+a reconstruction is executed containing all of the records in levels $i$
+and $i - 1$ (where $i=-1$ indicates the temporary shard built from the
+buffer). Following this reconstruction, all levels $j < i$ are shifted
+by one level.
+\item[\textbf{Tiering}] When using tiering, the system will locate
+the first level, $i$, containing fewer than $s$ shards. If no such
+level exists, then a new empty level is added and $i$ is set to the
+index of that level. Then, for each level $j < i$, a reconstruction
+is performed involving all $s$ shards on that level. The resulting new
+shard will then be placed into the level at $j + 1$ and $j$ will be
+emptied. Following this, the newly created shard from the buffer will
+be appended to level $0$.
+\end{itemize}
+
+In either case, the reconstructions all use instances of the shard as
+input, and so if the SSI supports more efficient construction in this case
+(with $B_M(n)$ cost), then this routine can be used here. Once all of
+the necessary reconstructions have been performed, each level is checked
+to verify that the proportion of tombstones or deleted records is less
+than $\delta$. If this condition fails, then a proactive compaction is
+triggered. This compaction involves doing the reconstructions necessary
+to move the shard violating the delete bound down one level. Once the
+compaction is complete, the delete proportions are checked again, and
+this process is repeated until all levels satisfy the bound.
+
+Following this procedure, inserts have a worst case cost of $I \in
+\Theta(B_M(n))$, equivalent to Bently-Saxe. The amortized cost can be
+determined by finding the total cost of reconstructions involving each
+record and amortizing it over each insert. The cost of the insert is
+composed of three parts,
\begin{enumerate}
- \item The underlying full query $Q$ supported by the SSI from whose results
- samples are drawn satisfies the following property:
- for any dataset $D = \cup_{i = 1}^{n}D_i$
- where $D_i \cap D_j = \emptyset$, $Q(D) = \cup_{i = 1}^{n}Q(D_i)$.
- \item \emph{(Optional)} The SSI supports efficient point-lookups.
- \item \emph{(Optional)} The SSI is capable of efficiently reporting the total weight of all records
- returned by the underlying full query.
+\item The cost of appending to the buffer
+\item The cost of flushing the buffer to a shard
+\item The total cost of the reconstructions the record is involved
+ in over the lifetime of the structure
\end{enumerate}
+The first cost is constant and the second is $B(N_B)$. Regardless of
+layout policy, there will be $\Theta(\log_s(n))$ total levels, and
+the record will, at worst, be written a constant number of times to
+each level, resulting in a maximum of $\Theta(\log_s(n)B_M(n))$ cost
+associated with these reconstructions. Thus, the total cost associated
+with each record in the structure is,
+\begin{equation*}
+\Theta(1) + \Theta(B(N_B)) + \Theta(\log_s(n)B_M(n))
+\end{equation*}
+Assuming that $N_B \ll n$, the first two terms of this expression are
+constant. Dropping them and amortizing the result over $n$ records give
+us the amortized insertion cost,
+\begin{equation*}
+I_a(n) \in \Theta\left(\frac{B_M(n)}{n}\log_s(n)\right)
+\end{equation*}
+If the SSI being considered does not support a more efficient
+construction procedure from other instances of the same SSI, and
+the general Bentley-Saxe \texttt{unbuild} and \texttt{build}
+operations must be used, the the cost becomes $I_a(n) \in
+\Theta\left(\frac{B(n)}{n}\log_s(n)\right)$ instead.
+
+\Paragraph{Delete.} The framework supports both tombstone and tagged
+deletes, each with different performance. Using tombstones, the cost
+of a delete is identical to that of an insert. When using tagging, the
+cost of a delete is the same as cost of doing a point lookup, as the
+"delete" itself is simply setting a bit in the header of the record,
+once it has been located. There will be $\Theta(\log_s n)$ total shards
+in the structure, each with a look-up cost of $L(n)$ using either the
+SSI's native point-lookup, or an auxilliary hash table, and the lookup
+must also scan the buffer in $\Theta(N_B)$ time. Thus, the worst-case
+cost of a tagged delete is,
+\begin{equation*}
+D(n) = \Theta(N_B + L(n)\log_s(n))
+\end{equation*}
-The first property applies to the query being sampled from, and is essential
-for the correctness of sample sets reported by extended sampling
-indexes.\footnote{ This condition is stricter than the definition of a
-decomposable search problem in the Bentley-Saxe method, which allows for
-\emph{any} constant-time merge operation, not just union.
-However, this condition is satisfied by many common types of database
-query, such as predicate-based filtering queries.} The latter two properties
-are optional, but reduce deletion and sampling costs respectively. Should the
-SSI fail to support point-lookups, an auxiliary hash table can be attached to
-the data structures.
-Should it fail to support query result weight reporting, rejection
-sampling can be used in place of the more efficient scheme discussed in
-Section~\ref{ssec:sample}. The analysis of this framework will generally
-assume that all three conditions are satisfied.
-
-Given an SSI with these properties, a dynamic extension can be produced as
-shown in Figure~\ref{fig:framework}. The extended index consists of disjoint
-shards containing an instance of the SSI being extended, and optional auxiliary
-data structures. The auxiliary structures allow acceleration of certain
-operations that are required by the framework, but which the SSI being extended
-does not itself support efficiently. Examples of possible auxiliary structures
-include hash tables, Bloom filters~\cite{bloom70}, and range
-filters~\cite{zhang18,siqiang20}. The shards are arranged into levels of
-increasing record capacity, with either one shard, or up to a fixed maximum
-number of shards, per level. The decision to place one or many shards per level
-is called the \emph{layout policy}. The policy names are borrowed from the
-literature on the LSM tree, with the former called \emph{leveling} and the
-latter called \emph{tiering}.
-
-To avoid a reconstruction on every insert, an unsorted array of fixed capacity
-($N_b$), called the \emph{mutable buffer}, is used to buffer updates. Because it is
-unsorted, it is kept small to maintain reasonably efficient sampling
-and point-lookup performance. All updates are performed by appending new
-records to the tail of this buffer.
-If a record currently within the index is
-to be updated to a new value, it must first be deleted, and then a record with
-the new value inserted. This ensures that old versions of records are properly
-filtered from query results.
-
-When the buffer is full, it is flushed to make room for new records. The
-flushing procedure is based on the layout policy in use. When using leveling
-(Figure~\ref{fig:leveling}) a new SSI is constructed using both the records in
-$L_0$ and those in the buffer. This is used to create a new shard, which
-replaces the one previously in $L_0$. When using tiering
-(Figure~\ref{fig:tiering}) a new shard is built using only the records from the
-buffer, and placed into $L_0$ without altering the existing shards. Each level
-has a record capacity of $N_b \cdot s^{i+1}$, controlled by a configurable
-parameter, $s$, called the scale factor. Records are organized in one large
-shard under leveling, or in $s$ shards of $N_b \cdot s^i$ capacity each under
-tiering. When a level reaches its capacity, it must be emptied to make room for
-the records flushed into it. This is accomplished by moving its records down to
-the next level of the index. Under leveling, this requires constructing a new
-shard containing all records from both the source and target levels, and
-placing this shard into the target, leaving the source empty. Under tiering,
-the shards in the source level are combined into a single new shard that is
-placed into the target level. Should the target be full, it is first emptied by
-applying the same procedure. New empty levels
-are dynamically added as necessary to accommodate these reconstructions.
-Note that shard reconstructions are not necessarily performed using
-merging, though merging can be used as an optimization of the reconstruction
-procedure where such an algorithm exists. In general, reconstruction requires
-only pooling the records of the shards being combined and then applying the SSI's
-standard construction algorithm to this set of records.
+\Paragraph{Update.} Given the above definitions of insert and delete,
+in-place updates of records can be supported by first deleting the record
+to be updated, and then inserting the updated value as a new record. Thus,
+the update cost is $\Theta(I(n) + D(n))$.
+
+\Paragraph{Sampling.} Answering sampling queries from this structure is
+largely the same as was discussed for a standard Bentley-Saxe dynamization
+in Section~\ref{ssec:sampling-with-deletes} with the addition of a need
+to sample from the unsorted buffer as well. There are two approaches
+for sampling from the buffer. The most general approach would be to
+temporarily build an SSI over the records within the buffer, and then
+treat this is a normal shard for the remainder of the sampling procedure.
+In this case, the sampling algorithm remains indentical to the algorithm
+discussed in Section~\ref{ssec:decomposed-structure-sampling}, following
+the construction of the temporary shard. This results in a worst-case
+sampling cost of,
+\begin{equation*}
+ \mathscr{Q}(n, k) = \Theta\left(B(N_B) + [W(n) + P(n)]\log_2 n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
+\end{equation*}
-\begin{table}[t]
-\caption{Frequently Used Notation}
-\centering
+In practice, however, it is often possible to perform rejection sampling
+against the buffer, without needing to do any additional work to prepare
+it. In this case, the full weight of the buffer can be used to determine
+how many samples to draw from it, and then these samples can be obtained
+using standard rejection sampling to both control the weight, and enforce
+any necessary predicates. Because $N_B \ll n$, this procedure will not
+introduce anything more than constant overhead in the sampling process as
+the probability of sampling from the buffer is quite low, and the cost of
+doing so is constant, and so the overall query cost when rejection sampling
+is possible is,
-\begin{tabular}{|p{2.5cm} p{5cm}|}
- \hline
- \textbf{Variable} & \textbf{Description} \\ \hline
- $N_b$ & Capacity of the mutable buffer \\ \hline
- $s$ & Scale factor \\ \hline
- $C_c(n)$ & SSI initial construction cost \\ \hline
- $C_r(n)$ & SSI reconstruction cost \\ \hline
- $L(n)$ & SSI point-lookup cost \\ \hline
- $P(n)$ & SSI sampling pre-processing cost \\ \hline
- $S(n)$ & SSI per-sample sampling cost \\ \hline
- $W(n)$ & Shard weight determination cost \\ \hline
- $R(n)$ & Shard rejection check cost \\ \hline
- $\delta$ & Maximum delete proportion \\ \hline
- %$\rho$ & Maximum rejection rate \\ \hline
-\end{tabular}
-\label{tab:nomen}
-
-\end{table}
+\begin{equation*}
+ \mathscr{Q}(n, k) = \Theta\left([W(n) + P(n)]\log_2 n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
+\end{equation*}
-Table~\ref{tab:nomen} lists frequently used notation for the various parameters
-of the framework, which will be used in the coming analysis of the costs and
-trade-offs associated with operations within the framework's design space. The
-remainder of this section will discuss the performance characteristics of
-insertion into this structure (Section~\ref{ssec:insert}), how it can be used
-to correctly answer sampling queries (Section~\ref{ssec:insert}), and efficient
-approaches for supporting deletes (Section~\ref{ssec:delete}). Finally, it will
-close with a detailed discussion of the trade-offs within the framework's
-design space (Section~\ref{ssec:design-space}).
-
-
-
-
-\subsection{Trade-offs on Framework Design Space}
-\label{ssec:design-space}
-The framework has several tunable parameters, allowing it to be tailored for
-specific applications. This design space contains trade-offs among three major
-performance characteristics: update cost, sampling cost, and auxiliary memory
-usage. The two most significant decisions when implementing this framework are
-the selection of the layout and delete policies. The asymptotic analysis of the
-previous sections obscures some of the differences between these policies, but
-they do have significant practical performance implications.
-
-\Paragraph{Layout Policy.} The choice of layout policy represents a clear
-trade-off between update and sampling performance. Leveling
-results in fewer shards of larger size, whereas tiering results in a larger
-number of smaller shards. As a result, leveling reduces the costs associated
-with point-lookups and sampling query preprocessing by a constant factor,
-compared to tiering. However, it results in more write amplification: a given
-record may be involved in up to $s$ reconstructions on a single level, as
-opposed to the single reconstruction per level under tiering.
-
-\Paragraph{Delete Policy.} There is a trade-off between delete performance and
-sampling performance that exists in the choice of delete policy. Tagging
-requires a point-lookup when performing a delete, which is more expensive than
-the insert required by tombstones. However, it also allows constant-time
-rejection checks, unlike tombstones which require a point-lookup of each
-sampled record. In situations where deletes are common and write-throughput is
-critical, tombstones may be more useful. Tombstones are also ideal in
-situations where immutability is required, or random writes must be avoided.
-Generally speaking, however, tagging is superior when using SSIs that support
-it, because sampling rejection checks will usually be more common than deletes.
-
-\Paragraph{Mutable Buffer Capacity and Scale Factor.} The mutable buffer
-capacity and scale factor both influence the number of levels within the index,
-and by extension the number of distinct shards. Sampling and point-lookups have
-better performance with fewer shards. Smaller shards are also faster to
-reconstruct, although the same adjustments that reduce shard size also result
-in a larger number of reconstructions, so the trade-off here is less clear.
-
-The scale factor has an interesting interaction with the layout policy: when
-using leveling, the scale factor directly controls the amount of write
-amplification per level. Larger scale factors mean more time is spent
-reconstructing shards on a level, reducing update performance. Tiering does not
-have this problem and should see its update performance benefit directly from a
-larger scale factor, as this reduces the number of reconstructions.
-
-The buffer capacity also influences the number of levels, but is more
-significant in its effects on point-lookup performance: a lookup must perform a
-linear scan of the buffer. Likewise, the unstructured nature of the buffer also
-will contribute negatively towards sampling performance, irrespective of which
-buffer sampling technique is used. As a result, although a large buffer will
-reduce the number of shards, it will also hurt sampling and delete (under
-tagging) performance. It is important to minimize the cost of these buffer
-scans, and so it is preferable to keep the buffer small, ideally small enough
-to fit within the CPU's L2 cache. The number of shards within the index is,
-then, better controlled by changing the scale factor, rather than the buffer
-capacity. Using a smaller buffer will result in more compactions and shard
-reconstructions; however, the empirical evaluation in Section~\ref{ssec:ds-exp}
-demonstrates that this is not a serious performance problem when a scale factor
-is chosen appropriately. When the shards are in memory, frequent small
-reconstructions do not have a significant performance penalty compared to less
-frequent, larger ones.
-
-\Paragraph{Auxiliary Structures.} The framework's support for arbitrary
-auxiliary data structures allows for memory to be traded in exchange for
-insertion or sampling performance. The use of Bloom filters for accelerating
-tombstone rejection checks has already been discussed, but many other options
-exist. Bloom filters could also be used to accelerate point-lookups for delete
-tagging, though such filters would require much more memory than tombstone-only
-ones to be effective. An auxiliary hash table could be used for accelerating
-point-lookups, or range filters like SuRF \cite{zhang18} or Rosetta
-\cite{siqiang20} added to accelerate pre-processing for range queries like in
-IRS or WIRS.
+In both cases, $R(n) \in \Theta(1)$ for tagging deletes, and $R(n) \in
+N_B + L(N) \log_s n$ for tombstones (including the cost of searching
+the buffer for the tombstone).
diff --git a/paper.tex b/paper.tex
index d1f7a47..c6bc608 100644
--- a/paper.tex
+++ b/paper.tex
@@ -147,7 +147,7 @@
\author{Douglas B. Rumbaugh}
\dept{Computer Science and Engineering}
% the degree will be conferred on this date
-\degreedate{December 2025}
+\degreedate{August 2025}
% year of your copyright
\copyrightyear{2025}
diff --git a/references/references.bib b/references/references.bib
index 0dbc804..5fef30a 100644
--- a/references/references.bib
+++ b/references/references.bib
@@ -421,6 +421,25 @@
bibsource = {dblp computer science bibliography, https://dblp.org}
}
+@inproceedings{overmars81-2,
+ author = {Mark H. Overmars and
+ Jan van Leeuwen},
+ editor = {Peter Deussen},
+ title = {Dynamization of Decomposable Searching Problems Yielding Good Worsts-Case
+ Bounds},
+ booktitle = {Theoretical Computer Science, 5th GI-Conference, Karlsruhe, Germany,
+ March 23-25, 1981, Proceedings},
+ series = {Lecture Notes in Computer Science},
+ volume = {104},
+ pages = {224--233},
+ publisher = {Springer},
+ year = {1981},
+ url = {https://doi.org/10.1007/BFb0017314},
+ doi = {10.1007/BFB0017314},
+ timestamp = {Tue, 14 May 2019 10:00:39 +0200},
+ biburl = {https://dblp.org/rec/conf/tcs/OvermarsL81.bib},
+ bibsource = {dblp computer science bibliography, https://dblp.org}
+}
@article{naidan14,
author = {Bilegsaikhan Naidan and
Magnus Lie Hetland},
@@ -1482,3 +1501,82 @@ keywords = {analytic model, analysis of algorithms, overflow chaining, performan
}
+@misc{leveldb,
+ author = {Sanjay Ghemawat and Jeff Dean},
+ title = {LevelDB},
+ year = {2025},
+ publisher = {GitHub},
+ journal = {GitHub repository},
+ howpublished = {\url{https://github.com/google/leveldb}}
+}
+
+
+@inproceedings{monkey,
+ author = {Niv Dayan and
+ Manos Athanassoulis and
+ Stratos Idreos},
+ editor = {Semih Salihoglu and
+ Wenchao Zhou and
+ Rada Chirkova and
+ Jun Yang and
+ Dan Suciu},
+ title = {Monkey: Optimal Navigable Key-Value Store},
+ booktitle = {Proceedings of the 2017 {ACM} International Conference on Management
+ of Data, {SIGMOD} Conference 2017, Chicago, IL, USA, May 14-19, 2017},
+ pages = {79--94},
+ publisher = {{ACM}},
+ year = {2017},
+ url = {https://doi.org/10.1145/3035918.3064054},
+ doi = {10.1145/3035918.3064054},
+ timestamp = {Thu, 14 Oct 2021 10:11:38 +0200},
+ biburl = {https://dblp.org/rec/conf/sigmod/DayanAI17.bib},
+ bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+
+@inproceedings{dostoevsky,
+ author = {Niv Dayan and
+ Stratos Idreos},
+ editor = {Gautam Das and
+ Christopher M. Jermaine and
+ Philip A. Bernstein},
+ title = {Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value
+ Stores via Adaptive Removal of Superfluous Merging},
+ booktitle = {Proceedings of the 2018 International Conference on Management of
+ Data, {SIGMOD} Conference 2018, Houston, TX, USA, June 10-15, 2018},
+ pages = {505--520},
+ publisher = {{ACM}},
+ year = {2018},
+ url = {https://doi.org/10.1145/3183713.3196927},
+ doi = {10.1145/3183713.3196927},
+ timestamp = {Wed, 21 Nov 2018 12:44:08 +0100},
+ biburl = {https://dblp.org/rec/conf/sigmod/DayanI18.bib},
+ bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+
+@misc{autumn,
+ title={Autumn: A Scalable Read Optimized LSM-tree based Key-Value Stores with Fast Point and Range Read Speed},
+ author={Fuheng Zhao and Zach Miller and Leron Reznikov and Divyakant Agrawal and Amr El Abbadi},
+ year={2024},
+ eprint={2305.05074},
+ archivePrefix={arXiv},
+ primaryClass={cs.DB},
+ url={https://arxiv.org/abs/2305.05074},
+}
+
+@inproceedings{sarkar23,
+ author = {Subhadeep Sarkar and
+ Niv Dayan and
+ Manos Athanassoulis},
+ title = {The {LSM} Design Space and its Read Optimizations},
+ booktitle = {39th {IEEE} International Conference on Data Engineering, {ICDE} 2023,
+ Anaheim, CA, USA, April 3-7, 2023},
+ pages = {3578--3584},
+ publisher = {{IEEE}},
+ year = {2023},
+ url = {https://doi.org/10.1109/ICDE55515.2023.00273},
+ doi = {10.1109/ICDE55515.2023.00273},
+ timestamp = {Sun, 12 Nov 2023 02:08:10 +0100},
+ biburl = {https://dblp.org/rec/conf/icde/SarkarDA23.bib},
+ bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+