updates

author: Douglas Rumbaugh <dbr4@psu.edu> 2025-06-20 17:24:18 -0400
committer: Douglas Rumbaugh <dbr4@psu.edu> 2025-06-20 17:24:18 -0400
commit: 7700f2818cca731cadac034322a28f19e9ac3a17 (patch)
tree: 86e29639d5067bc047ee2f36471eda0ce8c7a291
parent: 903055812fa35e0533b940ddb2d8db8c2a20af2b (diff)
download: dissertation-7700f2818cca731cadac034322a28f19e9ac3a17.tar.gz
5 files changed, 296 insertions, 187 deletions
diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex
index 26f733c..af84b43 100644
--- a/chapters/beyond-dsp.tex
+++ b/chapters/beyond-dsp.tex
@@ -13,18 +13,22 @@
 
 \section{Introduction}
 
-In the previous chapter, we discussed how several of the limitations of
-dynamization could be overcome by proposing a systematic dynamization
-approach for sampling data structures. In doing so, we introduced
-a multi-stage query mechanism to overcome the non-decomposability of
-these queries, provided two mechanisms for supporting deletes along with
-specialized processing to integrate these with the query mechanism, and
-introduced some performance tuning capability inspired by the design space
-of modern LSM Trees. While promising, these results are highly specialized
-and remain useful only within the context of sampling queries. In this
-chapter, we develop new generalized query abstractions based on these
-specific results, and discuss a fully implemented framework based upon
-these abstractions.
+In the previous chapter, we considered the problem of answering sampling
+queries over a decomposed data structure. Because such problems are
+not decomposable, they are not efficiently solvable using traditional
+dynamization techniques. However, by introducing a number of additional
+mechanisms, we were able to produce a specialized system that used a
+multiple-stage query process to overcome the non-decomposability of
+the queries and provide efficient and correct answers. We additionally
+introduced systems for supporting deletes in two different ways, which
+were tightly integrated with both the query and reconstruction processes.
+Finally, we introduced a configuration space inspired by LSM trees
+to allow for tuning between query and insert performance.  While this
+result is promising, it is not a general solution as the mechanisms are
+highly specialized to sampling queries. In this chapter, we consider
+these mechanisms and generalize them, creating new query abstractions
+based upon our specific results, and then discuss a fully implemented
+framework based on these new abstractions
 
 More specifically, in this chapter we propose \emph{extended
 decomposability} and \emph{iterative deletion decomposability} as two
@@ -32,10 +36,17 @@ new, broader classes of search problem which are strict supersets of
 decomposability and deletion decomposability respectively, providing a
 more powerful interface to allow the efficient implementation of a larger
 set of search problems over a dynamized structure. We then implement
-a C++ library based upon these abstractions which is capable of adding
+a C++ library based upon these abstractions that is capable of adding
 support for inserts, deletes, and concurrency to static data structures
 automatically, and use it to provide dynamizations for independent range
-sampling, range queries with learned indices, string search with succinct
+sampling,\footnote{
+    Our generalized framework's support for sampling is not \emph{exactly}
+    equivalent to that of the previous chapter. In particular, our
+    generalized framework does not support tombstone-based deletes
+    for sampling problem. It also lacks some mutable buffer-related
+    optimizations for weighted sampling. These features were sacrificed
+    in the name of generality.
+} range queries with learned indices, string search with succinct
 tries, and high dimensional vector search with metric indices. In each
 case we compare our dynamized implementation with existing dynamic
 structures, and standard Bentley-Saxe dynamizations, where possible.
@@ -352,8 +363,7 @@ interface, with the same performance as their specialized implementations.
 \subsection{Iterative Deletion Decomposability}
 \label{ssec:dyn-idsp}
 
-
-\begin{algorithm}[t]
+\begin{algorithm}
 	\caption{Answering an Iterative Deletion Decomposable Search Problem}
 	\label{alg:dyn-query}
 	\KwIn{$q$: query parameters, $\mathscr{I}_1 \ldots \mathscr{I}_m$: blocks}
@@ -378,49 +388,51 @@ interface, with the same performance as their specialized implementations.
 \end{algorithm}
 
 
-We next turn out attention to support for deletes. Efficient delete
+We next turn our attention to support for deletes. Efficient delete
 support in Bentley-Saxe dynamization is provably impossible~\cite{saxe79},
-but, as discussed in Section~\ref{ssec:dyn-deletes} it is possible
-to support them in restricted situations, where either the search
-problem is invertible (Definition~\ref{def:invert}) or the data
-structure and search problem combined are deletion decomposable
-(Definition~\ref{def:background-ddsp}).  In Chapter~\ref{chap:sampling},
-we considered a set of search problems which did \emph{not} satisfy
-any of these properties, and instead built a customized solution for
-deletes that required tight integration with the query process in order
-to function. While such a solution was acceptable for the goals of that
-chapter, it is not sufficient for our goal in this chapter of producing
-a generalized system.
-
-Additionally, of the two types of problem that can support deletes, the
-invertible case is preferable. This is because the amount of work necessary
-to support deletes for invertible search problems is very small. The data
+but, as discussed in Section~\ref{ssec:dyn-deletes}, it is possible to
+support them in restricted situations. Efficient delete mechanisms exist
+when the search problem is invertible (Definition~\ref{def:invert})
+or when the data structure and search problem combined are
+deletion decomposable (Definition~\ref{def:background-ddsp}).
+In Chapter~\ref{chap:sampling}, we considered a set of search problems
+which did \emph{not} satisfy either of these properties, and instead
+built a customized solution for deletes that required tight integration
+with the query process.  While such a solution was acceptable for the
+goals of that chapter, it is insufficient for this chapter's goal of
+producing a generalized system.
+
+Of the two types of problem that can support deletes, the invertible
+case is preferable. This is because the amount of work necessary to
+support deletes for invertible search problems is very small. The data
 structure requires no modification (such as to implement weak deletes),
-and the query requires no modification (to ignore the weak deletes) aside
-from the addition of the $\Delta$ operator. This is appealing from a
-framework design standpoint. Thus, it would also be worth it to consider
+and the query requires no modification (to ignore the weak deletes)
+aside from the addition of the $\Delta$ operator. This is appealing from
+a framework design standpoint. Thus, it would also be worth considering
 approaches for expanding the range of search problems that can be answered
 using the ghost structure mechanism supported by invertible problems.
 
 A significant limitation of invertible problems is that the result set
-size is not able to be controlled. We do not know how many records in our
-local results have been deleted until we reach the combine operation and
-they begin to cancel out, at which point we lack a mechanism to go back
-and retrieve more records. This presents difficulties for addressing
-important search problems such as top-$k$, $k$-NN, and sampling. In
+size cannot be known until after the query has been answered. We do
+not know how many records in the local results have been deleted until
+we reach the combine operation and they begin to cancel out. Once this
+point has been reached, we lack a mechanism to return to the structure
+and retrieve more records.  This presents difficulties for addressing
+important search problems such as top-$k$, $k$-NN, and sampling,
+where the required result set size is a user-specified parameter. In
 principle, these queries could be supported by repeating the query with
 larger-and-larger $k$ values until the desired number of records is
 returned, but in the eDSP model this requires throwing away a lot of
-useful work, as the state of the query must be rebuilt each time.
+useful work, because the state of the query must be rebuilt each time.
 
 We can resolve this problem by moving the decision to repeat the query
 into the query interface itself, allowing retries \emph{before} the
 result set is returned to the user and the local meta-information objects
-discarded. This allows us to preserve this pre-processing work, and repeat
+discarded. This allows us to preserve pre-processing results, and repeat
 the local query process as many times as is necessary to achieve our
-desired number of records. From this observation, we propose another new
-class of search problem: \emph{iterative deletion decomposable} (IDSP). The
-IDSP definition expands eDSP with a fifth operation,
+desired number of records. From this observation, we propose another
+new class of search problem: \emph{iterative deletion decomposable}
+(IDSP). The IDSP definition expands eDSP with a fifth operation,
 
 \begin{itemize}
 	\item $\mathbftt{repeat}(q, R, q_1, \ldots, q_m) \to
@@ -433,11 +445,14 @@ IDSP definition expands eDSP with a fifth operation,
 
 If this routine returns true, it must also modify the local queries as
 necessary to account for the work that remains to be completed (e.g.,
-update the number of records to retrieve). Then, the query process resumes
-from the execution of the local queries. If it returns false, then the
-result is simply returned to the user. If the number of repetitions of
-the query is bounded by $R(n)$, then the following provides an upper
-bound on the worst-case query complexity of an IDSP,
+update the number of records to retrieve). Then, the query process
+resumes from the execution of the local queries. If it returns false,
+then the result is simply returned to the user.  The full IDSP query
+algorithm is shown in Figure~\ref{alg:dyn-idsp-query}
+
+If the number of repetitions of the query is bounded by $R(n)$, then the
+following provides an upper bound on the worst-case query complexity of
+an IDSP,
 
 \begin{equation*}
     O\left(\log_2 n \cdot P(n) + D(n) + R(n) \left(\log_2 n \cdot Q_s(n) +
@@ -454,11 +469,10 @@ records. This can be done, for example, using the full-reconstruction
 techniques in the literature~\cite{saxe79, merge-dsp, overmars83}
 or through proactively performing reconstructions, such as with the
 mechanism discussed in Section~\ref{sssec:sampling-rejection-bound},
-depending on the particulars of how deletes are implemented. The
-full IDSP query algorithm is shown in Figure~\ref{alg:dyn-idsp-query}
-
+depending on the particulars of how deletes are implemented.
 
 
+% \subsubsection{IDSP for $k$-NN}
 
 As an example of how IDSP can facilitate delete support for search
 problems, let's consider $k$-NN. This problem can be $C(n)$-deletion
@@ -614,7 +628,7 @@ a search problem falls into a particular classification in the general
 taxonomy doesn't imply any particular information about where in the
 deletion taxonomy that same problem might also fall.
 
-\begin{figure}[t]
+\begin{figure}
 	\subfloat[General Taxonomy]{\includegraphics[width=.49\linewidth]{diag/taxonomy}
     \label{fig:taxonomy-main}} 
 	\subfloat[Deletion Taxonomy]{\includegraphics[width=.49\linewidth]{diag/deletes} \label{fig:taxonomy-deletes}}
@@ -1084,7 +1098,7 @@ in Algorithm~\ref{alg:dyn-insert}.
 		$\texttt{buffer.append}(r)$\;
 		\Return
 	}
-    $\texttt{buffer\_shard} \gets \texttt{build\_shard}(buffer)$ \;
+    $\texttt{buffer\_shard} \gets \texttt{build\_shard}(\texttt{buffer})$ \;
     \BlankLine
 	$\texttt{idx} \gets 0$\;
 	\For{$i \gets 0 \ldots \texttt{n\_levels}$}{
@@ -1320,7 +1334,7 @@ get better asymptotic performance.
 \label{ssec:dyn-concurrency}
 
 The decomposition-based dynamization scheme we are considering in this
-work lends itself to a very straightfoward concurrency control scheme,
+work lends itself to a very straightforward concurrency control scheme,
 because it is founded upon static data structures. As a result, concurrent
 writes only need to be managed within the mutable buffer. Beyond this,
 reconstructions within the levels of the structure only reorganize
@@ -1354,11 +1368,11 @@ assign jobs to the thread pool.
 \subsubsection{The Mutable Buffer}
 
 Our mutable buffer is an unsorted array to which new records are appended.
-This makes concurrent writes very straightfoward to support using a simple
+This makes concurrent writes very straightforward to support using a simple
 fetch-and-add instruction on the tail pointer of the buffer. When a write
 is issued, a fetch-and-add is executed against the tail pointer. This
 effectively reserves a slot at the end of the array for the new record
-to be written into, as each thread will recieve a unique index from this
+to be written into, as each thread will receive a unique index from this
 operation. Then, the record can be directly assigned to that index. If
 the buffer is full, then a reconstruction is scheduled (if one isn't
 already running) and a failure is immediately returned to the user.
diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex
index 8f29e96..13457b5 100644
--- a/chapters/conclusion.tex
+++ b/chapters/conclusion.tex
@@ -1,24 +1,97 @@
-\chapter{Conclusion}
+\chapter{Summary and Future Work}
 \label{chap:conclusion}
 
-In this work, we have considered approaches for automatically
-adding support for concurrent updates to static data structures,
-for the purpose of reducing the amount of work necessary to produce a
-dynamic index. Classical dynamization techniques suffered from several
-limitations on supported data structures, as well as performance problems
-stemming from a lack of configurability and poor worst-case insertion
-performance. We have attempted to address these limitations.
-
-The result of these efforts is a generalized dynamization framework built
-upon a set of novel mathematical results that allows for many static
-data structures to be automatically extended with tunable, concurrent
-insertion and deletion support, with bounded additional query cost. The
-technique expands on the base Bentley-Saxe method with new query interfaces
-to enable support for search problems that are not traditional decomposable,
-a tunable design space including buffering and alternative block layout
-polices to allow for trade-offs between insertion and query performance,
-and support for parallel reconstructions in a manner that effectively
-reduces the worst-case insertion cost while maintaining similar query
-performance. 
+One of the perennial problems in database systems is the design of new
+indices to support new data types and search problems. While there exist
+numerous data structures that could be used as the basis for such indices,
+there is a mismatch between the required feature set of an index and
+that of a data structure. This requires a significant amount of effort
+to be expended in order to implement the missing features. In order
+to circumvent this problem, there have been past efforts at creating
+systems for automating some, or all, of the index design process in
+certain contexts. These existing efforts fall short of a truly general
+solution to the problem of automatic index generation. Automatic index
+composition assumes a particular search problem and a set of data
+structure primitives, and then composes those primitives into a custom
+structure that is optimized for a particular workload.  Generalized index
+templates assume a solution structure, and attempt to solve a search
+problem within that structure. In both cases, the core methodology of
+the approach imposes restrictions on the types of problems to which they
+can be applied. Thus, neither is a truly viable approach to creating
+indices for arbitrary search problems in the general case.
+
+We propose a system based on a third technique: automatic feature
+extension. Starting with an existing data structure for the search problem
+of interest, various general techniques can be used to automatically
+add the features missing by the structure to create an index. A special
+case of this approach is well studied in the theoretical literature:
+dynamization. Dynamization seeks to automatically add support for
+inserts, and sometimes deletes, to a static data structure for a search
+problem that satisfies certain constraints. Dynamization has a number
+of limitations that prevent it from standing on its own as a solution
+to this problem, and so this work has concentrated on overcoming these
+shortcomings.
+
+By introducing new classifications of search problem, along with
+mechanisms to support solving them over a dynamized structure, we extended
+the applicability of dynamization techniques to a broader set of data
+structures and search problems, as well as increased the number of search
+problems for which deletes can be efficiently supported. We considered
+the design space of the similarly structured LSM Tree data structure,
+and borrowed certain applicable elements to introduce a configurable
+design space to allow for trade-offs between insertion and query
+performance. We then devised a system for controlling the worst-case
+insertion performance dynamized structures, leveraging concurrency to
+match the lowest existing worst-case bound in the theoretical literature,
+and then parallelism to beat it.
+
+Through this effort, we have managed to resolve what we saw as the most
+significant barriers to the use of dynamization in the context of database
+indexing.
+
+
+\section{Future Work}
+While this is a significant step forward, there remains significant
+work to be done before the ultimate goal of a general, automatic index
+generation framework has been reached. We have resolved a number
+of existing problems to make dynamization viable in the context of
+database systems, as well as expanded the scope of dynamization to
+include concurrency, but a database index requires more features than
+update support. In particular, our framework must also support the
+following additional features,
+
+\begin{enumerate}
+	\item \textbf{Support for external storage.} \\
+	While we did have an implementation of sampling framework
+	discussed in Chapter~\ref{chap:sampling} that used an external
+	data structure, the general framework discussed in the following
+	chapters was considered for in-memory structures only. We will need
+	to extend it with support for external structures, as well as evaluate
+	whether our proposed techniques still function effectively in this
+	context.
+	\item \textbf{Crash recovery.} \\
+	It is critical for a database index to support crash recovery,
+	so that it can be recovered to a state consistent with the rest of
+	the database in the event of a system fault. Because our dynamized
+	indices are append-only, and can be viewed as a log of sorts,
+	inefficient crash recovery is straightforward: All operations
+	can be logged and replayed in the event of a crash. But this is
+	highly inefficient, and so a better scheme must be devised.
+	\item \textbf{Distributed systems support.} \\
+	The append-only and decomposed nature of dynamized indices make
+	them seem a natural fit in a distributed systems context. This was
+	briefly discussed in Section~\ref{ssec:ext-distributed}. While
+	not required for all, or even most, applications, support for
+	automatically distributing an index over multiple nodes in a
+	distributed system would be desirable.
+\end{enumerate}
+
+Once the full set of necessary index features can be supported by the
+framework, we plan to integrate the system into a database to allow
+user-defined indexing. To accommodate this, it will also be necessary
+to devise a mechanism for allowing the query optimizer to use these
+arbitrary, user-defined indices, when generating query plans.
+
+
 
 
diff --git a/chapters/design-space.tex b/chapters/design-space.tex
index 321c638..c8876de 100644
--- a/chapters/design-space.tex
+++ b/chapters/design-space.tex
@@ -5,15 +5,14 @@
 
 In the previous two chapters, we introduced an LSM tree inspired design
 space into the Bentley-Saxe method to allow for more flexibility in
-tuning the performance. However, aside from some general comments
-about how these parameters operator in relation to insertion and
-query performance, and some limited experimental evaluation, we haven't
-performed a systematic analysis of this space, its capabilities, and its
-limitations. We will rectify this situation in this chapter, performing
-both a detailed mathematical analysis of the design parameter space,
-as well as experiments to demonstrate these trade-offs exist in practice.
-
-\subsection{Why bother?}
+performance tuning. However, aside from some general comments about how
+these parameters affect insertion and query performance, and some limited
+experimental evaluation, we have not performed a systematic analysis of
+this space, its capabilities, and its limitations. We will rectify this
+situation in this chapter, performing both a detailed mathematical
+analysis of the design parameter space, as well can experimental
+evaluation, to explore the space and its trade-offs, and demonstrate
+their practical effectiveness.
 
 Before diving into the design space we have introduced in detail, it's
 worth taking some time to motivate this entire endeavor. There is a large
@@ -21,78 +20,73 @@ body of theoretical work in the area of data structure dynamization,
 and, to the best of our knowledge, none of these papers have introduced
 a design space of the sort that we have introduced here. Despite this,
 some papers which \emph{use} these techniques have introduced similar
-design elements into their own implementations~\cite{pgm}, with some
-even going so far as to (inaccurately) describe these elements as part
-of the Bentley-Saxe method~\cite{almodaresi23}.
-
-This situation is best understood, we think, in terms of the ultimate
-goals of the respective lines of work. In the classical literature on
-dynamization, the focus is mostly on proving theoretical asymptotic
-bounds about the techniques. In this context, the LSM tree design space
-is of limited utility, because its tuning parameters adjust constant
-factors only, and thus don't play a major role in asymptotics. Where
+design elements into their own implementations~\cite{pgm}, with some even
+going so far as to describe these elements as part of the Bentley-Saxe
+method~\cite{almodaresi23}.
+
+This situation is best understood in terms of the ultimate goals
+of the respective lines of work. In the classical literature
+on dynamization, the focus is directed at proving theoretical
+asymptotic bounds. In this context, the LSM tree design space is
+of limited utility, because its tuning parameters adjust constant
+factors, and thus don't play a major role in asymptotics. Where
 the theoretical literature does introduce configurability, such as
 with the equal blocks method~\cite{overmars-art-of-dyn} or more
 complex schemes that nest the equal block method \emph{inside}
 of a binary decomposition~\cite{overmars81}, the intention is
 to produce asymptotically relevant trade-offs between insert,
 query, and delete performance for deletion decomposable search
-problems~\cite[pg. 117]{overmars83}. This is why the equal block method
-is described in terms of a function, rather than a constant value,
+problems~\cite[pg. 117]{overmars83}. This explains why the equal block
+method is described in terms of a function, rather than a constant value,
 to enable it to appear in the asymptotics.
 
 On the other hand, in practical scenarios, constant tuning of performance
 can be very relevant. We've already shown in Sections~\ref{ssec:ds-exp}
-and \ref{ssec:dyn-ds-exp} how tuning parameters, particularly the
-number of shards per level, can have measurable real-world effects on the
-performance characteristics of dynamized structures, and in fact sometimes
+and \ref{ssec:dyn-ds-exp} how tuning parameters, particularly adjusting
+the number of shards per level, can have measurable real-world effects on
+the performance characteristics of dynamized structures. In fact sometimes
 this tuning is \emph{necessary} to enable reasonable performance. It's
 quite telling that the two most direct implementations of the Bentley-Saxe
 method that we have identified in the literature are both in the context
 of metric indices~\cite{naidan14,bkdtree}, a class of data structure
 and search problem for which we saw very good performance from standard
-Bentley-Saxe in Section~\ref{ssec:dyn-knn-exp}. The other experiments
-in Chapter~\ref{chap:framework} show that, for other types of problem,
-the technique does not fare quite so well.
+Bentley-Saxe in Section~\ref{ssec:dyn-knn-exp}. Our experiments in
+Chapter~\ref{chap:framework} show that, for other types of problem,
+the technique does not fare quite so well in its unmodified form.
 
 \section{Asymptotic Analysis}
 \label{sec:design-asymp}
 
-Before beginning with derivations for
-the cost functions of dynamized structures within the context of our
-proposed design space, we should make a few comments about the assumptions
-and techniques that we will use in our analysis. As this design space
-involves adjusting constants, we will leave the design-space related
-constants within our asymptotic expressions. Additionally, we will
-perform the analysis for a simple decomposable search problem. Deletes
-will be entirely neglected, and we won't make any assumptions about
-mergeability. We will also neglect the buffer size, $N_B$, during this
-analysis. Buffering isn't fundamental to the techniques we are examining
-in this chapter, and including it would increase the complexity of the
-analysis without contributing any useful insights.\footnote{
-	The contribution of the buffer size is simply to replace each of the
-	individual records considered in the analysis with batches of $N_B$
-	records. The same patterns hold.
-} 
+Before beginning with derivations for the cost functions of dynamized
+structures within the context of our proposed design space, we should
+make a few comments about the assumptions and techniques that we will use
+in our analysis. We will generally neglect buffering in our analysis,
+both in terms of the additional query cost of querying the buffer, and
+in terms of the buffer's effect on the reconstruction process. Buffering
+isn't fundamental to the techniques we are considering, and including it
+would needlessly complicate the analysis. However, we will include the
+scale factor, $s$, which directly governs the number of blocks within
+the dynamized structures.  Additionally, we will perform the query cost
+analysis assuming a decomposable search problem.  Deletes will be entirely
+neglected, and we won't make any assumptions about mergeability.
 
 \subsection{Generalized Bentley Saxe Method}
 As a first step, we will derive a modified version of the Bentley-Saxe
-method that has been adjusted to support arbitrary scale factors, and
-buffering. There's nothing fundamental to the technique that prevents
-such modifications, and its likely that they have not been analyzed
-like this before simply out of a lack of interest in constant factors in
-theoretical asymptotic analysis. During our analysis, we'll intentionally
-leave these constant factors in place.
-
-When generalizing the Bentley-Saxe method for arbitrary scale factors, we
-decided to maintain the core concept of binary decomposition. One interesting
-mathematical property of a Bentley-Saxe dynamization is that the internal
-layout of levels exactly matches the binary representation of the record
-count contained within the index. For example, a dynamization containing
-$n=20$ records will have 4 records in the third level, and 16 in the fifth,
-with all other levels being empty. If we represent a full level with a 1
-and an empty level with a 0, then we'd have $10100$, which is $20$ in
-base 2.
+method that has been adjusted to support arbitrary scale factors.  There's
+nothing fundamental to the technique that prevents such modifications,
+and its likely that they have not been analyzed like this before simply
+out of a lack of interest in constant factors in theoretical asymptotic
+analysis.
+
+When generalizing the Bentley-Saxe method for arbitrary scale factors,
+we decided to maintain the core concept of binary decomposition. One
+interesting mathematical property of a Bentley-Saxe dynamization is that
+the internal layout of levels exactly matches the binary representation of
+the record count contained within the index. For example, a dynamization
+containing $n=20$ records will have 4 records in the third level, and
+16 in the fifth, with all other levels being empty. If we represent a
+full level with a 1 and an empty level with a 0, then we'd have $10100$,
+which is $20$ in base 2.
 
 \begin{algorithm}
 \caption{The Generalized BSM Layout Policy}
@@ -137,7 +131,6 @@ Unfortunately, the approach used by Bentley and Saxe to calculate the
 amortized insertion cost of the BSM does not generalize to larger bases,
 and so we will need to derive this result using a different approach.
 
-
 \begin{theorem}
 The amortized insertion cost for generalized BSM with a growth factor of
 $s$ is $\Theta\left(\frac{B(n)}{n} \cdot s\log_s n)\right)$.
@@ -318,7 +311,7 @@ $j+1$. This process clears space in level $0$ to contain the buffer flush.
 \begin{theorem}
 The amortized insertion cost of leveling with a scale factor of $s$ is
 \begin{equation*}
-I_A(n) \in \Theta\left(\frac{B(n)}{n} \cdot \frac{1}{2}(s+1)\log_s n\right)
+I_A(n) \in \Theta\left(\frac{B(n)}{n} \cdot s \log_s n\right)
 \end{equation*}
 \end{theorem}
 \begin{proof}
@@ -592,10 +585,10 @@ reconstructions, one per level.
 \begin{tabular}{|l l l l|}
 \hline
 & \textbf{Gen. BSM} & \textbf{Leveling} & \textbf{Tiering} \\ \hline
+$I(n)$ & $\Theta(B(n))$ & $\Theta\left(B\left(\frac{s-1}{s} \cdot n\right)\right)$ & $ \Theta\left(\sum_{i=0}^{\log_s n} B(s^i)\right)$ \\ \hline
 $I_A(n)$ & $\Theta\left(\frac{B(n)}{n} s\log_s n)\right)$ & $\Theta\left(\frac{B(n)}{n} s\log_s n\right)$& $\Theta\left(\frac{B(n)}{n} \log_s n\right)$ \\ \hline
 $\mathscr{Q}(n)$ &$O\left(\log_s n \cdot \mathscr{Q}_S(n)\right)$ & $O\left(\log_s n \cdot \mathscr{Q}_S(n)\right)$ & $O\left(s \log_s n  \cdot \mathscr{Q}_S(n)\right)$\\ \hline
-%$\mathscr{Q}_B(n)$ & $\Theta(\mathscr{Q}_S(n))$ & $O(\log_s n \cdot \mathscr{Q}_S(n))$ & $O(\log_s n \cdot \mathscr{Q}_S(n))$ \\ \hline
-$I(n)$ & $\Theta(B(n))$ & $\Theta\left(B\left(\frac{s-1}{s} \cdot n\right)\right)$ & $ \Theta\left(\sum_{i=0}^{\log_s n} B(s^i)\right)$ \\ \hline
+$\mathscr{Q}_B(n)$ & $\Theta(\mathscr{Q}_S(n))$ & $O(\log_s n \cdot \mathscr{Q}_S(n))$ & $O(\log_s n \cdot \mathscr{Q}_S(n))$ \\ \hline
 \end{tabular}
 
 \caption{Comparison of cost functions for various layout policies for DSPs}
@@ -620,12 +613,12 @@ space of our framework.
 We'll begin by validating our results for the insertion performance
 characteristics of the three layout policies. For this test, we
 consider two data structures: the ISAM tree and the VP tree. The ISAM
-tree structure is merge-decomposable using a sorted-array merge, with
-a build cost of $B_M(n) \in \Theta(n \log k)$, where $k$ is the number
-of structures being merged. The VPTree, by contrast, is \emph{not}
-merge decomposable, and is built in $B(n) \in \Theta(n \log n)$ time. We
-use the $200,000,000$ record SOSD \texttt{OSM} dataset~\cite{sosd-datasets} for
-ISAM testing, and the $1,000,000$ record, $300$-dimensional Spanish
+tree structure is merge-decomposable using a sorted-array merge, with a
+build cost of $B_M(n, k) \in \Theta(n \log k)$, where $k$ is the number of
+structures being merged. The VPTree, by contrast, is \emph{not} merge
+decomposable, and is built in $B(n) \in \Theta(n \log n)$ time. We use
+the $200$ million record SOSD \texttt{OSM} dataset~\cite{sosd-datasets}
+for ISAM testing, and the one million record, $300$-dimensional Spanish
 Billion Words (\texttt{SBW}) dataset~\cite{sbw} for VPTree testing.
 
 For our first experiment, we will examine the latency distribution
@@ -639,13 +632,20 @@ buffer size of $N_B=12000$ for the ISAM tree structure, and $N_B=1000$
 for the VPTree.
 
 We generated this distribution by inserting $30\%$ of the records from
-the set to ``warm up'' the dynamized structure, and then measuring the
-insertion latency for each individual insert for the remaining $70\%$
-of the data.  Note that, due to timer resolution issues at nanosecond
-scales, the specific latency values associated with the faster end of
-the insertion distribution are not precise. However, it is our intention
-to examine the latency distribution, not the values themselves, and so
-this is not a significant limitation for our analysis.
+the set to ``warm up'' the dynamized structure, and then measuring
+the insertion latency for each individual insert for the remaining
+$70\%$ of the data.  Note that, due to timer resolution issues at
+nanosecond scales, the specific latency values associated with the
+faster end of the insertion distribution are not precise. However,
+it is our intention to examine the latency distribution, not the
+values themselves, and so this is not a significant limitation
+for our analysis.  The resulting distributions are shown in
+Figure~\ref{fig:design-policy-ins-latency}. These distributions
+are representing using a ``reversed'' CDF with log scaling on both
+axes. This representation has proven very useful for interpreting the
+latency distributions that we see in evaluating dynamization, but is
+slightly unusual, and so we've included a guide to interpreting these
+charts in Appendix~\ref{append:rcdf}.
 
 \begin{figure}
 \centering
@@ -655,32 +655,25 @@ this is not a significant limitation for our analysis.
 \label{fig:design-policy-ins-latency}
 \end{figure}
 
-The resulting distributions are shown in
-Figure~\ref{fig:design-policy-ins-latency}. These distributions are
-representing using a "reversed" CDF with log scaling on both axes. This
-representation has proven very useful for interpreting the latency
-distributions that we see in evaluating dynamization, but are slightly
-unusual, and so we've included a guide to interpreting these charts
-in Appendix~\ref{append:rcdf}.
 
 The first notable point is that, for both the ISAM
 tree in Figure~\ref{fig:design-isam-ins-dist} and VPTree in
-Figure~\ref{fig:design-vptree-ins-dist}, the Leveling policy results in a
+Figure~\ref{fig:design-vptree-ins-dist}, the leveling policy results in a
 measurable lower worst-case insertion latency. This result is in line with
 our theoretical analysis in Section~\ref{sec:design-asymp}. However, there
 is a major deviation from theoretical in the worst-case performance of
-Tiering and BSM. Both of these should have similar worst-case latencies,
+tiering and BSM. Both of these should have similar worst-case latencies,
 as the worst-case reconstruction in both cases involves every record
 in the structure. Yet, we see tiering consistently performing better,
 particularly for the ISAM tree.
 
 The reason for this has to do with the way that the records are
-partitioned in these worst-case reconstructions. In Tiering, with a scale
+partitioned in these worst-case reconstructions. In tiering, with a scale
 factor of $s$, the worst-case reconstruction consists of $\Theta(\log_2
 n)$ distinct reconstructions, each involving exactly $2$ structures. BSM,
 on the other hand, will use exactly $1$ reconstruction involving
 $\Theta(\log_2 n)$ structures. This explains why ISAM performs much better
-in Tiering than BSM, as the actual reconstruction cost function there is
+in tiering than BSM, as the actual reconstruction cost function there is
 $\Theta(n \log_2 k)$. For tiering, this results in $\Theta(n)$ cost in
 the worst case. BSM, on the other hand, has $\Theta(n \log_2 \log_2 n)$,
 as many more distinct structures must be merged in the reconstruction,
@@ -699,7 +692,7 @@ due to cache effects most likely, but less so than in the MDSP case.
 \end{figure}
 
 Next, in Figure~\ref{fig:design-ins-tput}, we show the overall insertion
-throughput for the three policies for both ISAM Tree and VPTree. This
+throughput for the three policies for both ISAM tree and VPTree. This
 result should correlate with the amortized insertion costs for each
 policy derived in Section~\ref{sec:design-asymp}. At a scale factor of
 $s=2$, all three policies have similar insertion performance. This makes
@@ -708,7 +701,7 @@ proportional to the scale factor, and at $s=2$ this isn't significantly
 larger than tiering's write amplification, particularly compared
 to the other factors influencing insertion performance, such as
 reconstruction time. However, for larger scale factors, tiering shows
-\emph{significantly} higher insertion throughput, and Leveling and
+\emph{significantly} higher insertion throughput, and leveling and
 Bentley-Saxe show greatly degraded performance due to the large amount
 of additional write amplification. These results are perfectly in line
 with the mathematical analysis of the previous section.
@@ -718,7 +711,7 @@ with the mathematical analysis of the previous section.
 For our next experiment, we will consider the trade-offs between insertion
 and query performance that exist within this design space. We benchmarked
 each layout policy for a range of scale factors, measuring both their
-respective insertion throughputs and query latencies for both ISAM Tree
+respective insertion throughputs and query latencies for both ISAM tree
 and VPTree.
 
 \begin{figure}
@@ -786,7 +779,8 @@ factors on the trade-off between insertion and query performance. Our
 framework also supports varying buffer sizes, and so we will examine this
 next. Figure~\ref{fig:buffer-size} shows the same insertion throughput
 vs. query latency curves for fixed layout policy and scale factor
-configurations at varying buffer sizes.
+configurations at varying buffer sizes, under the same experimental
+conditions as the previous test.
 
 Unlike with the scale factor, there is a significant difference in the
 behavior of the two tested structures under buffer size variation. For
@@ -830,7 +824,11 @@ configurations approaching a similar query performance.
 In order to evaluate this effect, we tested the query latency of range
 queries of varying selectivity against various configurations of our
 framework to see at what points the query latencies begin to converge. We
-also tested $k$-NN queries with varying values of $k$.
+also tested $k$-NN queries with varying values of $k$. For these tests,
+we used a synthetic dataset of 500 million 64-bit key-value pairs for
+the ISAM testing, and the SBW dataset for $k$-NN. Query latencies were
+measured by executing the queries after all records were inserted into
+the structure.
 
 \begin{figure}
 \centering
@@ -961,24 +959,47 @@ In this chapter, we considered the proposed design space for our
 dynamization framework both mathematically and experimentally, and derived
 some general principles for configuration within the space. We generalized
 the Bentley-Saxe method to support scale factors and buffering, but
-found that the result was strictly worse than leveling in all but its
+found that the result was generally worse than leveling in all but its
 best case query performance. We also showed that there does exist a
 trade-off, mediated by scale factor, between insertion performance and
-query performance for the tiering layout policy. Unfortunately, the
-leveling layout policy does not have a particularly useful trade-off
-in this area because the cost in insertion performance grows far faster
-than any query performance benefit, due to the way to two effects scale
-in the cost functions for the method. 
+query performance, though it doesn't manifest for every layout policy
+and data structure combination. For example, when testing the ISAM tree
+structure with the leveling or BSM policies, there is not a particularly
+useful trade-off resulting from scale factor adjustments, because the
+amount of extra query performance resulting from increasing the scale
+factor is dwarfed by the reduction in insertion performance. This is
+because the cost in insertion performance grows far faster than any
+query performance benefit, due to the way to two effects scale in the
+cost functions for the method. 
 
 Broadly speaking, we can draw a few general conclusions. First, the
-leveling layout policy is better than tiering for query latency in
-all configurations, but worse in insertion performance. Leveling also
-has the best insertion tail latency performance by a small margin,
-owing to the way it performs reconstructions. Tiering, however,
-has significantly better insertion performance and can be configured
-with query performance that is similar to leveling. These results are
-aligned with the smaller-scale parameter testing done in the previous
-chapters, which landed on tiering as a good general solution for most
-cases. Tiering also has the advantage of meaningful tuning through scale
-factor adjustment.
+leveling and BSM policies are fairly similar, with the BSM having slightly
+better query performance in general owing to its better best-case query
+cost. Both of these policies are better than tiering in terms of query
+performance, but generally worse for insertion performance. The one
+slight exception to this trend is in worst-case insertion performance,
+where leveling has a slight advantage over the other policies because
+of the way it performs reconstructions ensuring that the worst-case
+reconstruction cost is smaller. Adjusting the scale factor can trade
+between insert and query performance, though leveling and BSM have an
+opposite effect from tiering. For these policies, increasing the scale
+factor reduces insert performance and improves query performance. Tiering
+does the opposite. The mutable buffer can be increased in size to improve
+insert performance as well (in all cases), but the query cost increases
+as a result. Once the buffer gets sufficiently large, the trade-off in
+query performance becomes severe.
+
+While this trade-off space does provide us with the desired
+configurability, the experimental results show that the trade-off curves
+are not particularly smooth, and the effectiveness can vary quite a bit
+depending on the properties of the data structure and search problem being
+dynamized. Additionally, there isn't a particular good way to control
+insertion tail latencies in this model, as leveling is only slightly
+better in this metric. In the next chapter, we'll consider methods for
+controlling tail latency, which will, as a side benefit, also provide
+a more desirable configuration space than the one considered here.
+
+
+
+
 
diff --git a/chapters/sigmod23/extensions.tex b/chapters/sigmod23/extensions.tex
index 3304b76..053c8e2 100644
--- a/chapters/sigmod23/extensions.tex
+++ b/chapters/sigmod23/extensions.tex
@@ -37,6 +37,7 @@ of the shards, at any given time the majority of the data will be on
 disk anyway, so this would only provide a marginal improvement.
 
 \subsection{Distributed Data Structures}
+\label{ssec:ext-distributed}
 
 Many distributed data processing systems are built on immutable
 abstractions, such Apache Spark's resilient distributed dataset
diff --git a/paper.tex b/paper.tex
index 1eac710..6cb2689 100644
--- a/paper.tex
+++ b/paper.tex
@@ -74,7 +74,7 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 % Only one of the following lines should be used at a time.
 % Doctoral students.
-\documentclass[phd,12pt]{psuthesis}
+\documentclass[phd,12pt,twoside]{psuthesis}
 % Masters students
 %\documentclass[ms,12pt]{psuthesis}
 % Bachelors students in the Schreyer Honors College.
author	Douglas Rumbaugh <dbr4@psu.edu>	2025-06-20 17:24:18 -0400
committer	Douglas Rumbaugh <dbr4@psu.edu>	2025-06-20 17:24:18 -0400
commit	7700f2818cca731cadac034322a28f19e9ac3a17 (patch)
tree	86e29639d5067bc047ee2f36471eda0ce8c7a291
parent	903055812fa35e0533b940ddb2d8db8c2a20af2b (diff)
download	dissertation-7700f2818cca731cadac034322a28f19e9ac3a17.tar.gz