summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--chapters/beyond-dsp.tex4
-rw-r--r--chapters/design-space.tex220
2 files changed, 222 insertions, 2 deletions
diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex
index 38a99c5..87f44ba 100644
--- a/chapters/beyond-dsp.tex
+++ b/chapters/beyond-dsp.tex
@@ -1353,7 +1353,7 @@ structures. Specifically,
\subsection{Design Space Evaluation}
-
+\label{ssec:dyn-ds-exp}
\begin{figure}
%\vspace{0pt}
\centering
@@ -1519,7 +1519,7 @@ in this case, doesn't add a significant amount of overhead over a single
instance of the static structure.
\subsection{$k$-NN Search}
-
+\label{ssec:dyn-knn-exp}
Next, we'll consider answering high dimensional exact $k$-NN queries
using a static Vantage Point Tree (VPTree)~\cite{vptree}. This is a
binary search tree with internal nodes that partition records based
diff --git a/chapters/design-space.tex b/chapters/design-space.tex
index 47f728a..10278bd 100644
--- a/chapters/design-space.tex
+++ b/chapters/design-space.tex
@@ -1,2 +1,222 @@
\chapter{Exploring the Design Space}
\label{chap:design-space}
+
+\section{Introduction}
+
+In the previous two chapters, we introduced an LSM tree inspired design
+space into the Bentley-Saxe method to allow for more flexilibity in
+tuning the performance. However, aside from some general comments
+about how these parameters operator in relation to insertion and
+query performance, and some limited experimental evaluation, we haven't
+performed a systematic analsyis of this space, its capabilities, and its
+limitations. We will rectify this situation in this chapter, performing
+both a detailed mathematical analysis of the design parameter space,
+as well as experiments to demonstrate these trade-offs exist in practice.
+
+\subsection{Why bother?}
+
+Before diving into the design space we have introduced in detail, it's
+worth taking some time to motivate this entire endevour. There is a large
+body of theoretical work in the area of data structure dynamization,
+and, to the best of our knowledge, none of these papers have introduced
+a design space of the sort that we have introduced here. Despite this,
+some papers which \emph{use} these techniques have introduced similar
+design elements into their own implementations~\cite{pgm}, with some
+even going so far as to (inaccurately) describe these elements as part
+of the Bentley-Saxe method~\cite{almodaresi23}.
+
+This situation is best understood, we think, in terms of the ultimate
+goals of the respective lines of work. In the classical literature on
+dynamization, the focus is mostly on proving theoretical asymptotic
+bounds about the techniques. In this context, the LSM tree design space
+is of limited utility, because its tuning parameters adjust constant
+factors only, and thus don't play a major role in asymptotics. Where
+the theoretical literature does introduce configurability, such as
+with the equal blocks method~\cite{overmars-art-of-dyn} or more
+complex schemes that nest the equal block method \emph{inside}
+of a binary decomposition~\cite{overmars81}, the intention is
+to produce asymptotically relevant trade-offs between insert,
+query, and delete performance for deletion decomposable search
+problems~\cite[pg. 117]{overmars83}. This is why the equal block method
+is described in terms of a function, rather than a constant value,
+to enable it to appear in the asymptotics.
+
+On the other hand, in practical scenarios, constant tuning of performance
+can be very relevant. We've already shown in Sections~\ref{ssec:ds-exp}
+and \ref{ssec:dyn-ds-exp} how tuning parameters, particularly the
+number of shards per level, can have measurable real-world effects on the
+performance characteristics of dynamized structures, and in fact sometimes
+this tuning is \emph{necessary} to enable reasonable performance. It's
+quite telling that the two most direct implementations of the Bentley-Saxe
+method that we have identified in the literature are both in the context
+of metric indices~\cite{naidan14,bkdtree}, a class of data structure
+and search problem for which we saw very good performance from standard
+Bentley-Saxe in Section~\ref{ssec:dyn-knn-exp}. The other experiments
+in Chapter~\ref{chap:framework} show that, for other types of problem,
+the technique does not fair quite so well.
+
+\section{Asymptotic Analsyis}
+
+Before beginning with derivations for
+the cost functions of dynamized structures within the context of our
+proposed design space, we should make a few comments about the assumptions
+and techniques that we will us in our analysis. As this design space
+involves adjusting constants, we will leave the design-space related
+constants within our asymptotic expressions. Additionally, we will
+perform the analysis for a simple decomposable search problem. Deletes
+will be entirely neglected, and we won't make any assumptions about
+mergability. These assumptions are to simplify the analysis.
+
+\subsection{Generalized Bentley Saxe Method}
+As a first step, we will derive a modified version of the Bentley-Saxe
+method that has been adjusted to support arbitrary scale factors, and
+buffering. There's nothing fundamental to the technique that prevents
+such modifications, and its likely that they have not been analyzed
+like this before simply out of a lack of interest in constant factors in
+theoretical asymptotic analysis. During our analysis, we'll intentionally
+leave these constant factors in place.
+
+
+When generalizing the Bentley-Saxe method for arbitrary scale factors, we
+decided to maintain the core concept of binary decomposition. One interesting
+mathematical property of a Bentley-Saxe dynamization is that the internal
+layout of levels exactly matches the binary representation of the record
+count contained within the index. For example, a dynamization containing
+$n=20$ records will have 4 records in the third level, and 16 in the fourth,
+with all other levels being empty. If we represent a full level with a 1
+and an empty level with a 0, then we'd have $1100$, which is $20$ in
+base 2.
+
+Our generalization, then, is to represent the data as an $s$-ary
+decomposition, where the scale factor represents the base of the
+representation. To accomplish this, we set of capacity of level $i$ to
+be $N_b (s - 1) \cdot s^i$, where $N_b$ is the size of the buffer. The
+resulting structure will have at most $\log_s n$ shards. Unfortunately,
+the approach used by Bentley and Saxe to calculate the amortized insertion
+cost of the BSM does not generalize to larger bases, and so we will need
+to derive this result using a different approach. Note that, for this
+analysis, we will neglect the buffer size $N_b$ for simplicity. It cancels
+out in the analysis, and so would only serve to increase the complexity
+of the expressions without contributing any additional insights.\footnote{
+ The contribution of the buffer size is simply to replace each of the
+ individual records considered in the analysis with batches of $N_b$
+ records. The same patterns hold.
+}
+
+\begin{theorem}
+The amortized insertion cost for generalized BSM with a growth factor of
+$s$ is $\Theta\left(\frac{B(n)}{n} \cdot \frac{1}{2}(s-1) \cdot ( (s-1)\log_s n + s)\right)$.
+\end{theorem}
+\begin{proof}
+
+In order to calculate the amortized insertion cost, we will first
+determine the average number of times that a record is involved in a
+reconstruction, and then amortize those reconstructions over the records
+in the structure.
+
+If we consider only the first level of the structure, it's clear that
+the reconstruction count associated with each record in that structure
+will follow the pattern, $1, 2, 3, 4, ..., s-1$ when the level is full.
+Thus, the total number of reconstructions associated with records on level
+$i=0$ is the sum of that sequence, or
+\begin{equation*}
+W(0) = \sum_{j=1}^{s-1} j = \frac{1}{2}\left(s^2 - s\right)
+\end{equation*}
+
+Considering the next level, $i=1$, each reconstruction involving this
+level will copy down the entirety of the structure above it, adding
+one more write per record, as well as one extra write for the new record.
+More specifically, in the above example, the first "batch" of records in
+level $i=1$ will have the following write counts: $1, 2, 3, 4, 5, ..., s$,
+the second "batch" of records will increment all of the existing write
+counts by one, and then introduce another copy of $1, 2, 3, 4, 5, ..., s$
+writes, and so on.
+
+Thus, each new "batch" written to level $i$ will introduce $W(i-1) + 1$
+writes from the previous level into level $i$, as well as rewriting all
+of the records currently on level $i$.
+
+The net result of this is that the number of writes on level $i$ is given
+by the following recurrance relation (combined with the $W(0)$ base case),
+
+\begin{equation*}
+W(i) = sW(i-1) + \frac{1}{2}\left(s-1\right)^2 \cdot s^i
+\end{equation*}
+
+which can be solved to give the following closed-form expression,
+\begin{equation*}
+W(i) = s^i \cdot \left(\frac{1}{2} (s-1) \cdot (s(i+1) - i)\right)
+\end{equation*}
+which provides the total number of reconstructions that records in
+level $i$ of the structure have participated in. As each record
+is involved in a different number of reconstructions, we'll consider the
+average number by dividing $W(i)$ by the number of records in level $i$.
+
+From here, the proof proceeds in the standard way for this sort of
+analysis. The worst-case cost of a reconstruction is $B(n)$, and there
+are $\log_s(n)$ total levels, so the total reconstruction costs associated
+with a record can be upper-bounded by, $B(n) \cdot
+\frac{W(\log_s(n))}{n}$, and then this cost amortized over the $n$
+insertions necessary to get the record into the last level, resulting
+in an amortized insertion cost of,
+\begin{equation*}
+\frac{B(n)}{n} \cdot \frac{1}{2}(s-1) \cdot ( (s-1)\log_s n + s)
+\end{equation*}
+Note that, in the case of $s=2$, this expression reduces to the same amortized
+insertion cost as was derived using Binomial Theorem in the original BSM
+paper~\cite{saxe79}.
+\end{proof}
+
+\begin{theorem}
+The worst-case insertion cost for generalized BSM with a scale factor
+of $s$ is $\Theta(B(n))$.
+\end{theorem}
+\begin{proof}
+The Bentley-Saxe method finds the smallest non-full block and performs
+a reconstruction including all of the records from that block, as well
+as all blocks smaller than it, and the new records to be added. The
+worst case, then, will occur when all of the existing blocks in the
+structure are full, and a new, larger, block must be added.
+
+In this case, the reconstruction will involve every record currently
+in the dynamized structure, and will thus have a cost of $I(n) \in
+\Theta(B(n))$.
+\end{proof}
+
+\begin{theorem}
+The worst-case query cost for generalized BSM for a decomposable
+search problem with cost $\mathscr{Q}_s(n)$ is $\Theta(\log_s(n) \cdot
+\mathscr{Q}_s(n))$.
+\end{theorem}
+\begin{proof}
+\end{proof}
+
+\begin{theorem}
+The best-case insertion cost for generalized BSM for a decomposable
+search problem is $I_B \in \Theta(1)$.
+\end{theorem}
+\begin{proof}
+
+\end{proof}
+
+
+\subsection{Leveling}
+
+\subsection{Tiering}
+
+\section{General Observations}
+
+
+\begin{table*}[!t]
+\centering
+\begin{tabular}{|l l l l l|}
+\hline
+\textbf{Policy} & \textbf{Worst-case Query Cost} & \textbf{Worst-case Insert Cost} & \textbf{Best-cast Insert Cost} & \textbf{Amortized Insert Cost} \\ \hline
+Gen. Bentley-Saxe &$\Theta\left(\log_s(n) \cdot Q(n)\right)$ &$\Theta\left(B(n)\right)$ &$\Theta\left(1\right)$ &$\Theta\left(\frac{B(n)}{n} \cdot \frac{1}{2}(s-1) \cdot ( (s-1)\log_s n + s)\right)$ \\
+Leveling &$\Theta\left(\log_s(n) \cdot Q(n)\right)$ &$\Theta\left(B(\frac{n}{s})\right)$ &$\Theta\left(1\right)$ &$\Theta\left(\frac{B(n)}{n} \cdot \frac{1}{2} \log_s(n)(s + 1)\right)$ \\
+Tiering &$\Theta\left(s\log_s(n) \cdot Q(n)\right)$ &$\Theta\left(B(n)\right)$ &$\Theta\left(1\right)$ &$\Theta\left(\frac{B(n)}{n} \cdot \log_s(n)\right)$ \\\hline
+\end{tabular}
+\caption{Comparison of cost functions for various reconstruction policies for DSPs}
+\label{tab:policy-comp}
+\end{table*}
+