From d49867867cf950a85cabcbcbaf43bb762c9912ac Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Thu, 22 May 2025 14:34:30 -0400 Subject: updates --- chapters/beyond-dsp.tex | 4 +- chapters/design-space.tex | 220 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 222 insertions(+), 2 deletions(-) diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex index 38a99c5..87f44ba 100644 --- a/chapters/beyond-dsp.tex +++ b/chapters/beyond-dsp.tex @@ -1353,7 +1353,7 @@ structures. Specifically, \subsection{Design Space Evaluation} - +\label{ssec:dyn-ds-exp} \begin{figure} %\vspace{0pt} \centering @@ -1519,7 +1519,7 @@ in this case, doesn't add a significant amount of overhead over a single instance of the static structure. \subsection{$k$-NN Search} - +\label{ssec:dyn-knn-exp} Next, we'll consider answering high dimensional exact $k$-NN queries using a static Vantage Point Tree (VPTree)~\cite{vptree}. This is a binary search tree with internal nodes that partition records based diff --git a/chapters/design-space.tex b/chapters/design-space.tex index 47f728a..10278bd 100644 --- a/chapters/design-space.tex +++ b/chapters/design-space.tex @@ -1,2 +1,222 @@ \chapter{Exploring the Design Space} \label{chap:design-space} + +\section{Introduction} + +In the previous two chapters, we introduced an LSM tree inspired design +space into the Bentley-Saxe method to allow for more flexilibity in +tuning the performance. However, aside from some general comments +about how these parameters operator in relation to insertion and +query performance, and some limited experimental evaluation, we haven't +performed a systematic analsyis of this space, its capabilities, and its +limitations. We will rectify this situation in this chapter, performing +both a detailed mathematical analysis of the design parameter space, +as well as experiments to demonstrate these trade-offs exist in practice. + +\subsection{Why bother?} + +Before diving into the design space we have introduced in detail, it's +worth taking some time to motivate this entire endevour. There is a large +body of theoretical work in the area of data structure dynamization, +and, to the best of our knowledge, none of these papers have introduced +a design space of the sort that we have introduced here. Despite this, +some papers which \emph{use} these techniques have introduced similar +design elements into their own implementations~\cite{pgm}, with some +even going so far as to (inaccurately) describe these elements as part +of the Bentley-Saxe method~\cite{almodaresi23}. + +This situation is best understood, we think, in terms of the ultimate +goals of the respective lines of work. In the classical literature on +dynamization, the focus is mostly on proving theoretical asymptotic +bounds about the techniques. In this context, the LSM tree design space +is of limited utility, because its tuning parameters adjust constant +factors only, and thus don't play a major role in asymptotics. Where +the theoretical literature does introduce configurability, such as +with the equal blocks method~\cite{overmars-art-of-dyn} or more +complex schemes that nest the equal block method \emph{inside} +of a binary decomposition~\cite{overmars81}, the intention is +to produce asymptotically relevant trade-offs between insert, +query, and delete performance for deletion decomposable search +problems~\cite[pg. 117]{overmars83}. This is why the equal block method +is described in terms of a function, rather than a constant value, +to enable it to appear in the asymptotics. + +On the other hand, in practical scenarios, constant tuning of performance +can be very relevant. We've already shown in Sections~\ref{ssec:ds-exp} +and \ref{ssec:dyn-ds-exp} how tuning parameters, particularly the +number of shards per level, can have measurable real-world effects on the +performance characteristics of dynamized structures, and in fact sometimes +this tuning is \emph{necessary} to enable reasonable performance. It's +quite telling that the two most direct implementations of the Bentley-Saxe +method that we have identified in the literature are both in the context +of metric indices~\cite{naidan14,bkdtree}, a class of data structure +and search problem for which we saw very good performance from standard +Bentley-Saxe in Section~\ref{ssec:dyn-knn-exp}. The other experiments +in Chapter~\ref{chap:framework} show that, for other types of problem, +the technique does not fair quite so well. + +\section{Asymptotic Analsyis} + +Before beginning with derivations for +the cost functions of dynamized structures within the context of our +proposed design space, we should make a few comments about the assumptions +and techniques that we will us in our analysis. As this design space +involves adjusting constants, we will leave the design-space related +constants within our asymptotic expressions. Additionally, we will +perform the analysis for a simple decomposable search problem. Deletes +will be entirely neglected, and we won't make any assumptions about +mergability. These assumptions are to simplify the analysis. + +\subsection{Generalized Bentley Saxe Method} +As a first step, we will derive a modified version of the Bentley-Saxe +method that has been adjusted to support arbitrary scale factors, and +buffering. There's nothing fundamental to the technique that prevents +such modifications, and its likely that they have not been analyzed +like this before simply out of a lack of interest in constant factors in +theoretical asymptotic analysis. During our analysis, we'll intentionally +leave these constant factors in place. + + +When generalizing the Bentley-Saxe method for arbitrary scale factors, we +decided to maintain the core concept of binary decomposition. One interesting +mathematical property of a Bentley-Saxe dynamization is that the internal +layout of levels exactly matches the binary representation of the record +count contained within the index. For example, a dynamization containing +$n=20$ records will have 4 records in the third level, and 16 in the fourth, +with all other levels being empty. If we represent a full level with a 1 +and an empty level with a 0, then we'd have $1100$, which is $20$ in +base 2. + +Our generalization, then, is to represent the data as an $s$-ary +decomposition, where the scale factor represents the base of the +representation. To accomplish this, we set of capacity of level $i$ to +be $N_b (s - 1) \cdot s^i$, where $N_b$ is the size of the buffer. The +resulting structure will have at most $\log_s n$ shards. Unfortunately, +the approach used by Bentley and Saxe to calculate the amortized insertion +cost of the BSM does not generalize to larger bases, and so we will need +to derive this result using a different approach. Note that, for this +analysis, we will neglect the buffer size $N_b$ for simplicity. It cancels +out in the analysis, and so would only serve to increase the complexity +of the expressions without contributing any additional insights.\footnote{ + The contribution of the buffer size is simply to replace each of the + individual records considered in the analysis with batches of $N_b$ + records. The same patterns hold. +} + +\begin{theorem} +The amortized insertion cost for generalized BSM with a growth factor of +$s$ is $\Theta\left(\frac{B(n)}{n} \cdot \frac{1}{2}(s-1) \cdot ( (s-1)\log_s n + s)\right)$. +\end{theorem} +\begin{proof} + +In order to calculate the amortized insertion cost, we will first +determine the average number of times that a record is involved in a +reconstruction, and then amortize those reconstructions over the records +in the structure. + +If we consider only the first level of the structure, it's clear that +the reconstruction count associated with each record in that structure +will follow the pattern, $1, 2, 3, 4, ..., s-1$ when the level is full. +Thus, the total number of reconstructions associated with records on level +$i=0$ is the sum of that sequence, or +\begin{equation*} +W(0) = \sum_{j=1}^{s-1} j = \frac{1}{2}\left(s^2 - s\right) +\end{equation*} + +Considering the next level, $i=1$, each reconstruction involving this +level will copy down the entirety of the structure above it, adding +one more write per record, as well as one extra write for the new record. +More specifically, in the above example, the first "batch" of records in +level $i=1$ will have the following write counts: $1, 2, 3, 4, 5, ..., s$, +the second "batch" of records will increment all of the existing write +counts by one, and then introduce another copy of $1, 2, 3, 4, 5, ..., s$ +writes, and so on. + +Thus, each new "batch" written to level $i$ will introduce $W(i-1) + 1$ +writes from the previous level into level $i$, as well as rewriting all +of the records currently on level $i$. + +The net result of this is that the number of writes on level $i$ is given +by the following recurrance relation (combined with the $W(0)$ base case), + +\begin{equation*} +W(i) = sW(i-1) + \frac{1}{2}\left(s-1\right)^2 \cdot s^i +\end{equation*} + +which can be solved to give the following closed-form expression, +\begin{equation*} +W(i) = s^i \cdot \left(\frac{1}{2} (s-1) \cdot (s(i+1) - i)\right) +\end{equation*} +which provides the total number of reconstructions that records in +level $i$ of the structure have participated in. As each record +is involved in a different number of reconstructions, we'll consider the +average number by dividing $W(i)$ by the number of records in level $i$. + +From here, the proof proceeds in the standard way for this sort of +analysis. The worst-case cost of a reconstruction is $B(n)$, and there +are $\log_s(n)$ total levels, so the total reconstruction costs associated +with a record can be upper-bounded by, $B(n) \cdot +\frac{W(\log_s(n))}{n}$, and then this cost amortized over the $n$ +insertions necessary to get the record into the last level, resulting +in an amortized insertion cost of, +\begin{equation*} +\frac{B(n)}{n} \cdot \frac{1}{2}(s-1) \cdot ( (s-1)\log_s n + s) +\end{equation*} +Note that, in the case of $s=2$, this expression reduces to the same amortized +insertion cost as was derived using Binomial Theorem in the original BSM +paper~\cite{saxe79}. +\end{proof} + +\begin{theorem} +The worst-case insertion cost for generalized BSM with a scale factor +of $s$ is $\Theta(B(n))$. +\end{theorem} +\begin{proof} +The Bentley-Saxe method finds the smallest non-full block and performs +a reconstruction including all of the records from that block, as well +as all blocks smaller than it, and the new records to be added. The +worst case, then, will occur when all of the existing blocks in the +structure are full, and a new, larger, block must be added. + +In this case, the reconstruction will involve every record currently +in the dynamized structure, and will thus have a cost of $I(n) \in +\Theta(B(n))$. +\end{proof} + +\begin{theorem} +The worst-case query cost for generalized BSM for a decomposable +search problem with cost $\mathscr{Q}_s(n)$ is $\Theta(\log_s(n) \cdot +\mathscr{Q}_s(n))$. +\end{theorem} +\begin{proof} +\end{proof} + +\begin{theorem} +The best-case insertion cost for generalized BSM for a decomposable +search problem is $I_B \in \Theta(1)$. +\end{theorem} +\begin{proof} + +\end{proof} + + +\subsection{Leveling} + +\subsection{Tiering} + +\section{General Observations} + + +\begin{table*}[!t] +\centering +\begin{tabular}{|l l l l l|} +\hline +\textbf{Policy} & \textbf{Worst-case Query Cost} & \textbf{Worst-case Insert Cost} & \textbf{Best-cast Insert Cost} & \textbf{Amortized Insert Cost} \\ \hline +Gen. Bentley-Saxe &$\Theta\left(\log_s(n) \cdot Q(n)\right)$ &$\Theta\left(B(n)\right)$ &$\Theta\left(1\right)$ &$\Theta\left(\frac{B(n)}{n} \cdot \frac{1}{2}(s-1) \cdot ( (s-1)\log_s n + s)\right)$ \\ +Leveling &$\Theta\left(\log_s(n) \cdot Q(n)\right)$ &$\Theta\left(B(\frac{n}{s})\right)$ &$\Theta\left(1\right)$ &$\Theta\left(\frac{B(n)}{n} \cdot \frac{1}{2} \log_s(n)(s + 1)\right)$ \\ +Tiering &$\Theta\left(s\log_s(n) \cdot Q(n)\right)$ &$\Theta\left(B(n)\right)$ &$\Theta\left(1\right)$ &$\Theta\left(\frac{B(n)}{n} \cdot \log_s(n)\right)$ \\\hline +\end{tabular} +\caption{Comparison of cost functions for various reconstruction policies for DSPs} +\label{tab:policy-comp} +\end{table*} + -- cgit v1.2.3