summaryrefslogtreecommitdiffstats
path: root/chapters/design-space.tex
blob: 10278bdc45eaabc02665de42e9e6197dcabab445 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
\chapter{Exploring the Design Space}
\label{chap:design-space}

\section{Introduction}

In the previous two chapters, we introduced an LSM tree inspired design
space into the Bentley-Saxe method to allow for more flexilibity in
tuning the performance. However, aside from some general comments
about how these parameters operator in relation to insertion and
query performance, and some limited experimental evaluation, we haven't
performed a systematic analsyis of this space, its capabilities, and its
limitations. We will rectify this situation in this chapter, performing
both a detailed mathematical analysis of the design parameter space,
as well as experiments to demonstrate these trade-offs exist in practice.

\subsection{Why bother?}

Before diving into the design space we have introduced in detail, it's
worth taking some time to motivate this entire endevour. There is a large
body of theoretical work in the area of data structure dynamization,
and, to the best of our knowledge, none of these papers have introduced
a design space of the sort that we have introduced here. Despite this,
some papers which \emph{use} these techniques have introduced similar
design elements into their own implementations~\cite{pgm}, with some
even going so far as to (inaccurately) describe these elements as part
of the Bentley-Saxe method~\cite{almodaresi23}.

This situation is best understood, we think, in terms of the ultimate
goals of the respective lines of work. In the classical literature on
dynamization, the focus is mostly on proving theoretical asymptotic
bounds about the techniques. In this context, the LSM tree design space
is of limited utility, because its tuning parameters adjust constant
factors only, and thus don't play a major role in asymptotics. Where
the theoretical literature does introduce configurability, such as
with the equal blocks method~\cite{overmars-art-of-dyn} or more
complex schemes that nest the equal block method \emph{inside}
of a binary decomposition~\cite{overmars81}, the intention is
to produce asymptotically relevant trade-offs between insert,
query, and delete performance for deletion decomposable search
problems~\cite[pg. 117]{overmars83}. This is why the equal block method
is described in terms of a function, rather than a constant value,
to enable it to appear in the asymptotics.

On the other hand, in practical scenarios, constant tuning of performance
can be very relevant. We've already shown in Sections~\ref{ssec:ds-exp}
and \ref{ssec:dyn-ds-exp} how tuning parameters, particularly the
number of shards per level, can have measurable real-world effects on the
performance characteristics of dynamized structures, and in fact sometimes
this tuning is \emph{necessary} to enable reasonable performance. It's
quite telling that the two most direct implementations of the Bentley-Saxe
method that we have identified in the literature are both in the context
of metric indices~\cite{naidan14,bkdtree}, a class of data structure
and search problem for which we saw very good performance from standard
Bentley-Saxe in Section~\ref{ssec:dyn-knn-exp}. The other experiments
in Chapter~\ref{chap:framework} show that, for other types of problem,
the technique does not fair quite so well.

\section{Asymptotic Analsyis}

Before beginning with derivations for
the cost functions of dynamized structures within the context of our
proposed design space, we should make a few comments about the assumptions
and techniques that we will us in our analysis. As this design space
involves adjusting constants, we will leave the design-space related
constants within our asymptotic expressions. Additionally, we will
perform the analysis for a simple decomposable search problem. Deletes
will be entirely neglected, and we won't make any assumptions about
mergability. These assumptions are to simplify the analysis.

\subsection{Generalized Bentley Saxe Method}
As a first step, we will derive a modified version of the Bentley-Saxe
method that has been adjusted to support arbitrary scale factors, and
buffering. There's nothing fundamental to the technique that prevents
such modifications, and its likely that they have not been analyzed
like this before simply out of a lack of interest in constant factors in
theoretical asymptotic analysis. During our analysis, we'll intentionally
leave these constant factors in place.


When generalizing the Bentley-Saxe method for arbitrary scale factors, we
decided to maintain the core concept of binary decomposition. One interesting
mathematical property of a Bentley-Saxe dynamization is that the internal
layout of levels exactly matches the binary representation of the record
count contained within the index. For example, a dynamization containing
$n=20$ records will have 4 records in the third level, and 16 in the fourth,
with all other levels being empty. If we represent a full level with a 1
and an empty level with a 0, then we'd have $1100$, which is $20$ in
base 2.

Our generalization, then, is to represent the data as an $s$-ary
decomposition, where the scale factor represents the base of the
representation. To accomplish this, we set of capacity of level $i$ to
be $N_b (s - 1) \cdot s^i$, where $N_b$ is the size of the buffer. The
resulting structure will have at most $\log_s n$ shards. Unfortunately,
the approach used by Bentley and Saxe to calculate the amortized insertion
cost of the BSM does not generalize to larger bases, and so we will need
to derive this result using a different approach. Note that, for this
analysis, we will neglect the buffer size $N_b$ for simplicity. It cancels
out in the analysis, and so would only serve to increase the complexity
of the expressions without contributing any additional insights.\footnote{
	The contribution of the buffer size is simply to replace each of the
	individual records considered in the analysis with batches of $N_b$
	records. The same patterns hold.
} 

\begin{theorem}
The amortized insertion cost for generalized BSM with a growth factor of
$s$ is $\Theta\left(\frac{B(n)}{n} \cdot \frac{1}{2}(s-1) \cdot ( (s-1)\log_s n + s)\right)$.
\end{theorem}
\begin{proof}

In order to calculate the amortized insertion cost, we will first
determine the average number of times that a record is involved in a
reconstruction, and then amortize those reconstructions over the records
in the structure.

If we consider only the first level of the structure, it's clear that
the reconstruction count associated with each record in that structure
will follow the pattern, $1, 2, 3, 4, ..., s-1$ when the level is full.
Thus, the total number of reconstructions associated with records on level
$i=0$ is the sum of that sequence, or
\begin{equation*}
W(0) = \sum_{j=1}^{s-1} j = \frac{1}{2}\left(s^2 - s\right)
\end{equation*}

Considering the next level, $i=1$, each reconstruction involving this
level will copy down the entirety of the structure above it, adding
one more write per record, as well as one extra write for the new record.
More specifically, in the above example, the first "batch" of records in
level $i=1$ will have the following write counts: $1, 2, 3, 4, 5, ..., s$,
the second "batch" of records will increment all of the existing write
counts by one, and then introduce another copy of $1, 2, 3, 4, 5, ..., s$
writes, and so on.

Thus, each new "batch" written to level $i$ will introduce $W(i-1) + 1$
writes from the previous level into level $i$, as well as rewriting all
of the records currently on level $i$.

The net result of this is that the number of writes on level $i$ is given
by the following recurrance relation (combined with the $W(0)$ base case),

\begin{equation*}
W(i) = sW(i-1) + \frac{1}{2}\left(s-1\right)^2 \cdot s^i
\end{equation*}

which can be solved to give the following closed-form expression,
\begin{equation*}
W(i) = s^i \cdot \left(\frac{1}{2} (s-1) \cdot (s(i+1) - i)\right)
\end{equation*}
which provides the total number of reconstructions that records in
level $i$ of the structure have participated in. As each record
is involved in a different number of reconstructions, we'll consider the
average number by dividing $W(i)$ by the number of records in level $i$.

From here, the proof proceeds in the standard way for this sort of
analysis. The worst-case cost of a reconstruction is $B(n)$, and there
are $\log_s(n)$ total levels, so the total reconstruction costs associated
with a record can be upper-bounded by, $B(n) \cdot
\frac{W(\log_s(n))}{n}$, and then this cost amortized over the $n$
insertions necessary to get the record into the last level, resulting
in an amortized insertion cost of,
\begin{equation*}
\frac{B(n)}{n} \cdot  \frac{1}{2}(s-1) \cdot ( (s-1)\log_s n + s)
\end{equation*}
Note that, in the case of $s=2$, this expression reduces to the same amortized
insertion cost as was derived using Binomial Theorem in the original BSM
paper~\cite{saxe79}.
\end{proof}

\begin{theorem}
The worst-case insertion cost for generalized BSM with a scale factor
of $s$ is $\Theta(B(n))$.
\end{theorem}
\begin{proof}
The Bentley-Saxe method finds the smallest non-full block and performs
a reconstruction including all of the records from that block, as well
as all blocks smaller than it, and the new records to be added. The
worst case, then, will occur when all of the existing blocks in the
structure are full, and a new, larger, block must be added.

In this case, the reconstruction will involve every record currently
in the dynamized structure, and will thus have a cost of $I(n) \in
\Theta(B(n))$.
\end{proof}

\begin{theorem}
The worst-case query cost for generalized BSM for a decomposable
search problem with cost $\mathscr{Q}_s(n)$ is $\Theta(\log_s(n) \cdot
\mathscr{Q}_s(n))$.
\end{theorem}
\begin{proof}
\end{proof}

\begin{theorem}
The best-case insertion cost for generalized BSM for a decomposable
search problem is $I_B \in \Theta(1)$.
\end{theorem}
\begin{proof}

\end{proof}


\subsection{Leveling}

\subsection{Tiering}

\section{General Observations}


\begin{table*}[!t]
\centering
\begin{tabular}{|l l l l l|}
\hline
\textbf{Policy} & \textbf{Worst-case Query Cost} & \textbf{Worst-case Insert Cost} & \textbf{Best-cast Insert Cost} & \textbf{Amortized Insert Cost} \\ \hline
Gen. Bentley-Saxe &$\Theta\left(\log_s(n) \cdot Q(n)\right)$  &$\Theta\left(B(n)\right)$           &$\Theta\left(1\right)$ &$\Theta\left(\frac{B(n)}{n} \cdot  \frac{1}{2}(s-1) \cdot ( (s-1)\log_s n + s)\right)$ \\
Leveling          &$\Theta\left(\log_s(n) \cdot Q(n)\right)$  &$\Theta\left(B(\frac{n}{s})\right)$ &$\Theta\left(1\right)$ &$\Theta\left(\frac{B(n)}{n} \cdot \frac{1}{2} \log_s(n)(s + 1)\right)$ \\
Tiering           &$\Theta\left(s\log_s(n) \cdot Q(n)\right)$ &$\Theta\left(B(n)\right)$           &$\Theta\left(1\right)$ &$\Theta\left(\frac{B(n)}{n} \cdot \log_s(n)\right)$ \\\hline
\end{tabular}
\caption{Comparison of cost functions for various reconstruction policies for DSPs}
\label{tab:policy-comp}
\end{table*}