summaryrefslogtreecommitdiffstats
path: root/chapters/background.tex
diff options
context:
space:
mode:
Diffstat (limited to 'chapters/background.tex')
-rw-r--r--chapters/background.tex316
1 files changed, 202 insertions, 114 deletions
diff --git a/chapters/background.tex b/chapters/background.tex
index 75e2b59..332dbb6 100644
--- a/chapters/background.tex
+++ b/chapters/background.tex
@@ -81,13 +81,13 @@ their work on dynamization, and we will adopt their definition,
\label{def:dsp}
A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and
only if there exists a constant-time computable, associative, and
- commutative binary operator $\square$ such that,
+ commutative binary operator $\mergeop$ such that,
\begin{equation*}
- F(A \cup B, q) = F(A, q)~ \square ~F(B, q)
+ F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
\end{equation*}
\end{definition}
-The requirement for $\square$ to be constant-time was used by Bentley and
+The requirement for $\mergeop$ to be constant-time was used by Bentley and
Saxe to prove specific performance bounds for answering queries from a
decomposed data structure. However, it is not strictly \emph{necessary},
and later work by Overmars lifted this constraint and considered a more
@@ -97,15 +97,15 @@ problems},
\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}]
A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable
if and only if there exists an $O(C(n))$-time computable, associative,
- and commutative binary operator $\square$ such that,
+ and commutative binary operator $\mergeop$ such that,
\begin{equation*}
- F(A \cup B, q) = F(A, q)~ \square ~F(B, q)
+ F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
\end{equation*}
\end{definition}
To demonstrate that a search problem is decomposable, it is necessary to
-show the existence of the merge operator, $\square$, with the necessary
-properties, and to show that $F(A \cup B, q) = F(A, q)~ \square ~F(B,
+show the existence of the merge operator, $\mergeop$, with the necessary
+properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B,
q)$. With these two results, induction demonstrates that the problem is
decomposable even in cases with more than two partial results.
@@ -121,7 +121,7 @@ Range Count is a decomposable search problem.
\end{theorem}
\begin{proof}
-Let $\square$ be addition ($+$). Applying this to
+Let $\mergeop$ be addition ($+$). Applying this to
Definition~\ref{def:dsp}, gives
\begin{align*}
|(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)|
@@ -146,11 +146,11 @@ The calculation of the arithmetic mean of a set of numbers is a DSP.
contains the sum of the values within the input set, and the
cardinality of the input set. For two disjoint paritions of the data,
$D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let
-$A(D_1) \square A(D_2) = (s_1 + s_2, c_1 + c_2)$.
+$A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$.
Applying Definition~\ref{def:dsp}, gives
\begin{align*}
- A(D_1 \cup D_2) &= A(D_1)\square A(D_2) \\
+ A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2) \\
(s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c)
\end{align*}
From this result, the average can be determined in constant time by
@@ -365,7 +365,6 @@ support updates, and so a general strategy for adding update support
would increase the number of data structures that could be used as
database indices. We refer to a data structure with update support as
\emph{dynamic}, and one without update support as \emph{static}.\footnote{
-
The term static is distinct from immutable. Static refers to the
layout of records within the data structure, whereas immutable
refers to the data stored within those records. This distinction
@@ -385,6 +384,16 @@ requirements, and then examine several classical dynamization techniques.
The section will conclude with a discussion of delete support within the
context of these techniques.
+It is worth noting that there are a variety of techniques
+discussed in the literature for dynamizing structures with specific
+properties, or under very specific sets of circumstances. Examples
+include frameworks for adding update support succinct data
+structures~\cite{dynamize-succinct} or taking advantage of batching
+of insert and query operations~\cite{batched-decomposable}. This
+section discusses techniques that are more general, and don't require
+workload-specific assumptions.
+
+
\subsection{Global Reconstruction}
The most fundamental dynamization technique is that of \emph{global
@@ -399,125 +408,204 @@ possible if $\mathcal{I}$ supports the following two operations,
\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\
\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D})
\end{align*}
-where $\mathtt{build}$ constructs an instance $\mathscr{i}\in\mathcal{I}$
+where $\mathtt{build}$ constructs an instance $\mathscr{I}\in\mathcal{I}$
over the data structure over a set of records $d \subseteq \mathcal{D}$
in $C(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d
-\subseteq \mathcal{D}$ used to construct $\mathscr{i} \in \mathcal{I}$ in
+\subseteq \mathcal{D}$ used to construct $\mathscr{I} \in \mathcal{I}$ in
$\Theta(1)$ time,\footnote{
There isn't any practical reason why $\mathtt{unbuild}$ must run
in constant time, but this is the assumption made in \cite{saxe79}
and in subsequent work based on it, and so we will follow the same
defininition here.
-} such that $\mathscr{i} = \mathtt{build}(\mathtt{unbuild}(\mathscr{i}))$.
-
+} such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$.
+Given this structure, an insert of record $r \in \mathcal{D}$ into a
+data structure $\mathscr{I} \in \mathcal{I}$ can be defined by,
+\begin{align*}
+\mathscr{I}_{i}^\prime = \text{build}(\text{unbuild}(\mathscr{I}_i) \cup \{r\})
+\end{align*}
+It goes without saying that this operation is sub-optimal, as the
+insertion cost is $\Theta(C(n))$, and $C(n) \in \Omega(n)$ at best for
+most data structures. However, this global reconstruction strategy can
+be used as a primitive for more sophisticated techniques that can provide
+reasonable performance.
+
+\subsection{Amortized Global Reconstruction}
+\label{ssec:agr}
+
+The problem with global reconstruction is that each insert must rebuild
+the entire data structure, involving all of its records. This results
+in a worst-case insert cost of $\Theta(C(n))$. However, opportunities
+for improving this scheme can present themselves when considering the
+\emph{amortized} insertion cost.
+
+Consider the cost acrrued by the dynamized structure under global
+reconstruction over the lifetime of the structure. Each insert will result
+in all of the existing records being rewritten, so at worst each record
+will be involved in $\Theta(n)$ reconstructions, each reconstruction
+having $\Theta(C(n))$ cost. We can amortize this cost over the $n$ records
+inserted to get an amortized insertion cost for global reconstruction of,
+
+\begin{equation*}
+I_a(n) = \frac{C(n) \cdot n}{n} = C(n)
+\end{equation*}
+
+This doesn't improve things as is, however it does present two
+opportunities for improvement. If we could either reduce the size of
+the reconstructions, or the number of times a record is reconstructed,
+then we could reduce the amortized insertion cost.
+
+The key insight, first discussed by Bentley and Saxe, is that
+this goal can be accomplished by \emph{decomposing} the data
+structure into multiple, smaller structures, each built from a
+disjoint partition of the data. As long as the search problem
+being considered is decomposable, queries can be answered from
+this structure with bounded worst-case overhead, and the amortized
+insertion cost can be improved~\cite{saxe79}. Significant theoretical
+work exists in evaluating different strategies for decomposing the
+data structure~\cite{saxe79, overmars81, overmars83} and for leveraging
+specific efficiencies of the data structures being considered to improve
+these reconstructions~\cite{merge-dsp}.
+
+There are two general decomposition techniques that emerged from this
+work. The earliest of these is the logarithmic method, often called
+the Bentley-Saxe method in modern literature, and is the most commonly
+discussed technique today. A later technique, the equal block method,
+was also examined. It is generally not as effective as the Bentley-Saxe
+method, but it has some useful properties for explainatory purposes and
+so will be discussed here as well.
+
+\subsection{Equal Block Method~\cite[pp.~96-100]{overmars83}}
+\label{ssec:ebm}
+
+Though chronologically later, the equal block method is theoretically a
+bit simpler, and so we will begin our discussion of decomposition-based
+technique for dynamization of decomposable search problems with it. The
+core concept of the equal block method is to decompose the data structure
+into several smaller data structures, called blocks, over partitions
+of the data. This decomposition is performed such that each block is of
+roughly equal size.
+
+Consider a data structure $\mathscr{I} \in \mathcal{I}$ that solves
+some decomposable search problem, $F$ and is built over a set of records
+$d \in \mathcal{D}$. This structure can be decomposed into $s$ blocks,
+$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over
+partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a specific value
+makes little sense when the number of records changes, and so it is taken
+to be governed by a smooth, monotonically increasing function $f(n)$ such
+that, at any point, the following two constraints are obeyed.
+\begin{align}
+ f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\
+ \forall_{1 \leq j \leq s} \quad | \mathscr{I}_j | \leq \frac{2n}{i} \label{ebm-c2}
+\end{align}
+where $|\mathscr{I}_j|$ is the number of records in the block,
+$|\text{unbuild}(\mathscr{I}_j)|$.
+
+A new record is inserted by finding the smallest block and rebuilding it
+using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$,
+then an insert is done by,
+\begin{equation*}
+\mathscr{I}_k^\prime = \text{build}(\text{unbuild}(\mathscr{I}_k) \cup \{r\})
+\end{equation*}
+Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{
+ Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be
+ violated by deletes. We're omitting deletes from the discussion at
+ this point, but will circle back to them in Section~\ref{sec:deletes}.
+} In this case, the constraints are enforced by "reconfiguring" the
+structure. $s$ is updated to be exactly $f(n)$, all of the existing
+blocks are unbuilt, and then the records are redistributed evenly into
+$s$ blocks.
+
+A query with parameters $q$ is answered by this structure by individually
+querying the blocks, and merging the local results together with $\mergeop$,
+\begin{equation*}
+F(\mathscr{I}, q) = \bigmergeop_{j=1}^{s}F(\mathscr{I}_j, q)
+\end{equation*}
+where $F(\mathscr{I}, q)$ is a slight abuse of notation, referring to
+answering the query over $d$ using the data structure $\mathscr{I}$.
+
+This technique provides better amortized performance bounds than global
+reconstruction, at the possible cost of increased query performance for
+sub-linear queries. We'll omit the details of the proof of performance
+for brevity and streamline some of the original notation (full details
+can be found in~\cite{overmars83}), but this technique ultimately
+results in a data structure with the following performance characterstics,
+\begin{align*}
+\text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{C(n)}{n} + C\left(\frac{n}{f(n)}\right)\right) \\
+\text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\
+\end{align*}
+where $C(n)$ is the cost of statically building $\mathcal{I}$, and
+$\mathscr{Q}(n)$ is the cost of answering $F$ using $\mathcal{I}$.
+%TODO: example?
-\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method}
+\subsection{The Bentley-Saxe Method~\cite{saxe79}}
\label{ssec:bsm}
-Another approach to support updates is to amortize the cost of
-global reconstruction over multiple updates. This approach can take
-take three forms,
-\begin{enumerate}
-
- \item Pairing a dynamic data structure (called a buffer or
- memtable) with an instance of the structure being extended.
- Updates are written to the buffer, and when the buffer is
- full its records are merged with those in the static
- structure, and the structure is rebuilt. This approach is
- used by one version of the originally proposed
- LSM-tree~\cite{oneil96}. Technically this technique proposed
- in that work for the purposes of converting random writes
- into sequential ones (all structures involved are dynamic),
- but it can be used for dynamization as well.
-
- \item Creating multiple, smaller data structures each
- containing a partition of the records from the dataset, and
- reconstructing individual structures to accommodate new
- inserts in a systematic manner. This technique is the basis
- of the Bentley-Saxe method~\cite{saxe79}.
-
- \item Using both of the above techniques at once. This is
- the approach used by modern incarnations of the
- LSM-tree~\cite{rocksdb}.
-
-\end{enumerate}
-
-In all three cases, it is necessary for the search problem associated
-with the index to be a DSP, as answering it will require querying
-multiple structures (the buffer and/or one or more instances of the
-data structure) and merging the results together to get a final
-result. This section will focus exclusively on the Bentley-Saxe
-method, as it is the basis for the proposed methodology.
-
-When dividing records across multiple structures, there is a clear
-trade-off between read performance and write performance. Keeping
-the individual structures small reduces the cost of reconstructing,
-and thereby increases update performance. However, this also means
-that more structures will be required to accommodate the same number
-of records, when compared to a scheme that allows the structures
-to be larger. As each structure must be queried independently, this
-will lead to worse query performance. The reverse is also true,
-fewer, larger structures will have better query performance and
-worse update performance, with the extreme limit of this being a
-single structure that is fully rebuilt on each insert.
-
-The key insight of the Bentley-Saxe method~\cite{saxe79} is that a
-good balance can be struck by using a geometrically increasing
-structure size. In Bentley-Saxe, the sub-structures are ``stacked'',
-with the base level having a capacity of a single record, and
-each subsequent level doubling in capacity. When an update is
-performed, the first empty level is located and a reconstruction
-is triggered, merging the structures of all levels below this empty
-one, along with the new record. The merits of this approach are
-that it ensures that ``most'' reconstructions involve the smaller
-data structures towards the bottom of the sequence, while most of
-the records reside in large, infrequently updated, structures towards
-the top. This balances between the read and write implications of
-structure size, while also allowing the number of structures required
-to represent $n$ records to be worst-case bounded by $O(\log n)$.
-
-Given a structure and DSP with $P(n)$ construction cost and $Q_S(n)$
-query cost, the Bentley-Saxe Method will produce a dynamic data
-structure with,
+%FIXME: switch this section (and maybe the previous?) over to being
+% indexed at 0 instead of 1
+
+The original, and most frequently used, dynamization technique is the
+Bentley-Saxe Method (BSM), also called the logarithmic method in older
+literature. Rather than breaking the data structure into equally sized
+blocks, BSM decomposes the structure into logarithmically many blocks
+of exponentially increasing size. More specifically, the data structure
+is decomposed into $h = \lceil \log_2 n \rceil$ blocks, $\mathscr{I}_1,
+\mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$
+will be either empty, or contain exactly $2^i$ records within it.
+
+The procedure for inserting a record, $r \in \mathcal{D}$, into
+a BSM dynamization is as follows. If the block $\mathscr{I}_0$
+is empty, then $\mathscr{I}_0 = \text{build}{\{r\}}$. If it is not
+empty, then there will exist a maximal sequence of non-empty blocks
+$\mathscr{I}_0, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq
+0$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case,
+$\mathscr{I}_{i+1}$ is set to $\text{build}(\{r\} \cup \bigcup_{l=0}^i
+\text{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_0$ through
+$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the
+end of the structure as needed.
+
+%FIXME: switch the x's to r's for consistency
+\begin{figure}
+\centering
+\includegraphics[width=.8\textwidth]{diag/bsm.pdf}
+\caption{An illustration of inserts into the Bentley-Saxe Method}
+\label{fig:bsm-example}
+\end{figure}
+
+Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The
+dynamization is built over a set of records $x_1, x_2, \ldots,
+x_{10}$ initially, with eight records in $\mathscr{I}_3$ and two in
+$\mathscr{I}_1$. The first new record, $x_{11}$, is inserted directly
+into $\mathscr{I}_0$. For the next insert following this, $x_{12}$, the
+first empty block is $\mathscr{I}_2$, and so the insert is performed by
+doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup
+\text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$
+and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$.
+
+
+
+
+
+
+
+
+
-\begin{align}
- \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\
- \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right)
-\end{align}
-In the case of a $C(n)$-decomposable problem, the query cost grows to
-\begin{equation}
- O\left((Q_s(n) + C(n)) \cdot \log n\right)
-\end{equation}
-While the Bentley-Saxe method manages to maintain good performance in
-terms of \emph{amortized} insertion cost, it has has poor worst-case performance. If the
-entire structure is full, it must grow by another level, requiring
-a full reconstruction involving every record within the structure.
-A slight adjustment to the technique, due to Overmars and van
-Leeuwen~\cite{overmars81}, allows for the worst-case insertion cost to be bounded by
-$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing
-each reconstruction into small pieces, one of which is executed
-each time a new update occurs. This has the effect of bounding the
-worst-case performance, but does so by sacrificing the expected
-case performance, and adds a lot of complexity to the method. This
-technique is not used much in practice.\footnote{
- We've yet to find any example of it used in a journal article
- or conference paper.
-}
-\section{Limitations of the Bentley-Saxe Method}
+\section{Limitations of Classical Dynamization Techniques}
\label{sec:bsm-limits}
-While fairly general, the Bentley-Saxe method has a number of limitations. Because
-of the way in which it merges query results together, the number of search problems
-to which it can be efficiently applied is limited. Additionally, the method does not
-expose any trade-off space to configure the structure: it is one-size fits all.
+While fairly general, the Bentley-Saxe method has a number of
+limitations. Because of the way in which it merges query results together,
+the number of search problems to which it can be efficiently applied is
+limited. Additionally, the method does not expose any trade-off space
+to configure the structure: it is one-size fits all.
\subsection{Limits of Decomposability}
\label{ssec:decomp-limits}
@@ -610,12 +698,12 @@ thought of as KNN with $k=1$), this problem is \emph{not} decomposable.
To prove this, consider the query $KNN(D, q, k)$ against some partitioned
dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If KNN is decomposable,
then there must exist some constant-time, commutative, and associative
-binary operator $\square$, such that $R = \square_{0 \leq i \leq l}
+binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l}
R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q,
k)$. Consider the evaluation of the merge operator against two arbitrary
-result sets, $R = R_i \square R_j$. It is clear that $|R| = |R_i| =
+result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| =
|R_j| = k$, and that the contents of $R$ must be the $k$ records from
-$R_i \cup R_j$ that are nearest to $q$. Thus, $\square$ must solve the
+$R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the
problem $KNN(R_i \cup R_j, q, k)$. However, KNN cannot be solved in $O(1)$
time. Therefore, KNN is not a decomposable search problem.
\end{proof}
@@ -688,9 +776,9 @@ of problem. More formally,
\begin{definition}[Decomposable Sampling Problem]
A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and
only if there exists a constant-time computable, associative, and
- commutative binary operator $\square$ such that,
+ commutative binary operator $\mergeop$ such that,
\begin{equation*}
- F(A \cup B, q) \sim F(A, q)~ \square ~F(B, q)
+ F(A \cup B, q) \sim F(A, q)~ \mergeop ~F(B, q)
\end{equation*}
\end{definition}