From b30145b6a54480d3f051be3ff3f8f222f5116f87 Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Thu, 1 May 2025 17:06:14 -0400 Subject: Background updates --- chapters/background.tex | 316 +++++++++++++++++++++++++++++++----------------- 1 file changed, 202 insertions(+), 114 deletions(-) (limited to 'chapters') diff --git a/chapters/background.tex b/chapters/background.tex index 75e2b59..332dbb6 100644 --- a/chapters/background.tex +++ b/chapters/background.tex @@ -81,13 +81,13 @@ their work on dynamization, and we will adopt their definition, \label{def:dsp} A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and only if there exists a constant-time computable, associative, and - commutative binary operator $\square$ such that, + commutative binary operator $\mergeop$ such that, \begin{equation*} - F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q) \end{equation*} \end{definition} -The requirement for $\square$ to be constant-time was used by Bentley and +The requirement for $\mergeop$ to be constant-time was used by Bentley and Saxe to prove specific performance bounds for answering queries from a decomposed data structure. However, it is not strictly \emph{necessary}, and later work by Overmars lifted this constraint and considered a more @@ -97,15 +97,15 @@ problems}, \begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}] A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable if and only if there exists an $O(C(n))$-time computable, associative, - and commutative binary operator $\square$ such that, + and commutative binary operator $\mergeop$ such that, \begin{equation*} - F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q) \end{equation*} \end{definition} To demonstrate that a search problem is decomposable, it is necessary to -show the existence of the merge operator, $\square$, with the necessary -properties, and to show that $F(A \cup B, q) = F(A, q)~ \square ~F(B, +show the existence of the merge operator, $\mergeop$, with the necessary +properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)$. With these two results, induction demonstrates that the problem is decomposable even in cases with more than two partial results. @@ -121,7 +121,7 @@ Range Count is a decomposable search problem. \end{theorem} \begin{proof} -Let $\square$ be addition ($+$). Applying this to +Let $\mergeop$ be addition ($+$). Applying this to Definition~\ref{def:dsp}, gives \begin{align*} |(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)| @@ -146,11 +146,11 @@ The calculation of the arithmetic mean of a set of numbers is a DSP. contains the sum of the values within the input set, and the cardinality of the input set. For two disjoint paritions of the data, $D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let -$A(D_1) \square A(D_2) = (s_1 + s_2, c_1 + c_2)$. +$A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$. Applying Definition~\ref{def:dsp}, gives \begin{align*} - A(D_1 \cup D_2) &= A(D_1)\square A(D_2) \\ + A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2) \\ (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c) \end{align*} From this result, the average can be determined in constant time by @@ -365,7 +365,6 @@ support updates, and so a general strategy for adding update support would increase the number of data structures that could be used as database indices. We refer to a data structure with update support as \emph{dynamic}, and one without update support as \emph{static}.\footnote{ - The term static is distinct from immutable. Static refers to the layout of records within the data structure, whereas immutable refers to the data stored within those records. This distinction @@ -385,6 +384,16 @@ requirements, and then examine several classical dynamization techniques. The section will conclude with a discussion of delete support within the context of these techniques. +It is worth noting that there are a variety of techniques +discussed in the literature for dynamizing structures with specific +properties, or under very specific sets of circumstances. Examples +include frameworks for adding update support succinct data +structures~\cite{dynamize-succinct} or taking advantage of batching +of insert and query operations~\cite{batched-decomposable}. This +section discusses techniques that are more general, and don't require +workload-specific assumptions. + + \subsection{Global Reconstruction} The most fundamental dynamization technique is that of \emph{global @@ -399,125 +408,204 @@ possible if $\mathcal{I}$ supports the following two operations, \mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\ \mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D}) \end{align*} -where $\mathtt{build}$ constructs an instance $\mathscr{i}\in\mathcal{I}$ +where $\mathtt{build}$ constructs an instance $\mathscr{I}\in\mathcal{I}$ over the data structure over a set of records $d \subseteq \mathcal{D}$ in $C(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d -\subseteq \mathcal{D}$ used to construct $\mathscr{i} \in \mathcal{I}$ in +\subseteq \mathcal{D}$ used to construct $\mathscr{I} \in \mathcal{I}$ in $\Theta(1)$ time,\footnote{ There isn't any practical reason why $\mathtt{unbuild}$ must run in constant time, but this is the assumption made in \cite{saxe79} and in subsequent work based on it, and so we will follow the same defininition here. -} such that $\mathscr{i} = \mathtt{build}(\mathtt{unbuild}(\mathscr{i}))$. - +} such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$. +Given this structure, an insert of record $r \in \mathcal{D}$ into a +data structure $\mathscr{I} \in \mathcal{I}$ can be defined by, +\begin{align*} +\mathscr{I}_{i}^\prime = \text{build}(\text{unbuild}(\mathscr{I}_i) \cup \{r\}) +\end{align*} +It goes without saying that this operation is sub-optimal, as the +insertion cost is $\Theta(C(n))$, and $C(n) \in \Omega(n)$ at best for +most data structures. However, this global reconstruction strategy can +be used as a primitive for more sophisticated techniques that can provide +reasonable performance. + +\subsection{Amortized Global Reconstruction} +\label{ssec:agr} + +The problem with global reconstruction is that each insert must rebuild +the entire data structure, involving all of its records. This results +in a worst-case insert cost of $\Theta(C(n))$. However, opportunities +for improving this scheme can present themselves when considering the +\emph{amortized} insertion cost. + +Consider the cost acrrued by the dynamized structure under global +reconstruction over the lifetime of the structure. Each insert will result +in all of the existing records being rewritten, so at worst each record +will be involved in $\Theta(n)$ reconstructions, each reconstruction +having $\Theta(C(n))$ cost. We can amortize this cost over the $n$ records +inserted to get an amortized insertion cost for global reconstruction of, + +\begin{equation*} +I_a(n) = \frac{C(n) \cdot n}{n} = C(n) +\end{equation*} + +This doesn't improve things as is, however it does present two +opportunities for improvement. If we could either reduce the size of +the reconstructions, or the number of times a record is reconstructed, +then we could reduce the amortized insertion cost. + +The key insight, first discussed by Bentley and Saxe, is that +this goal can be accomplished by \emph{decomposing} the data +structure into multiple, smaller structures, each built from a +disjoint partition of the data. As long as the search problem +being considered is decomposable, queries can be answered from +this structure with bounded worst-case overhead, and the amortized +insertion cost can be improved~\cite{saxe79}. Significant theoretical +work exists in evaluating different strategies for decomposing the +data structure~\cite{saxe79, overmars81, overmars83} and for leveraging +specific efficiencies of the data structures being considered to improve +these reconstructions~\cite{merge-dsp}. + +There are two general decomposition techniques that emerged from this +work. The earliest of these is the logarithmic method, often called +the Bentley-Saxe method in modern literature, and is the most commonly +discussed technique today. A later technique, the equal block method, +was also examined. It is generally not as effective as the Bentley-Saxe +method, but it has some useful properties for explainatory purposes and +so will be discussed here as well. + +\subsection{Equal Block Method~\cite[pp.~96-100]{overmars83}} +\label{ssec:ebm} + +Though chronologically later, the equal block method is theoretically a +bit simpler, and so we will begin our discussion of decomposition-based +technique for dynamization of decomposable search problems with it. The +core concept of the equal block method is to decompose the data structure +into several smaller data structures, called blocks, over partitions +of the data. This decomposition is performed such that each block is of +roughly equal size. + +Consider a data structure $\mathscr{I} \in \mathcal{I}$ that solves +some decomposable search problem, $F$ and is built over a set of records +$d \in \mathcal{D}$. This structure can be decomposed into $s$ blocks, +$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over +partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a specific value +makes little sense when the number of records changes, and so it is taken +to be governed by a smooth, monotonically increasing function $f(n)$ such +that, at any point, the following two constraints are obeyed. +\begin{align} + f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\ + \forall_{1 \leq j \leq s} \quad | \mathscr{I}_j | \leq \frac{2n}{i} \label{ebm-c2} +\end{align} +where $|\mathscr{I}_j|$ is the number of records in the block, +$|\text{unbuild}(\mathscr{I}_j)|$. + +A new record is inserted by finding the smallest block and rebuilding it +using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$, +then an insert is done by, +\begin{equation*} +\mathscr{I}_k^\prime = \text{build}(\text{unbuild}(\mathscr{I}_k) \cup \{r\}) +\end{equation*} +Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{ + Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be + violated by deletes. We're omitting deletes from the discussion at + this point, but will circle back to them in Section~\ref{sec:deletes}. +} In this case, the constraints are enforced by "reconfiguring" the +structure. $s$ is updated to be exactly $f(n)$, all of the existing +blocks are unbuilt, and then the records are redistributed evenly into +$s$ blocks. + +A query with parameters $q$ is answered by this structure by individually +querying the blocks, and merging the local results together with $\mergeop$, +\begin{equation*} +F(\mathscr{I}, q) = \bigmergeop_{j=1}^{s}F(\mathscr{I}_j, q) +\end{equation*} +where $F(\mathscr{I}, q)$ is a slight abuse of notation, referring to +answering the query over $d$ using the data structure $\mathscr{I}$. + +This technique provides better amortized performance bounds than global +reconstruction, at the possible cost of increased query performance for +sub-linear queries. We'll omit the details of the proof of performance +for brevity and streamline some of the original notation (full details +can be found in~\cite{overmars83}), but this technique ultimately +results in a data structure with the following performance characterstics, +\begin{align*} +\text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{C(n)}{n} + C\left(\frac{n}{f(n)}\right)\right) \\ +\text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\ +\end{align*} +where $C(n)$ is the cost of statically building $\mathcal{I}$, and +$\mathscr{Q}(n)$ is the cost of answering $F$ using $\mathcal{I}$. +%TODO: example? -\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method} +\subsection{The Bentley-Saxe Method~\cite{saxe79}} \label{ssec:bsm} -Another approach to support updates is to amortize the cost of -global reconstruction over multiple updates. This approach can take -take three forms, -\begin{enumerate} - - \item Pairing a dynamic data structure (called a buffer or - memtable) with an instance of the structure being extended. - Updates are written to the buffer, and when the buffer is - full its records are merged with those in the static - structure, and the structure is rebuilt. This approach is - used by one version of the originally proposed - LSM-tree~\cite{oneil96}. Technically this technique proposed - in that work for the purposes of converting random writes - into sequential ones (all structures involved are dynamic), - but it can be used for dynamization as well. - - \item Creating multiple, smaller data structures each - containing a partition of the records from the dataset, and - reconstructing individual structures to accommodate new - inserts in a systematic manner. This technique is the basis - of the Bentley-Saxe method~\cite{saxe79}. - - \item Using both of the above techniques at once. This is - the approach used by modern incarnations of the - LSM-tree~\cite{rocksdb}. - -\end{enumerate} - -In all three cases, it is necessary for the search problem associated -with the index to be a DSP, as answering it will require querying -multiple structures (the buffer and/or one or more instances of the -data structure) and merging the results together to get a final -result. This section will focus exclusively on the Bentley-Saxe -method, as it is the basis for the proposed methodology. - -When dividing records across multiple structures, there is a clear -trade-off between read performance and write performance. Keeping -the individual structures small reduces the cost of reconstructing, -and thereby increases update performance. However, this also means -that more structures will be required to accommodate the same number -of records, when compared to a scheme that allows the structures -to be larger. As each structure must be queried independently, this -will lead to worse query performance. The reverse is also true, -fewer, larger structures will have better query performance and -worse update performance, with the extreme limit of this being a -single structure that is fully rebuilt on each insert. - -The key insight of the Bentley-Saxe method~\cite{saxe79} is that a -good balance can be struck by using a geometrically increasing -structure size. In Bentley-Saxe, the sub-structures are ``stacked'', -with the base level having a capacity of a single record, and -each subsequent level doubling in capacity. When an update is -performed, the first empty level is located and a reconstruction -is triggered, merging the structures of all levels below this empty -one, along with the new record. The merits of this approach are -that it ensures that ``most'' reconstructions involve the smaller -data structures towards the bottom of the sequence, while most of -the records reside in large, infrequently updated, structures towards -the top. This balances between the read and write implications of -structure size, while also allowing the number of structures required -to represent $n$ records to be worst-case bounded by $O(\log n)$. - -Given a structure and DSP with $P(n)$ construction cost and $Q_S(n)$ -query cost, the Bentley-Saxe Method will produce a dynamic data -structure with, +%FIXME: switch this section (and maybe the previous?) over to being +% indexed at 0 instead of 1 + +The original, and most frequently used, dynamization technique is the +Bentley-Saxe Method (BSM), also called the logarithmic method in older +literature. Rather than breaking the data structure into equally sized +blocks, BSM decomposes the structure into logarithmically many blocks +of exponentially increasing size. More specifically, the data structure +is decomposed into $h = \lceil \log_2 n \rceil$ blocks, $\mathscr{I}_1, +\mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$ +will be either empty, or contain exactly $2^i$ records within it. + +The procedure for inserting a record, $r \in \mathcal{D}$, into +a BSM dynamization is as follows. If the block $\mathscr{I}_0$ +is empty, then $\mathscr{I}_0 = \text{build}{\{r\}}$. If it is not +empty, then there will exist a maximal sequence of non-empty blocks +$\mathscr{I}_0, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq +0$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case, +$\mathscr{I}_{i+1}$ is set to $\text{build}(\{r\} \cup \bigcup_{l=0}^i +\text{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_0$ through +$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the +end of the structure as needed. + +%FIXME: switch the x's to r's for consistency +\begin{figure} +\centering +\includegraphics[width=.8\textwidth]{diag/bsm.pdf} +\caption{An illustration of inserts into the Bentley-Saxe Method} +\label{fig:bsm-example} +\end{figure} + +Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The +dynamization is built over a set of records $x_1, x_2, \ldots, +x_{10}$ initially, with eight records in $\mathscr{I}_3$ and two in +$\mathscr{I}_1$. The first new record, $x_{11}$, is inserted directly +into $\mathscr{I}_0$. For the next insert following this, $x_{12}$, the +first empty block is $\mathscr{I}_2$, and so the insert is performed by +doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup +\text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$ +and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$. + + + + + + + + + -\begin{align} - \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\ - \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right) -\end{align} -In the case of a $C(n)$-decomposable problem, the query cost grows to -\begin{equation} - O\left((Q_s(n) + C(n)) \cdot \log n\right) -\end{equation} -While the Bentley-Saxe method manages to maintain good performance in -terms of \emph{amortized} insertion cost, it has has poor worst-case performance. If the -entire structure is full, it must grow by another level, requiring -a full reconstruction involving every record within the structure. -A slight adjustment to the technique, due to Overmars and van -Leeuwen~\cite{overmars81}, allows for the worst-case insertion cost to be bounded by -$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing -each reconstruction into small pieces, one of which is executed -each time a new update occurs. This has the effect of bounding the -worst-case performance, but does so by sacrificing the expected -case performance, and adds a lot of complexity to the method. This -technique is not used much in practice.\footnote{ - We've yet to find any example of it used in a journal article - or conference paper. -} -\section{Limitations of the Bentley-Saxe Method} +\section{Limitations of Classical Dynamization Techniques} \label{sec:bsm-limits} -While fairly general, the Bentley-Saxe method has a number of limitations. Because -of the way in which it merges query results together, the number of search problems -to which it can be efficiently applied is limited. Additionally, the method does not -expose any trade-off space to configure the structure: it is one-size fits all. +While fairly general, the Bentley-Saxe method has a number of +limitations. Because of the way in which it merges query results together, +the number of search problems to which it can be efficiently applied is +limited. Additionally, the method does not expose any trade-off space +to configure the structure: it is one-size fits all. \subsection{Limits of Decomposability} \label{ssec:decomp-limits} @@ -610,12 +698,12 @@ thought of as KNN with $k=1$), this problem is \emph{not} decomposable. To prove this, consider the query $KNN(D, q, k)$ against some partitioned dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If KNN is decomposable, then there must exist some constant-time, commutative, and associative -binary operator $\square$, such that $R = \square_{0 \leq i \leq l} +binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l} R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q, k)$. Consider the evaluation of the merge operator against two arbitrary -result sets, $R = R_i \square R_j$. It is clear that $|R| = |R_i| = +result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| = |R_j| = k$, and that the contents of $R$ must be the $k$ records from -$R_i \cup R_j$ that are nearest to $q$. Thus, $\square$ must solve the +$R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the problem $KNN(R_i \cup R_j, q, k)$. However, KNN cannot be solved in $O(1)$ time. Therefore, KNN is not a decomposable search problem. \end{proof} @@ -688,9 +776,9 @@ of problem. More formally, \begin{definition}[Decomposable Sampling Problem] A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and only if there exists a constant-time computable, associative, and - commutative binary operator $\square$ such that, + commutative binary operator $\mergeop$ such that, \begin{equation*} - F(A \cup B, q) \sim F(A, q)~ \square ~F(B, q) + F(A \cup B, q) \sim F(A, q)~ \mergeop ~F(B, q) \end{equation*} \end{definition} -- cgit v1.2.3