diff options
Diffstat (limited to 'chapters/background.tex')
| -rw-r--r-- | chapters/background.tex | 1027 |
1 files changed, 0 insertions, 1027 deletions
diff --git a/chapters/background.tex b/chapters/background.tex index 8ad92a8..ef30685 100644 --- a/chapters/background.tex +++ b/chapters/background.tex @@ -1,167 +1,4 @@ \chapter{Background} -\label{chap:background} - -This chapter will introduce important background information and -existing work in the area of data structure dynamization. We will -first discuss the concept of a search problem, which is central to -dynamization techniques. While one might imagine that restrictions on -dynamization would be functions of the data structure to be dynamized, -in practice the requirements placed on the data structure are quite mild, -and it is the necessary properties of the search problem that the data -structure is used to address that provide the central difficulty to -applying dynamization techniques in a given area. After this, database -indices will be discussed briefly. Indices are the primary use of data -structures within the database context that is of interest to our work. -Following this, existing theoretical results in the area of data structure -dynamization will be discussed, which will serve as the building blocks -for our techniques in subsequent chapters. The chapter will conclude with -a discussion of some of the limitations of these existing techniques. - -\section{Queries and Search Problems} -\label{sec:dsp} - -Data access lies at the core of most database systems. We want to ask -questions of the data, and ideally get the answer efficiently. We -will refer to the different types of question that can be asked as -\emph{search problems}. We will be using this term in a similar way as -the word \emph{query} \footnote{ - The term query is often abused and used to - refer to several related, but slightly different things. In the - vernacular, a query can refer to either a) a general type of search - problem (as in "range query"), b) a specific instance of a search - problem, or c) a program written in a query language. -} -is often used within the database systems literature: to refer to a -general class of questions. For example, we could consider range scans, -point-lookups, nearest neighbor searches, predicate filtering, random -sampling, etc., to each be a general search problem. Formally, for the -purposes of this work, a search problem is defined as follows, - -\begin{definition}[Search Problem] - Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function - $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched, - $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the -answer domain.\footnote{ - It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an -example, a \texttt{COUNT} aggregation might map a set of strings onto - an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need -not be a universal constraint. -} -\end{definition} - -We will use the term \emph{query} to mean a specific instance of a search -problem, - -\begin{definition}[Query] - Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and - a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific - instance of the search problem, $F(\mathcal{D}, q)$. -\end{definition} - -As an example of using these definitions, a \emph{membership test} -or \emph{range scan} would be considered search problems, and a range -scan over the interval $[10, 99]$ would be a query. We've drawn this -distinction because, as we'll see as we enter into the discussion of -our work in later chapters, it is useful to have separate, unambiguous -terms for these two concepts. - -\subsection{Decomposable Search Problems} - -Dynamization techniques require the partitioning of one data structure -into several, smaller ones. As a result, these techniques can only -be applied in situations where the search problem to be answered can -be answered from this set of smaller data structures, with the same -answer as would have been obtained had all of the data been used to -construct a single, large structure. This requirement is formalized in -the definition of a class of problems called \emph{decomposable search -problems (DSP)}. This class was first defined by Bentley and Saxe in -their work on dynamization, and we will adopt their definition, - -\begin{definition}[Decomposable Search Problem~\cite{saxe79}] - \label{def:dsp} - A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and - only if there exists a constant-time computable, associative, and - commutative binary operator $\mergeop$ such that, - \begin{equation*} - F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q) - \end{equation*} - for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$. -\end{definition} - -The requirement for $\mergeop$ to be constant-time was used by Bentley and -Saxe to prove specific performance bounds for answering queries from a -decomposed data structure. However, it is not strictly \emph{necessary}, -and later work by Overmars lifted this constraint and considered a more -general class of search problems called \emph{$C(n)$-decomposable search -problems}, - -\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars-cn-decomp}] - A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable - if and only if there exists an $O(C(n))$-time computable, associative, - and commutative binary operator $\mergeop$ such that, - \begin{equation*} - F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q) - \end{equation*} - for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$. -\end{definition} - -To demonstrate that a search problem is decomposable, it is necessary to -show the existence of the merge operator, $\mergeop$, with the necessary -properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, -q)$. With these two results, induction demonstrates that the problem is -decomposable even in cases with more than two partial results. - -As an example, consider range scans, -\begin{definition}[Range Count] - \label{def:range-count} - Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval, - $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns - the cardinality, $|d \cap q|$. -\end{definition} - -\begin{theorem} -\label{ther:decomp-range-count} -Range Count is a decomposable search problem. -\end{theorem} - -\begin{proof} -Let $\mergeop$ be addition ($+$). Applying this to -Definition~\ref{def:dsp}, gives -\begin{align*} - |(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)| -\end{align*} -which is true by the distributive property of union and -intersection. Addition is an associative and commutative -operator that can be calculated in $\Theta(1)$ time. Therefore, range counts -are DSPs. -\end{proof} - -Because the codomain of a DSP is not restricted, more complex output -structures can be used to allow for problems that are not directly -decomposable to be converted to DSPs, possibly with some minor -post-processing. For example, calculating the arithmetic mean of a set -of numbers can be formulated as a DSP, -\begin{theorem} -The calculation of the arithmetic mean of a set of numbers is a DSP. -\end{theorem} -\begin{proof} - Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$, - where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple -contains the sum of the values within the input set, and the -cardinality of the input set. For two disjoint partitions of the data, -$D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let -$A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$. - -Applying Definition~\ref{def:dsp}, gives -\begin{align*} - A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2) \\ - (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c) -\end{align*} -From this result, the average can be determined in constant time by -taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set -of numbers is a DSP. -\end{proof} - \section{Database Indexes} @@ -241,7 +78,6 @@ and the log-structured merge (LSM) tree~\cite{oneil96} is also often used within the context of key-value stores~\cite{rocksdb}. Some databases implement unordered indices using hash tables~\cite{mysql-btree-hash}. - \subsection{The Generalized Index} The previous section discussed the classical definition of index @@ -359,866 +195,3 @@ that has been designed specifically for answering particular types of queries %done surrounding the use of arbitrary indexes in queries in the past, %such as~\cite{byods-datalog}. This problem is considered out-of-scope %for the proposed work, but will be considered in the future. - -\section{Classical Dynamization Techniques} - -Because data in a database is regularly updated, data structures -intended to be used as an index must support updates (inserts, in-place -modification, and deletes). Not all potentially useful data structures -support updates, and so a general strategy for adding update support -would increase the number of data structures that could be used as -database indices. We refer to a data structure with update support as -\emph{dynamic}, and one without update support as \emph{static}.\footnote{ - The term static is distinct from immutable. Static refers to the - layout of records within the data structure, whereas immutable - refers to the data stored within those records. This distinction - will become relevant when we discuss different techniques for adding - delete support to data structures. The data structures used are - always static, but not necessarily immutable, because the records may - contain header information (like visibility) that is updated in place. -} - -This section discusses \emph{dynamization}, the construction of a -dynamic data structure based on an existing static one. When certain -conditions are satisfied by the data structure and its associated -search problem, this process can be done automatically, and with -provable asymptotic bounds on amortized insertion performance, as well -as worst case query performance. This is in contrast to the manual -design of dynamic data structures, which involve techniques based on -partially rebuilding small portions of a single data structure (called -\emph{local reconstruction})~\cite{overmars83}. This is a very high cost -intervention that requires significant effort on the part of the data -structure designer, whereas conventional dynamization can be performed -with little-to-no modification of the underlying data structure at all. - -It is worth noting that there are a variety of techniques -discussed in the literature for dynamizing structures with specific -properties, or under very specific sets of circumstances. Examples -include frameworks for adding update support succinct data -structures~\cite{dynamize-succinct} or taking advantage of batching -of insert and query operations~\cite{batched-decomposable}. This -section discusses techniques that are more general, and don't require -workload-specific assumptions. - -We will first discuss the necessary data structure requirements, and -then examine several classical dynamization techniques. The section -will conclude with a discussion of delete support within the context -of these techniques. For more detail than is included in this chapter, -Overmars wrote a book providing a comprehensive survey of techniques for -creating dynamic data structures, including not only the dynamization -techniques discussed here, but also local reconstruction based -techniques and more~\cite{overmars83}.\footnote{ - Sadly, this book isn't readily available in - digital format as of the time of writing. -} - - -\subsection{Global Reconstruction} - -The most fundamental dynamization technique is that of \emph{global -reconstruction}. While not particularly useful on its own, global -reconstruction serves as the basis for the techniques to follow, and so -we will begin our discussion of dynamization with it. - -Consider a class of data structure, $\mathcal{I}$, capable of answering a -search problem, $\mathcal{Q}$. Insertion via global reconstruction is -possible if $\mathcal{I}$ supports the following two operations, -\begin{align*} -\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\ -\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D}) -\end{align*} -where $\mathtt{build}$ constructs an instance $\mathscr{I}\in\mathcal{I}$ -over the data structure over a set of records $d \subseteq \mathcal{D}$ -in $B(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d -\subseteq \mathcal{D}$ used to construct $\mathscr{I} \in \mathcal{I}$ in -$\Theta(1)$ time,\footnote{ - There isn't any practical reason why $\mathtt{unbuild}$ must run - in constant time, but this is the assumption made in \cite{saxe79} - and in subsequent work based on it, and so we will follow the same - definition here. -} such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$. - -Given this structure, an insert of record $r \in \mathcal{D}$ into a -data structure $\mathscr{I} \in \mathcal{I}$ can be defined by, -\begin{align*} -\mathscr{I}_{i}^\prime = \text{build}(\text{unbuild}(\mathscr{I}_i) \cup \{r\}) -\end{align*} - -It goes without saying that this operation is sub-optimal, as the -insertion cost is $\Theta(B(n))$, and $B(n) \in \Omega(n)$ at best for -most data structures. However, this global reconstruction strategy can -be used as a primitive for more sophisticated techniques that can provide -reasonable performance. - -\subsection{Amortized Global Reconstruction} -\label{ssec:agr} - -The problem with global reconstruction is that each insert must rebuild -the entire data structure, involving all of its records. This results -in a worst-case insert cost of $\Theta(B(n))$. However, opportunities -for improving this scheme can present themselves when considering the -\emph{amortized} insertion cost. - -Consider the cost accrued by the dynamized structure under global -reconstruction over the lifetime of the structure. Each insert will result -in all of the existing records being rewritten, so at worst each record -will be involved in $\Theta(n)$ reconstructions, each reconstruction -having $\Theta(B(n))$ cost. We can amortize this cost over the $n$ records -inserted to get an amortized insertion cost for global reconstruction of, - -\begin{equation*} -I_a(n) = \frac{B(n) \cdot n}{n} = B(n) -\end{equation*} - -This doesn't improve things as is, however it does present two -opportunities for improvement. If we could either reduce the size of -the reconstructions, or the number of times a record is reconstructed, -then we could reduce the amortized insertion cost. - -The key insight, first discussed by Bentley and Saxe, is that -both of these goals can be accomplished by \emph{decomposing} the -data structure into multiple, smaller structures, each built from -a disjoint partition of the data. As long as the search problem -being considered is decomposable, queries can be answered from -this structure with bounded worst-case overhead, and the amortized -insertion cost can be improved~\cite{saxe79}. Significant theoretical -work exists in evaluating different strategies for decomposing the -data structure~\cite{saxe79, overmars81, overmars83} and for leveraging -specific efficiencies of the data structures being considered to improve -these reconstructions~\cite{merge-dsp}. - -There are two general decomposition techniques that emerged from -this work. The earliest of these is the logarithmic method, often -called the Bentley-Saxe method in modern literature, and is the most -commonly discussed technique today. The Bentley-Saxe method has been -directly applied in a few instances in the literature, such as to -metric indexing structures~\cite{naidan14} and spatial structures~\cite{bkdtree}, -and has also been used in a modified form for genetic sequence search -structures~\cite{almodaresi23} and graphs~\cite{lsmgraph}, to cite a few -examples. - -A later technique, the equal block method, was also developed. It is -generally not as effective as the Bentley-Saxe method, and as a result we -have not identified any specific applications of this technique outside -of the theoretical literature, however we will discuss it as well in -the interest of completeness, and because it does lend itself well to -demonstrating certain properties of decomposition-based dynamization -techniques. - -\subsection{Equal Block Method} -\label{ssec:ebm} - -Though chronologically later, the equal block method is theoretically a -bit simpler, and so we will begin our discussion of decomposition-based -technique for dynamization of decomposable search problems with it. There -have been several proposed variations of this concept~\cite{maurer79, -maurer80}, but we will focus on the most developed form as described by -Overmars and von Leeuwan~\cite{overmars-art-of-dyn, overmars83}. The core -concept of the equal block method is to decompose the data structure -into several smaller data structures, called blocks, over partitions -of the data. This decomposition is performed such that each block is of -roughly equal size. - -Consider a data structure $\mathscr{I} \in \mathcal{I}$ that solves -some decomposable search problem, $F$ and is built over a set of records -$d \in \mathcal{D}$. This structure can be decomposed into $s$ blocks, -$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over -partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a specific value -makes little sense when the number of records changes, and so it is taken -to be governed by a smooth, monotonically increasing function $f(n)$ such -that, at any point, the following two constraints are obeyed. -\begin{align} - f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\ - \forall_{1 \leq j \leq s} \quad | \mathscr{I}_j | \leq \frac{2n}{s} \label{ebm-c2} -\end{align} -where $|\mathscr{I}_j|$ is the number of records in the block, -$|\text{unbuild}(\mathscr{I}_j)|$. - -A new record is inserted by finding the smallest block and rebuilding it -using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$, -then an insert is done by, -\begin{equation*} -\mathscr{I}_k^\prime = \text{build}(\text{unbuild}(\mathscr{I}_k) \cup \{r\}) -\end{equation*} -Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{ - Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be - violated by deletes. We're omitting deletes from the discussion at - this point, but will circle back to them in Section~\ref{sec:deletes}. -} In this case, the constraints are enforced by "re-configuring" the -structure. $s$ is updated to be exactly $f(n)$, all of the existing -blocks are unbuilt, and then the records are redistributed evenly into -$s$ blocks. - -A query with parameters $q$ is answered by this structure by individually -querying the blocks, and merging the local results together with $\mergeop$, -\begin{equation*} -F(\mathscr{I}, q) = \bigmergeop_{j=1}^{s}F(\mathscr{I}_j, q) -\end{equation*} -where $F(\mathscr{I}, q)$ is a slight abuse of notation, referring to -answering the query over $d$ using the data structure $\mathscr{I}$. - -This technique provides better amortized performance bounds than global -reconstruction, at the possible cost of worse query performance for -sub-linear queries. We'll omit the details of the proof of performance -for brevity and streamline some of the original notation (full details -can be found in~\cite{overmars83}), but this technique ultimately -results in a data structure with the following performance characteristics, -\begin{align*} -\text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{B(n)}{n} + B\left(\frac{n}{f(n)}\right)\right) \\ -\text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\ -\end{align*} -where $B(n)$ is the cost of statically building $\mathcal{I}$, and -$\mathscr{Q}(n)$ is the cost of answering $F$ using $\mathcal{I}$. - -%TODO: example? - - -\subsection{The Bentley-Saxe Method~\cite{saxe79}} -\label{ssec:bsm} - -%FIXME: switch this section (and maybe the previous?) over to being -% indexed at 0 instead of 1 - -The original, and most frequently used, dynamization technique is the -Bentley-Saxe Method (BSM), also called the logarithmic method in older -literature. Rather than breaking the data structure into equally sized -blocks, BSM decomposes the structure into logarithmically many blocks -of exponentially increasing size. More specifically, the data structure -is decomposed into $h = \lceil \log_2 n \rceil$ blocks, $\mathscr{I}_1, -\mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$ -will be either empty, or contain exactly $2^i$ records within it. - -The procedure for inserting a record, $r \in \mathcal{D}$, into -a BSM dynamization is as follows. If the block $\mathscr{I}_0$ -is empty, then $\mathscr{I}_0 = \text{build}{\{r\}}$. If it is not -empty, then there will exist a maximal sequence of non-empty blocks -$\mathscr{I}_0, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq -0$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case, -$\mathscr{I}_{i+1}$ is set to $\text{build}(\{r\} \cup \bigcup_{l=0}^i -\text{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_0$ through -$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the -end of the structure as needed. - -%FIXME: switch the x's to r's for consistency -\begin{figure} -\centering -\includegraphics[width=.8\textwidth]{diag/bsm.pdf} -\caption{An illustration of inserts into the Bentley-Saxe Method} -\label{fig:bsm-example} -\end{figure} - -Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The -dynamization is built over a set of records $x_1, x_2, \ldots, -x_{10}$ initially, with eight records in $\mathscr{I}_3$ and two in -$\mathscr{I}_1$. The first new record, $x_{11}$, is inserted directly -into $\mathscr{I}_0$. For the next insert following this, $x_{12}$, the -first empty block is $\mathscr{I}_2$, and so the insert is performed by -doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup -\text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$ -and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$. - -This technique is called a \emph{binary decomposition} of the data -structure. Considering a BSM dynamization of a structure containing $n$ -records, labeling each block with a $0$ if it is empty and a $1$ if it -is full will result in the binary representation of $n$. For example, -the final state of the structure in Figure~\ref{fig:bsm-example} contains -$12$ records, and the labeling procedure will result in $0\text{b}1100$, -which is $12$ in binary. Inserts affect this representation of the -structure in the same way that incrementing the binary number by $1$ does. - -By applying BSM to a data structure, a dynamized structure can be created -with the following performance characteristics, -\begin{align*} -\text{Amortized Insertion Cost:}&\quad \Theta\left(\left(\frac{B(n)}{n}\cdot \log_2 n\right)\right) \\ -\text{Worst Case Insertion Cost:}&\quad \Theta\left(B(n)\right) \\ -\text{Worst-case Query Cost:}& \quad \Theta\left(\log_2 n\cdot \mathscr{Q}\left(n\right)\right) \\ -\end{align*} -This is a particularly attractive result because, for example, a data -structure having $B(n) \in \Theta(n)$ will have an amortized insertion -cost of $\log_2 (n)$, which is quite reasonable. The trade-off for this -is an extra logarithmic multiple attached to the query complexity. It is -also worth noting that the worst-case insertion cost remains the same -as global reconstruction, but this case arises only very rarely. If -you consider the binary decomposition representation, the worst-case -behavior is triggered each time the existing number overflows, and a -new digit must be added. - -As a final note about the query performance of this structure, because -the overhead due to querying the blocks is logarithmic, under certain -circumstances this cost can be absorbed, resulting in no effect on the -asymptotic worst-case query performance. As an example, consider a linear -scan of the data running in $\Theta(n)$ time. In this case, every record -must be considered, and so there isn't any performance penalty\footnote{ - From an asymptotic perspective. There will still be measurable performance - effects from caching, etc., even in this case. -} to breaking the records out into multiple chunks and scanning them -individually. For formally, for any query running in $\mathscr{Q}(n) \in -\Omega\left(n^\epsilon\right)$ time where $\epsilon > 0$, the worst-case -cost of answering a decomposable search problem from a BSM dynamization -is $\Theta\left(\mathscr{Q}(n)\right)$.~\cite{saxe79} - -\subsection{Merge Decomposable Search Problems} - -\subsection{Delete Support} - -Classical dynamization techniques have also been developed with -support for deleting records. In general, the same technique of global -reconstruction that was used for inserting records can also be used to -delete them. Given a record $r \in \mathcal{D}$ and a data structure -$\mathscr{I} \in \mathcal{I}$ such that $r \in \mathscr{I}$, $r$ can be -deleted from the structure in $C(n)$ time as follows, -\begin{equation*} -\mathscr{I}^\prime = \text{build}(\text{unbuild}(\mathscr{I}) - \{r\}) -\end{equation*} -However, supporting deletes within the dynamization schemes discussed -above is more complicated. The core problem is that inserts affect the -dynamized structure in a deterministic way, and as a result certain -partitioning schemes can be leveraged to reason about the -performance. But, deletes do not work like this. - -\begin{figure} -\caption{A Bentley-Saxe dynamization for the integers on the -interval $[1, 100]$.} -\label{fig:bsm-delete-example} -\end{figure} - -For example, consider a Bentley-Saxe dynamization that contains all -integers on the interval $[1, 100]$, inserted in that order, shown in -Figure~\ref{fig:bsm-delete-example}. We would like to delete all of the -records from this structure, one at a time, using global reconstruction. -This presents several problems, -\begin{itemize} - \item For each record, we need to identify which block it is in before - we can delete it. - \item The cost of performing a delete is a function of which block the - record is in, which is a question of distribution and not easily - controlled. - \item As records are deleted, the structure will potentially violate - the invariants of the decomposition scheme used, which will - require additional work to fix. -\end{itemize} - -To resolve these difficulties, two very different approaches have been -proposed for supporting deletes, each of which rely on certain properties -of the search problem and data structure. These are the use of a ghost -structure and weak deletes. - -\subsubsection{Ghost Structure for Invertible Search Problems} - -The first proposed mechanism for supporting deletes was discussed -alongside the Bentley-Saxe method in Bentley and Saxe's original -paper. This technique applies to a class of search problems called -\emph{invertible} (also called \emph{decomposable counting problems} -in later literature~\cite{overmars83}). Invertible search problems -are decomposable, and also support an ``inverse'' merge operator, $\Delta$, -that is able to remove records from the result set. More formally, -\begin{definition}[Invertible Search Problem~\cite{saxe79}] -\label{def:invert} -A decomposable search problem, $F$ is invertible if and only if there -exists a constant time computable operator, $\Delta$, such that -\begin{equation*} -F(A / B, q) = F(A, q)~\Delta~F(B, q) -\end{equation*} -for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$. -\end{definition} - -Given a search problem with this property, it is possible to perform -deletes by creating a secondary ``ghost'' structure. When a record -is to be deleted, it is inserted into this structure. Then, when the -dynamization is queried, this ghost structure is queried as well as the -main one. The results from the ghost structure can be removed from the -result set using the inverse merge operator. This simulates the result -that would have been obtained had the records been physically removed -from the main structure. - -Two examples of invertible search problems are set membership -and range count. Range count was formally defined in -Definition~\ref{def:range-count}. - -\begin{theorem} -Range count is an invertible search problem. -\end{theorem} - -\begin{proof} -To prove that range count is an invertible search problem, it must be -decomposable and have a $\Delta$ operator. That it is a DSP has already -been proven in Theorem~\ref{ther:decomp-range-count}. - -Let $\Delta$ be subtraction $(-)$. Applying this to Definition~\ref{def:invert} -gives, -\begin{equation*} -|(A / B) \cap q | = |(A \cap q) / (B \cap q)| = |(A \cap q)| - |(B \cap q)| -\end{equation*} -which is true by the distributive property of set difference and -intersection. Subtraction is computable in constant time, therefore -range count is an invertible search problem using subtraction as $\Delta$. -\end{proof} - -The set membership search problem is defined as follows, -\begin{definition}[Set Membership] -\label{def:set-membership} -Consider a set of elements $d \subseteq \mathcal{D}$ from some domain, -and a single element $r \in \mathcal{D}$. A test of set membership is a -search problem of the form $F: (\mathcal{PS}(\mathcal{D}), \mathcal{D}) -\to \mathbb{B}$ such that $F(d, r) = r \in d$, which maps to $0$ if $r -\not\in d$ and $1$ if $r \in d$. -\end{definition} - -\begin{theorem} -Set membership is an invertible search problem. -\end{theorem} - -\begin{proof} -To prove that set membership is invertible, it is necessary to establish -that it is a decomposable search problem, and that a $\Delta$ operator -exists. We'll begin with the former. -\begin{lemma} - \label{lem:set-memb-dsp} - Set membership is a decomposable search problem. -\end{lemma} -\begin{proof} -Let $\mergeop$ be the logical disjunction ($\lor$). This yields, -\begin{align*} -F(A \cup B, r) &= F(A, r) \lor F(B, r) \\ -r \in (A \cup B) &= (r \in A) \lor (r \in B) -\end{align*} -which is true, following directly from the definition of union. The -logical disjunction is an associative, commutative operator that can -be calculated in $\Theta(1)$ time. Therefore, set membership is a -decomposable search problem. -\end{proof} - -For the inverse merge operator, $\Delta$, it is necessary that $F(A, -r) ~\Delta~F(B, r)$ be true \emph{only} if $r \in A$ and $r \not\in -B$. Thus, it could be directly implemented as $F(A, r)~\Delta~F(B, r) = -F(A, r) \land \neg F(B, r)$, which is constant time if -the operands are already known. - -Thus, we have shown that set membership is a decomposable search problem, -and that a constant time $\Delta$ operator exists. Therefore, it is an -invertible search problem. -\end{proof} - -For search problems such as these, this technique allows for deletes to be -supported with the same cost as an insert. Unfortunately, it suffers from -write amplification because each deleted record is recorded twice--one in -the main structure, and once in the ghost structure. This means that $n$ -is, in effect, the total number of records and deletes. This can lead -to some serious problems, for example if every record in a structure -of $n$ records is deleted, the net result will be an "empty" dynamized -data structure containing $2n$ physical records within it. To circumvent -this problem, Bentley and Saxe proposed a mechanism of setting a maximum -threshold for the size of the ghost structure relative to the main one, -and performing a complete re-partitioning of the data once this threshold -is reached, removing all deleted records from the main structure, -emptying the ghost structure, and rebuilding blocks with the records -that remain according to the invariants of the technique. - -\subsubsection{Weak Deletes for Deletion Decomposable Search Problems} - -Another approach for supporting deletes was proposed later, by Overmars -and van Leeuwen, for a class of search problem called \emph{deletion -decomposable}. These are decomposable search problems for which the -underlying data structure supports a delete operation. More formally, - -\begin{definition}[Deletion Decomposable Search Problem~\cite{merge-dsp}] - A decomposable search problem, $F$, and its data structure, - $\mathcal{I}$, is deletion decomposable if and only if, for some - instance $\mathscr{I} \in \mathcal{I}$, containing $n$ records, - there exists a deletion routine $\mathtt{delete}(\mathscr{I}, - r)$ that removes some $r \in \mathcal{D}$ in time $D(n)$ without - increasing the query time, deletion time, or storage requirement, - for $\mathscr{I}$. -\end{definition} - -Superficially, this doesn't appear very useful. If the underlying data -structure already supports deletes, there isn't much reason to use a -dynamization technique to add deletes to it. However, one point worth -mentioning is that it is possible, in many cases, to easily \emph{add} -delete support to a static structure. If it is possible to locate a -record and somehow mark it as deleted, without removing it from the -structure, and then efficiently ignore these records while querying, -then the given structure and its search problem can be said to be -deletion decomposable. This technique for deleting records is called -\emph{weak deletes}. - -\begin{definition}[Weak Deletes~\cite{overmars81}] -\label{def:weak-delete} -A data structure is said to support weak deletes if it provides a -routine, \texttt{delete}, that guarantees that after $\alpha \cdot n$ -deletions, where $\alpha < 1$, the query cost is bounded by $k_\alpha -\mathscr{Q}(n)$ for some constant $k_\alpha$ dependent only upon $\alpha$, -where $\mathscr{Q}(n)$ is the cost of answering the query against a -structure upon which no weak deletes were performed.\footnote{ - This paper also provides a similar definition for weak updates, - but these aren't of interest to us in this work, and so the above - definition was adapted from the original with the weak update - constraints removed. -} The results of the query of a block containing weakly deleted records -should be the same as the results would be against a block with those -records removed. -\end{definition} - -As an example of a deletion decomposable search problem, consider the set -membership problem considered above (Definition~\ref{def:set-membership}) -where $\mathcal{I}$, the data structure used to answer queries of the -search problem, is a hash map.\footnote{ - While most hash maps are already dynamic, and so wouldn't need - dynamization to be applied, there do exist static ones too. For example, - the hash map being considered could be implemented using perfect - hashing~\cite{perfect-hashing}, which has many static implementations. -} - -\begin{theorem} - The set membership problem, answered using a static hash map, is - deletion decomposable. -\end{theorem} - -\begin{proof} -We've already shown in Lemma~\ref{lem:set-memb-dsp} that set membership -is a decomposable search problem. For it to be deletion decomposable, -we must demonstrate that the hash map, $\mathcal{I}$, supports deleting -records without hurting its query performance, delete performance, or -storage requirements. Assume that an instance $\mathscr{I} \in -\mathcal{I}$ having $|\mathscr{I}| = n$ can answer queries in -$\mathscr{Q}(n) \in \Theta(1)$ time and requires $\Omega(n)$ storage. - -Such a structure can support weak deletes. Each record within the -structure has a single bit attached to it, indicating whether it has -been deleted or not. These bits will require $\Theta(n)$ storage and -be initialized to 0 when the structure is constructed. A delete can -be performed by querying the structure for the record to be deleted in -$\Theta(1)$ time, and setting the bit to 1 if the record is found. This -operation has $D(n) \in \Theta(1)$ cost. - -\begin{lemma} -\label{lem:weak-deletes} -The delete procedure as described above satisfies the requirements of -Definition~\ref{def:weak-delete} for weak deletes. -\end{lemma} -\begin{proof} -Per Definition~\ref{def:weak-delete}, there must exist some constant -dependent only on $\alpha$, $k_\alpha$, such that after $\alpha \cdot -n$ deletes against $\mathscr{I}$ with $\alpha < 1$, the query cost is -bounded by $\Theta(\alpha \mathscr{Q}(n))$. - -In this case, $\mathscr{Q}(n) \in \Theta(1)$, and therefore our final -query cost must be bounded by $\Theta(k_\alpha)$. When a query is -executed against $\mathscr{I}$, there are three possible cases, -\begin{enumerate} -\item The record being searched for does not exist in $\mathscr{I}$. In -this case, the query result is 0. -\item The record being searched for does exist in $\mathscr{I}$ and has -a delete bit value of 0. In this case, the query result is 1. -\item The record being searched for does exist in $\mathscr{I}$ and has -a delete bit value of 1 (i.e., it has been deleted). In this case, the -query result is 0. -\end{enumerate} -In all three cases, the addition of deletes requires only $\Theta(1)$ -extra work at most. Therefore, set membership over a static hash map -using our proposed deletion mechanism satisfies the requirements for -weak deletes, with $k_\alpha = 1$. -\end{proof} - -Finally, we note that the cost of one of these weak deletes is $D(n) -= \mathscr{Q}(n)$. By Lemma~\ref{lem:weak-deletes}, the delete cost is -not asymptotically harmed by deleting records. - -Thus, we've shown that set membership using a static hash map is a -decomposable search problem, the storage cost remains $\Omega(n)$ and the -query and delete costs are unaffected by the presence of deletes using the -proposed mechanism. All of the requirements of deletion decomposability -are satisfied, therefore set membership using a static hash map is a -deletion decomposable search problem. -\end{proof} - -For such problems, deletes can be supported by first identifying the -block in the dynamization containing the record to be deleted, and -then calling $\mathtt{delete}$ on it. In order to allow this block to -be easily located, it is possible to maintain a hash table over all -of the records, alongside the dynamization, which maps each record -onto the block containing it. This table must be kept up to date as -reconstructions occur, but this can be done at no extra asymptotic costs -for any data structures having $B(n) \in \Omega(n)$, as it requires only -linear time. This allows for deletes to be performed in $\mathscr{D}(n) -\in \Theta(D(n))$ time. - -The presence of deleted records within the structure does introduce a -new problem, however. Over time, the number of records in each block will -drift away from the requirements imposed by the dynamization technique. It -will eventually become necessary to re-partition the records to restore -these invariants, which are necessary for bounding the number of blocks, -and thereby the query performance. The particular invariant maintenance -rules depend upon the decomposition scheme used. - -\Paragraph{Bentley-Saxe Method.} When creating a BSM dynamization for -a deletion decomposable search problem, the $i$th block where $i \geq 2$\footnote{ - Block $i=0$ will only ever have one record, so no special maintenance must be - done for it. A delete will simply empty it completely. -}, -in the absence of deletes, will contain $2^{i-1} + 1$ records. When a -delete occurs in block $i$, no special action is taken until the number -of records in that block falls below $2^{i-2}$. Once this threshold is -reached, a reconstruction can be performed to restore the appropriate -record counts in each block.~\cite{merge-dsp} - -\Paragraph{Equal Block Method.} For the equal block method, there are -two cases in which a delete may cause a block to fail to obey the method's -size invariants, -\begin{enumerate} - \item If enough records are deleted, it is possible for the number - of blocks to exceed $f(2n)$, violating Invariant~\ref{ebm-c1}. - \item The deletion of records may cause the maximum size of each - block to shrink, causing some blocks to exceed the maximum capacity - of $\nicefrac{2n}{s}$. This is a violation of Invariant~\ref{ebm-c2}. -\end{enumerate} -In both cases, it should be noted that $n$ is decreased as records are -deleted. Should either of these cases emerge as a result of a delete, -the entire structure must be reconfigured to ensure that its invariants -are maintained. This reconfiguration follows the same procedure as when -an insert results in a violation: $s$ is updated to be exactly $f(n)$, all -existing blocks are unbuilt, and then the records are evenly redistributed -into the $s$ blocks.~\cite{overmars-art-of-dyn} - - -\subsection{Worst-Case Optimal Techniques} - - -\section{Limitations of Classical Dynamization Techniques} -\label{sec:bsm-limits} - -While fairly general, these dynamization techniques have a number of -limitations that prevent them from being directly usable as a general -solution to the problem of creating database indices. Because of the -requirement that the query being answered be decomposable, many search -problems cannot be addressed--or at least efficiently addressed, by -decomposition-based dynamization. The techniques also do nothing to reduce -the worst-case insertion cost, resulting in extremely poor tail latency -performance relative to hand-built dynamic structures. Finally, these -approaches do not do a good job of exposing the underlying configuration -space to the user, meaning that the user can exert limited control on the -performance of the dynamized data structure. This section will discuss -these limitations, and the rest of the document will be dedicated to -proposing solutions to them. - -\subsection{Limits of Decomposability} -\label{ssec:decomp-limits} -Unfortunately, the DSP abstraction used as the basis of classical -dynamization techniques has a few significant limitations that restrict -their applicability, - -\begin{itemize} - \item The query must be broadcast identically to each block and cannot - be adjusted based on the state of the other blocks. - - \item The query process is done in one pass--it cannot be repeated. - - \item The result merge operation must be $O(1)$ to maintain good query - performance. - - \item The result merge operation must be commutative and associative, - and is called repeatedly to merge pairs of results. -\end{itemize} - -These requirements restrict the types of queries that can be supported by -the method efficiently. For example, k-nearest neighbor and independent -range sampling are not decomposable. - -\subsubsection{k-Nearest Neighbor} -\label{sssec-decomp-limits-knn} -The k-nearest neighbor (k-NN) problem is a generalization of the nearest -neighbor problem, which seeks to return the closest point within the -dataset to a given query point. More formally, this can be defined as, -\begin{definition}[Nearest Neighbor] - - Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$ - be some function $f: D^2 \to \mathbb{R}^+$ representing the distance - between two points within $D$. The nearest neighbor problem, $NN(D, - q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$ - for some query point, $q \in \mathbb{R}^d$. - -\end{definition} - -In practice, it is common to require $f(x, y)$ be a metric,\footnote -{ - Contrary to its vernacular usage as a synonym for ``distance'', a - metric is more formally defined as a valid distance function over - a metric space. Metric spaces require their distance functions to - have the following properties, - \begin{itemize} - \item The distance between a point and itself is always 0. - \item All distances between non-equal points must be positive. - \item For all points, $x, y \in D$, it is true that - $f(x, y) = f(y, x)$. - \item For any three points $x, y, z \in D$ it is true that - $f(x, z) \leq f(x, y) + f(y, z)$. - \end{itemize} - - These distances also must have the interpretation that $f(x, y) < - f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This - is the opposite of the definition of similarity, and so some minor - manipulations are usually required to make similarity measures work - in metric-based indexes. \cite{intro-analysis} -} -and this will be done in the examples of indices for addressing -this problem in this work, but it is not a fundamental aspect of the problem -formulation. The nearest neighbor problem itself is decomposable, -with a simple merge function that accepts the result with the smallest -value of $f(x, q)$ for any two inputs\cite{saxe79}. - -The k-nearest neighbor problem generalizes nearest-neighbor to return -the $k$ nearest elements, -\begin{definition}[k-Nearest Neighbor] - - Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$ - be some function $f: D^2 \to \mathbb{R}^+$ representing the distance - between two points within $D$. The k-nearest neighbor problem, - $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$ - such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$. - -\end{definition} - -This can be thought of as solving the nearest-neighbor problem $k$ times, -each time removing the returned result from $D$ prior to solving the -problem again. Unlike the single nearest-neighbor case (which can be -thought of as k-NN with $k=1$), this problem is \emph{not} decomposable. - -\begin{theorem} - k-NN is not a decomposable search problem. -\end{theorem} - -\begin{proof} -To prove this, consider the query $KNN(D, q, k)$ against some partitioned -dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If k-NN is decomposable, -then there must exist some constant-time, commutative, and associative -binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l} -R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q, -k)$. Consider the evaluation of the merge operator against two arbitrary -result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| = -|R_j| = k$, and that the contents of $R$ must be the $k$ records from -$R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the -problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$ -time. Therefore, k-NN is not a decomposable search problem. -\end{proof} - -With that said, it is clear that there isn't any fundamental restriction -preventing the merging of the result sets; it is only the case that an -arbitrary performance requirement wouldn't be satisfied. It is possible -to merge the result sets in non-constant time, and so it is the case -that k-NN is $C(n)$-decomposable. Unfortunately, this classification -brings with it a reduction in query performance as a result of the way -result merges are performed. - -As a concrete example of these costs, consider using the Bentley-Saxe -method to extend the VPTree~\cite{vptree}. The VPTree is a static, -metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k -\log n)$. One possible merge algorithm for k-NN would be to push all -of the elements in the two arguments onto a min-heap, and then pop off -the first $k$. In this case, the cost of the merge operation would be -$C(k) = k \log k$. Were $k$ assumed to be constant, then the operation -could be considered to be constant-time. But given that $k$ is only -bounded in size above by $n$, this isn't a safe assumption to make in -general. Evaluating the total query cost for the extended structure, -this would yield, - -\begin{equation} - k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right) -\end{equation} - -The reason for this large increase in cost is the repeated application -of the merge operator. The Bentley-Saxe method requires applying the -merge operator in a binary fashion to each partial result, multiplying -its cost by a factor of $\log n$. Thus, the constant-time requirement -of standard decomposability is necessary to keep the cost of the merge -operator from appearing within the complexity bound of the entire -operation in the general case.\footnote { - There is a special case, noted by Overmars, where the total cost is - $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n)) - \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the - case where the cost of the query and merge operation are sufficiently - large to consume the logarithmic factor, and so it doesn't represent - a special case with better performance. -} -If we could revise the result merging operation to remove this duplicated -cost, we could greatly reduce the cost of supporting $C(n)$-decomposable -queries. - -\subsubsection{Independent Range Sampling} -\label{ssec:background-irs} - -Another problem that is not decomposable is independent sampling. There -are a variety of problems falling under this umbrella, including weighted -set sampling, simple random sampling, and weighted independent range -sampling, but we will focus on independent range sampling here. - -\begin{definition}[Independent Range Sampling~\cite{tao22}] - Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query - interval $q = [x, y]$ and an integer $k$, an independent range - sampling query returns $k$ independent samples from $D \cap q$ - with each point having equal probability of being sampled. -\end{definition} - -This problem immediately encounters a category error when considering -whether it is decomposable: the result set is randomized, whereas -the conditions for decomposability are defined in terms of an exact -matching of records in result sets. To work around this, a slight abuse -of definition is in order: assume that the equality conditions within -the DSP definition can be interpreted to mean ``the contents in the two -sets are drawn from the same distribution''. This enables the category -of DSP to apply to this type of problem. - -Even with this abuse, however, IRS cannot generally be considered -decomposable; it is at best $C(n)$-decomposable. The reason for this is -that matching the distribution requires drawing the appropriate number -of samples from each each partition of the data. Even in the special -case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples -from each partition that must appear in the result set cannot be known -in advance due to differences in the selectivity of the predicate across -the partitions. - -\begin{example}[IRS Sampling Difficulties] - - Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 = - \{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and - an IRS query over the interval $[3, 4]$ with $k=12$. Because all three - partitions have the same size, it seems sensible to evenly distribute - the samples across them ($4$ samples from each partition). Applying - the query predicate to the partitions results in the following, - $d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$. - - In expectation, then, the first result set will contain $R_0 = \{3, - 3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same - probability of a $4$. The second and third result sets can only - be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these - together, we'd find that the probability distribution of the sample - would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform - the same sampling operation over the full dataset (not partitioned), - the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$. - -\end{example} - -The problem is that the number of samples drawn from each partition needs to be -weighted based on the number of elements satisfying the query predicate in that -partition. In the above example, by drawing $4$ samples from $D_1$, more weight -is given to $3$ than exists within the base dataset. This can be worked around -by sampling a full $k$ records from each partition, returning both the sample -and the number of records satisfying the predicate as that partition's query -result, and then performing another pass of IRS as the merge operator, but this -is the same approach as was used for k-NN above. This leaves IRS firmly in the -$C(n)$-decomposable camp. If it were possible to pre-calculate the number of -samples to draw from each partition, then a constant-time merge operation could -be used. - -\subsection{Insertion Tail Latency} - -\subsection{Configurability} - -\section{Conclusion} -This chapter discussed the necessary background information pertaining to -queries and search problems, indexes, and techniques for dynamic extension. It -described the potential for using custom indexes for accelerating particular -kinds of queries, as well as the challenges associated with constructing these -indexes. The remainder of this document will seek to address these challenges -through modification and extension of the Bentley-Saxe method, describing work -that has already been completed, as well as the additional work that must be -done to realize this vision. |