\chapter{Classical Dynamization Techniques} \label{chap:background} This chapter will introduce important background information and existing work in the area of data structure dynamization. We will first discuss the core concepts of search problems and data structures, which are central to dynamization techniques. While one might imagine that restrictions on dynamization would be functions of the data structure to be dynamized, in practice the requirements placed on the data structure are quite mild. Instead, the central difficulties to applying dynamization lie in the necessary properties of the search problem that the data structure is intended to solve. Following this, existing theoretical results in the area of data structure dynamization will be discussed, which will serve as the building blocks for our techniques in subsequent chapters. The chapter will conclude with a discussion of some of the limitations of these existing techniques. \section{Background} Before discussing dynamization itself, there are a few important definitions to dispose of. In this section, we'll discuss some relevant background information on search problems and data structures, which will form the foundation of our discussion of dynamization itself. \subsection{Queries and Search Problems} \label{sec:dsp} Data access lies at the core of most database systems. We want to ask questions of the data, and ideally get the answer efficiently. We will refer to the different types of question that can be asked as \emph{search problems}. We will be using this term in a similar way as the word \emph{query} \footnote{ The term query is often abused and used to refer to several related, but slightly different things. In the vernacular, a query can refer to either a) a general type of search problem (as in ``range query''), b) a specific instance of a search problem, or c) a program written in a query language. } is often used within the database systems literature: to refer to a general class of questions. For example, we could consider range scans, point-lookups, nearest neighbor searches, predicate filtering, random sampling, etc., to each be a general search problem. Formally, for the purposes of this work, a search problem is defined as follows, \begin{definition}[Search Problem] Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched, $\mathcal{Q}$ represents the domain of search parameters, and $\mathcal{R}$ represents the domain of possible answers.\footnote{ It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an example, a \texttt{COUNT} aggregation might map a set of strings onto an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need not be a universal constraint. } \end{definition} We will use the term \emph{query} to mean a specific instance of a search problem, with a fixed set of search parameters, \begin{definition}[Query] Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific instance of the search problem, $F(\mathcal{D}, q)$. \end{definition} As an example of these definitions, a \emph{membership test} or a \emph{range scan} would be considered a search problem, and a range scan over the interval $[10, 99]$ would be a query. We've drawn this distinction because, as we'll see as we enter into the discussion of our work in later chapters, it is useful to have separate, unambiguous terms for these two concepts. \subsection{Data Structures} Answering a search problem over an unordered set of data is not terribly efficient, and so usually the data is organized into a particular layout to facilitate more efficient answers. Such layouts are called \emph{data structures}. Examples of data structures include B-trees, hash tables, sorted arrays, etc. A data structure can be thought of as a \emph{solution} to a search problem~\cite{saxe79}. The symbol $\mathcal{I}$ indicates a data structure, and the symbol $\mathscr{I} \in \mathcal{I}$ represents an instance of a data structure built over a particular set of data, $d \subseteq \mathcal{D}$. We will use two abuses of notation pertaining to data structures throughout this work. First, $F(\mathscr{I}, q)$ will be used to indicate a query with search parameters $q$ over the data set $d$, answered using the data structure $\mathscr{I}$. Second, we will use $|\mathscr{I}|$ to indicate the number of records within $\mathscr{I}$ (which is equivalent to $|d|$, where $d$ is the set of records that $\mathscr{I}$ has been built over). We broadly classify data structures into three types, based upon the operations supported by the structure: static, half-dynamic, and full-dynamic. Static data structures do not support updates, half-dynamic structures support inserts, and full-dynamic support inserts and deletes. Note that we will use the unqualified term \emph{dynamic} to refer to both half-dynamic and full-dynamic structures when the distinction isn't relevant. Additionally, the term \emph{native dynamic} will be used to indicate a data structure that has been custom-built with support for inserts and/or deletes without the need for dynamization. These categories are not all-inclusive, as there are a number of data structures which do not fit the classification, but such structures are outside of the scope of this work. \begin{definition}[Static Data Structure~\cite{dsp}] \label{def:static-ds} A static data structure does not support updates of any kind, but can be constructed from a data set and answer queries. Additionally, we require that the static data structure provide the ability to reproduce the set of records that was used to construct it. Specifically, static data structures must support the following three operations, \begin{itemize} \item $\mathbftt{query}: \left(\mathcal{I}, \mathcal{Q}\right) \to \mathcal{R}$ \\ $\mathbftt{query}(\mathscr{I}, q)$ answers the query $F(\mathscr{I}, q)$ and returns the result. This operation runs in $\mathscr{Q}_S(n)$ time in the worst-case and cannot alter the state of $\mathscr{I}$. \item $\mathbftt{build}:\mathcal{PS}(\mathcal{D}) \to \mathcal{I}$ \\ $\mathbftt{build}(d)$ constructs a new instance of $\mathcal{I}$ using the records in set $d$. This operation runs in $B(n)$ time in the worst case.\footnote{ We use the notation $\mathcal{PS}(\mathcal{D})$ to indicate the power set of $\mathcal{D}$, i.e. the set containing all possible subsets of $\mathcal{D}$. Thus, $d \in \mathcal{PS}(\mathcal{D}) \iff d \subseteq \mathcal{D}$. } \item $\mathbftt{unbuild}: \mathcal{I} \to \mathcal{PS}(\mathcal{D})$ \\ $\mathbftt{unbuild}(\mathscr{I})$ recovers the set of records, $d$, used to construct $\mathscr{I}$. The literature on dynamization generally assumes that this operation runs in $\Theta(1)$ time~\cite{saxe79}, and we will adopt the same assumption in our analysis. \end{itemize} \end{definition} Note that the property of being static is distinct from that of being immutable. Static refers to the layout of records within the data structure, whereas immutable refers to the data stored within those records. This distinction will become relevant when we discuss different techniques for adding delete support to data structures. The data structures used are always static, but not necessarily immutable, because the records may contain header information (like visibility) that is updated in place. \begin{definition}[Half-dynamic Data Structure~\cite{overmars-art-of-dyn}] \label{def:half-dynamic-ds} A half-dynamic data structure requires the three operations of a static data structure, as well as the ability to efficiently insert new data into a structure built over an existing data set. \begin{itemize} \item $\mathbftt{insert}: \left(\mathcal{I}, \mathcal{D}\right) \to \mathcal{I}$ \\ $\mathbftt{insert}(\mathscr{I}, r)$ returns a data structure, $\mathscr{I}^\prime$, such that $\mathbftt{query}(\mathscr{I}^\prime, q) = F(\mathbftt{unbuild}(\mathscr{I}) \cup \{r\}, q)$, for some $r \in \mathcal{D}$. This operation runs in $I(n)$ time in the worst-case. \end{itemize} \end{definition} The important aspect of insertion in this model is that the effect of the new record on the query result is observed, not necessarily that the result is a structure exactly identical to the one that would be obtained by building a new structure over $\mathbftt{unbuild}(\mathscr{I}) \cup \{r\}$. Also, though the formalism used implies a functional operation where the original data structure is unmodified, this is not actually a requirement. $\mathscr{I}$ could be sightly modified in place, and returned as $\mathscr{I}^\prime$, as is conventionally done with native dynamic data structures. \begin{definition}[Full-dynamic Data Structure~\cite{overmars-art-of-dyn}] \label{def:full-dynamic-ds} A full-dynamic data structure is a half-dynamic structure that also has support for deleting records from the dataset. \begin{itemize} \item $\mathbftt{delete}: \left(\mathcal{I}, \mathcal{D}\right) \to \mathcal{I}$ \\ $\mathbftt{delete}(\mathscr{I}, r)$ returns a data structure, $\mathscr{I}^\prime$, such that $\mathbftt{query}(\mathscr{I}^\prime, q) = F(\mathbftt{unbuild}(\mathscr{I}) - \{r\}, q)$, for some $r \in \mathcal{D}$. This operation runs in $D(n)$ time in the worst-case. \end{itemize} \end{definition} As with insertion, the important aspect of deletion is that the effect of $r$ on the results of queries answered using $\mathscr{I}^\prime$ has been removed, not necessarily that the record is physically removed from the structure. A full-dynamic data structure also supports in-place modification of an existing record. In order to update a record $r$ to $r^\prime$, it is sufficient to perform $\mathbftt{insert}\left(\mathbftt{delete}\left(\mathscr{I}, r\right), r^\prime\right)$. There are data structures that do not fit into this classification scheme. For example, approximate structures like Bloom filters~\cite{bloom70} or various sketches and summaries, do not retain full information about the records that have been used to construct them. Such structures cannot support \texttt{unbuild} as a result. Some other data structures cannot be statically queried--the act of querying them mutates their state. This is the case for structures like heaps, stacks, and queues, for example. \section{Dynamization Basics} \emph{Dynamization} is the process of transforming a static data structure into a dynamic one. When certain conditions are satisfied by the data structure and its associated search problem, this process can be done automatically, and with provable asymptotic bounds on amortized insertion performance, as well as worst-case query performance. This automatic approach is in contrast with the design of a native dynamic data structure, which involves altering the data structure itself to natively support updates. This process usually involves implementing techniques that partially rebuild small portions of the structure to accommodate new records, which is called \emph{local reconstruction}~\cite{overmars83}. This is a very manual intervention that requires significant effort on the part of the data structure designer, whereas conventional dynamization can be performed with little-to-no modification of the underlying data structure at all. It is worth noting that there are a variety of techniques discussed in the literature for dynamizing structures with specific properties, or under very specific sets of circumstances. Examples include frameworks for adding update support succinct data structures~\cite{dynamize-succinct} or taking advantage of batching of insert and query operations~\cite{batched-decomposable}. This section discusses techniques that are more general, and don't require workload-specific assumptions. For more detail than is included in this section, Overmars wrote a book providing a comprehensive survey of techniques for creating dynamic data structures, including not only the techniques discussed here, but also local reconstruction based techniques and more~\cite{overmars83}.\footnote{ Sadly, this book isn't readily available in digital format as of the time of writing. } \subsection{Global Reconstruction} The most fundamental dynamization technique is that of \emph{global reconstruction}. While not particularly useful on its own, global reconstruction serves as the basis for the techniques to follow, and so we will begin our discussion of dynamization with it. Consider some search problem, $F$, for which we have a static solution, $\mathcal{I}$. Given the operations supported by static structures, it is possible to insert a new record, $r \in \mathcal{D}$, into an instance $\mathscr{I} \in \mathcal{I}$ as follows, \begin{equation*} \mathbftt{insert}(\mathscr{I}, r) \triangleq \mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}) \cup \{r\}) \end{equation*} Likewise, a record can be deleted using, \begin{equation*} \mathbftt{delete}(\mathscr{I}, r) \triangleq \mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}) - \{r\}) \end{equation*} It goes without saying that this operation is sub-optimal, as the insertion and deletion costs are both $\Theta(B(n))$, and $B(n) \in \Omega(n)$ at best for most data structures. However, this global reconstruction strategy can be used as a primitive for more sophisticated techniques that can provide reasonable performance. \begin{figure} \centering \includegraphics[width=.8\textwidth]{diag/decomp.pdf} \caption{\textbf{Data Structure Decomposition.} A single large data structure containing $n$ records can be decomposed into multiple instances of the same data structure, each built over a disjoint partition of the data. In this case, the decomposition is performed such that each block contains $\sqrt{n}$ records. As a result, rather than an insert requiring $B(n)$ time, it will require $B(\sqrt{n})$ time.} \label{fig:bg-decomp} \end{figure} The problem with global reconstruction is that each insert or delete must rebuild the entire data structure. The key insight that enables dynamization based on global reconstruction, first discussed by Bentley and Saxe~\cite{saxe79}, is that the cost associated with global reconstruction can be reduced by \emph{decomposing} the data structure into multiple, smaller structures, called \emph{blocks}, each built from a disjoint partition of the data. The process by which the structure is broken into blocks is called a decomposition method, and various methods have been proposed that result in asymptotic improvements of insertion performance when compared to global reconstruction alone. \begin{example}[Data Structure Decomposition] Consider a data structure that can be constructed in $B(n) \in \Theta (n \log n)$ time with $|\mathscr{I}| = n$. Inserting a new record into this structure using global reconstruction will require $I(n) \in \Theta (n \log n)$ time. However, if the data structure is decomposed into blocks, such that each block contains $\Theta(\sqrt{n)})$ records, as shown in Figure~\ref{fig:bg-decomp}, then only a single block must be reconstructed to accommodate the insert, requiring $I(n) \in \Theta(\sqrt{n} \log \sqrt{n})$ time. If this structure contains $m = \frac{n}{\sqrt{n}}$ blocks, we represent it with the notation $\mathscr{I} = \{\mathscr{I}_1, \ldots, \mathscr{I}_m\}$, where $\mathscr{I}_i$ is the $i$th block. \end{example} In this example, the decomposition resulted in a reduction of the worst-case insert cost. However, many decomposition schemes that we will examine do not affect the worst-case cost, despite having notably better performance in practice. As a result, it is more common to consider the \emph{amortized} insertion cost, $I_A(n)$ when examining dynamization. This cost function has the form, \begin{equation*} I_A(n) = \frac{B(n)}{n} \cdot \text{A} \end{equation*} where $\text{A}$ is the number of times that a record within the structure participates in a reconstruction, often called the write amplification. Much of the existing work on dynamization has considered different decomposition methods for static data structures, and the effects that these methods have on insertion and query performance. However, before we can discuss these approaches, we must first address the problem of answering search problems over these decomposed structures. \subsection{Decomposable Search Problems} Not all search problems can be correctly answered over a decomposed data structure, and this problem introduces one of the major limitations of traditional dynamization techniques: The answer to the search problem from the decomposition should be the same as would have been obtained had all of the data been stored in a single data structure. This requirement is formalized in the definition of a class of problems called \emph{decomposable search problems} (DSP). This class was first defined by Jon Bentley, \begin{definition}[Decomposable Search Problem~\cite{dsp}] \label{def:dsp} A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and only if there exists a constant-time computable, associative, and commutative binary operator $\mergeop$ such that, \begin{equation*} F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q) \end{equation*} for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$. \end{definition} The requirement for $\mergeop$ to be constant-time was used by Bentley to prove specific performance bounds for answering queries from a decomposed data structure. However, it is not strictly \emph{necessary}, and later work by Overmars lifted this constraint and considered a more general class of search problems called \emph{$C(n)$-decomposable search problems}, \begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars-cn-decomp}] A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable if and only if there exists an $O(C(n))$-time computable, associative, and commutative binary operator $\mergeop$ such that, \begin{equation*} F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q) \end{equation*} for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$. \end{definition} \subsubsection{Examples} To demonstrate that a search problem is decomposable, it is necessary to prove the existence of the merge operator, $\mergeop$, with the necessary properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)$. With these two results, induction demonstrates that the problem is decomposable even in cases with more than two partial results. As an example, consider the range counting problem, which seeks to identify the number of elements in a set of 1-dimensional points that fall onto a specified interval, \begin{definition}[Range Count] \label{def:range-count} Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval, $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns the cardinality, $|d \cap q|$. \end{definition} \begin{theorem} \label{ther:decomp-range-count} Range Count is a decomposable search problem. \end{theorem} \begin{proof} Let $\mergeop$ be addition ($+$). Applying this to Definition~\ref{def:dsp}, gives \begin{align*} |(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)| \end{align*} which is true by the distributive property of union and intersection. Addition is an associative and commutative operator that can be calculated in $\Theta(1)$ time. Therefore, range count is a decomposable search problem. \end{proof} Because the codomain of a DSP is not restricted, more complex output structures can be used to allow for problems that are not directly decomposable to be converted to DSPs, possibly with some minor post-processing. For example, calculating the arithmetic mean of a set of numbers can be formulated as a DSP, \begin{theorem} The calculation of the arithmetic mean of a set of numbers is a DSP. \end{theorem} \begin{proof} Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$, where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple contains the sum of the values within the input set, and the cardinality of the input set. For two disjoint partitions of the data, $D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let $A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$. Applying Definition~\ref{def:dsp}, gives \begin{align*} A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2) \\ (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c) \end{align*} From this result, the average can be determined in constant time by taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set of numbers is a DSP. \end{proof} \subsubsection{Answering Queries for DSPs.} Queries for a decomposable search problem can be answered over a decomposed structure by individually querying each block, and then merging the results together using $\mergeop$. In many cases, this process will introduce some overhead in the query cost. Given a decomposed data structure $\mathscr{I} = \{\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_m\}$, a query for a $C(n)$-decomposable search problem can be answered using, \begin{equation*} \mathbftt{query}\left(\mathscr{I}, q\right) \triangleq \bigmergeop_{i=1}^{m} F(\mathscr{I}_i, q) \end{equation*} which requires, \begin{equation*} \mathscr{Q}(n) \in O \left( m \cdot \left(\mathscr{Q}(n_\text{max}) + C(n)\right)\right) \end{equation*} time, where $m$ is the number of blocks and $n_\text{max}$ is the size of the largest block. Note the fact that $C(n)$ is multiplied by $m$ in this expression--this is a large part of the reason why $C(n)$-decomposability is not particularly desirable compared to standard decomposability, where $C(n) \in \Theta(1)$ and thus falls out of the cost function. This is an upper bound only, it is occasionally possible to do better. Under certain circumstances, the costs of querying multiple blocks can be absorbed, resulting in no worst-case overhead, at least asymptotically. As an example, consider a linear scan of the data running in $\Theta(n)$ time. In this case, every record must be considered, and so there isn't any performance penalty to breaking the records into multiple blocks and scanning them individually.\footnote{ From an asymptotic perspective. There will still be measurable performance effects from caching, etc., even in this case. } More formally, for any query running in $\mathscr{Q}_S(n) \in \Omega\left(n^\epsilon\right)$ time where $\epsilon > 0$, the worst-case cost of answering a decomposable search problem from a decomposed structure is $\Theta\left(\mathscr{Q}_S(n)\right)$.~\cite{saxe79} \section{Decomposition-based Dynamization for Half-dynamic Structures} The previous discussion reveals the basic tension that exists within decomposition based techniques: larger block sizes result in worse insertion performance and better query performance. Query performance is improved by reducing the number of blocks, but this is concomitant with making the blocks larger, harming insertion performance. The literature on decomposition-based dynamization techniques discusses different approaches for performing the decomposition to balance these two competing interests, as well as various additional properties of search problems and structures that can be leveraged for better performance. In this section, we will discuss these topics in the context of creating half-dynamic data structures, and the next section will discuss similar considerations for full-dynamic structures. Of the decomposition techniques, we will focus on the three most important methods.\footnote{ There are two main classes of method for decomposition: decomposing based on some counting scheme (logarithmic and $k$-binomial)~\cite{saxe79} or decomposing into equally sized blocks (equal block method)~\cite{overmars-art-of-dyn}. Other, more complex, methods do exist, but they are largely compositions of these two simpler ones. These decompositions are of largely theoretical interest, as they are sufficiently complex to be of questionable practical utility.~\cite{overmars83} } The earliest of these is the logarithmic method~\cite{saxe79}, often called the Bentley-Saxe method in modern literature, and is the most commonly discussed technique today. The logarithmic method has been directly applied in a few instances in the literature, such as to metric indexing structures~\cite{naidan14} and spatial structures~\cite{bkdtree}, and has also been used in a modified form for genetic sequence search structures~\cite{almodaresi23} and graphs~\cite{lsmgraph}, to cite a few examples. Bentley and Saxe also proposed a second approach, the $k$-binomial method, that slightly alters the exact decomposition approach used by the logarithmic method to allow for flexibility in whether the performance of inserts or queries should be favored~\cite{saxe79}. A later technique, the equal block method~\cite{overmars-art-of-dyn}, was also developed, which also seeks to introduce a mechanism for performance tuning. Of the three, the logarithmic method is the most generally effective, and we have not identified any specific applications of either $k$-binomial decomposition or the equal block method outside of the theoretical literature. \subsection{The Logarithmic Method} \label{ssec:bsm} The original, and most frequently used, decomposition technique is the logarithmic method, also called Bentley-Saxe method in more recent literature. This technique decomposes the structure into logarithmically many blocks of exponentially increasing size. More specifically, the data structure is decomposed into $h = \lceil \log_2 n \rceil$ blocks, $\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$ will be either empty, or contain exactly $2^i$ records within it. The procedure for inserting a record, $r \in \mathcal{D}$, into a logarithmic decomposition is as follows. If the block $\mathscr{I}_1$ is empty, then $\mathscr{I}_1 = \mathbftt{build}(\{r\})$. If it is not empty, then there will exist a maximal sequence of non-empty blocks $\mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq 1$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case, $\mathscr{I}_{i+1}$ is set to $\mathbftt{build}(\{r\} \cup \bigcup_{l=1}^i \mathbftt{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_1$ through $\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the end of the structure as needed. %FIXME: switch the x's to r's for consistency \begin{figure} \centering \includegraphics[width=.8\textwidth]{diag/bsm.pdf} \caption{\textbf{Insertion in the Logarithmic Method.} A logarithmic decomposition of some data structure initially containing records $r_1, r_2, \ldots, r_{10}$ is shown, along with the insertion procedure. First, the new record $r_{11}$ is inserted. The first empty block is at $i=1$, and so the new record is simply placed there. Next, $r_{12}$ is inserted. For this insert, the first empty block is at $i=3$, requiring the blocks $1$ and $2$ to be merged, along with the $r_{12}$, to create the new block. } \label{fig:bsm-example} \end{figure} \begin{example}[Insertion into a Logarithmic Decomposition] Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The dynamization is built over a set of records $r_1, r_2, \ldots, r_{10}$ initially, with eight records in $\mathscr{I}_4$ and two in $\mathscr{I}_2$. The first new record, $r_{11}$, is inserted directly into $\mathscr{I}_1$. For the next insert following this, $r_{12}$, the first empty block is $\mathscr{I}_3$, and so the insert is performed by doing $\mathscr{I}_3 = \text{build}\left(\{r_{12}\} \cup \text{unbuild}(\mathscr{I}_2) \cup \text{unbuild}(\mathscr{I}_3)\right)$ and then emptying $\mathscr{I}_2$ and $\mathscr{I}_3$. \end{example} This technique is also called a \emph{binary decomposition} of the data structure. Considering a logarithmic decomposition of a structure containing $n$ records, labeling each block with a $0$ if it is empty and a $1$ if it is full will result in the binary representation of $n$. For example, the final state of the structure in Figure~\ref{fig:bsm-example} contains $12$ records, and the labeling procedure will result in $0\text{b}1100$, which is $12$ in binary. Inserts affect this representation of the structure in the same way that incrementing the binary number by $1$ does. By applying this method to a static data structure, a half-dynamic structure can be created with the following performance characteristics, \begin{align*} \text{Amortized Insertion Cost:}&\quad I_A(n) \in \Theta\left(\frac{B(n)}{n}\cdot \log_2 n\right) \\ \text{Worst Case Insertion Cost:}&\quad I(n) \in \Theta\left(B(n)\right) \\ \text{Worst-case Query Cost:}& \quad \mathscr{Q}(n) \in O\left(\log_2 n\cdot \mathscr{Q}_S\left(n\right)\right) \\ \end{align*} This is a particularly attractive result because, for example, a data structure having $B(n) \in \Theta(n)$ will have an amortized insertion cost of $\log_2 (n)$, which is quite reasonable. The trade-off for this is an extra logarithmic multiple attached to the query complexity. It is also worth noting that the worst-case insertion cost remains the same as global reconstruction, but this case arises only very rarely. If you consider the binary decomposition representation, the worst-case behavior is triggered each time the existing number overflows, and a new digit must be added. \subsection{The $k$-Binomial Method} \begin{figure} \centering \includegraphics[width=.8\textwidth]{diag/kbin.pdf} \caption{\textbf{Insertion in the $k$-Binomial Method.} A $k$-binomial decomposition of some data structure initially containing the records $r_1 \ldots r_{18}$ with $k=3$, along with the values of the $D$ array. When the record $r_{19}$ is inserted, $D[1]$ is incremented and, as it is not equal to $D[2]$, no carrying is necessary. Thus, the new record is simply placed in the structure at $i=1$. However, inserting the next record will result in $D[1] = D[2]$ after the increment. Once this increment is shifted, it will also be the case that $D[2] = D[3]$. As a result, the entire structure is compacted into a single block. } \label{fig:dyn-kbin} \end{figure} One of the significant limitations of the logarithmic method is that it is incredibly rigid. In our earlier discussion of decomposition we noted that there exists a clear trade-off between insert and query performance for half-dynamic structures, mediated by the number of blocks into which the structure is decomposed. However, the logarithmic method does not allow any navigation of this trade-off. In their original paper on the topic, Bentley and Saxe proposed a different decomposition scheme that does expose this trade-off, called the $k$-binomial transform.~\cite{saxe79} In this transform, rather than decomposing the data structure based on powers of two, the structure is decomposed based on a sum of $k$ binomial coefficients. This decomposition results in exactly $k$ blocks in the structure. For example, with $k=3$, the number 17 can be represented as, \begin{align*} 17 &= {5 \choose 3} + {4 \choose 2} + {1 \choose 1} \\ &= 10 + 6 + 1 \end{align*} and thus the decomposed structure will contain three blocks, one with $10$ records, one with $6$, and another with $2$. More generally, a structure of $n$ elements is decomposed based on the following sum of binary coefficients, \begin{equation*} n = \sum_{i=1}^{k} {D[i] \choose k} \end{equation*} where $D$ is an array of $k+1$ integers, such that $D[i] > D[i-1]$ for all $i \leq k$, and $D[k+1] = \infty$. When a record is inserted, $D[1]$ is incremented by one, and then this increment is ``shifted'' up the array until no two adjacent integers are equal by considering each element $i$ in the array and checking if it is equal to $i+1$. If it is, the value at $D[i]$ is decremented and $D[i+1]$ incremented, and then $i$ itself is incremented. This is guaranteed to terminate because the last element of the array $D[k+1]$ is taken to be infinite. \begin{algorithm} \caption{Insertion into a $k$-Binomial Decomposition} \label{alg:dyn-binomial-insert} \KwIn{$r$: the record to be inserted, $\mathscr{I} = \{ \mathscr{I}_1 \ldots \mathscr{I}_k\}$: a decomposed structure, $D$: the array of binomial coefficients} $D[1] \gets D[1] + 1$ \; $S \gets \{r\} \cup \mathbftt{unbuild}(\mathscr{I}_1)$ \; $\mathscr{I}_1 \gets \emptyset$ \; \BlankLine $i \gets 1$ \; \While{$D[i] = D[i+1]$} { $D[i+1] \gets D[i+1] + 1$ \; $S \gets S \cup \mathbftt{unbuild}(\mathscr{I}_i)$ \; \BlankLine $D[i] \gets D[i] - 1$\; $\mathscr{I}_i \gets \emptyset$\; \BlankLine $i \gets i + 1$\; } \BlankLine $\mathscr{I}_i \gets \mathbftt{build}(S)$ \; \Return $\mathscr{I}$ \end{algorithm} In order to maintain the structural decomposition based on those results, the method maintains a list of $k$ structures, $\mathscr{I} = \{\mathscr{I}_1 \ldots \mathscr{I}_k\}$ (which all start empty). During an insert, a set of records to use to build the new block is initialized with the record to be inserted. Then, each time $D[i]$ is considered in the above increment algorithm, the structure $\mathscr{I}_i$ is unbuilt and its records added to the set. When the increment algorithm above terminates, a new block is built and placed at $\mathscr{I}_i$. This process is a bit complicated, so we've summarized it in Algorithm~\ref{alg:dyn-binomial-insert}. Figure~\ref{fig:dyn-kbin} shows an example of inserting records into a $k$-binomial decomposition. Applying this technique results in the following costs for operations, \begin{align*} \text{Amortized Insertion Cost:}&\quad I_A(n) \in \Theta\left(\frac{B(n)}{n} \cdot \left(k! \cdot n\right)^{\frac{1}{k}}\right) \\ \text{Worst-case Insertion Cost:}&\quad I(n) \in \Theta\left(B(n)\right) \\ \text{Worst-case Query Cost:}& \quad \mathscr{Q}(n) \in O\left(k\cdot \mathscr{Q}_S(n)\right) \\ \end{align*} Because the number of blocks is restricted to a constant, $k$, this method is highly biased towards query performance, at the cost of insertion. Bentley and Saxe also proposed a decomposition based on the dual of this $k$-binomial approach. We won't go into the details here, but this dual $k$-binomial method effectively reverses the insert and query trade-offs to produce an insert optimized structure, with costs, \begin{align*} \text{Amortized Insertion Cost:}&\quad I_A(n) \in \left( \frac{B(n)}{n} \cdot k\right) \\ \text{Worst-case Insertion Cost:}&\quad I(n) \in \Theta\left(B(n)\right) \\ \text{Worst-case Query Cost:}& \quad \mathscr{Q}(n) \in O\left(\left(k! \cdot n\right)^{\frac{1}{k}}\cdot \mathscr{Q}_S(n)\right) \\ \end{align*} \subsection{Equal Block Method} \label{ssec:ebm} \begin{figure} \centering \includegraphics[width=.8\textwidth]{diag/ebm.pdf} \caption{\textbf{Insertion in the Equal Block Method.} An equal block decomposition of some data structure initially containing the records $r_1\ldots r_{14}$, with $f(n) = 3$. When the record $r_{15}$ is inserted, the smallest block ($i=1$) is located, and the record is placed there by rebuilding it. When the next record, $r_{16}$ is inserted, the value of $f(n)$ increases to $4$. As a result, the entire structure must be re-partitioned to evenly distribute the records over $4$ blocks during this insert. } \label{fig:dyn-ebm} \end{figure} The $k$-binomial method aims to provide the ability to adjust the performance of a dynamized structure, selecting for either insert or query performance. However, it is a bit of a blunt instrument, allowing for a broadly insert-optimized system with horrible query performance, or a broadly query-optimized system with horrible insert performance. Once the system has been selected, the degree of trade-off can be slightly adjusted by tweaking $k$. The desire to introduce a decomposition that allowed for more fine-grained control over this trade-off resulted in a third decomposition technique called the \emph{equal block method}. There have been several proposed variations of this concept~\cite{maurer79, maurer80}, but we will focus on the most developed form as described by Overmars and von Leeuwen~\cite{overmars-art-of-dyn, overmars83}. The core concept of the equal block method is to decompose the data structure into a specified number of blocks, such that each block is of roughly equal size. Consider an instance of a data structure $\mathscr{I} \in \mathcal{I}$ that solves some decomposable search problem, $F$ and is built over a set of records $d \in \mathcal{D}$. Rather than decomposing the data structure based on some count-based scheme, as the two previous techniques did, we can simply break the structure up into $s$ evenly sized blocks, $\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a constant value results in large degradation of insert performance as the number of records grows,\footnote{ The $k$-binomial decomposition got away with fixed a constant number of blocks because it scaled the sizes of the blocks to ensure that there was a size distribution, and most of the reconstruction effort occurred in the smaller blocks. When all the blocks are of equal size, however, the cost of these reconstructions is much larger, and so it becomes necessary to gradually grow the block count to ensure that insertion cost doesn't grow too large. } and so we instead take it to be governed by a smooth, monotonically increasing function $f(n)$ such that, at any point, the following two constraints are obeyed, \begin{align} f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\ \forall_{1 \leq j \leq s} \quad | \mathscr{I}_j | \leq \frac{2n}{s} \label{ebm-c2} \end{align} A new record is inserted by finding the smallest block and rebuilding it using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$, then an insert is done by, \begin{equation*} \mathscr{I}_k^\prime = \mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}_k) \cup \{r\}) \end{equation*} Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{ Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be violated by deletes. We're omitting deletes from the discussion at this point, but will circle back to them in Section~\ref{ssec:dyn-deletes}. } In this case, the constraints are enforced by re-configuring the structure. $s$ is updated to be exactly $f(n)$, all of the existing blocks are unbuilt, and then the records are redistributed evenly into $s$ blocks. An example of insertion in the equal block method is shown in Figure~\ref{fig:dyn-ebm}. This technique provides better amortized performance bounds than global reconstruction, at the possible cost of worse query performance for sub-linear queries. We'll omit the details of the proof of performance for brevity and streamline some of the original notation (full details can be found in~\cite{overmars83}), but this technique ultimately results in a data structure with the following performance characteristics, \begin{align*} \text{Amortized Insertion Cost:}&\quad I_A(n) \in \Theta\left(\frac{B(n)}{n} + B\left(\frac{n}{f(n)}\right)\right) \\ \text{Worst-case Insertion Cost:}&\quad I(n) \in \Theta\left(f(n)\cdot B\left(\frac{n}{f(n)}\right)\right) \\ \text{Worst-case Query Cost:}& \quad \mathscr{Q}(n) \in O\left(f(n) \cdot \mathscr{Q}_S\left(\frac{n}{f(n)}\right)\right) \\ \end{align*} The equal block method is generally \emph{worse} in terms of insertion performance than the logarithmic and $k$-binomial decompositions. This happens because, for a given number of blocks, the reconstructions will typically be larger in the equal block method due to each block having approximately the same size. This results in larger reconstructions on average than the logarithmic method. \subsection{Optimizations} In addition to exploring various different approaches to decomposing the data structure to be dynamized, the literature also explores a number of techniques for optimizing performance under certain circumstances. In this section, we will discuss the two most important of these for our purposes: the exploitation of more efficient data structure merging, and an approach for reducing the worst-case insertion cost of a decomposition based loosely on the logarithmic method. \subsubsection{Merge Decomposable Search Problems} When considering a decomposed structure, reconstructions are performed not using a random assortment of records, but mostly using records extracted from already existing data structures. In the case of static data structures, as defined in Definition~\ref{def:static-ds}, the best we can do is to unbuild the data structures and then rebuild from scratch, there are many data structures which can be efficiently merged. Consider a data structure that supports construction via merging, $\mathbftt{merge}(\mathscr{I}_1, \ldots \mathscr{I}_k)$ in $B_M(n, k)$ time, where $n = \sum_{i=1}^k |\mathscr{I}_i|$. A search problem for which such a data structure exists is called a \emph{merge decomposable search problem} (MDSP)~\cite{merge-dsp}. Note that in~\cite{merge-dsp}, Overmars considers a \emph{very} specific definition where the data structure is built in two stages: An initial sorting phase, requiring $O(n \log n)$ time, and then a construction phase requiring $O(n)$ time. Overmars's proposed mechanism for leveraging this property attaches a linked list to each block, which stores the records in sorted order (to account for structures where the records must be sorted, but aren't necessarily kept that way). During reconstructions, these sorted lists can first be merged, and then the data structure built from the resulting merged list. Using this approach, even accounting for the merging of the list, he is able to prove that the amortized insertion cost is less than would have been the case paying the $O( n \log n)$ cost for each reconstruction.~\cite{merge-dsp} While Overmars's definition for MDSP does capture a large number of mergeable data structures (including all of the mergeable structures considered in this work), we modify his definition to consider a broader class of problems. We will be using the term to refer to any search problem with a data structure that can be merged more efficiently than built from an unsorted set of records. More formally, \begin{definition}[Merge Decomposable Search Problem~\cite{merge-dsp}] \label{def:mdsp} A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is merge decomposable if and only if there exists a solution to the search problem (i.e., a data structure) that is static, and also supports the operation, \begin{itemize} \item $\mathbftt{merge}: \mathcal{I}^k \to \mathcal{I}$ \\ $\mathbftt{merge}(\mathscr{I}_1, \ldots, \mathscr{I}_k)$ returns a static data structure, $\mathcal{I}^\prime$, constructed from the input data structures, with cost $B_M(n, k) \leq B(n)$, such that for any set of search parameters $q$, \begin{equation*} \mathbftt{query}(\mathscr{I}^\prime, q) = \mathbftt{query}(\mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}_1)\cup\ldots\cup\mathbftt{unbuild}(\mathscr{I}_k))) \end{equation*} \end{itemize} \end{definition} The value of $k$ can be upper-bounded by the decomposition technique used. For example, in the logarithmic method there will be $\log n$ structures to merge in the worst case, and so to gain benefit from the merge routine, the merging of $\log n$ structures must be less expensive than building the new structure using the standard $\mathbftt{unbuild}$ and $\mathbftt{build}$ mechanism. The availability of an efficient merge operation isn't helpful in the equal block method, which doesn't perform data structure merges.\footnote{ In the equal block method, all reconstructions are due to either inserting a record or re partitioning of the records. In the former case, the reconstruction pulls records from only a single structure and merging is not possible. In the latter, records may come from multiple structures, but the structures are not merged and only some of the records from each are used. In either case, merging is not useful as an optimization. } \subsubsection{Improved Worst-Case Insertion Performance} \label{ssec:bsm-worst-optimal} Dynamization based upon decomposition and global reconstruction has a significant gap between its \emph{amortized} insertion performance, and its \emph{worst-case} insertion performance. When using the Bentley-Saxe method, the logarithmic decomposition ensures that the majority of inserts involve rebuilding only small data structures, and thus are relatively fast. However, the worst-case insertion cost is still $\Theta(B(n))$, no better than global reconstruction, because the worst-case insert requires a reconstruction using all of the records in the structure. Overmars and van Leeuwen~\cite{overmars81, overmars83} proposed an alteration to the logarithmic method that is capable of bringing the worst-case insertion cost in line with amortized, $I(n) \in \Theta \left(\frac{B(n)}{n} \log n\right)$. To accomplish this, they introduce a structure that is capable of spreading the work of reconstructions out across multiple inserts. Their structure consists of $\log_2 n$ levels, like the logarithmic method, but each level contains four data structures, rather than one, called $Oldest_i$, $Older_i$, $Old_i$, $New_i$ respectively.\footnote{ We are here adopting nomenclature used by Erickson in his lecture notes on the topic~\cite{erickson-bsm-notes}, which is a bit clearer than the more mathematical notation in the original source material. } The $Old$, $Older$, $Oldest$ structures represent completely built versions of the data structure on each level, and will be either full ($2^i$ records) or empty. If $Oldest$ is empty, then so is $Older$, and if $Older$ is empty, then so is $Old$. The fourth structure, $New$, represents a partially built structure on the level. A record in the structure will be present in exactly one old structure, and may additionally appear in a new structure as well. When inserting into this structure, the algorithm first examines every level, $i$. If both $Older_{i-1}$ and $Oldest_{i-1}$ are full, then the algorithm will execute $\frac{B(2^i)}{2^i}$ steps of the algorithm to construct $New_i$ from $\mathbftt{unbuild}(Older_{i-1}) \cup \mathbftt{unbuild}(Oldest_{i-1})$. Once enough inserts have been performed to completely build some block, $New_i$, the source blocks for the reconstruction, $Oldest_{i-1}$ and $Older_{i-1}$ are deleted, $Old_{i-1}$ becomes $Oldest_{i-1}$, and $New_i$ is assigned to the oldest empty block on level $i$. This approach means that, in the worst case, partial reconstructions will be executed on every level in the structure, resulting in \begin{equation*} I(n) \in \Theta\left(\sum_{i=1}^{\log_2 n} \frac{B(2^i)}{2^i}\right) \in \Theta\left(\log_2 n \frac{B(n)}{n}\right) \end{equation*} time. Additionally, if $B(n) \in \Omega(n^{1 + \epsilon})$ for $\epsilon > 0$, then the bottom level dominates the reconstruction cost, and the worst-case bound drops to $I(n) \in \Theta\left(\frac{B(n)}{n}\right)$. \section{Decomposition-based Dynamization for Full-dynamic Structures} \label{ssec:dyn-deletes} Full-dynamic structures are those with support for deleting records, as well as inserting. As it turns out, supporting deletes efficiently is significantly more challenging than inserts, but there are some results in the theoretical literature for efficient delete support in restricted cases. In principle it is possible possible to support deletes using global reconstruction, with the operation defined as, \begin{equation*} \mathbftt{delete}(\mathscr{I}, r) \triangleq \mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}) - \{r\}) \end{equation*} However, the extension of this procedure to a decomposed data structure is less than trivial. Unlike inserts, where the record can (in principle) be placed into whatever block we like, deletes must be applied specifically to the block containing the record. As a result, there must be a means to locate the block containing a specified record before it can be deleted. In addition to this, all three of the decomposition schemes discussed so far take advantage of the fact that inserts can be applied to blocks in a systematic manner to provide performance guarantees. Deletes, however, lack this control, making bounding their performance far more difficult. For example, consider a logarithmic decomposition that contains all integers on the interval $[1, 100]$, inserted in that order. We would like to delete all of the records from this structure, one at a time, using global reconstruction, in the opposite order they were inserted in. Even if we assume that we can easily locate the block containing each record to delete, we are still faced with two major problems, \begin{itemize} \item The cost of performing a delete is a function of which block the record is in, which is a question of distribution and not easily controlled. In this example, we will always trigger the worst-case behavior, repeatedly rebuilding the largest blocks in the structure one at a time, as the number of records diminishes. \item As records are deleted, the structure will potentially violate the invariants of the decomposition scheme used, which will require additional work to fix. \end{itemize} To resolve these difficulties, two very different approaches have been proposed for creating full-dynamic structures. One approach requires the search problem itself to have certain properties, and the other requires certain operations to be supported by the data structure. We'll discuss these next. \subsection{Ghost Structure for Invertible Search Problems} The first proposed mechanism for supporting deletes was discussed alongside the logarithmic method in Bentley and Saxe's original paper. This technique applies to a class of search problems called \emph{invertible} (also called \emph{decomposable counting problems} in later literature~\cite{overmars83}). Invertible search problems are decomposable, and also support an ``inverse'' merge operator, $\Delta$, that is able to remove records from the result set. More formally, \begin{definition}[Invertible Search Problem~\cite{saxe79}] \label{def:invert} A decomposable search problem, $F$ is invertible if and only if there exists a constant time computable operator, $\Delta$, such that \begin{equation*} F(A - B, q) = F(A, q)~\Delta~F(B, q) \end{equation*} for all $A, B \in \mathcal{PS}(\mathcal{D})$. \end{definition} Given a search problem with this property, it is possible to emulate removing a record from the structure by instead inserting into a secondary ``ghost'' structure. When the decomposed structure is queried, this ghost structure is queried as well as the main one. The results from the ghost structure can be removed from the result set using the inverse merge operator. This simulates the result that would have been obtained had the records been physically removed from the main structure. Two examples of invertible search problems are range count and set membership. Range count was formally defined in Definition~\ref{def:range-count}. \begin{theorem} Range count is an invertible search problem. \end{theorem} \begin{proof} To prove that range count is an invertible search problem, it must be decomposable and have a $\Delta$ operator. That it is a DSP has already been proven in Theorem~\ref{ther:decomp-range-count}. Let $\Delta$ be subtraction $(-)$. Applying this to Definition~\ref{def:invert} gives, \begin{equation*} |(A / B) \cap q | = |(A \cap q) / (B \cap q)| = |(A \cap q)| - |(B \cap q)| \end{equation*} which is true by the distributive property of set difference and intersection. Subtraction is computable in constant time, therefore range count is an invertible search problem using subtraction as $\Delta$. \end{proof} The set membership search problem is defined as follows, \begin{definition}[Set Membership] \label{def:set-membership} Consider a set of elements $d \subseteq \mathcal{D}$ from some domain, and a single element $r \in \mathcal{D}$. A test of set membership is a search problem of the form $F: (\mathcal{PS}(\mathcal{D}), \mathcal{D}) \to \mathbb{B}$ such that $F(d, r) = r \in d$, which maps to $0$ if $r \not\in d$ and $1$ if $r \in d$. \end{definition} \begin{theorem} Set membership is an invertible search problem. \end{theorem} \begin{proof} To prove that set membership is invertible, it is necessary to establish that it is a decomposable search problem, and that a $\Delta$ operator exists. We'll begin with the former. \begin{lemma} \label{lem:set-memb-dsp} Set membership is a decomposable search problem. \end{lemma} \begin{proof} Let $\mergeop$ be the logical disjunction ($\lor$). This yields, \begin{align*} F(A \cup B, r) &= F(A, r) \lor F(B, r) \\ r \in (A \cup B) &= (r \in A) \lor (r \in B) \end{align*} which is true, following directly from the definition of union. The logical disjunction is an associative, commutative operator that can be calculated in $\Theta(1)$ time. Therefore, set membership is a decomposable search problem. \end{proof} For the inverse merge operator, $\Delta$, it is necessary that $F(A, r) ~\Delta~F(B, r)$ be true \emph{only} if $r \in A$ and $r \not\in B$. Thus, it could be directly implemented as $F(A, r)~\Delta~F(B, r) = F(A, r) \land \neg F(B, r)$, which is constant time if the operands are already known. Thus, we have shown that set membership is a decomposable search problem, and that a constant time $\Delta$ operator exists. Therefore, it is an invertible search problem. \end{proof} For search problems such as these, this technique allows for deletes to be supported with the same cost as an insert. Unfortunately, it suffers from write amplification because each deleted record is recorded twice--one in the main structure, and once in the ghost structure. This means that $n$ is, in effect, the total number of records and deletes. This can lead to some serious problems, for example if every record in a structure of $n$ records is deleted, the net result will be an ``empty'' dynamized data structure containing $2n$ physical records within it. To circumvent this problem, Bentley and Saxe proposed a mechanism of setting a maximum threshold for the size of the ghost structure relative to the main one. Once this threshold was reached, a complete re-partitioning of the data can be performed. During this re-partitioning, all deleted records can be removed from the main structure, and the ghost structure emptied completely. Then all of the blocks can be rebuilt from the remaining records, partitioning them according to the strict binary decomposition of the logarithmic method. \subsection{Weak Deletes for Deletion Decomposable Search Problems} Another approach for supporting deletes was proposed later, by Overmars and van Leeuwen, for a class of search problem called \emph{deletion decomposable}. These are decomposable search problems for which the underlying data structure supports a delete operation. More formally, \begin{definition}[Deletion Decomposable Search Problem~\cite{merge-dsp}] \label{def:background-ddsp} A decomposable search problem, $F$, and its data structure, $\mathcal{I}$, is deletion decomposable if and only if, for some instance $\mathscr{I} \in \mathcal{I}$, containing $n$ records, there exists a deletion routine $\mathtt{delete}(\mathscr{I}, r)$ that removes some $r \in \mathcal{D}$ in time $D(n)$ without increasing the query time, deletion time, or storage requirement, for $\mathscr{I}$. \end{definition} Superficially, this doesn't appear very useful, because if the underlying data structure already supports deletes, there isn't much reason to use a dynamization technique to add deletes to it. However, even in structures that don't natively support deleting, it is possible in many cases to \emph{add} delete support without significant alterations. If it is possible to locate a record and somehow mark it as deleted, without removing it from the structure, and then efficiently ignore these records while querying, then the given structure and its search problem can be said to be deletion decomposable. This technique for deleting records is called \emph{weak deletes}. \begin{definition}[Weak Deletes~\cite{overmars81}] \label{def:weak-delete} A data structure is said to support weak deletes if it provides a routine, \texttt{delete}, that guarantees that after $\alpha \cdot n$ deletions, where $\alpha < 1$, the query cost is bounded by $k_\alpha \mathscr{Q}(n)$ for some constant $k_\alpha$ dependent only upon $\alpha$, where $\mathscr{Q}(n)$ is the cost of answering the query against a structure upon which no weak deletes were performed.\footnote{ This paper also provides a similar definition for weak updates, but these aren't of interest to us in this work, and so the above definition was adapted from the original with the weak update constraints removed. } The results of the query of a block containing weakly deleted records should be the same as the results would be against a block with those records removed. \end{definition} As an example of a deletion decomposable search problem, consider the set membership problem considered above (Definition~\ref{def:set-membership}) where $\mathcal{I}$, the data structure used to answer queries of the search problem, is a hash map.\footnote{ While most hash maps are already dynamic, and so wouldn't need dynamization to be applied, there do exist static ones too. For example, the hash map being considered could be implemented using perfect hashing~\cite{perfect-hashing}, which has many static implementations. } \begin{theorem} The set membership problem, answered using a static hash map, is deletion decomposable. \end{theorem} \begin{proof} We've already shown in Lemma~\ref{lem:set-memb-dsp} that set membership is a decomposable search problem. For it to be deletion decomposable, we must demonstrate that the hash map, $\mathcal{I}$, supports deleting records without hurting its query performance, delete performance, or storage requirements. Assume that an instance $\mathscr{I} \in \mathcal{I}$ having $|\mathscr{I}| = n$ can answer queries in $\mathscr{Q}(n) \in \Theta(1)$ time and requires $\Omega(n)$ storage. Such a structure can support weak deletes. Each record within the structure has a single bit attached to it, indicating whether it has been deleted or not. These bits will require $\Theta(n)$ storage and be initialized to $0$ when the structure is constructed. A delete can be performed by querying the structure for the record to be deleted in $\Theta(1)$ time, and setting the bit to 1 if the record is found. This operation has $D(n) \in \Theta(1)$ cost. \begin{lemma} \label{lem:weak-deletes} The delete procedure as described above satisfies the requirements of Definition~\ref{def:weak-delete} for weak deletes. \end{lemma} \begin{proof} Per Definition~\ref{def:weak-delete}, there must exist some constant dependent only on $\alpha$, $k_\alpha$, such that after $\alpha \cdot n$ deletes against $\mathscr{I}$ with $\alpha < 1$, the query cost is bounded by $\Theta(\alpha \cdot \mathscr{Q}(n))$. In this case, $\mathscr{Q}(n) \in \Theta(1)$, and therefore our final query cost must be bounded by $\Theta(k_\alpha)$. When a query is executed against $\mathscr{I}$, there are three possible cases, \begin{enumerate} \item The record being searched for does not exist in $\mathscr{I}$. In this case, the query result is $0$. \item The record being searched for does exist in $\mathscr{I}$ and has a delete bit value of $0$. In this case, the query result is $1$. \item The record being searched for does exist in $\mathscr{I}$ and has a delete bit value of $1$ (i.e., it has been deleted). In this case, the query result is $0$. \end{enumerate} In all three cases, the addition of deletes requires only $\Theta(1)$ extra work at most. Therefore, set membership over a static hash map using our proposed deletion mechanism satisfies the requirements for weak deletes, with $k_\alpha = 1$. \end{proof} Finally, we note that the cost of one of these weak deletes is $D(n) = \mathscr{Q}(n)$. By Lemma~\ref{lem:weak-deletes}, the delete cost is not asymptotically harmed by deleting records. Thus, we've shown that set membership using a static hash map is a decomposable search problem, the storage cost remains $\Omega(n)$ and the query and delete costs are unaffected by the presence of deletes using the proposed mechanism. All of the requirements of deletion decomposability are satisfied, therefore set membership using a static hash map is a deletion decomposable search problem. \end{proof} For such problems, deletes can be supported by first identifying the block in the decomposition containing the record to be deleted, and then calling $\mathtt{delete}$ on it. In order to allow this block to be easily located, it is possible to maintain a hash table over all of the records, alongside the decomposition, which maps each record onto the block containing it. This table must be kept up to date as reconstructions occur, but this can be done at no extra asymptotic costs for any data structures having $B(n) \in \Omega(n)$, as it requires only linear time. This allows for deletes to be performed in $\mathscr{D}(n) \in \Theta(D(n))$ time. The presence of deleted records within the structure does introduce a new problem, however. Over time, the number of records in each block will drift away from the requirements imposed by the decomposition technique. It will eventually become necessary to re-partition the records to restore these invariants, which are necessary for bounding the number of blocks, and thereby the query performance. The particular invariant maintenance rules depend upon the decomposition scheme used. To our knowledge, there is no discussion of applying the $k$-binomial method to deletion decomposable search problems, and so method is not listed here. \Paragraph{Logarithmic Method.} When creating a logarithmic decomposition for a deletion decomposable search problem, the $i$th block where $i \geq 2$,\footnote{ Block $i=1$ will only ever have one record, so no special maintenance must be done for it. A delete will simply empty it completely. } in the absence of deletes, will contain $2^{i-1} + 1$ records. When a delete occurs in block $i$, no special action is taken until the number of records in that block falls below $2^{i-2}$. Once this threshold is reached, a reconstruction can be performed to restore the appropriate record counts in each block.~\cite{merge-dsp} \Paragraph{Equal Block Method.} For the equal block method, there are two cases in which a delete may cause a block to fail to obey the method's size invariants, \begin{enumerate} \item If enough records are deleted, it is possible for the number of blocks to exceed $f(2n)$, violating Invariant~\ref{ebm-c1}. \item The deletion of records may cause the maximum size of each block to shrink, causing some blocks to exceed the maximum capacity of $\nicefrac{2n}{s}$. This is a violation of Invariant~\ref{ebm-c2}. \end{enumerate} In both cases, it should be noted that $n$ is decreased as records are deleted. Should either of these cases emerge as a result of a delete, the entire structure must be reconfigured to ensure that its invariants are maintained. This reconfiguration follows the same procedure as when an insert results in a violation: $s$ is updated to be exactly $f(n)$, all existing blocks are unbuilt, and then the records are evenly redistributed into the $s$ blocks.~\cite{overmars-art-of-dyn} \section{Limitations of Classical Dynamization Techniques} \label{sec:bsm-limits} While fairly general, these dynamization techniques have a number of limitations that prevent them from being directly usable as a general solution to the problem of creating database indices. Because of the requirement that the query being answered be decomposable, many search problems cannot be addressed--or at least efficiently addressed, by decomposition-based dynamization. Additionally, though we have discussed two decomposition approaches that expose some form of performance tuning to the user, these techniques are targeted as asymptotic results, which results in poor results in practice. Finally, most decomposition schemes have poor worst-case insertion performance, resulting in extremely poor tail latency relative to native dynamic structures. While there do exist decomposition schemes that have better worst-case performance, they are impractical. This section will discuss these limitations in more detail, and the rest of the document will be dedicated to proposing solutions to them. \subsection{Limits of Decomposability} \label{ssec:decomp-limits} Unfortunately, the DSP abstraction used as the basis of classical dynamization techniques has a few significant limitations that restrict their applicability, \begin{itemize} \item The query must be broadcast identically to each block and cannot be adjusted based on the state of the other blocks. \item The query process is done in one pass--it cannot be repeated. \item The result merge operation must be $O(1)$ to maintain good query performance. \item The result merge operation must be commutative and associative, and is called repeatedly to merge pairs of results. \item There are serious restrictions on problems which can support deletes, requiring additional assumptions about the search problem or data structure. \end{itemize} These requirements restrict the types of queries that can be supported by the method efficiently. For example, k-nearest neighbor and independent range sampling are not decomposable. \subsubsection{k-Nearest Neighbor} \label{sssec-decomp-limits-knn} The k-nearest neighbor ($k$-NN) problem is a generalization of the nearest neighbor problem, which seeks to return the closest point within the dataset to a given query point. More formally, this can be defined as, \begin{definition}[Nearest Neighbor] Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$ be some function $f: D^2 \to \mathbb{R}^+$ representing the distance between two points within $D$. The nearest neighbor problem, $NN(D, q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$ for some query point, $q \in \mathbb{R}^d$. \end{definition} In practice, it is common to require $f(x, y)$ be a metric,\footnote { Contrary to its vernacular usage as a synonym for ``distance'', a metric is more formally defined as a valid distance function over a metric space. Metric spaces require their distance functions to have the following properties, \begin{itemize} \item The distance between a point and itself is always $0$. \item All distances between non-equal points must be positive. \item For all points, $x, y \in D$, it is true that $f(x, y) = f(y, x)$. \item For any three points $x, y, z \in D$ it is true that $f(x, z) \leq f(x, y) + f(y, z)$. \end{itemize} These distances also must have the interpretation that $f(x, y) < f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This is the opposite of the definition of similarity, and so some minor manipulations are usually required to make similarity measures work in metric-based indexes. \cite{intro-analysis} } and this will be done in the examples of indices for addressing this problem in this work, but it is not a fundamental aspect of the problem formulation. The nearest neighbor problem itself is decomposable, with a simple merge function that accepts the result with the smallest value of $f(x, q)$ for any two inputs\cite{saxe79}. The k-nearest neighbor problem generalizes nearest-neighbor to return the $k$ nearest elements, \begin{definition}[k-Nearest Neighbor] Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$ be some function $f: D^2 \to \mathbb{R}^+$ representing the distance between two points within $D$. The k-nearest neighbor problem, $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$ such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$. \end{definition} This can be thought of as solving the nearest-neighbor problem $k$ times, each time removing the returned result from $D$ prior to solving the problem again. Unlike the single nearest-neighbor case (which can be thought of as $k$-NN with $k=1$), this problem is \emph{not} decomposable. \begin{theorem} $k$-NN is not a decomposable search problem. \end{theorem} \begin{proof} To prove this, consider the query $KNN(D, q, k)$ against some partitioned dataset $D = D_1 \cup D_2 \ldots \cup D_\ell$. If $k$-NN is decomposable, then there must exist some constant-time, commutative, and associative binary operator $\mergeop$, such that $R = \mergeop_{1 \leq i \leq l} R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q, k)$. Consider the evaluation of the merge operator against two arbitrary result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| = |R_j| = k$, and that the contents of $R$ must be the $k$ records from $R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the problem $KNN(R_i \cup R_j, q, k)$. However, $k$-NN cannot be solved in $O(1)$ time. Therefore, $k$-NN is not a decomposable search problem. \end{proof} With that said, it is clear that there isn't any fundamental restriction preventing the merging of the result sets; it is only the case that an arbitrary performance requirement wouldn't be satisfied. It is possible to merge the result sets in non-constant time, and so it is the case that $k$-NN is $C(n)$-decomposable. Unfortunately, this classification brings with it a reduction in query performance as a result of the way result merges are performed. As a concrete example of these costs, consider using the logarithmic method to extend the VPTree~\cite{vptree}. The VPTree is a static, metric index capable of answering $k$-NN queries in $KNN(D, q, k) \in O(k \log n)$. One possible merge algorithm for $k$-NN would be to push all of the elements in the two arguments onto a min-heap, and then pop off the first $k$. In this case, the cost of the merge operation would be $C(k) = k \log k$. Were $k$ assumed to be constant, then the operation could be considered to be constant-time. But given that $k$ is only bounded in size above by $n$, this isn't a safe assumption to make in general. Evaluating the total query cost for the extended structure, this would yield, \begin{equation} KNN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right) \end{equation} The reason for this large increase in cost is the repeated application of the merge operator. The logarithmic method requires applying the merge operator in a binary fashion to each partial result, multiplying its cost by a factor of $\log n$. Thus, the constant-time requirement of standard decomposability is necessary to keep the cost of the merge operator from appearing within the complexity bound of the entire operation in the general case.\footnote { There is a special case, noted by Overmars, where the total cost is $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n)) \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the case where the cost of the query and merge operation are sufficiently large to consume the logarithmic factor, and so it doesn't represent a special case with better performance. } If we could revise the result merging operation to remove this duplicated cost, we could greatly reduce the cost of supporting $C(n)$-decomposable queries. \subsubsection{Independent Range Sampling} \label{ssec:background-irs} Another problem that is not decomposable is independent sampling. There are a variety of problems falling under this umbrella, including weighted set sampling, simple random sampling, and weighted independent range sampling, but we will focus on independent range sampling here. \begin{definition}[Independent Range Sampling~\cite{tao22}] Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query interval $q = [x, y]$ and an integer $k$, an independent range sampling query returns $k$ independent samples from $D \cap q$ with each point having equal probability of being sampled. \end{definition} This problem immediately encounters a category error when considering whether it is decomposable: the result set is randomized, whereas the conditions for decomposability are defined in terms of an exact matching of records in result sets. To work around this, a slight abuse of definition is in order: assume that the equality conditions within the DSP definition can be interpreted to mean ``the contents in the two sets are drawn from the same distribution''. This enables the category of DSP to apply to this type of problem, while maintaining the spirit of the definition. Even with this abuse, however, IRS cannot generally be considered decomposable; it is at best $C(n)$-decomposable. The reason for this is that matching the distribution requires drawing the appropriate number of samples from each partition of the data. Even in the special case that $|D_1| = |D_2| = \ldots = |D_\ell|$, the number of samples from each partition that must appear in the result set cannot be known in advance due to differences in the selectivity of the predicate across the partitions. \begin{example}[IRS Sampling Difficulties] Consider three partitions of data, $D_1 = \{1, 2, 3, 4, 5\}, D_2 = \{1, 1, 1, 1, 3\}, D_3 = \{4, 4, 4, 4, 4\}$ using bag semantics and an IRS query over the interval $[3, 4]$ with $k=12$. Because all three partitions have the same size, it seems sensible to evenly distribute the samples across them ($4$ samples from each partition). Applying the query predicate to the partitions results in the following, $d_1 = \{3, 4\}, d_2 = \{3 \}, d_3 = \{4, 4, 4, 4\}$. In expectation, then, the first result set will contain $R_1 = \{3, 3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same probability of a $4$. The second and third result sets can only be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these together, we'd find that the probability distribution of the sample would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were we to perform the same sampling operation over the full dataset (not partitioned), the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$. \end{example} The problem is that the number of samples drawn from each partition needs to be weighted based on the number of elements satisfying the query predicate in that partition. In the above example, by drawing $4$ samples from $D_1$, more weight is given to $3$ than exists within the base dataset. This can be worked around by sampling a full $k$ records from each partition, returning both the sample and the number of records satisfying the predicate as that partition's query result. This allows for the relative weights of each block to be controlled for during the merge, by doing weighted sampling of each partial result. This approach requires $\Theta(k)$ time for the merge operation, however, leaving IRS firmly in the $C(n)$-decomposable camp. If it were possible to pre-calculate the number of samples to draw from each partition, then a constant-time merge operation could be used. We examine expanding support for non-decomposable search problems in Chapters~\ref{chap:sampling} and \ref{chap:framework} and propose techniques for efficiently expanding support of dynamization systems to non-decomposable search problems, as well as addressing some additional difficulties introduced by supporting deletes, which can complicate query processing. \subsection{Configurability} Decomposition-based dynamization is built upon a fundamental trade-off between insertion and query performance, that is governed by the number of blocks a structure is decomposed into. Both the equal block and $k$-binomial method expose parameters to tune the number of blocks in the structure, but these techniques suffer from poor insertion performance in general. The equal block method in particular suffers greatly from the larger block sizes~\cite{overmars83}, and the $k$-binomial approach suffers the same problem. In fact, we'll show in Chapter~\ref{chap:tail-latency} that the equal block method is strictly worse than the logarithmic method in experimental conditions for a given query latency in the trade-off space. There is a theoretical technique that attempts to address this limitation by nesting the logarithmic method inside of the equal block method, called the \emph{mixed method}, that has appeared in the theoretical literature~\cite{overmars83}. But this technique is clunky, and doesn't provide the user with a meaningful design space for configuring the system beyond specifying arbitrary functions. The reason for this lack of simple configurability in existing dynamization literature seems to stem from the theoretical nature of the work. Many ``obvious'' options for tweaking the method, such as changing the rate at which levels grow, adding buffering, etc., result in constant-factor trade-offs, and thus are not relevant to the asymptotic bounds that these works are concerned with. It's worth noting that some works based on \emph{applying} the logarithmic method introduce some form of configurability~\cite{pgm,almodaresi23}, usually inspired by the design space of LSM trees~\cite{oneil96}, but the full consequences of this parametrization in the context of dynamization have, to the best of our knowledge, not been explored. We will discuss this topic in Chapter~\ref{chap:design-space}. \subsection{Insertion Tail Latency} \label{ssec:bsm-tail-latency-problem} One of the largest problems associated with classical dynamization techniques is the poor worst-case insertion performance. This results in massive insertion tail latencies. Unfortunately, solving this problem within the logarithmic method itself is not a trivial undertaking. Maintaining the strict binary decomposition of the structure, ensures that any given reconstruction cannot be performed in advance, as it requires access to all the records in the structure in the worst case. This limits the ability to use parallelism to hide the latencies. The worst-case optimized approach proposed by Overmars and von Leeuwen abandons the binary decomposition of the logarithmic method, and is thus able to provide an approach for limiting this worst-case insertion bound, but it has a number of serious problems, \begin{enumerate} \item It assumes that the reconstruction process for a data structure can be divided \textit{a priori} into a small number of independent operations that can be executed in batches during each insert. It is not always possible to do this efficiently, particularly for structures whose construction involve multiple stages (e.g., a sorting phase followed by a recursive node construction phase, like in a B+tree) with non-trivially predictable operation counts. \item Even if the reconstruction process can be efficiently sub-divided, implementing the technique requires \emph{significant} and highly specialized modification of the construction procedures for a data structure, and tight integration of these procedures into the insertion process as a whole. This makes it poorly suited for use in a generalized framework of the sort we are attempting to create. \end{enumerate} We tackle the problem of insertion tail latency in Chapter~\ref{chap:tail-latency} and propose a new system which resolves these difficulties and allows for significant improvements in insertion tail latency without seriously degrading the other performance characteristics of the dynamized structure. \section{Conclusion} In this chapter, we introduced the concept of a search problem, and showed how amortized global reconstruction can be used to dynamize data structures associated with search problems having certain properties. We examined several theoretical approaches for dynamization, including the equal block method, the logarithmic method, and a worst-case insertion optimized approach. Additionally, we considered several more classes of search problem, and saw how additional properties could be used to enable more efficient reconstruction, and support for efficiently deleting records from the structure. Ultimately, however, these techniques have several deficiencies that must be overcome before a practical, general, system can be built upon them. Namely, they lack support for several important types of search problem, particularly if deletes are required, they are not easily configurable by the user, and they suffer from poor insertion tail latency. The rest of this work will be dedicated to approaches to resolve these deficiencies.