\chapter{Background} This chapter will introduce important background information that will be used throughput the remainder of the document. We'll first define precisely what is meant by a query, and consider some special classes of query that will become relevant in our discussion of dynamic extension. We'll then consider the difference between a static and a dynamic structure, and techniques for converting static structures into dynamic ones in a variety of circumstances. \section{Database Indexes} The term \emph{index} is often abused within the database community to refer to a range of closely related, but distinct, conceptual categories\footnote{ The word index can be used to refer to a structure mapping record information to the set of records matching that information, as a general synonym for ``data structure'', to data structures used specifically in query processing, etc. }. This ambiguity is rarely problematic, as the subtle differences between these categories are not often significant, and context clarifies the intended meaning in situtations where they are. However, this work explicitly operates at the interface of two of these categories, and so it is important to disambiguiate between them. As a result, we will be using the word index to refer to a very specific structure \subsection{The Traditional Index} A database index is a specialized structure which provides a means to efficiently locate records that satisfy specific criteria. This enables more efficient query processing for support queries. A traditional database index can be modeled as a function, mapping a set of attribute values, called a key, $\mathcal{K}$, to a set of record identifiers, $\mathcal{R}$. Technically, the codomain of an index can be either a record identifier, a set of record identifiers, or the physical record itself, depending upon the configuration of the index. For the purposes of this work, the focus will be on the first of these, but in principle any of the three index types could be used with little material difference to the discussion. Formally speaking, we will use the following definition of a traditional database index, \begin{definition}[Traditional Index] Consider a set of database records, $\mathcal{D}$. An index over these records, $\mathcal{I}_\mathcal{D}$ is a map of the form $F:(\mathcal{I}_\mathcal{D}, \mathcal{K}) \to \mathcal{R}$, where $\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$, called a \emph{key}. \end{definition} In order to facilitate this mapping, indexes are built using data structures. The specific data structure used has particular implications about the performance of the index, and the situations in which the index is effectively. Broadly speaking, traditional database indexes can be categorized in two ways: ordered indexes and unordered indexes. The former of these allows for iteration over the set of record identifiers in some sorted order, starting at the returned record. The latter allows for point-lookups only. There is a very small set of data structures that are usually used for creating database indexes. The most common range index in RDBMSs is the B-tree\footnote{ By \emph{B-tree} here, I am referring not to the B-tree datastructure, but to a wide range of related structures derived from the B-tree. Examples include the B$^+$-tree, B$^\epsilon$-tree, etc. } based index, and key-value stores commonly use indices built on the LSM-tree. Some databases support unordered indexes using hashtables. Beyond these, some specialized databases or database extensions have support for indexes based on other structures, such as the R-tree\footnote{ Like the B-tree, R-tree here is used as a signifier for a general class of related data structures} for spatial databases or approximate small world graph models for similarity search. \subsection{The Generalized Index} The previous section discussed the traditional definition of index as might be found in a database systems textbook. However, this definition is limited by its association specifically with mapping key fields to records. For the purposes of this work, I will be considering a slightly broader definition of index, \begin{definition}[Generalized Index] Consider a set of database records, $\mathcal{D}$ and a search problem, $\mathcal{Q}$. A generalized index, $\mathcal{I}_\mathcal{D}$ is a map of the form $F:(\mathcal{I}_\mathcal{D}, \mathcal{Q}) \to \mathcal{R})$. \end{definition} \emph{Search problems} are the topic of the next section, but in brief a search problem represents a general class of query, such as range scan, point lookup, k-nearest neightbor, etc. A traditional index is a special case of a generalized index, having $\mathcal{Q}$ being a point-lookup or range query based on a set of record attributes. \subsection{Indices in Query Processing} A database management system utilizes indices to accelerate certain types of query. Queries are expressed to the system in some high level language, such as SQL or Datalog. These are generalized languages capable of expressing a wide range of possible queries. The DBMS is then responsible for converting these queries into a set of primitive data access procedures that are supported by the underlying storage engine. There are a variety of techniques for this, including mapping directly to a tree of relational algebra operators and interpretting that tree, query compilation, etc. But, ultimately, the expressiveness of this internal query representation is limited by the routines supported by the storage engine. As an example, consider the following SQL query (representing a 2-dimensional k-nearest neighbor)\footnote{There are more efficient ways of answering this query, but I'm aiming for simplicity here to demonstrate my point}, \begin{verbatim} SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A WHERE A.property = filtering_criterion ORDER BY d LIMIT 5; \end{verbatim} This query will be translated into a logical query plan (a sequence of relational algebra operators) by the query planner, which could result in a plan like this, \begin{verbatim} query plan here \end{verbatim} With this logical query plan, the DBMS will next need to determine which supported operations it can use to most efficiently answer this query. For example, the selection operation (A) could be physically manifested as a table scan, or could be answered using an index scan if there is an ordered index over \texttt{A.property}. The query optimizer will make this decision based on its estimate of the selectivity of the predicate. This may result in one of the following physical query plans \begin{verbatim} physical query plan \end{verbatim} In either case, however, the space of possible physical plans is limited by the available access methods: either a sorted scan on an attribute (index) or an unsorted scan (table scan). The database must filter for all elements matching the filtering criterion, calculate the distances between all of these points and the query, and then sort the results to get the final answer. Additionally, note that the sort operation in the plan is a pipeline-breaker. If this plan were to appear as a subtree in a larger query plan, the overall plan would need to wait for the full evaluation of this sub-query before it could proceed, as sorting requires the full result set. Imagine a world where a new index was available to our DBMS: a nearest neighbor index. This index would allow the iteration over records in sorted order, relative to some predefined metric and a query point. If such an index existed over \texttt{(A.x, A.y)} using \texttt{dist}, then a third physical plan would be available to the DBMS, \begin{verbatim} \end{verbatim} This plan pulls records in order of their distance to \texttt{Q} directly, using an index, and then filters them, avoiding the pipeline breaking sort operation. While it's not obvious in this case that this new plan is superior (this would depend a lot on the selectivity of the predicate), it is a third option. It becomes increasingly superior as the selectivity of the predicate grows, and is clearly superior in the case where the predicate has unit selectivity (requiring only the consideration of $5$ records total). The construction of this special index will be considered in Section~\ref{ssec:knn}. This use of query-specific indexing schemes also presents a query planning challenge: how does the database know when a particular specialized index can be used for a given query, and how can specialized indexes broadcast their capabilities to the query planner in a general fashion? This work is focused on the problem of enabling the existence of such indexes, rather than facilitating their use, however these are important questions that must be considered in future work for this solution to be viable. There has been work done surrounding the use of arbtrary indexes in queries in the past, such as~\cite{byods-datalog}. This problem is considered out-of-scope for the proposed work, but will be considered in the future. \section{Queries and Search Problems} In our discussion of generalized indexes, we encountered \emph{search problems}. A search problem is a term used within the literature on data structures in a manner similar to how the database community sometimes uses the term query\footnote{ Like with the term index, the term query is often abused and used to refer to several related, but slightly different things. In the vernacular, a query can refer to either a) a general type of search problem (as in "range query"), b) a specific instance of a search problem, or c) a program written in a query language. }, to refer to a general class of questions asked of data. Examples include range queries, point-lookups, nearest neighbor queries, predicate filtering, random sampling, etc. Formally, for the purposes of this work, we will define a search problem as follows, \begin{definition}[Search Problem] Given three multisets, $D$, $R$, and $Q$, a search problem is a function $F: (D, Q) \to R$, where $D$ represents the domain of data to be searched, $Q$ represents the domain of query parameters, and $R$ represents the answer domain. \footnote{ It is important to note that it is not required for $R \subseteq D$. As an example, a \texttt{COUNT} aggregation might map a set of strings onto an integer. Most common queries do satisfy $R \subseteq D$, but this need not be a universal constraint. } \end{definition} And we will use the word \emph{query} to refer to a specific instance of a search problem, except when used as part of the generally accepted name of a search problem (i.e., range query). \begin{definition}[Query] Given three multisets, $D$, $R$, and $Q$, a search problem $F$ and a specific set of query parameters $q \in Q$, a query is a specific instance of the search problem, $F(D, q)$. \end{definition} As an example of using these definitions, a \emph{membership test} or \emph{range query} would be considered search problems, and a range query over the interval $[10, 99]$ would be a query. \subsection{Decomposable Search Problems} An important subset of search problems is that of decomposable search problems (DSPs). This class was first defined by Saxe and Bentley as follows, \begin{definition}[Decomposable Search Problem~\cite{saxe79}] \label{def:dsp} Given a search problem $F: (D, Q) \to R$, $F$ is decomposable if and only if there exists a consant-time computable, associative, and commutative binary operator $\square$ such that, \begin{equation*} F(A \cup B, q) = F(A, q)~ \square ~F(B, q) \end{equation*} \end{definition} The constant-time requirement was used to prove bounds on the costs of evaluating DSPs over data broken across multiple partitions. Further work by Overmars lifted this constraint and considered a more general class of DSP, \begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}] Given a search problem $F: (D, Q) \to R$, $F$ is $C(n)$-decomposable if and only if there exists an $O(C(n))$-time computable, associative, and commutative binary operator $\square$ such that, \begin{equation*} F(A \cup B, q) = F(A, q)~ \square ~F(B, q) \end{equation*} \end{definition} Decomposability is an important property because it allows for search problems to be answered over partitioned datasets. The details of this will be discussed in Section~\ref{ssec:bentley-saxe} in the context of creating dynamic data structures. Many common types of search problems appearing in databases are decomposable, such as range queries or predicate filtering. To demonstrate that a search problem is decomposable, it is necessary to show the existance of the merge operator, $\square$, and to show that $F(A \cup B, q) = F(A, q)~ \square ~F(B, q)$. With these two results, simple induction demonstrates that the problem is decomposable even in cases with more than two partial results. As an example, consider range queries, \begin{definition}[Range Query] Let $D$ be a set of $n$ points in $\mathbb{R}$. Given an interval, $ q = [x, y],\quad x,y \in R$, a range query returns all points in $D \cap q$. \end{definition} \begin{theorem} Range Queries are a DSP. \end{theorem} \begin{proof} Let $\square$ be the set union operator ($\cup$). Applying this to Definition~\ref{def:dsp}, we have \begin{align*} (A \cup B) \cap q = (A \cap q) \cup (B \cap q) \end{align*} which is true by the distributive property of set union and intersection. Assuming an implementation allowing for an $O(1)$ set union operation, range queries are DSPs. \end{proof} Because the codomain of a DSP is not restricted, more complex output structures can be used to allow for problems that are not directly decomposable to be converted to DSPs, possibly with some minor post-processing. For example, the calculation of the mean of a set of numbers can be constructed as a DSP using the following technique, \begin{theorem} The calculation of the average of a set of numbers is a DSP. \end{theorem} \begin{proof} Define the search problem as $A:D \to (\mathbb{R}, \mathbb{Z})$, where $D\subset\mathbb{R}$ and is a multiset. The output tuple contains the sum of the values within the input set, and the cardinality of the input set. Let the $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Then, define $A(D_1)\square A(D_2) = (s_1 + s_2, c_1 + c_2)$. Applying Definition~\ref{def:dsp}, we have \begin{align*} A(D_1 \cup D_2) &= A(D_1)\square A(D_2) \\ (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c) \end{align*} From this result, the average can be determined in constant time by taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set of numbers is a DSP. \end{proof} \section{Dynamic Extension Techniques} Because data in a database is regularly updated, data structures intended to be used as an index must support updates (inserts, in-place modification, and deletes) to their data. In principle, any data structure can support updates to its underlying data through global reconstruction: adjusting the record set and then rebuilding the entire structure. Ignoring this trivial (and highly inefficient) approach, a data structure with support for updates is called \emph{dynamic}, and one without support for updates is called \emph{static}. In this section, we discuss approaches for modifying a static data structure to grant it support for updates, a process called \emph{dynamic extension} or \emph{dynamization}. A theoretical survey of this topic can be found in~\cite{overmars83}, but this work doesn't cover several techniques that are used in practice. As such, much of this section constitutes our own analysis, tying together threads from a variety of sources. \subsection{Local Reconstruction} One way of viewing updates to a data structure is as reconstructing all or part of the structure. To minimize the cost of the update, it is ideal to minimize the size of the reconstruction that accompanies an update, either by careful structuring of the data to ensure minimal disruption to surrounding records by an update, or by deferring the reconstructions and amortizing their costs over as many updates as possible. While minimizing the size of a reconstruction seems the most obvious, and best, approach, it is limited in its applicability. The more related ``nearby'' records in the structure are, the more records will be affected by a change. Records can be related in terms of some ordering of their values, which we'll term a \emph{spatial ordering}, or in terms of their order of insertion to the structure, which we'll term a \emph{temporal ordering}. Note that these terms don't imply anything about the nature of the data, and instead relate to the principles used by the data structure to arrange them. Arrays provide the extreme version of both of these ordering principles. In an unsorted array, in which records are appended to the end of the array, there is no spatial ordering dependence between records. This means that any insert or update will require no local reconstruction, aside from the record being directly affected.\footnote{ A delete can also be performed without any structural adjustments in a variety of ways. Reorganization of the array as a result of deleted records serves an efficiency purpose, but isn't required for the correctness of the structure. } However, the order of records in the array \emph{does} express a strong temporal dependency: the index of a record in the array provides the exact insertion order. A sorted array provides exactly the opposite situation. The order of a record in the array reflects an exact spatial ordering of records with respect to their sorting function. This means that an update or insert will require reordering a large number of records (potentially all of them, in the worst case). Because of the stronger spatial dependence of records in the structure, an update will require a larger-scale reconstruction. Additionally, there is no temporal component to the ordering of the records: inserting a set of records into a sorted array will produce the same final structure irrespective of insertion order. It's worth noting that the spatial dependency discussed here, as it relates to reconstruction costs, is based on the physical layout of the records and not the logical ordering of them. To exemplify this, a sorted singly-linked list can maintain the same logical order of records as a sorted array, but limits the spatial dependce between records each records preceeding node. This means that an insert into this structure will require only a single node update, regardless of where in the structure this insert occurs. The amount of spatial dependence in a structure directly reflects a trade-off between read and write performance. In the above example, performing a lookup for a given record in a sorted array requires asymptotically fewer comparisons in the worst case than an unsorted array, because the spatial dependecies can be exploited for an accelerated search (binary vs. linear search). Interestingly, this remains the case for lookups against a sorted array vs. a sorted linked list. Even though both structures have the same logical order of records, limited spatial dependecies between nodes in a linked list forces the lookup to perform a scan anyway. A balanced binary tree sits between these two extremes. Like a linked list, individual nodes have very few connections. However the nodes are arranged in such a way that a connection existing between two nodes implies further information about the ordering of children of those nodes. In this light, rebalancing of the tree can be seen as maintaining a certain degree of spatial dependence between the nodes in the tree, ensuring that it is balanced between the two children of each node. A very general summary of tree rebalancing techniques can be found in~\cite{overmars83}. Using an AVL tree~\cite{avl} as a specific example, each insert in the tree involves adding the new node and updating its parent (like you'd see in a simple linked list), followed by some larger scale local reconstruction in the form of tree rotations, to maintain the balance factor invariant. This means that insertion requires more reconstruction effort than the single pointer update in the linked list case, but results in much more efficient searches (which, as it turns out, makes insertion more efficient in general too, even with the overhead, because finding the insertion point is much faster). \subsection{Amortized Local Reconstruction} In addition to control update cost by arranging the structure so as to reduce the amount of reconstruction necessary to maintain the desired level of spatial dependence, update costs can also be reduced by amortizing the local reconstruction cost over multiple updates. This is often done in one of two ways: leaving gaps or adding overflow buckets. These gaps and buckets allows for a buffer of insertion capacity to be sustained by the data structure, before a reconstruction is triggered. A classic example of the gap approach is found in the B$^+$-tree~\cite{b+tree} commonly used in RDBMS indexes, as well as open addressing for hash tables. In a B$^+$-tree, each node has a fixed size, which must be at least half-utilized (aside from the root node). The empty spaces within these nodes are gaps, which can be cheaply filled with new records on insert. Only when a node has been filled must a local reconstruction (called a structural modification operation for B-trees) occur to redistribute the data into multiple nodes and replenish the supply of gaps. This approach is particularly well suited to data structures in contexts where the natural unit of storage is larger than a record, as in disk-based (with 4KiB pages) or cache-optimized (with 64B cachelines) structures. This gap-based approach was also used to create ALEX, an updatable learned index~\cite{ALEX}. The gap approach has a number of disadvantages. It results in a somewhat sparse structure, thereby wasting storage. For example, a B$^+$-tree requires all nodes other than the root to be at least half full--meaning in the worst case up to half of the space required by the structure could be taken up by gaps. Additionally, this scheme results in some inserts being more expensive than others: most new records will occupy an available gap, but some will trigger more expensive SMOs. In particular, it has been observed with B$^+$-trees that this can lead to ``waves of misery''~\cite{wavesofmisery}: the gaps in many nodes fill at about the same time, leading to periodic clusters of high-cost merge operations. Overflow buckets are seen in ISAM-tree based indexes~\cite{myisam}, as well as hash tables with closed addressing. In this approach, parts of the structure into which records would be inserted (leaf nodes of ISAM, directory entries in CA hashing) have a pointer to an overflow location, where newly inserted records can be placed. This allows for the structure to, theoretically, sustain an unlimited amount of insertions. However, read performance degrades, because the more overflow capacity is utilized, the less the records in the structure are ordered according to the data structure's definition. Thus, periodically a reconstruction is necessary to distribute the overflow records into the structure itself. \subsection{Amortized Global Reconstruction: The Bentley-Saxe Method} Another approach to support updates is to amortize the cost of global reconstruction over multiple updates. This approach can take take three forms, \begin{enumerate} \item Pairing a dynamic data structure (called a buffer or memtable) with an instance of the structure being extended. Updates are written to the buffer, and when the buffer is full its records are merged with those in the static structure, and the structure is rebuilt. This approach is used by one version of the originally proposed LSM-tree~\cite{oneil93}. Technically this technique proposed in that work for the purposes of converting random writes into sequential ones (all structures involved are dynamic), but it can be used for dynamization as well. \item Creating multiple, smaller data structures each containing a partition of the records from the dataset, and reconstructing individual structures to accomodate new inserts in a systematic manner. This technique is the basis of the Bentley-Saxe method~\cite{saxe79}. \item Using both of the above techniques at once. This is the approach used by modern incarnations of the LSM~tree~\cite{rocksdb}. \end{enumerate} In all three cases, it is necessary for the search problem associated with the index to be a DSP, as answering it will require querying multiple structures (the buffer and/or one or more instances of the data structure) and merging the results together to get a final result. This section will focus exclusively on the Bentley-Saxe method, as it is the basis for our proposed methodology.p When dividing records across multiple structures, there is a clear trade-off between read performance and write performance. Keeping the individual structures small reduces the cost of reconstructing, and thereby increases update performance. However, this also means that more structures will be required to accommodate the same number of records, when compared to a scheme that allows the structures to be larger. As each structure must be queried independently, this will lead to worse query performance. The reverse is also true, fewer, larger structures will have better query performance and worse update performance, with the extreme limit of this being a single structure that is fully rebuilt on each insert. \begin{figure} \caption{Inserting a new record using the Bentley-Saxe method.} \label{fig:bsm-example} \end{figure} The key insight of the Bentley-Saxe method~\cite{saxe79} is that a good balance can be struck by uses a geometrically increasing structure size. In Bentley-Saxe, the sub-structures are ``stacked'', with the bottom level having a capacity of a single record, and each subsequent level doubling in capacity. When an update is performed, the first empty level is located and a reconstruction is triggered, merging the structures of all levels below this empty one, along with the new record. An example of this process is shown in Figure~\ref{fig:bsm-example}. The merits of this approach are that it ensures that ``most'' reconstructions involve the smaller data structures towards the bottom of the sequence, while most of the records reside in large, infrequently updated, structures towards the top. This balances between the read and write implications of structure size, while also allowing the number of structures required to represent $n$ records to be worst-case bounded by $O(\log n)$. Given a structure with an $O(P(n))$ construction cost and $O(Q_S(n))$ query cost, the Bentley-Saxe Method will produce a dynamic data structure with, \begin{align} \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\ \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right) \end{align} However, the method has poor worst-case insertion cost: if the entire structure is full, it must grow by another level, requiring a full reconstruction involving every record within the structure. A slight adjustment to the technique, due to Overmars and van Leuwen \cite{}, allows for the worst-case insertion cost to be bounded by $O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing each reconstruction into small pieces, one of which is executed each time a new update occurs. This has the effect of bounding the worst-case performance, but does so by sacrificing the expected case performance, and adds a lot of complexity to the method. This technique is not used much in practice.\footnote{ I've yet to find any example of it used in a journal article or conference paper. } \subsection{Limitations of the Bentley-Saxe Method}