diff options
Diffstat (limited to 'chapters/background.tex.bak')
| -rw-r--r-- | chapters/background.tex.bak | 1324 |
1 files changed, 824 insertions, 500 deletions
diff --git a/chapters/background.tex.bak b/chapters/background.tex.bak index d57b370..78f4a30 100644 --- a/chapters/background.tex.bak +++ b/chapters/background.tex.bak @@ -1,315 +1,156 @@ \chapter{Background} - -This chapter will introduce important background information that -will be used throughput the remainder of the document. We'll first -define precisely what is meant by a query, and consider some special -classes of query that will become relevant in our discussion of dynamic -extension. We'll then consider the difference between a static and a -dynamic structure, and techniques for converting static structures into -dynamic ones in a variety of circumstances. - -\section{Database Indexes} - -The term \emph{index} is often abused within the database community -to refer to a range of closely related, but distinct, conceptual -categories\footnote{ -The word index can be used to refer to a structure mapping record -information to the set of records matching that information, as a -general synonym for ``data structure'', to data structures used -specifically in query processing, etc. -}. -This ambiguity is rarely problematic, as the subtle differences -between these categories are not often significant, and context -clarifies the intended meaning in situtations where they are. -However, this work explicitly operates at the interface of two of -these categories, and so it is important to disambiguiate between -them. As a result, we will be using the word index to -refer to a very specific structure - -\subsection{The Traditional Index} -A database index is a specialized structure which provides a means -to efficiently locate records that satisfy specific criteria. This -enables more efficient query processing for support queries. A -traditional database index can be modeled as a function, mapping a -set of attribute values, called a key, $\mathcal{K}$, to a set of -record identifiers, $\mathcal{R}$. Technically, the codomain of an -index can be either a record identifier, a set of record identifiers, -or the physical record itself, depending upon the configuration of -the index. For the purposes of this work, the focus will be on the -first of these, but in principle any of the three index types could -be used with little material difference to the discussion. - -Formally speaking, we will use the following definition of a traditional -database index, -\begin{definition}[Traditional Index] -Consider a set of database records, $\mathcal{D}$. An index over -these records, $\mathcal{I}_\mathcal{D}$ is a map of the form -$F:(\mathcal{I}_\mathcal{D}, \mathcal{K}) \to \mathcal{R}$, where -$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$, -called a \emph{key}. -\end{definition} - -In order to facilitate this mapping, indexes are built using data -structures. The specific data structure used has particular -implications about the performance of the index, and the situations -in which the index is effectively. Broadly speaking, traditional -database indexes can be categorized in two ways: ordered indexes -and unordered indexes. The former of these allows for iteration -over the set of record identifiers in some sorted order, starting -at the returned record. The latter allows for point-lookups only. - -There is a very small set of data structures that are usually used -for creating database indexes. The most common range index in RDBMSs -is the B-tree\footnote{ By \emph{B-tree} here, I am referring not -to the B-tree datastructure, but to a wide range of related structures -derived from the B-tree. Examples include the B$^+$-tree, -B$^\epsilon$-tree, etc. } based index, and key-value stores commonly -use indices built on the LSM-tree. Some databases support unordered -indexes using hashtables. Beyond these, some specialized databases or -database extensions have support for indexes based on other structures, -such as the R-tree\footnote{ -Like the B-tree, R-tree here is used as a signifier for a general class -of related data structures} for spatial databases or approximate small -world graph models for similarity search. - -\subsection{The Generalized Index} - -The previous section discussed the traditional definition of index -as might be found in a database systems textbook. However, this -definition is limited by its association specifically with mapping -key fields to records. For the purposes of this work, I will be -considering a slightly broader definition of index, - -\begin{definition}[Generalized Index] -Consider a set of database records, $\mathcal{D}$ and a search -problem, $\mathcal{Q}$. A generalized index, $\mathcal{I}_\mathcal{D}$ -is a map of the form $F:(\mathcal{I}_\mathcal{D}, \mathcal{Q}) \to -\mathcal{R})$. -\end{definition} - -\emph{Search problems} are the topic of the next section, but in -brief a search problem represents a general class of query, such -as range scan, point lookup, k-nearest neightbor, etc. A traditional -index is a special case of a generalized index, having $\mathcal{Q}$ -being a point-lookup or range query based on a set of record -attributes. - -\subsection{Indices in Query Processing} - -A database management system utilizes indices to accelerate certain -types of query. Queries are expressed to the system in some high -level language, such as SQL or Datalog. These are generalized -languages capable of expressing a wide range of possible queries. -The DBMS is then responsible for converting these queries into a -set of primitive data access procedures that are supported by the -underlying storage engine. There are a variety of techniques for -this, including mapping directly to a tree of relational algebra -operators and interpretting that tree, query compilation, etc. But, -ultimately, the expressiveness of this internal query representation -is limited by the routines supported by the storage engine. - -As an example, consider the following SQL query (representing a -2-dimensional k-nearest neighbor)\footnote{There are more efficient -ways of answering this query, but I'm aiming for simplicity here -to demonstrate my point}, - -\begin{verbatim} -SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A - WHERE A.property = filtering_criterion - ORDER BY d - LIMIT 5; -\end{verbatim} - -This query will be translated into a logical query plan (a sequence -of relational algebra operators) by the query planner, which could -result in a plan like this, - -\begin{verbatim} -query plan here -\end{verbatim} - -With this logical query plan, the DBMS will next need to determine -which supported operations it can use to most efficiently answer -this query. For example, the selection operation (A) could be -physically manifested as a table scan, or could be answered using -an index scan if there is an ordered index over \texttt{A.property}. -The query optimizer will make this decision based on its estimate -of the selectivity of the predicate. This may result in one of the -following physical query plans - -\begin{verbatim} -physical query plan -\end{verbatim} - -In either case, however, the space of possible physical plans is -limited by the available access methods: either a sorted scan on -an attribute (index) or an unsorted scan (table scan). The database -must filter for all elements matching the filtering criterion, -calculate the distances between all of these points and the query, -and then sort the results to get the final answer. Additionally, -note that the sort operation in the plan is a pipeline-breaker. If -this plan were to appear as a subtree in a larger query plan, the -overall plan would need to wait for the full evaluation of this -sub-query before it could proceed, as sorting requires the full -result set. - -Imagine a world where a new index was available to our DBMS: a -nearest neighbor index. This index would allow the iteration over -records in sorted order, relative to some predefined metric and a -query point. If such an index existed over \texttt{(A.x, A.y)} using -\texttt{dist}, then a third physical plan would be available to the DBMS, - -\begin{verbatim} -\end{verbatim} - -This plan pulls records in order of their distance to \texttt{Q} -directly, using an index, and then filters them, avoiding the -pipeline breaking sort operation. While it's not obvious in this -case that this new plan is superior (this would depend a lot on the -selectivity of the predicate), it is a third option. It becomes -increasingly superior as the selectivity of the predicate grows, -and is clearly superior in the case where the predicate has unit -selectivity (requiring only the consideration of $5$ records total). -The construction of this special index will be considered in -Section~\ref{ssec:knn}. - -This use of query-specific indexing schemes also presents a query -planning challenge: how does the database know when a particular -specialized index can be used for a given query, and how can -specialized indexes broadcast their capabilities to the query planner -in a general fashion? This work is focused on the problem of enabling -the existence of such indexes, rather than facilitating their use, -however these are important questions that must be considered in -future work for this solution to be viable. There has been work -done surrounding the use of arbtrary indexes in queries in the past, -such as~\cite{byods-datalog}. This problem is considered out-of-scope -for the proposed work, but will be considered in the future. +\label{chap:background} + +This chapter will introduce important background information and +existing work in the area of data structure dynamization. We will +first discuss the concept of a search problem, which is central to +dynamization techniques. While one might imagine that restrictions on +dynamization would be functions of the data structure to be dynamized, +in practice the requirements placed on the data structure are quite mild, +and it is the necessary properties of the search problem that the data +structure is used to address that provide the central difficulty to +applying dynamization techniques in a given area. After this, database +indices will be discussed briefly. Indices are the primary use of data +structures within the database context that is of interest to our work. +Following this, existing theoretical results in the area of data structure +dynamization will be discussed, which will serve as the building blocks +for our techniques in subsquent chapters. The chapter will conclude with +a discussion of some of the limitations of these existing techniques. \section{Queries and Search Problems} +\label{sec:dsp} + +Data access lies at the core of most database systems. We want to ask +questions of the data, and ideally get the answer efficiently. We +will refer to the different types of question that can be asked as +\emph{search problems}. We will be using this term in a similar way as +the word \emph{query} \footnote{ + The term query is often abused and used to + refer to several related, but slightly different things. In the + vernacular, a query can refer to either a) a general type of search + problem (as in "range query"), b) a specific instance of a search + problem, or c) a program written in a query language. +} +is often used within the database systems literature: to refer to a +general class of questions. For example, we could consider range scans, +point-lookups, nearest neighbor searches, predicate filtering, random +sampling, etc., to each be a general search problem. Formally, for the +purposes of this work, a search problem is defined as follows, -In our discussion of generalized indexes, we encountered \emph{search -problems}. A search problem is a term used within the literature -on data structures in a manner similar to how the database community -sometimes uses the term query\footnote{ -Like with the term index, the term query is often abused and used to -refer to several related, but slightly different things. In the vernacular, -a query can refer to either a) a general type of search problem (as in "range query"), -b) a specific instance of a search problem, or c) a program written in a query language. -}, to refer to a general -class of questions asked of data. Examples include range queries, -point-lookups, nearest neighbor queries, predicate filtering, random -sampling, etc. Formally, for the purposes of this work, we will define -a search problem as follows, \begin{definition}[Search Problem] -Given three multisets, $D$, $R$, and $Q$, a search problem is a function -$F: (D, Q) \to R$, where $D$ represents the domain of data to be searched, -$Q$ represents the domain of query parameters, and $R$ represents the -answer domain. -\footnote{ -It is important to note that it is not required for $R \subseteq D$. As an + Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function + $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched, + $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the +answer domain.\footnote{ + It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an example, a \texttt{COUNT} aggregation might map a set of strings onto -an integer. Most common queries do satisfy $R \subseteq D$, but this need + an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need not be a universal constraint. } \end{definition} -And we will use the word \emph{query} to refer to a specific instance -of a search problem, except when used as part of the generally -accepted name of a search problem (i.e., range query). +We will use the term \emph{query} to mean a specific instance of a search +problem, \begin{definition}[Query] -Given three multisets, $D$, $R$, and $Q$, a search problem $F$ and -a specific set of query parameters $q \in Q$, a query is a specific -instance of the search problem, $F(D, q)$. + Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and + a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific + instance of the search problem, $F(\mathcal{D}, q)$. \end{definition} As an example of using these definitions, a \emph{membership test} -or \emph{range query} would be considered search problems, and a -range query over the interval $[10, 99]$ would be a query. +or \emph{range scan} would be considered search problems, and a range +scan over the interval $[10, 99]$ would be a query. We've drawn this +distinction because, as we'll see as we enter into the discussion of +our work in later chapters, it is useful to have seperate, unambiguous +terms for these two concepts. \subsection{Decomposable Search Problems} -An important subset of search problems is that of decomposable -search problems (DSPs). This class was first defined by Saxe and -Bentley as follows, +Dynamization techniques require the partitioning of one data structure +into several, smaller ones. As a result, these techniques can only +be applied in situations where the search problem to be answered can +be answered from this set of smaller data structures, with the same +answer as would have been obtained had all of the data been used to +construct a single, large structure. This requirement is formalized in +the definition of a class of problems called \emph{decomposable search +problems (DSP)}. This class was first defined by Bentley and Saxe in +their work on dynamization, and we will adopt their definition, \begin{definition}[Decomposable Search Problem~\cite{saxe79}] \label{def:dsp} - Given a search problem $F: (D, Q) \to R$, $F$ is decomposable if and - only if there exists a consant-time computable, associative, and - commutative binary operator $\square$ such that, + A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and + only if there exists a constant-time computable, associative, and + commutative binary operator $\mergeop$ such that, \begin{equation*} - F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q) \end{equation*} \end{definition} -The constant-time requirement was used to prove bounds on the costs of -evaluating DSPs over data broken across multiple partitions. Further work -by Overmars lifted this constraint and considered a more general class -of DSP, +The requirement for $\mergeop$ to be constant-time was used by Bentley and +Saxe to prove specific performance bounds for answering queries from a +decomposed data structure. However, it is not strictly \emph{necessary}, +and later work by Overmars lifted this constraint and considered a more +general class of search problems called \emph{$C(n)$-decomposable search +problems}, + \begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}] - Given a search problem $F: (D, Q) \to R$, $F$ is $C(n)$-decomposable + A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable if and only if there exists an $O(C(n))$-time computable, associative, - and commutative binary operator $\square$ such that, + and commutative binary operator $\mergeop$ such that, \begin{equation*} - F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q) \end{equation*} \end{definition} -Decomposability is an important property because it allows for -search problems to be answered over partitioned datasets. The details -of this will be discussed in Section~\ref{ssec:bentley-saxe} in the -context of creating dynamic data structures. Many common types of -search problems appearing in databases are decomposable, such as -range queries or predicate filtering. - -To demonstrate that a search problem is decomposable, it is necessary -to show the existance of the merge operator, $\square$, and to show -that $F(A \cup B, q) = F(A, q)~ \square ~F(B, q)$. With these two -results, simple induction demonstrates that the problem is decomposable -even in cases with more than two partial results. - -As an example, consider range queries, -\begin{definition}[Range Query] -Let $D$ be a set of $n$ points in $\mathbb{R}$. Given an interval, -$ q = [x, y],\quad x,y \in R$, a range query returns all points in -$D \cap q$. +To demonstrate that a search problem is decomposable, it is necessary to +show the existence of the merge operator, $\mergeop$, with the necessary +properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, +q)$. With these two results, induction demonstrates that the problem is +decomposable even in cases with more than two partial results. + +As an example, consider range scans, +\begin{definition}[Range Count] + Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval, + $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns + the cardinality, $|d \cap q|$. \end{definition} \begin{theorem} -Range Queries are a DSP. +Range Count is a decomposable search problem. \end{theorem} \begin{proof} -Let $\square$ be the set union operator ($\cup$). Applying this to -Definition~\ref{def:dsp}, we have +Let $\mergeop$ be addition ($+$). Applying this to +Definition~\ref{def:dsp}, gives \begin{align*} - (A \cup B) \cap q = (A \cap q) \cup (B \cap q) + |(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)| \end{align*} -which is true by the distributive property of set union and -intersection. Assuming an implementation allowing for an $O(1)$ -set union operation, range queries are DSPs. +which is true by the distributive property of union and +intersection. Addition is an associative and commutative +operator that can be calculated in $O(1)$ time. Therefore, range counts +are DSPs. \end{proof} Because the codomain of a DSP is not restricted, more complex output structures can be used to allow for problems that are not directly decomposable to be converted to DSPs, possibly with some minor -post-processing. For example, the calculation of the mean of a set -of numbers can be constructed as a DSP using the following technique, +post-processing. For example, calculating the arithmetic mean of a set +of numbers can be formulated as a DSP, \begin{theorem} -The calculation of the average of a set of numbers is a DSP. +The calculation of the arithmetic mean of a set of numbers is a DSP. \end{theorem} \begin{proof} -Define the search problem as $A:D \to (\mathbb{R}, \mathbb{Z})$, -where $D\subset\mathbb{R}$ and is a multiset. The output tuple + Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$, + where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple contains the sum of the values within the input set, and the -cardinality of the input set. Let the $A(D_1) = (s_1, c_1)$ and -$A(D_2) = (s_2, c_2)$. Then, define $A(D_1)\square A(D_2) = (s_1 + -s_2, c_1 + c_2)$. +cardinality of the input set. For two disjoint paritions of the data, +$D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let +$A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$. -Applying Definition~\ref{def:dsp}, we have +Applying Definition~\ref{def:dsp}, gives \begin{align*} - A(D_1 \cup D_2) &= A(D_1)\square A(D_2) \\ + A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2) \\ (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c) \end{align*} From this result, the average can be determined in constant time by @@ -317,258 +158,741 @@ taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set of numbers is a DSP. \end{proof} -\section{Dynamic Extension Techniques} -Because data in a database is regularly updated, data structures -intended to be used as an index must support updates (inserts, -in-place modification, and deletes) to their data. In principle, -any data structure can support updates to its underlying data through -global reconstruction: adjusting the record set and then rebuilding -the entire structure. Ignoring this trivial (and highly inefficient) -approach, a data structure with support for updates is called -\emph{dynamic}, and one without support for updates is called -\emph{static}. In this section, we discuss approaches for modifying -a static data structure to grant it support for updates, a process -called \emph{dynamic extension} or \emph{dynamization}. A theoretical -survey of this topic can be found in~\cite{overmars83}, but this -work doesn't cover several techniques that are used in practice. -As such, much of this section constitutes our own analysis, tying -together threads from a variety of sources. - -\subsection{Local Reconstruction} - -One way of viewing updates to a data structure is as reconstructing -all or part of the structure. To minimize the cost of the update, -it is ideal to minimize the size of the reconstruction that accompanies -an update, either by careful structuring of the data to ensure -minimal disruption to surrounding records by an update, or by -deferring the reconstructions and amortizing their costs over as -many updates as possible. - -While minimizing the size of a reconstruction seems the most obvious, -and best, approach, it is limited in its applicability. The more -related ``nearby'' records in the structure are, the more records -will be affected by a change. Records can be related in terms of -some ordering of their values, which we'll term a \emph{spatial -ordering}, or in terms of their order of insertion to the structure, -which we'll term a \emph{temporal ordering}. Note that these terms -don't imply anything about the nature of the data, and instead -relate to the principles used by the data structure to arrange them. - -Arrays provide the extreme version of both of these ordering -principles. In an unsorted array, in which records are appended to -the end of the array, there is no spatial ordering dependence between -records. This means that any insert or update will require no local -reconstruction, aside from the record being directly affected.\footnote{ -A delete can also be performed without any structural adjustments -in a variety of ways. Reorganization of the array as a result of -deleted records serves an efficiency purpose, but isn't required -for the correctness of the structure. } However, the order of -records in the array \emph{does} express a strong temporal dependency: -the index of a record in the array provides the exact insertion -order. - -A sorted array provides exactly the opposite situation. The order -of a record in the array reflects an exact spatial ordering of -records with respect to their sorting function. This means that an -update or insert will require reordering a large number of records -(potentially all of them, in the worst case). Because of the stronger -spatial dependence of records in the structure, an update will -require a larger-scale reconstruction. Additionally, there is no -temporal component to the ordering of the records: inserting a set -of records into a sorted array will produce the same final structure -irrespective of insertion order. - -It's worth noting that the spatial dependency discussed here, as -it relates to reconstruction costs, is based on the physical layout -of the records and not the logical ordering of them. To exemplify -this, a sorted singly-linked list can maintain the same logical -order of records as a sorted array, but limits the spatial dependce -between records each records preceeding node. This means that an -insert into this structure will require only a single node update, -regardless of where in the structure this insert occurs. - -The amount of spatial dependence in a structure directly reflects -a trade-off between read and write performance. In the above example, -performing a lookup for a given record in a sorted array requires -asymptotically fewer comparisons in the worst case than an unsorted -array, because the spatial dependecies can be exploited for an -accelerated search (binary vs. linear search). Interestingly, this -remains the case for lookups against a sorted array vs. a sorted -linked list. Even though both structures have the same logical order -of records, limited spatial dependecies between nodes in a linked -list forces the lookup to perform a scan anyway. - -A balanced binary tree sits between these two extremes. Like a -linked list, individual nodes have very few connections. However -the nodes are arranged in such a way that a connection existing -between two nodes implies further information about the ordering -of children of those nodes. In this light, rebalancing of the tree -can be seen as maintaining a certain degree of spatial dependence -between the nodes in the tree, ensuring that it is balanced between -the two children of each node. A very general summary of tree -rebalancing techniques can be found in~\cite{overmars83}. Using an -AVL tree~\cite{avl} as a specific example, each insert in the tree -involves adding the new node and updating its parent (like you'd -see in a simple linked list), followed by some larger scale local -reconstruction in the form of tree rotations, to maintain the balance -factor invariant. This means that insertion requires more reconstruction -effort than the single pointer update in the linked list case, but -results in much more efficient searches (which, as it turns out, -makes insertion more efficient in general too, even with the overhead, -because finding the insertion point is much faster). - -\subsection{Amortized Local Reconstruction} - -In addition to control update cost by arranging the structure so -as to reduce the amount of reconstruction necessary to maintain the -desired level of spatial dependence, update costs can also be reduced -by amortizing the local reconstruction cost over multiple updates. -This is often done in one of two ways: leaving gaps or adding -overflow buckets. These gaps and buckets allows for a buffer of -insertion capacity to be sustained by the data structure, before -a reconstruction is triggered. - -A classic example of the gap approach is found in the -B$^+$-tree~\cite{b+tree} commonly used in RDBMS indexes, as well -as open addressing for hash tables. In a B$^+$-tree, each node has -a fixed size, which must be at least half-utilized (aside from the -root node). The empty spaces within these nodes are gaps, which can -be cheaply filled with new records on insert. Only when a node has -been filled must a local reconstruction (called a structural -modification operation for B-trees) occur to redistribute the data -into multiple nodes and replenish the supply of gaps. This approach -is particularly well suited to data structures in contexts where -the natural unit of storage is larger than a record, as in disk-based -(with 4KiB pages) or cache-optimized (with 64B cachelines) structures. -This gap-based approach was also used to create ALEX, an updatable -learned index~\cite{ALEX}. - -The gap approach has a number of disadvantages. It results in a -somewhat sparse structure, thereby wasting storage. For example, a -B$^+$-tree requires all nodes other than the root to be at least -half full--meaning in the worst case up to half of the space required -by the structure could be taken up by gaps. Additionally, this -scheme results in some inserts being more expensive than others: -most new records will occupy an available gap, but some will trigger -more expensive SMOs. In particular, it has been observed with -B$^+$-trees that this can lead to ``waves of misery''~\cite{wavesofmisery}: -the gaps in many nodes fill at about the same time, leading to -periodic clusters of high-cost merge operations. - -Overflow buckets are seen in ISAM-tree based indexes~\cite{myisam}, -as well as hash tables with closed addressing. In this approach, -parts of the structure into which records would be inserted (leaf -nodes of ISAM, directory entries in CA hashing) have a pointer to -an overflow location, where newly inserted records can be placed. -This allows for the structure to, theoretically, sustain an unlimited -amount of insertions. However, read performance degrades, because -the more overflow capacity is utilized, the less the records in the -structure are ordered according to the data structure's definition. -Thus, periodically a reconstruction is necessary to distribute the -overflow records into the structure itself. - -\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method} - -Another approach to support updates is to amortize the cost of -global reconstruction over multiple updates. This approach can take -take three forms, -\begin{enumerate} - - \item Pairing a dynamic data structure (called a buffer or - memtable) with an instance of the structure being extended. - Updates are written to the buffer, and when the buffer is - full its records are merged with those in the static - structure, and the structure is rebuilt. This approach is - used by one version of the originally proposed - LSM-tree~\cite{oneil93}. Technically this technique proposed - in that work for the purposes of converting random writes - into sequential ones (all structures involved are dynamic), - but it can be used for dynamization as well. - - \item Creating multiple, smaller data structures each - containing a partition of the records from the dataset, and - reconstructing individual structures to accomodate new - inserts in a systematic manner. This technique is the basis - of the Bentley-Saxe method~\cite{saxe79}. - - \item Using both of the above techniques at once. This is - the approach used by modern incarnations of the - LSM~tree~\cite{rocksdb}. - -\end{enumerate} - -In all three cases, it is necessary for the search problem associated -with the index to be a DSP, as answering it will require querying -multiple structures (the buffer and/or one or more instances of the -data structure) and merging the results together to get a final -result. This section will focus exclusively on the Bentley-Saxe -method, as it is the basis for our proposed methodology.p - -When dividing records across multiple structures, there is a clear -trade-off between read performance and write performance. Keeping -the individual structures small reduces the cost of reconstructing, -and thereby increases update performance. However, this also means -that more structures will be required to accommodate the same number -of records, when compared to a scheme that allows the structures -to be larger. As each structure must be queried independently, this -will lead to worse query performance. The reverse is also true, -fewer, larger structures will have better query performance and -worse update performance, with the extreme limit of this being a -single structure that is fully rebuilt on each insert. -\begin{figure} - \caption{Inserting a new record using the Bentley-Saxe method.} - \label{fig:bsm-example} -\end{figure} +\section{Database Indexes} +\label{sec:indexes} + +Within a database system, search problems are expressed using +some high level language (or mapped directly to commands, for +simpler systems like key-value stores), which is processed by +the database system to produce a result. Within many database +systems, the most basic access primitive is a table scan, which +sequentially examines each record within the data set. There are many +situations in which the same query could be answered in less time using +a more sophisticated data access scheme, however, and databases support +a limited number of such schemes through the use of specialized data +structures called \emph{indices} (or indexes). Indices can be built over +a set of attributes in a table and provide faster access for particular +search problems. -The key insight of the Bentley-Saxe method~\cite{saxe79} is that a -good balance can be struck by uses a geometrically increasing -structure size. In Bentley-Saxe, the sub-structures are ``stacked'', -with the bottom level having a capacity of a single record, and -each subsequent level doubling in capacity. When an update is -performed, the first empty level is located and a reconstruction -is triggered, merging the structures of all levels below this empty -one, along with the new record. An example of this process is shown -in Figure~\ref{fig:bsm-example}. The merits of this approach are -that it ensures that ``most'' reconstructions involve the smaller -data structures towards the bottom of the sequence, while most of -the records reside in large, infrequently updated, structures towards -the top. This balances between the read and write implications of -structure size, while also allowing the number of structures required -to represent $n$ records to be worst-case bounded by $O(\log n)$. - -Given a structure with an $O(P(n))$ construction cost and $O(Q_S(n))$ -query cost, the Bentley-Saxe Method will produce a dynamic data -structure with, +The term \emph{index} is often abused within the database community +to refer to a range of closely related, but distinct, conceptual +categories.\footnote{ +The word index can be used to refer to a structure mapping record +information to the set of records matching that information, as a +general synonym for ``data structure'', to data structures used +specifically in query processing, etc. +} +This ambiguity is rarely problematic, as the subtle differences between +these categories are not often significant, and context clarifies the +intended meaning in situations where they are. However, this work +explicitly operates at the interface of two of these categories, and so +it is important to disambiguate between them. + +\subsection{The Classical Index} +A database index is a specialized data structure that provides a means +to efficiently locate records that satisfy specific criteria. This +enables more efficient query processing for supported search problems. A +classical index can be modeled as a function, mapping a set of attribute +values, called a key, $\mathcal{K}$, to a set of record identifiers, +$\mathcal{R}$. The codomain of an index can be either the set of +record identifiers, a set containing sets of record identifiers, or +the set of physical records, depending upon the configuration of the +index.~\cite{cowbook} For our purposes here, we'll focus on the first of +these, but the use of other codmains wouldn't have any material effect +on our discussion. + +We will use the following definition of a "classical" database index, + +\begin{definition}[Classical Index~\cite{cowbook}] +Consider a set of database records, $\mathcal{D}$. An index over +these records, $\mathcal{I}_\mathcal{D}$ is a map of the form + $\mathcal{I}_\mathcal{D}:(\mathcal{K}, \mathcal{D}) \to \mathcal{R}$, where +$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$, +called a \emph{key}. +\end{definition} + +In order to facilitate this mapping, indexes are built using data +structures. The selection of data structure has implications on the +performance of the index, and the types of search problem it can be +used to accelerate. Broadly speaking, classical indices can be divided +into two categories: ordered and unordered. Ordered indices allow for +the iteration over a set of record identifiers in a particular sorted +order of keys, and the efficient location of a specific key value in +that order. These indices can be used to accelerate range scans and +point-lookups. Unordered indices are specialized for point-lookups on a +particular key value, and do not support iterating over records in some +order.~\cite{cowbook, mysql-btree-hash} + +There is a very small set of data structures that are usually used for +creating classical indexes. For ordered indices, the most commonly used +data structure is the B-tree~\cite{ubiq-btree},\footnote{ + By \emph{B-tree} here, we are referring not to the B-tree data + structure, but to a wide range of related structures derived from + the B-tree. Examples include the B$^+$-tree, B$^\epsilon$-tree, etc. +} +and the log-structured merge (LSM) tree~\cite{oneil96} is also often +used within the context of key-value stores~\cite{rocksdb}. Some databases +implement unordered indices using hash tables~\cite{mysql-btree-hash}. + + +\subsection{The Generalized Index} + +The previous section discussed the classical definition of index +as might be found in a database systems textbook. However, this +definition is limited by its association specifically with mapping +key fields to records. For the purposes of this work, a broader +definition of index will be considered, + +\begin{definition}[Generalized Index] +Consider a set of database records, $\mathcal{D}$, and search +problem, $\mathcal{Q}$. +A generalized index, $\mathcal{I}_\mathcal{D}$ +is a map of the form $\mathcal{I}_\mathcal{D}:(\mathcal{Q}, \mathcal{D}) \to +\mathcal{R})$. +\end{definition} + +A classical index is a special case of a generalized index, with $\mathcal{Q}$ +being a point-lookup or range scan based on a set of record attributes. + +There are a number of generalized indexes that appear in some database systems. +For example, some specialized databases or database extensions have support for +indexes based the R-tree\footnote{ Like the B-tree, R-tree here is used as a +signifier for a general class of related data structures.} for spatial +databases~\cite{postgis-doc, ubiq-rtree} or hierarchical navigable small world +graphs for similarity search~\cite{pinecone-db}, among others. These systems +are typically either an add-on module, or a specialized standalone database +that has been designed specifically for answering particular types of queries +(such as spatial queries, similarity search, string matching, etc.). + +%\subsection{Indexes in Query Processing} + +%A database management system utilizes indexes to accelerate certain +%types of query. Queries are expressed to the system in some high +%level language, such as SQL or Datalog. These are generalized +%languages capable of expressing a wide range of possible queries. +%The DBMS is then responsible for converting these queries into a +%set of primitive data access procedures that are supported by the +%underlying storage engine. There are a variety of techniques for +%this, including mapping directly to a tree of relational algebra +%operators and interpreting that tree, query compilation, etc. But, +%ultimately, this internal query representation is limited by the routines +%supported by the storage engine.~\cite{cowbook} + +%As an example, consider the following SQL query (representing a +%2-dimensional k-nearest neighbor problem)\footnote{There are more efficient +%ways of answering this query, but I'm aiming for simplicity here +%to demonstrate my point}, +% +%\begin{verbatim} +%SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A +% WHERE A.property = filtering_criterion +% ORDER BY d +% LIMIT 5; +%\end{verbatim} +% +%This query will be translated into a logical query plan (a sequence +%of relational algebra operators) by the query planner, which could +%result in a plan like this, +% +%\begin{verbatim} +%query plan here +%\end{verbatim} +% +%With this logical query plan, the DBMS will next need to determine +%which supported operations it can use to most efficiently answer +%this query. For example, the selection operation (A) could be +%physically manifested as a table scan, or could be answered using +%an index scan if there is an ordered index over \texttt{A.property}. +%The query optimizer will make this decision based on its estimate +%of the selectivity of the predicate. This may result in one of the +%following physical query plans +% +%\begin{verbatim} +%physical query plan +%\end{verbatim} +% +%In either case, however, the space of possible physical plans is +%limited by the available access methods: either a sorted scan on +%an attribute (index) or an unsorted scan (table scan). The database +%must filter for all elements matching the filtering criterion, +%calculate the distances between all of these points and the query, +%and then sort the results to get the final answer. Additionally, +%note that the sort operation in the plan is a pipeline-breaker. If +%this plan were to appear as a sub-tree in a larger query plan, the +%overall plan would need to wait for the full evaluation of this +%sub-query before it could proceed, as sorting requires the full +%result set. +% +%Imagine a world where a new index was available to the DBMS: a +%nearest neighbor index. This index would allow the iteration over +%records in sorted order, relative to some predefined metric and a +%query point. If such an index existed over \texttt{(A.x, A.y)} using +%\texttt{dist}, then a third physical plan would be available to the DBMS, +% +%\begin{verbatim} +%\end{verbatim} +% +%This plan pulls records in order of their distance to \texttt{Q} +%directly, using an index, and then filters them, avoiding the +%pipeline breaking sort operation. While it's not obvious in this +%case that this new plan is superior (this would depend upon the +%selectivity of the predicate), it is a third option. It becomes +%increasingly superior as the selectivity of the predicate grows, +%and is clearly superior in the case where the predicate has unit +%selectivity (requiring only the consideration of $5$ records total). +% +%This use of query-specific indexing schemes presents a query +%optimization challenge: how does the database know when a particular +%specialized index can be used for a given query, and how can +%specialized indexes broadcast their capabilities to the query optimizer +%in a general fashion? This work is focused on the problem of enabling +%the existence of such indexes, rather than facilitating their use; +%however these are important questions that must be considered in +%future work for this solution to be viable. There has been work +%done surrounding the use of arbitrary indexes in queries in the past, +%such as~\cite{byods-datalog}. This problem is considered out-of-scope +%for the proposed work, but will be considered in the future. + +\section{Classical Dynamization Techniques} + +Because data in a database is regularly updated, data structures +intended to be used as an index must support updates (inserts, in-place +modification, and deletes). Not all potentially useful data structures +support updates, and so a general strategy for adding update support +would increase the number of data structures that could be used as +database indices. We refer to a data structure with update support as +\emph{dynamic}, and one without update support as \emph{static}.\footnote{ + The term static is distinct from immutable. Static refers to the + layout of records within the data structure, whereas immutable + refers to the data stored within those records. This distinction + will become relevant when we discuss different techniques for adding + delete support to data structures. The data structures used are + always static, but not necessarily immutable, because the records may + contain header information (like visibility) that is updated in place. +} + +This section discusses \emph{dynamization}, the construction of a dynamic +data structure based on an existing static one. When certain conditions +are satisfied by the data structure and its associated search problem, +this process can be done automatically, and with provable asymptotic +bounds on amortized insertion performance, as well as worst case query +performance. We will first discuss the necessary data structure +requirements, and then examine several classical dynamization techniques. +The section will conclude with a discussion of delete support within the +context of these techniques. + +It is worth noting that there are a variety of techniques +discussed in the literature for dynamizing structures with specific +properties, or under very specific sets of circumstances. Examples +include frameworks for adding update support succinct data +structures~\cite{dynamize-succinct} or taking advantage of batching +of insert and query operations~\cite{batched-decomposable}. This +section discusses techniques that are more general, and don't require +workload-specific assumptions. + + +\subsection{Global Reconstruction} + +The most fundamental dynamization technique is that of \emph{global +reconstruction}. While not particularly useful on its own, global +reconstruction serves as the basis for the techniques to follow, and so +we will begin our discussion of dynamization with it. + +Consider a class of data structure, $\mathcal{I}$, capable of answering a +search problem, $\mathcal{Q}$. Insertion via global reconstruction is +possible if $\mathcal{I}$ supports the following two operations, +\begin{align*} +\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\ +\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D}) +\end{align*} +where $\mathtt{build}$ constructs an instance $\mathscr{I}\in\mathcal{I}$ +over the data structure over a set of records $d \subseteq \mathcal{D}$ +in $C(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d +\subseteq \mathcal{D}$ used to construct $\mathscr{I} \in \mathcal{I}$ in +$\Theta(1)$ time,\footnote{ + There isn't any practical reason why $\mathtt{unbuild}$ must run + in constant time, but this is the assumption made in \cite{saxe79} + and in subsequent work based on it, and so we will follow the same + defininition here. +} such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$. + +Given this structure, an insert of record $r \in \mathcal{D}$ into a +data structure $\mathscr{I} \in \mathcal{I}$ can be defined by, +\begin{align*} +\mathscr{I}_{i}^\prime = \text{build}(\text{unbuild}(\mathscr{I}_i) \cup \{r\}) +\end{align*} + +It goes without saying that this operation is sub-optimal, as the +insertion cost is $\Theta(C(n))$, and $C(n) \in \Omega(n)$ at best for +most data structures. However, this global reconstruction strategy can +be used as a primitive for more sophisticated techniques that can provide +reasonable performance. + +\subsection{Amortized Global Reconstruction} +\label{ssec:agr} + +The problem with global reconstruction is that each insert must rebuild +the entire data structure, involving all of its records. This results +in a worst-case insert cost of $\Theta(C(n))$. However, opportunities +for improving this scheme can present themselves when considering the +\emph{amortized} insertion cost. + +Consider the cost acrrued by the dynamized structure under global +reconstruction over the lifetime of the structure. Each insert will result +in all of the existing records being rewritten, so at worst each record +will be involved in $\Theta(n)$ reconstructions, each reconstruction +having $\Theta(C(n))$ cost. We can amortize this cost over the $n$ records +inserted to get an amortized insertion cost for global reconstruction of, + +\begin{equation*} +I_a(n) = \frac{C(n) \cdot n}{n} = C(n) +\end{equation*} + +This doesn't improve things as is, however it does present two +opportunities for improvement. If we could either reduce the size of +the reconstructions, or the number of times a record is reconstructed, +then we could reduce the amortized insertion cost. + +The key insight, first discussed by Bentley and Saxe, is that +this goal can be accomplished by \emph{decomposing} the data +structure into multiple, smaller structures, each built from a +disjoint partition of the data. As long as the search problem +being considered is decomposable, queries can be answered from +this structure with bounded worst-case overhead, and the amortized +insertion cost can be improved~\cite{saxe79}. Significant theoretical +work exists in evaluating different strategies for decomposing the +data structure~\cite{saxe79, overmars81, overmars83} and for leveraging +specific efficiencies of the data structures being considered to improve +these reconstructions~\cite{merge-dsp}. + +There are two general decomposition techniques that emerged from this +work. The earliest of these is the logarithmic method, often called +the Bentley-Saxe method in modern literature, and is the most commonly +discussed technique today. A later technique, the equal block method, +was also examined. It is generally not as effective as the Bentley-Saxe +method, but it has some useful properties for explainatory purposes and +so will be discussed here as well. + +\subsection{Equal Block Method~\cite[pp.~96-100]{overmars83}} +\label{ssec:ebm} + +Though chronologically later, the equal block method is theoretically a +bit simpler, and so we will begin our discussion of decomposition-based +technique for dynamization of decomposable search problems with it. The +core concept of the equal block method is to decompose the data structure +into several smaller data structures, called blocks, over partitions +of the data. This decomposition is performed such that each block is of +roughly equal size. + +Consider a data structure $\mathscr{I} \in \mathcal{I}$ that solves +some decomposable search problem, $F$ and is built over a set of records +$d \in \mathcal{D}$. This structure can be decomposed into $s$ blocks, +$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over +partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a specific value +makes little sense when the number of records changes, and so it is taken +to be governed by a smooth, monotonically increasing function $f(n)$ such +that, at any point, the following two constraints are obeyed. \begin{align} - \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\ - \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right) + f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\ + \forall_{1 \leq j \leq s} \quad | \mathscr{I}_j | \leq \frac{2n}{i} \label{ebm-c2} \end{align} +where $|\mathscr{I}_j|$ is the number of records in the block, +$|\text{unbuild}(\mathscr{I}_j)|$. + +A new record is inserted by finding the smallest block and rebuilding it +using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$, +then an insert is done by, +\begin{equation*} +\mathscr{I}_k^\prime = \text{build}(\text{unbuild}(\mathscr{I}_k) \cup \{r\}) +\end{equation*} +Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{ + Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be + violated by deletes. We're omitting deletes from the discussion at + this point, but will circle back to them in Section~\ref{sec:deletes}. +} In this case, the constraints are enforced by "reconfiguring" the +structure. $s$ is updated to be exactly $f(n)$, all of the existing +blocks are unbuilt, and then the records are redistributed evenly into +$s$ blocks. + +A query with parameters $q$ is answered by this structure by individually +querying the blocks, and merging the local results together with $\mergeop$, +\begin{equation*} +F(\mathscr{I}, q) = \bigmergeop_{j=1}^{s}F(\mathscr{I}_j, q) +\end{equation*} +where $F(\mathscr{I}, q)$ is a slight abuse of notation, referring to +answering the query over $d$ using the data structure $\mathscr{I}$. + +This technique provides better amortized performance bounds than global +reconstruction, at the possible cost of increased query performance for +sub-linear queries. We'll omit the details of the proof of performance +for brevity and streamline some of the original notation (full details +can be found in~\cite{overmars83}), but this technique ultimately +results in a data structure with the following performance characterstics, +\begin{align*} +\text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{C(n)}{n} + C\left(\frac{n}{f(n)}\right)\right) \\ +\text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\ +\end{align*} +where $C(n)$ is the cost of statically building $\mathcal{I}$, and +$\mathscr{Q}(n)$ is the cost of answering $F$ using $\mathcal{I}$. + +%TODO: example? + + +\subsection{The Bentley-Saxe Method~\cite{saxe79}} +\label{ssec:bsm} + +%FIXME: switch this section (and maybe the previous?) over to being +% indexed at 0 instead of 1 + +The original, and most frequently used, dynamization technique is the +Bentley-Saxe Method (BSM), also called the logarithmic method in older +literature. Rather than breaking the data structure into equally sized +blocks, BSM decomposes the structure into logarithmically many blocks +of exponentially increasing size. More specifically, the data structure +is decomposed into $h = \lceil \log_2 n \rceil$ blocks, $\mathscr{I}_1, +\mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$ +will be either empty, or contain exactly $2^i$ records within it. + +The procedure for inserting a record, $r \in \mathcal{D}$, into +a BSM dynamization is as follows. If the block $\mathscr{I}_0$ +is empty, then $\mathscr{I}_0 = \text{build}{\{r\}}$. If it is not +empty, then there will exist a maximal sequence of non-empty blocks +$\mathscr{I}_0, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq +0$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case, +$\mathscr{I}_{i+1}$ is set to $\text{build}(\{r\} \cup \bigcup_{l=0}^i +\text{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_0$ through +$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the +end of the structure as needed. + +%FIXME: switch the x's to r's for consistency +\begin{figure} +\centering +\includegraphics[width=.8\textwidth]{diag/bsm.pdf} +\caption{An illustration of inserts into the Bentley-Saxe Method} +\label{fig:bsm-example} +\end{figure} -However, the method has poor worst-case insertion cost: if the -entire structure is full, it must grow by another level, requiring -a full reconstruction involving every record within the structure. -A slight adjustment to the technique, due to Overmars and van Leuwen -\cite{}, allows for the worst-case insertion cost to be bounded by -$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing -each reconstruction into small pieces, one of which is executed -each time a new update occurs. This has the effect of bounding the -worst-case performance, but does so by sacrificing the expected -case performance, and adds a lot of complexity to the method. This -technique is not used much in practice.\footnote{ - I've yet to find any example of it used in a journal article - or conference paper. -} +Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The +dynamization is built over a set of records $x_1, x_2, \ldots, +x_{10}$ initially, with eight records in $\mathscr{I}_3$ and two in +$\mathscr{I}_1$. The first new record, $x_{11}$, is inserted directly +into $\mathscr{I}_0$. For the next insert following this, $x_{12}$, the +first empty block is $\mathscr{I}_2$, and so the insert is performed by +doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup +\text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$ +and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$. + +This technique is called a \emph{binary decomposition} of the data +structure. Considering a BSM dynamization of a structure containing $n$ +records, labeling each block with a $0$ if it is empty and a $1$ if it +is full will result in the binary representation of $n$. For example, +the final state of the structure in Figure~\ref{fig:bsm-example} contains +$12$ records, and the labeling procedure will result in $0\text{b}1100$, +which is $12$ in binary. Inserts affect this representation of the +structure in the same way that incrementing the binary number by $1$ does. + +By applying BSM to a data structure, a dynamized structure can be created +with the following performance characteristics, +\begin{align*} +\text{Amortized Insertion Cost:}&\quad \Theta\left(\left(\frac{C(n)}{n}\cdot \log_2 n\right)\right) \\ +\text{Worst Case Insertion Cost:}&\quad \Theta\left(C(n)\right) \\ +\text{Worst-case Query Cost:}& \quad \Theta\left(\log_2 n\cdot \mathscr{Q}\left(n\right)\right) \\ +\end{align*} +This is a particularly attractive result because, for example, a data +structure having $C(n) \in \Theta(n)$ will have an amortized insertion +cost of $\log_2 (n)$, which is quite reasonable. The cost is an extra +logarithmic multiple attached to the query complexity. It is also worth +noting that the worst-case insertion cost remains the same as global +reconstruction, but this case arises only very rarely. If you consider the +binary decomposition representation, the worst-case behavior is triggered +each time the existing number overflows, and a new digit must be added. + +\subsection{Delete Support} + +Classical dynamization techniques have also been developed with +support for deleting records. In general, the same technique of global +reconstruction that was used for inserting records can also be used to +delete them. Given a record $r \in \mathcal{D}$ and a data structure +$\mathscr{I} \in \mathcal{I}$ such that $r \in \mathscr{I}$, $r$ can be +deleted from the structure in $C(n)$ time as follows, +\begin{equation*} +\mathscr{I}^\prime = \text{build}(\text{unbuild}(\mathscr{I}) - \{r\}) +\end{equation*} +However, supporting deletes within the dynamization schemes discussed +above is more complicated. The core problem is that inserts affect the +dynamized structure in a deterministic way, and as a result certain +partionining schemes can be leveraged to reason about the +performance. But, deletes do not work like this. + +\begin{figure} +\caption{A Bentley-Saxe dynamization for the integers on the +interval $[1, 100]$.} +\label{fig:bsm-delete-example} +\end{figure} +For example, consider a Bentley-Saxe dynamization that contains all +integers on the interval $[1, 100]$, inserted in that order, shown in +Figure~\ref{fig:bsm-delete-example}. We would like to delete all of the +records from this structure, one at a time, using global reconstruction. +This presents several problems, +\begin{itemize} + \item For each record, we need to identify which block it is in before + we can delete it. + \item The cost of performing a delete is a function of which block the + record is in, which is a question of distribution and not easily + controlled. + \item As records are deleted, the structure will potentially violate + the invariants of the decomposition scheme used, which will + require additional work to fix. +\end{itemize} + + + +\section{Limitations of Classical Dynamization Techniques} +\label{sec:bsm-limits} + +While fairly general, these dynamization techniques have a number of +limitations that prevent them from being directly usable as a general +solution to the problem of creating database indices. Because of the +requirement that the query being answered be decomposable, many search +problems cannot be addressed--or at least efficiently addressed, by +decomposition-based dynamization. The techniques also do nothing to reduce +the worst-case insertion cost, resulting in extremely poor tail latency +performance relative to hand-built dynamic structures. Finally, these +approaches do not do a good job of exposing the underlying configuration +space to the user, meaning that the user can exert limited control on the +performance of the dynamized data structure. This section will discuss +these limitations, and the rest of the document will be dedicated to +proposing solutions to them. + +\subsection{Limits of Decomposability} +\label{ssec:decomp-limits} +Unfortunately, the DSP abstraction used as the basis of classical +dynamization techniques has a few significant limitations that restrict +their applicability, + +\begin{itemize} + \item The query must be broadcast identically to each block and cannot + be adjusted based on the state of the other blocks. + + \item The query process is done in one pass--it cannot be repeated. + + \item The result merge operation must be $O(1)$ to maintain good query + performance. + + \item The result merge operation must be commutative and associative, + and is called repeatedly to merge pairs of results. +\end{itemize} + +These requirements restrict the types of queries that can be supported by +the method efficiently. For example, k-nearest neighbor and independent +range sampling are not decomposable. + +\subsubsection{k-Nearest Neighbor} +\label{sssec-decomp-limits-knn} +The k-nearest neighbor (k-NN) problem is a generalization of the nearest +neighbor problem, which seeks to return the closest point within the +dataset to a given query point. More formally, this can be defined as, +\begin{definition}[Nearest Neighbor] + + Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$ + be some function $f: D^2 \to \mathbb{R}^+$ representing the distance + between two points within $D$. The nearest neighbor problem, $NN(D, + q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$ + for some query point, $q \in \mathbb{R}^d$. +\end{definition} +In practice, it is common to require $f(x, y)$ be a metric,\footnote +{ + Contrary to its vernacular usage as a synonym for ``distance'', a + metric is more formally defined as a valid distance function over + a metric space. Metric spaces require their distance functions to + have the following properties, + \begin{itemize} + \item The distance between a point and itself is always 0. + \item All distances between non-equal points must be positive. + \item For all points, $x, y \in D$, it is true that + $f(x, y) = f(y, x)$. + \item For any three points $x, y, z \in D$ it is true that + $f(x, z) \leq f(x, y) + f(y, z)$. + \end{itemize} + + These distances also must have the interpretation that $f(x, y) < + f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This + is the opposite of the definition of similarity, and so some minor + manipulations are usually required to make similarity measures work + in metric-based indexes. \cite{intro-analysis} +} +and this will be done in the examples of indices for addressing +this problem in this work, but it is not a fundamental aspect of the problem +formulation. The nearest neighbor problem itself is decomposable, +with a simple merge function that accepts the result with the smallest +value of $f(x, q)$ for any two inputs\cite{saxe79}. + +The k-nearest neighbor problem generalizes nearest-neighbor to return +the $k$ nearest elements, +\begin{definition}[k-Nearest Neighbor] + + Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$ + be some function $f: D^2 \to \mathbb{R}^+$ representing the distance + between two points within $D$. The k-nearest neighbor problem, + $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$ + such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$. +\end{definition} -\subsection{Limitations of the Bentley-Saxe Method} +This can be thought of as solving the nearest-neighbor problem $k$ times, +each time removing the returned result from $D$ prior to solving the +problem again. Unlike the single nearest-neighbor case (which can be +thought of as k-NN with $k=1$), this problem is \emph{not} decomposable. +\begin{theorem} + k-NN is not a decomposable search problem. +\end{theorem} +\begin{proof} +To prove this, consider the query $KNN(D, q, k)$ against some partitioned +dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If k-NN is decomposable, +then there must exist some constant-time, commutative, and associative +binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l} +R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q, +k)$. Consider the evaluation of the merge operator against two arbitrary +result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| = +|R_j| = k$, and that the contents of $R$ must be the $k$ records from +$R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the +problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$ +time. Therefore, k-NN is not a decomposable search problem. +\end{proof} +With that said, it is clear that there isn't any fundamental restriction +preventing the merging of the result sets; it is only the case that an +arbitrary performance requirement wouldn't be satisfied. It is possible +to merge the result sets in non-constant time, and so it is the case +that k-NN is $C(n)$-decomposable. Unfortunately, this classification +brings with it a reduction in query performance as a result of the way +result merges are performed. + +As a concrete example of these costs, consider using the Bentley-Saxe +method to extend the VPTree~\cite{vptree}. The VPTree is a static, +metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k +\log n)$. One possible merge algorithm for k-NN would be to push all +of the elements in the two arguments onto a min-heap, and then pop off +the first $k$. In this case, the cost of the merge operation would be +$C(k) = k \log k$. Were $k$ assumed to be constant, then the operation +could be considered to be constant-time. But given that $k$ is only +bounded in size above by $n$, this isn't a safe assumption to make in +general. Evaluating the total query cost for the extended structure, +this would yield, + +\begin{equation} + k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right) +\end{equation} + +The reason for this large increase in cost is the repeated application +of the merge operator. The Bentley-Saxe method requires applying the +merge operator in a binary fashion to each partial result, multiplying +its cost by a factor of $\log n$. Thus, the constant-time requirement +of standard decomposability is necessary to keep the cost of the merge +operator from appearing within the complexity bound of the entire +operation in the general case.\footnote { + There is a special case, noted by Overmars, where the total cost is + $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n)) + \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the + case where the cost of the query and merge operation are sufficiently + large to consume the logarithmic factor, and so it doesn't represent + a special case with better performance. +} +If we could revise the result merging operation to remove this duplicated +cost, we could greatly reduce the cost of supporting $C(n)$-decomposable +queries. + +\subsubsection{Independent Range Sampling} + +Another problem that is not decomposable is independent sampling. There +are a variety of problems falling under this umbrella, including weighted +set sampling, simple random sampling, and weighted independent range +sampling, but we will focus on independent range sampling here. + +\begin{definition}[Independent Range Sampling~\cite{tao22}] + Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query + interval $q = [x, y]$ and an integer $k$, an independent range + sampling query returns $k$ independent samples from $D \cap q$ + with each point having equal probability of being sampled. +\end{definition} +This problem immediately encounters a category error when considering +whether it is decomposable: the result set is randomized, whereas +the conditions for decomposability are defined in terms of an exact +matching of records in result sets. To work around this, a slight abuse +of definition is in order: assume that the equality conditions within +the DSP definition can be interpreted to mean ``the contents in the two +sets are drawn from the same distribution''. This enables the category +of DSP to apply to this type of problem. More formally, +\begin{definition}[Decomposable Sampling Problem] + A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and + only if there exists a constant-time computable, associative, and + commutative binary operator $\mergeop$ such that, + \begin{equation*} + F(A \cup B, q) \sim F(A, q)~ \mergeop ~F(B, q) + \end{equation*} +\end{definition} +Even with this abuse, however, IRS cannot generally be considered +decomposable; it is at best $C(n)$-decomposable. The reason for this is +that matching the distribution requires drawing the appropriate number +of samples from each each partition of the data. Even in the special +case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples +from each partition that must appear in the result set cannot be known +in advance due to differences in the selectivity of the predicate across +the partitions. + +\begin{example}[IRS Sampling Difficulties] + + Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 = + \{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and + an IRS query over the interval $[3, 4]$ with $k=12$. Because all three + partitions have the same size, it seems sensible to evenly distribute + the samples across them ($4$ samples from each partition). Applying + the query predicate to the partitions results in the following, + $d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$. + + In expectation, then, the first result set will contain $R_0 = \{3, + 3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same + probability of a $4$. The second and third result sets can only + be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these + together, we'd find that the probability distribution of the sample + would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform + the same sampling operation over the full dataset (not partitioned), + the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$. + +\end{example} + +The problem is that the number of samples drawn from each partition needs to be +weighted based on the number of elements satisfying the query predicate in that +partition. In the above example, by drawing $4$ samples from $D_1$, more weight +is given to $3$ than exists within the base dataset. This can be worked around +by sampling a full $k$ records from each partition, returning both the sample +and the number of records satisfying the predicate as that partition's query +result, and then performing another pass of IRS as the merge operator, but this +is the same approach as was used for k-NN above. This leaves IRS firmly in the +$C(n)$-decomposable camp. If it were possible to pre-calculate the number of +samples to draw from each partition, then a constant-time merge operation could +be used. + +\subsection{Insertion Tail Latency} + +\subsection{Configurability} + +\section{Conclusion} +This chapter discussed the necessary background information pertaining to +queries and search problems, indexes, and techniques for dynamic extension. It +described the potential for using custom indexes for accelerating particular +kinds of queries, as well as the challenges associated with constructing these +indexes. The remainder of this document will seek to address these challenges +through modification and extension of the Bentley-Saxe method, describing work +that has already been completed, as well as the additional work that must be +done to realize this vision. |