From 873fd659e45e80fe9e229d3d85b3c4c99fb2c121 Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Fri, 2 May 2025 16:03:40 -0400 Subject: Updates --- chapters/background.tex | 218 ++++--- chapters/background.tex.bak | 1324 +++++++++++++++++++++++++++---------------- 2 files changed, 965 insertions(+), 577 deletions(-) (limited to 'chapters') diff --git a/chapters/background.tex b/chapters/background.tex index 332dbb6..9950b39 100644 --- a/chapters/background.tex +++ b/chapters/background.tex @@ -14,7 +14,7 @@ indices will be discussed briefly. Indices are the primary use of data structures within the database context that is of interest to our work. Following this, existing theoretical results in the area of data structure dynamization will be discussed, which will serve as the building blocks -for our techniques in subsquent chapters. The chapter will conclude with +for our techniques in subsequent chapters. The chapter will conclude with a discussion of some of the limitations of these existing techniques. \section{Queries and Search Problems} @@ -62,7 +62,7 @@ As an example of using these definitions, a \emph{membership test} or \emph{range scan} would be considered search problems, and a range scan over the interval $[10, 99]$ would be a query. We've drawn this distinction because, as we'll see as we enter into the discussion of -our work in later chapters, it is useful to have seperate, unambiguous +our work in later chapters, it is useful to have separate, unambiguous terms for these two concepts. \subsection{Decomposable Search Problems} @@ -144,7 +144,7 @@ The calculation of the arithmetic mean of a set of numbers is a DSP. Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$, where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple contains the sum of the values within the input set, and the -cardinality of the input set. For two disjoint paritions of the data, +cardinality of the input set. For two disjoint partitions of the data, $D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let $A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$. @@ -416,7 +416,7 @@ $\Theta(1)$ time,\footnote{ There isn't any practical reason why $\mathtt{unbuild}$ must run in constant time, but this is the assumption made in \cite{saxe79} and in subsequent work based on it, and so we will follow the same - defininition here. + definition here. } such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$. Given this structure, an insert of record $r \in \mathcal{D}$ into a @@ -440,7 +440,7 @@ in a worst-case insert cost of $\Theta(C(n))$. However, opportunities for improving this scheme can present themselves when considering the \emph{amortized} insertion cost. -Consider the cost acrrued by the dynamized structure under global +Consider the cost accrued by the dynamized structure under global reconstruction over the lifetime of the structure. Each insert will result in all of the existing records being rewritten, so at worst each record will be involved in $\Theta(n)$ reconstructions, each reconstruction @@ -473,7 +473,7 @@ work. The earliest of these is the logarithmic method, often called the Bentley-Saxe method in modern literature, and is the most commonly discussed technique today. A later technique, the equal block method, was also examined. It is generally not as effective as the Bentley-Saxe -method, but it has some useful properties for explainatory purposes and +method, but it has some useful properties for explanatory purposes and so will be discussed here as well. \subsection{Equal Block Method~\cite[pp.~96-100]{overmars83}} @@ -512,7 +512,7 @@ Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\fo Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be violated by deletes. We're omitting deletes from the discussion at this point, but will circle back to them in Section~\ref{sec:deletes}. -} In this case, the constraints are enforced by "reconfiguring" the +} In this case, the constraints are enforced by "re-configuring" the structure. $s$ is updated to be exactly $f(n)$, all of the existing blocks are unbuilt, and then the records are redistributed evenly into $s$ blocks. @@ -530,7 +530,7 @@ reconstruction, at the possible cost of increased query performance for sub-linear queries. We'll omit the details of the proof of performance for brevity and streamline some of the original notation (full details can be found in~\cite{overmars83}), but this technique ultimately -results in a data structure with the following performance characterstics, +results in a data structure with the following performance characteristics, \begin{align*} \text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{C(n)}{n} + C\left(\frac{n}{f(n)}\right)\right) \\ \text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\ @@ -585,40 +585,100 @@ doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup \text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$ and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$. +This technique is called a \emph{binary decomposition} of the data +structure. Considering a BSM dynamization of a structure containing $n$ +records, labeling each block with a $0$ if it is empty and a $1$ if it +is full will result in the binary representation of $n$. For example, +the final state of the structure in Figure~\ref{fig:bsm-example} contains +$12$ records, and the labeling procedure will result in $0\text{b}1100$, +which is $12$ in binary. Inserts affect this representation of the +structure in the same way that incrementing the binary number by $1$ does. + +By applying BSM to a data structure, a dynamized structure can be created +with the following performance characteristics, +\begin{align*} +\text{Amortized Insertion Cost:}&\quad \Theta\left(\left(\frac{C(n)}{n}\cdot \log_2 n\right)\right) \\ +\text{Worst Case Insertion Cost:}&\quad \Theta\left(C(n)\right) \\ +\text{Worst-case Query Cost:}& \quad \Theta\left(\log_2 n\cdot \mathscr{Q}\left(n\right)\right) \\ +\end{align*} +This is a particularly attractive result because, for example, a data +structure having $C(n) \in \Theta(n)$ will have an amortized insertion +cost of $\log_2 (n)$, which is quite reasonable. The cost is an extra +logarithmic multiple attached to the query complexity. It is also worth +noting that the worst-case insertion cost remains the same as global +reconstruction, but this case arises only very rarely. If you consider the +binary decomposition representation, the worst-case behavior is triggered +each time the existing number overflows, and a new digit must be added. + +\subsection{Delete Support} + +Classical dynamization techniques have also been developed with +support for deleting records. In general, the same technique of global +reconstruction that was used for inserting records can also be used to +delete them. Given a record $r \in \mathcal{D}$ and a data structure +$\mathscr{I} \in \mathcal{I}$ such that $r \in \mathscr{I}$, $r$ can be +deleted from the structure in $C(n)$ time as follows, +\begin{equation*} +\mathscr{I}^\prime = \text{build}(\text{unbuild}(\mathscr{I}) - \{r\}) +\end{equation*} +However, supporting deletes within the dynamization schemes discussed +above is more complicated. The core problem is that inserts affect the +dynamized structure in a deterministic way, and as a result certain +partitioning schemes can be leveraged to reason about the +performance. But, deletes do not work like this. +\begin{figure} +\caption{A Bentley-Saxe dynamization for the integers on the +interval $[1, 100]$.} +\label{fig:bsm-delete-example} +\end{figure} - - - - - - - - +For example, consider a Bentley-Saxe dynamization that contains all +integers on the interval $[1, 100]$, inserted in that order, shown in +Figure~\ref{fig:bsm-delete-example}. We would like to delete all of the +records from this structure, one at a time, using global reconstruction. +This presents several problems, +\begin{itemize} + \item For each record, we need to identify which block it is in before + we can delete it. + \item The cost of performing a delete is a function of which block the + record is in, which is a question of distribution and not easily + controlled. + \item As records are deleted, the structure will potentially violate + the invariants of the decomposition scheme used, which will + require additional work to fix. +\end{itemize} \section{Limitations of Classical Dynamization Techniques} \label{sec:bsm-limits} -While fairly general, the Bentley-Saxe method has a number of -limitations. Because of the way in which it merges query results together, -the number of search problems to which it can be efficiently applied is -limited. Additionally, the method does not expose any trade-off space -to configure the structure: it is one-size fits all. +While fairly general, these dynamization techniques have a number of +limitations that prevent them from being directly usable as a general +solution to the problem of creating database indices. Because of the +requirement that the query being answered be decomposable, many search +problems cannot be addressed--or at least efficiently addressed, by +decomposition-based dynamization. The techniques also do nothing to reduce +the worst-case insertion cost, resulting in extremely poor tail latency +performance relative to hand-built dynamic structures. Finally, these +approaches do not do a good job of exposing the underlying configuration +space to the user, meaning that the user can exert limited control on the +performance of the dynamized data structure. This section will discuss +these limitations, and the rest of the document will be dedicated to +proposing solutions to them. \subsection{Limits of Decomposability} \label{ssec:decomp-limits} -Unfortunately, the DSP abstraction used as the basis of the Bentley-Saxe -method has a few significant limitations that must first be overcome, -before it can be used for the purposes of this work. At a high level, these limitations -are as follows, +Unfortunately, the DSP abstraction used as the basis of classical +dynamization techniques has a few significant limitations that restrict +their applicability, \begin{itemize} - \item Each local query must be oblivious to the state of every partition, - aside from the one it is directly running against. Further, - Bentley-Saxe provides no facility for accessing cross-block state - or performing multiple query passes against each partition. + \item The query must be broadcast identically to each block and cannot + be adjusted based on the state of the other blocks. + + \item The query process is done in one pass--it cannot be repeated. \item The result merge operation must be $O(1)$ to maintain good query performance. @@ -633,7 +693,7 @@ range sampling are not decomposable. \subsubsection{k-Nearest Neighbor} \label{sssec-decomp-limits-knn} -The k-nearest neighbor (KNN) problem is a generalization of the nearest +The k-nearest neighbor (k-NN) problem is a generalization of the nearest neighbor problem, which seeks to return the closest point within the dataset to a given query point. More formally, this can be defined as, \begin{definition}[Nearest Neighbor] @@ -667,11 +727,11 @@ In practice, it is common to require $f(x, y)$ be a metric,\footnote manipulations are usually required to make similarity measures work in metric-based indexes. \cite{intro-analysis} } -and this will be done in the examples of indexes for addressing +and this will be done in the examples of indices for addressing this problem in this work, but it is not a fundamental aspect of the problem -formulation. The nearest neighbor problem itself is decomposable, with -a simple merge function that accepts the result with the smallest value -of $f(x, q)$ for any two inputs\cite{saxe79}. +formulation. The nearest neighbor problem itself is decomposable, +with a simple merge function that accepts the result with the smallest +value of $f(x, q)$ for any two inputs\cite{saxe79}. The k-nearest neighbor problem generalizes nearest-neighbor to return the $k$ nearest elements, @@ -688,15 +748,15 @@ the $k$ nearest elements, This can be thought of as solving the nearest-neighbor problem $k$ times, each time removing the returned result from $D$ prior to solving the problem again. Unlike the single nearest-neighbor case (which can be -thought of as KNN with $k=1$), this problem is \emph{not} decomposable. +thought of as k-NN with $k=1$), this problem is \emph{not} decomposable. \begin{theorem} - KNN is not a decomposable search problem. + k-NN is not a decomposable search problem. \end{theorem} \begin{proof} To prove this, consider the query $KNN(D, q, k)$ against some partitioned -dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If KNN is decomposable, +dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If k-NN is decomposable, then there must exist some constant-time, commutative, and associative binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l} R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q, @@ -704,32 +764,32 @@ k)$. Consider the evaluation of the merge operator against two arbitrary result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| = |R_j| = k$, and that the contents of $R$ must be the $k$ records from $R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the -problem $KNN(R_i \cup R_j, q, k)$. However, KNN cannot be solved in $O(1)$ -time. Therefore, KNN is not a decomposable search problem. +problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$ +time. Therefore, k-NN is not a decomposable search problem. \end{proof} With that said, it is clear that there isn't any fundamental restriction -preventing the merging of the result sets; -it is only the case that an +preventing the merging of the result sets; it is only the case that an arbitrary performance requirement wouldn't be satisfied. It is possible -to merge the result sets in non-constant time, and so it is the case that -KNN is $C(n)$-decomposable. Unfortunately, this classification brings with -it a reduction in query performance as a result of the way result merges are -performed in Bentley-Saxe. - -As a concrete example of these costs, consider using Bentley-Saxe to -extend the VPTree~\cite{vptree}. The VPTree is a static, metric index capable of -answering KNN queries in $KNN(D, q, k) \in O(k \log n)$. One possible -merge algorithm for KNN would be to push all of the elements in the two -arguments onto a min-heap, and then pop off the first $k$. In this case, -the cost of the merge operation would be $C(k) = k \log k$. Were $k$ assumed -to be constant, then the operation could be considered to be constant-time. -But given that $k$ is only bounded in size above -by $n$, this isn't a safe assumption to make in general. Evaluating the -total query cost for the extended structure, this would yield, +to merge the result sets in non-constant time, and so it is the case +that k-NN is $C(n)$-decomposable. Unfortunately, this classification +brings with it a reduction in query performance as a result of the way +result merges are performed. + +As a concrete example of these costs, consider using the Bentley-Saxe +method to extend the VPTree~\cite{vptree}. The VPTree is a static, +metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k +\log n)$. One possible merge algorithm for k-NN would be to push all +of the elements in the two arguments onto a min-heap, and then pop off +the first $k$. In this case, the cost of the merge operation would be +$C(k) = k \log k$. Were $k$ assumed to be constant, then the operation +could be considered to be constant-time. But given that $k$ is only +bounded in size above by $n$, this isn't a safe assumption to make in +general. Evaluating the total query cost for the extended structure, +this would yield, \begin{equation} - KNN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right) + k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right) \end{equation} The reason for this large increase in cost is the repeated application @@ -746,16 +806,16 @@ operation in the general case.\footnote { large to consume the logarithmic factor, and so it doesn't represent a special case with better performance. } -If the result merging operation could be revised to remove this -duplicated cost, the cost of supporting $C(n)$-decomposable queries -could be greatly reduced. +If we could revise the result merging operation to remove this duplicated +cost, we could greatly reduce the cost of supporting $C(n)$-decomposable +queries. \subsubsection{Independent Range Sampling} Another problem that is not decomposable is independent sampling. There are a variety of problems falling under this umbrella, including weighted set sampling, simple random sampling, and weighted independent range -sampling, but this section will focus on independent range sampling. +sampling, but we will focus on independent range sampling here. \begin{definition}[Independent Range Sampling~\cite{tao22}] Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query @@ -765,14 +825,13 @@ sampling, but this section will focus on independent range sampling. \end{definition} This problem immediately encounters a category error when considering -whether it is decomposable: the result set is randomized, whereas the -conditions for decomposability are defined in terms of an exact matching -of records in result sets. To work around this, a slight abuse of definition -is in order: -assume that the equality conditions within the DSP definition can -be interpreted to mean ``the contents in the two sets are drawn from the -same distribution''. This enables the category of DSP to apply to this type -of problem. More formally, +whether it is decomposable: the result set is randomized, whereas +the conditions for decomposability are defined in terms of an exact +matching of records in result sets. To work around this, a slight abuse +of definition is in order: assume that the equality conditions within +the DSP definition can be interpreted to mean ``the contents in the two +sets are drawn from the same distribution''. This enables the category +of DSP to apply to this type of problem. More formally, \begin{definition}[Decomposable Sampling Problem] A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and only if there exists a constant-time computable, associative, and @@ -782,13 +841,14 @@ of problem. More formally, \end{equation*} \end{definition} -Even with this abuse, however, IRS cannot generally be considered decomposable; -it is at best $C(n)$-decomposable. The reason for this is that matching the -distribution requires drawing the appropriate number of samples from each each -partition of the data. Even in the special case that $|D_0| = |D_1| = \ldots = -|D_\ell|$, the number of samples from each partition that must appear in the -result set cannot be known in advance due to differences in the selectivity -of the predicate across the partitions. +Even with this abuse, however, IRS cannot generally be considered +decomposable; it is at best $C(n)$-decomposable. The reason for this is +that matching the distribution requires drawing the appropriate number +of samples from each each partition of the data. Even in the special +case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples +from each partition that must appear in the result set cannot be known +in advance due to differences in the selectivity of the predicate across +the partitions. \begin{example}[IRS Sampling Difficulties] @@ -818,11 +878,15 @@ is given to $3$ than exists within the base dataset. This can be worked around by sampling a full $k$ records from each partition, returning both the sample and the number of records satisfying the predicate as that partition's query result, and then performing another pass of IRS as the merge operator, but this -is the same approach as was used for KNN above. This leaves IRS firmly in the +is the same approach as was used for k-NN above. This leaves IRS firmly in the $C(n)$-decomposable camp. If it were possible to pre-calculate the number of samples to draw from each partition, then a constant-time merge operation could be used. +\subsection{Insertion Tail Latency} + +\subsection{Configurability} + \section{Conclusion} This chapter discussed the necessary background information pertaining to queries and search problems, indexes, and techniques for dynamic extension. It diff --git a/chapters/background.tex.bak b/chapters/background.tex.bak index d57b370..78f4a30 100644 --- a/chapters/background.tex.bak +++ b/chapters/background.tex.bak @@ -1,315 +1,156 @@ \chapter{Background} - -This chapter will introduce important background information that -will be used throughput the remainder of the document. We'll first -define precisely what is meant by a query, and consider some special -classes of query that will become relevant in our discussion of dynamic -extension. We'll then consider the difference between a static and a -dynamic structure, and techniques for converting static structures into -dynamic ones in a variety of circumstances. - -\section{Database Indexes} - -The term \emph{index} is often abused within the database community -to refer to a range of closely related, but distinct, conceptual -categories\footnote{ -The word index can be used to refer to a structure mapping record -information to the set of records matching that information, as a -general synonym for ``data structure'', to data structures used -specifically in query processing, etc. -}. -This ambiguity is rarely problematic, as the subtle differences -between these categories are not often significant, and context -clarifies the intended meaning in situtations where they are. -However, this work explicitly operates at the interface of two of -these categories, and so it is important to disambiguiate between -them. As a result, we will be using the word index to -refer to a very specific structure - -\subsection{The Traditional Index} -A database index is a specialized structure which provides a means -to efficiently locate records that satisfy specific criteria. This -enables more efficient query processing for support queries. A -traditional database index can be modeled as a function, mapping a -set of attribute values, called a key, $\mathcal{K}$, to a set of -record identifiers, $\mathcal{R}$. Technically, the codomain of an -index can be either a record identifier, a set of record identifiers, -or the physical record itself, depending upon the configuration of -the index. For the purposes of this work, the focus will be on the -first of these, but in principle any of the three index types could -be used with little material difference to the discussion. - -Formally speaking, we will use the following definition of a traditional -database index, -\begin{definition}[Traditional Index] -Consider a set of database records, $\mathcal{D}$. An index over -these records, $\mathcal{I}_\mathcal{D}$ is a map of the form -$F:(\mathcal{I}_\mathcal{D}, \mathcal{K}) \to \mathcal{R}$, where -$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$, -called a \emph{key}. -\end{definition} - -In order to facilitate this mapping, indexes are built using data -structures. The specific data structure used has particular -implications about the performance of the index, and the situations -in which the index is effectively. Broadly speaking, traditional -database indexes can be categorized in two ways: ordered indexes -and unordered indexes. The former of these allows for iteration -over the set of record identifiers in some sorted order, starting -at the returned record. The latter allows for point-lookups only. - -There is a very small set of data structures that are usually used -for creating database indexes. The most common range index in RDBMSs -is the B-tree\footnote{ By \emph{B-tree} here, I am referring not -to the B-tree datastructure, but to a wide range of related structures -derived from the B-tree. Examples include the B$^+$-tree, -B$^\epsilon$-tree, etc. } based index, and key-value stores commonly -use indices built on the LSM-tree. Some databases support unordered -indexes using hashtables. Beyond these, some specialized databases or -database extensions have support for indexes based on other structures, -such as the R-tree\footnote{ -Like the B-tree, R-tree here is used as a signifier for a general class -of related data structures} for spatial databases or approximate small -world graph models for similarity search. - -\subsection{The Generalized Index} - -The previous section discussed the traditional definition of index -as might be found in a database systems textbook. However, this -definition is limited by its association specifically with mapping -key fields to records. For the purposes of this work, I will be -considering a slightly broader definition of index, - -\begin{definition}[Generalized Index] -Consider a set of database records, $\mathcal{D}$ and a search -problem, $\mathcal{Q}$. A generalized index, $\mathcal{I}_\mathcal{D}$ -is a map of the form $F:(\mathcal{I}_\mathcal{D}, \mathcal{Q}) \to -\mathcal{R})$. -\end{definition} - -\emph{Search problems} are the topic of the next section, but in -brief a search problem represents a general class of query, such -as range scan, point lookup, k-nearest neightbor, etc. A traditional -index is a special case of a generalized index, having $\mathcal{Q}$ -being a point-lookup or range query based on a set of record -attributes. - -\subsection{Indices in Query Processing} - -A database management system utilizes indices to accelerate certain -types of query. Queries are expressed to the system in some high -level language, such as SQL or Datalog. These are generalized -languages capable of expressing a wide range of possible queries. -The DBMS is then responsible for converting these queries into a -set of primitive data access procedures that are supported by the -underlying storage engine. There are a variety of techniques for -this, including mapping directly to a tree of relational algebra -operators and interpretting that tree, query compilation, etc. But, -ultimately, the expressiveness of this internal query representation -is limited by the routines supported by the storage engine. - -As an example, consider the following SQL query (representing a -2-dimensional k-nearest neighbor)\footnote{There are more efficient -ways of answering this query, but I'm aiming for simplicity here -to demonstrate my point}, - -\begin{verbatim} -SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A - WHERE A.property = filtering_criterion - ORDER BY d - LIMIT 5; -\end{verbatim} - -This query will be translated into a logical query plan (a sequence -of relational algebra operators) by the query planner, which could -result in a plan like this, - -\begin{verbatim} -query plan here -\end{verbatim} - -With this logical query plan, the DBMS will next need to determine -which supported operations it can use to most efficiently answer -this query. For example, the selection operation (A) could be -physically manifested as a table scan, or could be answered using -an index scan if there is an ordered index over \texttt{A.property}. -The query optimizer will make this decision based on its estimate -of the selectivity of the predicate. This may result in one of the -following physical query plans - -\begin{verbatim} -physical query plan -\end{verbatim} - -In either case, however, the space of possible physical plans is -limited by the available access methods: either a sorted scan on -an attribute (index) or an unsorted scan (table scan). The database -must filter for all elements matching the filtering criterion, -calculate the distances between all of these points and the query, -and then sort the results to get the final answer. Additionally, -note that the sort operation in the plan is a pipeline-breaker. If -this plan were to appear as a subtree in a larger query plan, the -overall plan would need to wait for the full evaluation of this -sub-query before it could proceed, as sorting requires the full -result set. - -Imagine a world where a new index was available to our DBMS: a -nearest neighbor index. This index would allow the iteration over -records in sorted order, relative to some predefined metric and a -query point. If such an index existed over \texttt{(A.x, A.y)} using -\texttt{dist}, then a third physical plan would be available to the DBMS, - -\begin{verbatim} -\end{verbatim} - -This plan pulls records in order of their distance to \texttt{Q} -directly, using an index, and then filters them, avoiding the -pipeline breaking sort operation. While it's not obvious in this -case that this new plan is superior (this would depend a lot on the -selectivity of the predicate), it is a third option. It becomes -increasingly superior as the selectivity of the predicate grows, -and is clearly superior in the case where the predicate has unit -selectivity (requiring only the consideration of $5$ records total). -The construction of this special index will be considered in -Section~\ref{ssec:knn}. - -This use of query-specific indexing schemes also presents a query -planning challenge: how does the database know when a particular -specialized index can be used for a given query, and how can -specialized indexes broadcast their capabilities to the query planner -in a general fashion? This work is focused on the problem of enabling -the existence of such indexes, rather than facilitating their use, -however these are important questions that must be considered in -future work for this solution to be viable. There has been work -done surrounding the use of arbtrary indexes in queries in the past, -such as~\cite{byods-datalog}. This problem is considered out-of-scope -for the proposed work, but will be considered in the future. +\label{chap:background} + +This chapter will introduce important background information and +existing work in the area of data structure dynamization. We will +first discuss the concept of a search problem, which is central to +dynamization techniques. While one might imagine that restrictions on +dynamization would be functions of the data structure to be dynamized, +in practice the requirements placed on the data structure are quite mild, +and it is the necessary properties of the search problem that the data +structure is used to address that provide the central difficulty to +applying dynamization techniques in a given area. After this, database +indices will be discussed briefly. Indices are the primary use of data +structures within the database context that is of interest to our work. +Following this, existing theoretical results in the area of data structure +dynamization will be discussed, which will serve as the building blocks +for our techniques in subsquent chapters. The chapter will conclude with +a discussion of some of the limitations of these existing techniques. \section{Queries and Search Problems} +\label{sec:dsp} + +Data access lies at the core of most database systems. We want to ask +questions of the data, and ideally get the answer efficiently. We +will refer to the different types of question that can be asked as +\emph{search problems}. We will be using this term in a similar way as +the word \emph{query} \footnote{ + The term query is often abused and used to + refer to several related, but slightly different things. In the + vernacular, a query can refer to either a) a general type of search + problem (as in "range query"), b) a specific instance of a search + problem, or c) a program written in a query language. +} +is often used within the database systems literature: to refer to a +general class of questions. For example, we could consider range scans, +point-lookups, nearest neighbor searches, predicate filtering, random +sampling, etc., to each be a general search problem. Formally, for the +purposes of this work, a search problem is defined as follows, -In our discussion of generalized indexes, we encountered \emph{search -problems}. A search problem is a term used within the literature -on data structures in a manner similar to how the database community -sometimes uses the term query\footnote{ -Like with the term index, the term query is often abused and used to -refer to several related, but slightly different things. In the vernacular, -a query can refer to either a) a general type of search problem (as in "range query"), -b) a specific instance of a search problem, or c) a program written in a query language. -}, to refer to a general -class of questions asked of data. Examples include range queries, -point-lookups, nearest neighbor queries, predicate filtering, random -sampling, etc. Formally, for the purposes of this work, we will define -a search problem as follows, \begin{definition}[Search Problem] -Given three multisets, $D$, $R$, and $Q$, a search problem is a function -$F: (D, Q) \to R$, where $D$ represents the domain of data to be searched, -$Q$ represents the domain of query parameters, and $R$ represents the -answer domain. -\footnote{ -It is important to note that it is not required for $R \subseteq D$. As an + Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function + $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched, + $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the +answer domain.\footnote{ + It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an example, a \texttt{COUNT} aggregation might map a set of strings onto -an integer. Most common queries do satisfy $R \subseteq D$, but this need + an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need not be a universal constraint. } \end{definition} -And we will use the word \emph{query} to refer to a specific instance -of a search problem, except when used as part of the generally -accepted name of a search problem (i.e., range query). +We will use the term \emph{query} to mean a specific instance of a search +problem, \begin{definition}[Query] -Given three multisets, $D$, $R$, and $Q$, a search problem $F$ and -a specific set of query parameters $q \in Q$, a query is a specific -instance of the search problem, $F(D, q)$. + Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and + a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific + instance of the search problem, $F(\mathcal{D}, q)$. \end{definition} As an example of using these definitions, a \emph{membership test} -or \emph{range query} would be considered search problems, and a -range query over the interval $[10, 99]$ would be a query. +or \emph{range scan} would be considered search problems, and a range +scan over the interval $[10, 99]$ would be a query. We've drawn this +distinction because, as we'll see as we enter into the discussion of +our work in later chapters, it is useful to have seperate, unambiguous +terms for these two concepts. \subsection{Decomposable Search Problems} -An important subset of search problems is that of decomposable -search problems (DSPs). This class was first defined by Saxe and -Bentley as follows, +Dynamization techniques require the partitioning of one data structure +into several, smaller ones. As a result, these techniques can only +be applied in situations where the search problem to be answered can +be answered from this set of smaller data structures, with the same +answer as would have been obtained had all of the data been used to +construct a single, large structure. This requirement is formalized in +the definition of a class of problems called \emph{decomposable search +problems (DSP)}. This class was first defined by Bentley and Saxe in +their work on dynamization, and we will adopt their definition, \begin{definition}[Decomposable Search Problem~\cite{saxe79}] \label{def:dsp} - Given a search problem $F: (D, Q) \to R$, $F$ is decomposable if and - only if there exists a consant-time computable, associative, and - commutative binary operator $\square$ such that, + A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and + only if there exists a constant-time computable, associative, and + commutative binary operator $\mergeop$ such that, \begin{equation*} - F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q) \end{equation*} \end{definition} -The constant-time requirement was used to prove bounds on the costs of -evaluating DSPs over data broken across multiple partitions. Further work -by Overmars lifted this constraint and considered a more general class -of DSP, +The requirement for $\mergeop$ to be constant-time was used by Bentley and +Saxe to prove specific performance bounds for answering queries from a +decomposed data structure. However, it is not strictly \emph{necessary}, +and later work by Overmars lifted this constraint and considered a more +general class of search problems called \emph{$C(n)$-decomposable search +problems}, + \begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}] - Given a search problem $F: (D, Q) \to R$, $F$ is $C(n)$-decomposable + A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable if and only if there exists an $O(C(n))$-time computable, associative, - and commutative binary operator $\square$ such that, + and commutative binary operator $\mergeop$ such that, \begin{equation*} - F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q) \end{equation*} \end{definition} -Decomposability is an important property because it allows for -search problems to be answered over partitioned datasets. The details -of this will be discussed in Section~\ref{ssec:bentley-saxe} in the -context of creating dynamic data structures. Many common types of -search problems appearing in databases are decomposable, such as -range queries or predicate filtering. - -To demonstrate that a search problem is decomposable, it is necessary -to show the existance of the merge operator, $\square$, and to show -that $F(A \cup B, q) = F(A, q)~ \square ~F(B, q)$. With these two -results, simple induction demonstrates that the problem is decomposable -even in cases with more than two partial results. - -As an example, consider range queries, -\begin{definition}[Range Query] -Let $D$ be a set of $n$ points in $\mathbb{R}$. Given an interval, -$ q = [x, y],\quad x,y \in R$, a range query returns all points in -$D \cap q$. +To demonstrate that a search problem is decomposable, it is necessary to +show the existence of the merge operator, $\mergeop$, with the necessary +properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, +q)$. With these two results, induction demonstrates that the problem is +decomposable even in cases with more than two partial results. + +As an example, consider range scans, +\begin{definition}[Range Count] + Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval, + $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns + the cardinality, $|d \cap q|$. \end{definition} \begin{theorem} -Range Queries are a DSP. +Range Count is a decomposable search problem. \end{theorem} \begin{proof} -Let $\square$ be the set union operator ($\cup$). Applying this to -Definition~\ref{def:dsp}, we have +Let $\mergeop$ be addition ($+$). Applying this to +Definition~\ref{def:dsp}, gives \begin{align*} - (A \cup B) \cap q = (A \cap q) \cup (B \cap q) + |(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)| \end{align*} -which is true by the distributive property of set union and -intersection. Assuming an implementation allowing for an $O(1)$ -set union operation, range queries are DSPs. +which is true by the distributive property of union and +intersection. Addition is an associative and commutative +operator that can be calculated in $O(1)$ time. Therefore, range counts +are DSPs. \end{proof} Because the codomain of a DSP is not restricted, more complex output structures can be used to allow for problems that are not directly decomposable to be converted to DSPs, possibly with some minor -post-processing. For example, the calculation of the mean of a set -of numbers can be constructed as a DSP using the following technique, +post-processing. For example, calculating the arithmetic mean of a set +of numbers can be formulated as a DSP, \begin{theorem} -The calculation of the average of a set of numbers is a DSP. +The calculation of the arithmetic mean of a set of numbers is a DSP. \end{theorem} \begin{proof} -Define the search problem as $A:D \to (\mathbb{R}, \mathbb{Z})$, -where $D\subset\mathbb{R}$ and is a multiset. The output tuple + Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$, + where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple contains the sum of the values within the input set, and the -cardinality of the input set. Let the $A(D_1) = (s_1, c_1)$ and -$A(D_2) = (s_2, c_2)$. Then, define $A(D_1)\square A(D_2) = (s_1 + -s_2, c_1 + c_2)$. +cardinality of the input set. For two disjoint paritions of the data, +$D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let +$A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$. -Applying Definition~\ref{def:dsp}, we have +Applying Definition~\ref{def:dsp}, gives \begin{align*} - A(D_1 \cup D_2) &= A(D_1)\square A(D_2) \\ + A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2) \\ (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c) \end{align*} From this result, the average can be determined in constant time by @@ -317,258 +158,741 @@ taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set of numbers is a DSP. \end{proof} -\section{Dynamic Extension Techniques} -Because data in a database is regularly updated, data structures -intended to be used as an index must support updates (inserts, -in-place modification, and deletes) to their data. In principle, -any data structure can support updates to its underlying data through -global reconstruction: adjusting the record set and then rebuilding -the entire structure. Ignoring this trivial (and highly inefficient) -approach, a data structure with support for updates is called -\emph{dynamic}, and one without support for updates is called -\emph{static}. In this section, we discuss approaches for modifying -a static data structure to grant it support for updates, a process -called \emph{dynamic extension} or \emph{dynamization}. A theoretical -survey of this topic can be found in~\cite{overmars83}, but this -work doesn't cover several techniques that are used in practice. -As such, much of this section constitutes our own analysis, tying -together threads from a variety of sources. - -\subsection{Local Reconstruction} - -One way of viewing updates to a data structure is as reconstructing -all or part of the structure. To minimize the cost of the update, -it is ideal to minimize the size of the reconstruction that accompanies -an update, either by careful structuring of the data to ensure -minimal disruption to surrounding records by an update, or by -deferring the reconstructions and amortizing their costs over as -many updates as possible. - -While minimizing the size of a reconstruction seems the most obvious, -and best, approach, it is limited in its applicability. The more -related ``nearby'' records in the structure are, the more records -will be affected by a change. Records can be related in terms of -some ordering of their values, which we'll term a \emph{spatial -ordering}, or in terms of their order of insertion to the structure, -which we'll term a \emph{temporal ordering}. Note that these terms -don't imply anything about the nature of the data, and instead -relate to the principles used by the data structure to arrange them. - -Arrays provide the extreme version of both of these ordering -principles. In an unsorted array, in which records are appended to -the end of the array, there is no spatial ordering dependence between -records. This means that any insert or update will require no local -reconstruction, aside from the record being directly affected.\footnote{ -A delete can also be performed without any structural adjustments -in a variety of ways. Reorganization of the array as a result of -deleted records serves an efficiency purpose, but isn't required -for the correctness of the structure. } However, the order of -records in the array \emph{does} express a strong temporal dependency: -the index of a record in the array provides the exact insertion -order. - -A sorted array provides exactly the opposite situation. The order -of a record in the array reflects an exact spatial ordering of -records with respect to their sorting function. This means that an -update or insert will require reordering a large number of records -(potentially all of them, in the worst case). Because of the stronger -spatial dependence of records in the structure, an update will -require a larger-scale reconstruction. Additionally, there is no -temporal component to the ordering of the records: inserting a set -of records into a sorted array will produce the same final structure -irrespective of insertion order. - -It's worth noting that the spatial dependency discussed here, as -it relates to reconstruction costs, is based on the physical layout -of the records and not the logical ordering of them. To exemplify -this, a sorted singly-linked list can maintain the same logical -order of records as a sorted array, but limits the spatial dependce -between records each records preceeding node. This means that an -insert into this structure will require only a single node update, -regardless of where in the structure this insert occurs. - -The amount of spatial dependence in a structure directly reflects -a trade-off between read and write performance. In the above example, -performing a lookup for a given record in a sorted array requires -asymptotically fewer comparisons in the worst case than an unsorted -array, because the spatial dependecies can be exploited for an -accelerated search (binary vs. linear search). Interestingly, this -remains the case for lookups against a sorted array vs. a sorted -linked list. Even though both structures have the same logical order -of records, limited spatial dependecies between nodes in a linked -list forces the lookup to perform a scan anyway. - -A balanced binary tree sits between these two extremes. Like a -linked list, individual nodes have very few connections. However -the nodes are arranged in such a way that a connection existing -between two nodes implies further information about the ordering -of children of those nodes. In this light, rebalancing of the tree -can be seen as maintaining a certain degree of spatial dependence -between the nodes in the tree, ensuring that it is balanced between -the two children of each node. A very general summary of tree -rebalancing techniques can be found in~\cite{overmars83}. Using an -AVL tree~\cite{avl} as a specific example, each insert in the tree -involves adding the new node and updating its parent (like you'd -see in a simple linked list), followed by some larger scale local -reconstruction in the form of tree rotations, to maintain the balance -factor invariant. This means that insertion requires more reconstruction -effort than the single pointer update in the linked list case, but -results in much more efficient searches (which, as it turns out, -makes insertion more efficient in general too, even with the overhead, -because finding the insertion point is much faster). - -\subsection{Amortized Local Reconstruction} - -In addition to control update cost by arranging the structure so -as to reduce the amount of reconstruction necessary to maintain the -desired level of spatial dependence, update costs can also be reduced -by amortizing the local reconstruction cost over multiple updates. -This is often done in one of two ways: leaving gaps or adding -overflow buckets. These gaps and buckets allows for a buffer of -insertion capacity to be sustained by the data structure, before -a reconstruction is triggered. - -A classic example of the gap approach is found in the -B$^+$-tree~\cite{b+tree} commonly used in RDBMS indexes, as well -as open addressing for hash tables. In a B$^+$-tree, each node has -a fixed size, which must be at least half-utilized (aside from the -root node). The empty spaces within these nodes are gaps, which can -be cheaply filled with new records on insert. Only when a node has -been filled must a local reconstruction (called a structural -modification operation for B-trees) occur to redistribute the data -into multiple nodes and replenish the supply of gaps. This approach -is particularly well suited to data structures in contexts where -the natural unit of storage is larger than a record, as in disk-based -(with 4KiB pages) or cache-optimized (with 64B cachelines) structures. -This gap-based approach was also used to create ALEX, an updatable -learned index~\cite{ALEX}. - -The gap approach has a number of disadvantages. It results in a -somewhat sparse structure, thereby wasting storage. For example, a -B$^+$-tree requires all nodes other than the root to be at least -half full--meaning in the worst case up to half of the space required -by the structure could be taken up by gaps. Additionally, this -scheme results in some inserts being more expensive than others: -most new records will occupy an available gap, but some will trigger -more expensive SMOs. In particular, it has been observed with -B$^+$-trees that this can lead to ``waves of misery''~\cite{wavesofmisery}: -the gaps in many nodes fill at about the same time, leading to -periodic clusters of high-cost merge operations. - -Overflow buckets are seen in ISAM-tree based indexes~\cite{myisam}, -as well as hash tables with closed addressing. In this approach, -parts of the structure into which records would be inserted (leaf -nodes of ISAM, directory entries in CA hashing) have a pointer to -an overflow location, where newly inserted records can be placed. -This allows for the structure to, theoretically, sustain an unlimited -amount of insertions. However, read performance degrades, because -the more overflow capacity is utilized, the less the records in the -structure are ordered according to the data structure's definition. -Thus, periodically a reconstruction is necessary to distribute the -overflow records into the structure itself. - -\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method} - -Another approach to support updates is to amortize the cost of -global reconstruction over multiple updates. This approach can take -take three forms, -\begin{enumerate} - - \item Pairing a dynamic data structure (called a buffer or - memtable) with an instance of the structure being extended. - Updates are written to the buffer, and when the buffer is - full its records are merged with those in the static - structure, and the structure is rebuilt. This approach is - used by one version of the originally proposed - LSM-tree~\cite{oneil93}. Technically this technique proposed - in that work for the purposes of converting random writes - into sequential ones (all structures involved are dynamic), - but it can be used for dynamization as well. - - \item Creating multiple, smaller data structures each - containing a partition of the records from the dataset, and - reconstructing individual structures to accomodate new - inserts in a systematic manner. This technique is the basis - of the Bentley-Saxe method~\cite{saxe79}. - - \item Using both of the above techniques at once. This is - the approach used by modern incarnations of the - LSM~tree~\cite{rocksdb}. - -\end{enumerate} - -In all three cases, it is necessary for the search problem associated -with the index to be a DSP, as answering it will require querying -multiple structures (the buffer and/or one or more instances of the -data structure) and merging the results together to get a final -result. This section will focus exclusively on the Bentley-Saxe -method, as it is the basis for our proposed methodology.p - -When dividing records across multiple structures, there is a clear -trade-off between read performance and write performance. Keeping -the individual structures small reduces the cost of reconstructing, -and thereby increases update performance. However, this also means -that more structures will be required to accommodate the same number -of records, when compared to a scheme that allows the structures -to be larger. As each structure must be queried independently, this -will lead to worse query performance. The reverse is also true, -fewer, larger structures will have better query performance and -worse update performance, with the extreme limit of this being a -single structure that is fully rebuilt on each insert. -\begin{figure} - \caption{Inserting a new record using the Bentley-Saxe method.} - \label{fig:bsm-example} -\end{figure} +\section{Database Indexes} +\label{sec:indexes} + +Within a database system, search problems are expressed using +some high level language (or mapped directly to commands, for +simpler systems like key-value stores), which is processed by +the database system to produce a result. Within many database +systems, the most basic access primitive is a table scan, which +sequentially examines each record within the data set. There are many +situations in which the same query could be answered in less time using +a more sophisticated data access scheme, however, and databases support +a limited number of such schemes through the use of specialized data +structures called \emph{indices} (or indexes). Indices can be built over +a set of attributes in a table and provide faster access for particular +search problems. -The key insight of the Bentley-Saxe method~\cite{saxe79} is that a -good balance can be struck by uses a geometrically increasing -structure size. In Bentley-Saxe, the sub-structures are ``stacked'', -with the bottom level having a capacity of a single record, and -each subsequent level doubling in capacity. When an update is -performed, the first empty level is located and a reconstruction -is triggered, merging the structures of all levels below this empty -one, along with the new record. An example of this process is shown -in Figure~\ref{fig:bsm-example}. The merits of this approach are -that it ensures that ``most'' reconstructions involve the smaller -data structures towards the bottom of the sequence, while most of -the records reside in large, infrequently updated, structures towards -the top. This balances between the read and write implications of -structure size, while also allowing the number of structures required -to represent $n$ records to be worst-case bounded by $O(\log n)$. - -Given a structure with an $O(P(n))$ construction cost and $O(Q_S(n))$ -query cost, the Bentley-Saxe Method will produce a dynamic data -structure with, +The term \emph{index} is often abused within the database community +to refer to a range of closely related, but distinct, conceptual +categories.\footnote{ +The word index can be used to refer to a structure mapping record +information to the set of records matching that information, as a +general synonym for ``data structure'', to data structures used +specifically in query processing, etc. +} +This ambiguity is rarely problematic, as the subtle differences between +these categories are not often significant, and context clarifies the +intended meaning in situations where they are. However, this work +explicitly operates at the interface of two of these categories, and so +it is important to disambiguate between them. + +\subsection{The Classical Index} +A database index is a specialized data structure that provides a means +to efficiently locate records that satisfy specific criteria. This +enables more efficient query processing for supported search problems. A +classical index can be modeled as a function, mapping a set of attribute +values, called a key, $\mathcal{K}$, to a set of record identifiers, +$\mathcal{R}$. The codomain of an index can be either the set of +record identifiers, a set containing sets of record identifiers, or +the set of physical records, depending upon the configuration of the +index.~\cite{cowbook} For our purposes here, we'll focus on the first of +these, but the use of other codmains wouldn't have any material effect +on our discussion. + +We will use the following definition of a "classical" database index, + +\begin{definition}[Classical Index~\cite{cowbook}] +Consider a set of database records, $\mathcal{D}$. An index over +these records, $\mathcal{I}_\mathcal{D}$ is a map of the form + $\mathcal{I}_\mathcal{D}:(\mathcal{K}, \mathcal{D}) \to \mathcal{R}$, where +$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$, +called a \emph{key}. +\end{definition} + +In order to facilitate this mapping, indexes are built using data +structures. The selection of data structure has implications on the +performance of the index, and the types of search problem it can be +used to accelerate. Broadly speaking, classical indices can be divided +into two categories: ordered and unordered. Ordered indices allow for +the iteration over a set of record identifiers in a particular sorted +order of keys, and the efficient location of a specific key value in +that order. These indices can be used to accelerate range scans and +point-lookups. Unordered indices are specialized for point-lookups on a +particular key value, and do not support iterating over records in some +order.~\cite{cowbook, mysql-btree-hash} + +There is a very small set of data structures that are usually used for +creating classical indexes. For ordered indices, the most commonly used +data structure is the B-tree~\cite{ubiq-btree},\footnote{ + By \emph{B-tree} here, we are referring not to the B-tree data + structure, but to a wide range of related structures derived from + the B-tree. Examples include the B$^+$-tree, B$^\epsilon$-tree, etc. +} +and the log-structured merge (LSM) tree~\cite{oneil96} is also often +used within the context of key-value stores~\cite{rocksdb}. Some databases +implement unordered indices using hash tables~\cite{mysql-btree-hash}. + + +\subsection{The Generalized Index} + +The previous section discussed the classical definition of index +as might be found in a database systems textbook. However, this +definition is limited by its association specifically with mapping +key fields to records. For the purposes of this work, a broader +definition of index will be considered, + +\begin{definition}[Generalized Index] +Consider a set of database records, $\mathcal{D}$, and search +problem, $\mathcal{Q}$. +A generalized index, $\mathcal{I}_\mathcal{D}$ +is a map of the form $\mathcal{I}_\mathcal{D}:(\mathcal{Q}, \mathcal{D}) \to +\mathcal{R})$. +\end{definition} + +A classical index is a special case of a generalized index, with $\mathcal{Q}$ +being a point-lookup or range scan based on a set of record attributes. + +There are a number of generalized indexes that appear in some database systems. +For example, some specialized databases or database extensions have support for +indexes based the R-tree\footnote{ Like the B-tree, R-tree here is used as a +signifier for a general class of related data structures.} for spatial +databases~\cite{postgis-doc, ubiq-rtree} or hierarchical navigable small world +graphs for similarity search~\cite{pinecone-db}, among others. These systems +are typically either an add-on module, or a specialized standalone database +that has been designed specifically for answering particular types of queries +(such as spatial queries, similarity search, string matching, etc.). + +%\subsection{Indexes in Query Processing} + +%A database management system utilizes indexes to accelerate certain +%types of query. Queries are expressed to the system in some high +%level language, such as SQL or Datalog. These are generalized +%languages capable of expressing a wide range of possible queries. +%The DBMS is then responsible for converting these queries into a +%set of primitive data access procedures that are supported by the +%underlying storage engine. There are a variety of techniques for +%this, including mapping directly to a tree of relational algebra +%operators and interpreting that tree, query compilation, etc. But, +%ultimately, this internal query representation is limited by the routines +%supported by the storage engine.~\cite{cowbook} + +%As an example, consider the following SQL query (representing a +%2-dimensional k-nearest neighbor problem)\footnote{There are more efficient +%ways of answering this query, but I'm aiming for simplicity here +%to demonstrate my point}, +% +%\begin{verbatim} +%SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A +% WHERE A.property = filtering_criterion +% ORDER BY d +% LIMIT 5; +%\end{verbatim} +% +%This query will be translated into a logical query plan (a sequence +%of relational algebra operators) by the query planner, which could +%result in a plan like this, +% +%\begin{verbatim} +%query plan here +%\end{verbatim} +% +%With this logical query plan, the DBMS will next need to determine +%which supported operations it can use to most efficiently answer +%this query. For example, the selection operation (A) could be +%physically manifested as a table scan, or could be answered using +%an index scan if there is an ordered index over \texttt{A.property}. +%The query optimizer will make this decision based on its estimate +%of the selectivity of the predicate. This may result in one of the +%following physical query plans +% +%\begin{verbatim} +%physical query plan +%\end{verbatim} +% +%In either case, however, the space of possible physical plans is +%limited by the available access methods: either a sorted scan on +%an attribute (index) or an unsorted scan (table scan). The database +%must filter for all elements matching the filtering criterion, +%calculate the distances between all of these points and the query, +%and then sort the results to get the final answer. Additionally, +%note that the sort operation in the plan is a pipeline-breaker. If +%this plan were to appear as a sub-tree in a larger query plan, the +%overall plan would need to wait for the full evaluation of this +%sub-query before it could proceed, as sorting requires the full +%result set. +% +%Imagine a world where a new index was available to the DBMS: a +%nearest neighbor index. This index would allow the iteration over +%records in sorted order, relative to some predefined metric and a +%query point. If such an index existed over \texttt{(A.x, A.y)} using +%\texttt{dist}, then a third physical plan would be available to the DBMS, +% +%\begin{verbatim} +%\end{verbatim} +% +%This plan pulls records in order of their distance to \texttt{Q} +%directly, using an index, and then filters them, avoiding the +%pipeline breaking sort operation. While it's not obvious in this +%case that this new plan is superior (this would depend upon the +%selectivity of the predicate), it is a third option. It becomes +%increasingly superior as the selectivity of the predicate grows, +%and is clearly superior in the case where the predicate has unit +%selectivity (requiring only the consideration of $5$ records total). +% +%This use of query-specific indexing schemes presents a query +%optimization challenge: how does the database know when a particular +%specialized index can be used for a given query, and how can +%specialized indexes broadcast their capabilities to the query optimizer +%in a general fashion? This work is focused on the problem of enabling +%the existence of such indexes, rather than facilitating their use; +%however these are important questions that must be considered in +%future work for this solution to be viable. There has been work +%done surrounding the use of arbitrary indexes in queries in the past, +%such as~\cite{byods-datalog}. This problem is considered out-of-scope +%for the proposed work, but will be considered in the future. + +\section{Classical Dynamization Techniques} + +Because data in a database is regularly updated, data structures +intended to be used as an index must support updates (inserts, in-place +modification, and deletes). Not all potentially useful data structures +support updates, and so a general strategy for adding update support +would increase the number of data structures that could be used as +database indices. We refer to a data structure with update support as +\emph{dynamic}, and one without update support as \emph{static}.\footnote{ + The term static is distinct from immutable. Static refers to the + layout of records within the data structure, whereas immutable + refers to the data stored within those records. This distinction + will become relevant when we discuss different techniques for adding + delete support to data structures. The data structures used are + always static, but not necessarily immutable, because the records may + contain header information (like visibility) that is updated in place. +} + +This section discusses \emph{dynamization}, the construction of a dynamic +data structure based on an existing static one. When certain conditions +are satisfied by the data structure and its associated search problem, +this process can be done automatically, and with provable asymptotic +bounds on amortized insertion performance, as well as worst case query +performance. We will first discuss the necessary data structure +requirements, and then examine several classical dynamization techniques. +The section will conclude with a discussion of delete support within the +context of these techniques. + +It is worth noting that there are a variety of techniques +discussed in the literature for dynamizing structures with specific +properties, or under very specific sets of circumstances. Examples +include frameworks for adding update support succinct data +structures~\cite{dynamize-succinct} or taking advantage of batching +of insert and query operations~\cite{batched-decomposable}. This +section discusses techniques that are more general, and don't require +workload-specific assumptions. + + +\subsection{Global Reconstruction} + +The most fundamental dynamization technique is that of \emph{global +reconstruction}. While not particularly useful on its own, global +reconstruction serves as the basis for the techniques to follow, and so +we will begin our discussion of dynamization with it. + +Consider a class of data structure, $\mathcal{I}$, capable of answering a +search problem, $\mathcal{Q}$. Insertion via global reconstruction is +possible if $\mathcal{I}$ supports the following two operations, +\begin{align*} +\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\ +\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D}) +\end{align*} +where $\mathtt{build}$ constructs an instance $\mathscr{I}\in\mathcal{I}$ +over the data structure over a set of records $d \subseteq \mathcal{D}$ +in $C(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d +\subseteq \mathcal{D}$ used to construct $\mathscr{I} \in \mathcal{I}$ in +$\Theta(1)$ time,\footnote{ + There isn't any practical reason why $\mathtt{unbuild}$ must run + in constant time, but this is the assumption made in \cite{saxe79} + and in subsequent work based on it, and so we will follow the same + defininition here. +} such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$. + +Given this structure, an insert of record $r \in \mathcal{D}$ into a +data structure $\mathscr{I} \in \mathcal{I}$ can be defined by, +\begin{align*} +\mathscr{I}_{i}^\prime = \text{build}(\text{unbuild}(\mathscr{I}_i) \cup \{r\}) +\end{align*} + +It goes without saying that this operation is sub-optimal, as the +insertion cost is $\Theta(C(n))$, and $C(n) \in \Omega(n)$ at best for +most data structures. However, this global reconstruction strategy can +be used as a primitive for more sophisticated techniques that can provide +reasonable performance. + +\subsection{Amortized Global Reconstruction} +\label{ssec:agr} + +The problem with global reconstruction is that each insert must rebuild +the entire data structure, involving all of its records. This results +in a worst-case insert cost of $\Theta(C(n))$. However, opportunities +for improving this scheme can present themselves when considering the +\emph{amortized} insertion cost. + +Consider the cost acrrued by the dynamized structure under global +reconstruction over the lifetime of the structure. Each insert will result +in all of the existing records being rewritten, so at worst each record +will be involved in $\Theta(n)$ reconstructions, each reconstruction +having $\Theta(C(n))$ cost. We can amortize this cost over the $n$ records +inserted to get an amortized insertion cost for global reconstruction of, + +\begin{equation*} +I_a(n) = \frac{C(n) \cdot n}{n} = C(n) +\end{equation*} + +This doesn't improve things as is, however it does present two +opportunities for improvement. If we could either reduce the size of +the reconstructions, or the number of times a record is reconstructed, +then we could reduce the amortized insertion cost. + +The key insight, first discussed by Bentley and Saxe, is that +this goal can be accomplished by \emph{decomposing} the data +structure into multiple, smaller structures, each built from a +disjoint partition of the data. As long as the search problem +being considered is decomposable, queries can be answered from +this structure with bounded worst-case overhead, and the amortized +insertion cost can be improved~\cite{saxe79}. Significant theoretical +work exists in evaluating different strategies for decomposing the +data structure~\cite{saxe79, overmars81, overmars83} and for leveraging +specific efficiencies of the data structures being considered to improve +these reconstructions~\cite{merge-dsp}. + +There are two general decomposition techniques that emerged from this +work. The earliest of these is the logarithmic method, often called +the Bentley-Saxe method in modern literature, and is the most commonly +discussed technique today. A later technique, the equal block method, +was also examined. It is generally not as effective as the Bentley-Saxe +method, but it has some useful properties for explainatory purposes and +so will be discussed here as well. + +\subsection{Equal Block Method~\cite[pp.~96-100]{overmars83}} +\label{ssec:ebm} + +Though chronologically later, the equal block method is theoretically a +bit simpler, and so we will begin our discussion of decomposition-based +technique for dynamization of decomposable search problems with it. The +core concept of the equal block method is to decompose the data structure +into several smaller data structures, called blocks, over partitions +of the data. This decomposition is performed such that each block is of +roughly equal size. + +Consider a data structure $\mathscr{I} \in \mathcal{I}$ that solves +some decomposable search problem, $F$ and is built over a set of records +$d \in \mathcal{D}$. This structure can be decomposed into $s$ blocks, +$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over +partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a specific value +makes little sense when the number of records changes, and so it is taken +to be governed by a smooth, monotonically increasing function $f(n)$ such +that, at any point, the following two constraints are obeyed. \begin{align} - \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\ - \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right) + f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\ + \forall_{1 \leq j \leq s} \quad | \mathscr{I}_j | \leq \frac{2n}{i} \label{ebm-c2} \end{align} +where $|\mathscr{I}_j|$ is the number of records in the block, +$|\text{unbuild}(\mathscr{I}_j)|$. + +A new record is inserted by finding the smallest block and rebuilding it +using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$, +then an insert is done by, +\begin{equation*} +\mathscr{I}_k^\prime = \text{build}(\text{unbuild}(\mathscr{I}_k) \cup \{r\}) +\end{equation*} +Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{ + Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be + violated by deletes. We're omitting deletes from the discussion at + this point, but will circle back to them in Section~\ref{sec:deletes}. +} In this case, the constraints are enforced by "reconfiguring" the +structure. $s$ is updated to be exactly $f(n)$, all of the existing +blocks are unbuilt, and then the records are redistributed evenly into +$s$ blocks. + +A query with parameters $q$ is answered by this structure by individually +querying the blocks, and merging the local results together with $\mergeop$, +\begin{equation*} +F(\mathscr{I}, q) = \bigmergeop_{j=1}^{s}F(\mathscr{I}_j, q) +\end{equation*} +where $F(\mathscr{I}, q)$ is a slight abuse of notation, referring to +answering the query over $d$ using the data structure $\mathscr{I}$. + +This technique provides better amortized performance bounds than global +reconstruction, at the possible cost of increased query performance for +sub-linear queries. We'll omit the details of the proof of performance +for brevity and streamline some of the original notation (full details +can be found in~\cite{overmars83}), but this technique ultimately +results in a data structure with the following performance characterstics, +\begin{align*} +\text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{C(n)}{n} + C\left(\frac{n}{f(n)}\right)\right) \\ +\text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\ +\end{align*} +where $C(n)$ is the cost of statically building $\mathcal{I}$, and +$\mathscr{Q}(n)$ is the cost of answering $F$ using $\mathcal{I}$. + +%TODO: example? + + +\subsection{The Bentley-Saxe Method~\cite{saxe79}} +\label{ssec:bsm} + +%FIXME: switch this section (and maybe the previous?) over to being +% indexed at 0 instead of 1 + +The original, and most frequently used, dynamization technique is the +Bentley-Saxe Method (BSM), also called the logarithmic method in older +literature. Rather than breaking the data structure into equally sized +blocks, BSM decomposes the structure into logarithmically many blocks +of exponentially increasing size. More specifically, the data structure +is decomposed into $h = \lceil \log_2 n \rceil$ blocks, $\mathscr{I}_1, +\mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$ +will be either empty, or contain exactly $2^i$ records within it. + +The procedure for inserting a record, $r \in \mathcal{D}$, into +a BSM dynamization is as follows. If the block $\mathscr{I}_0$ +is empty, then $\mathscr{I}_0 = \text{build}{\{r\}}$. If it is not +empty, then there will exist a maximal sequence of non-empty blocks +$\mathscr{I}_0, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq +0$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case, +$\mathscr{I}_{i+1}$ is set to $\text{build}(\{r\} \cup \bigcup_{l=0}^i +\text{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_0$ through +$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the +end of the structure as needed. + +%FIXME: switch the x's to r's for consistency +\begin{figure} +\centering +\includegraphics[width=.8\textwidth]{diag/bsm.pdf} +\caption{An illustration of inserts into the Bentley-Saxe Method} +\label{fig:bsm-example} +\end{figure} -However, the method has poor worst-case insertion cost: if the -entire structure is full, it must grow by another level, requiring -a full reconstruction involving every record within the structure. -A slight adjustment to the technique, due to Overmars and van Leuwen -\cite{}, allows for the worst-case insertion cost to be bounded by -$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing -each reconstruction into small pieces, one of which is executed -each time a new update occurs. This has the effect of bounding the -worst-case performance, but does so by sacrificing the expected -case performance, and adds a lot of complexity to the method. This -technique is not used much in practice.\footnote{ - I've yet to find any example of it used in a journal article - or conference paper. -} +Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The +dynamization is built over a set of records $x_1, x_2, \ldots, +x_{10}$ initially, with eight records in $\mathscr{I}_3$ and two in +$\mathscr{I}_1$. The first new record, $x_{11}$, is inserted directly +into $\mathscr{I}_0$. For the next insert following this, $x_{12}$, the +first empty block is $\mathscr{I}_2$, and so the insert is performed by +doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup +\text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$ +and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$. + +This technique is called a \emph{binary decomposition} of the data +structure. Considering a BSM dynamization of a structure containing $n$ +records, labeling each block with a $0$ if it is empty and a $1$ if it +is full will result in the binary representation of $n$. For example, +the final state of the structure in Figure~\ref{fig:bsm-example} contains +$12$ records, and the labeling procedure will result in $0\text{b}1100$, +which is $12$ in binary. Inserts affect this representation of the +structure in the same way that incrementing the binary number by $1$ does. + +By applying BSM to a data structure, a dynamized structure can be created +with the following performance characteristics, +\begin{align*} +\text{Amortized Insertion Cost:}&\quad \Theta\left(\left(\frac{C(n)}{n}\cdot \log_2 n\right)\right) \\ +\text{Worst Case Insertion Cost:}&\quad \Theta\left(C(n)\right) \\ +\text{Worst-case Query Cost:}& \quad \Theta\left(\log_2 n\cdot \mathscr{Q}\left(n\right)\right) \\ +\end{align*} +This is a particularly attractive result because, for example, a data +structure having $C(n) \in \Theta(n)$ will have an amortized insertion +cost of $\log_2 (n)$, which is quite reasonable. The cost is an extra +logarithmic multiple attached to the query complexity. It is also worth +noting that the worst-case insertion cost remains the same as global +reconstruction, but this case arises only very rarely. If you consider the +binary decomposition representation, the worst-case behavior is triggered +each time the existing number overflows, and a new digit must be added. + +\subsection{Delete Support} + +Classical dynamization techniques have also been developed with +support for deleting records. In general, the same technique of global +reconstruction that was used for inserting records can also be used to +delete them. Given a record $r \in \mathcal{D}$ and a data structure +$\mathscr{I} \in \mathcal{I}$ such that $r \in \mathscr{I}$, $r$ can be +deleted from the structure in $C(n)$ time as follows, +\begin{equation*} +\mathscr{I}^\prime = \text{build}(\text{unbuild}(\mathscr{I}) - \{r\}) +\end{equation*} +However, supporting deletes within the dynamization schemes discussed +above is more complicated. The core problem is that inserts affect the +dynamized structure in a deterministic way, and as a result certain +partionining schemes can be leveraged to reason about the +performance. But, deletes do not work like this. + +\begin{figure} +\caption{A Bentley-Saxe dynamization for the integers on the +interval $[1, 100]$.} +\label{fig:bsm-delete-example} +\end{figure} +For example, consider a Bentley-Saxe dynamization that contains all +integers on the interval $[1, 100]$, inserted in that order, shown in +Figure~\ref{fig:bsm-delete-example}. We would like to delete all of the +records from this structure, one at a time, using global reconstruction. +This presents several problems, +\begin{itemize} + \item For each record, we need to identify which block it is in before + we can delete it. + \item The cost of performing a delete is a function of which block the + record is in, which is a question of distribution and not easily + controlled. + \item As records are deleted, the structure will potentially violate + the invariants of the decomposition scheme used, which will + require additional work to fix. +\end{itemize} + + + +\section{Limitations of Classical Dynamization Techniques} +\label{sec:bsm-limits} + +While fairly general, these dynamization techniques have a number of +limitations that prevent them from being directly usable as a general +solution to the problem of creating database indices. Because of the +requirement that the query being answered be decomposable, many search +problems cannot be addressed--or at least efficiently addressed, by +decomposition-based dynamization. The techniques also do nothing to reduce +the worst-case insertion cost, resulting in extremely poor tail latency +performance relative to hand-built dynamic structures. Finally, these +approaches do not do a good job of exposing the underlying configuration +space to the user, meaning that the user can exert limited control on the +performance of the dynamized data structure. This section will discuss +these limitations, and the rest of the document will be dedicated to +proposing solutions to them. + +\subsection{Limits of Decomposability} +\label{ssec:decomp-limits} +Unfortunately, the DSP abstraction used as the basis of classical +dynamization techniques has a few significant limitations that restrict +their applicability, + +\begin{itemize} + \item The query must be broadcast identically to each block and cannot + be adjusted based on the state of the other blocks. + + \item The query process is done in one pass--it cannot be repeated. + + \item The result merge operation must be $O(1)$ to maintain good query + performance. + + \item The result merge operation must be commutative and associative, + and is called repeatedly to merge pairs of results. +\end{itemize} + +These requirements restrict the types of queries that can be supported by +the method efficiently. For example, k-nearest neighbor and independent +range sampling are not decomposable. + +\subsubsection{k-Nearest Neighbor} +\label{sssec-decomp-limits-knn} +The k-nearest neighbor (k-NN) problem is a generalization of the nearest +neighbor problem, which seeks to return the closest point within the +dataset to a given query point. More formally, this can be defined as, +\begin{definition}[Nearest Neighbor] + + Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$ + be some function $f: D^2 \to \mathbb{R}^+$ representing the distance + between two points within $D$. The nearest neighbor problem, $NN(D, + q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$ + for some query point, $q \in \mathbb{R}^d$. +\end{definition} +In practice, it is common to require $f(x, y)$ be a metric,\footnote +{ + Contrary to its vernacular usage as a synonym for ``distance'', a + metric is more formally defined as a valid distance function over + a metric space. Metric spaces require their distance functions to + have the following properties, + \begin{itemize} + \item The distance between a point and itself is always 0. + \item All distances between non-equal points must be positive. + \item For all points, $x, y \in D$, it is true that + $f(x, y) = f(y, x)$. + \item For any three points $x, y, z \in D$ it is true that + $f(x, z) \leq f(x, y) + f(y, z)$. + \end{itemize} + + These distances also must have the interpretation that $f(x, y) < + f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This + is the opposite of the definition of similarity, and so some minor + manipulations are usually required to make similarity measures work + in metric-based indexes. \cite{intro-analysis} +} +and this will be done in the examples of indices for addressing +this problem in this work, but it is not a fundamental aspect of the problem +formulation. The nearest neighbor problem itself is decomposable, +with a simple merge function that accepts the result with the smallest +value of $f(x, q)$ for any two inputs\cite{saxe79}. + +The k-nearest neighbor problem generalizes nearest-neighbor to return +the $k$ nearest elements, +\begin{definition}[k-Nearest Neighbor] + + Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$ + be some function $f: D^2 \to \mathbb{R}^+$ representing the distance + between two points within $D$. The k-nearest neighbor problem, + $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$ + such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$. +\end{definition} -\subsection{Limitations of the Bentley-Saxe Method} +This can be thought of as solving the nearest-neighbor problem $k$ times, +each time removing the returned result from $D$ prior to solving the +problem again. Unlike the single nearest-neighbor case (which can be +thought of as k-NN with $k=1$), this problem is \emph{not} decomposable. +\begin{theorem} + k-NN is not a decomposable search problem. +\end{theorem} +\begin{proof} +To prove this, consider the query $KNN(D, q, k)$ against some partitioned +dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If k-NN is decomposable, +then there must exist some constant-time, commutative, and associative +binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l} +R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q, +k)$. Consider the evaluation of the merge operator against two arbitrary +result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| = +|R_j| = k$, and that the contents of $R$ must be the $k$ records from +$R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the +problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$ +time. Therefore, k-NN is not a decomposable search problem. +\end{proof} +With that said, it is clear that there isn't any fundamental restriction +preventing the merging of the result sets; it is only the case that an +arbitrary performance requirement wouldn't be satisfied. It is possible +to merge the result sets in non-constant time, and so it is the case +that k-NN is $C(n)$-decomposable. Unfortunately, this classification +brings with it a reduction in query performance as a result of the way +result merges are performed. + +As a concrete example of these costs, consider using the Bentley-Saxe +method to extend the VPTree~\cite{vptree}. The VPTree is a static, +metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k +\log n)$. One possible merge algorithm for k-NN would be to push all +of the elements in the two arguments onto a min-heap, and then pop off +the first $k$. In this case, the cost of the merge operation would be +$C(k) = k \log k$. Were $k$ assumed to be constant, then the operation +could be considered to be constant-time. But given that $k$ is only +bounded in size above by $n$, this isn't a safe assumption to make in +general. Evaluating the total query cost for the extended structure, +this would yield, + +\begin{equation} + k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right) +\end{equation} + +The reason for this large increase in cost is the repeated application +of the merge operator. The Bentley-Saxe method requires applying the +merge operator in a binary fashion to each partial result, multiplying +its cost by a factor of $\log n$. Thus, the constant-time requirement +of standard decomposability is necessary to keep the cost of the merge +operator from appearing within the complexity bound of the entire +operation in the general case.\footnote { + There is a special case, noted by Overmars, where the total cost is + $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n)) + \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the + case where the cost of the query and merge operation are sufficiently + large to consume the logarithmic factor, and so it doesn't represent + a special case with better performance. +} +If we could revise the result merging operation to remove this duplicated +cost, we could greatly reduce the cost of supporting $C(n)$-decomposable +queries. + +\subsubsection{Independent Range Sampling} + +Another problem that is not decomposable is independent sampling. There +are a variety of problems falling under this umbrella, including weighted +set sampling, simple random sampling, and weighted independent range +sampling, but we will focus on independent range sampling here. + +\begin{definition}[Independent Range Sampling~\cite{tao22}] + Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query + interval $q = [x, y]$ and an integer $k$, an independent range + sampling query returns $k$ independent samples from $D \cap q$ + with each point having equal probability of being sampled. +\end{definition} +This problem immediately encounters a category error when considering +whether it is decomposable: the result set is randomized, whereas +the conditions for decomposability are defined in terms of an exact +matching of records in result sets. To work around this, a slight abuse +of definition is in order: assume that the equality conditions within +the DSP definition can be interpreted to mean ``the contents in the two +sets are drawn from the same distribution''. This enables the category +of DSP to apply to this type of problem. More formally, +\begin{definition}[Decomposable Sampling Problem] + A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and + only if there exists a constant-time computable, associative, and + commutative binary operator $\mergeop$ such that, + \begin{equation*} + F(A \cup B, q) \sim F(A, q)~ \mergeop ~F(B, q) + \end{equation*} +\end{definition} +Even with this abuse, however, IRS cannot generally be considered +decomposable; it is at best $C(n)$-decomposable. The reason for this is +that matching the distribution requires drawing the appropriate number +of samples from each each partition of the data. Even in the special +case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples +from each partition that must appear in the result set cannot be known +in advance due to differences in the selectivity of the predicate across +the partitions. + +\begin{example}[IRS Sampling Difficulties] + + Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 = + \{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and + an IRS query over the interval $[3, 4]$ with $k=12$. Because all three + partitions have the same size, it seems sensible to evenly distribute + the samples across them ($4$ samples from each partition). Applying + the query predicate to the partitions results in the following, + $d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$. + + In expectation, then, the first result set will contain $R_0 = \{3, + 3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same + probability of a $4$. The second and third result sets can only + be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these + together, we'd find that the probability distribution of the sample + would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform + the same sampling operation over the full dataset (not partitioned), + the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$. + +\end{example} + +The problem is that the number of samples drawn from each partition needs to be +weighted based on the number of elements satisfying the query predicate in that +partition. In the above example, by drawing $4$ samples from $D_1$, more weight +is given to $3$ than exists within the base dataset. This can be worked around +by sampling a full $k$ records from each partition, returning both the sample +and the number of records satisfying the predicate as that partition's query +result, and then performing another pass of IRS as the merge operator, but this +is the same approach as was used for k-NN above. This leaves IRS firmly in the +$C(n)$-decomposable camp. If it were possible to pre-calculate the number of +samples to draw from each partition, then a constant-time merge operation could +be used. + +\subsection{Insertion Tail Latency} + +\subsection{Configurability} + +\section{Conclusion} +This chapter discussed the necessary background information pertaining to +queries and search problems, indexes, and techniques for dynamic extension. It +described the potential for using custom indexes for accelerating particular +kinds of queries, as well as the challenges associated with constructing these +indexes. The remainder of this document will seek to address these challenges +through modification and extension of the Bentley-Saxe method, describing work +that has already been completed, as well as the additional work that must be +done to realize this vision. -- cgit v1.2.3