summaryrefslogtreecommitdiffstats
path: root/chapters/background.tex.bak
diff options
context:
space:
mode:
Diffstat (limited to 'chapters/background.tex.bak')
-rw-r--r--chapters/background.tex.bak1324
1 files changed, 824 insertions, 500 deletions
diff --git a/chapters/background.tex.bak b/chapters/background.tex.bak
index d57b370..78f4a30 100644
--- a/chapters/background.tex.bak
+++ b/chapters/background.tex.bak
@@ -1,315 +1,156 @@
\chapter{Background}
-
-This chapter will introduce important background information that
-will be used throughput the remainder of the document. We'll first
-define precisely what is meant by a query, and consider some special
-classes of query that will become relevant in our discussion of dynamic
-extension. We'll then consider the difference between a static and a
-dynamic structure, and techniques for converting static structures into
-dynamic ones in a variety of circumstances.
-
-\section{Database Indexes}
-
-The term \emph{index} is often abused within the database community
-to refer to a range of closely related, but distinct, conceptual
-categories\footnote{
-The word index can be used to refer to a structure mapping record
-information to the set of records matching that information, as a
-general synonym for ``data structure'', to data structures used
-specifically in query processing, etc.
-}.
-This ambiguity is rarely problematic, as the subtle differences
-between these categories are not often significant, and context
-clarifies the intended meaning in situtations where they are.
-However, this work explicitly operates at the interface of two of
-these categories, and so it is important to disambiguiate between
-them. As a result, we will be using the word index to
-refer to a very specific structure
-
-\subsection{The Traditional Index}
-A database index is a specialized structure which provides a means
-to efficiently locate records that satisfy specific criteria. This
-enables more efficient query processing for support queries. A
-traditional database index can be modeled as a function, mapping a
-set of attribute values, called a key, $\mathcal{K}$, to a set of
-record identifiers, $\mathcal{R}$. Technically, the codomain of an
-index can be either a record identifier, a set of record identifiers,
-or the physical record itself, depending upon the configuration of
-the index. For the purposes of this work, the focus will be on the
-first of these, but in principle any of the three index types could
-be used with little material difference to the discussion.
-
-Formally speaking, we will use the following definition of a traditional
-database index,
-\begin{definition}[Traditional Index]
-Consider a set of database records, $\mathcal{D}$. An index over
-these records, $\mathcal{I}_\mathcal{D}$ is a map of the form
-$F:(\mathcal{I}_\mathcal{D}, \mathcal{K}) \to \mathcal{R}$, where
-$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$,
-called a \emph{key}.
-\end{definition}
-
-In order to facilitate this mapping, indexes are built using data
-structures. The specific data structure used has particular
-implications about the performance of the index, and the situations
-in which the index is effectively. Broadly speaking, traditional
-database indexes can be categorized in two ways: ordered indexes
-and unordered indexes. The former of these allows for iteration
-over the set of record identifiers in some sorted order, starting
-at the returned record. The latter allows for point-lookups only.
-
-There is a very small set of data structures that are usually used
-for creating database indexes. The most common range index in RDBMSs
-is the B-tree\footnote{ By \emph{B-tree} here, I am referring not
-to the B-tree datastructure, but to a wide range of related structures
-derived from the B-tree. Examples include the B$^+$-tree,
-B$^\epsilon$-tree, etc. } based index, and key-value stores commonly
-use indices built on the LSM-tree. Some databases support unordered
-indexes using hashtables. Beyond these, some specialized databases or
-database extensions have support for indexes based on other structures,
-such as the R-tree\footnote{
-Like the B-tree, R-tree here is used as a signifier for a general class
-of related data structures} for spatial databases or approximate small
-world graph models for similarity search.
-
-\subsection{The Generalized Index}
-
-The previous section discussed the traditional definition of index
-as might be found in a database systems textbook. However, this
-definition is limited by its association specifically with mapping
-key fields to records. For the purposes of this work, I will be
-considering a slightly broader definition of index,
-
-\begin{definition}[Generalized Index]
-Consider a set of database records, $\mathcal{D}$ and a search
-problem, $\mathcal{Q}$. A generalized index, $\mathcal{I}_\mathcal{D}$
-is a map of the form $F:(\mathcal{I}_\mathcal{D}, \mathcal{Q}) \to
-\mathcal{R})$.
-\end{definition}
-
-\emph{Search problems} are the topic of the next section, but in
-brief a search problem represents a general class of query, such
-as range scan, point lookup, k-nearest neightbor, etc. A traditional
-index is a special case of a generalized index, having $\mathcal{Q}$
-being a point-lookup or range query based on a set of record
-attributes.
-
-\subsection{Indices in Query Processing}
-
-A database management system utilizes indices to accelerate certain
-types of query. Queries are expressed to the system in some high
-level language, such as SQL or Datalog. These are generalized
-languages capable of expressing a wide range of possible queries.
-The DBMS is then responsible for converting these queries into a
-set of primitive data access procedures that are supported by the
-underlying storage engine. There are a variety of techniques for
-this, including mapping directly to a tree of relational algebra
-operators and interpretting that tree, query compilation, etc. But,
-ultimately, the expressiveness of this internal query representation
-is limited by the routines supported by the storage engine.
-
-As an example, consider the following SQL query (representing a
-2-dimensional k-nearest neighbor)\footnote{There are more efficient
-ways of answering this query, but I'm aiming for simplicity here
-to demonstrate my point},
-
-\begin{verbatim}
-SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A
- WHERE A.property = filtering_criterion
- ORDER BY d
- LIMIT 5;
-\end{verbatim}
-
-This query will be translated into a logical query plan (a sequence
-of relational algebra operators) by the query planner, which could
-result in a plan like this,
-
-\begin{verbatim}
-query plan here
-\end{verbatim}
-
-With this logical query plan, the DBMS will next need to determine
-which supported operations it can use to most efficiently answer
-this query. For example, the selection operation (A) could be
-physically manifested as a table scan, or could be answered using
-an index scan if there is an ordered index over \texttt{A.property}.
-The query optimizer will make this decision based on its estimate
-of the selectivity of the predicate. This may result in one of the
-following physical query plans
-
-\begin{verbatim}
-physical query plan
-\end{verbatim}
-
-In either case, however, the space of possible physical plans is
-limited by the available access methods: either a sorted scan on
-an attribute (index) or an unsorted scan (table scan). The database
-must filter for all elements matching the filtering criterion,
-calculate the distances between all of these points and the query,
-and then sort the results to get the final answer. Additionally,
-note that the sort operation in the plan is a pipeline-breaker. If
-this plan were to appear as a subtree in a larger query plan, the
-overall plan would need to wait for the full evaluation of this
-sub-query before it could proceed, as sorting requires the full
-result set.
-
-Imagine a world where a new index was available to our DBMS: a
-nearest neighbor index. This index would allow the iteration over
-records in sorted order, relative to some predefined metric and a
-query point. If such an index existed over \texttt{(A.x, A.y)} using
-\texttt{dist}, then a third physical plan would be available to the DBMS,
-
-\begin{verbatim}
-\end{verbatim}
-
-This plan pulls records in order of their distance to \texttt{Q}
-directly, using an index, and then filters them, avoiding the
-pipeline breaking sort operation. While it's not obvious in this
-case that this new plan is superior (this would depend a lot on the
-selectivity of the predicate), it is a third option. It becomes
-increasingly superior as the selectivity of the predicate grows,
-and is clearly superior in the case where the predicate has unit
-selectivity (requiring only the consideration of $5$ records total).
-The construction of this special index will be considered in
-Section~\ref{ssec:knn}.
-
-This use of query-specific indexing schemes also presents a query
-planning challenge: how does the database know when a particular
-specialized index can be used for a given query, and how can
-specialized indexes broadcast their capabilities to the query planner
-in a general fashion? This work is focused on the problem of enabling
-the existence of such indexes, rather than facilitating their use,
-however these are important questions that must be considered in
-future work for this solution to be viable. There has been work
-done surrounding the use of arbtrary indexes in queries in the past,
-such as~\cite{byods-datalog}. This problem is considered out-of-scope
-for the proposed work, but will be considered in the future.
+\label{chap:background}
+
+This chapter will introduce important background information and
+existing work in the area of data structure dynamization. We will
+first discuss the concept of a search problem, which is central to
+dynamization techniques. While one might imagine that restrictions on
+dynamization would be functions of the data structure to be dynamized,
+in practice the requirements placed on the data structure are quite mild,
+and it is the necessary properties of the search problem that the data
+structure is used to address that provide the central difficulty to
+applying dynamization techniques in a given area. After this, database
+indices will be discussed briefly. Indices are the primary use of data
+structures within the database context that is of interest to our work.
+Following this, existing theoretical results in the area of data structure
+dynamization will be discussed, which will serve as the building blocks
+for our techniques in subsquent chapters. The chapter will conclude with
+a discussion of some of the limitations of these existing techniques.
\section{Queries and Search Problems}
+\label{sec:dsp}
+
+Data access lies at the core of most database systems. We want to ask
+questions of the data, and ideally get the answer efficiently. We
+will refer to the different types of question that can be asked as
+\emph{search problems}. We will be using this term in a similar way as
+the word \emph{query} \footnote{
+ The term query is often abused and used to
+ refer to several related, but slightly different things. In the
+ vernacular, a query can refer to either a) a general type of search
+ problem (as in "range query"), b) a specific instance of a search
+ problem, or c) a program written in a query language.
+}
+is often used within the database systems literature: to refer to a
+general class of questions. For example, we could consider range scans,
+point-lookups, nearest neighbor searches, predicate filtering, random
+sampling, etc., to each be a general search problem. Formally, for the
+purposes of this work, a search problem is defined as follows,
-In our discussion of generalized indexes, we encountered \emph{search
-problems}. A search problem is a term used within the literature
-on data structures in a manner similar to how the database community
-sometimes uses the term query\footnote{
-Like with the term index, the term query is often abused and used to
-refer to several related, but slightly different things. In the vernacular,
-a query can refer to either a) a general type of search problem (as in "range query"),
-b) a specific instance of a search problem, or c) a program written in a query language.
-}, to refer to a general
-class of questions asked of data. Examples include range queries,
-point-lookups, nearest neighbor queries, predicate filtering, random
-sampling, etc. Formally, for the purposes of this work, we will define
-a search problem as follows,
\begin{definition}[Search Problem]
-Given three multisets, $D$, $R$, and $Q$, a search problem is a function
-$F: (D, Q) \to R$, where $D$ represents the domain of data to be searched,
-$Q$ represents the domain of query parameters, and $R$ represents the
-answer domain.
-\footnote{
-It is important to note that it is not required for $R \subseteq D$. As an
+ Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function
+ $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched,
+ $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the
+answer domain.\footnote{
+ It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an
example, a \texttt{COUNT} aggregation might map a set of strings onto
-an integer. Most common queries do satisfy $R \subseteq D$, but this need
+ an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need
not be a universal constraint.
}
\end{definition}
-And we will use the word \emph{query} to refer to a specific instance
-of a search problem, except when used as part of the generally
-accepted name of a search problem (i.e., range query).
+We will use the term \emph{query} to mean a specific instance of a search
+problem,
\begin{definition}[Query]
-Given three multisets, $D$, $R$, and $Q$, a search problem $F$ and
-a specific set of query parameters $q \in Q$, a query is a specific
-instance of the search problem, $F(D, q)$.
+ Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and
+ a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific
+ instance of the search problem, $F(\mathcal{D}, q)$.
\end{definition}
As an example of using these definitions, a \emph{membership test}
-or \emph{range query} would be considered search problems, and a
-range query over the interval $[10, 99]$ would be a query.
+or \emph{range scan} would be considered search problems, and a range
+scan over the interval $[10, 99]$ would be a query. We've drawn this
+distinction because, as we'll see as we enter into the discussion of
+our work in later chapters, it is useful to have seperate, unambiguous
+terms for these two concepts.
\subsection{Decomposable Search Problems}
-An important subset of search problems is that of decomposable
-search problems (DSPs). This class was first defined by Saxe and
-Bentley as follows,
+Dynamization techniques require the partitioning of one data structure
+into several, smaller ones. As a result, these techniques can only
+be applied in situations where the search problem to be answered can
+be answered from this set of smaller data structures, with the same
+answer as would have been obtained had all of the data been used to
+construct a single, large structure. This requirement is formalized in
+the definition of a class of problems called \emph{decomposable search
+problems (DSP)}. This class was first defined by Bentley and Saxe in
+their work on dynamization, and we will adopt their definition,
\begin{definition}[Decomposable Search Problem~\cite{saxe79}]
\label{def:dsp}
- Given a search problem $F: (D, Q) \to R$, $F$ is decomposable if and
- only if there exists a consant-time computable, associative, and
- commutative binary operator $\square$ such that,
+ A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and
+ only if there exists a constant-time computable, associative, and
+ commutative binary operator $\mergeop$ such that,
\begin{equation*}
- F(A \cup B, q) = F(A, q)~ \square ~F(B, q)
+ F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
\end{equation*}
\end{definition}
-The constant-time requirement was used to prove bounds on the costs of
-evaluating DSPs over data broken across multiple partitions. Further work
-by Overmars lifted this constraint and considered a more general class
-of DSP,
+The requirement for $\mergeop$ to be constant-time was used by Bentley and
+Saxe to prove specific performance bounds for answering queries from a
+decomposed data structure. However, it is not strictly \emph{necessary},
+and later work by Overmars lifted this constraint and considered a more
+general class of search problems called \emph{$C(n)$-decomposable search
+problems},
+
\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}]
- Given a search problem $F: (D, Q) \to R$, $F$ is $C(n)$-decomposable
+ A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable
if and only if there exists an $O(C(n))$-time computable, associative,
- and commutative binary operator $\square$ such that,
+ and commutative binary operator $\mergeop$ such that,
\begin{equation*}
- F(A \cup B, q) = F(A, q)~ \square ~F(B, q)
+ F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
\end{equation*}
\end{definition}
-Decomposability is an important property because it allows for
-search problems to be answered over partitioned datasets. The details
-of this will be discussed in Section~\ref{ssec:bentley-saxe} in the
-context of creating dynamic data structures. Many common types of
-search problems appearing in databases are decomposable, such as
-range queries or predicate filtering.
-
-To demonstrate that a search problem is decomposable, it is necessary
-to show the existance of the merge operator, $\square$, and to show
-that $F(A \cup B, q) = F(A, q)~ \square ~F(B, q)$. With these two
-results, simple induction demonstrates that the problem is decomposable
-even in cases with more than two partial results.
-
-As an example, consider range queries,
-\begin{definition}[Range Query]
-Let $D$ be a set of $n$ points in $\mathbb{R}$. Given an interval,
-$ q = [x, y],\quad x,y \in R$, a range query returns all points in
-$D \cap q$.
+To demonstrate that a search problem is decomposable, it is necessary to
+show the existence of the merge operator, $\mergeop$, with the necessary
+properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B,
+q)$. With these two results, induction demonstrates that the problem is
+decomposable even in cases with more than two partial results.
+
+As an example, consider range scans,
+\begin{definition}[Range Count]
+ Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval,
+ $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns
+ the cardinality, $|d \cap q|$.
\end{definition}
\begin{theorem}
-Range Queries are a DSP.
+Range Count is a decomposable search problem.
\end{theorem}
\begin{proof}
-Let $\square$ be the set union operator ($\cup$). Applying this to
-Definition~\ref{def:dsp}, we have
+Let $\mergeop$ be addition ($+$). Applying this to
+Definition~\ref{def:dsp}, gives
\begin{align*}
- (A \cup B) \cap q = (A \cap q) \cup (B \cap q)
+ |(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)|
\end{align*}
-which is true by the distributive property of set union and
-intersection. Assuming an implementation allowing for an $O(1)$
-set union operation, range queries are DSPs.
+which is true by the distributive property of union and
+intersection. Addition is an associative and commutative
+operator that can be calculated in $O(1)$ time. Therefore, range counts
+are DSPs.
\end{proof}
Because the codomain of a DSP is not restricted, more complex output
structures can be used to allow for problems that are not directly
decomposable to be converted to DSPs, possibly with some minor
-post-processing. For example, the calculation of the mean of a set
-of numbers can be constructed as a DSP using the following technique,
+post-processing. For example, calculating the arithmetic mean of a set
+of numbers can be formulated as a DSP,
\begin{theorem}
-The calculation of the average of a set of numbers is a DSP.
+The calculation of the arithmetic mean of a set of numbers is a DSP.
\end{theorem}
\begin{proof}
-Define the search problem as $A:D \to (\mathbb{R}, \mathbb{Z})$,
-where $D\subset\mathbb{R}$ and is a multiset. The output tuple
+ Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$,
+ where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple
contains the sum of the values within the input set, and the
-cardinality of the input set. Let the $A(D_1) = (s_1, c_1)$ and
-$A(D_2) = (s_2, c_2)$. Then, define $A(D_1)\square A(D_2) = (s_1 +
-s_2, c_1 + c_2)$.
+cardinality of the input set. For two disjoint paritions of the data,
+$D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let
+$A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$.
-Applying Definition~\ref{def:dsp}, we have
+Applying Definition~\ref{def:dsp}, gives
\begin{align*}
- A(D_1 \cup D_2) &= A(D_1)\square A(D_2) \\
+ A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2) \\
(s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c)
\end{align*}
From this result, the average can be determined in constant time by
@@ -317,258 +158,741 @@ taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set
of numbers is a DSP.
\end{proof}
-\section{Dynamic Extension Techniques}
-Because data in a database is regularly updated, data structures
-intended to be used as an index must support updates (inserts,
-in-place modification, and deletes) to their data. In principle,
-any data structure can support updates to its underlying data through
-global reconstruction: adjusting the record set and then rebuilding
-the entire structure. Ignoring this trivial (and highly inefficient)
-approach, a data structure with support for updates is called
-\emph{dynamic}, and one without support for updates is called
-\emph{static}. In this section, we discuss approaches for modifying
-a static data structure to grant it support for updates, a process
-called \emph{dynamic extension} or \emph{dynamization}. A theoretical
-survey of this topic can be found in~\cite{overmars83}, but this
-work doesn't cover several techniques that are used in practice.
-As such, much of this section constitutes our own analysis, tying
-together threads from a variety of sources.
-
-\subsection{Local Reconstruction}
-
-One way of viewing updates to a data structure is as reconstructing
-all or part of the structure. To minimize the cost of the update,
-it is ideal to minimize the size of the reconstruction that accompanies
-an update, either by careful structuring of the data to ensure
-minimal disruption to surrounding records by an update, or by
-deferring the reconstructions and amortizing their costs over as
-many updates as possible.
-
-While minimizing the size of a reconstruction seems the most obvious,
-and best, approach, it is limited in its applicability. The more
-related ``nearby'' records in the structure are, the more records
-will be affected by a change. Records can be related in terms of
-some ordering of their values, which we'll term a \emph{spatial
-ordering}, or in terms of their order of insertion to the structure,
-which we'll term a \emph{temporal ordering}. Note that these terms
-don't imply anything about the nature of the data, and instead
-relate to the principles used by the data structure to arrange them.
-
-Arrays provide the extreme version of both of these ordering
-principles. In an unsorted array, in which records are appended to
-the end of the array, there is no spatial ordering dependence between
-records. This means that any insert or update will require no local
-reconstruction, aside from the record being directly affected.\footnote{
-A delete can also be performed without any structural adjustments
-in a variety of ways. Reorganization of the array as a result of
-deleted records serves an efficiency purpose, but isn't required
-for the correctness of the structure. } However, the order of
-records in the array \emph{does} express a strong temporal dependency:
-the index of a record in the array provides the exact insertion
-order.
-
-A sorted array provides exactly the opposite situation. The order
-of a record in the array reflects an exact spatial ordering of
-records with respect to their sorting function. This means that an
-update or insert will require reordering a large number of records
-(potentially all of them, in the worst case). Because of the stronger
-spatial dependence of records in the structure, an update will
-require a larger-scale reconstruction. Additionally, there is no
-temporal component to the ordering of the records: inserting a set
-of records into a sorted array will produce the same final structure
-irrespective of insertion order.
-
-It's worth noting that the spatial dependency discussed here, as
-it relates to reconstruction costs, is based on the physical layout
-of the records and not the logical ordering of them. To exemplify
-this, a sorted singly-linked list can maintain the same logical
-order of records as a sorted array, but limits the spatial dependce
-between records each records preceeding node. This means that an
-insert into this structure will require only a single node update,
-regardless of where in the structure this insert occurs.
-
-The amount of spatial dependence in a structure directly reflects
-a trade-off between read and write performance. In the above example,
-performing a lookup for a given record in a sorted array requires
-asymptotically fewer comparisons in the worst case than an unsorted
-array, because the spatial dependecies can be exploited for an
-accelerated search (binary vs. linear search). Interestingly, this
-remains the case for lookups against a sorted array vs. a sorted
-linked list. Even though both structures have the same logical order
-of records, limited spatial dependecies between nodes in a linked
-list forces the lookup to perform a scan anyway.
-
-A balanced binary tree sits between these two extremes. Like a
-linked list, individual nodes have very few connections. However
-the nodes are arranged in such a way that a connection existing
-between two nodes implies further information about the ordering
-of children of those nodes. In this light, rebalancing of the tree
-can be seen as maintaining a certain degree of spatial dependence
-between the nodes in the tree, ensuring that it is balanced between
-the two children of each node. A very general summary of tree
-rebalancing techniques can be found in~\cite{overmars83}. Using an
-AVL tree~\cite{avl} as a specific example, each insert in the tree
-involves adding the new node and updating its parent (like you'd
-see in a simple linked list), followed by some larger scale local
-reconstruction in the form of tree rotations, to maintain the balance
-factor invariant. This means that insertion requires more reconstruction
-effort than the single pointer update in the linked list case, but
-results in much more efficient searches (which, as it turns out,
-makes insertion more efficient in general too, even with the overhead,
-because finding the insertion point is much faster).
-
-\subsection{Amortized Local Reconstruction}
-
-In addition to control update cost by arranging the structure so
-as to reduce the amount of reconstruction necessary to maintain the
-desired level of spatial dependence, update costs can also be reduced
-by amortizing the local reconstruction cost over multiple updates.
-This is often done in one of two ways: leaving gaps or adding
-overflow buckets. These gaps and buckets allows for a buffer of
-insertion capacity to be sustained by the data structure, before
-a reconstruction is triggered.
-
-A classic example of the gap approach is found in the
-B$^+$-tree~\cite{b+tree} commonly used in RDBMS indexes, as well
-as open addressing for hash tables. In a B$^+$-tree, each node has
-a fixed size, which must be at least half-utilized (aside from the
-root node). The empty spaces within these nodes are gaps, which can
-be cheaply filled with new records on insert. Only when a node has
-been filled must a local reconstruction (called a structural
-modification operation for B-trees) occur to redistribute the data
-into multiple nodes and replenish the supply of gaps. This approach
-is particularly well suited to data structures in contexts where
-the natural unit of storage is larger than a record, as in disk-based
-(with 4KiB pages) or cache-optimized (with 64B cachelines) structures.
-This gap-based approach was also used to create ALEX, an updatable
-learned index~\cite{ALEX}.
-
-The gap approach has a number of disadvantages. It results in a
-somewhat sparse structure, thereby wasting storage. For example, a
-B$^+$-tree requires all nodes other than the root to be at least
-half full--meaning in the worst case up to half of the space required
-by the structure could be taken up by gaps. Additionally, this
-scheme results in some inserts being more expensive than others:
-most new records will occupy an available gap, but some will trigger
-more expensive SMOs. In particular, it has been observed with
-B$^+$-trees that this can lead to ``waves of misery''~\cite{wavesofmisery}:
-the gaps in many nodes fill at about the same time, leading to
-periodic clusters of high-cost merge operations.
-
-Overflow buckets are seen in ISAM-tree based indexes~\cite{myisam},
-as well as hash tables with closed addressing. In this approach,
-parts of the structure into which records would be inserted (leaf
-nodes of ISAM, directory entries in CA hashing) have a pointer to
-an overflow location, where newly inserted records can be placed.
-This allows for the structure to, theoretically, sustain an unlimited
-amount of insertions. However, read performance degrades, because
-the more overflow capacity is utilized, the less the records in the
-structure are ordered according to the data structure's definition.
-Thus, periodically a reconstruction is necessary to distribute the
-overflow records into the structure itself.
-
-\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method}
-
-Another approach to support updates is to amortize the cost of
-global reconstruction over multiple updates. This approach can take
-take three forms,
-\begin{enumerate}
-
- \item Pairing a dynamic data structure (called a buffer or
- memtable) with an instance of the structure being extended.
- Updates are written to the buffer, and when the buffer is
- full its records are merged with those in the static
- structure, and the structure is rebuilt. This approach is
- used by one version of the originally proposed
- LSM-tree~\cite{oneil93}. Technically this technique proposed
- in that work for the purposes of converting random writes
- into sequential ones (all structures involved are dynamic),
- but it can be used for dynamization as well.
-
- \item Creating multiple, smaller data structures each
- containing a partition of the records from the dataset, and
- reconstructing individual structures to accomodate new
- inserts in a systematic manner. This technique is the basis
- of the Bentley-Saxe method~\cite{saxe79}.
-
- \item Using both of the above techniques at once. This is
- the approach used by modern incarnations of the
- LSM~tree~\cite{rocksdb}.
-
-\end{enumerate}
-
-In all three cases, it is necessary for the search problem associated
-with the index to be a DSP, as answering it will require querying
-multiple structures (the buffer and/or one or more instances of the
-data structure) and merging the results together to get a final
-result. This section will focus exclusively on the Bentley-Saxe
-method, as it is the basis for our proposed methodology.p
-
-When dividing records across multiple structures, there is a clear
-trade-off between read performance and write performance. Keeping
-the individual structures small reduces the cost of reconstructing,
-and thereby increases update performance. However, this also means
-that more structures will be required to accommodate the same number
-of records, when compared to a scheme that allows the structures
-to be larger. As each structure must be queried independently, this
-will lead to worse query performance. The reverse is also true,
-fewer, larger structures will have better query performance and
-worse update performance, with the extreme limit of this being a
-single structure that is fully rebuilt on each insert.
-\begin{figure}
- \caption{Inserting a new record using the Bentley-Saxe method.}
- \label{fig:bsm-example}
-\end{figure}
+\section{Database Indexes}
+\label{sec:indexes}
+
+Within a database system, search problems are expressed using
+some high level language (or mapped directly to commands, for
+simpler systems like key-value stores), which is processed by
+the database system to produce a result. Within many database
+systems, the most basic access primitive is a table scan, which
+sequentially examines each record within the data set. There are many
+situations in which the same query could be answered in less time using
+a more sophisticated data access scheme, however, and databases support
+a limited number of such schemes through the use of specialized data
+structures called \emph{indices} (or indexes). Indices can be built over
+a set of attributes in a table and provide faster access for particular
+search problems.
-The key insight of the Bentley-Saxe method~\cite{saxe79} is that a
-good balance can be struck by uses a geometrically increasing
-structure size. In Bentley-Saxe, the sub-structures are ``stacked'',
-with the bottom level having a capacity of a single record, and
-each subsequent level doubling in capacity. When an update is
-performed, the first empty level is located and a reconstruction
-is triggered, merging the structures of all levels below this empty
-one, along with the new record. An example of this process is shown
-in Figure~\ref{fig:bsm-example}. The merits of this approach are
-that it ensures that ``most'' reconstructions involve the smaller
-data structures towards the bottom of the sequence, while most of
-the records reside in large, infrequently updated, structures towards
-the top. This balances between the read and write implications of
-structure size, while also allowing the number of structures required
-to represent $n$ records to be worst-case bounded by $O(\log n)$.
-
-Given a structure with an $O(P(n))$ construction cost and $O(Q_S(n))$
-query cost, the Bentley-Saxe Method will produce a dynamic data
-structure with,
+The term \emph{index} is often abused within the database community
+to refer to a range of closely related, but distinct, conceptual
+categories.\footnote{
+The word index can be used to refer to a structure mapping record
+information to the set of records matching that information, as a
+general synonym for ``data structure'', to data structures used
+specifically in query processing, etc.
+}
+This ambiguity is rarely problematic, as the subtle differences between
+these categories are not often significant, and context clarifies the
+intended meaning in situations where they are. However, this work
+explicitly operates at the interface of two of these categories, and so
+it is important to disambiguate between them.
+
+\subsection{The Classical Index}
+A database index is a specialized data structure that provides a means
+to efficiently locate records that satisfy specific criteria. This
+enables more efficient query processing for supported search problems. A
+classical index can be modeled as a function, mapping a set of attribute
+values, called a key, $\mathcal{K}$, to a set of record identifiers,
+$\mathcal{R}$. The codomain of an index can be either the set of
+record identifiers, a set containing sets of record identifiers, or
+the set of physical records, depending upon the configuration of the
+index.~\cite{cowbook} For our purposes here, we'll focus on the first of
+these, but the use of other codmains wouldn't have any material effect
+on our discussion.
+
+We will use the following definition of a "classical" database index,
+
+\begin{definition}[Classical Index~\cite{cowbook}]
+Consider a set of database records, $\mathcal{D}$. An index over
+these records, $\mathcal{I}_\mathcal{D}$ is a map of the form
+ $\mathcal{I}_\mathcal{D}:(\mathcal{K}, \mathcal{D}) \to \mathcal{R}$, where
+$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$,
+called a \emph{key}.
+\end{definition}
+
+In order to facilitate this mapping, indexes are built using data
+structures. The selection of data structure has implications on the
+performance of the index, and the types of search problem it can be
+used to accelerate. Broadly speaking, classical indices can be divided
+into two categories: ordered and unordered. Ordered indices allow for
+the iteration over a set of record identifiers in a particular sorted
+order of keys, and the efficient location of a specific key value in
+that order. These indices can be used to accelerate range scans and
+point-lookups. Unordered indices are specialized for point-lookups on a
+particular key value, and do not support iterating over records in some
+order.~\cite{cowbook, mysql-btree-hash}
+
+There is a very small set of data structures that are usually used for
+creating classical indexes. For ordered indices, the most commonly used
+data structure is the B-tree~\cite{ubiq-btree},\footnote{
+ By \emph{B-tree} here, we are referring not to the B-tree data
+ structure, but to a wide range of related structures derived from
+ the B-tree. Examples include the B$^+$-tree, B$^\epsilon$-tree, etc.
+}
+and the log-structured merge (LSM) tree~\cite{oneil96} is also often
+used within the context of key-value stores~\cite{rocksdb}. Some databases
+implement unordered indices using hash tables~\cite{mysql-btree-hash}.
+
+
+\subsection{The Generalized Index}
+
+The previous section discussed the classical definition of index
+as might be found in a database systems textbook. However, this
+definition is limited by its association specifically with mapping
+key fields to records. For the purposes of this work, a broader
+definition of index will be considered,
+
+\begin{definition}[Generalized Index]
+Consider a set of database records, $\mathcal{D}$, and search
+problem, $\mathcal{Q}$.
+A generalized index, $\mathcal{I}_\mathcal{D}$
+is a map of the form $\mathcal{I}_\mathcal{D}:(\mathcal{Q}, \mathcal{D}) \to
+\mathcal{R})$.
+\end{definition}
+
+A classical index is a special case of a generalized index, with $\mathcal{Q}$
+being a point-lookup or range scan based on a set of record attributes.
+
+There are a number of generalized indexes that appear in some database systems.
+For example, some specialized databases or database extensions have support for
+indexes based the R-tree\footnote{ Like the B-tree, R-tree here is used as a
+signifier for a general class of related data structures.} for spatial
+databases~\cite{postgis-doc, ubiq-rtree} or hierarchical navigable small world
+graphs for similarity search~\cite{pinecone-db}, among others. These systems
+are typically either an add-on module, or a specialized standalone database
+that has been designed specifically for answering particular types of queries
+(such as spatial queries, similarity search, string matching, etc.).
+
+%\subsection{Indexes in Query Processing}
+
+%A database management system utilizes indexes to accelerate certain
+%types of query. Queries are expressed to the system in some high
+%level language, such as SQL or Datalog. These are generalized
+%languages capable of expressing a wide range of possible queries.
+%The DBMS is then responsible for converting these queries into a
+%set of primitive data access procedures that are supported by the
+%underlying storage engine. There are a variety of techniques for
+%this, including mapping directly to a tree of relational algebra
+%operators and interpreting that tree, query compilation, etc. But,
+%ultimately, this internal query representation is limited by the routines
+%supported by the storage engine.~\cite{cowbook}
+
+%As an example, consider the following SQL query (representing a
+%2-dimensional k-nearest neighbor problem)\footnote{There are more efficient
+%ways of answering this query, but I'm aiming for simplicity here
+%to demonstrate my point},
+%
+%\begin{verbatim}
+%SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A
+% WHERE A.property = filtering_criterion
+% ORDER BY d
+% LIMIT 5;
+%\end{verbatim}
+%
+%This query will be translated into a logical query plan (a sequence
+%of relational algebra operators) by the query planner, which could
+%result in a plan like this,
+%
+%\begin{verbatim}
+%query plan here
+%\end{verbatim}
+%
+%With this logical query plan, the DBMS will next need to determine
+%which supported operations it can use to most efficiently answer
+%this query. For example, the selection operation (A) could be
+%physically manifested as a table scan, or could be answered using
+%an index scan if there is an ordered index over \texttt{A.property}.
+%The query optimizer will make this decision based on its estimate
+%of the selectivity of the predicate. This may result in one of the
+%following physical query plans
+%
+%\begin{verbatim}
+%physical query plan
+%\end{verbatim}
+%
+%In either case, however, the space of possible physical plans is
+%limited by the available access methods: either a sorted scan on
+%an attribute (index) or an unsorted scan (table scan). The database
+%must filter for all elements matching the filtering criterion,
+%calculate the distances between all of these points and the query,
+%and then sort the results to get the final answer. Additionally,
+%note that the sort operation in the plan is a pipeline-breaker. If
+%this plan were to appear as a sub-tree in a larger query plan, the
+%overall plan would need to wait for the full evaluation of this
+%sub-query before it could proceed, as sorting requires the full
+%result set.
+%
+%Imagine a world where a new index was available to the DBMS: a
+%nearest neighbor index. This index would allow the iteration over
+%records in sorted order, relative to some predefined metric and a
+%query point. If such an index existed over \texttt{(A.x, A.y)} using
+%\texttt{dist}, then a third physical plan would be available to the DBMS,
+%
+%\begin{verbatim}
+%\end{verbatim}
+%
+%This plan pulls records in order of their distance to \texttt{Q}
+%directly, using an index, and then filters them, avoiding the
+%pipeline breaking sort operation. While it's not obvious in this
+%case that this new plan is superior (this would depend upon the
+%selectivity of the predicate), it is a third option. It becomes
+%increasingly superior as the selectivity of the predicate grows,
+%and is clearly superior in the case where the predicate has unit
+%selectivity (requiring only the consideration of $5$ records total).
+%
+%This use of query-specific indexing schemes presents a query
+%optimization challenge: how does the database know when a particular
+%specialized index can be used for a given query, and how can
+%specialized indexes broadcast their capabilities to the query optimizer
+%in a general fashion? This work is focused on the problem of enabling
+%the existence of such indexes, rather than facilitating their use;
+%however these are important questions that must be considered in
+%future work for this solution to be viable. There has been work
+%done surrounding the use of arbitrary indexes in queries in the past,
+%such as~\cite{byods-datalog}. This problem is considered out-of-scope
+%for the proposed work, but will be considered in the future.
+
+\section{Classical Dynamization Techniques}
+
+Because data in a database is regularly updated, data structures
+intended to be used as an index must support updates (inserts, in-place
+modification, and deletes). Not all potentially useful data structures
+support updates, and so a general strategy for adding update support
+would increase the number of data structures that could be used as
+database indices. We refer to a data structure with update support as
+\emph{dynamic}, and one without update support as \emph{static}.\footnote{
+ The term static is distinct from immutable. Static refers to the
+ layout of records within the data structure, whereas immutable
+ refers to the data stored within those records. This distinction
+ will become relevant when we discuss different techniques for adding
+ delete support to data structures. The data structures used are
+ always static, but not necessarily immutable, because the records may
+ contain header information (like visibility) that is updated in place.
+}
+
+This section discusses \emph{dynamization}, the construction of a dynamic
+data structure based on an existing static one. When certain conditions
+are satisfied by the data structure and its associated search problem,
+this process can be done automatically, and with provable asymptotic
+bounds on amortized insertion performance, as well as worst case query
+performance. We will first discuss the necessary data structure
+requirements, and then examine several classical dynamization techniques.
+The section will conclude with a discussion of delete support within the
+context of these techniques.
+
+It is worth noting that there are a variety of techniques
+discussed in the literature for dynamizing structures with specific
+properties, or under very specific sets of circumstances. Examples
+include frameworks for adding update support succinct data
+structures~\cite{dynamize-succinct} or taking advantage of batching
+of insert and query operations~\cite{batched-decomposable}. This
+section discusses techniques that are more general, and don't require
+workload-specific assumptions.
+
+
+\subsection{Global Reconstruction}
+
+The most fundamental dynamization technique is that of \emph{global
+reconstruction}. While not particularly useful on its own, global
+reconstruction serves as the basis for the techniques to follow, and so
+we will begin our discussion of dynamization with it.
+
+Consider a class of data structure, $\mathcal{I}$, capable of answering a
+search problem, $\mathcal{Q}$. Insertion via global reconstruction is
+possible if $\mathcal{I}$ supports the following two operations,
+\begin{align*}
+\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\
+\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D})
+\end{align*}
+where $\mathtt{build}$ constructs an instance $\mathscr{I}\in\mathcal{I}$
+over the data structure over a set of records $d \subseteq \mathcal{D}$
+in $C(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d
+\subseteq \mathcal{D}$ used to construct $\mathscr{I} \in \mathcal{I}$ in
+$\Theta(1)$ time,\footnote{
+ There isn't any practical reason why $\mathtt{unbuild}$ must run
+ in constant time, but this is the assumption made in \cite{saxe79}
+ and in subsequent work based on it, and so we will follow the same
+ defininition here.
+} such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$.
+
+Given this structure, an insert of record $r \in \mathcal{D}$ into a
+data structure $\mathscr{I} \in \mathcal{I}$ can be defined by,
+\begin{align*}
+\mathscr{I}_{i}^\prime = \text{build}(\text{unbuild}(\mathscr{I}_i) \cup \{r\})
+\end{align*}
+
+It goes without saying that this operation is sub-optimal, as the
+insertion cost is $\Theta(C(n))$, and $C(n) \in \Omega(n)$ at best for
+most data structures. However, this global reconstruction strategy can
+be used as a primitive for more sophisticated techniques that can provide
+reasonable performance.
+
+\subsection{Amortized Global Reconstruction}
+\label{ssec:agr}
+
+The problem with global reconstruction is that each insert must rebuild
+the entire data structure, involving all of its records. This results
+in a worst-case insert cost of $\Theta(C(n))$. However, opportunities
+for improving this scheme can present themselves when considering the
+\emph{amortized} insertion cost.
+
+Consider the cost acrrued by the dynamized structure under global
+reconstruction over the lifetime of the structure. Each insert will result
+in all of the existing records being rewritten, so at worst each record
+will be involved in $\Theta(n)$ reconstructions, each reconstruction
+having $\Theta(C(n))$ cost. We can amortize this cost over the $n$ records
+inserted to get an amortized insertion cost for global reconstruction of,
+
+\begin{equation*}
+I_a(n) = \frac{C(n) \cdot n}{n} = C(n)
+\end{equation*}
+
+This doesn't improve things as is, however it does present two
+opportunities for improvement. If we could either reduce the size of
+the reconstructions, or the number of times a record is reconstructed,
+then we could reduce the amortized insertion cost.
+
+The key insight, first discussed by Bentley and Saxe, is that
+this goal can be accomplished by \emph{decomposing} the data
+structure into multiple, smaller structures, each built from a
+disjoint partition of the data. As long as the search problem
+being considered is decomposable, queries can be answered from
+this structure with bounded worst-case overhead, and the amortized
+insertion cost can be improved~\cite{saxe79}. Significant theoretical
+work exists in evaluating different strategies for decomposing the
+data structure~\cite{saxe79, overmars81, overmars83} and for leveraging
+specific efficiencies of the data structures being considered to improve
+these reconstructions~\cite{merge-dsp}.
+
+There are two general decomposition techniques that emerged from this
+work. The earliest of these is the logarithmic method, often called
+the Bentley-Saxe method in modern literature, and is the most commonly
+discussed technique today. A later technique, the equal block method,
+was also examined. It is generally not as effective as the Bentley-Saxe
+method, but it has some useful properties for explainatory purposes and
+so will be discussed here as well.
+
+\subsection{Equal Block Method~\cite[pp.~96-100]{overmars83}}
+\label{ssec:ebm}
+
+Though chronologically later, the equal block method is theoretically a
+bit simpler, and so we will begin our discussion of decomposition-based
+technique for dynamization of decomposable search problems with it. The
+core concept of the equal block method is to decompose the data structure
+into several smaller data structures, called blocks, over partitions
+of the data. This decomposition is performed such that each block is of
+roughly equal size.
+
+Consider a data structure $\mathscr{I} \in \mathcal{I}$ that solves
+some decomposable search problem, $F$ and is built over a set of records
+$d \in \mathcal{D}$. This structure can be decomposed into $s$ blocks,
+$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over
+partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a specific value
+makes little sense when the number of records changes, and so it is taken
+to be governed by a smooth, monotonically increasing function $f(n)$ such
+that, at any point, the following two constraints are obeyed.
\begin{align}
- \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\
- \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right)
+ f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\
+ \forall_{1 \leq j \leq s} \quad | \mathscr{I}_j | \leq \frac{2n}{i} \label{ebm-c2}
\end{align}
+where $|\mathscr{I}_j|$ is the number of records in the block,
+$|\text{unbuild}(\mathscr{I}_j)|$.
+
+A new record is inserted by finding the smallest block and rebuilding it
+using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$,
+then an insert is done by,
+\begin{equation*}
+\mathscr{I}_k^\prime = \text{build}(\text{unbuild}(\mathscr{I}_k) \cup \{r\})
+\end{equation*}
+Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{
+ Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be
+ violated by deletes. We're omitting deletes from the discussion at
+ this point, but will circle back to them in Section~\ref{sec:deletes}.
+} In this case, the constraints are enforced by "reconfiguring" the
+structure. $s$ is updated to be exactly $f(n)$, all of the existing
+blocks are unbuilt, and then the records are redistributed evenly into
+$s$ blocks.
+
+A query with parameters $q$ is answered by this structure by individually
+querying the blocks, and merging the local results together with $\mergeop$,
+\begin{equation*}
+F(\mathscr{I}, q) = \bigmergeop_{j=1}^{s}F(\mathscr{I}_j, q)
+\end{equation*}
+where $F(\mathscr{I}, q)$ is a slight abuse of notation, referring to
+answering the query over $d$ using the data structure $\mathscr{I}$.
+
+This technique provides better amortized performance bounds than global
+reconstruction, at the possible cost of increased query performance for
+sub-linear queries. We'll omit the details of the proof of performance
+for brevity and streamline some of the original notation (full details
+can be found in~\cite{overmars83}), but this technique ultimately
+results in a data structure with the following performance characterstics,
+\begin{align*}
+\text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{C(n)}{n} + C\left(\frac{n}{f(n)}\right)\right) \\
+\text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\
+\end{align*}
+where $C(n)$ is the cost of statically building $\mathcal{I}$, and
+$\mathscr{Q}(n)$ is the cost of answering $F$ using $\mathcal{I}$.
+
+%TODO: example?
+
+
+\subsection{The Bentley-Saxe Method~\cite{saxe79}}
+\label{ssec:bsm}
+
+%FIXME: switch this section (and maybe the previous?) over to being
+% indexed at 0 instead of 1
+
+The original, and most frequently used, dynamization technique is the
+Bentley-Saxe Method (BSM), also called the logarithmic method in older
+literature. Rather than breaking the data structure into equally sized
+blocks, BSM decomposes the structure into logarithmically many blocks
+of exponentially increasing size. More specifically, the data structure
+is decomposed into $h = \lceil \log_2 n \rceil$ blocks, $\mathscr{I}_1,
+\mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$
+will be either empty, or contain exactly $2^i$ records within it.
+
+The procedure for inserting a record, $r \in \mathcal{D}$, into
+a BSM dynamization is as follows. If the block $\mathscr{I}_0$
+is empty, then $\mathscr{I}_0 = \text{build}{\{r\}}$. If it is not
+empty, then there will exist a maximal sequence of non-empty blocks
+$\mathscr{I}_0, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq
+0$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case,
+$\mathscr{I}_{i+1}$ is set to $\text{build}(\{r\} \cup \bigcup_{l=0}^i
+\text{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_0$ through
+$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the
+end of the structure as needed.
+
+%FIXME: switch the x's to r's for consistency
+\begin{figure}
+\centering
+\includegraphics[width=.8\textwidth]{diag/bsm.pdf}
+\caption{An illustration of inserts into the Bentley-Saxe Method}
+\label{fig:bsm-example}
+\end{figure}
-However, the method has poor worst-case insertion cost: if the
-entire structure is full, it must grow by another level, requiring
-a full reconstruction involving every record within the structure.
-A slight adjustment to the technique, due to Overmars and van Leuwen
-\cite{}, allows for the worst-case insertion cost to be bounded by
-$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing
-each reconstruction into small pieces, one of which is executed
-each time a new update occurs. This has the effect of bounding the
-worst-case performance, but does so by sacrificing the expected
-case performance, and adds a lot of complexity to the method. This
-technique is not used much in practice.\footnote{
- I've yet to find any example of it used in a journal article
- or conference paper.
-}
+Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The
+dynamization is built over a set of records $x_1, x_2, \ldots,
+x_{10}$ initially, with eight records in $\mathscr{I}_3$ and two in
+$\mathscr{I}_1$. The first new record, $x_{11}$, is inserted directly
+into $\mathscr{I}_0$. For the next insert following this, $x_{12}$, the
+first empty block is $\mathscr{I}_2$, and so the insert is performed by
+doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup
+\text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$
+and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$.
+
+This technique is called a \emph{binary decomposition} of the data
+structure. Considering a BSM dynamization of a structure containing $n$
+records, labeling each block with a $0$ if it is empty and a $1$ if it
+is full will result in the binary representation of $n$. For example,
+the final state of the structure in Figure~\ref{fig:bsm-example} contains
+$12$ records, and the labeling procedure will result in $0\text{b}1100$,
+which is $12$ in binary. Inserts affect this representation of the
+structure in the same way that incrementing the binary number by $1$ does.
+
+By applying BSM to a data structure, a dynamized structure can be created
+with the following performance characteristics,
+\begin{align*}
+\text{Amortized Insertion Cost:}&\quad \Theta\left(\left(\frac{C(n)}{n}\cdot \log_2 n\right)\right) \\
+\text{Worst Case Insertion Cost:}&\quad \Theta\left(C(n)\right) \\
+\text{Worst-case Query Cost:}& \quad \Theta\left(\log_2 n\cdot \mathscr{Q}\left(n\right)\right) \\
+\end{align*}
+This is a particularly attractive result because, for example, a data
+structure having $C(n) \in \Theta(n)$ will have an amortized insertion
+cost of $\log_2 (n)$, which is quite reasonable. The cost is an extra
+logarithmic multiple attached to the query complexity. It is also worth
+noting that the worst-case insertion cost remains the same as global
+reconstruction, but this case arises only very rarely. If you consider the
+binary decomposition representation, the worst-case behavior is triggered
+each time the existing number overflows, and a new digit must be added.
+
+\subsection{Delete Support}
+
+Classical dynamization techniques have also been developed with
+support for deleting records. In general, the same technique of global
+reconstruction that was used for inserting records can also be used to
+delete them. Given a record $r \in \mathcal{D}$ and a data structure
+$\mathscr{I} \in \mathcal{I}$ such that $r \in \mathscr{I}$, $r$ can be
+deleted from the structure in $C(n)$ time as follows,
+\begin{equation*}
+\mathscr{I}^\prime = \text{build}(\text{unbuild}(\mathscr{I}) - \{r\})
+\end{equation*}
+However, supporting deletes within the dynamization schemes discussed
+above is more complicated. The core problem is that inserts affect the
+dynamized structure in a deterministic way, and as a result certain
+partionining schemes can be leveraged to reason about the
+performance. But, deletes do not work like this.
+
+\begin{figure}
+\caption{A Bentley-Saxe dynamization for the integers on the
+interval $[1, 100]$.}
+\label{fig:bsm-delete-example}
+\end{figure}
+For example, consider a Bentley-Saxe dynamization that contains all
+integers on the interval $[1, 100]$, inserted in that order, shown in
+Figure~\ref{fig:bsm-delete-example}. We would like to delete all of the
+records from this structure, one at a time, using global reconstruction.
+This presents several problems,
+\begin{itemize}
+ \item For each record, we need to identify which block it is in before
+ we can delete it.
+ \item The cost of performing a delete is a function of which block the
+ record is in, which is a question of distribution and not easily
+ controlled.
+ \item As records are deleted, the structure will potentially violate
+ the invariants of the decomposition scheme used, which will
+ require additional work to fix.
+\end{itemize}
+
+
+
+\section{Limitations of Classical Dynamization Techniques}
+\label{sec:bsm-limits}
+
+While fairly general, these dynamization techniques have a number of
+limitations that prevent them from being directly usable as a general
+solution to the problem of creating database indices. Because of the
+requirement that the query being answered be decomposable, many search
+problems cannot be addressed--or at least efficiently addressed, by
+decomposition-based dynamization. The techniques also do nothing to reduce
+the worst-case insertion cost, resulting in extremely poor tail latency
+performance relative to hand-built dynamic structures. Finally, these
+approaches do not do a good job of exposing the underlying configuration
+space to the user, meaning that the user can exert limited control on the
+performance of the dynamized data structure. This section will discuss
+these limitations, and the rest of the document will be dedicated to
+proposing solutions to them.
+
+\subsection{Limits of Decomposability}
+\label{ssec:decomp-limits}
+Unfortunately, the DSP abstraction used as the basis of classical
+dynamization techniques has a few significant limitations that restrict
+their applicability,
+
+\begin{itemize}
+ \item The query must be broadcast identically to each block and cannot
+ be adjusted based on the state of the other blocks.
+
+ \item The query process is done in one pass--it cannot be repeated.
+
+ \item The result merge operation must be $O(1)$ to maintain good query
+ performance.
+
+ \item The result merge operation must be commutative and associative,
+ and is called repeatedly to merge pairs of results.
+\end{itemize}
+
+These requirements restrict the types of queries that can be supported by
+the method efficiently. For example, k-nearest neighbor and independent
+range sampling are not decomposable.
+
+\subsubsection{k-Nearest Neighbor}
+\label{sssec-decomp-limits-knn}
+The k-nearest neighbor (k-NN) problem is a generalization of the nearest
+neighbor problem, which seeks to return the closest point within the
+dataset to a given query point. More formally, this can be defined as,
+\begin{definition}[Nearest Neighbor]
+
+ Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$
+ be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
+ between two points within $D$. The nearest neighbor problem, $NN(D,
+ q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$
+ for some query point, $q \in \mathbb{R}^d$.
+\end{definition}
+In practice, it is common to require $f(x, y)$ be a metric,\footnote
+{
+ Contrary to its vernacular usage as a synonym for ``distance'', a
+ metric is more formally defined as a valid distance function over
+ a metric space. Metric spaces require their distance functions to
+ have the following properties,
+ \begin{itemize}
+ \item The distance between a point and itself is always 0.
+ \item All distances between non-equal points must be positive.
+ \item For all points, $x, y \in D$, it is true that
+ $f(x, y) = f(y, x)$.
+ \item For any three points $x, y, z \in D$ it is true that
+ $f(x, z) \leq f(x, y) + f(y, z)$.
+ \end{itemize}
+
+ These distances also must have the interpretation that $f(x, y) <
+ f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This
+ is the opposite of the definition of similarity, and so some minor
+ manipulations are usually required to make similarity measures work
+ in metric-based indexes. \cite{intro-analysis}
+}
+and this will be done in the examples of indices for addressing
+this problem in this work, but it is not a fundamental aspect of the problem
+formulation. The nearest neighbor problem itself is decomposable,
+with a simple merge function that accepts the result with the smallest
+value of $f(x, q)$ for any two inputs\cite{saxe79}.
+
+The k-nearest neighbor problem generalizes nearest-neighbor to return
+the $k$ nearest elements,
+\begin{definition}[k-Nearest Neighbor]
+
+ Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$
+ be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
+ between two points within $D$. The k-nearest neighbor problem,
+ $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$
+ such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$.
+\end{definition}
-\subsection{Limitations of the Bentley-Saxe Method}
+This can be thought of as solving the nearest-neighbor problem $k$ times,
+each time removing the returned result from $D$ prior to solving the
+problem again. Unlike the single nearest-neighbor case (which can be
+thought of as k-NN with $k=1$), this problem is \emph{not} decomposable.
+\begin{theorem}
+ k-NN is not a decomposable search problem.
+\end{theorem}
+\begin{proof}
+To prove this, consider the query $KNN(D, q, k)$ against some partitioned
+dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If k-NN is decomposable,
+then there must exist some constant-time, commutative, and associative
+binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l}
+R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q,
+k)$. Consider the evaluation of the merge operator against two arbitrary
+result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| =
+|R_j| = k$, and that the contents of $R$ must be the $k$ records from
+$R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the
+problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$
+time. Therefore, k-NN is not a decomposable search problem.
+\end{proof}
+With that said, it is clear that there isn't any fundamental restriction
+preventing the merging of the result sets; it is only the case that an
+arbitrary performance requirement wouldn't be satisfied. It is possible
+to merge the result sets in non-constant time, and so it is the case
+that k-NN is $C(n)$-decomposable. Unfortunately, this classification
+brings with it a reduction in query performance as a result of the way
+result merges are performed.
+
+As a concrete example of these costs, consider using the Bentley-Saxe
+method to extend the VPTree~\cite{vptree}. The VPTree is a static,
+metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k
+\log n)$. One possible merge algorithm for k-NN would be to push all
+of the elements in the two arguments onto a min-heap, and then pop off
+the first $k$. In this case, the cost of the merge operation would be
+$C(k) = k \log k$. Were $k$ assumed to be constant, then the operation
+could be considered to be constant-time. But given that $k$ is only
+bounded in size above by $n$, this isn't a safe assumption to make in
+general. Evaluating the total query cost for the extended structure,
+this would yield,
+
+\begin{equation}
+ k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right)
+\end{equation}
+
+The reason for this large increase in cost is the repeated application
+of the merge operator. The Bentley-Saxe method requires applying the
+merge operator in a binary fashion to each partial result, multiplying
+its cost by a factor of $\log n$. Thus, the constant-time requirement
+of standard decomposability is necessary to keep the cost of the merge
+operator from appearing within the complexity bound of the entire
+operation in the general case.\footnote {
+ There is a special case, noted by Overmars, where the total cost is
+ $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n))
+ \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the
+ case where the cost of the query and merge operation are sufficiently
+ large to consume the logarithmic factor, and so it doesn't represent
+ a special case with better performance.
+}
+If we could revise the result merging operation to remove this duplicated
+cost, we could greatly reduce the cost of supporting $C(n)$-decomposable
+queries.
+
+\subsubsection{Independent Range Sampling}
+
+Another problem that is not decomposable is independent sampling. There
+are a variety of problems falling under this umbrella, including weighted
+set sampling, simple random sampling, and weighted independent range
+sampling, but we will focus on independent range sampling here.
+
+\begin{definition}[Independent Range Sampling~\cite{tao22}]
+ Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
+ interval $q = [x, y]$ and an integer $k$, an independent range
+ sampling query returns $k$ independent samples from $D \cap q$
+ with each point having equal probability of being sampled.
+\end{definition}
+This problem immediately encounters a category error when considering
+whether it is decomposable: the result set is randomized, whereas
+the conditions for decomposability are defined in terms of an exact
+matching of records in result sets. To work around this, a slight abuse
+of definition is in order: assume that the equality conditions within
+the DSP definition can be interpreted to mean ``the contents in the two
+sets are drawn from the same distribution''. This enables the category
+of DSP to apply to this type of problem. More formally,
+\begin{definition}[Decomposable Sampling Problem]
+ A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and
+ only if there exists a constant-time computable, associative, and
+ commutative binary operator $\mergeop$ such that,
+ \begin{equation*}
+ F(A \cup B, q) \sim F(A, q)~ \mergeop ~F(B, q)
+ \end{equation*}
+\end{definition}
+Even with this abuse, however, IRS cannot generally be considered
+decomposable; it is at best $C(n)$-decomposable. The reason for this is
+that matching the distribution requires drawing the appropriate number
+of samples from each each partition of the data. Even in the special
+case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples
+from each partition that must appear in the result set cannot be known
+in advance due to differences in the selectivity of the predicate across
+the partitions.
+
+\begin{example}[IRS Sampling Difficulties]
+
+ Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 =
+ \{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and
+ an IRS query over the interval $[3, 4]$ with $k=12$. Because all three
+ partitions have the same size, it seems sensible to evenly distribute
+ the samples across them ($4$ samples from each partition). Applying
+ the query predicate to the partitions results in the following,
+ $d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$.
+
+ In expectation, then, the first result set will contain $R_0 = \{3,
+ 3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same
+ probability of a $4$. The second and third result sets can only
+ be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these
+ together, we'd find that the probability distribution of the sample
+ would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform
+ the same sampling operation over the full dataset (not partitioned),
+ the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$.
+
+\end{example}
+
+The problem is that the number of samples drawn from each partition needs to be
+weighted based on the number of elements satisfying the query predicate in that
+partition. In the above example, by drawing $4$ samples from $D_1$, more weight
+is given to $3$ than exists within the base dataset. This can be worked around
+by sampling a full $k$ records from each partition, returning both the sample
+and the number of records satisfying the predicate as that partition's query
+result, and then performing another pass of IRS as the merge operator, but this
+is the same approach as was used for k-NN above. This leaves IRS firmly in the
+$C(n)$-decomposable camp. If it were possible to pre-calculate the number of
+samples to draw from each partition, then a constant-time merge operation could
+be used.
+
+\subsection{Insertion Tail Latency}
+
+\subsection{Configurability}
+
+\section{Conclusion}
+This chapter discussed the necessary background information pertaining to
+queries and search problems, indexes, and techniques for dynamic extension. It
+described the potential for using custom indexes for accelerating particular
+kinds of queries, as well as the challenges associated with constructing these
+indexes. The remainder of this document will seek to address these challenges
+through modification and extension of the Bentley-Saxe method, describing work
+that has already been completed, as well as the additional work that must be
+done to realize this vision.