\chapter{Background}

This chapter will introduce important background information that
will be used throughput the remainder of the document. We'll first
define precisely what is meant by a query, and consider some special
classes of query that will become relevant in our discussion of dynamic
extension. We'll then consider the difference between a static and a
dynamic structure, and techniques for converting static structures into
dynamic ones in a variety of circumstances.

\section{Database Indexes}

The term \emph{index} is often abused within the database community
to refer to a range of closely related, but distinct, conceptual
categories\footnote{
The word index can be used to refer to a structure mapping record
information to the set of records matching that information, as a
general synonym for ``data structure'', to data structures used
specifically in query processing, etc.
}. 
This ambiguity is rarely problematic, as the subtle differences
between these categories are not often significant, and context
clarifies the intended meaning in situtations where they are.
However, this work explicitly operates at the interface of two of
these categories, and so it is important to disambiguiate between
them. As a result, we will be using the word index to
refer to a very specific structure

\subsection{The Traditional Index}
A database index is a specialized structure which provides a means
to efficiently locate records that satisfy specific criteria. This
enables more efficient query processing for support queries. A
traditional database index can be modeled as a function, mapping a
set of attribute values, called a key, $\mathcal{K}$, to a set of
record identifiers, $\mathcal{R}$. Technically, the codomain of an
index can be either a record identifier, a set of record identifiers,
or the physical record itself, depending upon the configuration of
the index.  For the purposes of this work, the focus will be on the
first of these, but in principle any of the three index types could
be used with little material difference to the discussion.

Formally speaking, we will use the following definition of a traditional
database index,
\begin{definition}[Traditional Index]
Consider a set of database records, $\mathcal{D}$. An index over
these records, $\mathcal{I}_\mathcal{D}$ is a map of the form
$F:(\mathcal{I}_\mathcal{D}, \mathcal{K}) \to \mathcal{R}$, where
$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$,
called a \emph{key}.
\end{definition}

In order to facilitate this mapping, indexes are built using data
structures. The specific data structure used has particular
implications about the performance of the index, and the situations
in which the index is effectively. Broadly speaking, traditional
database indexes can be categorized in two ways: ordered indexes
and unordered indexes. The former of these allows for iteration
over the set of record identifiers in some sorted order, starting
at the returned record. The latter allows for point-lookups only.

There is a very small set of data structures that are usually used
for creating database indexes. The most common range index in RDBMSs
is the B-tree\footnote{ By \emph{B-tree} here, I am referring not
to the B-tree datastructure, but to a wide range of related structures
derived from the B-tree.  Examples include the B$^+$-tree,
B$^\epsilon$-tree, etc.  } based index, and key-value stores commonly
use indices built on the LSM-tree. Some databases support unordered
indexes using hashtables. Beyond these, some specialized databases or
database extensions have support for indexes based on other structures,
such as the R-tree\footnote{
Like the B-tree, R-tree here is used as a signifier for a general class
of related data structures} for spatial databases or approximate small
world graph models for similarity search.

\subsection{The Generalized Index}

The previous section discussed the traditional definition of index
as might be found in a database systems textbook. However, this
definition is limited by its association specifically with mapping
key fields to records. For the purposes of this work, I will be 
considering a slightly broader definition of index,

\begin{definition}[Generalized Index]
Consider a set of database records, $\mathcal{D}$ and a search
problem, $\mathcal{Q}$. A generalized index, $\mathcal{I}_\mathcal{D}$
is a map of the form $F:(\mathcal{I}_\mathcal{D}, \mathcal{Q}) \to
\mathcal{R})$.
\end{definition}

\emph{Search problems} are the topic of the next section, but in
brief a search problem represents a general class of query, such
as range scan, point lookup, k-nearest neightbor, etc. A traditional
index is a special case of a generalized index, having $\mathcal{Q}$
being a point-lookup or range query based on a set of record
attributes.

\subsection{Indices in Query Processing} 

A database management system utilizes indices to accelerate certain
types of query. Queries are expressed to the system in some high
level language, such as SQL or Datalog. These are generalized
languages capable of expressing a wide range of possible queries.
The DBMS is then responsible for converting these queries into a
set of primitive data access procedures that are supported by the
underlying storage engine. There are a variety of techniques for
this, including mapping directly to a tree of relational algebra
operators and interpretting that tree, query compilation, etc. But,
ultimately, the expressiveness of this internal query representation
is limited by the routines supported by the storage engine.

As an example, consider the following SQL query (representing a
2-dimensional k-nearest neighbor)\footnote{There are more efficient
ways of answering this query, but I'm aiming for simplicity here
to demonstrate my point},

\begin{verbatim}
SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A
       WHERE A.property = filtering_criterion
       ORDER BY d
       LIMIT 5;
\end{verbatim}

This query will be translated into a logical query plan (a sequence
of relational algebra operators) by the query planner, which could
result in a plan like this,

\begin{verbatim}
query plan here
\end{verbatim}

With this logical query plan, the DBMS will next need to determine
which supported operations it can use to most efficiently answer
this query. For example, the selection operation (A) could be
physically manifested as a table scan, or could be answered using
an index scan if there is an ordered index over \texttt{A.property}.
The query optimizer will make this decision based on its estimate
of the selectivity of the predicate. This may result in one of the
following physical query plans

\begin{verbatim}
physical query plan
\end{verbatim}

In either case, however, the space of possible physical plans is
limited by the available access methods: either a sorted scan on
an attribute (index) or an unsorted scan (table scan). The database
must filter for all elements matching the filtering criterion,
calculate the distances between all of these points and the query,
and then sort the results to get the final answer. Additionally,
note that the sort operation in the plan is a pipeline-breaker. If
this plan were to appear as a subtree in a larger query plan, the
overall plan would need to wait for the full evaluation of this
sub-query before it could proceed, as sorting requires the full
result set.

Imagine a world where a new index was available to our DBMS: a
nearest neighbor index. This index would allow the iteration over
records in sorted order, relative to some predefined metric and a
query point. If such an index existed over \texttt{(A.x, A.y)} using
\texttt{dist}, then a third physical plan would be available to the DBMS,

\begin{verbatim}
\end{verbatim}

This plan pulls records in order of their distance to \texttt{Q}
directly, using an index, and then filters them, avoiding the
pipeline breaking sort operation. While it's not obvious in this
case that this new plan is superior (this would depend a lot on the
selectivity of the predicate), it is a third option. It becomes
increasingly superior as the selectivity of the predicate grows,
and is clearly superior in the case where the predicate has unit
selectivity (requiring only the consideration of $5$ records total).
The construction of this special index will be considered in
Section~\ref{ssec:knn}.

This use of query-specific indexing schemes also presents a query
planning challenge: how does the database know when a particular
specialized index can be used for a given query, and how can
specialized indexes broadcast their capabilities to the query planner
in a general fashion? This work is focused on the problem of enabling
the existence of such indexes, rather than facilitating their use,
however these are important questions that must be considered in
future work for this solution to be viable. There has been work
done surrounding the use of arbtrary indexes in queries in the past,
such as~\cite{byods-datalog}. This problem is considered out-of-scope
for the proposed work, but will be considered in the future.

\section{Queries and Search Problems}

In our discussion of generalized indexes, we encountered \emph{search
problems}.  A search problem is a term used within the literature
on data structures in a manner similar to how the database community
sometimes uses the term query\footnote{
Like with the term index, the term query is often abused and used to
refer to several related, but slightly different things. In the vernacular,
a query can refer to either a) a general type of search problem (as in "range query"),
b) a specific instance of a search problem, or c) a program written in a query language.
}, to refer to a general
class of questions asked of data. Examples include range queries,
point-lookups, nearest neighbor queries, predicate filtering, random
sampling, etc. Formally, for the purposes of this work, we will define
a search problem as follows,
\begin{definition}[Search Problem] 
Given three multisets, $D$, $R$, and $Q$, a search problem is a function
$F: (D, Q) \to R$, where $D$ represents the domain of data to be searched,
$Q$ represents the domain of query parameters, and $R$ represents the
answer domain.
\footnote{
It is important to note that it is not required for $R \subseteq D$. As an
example, a \texttt{COUNT} aggregation might map a set of strings onto
an integer. Most common queries do satisfy $R \subseteq D$, but this need
not be a universal constraint.
}
\end{definition}

And we will use the word \emph{query} to refer to a specific instance
of a search problem, except when used as part of the generally
accepted name of a search problem (i.e., range query).

\begin{definition}[Query]
Given three multisets, $D$, $R$, and $Q$, a search problem $F$ and
a specific set of query parameters $q \in Q$, a query is a specific
instance of the search problem, $F(D, q)$.
\end{definition}

As an example of using these definitions, a \emph{membership test}
or \emph{range query} would be considered search problems, and a
range query over the interval $[10, 99]$ would be a query.

\subsection{Decomposable Search Problems}

An important subset of search problems is that of decomposable
search problems (DSPs). This class was first defined by Saxe and
Bentley as follows,

\begin{definition}[Decomposable Search Problem~\cite{saxe79}]
	\label{def:dsp}
    Given a search problem $F: (D, Q) \to R$, $F$ is decomposable if and
    only if there exists a consant-time computable, associative, and
    commutative binary operator $\square$ such that,
    \begin{equation*}
    F(A \cup B, q) = F(A, q)~ \square ~F(B, q)
    \end{equation*}
\end{definition}

The constant-time requirement was used to prove bounds on the costs of
evaluating DSPs over data broken across multiple partitions. Further work
by Overmars lifted this constraint and considered a more general class 
of DSP,
\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}]
    Given a search problem $F: (D, Q) \to R$, $F$ is $C(n)$-decomposable
    if and only if there exists an $O(C(n))$-time computable, associative,
    and commutative binary operator $\square$ such that,
    \begin{equation*}
    F(A \cup B, q) = F(A, q)~ \square ~F(B, q)
    \end{equation*}
\end{definition}

Decomposability is an important property because it allows for
search problems to be answered over partitioned datasets. The details
of this will be discussed in Section~\ref{ssec:bentley-saxe} in the
context of creating dynamic data structures. Many common types of
search problems appearing in databases are decomposable, such as
range queries or predicate filtering. 

To demonstrate that a search problem is decomposable, it is necessary
to show the existance of the merge operator, $\square$, and to show
that $F(A \cup B, q) = F(A, q)~ \square ~F(B, q)$. With these two
results, simple induction demonstrates that the problem is decomposable
even in cases with more than two partial results.

As an example, consider  range queries,
\begin{definition}[Range Query]
Let $D$ be a set of $n$ points in $\mathbb{R}$. Given an interval,
$ q = [x, y],\quad x,y \in R$, a range query returns all points in
$D \cap q$.
\end{definition}

\begin{theorem}
Range Queries are a DSP.
\end{theorem}

\begin{proof}
Let $\square$ be the set union operator ($\cup$). Applying this to
Definition~\ref{def:dsp}, we have
\begin{align*}
	(A \cup B) \cap q = (A \cap q) \cup (B \cap q) 
\end{align*}
which is true by the distributive property of set union and
intersection. Assuming an implementation allowing for an $O(1)$
set union operation, range queries are DSPs.
\end{proof}

Because the codomain of a DSP is not restricted, more complex output
structures can be used to allow for problems that are not directly
decomposable to be converted to DSPs, possibly with some minor
post-processing. For example, the calculation of the mean of a set
of numbers can be constructed as a DSP using the following technique,
\begin{theorem}
The calculation of the average of a set of numbers is a DSP.
\end{theorem}
\begin{proof}
Define the search problem as $A:D \to (\mathbb{R}, \mathbb{Z})$,
where $D\subset\mathbb{R}$ and is a multiset. The output tuple
contains the sum of the values within the input set, and the
cardinality of the input set. Let the $A(D_1) = (s_1, c_1)$ and
$A(D_2) = (s_2, c_2)$. Then, define $A(D_1)\square A(D_2) = (s_1 +
s_2, c_1 + c_2)$. 

Applying Definition~\ref{def:dsp}, we have
\begin{align*}
	A(D_1 \cup D_2) &= A(D_1)\square A(D_2)  \\
	(s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c)
\end{align*}
From this result, the average can be determined in constant time by
taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set
of numbers is a DSP.
\end{proof}

\section{Dynamic Extension Techniques}

Because data in a database is regularly updated, data structures
intended to be used as an index must support updates (inserts,
in-place modification, and deletes) to their data. In principle,
any data structure can support updates to its underlying data through
global reconstruction: adjusting the record set and then rebuilding
the entire structure. Ignoring this trivial (and highly inefficient)
approach, a data structure with support for updates is called
\emph{dynamic}, and one without support for updates is called
\emph{static}. In this section, we discuss approaches for modifying
a static data structure to grant it support for updates, a process
called \emph{dynamic extension} or \emph{dynamization}. A theoretical
survey of this topic can be found in~\cite{overmars83}, but this
work doesn't cover several techniques that are used in practice.
As such, much of this section constitutes our own analysis, tying
together threads from a variety of sources.

\subsection{Local Reconstruction}

One way of viewing updates to a data structure is as reconstructing
all or part of the structure. To minimize the cost of the update,
it is ideal to minimize the size of the reconstruction that accompanies
an update, either by careful structuring of the data to ensure
minimal disruption to surrounding records by an update, or by
deferring the reconstructions and amortizing their costs over as
many updates as possible. 

While minimizing the size of a reconstruction seems the most obvious,
and best, approach, it is limited in its applicability. The more
related ``nearby'' records in the structure are, the more records
will be affected by a change. Records can be related in terms of
some ordering of their values, which we'll term a \emph{spatial
ordering}, or in terms of their order of insertion to the structure,
which we'll term a \emph{temporal ordering}. Note that these terms
don't imply anything about the nature of the data, and instead
relate to the principles used by the data structure to arrange them.

Arrays provide the extreme version of both of these ordering
principles. In an unsorted array, in which records are appended to
the end of the array, there is no spatial ordering dependence between
records. This means that any insert or update will require no local
reconstruction, aside from the record being directly affected.\footnote{
A delete can also be performed without any structural adjustments
in a variety of ways. Reorganization of the array as a result of
deleted records serves an efficiency purpose, but isn't required
for the correctness of the structure.  } However, the order of
records in the array \emph{does} express a strong temporal dependency:
the index of a record in the array provides the exact insertion
order.

A sorted array provides exactly the opposite situation. The order
of a record in the array reflects an exact spatial ordering of
records with respect to their sorting function. This means that an
update or insert will require reordering a large number of records
(potentially all of them, in the worst case). Because of the stronger
spatial dependence of records in the structure, an update will
require a larger-scale reconstruction. Additionally, there is no
temporal component to the ordering of the records: inserting a set
of records into a sorted array will produce the same final structure
irrespective of insertion order.

It's worth noting that the spatial dependency discussed here, as
it relates to reconstruction costs, is based on the physical layout
of the records and not the logical ordering of them. To exemplify
this, a sorted singly-linked list can maintain the same logical
order of records as a sorted array, but limits the spatial dependce
between records each records preceeding node. This means that an
insert into this structure will require only a single node update,
regardless of where in the structure this insert occurs.

The amount of spatial dependence in a structure directly reflects
a trade-off between read and write performance. In the above example,
performing a lookup for a given record in a sorted array requires
asymptotically fewer comparisons in the worst case than an unsorted
array, because the spatial dependecies can be exploited for an
accelerated search (binary vs. linear search). Interestingly, this
remains the case for lookups against a sorted array vs. a sorted
linked list. Even though both structures have the same logical order
of records, limited spatial dependecies between nodes in a linked
list forces the lookup to perform a scan anyway. 

A balanced binary tree sits between these two extremes. Like a
linked list, individual nodes have very few connections. However
the nodes are arranged in such a way that a connection existing
between two nodes implies further information about the ordering
of children of those nodes.  In this light, rebalancing of the tree
can be seen as maintaining a certain degree of spatial dependence
between the nodes in the tree, ensuring that it is balanced between
the two children of each node. A very general summary of tree
rebalancing techniques can be found in~\cite{overmars83}. Using an
AVL tree~\cite{avl} as a specific example, each insert in the tree
involves adding the new node and updating its parent (like you'd
see in a simple linked list), followed by some larger scale local
reconstruction in the form of tree rotations, to maintain the balance
factor invariant. This means that insertion requires more reconstruction
effort than the single pointer update in the linked list case, but
results in much more efficient searches (which, as it turns out,
makes insertion more efficient in general too, even with the overhead,
because finding the insertion point is much faster).

\subsection{Amortized Local Reconstruction}

In addition to control update cost by arranging the structure so
as to reduce the amount of reconstruction necessary to maintain the
desired level of spatial dependence, update costs can also be reduced
by amortizing the local reconstruction cost over multiple updates.
This is often done in one of two ways: leaving gaps or adding
overflow buckets. These gaps and buckets allows for a buffer of
insertion capacity to be sustained by the data structure, before
a reconstruction is triggered.

A classic example of the gap approach is found in the
B$^+$-tree~\cite{b+tree} commonly used in RDBMS indexes, as well
as open addressing for hash tables. In a B$^+$-tree, each node has
a fixed size, which must be at least half-utilized (aside from the
root node). The empty spaces within these nodes are gaps, which can
be cheaply filled with new records on insert. Only when a node has
been filled must a local reconstruction (called a structural
modification operation for B-trees) occur to redistribute the data
into multiple nodes and replenish the supply of gaps. This approach
is particularly well suited to data structures in contexts where
the natural unit of storage is larger than a record, as in disk-based
(with 4KiB pages) or cache-optimized (with 64B cachelines) structures.
This gap-based approach was also used to create ALEX, an updatable
learned index~\cite{ALEX}.

The gap approach has a number of disadvantages. It results in a
somewhat sparse structure, thereby wasting storage. For example, a
B$^+$-tree requires all nodes other than the root to be at least
half full--meaning in the worst case up to half of the space required
by the structure could be taken up by gaps. Additionally, this
scheme results in some inserts being more expensive than others:
most new records will occupy an available gap, but some will trigger
more expensive SMOs. In particular, it has been observed with
B$^+$-trees that this can lead to ``waves of misery''~\cite{wavesofmisery}:
the gaps in many nodes fill at about the same time, leading to
periodic clusters of high-cost merge operations.

Overflow buckets are seen in ISAM-tree based indexes~\cite{myisam},
as well as hash tables with closed addressing. In this approach,
parts of the structure into which records would be inserted (leaf
nodes of ISAM, directory entries in CA hashing) have a pointer to
an overflow location, where newly inserted records can be placed.
This allows for the structure to, theoretically, sustain an unlimited
amount of insertions. However, read performance degrades, because
the more overflow capacity is utilized, the less the records in the
structure are ordered according to the data structure's definition.
Thus, periodically a reconstruction is necessary to distribute the
overflow records into the structure itself.

\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method} 

Another approach to support updates is to amortize the cost of
global reconstruction over multiple updates. This approach can take
take three forms, 
\begin{enumerate}

        \item Pairing a dynamic data structure (called a buffer or
        memtable) with an instance of the structure being extended.
        Updates are written to the buffer, and when the buffer is
        full its records are merged with those in the static
        structure, and the structure is rebuilt.  This approach is
        used by one version of the originally proposed
        LSM-tree~\cite{oneil93}. Technically this technique proposed
        in that work for the purposes of converting random writes
        into sequential ones (all structures involved are dynamic),
        but it can be used for dynamization as well.

        \item Creating multiple, smaller data structures each
        containing a partition of the records from the dataset, and
        reconstructing individual structures to accomodate new
        inserts in a systematic manner. This technique is the basis
        of the Bentley-Saxe method~\cite{saxe79}.

        \item Using both of the above techniques at once. This is
        the approach used by modern incarnations of the
        LSM~tree~\cite{rocksdb}.

\end{enumerate}

In all three cases, it is necessary for the search problem associated
with the index to be a DSP, as answering it will require querying
multiple structures (the buffer and/or one or more instances of the
data structure) and merging the results together to get a final
result. This section will focus exclusively on the Bentley-Saxe
method, as it is the basis for our proposed methodology.p

When dividing records across multiple structures, there is a clear
trade-off between read performance and write performance. Keeping
the individual structures small reduces the cost of reconstructing,
and thereby increases update performance. However, this also means
that more structures will be required to accommodate the same number
of records, when compared to a scheme that allows the structures
to be larger. As each structure must be queried independently, this
will lead to worse query performance. The reverse is also true,
fewer, larger structures will have better query performance and
worse update performance, with the extreme limit of this being a
single structure that is fully rebuilt on each insert.

\begin{figure}
    \caption{Inserting a new record using the Bentley-Saxe method.}
    \label{fig:bsm-example}
\end{figure}

The key insight of the Bentley-Saxe method~\cite{saxe79} is that a
good balance can be struck by uses a geometrically increasing
structure size. In Bentley-Saxe, the sub-structures are ``stacked'',
with the bottom level having a capacity of a single record, and
each subsequent level doubling in capacity. When an update is
performed, the first empty level is located and a reconstruction
is triggered, merging the structures of all levels below this empty
one, along with the new record. An example of this process is shown
in Figure~\ref{fig:bsm-example}.  The merits of this approach are
that it ensures that ``most'' reconstructions involve the smaller
data structures towards the bottom of the sequence, while most of
the records reside in large, infrequently updated, structures towards
the top. This balances between the read and write implications of
structure size, while also allowing the number of structures required
to represent $n$ records to be worst-case bounded by $O(\log n)$.

Given a structure with an $O(P(n))$ construction cost and $O(Q_S(n))$
query cost, the Bentley-Saxe Method will produce a dynamic data
structure with,

\begin{align}
    \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\
    \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right)
\end{align}

However, the method has poor worst-case insertion cost: if the
entire structure is full, it must grow by another level, requiring
a full reconstruction involving every record within the structure.
A slight adjustment to the technique, due to Overmars and van Leuwen
\cite{}, allows for the worst-case insertion cost to be bounded by
$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing
each reconstruction into small pieces, one of which is executed
each time a new update occurs. This has the effect of bounding the
worst-case performance, but does so by sacrificing the expected
case performance, and adds a lot of complexity to the method. This
technique is not used much in practice.\footnote{
    I've yet to find any example of it used in a journal article
    or conference paper.
}


\subsection{Limitations of the Bentley-Saxe Method}