\chapter{Introduction}
\label{chap:intro}

\section{Motivation}

Modern relational database management systems (RDBMS) are founded
upon a set-based representation of data~\cite{codd70}. This model is
very flexible and can be used to represent data of a wide variety of
types, from standard tabular information, to vectors, to graphs, and
more. However, this flexibility comes at a significant cost in terms of
its ability to answer queries: the most basic data access operation is
a linear table scan.

To work around this limitation, RDBMS support the creation of special data
structures called indices, which can be used to accelerate particular
types of query. To take full advantage of these structures, databases
feature sophisticated query planning and optimization systems that can
identify opportunities to utilize these indices~\cite{cowbook}. This
approach works well for particular types of queries for which an index
has been designed and integrated into the database. Unfortunately, many
RDBMS only support a very limited set of indices for accelerating single
dimensional range queries and point-lookups~\cite{mysql-btree-hash,
cowbook}.

This situation is unfortunate, because one of the major challenges
currently facing data systems is the processing of complex analytical
queries of varying types over large sets of data. These queries and
data types are supported, nominally, by a relational database, but are
not well addressed by existing indexing techniques and as a result have
poor performance. This has led to the development of a variety of
specialized systems for particular types of query, such as spatial
systems~\cite{postgis-doc}, vector databases~\cite{pinecone-db},
and graph databases~\cite{neptune, neo4j}. At the heart of these
specialized systems are specialized indices, and the accompanying query
processing and optimization architectures necessary to utilize them
effectively. However, the development a novel data processing system
for a specific type of query is not a trivial process. While specialized
data structures, which often already exist, are at the heart of such
systems, meaningfully using such a data structure in a database requires
adding a large number of additional features. 

A recent work on extending Datalog with support for user-defined data
structures demonstrates both the benefits and challenges associated
with the use of specialized indices. It showed showed significant
improvements in query processing time and space requirements when
using custom indices, but required that the user-defined structures
have support for concurrent updates~\cite{byods-datalog}. In practice,
to be useful within the context of a database, a data structure must
support inserts and deletes (collectively referred to as updates),
as well as concurrency support that satisfies standardized isolation
semantics~\cite{cowbook}, support for crash recovery of the index in the
case of a system failure~\cite{aries}, and possibly more. The process
of extending an existing or novel data structure with support for all
of these functions is a major barrier to their use.

As a current example that demonstrates this problem, consider the recent
development of learned indices. These are a broad class of data structure
that use various techniques to approximate a function mapping a key onto
its location in storage. Theoretically, this model allows for better
space efficiency of the index, as well as improved lookup performance.
This concept was first proposed by Kraska et al. in 2017, when they
published a paper on the first learned index, RMI~\cite{RMI}. This index
succeeding in showing that a learned model can be both faster and smaller
than a conventional range index, but the proposed solution did not support
updates. The first (non-concurrently) updatable learned index, ALEX, took
a year and a half to appear~\cite{alex}. Over the course of the subsequent
three years, several learned indexes were proposed with concurrency
support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} but a
recent performance study~\cite{10.14778/3551793.3551848} showed that these
were still generally inferior to ART-OLC~\cite{10.1145/2933349.2933352},
a traditional index.  This same study did however demonstrate that a new
design, ALEX+, was able to outperform ART-OLC under certain circumstances,
but even with this result learned indexes are not generally considered
production ready, because they suffer from significant performance
regressions under certain workloads, and are highly sensitive to the
distribution of keys~\cite{10.14778/3551793.3551848,alex-aca}.  Despite the
demonstrable advantages of the technique and over half a decade of
development, learned indexes still have not reached a generally usable
state.

It would not be an exaggeration to say that there are dozens of novel data
structures proposed each year at data structures and systems conferences,
many of which solve useful problems. However, the burden of producing
a useful database index from these structures is great, and many of
them either never see use, or at least require a significant amount
of time and effort before they can be deployed. If there were a way to
bypass much of this additional development time by \emph{automatically}
extending the feature set of an existing data structure to produce a
usable index, many of these structures would become readily accessible
to database practitioners, and the capabilities of database systems could
be greatly enhanced. It is our goal with this work to make a significant
step in this direction.

\section{Existing Attempts}

At present, there are several lines of work targeted at reducing
the development burden associated with creating specialized indices. We
classify them into three broad categories,

\begin{itemize}
\item \textbf{Automatic Index Composition.} This line of work seeks to
automatically compose an instance-optimized data structure for indexing
static data by examining the workload and combining a collection of basic
primitive structures to optimize performance.

\item \textbf{Generalized Index Templates.} This line of work seeks
to introduce generalized data structures with built-in support for
updates, concurrency, crash recovery, etc., that have user configurable
behavior. The user can define various operations and data types according
to the template, and a corresponding customized index is automatically
constructed.

\item \textbf{Automatic Feature Extension.} This line of work seeks to 
take data structures that lack specific features, and automatically add
these without requiring adjustment to the data structure itself. This is
most commonly used to add update support, in which case the process is
called \emph{dynamization}.

\end{itemize}
We'll briefly discuss each of these three lines, and their limitations,
in this section. A more detailed discussion of the first two of these
lines can be found in Chapter~\ref{chap:related-work}, and the third
will be extensively discussed in Chapter~\ref{chap:background}.

Automatic index composition has been considered in a variety of
papers~\cite{periodic-table,ds-alchemy,fluid-ds,gene,cosine}, each
considering differing sets of data structure primitives and different
techniques for composing the structure. The general principle across all
incarnations of the technique is to consider a (usually static) set of
data, and a workload consisting of single-dimensional range queries and
point lookups.  The system then analyzes the workload, either statically
or in real time, selects specific primitive structures optimized for
certain operations (e.g., hash table-like structures for point lookups,
sorted runs for range scans), and applies them to different regions
of the data, in an attempt to maximize the overall performance of the
workload. Although some work in this area suggests generalization to
more complex data types, such as multi-dimensional data~\cite{fluid-ds},
this line is broadly focused on creating instance-optimal indices for
workloads that databases are already well equipped to handle. While this
task is quite important, it is not precisely the work that we are trying
to accomplish here. And, because the techniques are limited to specified
sets of structural primitives, it isn't clear that the approach can
be usefully extended to support \emph{arbitrary} query and data types
without reintroducing the very problem we are trying to address. We thus
consider this line to be largely orthogonal to ours.

The second approach, generalized index templates, \emph{does} attempt
to address the problem of expanding indexing support of databases to
a broader set of queries. The two primary exemplars of this approach
are the generalized search tree (GiST)~\cite{gist, concurrent-gist}
and the generalized inverted index (GIN)~\cite{pg-gin}, both of which
have integrated into PostgreSQL~\cite{pg-gist, pg-gin}. GiST enables
generalized predicate filtering over user-defined data types, and
GIN generalizes an inverted index for text search. While powerful,
these techniques are limited by the specific data structure that they
are based upon, in a similar way that automatic index composition
techniques are limited by their set of defined primitives. As a
result, generalized index templates cannot support queries (e.g.,
independent range sampling~\cite{hu14}) or data structures (e.g.,
succinct tries~\cite{zhang18}) that do not fit their underlying models.
Expanding these underlying models by introducing a new generalized index
faces the same challenges as any other index development program. Thus, 
while generalized index templates are a significant contribution in this
area, they are not a general solution to the fundamental problem of the
difficulties of index development.

The final approach is  automatic feature extension. More specifically,
we will consider dynamization,\footnote{
	This is alternatively called a static-to-dynamic transformation,
	or dynamic extension, depending upon the source. These terms
	all refer to the same process.
} the automatic extension of an existing static data structure with
support for inserts and deletes. The most general of these techniques
are based on amortized global reconstruction~\cite{overmars83},
an approach that divides a single data structure up into smaller
structures, called blocks, built over disjoint partitions of the
data. Inserts and deletes can then be supported by selectively
rebuilding these blocks. The most commonly used version of this
approach is the Bentley-Saxe method~\cite{saxe79}, which has been
individually applied to several specific data structures in past
work~\cite{almodaresi23,pgm,naidan14,xie21,bkdtree}. Dynamization
of this sort is not a fully general solution though; it places a
number of restrictions on the data structures and queries that
it can support.  These limitations will be discussed at length
in Chapter~\ref{chap:background}, but briefly they include: (1)
restrictions on query types that can be supported, as well as even
stricter constraints on when deletes are supported, (2) a lack of
useful performance configuration, and  (3) sub-optimal performance
characteristics, particularly in terms of insertion tail latencies.

Of the three approaches, we believe the latter to be the most promising
from the perspective of easing the development of novel indices
for specialized queries and data types. While dynamization does have
limitations, they are less onerous than the other two approaches. This
is because dynamization is unburdened by specific selections of primitive
data layouts; rather, any existing (or novel) data structure can be used.

\section{Our Work}

The work described by this document is focused towards addressing, at
least in part, the limitations of dynamization mentioned in the previous
section. We discuss general strategies for overcoming each limitation in
the context of the most popular dynamization technique: the Bentley-Saxe
method. We then present a generalized dynamization framework based upon
this discussion. This framework is capable of automatically adding
support for concurrent inserts, deletes, and queries to a wide range
of static data structures, including ones not supported by traditional
dynamization techniques. Included in this framework is a tunable design
space, allowing for trade-offs between query and update performance, and
mitigations to control the significant insertion tail latency problems
faced by most classical dynamization techniques.

Specifically, the proposed work will address the following points,
\begin{enumerate}
    \item The proposal of a theoretical framework for analyzing queries
          and data structures that extends existing theoretical
          approaches and allows for more data structures to be dynamized.
    \item The design of a system based upon this theoretical framework
          for automatically dynamizing static data structures in a performant
          and configurable manner.
    \item The extension of this system with support for concurrent operations,
          and the use of parallelism to provide more effective worst-case
          performance guarantees.
\end{enumerate}

The rest of this document is structured as follows. First,
Chapter~\ref{chap:background} introduces relevant background information
about classical dynamization techniques and serves as a foundation
for the discussion to follow. In Chapter~\ref{chap:sampling},
we consider one specific example of a query type not supported
by traditional dynamization systems, and propose a framework that
addresses the underlying problems and enables dynamization for these
problems. Next, in Chapter~\ref{chap:framework}, we use the results
from the previous chapter to propose novel extensions to the search
problem taxonomy and generalized mechanisms for supporting dynamization
of these new types of problem, culminating in a general dynamization
framework. Chapter~\ref{chap:design-space} unifies our discussion
of configuration parameters of our dynamizations from the previous
two chapters, and formally considers the design space and trade-offs
within it. In Chapter~\ref{chap:tail-latency}, we consider the problem
of insertion tail latency, and extend our framework with support for
techniques to mitigate this problem. Chapter~\ref{chap:related-work}
contains a more detailed discussion of works related to our
own and the ways in which are approaches differ, and finally
Chapter~\ref{chap:conclusion} concludes the work.