\chapter{Introduction}
\label{chap:intro}

One of the major challenges facing current data systems is the processing
of complex and varied analytical queries over vast data sets. One commonly
used technique for accelerating these queries is the application of data
structures to create indexes, which are the basis for specialized database
systems and data processing libraries. Unfortunately, the development
of these indexes is difficult because of the requirements placed on
them by data processing systems. Data is frequently subject to updates,
yet a large number of potentially useful data structures are static.
Further, many large-scale data processing systems are highly concurrent,
which increases the barrier to entry even further. The process for
developing data structures that satisfy these requirements is arduous.

To demonstrate this difficulty, consder the recent example of the
evolution of learned indexes. These are data structures designed to
efficiently solve a simple problem: single dimensional range queries
over sorted data. They seek to reduce the size of the structure, as
well as lookup times, by replacing a traditional data structure with a
learned model capable of predicting the location of a record in storage
that matches a key value to within bounded error. This concept was first
proposed by Kraska et al. in 2017, when they published a paper on the
first learned index, RMI~\cite{RMI}. This index succeeding in showing
that a learned model can be both faster and smaller than a conventional
range index, but the proposed solution did not support updates. The
first (non-concurrently) updatable learned index, ALEX, took a year
and a half to appear~\cite{ALEX}. Over the course of the subsequent
three years, several learned indexes were proposed with concurrency
support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} but a
recent performance study~\cite{10.14778/3551793.3551848} showed that these
were still generally inferior to ART-OLC~\cite{10.1145/2933349.2933352},
a traditional index.  This same study did however demonstrate that a new
design, ALEX+, was able to outperform ART-OLC under certain circumstances,
but even with this result learned indexes are not generally considered
production ready, because they suffer from significant performance
regressions under certain workloads, and are highly sensitive to the
distribution of keys~\cite{10.14778/3551793.3551848}.  Despite the
demonstrable advantages of the technique and over half a decade of
development, learned indexes still have not reached a generally usable
state.\footnote{
    In Chapter~\ref{chap:framework}, we apply our proposed technique to
    existing static learned indexes to produce an effective dynamic index.
}

This work proposes a strategy for addressing this problem by providing a
framework for automatically introducing support for concurrent updates
(including both inserts and deletes) to many static data structures. With
this framework, a wide range of static, or otherwise impractical, data
structures will be made practically useful in data systems. Based
on a classical, theoretical framework called the Bentley-Saxe
Method~\cite{saxe79}, the proposed system will provide a library
that can automatically extend many data structures with support for
concurrent updates, as well as a tunable design space to allow for the
user to make trade-offs between read performance, write performance,
and storage usage. The framework will address a number of limitations
present in the original technique, widely increasing its applicability
and practicality. It will also provide a workload-adaptive, online tuning
system that can automatically adjust the tuning parameters of the data
structure in the face of changing workloads.

This framework is based on the splitting of the data structure into
several smaller pieces, which are periodically reconstructed to support
updates. A systematic partitioning and reconstruction approach is used
to provide specific guarantees on amortized insertion performance, and
worst case query performance.  The underlying Bentley-Saxe method is
extended using a novel query abstraction to broaden its applicability,
and the partitioning and reconstruction processes are adjusted to improve
performance and introduce configurability.

Specifically, the proposed work will address the following points,
\begin{enumerate}
    \item The proposal of a theoretical framework for analysing queries
          and data structures that extends existing theoretical
          approaches and allows for more data structures to be dynamized.
    \item The design of a system based upon this theoretical framework
          for automatically dynamizing static data structures in a performant
          and configurable manner.
    \item The extension of this system with support for concurrent operations,
          and the use of concurrency to provide more effective worst-case
          performance guarantees.
\end{enumerate}

The rest of this document is structured as follows. First,
Chapter~\ref{chap:background} introduces relevant background information,
including the importance of data structures and indexes in database systems,
the concept of a search problem, and techniques for designing updatable data
structures. Next, in Chapter~\ref{chap:sampling}, the application of the
Bentley-Saxe method to a number of sampling data structures is presented. The
extension of these structures introduces a number of challenges which must be
addressed, resulting in significant modification of the underlying technique.
Then, Chapter~\ref{chap:framework} discusses the generalization of the
modifications from the sampling framework into a more general framework.
Chapter~\ref{chap:proposed} discusses the work that remains to be completed as
part of this project, and Chapter~\ref{chap:conclusion} concludes the work.