\chapter{Introduction} \label{chap:intro} \section{Motivation} Modern relational database management systems (RDBMS) are founded upon a set-based representation of data~\cite{codd70}. This model is very flexible and can be used to represent data of a wide variety of types, from standard tabular information, to vectors, to graphs, and more. However, this flexibility comes at a significant cost in terms of its ability to answer queries: the most basic data access operation is a linear table scan. To work around this limitation, RDBMS support the creation of special data structures called indices, which can be used to accelerate particular types of query. To take full advantage of these structures, databases feature sophisticated query planning and optimization systems that can identify opportunities to utilize these indices~\cite{cowbook}. This approach works well for particular types of queries for which an index has been designed and integrated into the database. Unfortunately, many RDBMS only support a very limited set of indices for accelerating single dimensional range queries and point-lookups~\cite{mysql-btree-hash, cowbook}. This situation is unfortunate, because one of the major challenges currently facing data systems is the processing of complex analytical queries of varying types over large sets of data. These queries and data types are supported, nominally, by a relational database, but are not well addressed by existing indexing techniques and as a result have poor performance. This has led to the development of a variety of specialized systems for particular types of query, such as spatial systems~\cite{postgis-doc}, vector databases~\cite{pinecone-db}, and graph databases~\cite{neptune, neo4j}. At the heart of these specialized systems are specialized indices, and the accompanying query processing and optimization architectures necessary to utilize them effectively. However, the development a novel data processing system for a specific type of query is not a trivial process. While specialized data structures, which often already exist, are at the heart of such systems, meaningfully using such a data structure in a database requires adding a large number of additional features. A recent work on extending Datalog with support for user-defined data structures demonstrates both the benefits and challenges associated with the use of specialized indices. It showed showed significant improvements in query processing time and space requirements when using custom indices, but required that the user-defined structures have support for concurrent updates~\cite{byods-datalog}. In practice, to be useful within the context of a database, a data structure must support inserts and deletes (collectively referred to as updates), as well as concurrency support that satisfies standardized isolation semantics~\cite{cowbook}, support for crash recovery of the index in the case of a system failure~\cite{aries}, and possibly more. The process of extending an existing or novel data structure with support for all of these functions is a major barrier to their use. As a current example that demonstrates this problem, consider the recent development of learned indices. These are a broad class of data structure that use various techniques to approximate a function mapping a key onto its location in storage. Theoretically, this model allows for better space efficiency of the index, as well as improved lookup performance. This concept was first proposed by Kraska et al. in 2017, when they published a paper on the first learned index, RMI~\cite{RMI}. This index succeeding in showing that a learned model can be both faster and smaller than a conventional range index, but the proposed solution did not support updates. The first (non-concurrently) updatable learned index, ALEX, took a year and a half to appear~\cite{alex}. Over the course of the subsequent three years, several learned indexes were proposed with concurrency support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} but a recent performance study~\cite{10.14778/3551793.3551848} showed that these were still generally inferior to ART-OLC~\cite{10.1145/2933349.2933352}, a traditional index. This same study did however demonstrate that a new design, ALEX+, was able to outperform ART-OLC under certain circumstances, but even with this result learned indexes are not generally considered production ready, because they suffer from significant performance regressions under certain workloads, and are highly sensitive to the distribution of keys~\cite{10.14778/3551793.3551848,alex-aca}. Despite the demonstrable advantages of the technique and over half a decade of development, learned indexes still have not reached a generally usable state. It would not be an exaggeration to say that there are dozens of novel data structures proposed each year at data structures and systems conferences, many of which solve useful problems. However, the burden of producing a useful database index from these structures is great, and many of them either never see use, or at least require a significant amount of time and effort before they can be deployed. If there were a way to bypass much of this additional development time by \emph{automatically} extending the feature set of an existing data structure to produce a usable index, many of these structures would become readily accessible to database practitioners, and the capabilities of database systems could be greatly enhanced. It is our goal with this work to make a significant step in this direction. \section{Existing Attempts} At present, there are several lines of work targeted at reducing the development burden associated with creating specialized indices. We classify them into three broad categories, \begin{itemize} \item \textbf{Automatic Index Composition.} This line of work seeks to automatically compose an instance-optimized data structure for indexing static data by examining the workload and combining a collection of basic primitive structures to optimize performance. \item \textbf{Generalized Index Templates.} This line of work seeks to introduce generalized data structures with built-in support for updates, concurrency, crash recovery, etc., that have user configurable behavior. The user can define various operations and data types according to the template, and a corresponding customized index is automatically constructed. \item \textbf{Automatic Feature Extension.} This line of work seeks to take data structures that lack specific features, and automatically add these without requiring adjustment to the data structure itself. This is most commonly used to add update support, in which case the process is called \emph{dynamization}. \end{itemize} We'll briefly discuss each of these three lines, and their limitations, in this section. A more detailed discussion of the first two of these lines can be found in Chapter~\ref{chap:related-work}, and the third will be extensively discussed in Chapter~\ref{chap:background}. Automatic index composition has been considered in a variety of papers~\cite{periodic-table,ds-alchemy,fluid-ds,gene,cosine}, each considering differing sets of data structure primitives and different techniques for composing the structure. The general principle across all incarnations of the technique is to consider a (usually static) set of data, and a workload consisting of single-dimensional range queries and point lookups. The system then analyzes the workload, either statically or in real time, selects specific primitive structures optimized for certain operations (e.g., hash table-like structures for point lookups, sorted runs for range scans), and applies them to different regions of the data, in an attempt to maximize the overall performance of the workload. Although some work in this area suggests generalization to more complex data types, such as multi-dimensional data~\cite{fluid-ds}, this line is broadly focused on creating instance-optimal indices for workloads that databases are already well equipped to handle. While this task is quite important, it is not precisely the work that we are trying to accomplish here. And, because the techniques are limited to specified sets of structural primitives, it isn't clear that the approach can be usefully extended to support \emph{arbitrary} query and data types without reintroducing the very problem we are trying to address. We thus consider this line to be largely orthogonal to ours. The second approach, generalized index templates, \emph{does} attempt to address the problem of expanding indexing support of databases to a broader set of queries. The two primary exemplars of this approach are the generalized search tree (GiST)~\cite{gist, concurrent-gist} and the generalized inverted index (GIN)~\cite{pg-gin}, both of which have integrated into PostgreSQL~\cite{pg-gist, pg-gin}. GiST enables generalized predicate filtering over user-defined data types, and GIN generalizes an inverted index for text search. While powerful, these techniques are limited by the specific data structure that they are based upon, in a similar way that automatic index composition techniques are limited by their set of defined primitives. As a result, generalized index templates cannot support queries (e.g., independent range sampling~\cite{hu14}) or data structures (e.g., succinct tries~\cite{zhang18}) that do not fit their underlying models. Expanding these underlying models by introducing a new generalized index faces the same challenges as any other index development program. Thus, while generalized index templates are a significant contribution in this area, they are not a general solution to the fundamental problem of the difficulties of index development. The final approach is automatic feature extension. More specifically, we will consider dynamization,\footnote{ This is alternatively called a static-to-dynamic transformation, or dynamic extension, depending upon the source. These terms all refer to the same process. } the automatic extension of an existing static data structure with support for inserts and deletes. The most general of these techniques are based on amortized global reconstruction~\cite{overmars83}, an approach that divides a single data structure up into smaller structures, called blocks, built over disjoint partitions of the data. Inserts and deletes can then be supported by selectively rebuilding these blocks. The most commonly used version of this approach is the Bentley-Saxe method~\cite{saxe79}, which has been individually applied to several specific data structures in past work~\cite{almodaresi23,pgm,naidan14,xie21,bkdtree}. Dynamization of this sort is not a fully general solution though; it places a number of restrictions on the data structures and queries that it can support. These limitations will be discussed at length in Chapter~\ref{chap:background}, but briefly they include: (1) restrictions on query types that can be supported, as well as even stricter constraints on when deletes are supported, (2) a lack of useful performance configuration, and (3) sub-optimal performance characteristics, particularly in terms of insertion tail latencies. Of the three approaches, we believe the latter to be the most promising from the perspective of easing the development of novel indices for specialized queries and data types. While dynamization does have limitations, they are less onerous than the other two approaches. This is because dynamization is unburdened by specific selections of primitive data layouts; rather, any existing (or novel) data structure can be used. \section{Our Work} The work described by this document is focused towards addressing, at least in part, the limitations of dynamization mentioned in the previous section. We discuss general strategies for overcoming each limitation in the context of the most popular dynamization technique: the Bentley-Saxe method. We then present a generalized dynamization framework based upon this discussion. This framework is capable of automatically adding support for concurrent inserts, deletes, and queries to a wide range of static data structures, including ones not supported by traditional dynamization techniques. Included in this framework is a tunable design space, allowing for trade-offs between query and update performance, and mitigations to control the significant insertion tail latency problems faced by most classical dynamization techniques. Specifically, the proposed work will address the following points, \begin{enumerate} \item The proposal of a theoretical framework for analyzing queries and data structures that extends existing theoretical approaches and allows for more data structures to be dynamized. \item The design of a system based upon this theoretical framework for automatically dynamizing static data structures in a performant and configurable manner. \item The extension of this system with support for concurrent operations, and the use of parallelism to provide more effective worst-case performance guarantees. \end{enumerate} The rest of this document is structured as follows. First, Chapter~\ref{chap:background} introduces relevant background information about classical dynamization techniques and serves as a foundation for the discussion to follow. In Chapter~\ref{chap:sampling}, we consider one specific example of a query type not supported by traditional dynamization systems, and propose a framework that addresses the underlying problems and enables dynamization for these problems. Next, in Chapter~\ref{chap:framework}, we use the results from the previous chapter to propose novel extensions to the search problem taxonomy and generalized mechanisms for supporting dynamization of these new types of problem, culminating in a general dynamization framework. Chapter~\ref{chap:design-space} unifies our discussion of configuration parameters of our dynamizations from the previous two chapters, and formally considers the design space and trade-offs within it. In Chapter~\ref{chap:tail-latency}, we consider the problem of insertion tail latency, and extend our framework with support for techniques to mitigate this problem. Chapter~\ref{chap:related-work} contains a more detailed discussion of works related to our own and the ways in which are approaches differ, and finally Chapter~\ref{chap:conclusion} concludes the work.