\chapter{Introduction} It probably goes without saying that database systems are heavily dependent upon data structures, both for auxiliary use within the system itself, and for indexing the data in storage to facilitate faster access. As a result of this, the design of novel data structures constitutes a significant subfield within the database community. However, there is a stark divide between theoretical work and so-called "practical" work in this area, with many theoretically oriented data structures not seeing much, if any, use in real systems. I would go so far as to assert that many of these published data structures have \emph{never} been actually used. This situation exists with reason, of course. Fundamentally, the rules of engadgement within the theory community differ from those within the systems community. Asymptotic analysis, which eschews constant factors, dominates theoretical analysis of data structures, whereas the systems community cares a great deal about these constants. We'll see within this document itself just how significant a divide this is in terms of real performance numbers. But, perhaps an even more significant barrier to theoretical data structures is that of support for features. A data structure, technically speaking, only needs to define algorithms for constructing and querying it. I'll describe such minimal structures as \emph{static data structures} within this document. Many theoretical structures that seem potentially useful fall into this category. Examples include alias-augmented structures for independent sampling, vantage-point trees for multi-dimensional similiarity search, ISAM trees for traditional one-dimensional indexing, the vast majority of learned indexes, etc. These structures allow for highly efficient answering of their associated types of query, but have either fallen out of use (ISAM Trees) or have yet to see widespread adoption in database systems. This is because the minimal interface provided by a static data structure is usually not sufficient to address the real-world engineering challenges associated with database systems. Instead, data structures used by such systems must support variety of additional features: updates to the underlying data, concurrent access, fault-tolerance, etc. This lack of feature support is a major barrier to the adoption of such structures. In the current data structure design paradigm, support for such features requires extensive redesign of the static data structure, often over a lengthy development cycle. Learned indexes provide a good case study for this. The first learned index, RMI, was proposed by Kraska \emph{et al.} in 2017~\cite{kraska-rmi}. As groundbreaking as this data structure, and the idea behind it, was, it lacks support for updates and thus was of very limited practical utility. Work then proceeded over the next year-and-a-half to develop an updatable data structure based on the concepts of RMI, culmintating in ALEX~\cite{alex}, which first appeared on archive a year-and-a-half later. The next several years saw the development of a wide range of learned indexes, promising support for updates and concurrency. However, a recent survey found that all of them were still largely inferior to more mature indexing techniques, at least on certain workloads. These adventures in learned index design represent much of the modern index design process in microcosm. It is not unreasonable to expect that, as the technology matures, learned indexes may one day become commonplace. But the amount of development and research effort to get there is, clearly, vast. On the opposite end of the spectrum, theoretical data structure works also attempt to extend their structures with update support using a variety of techniques. However, the differing rules of engagement often result in solutions to this problem that are horribly impractical in database systems. As an example, Hu, Qiao, and Tao have proposed a data structure for efficient range sampling, and included in their design a discussion of efficient support for updates~\cite{irs}. Without getting into details, they need to add multiple additional data structures beside their sampling structure to facilitate this, including a hash table and multiple linked lists. Asymptotically, this approach doesn't affect space or time complexity as there is a constant number of extra structures, and the cost of maintaining and accessing them are on par with the costs associated with their main structure. But it's clear that the space and time costs of these extra data structures would have relevance in a real system. A similar problem arises in a recent attempt to create a dynamic alias structure, which uses multiple auxilliary data structures, and further assumes that the key space size is a constant that can be neglected~\cite{that-paper}. Further, update support is only one of many features that a data structure must support for use in database systems. Given these challenges associated with just update support, one can imagine the amount of work required to get a data structure fully ``production ready''! However, all of these tribulations are, I'm going to argue, not fundamental to data structure design, but rather a consequence of the modern data structure design paradigm. Rather than this process of manual integration of features into the data structure itself, we propose a new paradigm: \emph{Framework-driven Data Structure Design}. Under this paradigm, the process of designing a data structure is reduced to the static case: an algorithm for querying the structure and an algorithm for building it from a set of elements. Once these are defined, a high level framework can be used to automatically add support for other desirable features, such as updates, concurrency, and fault-tolerance, in a manner that is mostly transparent to the static structure itself. This idea is not without precident. For example, a similar approach is used to provide fault-tolerance to indexes within traditional, disk-based RDBMS. The RDBMS provides a storage engine which has its own fault tolerance systems. Any data structure built on top of this storage engine can benefit from its crash recovery, requiring only a small amount of effort to integrate the system. As a result, crash recovery/fault tolerance is not handled at the level of the data structure in such systems. The B+Tree index itself doesn't have the mechanism built into it, it relies upon the framework provided by the RDBMS. Similarly, there is an existing technique which uses a similar process to add support for updates to static structures, commonly called the Bentley-Saxe method. \section{Research Objectives} The proposed project has four major objectives, \begin{enumerate} \item Automatic Dynamic Extension The first phase of this project has seen the development of a \emph{dynamic extension framework}, which is capable of adding support for inserts and deletes of data to otherwise static data structures, so long as a few basic assumptions about the structure and associated queries are satisified. This framework is based on the core principles of the Bentley-Saxe method, and is implemented using C++ templates to allow for ease of use. As part of the extension of BSM, a large design space has been added, giving the framework a trade-off space between memory usage, insert performance, and query performance. This allows for the performance characteristics of the framework-extended data structure to be tuned for particular use cases, and provides a large degree of flexibility to the technique. \item Automatic Concurrency Support Because the Bentley-Saxe method is based on the reconstruction of otherwise immutable blocks, a basic concurrency implementation is straightforward. While there are hard blocking points when a reconstruction requires the results of an as-of-yet incomplete reconstruction, all other operations can be easily performed concurrently, so long as the destruction of blocks can be deferred until all operations actively using it are complete. This lends itself to a simple epoch-based system, where a particular configuration of blocks constitutes an epoch, and the reconstruction of one or more blocks triggers a shift to a new epoch upon its completion. Each query will see exactly one epoch, and that epoch will remain in existence until all queries using it have terminated. With this strategy, the problem of adding support for concurrent operations is largely converted into one of resource management. Retaining old epochs, adding more buffers, and running reconstruction operations all require storage. Further, large reconstructions consume memory bandwidth and CPU resources, which must be shared with active queries. And, at least some reconstructions will actively block others, which will lead to tail latency spikes. The objective of this phase of the project is the creation of a scheduling system, built into the framework, that will schedule queries and merges so as to ensure that the system operates within specific tail latency and resource utilization constraints. In particular, it is important to effectively hide the large insertion tail latencies caused by reconstructions, and to limit the storage required to retain old versions of the structure. Alongside scheduling, the use of admission control will be considered for helping to maintain latency guarentees even in adverserial conditions. \item Automatic Multi-node Support It is increasingly the case that the requirements for data management systems exceed the capacity of a single node, requiring horizontal scaling. Unfortunately, the design of data structures that work effectively in a distributed, multi-node environment is non-trivial. However, the same design elements that make it straightforward to implement a framework-driven concurrency system should also lend themselves to adding multi-node support to a data structure. The framework uses immutable blocks of data, which are periodically reconstructed by combining them with other blocks. This system is superficially similar to the RDDs used by Apache Spark, for example. What is not so straightforward, however, is the implementation decisions that underly this framework. It is not obvious that the geometric block sizing technique used by BSM is well suited to this task, and so a comprehensive evaluation of block sizing techniques will be required. Additionally, there are significant challenges to be overcome regarding block placement on nodes, fault-tolerance and recovery, how best to handle buffering, and the effect of block sizing strategies and placement on end-to-end query performance. All of these problems will be studied during this phase of the project. \item Automatic Performance Tuning During all phases of the project, various tunable parameters will be introduced that allow for various trade-offs between insertion performance, query performance, and memory usage. These allow for a user to fine-tune the performance characteristics of the framework to suit her use-cases. However, this tunability may introduce an obstical to adoption for the system, as it is not necessarily trivial to arrive at an effective configuration of the system, given a set of performance requirements. Thus, the final phase of the project will consider systems to automatically tune the framework. As a further benefit, such a system could allow dynamic adjustment to the tunable parameters of the framework during execution, to allow for automatic and transparent evolution in the phase of changing workloads. \end{enumerate}