1 files changed, 256 insertions, 0 deletions
diff --git a/chapters/chapter1-old.tex b/chapters/chapter1-old.tex
new file mode 100644
index 0000000..fca257d
--- /dev/null
+++ b/chapters/chapter1-old.tex
@@ -0,0 +1,256 @@
+\chapter{Introduction}
+
+It probably goes without saying that database systems are heavily
+dependent upon data structures, both for auxiliary use within the system
+itself, and for indexing the data in storage to facilitate faster access.
+As a result of this, the design of novel data structures constitutes a
+significant sub-field within the database community. However, there is a
+stark divide between theoretical work and so-called "practical" work in
+this area, with many theoretically oriented data structures not seeing
+much, if any, use in real systems. I would go so far as to assert that
+many of these published data structures have \emph{never} been actually
+used. 
+
+This situation exists with reason, of course. Fundamentally, the rules
+of engagement within the theory community differ from those within the
+systems community. Asymptotic analysis, which eschews constant factors,
+dominates theoretical analysis of data structures, whereas the systems
+community cares a great deal about these constants. We'll see within
+this document itself just how significant a divide this is in terms of
+real performance numbers. But, perhaps an even more significant barrier
+to theoretical data structures is that of support for features.
+
+A data structure, technically speaking, only needs to define algorithms
+for constructing and querying it. I'll describe such minimal structures
+as \emph{static data structures} within this document. Many theoretical
+structures that seem potentially useful fall into this category. Examples
+include alias-augmented structures for independent sampling, vantage-point
+trees for multi-dimensional similarity search, ISAM trees for traditional
+one-dimensional indexing, the vast majority of learned indexes, etc. 
+
+These structures allow for highly efficient answering of their associated
+types of query, but have either fallen out of use (ISAM Trees) or have
+yet to see widespread adoption in database systems. This is because the
+minimal interface provided by a static data structure is usually not
+sufficient to address the real-world engineering challenges associated
+with database systems.  Instead, data structures used by such systems must
+support variety of additional features: updates to the underlying data,
+concurrent access, fault-tolerance, etc. This lack of feature support
+is a major barrier to the adoption of such structures.
+
+In the current data structure design paradigm, support for such features
+requires extensive redesign of the static data structure, often over a
+lengthy development cycle. Learned indexes provide a good case study for
+this. The first learned index, RMI, was proposed by Kraska \emph{et al.}
+in 2017~\cite{kraska-rmi}. As groundbreaking as this data structure,
+and the idea behind it, was, it lacks support for updates and thus was
+of very limited practical utility. Work then proceeded over the next
+year-and-a-half to develop an updatable data structure based on the
+concepts of RMI, culminating in ALEX~\cite{alex}, which first appeared
+on archive a year-and-a-half later. The next several years saw the
+development of a wide range of learned indexes, promising support for
+updates and concurrency. However, a recent survey found that all of them
+were still largely inferior to more mature indexing techniques, at least
+on certain workloads.
+
+These adventures in learned index design represent much of the modern
+index design process in microcosm. It is not unreasonable to expect
+that, as the technology matures, learned indexes may one day become
+commonplace. But the amount of development and research effort to get
+there is, clearly, vast.
+
+On the opposite end of the spectrum, theoretical data structure works
+also attempt to extend their structures with update support using a
+variety of techniques. However, the differing rules of engagement often
+result in solutions to this problem that are horribly impractical in
+database systems. As an example, Hu, Qiao, and Tao have proposed a data
+structure for efficient range sampling, and included in their design a
+discussion of efficient support for updates~\cite{irs}. Without getting
+into details, they need to add multiple additional data structures beside
+their sampling structure to facilitate this, including a hash table and
+multiple linked lists. Asymptotically, this approach doesn't affect space
+or time complexity as there is a constant number of extra structures,
+and the cost of maintaining and accessing them are on par with the costs
+associated with their main structure. But it's clear that the space
+and time costs of these extra data structures would have relevance in
+a real system. A similar problem arises in a recent attempt to create a
+dynamic alias structure, which uses multiple auxiliary data structures,
+and further assumes that the key space size is a constant that can be
+neglected~\cite{that-paper}.
+
+Further, update support is only one of many features that a data
+structure must support for use in database systems. Given these challenges
+associated with just update support, one can imagine the amount of work
+required to get a data structure fully ``production ready''!
+
+However, all of these tribulations are, I'm going to argue, not
+fundamental to data structure design, but rather a consequence of the
+modern data structure design paradigm. Rather than this process of manual
+integration of features into the data structure itself, we propose a
+new paradigm: \emph{Framework-driven Data Structure Design}. Under this
+paradigm, the process of designing a data structure is reduced to the
+static case: an algorithm for querying the structure and an algorithm
+for building it from a set of elements. Once these are defined, a high
+level framework can be used to automatically add support for other
+desirable features, such as updates, concurrency, and fault-tolerance,
+in a manner that is mostly transparent to the static structure itself.
+
+This idea is not without precedent. For example, a similar approach
+is used to provide fault-tolerance to indexes within traditional,
+disk-based RDBMS. The RDBMS provides a storage engine which has its own
+fault tolerance systems. Any data structure built on top of this storage
+engine can benefit from its crash recovery, requiring only a small amount
+of effort to integrate the system. As a result, crash recovery/fault
+tolerance is not handled at the level of the data structure in such
+systems. The B+Tree index itself doesn't have the mechanism built into
+it, it relies upon the framework provided by the RDBMS.
+
+Similarly, there is an existing technique which uses a similar process
+to add support for updates to static structures, commonly called the
+Bentley-Saxe method. 
+
+\section{Research Objectives}
+The proposed project has four major objectives, 
+\begin{enumerate}
+\item Automatic Dynamic Extension
+
+    The first phase of this project has seen the development of a
+    \emph{dynamic extension framework}, which is capable of adding
+    support for inserts and deletes of data to otherwise static data
+    structures, so long as a few basic assumptions about the structure
+    and associated queries are satisfied. This framework is based on
+    the core principles of the Bentley-Saxe method, and is implemented
+    using C++ templates to allow for ease of use.
+
+    As part of the extension of BSM, a large design space has been added,
+    giving the framework a trade-off space between memory usage, insert
+    performance, and query performance. This allows for the performance
+    characteristics of the framework-extended data structure to be tuned
+    for particular use cases, and provides a large degree of flexibility
+    to the technique.
+    
+\item Automatic Concurrency Support 
+
+    Because the Bentley-Saxe method is based on the reconstruction
+    of otherwise immutable blocks, a basic concurrency implementation
+    is straightforward.  While there are hard blocking points when a
+    reconstruction requires the results of an as-of-yet incomplete
+    reconstruction, all other operations can be easily performed
+    concurrently, so long as the destruction of blocks can be deferred
+    until all operations actively using it are complete. This lends itself
+    to a simple epoch-based system, where a particular configuration of
+    blocks constitutes an epoch, and the reconstruction of one or more
+    blocks triggers a shift to a new epoch upon its completion. Each
+    query will see exactly one epoch, and that epoch will remain in
+    existence until all queries using it have terminated.
+
+    With this strategy, the problem of adding support for concurrent
+    operations is largely converted into one of resource management.
+    Retaining old epochs, adding more buffers, and running reconstruction
+    operations all require storage. Further, large reconstructions
+    consume memory bandwidth and CPU resources, which must be shared
+    with active queries. And, at least some reconstructions will actively
+    block others, which will lead to tail latency spikes.
+
+    The objective of this phase of the project is the creation of a
+    scheduling system, built into the framework, that will schedule
+    queries and merges so as to ensure that the system operates within
+    specific tail latency and resource utilization constraints. In
+    particular, it is important to effectively hide the large insertion
+    tail latencies caused by reconstructions, and to limit the storage
+    required to retain old versions of the structure. Alongside
+    scheduling, the use of admission control will be considered for helping
+    to maintain latency guarantees even in adversarial conditions.
+
+\item Automatic Multi-node Support 
+
+    It is increasingly the case that the requirements for data management
+    systems exceed the capacity of a single node, requiring horizontal
+    scaling. Unfortunately, the design of data structures that work
+    effectively in a distributed, multi-node environment is non-trivial.
+    However, the same design elements that make it straightforward to
+    implement a framework-driven concurrency system should also lend
+    themselves to adding multi-node support to a data structure. The
+    framework uses immutable blocks of data, which are periodically
+    reconstructed by combining them with other blocks. This system is
+    superficially similar to the RDDs used by Apache Spark, for example.
+
+    What is not so straightforward, however, is the implementation
+    decisions that underlie this framework. It is not obvious that the
+    geometric block sizing technique used by BSM is well suited to this
+    task, and so a comprehensive evaluation of block sizing techniques
+    will be required. Additionally, there are significant challenges
+    to be overcome regarding block placement on nodes, fault-tolerance
+    and recovery, how best to handle buffering, and the effect of block
+    sizing strategies and placement on end-to-end query performance. All
+    of these problems will be studied during this phase of the project.
+
+
+\item Automatic Performance Tuning
+
+    During all phases of the project, various tunable parameters will
+    be introduced that allow for various trade-offs between insertion
+    performance, query performance, and memory usage. These allow for a
+    user to fine-tune the performance characteristics of the framework
+    to suit her use-cases. However, this tunability may introduce an
+    obstacle to adoption for the system, as it is not necessarily trivial
+    to arrive at an effective configuration of the system, given a set of
+    performance requirements. Thus, the final phase of the project will
+    consider systems to automatically tune the framework. As a further
+    benefit, such a system could allow dynamic adjustment to the tunable
+    parameters of the framework during execution, to allow for automatic
+    and transparent evolution in the phase of changing workloads.
+
+\end{enumerate}
+
+
+\begin{enumerate}
+    \item Thrust 1. Automatic Concurrency and Scheduling 
+
+        The design of the framework lends itself to a straightforward, data
+        structure independent, concurrency implementation, but ensuring good
+        performance of this implementation will require intelligent scheduling.
+        In this thrust, we will study the problem of scheduling operations
+        within the framework to meet certain tail latency guarantees, within a
+        particular set of resource constraints.
+
+        RQ1: How best to parameterize merge and query operations
+        RQ2: Develop a real-time (or nearly real time) scheduling
+             system to make decisions about when the merge, while
+             ensuring certain tail latency requirements within a
+             set of resource constraints
+
+
+    \item Thrust 2. Temporal and Spatial Data Partitioning
+        
+        The framework is based upon a temporal partitioning of data, however
+        there are opportunities to improve the performance of certain
+        operations by introducing a spatial partitioning scheme as well. In
+        this thrust, we will expand the framework to support arbitrary
+        partitioning schemes, and access the efficacy of spatial partitioning
+        under a variety of contexts.
+
+        RQ1: What effect does spatial partitioning within levels have on
+             the performance of inserts and queries?
+        RQ2: Does a trade-offs exist between spatial and temporal partitioning?
+        RQ3: To what degree do results about spatial partitioning generalize
+             across different types of index (particularly multi-dimensional
+             ones).
+
+    \item Thrust 3. Dynamic Performance Tuning
+
+        The framework contains a large number of tunable parameters which allow
+        for trade-offs between memory usage, read performance, and write
+        performance. In this thrust, we will comprehensively evaluate this
+        design space, and develop a system for automatically adjusting these
+        parameters during system operation. This will allow the system to
+        dynamically change its own configuration when the workload changes.
+
+        RQ1: Quantity and model the effects of framework tuning parameters on
+             various performance metrics.
+        RQ2: Evaluate the utility of having a heterogeneous configuration, with
+             different parameter values on different levels.
+        RQ3: Develop a system for dynamically adjusting these values based on
+             current performance data.
+
+\end{enumerate}