summaryrefslogtreecommitdiffstats
path: root/chapters/chapter1-old.tex
diff options
context:
space:
mode:
Diffstat (limited to 'chapters/chapter1-old.tex')
-rw-r--r--chapters/chapter1-old.tex256
1 files changed, 256 insertions, 0 deletions
diff --git a/chapters/chapter1-old.tex b/chapters/chapter1-old.tex
new file mode 100644
index 0000000..fca257d
--- /dev/null
+++ b/chapters/chapter1-old.tex
@@ -0,0 +1,256 @@
+\chapter{Introduction}
+
+It probably goes without saying that database systems are heavily
+dependent upon data structures, both for auxiliary use within the system
+itself, and for indexing the data in storage to facilitate faster access.
+As a result of this, the design of novel data structures constitutes a
+significant sub-field within the database community. However, there is a
+stark divide between theoretical work and so-called "practical" work in
+this area, with many theoretically oriented data structures not seeing
+much, if any, use in real systems. I would go so far as to assert that
+many of these published data structures have \emph{never} been actually
+used.
+
+This situation exists with reason, of course. Fundamentally, the rules
+of engagement within the theory community differ from those within the
+systems community. Asymptotic analysis, which eschews constant factors,
+dominates theoretical analysis of data structures, whereas the systems
+community cares a great deal about these constants. We'll see within
+this document itself just how significant a divide this is in terms of
+real performance numbers. But, perhaps an even more significant barrier
+to theoretical data structures is that of support for features.
+
+A data structure, technically speaking, only needs to define algorithms
+for constructing and querying it. I'll describe such minimal structures
+as \emph{static data structures} within this document. Many theoretical
+structures that seem potentially useful fall into this category. Examples
+include alias-augmented structures for independent sampling, vantage-point
+trees for multi-dimensional similarity search, ISAM trees for traditional
+one-dimensional indexing, the vast majority of learned indexes, etc.
+
+These structures allow for highly efficient answering of their associated
+types of query, but have either fallen out of use (ISAM Trees) or have
+yet to see widespread adoption in database systems. This is because the
+minimal interface provided by a static data structure is usually not
+sufficient to address the real-world engineering challenges associated
+with database systems. Instead, data structures used by such systems must
+support variety of additional features: updates to the underlying data,
+concurrent access, fault-tolerance, etc. This lack of feature support
+is a major barrier to the adoption of such structures.
+
+In the current data structure design paradigm, support for such features
+requires extensive redesign of the static data structure, often over a
+lengthy development cycle. Learned indexes provide a good case study for
+this. The first learned index, RMI, was proposed by Kraska \emph{et al.}
+in 2017~\cite{kraska-rmi}. As groundbreaking as this data structure,
+and the idea behind it, was, it lacks support for updates and thus was
+of very limited practical utility. Work then proceeded over the next
+year-and-a-half to develop an updatable data structure based on the
+concepts of RMI, culminating in ALEX~\cite{alex}, which first appeared
+on archive a year-and-a-half later. The next several years saw the
+development of a wide range of learned indexes, promising support for
+updates and concurrency. However, a recent survey found that all of them
+were still largely inferior to more mature indexing techniques, at least
+on certain workloads.
+
+These adventures in learned index design represent much of the modern
+index design process in microcosm. It is not unreasonable to expect
+that, as the technology matures, learned indexes may one day become
+commonplace. But the amount of development and research effort to get
+there is, clearly, vast.
+
+On the opposite end of the spectrum, theoretical data structure works
+also attempt to extend their structures with update support using a
+variety of techniques. However, the differing rules of engagement often
+result in solutions to this problem that are horribly impractical in
+database systems. As an example, Hu, Qiao, and Tao have proposed a data
+structure for efficient range sampling, and included in their design a
+discussion of efficient support for updates~\cite{irs}. Without getting
+into details, they need to add multiple additional data structures beside
+their sampling structure to facilitate this, including a hash table and
+multiple linked lists. Asymptotically, this approach doesn't affect space
+or time complexity as there is a constant number of extra structures,
+and the cost of maintaining and accessing them are on par with the costs
+associated with their main structure. But it's clear that the space
+and time costs of these extra data structures would have relevance in
+a real system. A similar problem arises in a recent attempt to create a
+dynamic alias structure, which uses multiple auxiliary data structures,
+and further assumes that the key space size is a constant that can be
+neglected~\cite{that-paper}.
+
+Further, update support is only one of many features that a data
+structure must support for use in database systems. Given these challenges
+associated with just update support, one can imagine the amount of work
+required to get a data structure fully ``production ready''!
+
+However, all of these tribulations are, I'm going to argue, not
+fundamental to data structure design, but rather a consequence of the
+modern data structure design paradigm. Rather than this process of manual
+integration of features into the data structure itself, we propose a
+new paradigm: \emph{Framework-driven Data Structure Design}. Under this
+paradigm, the process of designing a data structure is reduced to the
+static case: an algorithm for querying the structure and an algorithm
+for building it from a set of elements. Once these are defined, a high
+level framework can be used to automatically add support for other
+desirable features, such as updates, concurrency, and fault-tolerance,
+in a manner that is mostly transparent to the static structure itself.
+
+This idea is not without precedent. For example, a similar approach
+is used to provide fault-tolerance to indexes within traditional,
+disk-based RDBMS. The RDBMS provides a storage engine which has its own
+fault tolerance systems. Any data structure built on top of this storage
+engine can benefit from its crash recovery, requiring only a small amount
+of effort to integrate the system. As a result, crash recovery/fault
+tolerance is not handled at the level of the data structure in such
+systems. The B+Tree index itself doesn't have the mechanism built into
+it, it relies upon the framework provided by the RDBMS.
+
+Similarly, there is an existing technique which uses a similar process
+to add support for updates to static structures, commonly called the
+Bentley-Saxe method.
+
+\section{Research Objectives}
+The proposed project has four major objectives,
+\begin{enumerate}
+\item Automatic Dynamic Extension
+
+ The first phase of this project has seen the development of a
+ \emph{dynamic extension framework}, which is capable of adding
+ support for inserts and deletes of data to otherwise static data
+ structures, so long as a few basic assumptions about the structure
+ and associated queries are satisfied. This framework is based on
+ the core principles of the Bentley-Saxe method, and is implemented
+ using C++ templates to allow for ease of use.
+
+ As part of the extension of BSM, a large design space has been added,
+ giving the framework a trade-off space between memory usage, insert
+ performance, and query performance. This allows for the performance
+ characteristics of the framework-extended data structure to be tuned
+ for particular use cases, and provides a large degree of flexibility
+ to the technique.
+
+\item Automatic Concurrency Support
+
+ Because the Bentley-Saxe method is based on the reconstruction
+ of otherwise immutable blocks, a basic concurrency implementation
+ is straightforward. While there are hard blocking points when a
+ reconstruction requires the results of an as-of-yet incomplete
+ reconstruction, all other operations can be easily performed
+ concurrently, so long as the destruction of blocks can be deferred
+ until all operations actively using it are complete. This lends itself
+ to a simple epoch-based system, where a particular configuration of
+ blocks constitutes an epoch, and the reconstruction of one or more
+ blocks triggers a shift to a new epoch upon its completion. Each
+ query will see exactly one epoch, and that epoch will remain in
+ existence until all queries using it have terminated.
+
+ With this strategy, the problem of adding support for concurrent
+ operations is largely converted into one of resource management.
+ Retaining old epochs, adding more buffers, and running reconstruction
+ operations all require storage. Further, large reconstructions
+ consume memory bandwidth and CPU resources, which must be shared
+ with active queries. And, at least some reconstructions will actively
+ block others, which will lead to tail latency spikes.
+
+ The objective of this phase of the project is the creation of a
+ scheduling system, built into the framework, that will schedule
+ queries and merges so as to ensure that the system operates within
+ specific tail latency and resource utilization constraints. In
+ particular, it is important to effectively hide the large insertion
+ tail latencies caused by reconstructions, and to limit the storage
+ required to retain old versions of the structure. Alongside
+ scheduling, the use of admission control will be considered for helping
+ to maintain latency guarantees even in adversarial conditions.
+
+\item Automatic Multi-node Support
+
+ It is increasingly the case that the requirements for data management
+ systems exceed the capacity of a single node, requiring horizontal
+ scaling. Unfortunately, the design of data structures that work
+ effectively in a distributed, multi-node environment is non-trivial.
+ However, the same design elements that make it straightforward to
+ implement a framework-driven concurrency system should also lend
+ themselves to adding multi-node support to a data structure. The
+ framework uses immutable blocks of data, which are periodically
+ reconstructed by combining them with other blocks. This system is
+ superficially similar to the RDDs used by Apache Spark, for example.
+
+ What is not so straightforward, however, is the implementation
+ decisions that underlie this framework. It is not obvious that the
+ geometric block sizing technique used by BSM is well suited to this
+ task, and so a comprehensive evaluation of block sizing techniques
+ will be required. Additionally, there are significant challenges
+ to be overcome regarding block placement on nodes, fault-tolerance
+ and recovery, how best to handle buffering, and the effect of block
+ sizing strategies and placement on end-to-end query performance. All
+ of these problems will be studied during this phase of the project.
+
+
+\item Automatic Performance Tuning
+
+ During all phases of the project, various tunable parameters will
+ be introduced that allow for various trade-offs between insertion
+ performance, query performance, and memory usage. These allow for a
+ user to fine-tune the performance characteristics of the framework
+ to suit her use-cases. However, this tunability may introduce an
+ obstacle to adoption for the system, as it is not necessarily trivial
+ to arrive at an effective configuration of the system, given a set of
+ performance requirements. Thus, the final phase of the project will
+ consider systems to automatically tune the framework. As a further
+ benefit, such a system could allow dynamic adjustment to the tunable
+ parameters of the framework during execution, to allow for automatic
+ and transparent evolution in the phase of changing workloads.
+
+\end{enumerate}
+
+
+\begin{enumerate}
+ \item Thrust 1. Automatic Concurrency and Scheduling
+
+ The design of the framework lends itself to a straightforward, data
+ structure independent, concurrency implementation, but ensuring good
+ performance of this implementation will require intelligent scheduling.
+ In this thrust, we will study the problem of scheduling operations
+ within the framework to meet certain tail latency guarantees, within a
+ particular set of resource constraints.
+
+ RQ1: How best to parameterize merge and query operations
+ RQ2: Develop a real-time (or nearly real time) scheduling
+ system to make decisions about when the merge, while
+ ensuring certain tail latency requirements within a
+ set of resource constraints
+
+
+ \item Thrust 2. Temporal and Spatial Data Partitioning
+
+ The framework is based upon a temporal partitioning of data, however
+ there are opportunities to improve the performance of certain
+ operations by introducing a spatial partitioning scheme as well. In
+ this thrust, we will expand the framework to support arbitrary
+ partitioning schemes, and access the efficacy of spatial partitioning
+ under a variety of contexts.
+
+ RQ1: What effect does spatial partitioning within levels have on
+ the performance of inserts and queries?
+ RQ2: Does a trade-offs exist between spatial and temporal partitioning?
+ RQ3: To what degree do results about spatial partitioning generalize
+ across different types of index (particularly multi-dimensional
+ ones).
+
+ \item Thrust 3. Dynamic Performance Tuning
+
+ The framework contains a large number of tunable parameters which allow
+ for trade-offs between memory usage, read performance, and write
+ performance. In this thrust, we will comprehensively evaluate this
+ design space, and develop a system for automatically adjusting these
+ parameters during system operation. This will allow the system to
+ dynamically change its own configuration when the workload changes.
+
+ RQ1: Quantity and model the effects of framework tuning parameters on
+ various performance metrics.
+ RQ2: Evaluate the utility of having a heterogeneous configuration, with
+ different parameter values on different levels.
+ RQ3: Develop a system for dynamically adjusting these values based on
+ current performance data.
+
+\end{enumerate}