diff options
Diffstat (limited to 'chapters/chapter1.tex.bak')
| -rw-r--r-- | chapters/chapter1.tex.bak | 204 |
1 files changed, 204 insertions, 0 deletions
diff --git a/chapters/chapter1.tex.bak b/chapters/chapter1.tex.bak new file mode 100644 index 0000000..c66ba2c --- /dev/null +++ b/chapters/chapter1.tex.bak @@ -0,0 +1,204 @@ +\chapter{Introduction} + +It probably goes without saying that database systems are heavily +dependent upon data structures, both for auxiliary use within the system +itself, and for indexing the data in storage to facilitate faster access. +As a result of this, the design of novel data structures constitutes a +significant subfield within the database community. However, there is a +stark divide between theoretical work and so-called "practical" work in +this area, with many theoretically oriented data structures not seeing +much, if any, use in real systems. I would go so far as to assert that +many of these published data structures have \emph{never} been actually +used. + +This situation exists with reason, of course. Fundamentally, the rules +of engadgement within the theory community differ from those within the +systems community. Asymptotic analysis, which eschews constant factors, +dominates theoretical analysis of data structures, whereas the systems +community cares a great deal about these constants. We'll see within +this document itself just how significant a divide this is in terms of +real performance numbers. But, perhaps an even more significant barrier +to theoretical data structures is that of support for features. + +A data structure, technically speaking, only needs to define algorithms +for constructing and querying it. I'll describe such minimal structures +as \emph{static data structures} within this document. Many theoretical +structures that seem potentially useful fall into this category. Examples +include alias-augmented structures for independent sampling, vantage-point +trees for multi-dimensional similiarity search, ISAM trees for traditional +one-dimensional indexing, the vast majority of learned indexes, etc. + +These structures allow for highly efficient answering of their associated +types of query, but have either fallen out of use (ISAM Trees) or have +yet to see widespread adoption in database systems. This is because the +minimal interface provided by a static data structure is usually not +sufficient to address the real-world engineering challenges associated +with database systems. Instead, data structures used by such systems must +support variety of additional features: updates to the underlying data, +concurrent access, fault-tolerance, etc. This lack of feature support +is a major barrier to the adoption of such structures. + +In the current data structure design paradigm, support for such features +requires extensive redesign of the static data structure, often over a +lengthy development cycle. Learned indexes provide a good case study for +this. The first learned index, RMI, was proposed by Kraska \emph{et al.} +in 2017~\cite{kraska-rmi}. As groundbreaking as this data structure, +and the idea behind it, was, it lacks support for updates and thus was +of very limited practical utility. Work then proceeded over the next +year-and-a-half to develop an updatable data structure based on the +concepts of RMI, culmintating in ALEX~\cite{alex}, which first appeared +on archive a year-and-a-half later. The next several years saw the +development of a wide range of learned indexes, promising support for +updates and concurrency. However, a recent survey found that all of them +were still largely inferior to more mature indexing techniques, at least +on certain workloads. + +These adventures in learned index design represent much of the modern +index design process in microcosm. It is not unreasonable to expect +that, as the technology matures, learned indexes may one day become +commonplace. But the amount of development and research effort to get +there is, clearly, vast. + +On the opposite end of the spectrum, theoretical data structure works +also attempt to extend their structures with update support using a +variety of techniques. However, the differing rules of engagement often +result in solutions to this problem that are horribly impractical in +database systems. As an example, Hu, Qiao, and Tao have proposed a data +structure for efficient range sampling, and included in their design a +discussion of efficient support for updates~\cite{irs}. Without getting +into details, they need to add multiple additional data structures beside +their sampling structure to facilitate this, including a hash table and +multiple linked lists. Asymptotically, this approach doesn't affect space +or time complexity as there is a constant number of extra structures, +and the cost of maintaining and accessing them are on par with the costs +associated with their main structure. But it's clear that the space +and time costs of these extra data structures would have relevance in +a real system. A similar problem arises in a recent attempt to create a +dynamic alias structure, which uses multiple auxilliary data structures, +and further assumes that the key space size is a constant that can be +neglected~\cite{that-paper}. + +Further, update support is only one of many features that a data +structure must support for use in database systems. Given these challenges +associated with just update support, one can imagine the amount of work +required to get a data structure fully ``production ready''! + +However, all of these tribulations are, I'm going to argue, not +fundamental to data structure design, but rather a consequence of the +modern data structure design paradigm. Rather than this process of manual +integration of features into the data structure itself, we propose a +new paradigm: \emph{Framework-driven Data Structure Design}. Under this +paradigm, the process of designing a data structure is reduced to the +static case: an algorithm for querying the structure and an algorithm +for building it from a set of elements. Once these are defined, a high +level framework can be used to automatically add support for other +desirable features, such as updates, concurrency, and fault-tolerance, +in a manner that is mostly transparent to the static structure itself. + +This idea is not without precident. For example, a similar approach +is used to provide fault-tolerance to indexes within traditional, +disk-based RDBMS. The RDBMS provides a storage engine which has its own +fault tolerance systems. Any data structure built on top of this storage +engine can benefit from its crash recovery, requiring only a small amount +of effort to integrate the system. As a result, crash recovery/fault +tolerance is not handled at the level of the data structure in such +systems. The B+Tree index itself doesn't have the mechanism built into +it, it relies upon the framework provided by the RDBMS. + +Similarly, there is an existing technique which uses a similar process +to add support for updates to static structures, commonly called the +Bentley-Saxe method. + +\section{Research Objectives} +The proposed project has four major objectives, +\begin{enumerate} +\item Automatic Dynamic Extension + + The first phase of this project has seen the development of a + \emph{dynamic extension framework}, which is capable of adding + support for inserts and deletes of data to otherwise static data + structures, so long as a few basic assumptions about the structure + and associated queries are satisified. This framework is based on + the core principles of the Bentley-Saxe method, and is implemented + using C++ templates to allow for ease of use. + + As part of the extension of BSM, a large design space has been added, + giving the framework a trade-off space between memory usage, insert + performance, and query performance. This allows for the performance + characteristics of the framework-extended data structure to be tuned + for particular use cases, and provides a large degree of flexibility + to the technique. + +\item Automatic Concurrency Support + + Because the Bentley-Saxe method is based on the reconstruction + of otherwise immutable blocks, a basic concurrency implementation + is straightforward. While there are hard blocking points when a + reconstruction requires the results of an as-of-yet incomplete + reconstruction, all other operations can be easily performed + concurrently, so long as the destruction of blocks can be deferred + until all operations actively using it are complete. This lends itself + to a simple epoch-based system, where a particular configuration of + blocks constitutes an epoch, and the reconstruction of one or more + blocks triggers a shift to a new epoch upon its completion. Each + query will see exactly one epoch, and that epoch will remain in + existence until all queries using it have terminated. + + With this strategy, the problem of adding support for concurrent + operations is largely converted into one of resource management. + Retaining old epochs, adding more buffers, and running reconstruction + operations all require storage. Further, large reconstructions + consume memory bandwidth and CPU resources, which must be shared + with active queries. And, at least some reconstructions will actively + block others, which will lead to tail latency spikes. + + The objective of this phase of the project is the creation of a + scheduling system, built into the framework, that will schedule + queries and merges so as to ensure that the system operates within + specific tail latency and resource utilization constraints. In + particular, it is important to effectively hide the large insertion + tail latencies caused by reconstructions, and to limit the storage + required to retain old versions of the structure. Alongside + scheduling, the use of admission control will be considered for helping + to maintain latency guarentees even in adverserial conditions. + +\item Automatic Multi-node Support + + It is increasingly the case that the requirements for data management + systems exceed the capacity of a single node, requiring horizontal + scaling. Unfortunately, the design of data structures that work + effectively in a distributed, multi-node environment is non-trivial. + However, the same design elements that make it straightforward to + implement a framework-driven concurrency system should also lend + themselves to adding multi-node support to a data structure. The + framework uses immutable blocks of data, which are periodically + reconstructed by combining them with other blocks. This system is + superficially similar to the RDDs used by Apache Spark, for example. + + What is not so straightforward, however, is the implementation + decisions that underly this framework. It is not obvious that the + geometric block sizing technique used by BSM is well suited to this + task, and so a comprehensive evaluation of block sizing techniques + will be required. Additionally, there are significant challenges + to be overcome regarding block placement on nodes, fault-tolerance + and recovery, how best to handle buffering, and the effect of block + sizing strategies and placement on end-to-end query performance. All + of these problems will be studied during this phase of the project. + + +\item Automatic Performance Tuning + + During all phases of the project, various tunable parameters will + be introduced that allow for various trade-offs between insertion + performance, query performance, and memory usage. These allow for a + user to fine-tune the performance characteristics of the framework + to suit her use-cases. However, this tunability may introduce an + obstical to adoption for the system, as it is not necessarily trivial + to arrive at an effective configuration of the system, given a set of + performance requirements. Thus, the final phase of the project will + consider systems to automatically tune the framework. As a further + benefit, such a system could allow dynamic adjustment to the tunable + parameters of the framework during execution, to allow for automatic + and transparent evolution in the phase of changing workloads. + +\end{enumerate} |