diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-04-27 17:36:57 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-04-27 17:36:57 -0400 |
| commit | 5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch) | |
| tree | 276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/introduction.tex | |
| download | dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz | |
Initial commit
Diffstat (limited to 'chapters/introduction.tex')
| -rw-r--r-- | chapters/introduction.tex | 95 |
1 files changed, 95 insertions, 0 deletions
diff --git a/chapters/introduction.tex b/chapters/introduction.tex new file mode 100644 index 0000000..a5d9740 --- /dev/null +++ b/chapters/introduction.tex @@ -0,0 +1,95 @@ +\chapter{Introduction} +\label{chap:intro} + +One of the major challenges facing current data systems is the processing +of complex and varied analytical queries over vast data sets. One commonly +used technique for accelerating these queries is the application of data +structures to create indexes, which are the basis for specialized database +systems and data processing libraries. Unfortunately, the development +of these indexes is difficult because of the requirements placed on +them by data processing systems. Data is frequently subject to updates, +yet a large number of potentially useful data structures are static. +Further, many large-scale data processing systems are highly concurrent, +which increases the barrier to entry even further. The process for +developing data structures that satisfy these requirements is arduous. + +To demonstrate this difficulty, consder the recent example of the +evolution of learned indexes. These are data structures designed to +efficiently solve a simple problem: single dimensional range queries +over sorted data. They seek to reduce the size of the structure, as +well as lookup times, by replacing a traditional data structure with a +learned model capable of predicting the location of a record in storage +that matches a key value to within bounded error. This concept was first +proposed by Kraska et al. in 2017, when they published a paper on the +first learned index, RMI~\cite{RMI}. This index succeeding in showing +that a learned model can be both faster and smaller than a conventional +range index, but the proposed solution did not support updates. The +first (non-concurrently) updatable learned index, ALEX, took a year +and a half to appear~\cite{ALEX}. Over the course of the subsequent +three years, several learned indexes were proposed with concurrency +support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} but a +recent performance study~\cite{10.14778/3551793.3551848} showed that these +were still generally inferior to ART-OLC~\cite{10.1145/2933349.2933352}, +a traditional index. This same study did however demonstrate that a new +design, ALEX+, was able to outperform ART-OLC under certain circumstances, +but even with this result learned indexes are not generally considered +production ready, because they suffer from significant performance +regressions under certain workloads, and are highly sensitive to the +distribution of keys~\cite{10.14778/3551793.3551848}. Despite the +demonstrable advantages of the technique and over half a decade of +development, learned indexes still have not reached a generally usable +state.\footnote{ + In Chapter~\ref{chap:framework}, we apply our proposed technique to + existing static learned indexes to produce an effective dynamic index. +} + +This work proposes a strategy for addressing this problem by providing a +framework for automatically introducing support for concurrent updates +(including both inserts and deletes) to many static data structures. With +this framework, a wide range of static, or otherwise impractical, data +structures will be made practically useful in data systems. Based +on a classical, theoretical framework called the Bentley-Saxe +Method~\cite{saxe79}, the proposed system will provide a library +that can automatically extend many data structures with support for +concurrent updates, as well as a tunable design space to allow for the +user to make trade-offs between read performance, write performance, +and storage usage. The framework will address a number of limitations +present in the original technique, widely increasing its applicability +and practicality. It will also provide a workload-adaptive, online tuning +system that can automatically adjust the tuning parameters of the data +structure in the face of changing workloads. + +This framework is based on the splitting of the data structure into +several smaller pieces, which are periodically reconstructed to support +updates. A systematic partitioning and reconstruction approach is used +to provide specific guarantees on amortized insertion performance, and +worst case query performance. The underlying Bentley-Saxe method is +extended using a novel query abstraction to broaden its applicability, +and the partitioning and reconstruction processes are adjusted to improve +performance and introduce configurability. + +Specifically, the proposed work will address the following points, +\begin{enumerate} + \item The proposal of a theoretical framework for analysing queries + and data structures that extends existing theoretical + approaches and allows for more data structures to be dynamized. + \item The design of a system based upon this theoretical framework + for automatically dynamizing static data structures in a performant + and configurable manner. + \item The extension of this system with support for concurrent operations, + and the use of concurrency to provide more effective worst-case + performance guarantees. +\end{enumerate} + +The rest of this document is structured as follows. First, +Chapter~\ref{chap:background} introduces relevant background information, +including the importance of data structures and indexes in database systems, +the concept of a search problem, and techniques for designing updatable data +structures. Next, in Chapter~\ref{chap:sampling}, the application of the +Bentley-Saxe method to a number of sampling data structures is presented. The +extension of these structures introduces a number of challenges which must be +addressed, resulting in significant modification of the underlying technique. +Then, Chapter~\ref{chap:framework} discusses the generalization of the +modifications from the sampling framework into a more general framework. +Chapter~\ref{chap:proposed} discusses the work that remains to be completed as +part of this project, and Chapter~\ref{chap:conclusion} concludes the work. |