summaryrefslogtreecommitdiffstats
path: root/chapters/introduction.tex
diff options
context:
space:
mode:
authorDouglas Rumbaugh <dbr4@psu.edu>2025-04-27 17:36:57 -0400
committerDouglas Rumbaugh <dbr4@psu.edu>2025-04-27 17:36:57 -0400
commit5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch)
tree276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/introduction.tex
downloaddissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz
Initial commit
Diffstat (limited to 'chapters/introduction.tex')
-rw-r--r--chapters/introduction.tex95
1 files changed, 95 insertions, 0 deletions
diff --git a/chapters/introduction.tex b/chapters/introduction.tex
new file mode 100644
index 0000000..a5d9740
--- /dev/null
+++ b/chapters/introduction.tex
@@ -0,0 +1,95 @@
+\chapter{Introduction}
+\label{chap:intro}
+
+One of the major challenges facing current data systems is the processing
+of complex and varied analytical queries over vast data sets. One commonly
+used technique for accelerating these queries is the application of data
+structures to create indexes, which are the basis for specialized database
+systems and data processing libraries. Unfortunately, the development
+of these indexes is difficult because of the requirements placed on
+them by data processing systems. Data is frequently subject to updates,
+yet a large number of potentially useful data structures are static.
+Further, many large-scale data processing systems are highly concurrent,
+which increases the barrier to entry even further. The process for
+developing data structures that satisfy these requirements is arduous.
+
+To demonstrate this difficulty, consder the recent example of the
+evolution of learned indexes. These are data structures designed to
+efficiently solve a simple problem: single dimensional range queries
+over sorted data. They seek to reduce the size of the structure, as
+well as lookup times, by replacing a traditional data structure with a
+learned model capable of predicting the location of a record in storage
+that matches a key value to within bounded error. This concept was first
+proposed by Kraska et al. in 2017, when they published a paper on the
+first learned index, RMI~\cite{RMI}. This index succeeding in showing
+that a learned model can be both faster and smaller than a conventional
+range index, but the proposed solution did not support updates. The
+first (non-concurrently) updatable learned index, ALEX, took a year
+and a half to appear~\cite{ALEX}. Over the course of the subsequent
+three years, several learned indexes were proposed with concurrency
+support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} but a
+recent performance study~\cite{10.14778/3551793.3551848} showed that these
+were still generally inferior to ART-OLC~\cite{10.1145/2933349.2933352},
+a traditional index. This same study did however demonstrate that a new
+design, ALEX+, was able to outperform ART-OLC under certain circumstances,
+but even with this result learned indexes are not generally considered
+production ready, because they suffer from significant performance
+regressions under certain workloads, and are highly sensitive to the
+distribution of keys~\cite{10.14778/3551793.3551848}. Despite the
+demonstrable advantages of the technique and over half a decade of
+development, learned indexes still have not reached a generally usable
+state.\footnote{
+ In Chapter~\ref{chap:framework}, we apply our proposed technique to
+ existing static learned indexes to produce an effective dynamic index.
+}
+
+This work proposes a strategy for addressing this problem by providing a
+framework for automatically introducing support for concurrent updates
+(including both inserts and deletes) to many static data structures. With
+this framework, a wide range of static, or otherwise impractical, data
+structures will be made practically useful in data systems. Based
+on a classical, theoretical framework called the Bentley-Saxe
+Method~\cite{saxe79}, the proposed system will provide a library
+that can automatically extend many data structures with support for
+concurrent updates, as well as a tunable design space to allow for the
+user to make trade-offs between read performance, write performance,
+and storage usage. The framework will address a number of limitations
+present in the original technique, widely increasing its applicability
+and practicality. It will also provide a workload-adaptive, online tuning
+system that can automatically adjust the tuning parameters of the data
+structure in the face of changing workloads.
+
+This framework is based on the splitting of the data structure into
+several smaller pieces, which are periodically reconstructed to support
+updates. A systematic partitioning and reconstruction approach is used
+to provide specific guarantees on amortized insertion performance, and
+worst case query performance. The underlying Bentley-Saxe method is
+extended using a novel query abstraction to broaden its applicability,
+and the partitioning and reconstruction processes are adjusted to improve
+performance and introduce configurability.
+
+Specifically, the proposed work will address the following points,
+\begin{enumerate}
+ \item The proposal of a theoretical framework for analysing queries
+ and data structures that extends existing theoretical
+ approaches and allows for more data structures to be dynamized.
+ \item The design of a system based upon this theoretical framework
+ for automatically dynamizing static data structures in a performant
+ and configurable manner.
+ \item The extension of this system with support for concurrent operations,
+ and the use of concurrency to provide more effective worst-case
+ performance guarantees.
+\end{enumerate}
+
+The rest of this document is structured as follows. First,
+Chapter~\ref{chap:background} introduces relevant background information,
+including the importance of data structures and indexes in database systems,
+the concept of a search problem, and techniques for designing updatable data
+structures. Next, in Chapter~\ref{chap:sampling}, the application of the
+Bentley-Saxe method to a number of sampling data structures is presented. The
+extension of these structures introduces a number of challenges which must be
+addressed, resulting in significant modification of the underlying technique.
+Then, Chapter~\ref{chap:framework} discusses the generalization of the
+modifications from the sampling framework into a more general framework.
+Chapter~\ref{chap:proposed} discusses the work that remains to be completed as
+part of this project, and Chapter~\ref{chap:conclusion} concludes the work.