Initial commit

author: Douglas Rumbaugh <dbr4@psu.edu> 2025-04-27 17:36:57 -0400
committer: Douglas Rumbaugh <dbr4@psu.edu> 2025-04-27 17:36:57 -0400
commit: 5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch)
tree: 276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/introduction.tex
download: dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz
1 files changed, 95 insertions, 0 deletions
diff --git a/chapters/introduction.tex b/chapters/introduction.tex
new file mode 100644
index 0000000..a5d9740
--- /dev/null
+++ b/chapters/introduction.tex
@@ -0,0 +1,95 @@
+\chapter{Introduction}
+\label{chap:intro}
+
+One of the major challenges facing current data systems is the processing
+of complex and varied analytical queries over vast data sets. One commonly
+used technique for accelerating these queries is the application of data
+structures to create indexes, which are the basis for specialized database
+systems and data processing libraries. Unfortunately, the development
+of these indexes is difficult because of the requirements placed on
+them by data processing systems. Data is frequently subject to updates,
+yet a large number of potentially useful data structures are static.
+Further, many large-scale data processing systems are highly concurrent,
+which increases the barrier to entry even further. The process for
+developing data structures that satisfy these requirements is arduous.
+
+To demonstrate this difficulty, consder the recent example of the
+evolution of learned indexes. These are data structures designed to
+efficiently solve a simple problem: single dimensional range queries
+over sorted data. They seek to reduce the size of the structure, as
+well as lookup times, by replacing a traditional data structure with a
+learned model capable of predicting the location of a record in storage
+that matches a key value to within bounded error. This concept was first
+proposed by Kraska et al. in 2017, when they published a paper on the
+first learned index, RMI~\cite{RMI}. This index succeeding in showing
+that a learned model can be both faster and smaller than a conventional
+range index, but the proposed solution did not support updates. The
+first (non-concurrently) updatable learned index, ALEX, took a year
+and a half to appear~\cite{ALEX}. Over the course of the subsequent
+three years, several learned indexes were proposed with concurrency
+support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} but a
+recent performance study~\cite{10.14778/3551793.3551848} showed that these
+were still generally inferior to ART-OLC~\cite{10.1145/2933349.2933352},
+a traditional index.  This same study did however demonstrate that a new
+design, ALEX+, was able to outperform ART-OLC under certain circumstances,
+but even with this result learned indexes are not generally considered
+production ready, because they suffer from significant performance
+regressions under certain workloads, and are highly sensitive to the
+distribution of keys~\cite{10.14778/3551793.3551848}.  Despite the
+demonstrable advantages of the technique and over half a decade of
+development, learned indexes still have not reached a generally usable
+state.\footnote{
+    In Chapter~\ref{chap:framework}, we apply our proposed technique to
+    existing static learned indexes to produce an effective dynamic index.
+}
+
+This work proposes a strategy for addressing this problem by providing a
+framework for automatically introducing support for concurrent updates
+(including both inserts and deletes) to many static data structures. With
+this framework, a wide range of static, or otherwise impractical, data
+structures will be made practically useful in data systems. Based
+on a classical, theoretical framework called the Bentley-Saxe
+Method~\cite{saxe79}, the proposed system will provide a library
+that can automatically extend many data structures with support for
+concurrent updates, as well as a tunable design space to allow for the
+user to make trade-offs between read performance, write performance,
+and storage usage. The framework will address a number of limitations
+present in the original technique, widely increasing its applicability
+and practicality. It will also provide a workload-adaptive, online tuning
+system that can automatically adjust the tuning parameters of the data
+structure in the face of changing workloads.
+
+This framework is based on the splitting of the data structure into
+several smaller pieces, which are periodically reconstructed to support
+updates. A systematic partitioning and reconstruction approach is used
+to provide specific guarantees on amortized insertion performance, and
+worst case query performance.  The underlying Bentley-Saxe method is
+extended using a novel query abstraction to broaden its applicability,
+and the partitioning and reconstruction processes are adjusted to improve
+performance and introduce configurability.
+
+Specifically, the proposed work will address the following points,
+\begin{enumerate}
+    \item The proposal of a theoretical framework for analysing queries
+          and data structures that extends existing theoretical
+          approaches and allows for more data structures to be dynamized.
+    \item The design of a system based upon this theoretical framework
+          for automatically dynamizing static data structures in a performant
+          and configurable manner.
+    \item The extension of this system with support for concurrent operations,
+          and the use of concurrency to provide more effective worst-case
+          performance guarantees.
+\end{enumerate}
+
+The rest of this document is structured as follows. First,
+Chapter~\ref{chap:background} introduces relevant background information,
+including the importance of data structures and indexes in database systems,
+the concept of a search problem, and techniques for designing updatable data
+structures. Next, in Chapter~\ref{chap:sampling}, the application of the
+Bentley-Saxe method to a number of sampling data structures is presented. The
+extension of these structures introduces a number of challenges which must be
+addressed, resulting in significant modification of the underlying technique.
+Then, Chapter~\ref{chap:framework} discusses the generalization of the
+modifications from the sampling framework into a more general framework.
+Chapter~\ref{chap:proposed} discusses the work that remains to be completed as
+part of this project, and Chapter~\ref{chap:conclusion} concludes the work.
author	Douglas Rumbaugh <dbr4@psu.edu>	2025-04-27 17:36:57 -0400
committer	Douglas Rumbaugh <dbr4@psu.edu>	2025-04-27 17:36:57 -0400
commit	5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch)
tree	276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/introduction.tex
download	dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz