From 692e6185988fde5e20b883ac3d9d8f0847d96958 Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Fri, 27 Jun 2025 18:10:23 -0400 Subject: updates --- chapters/introduction.tex | 79 +++++++++++++++++++++++------------------------ 1 file changed, 38 insertions(+), 41 deletions(-) (limited to 'chapters/introduction.tex') diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 8a45bd0..101c36f 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -1,8 +1,6 @@ \chapter{Introduction} \label{chap:intro} -\section{Motivation} - Modern relational database management systems (RDBMS) are founded upon a set-based representation of data~\cite{codd70}. This model is very flexible and can be used to represent data of a wide variety of @@ -16,11 +14,10 @@ structures called indices, which can be used to accelerate particular types of query. To take full advantage of these structures, databases feature sophisticated query planning and optimization systems that can identify opportunities to utilize these indices~\cite{cowbook}. This -approach works well for particular types of queries for which an index -has been designed and integrated into the database. Unfortunately, many -RDBMS only support a very limited set of indices for accelerating single -dimensional range queries and point-lookups~\cite{mysql-btree-hash, -cowbook}. +approach works well for particular types of queries for which an index has +been designed and integrated into the database. Many RDBMS only support +a very limited set of indices for accelerating single dimensional range +queries and point-lookups~\cite{mysql-btree-hash, cowbook}. This situation is unfortunate, because one of the major challenges currently facing data systems is the processing of complex analytical @@ -54,28 +51,27 @@ of extending an existing or novel data structure with support for all of these functions is a major barrier to their use. As a current example that demonstrates this problem, consider the recent -development of learned indices. These are a broad class of data structure -that use various techniques to approximate a function mapping a key onto -its location in storage. Theoretically, this model allows for better -space efficiency of the index, as well as improved lookup performance. -This concept was first proposed by Kraska et al. in 2017, when they -published a paper on the first learned index, RMI~\cite{RMI}. This index -succeeding in showing that a learned model can be both faster and smaller -than a conventional range index, but the proposed solution did not support -updates. The first (non-concurrently) updatable learned index, ALEX, took -a year and a half to appear~\cite{alex}. Over the course of the subsequent +development of learned indices. Learned indices are data structures +that that use various techniques to approximate a function mapping +a key onto its location in storage. The concept was first proposed +by Kraska et al. in 2017, when they published a paper on the first +learned index, RMI~\cite{RMI}. This index succeeding in showing that +a learned model can be both faster and smaller than a conventional +range index, but the proposed solution did not support updates. The +first (non-concurrently) updatable learned index, ALEX, took a year +and a half to appear~\cite{alex}. Over the course of the subsequent three years, several learned indexes were proposed with concurrency -support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} but a -recent performance study~\cite{10.14778/3551793.3551848} showed that these -were still generally inferior to ART-OLC~\cite{10.1145/2933349.2933352}, -a traditional index. This same study did however demonstrate that a new -design, ALEX+, was able to outperform ART-OLC under certain circumstances, -but even with this result learned indexes are not generally considered -production ready, because they suffer from significant performance -regressions under certain workloads, and are highly sensitive to the -distribution of keys~\cite{10.14778/3551793.3551848,alex-aca}. Despite the -demonstrable advantages of the technique and over half a decade of -development, learned indexes still have not reached a generally usable +support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} +but a recent performance study~\cite{10.14778/3551793.3551848} +showed that these were still generally inferior to traditional +indexing techniques~\cite{10.1145/2933349.2933352}. While this +study demonstrated that a new design, ALEX+, was able to outperform +traditional indices under certain circumstances, it also showed that +learned indices suffer from significant performance regressions +under certain workloads, and are highly sensitive to the key +distribution~\cite{10.14778/3551793.3551848,alex-aca}. Despite the +demonstrable advantages of the technique and nearly a decade of +development, learned indices still have not reached a generally usable state. It would not be an exaggeration to say that there are dozens of novel data @@ -91,7 +87,7 @@ to database practitioners, and the capabilities of database systems could be greatly enhanced. It is our goal with this work to make a significant step in this direction. -\section{Existing Attempts} +\section{Existing Work} At present, there are several lines of work targeted at reducing the development burden associated with creating specialized indices. We @@ -100,7 +96,7 @@ classify them into three broad categories, \begin{itemize} \item \textbf{Automatic Index Composition.} This line of work seeks to automatically compose an instance-optimized data structure for indexing -static data by examining the workload and combining a collection of basic +data by examining the workload and combining a collection of basic primitive structures to optimize performance. \item \textbf{Generalized Index Templates.} This line of work seeks @@ -125,14 +121,14 @@ will be extensively discussed in Chapter~\ref{chap:background}. Automatic index composition has been considered in a variety of papers~\cite{periodic-table,ds-alchemy,fluid-ds,gene,cosine}, each considering differing sets of data structure primitives and different -techniques for composing the structure. The general principle across all -incarnations of the technique is to consider a (usually static) set of -data, and a workload consisting of single-dimensional range queries and -point lookups. The system then analyzes the workload, either statically -or in real time, selects specific primitive structures optimized for -certain operations (e.g., hash table-like structures for point lookups, -sorted runs for range scans), and applies them to different regions -of the data, in an attempt to maximize the overall performance of the +techniques for composing the structure. The general principle across +all incarnations of the technique is to consider a (usually static) +set of data, and a workload consisting of single-dimensional range +queries and point lookups. The system then analyzes the workload, +either statically or in real time, selects specific primitive structures +optimized for certain operations (e.g., hash table-like structures for +point lookups, sorted runs for range scans), and applies them to different +regions of the data in order to maximize the overall performance of the workload. Although some work in this area suggests generalization to more complex data types, such as multi-dimensional data~\cite{fluid-ds}, this line is broadly focused on creating instance-optimal indices for @@ -171,8 +167,8 @@ we will consider dynamization,\footnote{ all refer to the same process. } the automatic extension of an existing static data structure with support for inserts and deletes. The most general of these techniques -are based on amortized global reconstruction~\cite{overmars83}, -an approach that divides a single data structure up into smaller +are based on data structure \emph{decomposition}~\cite{overmars83}. This +is an approach that divides a single data structure up into smaller structures, called blocks, built over disjoint partitions of the data. Inserts and deletes can then be supported by selectively rebuilding these blocks. The most commonly used version of this @@ -214,7 +210,8 @@ Specifically, the proposed work will address the following points, \begin{enumerate} \item The proposal of a theoretical framework for analyzing queries and data structures that extends existing theoretical - approaches and allows for more data structures to be dynamized. + approaches and allows for more data structures to be + systematically dynamized. \item The design of a system based upon this theoretical framework for automatically dynamizing static data structures in a performant and configurable manner. -- cgit v1.2.3