chapters/introduction.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242

\chapter{Introduction}
\label{chap:intro}

Modern relational database management systems (RDBMS) are founded
upon a set-based representation of data~\cite{codd70}. This model is
very flexible and can be used to represent data of a wide variety of
types, from standard tabular information, to vectors, to graphs, and
more. However, this flexibility comes at a significant cost in terms of
its ability to answer queries: the most basic data access operation is
a linear table scan.

To work around this limitation, RDBMS support the creation of special data
structures called indices, which can be used to accelerate particular
types of query. To take full advantage of these structures, databases
feature sophisticated query planning and optimization systems that can
identify opportunities to utilize these indices~\cite{cowbook}. This
approach works well for particular types of queries for which an index has
been designed and integrated into the database. Many RDBMS only support
a very limited set of indices for accelerating single dimensional range
queries and point-lookups~\cite{mysql-btree-hash, cowbook}.

This situation is unfortunate, because one of the major challenges
currently facing data systems is the processing of complex analytical
queries of varying types over large sets of data. These queries and
data types are supported, nominally, by a relational database, but are
not well addressed by existing indexing techniques and as a result have
poor performance. This has led to the development of a variety of
specialized systems for particular types of query, such as spatial
systems~\cite{postgis-doc}, vector databases~\cite{pinecone-db},
and graph databases~\cite{neptune, neo4j}. At the heart of these
specialized systems are specialized indices, and the accompanying query
processing and optimization architectures necessary to utilize them
effectively. However, the development a novel data processing system
for a specific type of query is not a trivial process. While specialized
data structures, which often already exist, are at the heart of such
systems, meaningfully using such a data structure in a database requires
adding a large number of additional features. 

A recent work on extending Datalog with support for user-defined data
structures demonstrates both the benefits and challenges associated
with the use of specialized indices. It showed showed significant
improvements in query processing time and space requirements when
using custom indices, but required that the user-defined structures
have support for concurrent updates~\cite{byods-datalog}. In practice,
to be useful within the context of a database, a data structure must
support inserts and deletes (collectively referred to as updates),
as well as concurrency support that satisfies standardized isolation
semantics~\cite{cowbook}, support for crash recovery of the index in the
case of a system failure~\cite{aries}, and possibly more. The process
of extending an existing or novel data structure with support for all
of these functions is a major barrier to their use.

As a current example that demonstrates this problem, consider the recent
development of learned indices. Learned indices are data structures
that that use various techniques to approximate a function mapping
a key onto its location in storage.  The concept was first proposed
by Kraska et al. in 2017, when they published a paper on the first
learned index, RMI~\cite{RMI}. This index succeeding in showing that
a learned model can be both faster and smaller than a conventional
range index, but the proposed solution did not support updates. The
first (non-concurrently) updatable learned index, ALEX, took a year
and a half to appear~\cite{alex}. Over the course of the subsequent
three years, several learned indexes were proposed with concurrency
support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512}
but a recent performance study~\cite{10.14778/3551793.3551848}
showed that these were still generally inferior to traditional
indexing techniques~\cite{10.1145/2933349.2933352}.  While this
study demonstrated that a new design, ALEX+, was able to outperform
traditional indices under certain circumstances, it also showed that
learned indices suffer from significant performance regressions
under certain workloads, and are highly sensitive to the key
distribution~\cite{10.14778/3551793.3551848,alex-aca}.  Despite the
demonstrable advantages of the technique and nearly a decade of
development, learned indices still have not reached a generally usable
state.

It would not be an exaggeration to say that there are dozens of novel data
structures proposed each year at data structures and systems conferences,
many of which solve useful problems. However, the burden of producing
a useful database index from these structures is great, and many of
them either never see use, or at least require a significant amount
of time and effort before they can be deployed. If there were a way to
bypass much of this additional development time by \emph{automatically}
extending the feature set of an existing data structure to produce a
usable index, many of these structures would become readily accessible
to database practitioners, and the capabilities of database systems could
be greatly enhanced. It is our goal with this work to make a significant
step in this direction.

\section{Existing Work}

At present, there are several lines of work targeted at reducing
the development burden associated with creating specialized indices. We
classify them into three broad categories,

\begin{itemize}
\item \textbf{Automatic Index Composition.} This line of work seeks to
automatically compose an instance-optimized data structure for indexing
data by examining the workload and combining a collection of basic
primitive structures to optimize performance.

\item \textbf{Generalized Index Templates.} This line of work seeks
to introduce generalized data structures with built-in support for
updates, concurrency, crash recovery, etc., that have user configurable
behavior. The user can define various operations and data types according
to the template, and a corresponding customized index is automatically
constructed.

\item \textbf{Automatic Feature Extension.} This line of work seeks to 
take data structures that lack specific features, and automatically add
these without requiring adjustment to the data structure itself. This is
most commonly used to add update support, in which case the process is
called \emph{dynamization}.

\end{itemize}
We'll briefly discuss each of these three lines, and their limitations,
in this section. A more detailed discussion of the first two of these
lines can be found in Chapter~\ref{chap:related-work}, and the third
will be extensively discussed in Chapter~\ref{chap:background}.

Automatic index composition has been considered in a variety of
papers~\cite{periodic-table,ds-alchemy,fluid-ds,gene,cosine}, each
considering differing sets of data structure primitives and different
techniques for composing the structure. The general principle across
all incarnations of the technique is to consider a (usually static)
set of data, and a workload consisting of single-dimensional range
queries and point lookups.  The system then analyzes the workload,
either statically or in real time, selects specific primitive structures
optimized for certain operations (e.g., hash table-like structures for
point lookups, sorted runs for range scans), and applies them to different
regions of the data in order to maximize the overall performance of the
workload. Although some work in this area suggests generalization to
more complex data types, such as multi-dimensional data~\cite{fluid-ds},
this line is broadly focused on creating instance-optimal indices for
workloads that databases are already well equipped to handle. While this
task is quite important, it is not precisely the work that we are trying
to accomplish here. And, because the techniques are limited to specified
sets of structural primitives, it isn't clear that the approach can
be usefully extended to support \emph{arbitrary} query and data types
without reintroducing the very problem we are trying to address. We thus
consider this line to be largely orthogonal to ours.

The second approach, generalized index templates, \emph{does} attempt
to address the problem of expanding indexing support of databases to
a broader set of queries. The two primary exemplars of this approach
are the generalized search tree (GiST)~\cite{gist, concurrent-gist}
and the generalized inverted index (GIN)~\cite{pg-gin}, both of which
have integrated into PostgreSQL~\cite{pg-gist, pg-gin}. GiST enables
generalized predicate filtering over user-defined data types, and
GIN generalizes an inverted index for text search. While powerful,
these techniques are limited by the specific data structure that they
are based upon, in a similar way that automatic index composition
techniques are limited by their set of defined primitives. As a
result, generalized index templates cannot support queries (e.g.,
independent range sampling~\cite{hu14}) or data structures (e.g.,
succinct tries~\cite{zhang18}) that do not fit their underlying models.
Expanding these underlying models by introducing a new generalized index
faces the same challenges as any other index development program. Thus, 
while generalized index templates are a significant contribution in this
area, they are not a general solution to the fundamental problem of the
difficulties of index development.

The final approach is  automatic feature extension. More specifically,
we will consider dynamization,\footnote{
	This is alternatively called a static-to-dynamic transformation,
	or dynamic extension, depending upon the source. These terms
	all refer to the same process.
} the automatic extension of an existing static data structure with
support for inserts and deletes. The most general of these techniques
are based on data structure \emph{decomposition}~\cite{overmars83}. This
is an approach that divides a single data structure up into smaller
structures, called blocks, built over disjoint partitions of the
data. Inserts and deletes can then be supported by selectively
rebuilding these blocks. The most commonly used version of this
approach is the Bentley-Saxe method~\cite{saxe79}, which has been
individually applied to several specific data structures in past
work~\cite{almodaresi23,pgm,naidan14,xie21,bkdtree}. Dynamization
of this sort is not a fully general solution though; it places a
number of restrictions on the data structures and queries that
it can support.  These limitations will be discussed at length
in Chapter~\ref{chap:background}, but briefly they include: (1)
restrictions on query types that can be supported, as well as even
stricter constraints on when deletes are supported, (2) a lack of
useful performance configuration, and  (3) sub-optimal performance
characteristics, particularly in terms of insertion tail latencies.

Of the three approaches, we believe the latter to be the most promising
from the perspective of easing the development of novel indices
for specialized queries and data types. While dynamization does have
limitations, they are less onerous than the other two approaches. This
is because dynamization is unburdened by specific selections of primitive
data layouts; rather, any existing (or novel) data structure can be used.

\section{Our Work}

The work described by this document is focused towards addressing, at
least in part, the limitations of dynamization mentioned in the previous
section. We discuss general strategies for overcoming each limitation in
the context of the most popular dynamization technique: the Bentley-Saxe
method. We then present a generalized dynamization framework based upon
this discussion. This framework is capable of automatically adding
support for concurrent inserts, deletes, and queries to a wide range
of static data structures, including ones not supported by traditional
dynamization techniques. Included in this framework is a tunable design
space, allowing for trade-offs between query and update performance, and
mitigations to control the significant insertion tail latency problems
faced by most classical dynamization techniques.

Specifically, the proposed work will address the following points,
\begin{enumerate}
    \item The proposal of a theoretical framework for analyzing queries
          and data structures that extends existing theoretical
          approaches and allows for more data structures to be
		  systematically dynamized.
    \item The design of a system based upon this theoretical framework
          for automatically dynamizing static data structures in a performant
          and configurable manner.
    \item The extension of this system with support for concurrent operations,
          and the use of parallelism to provide more effective worst-case
          performance guarantees.
\end{enumerate}

The rest of this document is structured as follows. First,
Chapter~\ref{chap:background} introduces relevant background information
about classical dynamization techniques and serves as a foundation
for the discussion to follow. In Chapter~\ref{chap:sampling},
we consider one specific example of a query type not supported
by traditional dynamization systems, and propose a framework that
addresses the underlying problems and enables dynamization for these
problems. Next, in Chapter~\ref{chap:framework}, we use the results
from the previous chapter to propose novel extensions to the search
problem taxonomy and generalized mechanisms for supporting dynamization
of these new types of problem, culminating in a general dynamization
framework. Chapter~\ref{chap:design-space} unifies our discussion
of configuration parameters of our dynamizations from the previous
two chapters, and formally considers the design space and trade-offs
within it. In Chapter~\ref{chap:tail-latency}, we consider the problem
of insertion tail latency, and extend our framework with support for
techniques to mitigate this problem. Chapter~\ref{chap:related-work}
contains a more detailed discussion of works related to our
own and the ways in which are approaches differ, and finally
Chapter~\ref{chap:conclusion} concludes the work.