chapters/introduction.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242

\chapter{Introduction}
\label{chap:intro}

\section{Motiviation}

Modern relational database management systems (RDBMS) are founded
upon a set-based representation of data~\cite{codd70}. This model is
very flexible and can be used to represent data of a wide variety of
types, from standard tabular information, to vectors, to graphs, and
more. However, this flexibility comes at a significant cost in terms of
its ability to answer queries: the most basic data access operation is
a linear table scan.

To work around this limitation, RDBMS support the creation of special data
structures called indices, which can be used to accelerate particular
types of query. To take full advantage of these structures, databases
feature sophisticated query planning and optimization systems that can
identify opportunities to utilize these indices~\cite{cowbook}. This
approach works well for particular types of queries for which an index
has been designed and integrated into the database. Unfortunately, many
RDBMS only support a very limited set of indices for accelerating single
dimensional range queries and point-lookups~\cite{mysql-btree-hash,
cowbook}.

This situation is unfortunate, because one of the major challenges
currently facing data systems is the processing of complex analytical
queries of varying types over large sets of data. These queries and
data types are supported, nominally, by a relational database, but are
not well addressed by existing indexing techniques and as a result have
poor performance. This has led to the development of a variety of
specialized systems for particular types of query, such as spatial
systems~\cite{postgis-doc}, vector databases~\cite{pinecone-db},
and graph databases~\cite{neptune, neo4j}. At the heart of these
specialized systems are specialized indices, and the accompanying query
processing and optimization architectures necessary to utilize them
effectively. However, the development a novel data processing system
for a specific type of query is not a trivial process. While specialized
data structures, which often already exist, are at the heart of such
systems, meaningfully using such a data structure in a database requires
adding a large number of additional features. 

To be useful within the context of a database, a data structure must
support inserts and deletes (collectively referred to as updates),
as well as concurrency support that satisfies standardized isolation
semantics~\cite{cowbook}, support for crash recovery of the index in the
case of a system failure~\cite{aries}, and possibly more. As an example,
a recent work on extended Datalog with support for user-defined data
structures showed significant improvements in query processing time and
space requirements, but required that the user-defined structures have
support for concurrent updates~\cite{byods-datalog}.  The process of
adding these features to data structures that currently lack them is
not straightfoward and can take an extensive amount of time and effort.

As a current example that demonstrates this problem, consider the recent
development of learned indices. These are a broad class of data structure
that use various techniques to approximate a function mapping a key onto
its location in storage. Theoretically, this model allows for better
space efficiency of the index, as well as improved lookup performance.
This concept was first proposed by Kraska et al. in 2017, when they
published a paper on the first learned index, RMI~\cite{RMI}. This index
succeeding in showing that a learned model can be both faster and smaller
than a conventional range index, but the proposed solution did not support
updates. The first (non-concurrently) updatable learned index, ALEX, took
a year and a half to appear~\cite{alex}. Over the course of the subsequent
three years, several learned indexes were proposed with concurrency
support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} but a
recent performance study~\cite{10.14778/3551793.3551848} showed that these
were still generally inferior to ART-OLC~\cite{10.1145/2933349.2933352},
a traditional index.  This same study did however demonstrate that a new
design, ALEX+, was able to outperform ART-OLC under certain circumstances,
but even with this result learned indexes are not generally considered
production ready, because they suffer from significant performance
regressions under certain workloads, and are highly sensitive to the
distribution of keys~\cite{10.14778/3551793.3551848,alex-aca}.  Despite the
demonstrable advantages of the technique and over half a decade of
development, learned indexes still have not reached a generally usable
state.

It would not be an exaggeration to say that there are dozens of novel data
structures proposed each year at data structures and systems conferences,
many of which solve useful problems. However, the burden of producing
a useful database index from these structures is great, and many of
them either never see use, or at least require a significant amount
of time and effort before they can be deployed. If there were a way to
bypass much of this additional development time by \emph{automatically}
extending the feature set of an existing data structure to produce a
usable index, many of these structures would become readily accessible
to database practitioners, and the capabilities of database systems could
be greatly enhanced. It is our goal with this work to make a significant
step in this direction.

\section{Existing Attempts}

At present, there are several lines of work targetted at reducing
the development burden associated with creating specialized indices. We
classify them into three broad categories,

\begin{itemize}
\item \textbf{Automatic Index Composition.} This line of work seeks to
automatically compose an instance-optimized data structure for indexing
static data by examining the workload and combining a collection of basic
primitive structures to optimize performance.

\item \textbf{Generalized Index Templates.} This line of work seeks
to introduce generalized data structures with built-in support for
updates, concurrency, crash recovery, etc., that have user configurable
behavior. The user can define various operations and data types according
to the template, and a corresponding customized index is automatically
constructed.

\item \textbf{Automatic Feature Extension.} This line of work seeks to 
take data structures that lack specific features, and automatically add
these without requiring adjustment to the data structure itself. This is
most commonly used to add update support, in which case the process is
called \emph{dynamization}.

\end{itemize}
We'll briefly discuss each of these three lines, and their limitations,
in this section. A more detailed discussion of the first two of these
lines can be found in Chapter~\ref{chap:related-work}, and the third
will be extensively discussed in Chapter~\ref{chap:background}.

Automatic index composition has been considered in a variety of
papers~\cite{periodic-table,ds-alchemy,fluid-ds,gene,cosine}, each considering
differing sets of data structure primitives and different techniques for
composing the structure. The general principle across all incarnations
of the technique is to consider a (usually static) set of data, and a
workload consisting of single-dimensional range queries and point lookups.
The system then analyzes the workload, either statically or in real time,
selects specific primitive structures optimized for certain operations
(e.g., hash table-like structures for point lookups, sorted runs for range
scans), and applies them to different regions of the data, in an attempt
to maximize the overall performance of the workload. Although some work
in this area suggests generalization to more complex data types, such
as multi-dimensional data~\cite{fluid-ds}, this line is broadly focused
on creating instance-optimal indices for workloads that databases are
already well equiped to handle. While this task is quite important, it
is not precisely the work that we are trying to accomplish here. And,
because the techniques are limited to specified sets of structural
primitives, it isn't clear that the approach can be usefully extended
to support \emph{arbitrary} query and data types. We thus consider this
line to be largely orthogonal to ours.

The second approach, generalized index templates, \emph{does} attempt
to address the problem of expanding indexing support of databases to
a broader set of queries. The two primary exemplars of this approach
are the generalized search tree (GiST)~\cite{gist, concurrent-gist}
and the generalized inverted index (GIN)~\cite{pg-gin}, both of which
have integrated into PostgreSQL~\cite{pg-gist, pg-gin}. GiST enables
generalized predicate filtering over user-defined data types, and
GIN generalizes an inverted index for text search. While powerful,
these techniques are limited by the specific data structure that they
are based upon, in a similar way that automatic index composition
techniques are limited by their set of defined primitives. As a
result, generalized index templates cannot support queries (e.g.,
independent range sampling~\cite{hu14}) or data structures (e.g.,
succinct tries~\cite{zhang18}) that do not fit their underlying models.
Expanding these underlying models by introducing a new generalized index
faces the same challenges as any other index development program. Thus, 
while generalized index templates are a significant contribution in this
area, they are not a general solution to the fundamental problem of the
difficulties of index development.

The final approach is  automatic feature extension. More specifically,
we will consider dynamization,\footnote{
	This is alternative called a static-to-dynamic transformation,
	or dynamic extension, depending upon the source. These terms
	all refer to the same process.
} the automatic extension of an existing static data structure with
support for inserts and deletes. The most general of these techniques
are based on amortized global reconstruction~\cite{overmars83},
an approach that divides a single data structure up into smaller
structures, called blocks, built over disjoint partitions of the
data. Inserts and deletes can then be supported by selectively
rebuilding these blocks. The most commonly used version of this
approach is the Bentley-Saxe method~\cite{saxe79}, which has been
individually applied to several specific data structures in past
work~\cite{almodaresi23,pgm,naidan14,xie21,bkdtree}. Dynamization
of this sort is not a fully general solution though; it places
a number of restrictions on the data structures and queries that
it can support.  These limitations will be discussed at length in
Chapter~\ref{chap:background}, but briefly they include: (1) restrictions
on query types that can be supported, as well as even stricter constraints
on when deletes are supported, (2) a lack of useful performance configuration,
and  (3) sub-optimal performance characteristics, particularly in terms of
insertion tail latencies. 

Of the three approaches, we believe the latter to be the most promising
from the prospective of easing the development of novel indices
for specialized queries and data types. While dynamization does have
limitations, they are less onerous than the other two approaches. This
is because dynamization is unburdened by specific selections of primitive
data layouts; rather, any existing (or novel) data structure can be used.

\section{Our Work}

The work described by this document is focused towards addressing, at
least in part, the limitations of dynamization mentioned in the previous
section. We discuss general strategies for overcoming each limitation in
the context of the most popular dynamization technique: the Bentley-Saxe
method. We then present a generalized dynamization framework based upon
this discussion. This framework is capable of automatically adding
support for concurrent inserts, deletes, and queries to a wide range
of static data structures, including ones not supported by traditional
dynamization techniques. Included in this framework is a tunable design
space, allowing for trade-offs between query and update performance, and
mitigations to control the significant insertion tail latency problems
faced by most classical dynamization techniques.

Specifically, the proposed work will address the following points,
\begin{enumerate}
    \item The proposal of a theoretical framework for analysing queries
          and data structures that extends existing theoretical
          approaches and allows for more data structures to be dynamized.
    \item The design of a system based upon this theoretical framework
          for automatically dynamizing static data structures in a performant
          and configurable manner.
    \item The extension of this system with support for concurrent operations,
          and the use of parallelism to provide more effective worst-case
          performance guarantees.
\end{enumerate}

The rest of this document is structured as follows. First,
Chapter~\ref{chap:background} introduces relevant background information
about classical dynamization techniques and serves as a foundation
for the discussion to follow. In Chapter~\ref{chap:sampling},
we consider one specific example of a query type not supported
by traditional dynamization systems, and propose a framework that
addresses the underlying problems and enables dynamization for these
problems. Next, in Chapter~\ref{chap:framework}, we use the results
from the previous chapter to propose novel extensions to the search
problem taxonomy and generalized mechanisms for supporting dynamization
of these new types of problem, culminating in a general dynamization
framework. Chapter~\ref{chap:design-space} unifies our discussion
of configuration parameters of our dynamizations from the previous
two chapters, and formally considers the design space and trade-offs
within it. In Chapter~\ref{chap:tail-latency}, we consider the problem
of insertion tail latency, and extend our framework with support for
techniques to mitigate this problem. Chapter~\ref{chap:related-work}
contains a more detailed discussion of works related to our own and the
ways in which are approaches differ, and finally Chapter~\ref{chap:conclusion}
concludes the work.