1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
|
\chapter{Introduction}
\label{chap:intro}
Modern relational database management systems (RDBMS) are founded
upon a set-based representation of data~\cite{codd70}. This model is
very flexible and can be used to represent data of a wide variety of
types, from standard tabular information, to vectors, to graphs, and
more. However, this flexibility comes at a significant cost in terms of
its ability to answer queries: the most basic data access operation is
a linear table scan.
To work around this limitation, RDBMS support the creation of special
data structures called indices, which can be used to accelerate
particular types of query, and feature sophisticated query planning and
optimization systems that can identify opportunities to utilize these
indices~\cite{cowbook}. This approach works well for particular types
of queries for which an index has been designed and integrated into
the database. Unfortunately, many RDBMS only support a very limited
set of indices for accelerating single dimensional range queries and
point-lookups~\cite{mysql-btree-hash, cowbook}.
This situation is unfortunate, because one of the major challenges
currently facing data systems is the processing of complex analytical
queries of varying types over large sets of data. These queries and
data types are supported, nominally, by a relational database, but
are not well addressed by existing indexing techniques and as a result
have horrible performance. This has led to the development of a variety
of specialized systems for particular types of query, such as spatial
systems~\cite{postgis-doc}, vector databases~\cite{pinecone-db}, and
graph databases~\cite{neptune, neo4j}.
however the cost of this flexibility is
Modern relational database systems are based upon the fundamental data
highly optimized for addressing
particular types of search problems, such as point lookups and range
queries.
One of the major challenges facing current data systems is the processing
of complex and varied analytical queries over vast data sets. One commonly
used technique for accelerating these queries is the application of data
structures to create indexes, which are the basis for specialized database
systems and data processing libraries. Unfortunately, the development
of these indexes is difficult because of the requirements placed on
them by data processing systems. Data is frequently subject to updates,
yet a large number of potentially useful data structures are static.
Further, many large-scale data processing systems are highly concurrent,
which increases the barrier to entry even further. The process for
developing data structures that satisfy these requirements is arduous.
To demonstrate this difficulty, consder the recent example of the
evolution of learned indexes. These are data structures designed to
efficiently solve a simple problem: single dimensional range queries
over sorted data. They seek to reduce the size of the structure, as
well as lookup times, by replacing a traditional data structure with a
learned model capable of predicting the location of a record in storage
that matches a key value to within bounded error. This concept was first
proposed by Kraska et al. in 2017, when they published a paper on the
first learned index, RMI~\cite{RMI}. This index succeeding in showing
that a learned model can be both faster and smaller than a conventional
range index, but the proposed solution did not support updates. The
first (non-concurrently) updatable learned index, ALEX, took a year
and a half to appear~\cite{alex}. Over the course of the subsequent
three years, several learned indexes were proposed with concurrency
support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} but a
recent performance study~\cite{10.14778/3551793.3551848} showed that these
were still generally inferior to ART-OLC~\cite{10.1145/2933349.2933352},
a traditional index. This same study did however demonstrate that a new
design, ALEX+, was able to outperform ART-OLC under certain circumstances,
but even with this result learned indexes are not generally considered
production ready, because they suffer from significant performance
regressions under certain workloads, and are highly sensitive to the
distribution of keys~\cite{10.14778/3551793.3551848}. Despite the
demonstrable advantages of the technique and over half a decade of
development, learned indexes still have not reached a generally usable
state.\footnote{
In Chapter~\ref{chap:framework}, we apply our proposed technique to
existing static learned indexes to produce an effective dynamic index.
}
This work proposes a strategy for addressing this problem by providing a
framework for automatically introducing support for concurrent updates
(including both inserts and deletes) to many static data structures. With
this framework, a wide range of static, or otherwise impractical, data
structures will be made practically useful in data systems. Based
on a classical, theoretical framework called the Bentley-Saxe
Method~\cite{saxe79}, the proposed system will provide a library
that can automatically extend many data structures with support for
concurrent updates, as well as a tunable design space to allow for the
user to make trade-offs between read performance, write performance,
and storage usage. The framework will address a number of limitations
present in the original technique, widely increasing its applicability
and practicality. It will also provide a workload-adaptive, online tuning
system that can automatically adjust the tuning parameters of the data
structure in the face of changing workloads.
This framework is based on the splitting of the data structure into
several smaller pieces, which are periodically reconstructed to support
updates. A systematic partitioning and reconstruction approach is used
to provide specific guarantees on amortized insertion performance, and
worst case query performance. The underlying Bentley-Saxe method is
extended using a novel query abstraction to broaden its applicability,
and the partitioning and reconstruction processes are adjusted to improve
performance and introduce configurability.
Specifically, the proposed work will address the following points,
\begin{enumerate}
\item The proposal of a theoretical framework for analysing queries
and data structures that extends existing theoretical
approaches and allows for more data structures to be dynamized.
\item The design of a system based upon this theoretical framework
for automatically dynamizing static data structures in a performant
and configurable manner.
\item The extension of this system with support for concurrent operations,
and the use of concurrency to provide more effective worst-case
performance guarantees.
\end{enumerate}
The rest of this document is structured as follows. First,
Chapter~\ref{chap:background} introduces relevant background information,
including the importance of data structures and indexes in database systems,
the concept of a search problem, and techniques for designing updatable data
structures. Next, in Chapter~\ref{chap:sampling}, the application of the
Bentley-Saxe method to a number of sampling data structures is presented. The
extension of these structures introduces a number of challenges which must be
addressed, resulting in significant modification of the underlying technique.
Then, Chapter~\ref{chap:framework} discusses the generalization of the
modifications from the sampling framework into a more general framework.
Chapter~\ref{chap:proposed} discusses the work that remains to be completed as
part of this project, and Chapter~\ref{chap:conclusion} concludes the work.
|