chapters/related-works.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446

\chapter{Related Work}
\label{chap:related-work}

While we have already discussed, at length, the most
directly relevant background work in the area of dynamization in
Chapter~\ref{chap:background}, there are a number of other lines of work
that are related, either directly or superficially, to our ultimate
goal of automating some, or all, of the process of constructing a
database index. In this chapter, we will discuss some of the most notable
of these works.

\section{Existing Applications of Dynamization}

We will begin with the most directly relevant topics: other papers which
use dynamization to construct specific dynamic data structures. There
are a few examples of works which introduce a new data structure, and
simply apply dynamization to them for the purposes of adding update
support.  For example, both the PGM learned index~\cite{pgm} and the KDS
tree~\cite{xie21} propose static structures, and then apply dynamization
techniques to support updates. However, in this section, we will focus
on works in which dynamization appears as a major focus of the paper,
not simply as an incidental tool.

One of the older applications of the Bentley-Saxe method is in the
creation of a data structure called the Bkd-tree~\cite{bkd-tree}.
This structure is a search tree, based on the kd-tree~\cite{kd-tree},
for multi-dimensional searching, that has been designed for use in
external storage. While it was not the first external kd-tree, existing
implementations struggled with support for updates, which typically
were both inefficient, and resulted in node structures that poorly
utilized the space (i.e., many nodes were mostly empty, resulting in
less efficient searches). To resolve these problems, the authors used
a statically constructed external structure,\footnote{To clarify,
the K-D-B-tree supports updates, but poorly. For Bkd-tree, the authors
exclusively used the K-D-B-tree's static bulk-loading, which is highly
efficient.  } the K-D-B-tree, and combined it with the logarithmic method
to create a full-dynamic structure (K-D-B-tree supports deletes natively,
so the problem is deletion decomposable).  The major contribution of
this paper, per the authors, was not necessarily the structure itself,
but the demonstration of the viability of the logarithmic method, as
they showed extensively that their dynamized static structure was able
to outperform native dynamic implementations on external storage.

A more recent paper discussing the application of the logarithmic method
to a specific example is its application to the Mantis structure for
large-scale DNA sequence search~\cite{mantis-dyn}. Mantis~\cite{mantis}
is one of the fastest and most space efficient structures for sequence
search, but is static. To create a half-dynamic version of Mantis, the
authors first design an algorithm to efficiently merge multiple Mantis
structures together (turning their problem into an MDSP, though they
don't use this terminology). They then apply a modified version of the
logarithmic method, including LSM-tree inspired features like leveling,
tiering, and a scale factor. The resulting structure was shown to perform
quite well compared to other existing solutions.

Another notable work considering dynamization techniques examines applying
the logarithmic method to produce full-dynamic versions of various metric
indexing structures~\cite{naidan14}. In this paper, the logarithmic
method is directly applied to two static metric indexing structures,
VPTree and SSS-tree, to create full-dynamic versions (using weak deletes)
These dynamized structures are compared to two dynamic baselines, DSA-tree
and EGNAT, for both multi-dimensional range scans and $k$-NN searches. The
paper contains extensive benchmarks which demonstrate that the dynamized
versions perform quite well, being capable of even beating the dynamic
ones in query performance under certain circumstances. It's worth noting
that we also tested a dynamized VPTree in Chapter~\ref{chap:framework}
for $k$-NN, and obtained results in line with theirs. 

Finally, LSMGraph is a recently proposed system which applies
dynamization techniques\footnote{
	The authors make a point of saying that they are \emph{not}
	applying dynamization, but instead embedding their structure
	inside of an LSM-tree, noting the challenges associated
	with applying dynamization directly to graphs. While the
	specific techniques they are using are not directly taken
	from any of the dynamization techniques we discussed in
	Chapter~\ref{chap:background}, we nonetheless consider this
	work to be an example of dynamization, at least in principle,
	because they decompose a static structure into smaller blocks
	and handle inserts by rebuilding these blocks systematically.} 
to the compressed sparse row (CSR) matrix representation of graphs to
produce an dynamic, external, graph storage system~\cite{lsmgraph}. This
is a particularly interesting example, because graphs and graph algorithms
are \emph{not} remotely decomposable. Adjacent vertices in the graph may
be spread across many levels, and this means that graph algorithms cannot
be decomposed, as traversals must access adjacent vertices, regardless
of which block they are contained within. To resolve this problem, the
authors discard the general query model and build a tightly integrated
system which uses an index to map each vertex to the block containing
it, and implement a vertex-adjacency aware reconstruction process which
helps ensure that adjacent vertices are compacted into the same blocks
during reconstruction.

\section{LSM Tree}

The Log-structured Merge-tree (LSM tree)~\cite{oneil96} is a data
structure proposed by Oneil \emph{et al.} in the mid-90s that is designed
to optimize for write throughput in external indexing contexts. While
Oneil never cites any of the dynamization work we have considered in this
work, the structure that he proposed is eerily similar to a decomposed
data structure, and future developments have driven the structure in a
direction that looks incredibly similar to Bentley and Saxe's logarithmic
method. In fact, several of the examples of dynamization used in the
previous section (as well as this work) either borrow concepts from
modern LSM trees, or go so far as to use the term ``LSM'' as a synonym
for what we call dynamization in this work. However, the work on LSM
trees is distinct from dynamization, at least general dynamization,
because it leans heavily on very specific aspects of the search problems
(point lookup and single-dimensional range search) and data structure in
ways that don't generalize well. In this section, we'll discuss a few
of the relevant works on LSM trees and attempt to differentiate them
from dynamization.

\subsection{The Structure}

The modern LSM tree is a single-dimensional range data structure that is
commonly used in key-value stores such as RocksDB. It consists of a small,
dynamic in-memory structure called a memtable, and a sequence of static,
external structures on disk of geometrically increasing size. These
structures are organized into levels, which can contain either one
structure or several, with the former strategy being called leveling and
the latter tiering. The individual structures are often simple sorted
arrays (with some attached metadata) called runs, which can be further
decomposed into smaller files called sorted string tables (SSTs). Records
are inserted into the memtable initially. When the memtable is filled, it
is flushed and the records are merged into the top level of the structure,
with reconstructions proceeding according to various merge policies
to make room as necessary. LSM trees typically support point lookup
and range queries, and answer these by searching all of the runs in the
structure and merging the results together. To accelerate point lookups,
Bloom filters~\cite{bloom70} are often built over the records in each run
to allow skipping of some of the runs that don't contain the key being
searched for. Deletes are typically handled using tombstones.

\subsection{Design Space}

The bulk of work on LSM trees that is of interest to us focuses on the
associated design space and performance tuning of the structure. There
have been a very large number of papers discussing different ways
of decomposing the structure, performing reconstructions, allocating
resources to filters and other auxiliary structures, etc., to optimize
for resource usage, enable performance tuning, etc. We'll summarize a few
of these works here.

One major line of work in LSM trees involves optimizing the memory
allocation of Bloom filters to the sorted runs within the structure. Bloom
filters are commonly used in LSM trees to accelerate point lookups,
because these queries must examine each run, from top to bottom, until a
matching key is found. Bloom filters can be used to improve performance by
allowing some runs to be skipped over in this searching process. Because
LSM trees are an external data structure, the savings from doing this can
be quite large. There are a number of works in this area~\cite{dayan18-1,
zhu21}, but we will highlight Monkey~\cite{dayan17} specifically.
Monkey is a system that optimizes the allocation of Bloom filter memory
across the levels of an LSM tree, based on the observation that the
worst-case lookup cost (i.e., the cost of a point-lookup on a key that
doesn't exist within the LSM tree) is directly proportional to the sum
of the false positive rates across all levels in the tree. Thus, memory
can be allocated to filters in a way that minimizes this sum. These works
could be useful in the context of dynamization for problems which allow
similar optimizations, such as a dynamized structure for point lookups
using Bloom filters, or possibly range scans using range filters such
as SuRF~\cite{zhang18} or Rosetta~\cite{siqiang20}, but aren't directly
applicable to the general problem of dynamization.

Other work in LSM trees considers different merge
policies. Dostoevsky~\cite{dayan18} introduces two new merge policies, and
the so-called Wacky Continuum~\cite{dayan19} introduces a general design
space that includes Dostoevsky's policies, the traditional LSM policies,
and a new policy called an LSM bush. As it encompasses all of these,
we'll exclusively summarize the Wacky Continuum here. Wacky defines a
merge policy based on three parameters: the capping ratio ($C$), growth
exponential ($X$), merge greed ($K$), and largest-level merge greed
($Z$). The merge greed parameters are used to define the merge threshold,
which effectively allows merge policies that sit between leveling and
tiering. Each level contains an ``active'' run, into which new records
will be merged. Once this run contains specified fraction of the level's
total capacity (determined by the merge greed parameters), a new run
will be added to the level and made active. Leveling can be simulated in
this model by setting this merge greed parameter to 1, so that 100\% of
the level's capacity is allocated to a single run. Tiering is simulated
by setting this parameter such that each active run can only sustain a
single set of records, and a new run is created each time records are
merged into the level. The Wacky continuum allows configuring the merge
greed of the last level independently from the inner levels, to allow it
to support lazy leveling, a policy where the largest level in the LSM
tree contains a single run, but the inner levels are tiered. The other
design parameters are simpler: the capping ratio allows the size ratio
of the last level to be varied independently of the inner levels, and
the growth exponential allows the size ratio between adjacent levels to
grow as the levels get larger. This work also introduces an optimization
system for determining good values for all of these parameters for a 
given workload.

It's worth taking a moment to address why we did not consider the Wacky
Continuum design space in our attempts to introduce a design space into
dynamization. It appears that these concepts would be useful to us, given
that we imported the basic leveling and tiering concepts from LSM trees.
However, we believe that this particular set of design parameters are
not broadly useful outside of the LSM context. The reason for this is
shown within the experimental section of the Wacky paper itself. For
workloads involving range reads, the standard leveling/tiering designs
show perfectly reasonable (and sometimes even superior) performance
trade-offs.  In large part, the Wacky Continuum work is an extension of
the authors' earlier work on Monkey, as it is most effective at improving
trade-offs for point-lookup performance, which are strongly influenced by
Bloom filters. The range scan results are the ones most closely related to
the general dynamization case we have been considering in this work, where
filters cannot be assumed and filter memory allocation isn't an important
consideration. And, in that case, the new merge policies available within
the Wacky Continuum didn't provide enough advantage to be considered here.

Another aspect of the LSM tree design space is compaction granularity,
which was studied in Spooky~\cite{dayan22}. Because LSM trees are built
over sorted runs of data, it is possible to perform partial merges between
levels--moving only a small portion of the data from one level to another.
This can improve write performance by making reconstructions smaller,
and also help reduce the transient storage requirements of the structure,
but comes at the cost of additional write amplification due to files
being repeatedly re-written on the same level more frequently. The storage
benefit comes from the fact that LSM trees require extra working storage
to perform reconstructions, and making the reconstructions smaller reduces
this space requirement. Spooky is a method for determining effective
approaches to performing partial reconstructions. It works partitioning the
largest level into equally sized files, and then dynamically partitioning
the other levels in the structure based on the key ranges of the last
level files. This approach is shown to provide a good balance between
write-amplification and reconstruction storage requirements. This concept
of compaction granularity is another example of an LSM tree specific
design element that doesn't generalize well. It could be used in the
context of dynamization, but only for single-dimensional sorted data, and
so we have not considered it as part of our general techniques.

\subsection{Tail Latency Control}
LSM trees are susceptible to similar insertion tail latency
problems as dynamization. While tail latency can be controlled by
using partial reconstructions, such as in Spooky~\cite{dayan22}
(though this isn't a focus of the work), there does exist some work
specifically on controlling tail latency. One notable result in this
area is SILK~\cite{balmau19,silk-plus}, which focuses on reducing tail
latency using intelligent scheduling of reconstruction operations.

After performing an experimental evaluation of various LSM tree based key
value systems, the authors determine three main principles for designing
their tail latency control system,
\begin{enumerate}
	\item I/O bandwidth should be opportunistically allocated to 
	      reconstructions.
	\item Reconstructions at smaller levels in the tree should be 
	      prioritized.
	\item Reconstructions at larger levels in the tree should be 
	      preemptable by those on lower levels.
\end{enumerate}
The resulting system prioritizes buffer flushes, which are given dedicated
threads and priority access to I/O bandwidth. The next highest priority
operations are reconstructions involving levels $0$ and $1$, which must be
completed to allow flushes to proceed. These reconstructions are able to
preempt any other running compaction if there is not an available thread
when one is scheduled. All other reconstructions run with lower priority,
and may need to be wholly discarded if a high priority reconstruction
invalidates them. The system also includes sophisticated I/O bandwidth
controls, as this is a constrained resource in external contexts.

Some of the core concepts underlying the SILK system
inspired the tail latency control system we have proposed in
Chapter~\ref{chap:tail-latency}, but our system is quite distinct from
it. SILK leverages some consequences of the LSM tree design space in ways
that our system cannot rely upon.  For example, SILK uses a two-version
buffer (like we do), but is able to allocate enough I/O bandwidth to
ensure that one of the buffer versions can be flushed before the other
one fills up. Given the constraints of our dynamization system, with an
unsorted buffer, this was not possible to do. Additionally, SILK uses
partial compactions to reduce the size of reconstructions. These factors
let SILK maintain the LSM tree structure without having to resort to
insertion throttling, as we do in our system. 

\section{Automated Index Composition}

At the beginning of this work, we described what we see as the three
major lines of work that are attempting to partially automate index
design. The first of these we discussed was something we called automated
index composition. This is an approach to index design which uses a series
of data structure primitives to compose an index over a set of data that
has been optimized for a particular workload. Of these works, \emph{all}
of them are focused explicitly on single dimensional data with range
scans and point lookups, and only Cosine supports inserting new records.

Two closely related papers in this line are the so-called Data
Alchemist~\cite{ds-alchemy} and the Periodic Table of Data
Structures~\cite{periodic-table}. Both of these works consider
automatically designing indexes for single dimensional range data,
capable of addressing point lookups and range scans. The Periodic Table
of Data Structures proposes a wide design space for data structures
based on creating individual ''nodes`` over the data, each of which
have different design decisions applied to them, which are termed
``first principles''. They consider first principles to be any design
decision that is irreducible to other decisions, such as whether the
type of partitioning used (range, radix, etc.), whether  the data is
stored in a row-based or columnar format, etc. From this model, an
index can be designed by iteratively describing the first principles of
each node. Given these first principles, and the set of nodes, access
algorithms can be automatically devised. The Data Alchemist extends
this model of data structures with learned cost models and machine
learning based systems for automatically composing an optimal data
structure for a given workload. Both of these papers discuss their
core ideas, but don't include testing of a working system. The same
authors have two further works in this area that go more into detail on
cost models~\cite{data-calc} and explore the design continuum in more
detail~\cite{ds-continuum}.

GENE~\cite{gene} advances the same basic line of research as the
previously mentioned works, but actually includes a functioning,
end-to-end index generation system.  GENE decomposes data structures
in a few specific primitives based upon search algorithm and data
layout. Specifically, it supports scanning, binary search, interpolation
search, exponential search, hashing with closed addressing, and linear
regression modeling, as search algorithms. The supported data layout
parameters include column vs. row orientation, sorted vs. unsorted
ordering, compression, and function mapping (which obviates the need
to make other layout decisions, and is intended to be used with linear
regression search algorithms). GENE then designs an index based upon
these options automatically for a given workload by applying a genetic
algorithm.

Cosine~\cite{cosine} is another similar system, which has been designed
for cloud based systems and accounts for cloud SLAs and (monetary)
budgeting concerns. It includes sophisticate cost modeling systems
to allow it to dynamically adapt the structure of the index, shifting
elements of it between an LSM-like structure, a B-tree-like structure,
and an LSH-like (log-structured hash) structure, as well as adjusting
various configuration parameters associated with each of these primitive
components. It is particularly notable for being the only work discussed
in this section that supports updates.

Finally, fluid data structures~\cite{fluid-ds} represents a more formal
work in this area, based upon immutable data primitives. The core idea of
this work is that the physical representation of the data can be mutated
while maintaining its logical ordering, under certain circumstances. This
allows for regions of the data structure to shift dynamically to optimize
for particular types of search. To accomplish this, the authors define
a formal grammar, which they call a compositional organizational grammar
(COG) as a description of a data structure, and the transformation rules
that can be applied to a data structure based on this language while
ensuring logical equivalence. A runtime system, then, can automatically
apply these transformations to the structure. This is the only of the
works in this section that discusses generalizing its techniques to
non-single dimensional data.

\section{Generalized Index Templates}

The other line of work in automatic index design we discussed
was generalized index templates. These are systems which expose a
generalization of a particular type of data structure, presenting hooks
for user-defined functions into various data structure operations,
while providing all of the necessary index features (concurrency,
crash recovery, etc.) automatically. Assuming that the underlying data
structure can be used to construct an index for a given use case, this
approach makes it straightforward for a database user to produce a custom
index for a particular search problem or data type. Unfortunately,
these templates are restricted by their underlying data structure,
and can only produce indices that fit this model.  There are two major
examples of generalized index templates, the generalized search tree
(GiST)~\cite{gist, concurrent-gist, pg-gist} and the generalized inverted
index (GIN)~\cite{pg-gin}.

The GiST~\cite{gist} is a general data structure built on a search
tree, which allows the user to specify certain specific behaviors to
adapt it to their needs, while automatically providing concurrency
and crash recovery~\cite{concurrent-gist}. It was been implemented in
Postgres~\cite{pg-gist}. GiST requires the user to implement six
functions,
\begin{itemize}
\item $\mathbftt{consistent}(E, q)$ \\
	Given an internal node entry, $E$, and a query predicate $q$,
	it determines whether the entry could satisfy the predicate and
	returns true if so, or false if the predicate can certainly not
	be satisfied.

\item $\mathbftt{union}(P)$ \\
	Given a set of contiguous internal node entries, $P$, this
	function returns a predicate that holds for all tuples stored
	in children of the entries.

\item $\mathbftt{compress}(E)$ \\
	Returns a compressed version of an entry, $E$. This is used to
	produce internal node separator keys.

\item $\mathbftt{decompress}(E)$ \\
	Returns a decompressed version of a compressed entry, $E$. This
	is used to recover record intervals consistent with a separator
	key. 

\item $\mathbftt{penalty}(E_1, E_2)$ \\
	Returns the ``penalty'' for inserting $E_2$ into the sub-tree
	rooted at $E_1$. This is used for tree balancing in the insertion
	routines.

\item $\mathbftt{pick\_split}(P)$ \\
	Given a set of contiguous internal node entries, $P$, pick a split
	point to break $P$ into two disjoint partitions. This is used during
	split operations to maintain tree balance.
	
\end{itemize}

Given these user-defined functions, the GiST will automatically provide
two search routines: a general search returning all records matching a
predicate, and a specific search for linearly ordered data. This latter
search also requires a $\mathbftt{compare}(E_1, E_2)$ function be specified
which serves as a comparator between two records. The structure also
supports inserting and deleting records.

The generalized inverted index, GIN, is a similar concept for a different
data structure~\cite{pg-gin}. Rather than generalizing a search tree,
GIN generalizes an inverted index. This structure represents a set of
key-value pairs, where the value is a set of composite entries containing
multiple keys. The index allows all values containing a specified key
to be easily identified. The classic example of a use for an inverted
index is document search, where the values are entire documents, and
the keys are specific words.

GIN requires the user specify the following functions,\footnote{
	I've streamlined the representations of these functions somewhat
	to conform to the conventions of this work. In the original
	source documentation, these function definitions are given in C,
	with many more arguments for outputs, configuration, etc.
	}
\begin{itemize}
	\item $\mathbftt{extract\_value}(V)$ \\
		Extract all of the keys from a given value, $V$, and return them
		as an array. 
	\item $\mathbftt{extract\_query}(Q)$ \\
		Extract the keys to search for in a given query, $Q$.
	\item $\mathbftt{consistent}(V, Q)$ \\
		Checks an indexed value, $V$, for the keys contained in a query,
		$Q$. Returns an array of booleans, where the $i$th
		element in the array is true of the $i$th key from the
		query appears in the value.	
	\item $\mathbftt{compare}(E_1, E_2)$ \\
		A comparator used for sorting keys.
\end{itemize}

Note that, unlike GiST, GIN doesn't generalize the construction of
the index itself. It uses a B-tree index over the keys with limited
user control over how it is built. The generalization of GIN is in the
extraction of keys from values, and the query process itself. Like GiST,
a GIN index automatically provides concurrency and crash recovery.