chapters/future-work.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174

\chapter{Proposed Work}
\label{chap:proposed}

The previous two chapters described work already completed, however
there are a number of work that remains to be done as part of this
project. Update support is only one of the important features that an
index requires of its data structure.  In this chapter, the remaining
research problems will be discussed briefly, to lay out a set of criteria
for project completion.

\section{Concurrency Support}

Database management systems are designed to hide the latency of
IO operations, and one of the techniques they use are being highly
concurrent. As a result, any data structure used to build a database
index must also support concurrent updates and queries. The sampling
extension framework described in Chapter~\ref{chap:sampling} had basic
concurrency support, but work is ongoing to integrate a superior system
into the framework of Chapter~\ref{chap:framework}.

Because the framework is based on the Bentley-Saxe method, it has a number
of desirable properties for making concurrency management simpler. With
the exception of the buffer, the vast majority of the data resides in
static data structures. When using tombstones, these static structures
become fully immutable. This turns concurrency control into a resource
management problem, and suggests a simple multi-version concurrency
control scheme. Each version of the structure, defined as being the
state between two reconstructions, is tagged with an epoch number. A
query, then, will read only a single epoch, which will be preserved
in storage until all queries accessing it have terminated. Because the
mutable buffer is append-only, a consistent view of it can be obtained
by storing the tail of the log at the start of query execution.  Thus,
a fixed snapshot of the index can be represented as a two-tuple containing
the epoch number and buffer tail index.

The major limitation of the Chapter~\ref{chap:sampling} system was
the handling of buffer expansion. While the mutable buffer itself is
an unsorted array, and thus supports concurrent inserts using a simple
fetch-and-add operation, the real hurdle to insert performance is managing
reconstruction. During a reconstruction, the buffer is full and cannot
support any new inserts. Because active queries may be using the buffer,
it cannot be immediately flushed, and so inserts are blocked. Because of
this, it is necessary to use multiple buffers to sustain insertions. When
a buffer is filled, a background thread is used to perform the
reconstruction, and a new buffer is added to continue inserting while that
reconstruction occurs. In Chapter~\ref{chap:sampling}, the solution used
was limited by its restriction to only two buffers (and as a result,
a maximum of two active epochs at any point in time). Any sustained
insertion workload would quickly fill up the pair of buffers, and then
be forced to block until one of the buffers could be emptied. This
emptying of the buffer was contingent on \emph{both} all queries using
the buffer finishing, \emph{and} on the reconstruction using that buffer
to finish. As a result, the length of the block on inserts could be long
(multiple seconds, or even minutes for particularly large reconstructions)
and indeterminate (a given index could be involved in a very long running
query, and the buffer would be blocked until the query completed).

Thus, a more effective concurrency solution would need to support
dynamically adding mutable buffers as needed to maintain insertion
throughput. This would allow for insertion throughput to be maintained
so long as memory for more buffer space is available.\footnote{For the
in-memory indexes considered thus far, it isn't clear that running out of
memory for buffers is a recoverable error in all cases. The system would
require the same amount of memory for storing record (technically more,
considering index overhead) in a shard as it does in the buffer. In the
case of an external storage system, the calculus would be different,
of course.} It would also ensure that a long running could only block
insertion if there is insufficient memory to create a new buffer or to
run a reconstruction. However, as the number of buffered records grows,
there is the potential for query performance to suffer, which leads to
another important aspect of an effective concurrency control scheme.

\subsection{Tail Latency Control}

The concurrency control scheme discussed thus far allows for maintaining
insertion throughput by allowing an unbounded portion of the new data
to remain buffered in an unsorted fashion. Over time, this buffered
data will be moved into data structures in the background, as the
system performs merges (which are moved off of the critical path for
most operations). While this system allows for fast inserts, it has the
potential to damage query performance. This is because the more buffered
data there is, the more a query must fall back on its inefficient
scan-based buffer path, as opposed to using the data structure.

Unfortunately, reconstructions can be incredibly lengthy (recall that
the worst-case scenario involves rebuilding a static structure over
all of the records; this is, thankfully, quite rare). This implies that
it may be necessary in certain circumstances to throttle insertions to
maintain certain levels of query performance. Additionally, it may be
worth preemptively performing large reconstructions during periods of
low utilization, similar to systems like Silk designed for mitigating
tail latency spikes in LSM-tree based systems~\cite{balmau19}.

Additionally, it is possible that large reconstructions may have a
negative effect on query performance, due to system resource utilization.
Reconstructions can use a large amount of memory bandwidth, which must
be shared by queries. The effects of parallel reconstruction on query
performance will need to be assessed, and strategies for mitigation of
this effect, be it a scheduling-based solution, or a resource-throttling
one, considered if necessary.


\section{Fine-Grained Online Performance Tuning}

The framework has a large number of configurable parameters, and
introducing concurrency control will add even more. The parameter sweeps
in Section~\ref{ssec:ds-exp} show that there are trade-offs between
read and write performance across this space. Unfortunately, the current
framework applies this configuration parameters globally, and does not
allow them to be changed after the index is constructed. It seems apparent
that better performance might be obtained by adjusting this approach.

First, there is nothing preventing these parameters from being configured
on a per-level basis. Having different layout policies on different
levels (for example, tiering on higher levels and leveling on lower ones),
different scale factors, etc. More index specific tuning, like controlling
memory budget for auxiliary structures, could also be considered.

This fine-grained tuning will open up an even broader design space,
which has the benefit of improving the configurability of the system,
but the disadvantage of making configuration more difficult. Additionally,
it does nothing to address the problem of workload drift: a configuration
may be optimal now, but will it remain effective in the future as the
read/write mix of the workload changes? Both of these challenges can be
addressed using dynamic tuning.

The theory is that the framework could be augmented with some workload
and performance statistics tracking. Based on these numbers, during
reconstruction, the framework could decide to adjust the configuration
of one or more levels in an online fashion, to lean more towards read
or write performance, or to dial back memory budgets as the system's
memory usage increases. Additionally, buffer-related parameters could
be tweaked in real time as well. If insertion throughput is high, it
might be worth it to temporarily increase the buffer size, rather than
spawning multiple smaller buffers.

A system like this would allow for more consistent performance of the
system in the face of changing workloads, and also increase the ease
of use of the framework by removing the burden of configuration from
the user.


\section{Alternative Data Partitioning Schemes}

One problem with Bentley-Saxe or LSM-tree derived systems is temporary
memory usage spikes. When performing a reconstruction, the system needs
enough storage to store the shards involved in the reconstruction,
and also the newly constructed shard. This is made worse in the face
of multi-version concurrency, where multiple older versions of shards
may be retained in memory at once. It's well known that, in the worst
case, such a system may temporarily require double its current memory
usage~\cite{dayan22}.

One approach to addressing this problem in LSM-tree based systems is
to adjust the compaction granularity~\cite{dayan22}. In the terminology
associated with this framework, the idea is to further sub-divide each
shard into smaller chunks, partitioned based on keys. That way, when a
reconstruction is triggered, rather than reconstructing an entire shard,
these smaller partitions can be used instead. One of the partitions in
the source shard can be selected, and then merged with the partitions
in the next level down having overlapping key ranges. The amount of
memory required for reconstruction (and also reconstruction time costs)
can then be controlled by adjusting these partitions.

Unfortunately, while this system works incredibly well for LSM-tree
based systems which store one-dimensional data in sorted arrays, it
encounters some problems in the context of a general index. It isn't
clear how to effectively partition multi-dimensional data in the same
way. Additionally, in the general case, each partition would need to
contain its own instance of the index, as the framework supports data
structures that don't themselves support effective partitioning in the
way that a simple sorted array would. These challenges will need to be
overcome to devise effective, general schemes for data partitioning to
address the problems of reconstruction size and memory usage.