chapters/tail-latency.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318

\chapter{Controlling Insertion Tail Latency}
\label{chap:tail-latency}

\section{Introduction}

\begin{figure}
\subfloat[Insertion Throughput]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-btree-isam-tput}} 
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:tl-btree-isam-lat}} \\
\caption{Insertion Performance of Dynamized ISAM vs. B+Tree}
\label{fig:tl-btree-isam}
\end{figure}

Up to this point in our investigation, we have not directly addressed
one of the largest problems associated with dynamization: insertion
tail latency. While these techniques result in structures that have
reasonable, or even good, insertion throughput, the latency associated
with each individual insert is wildly variable. To illustrate this
problem, consider the insertion performance in
Figure~\ref{fig:tl-btree-isam}, which compares the insertion latencies
of a dynamized ISAM tree with that of its most direct dynamic analog:
a B+Tree. While, as shown in Figure~\ref{fig:tl-btree-isam-tput},
the dynamized structure has comperable average performance to the
native dynamic structure, the latency distributions are quite
different. Figure~\ref{fig:tl-btree-isam-lat} shows representations
of the distributions. While the dynamized structure has much better
"best-case" performance, the worst-case performance is exceedingly
poor. That the structure exhibits reasonable performance on average
is the result of these two ends of the distribution balancing each
other out.


This poor worst-case performance is a direct consequence of the strategies
used by the two structures to support updates. B+Trees use a form of
amortized local reconstruction, whereas the dynamized ISAM tree uses
amortized global reconstruction. Because the B+Tree only reconstructs the
portions of the structure ``local'' to the update, even in the worst case
only a portion of the data structure will need to be adjusted. However,
when using global reconstruction based techniques, the worst-case insert
requires rebuilding either the entirety of the structure (for tiering
or BSM), or at least a very large proportion of it (for leveling). The
fact that our dynamization technique  uses buffering, and most of the
shards involved in reconstruction are kept small by the logarithmic
decomposition technique used to partition it, ensures that the majority
of inserts are low cost compared to the B+Tree, but at the extreme end
of the latency distribution, the local reconstruction strategy used by
the B+Tree results in better worst-case performance.

Unfortunately, the design space that we have been considering thus far
is limited in its ability to meaningfully alter the worst-case insertion
performance. While we have seen that the choice of layout policy can have
some effect, the actual benefit in terms of tail latency is quite small,
and the situation is made worse by the fact that leveling, which can
have better worst-case insertion performance, lags behind tiering in
terms of average insertion performance. The use of leveling can allow
for a small reduction in the worst case, but at the cost of making the
majority of inserts worse because of increased write amplification.

\begin{figure}
\subfloat[Scale Factor Sweep]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-parm-sf}} 
\subfloat[Buffer Size Sweep]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:tl-parm-bs}} \\
\caption{Design Space Effects on Latency Distribution}
\label{fig:tl-parm-sweep}
\end{figure}

Additionally, the other tuning nobs that are available to us are
of limited usefulness in tuning the worst case behavior.
Figure~\ref{fig:tl-parm-sweep} shows the latency distributions of
our framework as we vary the scale factor (Figure~\ref{fig:tl-parm-sf})
and buffer size (Figure~\ref{fig:tl-parm-bs}) respectively. There
is no clear trend in worst-case performance to be seen here. This
is to be expected, ultimately the worst-case reconstructions in
both cases are largely the same regardless of scale factor or buffer
size: a reconstruction involving $\Theta(n)$ records. The selection
of configuration parameters can influence \emph{when} these
reconstructions occur, as well as slightly influence their size, but
ultimately the question of ``which configuration has the best tail-latency
performance'' is more a question of how many insertions the latency is
measured over, than any fundamental trade-offs with the design space.

Thus, in this chapter, we will look beyond the design space we have
thus far considered to design a dynamization system that allows for
tail latency tuning in a meaningful capacity. To accomplish this,
we will consider a different way of looking at reconstructions within
dynamized structures.

\section{The Insertion-Query Trade-off}

As reconstructions are at the heart of the insertion tail latency problem,
it seems worth taking a moment to consider \emph{why} they must be done
at all.  Fundamentally, decomposition-based dynamization techniques trade
between insertion and query performance by controlling the number of blocks
in the decomposition. Reconstructions serve to place a bound on the
number of blocks, to allow for query performance bounds to be enforced.
This trade-off between insertion and query performance by way of block
count is most directly visible in the equal block method described
in Section~\ref{ssec:ebm}. As a reminder, this technique provides the
following worst-case insertion and query bounds,
\begin{align*}
I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\
\mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right)
\end{align*}
where $f(n)$ is the number of blocks.

Figure~\ref{fig:tl-ebm-trade-off} shows the trade-off between insertion
and query performance for a dynamized ISAM tree using the equal block
method, for various numbers of blocks. The trade-off is evident in the
figure, with a linear relationship between insertion throughput and query
latency, mediated by the number of blocks in the dynamized structure (the
block counts are annotated on each point in the plot). As the number of
blocks is increased, their size is reduced, leading to less expensive
inserts in terms of both amortized and worst-case cost. However, the
additional blocks make queries more expensive.


\begin{figure}
\centering
\includegraphics[width=.75\textwidth]{img/tail-latency/ebm-count-sweep.pdf} 
\caption{The Insert-Query Tradeoff for the Equal Block Method with varying
number of blocks}
\label{fig:tl-ebm-trade-off}
\end{figure}

While using the equal block method does allow for direct tuning of
the worst-case insert cost, as well as exposing a very clean trade-off
space for average query and insert performance, the technique is not
well suited to our purposes because the amortized insertion performance
is not particularly good: the insertion throughput is many times worse
than is possible with our dynamization framework for an equivalent
query latency.\footnote{
	In actuality, the insertion performance of the equal block method is
	even \emph{worse} than the numbers presented here. For this particular
	benchmark, we implemented the technique knowing the number of records in
	advance, and so fixed the size of each block from the start. This avoided
	the need to do any repartitioning as the structure grew, and reduced write
	amplification.
}
This is because, in our Bentley-Saxe-based technique, the
variable size of the blocks allows for the majority of the reconstructions
to occur with smaller structures, while allowing the majority of the
records to exist in a single large block at the bottom of the structure.
This setup enables high insertion throughput while keeping the block
count small. But, as we've seen, the cost of this is large tail latencies.
However, we can use the extreme ends of the equal block method's design
space to consider upper limits on the insertion and query performance
that we might expect to get out of a dynamized structure.


Consider what would happen if we were to modify our dynamization framework
to avoid all reconstructions.  We retain a buffer of size $N_B$, which
we flush to create a shard when full, however we never touch the shards
once they are created. This is effectively the equal blocks method,
where every block is fixed at $N_B$ capacity.  Such a technique would
result in a worst-case insertion cost of $I(n) \in \Theta(B(N_B))$ and
produce $\Theta\left(\frac{n}{N_B}\right)$ shards in total, resulting
in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$ worst-case query
cost for a decomposable search problem. Applying this technique to an
ISAM Tree, and compared against a B+Tree, yields the insertion and query
latency distributions shown in Figure~\ref{fig:tl-floodl0}.

\begin{figure}
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} 
\subfloat[Query Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-query.pdf} \label{fig:tl-floodl0-query}} \\

\caption{Latency Distributions for a "Reconstructionless" Dynamization}
\label{fig:tl-floodl0}
\end{figure}

Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain
insertion latency distributions using amortized global reconstruction
that are directly comperable to dynamic structures based on amortized
local reconstruction, at least in some cases. In particular, the
worst-case insertion tail latency in this model is direct function
of the buffer size, as the worst-case insert occurs when the buffer
must be flushed to a shard. However, this performance comes at the
cost of queries, which are incredibly slow compared to B+Trees, as
shown in Figure~\ref{fig:tl-floodl0-query}.

While this approach is not useful on its own, it does 

\section{Relaxed Reconstruction}

There is theoretical work in this area, which we discussed in
Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach for
controlling the worst-case insertion cost is to break the largest
reconstructions up into small sequences of operations, that can then be
attached to each insert, spreading the total workload out and ensuring
each insert takes a consistent amount of time. Theoretically, the total
throughput should remain about the same when doing this, but rather
than having a bursty latency distribution with many fast inserts, and
a small number of incredibly slow ones, distribution should be far more
uniform.

Unfortunately, this technique has a number of limitations that we
discussed in Section~\ref{ssec:bsm-tail-latency-problem}. Notably for
this discussion, they are
\begin{enumerate}

	\item In the Bentley-Saxe method, the worst-case reconstruction
	involves every record in the structure. As such, it cannot be
	performed ``in advance'' without significant extra work. This problem
	requires the worst-case optimized dynamization systems to include 
	complicated structures of partially built structures.

	\item The approach assumes that the workload of building a
	block can be evenly divided in advance, and somehow attached
	to inserts. Even for simple structures, this requires a large
	amount of manual adjustment to the data structure reconstruction
	routines, and doesn't admit simple, generalized interfaces.
	
\end{enumerate}
In this section, we consider how these restrictions can be overcome given
our dynamization framework, and propose a strategy that achieves the
same worst-case insertion time as the worst-case optimized techniques,
given a few assumptions about available resources.

At a very high level, our proposed approach as follows. We will fully
detach reconstructions from buffer flushes. When the buffer fills, it will
immediately flush and a new shard will be placed in L0. Reconstructions
will be performed in the background to maintain the internal structure
according, roughly, to tiering. When a level contains $s$ shards, a
reconstruction will immediately be trigger to merge these shards and
push the result down to the next level. To ensure that the number of
shards in the structure remains bounded by $\Theta(\log n)$, we will
throttle the insertion rate so that it is balanced with amount of time
needed to complete reconstructions. 

\begin{figure}
\caption{Several "states" of tiering, leading up to the worst-case
reconstruction.}
\label{fig:tl-tiering}
\end{figure}

First, we'll consider how to ``spread out'' the cost of the worst-case
reconstruction.  Figure~\ref{fig:tl-tiering} shows various stages in
the development of the internal structure of a dynamized index using
tiering. Importantly, note that the last level reconstruction, which
dominates the cost of the worst-case reconstruction, \emph{is able to be
performed well in advance}. All of the records necessary to perform this
reconstruction are present in the last level $\Theta(n)$ inserts before
the reconstruction must be done to make room. This is a significant
advantage to our technique over the normal Bentley-Saxe method, which
will allow us to spread the cost of this reconstruction over a number
of inserts without much of the complexity of~\cite{overmars81}. This
leads us to the following result,
\begin{theorem}
Given a buffered, dynamized structure utilizing the tiering layout policy,
and at least $2$ parallel threads of execution, it is possible to maintain
a worst-case insertion cost of
\begin{equation}
I(n) \in \Theta\left(\frac{B(n)}{n} \log n\right)
\end{equation}
\end{theorem}
\begin{proof}
Consider the cost of the worst-case reconstruction, which in tiering
will be of cost $\Theta(B(n))$. This reconstruction requires all of the
blocks on the last level of the structure. At the point at which the
last level is full, there will be $\Theta(n)$ inserts before the last
level must be merged and a new level added.

To ensure that the reconstruction has been completed by the time the
$\Theta(n)$ inserts have been completed, it is sufficient to guarantee
that the rate of inserts is sufficiently slow. Ignoring the cost of
buffer flushing, this means that inserts must cost,
\begin{equation*}
I(n) \in \Theta(1 + \delta)
\end{equation*}
where the $\Theta(1)$ is the cost of appending to the mutable buffer,
and $\delta$ is a stall inserted into the insertion process to ensure
that the necessary reconstructions are completed in time.

To identify the value of $\delta$, we note that each insert must take
at least $\frac{B(n)}{n}$ time to fully cover the cost of the last level
reconstruction. However, this is not sufficient to guarantee the bound, as
other reconstructions will also occur within the structure. At the point
at which the last level reconstruction can be scheduled, there will be
exactly $1$ shard on each level. Thus, each level will potentially also
have an ongoing reconstruction that must be covered by inserting more
stall time, to ensure that no level in the structure exceeds $s$ shards.
There are $\log n$ levels in total, and so in the worst case we will need
to introduce a extra stall time to account for a reconstruction on each
level,
\begin{equation*}
I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1})
\end{equation*}
All of these internal reconstructions will be strictly less than the
size of the last-level reconstruction, and so we can bound them all
above by $\frac{B(n)}{n}$ time. 

Given this, and assuming that the smallest (i.e., most pressing)
reconstruction is prioritized on the background thread, we find that
\begin{equation*}
I(n) \in \Theta\left(\frac{B(n)}{n} \cdot \log n\right)
\end{equation*}
\end{proof}

This approach results in an equivalent worst-case insertion latency
bound to~\cite{overmars81}, but manages to resolve both of the issues
cited above. By leveraging two parallel threads, instead of trying to
manually multiplex a single thread, this approach requires \emph{no}
modification to the user's shard code to function. And, by leveraging
the fact that reconstructions under tiering are strictly local to a
single level, we can avoid needing to add any complicated additional
structures to manage partially building shards as new records are added.

\section{Implementation}

\subsection{Parallel Reconstruction Architecture}

\subsection{Concurrent Queries}

\subsection{Query Pre-emption}

\subsection{Insertion Stall Mechanism}

\section{Evaluation}

\section{Conclusion}