chapters/tail-latency.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92

\chapter{Controlling Insertion Tail Latency}
\label{chap:tail-latency}

\section{Introduction}

\begin{figure}
\subfloat[Insertion Throughput]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-btree-isam-tput}} 
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:tl-btree-isam-lat}} \\
\caption{Insertion Performance of Dynamized ISAM vs. B+Tree}
\label{fig:tl-btree-isam}
\end{figure}

Up to this point in our investigation, we have not directly addressed
one of the largest problems associated with dynamization: insertion
tail latency. While these techniques result in structures that have
reasonable, or even good, insertion throughput, the latency associated
with each individual insert is wildly variable. To illustrate this
problem, consider the insertion performance in
Figure~\ref{fig:tl-btree-isam}, which compares the insertion latencies
of a dynamized ISAM tree with that of its most direct dynamic analog:
a B+Tree. While, as shown in Figure~\ref{fig:tl-btree-isam-tput},
the dynamized structure has comperable average performance to the
native dynamic structure, the latency distributions are quite
different. Figure~\ref{fig:tl-btree-isam-lat} shows representations
of the distributions. While the dynamized structure has much better
"best-case" performance, the worst-case performance is exceedingly
poor. That the structure exhibits reasonable performance on average
is the result of these two ends of the distribution balancing each
other out.

The reason for this poor tail latency is reconstructions. In order
to provide tight bounds on the number of shards within the structure,
our techniques must block inserts once the buffer has filled, until
sufficient room is cleared in the structure to accomodate these new
records. This results in the worst-case insertion behavior that we
described mathematically in the previous chapter.

Unfortunately, the design space that we have been considering thus
far is very limited in its ability to meaningfully alter the
worst-case insertion performance. While we have seen that the choice
between leveling and tiering can have some effect, the actual benefit
in terms of tail latency is quite small, and the situation is made
worse by the fact that leveling, which can have better worst-case
insertion performance, lags behind tiering in terms of average
insertion performance. The use of leveling can allow for a small
reduction in the worst case, but at the cost of making the majority
of inserts worse because of increased write amplification.

\begin{figure}
\subfloat[Scale Factor Sweep]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-parm-sf}} 
\subfloat[Buffer Size Sweep]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:tl-parm-bs}} \\
\caption{Design Space Effects on Latency Distribution}
\label{fig:tl-parm-sweep}
\end{figure}

Additionally, the other tuning nobs that are available to us are
of limited usefulness in tuning the worst case behavior.
Figure~\ref{fig:tl-parm-sweep} shows the latency distributions of
our framework as we vary the scale factor (Figure~\ref{fig:tl-parm-sf})
and buffer size (Figure~\ref{fig:tl-parm-bs}) respectively. There
is no clear trend in worst-case performance to be seen here. This
is to be expected, ultimately the worst-case reconstructions in
both cases are largely the same regardless of scale factor or buffer
size: a reconstruction involving $\Theta(n)$ records. The selection
of configuration parameters can influence \emph{when} these
reconstructions occur, as well as slightly influence their size, but
ultimately the question of ``which configuration has the best tail-latency
performance'' is more a question of how many insertions the latency is
measured over, than any fundamental trade-offs with the design space.

\begin{example}
Consider two dynamized structures, $\mathscr{I}_A$ and $\mathscr{I}_B$,
with slightly different configurations.  Regardless of the layout
policy used (of those discussed in Chapter~\ref{chap:design-space}),
the worst-case insertion will occur when the structure is completely
full, i.e., after
\begin{equation*}
n_\text{worst} = N_B + \sum_{i=0}^{\log_s n} N_B \cdot s^{i+1}
\end{equation*}
Let this be $n_a$ for $\mathscr{I}_B$ and $n_b$ for $\mathscr{I}_B$,
and let $\mathscr{I}_A$ be configured with scale factor $s_a$ and
$\mathscr{I}_B$ with scale factor $s_b$, such that $s_a < s_b$.
\end{example}

The upshot of this discussion is that tail latencies are due to the
worst-case reconstructions associated with this method, and that the
proposed design space does not provide the necessary tools to avoid or
reduce these costs.

\section{The Insertion-Query Trade-off}