\chapter{Controlling Insertion Tail Latency}
\label{chap:tail-latency}

\section{Introduction}

\begin{figure}
\subfloat[Insertion Throughput]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-tput.pdf} \label{fig:tl-btree-isam-tput}} 
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-dist.pdf} \label{fig:tl-btree-isam-lat}} \\
\caption{Insertion Performance of Dynamized ISAM vs. B+Tree}
\label{fig:tl-btree-isam}
\end{figure}

Up to this point in our investigation, we have not directly addressed
one of the largest problems associated with dynamization: insertion tail
latency. While our dynamization techniques are capable of producing
structures with good overall insertion throughput, the latency of
individual inserts is highly variable.  To illustrate this problem,
consider the insertion performance in Figure~\ref{fig:tl-btree-isam},
which compares the insertion latencies of a dynamized ISAM tree with
that of its most direct dynamic analog: a B+Tree. While, as shown
in Figure~\ref{fig:tl-btree-isam-tput}, the dynamized structure has
comparable average performance to the native dynamic structure, the
latency distributions, shown in Figure~\ref{fig:tl-btree-isam-lat}
are quite different.  While the dynamized structure has much better
"best-case" performance, the worst-case performance is exceedingly
poor. That the structure exhibits reasonable performance on average is
the result of these two ends of the distribution balancing each other out.

This poor worst-case performance is a direct consequence of the different
approaches to update support used by the dynamized structure and B+Tree.
B+Trees use a form of amortized local reconstruction, whereas the
dynamized ISAM tree uses amortized global reconstruction. Because the
B+Tree only reconstructs the portions of the structure ``local'' to the
update, even in the worst case only a small part of the data structure
will need to be adjusted. However, when using global reconstruction
based techniques, the worst-case insert requires rebuilding either the
entirety of the structure (for tiering or BSM), or at least a very large
proportion of it (for leveling). The fact that our dynamization technique
uses buffering, and most of the shards involved in reconstruction are
kept small by the logarithmic decomposition technique used to partition
it, ensures that the majority of inserts are low cost compared to the
B+Tree, but at the extreme end of the latency distribution, the local
reconstruction strategy used by the B+Tree results in better worst-case
performance.

Unfortunately, the design space that we have been considering thus far
is limited in its ability to meaningfully alter the worst-case insertion
performance. While we have seen that the choice of layout policy can have
some effect, the actual benefit in terms of tail latency is quite small,
and the situation is made worse by the fact that leveling, which can
have better worst-case insertion performance, lags behind tiering in
terms of average insertion performance. The use of leveling can allow
for a small reduction in the worst case, but at the cost of making the
majority of inserts worse because of increased write amplification.

\begin{figure}
\subfloat[Varied Scale Factors and Fixed Buffer Size]{\includegraphics[width=.5\textwidth]{img/tail-latency/scale_factor_sweep_standard.pdf} \label{fig:tl-parm-sf}} 
\subfloat[Varied Buffer Sizes and Fixed Scale Factor]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer_sweep_standard.pdf} \label{fig:tl-parm-bs}} \\
\caption{Design Space Effects on Latency Distribution}
\label{fig:tl-parm-sweep}
\end{figure}

The other tuning nobs that are available to us are of limited usefulness
in tuning the worst case behavior.  Figure~\ref{fig:tl-parm-sweep}
shows the latency distributions of our framework as we vary
the scale factor (Figure~\ref{fig:tl-parm-sf}) and buffer size
(Figure~\ref{fig:tl-parm-bs}) respectively. There is no clear trend in
worst-case performance to be seen here. Adjusting the scale factor does
have an effect on the distribution, but not in a way that is particularly
useful from a configuration standpoint, and adjusting the mutable buffer
has almost no effect on the worst-case latency at all, or even on the
distribution; particularly when tiering is used. This is to be expected,
ultimately the worst-case reconstructions largely the same regardless
of scale factor or buffer size: a reconstruction involving $\Theta(n)$
records.

The selection of configuration parameters can influence \emph{when}
these reconstructions occur, as well as slightly influence their size, but
ultimately the question of ``which configuration has the best tail-latency
performance'' is more a question of how many insertions the latency is
measured over, than any fundamental trade-offs with the design space. This
is exemplified rather well in Figure~\ref{fig:tl-parm-sf}. Each of
the ``shelves'' in the distribution correspond to reconstructions on
particular levels. As can be seen, the lines cross each other repeatedly
at these shelves. These cross-overs are points at which one configuration
begins to, temporarily, exhibit better tail latency behavior than the
other. However, after enough records have been inserted to cause the next
largest reconstructions to begin to occur, the "better" configuration
begins to appear worse again in terms of tail latency.\footnote{
	This plot also shows a notable difference between leveling
	and tiering.  In the tiering configurations, the transitions
	between the shelves are steep and abrupt, whereas in leveling,
	the transitions are smoother, particular as the scale factor
	increases. These smoother curves show the write amplification
	of leveling, where the largest shards are not created ``fully
	formed'' as they are in tiering, but rather are built over a
	series of merges.  This slower growth results in the smoother
	transitions. Note also that these curves are convex--which is
	\emph{bad} on this plot, as this means a higher probability of
	a higher latency reconstruction.
}

It seems apparent that, to resolve the problem of insertion tail latency,
we will need to look beyond the design space we have thus far considered.
In this chapter, we do just this, and propose a new mechanism for
controlling reconstructions that leverages parallelism to provide
similar amortized insertion and query performance characteristics, but
also allows for significantly better insertion tail latencies. We will
demonstrate mathematically that our new technique is capable of matching
the query performance of the tiering layout policy, describe a practical
implementation of these ideas, and then evaluate that prototype system
to demonstrate that the theoretical trade-offs are achievable in practice.

\section{The Insertion-Query Trade-off}

As reconstructions are at the heart of the insertion tail latency problem,
it seems worth taking a moment to consider \emph{why} they must be done
at all.  Fundamentally, decomposition-based dynamization techniques trade
between insertion and query performance by controlling the number of
blocks in the decomposition. Placing a bound on this number is necessary
to bound the worst-case query cost, and is done using reconstructions
to either merge (in the case of the Bentley-Saxe method) or re-partition
(in the case of the equal block method) them. Performing less frequent
reconstructions reduces the amount of work associated with inserts,
at the cost of allowing more blocks to accumulate and thereby hurting
query performance.

This trade-off between insertion and query performance by way of block
count is most directly visible in the equal block method described
in Section~\ref{ssec:ebm}. As a reminder, this technique provides the
following worst-case insertion and query bounds,
\begin{align*}
I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\
\mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right)
\end{align*}
where $f(n)$ is the number of blocks.

Unlike the design space we have proposed in
Chapter~\ref{chap:design-space}, the equal block method allows for
\emph{both} trading off between insert and query performance, \emph{and}
controlling the tail latency. Figure~\ref{fig:tl-ebm)} shows the results
of testing an implementation of a dynamized ISAM tree using the equal block
method, with
\begin{equation*}
f(n) = C
\end{equation*}
for varying constant values of $C$. Note that in this test the final
record count was known in advance, allowing all re-partitioning to be
avoided. This represents a sort of ``best case scenario'' for the
technique, and isn't reflective of real-world performance, but does
serve to demonstrate the relevant properties in the clearest possible
manner.

Figure~\ref{fig:tl-ebm-tail-latency} shows that the equal block
method provides a very direct relationship between the tail latency,
and the number of blocks. The worst-case insertion performance is
dictated by the size of the largest reconstruction, and so increasing
the block count results in smaller blocks, and better insertion
performance. These worst-case results also translate directly into
improved average throughput, at the cost of query latency, as shown in
Figure~\ref{fig:tl-ebm-tradeoff}. Note that, contrary to our Bentley-Saxe
inspired dynamization system, the equal block method provides clear and
direct relationships between insertion and query performance, as well
as direct control over tail latency, through its design space.

\begin{figure}
\centering
\subfloat[Insertion vs. Query Trade-off]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-ebm-tradeoff}} 
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:tl-ebm-tail-latency}} \\

\caption{The equal block method with varying values of $f(n)$.}
\label{fig:tl-ebm}
\end{figure}

Unfortunately, the equal block method is not well suited for
our purposes. Despite having a much cleaner trade-off space, its
performance is strictly worse than our dynamization system. Comparing
Figure~\ref{fig:tl-ebm-tradeoff} with Figure~\ref{fig:design-tradeoff}
shows that, for a specified query latency, our technique provides
significantly better insertion throughput.\footnote{
	In actuality, the insertion performance of the equal block method is
	even \emph{worse} than the numbers presented here. For this particular
	benchmark, we implemented the technique knowing the number of records in
	advance, and so fixed the size of each block from the start. This avoided
	the need to do any re-partitioning as the structure grew, and reduced write
	amplification.
}
This is because, in our technique, the variable size of the blocks allows
for the majority of the reconstructions to occur with smaller structures,
while allowing the majority of the records to exist in a small number
of large blocks at the bottom of the structure.  This setup enables
high insertion throughput while keeping the block count small. But,
as we've seen, the cost of this is large tail latencies, as the large
blocks must occasionally be involved in reconstructions. However, we
can use the extreme ends of the equal block method's design space to
consider upper limits on the insertion and query performance that we
might expect to get out of a dynamized structure, and then take steps
within our own framework to approach these limits, while retaining the
desirable characteristics of the logarithmic decomposition.

At the extreme end, consider what would happen if we were to modify
our dynamization framework to avoid all reconstructions.  We retain a
buffer of size $N_B$, which we flush to create a shard when full; however
we never touch the shards once they are created. This is effectively
the equal block method, where every block is fixed at $N_B$ capacity.
Such a technique would result in a worst-case insertion cost of $I(n) \in
\Theta(B(N_B))$ and produce $\Theta\left(\frac{n}{N_B}\right)$ shards in
total, resulting in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$
worst-case query cost for a decomposable search problem. Applying
this technique to an ISAM Tree, and compared against a B+Tree,
yields the insertion and query latency distributions shown in
Figure~\ref{fig:tl-floodl0}.

\begin{figure}
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} 
\subfloat[Query Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-query.pdf} \label{fig:tl-floodl0-query}} \\

\caption{Latency Distributions for a "Reconstructionless" Dynamization}
\label{fig:tl-floodl0}
\end{figure}

Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain
insertion latency distributions using amortized global reconstruction
that are directly comparable to dynamic structures based on amortized
local reconstruction, at least in some cases. In particular, the
worst-case insertion tail latency in this model is direct function
of the buffer size, as the worst-case insert occurs when the buffer
must be flushed to a shard. However, this performance comes at the
cost of queries, which are incredibly slow compared to B+Trees, as
shown in Figure~\ref{fig:tl-floodl0-query}.

Unfortunately, the query latency of this technique is too large for it
to be useful; it is necessary to perform reconstructions to merge these
small shards together to ensure good query performance. However, this
does raise an interesting point. Fundamentally, the reconstructions
that contribute to tail latency are \emph{not} required from an
insertion perspective; they are a query optimization. Thus, we could
remove the reconstructions from the insertion process and perform
them elsewhere.  This could, theoretically, allow us to have our
cake and eat it too. The only insertion bottleneck would become
the buffer flushing procedure--as is the case in our hypothetical
``reconstructionless'' approach. Unfortunately, it is not as simple as
pulling the reconstructions off of the insertion path and running them in
the background, as this alone cannot provide us with a meaningful bound
on the number of blocks in the dynamized structure. But, it is possible
to still provide this bound, if we're willing to throttle the insertion
rate to be slow enough to keep up with the background reconstructions. In
the next section, we'll discuss a technique based on this idea.

\section{Relaxed Reconstruction}

There is theoretical work in this area, which we discussed in
Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach for
controlling the worst-case insertion cost is to break the largest
reconstructions up into small sequences of operations, that can then be
attached to each insert, spreading the total workload out and ensuring
each insert takes a consistent amount of time. Theoretically, the total
throughput should remain about the same when doing this, but rather
than having a bursty latency distribution with many fast inserts, and
a small number of incredibly slow ones, distribution should be far more
uniform.

Unfortunately, this technique has a number of limitations that we
discussed in Section~\ref{ssec:bsm-tail-latency-problem}. Notably for
this discussion, they are
\begin{enumerate}

	\item In the Bentley-Saxe method, the worst-case reconstruction
	involves every record in the structure. As such, it cannot be
	performed ``in advance'' without significant extra work. This problem
	requires the worst-case optimized dynamization systems to include 
	complicated structures of partially built structures.

	\item The approach assumes that the workload of building a
	block can be evenly divided in advance, and somehow attached
	to inserts. Even for simple structures, this requires a large
	amount of manual adjustment to the data structure reconstruction
	routines, and doesn't admit simple, generalized interfaces.
	
\end{enumerate}
In this section, we consider how these restrictions can be overcome given
our dynamization framework, and propose a strategy that achieves the
same worst-case insertion time as the worst-case optimized techniques,
given a few assumptions about available resources.

First, a comment on nomenclature. We define the term \emph{last level},
$i = \ell$, to mean the level in the dynamized structure with the largest
index value (and thereby the most records) and \emph{first level}
to mean the level with index $i=0$. Any level with $0 < i < \ell$ is
called an \emph{internal level}. A reconstruction on a level involves the
compaction of all blocks on that level into one, larger, block, that is
then appended to the level below. Relative to some level at index $i$,
the \emph{next level} is the level at index $i + 1$.

At a very high level, our proposed approach as follows. We will fully
detach reconstructions from buffer flushes. When the buffer fills,
it will immediately flush and a new shard will be placed in the first
level. Reconstructions will be performed in the background to maintain the
internal structure according to the tiering policy. When a level contains
$s$ shards, a reconstruction will immediately be triggered to merge these
shards and push the result down to the next level. To ensure that the
number of shards in the structure remains bounded by $\Theta(\log n)$,
we will throttle the insertion rate so that it is balanced with amount
of time needed to complete reconstructions.

\begin{figure}
\caption{Several "states" of tiering, leading up to the worst-case
reconstruction.}
\label{fig:tl-tiering}
\end{figure}

First, we'll consider how to ``spread out'' the cost of the worst-case
reconstruction.  Figure~\ref{fig:tl-tiering} shows various stages in
the development of the internal structure of a dynamized index using
tiering. Importantly, note that the last level reconstruction, which
dominates the cost of the worst-case reconstruction, \emph{is able to be
performed well in advance}. All of the records necessary to perform this
reconstruction are present in the last level $\Theta(n)$ inserts before
the reconstruction must be done to make room. This is a significant
advantage to our technique over the normal Bentley-Saxe method, which
will allow us to spread the cost of this reconstruction over a number
of inserts without much of the complexity of~\cite{overmars81}. This
leads us to the following result,
\begin{theorem}
\label{theo:worst-case-optimal}
Given a buffered, dynamized structure utilizing the tiering layout policy,
and at least $2$ parallel threads of execution, it is possible to maintain
a worst-case insertion cost of
\begin{equation}
I(n) \in \Theta\left(\frac{B(n)}{n} \log n\right)
\end{equation}
\end{theorem}
\begin{proof}
Consider the cost of the worst-case reconstruction, which in tiering
will be of cost $\Theta(B(n))$. This reconstruction requires all of the
blocks on the last level of the structure. At the point at which the
last level is full, there will be $\Theta(n)$ inserts before the last
level must be merged and a new level added.

To ensure that the reconstruction has been completed by the time the
$\Theta(n)$ inserts have been completed, it is sufficient to guarantee
that the rate of inserts is sufficiently slow. Ignoring the cost of
buffer flushing, this means that inserts must cost,
\begin{equation*}
I(n) \in \Theta(1 + \delta)
\end{equation*}
where the $\Theta(1)$ is the cost of appending to the mutable buffer,
and $\delta$ is a stall inserted into the insertion process to ensure
that the necessary reconstructions are completed in time.

To identify the value of $\delta$, we note that each insert must take
at least $\frac{B(n)}{n}$ time to fully cover the cost of the last level
reconstruction. However, this is not sufficient to guarantee the bound, as
other reconstructions will also occur within the structure. At the point
at which the last level reconstruction can be scheduled, there will be
exactly $1$ shard on each level. Thus, each level will potentially also
have an ongoing reconstruction that must be covered by inserting more
stall time, to ensure that no level in the structure exceeds $s$ shards.
There are $\log n$ levels in total, and so in the worst case we will need
to introduce a extra stall time to account for a reconstruction on each
level,
\begin{equation*}
I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1})
\end{equation*}
All of these internal reconstructions will be strictly less than the
size of the last-level reconstruction, and so we can bound them all
above by $\frac{B(n)}{n}$ time. 

Given this, and assuming that the smallest (i.e., most pressing)
reconstruction is prioritized on the background thread, we find that
\begin{equation*}
I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right)
\end{equation*}
\end{proof}

This approach results in an equivalent worst-case insertion latency
bound to~\cite{overmars81}, but manages to resolve both of the issues
cited above. By leveraging two parallel threads, instead of trying to
manually multiplex a single thread, this approach requires \emph{no}
modification to the user's shard code to function. And, by leveraging
the fact that reconstructions under tiering are strictly local to a
single level, we can avoid needing to add any complicated additional
structures to manage partially building shards as new records are added.

\subsection{Reducing Stall with Parallelism}

The result in Theorem~\ref{theo:worst-case-optimal} assumes that there
are two available threads of parallel execution, which allows for the
reconstructions to run in parallel with inserts. The amount of necessary
insertion stall can be significantly reduced, however, if more than two
threads are available.

The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case
bound is that it is insufficient to cover only the cost of the last level
reconstruction to maintain the bound on the shard count. From the moment
that the last level has filled, and this reconstruction can begin, every
level within the structure will sustain another $s - 1$ reconstructions
before it is necessary to have completed the last level reconstruction.

Consider a parallel implementation that, contrary to
Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover
the last level reconstruction, and blocks all other reconstructions
until it has been completed. Such an approach would result in $\delta
= \frac{B(n)}{n}$ stall and complete the last level reconstruction
after $\Theta(n)$ inserts. During this time, $\Theta(\frac{n}{N_B})$
shards would accumulate in L0, ultimately resulting in a bound of
$\Theta(n)$ shards in the structure, rather than the $\Theta(\log
n)$ bound we are trying to maintain. This is the reason why
Theorem~\ref{theo:worst-case-optimal} must account for stalls on every
level, and assumes that the smallest (and therefore most pressing)
reconstruction is always active on the parallel reconstruction
thread. This introduces the extra $\log n$ factor into the worst-case
insertion cost function, because there will at worst be a reconstruction
running on every level, and each reconstruction will be no larger than
$\Theta(n)$ records.

In effect, the stall amount must be selected to cover the \emph{sum} of
the costs of all reconstructions that occur. Another way of deriving this
bound would be to consider this sum,
\begin{equation*}
B(n) + (s - 1) \cdot \sum_{i=0}^{\log n - 1} B(n)
\end{equation*}
where the first term is the last level reconstruction cost, and the sum
term considers the cost of the $s-1$ reconstructions on each internal
level. Dropping constants and expanding the sum results in,
\begin{equation*}
B(n) \cdot \log n 
\end{equation*}
reconstruction cost to amortize over the $\Theta(n)$ inserts.

However, additional parallelism will allow us to reduce this. At the
upper limit, assume that there are $\log n$ threads available for parallel
reconstructions. This condition allows us to derive a smaller bound in
certain cases,
\begin{theorem}
\label{theo:par-worst-case-optimal}
Given a buffered, dynamized structure utilizing the tiering layout policy,
and at least $\log n$ parallel threads of execution, it is possible to
maintain a worst-case insertion cost of
\begin{equation}
I(n) \in O\left(\frac{B(n)}{n}\right)
\end{equation}
for a data structure with $B(n) \in \Omega(n)$.
\end{theorem}
\begin{proof}
Just as in Theorem~\ref{theo:worst-case-optimal}, we begin by noting that
the last level reconstruction will be of cost $\Theta(B(n))$ and must
be amortized over $\Theta(n)$ inserts. However, unlike in that case,
we now have $\log n$ threads of parallelism to work with. Thus, each
time a reconstruction must be performed on an internal level, it can
be executed on one of these threads in parallel with all other ongoing
reconstructions. As there can be at most one reconstruction per level,
$\log n$ threads are sufficient to run all possible reconstructions at
any point in time in parallel.

Each reconstruction will require $\Theta(B(N_B \cdot s^{i+1}))$ time to
complete. Thus, the necessary stall to fully cover a reconstruction on
level $i$ is this cost, divided by the number of inserts that can occur
before the reconstruction must be done (i.e., the capacity of the index
above this point). This gives,
\begin{equation*}
\delta_i \in O\left( \frac{B(N_B \cdot s^{i+1})}{\sum_{j=0}^{i-1} N_B\cdot s^{j+1}} \right)
\end{equation*}
necessary stall for each level. Noting that $s > 1$, $s \in \Theta(1)$,
and that the denominator is the sum of a geometric progression, we have
\begin{align*}
\delta_i \in &O\left( \frac{B(N_B\cdot s^{i+1})}{s\cdot N_B \sum_{j=0}^{i-1} s^{j}} \right) \\
             &O\left( \frac{(1-s) B(N_B\cdot s^{i+1})}{N_B\cdot (s - s^{i+1})} \right) \\
			 &O\left( \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right)
\end{align*}
For all reconstructions running in parallel, the necessary stall is the
maximum stall of all the parallel reconstructions,
\begin{equation*}
\delta = \max_{i \in [0, \ell]} \left\{ \delta_i \right\} = \max_{i \in [0, \ell]} \left\{ \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right\}
\end{equation*}

For $B(n) \in \Omega(n)$, the numerator of the fraction will grow at
least as rapidly as the denominator, meaning that $\delta_\ell$ will
always be the largest. $N_B \cdot s^{\ell + 1}$ is $\Theta(n)$, so
we find that,

\begin{equation*}
I(n) \in O \left(\frac{B(n)}{n}\right)
\end{equation*}
is the worst-case insertion cost, while ensuring that all reconstructions
are done in time to maintain the shard bound given $\log n$ parallel threads.
\end{proof}


\section{Implementation}

The previous section demonstrated that, theoretically, it is possible
to meaningfully control the tail latency of our dynamization system by
relaxing the reconstruction processes and throttling the insertion rate,
rather than blocking, as a means of controlling the shard count within
the structure. However, there are a number of practical problems to be
solved before this idea can be used in a real system. In this section,
we discuss these problems, and our approaches to solving them to produce
a dynamization framework based upon the technique.


\subsection{Parallel Reconstruction Architecture}

\subsection{Concurrent Queries}

\subsubsection{Query Pre-emption}

Because our implementation only supports a finite number of versions of
the mutable buffer at any point in time, and insertions will stall after
this finite number is reached, it is possible for queries to introduce
additional insertion latency. Queries hold a reference to the version of
the structure the are using, which includes holding on to a buffer head
pointer. If a query is particularly long running, or otherwise stalled,
it is possible that the query will block insertions by holding onto this
head pointer.


\subsection{Insertion Stall Mechanism}

\section{Evaluation}

\subsection{}


\section{Conclusion}