\chapter{Controlling Insertion Tail Latency}
\label{chap:tail-latency}

\section{Introduction}

\begin{figure}
\subfloat[Insertion Throughput]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-tput.pdf} \label{fig:tl-btree-isam-tput}} 
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-dist.pdf} \label{fig:tl-btree-isam-lat}} \\
\caption{Insertion Performance of Dynamized ISAM vs. B+Tree}
\label{fig:tl-btree-isam}
\end{figure}

Up to this point in our investigation, we have not directly addressed
one of the largest problems associated with dynamization: insertion tail
latency. While our dynamization techniques are capable of producing
structures with good overall insertion throughput, the latency of
individual inserts is highly variable.  To illustrate this problem,
consider the insertion performance in Figure~\ref{fig:tl-btree-isam},
which compares the insertion latencies of a dynamized ISAM tree with
that of its most direct dynamic analog: a B+Tree. While, as shown
in Figure~\ref{fig:tl-btree-isam-tput}, the dynamized structure has
superior average performance to the native dynamic structure, the
latency distributions, shown in Figure~\ref{fig:tl-btree-isam-lat} are
quite different. The dynamized structure has much better "best-case"
performance, but the worst-case performance is exceedingly poor. That
the structure exhibits reasonable performance on average is the result
of these two ends of the distribution balancing each other out.

This poor worst-case performance is a direct consequence of the different
approaches to update support used by the dynamized structure and B+Tree.
B+Trees use a form of amortized local reconstruction, whereas the
dynamized ISAM tree uses amortized global reconstruction. Because the
B+Tree only reconstructs the portions of the structure ``local'' to the
update, even in the worst case only a small part of the data structure
will need to be adjusted. However, when using global reconstruction
based techniques, the worst-case insert requires rebuilding either the
entirety of the structure (for tiering or BSM), or at least a very large
proportion of it (for leveling). The fact that our dynamization technique
uses buffering, and most of the shards involved in reconstruction are
kept small by the logarithmic decomposition technique used to partition
it, ensures that the majority of inserts are low cost compared to the
B+Tree. At the extreme end of the latency distribution, though, the
local reconstruction strategy used by the B+Tree results in significantly
better worst-case performance.

Unfortunately, the design space that we have been considering thus far
is limited in its ability to meaningfully alter the worst-case insertion
performance. While we have seen that the choice of layout policy can have
some effect, the actual benefit in terms of tail latency is quite small,
and the situation is made worse by the fact that leveling, which can
have better worst-case insertion performance, lags behind tiering in
terms of average insertion performance. The use of leveling can allow
for a small reduction in the worst case, but at the cost of making the
majority of inserts worse because of increased write amplification.

\begin{figure}
\subfloat[Varied Scale Factors and Fixed Buffer Size]{\includegraphics[width=.5\textwidth]{img/tail-latency/scale_factor_sweep_standard.pdf} \label{fig:tl-parm-sf}} 
\subfloat[Varied Buffer Sizes and Fixed Scale Factor]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer_sweep_standard.pdf} \label{fig:tl-parm-bs}} \\
\caption{Design Space Effects on Latency Distribution}
\label{fig:tl-parm-sweep}
\end{figure}

The other tuning nobs that are available to us are of limited usefulness
in tuning the worst case behavior.  Figure~\ref{fig:tl-parm-sweep}
shows the latency distributions of our framework as we vary
the scale factor (Figure~\ref{fig:tl-parm-sf}) and buffer size
(Figure~\ref{fig:tl-parm-bs}) respectively. There is no clear trend in
worst-case performance to be seen here. Adjusting the scale factor does
have an effect on the distribution, but not in a way that is particularly
useful from a configuration standpoint, and adjusting the mutable buffer
has almost no effect on the worst-case latency at all, or even on the
distribution; particularly when tiering is used. This is to be expected;
ultimately the worst-case reconstruction size is largely the same
regardless of scale factor or buffer size:  $\Theta(n)$ records.

The selection of configuration parameters can influence \emph{when}
these reconstructions occur, as well as slightly influence their size, but
ultimately the question of ``which configuration has the best tail-latency
performance'' is more a question of how many insertions the latency is
measured over, than any fundamental trade-offs with the design space. This
is exemplified rather well in Figure~\ref{fig:tl-parm-sf}. Each of
the ``shelves'' in the distribution correspond to reconstructions on
particular levels. As can be seen, the lines cross each other repeatedly
at these shelves. These cross-overs are points at which one configuration
begins to, temporarily, exhibit better tail latency behavior than the
other. However, after enough records have been inserted to cause the next
largest reconstructions to begin to occur, the "better" configuration
begins to appear worse again in terms of tail latency.\footnote{
	This plot also shows a notable difference between leveling
	and tiering.  In the tiering configurations, the transitions
	between the shelves are steep and abrupt, whereas in leveling,
	the transitions are smoother, particular as the scale factor
	increases. These smoother curves show the write amplification
	of leveling, where the largest shards are not created ``fully
	formed'' as they are in tiering, but rather are built over a
	series of merges.  This slower growth results in the smoother
	transitions. Note also that these curves are convex--which is
	\emph{bad} on this plot, as this means a higher probability of
	a higher latency reconstruction.
}

It seems apparent that, to resolve the problem of insertion tail latency,
we will need to look beyond the design space we have thus far considered.
In this chapter, we do just this, and propose a new mechanism for
controlling reconstructions that leverages parallelism to provide
similar amortized insertion and query performance characteristics, but
also allows for significantly better insertion tail latencies. We will
demonstrate mathematically that our new technique is capable of matching
the query performance of the tiering layout policy, describe a practical
implementation of these ideas, and then evaluate that prototype system
to demonstrate that the theoretical trade-offs are achievable in practice.

\section{The Insertion-Query Trade-off}

As reconstructions are at the heart of the insertion tail latency problem,
it seems worth taking a moment to consider \emph{why} they must be done
at all.  Fundamentally, decomposition-based dynamization techniques trade
between insertion and query performance by controlling the number of
blocks in the decomposition. Placing a bound on this number is necessary
to bound the worst-case query cost, and is done using reconstructions
to either merge (in the case of the Bentley-Saxe method) or re-partition
(in the case of the equal block method) the blocks. Performing less
frequent (or smaller) reconstructions reduces the amount of work
associated with inserts, at the cost of allowing more blocks to accumulate
and thereby hurting query performance.

This trade-off between insertion and query performance by way of block
count is most directly visible in the equal block method described
in Section~\ref{ssec:ebm}. This technique provides the
following worst-case insertion and query bounds,
\begin{align*}
I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\
\mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right)
\end{align*}
where $f(n)$ is the number of blocks. This worst-case result ignores
re-partitioning costs, which may be necessary for certain selections
of $f(n)$.  We omit it here because we are about to examine a case
of the equal block method were no re-partitioning is necessary. When
re-partitioning is used, the worst case cost rises to the now familiar $I(n)
\in \Theta(B(n))$ result.

\begin{figure}
\centering
\subfloat[Insertion vs. Query Trade-off]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-count-sweep.pdf}\label{fig:tl-ebm-tradeoff}} 
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-latency-dist.pdf} \label{fig:tl-ebm-tail-latency}} \\

\caption{The equal block method with $f(n) = C$ for varying values of C.}
\label{fig:tl-ebm}
\end{figure}

Unlike the design space we have proposed in
Chapter~\ref{chap:design-space}, the equal block method allows for
\emph{both} trading off between insert and query performance, \emph{and}
controlling the tail latency. Figure~\ref{fig:tl-ebm} shows the results
of testing an implementation of a dynamized ISAM tree using the equal block
method, with
\begin{equation*}
f(n) = C
\end{equation*}
for varying constant values of $C$. Note that in this test the final
record count was known in advance, allowing all re-partitioning to be
avoided. This represents a sort of ``best case scenario'' for the
technique, and isn't reflective of real-world performance, but does
serve to demonstrate the relevant properties in the clearest possible
manner.

Figure~\ref{fig:tl-ebm-tail-latency} shows that the equal block
method provides a very direct relationship between the tail latency
and the number of blocks. The worst-case insertion performance is
dictated by the size of the largest reconstruction, and so increasing
the block count results in smaller blocks, and better insertion
performance. These worst-case results also translate directly into
improved average throughput, at the cost of query latency, as shown in
Figure~\ref{fig:tl-ebm-tradeoff}. These results show that, contrary to our
Bentley-Saxe inspired dynamization system, the equal block method provides
clear and direct relationships between insertion and query performance,
as well as direct control over tail latency, through its design space.

Unfortunately, the equal block method is not well suited for
our purposes. Despite having a much cleaner trade-off space, its
performance is strictly worse than our dynamization system. Comparing
Figure~\ref{fig:tl-ebm-tradeoff} with Figure~\ref{fig:design-tradeoff}
shows that, for a specified query latency, our technique provides
significantly better insertion throughput.\footnote{
	In actuality, the insertion performance of the equal block method is
	even \emph{worse} than the numbers presented here. For this particular
	benchmark, we implemented the technique knowing the number of records in
	advance, and so fixed the size of each block from the start. This avoided
	the need to do any re-partitioning as the structure grew, and reduced write
	amplification.
}
This is because, in our technique, the variable size of the blocks allows
for the majority of the reconstructions to occur with smaller structures,
while allowing the majority of the records to exist in a small number
of large blocks at the bottom of the structure.  This setup enables
high insertion throughput while keeping the block count small. But,
as we've seen, the cost of this is large tail latencies, as the large
blocks must occasionally be involved in reconstructions. However, we
can use the extreme ends of the equal block method's design space to
consider upper limits on the insertion and query performance that we
might expect to get out of a dynamized structure, and then take steps
within our own framework to approach these limits, while retaining the
desirable characteristics of the logarithmic decomposition.

At the extreme end, consider what would happen if we were to modify
our dynamization framework to avoid all reconstructions.  We retain a
buffer of size $N_B$, which we flush to create a shard when full; however
we never touch the shards once they are created. This is effectively
the equal block method, where every block is fixed at $N_B$ capacity.
Such a technique would result in a worst-case insertion cost of $I(n) \in
\Theta(B(N_B))$ and produce $\Theta\left(\frac{n}{N_B}\right)$ shards in
total, resulting in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$
worst-case query cost for a decomposable search problem. Applying
this technique to an ISAM Tree, and compared against a B+Tree,
yields the insertion and query latency distributions shown in
Figure~\ref{fig:tl-floodl0}.

\begin{figure}
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} 
\subfloat[Query Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-query.pdf} \label{fig:tl-floodl0-query}} \\

\caption{Latency Distributions for a "Reconstructionless" Dynamization}
\label{fig:tl-floodl0}
\end{figure}

Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain
insertion latency distributions using amortized global reconstruction
that are directly comparable to dynamic structures based on amortized
local reconstruction, at least in some cases. In particular, the
worst-case insertion tail latency in this model is direct function
of the buffer size, as the worst-case insert occurs when the buffer
must be flushed to a shard. However, this performance comes at the
cost of queries, which are incredibly slow compared to B+Trees, as
shown in Figure~\ref{fig:tl-floodl0-query}.

Unfortunately, the query latency of this technique is too large for it
to be useful; it is necessary to perform reconstructions to merge these
small shards together to ensure good query performance. However, this
does raise an interesting point. Fundamentally, the reconstructions
that contribute to tail latency are \emph{not} required from an
insertion perspective; they are a query optimization. Thus, we could
remove the reconstructions from the insertion process and perform
them elsewhere.  This could, theoretically, allow us to have our
cake and eat it too. The only insertion bottleneck would become
the buffer flushing procedure--as is the case in our hypothetical
``reconstructionless'' approach. Unfortunately, it is not as simple as
pulling the reconstructions off of the insertion path and running them in
the background, as this alone cannot provide us with a meaningful bound
on the number of blocks in the dynamized structure. But, it is possible
to still provide this bound, if we're willing to throttle the insertion
rate to be slow enough to keep up with the background reconstructions. In
the next section, we'll discuss a technique based on this idea.

\section{Relaxed Reconstruction}

There does exist theoretical work on throttling the insertion
rate of a Bentley-Saxe dynamization to control the worst-case
insertion cost~\cite{overmars81}, which we discussed in
Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach is to
break the largest reconstructions up into small sequences of operations,
that can then be attached to each insert, spreading the total workload out
and ensuring each insert takes a consistent amount of time. Theoretically,
the total throughput should remain about the same when doing this, but
rather than having a bursty latency distribution with many fast inserts,
and a small number of incredibly slow ones, distribution should be
more normal.

Unfortunately, this technique has a number of limitations that we
discussed in Section~\ref{ssec:bsm-tail-latency-problem}. Notably for
this discussion, they are
\begin{enumerate}

	\item In the Bentley-Saxe method, the worst-case reconstruction
	involves every record in the structure. As such, it cannot be
	performed ``in advance'' without significant extra work. This problem
	requires the worst-case optimized dynamization systems to include 
	complicated structures of partially built structures.

	\item The approach assumes that the workload of building a
	block can be evenly divided in advance, and somehow attached
	to inserts. Even for simple structures, this requires a large
	amount of manual adjustment to the data structure reconstruction
	routines, and doesn't admit simple, generalized interfaces.
	
\end{enumerate}
In this section, we consider how these restrictions can be overcome given
our dynamization framework, and propose a strategy that achieves the
same worst-case insertion time as the worst-case optimized theoretical
techniques, given a few assumptions about available resources, by taking
advantage of parallelism and proactive scheduling of reconstructions.

We will be restricting ourselves in this chapter to the tiering layout
policy, because it has some specific properties that are useful to our
goals. In tiering, the input shards to a reconstruction are restricted
to a single level, unlike in leveling where the shards come from two
levels, and BSM where the shards come from potentially \emph{all}
the levels. This allows us to maximize parallelism, which we will
be using to improve the tail latency performance, greatly simplifies
synchronization, and provides us with the largest window over which to
amortize the costs of reconstruction.  The techniques we describe in
this chapter will work with leveling as well, albeit less effectively,
and will not work \emph{at all} using BSM.

First, a comment on nomenclature. We define the term \emph{last level},
$i = \ell$, to mean the level in the dynamized structure with the
largest index value (and thereby the most records) and \emph{first
level} to mean the level with index $i=0$. Any level with $0 < i <
\ell$ is called an \emph{internal level}. A reconstruction on level $i$
involves the combination of all blocks on that level into one, larger,
block, that is then appended level $i+1$. Relative to some level at
index $i$, the \emph{next level} is the level at index $i + 1$, and the
\emph{previous level} is at index $i-1$.

Our proposed approach is as follows. We will fully detach reconstructions
from buffer flushes. When the buffer fills, it will immediately flush
and a new block will be placed in the first level. Reconstructions
will be performed in the background to maintain the internal structure
according to the tiering policy. When a level contains $s$ blocks,
a reconstruction will immediately be triggered to merge these blocks
and push the result down to the next level. To ensure that the number
of blocks in the structure remains bounded by $\Theta(\log n)$, we will
throttle the insertion rate so that it is balanced with amount of time
needed to complete reconstructions.


\begin{figure}
\centering
\includegraphics[width=\textwidth]{diag/tail-latency/last-level-recon.pdf}
\caption{\textbf{Worst-case Reconstruction.} Using the tiering layout
policy, the worst-case reconstruction occurs when every level in
the structure has been filled (middle portion of the figure) and a
reconstruction must be performed on each level to merge it into a single
shard, and place it on the level below, leaving the structure with one
shard per level after the records from the buffer have been added to
L0 (right portion of the figure). The cost of this reconstruction is
dominated by the cost of performing a reconstruction on the last level. The
last level reconstruction, however, is able to be performed well in advance,
as it only requires the blocks on the last level, which fills $\Theta(n)$
inserts before the worst-case reconstruction is triggered (left portion of the
figure). This provides us with the opportunity to initiate this reconstruction
early. \emph{Note: the block sizes in this diagram are not scaled to their
record counts--each level has an increasing number of records per block.}}
\label{fig:tl-tiering}
\end{figure}

First, we'll consider how to ``spread out'' the cost of the worst-case
reconstruction.  Figure~\ref{fig:tl-tiering} shows various stages in
the development of the internal structure of a dynamized index using
tiering. Importantly, note that the last level reconstruction, which
dominates the cost of the worst-case reconstruction, \emph{is able to be
performed well in advance}. All of the records necessary to perform this
reconstruction are present in the last level $\Theta(n)$ inserts before
the reconstruction must be done to make room. This is a significant
advantage to our technique over the normal Bentley-Saxe method, which
will allow us to spread the cost of this reconstruction over a number
of inserts without much of the complexity of~\cite{overmars81}. This
leads us to the following result,
\begin{theorem}
\label{theo:worst-case-optimal}
Given a buffered, dynamized structure utilizing the tiering layout policy,
and at least $2$ parallel threads of execution, it is possible to maintain
a worst-case insertion cost of
\begin{equation}
I(n) \in O\left(\frac{B(n)}{n} \log n\right)
\end{equation}
\end{theorem}
\begin{proof}
Consider the cost of the worst-case reconstruction, which in tiering
will be of cost $\Theta(B(n))$. This reconstruction requires all of the
blocks on the last level of the structure. At the point at which the
last level is full, there will be $\Theta(n)$ inserts before the last
level must be merged and a new level added.

To ensure that the reconstruction has been completed by the time the
$\Theta(n)$ inserts have been completed, it is sufficient to guarantee
that the rate of inserts is sufficiently slow. Ignoring the cost of
buffer flushing, this means that inserts must cost,
\begin{equation*}
I(n) \in \Theta(1 + \delta)
\end{equation*}
where the $\Theta(1)$ is the cost of appending to the mutable buffer,
and $\delta$ is a stall inserted into the insertion process to ensure
that the necessary reconstructions are completed in time.

To identify the value of $\delta$, we note that each insert must take
at least $\frac{B(n)}{n}$ time to fully cover the cost of the last level
reconstruction. However, this is not sufficient to guarantee the bound, as
other reconstructions will also occur within the structure. At the point
at which the last level reconstruction can be scheduled, there will be
exactly $1$ block on each level. Thus, each level will potentially also
have an ongoing reconstruction that must be covered by inserting more
stall time, to ensure that no level in the structure exceeds $s$ blocks.
There are $\log n$ levels in total, and so in the worst case we will need
to introduce a extra stall time to account for a reconstruction on each
level,
\begin{equation*}
I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1})
\end{equation*}
All of these internal reconstructions will be strictly less than the
size of the last-level reconstruction, and so we can bound them all
above by $O(\frac{B(n)}{n})$ time. 

Given this, and assuming that the smallest (i.e., most pressing)
reconstruction is prioritized on the background thread, we find that
\begin{equation*}
I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right)
\end{equation*}
\end{proof}

This approach results in an equivalent worst-case insertion latency
bound to~\cite{overmars81}, but manages to resolve both of the issues
cited above. By leveraging two parallel threads, instead of trying to
manually multiplex a single thread, this approach requires \emph{no}
modification to the user's block code to function. And, by leveraging
the fact that reconstructions under tiering are strictly local to a
single level, we can avoid needing to add any complicated additional
structures to manage partially building blocks as new records are added.

\subsection{Reducing Stall with Additional Parallelism}

The result in Theorem~\ref{theo:worst-case-optimal} assumes that there
are two available threads of parallel execution, which allows for the
reconstructions to run in parallel with inserts. The amount of necessary
insertion stall can be significantly reduced, however, if more than two
threads are available.

The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case
bound is that it is insufficient to cover only the cost of the last level
reconstruction to maintain the bound on the block count. From the moment
that the last level has filled, and this reconstruction can begin, every
level within the structure must sustain another $s - 1$ reconstructions
before it is necessary to have completed the last level reconstruction,
in order to maintain the $\Theta(\log n)$ bound on the number of blocks.

Consider a parallel implementation that, contrary to
Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover
the last level reconstruction, and blocks all other reconstructions
until it has been completed. Such an approach would result in $\delta
= \frac{B(n)}{n}$ stall and complete the last level reconstruction
after $\Theta(n)$ inserts. During this time, $\Theta(\frac{n}{N_B})$
blocks would accumulate in L0, ultimately resulting in a bound of
$\Theta(n)$ blocks in the structure, rather than the $\Theta(\log
n)$ bound we are trying to maintain. This is the reason why
Theorem~\ref{theo:worst-case-optimal} must account for stalls on every
level, and assumes that the smallest (and therefore most pressing)
reconstruction is always active on the parallel reconstruction
thread. This introduces the extra $\log n$ factor into the worst-case
insertion cost function, because there will at worst be a reconstruction
running on every level, and each reconstruction will be no larger than
$\Theta(n)$ records.

In effect, the stall amount must be selected to cover the \emph{sum} of
the costs of all reconstructions that occur. Another way of deriving this
bound would be to consider this sum,
\begin{equation*}
B(n) + (s - 1) \cdot \sum_{i=0}^{\log n - 1} B(n)
\end{equation*}
where the first term is the last level reconstruction cost, and the sum
term considers the cost of the $s-1$ reconstructions on each internal
level. Dropping constants and expanding the sum results in,
\begin{equation*}
B(n) \cdot \log n 
\end{equation*}
reconstruction cost to amortize over the $\Theta(n)$ inserts.

However, additional parallelism will allow us to reduce this. At the
upper limit, assume that there are $\log n$ threads available for parallel
reconstructions. This condition allows us to derive a smaller bound in
certain cases,
\begin{theorem}
\label{theo:par-worst-case-optimal}
Given a buffered, dynamized structure utilizing the tiering layout policy,
and at least $\log n$ parallel threads of execution, it is possible to
maintain a worst-case insertion cost of
\begin{equation}
I(n) \in O\left(\frac{B(n)}{n}\right)
\end{equation}
for a data structure with $B(n) \in \Omega(n)$.
\end{theorem}
\begin{proof}
Just as in Theorem~\ref{theo:worst-case-optimal}, we begin by noting that
the last level reconstruction will be of cost $\Theta(B(n))$ and must
be amortized over $\Theta(n)$ inserts. However, unlike in that case,
we now have $\log n$ threads of parallelism to work with. Thus, each
time a reconstruction must be performed on an internal level, it can
be executed on one of these threads in parallel with all other ongoing
reconstructions. As there can be at most one reconstruction per level,
$\log n$ threads are sufficient to run all possible reconstructions at
any point in time in parallel.

Each reconstruction will require $\Theta(B(N_B \cdot s^{i+1}))$ time to
complete. Thus, the necessary stall to fully cover a reconstruction on
level $i$ is this cost, divided by the number of inserts that can occur
before the reconstruction must be done (i.e., the capacity of the index
above this point). This gives,
\begin{equation*}
\delta_i \in O\left( \frac{B(N_B \cdot s^{i+1})}{\sum_{j=0}^{i-1} N_B\cdot s^{j+1}} \right)
\end{equation*}
necessary stall for each level. Noting that $s > 1$, $s \in \Theta(1)$,
and that the denominator is the sum of a geometric progression, we have
\begin{align*}
\delta_i \in &O\left( \frac{B(N_B\cdot s^{i+1})}{s\cdot N_B \sum_{j=0}^{i-1} s^{j}} \right) \\
             &O\left( \frac{(1-s) B(N_B\cdot s^{i+1})}{N_B\cdot (s - s^{i+1})} \right) \\
			 &O\left( \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right)
\end{align*}
For all reconstructions running in parallel, the necessary stall is the
maximum stall of all the parallel reconstructions,
\begin{equation*}
\delta = \max_{i \in [0, \ell]} \left\{ \delta_i \right\} = \max_{i \in [0, \ell]} \left\{ \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right\}
\end{equation*}

For $B(n) \in \Omega(n)$, the numerator of the fraction will grow at
least as rapidly as the denominator, meaning that $\delta_\ell$ will
always be the largest. $N_B \cdot s^{\ell + 1}$ is $\Theta(n)$, so
we find that,

\begin{equation*}
I(n) \in O \left(\frac{B(n)}{n}\right)
\end{equation*}
is the worst-case insertion cost, while ensuring that all reconstructions
are done in time to maintain the block bound given $\log n$ parallel threads.
\end{proof}


\section{Implementation}
\label{sec:tl-impl}

The previous section demonstrated that, theoretically, it is possible
to meaningfully control the tail latency of our dynamization system by
relaxing the reconstruction processes and throttling the insertion rate,
rather than blocking, as a means of controlling the shard count within
the structure. However, there are a number of practical problems to be
solved before this idea can be used in a real system. In this section,
we discuss these problems, and our approaches to solving them to produce
a dynamization framework based upon the technique. Note that this
system is based on the same high level architecture as we described in
Section~\ref{ssec:dyn-concurrency}. To avoid redundancy, we will focus
on how this system differs, without fully recapitulating the content of
that earlier section.

\subsection{Parallel Reconstruction Architecture}

The existing concurrency implementation described in
Section~\ref{ssec:dyn-concurrency} is insufficient for the purposes of
constructing a framework supporting the parallel reconstruction scheme
described in the previous section. In particular, it is limited to
only two active versions of the structure at at time, with one ongoing
reconstruction. Additionally, it does not consider buffer flushes as
distinct events from reconstructions. In this section, we will discuss
the modifications made to the concurrency support within our framework
to support parallel reconstructions.

Much like the simpler scheme in Section~\ref{ssec:dyn-concurrency},
our concurrency framework will be based on multi-versioning. Each
\emph{version} consists of three pieces of information: a buffer
head pointer, buffer tail pointer, and a collection of levels and
shards. However, the process of managing, creating, and installing
versions is much more complex, to allow more than two versions to exist
at the same time under certain circumstances.

\subsubsection{Structure Versioning}

The internal structure of the dynamization consists of a sequence of
levels containing immutable shards, as well as a snapshot of the state
of the mutable buffer. This section pertains specifically to the internal
structure; the mutable buffer handles its own versioning separate from
this and will be discussed in the next section.

\begin{figure}
\centering
\subfloat[Buffer Flush]{\includegraphics[width=.5\textwidth]{diag/tail-latency/flush.pdf}\label{fig:tl-flush}} 
\subfloat[Maintenance Reconstruction]{\includegraphics[width=.5\textwidth]{diag/tail-latency/maint.pdf}\label{fig:tl-maint}} 
\caption{\textbf{Structure Version Transitions.} The dynamized structure
can transition to a new version via two operations, flushing the buffer
into the first level or performing a maintenance reconstruction to
merge shards on some level and append the result onto the next one. In
each case, \texttt{V2} contains a shallow copy of \texttt{V1}'s
light grey shards, with the dark grey shards being newly created
and the white shards being deleted. The buffer flush operation in
Figure~\ref{fig:tl-flush} simply creates a new shard from the buffer
and places it in \texttt{L0} to create \texttt{V2}. The maintenance
reconstruction in Figure~\ref{fig:tl-maint} is slightly more complex,
creating a new shard in \texttt{L2} using the two shards in \texttt{V1}'s
\texttt{L1}, and then removing the shards in \texttt{V2}'s \texttt{L1}.
}
\label{fig:tl-flush-maint}. 
\end{figure}

The internal structure of the dynamized data structure (ignoring the
buffer) can be thought of as a list of immutable levels, $\mathcal{V}
= \{\mathscr{L}_0, \ldots \mathscr{L}_h\}$, where each level
contains immutable shards, $\mathcal{L}_i = \{\mathscr{I}_0, \ldots
\mathscr{I}_m\}$. Buffer flushes and reconstructions can be thought of
as functions, which accept a version as input and produce a new version
as output. Namely,
\begin{align*}
	\mathcal{V}_{i+1} &= \mathbftt{flush}(\mathcal{V}_i, \mathcal{B}) \\ 
	\mathcal{V}_{i+1} &= \mathbftt{maint}(\mathcal{V}_i, \mathscr{L}_x, j)
\end{align*}
where the subscript represents the \texttt{version\_id} and is a strictly
increasing number assigned to each version. The $\mathbftt{flush}$
operation builds a new shard using records from the buffer, $\mathcal{B}$,
and creates a new version identical to $\mathcal{V}_i$, except with
the new shard appended to $\mathscr{L}_0$. $\mathbftt{maint}$ performs
a maintenance reconstruction by building a new shard using all of the
shards in level $\mathscr{L}_x$ and creating a new version identical
to $\mathcal{V}_i$ except that the new shard is appended to level
$\mathscr{L}_j$ and the shards in $\mathscr{L}_x$ are removed from
$\mathscr{L}_x$ in the new version. These two operations are shown in
Figure~\ref{fig:tl-flush-maint}.

At any point in time, the framework will have \emph{one} active version,
$\mathcal{V}_a$, as well as a maximum unassigned version number, $v_m
> a$. New version ids are obtained by performing an atomic fetch-and-add
on $v_m$, and versions will become active in the exact order of their
assigned version numbers. We use the term \emph{installing} a version,
$\mathcal{V}_x$ to refer to setting $\mathcal{V}_a \gets \mathcal{V}_x$.

\Paragraph{Version Number Assignment.} It is the intention of this
framework to prioritize buffer flushes, meaning that the versions
resulting from a buffer flush should become active as rapidly as
possible. It is undesirable to have some version, $\mathcal{V}_f$,
resulting from a buffer flush, attempting to install while there is
a version $\mathcal{V}_r$ associated with an in-process maintenance
reconstruction such that $a < r < f$. In this case, the flush must wait
for the maintenance reconstruction to finalize before it can itself be
installed. To avoid this problem, we assign version numbers differently
based upon whether the new version is created by a flush or a maintenance
reconstruction.

\begin{itemize}
	\item \textbf{Flush.} When a buffer flush is scheduled, it is
	immediately assigned the next available version number at the
	time of scheduling.

	\item \textbf{Maintenance Reconstruction.} Maintenance reconstructions
	are \emph{not} assigned a version number immediately. Instead, they
	are assigned a version number \emph{after} all of the reconstruction
	work is performed, during their installation process.
\end{itemize}

\Paragraph{Version Installation.} Once a given flush or maintenance
reconstruction has completed and has been assigned a version
number, $i$, the version will attempt to install itself. The
thread running the operation will wait until $a = i - 1$, and
then it will update $\mathcal{V}_a \gets \mathcal{V}_i$ using an
atomic pointer assignment. All versions are reference counted using
\texttt{std::shared\_pointer}, and so will be automatically deleted
once all threads containing a reference to the version have terminated,
so no special memory management is necessary during version installation.

\begin{figure}
\centering
\includegraphics[width=\textwidth]{diag/tail-latency/dropped-shard.pdf}
\caption{\textbf{Shard Reconciliation Problem.} Because maintenance
reconstructions don't obtain their version number until after they have
completed, it is possible for the internal structure of the dynamization
to change between when the reconstruction is scheduled, and when it completes. In
this example, a maintenance reconstruction is scheduled based on V1 of the
structure. Before it can finish, V2 is created as a result of a buffer flush. As a
result, the maintenance reconstruction's resulting structure is assigned V3. But, when
it is installed, the shard produced by the flush in V2 is lost. It will be necessary
to devise a means to prevent this from happening.}
\label{fig:tl-dropped-shard}
\end{figure}

\Paragraph{Maintenance Version Reconciliation.} Waiting until the
moment of installation to assign a version number to maintenance
reconstructions avoids stalling buffer flushes, however it introduces
additional complexity in the installation process. This is because active
version at the time the reconstruction was scheduled, $\mathcal{V}_a$,
may not still be the active version at the time the reconstruction is
installed, $\mathcal{V}_{a^\prime}$. This means that the version of
the structure produced by the reconstruction, $\mathcal{V}_r$, will not
reflect any updates to the structure that were performed in version ids on
the interval $(a, a^\prime]$. Figure~\ref{fig:tl-dropped-shard}
shows an example of the sort of problem that can arise.

One possible approach is to simply merge the versions together,
adding all of the shards that are in $\mathcal{V}_{a^\prime}$ but
not in $\mathcal{V}_r$ prior to installation. Sadly, this approach is
insufficient because it can lead to three possible problems,

\begin{enumerate}
	\item If shards used in the maintenance reconstruction to
	produce $\mathcal{V}_r$ were \emph{also} used as part of a
	different maintenance reconstruction resulting in a version
	$\mathcal{V}_o$ with $o < r$, then \textbf{records will be
	duplicated} by the merge.

	\item If another reconstruction produced a version $\mathcal{V}_o$
	with $o < r$, and $\mathcal{V}_o$ added a new shard to the same
	level that $\mathcal{V}_r$ did, it is possible that the
	temporal ordering properties of the shards on the level
	may be violated. Recall that supporting tombstone-based
	deletes requires that shards be strictly ordered within each
	level by their age to ensure that tombstone cancellation
	(Section~\ref{sssec:dyn-deletes}).

	\item The shards that were deleted from $\mathcal{V}_r$ after the
	reconstruction will still be present in $\mathcal{V}_{a^\prime}$
	and so may be reintroduced into the new version, again leading
	to duplication of records. It is non-trivial to identify these
	shards during the merge to skip over them, because the shards
	don't have a unique identifier other than their pointers, and
	using the pointers for this check can lead to the ABA problem
	using the reference counting based memory management scheme the
	framework is built on.

\end{enumerate}

The first two of these problems result from a simple synchronization
problem and can be solved using locking. A maintenance reconstruction
operates on some level $\mathscr{L}_i$, merging and then deleting shards
from that level and placing the result in $\mathscr{L}_{i+1}$. In
order for either of these problems to occur, multiple concurrent
reconstructions must be operating on $\mathscr{L}_i$. Thus, a lock manager
can be introduced into the framework to allow reconstructions to lock
entire levels. A reconstruction can only be scheduled if it is able to
acquire the lock on the level that it is using as the \emph{source}
for its shards. Note that there is no synchronization problem with a
concurrent reconstruction on level $\mathscr{L}_{i-1}$ appending a shard
to $\mathscr{L}_i$. This will not violate any ordering properties or
result in any duplication of records. Thus, each reconstruction only
needs to lock a single level.

The final problem is a bit trickier to address, but is fundamentally
an implementation detail. Our approach for resolving it is to
change the way that maintenance reconstructions produce a version
in the first place. Rather than taking a copy of $\mathcal{V}_a$,
manipulating it to perform the reconstruction, and then reconciling
it with $\mathcal{V}_{a^\prime}$ when it is installed, we delay
\emph{all} structural updates to the version to installation. When a
reconstruction is scheduled, a reference to $\mathcal{V}_a$ is taken,
instead of a copy.  Then, any new shards are built based on the contents
of $\mathcal{V}_a$, but no updates to the structure are made. Once all
of the shard reconstructions are complete, the version installation
process begins. The thread running the reconstruction waits for its turn
to install, and \emph{then} makes a copy of $\mathcal{V}_{a^\prime}$. To
this copy, the newly created shards are added, and any necessary deletes
are performed. Because the shards to be deleted are currently referenced
in, at minimum, the reference to $\mathcal{V}_a$ maintained by the
reconstruction thread, pointer equality can be used to identify the
shards and the ABA problem avoided. Then, once all the updates are complete,
the new version can be installed.

This process does push a fair amount of work to the moment of install,
between when a version id is claimed by the reconstruction thread, and
that version id becomes active. During this time, any buffer flushes
will be blocked. However, relative to the work associated with actually
performing the reconstructions, the overhead of these metadata operations
is fairly minor, and so it doesn't have a significant effect on buffer
flush performance.


\subsubsection{Mutable Buffer}

\begin{figure}
\centering
\subfloat[Buffer Initial State]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-1.pdf}\label{fig:tl-buffer1}} 
\subfloat[Buffer Following an Insert]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-2.pdf}\label{fig:tl-buffer2}} 
\subfloat[Buffer Version Transition]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-3.pdf}\label{fig:tl-buffer3}} 
\caption{\textbf{Versioning process for the mutable buffer.} A schematic
view of the mutable buffer demonstrating the three pointers representing
its state, and how they are adjusted as inserts occur. Dark grey slots
represent the currently active version, light grey slots the old version,
and white slots are available space.}
\label{fig:tl-buffer}
\end{figure}

Next, we'll address concurrent access and versioning of the mutable
buffer. In our system, the mutable buffer consists of a large ring buffer
with a head and tail pointer, as shown in Figure~\ref{fig:tl-buffer}. In
order to support versioning, the buffer actually uses two head pointers,
one called \texttt{head} and one called \texttt{old head}, along
with a single \texttt{tail} pointer. Records are inserted into the
buffer by atomically incrementing \texttt{tail} and then placing the
record into the slot. For records that cannot be atomically assigned,
a visibility bit can be used to ensure that concurrent readers don't
access a partially written value.  \texttt{tail} can be incremented
until it matches \texttt{old head}, or until the current version of
the buffer (between \texttt{head} and \texttt{tail}) contains $N_B$
records. At this point, any further writes would either clobber records
in the old version, or exceed the user-specified buffer capacity, and so
any inserts must block until a flush has been completed.

Flushes are triggered based on a user-configurable set point, $N_F \leq
N_B$. When $\mathtt{tail} - \mathtt{head} = N_F$, a flush operation
is scheduled. The location of \texttt{tail} is recorded as part of
the flush, but records can continue to be inserted until one of the
blocking conditions in the previous paragraph is reached. When the flush
has completed, a new shard is created containing the records between
\texttt{head} and the value of \texttt{tail} at the time the flush began.
The buffer version can then be advanced by setting \texttt{old head}
to \texttt{head} and setting \texttt{head} to \texttt{tail}. All of the
records associated with the old version are freed, and the records that
were just flushed now begin part of the old version.

The reason for this scheme is to allow threads accessing an older
version of the dynamized structure to still see a current view of all
of the records. These threads will have a reference to a dynamized
structure containing none of the records in the buffer, as well as
the old head. Because the older version of the buffer always directly
precedes the newer, all of the buffered records are visible to this
older version.  However, threads accessing the more current version of
the buffer will \emph{not} see the records contained between \texttt{old
head} and \texttt{head}, as these records will have been flushed into
the structure and are visible to the thread there. If this thread could
still see records in the older version of the buffer, then it would see
these records twice, which is incorrect.

One consequence of this technique is that a buffer flush cannot complete
until all threads referencing \texttt{old head} have completed. To ensure
that this is the case, the two head pointers are reference counted, and
a flush will stall until all references to \texttt{old head} have been
removed. In principle, this problem could be reduced by allowing for more
than two heads, but it becomes difficult to atomically transition between
versions in that case, and it would also increase the storage requirements
for the buffer, which requires $N_B$ space per available version.

\subsection{Concurrent Queries}

Queries are answered based upon the active version of the structure
at the moment the query begins to execute. When the query routine of
the dynamization is called, a query is scheduled. Once a thread becomes
available, the query will begin to execute. At the start of execution, the
query thread takes a reference to $\mathcal{V}_a$ as well as the current
\texttt{head} and \texttt{tail} of the buffer. Both $\mathcal{V}_a$
and \texttt{head} are reference counted, and will be retained for the
duration of the query. Once the query has finished processing, it will
return the result to the user via an \texttt{std::promise} and release
its references to the active version and buffer.

\subsubsection{Query Preemption}

Because our implementation only supports two active head pointers in the
mutable buffer, queries can lead to insertion stalls. If a long running
query is holding a reference to \texttt{old head}, then an active buffer
flush of the old version will be blocked by this query. If this blocking
goes on for sufficiently long, then the buffer may fill up and the system
begin to reject inserts.

One possible solution to this problem is to process the
\texttt{buffer\_query} first, and then discard the reference to
\texttt{old head}, allowing the buffer flush to proceed. However, this
would not work for iterative deletion decomposable search problems,
which may require re-processing the buffer query arbitrarily many times.
As a result, we instead implement a simple preemption mechanism to
defeat long running queries. The framework keeps track of how long
a buffer flush has been stalled by queries maintaining references to
\texttt{old head}.  Once this stalling passes a user-defined threshold,
a preemption flag will be set ordering the queries in question to restart
themselves. This is implemented fully within the framework code, requiring
no user adjustment to their queries to support it, as the framework query
mechanism simply checks this flag in between calls to user code. If
a query sees this flag is set, it will release its references to the
\texttt{old head} and structure version, and automatically put itself
back in the scheduling queue to be retried against newer versions of
the structure and buffer.

Note that, if misconfigured, it is possible that this mechanism will
entirely prevent certain long-running queries from being answered.
If the threshold for preemption is set lower than the expected
run-time of a valid query, it's possible that the query will loop forever
if the system is experiencing sufficient insertion pressure. To help
avoid this, another parameter is available to specify a maximum preemption
count, after which a query will ignore a request for preemption.

\subsection{Insertion Stall Mechanism}

The results of Theorem~\ref{theo:worst-case-optimal} and
\ref{theo:par-worst-case-optimal} are based upon enforcing a rate
limit upon incoming inserts, to ensure that there is sufficient time
for reconstructions to complete. In practice, calculating and precisely
stalling for the correct amount of time is quite difficult because of
the vagaries of working with a real system. While ultimately it would be
ideal to have a reasonable cost model that can estimate this stall time
on the fly based on the cost of building a data structure, the number of
records involved in reconstructions, the number of available threads,
available memory bandwidth, etc., for the purposes of this prototype
we have settled for a simple system that demonstrates the robustness of
the technique.

Recall the basic insert process within our system. Inserts bypass the
scheduling system and communicate directly with the buffer, on the same
client thread that called the function, to maximize insertion performance
and eliminate as much concurrency control overhead as possible. The
insert routine is synchronous, and returns a boolean indicating whether
the insert has succeeded or not. The insert can fail if the buffer is full,
in which case the user is expected to delay for a moment and retry. Once
space has been cleared in the buffer, the insert will succeed.

We can leverage this same rejection mechanism as a form of rate limiting,
by probabilistically rejecting inserts. To this end, we introduce a
configurable parameter indicating the probability that an insert will
succeed. For each insert, we we use Bernoulli sampling based upon this
probably to determine whether the insert should be attempted or not. If
the insert is rejected, then the attempt is aborted and the function
returns a failure. The user can then delay and attempt again.

This approach was selected because it has a few specific
advantages. First, it is based on a single parameter that can be
readily updated on demand using atomics. Our current prototype
uses a single, fixed value for the probability, but ultimately it
should be dynamically tuned to approximate the $\delta$ value from
Theorem~\ref{theo:worst-case-optimal} as closely as possible. Second,
random sampling to apply stalls ensures fairness. Ultimately, there's a
limit on how small of a stall can be meaningfully applied, particularly
if we want to avoid busy waiting. This approach allows us to fairly
approximate these small stalls by using larger, more practical, amounts
attached to randomly sampled inserts. Finally, the approach is simple
and requires no significant changes to the user interface, while (as
we will see in Section~\ref{sec:tl-eval}) directly exposing the design
space associated with the parameter to the user.

\section{Evaluation}
\label{sec:tl-eval}

In this section, we perform several experiments to evaluate the ability of
the system proposed in Section~\ref{sec:tl-impl} to control tail latencies.

\subsection{Stall Proportion Sweep}

\begin{figure}
\centering
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-insert-dist.pdf} \label{fig:tl-stall-200m-dist}} 
\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-shard-dist.pdf} \label{fig:tl-stall-200m-shard}} \\
\caption{Insertion and Shard Count Distributions for ISAM with 200M Records}
\label{fig:tl-stall-200m}
\end{figure}

First, we will consider the insertion and query performance of our
system at a variety of stall proportions. The purpose of this testing
is to demonstrate that inserting stalls into the insertion process is
able to reduce the insertion tail latency, while being able to match the
general insertion and query performance of a strict tiering policy. Recall
that, in the insertion stall case, no explicit shard capacity limits are
enforced by the framework. Reconstructions are triggered with each buffer
flush on all levels exceeding a specified shard count ($s = 4$ in these
tests) and the buffer flushes immediately when full with no regard to the
state of the structure. Thus, limiting the insertion latency is the only
means the system uses to maintain its shard count at a manageable level.
These tests were run on a system with sufficient available resources to
fully parallelize all reconstructions.

First, Figure~\ref{fig:tl-stall-200m} shows the results of testing
insertion of the 200 million record SOSD \texttt{OSM} dataset in
a dynamized ISAM tree. Using our insertion stalling technique, as
well as strict tiering. We inserted $30\%$ of the records, and then
measured the individual latency of each insert after that point to produce
Figure~\ref{fig:tl-stall-200m-dist}. Figure~\ref{fig:tl-stall-200m-shard}
was produced by recording the number of shards in the dynamized structure
each time the buffer flushed.  Note that a stall value of one indicates
no stalling at all, and values less than one indicate $1 - \delta$
probability of an insert being rejected. Thus, a lower stall value means
more stalls are introduced. The tiering policy is strict tiering with a
scale factor of $s=4$. It uses the concurrency control scheme described
in Section~\ref{ssec:dyn-concurrency}.


Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all of insertion
rejection probabilities succeed in greatly reducing tail latency relative
to tiering. Additionally, it shows a small amount of available tuning of
the worst-case insertion latencies, with higher stall amounts reducing
the tail latencies slightly at various points in the distribution. This
latter effect results from the buffer flush latency hiding mechanism,
which was retained from Chapter~\ref{chap:framework}. The buffer actually
has space to two two versions, and the second version can be filled while
the first is flushing. This means that, for more aggressive stalling,
some of the time spent blocking on the buffer flush is redistributed
over the inserts into the second version of the buffer, rather than
resulting in a stall.

Of course, if the query latency is severely affected by the
use of this mechanism, it may not be worth using. Thus, in
Figure~\ref{fig:tl-stall-200m-shard} we show the probability density of
various shard counts within the dynamized structure for each stalling
amount, as well as strict tiering. We have elected to examine the shard
count, rather than the query latencies, for this purpose because our
intention with this technique is to directly control the number of
shards, and our intention is to show that this is possible. Of course,
the shard count control is necessary for the sake of query latencies,
and we will consider query latency directly later.

This figure shows that, even for no insertion throttle at all, the shard
count within the structure remains well behaved and normally distributed,
albeit with a slightly longer tail and a higher average value. Once
stalls are introduced, though, it is possible to both reduce the tail,
and shift the peak of the distribution through a variety of points. In
particular, we see that a stall of $.99$ is sufficient to move the peak
to very close to tiering, and lower stalls are able to further shift the
peak of the distribution to even lower counts.

\begin{figure}
\centering
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-insert-dist.pdf} \label{fig:tl-stall-4b-dist}} 
\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-shard-dist.pdf} \label{fig:tl-stall-4b-shard}} \\
\caption{Insertion and Shard Count Distributions for ISAM with 4B Records}
\label{fig:tl-stall-4b}
\end{figure}

To validate that these results were not simply a result of the relatively
small size of the data set used, we repeated the exact same testing
using a set of four billion uniform integers, and these results are
shown in Figure~\ref{fig:tl-stall-4b}. These results are aligned with
the smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing
the same improvements in insertion tail latency for all stall amounts,
and Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the
shard count. If anything, the gap between strict tiering and un-throttled
insertion is narrower with the larger data set than the smaller one.

\begin{figure}
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-insert-dist.pdf} \label{fig:tl-stall-knn-dist}} 
\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-shard-dist.pdf} \label{fig:tl-stall-knn-shard}} \\
\caption{Insertion and Shard Count Distributions for VPTree }
\label{fig:tl-stall-knn}
\end{figure}

Finally, we considered our dynamized VPTree in
Figure~\ref{fig:tl-stall-knn}, using the \texttt{SBW} dataset of
about one million 300-dimensional vectors. This test shows some of
the possible limitations of our fixed rejection rate. The ISAM Tree
tested above is constructable in roughly linear time, being an MDSP
with $B_M(n, k) \in \Theta(n \log k)$. Thus, the ratio $\frac{B_M(n,
k)}{n}$ used to determine the optimal insertion stall rate is
asymptotically a constant.  For VPTree, however, the construction
cost is super-linear, with $B(n) \in \Theta(n \log n)$, and also
generally much larger in absolute time requirements. We can see in
Figure~\ref{fig:tl-stall-knn-shard} that the shard count distribution
is very poorly behaved for smaller stall amounts, with the shard count
following a roughly uniform distribution for a stall rate of $1$. This
means that the background reconstructions are not capable of keeping up
with buffer flushing, and so the number of shards grows significantly
over time. Introducing stalls does shift the distribution closer to
normal, but it requires a much larger stall rate in order to obtain
a shard count distribution that is close to the strict tiering than
was the case with the ISAM tree test. It is still possible, though,
even with our simple fixed-stall rate implementation. Additionally,
this approach is shown in Figure~\ref{fig:tl-stall-knn-dist} to reduce
the tail latency substantially compared to strict tiering, with the same
latency distribution effects for larger stall rates as was seen in the
ISAM examples.

Thus, we've shown that introducing even a fixed stall while allowing
the internal structure of the dynamization to develop naturally is able
to match the shard count distribution of strict tiering, while having
significantly lower insertion tail latencies. 

\subsection{Insertion Stall Trade-off Space}

While we have shown that introducing insertion stalls accomplishes the
goal of reducing tail latencies while being able to match the shard count
of a strict tiering reconstruction strategy, we've not yet addressed
what the actual performance of this structure is. By throttling inserts,
we potentially reduce the insertion throughput. And, further, it isn't
immediately obvious just how much query performance suffers as the shard
count distribution shifts. In this test, we examine the average values
of insertion throughput and query latency over a variety of stall rates.

The results of this test for ISAM with the SOSD \texttt{OSM} dataset are
shown in Figure~\ref{fig:tl-latency-curve}, which shows the insertion
throughput plotted against the average query latency for our system at
various stall rates, and with tiering configured with an equivalent
scale factor marked as red point for reference. This plot shows two
interesting features of the insertion stall mechanism. First, it is
possible to introduce stalls that do not significantly affect the write
throughput, but do improve query latency. This is seen by the difference
between the two points at the far right of the curve, where introducing
a slight stall improves query performance at virtually no cost. This
represents the region of the curve where the stalling introduces delay
that doesn't exceed the cost of a buffer flush, and so the amount of
time spent stalling by the system doesn't change much.

The second, and perhaps more notable, point that this plot shows is
that introducing the stall rate provides a beautiful design trade-off
between query and insert performance. In fact, this space is far more
useful than the trade-off space represented by layout policy and scale
factor selection using strict reconstruction schemes that we examined
in Chapter~\ref{chap:design-space}. At the upper end of the insertion
optimized region, we see more than double the insertion throughput of
tiering (with significantly lower tail latencies at well) at the cost
of a slightly larger than 2x increase in query latency. Moving down the
curve, we see that we are able to roughly match the performance of tiering
within this space, and even shift to more query optimized configurations.


\begin{figure}
\centering
\includegraphics[width=.5\textwidth]{img/tail-latency/stall-latency-curve.pdf}
\caption{Insertion Throughput vs. Query Latency for ISAM with 200M Records}
\label{fig:tl-latency-curve}
\end{figure}

This shows a very interesting result. Not only is our approach able
to match a strict reconstruction policy in terms of average query and
insertion performance with better tail latencies, but it is even able
to provide a superior set of design trade-offs than the strict policies,
at least in environments where sufficient parallel processing and memory
are available to leverage parallel reconstructions.


\section{Conclusion}

In this section, we addressed the final of the three major problems of
dynamization: tail latency. We proposed a technique for limiting the
rate of insertions to match the rate of reconstruction that is able to
match the worst-case optimized approach of Overmars~\cite{overmars81} on
a single thread, and able to exceed it given multiple parallel threads.
We then implemented the necessary mechanisms to support this technique
within our framework, including a significantly improved architecture
for scheduling and executing parallel and background reconstructions,
and a system for rate limiting by rejecting inserts via Bernoulli sampling.

We evaluated this system for fixed insertion rejection rates, and found
significant improvements in tail latencies, approaching the practical lower
bound we established using the equal block method, without requiring
significant degradation of query performance. In fact, we found that
this rate limiting mechanism provides a design space with more effective
trade-offs than the one we examined in Chapter~\ref{chap:design-space},
with the system being able to exceed the query performance of an
equivalently configured tiering system for certain rate limiting
configurations. The method has limitations, assigning a fixed rejection
rate of inserts works well for linear time constructable structures like
the ISAM Tree, but was significantly less effective for the VPTree, which
requires $\Theta(n \log n)$ time to construct. For structures like this,
it will be necessary to dynamically scale the amount of throttling based
on the record count and size of reconstruction. Additionally, our current
system isn't easily capable of reaching the ``ideal'' goal of being able
to reliably trade query performance and insertion latency at a fixed
throughput. Nonetheless, the mechanisms for supporting such features
are present, and even this simple implementation represents a marked
improvement in terms of both insertion tail latency and configurability.