summaryrefslogtreecommitdiffstats
path: root/chapters/tail-latency.tex
diff options
context:
space:
mode:
authorDouglas Rumbaugh <dbr4@psu.edu>2025-05-29 19:36:41 -0400
committerDouglas Rumbaugh <dbr4@psu.edu>2025-05-29 19:36:41 -0400
commit228be229a831ad082e8310a6d247f1153fb475b8 (patch)
tree8ff8ab4ce2363cfa5f11c01ca47485217bf23741 /chapters/tail-latency.tex
parent3474aa14fdaec66152ab999a1d3c4b0ec8315a3c (diff)
downloaddissertation-228be229a831ad082e8310a6d247f1153fb475b8.tar.gz
updates
Diffstat (limited to 'chapters/tail-latency.tex')
-rw-r--r--chapters/tail-latency.tex435
1 files changed, 323 insertions, 112 deletions
diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex
index 14637c6..2638e70 100644
--- a/chapters/tail-latency.tex
+++ b/chapters/tail-latency.tex
@@ -11,40 +11,37 @@
\end{figure}
Up to this point in our investigation, we have not directly addressed
-one of the largest problems associated with dynamization: insertion
-tail latency. While these techniques result in structures that have
-reasonable, or even good, insertion throughput, the latency associated
-with each individual insert is wildly variable. To illustrate this
-problem, consider the insertion performance in
-Figure~\ref{fig:tl-btree-isam}, which compares the insertion latencies
-of a dynamized ISAM tree with that of its most direct dynamic analog:
-a B+Tree. While, as shown in Figure~\ref{fig:tl-btree-isam-tput},
-the dynamized structure has comperable average performance to the
-native dynamic structure, the latency distributions are quite
-different. Figure~\ref{fig:tl-btree-isam-lat} shows representations
-of the distributions. While the dynamized structure has much better
+one of the largest problems associated with dynamization: insertion tail
+latency. While our dynamization techniques are capable of producing
+structures with good overall insertion throughput, the latency of
+individual inserts is highly variable. To illustrate this problem,
+consider the insertion performance in Figure~\ref{fig:tl-btree-isam},
+which compares the insertion latencies of a dynamized ISAM tree with
+that of its most direct dynamic analog: a B+Tree. While, as shown
+in Figure~\ref{fig:tl-btree-isam-tput}, the dynamized structure has
+comparable average performance to the native dynamic structure, the
+latency distributions, shown in Figure~\ref{fig:tl-btree-isam-lat}
+are quite different. While the dynamized structure has much better
"best-case" performance, the worst-case performance is exceedingly
-poor. That the structure exhibits reasonable performance on average
-is the result of these two ends of the distribution balancing each
-other out.
-
-
-
-This poor worst-case performance is a direct consequence of the strategies
-used by the two structures to support updates. B+Trees use a form of
-amortized local reconstruction, whereas the dynamized ISAM tree uses
-amortized global reconstruction. Because the B+Tree only reconstructs the
-portions of the structure ``local'' to the update, even in the worst case
-only a portion of the data structure will need to be adjusted. However,
-when using global reconstruction based techniques, the worst-case insert
-requires rebuilding either the entirety of the structure (for tiering
-or BSM), or at least a very large proportion of it (for leveling). The
-fact that our dynamization technique uses buffering, and most of the
-shards involved in reconstruction are kept small by the logarithmic
-decomposition technique used to partition it, ensures that the majority
-of inserts are low cost compared to the B+Tree, but at the extreme end
-of the latency distribution, the local reconstruction strategy used by
-the B+Tree results in better worst-case performance.
+poor. That the structure exhibits reasonable performance on average is
+the result of these two ends of the distribution balancing each other out.
+
+This poor worst-case performance is a direct consequence of the different
+approaches to update support used by the dynamized structure and B+Tree.
+B+Trees use a form of amortized local reconstruction, whereas the
+dynamized ISAM tree uses amortized global reconstruction. Because the
+B+Tree only reconstructs the portions of the structure ``local'' to the
+update, even in the worst case only a small part of the data structure
+will need to be adjusted. However, when using global reconstruction
+based techniques, the worst-case insert requires rebuilding either the
+entirety of the structure (for tiering or BSM), or at least a very large
+proportion of it (for leveling). The fact that our dynamization technique
+uses buffering, and most of the shards involved in reconstruction are
+kept small by the logarithmic decomposition technique used to partition
+it, ensures that the majority of inserts are low cost compared to the
+B+Tree, but at the extreme end of the latency distribution, the local
+reconstruction strategy used by the B+Tree results in better worst-case
+performance.
Unfortunately, the design space that we have been considering thus far
is limited in its ability to meaningfully alter the worst-case insertion
@@ -57,41 +54,77 @@ for a small reduction in the worst case, but at the cost of making the
majority of inserts worse because of increased write amplification.
\begin{figure}
-\subfloat[Scale Factor Sweep]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-parm-sf}}
-\subfloat[Buffer Size Sweep]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:tl-parm-bs}} \\
+\subfloat[Varied Scale Factors and Fixed Buffer Size]{\includegraphics[width=.5\textwidth]{img/tail-latency/scale_factor_sweep_standard.pdf} \label{fig:tl-parm-sf}}
+\subfloat[Varied Buffer Sizes and Fixed Scale Factor]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer_sweep_standard.pdf} \label{fig:tl-parm-bs}} \\
\caption{Design Space Effects on Latency Distribution}
\label{fig:tl-parm-sweep}
\end{figure}
-Additionally, the other tuning nobs that are available to us are
-of limited usefulness in tuning the worst case behavior.
-Figure~\ref{fig:tl-parm-sweep} shows the latency distributions of
-our framework as we vary the scale factor (Figure~\ref{fig:tl-parm-sf})
-and buffer size (Figure~\ref{fig:tl-parm-bs}) respectively. There
-is no clear trend in worst-case performance to be seen here. This
-is to be expected, ultimately the worst-case reconstructions in
-both cases are largely the same regardless of scale factor or buffer
-size: a reconstruction involving $\Theta(n)$ records. The selection
-of configuration parameters can influence \emph{when} these
-reconstructions occur, as well as slightly influence their size, but
+The other tuning nobs that are available to us are of limited usefulness
+in tuning the worst case behavior. Figure~\ref{fig:tl-parm-sweep}
+shows the latency distributions of our framework as we vary
+the scale factor (Figure~\ref{fig:tl-parm-sf}) and buffer size
+(Figure~\ref{fig:tl-parm-bs}) respectively. There is no clear trend in
+worst-case performance to be seen here. Adjusting the scale factor does
+have an effect on the distribution, but not in a way that is particularly
+useful from a configuration standpoint, and adjusting the mutable buffer
+has almost no effect on the worst-case latency at all, or even on the
+distribution; particularly when tiering is used. This is to be expected,
+ultimately the worst-case reconstructions largely the same regardless
+of scale factor or buffer size: a reconstruction involving $\Theta(n)$
+records.
+
+The selection of configuration parameters can influence \emph{when}
+these reconstructions occur, as well as slightly influence their size, but
ultimately the question of ``which configuration has the best tail-latency
performance'' is more a question of how many insertions the latency is
-measured over, than any fundamental trade-offs with the design space.
+measured over, than any fundamental trade-offs with the design space. This
+is exemplified rather well in Figure~\ref{fig:tl-parm-sf}. Each of
+the ``shelves'' in the distribution correspond to reconstructions on
+particular levels. As can be seen, the lines cross each other repeatedly
+at these shelves. These cross-overs are points at which one configuration
+begins to, temporarily, exhibit better tail latency behavior than the
+other. However, after enough records have been inserted to cause the next
+largest reconstructions to begin to occur, the "better" configuration
+begins to appear worse again in terms of tail latency.\footnote{
+ This plot also shows a notable difference between leveling
+ and tiering. In the tiering configurations, the transitions
+ between the shelves are steep and abrupt, whereas in leveling,
+ the transitions are smoother, particular as the scale factor
+ increases. These smoother curves show the write amplification
+ of leveling, where the largest shards are not created ``fully
+ formed'' as they are in tiering, but rather are built over a
+ series of merges. This slower growth results in the smoother
+ transitions. Note also that these curves are convex--which is
+ \emph{bad} on this plot, as this means a higher probability of
+ a higher latency reconstruction.
+}
-Thus, in this chapter, we will look beyond the design space we have
-thus far considered to design a dynamization system that allows for
-tail latency tuning in a meaningful capacity. To accomplish this,
-we will consider a different way of looking at reconstructions within
-dynamized structures.
+It seems apparent that, to resolve the problem of insertion tail latency,
+we will need to look beyond the design space we have thus far considered.
+In this chapter, we do just this, and propose a new mechanism for
+controlling reconstructions that leverages parallelism to provide
+similar amortized insertion and query performance characteristics, but
+also allows for significantly better insertion tail latencies. We will
+demonstrate mathematically that our new technique is capable of matching
+the query performance of the tiering layout policy, describe a practical
+implementation of these ideas, and then evaluate that prototype system
+to demonstrate that the theoretical trade-offs are achievable in practice.
\section{The Insertion-Query Trade-off}
As reconstructions are at the heart of the insertion tail latency problem,
it seems worth taking a moment to consider \emph{why} they must be done
at all. Fundamentally, decomposition-based dynamization techniques trade
-between insertion and query performance by controlling the number of blocks
-in the decomposition. Reconstructions serve to place a bound on the
-number of blocks, to allow for query performance bounds to be enforced.
+between insertion and query performance by controlling the number of
+blocks in the decomposition. Placing a bound on this number is necessary
+to bound the worst-case query cost, and is done using reconstructions
+to either merge (in the case of the Bentley-Saxe method) or re-partition
+(in the case of the equal block method) them. Performing less frequent
+reconstructions reduces the amount of work associated with inserts,
+at the cost of allowing more blocks to accumulate and thereby hurting
+query performance.
+
This trade-off between insertion and query performance by way of block
count is most directly visible in the equal block method described
in Section~\ref{ssec:ebm}. As a reminder, this technique provides the
@@ -102,61 +135,81 @@ I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\
\end{align*}
where $f(n)$ is the number of blocks.
-Figure~\ref{fig:tl-ebm-trade-off} shows the trade-off between insertion
-and query performance for a dynamized ISAM tree using the equal block
-method, for various numbers of blocks. The trade-off is evident in the
-figure, with a linear relationship between insertion throughput and query
-latency, mediated by the number of blocks in the dynamized structure (the
-block counts are annotated on each point in the plot). As the number of
-blocks is increased, their size is reduced, leading to less expensive
-inserts in terms of both amortized and worst-case cost. However, the
-additional blocks make queries more expensive.
-
+Unlike the design space we have proposed in
+Chapter~\ref{chap:design-space}, the equal block method allows for
+\emph{both} trading off between insert and query performance, \emph{and}
+controlling the tail latency. Figure~\ref{fig:tl-ebm)} shows the results
+of testing an implementation of a dynamized ISAM tree using the equal block
+method, with
+\begin{equation*}
+f(n) = C
+\end{equation*}
+for varying constant values of $C$. Note that in this test the final
+record count was known in advance, allowing all re-partitioning to be
+avoided. This represents a sort of ``best case scenario'' for the
+technique, and isn't reflective of real-world performance, but does
+serve to demonstrate the relevant properties in the clearest possible
+manner.
+
+Figure~\ref{fig:tl-ebm-tail-latency} shows that the equal block
+method provides a very direct relationship between the tail latency,
+and the number of blocks. The worst-case insertion performance is
+dictated by the size of the largest reconstruction, and so increasing
+the block count results in smaller blocks, and better insertion
+performance. These worst-case results also translate directly into
+improved average throughput, at the cost of query latency, as shown in
+Figure~\ref{fig:tl-ebm-tradeoff}. Note that, contrary to our Bentley-Saxe
+inspired dynamization system, the equal block method provides clear and
+direct relationships between insertion and query performance, as well
+as direct control over tail latency, through its design space.
\begin{figure}
\centering
-\includegraphics[width=.75\textwidth]{img/tail-latency/ebm-count-sweep.pdf}
-\caption{The Insert-Query Tradeoff for the Equal Block Method with varying
-number of blocks}
-\label{fig:tl-ebm-trade-off}
+\subfloat[Insertion vs. Query Trade-off]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-ebm-tradeoff}}
+\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:tl-ebm-tail-latency}} \\
+
+\caption{The equal block method with varying values of $f(n)$.}
+\label{fig:tl-ebm}
\end{figure}
-While using the equal block method does allow for direct tuning of
-the worst-case insert cost, as well as exposing a very clean trade-off
-space for average query and insert performance, the technique is not
-well suited to our purposes because the amortized insertion performance
-is not particularly good: the insertion throughput is many times worse
-than is possible with our dynamization framework for an equivalent
-query latency.\footnote{
+Unfortunately, the equal block method is not well suited for
+our purposes. Despite having a much cleaner trade-off space, its
+performance is strictly worse than our dynamization system. Comparing
+Figure~\ref{fig:tl-ebm-tradeoff} with Figure~\ref{fig:design-tradeoff}
+shows that, for a specified query latency, our technique provides
+significantly better insertion throughput.\footnote{
In actuality, the insertion performance of the equal block method is
even \emph{worse} than the numbers presented here. For this particular
benchmark, we implemented the technique knowing the number of records in
advance, and so fixed the size of each block from the start. This avoided
- the need to do any repartitioning as the structure grew, and reduced write
+ the need to do any re-partitioning as the structure grew, and reduced write
amplification.
}
-This is because, in our Bentley-Saxe-based technique, the
-variable size of the blocks allows for the majority of the reconstructions
-to occur with smaller structures, while allowing the majority of the
-records to exist in a single large block at the bottom of the structure.
-This setup enables high insertion throughput while keeping the block
-count small. But, as we've seen, the cost of this is large tail latencies.
-However, we can use the extreme ends of the equal block method's design
-space to consider upper limits on the insertion and query performance
-that we might expect to get out of a dynamized structure.
-
-
-Consider what would happen if we were to modify our dynamization framework
-to avoid all reconstructions. We retain a buffer of size $N_B$, which
-we flush to create a shard when full, however we never touch the shards
-once they are created. This is effectively the equal blocks method,
-where every block is fixed at $N_B$ capacity. Such a technique would
-result in a worst-case insertion cost of $I(n) \in \Theta(B(N_B))$ and
-produce $\Theta\left(\frac{n}{N_B}\right)$ shards in total, resulting
-in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$ worst-case query
-cost for a decomposable search problem. Applying this technique to an
-ISAM Tree, and compared against a B+Tree, yields the insertion and query
-latency distributions shown in Figure~\ref{fig:tl-floodl0}.
+This is because, in our technique, the variable size of the blocks allows
+for the majority of the reconstructions to occur with smaller structures,
+while allowing the majority of the records to exist in a small number
+of large blocks at the bottom of the structure. This setup enables
+high insertion throughput while keeping the block count small. But,
+as we've seen, the cost of this is large tail latencies, as the large
+blocks must occasionally be involved in reconstructions. However, we
+can use the extreme ends of the equal block method's design space to
+consider upper limits on the insertion and query performance that we
+might expect to get out of a dynamized structure, and then take steps
+within our own framework to approach these limits, while retaining the
+desirable characteristics of the logarithmic decomposition.
+
+At the extreme end, consider what would happen if we were to modify
+our dynamization framework to avoid all reconstructions. We retain a
+buffer of size $N_B$, which we flush to create a shard when full; however
+we never touch the shards once they are created. This is effectively
+the equal block method, where every block is fixed at $N_B$ capacity.
+Such a technique would result in a worst-case insertion cost of $I(n) \in
+\Theta(B(N_B))$ and produce $\Theta\left(\frac{n}{N_B}\right)$ shards in
+total, resulting in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$
+worst-case query cost for a decomposable search problem. Applying
+this technique to an ISAM Tree, and compared against a B+Tree,
+yields the insertion and query latency distributions shown in
+Figure~\ref{fig:tl-floodl0}.
\begin{figure}
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}}
@@ -168,7 +221,7 @@ latency distributions shown in Figure~\ref{fig:tl-floodl0}.
Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain
insertion latency distributions using amortized global reconstruction
-that are directly comperable to dynamic structures based on amortized
+that are directly comparable to dynamic structures based on amortized
local reconstruction, at least in some cases. In particular, the
worst-case insertion tail latency in this model is direct function
of the buffer size, as the worst-case insert occurs when the buffer
@@ -176,7 +229,23 @@ must be flushed to a shard. However, this performance comes at the
cost of queries, which are incredibly slow compared to B+Trees, as
shown in Figure~\ref{fig:tl-floodl0-query}.
-While this approach is not useful on its own, it does
+Unfortunately, the query latency of this technique is too large for it
+to be useful; it is necessary to perform reconstructions to merge these
+small shards together to ensure good query performance. However, this
+does raise an interesting point. Fundamentally, the reconstructions
+that contribute to tail latency are \emph{not} required from an
+insertion perspective; they are a query optimization. Thus, we could
+remove the reconstructions from the insertion process and perform
+them elsewhere. This could, theoretically, allow us to have our
+cake and eat it too. The only insertion bottleneck would become
+the buffer flushing procedure--as is the case in our hypothetical
+``reconstructionless'' approach. Unfortunately, it is not as simple as
+pulling the reconstructions off of the insertion path and running them in
+the background, as this alone cannot provide us with a meaningful bound
+on the number of blocks in the dynamized structure. But, it is possible
+to still provide this bound, if we're willing to throttle the insertion
+rate to be slow enough to keep up with the background reconstructions. In
+the next section, we'll discuss a technique based on this idea.
\section{Relaxed Reconstruction}
@@ -214,16 +283,25 @@ our dynamization framework, and propose a strategy that achieves the
same worst-case insertion time as the worst-case optimized techniques,
given a few assumptions about available resources.
+First, a comment on nomenclature. We define the term \emph{last level},
+$i = \ell$, to mean the level in the dynamized structure with the largest
+index value (and thereby the most records) and \emph{first level}
+to mean the level with index $i=0$. Any level with $0 < i < \ell$ is
+called an \emph{internal level}. A reconstruction on a level involves the
+compaction of all blocks on that level into one, larger, block, that is
+then appended to the level below. Relative to some level at index $i$,
+the \emph{next level} is the level at index $i + 1$.
+
At a very high level, our proposed approach as follows. We will fully
-detach reconstructions from buffer flushes. When the buffer fills, it will
-immediately flush and a new shard will be placed in L0. Reconstructions
-will be performed in the background to maintain the internal structure
-according, roughly, to tiering. When a level contains $s$ shards, a
-reconstruction will immediately be trigger to merge these shards and
-push the result down to the next level. To ensure that the number of
-shards in the structure remains bounded by $\Theta(\log n)$, we will
-throttle the insertion rate so that it is balanced with amount of time
-needed to complete reconstructions.
+detach reconstructions from buffer flushes. When the buffer fills,
+it will immediately flush and a new shard will be placed in the first
+level. Reconstructions will be performed in the background to maintain the
+internal structure according to the tiering policy. When a level contains
+$s$ shards, a reconstruction will immediately be triggered to merge these
+shards and push the result down to the next level. To ensure that the
+number of shards in the structure remains bounded by $\Theta(\log n)$,
+we will throttle the insertion rate so that it is balanced with amount
+of time needed to complete reconstructions.
\begin{figure}
\caption{Several "states" of tiering, leading up to the worst-case
@@ -244,6 +322,7 @@ will allow us to spread the cost of this reconstruction over a number
of inserts without much of the complexity of~\cite{overmars81}. This
leads us to the following result,
\begin{theorem}
+\label{theo:worst-case-optimal}
Given a buffered, dynamized structure utilizing the tiering layout policy,
and at least $2$ parallel threads of execution, it is possible to maintain
a worst-case insertion cost of
@@ -290,7 +369,7 @@ above by $\frac{B(n)}{n}$ time.
Given this, and assuming that the smallest (i.e., most pressing)
reconstruction is prioritized on the background thread, we find that
\begin{equation*}
-I(n) \in \Theta\left(\frac{B(n)}{n} \cdot \log n\right)
+I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right)
\end{equation*}
\end{proof}
@@ -303,16 +382,148 @@ the fact that reconstructions under tiering are strictly local to a
single level, we can avoid needing to add any complicated additional
structures to manage partially building shards as new records are added.
+\subsection{Reducing Stall with Parallelism}
+
+The result in Theorem~\ref{theo:worst-case-optimal} assumes that there
+are two available threads of parallel execution, which allows for the
+reconstructions to run in parallel with inserts. The amount of necessary
+insertion stall can be significantly reduced, however, if more than two
+threads are available.
+
+The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case
+bound is that it is insufficient to cover only the cost of the last level
+reconstruction to maintain the bound on the shard count. From the moment
+that the last level has filled, and this reconstruction can begin, every
+level within the structure will sustain another $s - 1$ reconstructions
+before it is necessary to have completed the last level reconstruction.
+
+Consider a parallel implementation that, contrary to
+Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover
+the last level reconstruction, and blocks all other reconstructions
+until it has been completed. Such an approach would result in $\delta
+= \frac{B(n)}{n}$ stall and complete the last level reconstruction
+after $\Theta(n)$ inserts. During this time, $\Theta(\frac{n}{N_B})$
+shards would accumulate in L0, ultimately resulting in a bound of
+$\Theta(n)$ shards in the structure, rather than the $\Theta(\log
+n)$ bound we are trying to maintain. This is the reason why
+Theorem~\ref{theo:worst-case-optimal} must account for stalls on every
+level, and assumes that the smallest (and therefore most pressing)
+reconstruction is always active on the parallel reconstruction
+thread. This introduces the extra $\log n$ factor into the worst-case
+insertion cost function, because there will at worst be a reconstruction
+running on every level, and each reconstruction will be no larger than
+$\Theta(n)$ records.
+
+In effect, the stall amount must be selected to cover the \emph{sum} of
+the costs of all reconstructions that occur. Another way of deriving this
+bound would be to consider this sum,
+\begin{equation*}
+B(n) + (s - 1) \cdot \sum_{i=0}^{\log n - 1} B(n)
+\end{equation*}
+where the first term is the last level reconstruction cost, and the sum
+term considers the cost of the $s-1$ reconstructions on each internal
+level. Dropping constants and expanding the sum results in,
+\begin{equation*}
+B(n) \cdot \log n
+\end{equation*}
+reconstruction cost to amortize over the $\Theta(n)$ inserts.
+
+However, additional parallelism will allow us to reduce this. At the
+upper limit, assume that there are $\log n$ threads available for parallel
+reconstructions. This condition allows us to derive a smaller bound in
+certain cases,
+\begin{theorem}
+\label{theo:par-worst-case-optimal}
+Given a buffered, dynamized structure utilizing the tiering layout policy,
+and at least $\log n$ parallel threads of execution, it is possible to
+maintain a worst-case insertion cost of
+\begin{equation}
+I(n) \in O\left(\frac{B(n)}{n}\right)
+\end{equation}
+for a data structure with $B(n) \in \Omega(n)$.
+\end{theorem}
+\begin{proof}
+Just as in Theorem~\ref{theo:worst-case-optimal}, we begin by noting that
+the last level reconstruction will be of cost $\Theta(B(n))$ and must
+be amortized over $\Theta(n)$ inserts. However, unlike in that case,
+we now have $\log n$ threads of parallelism to work with. Thus, each
+time a reconstruction must be performed on an internal level, it can
+be executed on one of these threads in parallel with all other ongoing
+reconstructions. As there can be at most one reconstruction per level,
+$\log n$ threads are sufficient to run all possible reconstructions at
+any point in time in parallel.
+
+Each reconstruction will require $\Theta(B(N_B \cdot s^{i+1}))$ time to
+complete. Thus, the necessary stall to fully cover a reconstruction on
+level $i$ is this cost, divided by the number of inserts that can occur
+before the reconstruction must be done (i.e., the capacity of the index
+above this point). This gives,
+\begin{equation*}
+\delta_i \in O\left( \frac{B(N_B \cdot s^{i+1})}{\sum_{j=0}^{i-1} N_B\cdot s^{j+1}} \right)
+\end{equation*}
+necessary stall for each level. Noting that $s > 1$, $s \in \Theta(1)$,
+and that the denominator is the sum of a geometric progression, we have
+\begin{align*}
+\delta_i \in &O\left( \frac{B(N_B\cdot s^{i+1})}{s\cdot N_B \sum_{j=0}^{i-1} s^{j}} \right) \\
+ &O\left( \frac{(1-s) B(N_B\cdot s^{i+1})}{N_B\cdot (s - s^{i+1})} \right) \\
+ &O\left( \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right)
+\end{align*}
+For all reconstructions running in parallel, the necessary stall is the
+maximum stall of all the parallel reconstructions,
+\begin{equation*}
+\delta = \max_{i \in [0, \ell]} \left\{ \delta_i \right\} = \max_{i \in [0, \ell]} \left\{ \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right\}
+\end{equation*}
+
+For $B(n) \in \Omega(n)$, the numerator of the fraction will grow at
+least as rapidly as the denominator, meaning that $\delta_\ell$ will
+always be the largest. $N_B \cdot s^{\ell + 1}$ is $\Theta(n)$, so
+we find that,
+
+\begin{equation*}
+I(n) \in O \left(\frac{B(n)}{n}\right)
+\end{equation*}
+is the worst-case insertion cost, while ensuring that all reconstructions
+are done in time to maintain the shard bound given $\log n$ parallel threads.
+\end{proof}
+
+
+
+
+
\section{Implementation}
+The previous section demonstrated that, theoretically, it is possible
+to meaningfully control the tail latency of our dynamization system by
+relaxing the reconstruction processes and throttling the insertion rate,
+rather than blocking, as a means of controlling the shard count within
+the structure. However, there are a number of practical problems to be
+solved before this idea can be used in a real system. In this section,
+we discuss these problems, and our approaches to solving them to produce
+a dynamization framework based upon the technique.
+
+
\subsection{Parallel Reconstruction Architecture}
\subsection{Concurrent Queries}
-\subsection{Query Pre-emption}
+\subsubsection{Query Pre-emption}
+
+Because our implementation only supports a finite number of versions of
+the mutable buffer at any point in time, and insertions will stall after
+this finite number is reached, it is possible for queries to introduce
+additional insertion latency. Queries hold a reference to the version of
+the structure the are using, which includes holding on to a buffer head
+pointer. If a query is particularly long running, or otherwise stalled,
+it is possible that the query will block insertions by holding onto this
+head pointer.
+
+
\subsection{Insertion Stall Mechanism}
\section{Evaluation}
+\subsection{}
+
+
\section{Conclusion}