\chapter{Controlling Insertion Tail Latency} \label{chap:tail-latency} \section{Introduction} \begin{figure} \subfloat[Insertion Throughput]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-tput.pdf} \label{fig:tl-btree-isam-tput}} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-dist.pdf} \label{fig:tl-btree-isam-lat}} \\ \caption{Insertion Performance of Dynamized ISAM vs. B+tree} \label{fig:tl-btree-isam} \end{figure} Up to this point in our investigation, we have not directly addressed one of the largest problems associated with dynamization: insertion tail latency. While our dynamization techniques are capable of producing structures with good overall insertion throughput, the latency of individual inserts is highly variable. To illustrate this problem, consider the insertion performance in Figure~\ref{fig:tl-btree-isam}, which compares the insertion latencies of a dynamized ISAM tree with that of its most direct dynamic analog: a B+tree. While, as shown in Figure~\ref{fig:tl-btree-isam-tput}, the dynamized structure has superior average performance to the native dynamic structure, the latency distributions, shown in Figure~\ref{fig:tl-btree-isam-lat} are quite different. The dynamized structure has much better best-case performance, but the worst-case performance is exceedingly poor. This poor worst-case performance is a direct consequence of the different approaches used by the dynamized structure and B+tree to support updates. B+trees use a form of amortized local reconstruction, whereas the dynamized ISAM tree uses amortized global reconstruction. Because the B+tree only reconstructs the portions of the structure ``local'' to the update, even in the worst case only a small part of the data structure will need to be adjusted. However, when using global reconstruction based techniques, the worst-case insert requires rebuilding either the entirety of the structure (for tiering or BSM), or at least a very large proportion of it (for leveling). The fact that our dynamization technique uses buffering, and most of the shards involved in reconstruction are kept small by the logarithmic decomposition technique used to partition it, ensures that the majority of inserts are low cost compared to the B+tree. At the extreme end of the latency distribution, though, the local reconstruction strategy used by the B+tree results in significantly better worst-case performance. Unfortunately, the design space that we have been considering is limited in its ability to meaningfully alter the worst-case insertion performance. Leveling requires only a fraction of, rather than all of, the records in the structure to participate in its worst-case reconstruction, and as a result shows slightly reduced worst-case insertion cost compared to the other layout policies. However, this effect only results in a single-digit factor reduction in measured worst-case latency, and has no effect on the insertion latency distribution itself outside of the absolute maximum value. Additionally, we've shown that leveling performed significantly worse in average insertion performance compared to tiering, and so its usefulness as a tool to reduce insertion tail latencies is questionable at best. \begin{figure} \subfloat[Varied Scale Factors and Fixed Buffer Size]{\includegraphics[width=.5\textwidth]{img/tail-latency/scale_factor_sweep_standard.pdf} \label{fig:tl-parm-sf}} \subfloat[Varied Buffer Sizes and Fixed Scale Factor]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer_sweep_standard.pdf} \label{fig:tl-parm-bs}} \\ \caption{Design Space Effects on Latency Distribution} \label{fig:tl-parm-sweep} \end{figure} Our existing framework support two other tuning parameters: the scale factor and the buffer size. However, neither of these parameters are of much use in adjusting the worst-case insertion behavior. We demonstrate this experimentally in Figure~\ref{fig:tl-parm-sweep}, which shows the latency distributions of our framework as we vary the scale factor (Figure~\ref{fig:tl-parm-sf}) and buffer size (Figure~\ref{fig:tl-parm-bs}) respectively. There is no clear trend in worst-case performance to be seen here. Adjusting the scale factor does have an effect on the distribution, but not in a way that is particularly useful from a configuration standpoint, and adjusting the mutable buffer has almost no effect on either the distribution itself, or the worst-case insertion latency; particularly when tiering is used. This is to be expected; ultimately the worst-case reconstruction size is largely the same irrespective of scale factor or buffer size: $\Theta(n)$ records. The selection of configuration parameters does influence \emph{when} a worst-case reconstruction occurs, and can slightly affect its size. Ultimately, however, the answer to the question of which configuration has the best insertion tail latency performance is more a matter of how many records the insertion latencies are measured over than it is one of any fundamental design trade-offs within the space. This is exemplified rather well in Figure~\ref{fig:tl-parm-sf}. Each of the ``shelves'' in the distribution correspond to reconstructions on particular levels. As can be seen, the lines cross each other repeatedly at these shelves. These cross-overs are points at which one configuration begins to exhibit better tail latency behavior than another. However, after enough records have been inserted, the next largest reconstructions will begin to occur. This will make the "better" configuration appear worse in terms of tail latency.\footnote{ This plot also shows a notable difference between leveling and tiering. In the tiering configurations, the transitions between the shelves are steep and abrupt, whereas in leveling, the transitions are smoother, particular as the scale factor increases. These smoother curves show the write amplification of leveling, where the largest shards are not created ``fully formed'' as they are in tiering, but rather are built over a series of merges. This slower growth results in the smoother transitions. Note also that these curves are convex--which is \emph{bad} on this plot, as this means a higher probability of a higher latency reconstruction. } It seems apparent that, to resolve the problem of insertion tail latency, we will need to look beyond the design space we have thus far considered. In this chapter, we do just this, and propose a new mechanism for controlling reconstructions that leverages parallelism to provide similar amortized insertion and query performance characteristics, but also allows for significantly better insertion tail latencies. We will demonstrate mathematically that our new technique is capable of matching the query performance of the tiering layout policy, describe a practical implementation of these ideas, and then evaluate that prototype system to demonstrate that the theoretical trade-offs are achievable in practice. \section{The Insertion-Query Trade-off} \label{sec:tl-insert-query-tradeoff} Reconstructions lie at the heart of the insertion tail latency problem, and so it seems worth taking a moment to consider \emph{why} they occur at all. Fundamentally, decomposition-based dynamization techniques trade between insertion and query performance by controlling the number of blocks in the decomposition. Placing a bound on this number is necessary to bound the worst-case query cost, and this bound is enforced using reconstructions to either merge (in the case of the Bentley-Saxe method) or re-partition (in the case of the equal block method) the blocks. Performing less frequent (or smaller) reconstructions reduces the amount of work associated with inserts, at the cost of allowing more blocks to accumulate and thereby hurting query performance. This trade-off between insertion and query performance by way of block count is most directly visible in the equal block method described in Section~\ref{ssec:ebm}. We will consider a variant of the equal block method here for which $f(n) \in \Theta(1)$, resulting in a dynamization that does not perform any re-partitioning. In this case, the technique provides the following worst-case insertion and query bounds, \begin{align*} I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\ \mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}_S\left(\frac{n}{f(n)}\right)\right) \end{align*} where $f(n)$ is the number of blocks. \begin{figure} \centering \subfloat[Insertion vs. Query Trade-off]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-count-sweep.pdf}\label{fig:tl-ebm-tradeoff}} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-latency-dist.pdf} \label{fig:tl-ebm-tail-latency}} \\ \caption{The equal block method with $f(n) = C$ for varying values of C.} \label{fig:tl-ebm} \end{figure} Unlike the design space we have proposed in Chapter~\ref{chap:design-space}, the equal block method allows for \emph{both} trading off between insert and query performance, \emph{and} controlling the tail latency. Figure~\ref{fig:tl-ebm} shows the results of testing an implementation of a dynamized ISAM tree using the equal block method, with \begin{equation*} f(n) = C \end{equation*} for varying constant values of $C$. As noted above, this special case of the equal block method allows for re-partitioning costs to be avoided entirely, resulting in a very clean trade-off space. This result isn't necessarily demonstrative of the real-world performance of the equal block method, but it does serve to demonstrate the relevant properties of the method in the clearest possible manner. Figure~\ref{fig:tl-ebm-tail-latency} shows that the equal block method provides a very direct relationship between the tail latency and the number of blocks. The worst-case insertion performance is dictated by the size of the largest reconstruction. Increasing the block count reduces the size of each block, and so improves the insertion performance. Figure~\ref{fig:tl-ebm-tradeoff} shows that these improvements to tail latency performance translate directly into an improvement of the overall insertion throughput as well, as the cost of worse query latencies. Contrary to our Bentley-Saxe inspired dynamization system, this formulation of the equal block method provides direct control over insertion tail latency, as well as a much cleaner relationship between average insertion and query performance. While these results are promising, the equal block method is not well suited for our purposes. Despite having a clean trade-off space and control over insertion tail latencies, this technique is strictly worse than our existing dynamization system in every way but tail latency control. Comparing Figure~\ref{fig:tl-ebm-tradeoff} with Figure~\ref{fig:design-tradeoff} shows that, for a specified query latency, a logarithmic decomposition provides significantly better insertion throughput. Our technique uses geometric growth of block sizes, which ensures that most reconstructions are smaller than those in the equal block method for an equivalent number of blocks. This comes at the cost of needing to occasionally perform large reconstructions to compact these smaller blocks, resulting in the poor tail latencies we are attempting to resolve. Thus, it seems as though poor tail latency is concomitant with good average performance. Despite this, let's consider an approach to reconstruction within our framework that optimizes for insertion tail latency exclusively using the equal block method, neglecting any considerations of maintaining a shard count bound. We can consider a variant of the equal block method having $f(n) = \frac{n}{N_B}$. This case, like the $f(n) = C$ approach considered above, avoids all re-partitioning, because records are flushed into the dynamization in sets of $N_B$ size, and so each new block is always exactly full on creation. In effect, this technique has no reconstructions at all. Each buffer flush simply creates a new block that is added to an ever-growing list. This produces a system with worst-case insertion and query costs of, \begin{align*} I(n) &\in \Theta(B(N_B)) \\ \mathscr{Q}(n) &\in O (n\cdot \mathscr{Q}_S(N_B)) \end{align*} where the worst-case insertion is simply the cost of a buffer flush, and the worst-case query cost follows from the fact that there will be $\Theta\left(\frac{n}{N_B}\right)$ shards in the dynamization, each of which will have exactly $N_B$ records.\footnote{ We are neglecting the cost of querying the buffer in this cost function for simplicity. } Applying this technique to an ISAM tree, and compared against a B+tree, yields the insertion and query latency distributions shown in Figure~\ref{fig:tl-floodl0}. Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain insertion latency distributions using amortized global reconstruction that are directly comparable to dynamic structures based on amortized local reconstruction. However, this performance comes at the cost of queries, which are incredibly slow compared to B+trees, as shown in Figure~\ref{fig:tl-floodl0-query}. \begin{figure} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} \subfloat[Query Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-query.pdf} \label{fig:tl-floodl0-query}} \\ \caption{Latency Distributions for a Reconstructionless Dynamization} \label{fig:tl-floodl0} \end{figure} On its own, this technique exhibits too large of a degradation of query latency for it to be useful in any scenario involving a need for queries. However, it does demonstrate that, in scenarios where insertion doesn't need to block on reconstructions, it is possible to obtain significantly improved insertion tail latency distributions. Unfortunately, it also shows that using reconstructions to enforce structural invariants to control the number of block is critical for query performance. In the next section, we will consider approaches to reconstruction that allow us to maintain the structural invariants of our dynamization, while avoiding direct blocking of inserts, in an attempt to reduce the worst-case insertion cost in line with the approach we have just discussed, while maintaining similar worst-case query bounds to our existing dynamization system. \section{Relaxed Reconstruction} Reconstructions are necessary to maintain the structural invariants of our dynamization, which are themselves required to maintain bounds on worst-case query performance. Inserts are what causes the structure to violate these invariants, and so it makes sense to attach reconstructions to the insertion process to allow a strict maintenance of these invariants. However, it is possible to take a more relaxed approach to maintaining these invariants using concurrency, allowing for the same shard bound to be enforced at a much lower worst-case insertion cost. There does exist theoretical work in this area, which we've already discussed in Section~\ref{ssec:bsm-worst-optimal}. The gist of this technique is to relax the strict binary decomposition of the Bentley-Saxe method to allow multiple reconstructions to occur at once, and to add a buffer to each level to contain a partially built structure. Then, reconstructions are split up into small batches of operations, which are attached to each insert that is issued up to the moment when the reconstruction must be complete. By doing this, the work of the reconstructions is spread out across many inserts, in effect removing the need to block for large reconstructions. Theoretically, the total throughput should remain about the same when doing this, but rather than having a bursty latency distribution with many fast inserts, and a small number of incredibly slow ones, distribution should be more normal.~\cite{overmars81} Unfortunately, this technique has a number of limitations that we discussed in Section~\ref{ssec:bsm-tail-latency-problem}. It effectively reduces to manually multiplexing a single thread to perform a highly controlled, concurrent, reconstruction process. This requires the ability to evenly divide up the work of building a data structure and somehow attach these operations to individual inserts. This makes it ill-suited for our general framework, because, even when the construction can be split apart into small independent chunks, implementing it requires a significant amount of manual adjustment to the data structure construction processes. In this section, we will propose an alternative approach for implementing a similar idea using multi-threading and prove that we can achieve, in principle, the same worst-case insertion and query costs in a far more general and easily implementable manner. We'll then show how we can further leverage parallelism on top of our approach to obtain \emph{better} worst-case bounds, assuming sufficient resources are available. \Paragraph{Layout Policies.} One important aspect of the selection of layout policy that has not been considered up to now, but will soon become very relevant, is the degree of reconstruction concurrency afforded by each policy. Because different layout policies perform reconstructions differently, there are significant differences in the number of reconstructions that can be performed concurrently in each one. Note that in previous chapters, we used the term \emph{reconstruction} broadly to refer to all operations performed on the dynamization to maintain its structural invariants as the result of a single insert. Here, we instead use the term to refer to a single call to \texttt{build}. \begin{itemize} \item \textbf{Leveling.} \\ Our leveling layout policy performs a single \texttt{build} operation involving shards from at most two levels, as well as flushing the buffer. Thus, at best, there can be two concurrent operations: the \texttt{build} and the flush. If we were to proactively perform reconstructions, each \texttt{build} would require shards from two levels, and so the maximum number of concurrent reconstructions is half the number of levels, plus the flush. \item \textbf{Tiering.} \\ In our tiering policy, it may be necessary to perform one \texttt{build} operation per level. Each of these reconstructions involves only shards from that level. As a result, at most one reconstruction per level (as well as the flush) can proceed concurrently. \item \textbf{BSM.} \\ The Bentley-Saxe method is highly eager, and merges all relevant shards, plus the buffer, in a single call to \texttt{build}. As a result, no concurrency is possible. \end{itemize} We will be restricting ourselves in this chapter to the tiering layout policy. Tiering provides the most opportunities for concurrency and (assuming sufficient resources) parallelism. Because a given reconstruction only requires shards from a single level, using tiering also makes synchronization significantly easier, and it provides us with largest window to preemptively schedule reconstructions. Most of our discussion in this chapter could also be applied to leveling, albeit with worse results. However, BSM \emph{cannot} be used at all. \subsection{Concurrent Reconstructions} Our proposed approach is as follows. We will fully detach reconstructions from buffer flushes. When the buffer fills, it will immediately flush and a new block will be placed in the first level. Reconstructions will be performed in the background to maintain the internal structure according to the tiering policy. When a level contains $s$ blocks, a reconstruction will immediately be triggered to merge these blocks and push the result down to the next level. To ensure that the number of blocks in the structure remains bounded by $\Theta(\log_s n)$, we will throttle the insertion rate by adding a stall time, $\delta$, to each insert. $\delta$ will be determined such that it is sufficiently large to ensure that any scheduled reconstructions have enough time to complete before the shard count on any level exceeds $s$. This process is summarized in Algorithm~\ref{alg:tl-relaxed-recon}. \begin{algorithm} \caption{Relaxed Reconstruction Algorithm with Insertion Stalling} \label{alg:tl-relaxed-recon} \KwIn{$r$: a record to be inserted, $\mathscr{I} = (\mathcal{B}, \mathscr{L}_0 \ldots \mathscr{L}_\ell)$: a dynamized structure, $\delta$: insertion stall amount} \Comment{Stall insertion process by specified amount} sleep($\delta$) \; \BlankLine \Comment{Append to the buffer if possible} \If {$|\mathcal{B}| < N_B$} { $\mathcal{B} \gets \mathcal{B} \cup \{r\}$ \; \Return \; } \BlankLine \Comment{Schedule any necessary reconstructions background threads} \For {$\mathscr{L} \in \mathscr{I}$} { \If {$|\mathscr{L}| = s$} { $\text{schedule\_reconstruction}(\mathscr{L})$ \; } } \BlankLine \Comment{Perform the flush} $\mathscr{L}_0 \gets \mathscr{L}_0 \cup \{\text{build}(\mathcal{B})\}$ \; $\mathcal{B} \gets \emptyset$ \; \BlankLine \Comment{Append to the now empty buffer} $\mathcal{B} \gets \mathcal{B} \cup \{r\}$ \; \Return \; \end{algorithm} \begin{figure} \centering \includegraphics[width=\textwidth]{diag/tail-latency/last-level-recon.pdf} \caption{\textbf{Worst-case Reconstruction.} Using the tiering layout policy, the worst-case reconstruction occurs when every level in the structure has been filled (middle portion of the figure) and a reconstruction must be performed on each level to merge it into a single shard, and place it on the level below, leaving the structure with one shard per level after the records from the buffer have been added to L0 (right portion of the figure). The cost of this reconstruction is dominated by the cost of performing a reconstruction on the last level. The last level reconstruction, however, is able to be performed well in advance, as it only requires the blocks on the last level, which fills $\Theta(n)$ inserts before the worst-case reconstruction is triggered (left portion of the figure). This provides us with the opportunity to initiate this reconstruction early. \emph{Note: the block sizes in this diagram are not scaled to their record counts--each level has an increasing number of records per block.}} \label{fig:tl-tiering} \end{figure} To ensure the correctness of this algorithm, it is necessary to show that there exists a value for $\delta$ that ensures that the structural invariants can be maintained. Logically, this $\delta$ can be thought of as the amount of time needed to perform the active reconstruction operation, amortized over the inserts between when this reconstruction can be scheduled, and when it needs to be complete. We'll consider how to establish this value next. Figure~\ref{fig:tl-tiering} shows various stages in the development of the internal structure of a dynamized index using tiering. Importantly, note that the last level reconstruction, which dominates the cost of the worst-case reconstruction, \emph{is able to be performed in advance}. All of the records necessary to perform this reconstruction are present in the last level $\Theta(n)$ inserts before the reconstruction must be done to make room. This is a significant advantage to our technique over the normal Bentley-Saxe method, which will allow us to spread the cost of this reconstruction over a number of inserts without much of the complexity of~\cite{overmars81}. This leads us to the following result, \begin{theorem} \label{theo:worst-case-optimal} Given a dynamized structure utilizing the reconstruction policy described in Algorithm~\ref{alg:tl-relaxed-recon}, a single execution unit, and multiple threads of execution that can be scheduled on that unit at will with preemption, it is possible to maintain a worst-case insertion cost of \begin{equation} I(n) \in O\left(\frac{B(n)}{n} \log n\right) \end{equation} while maintaining a bound of $s$ shards per level, and $\log_s n$ levels. \end{theorem} \begin{proof} Under Algorithm~\ref{alg:tl-relaxed-recon}, the worst-case reconstruction operation consists of the creation of a new block from all of the existing blocks in the last level. This reconstruction will be initiated when the last level is full, at which point there will be another $\Theta(n)$ inserts before the level above it also fills, and a new shard must be added to the last level. The reconstruction must be completed by this point to ensure that no more than $s$ shards exist on the last level. Assume that all inserts run on a single thread that can be scheduled alongside the reconstructions, and let each insert have a cost of \begin{equation*} I(n) \in \Theta(1 + \delta) \end{equation*} where $1$ is the cost of appending to the buffer, and $\delta$ is a calculated stall time. During the stalling, the insert thread will be idle and reconstructions can be run on the execution unit. To ensure the last-level reconstruction is complete by the time that $\Theta(n)$ inserts have finished, it is necessary that $\delta \in \Theta\left(\frac{B(n)}{n}\right)$. However, this amount of stall is insufficient to maintain exactly $s$ shards on each level of the dynamization. At the point at which the last-level reconstruction is initiated, there will be exactly $1$ shard on all other levels (see Figure~\ref{fig:tl-tiering}). However, between this initiation and the time at which the last level reconstruction must be complete to maintain the shard count bound, each other level must also undergo $s - 1$ reconstructions to maintain their own bounds. Because we have only a single execution unit, it is necessary to account for the time to complete these reconstructions as well. In the worst-case, there will be one active reconstruction on each of the $\log_s n$ levels, and thus we must introduce stalls such that, \begin{equation*} I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1}) \end{equation*} All of these internal reconstructions will be strictly less than the size of the last-level reconstruction, and so we can bound them all above by $O(\frac{B(n)}{n})$ time. Given this, and assuming that the smallest (i.e., most pressing) reconstruction is prioritized on the execution unit, we find that \begin{equation*} I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right) \end{equation*} \end{proof} This approach results in an equivalent worst-case insertion and query latency bounds to~\cite{overmars81}, but manages to resolve the issues cited above. By leveraging multiple threads, instead of trying to manually multiplex a single thread, this approach requires \emph{no} modification to the user's block code to function. The level of fine-grained control over the active thread necessary to achieve this bound can be achieved by using userspace interrupts~\cite{userspace-preempt}, allowing it to be more easily implemented, without making significant modifications to reconstruction procedures when compared to the existing worst-case optimal technique. \subsection{Reducing Stall with Additional Parallelism} The result in Theorem~\ref{theo:worst-case-optimal} assumes that there is only a single available execution unit. This requires that the insertion stall amount be large enough to cover all of the reconstructions necessary at any moment in time. If we have access to parallel execution units, though, we can significantly reduce the amount of stall time required. The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case bound is that it is insufficient to cover only the cost of the last level reconstruction to maintain the bound on the block count. From the moment that the last level has filled, and this reconstruction can begin, every level within the structure must sustain another $s - 1$ reconstructions before it is necessary to have completed the last level reconstruction, in order to maintain the $\Theta(\log n)$ bound on the number of blocks. To see why this is important, consider an implementation that, contrary to Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover the last-level reconstruction. All other reconstructions are blocked until the last-level one has been completed. This approach would result in $\delta = \frac{B(n)}{n}$ stall and complete the last level reconstruction after $\Theta(n)$ inserts. During this time, $\Theta(\frac{n}{N_B})$ blocks would accumulate in L0, ultimately resulting in a bound of $\Theta(n)$ blocks in the structure, rather than the $\Theta(\log n)$ bound we are trying to maintain. This is the reason why Theorem~\ref{theo:worst-case-optimal} must account for stalls on every level, and assumes that the smallest (and therefore most pressing) reconstruction is always active. This introduces the extra $\log n$ factor into the worst-case insertion cost function, because there will at worst be a reconstruction running on every level, and each reconstruction will be no larger than $\Theta(n)$ records. In effect, the stall amount must be selected to cover the \emph{sum} of the costs of all reconstructions that occur. Another way of deriving this bound would be to consider this sum, \begin{equation*} B(n) + (s - 1) \cdot \sum_{i=0}^{\log n - 1} B(n) \end{equation*} where the first term is the last level reconstruction cost, and the sum term considers the cost of the $s-1$ reconstructions on each internal level. Dropping constants and expanding the sum results in, \begin{equation*} B(n) \cdot \log n \end{equation*} reconstruction cost to amortize over the $\Theta(n)$ inserts. However, additional parallelism will allow us to reduce this. At the upper limit, assume that there are $\log n$ execution units available for parallel reconstructions. We'll adopt the bulk-synchronous parallel (BSP) model~\cite{bsp} for our analysis of the parallel algorithm. In this model, computation is broken up into multiple parallel threads of computation, which are executed independently for a period of time. Intermittently, a synchronization barrier is introduced, at which point each of the parallel threads is blocked and synchronization of global state occurs. This period of independent execution between barriers is called a \emph{super step}. In this model, the parallel execution cost of a super-step is given by the cost of the longest-running computation, the cost of communication between the parallel threads, and the cost of the synchronization, \begin{equation*} \max\left\{w_i\right\} + hg + l \end{equation*} where $hg$ describes the communication cost, $w_i$ is the cost of the $i$th computation, and $l$ is the cost of barrier synchronization. The cost for the entire BSP computation is the sum of all the individual super-steps, \begin{equation*} W + Hg + Sl \end{equation*} where $S$ is the number of super-steps in the computation. We'll model the worst-case reconstruction within the BSP in the following way. Each individual reconstruction will be considered a parallel operation, hence $w_i = B(N_B \cdot s^{i+1})$, where $i$ is the level number. Because we are operating on a single machine, we can assume that the communication cost is constant, $Hg \in \Theta(1)$. During the synchronization barrier, any pending structural updates can be applied to the dynamization (i.e., blocks added or removed from levels). Assuming that tiering is used, this can be done in $l \in \Theta(1)$ time (for details on how this is done, see Section~\ref{sssec:tl-versioning}). Given this model, is is possible to derive the following new worst-case bound, \begin{theorem} \label{theo:par-worst-case-optimal} Given a dynamized structure utilizing the reconstruction policy described in Algorithm~\ref{alg:tl-relaxed-recon}, and at least $\log n$ execution units in the BSP model, it is possible to maintain a worst-case insertion cost of \begin{equation} I(n) \in O\left(\frac{B(n)}{n}\right) \end{equation} for a data structure with $B(n) \in \Omega(n)$. \end{theorem} \begin{proof} Just as in Theorem~\ref{theo:worst-case-optimal}, we begin by noting that the last level reconstruction will be of cost $\Theta(B(n))$ and must be amortized over $\Theta(n)$ inserts. However, unlike in that case, we now have $\log n$ execution units to work with. Thus, each time a reconstruction must be performed on an internal level, it can be executed on one of these units in parallel with all other ongoing reconstructions. As there can be at most one reconstruction per level, $\log n$ threads are sufficient to run all possible reconstructions at any point in time in parallel. To fit this into the BSP model, we need to establish the necessary frequency of synchronization. At best, we will need to perform one synchronized update of the internal structure of the dynamization per buffer flush, which will occur $\Theta\left(\frac{n}{N_B}\right)$ times over the $\Theta(n)$ inserts. Thus, we will take $S = \frac{n}{N_B}$ as the length of a super-step. At the end of each super-step, any pending structural updates from any active reconstruction can be applied. Within a BSP super-step, the total operational cost is equal to the sum of the communication cost (which is $\Theta(1)$ here), the synchronization cost (also $\Theta(1)$), and the cost of the longest running computation. The computations themselves are reconstructions, each with a cost of $B(N_B \cdot s^{i+1})$. They will, generally, exceed the duration of a single super-step, and so their cost will be spread evenly over each super-step that elapses from their initiation to their conclusion. As a result, in any given super-step, there will be two possible states that a given active reconstruction on level $i$ can be in, \begin{enumerate} \item The reconstruction can complete during this super-step, in which case the cost of it will be $w_i < \frac{n}{N_B}$ is the amount of work done in this super step. \item The reconstruction can be incomplete at the end of this super-step, in which case $w_i = \frac{n}{N_B}$. \end{enumerate} It is our intention to determine the time required to complete the full reconstruction, the largest computation of which will have cost $B(n)$. If we can show that the insertion rate necessary to ensure that this computation is completed over $\Theta(n)$ inserts is also enough to ensure that all the smaller reconstructions are completed in time as well, then we can use the last-level reconstruction cost as the total cost of the computational component of the BSP cost calculation. To accomplish this, consider the necessary stall to fully cover a reconstruction on level $i$. This is the cost of the reconstruction over level $i$, divided by the number of inserts that can occur before the reconstruction must be done (i.e., the capacity of the index above this point). This gives, \begin{equation*} \delta_i \in O\left( \frac{B(N_B \cdot s^{i+1})}{\sum_{j=0}^{i-1} N_B\cdot s^{j+1}} \right) \end{equation*} stall for each level. Noting that $s > 1$, $s \in \Theta(1)$, and that the denominator is the sum of a geometric progression, we have \begin{align*} \delta_i \in &O\left( \frac{B(N_B\cdot s^{i+1})}{s\cdot N_B \sum_{j=0}^{i-1} s^{j}} \right) \\ &O\left( \frac{(1-s) B(N_B\cdot s^{i+1})}{N_B\cdot (s - s^{i+1})} \right) \\ &O\left( \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right) \end{align*} For $B(n) \in \Omega(n)$, the numerator of the fraction will grow at least as rapidly as the denominator, meaning that $\delta_\ell$ will always be the largest. Thus, the stall necessary to cover the last-level reconstruction will be at least as much as is necessary for the internal reconstructions. Given this, we will consider only the cost $B(n)$ of the last level reconstruction in our BSP calculation. This cost will be spread evenly over $S=\frac{n}{N_B}$ super-steps , so the cost of each interval is $\frac{B(n)}{S}$. This results in the following overall operation cost over all super-steps, \begin{align*} &\frac{S\cdot B(n)}{S} + S\cdot + S \\ &B(n) + \frac{n}{N_B} \end{align*} Given that $B(n) \in \Omega(n)$, we can absorb the synchronization cost term to get a total operational cost of $O(B(n))$, which we must evenly distribute over the $\Theta(n)$ inserts. Thus, the worst-case insertion cost is, \begin{equation*} I(n) \in O\left(\frac{B(n)}{n}\right) \end{equation*} While the BSP model assumes infinite parallel execution, we'll further restrict this by noting that, at any moment in time, there can be at most one reconstruction occurring on each level of the structure, and therefore $\log_s n$ threads are sufficient to obtain the above result. \end{proof} It's worth noting that the ability to parallelize the reconstructions is afforded to us by our technique, and is not possible in the classical Overmars's formulation, which is inherently single-threaded in nature. \section{Implementation} \label{sec:tl-impl} The previous section demonstrated that it is possible to meaningfully control the worst-case insertion cost (and, therefore, the insertion tail latency) of our dynamization system, at least in theory. This can be done by relaxing the reconstruction processes and throttling the insertion rate as a means of controlling the shard count within the structure, rather than blocking the insertion thread during reconstructions. However, there are a number of practical problems to be solved before this idea can be used in a real system. In this section, we discuss these problems, and our approaches to solving them to produce a dynamization framework based upon the technique. Note that this system is based on the same high level architecture as we described in Section~\ref{ssec:dyn-concurrency}. To avoid redundancy, we will focus on how this system differs, without fully recapitulating the content of that earlier section. \subsection{Parallel Reconstruction Architecture} The existing concurrency implementation described in Section~\ref{ssec:dyn-concurrency} is insufficient for the purposes of constructing a framework supporting the parallel reconstruction scheme described in the previous section. In particular, it is limited to only two active versions of the structure at at time, with one ongoing reconstruction. Additionally, it does not consider buffer flushes as distinct events from reconstructions. In order to support the result of Theorem~\ref{theo:par-worst-case-optimal} it will be necessary to have a concurrency control system that considers reconstructions on each level independently and allows for one reconstruction per level without any synchronization, and that allows for each reconstruction to apply its results to the active structure in $\Theta(1)$ time, and in any order, without violating any structural invariants. To accomplish this, we will use a multi-versioning control scheme that is similar to the simple scheme in Section~\ref{ssec:dyn-concurrency}. Each \emph{version} will consist of three pieces of information: a buffer head pointer, buffer tail pointer, and a collection of levels and shards. However, the process of managing, creating, and installing versions will be more complex, to allow more than two versions to exist at the same time under certain circumstances and support the necessary features mentioned above. \subsubsection{Structure Versioning} \label{sssec:tl-versioning} The internal structure of the dynamization consists of a sequence of levels containing immutable shards, as well as a snapshot of the state of the mutable buffer. This section pertains specifically to the internal structure; the mutable buffer handles its own versioning separate from this and will be discussed in the next section. \begin{figure} \centering \subfloat[Buffer Flush]{\includegraphics[width=.5\textwidth]{diag/tail-latency/flush.pdf}\label{fig:tl-flush}} \subfloat[Maintenance Reconstruction]{\includegraphics[width=.5\textwidth]{diag/tail-latency/maint.pdf}\label{fig:tl-maint}} \caption{\textbf{Structure Version Transitions.} The dynamized structure can transition to a new version via two operations, flushing the buffer into the first level or performing a maintenance reconstruction to merge shards on some level and append the result onto the next one. In each case, \texttt{V2} contains a shallow copy of \texttt{V1}'s light grey shards, with the dark grey shards being newly created and the white shards being deleted. The buffer flush operation in Figure~\ref{fig:tl-flush} simply creates a new shard from the buffer and places it in \texttt{L0} to create \texttt{V2}. The maintenance reconstruction in Figure~\ref{fig:tl-maint} is slightly more complex, creating a new shard in \texttt{L2} using the two shards in \texttt{V1}'s \texttt{L1}, and then removing the shards in \texttt{V2}'s \texttt{L1}. } \label{fig:tl-flush-maint}. \end{figure} The internal structure of the dynamized data structure (ignoring the buffer) can be thought of as a list of immutable levels, $\mathcal{V} = \{\mathscr{L}_0, \ldots \mathscr{L}_h\}$, where each level contains immutable shards, $\mathscr{L}_i = \{\mathscr{I}_0, \ldots \mathscr{I}_m\}$. Buffer flushes and reconstructions can be thought of as functions, which accept a version as input and produce a new version as output. Namely, \begin{align*} \mathcal{V}_{i+1} &= \mathbftt{flush}(\mathcal{V}_i, \mathcal{B}) \\ \mathcal{V}_{i+1} &= \mathbftt{maint}(\mathcal{V}_i, \mathscr{L}_x, j) \end{align*} where the subscript represents the \texttt{version\_id} and is a strictly increasing number assigned to each version. The $\mathbftt{flush}$ operation builds a new shard using records from the buffer, $\mathcal{B}$, and creates a new version identical to $\mathcal{V}_i$, except with the new shard appended to $\mathscr{L}_0$. $\mathbftt{maint}$ performs a maintenance reconstruction by building a new shard using all of the shards in level $\mathscr{L}_x$ and creating a new version identical to $\mathcal{V}_i$ except that the new shard is appended to level $\mathscr{L}_j$ and the shards in $\mathscr{L}_x$ are removed from $\mathscr{L}_x$ in the new version. These two operations are shown in Figure~\ref{fig:tl-flush-maint}. At any point in time, the framework will have \emph{one} active version, $\mathcal{V}_a$, as well as a maximum unassigned version number, $v_m > a$. New version ids are obtained by performing an atomic fetch-and-add on $v_m$, and versions will become active in the exact order of their assigned version numbers. We use the term \emph{installing} a version, $\mathcal{V}_x$ to refer to setting $\mathcal{V}_a \gets \mathcal{V}_x$. \Paragraph{Version Number Assignment.} It is the intention of this framework to prioritize buffer flushes, meaning that the versions resulting from a buffer flush should become active as rapidly as possible. It is undesirable to have some version, $\mathcal{V}_f$, resulting from a buffer flush, attempting to install while there is a version $\mathcal{V}_r$ associated with an in-process maintenance reconstruction such that $a < r < f$. In this case, the flush must wait for the maintenance reconstruction to finalize before it can itself be installed. To avoid this problem, we assign version numbers differently based upon whether the new version is created by a flush or a maintenance reconstruction. \begin{itemize} \item \textbf{Flush.} When a buffer flush is scheduled, it is immediately assigned the next available version number at the time of scheduling. \item \textbf{Maintenance Reconstruction.} Maintenance reconstructions are \emph{not} assigned a version number immediately. Instead, they are assigned a version number \emph{after} all of the reconstruction work is performed, during their installation process. \end{itemize} \Paragraph{Version Installation.} Once a given flush or maintenance reconstruction has completed and has been assigned a version number, $i$, the version will attempt to install itself. The thread running the operation will wait until $a = i - 1$, and then it will update $\mathcal{V}_a \gets \mathcal{V}_i$ using an atomic pointer assignment. All versions are reference counted using \texttt{std::shared\_pointer}, and so will be automatically deleted once all threads containing a reference to the version have terminated, so no special memory management is necessary during version installation. \begin{figure} \centering \includegraphics[width=\textwidth]{diag/tail-latency/dropped-shard.pdf} \caption{\textbf{Shard Reconciliation Problem.} Because maintenance reconstructions don't obtain their version number until after they have completed, it is possible for the internal structure of the dynamization to change between when the reconstruction is scheduled, and when it completes. In this example, a maintenance reconstruction is scheduled based on V1 of the structure. Before it can finish, V2 is created as a result of a buffer flush. As a result, the maintenance reconstruction's resulting structure is assigned V3. But, when it is installed, the shard produced by the flush in V2 is lost. It will be necessary to devise a means to prevent this from happening.} \label{fig:tl-dropped-shard} \end{figure} \Paragraph{Maintenance Version Reconciliation.} Waiting until the moment of installation to assign a version number to maintenance reconstructions avoids stalling buffer flushes, however it introduces additional complexity in the installation process. This is because active version at the time the reconstruction was scheduled, $\mathcal{V}_a$, may not still be the active version at the time the reconstruction is installed, $\mathcal{V}_{a^\prime}$. This means that the version of the structure produced by the reconstruction, $\mathcal{V}_r$, will not reflect any updates to the structure that were performed in version ids on the interval $(a, a^\prime]$. Figure~\ref{fig:tl-dropped-shard} shows an example of the sort of problem that can arise. One possible approach is to simply merge the versions together, adding all of the shards that are in $\mathcal{V}_{a^\prime}$ but not in $\mathcal{V}_r$ prior to installation. Sadly, this approach is insufficient because it can lead to three possible problems, \begin{enumerate} \item If shards used in the maintenance reconstruction to produce $\mathcal{V}_r$ were \emph{also} used as part of a different maintenance reconstruction resulting in a version $\mathcal{V}_o$ with $o < r$, then \textbf{records will be duplicated} by the merge. \item If another reconstruction produced a version $\mathcal{V}_o$ with $o < r$, and $\mathcal{V}_o$ added a new shard to the same level that $\mathcal{V}_r$ did, it is possible that the temporal ordering properties of the shards on the level may be violated. Recall that supporting tombstone-based deletes requires that shards be strictly ordered within each level by their age to ensure that tombstone cancellation (Section~\ref{sssec:dyn-deletes}). \item The shards that were deleted from $\mathcal{V}_r$ after the reconstruction will still be present in $\mathcal{V}_{a^\prime}$ and so may be reintroduced into the new version, again leading to duplication of records. It is non-trivial to identify these shards during the merge to skip over them, because the shards don't have a unique identifier other than their pointers, and using the pointers for this check can lead to the ABA problem using the reference counting based memory management scheme the framework is built on. \end{enumerate} The first two of these problems result from a simple synchronization problem and can be solved using locking. A maintenance reconstruction operates on some level $\mathscr{L}_i$, merging and then deleting shards from that level and placing the result in $\mathscr{L}_{i+1}$. In order for either of these problems to occur, multiple concurrent reconstructions must be operating on $\mathscr{L}_i$. Thus, a lock manager can be introduced into the framework to allow reconstructions to lock entire levels. A reconstruction can only be scheduled if it is able to acquire the lock on the level that it is using as the \emph{source} for its shards. Note that there is no synchronization problem with a concurrent reconstruction on level $\mathscr{L}_{i-1}$ appending a shard to $\mathscr{L}_i$. This will not violate any ordering properties or result in any duplication of records. Thus, each reconstruction only needs to lock a single level. The final problem is a bit trickier to address, but is fundamentally an implementation detail. Our approach for resolving it is to change the way that maintenance reconstructions produce a version in the first place. Rather than taking a copy of $\mathcal{V}_a$, manipulating it to perform the reconstruction, and then reconciling it with $\mathcal{V}_{a^\prime}$ when it is installed, we delay \emph{all} structural updates to the version to installation. When a reconstruction is scheduled, a reference to $\mathcal{V}_a$ is taken, instead of a copy. Then, any new shards are built based on the contents of $\mathcal{V}_a$, but no updates to the structure are made. Once all of the shard reconstructions are complete, the version installation process begins. The thread running the reconstruction waits for its turn to install, and \emph{then} makes a copy of $\mathcal{V}_{a^\prime}$. To this copy, the newly created shards are added, and any necessary deletes are performed. Because the shards to be deleted are currently referenced in, at minimum, the reference to $\mathcal{V}_a$ maintained by the reconstruction thread, pointer equality can be used to identify the shards and the ABA problem avoided. Then, once all the updates are complete, the new version can be installed. This process does push a fair amount of work to the moment of install, between when a version id is claimed by the reconstruction thread, and that version id becomes active. During this time, any buffer flushes will be blocked. However, relative to the work associated with actually performing the reconstructions, the overhead of these metadata operations is fairly minor, and so it doesn't have a significant effect on buffer flush performance. \subsubsection{Mutable Buffer} \begin{figure} \centering \subfloat[Buffer Initial State]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-1.pdf}\label{fig:tl-buffer1}} \subfloat[Buffer Following an Insert]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-2.pdf}\label{fig:tl-buffer2}} \subfloat[Buffer Version Transition]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-3.pdf}\label{fig:tl-buffer3}} \caption{\textbf{Versioning process for the mutable buffer.} A schematic view of the mutable buffer demonstrating the three pointers representing its state, and how they are adjusted as inserts occur. Dark grey slots represent the currently active version, light grey slots the old version, and white slots are available space.} \label{fig:tl-buffer} \end{figure} Next, we'll address concurrent access and versioning of the mutable buffer. In our system, the mutable buffer consists of a large ring buffer with a head and tail pointer, as shown in Figure~\ref{fig:tl-buffer}. In order to support versioning, the buffer actually uses two head pointers, one called \texttt{head} and one called \texttt{old head}, along with a single \texttt{tail} pointer. Records are inserted into the buffer by atomically incrementing \texttt{tail} and then placing the record into the slot. For records that cannot be atomically assigned, a visibility bit can be used to ensure that concurrent readers don't access a partially written value. \texttt{tail} can be incremented until it matches \texttt{old head}, or until the current version of the buffer (between \texttt{head} and \texttt{tail}) contains $N_B$ records. At this point, any further writes would either clobber records in the old version, or exceed the user-specified buffer capacity, and so any inserts must block until a flush has been completed. Flushes are triggered based on a user-configurable set point, $N_F \leq N_B$. When $\mathtt{tail} - \mathtt{head} = N_F$, a flush operation is scheduled. The location of \texttt{tail} is recorded as part of the flush, but records can continue to be inserted until one of the blocking conditions in the previous paragraph is reached. When the flush has completed, a new shard is created containing the records between \texttt{head} and the value of \texttt{tail} at the time the flush began. The buffer version can then be advanced by setting \texttt{old head} to \texttt{head} and setting \texttt{head} to \texttt{tail}. All of the records associated with the old version are freed, and the records that were just flushed now begin part of the old version. The reason for this scheme is to allow threads accessing an older version of the dynamized structure to still see a current view of all of the records. These threads will have a reference to a dynamized structure containing none of the records in the buffer, as well as the old head. Because the older version of the buffer always directly precedes the newer, all of the buffered records are visible to this older version. However, threads accessing the more current version of the buffer will \emph{not} see the records contained between \texttt{old head} and \texttt{head}, as these records will have been flushed into the structure and are visible to the thread there. If this thread could still see records in the older version of the buffer, then it would see these records twice, which is incorrect. One consequence of this technique is that a buffer flush cannot complete until all threads referencing \texttt{old head} have completed. To ensure that this is the case, the two head pointers are reference counted, and a flush will stall until all references to \texttt{old head} have been removed. In principle, this problem could be reduced by allowing for more than two heads, but it becomes difficult to atomically transition between versions in that case, and it would also increase the storage requirements for the buffer, which requires $N_B$ space per available version. \subsection{Concurrent Queries} Queries are answered based upon the active version of the structure at the moment the query begins to execute. When the query routine of the dynamization is called, a query is scheduled. Once a thread becomes available, the query will begin to execute. At the start of execution, the query thread takes a reference to $\mathcal{V}_a$ as well as the current \texttt{head} and \texttt{tail} of the buffer. Both $\mathcal{V}_a$ and \texttt{head} are reference counted, and will be retained for the duration of the query. Once the query has finished processing, it will return the result to the user via an \texttt{std::promise} and release its references to the active version and buffer. \subsubsection{Query Preemption} Because our implementation only supports two active head pointers in the mutable buffer, queries can lead to insertion stalls. If a long running query is holding a reference to \texttt{old head}, then an active buffer flush of the old version will be blocked by this query. If this blocking goes on for sufficiently long, then the buffer may fill up and the system begin to reject inserts. One possible solution to this problem is to process the \texttt{buffer\_query} first, and then discard the reference to \texttt{old head}, allowing the buffer flush to proceed. However, this would not work for iterative deletion decomposable search problems, which may require re-processing the buffer query arbitrarily many times. As a result, we instead implement a simple preemption mechanism to defeat long running queries. The framework keeps track of how long a buffer flush has been stalled by queries maintaining references to \texttt{old head}. Once this stalling passes a user-defined threshold, a preemption flag will be set ordering the queries in question to restart themselves. This is implemented fully within the framework code, requiring no user adjustment to their queries to support it, as the framework query mechanism simply checks this flag in between calls to user code. If a query sees this flag is set, it will release its references to the \texttt{old head} and structure version, and automatically put itself back in the scheduling queue to be retried against newer versions of the structure and buffer. Note that, if misconfigured, it is possible that this mechanism will entirely prevent certain long-running queries from being answered. If the threshold for preemption is set lower than the expected run-time of a valid query, it's possible that the query will loop forever if the system is experiencing sufficient insertion pressure. To help avoid this, another parameter is available to specify a maximum preemption count, after which a query will ignore a request for preemption. \subsection{Insertion Stall Mechanism} The results of Theorem~\ref{theo:worst-case-optimal} and \ref{theo:par-worst-case-optimal} are based upon enforcing a rate limit upon incoming inserts by manually increasing their cost, to ensure that there is sufficient time for reconstructions to complete. The calculation of and application of this stall factor can be seen as equivalent to explicitly limiting the maximum allowed insertion throughput. In this section, we consider a mechanism for doing this. In practice, calculating and precisely stalling for the correct amount of time is quite difficult because of the vagaries of working with a real system. While ultimately it would be ideal to have a reasonable cost model that can estimate this stall time on the fly based on the cost of building a data structure, the number of records involved in reconstructions, the number of available threads, available memory bandwidth, etc., for the purposes of this prototype we have settled for a simple system that demonstrates the robustness of the technique. Recall the basic insert process within our system. Inserts bypass the scheduling system and communicate directly with the buffer, on the same client thread that called the function, to maximize insertion performance and eliminate as much concurrency control overhead as possible. The insert routine is synchronous, and returns a boolean indicating whether the insert has succeeded or not. The insert can fail if the buffer is full, in which case the user is expected to delay for a moment and retry. Once space has been cleared in the buffer, the insert will succeed. We can leverage this same mechanism as a form of rate limiting, by rejecting new inserts when the throughput rises above a specified level. Unfortunately, the most straightforward approach to doing this--monitoring the throughput and simply blocking inserts when it raises above a specified threshold--is undesirable because the probability of a given insert being rejected is not independent. The rejections will tend to clump, which introduces back the tail latency problem we are attempting to resolve. Instead, it would be best to spread the rejections more evenly. We approximate this rate limiting behavior by using random sampling to determine which inserts to reject. Rather than specifying a maximum throughput, the system is configured with a probability of acceptance for an insert. This avoids the distributional problems mentioned above that arise from direct throughput monitoring, and has a few additional benefits. It is based on a single parameter that can be readily updated on demand using atomics. Our current prototype uses a single, fixed value for the probability, but ultimately it should be dynamically tuned to approximate the $\delta$ value from Theorem~\ref{theo:worst-case-optimal} as closely as possible. It also doesn't require significant modification of the existing client interfaces. We have elected to use Bernoulli sampling for the task of selecting which inserts to reject. An alternative approach would have been to apply systematic sampling. Bernoulli sampling results in the probability of an insert being rejected being independent of its order in the workload, but is non-deterministic. This means that there is a small probability that many more rejections than are expected may occur over a short span of time. Systematic sampling is not vulnerable to this problem, but introduces a dependence between the position of an insert in a sequence of operations and its probability of being rejected. We decided to prioritize independence in our implementation. \section{Evaluation} \label{sec:tl-eval} In this section, we perform several experiments to evaluate the ability of the system proposed in Section~\ref{sec:tl-impl} to control tail latencies. \subsection{Stall Rate Sweep} As a first test, we will evaluate the ability of our insertion stall mechanism to control insertion tail latencies, as well as maintain similar decomposed structures to strict tiering. We consider the shard count directly in this test, rather than query latencies, because our intention is to show that this technique is capable of controlling the number of shards in the decomposition. The shard count also serves as an indirect measure of query latency, but we will consider this metric directly in a later section. Recall that, when using insertion stalling, our framework does \emph{not} block inserts to maintain a shard bound. The buffer is always flushed immediately, regardless of the number of shards in the structure. Thus, the rate of insertion is controlled by the cost of flushing the buffer (we still block when the buffer is full) and the insertion stall rate. The structure is maintained fully in the background, with maintenance reconstructions being scheduled for all levels exceeding a specified shard count. Thus, the number of shards within the structure is controlled indirectly by limiting the insertion rate. We ran these tests with 32 background threads on a system with 40 physical cores to ensure sufficient resources to fully parallelize all reconstructions (we'll consider resource constrained situations later). We tested -ISAM tree with the 200 million record SOSD \texttt{OSM} dataset~\cite{sosd-datasets}, as well as VPTree with the one million, 300-dimensional, \texttt{SBW} dataset~\cite{sbw}. For each test, we inserted $30\%$ of the records to warm up the structure, and then measured the individual latency of each insert after that. We measured the count of shards in the structure each time the buffer flushed (including during the warmup period). Note that a stall rate of $\delta = 1$ indicates no stalling at all, and values less than one indicate $1 - \delta$ probability of an insert being rejected, after which the insert thread sleeps for about a microsecond. A lower stall rate means more stalls are introduced. The tiering policy is strict tiering with a scale factor of $s=6$ using the concurrency control scheme described in Section~\ref{ssec:dyn-concurrency}. \begin{figure} \centering \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-insert-dist.pdf} \label{fig:tl-stall-200m-dist}} \subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-shard-dist.pdf} \label{fig:tl-stall-200m-shard}} \\ \caption{Insertion and Shard Count Distributions for ISAM with 200M Records} \label{fig:tl-stall-200m} \end{figure} We'll begin by considering the ISAM tree. Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all stall rates succeed in greatly reducing tail latency relative to tiering. Additionally, it shows a small amount of available tuning of the worst-case insertion latencies, with higher stall amounts reducing the tail latencies slightly at various points in the distribution. This latter effect results from the buffer flush latency hiding mechanism, which was retained from Chapter~\ref{chap:framework}. The buffer actually has space for two versions, and the second version can be filled while the first is flushing. This means that, for more aggressive stalling, some of the time spent blocking on the buffer flush is redistributed over the inserts into the second version of the buffer, rather than resulting in a stall. Of course, if the query latency is severely affected by the use of this mechanism, it may not be worth using. Thus, in Figure~\ref{fig:tl-stall-200m-shard} we show the probability density of various shard counts within the decomposed structure for each stall rate, as well as strict tiering. This figure shows that, even for no insertion stalling, the shard count within the structure remains well behaved, albeit with a slightly longer tail and a higher average value compared to tiering. Once stalls are introduced, it is possible to both reduce the tail, and shift the peak of the distribution through a variety of points. In particular, we see that a stall of $.99$ is sufficient to move the peak to very close to tiering, and lower stall rates are able to further shift the peak of the distribution to even lower counts. This result implies that this stall mechanism may be able to produce a trade-off space for insertion and query performance, which is a question we will examine in Section~\ref{ssec:tl-design-space}. \begin{figure} \centering \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-insert-dist.pdf} \label{fig:tl-stall-4b-dist}} \subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-shard-dist.pdf} \label{fig:tl-stall-4b-shard}} \\ \caption{Insertion and Shard Count Distributions for ISAM with 4B Records} \label{fig:tl-stall-4b} \end{figure} To validate that these results were not due to the relatively small size of the data set used, we repeated the exact same testing using a set of four billion uniform integers, shown in Figure~\ref{fig:tl-stall-4b}. These results are aligned with the smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing the same improvements in insertion tail latency for all stall amounts, and Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the shard count. \begin{figure} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-insert-dist.pdf} \label{fig:tl-stall-knn-dist}} \subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-shard-dist.pdf} \label{fig:tl-stall-knn-shard}} \\ \caption{Insertion and Shard Count Distributions for VPTree } \label{fig:tl-stall-knn} \end{figure} Finally, we considered our dynamized VPTree in Figure~\ref{fig:tl-stall-knn}, This test shows some of the possible limitations of our fixed stall rate mechanism. The ISAM tree tested above is constructable in roughly linear time, being an MDSP with $B_M(n, k) \in \Theta(n \log k)$, where $k$ is the number of shards, and thus roughly constant.\footnote{ For strict tiering, $k=s$ in all cases. Because we don't enforce the level shard capacity directly, however, in the insertion stalling case $k \in \Omega(s)$. Based on the experimental results about, however, it is clear that $k$ is typically quite close to $s$ in practice for ISAM tree. } Thus, the ratio $\frac{B_M(n, k)}{n}$ used to determine the optimal insertion stall rate is asymptotically a constant. For VPTree, however, the construction cost is super-linear, with $B(n) \in \Theta(n \log n)$, and also generally much larger in absolute time requirements. We can see in Figure~\ref{fig:tl-stall-knn-shard} that the shard count distribution is very poorly behaved for smaller stall amounts, with the shard count following a roughly uniform distribution for a stall rate of $1$. This means that the background reconstructions are not capable of keeping up with buffer flushing, and so the number of shards grows significantly over time. Introducing stalls does shift the distribution closer to normal, but it requires a much larger stall rate in order to obtain a shard count distribution that is close to the strict tiering than was the case with the ISAM tree test. It is still possible, though, even with our simple fixed-stall rate implementation. Additionally, this approach is shown in Figure~\ref{fig:tl-stall-knn-dist} to reduce the tail latency substantially compared to strict tiering, with the same latency distribution effects for larger stall rates as was seen in the ISAM examples. These tests show that, for ISAM tree at least, introducing a constant stall rate while allowing the decomposition to develop naturally with background reconstructions only is able to match the shard count distribution of tiering, which strictly enforces the shard count bound using blocking, while achieving significantly better insertion tail latencies. VPTree is able to achieve the same results too, albeit requiring significantly higher stall rates to match the shard bound. \subsection{Insertion Stall Trade-off Space} \label{ssec:tl-design-space} \begin{figure} \centering \subfloat[ISAM w/ Point Lookup]{\includegraphics[width=.5\textwidth]{img/tail-latency/stall-latency-curve.pdf} \label{fig:tl-latency-curve-isam}} \subfloat[VPTree w/ $k$-NN]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-latency-curve.pdf} \label{fig:tl-latency-curve-knn}} \\ \caption{Insertion Throughput vs. Query Latency} \label{fig:tl-latency-curve} \end{figure} We have shown that introducing insertion stalls accomplishes our stated goal of reducing insertion tail latencies while simultaneously maintaining a shard count in line with strict tiering. However, we have not address the actual performance of the structure in terms of average throughput or query latency. By throttling insertion, we potentially reduce the throughput. Further, it isn't clear in practice how much query latency suffers as the shard count distribution changes. In this experiment, we address these concerns by directly measuring the insertion throughput and query latency over a variety of stall rates and compare the results to strict tiering. The results of this test for ISAM with the SOSD \texttt{OSM} dataset are shown in Figure~\ref{fig:tl-latency-curve-isam}, which shows the insertion throughput plotted against the average query latency for our system at various stall rates, and with tiering configured with an equivalent scale factor marked as red point for reference. The most interesting point demonstrated by this plot is that introducing the stall rate provides a beautiful design trade-off between query and insert performance. In fact, this space is far more useful than the trade-off space represented by layout policy and scale factor selection using strict reconstruction schemes that we examined in Chapter~\ref{chap:design-space}. At the upper end of the insertion optimized region, we see more than double the insertion throughput of tiering (with significantly lower tail latencies at well) at the cost of a slightly larger than 2x increase in query latency. Moving down the curve, we see that we are able to roughly match the performance of tiering within this space, and even shift to more query optimized configurations. Also, this trade-off curve falls \emph{below} the equivalently configured tiering on the chart, indicating that it's performance is strictly superior. We also performed the same testing for $k$-NN queries using VPTree and the \texttt{SBW} dataset. The results are shown in Figure~\ref{fig:tl-latency-curve-knn}. Because the run time of $k$-NN queries is significantly longer than the point lookups in the ISAM test, we additionally applied a rate limit to the query thread, issuing new queries every 100 milliseconds, and configured query preemption with a trigger point of approximately 40 milliseconds. We applied the same parameters for the tiering test, and counted any additional latency associated with query preemption towards the average query latency figures reported. This test shows that, like with ISAM, we have access to a similarly clear trade-off space by adjusting the insertion throughput, however in this case the standard tiering policy did perform better in terms of average insertion throughput and query latency. The fact that stalling was outperformed by strict tiering for VPTree isn't a surprising result, given the observations made in the previous test. VPTree requires significantly higher insertion throttling to keep up with the longer reconstruction times, and the amount of throttling per record is asymptotically not constant as the structure grows. This shows a very interesting result. Not only is our approach able to match a strict reconstruction policy in terms of average query and insertion performance with better tail latencies, but it is even able to provide a superior set of design trade-offs than the strict policies, at least in environments where sufficient parallel processing and memory are available to leverage parallel reconstructions. \subsection{Legacy Design Space} Our new system retains the concept of buffer size and scale factor from the previous version, although these have very different performance implications given our different compaction strategy. In this test, we examine the effects of these parameters on the insertion-query trade-off curves noted above, as well as on insertion tail latency. The results are shown in Figure~\ref{fig:tl-design-space}, for a dynamized ISAM tree using the SOSD \texttt{OSM} dataset and point lookup queries. \begin{figure} \centering \subfloat[Insertion Throughput vs. Query Latency for Varying Scale Factors]{\includegraphics[width=.5\textwidth]{img/tail-latency/stall-sf-sweep.pdf} \label{fig:tl-sf-curve}} \subfloat[Insertion Tail Latency for Varying Buffer Sizes]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer-tail-latency.pdf} \label{fig:tl-buffer-tail}} \\ \caption{Legacy Design Space Examination} \label{fig:tl-design-space} \end{figure} First, we consider the insertion throughput vs. average query latency curves for our system using different values of scale factor in Figure~\ref{fig:tl-sf-curve}. Recall that our system of reconstruction in this chapter does not explicitly enforce any structural invariants, and so the scale factor's only role is in determining at what point a given level will have a reconstruction scheduled for it. Lower scale factors will more aggressively compact shards, while higher scale factors will allow more shards to accumulate before attempting to perform a reconstruction. Interestingly, there are clear differences in the curves, particularly at higher insertion throughputs. For lower throughputs, a scale factor of $s=2$ appears strictly inferior, while the other tested scale factors result in roughly equivalent curves. However, as the insertion throughput is increased, the curves begin to separate more, with $s = 6$ emerging as the superior option for the majority of the space. Next, we consider the effect that buffer size has on insertion tail latency. Based on our discussion of the equal block method in Section~\ref{sec:tl-insert-query-tradeoff}, and the fact that our technique only blocks inserts on buffer flushes, it stands the reason that the buffer size should directly influence the worst-case insertion time. That bears out in practice, as shown in Figure~\ref{fig:tl-buffer-tail}. As the buffer size is increased, the worst-case insertion time also increases, although the effect is relatively small. \subsection{Thread Scaling} \begin{figure} \centering \subfloat[Insertion Throughput vs. Query Latency]{\includegraphics[width=.5\textwidth]{img/tail-latency/recon-thread-scale.pdf} \label{fig:tl-latency-threads}} \subfloat[Maximum Insertion Throughput for a Given Query Latency]{\includegraphics[width=.5\textwidth]{img/tail-latency/constant-query.pdf} \label{fig:tl-query-scaling}} \\ \caption{Framework Thread Scaling} \label{fig:tl-threads} \end{figure} In the previous tests, we ran our system configured with 32 available threads, which was more than enough to run all reconstructions and queries fully in parallel. However, it's important to determine how well the system works in more resource constrained environments. The system shares internal threads between reconstructions and queries, and flushing occurs on a dedicated thread separate from these. During the benchmark, one client thread issued queries continuously and another issued inserts. The index accumulated a total of five levels, so the maximum amount of parallelism available during the testing was 4 parallel reconstructions, along with the dedicated flushing thread and any concurrent queries. In these tests, we used the SOSD \texttt{OSM} dataset (200M records) and point-lookup queries without early abort against a dynamized ISAM tree. We considered the insertion throughput vs. query latency trade-off for various stall amounts with several internal thread counts. We inserted 30\% of the dataset first, and then measured the insertion throughput over the insertion of the rest of the data on a client thread, while another client thread continuously issued queries against the structure. The results of this test are shown in Figure~\ref{fig:tl-latency-threads}. The first note is that the change in the number of available internal threads has little effect on the insertion throughput, as shown by the clustering of the points on the curve. This is to be expected, as inserts throughput is limited only by the stall amount, and by buffer flushing. As flushing occurs on a dedicated thread, it is unaffected by changes in the internal thread configuration of the system. In terms of query performance, there are two general effects that can be observed. The first effect is that the previously noted reduction in query performance as the insertion throughput is increased is observed in all cases, irrespective of thread count. However, interestingly, the thread count itself has little effect on the curve outside of the case of only having a single thread. This can also be seen in Figure~\ref{fig:tl-query-scaling}, which shows an alternative view of the same data revealing the best measured insertion throughput associated with a given query latency bound. In both cases, two or more threads are capable of significantly higher insertion throughput at a given query latency. But, at very low insertion throughputs, this effect vanishes and all thread counts are roughly equivalent in performance. A large part of the reason for this significant deviation in behavior between one thread and multiple is likely that queries and reconstructions share the same pool of background threads. Our testing involved issuing queries continuously on a single thread, while performing inserts, and so two threads background threads ensures that a reconstruction and query can be run in parallel, whereas a single thread will force queries to wait behind long running reconstructions. Once this bottleneck is overcome, a reduction in the amount of parallel reconstruction seems to have only a minor influence on overall performance. This is because, although in the worst case the system requires $\log_s n$ threads to fully parallelize reconstructions, this worst case is fairly rare. The vast majority of reconstructions only require a fraction of this total parallel capacity. \section{Conclusion} In this section, we addressed the final of the three major problems of dynamization: tail latency. We proposed a technique for limiting the rate of insertions to match the rate of reconstruction that is able to match the worst-case optimized approach of Overmars~\cite{overmars81} on a single thread, and able to exceed it given multiple parallel threads. We then implemented the necessary mechanisms to support this technique within our framework, including a significantly improved architecture for scheduling and executing parallel and background reconstructions, and a system for rate limiting by rejecting inserts via Bernoulli sampling. We evaluated this system for fixed stall rates, and found significant improvements in tail latencies, approaching the practical lower bound we established using the equal block method, without requiring significant degradation of query performance. In fact, we found that this rate limiting mechanism provides a design space with more effective trade-offs than the one we examined in Chapter~\ref{chap:design-space}, with the system being able to exceed the query performance of an equivalently configured tiering system for certain rate limiting configurations. The method has limitations, assigning a fixed rejection rate of inserts works well for linear time constructable structures like the ISAM tree, but was significantly less effective for the VPTree, which requires $\Theta(n \log n)$ time to construct. For structures like this, it will be necessary to dynamically scale the amount of throttling based on the record count and size of reconstruction. Additionally, our current system isn't easily capable of reaching the ``ideal'' goal of being able to reliably trade query performance and insertion latency at a fixed throughput. Nonetheless, the mechanisms for supporting such features are present, and even this simple implementation represents a marked improvement in terms of both insertion tail latency and configurability.