\chapter{Controlling Insertion Tail Latency} \label{chap:tail-latency} \section{Introduction} \begin{figure} \subfloat[Insertion Throughput]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-tput.pdf} \label{fig:tl-btree-isam-tput}} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-dist.pdf} \label{fig:tl-btree-isam-lat}} \\ \caption{Insertion Performance of Dynamized ISAM vs. B+Tree} \label{fig:tl-btree-isam} \end{figure} Up to this point in our investigation, we have not directly addressed one of the largest problems associated with dynamization: insertion tail latency. While our dynamization techniques are capable of producing structures with good overall insertion throughput, the latency of individual inserts is highly variable. To illustrate this problem, consider the insertion performance in Figure~\ref{fig:tl-btree-isam}, which compares the insertion latencies of a dynamized ISAM tree with that of its most direct dynamic analog: a B+Tree. While, as shown in Figure~\ref{fig:tl-btree-isam-tput}, the dynamized structure has superior average performance to the native dynamic structure, the latency distributions, shown in Figure~\ref{fig:tl-btree-isam-lat} are quite different. The dynamized structure has much better "best-case" performance, but the worst-case performance is exceedingly poor. That the structure exhibits reasonable performance on average is the result of these two ends of the distribution balancing each other out. This poor worst-case performance is a direct consequence of the different approaches to update support used by the dynamized structure and B+Tree. B+Trees use a form of amortized local reconstruction, whereas the dynamized ISAM tree uses amortized global reconstruction. Because the B+Tree only reconstructs the portions of the structure ``local'' to the update, even in the worst case only a small part of the data structure will need to be adjusted. However, when using global reconstruction based techniques, the worst-case insert requires rebuilding either the entirety of the structure (for tiering or BSM), or at least a very large proportion of it (for leveling). The fact that our dynamization technique uses buffering, and most of the shards involved in reconstruction are kept small by the logarithmic decomposition technique used to partition it, ensures that the majority of inserts are low cost compared to the B+Tree. At the extreme end of the latency distribution, though, the local reconstruction strategy used by the B+Tree results in significantly better worst-case performance. Unfortunately, the design space that we have been considering thus far is limited in its ability to meaningfully alter the worst-case insertion performance. While we have seen that the choice of layout policy can have some effect, the actual benefit in terms of tail latency is quite small, and the situation is made worse by the fact that leveling, which can have better worst-case insertion performance, lags behind tiering in terms of average insertion performance. The use of leveling can allow for a small reduction in the worst case, but at the cost of making the majority of inserts worse because of increased write amplification. \begin{figure} \subfloat[Varied Scale Factors and Fixed Buffer Size]{\includegraphics[width=.5\textwidth]{img/tail-latency/scale_factor_sweep_standard.pdf} \label{fig:tl-parm-sf}} \subfloat[Varied Buffer Sizes and Fixed Scale Factor]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer_sweep_standard.pdf} \label{fig:tl-parm-bs}} \\ \caption{Design Space Effects on Latency Distribution} \label{fig:tl-parm-sweep} \end{figure} The other tuning nobs that are available to us are of limited usefulness in tuning the worst case behavior. Figure~\ref{fig:tl-parm-sweep} shows the latency distributions of our framework as we vary the scale factor (Figure~\ref{fig:tl-parm-sf}) and buffer size (Figure~\ref{fig:tl-parm-bs}) respectively. There is no clear trend in worst-case performance to be seen here. Adjusting the scale factor does have an effect on the distribution, but not in a way that is particularly useful from a configuration standpoint, and adjusting the mutable buffer has almost no effect on the worst-case latency at all, or even on the distribution; particularly when tiering is used. This is to be expected; ultimately the worst-case reconstruction size is largely the same regardless of scale factor or buffer size: $\Theta(n)$ records. The selection of configuration parameters can influence \emph{when} these reconstructions occur, as well as slightly influence their size, but ultimately the question of ``which configuration has the best tail-latency performance'' is more a question of how many insertions the latency is measured over, than any fundamental trade-offs with the design space. This is exemplified rather well in Figure~\ref{fig:tl-parm-sf}. Each of the ``shelves'' in the distribution correspond to reconstructions on particular levels. As can be seen, the lines cross each other repeatedly at these shelves. These cross-overs are points at which one configuration begins to, temporarily, exhibit better tail latency behavior than the other. However, after enough records have been inserted to cause the next largest reconstructions to begin to occur, the "better" configuration begins to appear worse again in terms of tail latency.\footnote{ This plot also shows a notable difference between leveling and tiering. In the tiering configurations, the transitions between the shelves are steep and abrupt, whereas in leveling, the transitions are smoother, particular as the scale factor increases. These smoother curves show the write amplification of leveling, where the largest shards are not created ``fully formed'' as they are in tiering, but rather are built over a series of merges. This slower growth results in the smoother transitions. Note also that these curves are convex--which is \emph{bad} on this plot, as this means a higher probability of a higher latency reconstruction. } It seems apparent that, to resolve the problem of insertion tail latency, we will need to look beyond the design space we have thus far considered. In this chapter, we do just this, and propose a new mechanism for controlling reconstructions that leverages parallelism to provide similar amortized insertion and query performance characteristics, but also allows for significantly better insertion tail latencies. We will demonstrate mathematically that our new technique is capable of matching the query performance of the tiering layout policy, describe a practical implementation of these ideas, and then evaluate that prototype system to demonstrate that the theoretical trade-offs are achievable in practice. \section{The Insertion-Query Trade-off} As reconstructions are at the heart of the insertion tail latency problem, it seems worth taking a moment to consider \emph{why} they must be done at all. Fundamentally, decomposition-based dynamization techniques trade between insertion and query performance by controlling the number of blocks in the decomposition. Placing a bound on this number is necessary to bound the worst-case query cost, and is done using reconstructions to either merge (in the case of the Bentley-Saxe method) or re-partition (in the case of the equal block method) the blocks. Performing less frequent (or smaller) reconstructions reduces the amount of work associated with inserts, at the cost of allowing more blocks to accumulate and thereby hurting query performance. This trade-off between insertion and query performance by way of block count is most directly visible in the equal block method described in Section~\ref{ssec:ebm}. This technique provides the following worst-case insertion and query bounds, \begin{align*} I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\ \mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \end{align*} where $f(n)$ is the number of blocks. This worst-case result ignores re-partitioning costs, which may be necessary for certain selections of $f(n)$. We omit it here because we are about to examine a case of the equal block method were no re-partitioning is necessary. When re-partitioning is used, the worst case cost rises to the now familiar $I(n) \in \Theta(B(n))$ result. \begin{figure} \centering \subfloat[Insertion vs. Query Trade-off]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-count-sweep.pdf}\label{fig:tl-ebm-tradeoff}} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-latency-dist.pdf} \label{fig:tl-ebm-tail-latency}} \\ \caption{The equal block method with $f(n) = C$ for varying values of C.} \label{fig:tl-ebm} \end{figure} Unlike the design space we have proposed in Chapter~\ref{chap:design-space}, the equal block method allows for \emph{both} trading off between insert and query performance, \emph{and} controlling the tail latency. Figure~\ref{fig:tl-ebm} shows the results of testing an implementation of a dynamized ISAM tree using the equal block method, with \begin{equation*} f(n) = C \end{equation*} for varying constant values of $C$. Note that in this test the final record count was known in advance, allowing all re-partitioning to be avoided. This represents a sort of ``best case scenario'' for the technique, and isn't reflective of real-world performance, but does serve to demonstrate the relevant properties in the clearest possible manner. Figure~\ref{fig:tl-ebm-tail-latency} shows that the equal block method provides a very direct relationship between the tail latency and the number of blocks. The worst-case insertion performance is dictated by the size of the largest reconstruction, and so increasing the block count results in smaller blocks, and better insertion performance. These worst-case results also translate directly into improved average throughput, at the cost of query latency, as shown in Figure~\ref{fig:tl-ebm-tradeoff}. These results show that, contrary to our Bentley-Saxe inspired dynamization system, the equal block method provides clear and direct relationships between insertion and query performance, as well as direct control over tail latency, through its design space. Unfortunately, the equal block method is not well suited for our purposes. Despite having a much cleaner trade-off space, its performance is strictly worse than our dynamization system. Comparing Figure~\ref{fig:tl-ebm-tradeoff} with Figure~\ref{fig:design-tradeoff} shows that, for a specified query latency, our technique provides significantly better insertion throughput.\footnote{ In actuality, the insertion performance of the equal block method is even \emph{worse} than the numbers presented here. For this particular benchmark, we implemented the technique knowing the number of records in advance, and so fixed the size of each block from the start. This avoided the need to do any re-partitioning as the structure grew, and reduced write amplification. } This is because, in our technique, the variable size of the blocks allows for the majority of the reconstructions to occur with smaller structures, while allowing the majority of the records to exist in a small number of large blocks at the bottom of the structure. This setup enables high insertion throughput while keeping the block count small. But, as we've seen, the cost of this is large tail latencies, as the large blocks must occasionally be involved in reconstructions. However, we can use the extreme ends of the equal block method's design space to consider upper limits on the insertion and query performance that we might expect to get out of a dynamized structure, and then take steps within our own framework to approach these limits, while retaining the desirable characteristics of the logarithmic decomposition. At the extreme end, consider what would happen if we were to modify our dynamization framework to avoid all reconstructions. We retain a buffer of size $N_B$, which we flush to create a shard when full; however we never touch the shards once they are created. This is effectively the equal block method, where every block is fixed at $N_B$ capacity. Such a technique would result in a worst-case insertion cost of $I(n) \in \Theta(B(N_B))$ and produce $\Theta\left(\frac{n}{N_B}\right)$ shards in total, resulting in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$ worst-case query cost for a decomposable search problem. Applying this technique to an ISAM Tree, and compared against a B+Tree, yields the insertion and query latency distributions shown in Figure~\ref{fig:tl-floodl0}. \begin{figure} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} \subfloat[Query Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-query.pdf} \label{fig:tl-floodl0-query}} \\ \caption{Latency Distributions for a "Reconstructionless" Dynamization} \label{fig:tl-floodl0} \end{figure} Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain insertion latency distributions using amortized global reconstruction that are directly comparable to dynamic structures based on amortized local reconstruction, at least in some cases. In particular, the worst-case insertion tail latency in this model is direct function of the buffer size, as the worst-case insert occurs when the buffer must be flushed to a shard. However, this performance comes at the cost of queries, which are incredibly slow compared to B+Trees, as shown in Figure~\ref{fig:tl-floodl0-query}. Unfortunately, the query latency of this technique is too large for it to be useful; it is necessary to perform reconstructions to merge these small shards together to ensure good query performance. However, this does raise an interesting point. Fundamentally, the reconstructions that contribute to tail latency are \emph{not} required from an insertion perspective; they are a query optimization. Thus, we could remove the reconstructions from the insertion process and perform them elsewhere. This could, theoretically, allow us to have our cake and eat it too. The only insertion bottleneck would become the buffer flushing procedure--as is the case in our hypothetical ``reconstructionless'' approach. Unfortunately, it is not as simple as pulling the reconstructions off of the insertion path and running them in the background, as this alone cannot provide us with a meaningful bound on the number of blocks in the dynamized structure. But, it is possible to still provide this bound, if we're willing to throttle the insertion rate to be slow enough to keep up with the background reconstructions. In the next section, we'll discuss a technique based on this idea. \section{Relaxed Reconstruction} There does exist theoretical work on throttling the insertion rate of a Bentley-Saxe dynamization to control the worst-case insertion cost~\cite{overmars81}, which we discussed in Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach is to break the largest reconstructions up into small sequences of operations, that can then be attached to each insert, spreading the total workload out and ensuring each insert takes a consistent amount of time. Theoretically, the total throughput should remain about the same when doing this, but rather than having a bursty latency distribution with many fast inserts, and a small number of incredibly slow ones, distribution should be more normal. Unfortunately, this technique has a number of limitations that we discussed in Section~\ref{ssec:bsm-tail-latency-problem}. Notably for this discussion, they are \begin{enumerate} \item In the Bentley-Saxe method, the worst-case reconstruction involves every record in the structure. As such, it cannot be performed ``in advance'' without significant extra work. This problem requires the worst-case optimized dynamization systems to include complicated structures of partially built structures. \item The approach assumes that the workload of building a block can be evenly divided in advance, and somehow attached to inserts. Even for simple structures, this requires a large amount of manual adjustment to the data structure reconstruction routines, and doesn't admit simple, generalized interfaces. \end{enumerate} In this section, we consider how these restrictions can be overcome given our dynamization framework, and propose a strategy that achieves the same worst-case insertion time as the worst-case optimized theoretical techniques, given a few assumptions about available resources, by taking advantage of parallelism and proactive scheduling of reconstructions. We will be restricting ourselves in this chapter to the tiering layout policy, because it has some specific properties that are useful to our goals. In tiering, the input shards to a reconstruction are restricted to a single level, unlike in leveling where the shards come from two levels, and BSM where the shards come from potentially \emph{all} the levels. This allows us to maximize parallelism, which we will be using to improve the tail latency performance, greatly simplifies synchronization, and provides us with the largest window over which to amortize the costs of reconstruction. The techniques we describe in this chapter will work with leveling as well, albeit less effectively, and will not work \emph{at all} using BSM. First, a comment on nomenclature. We define the term \emph{last level}, $i = \ell$, to mean the level in the dynamized structure with the largest index value (and thereby the most records) and \emph{first level} to mean the level with index $i=0$. Any level with $0 < i < \ell$ is called an \emph{internal level}. A reconstruction on level $i$ involves the combination of all blocks on that level into one, larger, block, that is then appended level $i+1$. Relative to some level at index $i$, the \emph{next level} is the level at index $i + 1$, and the \emph{previous level} is at index $i-1$. Our proposed approach is as follows. We will fully detach reconstructions from buffer flushes. When the buffer fills, it will immediately flush and a new block will be placed in the first level. Reconstructions will be performed in the background to maintain the internal structure according to the tiering policy. When a level contains $s$ blocks, a reconstruction will immediately be triggered to merge these blocks and push the result down to the next level. To ensure that the number of blocks in the structure remains bounded by $\Theta(\log n)$, we will throttle the insertion rate so that it is balanced with amount of time needed to complete reconstructions. \begin{figure} \centering \includegraphics[width=\textwidth]{diag/tail-latency/last-level-recon.pdf} \caption{\textbf{Worst-case Reconstruction.} Using the tiering layout policy, the worst-case reconstruction occurs when every level in the structure has been filled (middle portion of the figure) and a reconstruction must be performed on each level to merge it into a single shard, and place it on the level below, leaving the structure with one shard per level after the records from the buffer have been added to L0 (right portion of the figure). The cost of this reconstruction is dominated by the cost of performing a reconstruction on the last level. The last level reconstruction, however, is able to be performed well in advance, as it only requires the blocks on the last level, which fills $\Theta(n)$ inserts before the worst-case reconstruction is triggered (left portion of the figure). This provides us with the opportunity to initiate this reconstruction early. \emph{Note: the block sizes in this diagram are not scaled to their record counts--each level has an increasing number of records per block.}} \label{fig:tl-tiering} \end{figure} First, we'll consider how to ``spread out'' the cost of the worst-case reconstruction. Figure~\ref{fig:tl-tiering} shows various stages in the development of the internal structure of a dynamized index using tiering. Importantly, note that the last level reconstruction, which dominates the cost of the worst-case reconstruction, \emph{is able to be performed well in advance}. All of the records necessary to perform this reconstruction are present in the last level $\Theta(n)$ inserts before the reconstruction must be done to make room. This is a significant advantage to our technique over the normal Bentley-Saxe method, which will allow us to spread the cost of this reconstruction over a number of inserts without much of the complexity of~\cite{overmars81}. This leads us to the following result, \begin{theorem} \label{theo:worst-case-optimal} Given a buffered, dynamized structure utilizing the tiering layout policy, and at least $2$ parallel threads of execution, it is possible to maintain a worst-case insertion cost of \begin{equation} I(n) \in O\left(\frac{B(n)}{n} \log n\right) \end{equation} \end{theorem} \begin{proof} Consider the cost of the worst-case reconstruction, which in tiering will be of cost $\Theta(B(n))$. This reconstruction requires all of the blocks on the last level of the structure. At the point at which the last level is full, there will be $\Theta(n)$ inserts before the last level must be merged and a new level added. To ensure that the reconstruction has been completed by the time the $\Theta(n)$ inserts have been completed, it is sufficient to guarantee that the rate of inserts is sufficiently slow. Ignoring the cost of buffer flushing, this means that inserts must cost, \begin{equation*} I(n) \in \Theta(1 + \delta) \end{equation*} where the $\Theta(1)$ is the cost of appending to the mutable buffer, and $\delta$ is a stall inserted into the insertion process to ensure that the necessary reconstructions are completed in time. To identify the value of $\delta$, we note that each insert must take at least $\frac{B(n)}{n}$ time to fully cover the cost of the last level reconstruction. However, this is not sufficient to guarantee the bound, as other reconstructions will also occur within the structure. At the point at which the last level reconstruction can be scheduled, there will be exactly $1$ block on each level. Thus, each level will potentially also have an ongoing reconstruction that must be covered by inserting more stall time, to ensure that no level in the structure exceeds $s$ blocks. There are $\log n$ levels in total, and so in the worst case we will need to introduce a extra stall time to account for a reconstruction on each level, \begin{equation*} I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1}) \end{equation*} All of these internal reconstructions will be strictly less than the size of the last-level reconstruction, and so we can bound them all above by $O(\frac{B(n)}{n})$ time. Given this, and assuming that the smallest (i.e., most pressing) reconstruction is prioritized on the background thread, we find that \begin{equation*} I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right) \end{equation*} \end{proof} This approach results in an equivalent worst-case insertion latency bound to~\cite{overmars81}, but manages to resolve both of the issues cited above. By leveraging two parallel threads, instead of trying to manually multiplex a single thread, this approach requires \emph{no} modification to the user's block code to function. And, by leveraging the fact that reconstructions under tiering are strictly local to a single level, we can avoid needing to add any complicated additional structures to manage partially building blocks as new records are added. \subsection{Reducing Stall with Additional Parallelism} The result in Theorem~\ref{theo:worst-case-optimal} assumes that there are two available threads of parallel execution, which allows for the reconstructions to run in parallel with inserts. The amount of necessary insertion stall can be significantly reduced, however, if more than two threads are available. The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case bound is that it is insufficient to cover only the cost of the last level reconstruction to maintain the bound on the block count. From the moment that the last level has filled, and this reconstruction can begin, every level within the structure must sustain another $s - 1$ reconstructions before it is necessary to have completed the last level reconstruction, in order to maintain the $\Theta(\log n)$ bound on the number of blocks. Consider a parallel implementation that, contrary to Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover the last level reconstruction, and blocks all other reconstructions until it has been completed. Such an approach would result in $\delta = \frac{B(n)}{n}$ stall and complete the last level reconstruction after $\Theta(n)$ inserts. During this time, $\Theta(\frac{n}{N_B})$ blocks would accumulate in L0, ultimately resulting in a bound of $\Theta(n)$ blocks in the structure, rather than the $\Theta(\log n)$ bound we are trying to maintain. This is the reason why Theorem~\ref{theo:worst-case-optimal} must account for stalls on every level, and assumes that the smallest (and therefore most pressing) reconstruction is always active on the parallel reconstruction thread. This introduces the extra $\log n$ factor into the worst-case insertion cost function, because there will at worst be a reconstruction running on every level, and each reconstruction will be no larger than $\Theta(n)$ records. In effect, the stall amount must be selected to cover the \emph{sum} of the costs of all reconstructions that occur. Another way of deriving this bound would be to consider this sum, \begin{equation*} B(n) + (s - 1) \cdot \sum_{i=0}^{\log n - 1} B(n) \end{equation*} where the first term is the last level reconstruction cost, and the sum term considers the cost of the $s-1$ reconstructions on each internal level. Dropping constants and expanding the sum results in, \begin{equation*} B(n) \cdot \log n \end{equation*} reconstruction cost to amortize over the $\Theta(n)$ inserts. However, additional parallelism will allow us to reduce this. At the upper limit, assume that there are $\log n$ threads available for parallel reconstructions. This condition allows us to derive a smaller bound in certain cases, \begin{theorem} \label{theo:par-worst-case-optimal} Given a buffered, dynamized structure utilizing the tiering layout policy, and at least $\log n$ parallel threads of execution, it is possible to maintain a worst-case insertion cost of \begin{equation} I(n) \in O\left(\frac{B(n)}{n}\right) \end{equation} for a data structure with $B(n) \in \Omega(n)$. \end{theorem} \begin{proof} Just as in Theorem~\ref{theo:worst-case-optimal}, we begin by noting that the last level reconstruction will be of cost $\Theta(B(n))$ and must be amortized over $\Theta(n)$ inserts. However, unlike in that case, we now have $\log n$ threads of parallelism to work with. Thus, each time a reconstruction must be performed on an internal level, it can be executed on one of these threads in parallel with all other ongoing reconstructions. As there can be at most one reconstruction per level, $\log n$ threads are sufficient to run all possible reconstructions at any point in time in parallel. Each reconstruction will require $\Theta(B(N_B \cdot s^{i+1}))$ time to complete. Thus, the necessary stall to fully cover a reconstruction on level $i$ is this cost, divided by the number of inserts that can occur before the reconstruction must be done (i.e., the capacity of the index above this point). This gives, \begin{equation*} \delta_i \in O\left( \frac{B(N_B \cdot s^{i+1})}{\sum_{j=0}^{i-1} N_B\cdot s^{j+1}} \right) \end{equation*} necessary stall for each level. Noting that $s > 1$, $s \in \Theta(1)$, and that the denominator is the sum of a geometric progression, we have \begin{align*} \delta_i \in &O\left( \frac{B(N_B\cdot s^{i+1})}{s\cdot N_B \sum_{j=0}^{i-1} s^{j}} \right) \\ &O\left( \frac{(1-s) B(N_B\cdot s^{i+1})}{N_B\cdot (s - s^{i+1})} \right) \\ &O\left( \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right) \end{align*} For all reconstructions running in parallel, the necessary stall is the maximum stall of all the parallel reconstructions, \begin{equation*} \delta = \max_{i \in [0, \ell]} \left\{ \delta_i \right\} = \max_{i \in [0, \ell]} \left\{ \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right\} \end{equation*} For $B(n) \in \Omega(n)$, the numerator of the fraction will grow at least as rapidly as the denominator, meaning that $\delta_\ell$ will always be the largest. $N_B \cdot s^{\ell + 1}$ is $\Theta(n)$, so we find that, \begin{equation*} I(n) \in O \left(\frac{B(n)}{n}\right) \end{equation*} is the worst-case insertion cost, while ensuring that all reconstructions are done in time to maintain the block bound given $\log n$ parallel threads. \end{proof} \section{Implementation} \label{sec:tl-impl} The previous section demonstrated that, theoretically, it is possible to meaningfully control the tail latency of our dynamization system by relaxing the reconstruction processes and throttling the insertion rate, rather than blocking, as a means of controlling the shard count within the structure. However, there are a number of practical problems to be solved before this idea can be used in a real system. In this section, we discuss these problems, and our approaches to solving them to produce a dynamization framework based upon the technique. Note that this system is based on the same high level architecture as we described in Section~\ref{ssec:dyn-concurrency}. To avoid redundancy, we will focus on how this system differs, without fully recapitulating the content of that earlier section. \subsection{Parallel Reconstruction Architecture} The existing concurrency implementation described in Section~\ref{ssec:dyn-concurrency} is insufficient for the purposes of constructing a framework supporting the parallel reconstruction scheme described in the previous section. In particular, it is limited to only two active versions of the structure at at time, with one ongoing reconstruction. Additionally, it does not consider buffer flushes as distinct events from reconstructions. In this section, we will discuss the modifications made to the concurrency support within our framework to support parallel reconstructions. Much like the simpler scheme in Section~\ref{ssec:dyn-concurrency}, our concurrency framework will be based on multi-versioning. Each \emph{version} consists of three pieces of information: a buffer head pointer, buffer tail pointer, and a collection of levels and shards. However, the process of managing, creating, and installing versions is much more complex, to allow more than two versions to exist at the same time under certain circumstances. \subsubsection{Structure Versioning} The internal structure of the dynamization consists of a sequence of levels containing immutable shards, as well as a snapshot of the state of the mutable buffer. This section pertains specifically to the internal structure; the mutable buffer handles its own versioning separate from this and will be discussed in the next section. \begin{figure} \centering \subfloat[Buffer Flush]{\includegraphics[width=.5\textwidth]{diag/tail-latency/flush.pdf}\label{fig:tl-flush}} \subfloat[Maintenance Reconstruction]{\includegraphics[width=.5\textwidth]{diag/tail-latency/maint.pdf}\label{fig:tl-maint}} \caption{\textbf{Structure Version Transitions.} The dynamized structure can transition to a new version via two operations, flushing the buffer into the first level or performing a maintenance reconstruction to merge shards on some level and append the result onto the next one. In each case, \texttt{V2} contains a shallow copy of \texttt{V1}'s light grey shards, with the dark grey shards being newly created and the white shards being deleted. The buffer flush operation in Figure~\ref{fig:tl-flush} simply creates a new shard from the buffer and places it in \texttt{L0} to create \texttt{V2}. The maintenance reconstruction in Figure~\ref{fig:tl-maint} is slightly more complex, creating a new shard in \texttt{L2} using the two shards in \texttt{V1}'s \texttt{L1}, and then removing the shards in \texttt{V2}'s \texttt{L1}. } \label{fig:tl-flush-maint}. \end{figure} The internal structure of the dynamized data structure (ignoring the buffer) can be thought of as a list of immutable levels, $\mathcal{V} = \{\mathscr{L}_0, \ldots \mathscr{L}_h\}$, where each level contains immutable shards, $\mathcal{L}_i = \{\mathscr{I}_0, \ldots \mathscr{I}_m\}$. Buffer flushes and reconstructions can be thought of as functions, which accept a version as input and produce a new version as output. Namely, \begin{align*} \mathcal{V}_{i+1} &= \mathbftt{flush}(\mathcal{V}_i, \mathcal{B}) \\ \mathcal{V}_{i+1} &= \mathbftt{maint}(\mathcal{V}_i, \mathscr{L}_x, j) \end{align*} where the subscript represents the \texttt{version\_id} and is a strictly increasing number assigned to each version. The $\mathbftt{flush}$ operation builds a new shard using records from the buffer, $\mathcal{B}$, and creates a new version identical to $\mathcal{V}_i$, except with the new shard appended to $\mathscr{L}_0$. $\mathbftt{maint}$ performs a maintenance reconstruction by building a new shard using all of the shards in level $\mathscr{L}_x$ and creating a new version identical to $\mathcal{V}_i$ except that the new shard is appended to level $\mathscr{L}_j$ and the shards in $\mathscr{L}_x$ are removed from $\mathscr{L}_x$ in the new version. These two operations are shown in Figure~\ref{fig:tl-flush-maint}. At any point in time, the framework will have \emph{one} active version, $\mathcal{V}_a$, as well as a maximum unassigned version number, $v_m > a$. New version ids are obtained by performing an atomic fetch-and-add on $v_m$, and versions will become active in the exact order of their assigned version numbers. We use the term \emph{installing} a version, $\mathcal{V}_x$ to refer to setting $\mathcal{V}_a \gets \mathcal{V}_x$. \Paragraph{Version Number Assignment.} It is the intention of this framework to prioritize buffer flushes, meaning that the versions resulting from a buffer flush should become active as rapidly as possible. It is undesirable to have some version, $\mathcal{V}_f$, resulting from a buffer flush, attempting to install while there is a version $\mathcal{V}_r$ associated with an in-process maintenance reconstruction such that $a < r < f$. In this case, the flush must wait for the maintenance reconstruction to finalize before it can itself be installed. To avoid this problem, we assign version numbers differently based upon whether the new version is created by a flush or a maintenance reconstruction. \begin{itemize} \item \textbf{Flush.} When a buffer flush is scheduled, it is immediately assigned the next available version number at the time of scheduling. \item \textbf{Maintenance Reconstruction.} Maintenance reconstructions are \emph{not} assigned a version number immediately. Instead, they are assigned a version number \emph{after} all of the reconstruction work is performed, during their installation process. \end{itemize} \Paragraph{Version Installation.} Once a given flush or maintenance reconstruction has completed and has been assigned a version number, $i$, the version will attempt to install itself. The thread running the operation will wait until $a = i - 1$, and then it will update $\mathcal{V}_a \gets \mathcal{V}_i$ using an atomic pointer assignment. All versions are reference counted using \texttt{std::shared\_pointer}, and so will be automatically deleted once all threads containing a reference to the version have terminated, so no special memory management is necessary during version installation. \begin{figure} \centering \includegraphics[width=\textwidth]{diag/tail-latency/dropped-shard.pdf} \caption{\textbf{Shard Reconciliation Problem.} Because maintenance reconstructions don't obtain their version number until after they have completed, it is possible for the internal structure of the dynamization to change between when the reconstruction is scheduled, and when it completes. In this example, a maintenance reconstruction is scheduled based on V1 of the structure. Before it can finish, V2 is created as a result of a buffer flush. As a result, the maintenance reconstruction's resulting structure is assigned V3. But, when it is installed, the shard produced by the flush in V2 is lost. It will be necessary to devise a means to prevent this from happening.} \label{fig:tl-dropped-shard} \end{figure} \Paragraph{Maintenance Version Reconciliation.} Waiting until the moment of installation to assign a version number to maintenance reconstructions avoids stalling buffer flushes, however it introduces additional complexity in the installation process. This is because active version at the time the reconstruction was scheduled, $\mathcal{V}_a$, may not still be the active version at the time the reconstruction is installed, $\mathcal{V}_{a^\prime}$. This means that the version of the structure produced by the reconstruction, $\mathcal{V}_r$, will not reflect any updates to the structure that were performed in version ids on the interval $(a, a^\prime]$. Figure~\ref{fig:tl-dropped-shard} shows an example of the sort of problem that can arise. One possible approach is to simply merge the versions together, adding all of the shards that are in $\mathcal{V}_{a^\prime}$ but not in $\mathcal{V}_r$ prior to installation. Sadly, this approach is insufficient because it can lead to three possible problems, \begin{enumerate} \item If shards used in the maintenance reconstruction to produce $\mathcal{V}_r$ were \emph{also} used as part of a different maintenance reconstruction resulting in a version $\mathcal{V}_o$ with $o < r$, then \textbf{records will be duplicated} by the merge. \item If another reconstruction produced a version $\mathcal{V}_o$ with $o < r$, and $\mathcal{V}_o$ added a new shard to the same level that $\mathcal{V}_r$ did, it is possible that the temporal ordering properties of the shards on the level may be violated. Recall that supporting tombstone-based deletes requires that shards be strictly ordered within each level by their age to ensure that tombstone cancellation (Section~\ref{sssec:dyn-deletes}). \item The shards that were deleted from $\mathcal{V}_r$ after the reconstruction will still be present in $\mathcal{V}_{a^\prime}$ and so may be reintroduced into the new version, again leading to duplication of records. It is non-trivial to identify these shards during the merge to skip over them, because the shards don't have a unique identifier other than their pointers, and using the pointers for this check can lead to the ABA problem using the reference counting based memory management scheme the framework is built on. \end{enumerate} The first two of these problems result from a simple synchronization problem and can be solved using locking. A maintenance reconstruction operates on some level $\mathscr{L}_i$, merging and then deleting shards from that level and placing the result in $\mathscr{L}_{i+1}$. In order for either of these problems to occur, multiple concurrent reconstructions must be operating on $\mathscr{L}_i$. Thus, a lock manager can be introduced into the framework to allow reconstructions to lock entire levels. A reconstruction can only be scheduled if it is able to acquire the lock on the level that it is using as the \emph{source} for its shards. Note that there is no synchronization problem with a concurrent reconstruction on level $\mathscr{L}_{i-1}$ appending a shard to $\mathscr{L}_i$. This will not violate any ordering properties or result in any duplication of records. Thus, each reconstruction only needs to lock a single level. The final problem is a bit trickier to address, but is fundamentally an implementation detail. Our approach for resolving it is to change the way that maintenance reconstructions produce a version in the first place. Rather than taking a copy of $\mathcal{V}_a$, manipulating it to perform the reconstruction, and then reconciling it with $\mathcal{V}_{a^\prime}$ when it is installed, we delay \emph{all} structural updates to the version to installation. When a reconstruction is scheduled, a reference to $\mathcal{V}_a$ is taken, instead of a copy. Then, any new shards are built based on the contents of $\mathcal{V}_a$, but no updates to the structure are made. Once all of the shard reconstructions are complete, the version installation process begins. The thread running the reconstruction waits for its turn to install, and \emph{then} makes a copy of $\mathcal{V}_{a^\prime}$. To this copy, the newly created shards are added, and any necessary deletes are performed. Because the shards to be deleted are currently referenced in, at minimum, the reference to $\mathcal{V}_a$ maintained by the reconstruction thread, pointer equality can be used to identify the shards and the ABA problem avoided. Then, once all the updates are complete, the new version can be installed. This process does push a fair amount of work to the moment of install, between when a version id is claimed by the reconstruction thread, and that version id becomes active. During this time, any buffer flushes will be blocked. However, relative to the work associated with actually performing the reconstructions, the overhead of these metadata operations is fairly minor, and so it doesn't have a significant effect on buffer flush performance. \subsubsection{Mutable Buffer} \begin{figure} \centering \subfloat[Buffer Initial State]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-1.pdf}\label{fig:tl-buffer1}} \subfloat[Buffer Following an Insert]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-2.pdf}\label{fig:tl-buffer2}} \subfloat[Buffer Version Transition]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-3.pdf}\label{fig:tl-buffer3}} \caption{\textbf{Versioning process for the mutable buffer.} A schematic view of the mutable buffer demonstrating the three pointers representing its state, and how they are adjusted as inserts occur. Dark grey slots represent the currently active version, light grey slots the old version, and white slots are available space.} \label{fig:tl-buffer} \end{figure} Next, we'll address concurrent access and versioning of the mutable buffer. In our system, the mutable buffer consists of a large ring buffer with a head and tail pointer, as shown in Figure~\ref{fig:tl-buffer}. In order to support versioning, the buffer actually uses two head pointers, one called \texttt{head} and one called \texttt{old head}, along with a single \texttt{tail} pointer. Records are inserted into the buffer by atomically incrementing \texttt{tail} and then placing the record into the slot. For records that cannot be atomically assigned, a visibility bit can be used to ensure that concurrent readers don't access a partially written value. \texttt{tail} can be incremented until it matches \texttt{old head}, or until the current version of the buffer (between \texttt{head} and \texttt{tail}) contains $N_B$ records. At this point, any further writes would either clobber records in the old version, or exceed the user-specified buffer capacity, and so any inserts must block until a flush has been completed. Flushes are triggered based on a user-configurable set point, $N_F \leq N_B$. When $\mathtt{tail} - \mathtt{head} = N_F$, a flush operation is scheduled. The location of \texttt{tail} is recorded as part of the flush, but records can continue to be inserted until one of the blocking conditions in the previous paragraph is reached. When the flush has completed, a new shard is created containing the records between \texttt{head} and the value of \texttt{tail} at the time the flush began. The buffer version can then be advanced by setting \texttt{old head} to \texttt{head} and setting \texttt{head} to \texttt{tail}. All of the records associated with the old version are freed, and the records that were just flushed now begin part of the old version. The reason for this scheme is to allow threads accessing an older version of the dynamized structure to still see a current view of all of the records. These threads will have a reference to a dynamized structure containing none of the records in the buffer, as well as the old head. Because the older version of the buffer always directly precedes the newer, all of the buffered records are visible to this older version. However, threads accessing the more current version of the buffer will \emph{not} see the records contained between \texttt{old head} and \texttt{head}, as these records will have been flushed into the structure and are visible to the thread there. If this thread could still see records in the older version of the buffer, then it would see these records twice, which is incorrect. One consequence of this technique is that a buffer flush cannot complete until all threads referencing \texttt{old head} have completed. To ensure that this is the case, the two head pointers are reference counted, and a flush will stall until all references to \texttt{old head} have been removed. In principle, this problem could be reduced by allowing for more than two heads, but it becomes difficult to atomically transition between versions in that case, and it would also increase the storage requirements for the buffer, which requires $N_B$ space per available version. \subsection{Concurrent Queries} Queries are answered based upon the active version of the structure at the moment the query begins to execute. When the query routine of the dynamization is called, a query is scheduled. Once a thread becomes available, the query will begin to execute. At the start of execution, the query thread takes a reference to $\mathcal{V}_a$ as well as the current \texttt{head} and \texttt{tail} of the buffer. Both $\mathcal{V}_a$ and \texttt{head} are reference counted, and will be retained for the duration of the query. Once the query has finished processing, it will return the result to the user via an \texttt{std::promise} and release its references to the active version and buffer. \subsubsection{Query Preemption} Because our implementation only supports two active head pointers in the mutable buffer, queries can lead to insertion stalls. If a long running query is holding a reference to \texttt{old head}, then an active buffer flush of the old version will be blocked by this query. If this blocking goes on for sufficiently long, then the buffer may fill up and the system begin to reject inserts. One possible solution to this problem is to process the \texttt{buffer\_query} first, and then discard the reference to \texttt{old head}, allowing the buffer flush to proceed. However, this would not work for iterative deletion decomposable search problems, which may require re-processing the buffer query arbitrarily many times. As a result, we instead implement a simple preemption mechanism to defeat long running queries. The framework keeps track of how long a buffer flush has been stalled by queries maintaining references to \texttt{old head}. Once this stalling passes a user-defined threshold, a preemption flag will be set ordering the queries in question to restart themselves. This is implemented fully within the framework code, requiring no user adjustment to their queries to support it, as the framework query mechanism simply checks this flag in between calls to user code. If a query sees this flag is set, it will release its references to the \texttt{old head} and structure version, and automatically put itself back in the scheduling queue to be retried against newer versions of the structure and buffer. Note that, if misconfigured, it is possible that this mechanism will entirely prevent certain long-running queries from being answered. If the threshold for preemption is set lower than the expected run-time of a valid query, it's possible that the query will loop forever if the system is experiencing sufficient insertion pressure. To help avoid this, another parameter is available to specify a maximum preemption count, after which a query will ignore a request for preemption. \subsection{Insertion Stall Mechanism} The results of Theorem~\ref{theo:worst-case-optimal} and \ref{theo:par-worst-case-optimal} are based upon enforcing a rate limit upon incoming inserts, to ensure that there is sufficient time for reconstructions to complete. In practice, calculating and precisely stalling for the correct amount of time is quite difficult because of the vagaries of working with a real system. While ultimately it would be ideal to have a reasonable cost model that can estimate this stall time on the fly based on the cost of building a data structure, the number of records involved in reconstructions, the number of available threads, available memory bandwidth, etc., for the purposes of this prototype we have settled for a simple system that demonstrates the robustness of the technique. Recall the basic insert process within our system. Inserts bypass the scheduling system and communicate directly with the buffer, on the same client thread that called the function, to maximize insertion performance and eliminate as much concurrency control overhead as possible. The insert routine is synchronous, and returns a boolean indicating whether the insert has succeeded or not. The insert can fail if the buffer is full, in which case the user is expected to delay for a moment and retry. Once space has been cleared in the buffer, the insert will succeed. We can leverage this same rejection mechanism as a form of rate limiting, by probabilistically rejecting inserts. To this end, we introduce a configurable parameter indicating the probability that an insert will succeed. For each insert, we we use Bernoulli sampling based upon this probably to determine whether the insert should be attempted or not. If the insert is rejected, then the attempt is aborted and the function returns a failure. The user can then delay and attempt again. This approach was selected because it has a few specific advantages. First, it is based on a single parameter that can be readily updated on demand using atomics. Our current prototype uses a single, fixed value for the probability, but ultimately it should be dynamically tuned to approximate the $\delta$ value from Theorem~\ref{theo:worst-case-optimal} as closely as possible. Second, random sampling to apply stalls ensures fairness. Ultimately, there's a limit on how small of a stall can be meaningfully applied, particularly if we want to avoid busy waiting. This approach allows us to fairly approximate these small stalls by using larger, more practical, amounts attached to randomly sampled inserts. Finally, the approach is simple and requires no significant changes to the user interface, while (as we will see in Section~\ref{sec:tl-eval}) directly exposing the design space associated with the parameter to the user. \section{Evaluation} \label{sec:tl-eval} In this section, we perform several experiments to evaluate the ability of the system proposed in Section~\ref{sec:tl-impl} to control tail latencies. \subsection{Stall Proportion Sweep} \begin{figure} \centering \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-insert-dist.pdf} \label{fig:tl-stall-200m-dist}} \subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-shard-dist.pdf} \label{fig:tl-stall-200m-shard}} \\ \caption{Insertion and Shard Count Distributions for ISAM with 200M Records} \label{fig:tl-stall-200m} \end{figure} First, we will consider the insertion and query performance of our system at a variety of stall proportions. The purpose of this testing is to demonstrate that inserting stalls into the insertion process is able to reduce the insertion tail latency, while being able to match the general insertion and query performance of a strict tiering policy. Recall that, in the insertion stall case, no explicit shard capacity limits are enforced by the framework. Reconstructions are triggered with each buffer flush on all levels exceeding a specified shard count ($s = 4$ in these tests) and the buffer flushes immediately when full with no regard to the state of the structure. Thus, limiting the insertion latency is the only means the system uses to maintain its shard count at a manageable level. These tests were run on a system with sufficient available resources to fully parallelize all reconstructions. First, Figure~\ref{fig:tl-stall-200m} shows the results of testing insertion of the 200 million record SOSD \texttt{OSM} dataset in a dynamized ISAM tree. Using our insertion stalling technique, as well as strict tiering. We inserted $30\%$ of the records, and then measured the individual latency of each insert after that point to produce Figure~\ref{fig:tl-stall-200m-dist}. Figure~\ref{fig:tl-stall-200m-shard} was produced by recording the number of shards in the dynamized structure each time the buffer flushed. Note that a stall value of one indicates no stalling at all, and values less than one indicate $1 - \delta$ probability of an insert being rejected. Thus, a lower stall value means more stalls are introduced. The tiering policy is strict tiering with a scale factor of $s=4$. It uses the concurrency control scheme described in Section~\ref{ssec:dyn-concurrency}. Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all of insertion rejection probabilities succeed in greatly reducing tail latency relative to tiering. Additionally, it shows a small amount of available tuning of the worst-case insertion latencies, with higher stall amounts reducing the tail latencies slightly at various points in the distribution. This latter effect results from the buffer flush latency hiding mechanism, which was retained from Chapter~\ref{chap:framework}. The buffer actually has space to two two versions, and the second version can be filled while the first is flushing. This means that, for more aggressive stalling, some of the time spent blocking on the buffer flush is redistributed over the inserts into the second version of the buffer, rather than resulting in a stall. Of course, if the query latency is severely affected by the use of this mechanism, it may not be worth using. Thus, in Figure~\ref{fig:tl-stall-200m-shard} we show the probability density of various shard counts within the dynamized structure for each stalling amount, as well as strict tiering. We have elected to examine the shard count, rather than the query latencies, for this purpose because our intention with this technique is to directly control the number of shards, and our intention is to show that this is possible. Of course, the shard count control is necessary for the sake of query latencies, and we will consider query latency directly later. This figure shows that, even for no insertion throttle at all, the shard count within the structure remains well behaved and normally distributed, albeit with a slightly longer tail and a higher average value. Once stalls are introduced, though, it is possible to both reduce the tail, and shift the peak of the distribution through a variety of points. In particular, we see that a stall of $.99$ is sufficient to move the peak to very close to tiering, and lower stalls are able to further shift the peak of the distribution to even lower counts. \begin{figure} \centering \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-insert-dist.pdf} \label{fig:tl-stall-4b-dist}} \subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-shard-dist.pdf} \label{fig:tl-stall-4b-shard}} \\ \caption{Insertion and Shard Count Distributions for ISAM with 4B Records} \label{fig:tl-stall-4b} \end{figure} To validate that these results were not simply a result of the relatively small size of the data set used, we repeated the exact same testing using a set of four billion uniform integers, and these results are shown in Figure~\ref{fig:tl-stall-4b}. These results are aligned with the smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing the same improvements in insertion tail latency for all stall amounts, and Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the shard count. If anything, the gap between strict tiering and un-throttled insertion is narrower with the larger data set than the smaller one. \begin{figure} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-insert-dist.pdf} \label{fig:tl-stall-knn-dist}} \subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-shard-dist.pdf} \label{fig:tl-stall-knn-shard}} \\ \caption{Insertion and Shard Count Distributions for VPTree } \label{fig:tl-stall-knn} \end{figure} Finally, we considered our dynamized VPTree in Figure~\ref{fig:tl-stall-knn}, using the \texttt{SBW} dataset of about one million 300-dimensional vectors. This test shows some of the possible limitations of our fixed rejection rate. The ISAM Tree tested above is constructable in roughly linear time, being an MDSP with $B_M(n, k) \in \Theta(n \log k)$. Thus, the ratio $\frac{B_M(n, k)}{n}$ used to determine the optimal insertion stall rate is asymptotically a constant. For VPTree, however, the construction cost is super-linear, with $B(n) \in \Theta(n \log n)$, and also generally much larger in absolute time requirements. We can see in Figure~\ref{fig:tl-stall-knn-shard} that the shard count distribution is very poorly behaved for smaller stall amounts, with the shard count following a roughly uniform distribution for a stall rate of $1$. This means that the background reconstructions are not capable of keeping up with buffer flushing, and so the number of shards grows significantly over time. Introducing stalls does shift the distribution closer to normal, but it requires a much larger stall rate in order to obtain a shard count distribution that is close to the strict tiering than was the case with the ISAM tree test. It is still possible, though, even with our simple fixed-stall rate implementation. Additionally, this approach is shown in Figure~\ref{fig:tl-stall-knn-dist} to reduce the tail latency substantially compared to strict tiering, with the same latency distribution effects for larger stall rates as was seen in the ISAM examples. Thus, we've shown that introducing even a fixed stall while allowing the internal structure of the dynamization to develop naturally is able to match the shard count distribution of strict tiering, while having significantly lower insertion tail latencies. \subsection{Insertion Stall Trade-off Space} While we have shown that introducing insertion stalls accomplishes the goal of reducing tail latencies while being able to match the shard count of a strict tiering reconstruction strategy, we've not yet addressed what the actual performance of this structure is. By throttling inserts, we potentially reduce the insertion throughput. And, further, it isn't immediately obvious just how much query performance suffers as the shard count distribution shifts. In this test, we examine the average values of insertion throughput and query latency over a variety of stall rates. The results of this test for ISAM with the SOSD \texttt{OSM} dataset are shown in Figure~\ref{fig:tl-latency-curve}, which shows the insertion throughput plotted against the average query latency for our system at various stall rates, and with tiering configured with an equivalent scale factor marked as red point for reference. This plot shows two interesting features of the insertion stall mechanism. First, it is possible to introduce stalls that do not significantly affect the write throughput, but do improve query latency. This is seen by the difference between the two points at the far right of the curve, where introducing a slight stall improves query performance at virtually no cost. This represents the region of the curve where the stalling introduces delay that doesn't exceed the cost of a buffer flush, and so the amount of time spent stalling by the system doesn't change much. The second, and perhaps more notable, point that this plot shows is that introducing the stall rate provides a beautiful design trade-off between query and insert performance. In fact, this space is far more useful than the trade-off space represented by layout policy and scale factor selection using strict reconstruction schemes that we examined in Chapter~\ref{chap:design-space}. At the upper end of the insertion optimized region, we see more than double the insertion throughput of tiering (with significantly lower tail latencies at well) at the cost of a slightly larger than 2x increase in query latency. Moving down the curve, we see that we are able to roughly match the performance of tiering within this space, and even shift to more query optimized configurations. \begin{figure} \centering \includegraphics[width=.5\textwidth]{img/tail-latency/stall-latency-curve.pdf} \caption{Insertion Throughput vs. Query Latency for ISAM with 200M Records} \label{fig:tl-latency-curve} \end{figure} This shows a very interesting result. Not only is our approach able to match a strict reconstruction policy in terms of average query and insertion performance with better tail latencies, but it is even able to provide a superior set of design trade-offs than the strict policies, at least in environments where sufficient parallel processing and memory are available to leverage parallel reconstructions. \section{Conclusion} In this section, we addressed the final of the three major problems of dynamization: tail latency. We proposed a technique for limiting the rate of insertions to match the rate of reconstruction that is able to match the worst-case optimized approach of Overmars~\cite{overmars81} on a single thread, and able to exceed it given multiple parallel threads. We then implemented the necessary mechanisms to support this technique within our framework, including a significantly improved architecture for scheduling and executing parallel and background reconstructions, and a system for rate limiting by rejecting inserts via Bernoulli sampling. We evaluated this system for fixed insertion rejection rates, and found significant improvements in tail latencies, approaching the practical lower bound we established using the equal block method, without requiring significant degradation of query performance. In fact, we found that this rate limiting mechanism provides a design space with more effective trade-offs than the one we examined in Chapter~\ref{chap:design-space}, with the system being able to exceed the query performance of an equivalently configured tiering system for certain rate limiting configurations. The method has limitations, assigning a fixed rejection rate of inserts works well for linear time constructable structures like the ISAM Tree, but was significantly less effective for the VPTree, which requires $\Theta(n \log n)$ time to construct. For structures like this, it will be necessary to dynamically scale the amount of throttling based on the record count and size of reconstruction. Additionally, our current system isn't easily capable of reaching the ``ideal'' goal of being able to reliably trade query performance and insertion latency at a fixed throughput. Nonetheless, the mechanisms for supporting such features are present, and even this simple implementation represents a marked improvement in terms of both insertion tail latency and configurability.