\chapter{Controlling Insertion Tail Latency} \label{chap:tail-latency} \section{Introduction} \begin{figure} \subfloat[Insertion Throughput]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-tput.pdf} \label{fig:tl-btree-isam-tput}} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-dist.pdf} \label{fig:tl-btree-isam-lat}} \\ \caption{Insertion Performance of Dynamized ISAM vs. B+Tree} \label{fig:tl-btree-isam} \end{figure} Up to this point in our investigation, we have not directly addressed one of the largest problems associated with dynamization: insertion tail latency. While our dynamization techniques are capable of producing structures with good overall insertion throughput, the latency of individual inserts is highly variable. To illustrate this problem, consider the insertion performance in Figure~\ref{fig:tl-btree-isam}, which compares the insertion latencies of a dynamized ISAM tree with that of its most direct dynamic analog: a B+Tree. While, as shown in Figure~\ref{fig:tl-btree-isam-tput}, the dynamized structure has comparable average performance to the native dynamic structure, the latency distributions, shown in Figure~\ref{fig:tl-btree-isam-lat} are quite different. While the dynamized structure has much better "best-case" performance, the worst-case performance is exceedingly poor. That the structure exhibits reasonable performance on average is the result of these two ends of the distribution balancing each other out. This poor worst-case performance is a direct consequence of the different approaches to update support used by the dynamized structure and B+Tree. B+Trees use a form of amortized local reconstruction, whereas the dynamized ISAM tree uses amortized global reconstruction. Because the B+Tree only reconstructs the portions of the structure ``local'' to the update, even in the worst case only a small part of the data structure will need to be adjusted. However, when using global reconstruction based techniques, the worst-case insert requires rebuilding either the entirety of the structure (for tiering or BSM), or at least a very large proportion of it (for leveling). The fact that our dynamization technique uses buffering, and most of the shards involved in reconstruction are kept small by the logarithmic decomposition technique used to partition it, ensures that the majority of inserts are low cost compared to the B+Tree, but at the extreme end of the latency distribution, the local reconstruction strategy used by the B+Tree results in better worst-case performance. Unfortunately, the design space that we have been considering thus far is limited in its ability to meaningfully alter the worst-case insertion performance. While we have seen that the choice of layout policy can have some effect, the actual benefit in terms of tail latency is quite small, and the situation is made worse by the fact that leveling, which can have better worst-case insertion performance, lags behind tiering in terms of average insertion performance. The use of leveling can allow for a small reduction in the worst case, but at the cost of making the majority of inserts worse because of increased write amplification. \begin{figure} \subfloat[Varied Scale Factors and Fixed Buffer Size]{\includegraphics[width=.5\textwidth]{img/tail-latency/scale_factor_sweep_standard.pdf} \label{fig:tl-parm-sf}} \subfloat[Varied Buffer Sizes and Fixed Scale Factor]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer_sweep_standard.pdf} \label{fig:tl-parm-bs}} \\ \caption{Design Space Effects on Latency Distribution} \label{fig:tl-parm-sweep} \end{figure} The other tuning nobs that are available to us are of limited usefulness in tuning the worst case behavior. Figure~\ref{fig:tl-parm-sweep} shows the latency distributions of our framework as we vary the scale factor (Figure~\ref{fig:tl-parm-sf}) and buffer size (Figure~\ref{fig:tl-parm-bs}) respectively. There is no clear trend in worst-case performance to be seen here. Adjusting the scale factor does have an effect on the distribution, but not in a way that is particularly useful from a configuration standpoint, and adjusting the mutable buffer has almost no effect on the worst-case latency at all, or even on the distribution; particularly when tiering is used. This is to be expected, ultimately the worst-case reconstructions largely the same regardless of scale factor or buffer size: a reconstruction involving $\Theta(n)$ records. The selection of configuration parameters can influence \emph{when} these reconstructions occur, as well as slightly influence their size, but ultimately the question of ``which configuration has the best tail-latency performance'' is more a question of how many insertions the latency is measured over, than any fundamental trade-offs with the design space. This is exemplified rather well in Figure~\ref{fig:tl-parm-sf}. Each of the ``shelves'' in the distribution correspond to reconstructions on particular levels. As can be seen, the lines cross each other repeatedly at these shelves. These cross-overs are points at which one configuration begins to, temporarily, exhibit better tail latency behavior than the other. However, after enough records have been inserted to cause the next largest reconstructions to begin to occur, the "better" configuration begins to appear worse again in terms of tail latency.\footnote{ This plot also shows a notable difference between leveling and tiering. In the tiering configurations, the transitions between the shelves are steep and abrupt, whereas in leveling, the transitions are smoother, particular as the scale factor increases. These smoother curves show the write amplification of leveling, where the largest shards are not created ``fully formed'' as they are in tiering, but rather are built over a series of merges. This slower growth results in the smoother transitions. Note also that these curves are convex--which is \emph{bad} on this plot, as this means a higher probability of a higher latency reconstruction. } It seems apparent that, to resolve the problem of insertion tail latency, we will need to look beyond the design space we have thus far considered. In this chapter, we do just this, and propose a new mechanism for controlling reconstructions that leverages parallelism to provide similar amortized insertion and query performance characteristics, but also allows for significantly better insertion tail latencies. We will demonstrate mathematically that our new technique is capable of matching the query performance of the tiering layout policy, describe a practical implementation of these ideas, and then evaluate that prototype system to demonstrate that the theoretical trade-offs are achievable in practice. \section{The Insertion-Query Trade-off} As reconstructions are at the heart of the insertion tail latency problem, it seems worth taking a moment to consider \emph{why} they must be done at all. Fundamentally, decomposition-based dynamization techniques trade between insertion and query performance by controlling the number of blocks in the decomposition. Placing a bound on this number is necessary to bound the worst-case query cost, and is done using reconstructions to either merge (in the case of the Bentley-Saxe method) or re-partition (in the case of the equal block method) them. Performing less frequent reconstructions reduces the amount of work associated with inserts, at the cost of allowing more blocks to accumulate and thereby hurting query performance. This trade-off between insertion and query performance by way of block count is most directly visible in the equal block method described in Section~\ref{ssec:ebm}. As a reminder, this technique provides the following worst-case insertion and query bounds, \begin{align*} I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\ \mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \end{align*} where $f(n)$ is the number of blocks. Unlike the design space we have proposed in Chapter~\ref{chap:design-space}, the equal block method allows for \emph{both} trading off between insert and query performance, \emph{and} controlling the tail latency. Figure~\ref{fig:tl-ebm)} shows the results of testing an implementation of a dynamized ISAM tree using the equal block method, with \begin{equation*} f(n) = C \end{equation*} for varying constant values of $C$. Note that in this test the final record count was known in advance, allowing all re-partitioning to be avoided. This represents a sort of ``best case scenario'' for the technique, and isn't reflective of real-world performance, but does serve to demonstrate the relevant properties in the clearest possible manner. Figure~\ref{fig:tl-ebm-tail-latency} shows that the equal block method provides a very direct relationship between the tail latency, and the number of blocks. The worst-case insertion performance is dictated by the size of the largest reconstruction, and so increasing the block count results in smaller blocks, and better insertion performance. These worst-case results also translate directly into improved average throughput, at the cost of query latency, as shown in Figure~\ref{fig:tl-ebm-tradeoff}. Note that, contrary to our Bentley-Saxe inspired dynamization system, the equal block method provides clear and direct relationships between insertion and query performance, as well as direct control over tail latency, through its design space. \begin{figure} \centering \subfloat[Insertion vs. Query Trade-off]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-ebm-tradeoff}} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:tl-ebm-tail-latency}} \\ \caption{The equal block method with varying values of $f(n)$.} \label{fig:tl-ebm} \end{figure} Unfortunately, the equal block method is not well suited for our purposes. Despite having a much cleaner trade-off space, its performance is strictly worse than our dynamization system. Comparing Figure~\ref{fig:tl-ebm-tradeoff} with Figure~\ref{fig:design-tradeoff} shows that, for a specified query latency, our technique provides significantly better insertion throughput.\footnote{ In actuality, the insertion performance of the equal block method is even \emph{worse} than the numbers presented here. For this particular benchmark, we implemented the technique knowing the number of records in advance, and so fixed the size of each block from the start. This avoided the need to do any re-partitioning as the structure grew, and reduced write amplification. } This is because, in our technique, the variable size of the blocks allows for the majority of the reconstructions to occur with smaller structures, while allowing the majority of the records to exist in a small number of large blocks at the bottom of the structure. This setup enables high insertion throughput while keeping the block count small. But, as we've seen, the cost of this is large tail latencies, as the large blocks must occasionally be involved in reconstructions. However, we can use the extreme ends of the equal block method's design space to consider upper limits on the insertion and query performance that we might expect to get out of a dynamized structure, and then take steps within our own framework to approach these limits, while retaining the desirable characteristics of the logarithmic decomposition. At the extreme end, consider what would happen if we were to modify our dynamization framework to avoid all reconstructions. We retain a buffer of size $N_B$, which we flush to create a shard when full; however we never touch the shards once they are created. This is effectively the equal block method, where every block is fixed at $N_B$ capacity. Such a technique would result in a worst-case insertion cost of $I(n) \in \Theta(B(N_B))$ and produce $\Theta\left(\frac{n}{N_B}\right)$ shards in total, resulting in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$ worst-case query cost for a decomposable search problem. Applying this technique to an ISAM Tree, and compared against a B+Tree, yields the insertion and query latency distributions shown in Figure~\ref{fig:tl-floodl0}. \begin{figure} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} \subfloat[Query Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-query.pdf} \label{fig:tl-floodl0-query}} \\ \caption{Latency Distributions for a "Reconstructionless" Dynamization} \label{fig:tl-floodl0} \end{figure} Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain insertion latency distributions using amortized global reconstruction that are directly comparable to dynamic structures based on amortized local reconstruction, at least in some cases. In particular, the worst-case insertion tail latency in this model is direct function of the buffer size, as the worst-case insert occurs when the buffer must be flushed to a shard. However, this performance comes at the cost of queries, which are incredibly slow compared to B+Trees, as shown in Figure~\ref{fig:tl-floodl0-query}. Unfortunately, the query latency of this technique is too large for it to be useful; it is necessary to perform reconstructions to merge these small shards together to ensure good query performance. However, this does raise an interesting point. Fundamentally, the reconstructions that contribute to tail latency are \emph{not} required from an insertion perspective; they are a query optimization. Thus, we could remove the reconstructions from the insertion process and perform them elsewhere. This could, theoretically, allow us to have our cake and eat it too. The only insertion bottleneck would become the buffer flushing procedure--as is the case in our hypothetical ``reconstructionless'' approach. Unfortunately, it is not as simple as pulling the reconstructions off of the insertion path and running them in the background, as this alone cannot provide us with a meaningful bound on the number of blocks in the dynamized structure. But, it is possible to still provide this bound, if we're willing to throttle the insertion rate to be slow enough to keep up with the background reconstructions. In the next section, we'll discuss a technique based on this idea. \section{Relaxed Reconstruction} There is theoretical work in this area, which we discussed in Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach for controlling the worst-case insertion cost is to break the largest reconstructions up into small sequences of operations, that can then be attached to each insert, spreading the total workload out and ensuring each insert takes a consistent amount of time. Theoretically, the total throughput should remain about the same when doing this, but rather than having a bursty latency distribution with many fast inserts, and a small number of incredibly slow ones, distribution should be far more uniform. Unfortunately, this technique has a number of limitations that we discussed in Section~\ref{ssec:bsm-tail-latency-problem}. Notably for this discussion, they are \begin{enumerate} \item In the Bentley-Saxe method, the worst-case reconstruction involves every record in the structure. As such, it cannot be performed ``in advance'' without significant extra work. This problem requires the worst-case optimized dynamization systems to include complicated structures of partially built structures. \item The approach assumes that the workload of building a block can be evenly divided in advance, and somehow attached to inserts. Even for simple structures, this requires a large amount of manual adjustment to the data structure reconstruction routines, and doesn't admit simple, generalized interfaces. \end{enumerate} In this section, we consider how these restrictions can be overcome given our dynamization framework, and propose a strategy that achieves the same worst-case insertion time as the worst-case optimized techniques, given a few assumptions about available resources. First, a comment on nomenclature. We define the term \emph{last level}, $i = \ell$, to mean the level in the dynamized structure with the largest index value (and thereby the most records) and \emph{first level} to mean the level with index $i=0$. Any level with $0 < i < \ell$ is called an \emph{internal level}. A reconstruction on a level involves the compaction of all blocks on that level into one, larger, block, that is then appended to the level below. Relative to some level at index $i$, the \emph{next level} is the level at index $i + 1$. At a very high level, our proposed approach as follows. We will fully detach reconstructions from buffer flushes. When the buffer fills, it will immediately flush and a new shard will be placed in the first level. Reconstructions will be performed in the background to maintain the internal structure according to the tiering policy. When a level contains $s$ shards, a reconstruction will immediately be triggered to merge these shards and push the result down to the next level. To ensure that the number of shards in the structure remains bounded by $\Theta(\log n)$, we will throttle the insertion rate so that it is balanced with amount of time needed to complete reconstructions. \begin{figure} \caption{Several "states" of tiering, leading up to the worst-case reconstruction.} \label{fig:tl-tiering} \end{figure} First, we'll consider how to ``spread out'' the cost of the worst-case reconstruction. Figure~\ref{fig:tl-tiering} shows various stages in the development of the internal structure of a dynamized index using tiering. Importantly, note that the last level reconstruction, which dominates the cost of the worst-case reconstruction, \emph{is able to be performed well in advance}. All of the records necessary to perform this reconstruction are present in the last level $\Theta(n)$ inserts before the reconstruction must be done to make room. This is a significant advantage to our technique over the normal Bentley-Saxe method, which will allow us to spread the cost of this reconstruction over a number of inserts without much of the complexity of~\cite{overmars81}. This leads us to the following result, \begin{theorem} \label{theo:worst-case-optimal} Given a buffered, dynamized structure utilizing the tiering layout policy, and at least $2$ parallel threads of execution, it is possible to maintain a worst-case insertion cost of \begin{equation} I(n) \in \Theta\left(\frac{B(n)}{n} \log n\right) \end{equation} \end{theorem} \begin{proof} Consider the cost of the worst-case reconstruction, which in tiering will be of cost $\Theta(B(n))$. This reconstruction requires all of the blocks on the last level of the structure. At the point at which the last level is full, there will be $\Theta(n)$ inserts before the last level must be merged and a new level added. To ensure that the reconstruction has been completed by the time the $\Theta(n)$ inserts have been completed, it is sufficient to guarantee that the rate of inserts is sufficiently slow. Ignoring the cost of buffer flushing, this means that inserts must cost, \begin{equation*} I(n) \in \Theta(1 + \delta) \end{equation*} where the $\Theta(1)$ is the cost of appending to the mutable buffer, and $\delta$ is a stall inserted into the insertion process to ensure that the necessary reconstructions are completed in time. To identify the value of $\delta$, we note that each insert must take at least $\frac{B(n)}{n}$ time to fully cover the cost of the last level reconstruction. However, this is not sufficient to guarantee the bound, as other reconstructions will also occur within the structure. At the point at which the last level reconstruction can be scheduled, there will be exactly $1$ shard on each level. Thus, each level will potentially also have an ongoing reconstruction that must be covered by inserting more stall time, to ensure that no level in the structure exceeds $s$ shards. There are $\log n$ levels in total, and so in the worst case we will need to introduce a extra stall time to account for a reconstruction on each level, \begin{equation*} I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1}) \end{equation*} All of these internal reconstructions will be strictly less than the size of the last-level reconstruction, and so we can bound them all above by $\frac{B(n)}{n}$ time. Given this, and assuming that the smallest (i.e., most pressing) reconstruction is prioritized on the background thread, we find that \begin{equation*} I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right) \end{equation*} \end{proof} This approach results in an equivalent worst-case insertion latency bound to~\cite{overmars81}, but manages to resolve both of the issues cited above. By leveraging two parallel threads, instead of trying to manually multiplex a single thread, this approach requires \emph{no} modification to the user's shard code to function. And, by leveraging the fact that reconstructions under tiering are strictly local to a single level, we can avoid needing to add any complicated additional structures to manage partially building shards as new records are added. \subsection{Reducing Stall with Parallelism} The result in Theorem~\ref{theo:worst-case-optimal} assumes that there are two available threads of parallel execution, which allows for the reconstructions to run in parallel with inserts. The amount of necessary insertion stall can be significantly reduced, however, if more than two threads are available. The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case bound is that it is insufficient to cover only the cost of the last level reconstruction to maintain the bound on the shard count. From the moment that the last level has filled, and this reconstruction can begin, every level within the structure will sustain another $s - 1$ reconstructions before it is necessary to have completed the last level reconstruction. Consider a parallel implementation that, contrary to Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover the last level reconstruction, and blocks all other reconstructions until it has been completed. Such an approach would result in $\delta = \frac{B(n)}{n}$ stall and complete the last level reconstruction after $\Theta(n)$ inserts. During this time, $\Theta(\frac{n}{N_B})$ shards would accumulate in L0, ultimately resulting in a bound of $\Theta(n)$ shards in the structure, rather than the $\Theta(\log n)$ bound we are trying to maintain. This is the reason why Theorem~\ref{theo:worst-case-optimal} must account for stalls on every level, and assumes that the smallest (and therefore most pressing) reconstruction is always active on the parallel reconstruction thread. This introduces the extra $\log n$ factor into the worst-case insertion cost function, because there will at worst be a reconstruction running on every level, and each reconstruction will be no larger than $\Theta(n)$ records. In effect, the stall amount must be selected to cover the \emph{sum} of the costs of all reconstructions that occur. Another way of deriving this bound would be to consider this sum, \begin{equation*} B(n) + (s - 1) \cdot \sum_{i=0}^{\log n - 1} B(n) \end{equation*} where the first term is the last level reconstruction cost, and the sum term considers the cost of the $s-1$ reconstructions on each internal level. Dropping constants and expanding the sum results in, \begin{equation*} B(n) \cdot \log n \end{equation*} reconstruction cost to amortize over the $\Theta(n)$ inserts. However, additional parallelism will allow us to reduce this. At the upper limit, assume that there are $\log n$ threads available for parallel reconstructions. This condition allows us to derive a smaller bound in certain cases, \begin{theorem} \label{theo:par-worst-case-optimal} Given a buffered, dynamized structure utilizing the tiering layout policy, and at least $\log n$ parallel threads of execution, it is possible to maintain a worst-case insertion cost of \begin{equation} I(n) \in O\left(\frac{B(n)}{n}\right) \end{equation} for a data structure with $B(n) \in \Omega(n)$. \end{theorem} \begin{proof} Just as in Theorem~\ref{theo:worst-case-optimal}, we begin by noting that the last level reconstruction will be of cost $\Theta(B(n))$ and must be amortized over $\Theta(n)$ inserts. However, unlike in that case, we now have $\log n$ threads of parallelism to work with. Thus, each time a reconstruction must be performed on an internal level, it can be executed on one of these threads in parallel with all other ongoing reconstructions. As there can be at most one reconstruction per level, $\log n$ threads are sufficient to run all possible reconstructions at any point in time in parallel. Each reconstruction will require $\Theta(B(N_B \cdot s^{i+1}))$ time to complete. Thus, the necessary stall to fully cover a reconstruction on level $i$ is this cost, divided by the number of inserts that can occur before the reconstruction must be done (i.e., the capacity of the index above this point). This gives, \begin{equation*} \delta_i \in O\left( \frac{B(N_B \cdot s^{i+1})}{\sum_{j=0}^{i-1} N_B\cdot s^{j+1}} \right) \end{equation*} necessary stall for each level. Noting that $s > 1$, $s \in \Theta(1)$, and that the denominator is the sum of a geometric progression, we have \begin{align*} \delta_i \in &O\left( \frac{B(N_B\cdot s^{i+1})}{s\cdot N_B \sum_{j=0}^{i-1} s^{j}} \right) \\ &O\left( \frac{(1-s) B(N_B\cdot s^{i+1})}{N_B\cdot (s - s^{i+1})} \right) \\ &O\left( \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right) \end{align*} For all reconstructions running in parallel, the necessary stall is the maximum stall of all the parallel reconstructions, \begin{equation*} \delta = \max_{i \in [0, \ell]} \left\{ \delta_i \right\} = \max_{i \in [0, \ell]} \left\{ \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right\} \end{equation*} For $B(n) \in \Omega(n)$, the numerator of the fraction will grow at least as rapidly as the denominator, meaning that $\delta_\ell$ will always be the largest. $N_B \cdot s^{\ell + 1}$ is $\Theta(n)$, so we find that, \begin{equation*} I(n) \in O \left(\frac{B(n)}{n}\right) \end{equation*} is the worst-case insertion cost, while ensuring that all reconstructions are done in time to maintain the shard bound given $\log n$ parallel threads. \end{proof} \section{Implementation} The previous section demonstrated that, theoretically, it is possible to meaningfully control the tail latency of our dynamization system by relaxing the reconstruction processes and throttling the insertion rate, rather than blocking, as a means of controlling the shard count within the structure. However, there are a number of practical problems to be solved before this idea can be used in a real system. In this section, we discuss these problems, and our approaches to solving them to produce a dynamization framework based upon the technique. \subsection{Parallel Reconstruction Architecture} \subsection{Concurrent Queries} \subsubsection{Query Pre-emption} Because our implementation only supports a finite number of versions of the mutable buffer at any point in time, and insertions will stall after this finite number is reached, it is possible for queries to introduce additional insertion latency. Queries hold a reference to the version of the structure the are using, which includes holding on to a buffer head pointer. If a query is particularly long running, or otherwise stalled, it is possible that the query will block insertions by holding onto this head pointer. \subsection{Insertion Stall Mechanism} \section{Evaluation} \subsection{} \section{Conclusion}