\chapter{Controlling Insertion Tail Latency} \label{chap:tail-latency} \section{Introduction} \begin{figure} \subfloat[Insertion Throughput]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-btree-isam-tput}} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:tl-btree-isam-lat}} \\ \caption{Insertion Performance of Dynamized ISAM vs. B+Tree} \label{fig:tl-btree-isam} \end{figure} Up to this point in our investigation, we have not directly addressed one of the largest problems associated with dynamization: insertion tail latency. While these techniques result in structures that have reasonable, or even good, insertion throughput, the latency associated with each individual insert is wildly variable. To illustrate this problem, consider the insertion performance in Figure~\ref{fig:tl-btree-isam}, which compares the insertion latencies of a dynamized ISAM tree with that of its most direct dynamic analog: a B+Tree. While, as shown in Figure~\ref{fig:tl-btree-isam-tput}, the dynamized structure has comperable average performance to the native dynamic structure, the latency distributions are quite different. Figure~\ref{fig:tl-btree-isam-lat} shows representations of the distributions. While the dynamized structure has much better "best-case" performance, the worst-case performance is exceedingly poor. That the structure exhibits reasonable performance on average is the result of these two ends of the distribution balancing each other out. This poor worst-case performance is a direct consequence of the strategies used by the two structures to support updates. B+Trees use a form of amortized local reconstruction, whereas the dynamized ISAM tree uses amortized global reconstruction. Because the B+Tree only reconstructs the portions of the structure ``local'' to the update, even in the worst case only a portion of the data structure will need to be adjusted. However, when using global reconstruction based techniques, the worst-case insert requires rebuilding either the entirety of the structure (for tiering or BSM), or at least a very large proportion of it (for leveling). The fact that our dynamization technique uses buffering, and most of the shards involved in reconstruction are kept small by the logarithmic decomposition technique used to partition it, ensures that the majority of inserts are low cost compared to the B+Tree, but at the extreme end of the latency distribution, the local reconstruction strategy used by the B+Tree results in better worst-case performance. Unfortunately, the design space that we have been considering thus far is limited in its ability to meaningfully alter the worst-case insertion performance. While we have seen that the choice of layout policy can have some effect, the actual benefit in terms of tail latency is quite small, and the situation is made worse by the fact that leveling, which can have better worst-case insertion performance, lags behind tiering in terms of average insertion performance. The use of leveling can allow for a small reduction in the worst case, but at the cost of making the majority of inserts worse because of increased write amplification. \begin{figure} \subfloat[Scale Factor Sweep]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-parm-sf}} \subfloat[Buffer Size Sweep]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:tl-parm-bs}} \\ \caption{Design Space Effects on Latency Distribution} \label{fig:tl-parm-sweep} \end{figure} Additionally, the other tuning nobs that are available to us are of limited usefulness in tuning the worst case behavior. Figure~\ref{fig:tl-parm-sweep} shows the latency distributions of our framework as we vary the scale factor (Figure~\ref{fig:tl-parm-sf}) and buffer size (Figure~\ref{fig:tl-parm-bs}) respectively. There is no clear trend in worst-case performance to be seen here. This is to be expected, ultimately the worst-case reconstructions in both cases are largely the same regardless of scale factor or buffer size: a reconstruction involving $\Theta(n)$ records. The selection of configuration parameters can influence \emph{when} these reconstructions occur, as well as slightly influence their size, but ultimately the question of ``which configuration has the best tail-latency performance'' is more a question of how many insertions the latency is measured over, than any fundamental trade-offs with the design space. Thus, in this chapter, we will look beyond the design space we have thus far considered to design a dynamization system that allows for tail latency tuning in a meaningful capacity. To accomplish this, we will consider a different way of looking at reconstructions within dynamized structures. \section{The Insertion-Query Trade-off} As reconstructions are at the heart of the insertion tail latency problem, it seems worth taking a moment to consider \emph{why} they must be done at all. Fundamentally, decomposition-based dynamization techniques trade between insertion and query performance by controlling the number of blocks in the decomposition. Reconstructions serve to place a bound on the number of blocks, to allow for query performance bounds to be enforced. This trade-off between insertion and query performance by way of block count is most directly visible in the equal block method described in Section~\ref{ssec:ebm}. As a reminder, this technique provides the following worst-case insertion and query bounds, \begin{align*} I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\ \mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \end{align*} where $f(n)$ is the number of blocks. Figure~\ref{fig:tl-ebm-trade-off} shows the trade-off between insertion and query performance for a dynamized ISAM tree using the equal block method, for various numbers of blocks. The trade-off is evident in the figure, with a linear relationship between insertion throughput and query latency, mediated by the number of blocks in the dynamized structure (the block counts are annotated on each point in the plot). As the number of blocks is increased, their size is reduced, leading to less expensive inserts in terms of both amortized and worst-case cost. However, the additional blocks make queries more expensive. \begin{figure} \centering \includegraphics[width=.75\textwidth]{img/tail-latency/ebm-count-sweep.pdf} \caption{The Insert-Query Tradeoff for the Equal Block Method with varying number of blocks} \label{fig:tl-ebm-trade-off} \end{figure} While using the equal block method does allow for direct tuning of the worst-case insert cost, as well as exposing a very clean trade-off space for average query and insert performance, the technique is not well suited to our purposes because the amortized insertion performance is not particularly good: the insertion throughput is many times worse than is possible with our dynamization framework for an equivalent query latency.\footnote{ In actuality, the insertion performance of the equal block method is even \emph{worse} than the numbers presented here. For this particular benchmark, we implemented the technique knowing the number of records in advance, and so fixed the size of each block from the start. This avoided the need to do any repartitioning as the structure grew, and reduced write amplification. } This is because, in our Bentley-Saxe-based technique, the variable size of the blocks allows for the majority of the reconstructions to occur with smaller structures, while allowing the majority of the records to exist in a single large block at the bottom of the structure. This setup enables high insertion throughput while keeping the block count small. But, as we've seen, the cost of this is large tail latencies. However, we can use the extreme ends of the equal block method's design space to consider upper limits on the insertion and query performance that we might expect to get out of a dynamized structure. Consider what would happen if we were to modify our dynamization framework to avoid all reconstructions. We retain a buffer of size $N_B$, which we flush to create a shard when full, however we never touch the shards once they are created. This is effectively the equal blocks method, where every block is fixed at $N_B$ capacity. Such a technique would result in a worst-case insertion cost of $I(n) \in \Theta(B(N_B))$ and produce $\Theta\left(\frac{n}{N_B}\right)$ shards in total, resulting in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$ worst-case query cost for a decomposable search problem. Applying this technique to an ISAM Tree, and compared against a B+Tree, yields the insertion and query latency distributions shown in Figure~\ref{fig:tl-floodl0}. \begin{figure} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} \subfloat[Query Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-query.pdf} \label{fig:tl-floodl0-query}} \\ \caption{Latency Distributions for a "Reconstructionless" Dynamization} \label{fig:tl-floodl0} \end{figure} Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain insertion latency distributions using amortized global reconstruction that are directly comperable to dynamic structures based on amortized local reconstruction, at least in some cases. In particular, the worst-case insertion tail latency in this model is direct function of the buffer size, as the worst-case insert occurs when the buffer must be flushed to a shard. However, this performance comes at the cost of queries, which are incredibly slow compared to B+Trees, as shown in Figure~\ref{fig:tl-floodl0-query}. While this approach is not useful on its own, it does \section{Relaxed Reconstruction} There is theoretical work in this area, which we discussed in Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach for controlling the worst-case insertion cost is to break the largest reconstructions up into small sequences of operations, that can then be attached to each insert, spreading the total workload out and ensuring each insert takes a consistent amount of time. Theoretically, the total throughput should remain about the same when doing this, but rather than having a bursty latency distribution with many fast inserts, and a small number of incredibly slow ones, distribution should be far more uniform. Unfortunately, this technique has a number of limitations that we discussed in Section~\ref{ssec:bsm-tail-latency-problem}. Notably for this discussion, they are \begin{enumerate} \item In the Bentley-Saxe method, the worst-case reconstruction involves every record in the structure. As such, it cannot be performed ``in advance'' without significant extra work. This problem requires the worst-case optimized dynamization systems to include complicated structures of partially built structures. \item The approach assumes that the workload of building a block can be evenly divided in advance, and somehow attached to inserts. Even for simple structures, this requires a large amount of manual adjustment to the data structure reconstruction routines, and doesn't admit simple, generalized interfaces. \end{enumerate} In this section, we consider how these restrictions can be overcome given our dynamization framework, and propose a strategy that achieves the same worst-case insertion time as the worst-case optimized techniques, given a few assumptions about available resources. At a very high level, our proposed approach as follows. We will fully detach reconstructions from buffer flushes. When the buffer fills, it will immediately flush and a new shard will be placed in L0. Reconstructions will be performed in the background to maintain the internal structure according, roughly, to tiering. When a level contains $s$ shards, a reconstruction will immediately be trigger to merge these shards and push the result down to the next level. To ensure that the number of shards in the structure remains bounded by $\Theta(\log n)$, we will throttle the insertion rate so that it is balanced with amount of time needed to complete reconstructions. \begin{figure} \caption{Several "states" of tiering, leading up to the worst-case reconstruction.} \label{fig:tl-tiering} \end{figure} First, we'll consider how to ``spread out'' the cost of the worst-case reconstruction. Figure~\ref{fig:tl-tiering} shows various stages in the development of the internal structure of a dynamized index using tiering. Importantly, note that the last level reconstruction, which dominates the cost of the worst-case reconstruction, \emph{is able to be performed well in advance}. All of the records necessary to perform this reconstruction are present in the last level $\Theta(n)$ inserts before the reconstruction must be done to make room. This is a significant advantage to our technique over the normal Bentley-Saxe method, which will allow us to spread the cost of this reconstruction over a number of inserts without much of the complexity of~\cite{overmars81}. This leads us to the following result, \begin{theorem} Given a buffered, dynamized structure utilizing the tiering layout policy, and at least $2$ parallel threads of execution, it is possible to maintain a worst-case insertion cost of \begin{equation} I(n) \in \Theta\left(\frac{B(n)}{n} \log n\right) \end{equation} \end{theorem} \begin{proof} Consider the cost of the worst-case reconstruction, which in tiering will be of cost $\Theta(B(n))$. This reconstruction requires all of the blocks on the last level of the structure. At the point at which the last level is full, there will be $\Theta(n)$ inserts before the last level must be merged and a new level added. To ensure that the reconstruction has been completed by the time the $\Theta(n)$ inserts have been completed, it is sufficient to guarantee that the rate of inserts is sufficiently slow. Ignoring the cost of buffer flushing, this means that inserts must cost, \begin{equation*} I(n) \in \Theta(1 + \delta) \end{equation*} where the $\Theta(1)$ is the cost of appending to the mutable buffer, and $\delta$ is a stall inserted into the insertion process to ensure that the necessary reconstructions are completed in time. To identify the value of $\delta$, we note that each insert must take at least $\frac{B(n)}{n}$ time to fully cover the cost of the last level reconstruction. However, this is not sufficient to guarantee the bound, as other reconstructions will also occur within the structure. At the point at which the last level reconstruction can be scheduled, there will be exactly $1$ shard on each level. Thus, each level will potentially also have an ongoing reconstruction that must be covered by inserting more stall time, to ensure that no level in the structure exceeds $s$ shards. There are $\log n$ levels in total, and so in the worst case we will need to introduce a extra stall time to account for a reconstruction on each level, \begin{equation*} I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1}) \end{equation*} All of these internal reconstructions will be strictly less than the size of the last-level reconstruction, and so we can bound them all above by $\frac{B(n)}{n}$ time. Given this, and assuming that the smallest (i.e., most pressing) reconstruction is prioritized on the background thread, we find that \begin{equation*} I(n) \in \Theta\left(\frac{B(n)}{n} \cdot \log n\right) \end{equation*} \end{proof} This approach results in an equivalent worst-case insertion latency bound to~\cite{overmars81}, but manages to resolve both of the issues cited above. By leveraging two parallel threads, instead of trying to manually multiplex a single thread, this approach requires \emph{no} modification to the user's shard code to function. And, by leveraging the fact that reconstructions under tiering are strictly local to a single level, we can avoid needing to add any complicated additional structures to manage partially building shards as new records are added. \section{Implementation} \subsection{Parallel Reconstruction Architecture} \subsection{Concurrent Queries} \subsection{Query Pre-emption} \subsection{Insertion Stall Mechanism} \section{Evaluation} \section{Conclusion}