\chapter{Controlling Insertion Tail Latency} \label{chap:tail-latency} \section{Introduction} \begin{figure} \subfloat[Insertion Throughput]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-tput.pdf} \label{fig:tl-btree-isam-tput}} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-dist.pdf} \label{fig:tl-btree-isam-lat}} \\ \caption{Insertion Performance of Dynamized ISAM vs. B+Tree} \label{fig:tl-btree-isam} \end{figure} Up to this point in our investigation, we have not directly addressed one of the largest problems associated with dynamization: insertion tail latency. While our dynamization techniques are capable of producing structures with good overall insertion throughput, the latency of individual inserts is highly variable. To illustrate this problem, consider the insertion performance in Figure~\ref{fig:tl-btree-isam}, which compares the insertion latencies of a dynamized ISAM tree with that of its most direct dynamic analog: a B+Tree. While, as shown in Figure~\ref{fig:tl-btree-isam-tput}, the dynamized structure has superior average performance to the native dynamic structure, the latency distributions, shown in Figure~\ref{fig:tl-btree-isam-lat} are quite different. The dynamized structure has much better "best-case" performance, but the worst-case performance is exceedingly poor. That the structure exhibits reasonable performance on average is the result of these two ends of the distribution balancing each other out. This poor worst-case performance is a direct consequence of the different approaches to update support used by the dynamized structure and B+Tree. B+Trees use a form of amortized local reconstruction, whereas the dynamized ISAM tree uses amortized global reconstruction. Because the B+Tree only reconstructs the portions of the structure ``local'' to the update, even in the worst case only a small part of the data structure will need to be adjusted. However, when using global reconstruction based techniques, the worst-case insert requires rebuilding either the entirety of the structure (for tiering or BSM), or at least a very large proportion of it (for leveling). The fact that our dynamization technique uses buffering, and most of the shards involved in reconstruction are kept small by the logarithmic decomposition technique used to partition it, ensures that the majority of inserts are low cost compared to the B+Tree. At the extreme end of the latency distribution, though, the local reconstruction strategy used by the B+Tree results in significantly better worst-case performance. Unfortunately, the design space that we have been considering thus far is limited in its ability to meaningfully alter the worst-case insertion performance. While we have seen that the choice of layout policy can have some effect, the actual benefit in terms of tail latency is quite small, and the situation is made worse by the fact that leveling, which can have better worst-case insertion performance, lags behind tiering in terms of average insertion performance. The use of leveling can allow for a small reduction in the worst case, but at the cost of making the majority of inserts worse because of increased write amplification. \begin{figure} \subfloat[Varied Scale Factors and Fixed Buffer Size]{\includegraphics[width=.5\textwidth]{img/tail-latency/scale_factor_sweep_standard.pdf} \label{fig:tl-parm-sf}} \subfloat[Varied Buffer Sizes and Fixed Scale Factor]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer_sweep_standard.pdf} \label{fig:tl-parm-bs}} \\ \caption{Design Space Effects on Latency Distribution} \label{fig:tl-parm-sweep} \end{figure} The other tuning nobs that are available to us are of limited usefulness in tuning the worst case behavior. Figure~\ref{fig:tl-parm-sweep} shows the latency distributions of our framework as we vary the scale factor (Figure~\ref{fig:tl-parm-sf}) and buffer size (Figure~\ref{fig:tl-parm-bs}) respectively. There is no clear trend in worst-case performance to be seen here. Adjusting the scale factor does have an effect on the distribution, but not in a way that is particularly useful from a configuration standpoint, and adjusting the mutable buffer has almost no effect on the worst-case latency at all, or even on the distribution; particularly when tiering is used. This is to be expected; ultimately the worst-case reconstruction size is largely the same regardless of scale factor or buffer size: $\Theta(n)$ records. The selection of configuration parameters can influence \emph{when} these reconstructions occur, as well as slightly influence their size, but ultimately the question of ``which configuration has the best tail-latency performance'' is more a question of how many insertions the latency is measured over, than any fundamental trade-offs with the design space. This is exemplified rather well in Figure~\ref{fig:tl-parm-sf}. Each of the ``shelves'' in the distribution correspond to reconstructions on particular levels. As can be seen, the lines cross each other repeatedly at these shelves. These cross-overs are points at which one configuration begins to, temporarily, exhibit better tail latency behavior than the other. However, after enough records have been inserted to cause the next largest reconstructions to begin to occur, the "better" configuration begins to appear worse again in terms of tail latency.\footnote{ This plot also shows a notable difference between leveling and tiering. In the tiering configurations, the transitions between the shelves are steep and abrupt, whereas in leveling, the transitions are smoother, particular as the scale factor increases. These smoother curves show the write amplification of leveling, where the largest shards are not created ``fully formed'' as they are in tiering, but rather are built over a series of merges. This slower growth results in the smoother transitions. Note also that these curves are convex--which is \emph{bad} on this plot, as this means a higher probability of a higher latency reconstruction. } It seems apparent that, to resolve the problem of insertion tail latency, we will need to look beyond the design space we have thus far considered. In this chapter, we do just this, and propose a new mechanism for controlling reconstructions that leverages parallelism to provide similar amortized insertion and query performance characteristics, but also allows for significantly better insertion tail latencies. We will demonstrate mathematically that our new technique is capable of matching the query performance of the tiering layout policy, describe a practical implementation of these ideas, and then evaluate that prototype system to demonstrate that the theoretical trade-offs are achievable in practice. \section{The Insertion-Query Trade-off} As reconstructions are at the heart of the insertion tail latency problem, it seems worth taking a moment to consider \emph{why} they must be done at all. Fundamentally, decomposition-based dynamization techniques trade between insertion and query performance by controlling the number of blocks in the decomposition. Placing a bound on this number is necessary to bound the worst-case query cost, and is done using reconstructions to either merge (in the case of the Bentley-Saxe method) or re-partition (in the case of the equal block method) the blocks. Performing less frequent (or smaller) reconstructions reduces the amount of work associated with inserts, at the cost of allowing more blocks to accumulate and thereby hurting query performance. This trade-off between insertion and query performance by way of block count is most directly visible in the equal block method described in Section~\ref{ssec:ebm}. This technique provides the following worst-case insertion and query bounds, \begin{align*} I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\ \mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \end{align*} where $f(n)$ is the number of blocks. This worst-case result ignores re-partitioning costs, which may be necessary for certain selections of $f(n)$. We omit it here because we are about to examine a case of the equal block method were no re-partitioning is necessary. When re-partitioning is used, the worst case cost rises to the now familiar $I(n) \in \Theta(B(n))$ result. \begin{figure} \centering \subfloat[Insertion vs. Query Trade-off]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-count-sweep.pdf}\label{fig:tl-ebm-tradeoff}} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-count-sweep.pdf} \label{fig:tl-ebm-tail-latency}} \\ \caption{The equal block method with $f(n) = C$ for varying values of C. \textbf{Plots not yet populated}} \label{fig:tl-ebm} \end{figure} Unlike the design space we have proposed in Chapter~\ref{chap:design-space}, the equal block method allows for \emph{both} trading off between insert and query performance, \emph{and} controlling the tail latency. Figure~\ref{fig:tl-ebm} shows the results of testing an implementation of a dynamized ISAM tree using the equal block method, with \begin{equation*} f(n) = C \end{equation*} for varying constant values of $C$. Note that in this test the final record count was known in advance, allowing all re-partitioning to be avoided. This represents a sort of ``best case scenario'' for the technique, and isn't reflective of real-world performance, but does serve to demonstrate the relevant properties in the clearest possible manner. Figure~\ref{fig:tl-ebm-tail-latency} shows that the equal block method provides a very direct relationship between the tail latency and the number of blocks. The worst-case insertion performance is dictated by the size of the largest reconstruction, and so increasing the block count results in smaller blocks, and better insertion performance. These worst-case results also translate directly into improved average throughput, at the cost of query latency, as shown in Figure~\ref{fig:tl-ebm-tradeoff}. These results show that, contrary to our Bentley-Saxe inspired dynamization system, the equal block method provides clear and direct relationships between insertion and query performance, as well as direct control over tail latency, through its design space. Unfortunately, the equal block method is not well suited for our purposes. Despite having a much cleaner trade-off space, its performance is strictly worse than our dynamization system. Comparing Figure~\ref{fig:tl-ebm-tradeoff} with Figure~\ref{fig:design-tradeoff} shows that, for a specified query latency, our technique provides significantly better insertion throughput.\footnote{ In actuality, the insertion performance of the equal block method is even \emph{worse} than the numbers presented here. For this particular benchmark, we implemented the technique knowing the number of records in advance, and so fixed the size of each block from the start. This avoided the need to do any re-partitioning as the structure grew, and reduced write amplification. } This is because, in our technique, the variable size of the blocks allows for the majority of the reconstructions to occur with smaller structures, while allowing the majority of the records to exist in a small number of large blocks at the bottom of the structure. This setup enables high insertion throughput while keeping the block count small. But, as we've seen, the cost of this is large tail latencies, as the large blocks must occasionally be involved in reconstructions. However, we can use the extreme ends of the equal block method's design space to consider upper limits on the insertion and query performance that we might expect to get out of a dynamized structure, and then take steps within our own framework to approach these limits, while retaining the desirable characteristics of the logarithmic decomposition. At the extreme end, consider what would happen if we were to modify our dynamization framework to avoid all reconstructions. We retain a buffer of size $N_B$, which we flush to create a shard when full; however we never touch the shards once they are created. This is effectively the equal block method, where every block is fixed at $N_B$ capacity. Such a technique would result in a worst-case insertion cost of $I(n) \in \Theta(B(N_B))$ and produce $\Theta\left(\frac{n}{N_B}\right)$ shards in total, resulting in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$ worst-case query cost for a decomposable search problem. Applying this technique to an ISAM Tree, and compared against a B+Tree, yields the insertion and query latency distributions shown in Figure~\ref{fig:tl-floodl0}. \begin{figure} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} \subfloat[Query Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-query.pdf} \label{fig:tl-floodl0-query}} \\ \caption{Latency Distributions for a "Reconstructionless" Dynamization} \label{fig:tl-floodl0} \end{figure} Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain insertion latency distributions using amortized global reconstruction that are directly comparable to dynamic structures based on amortized local reconstruction, at least in some cases. In particular, the worst-case insertion tail latency in this model is direct function of the buffer size, as the worst-case insert occurs when the buffer must be flushed to a shard. However, this performance comes at the cost of queries, which are incredibly slow compared to B+Trees, as shown in Figure~\ref{fig:tl-floodl0-query}. Unfortunately, the query latency of this technique is too large for it to be useful; it is necessary to perform reconstructions to merge these small shards together to ensure good query performance. However, this does raise an interesting point. Fundamentally, the reconstructions that contribute to tail latency are \emph{not} required from an insertion perspective; they are a query optimization. Thus, we could remove the reconstructions from the insertion process and perform them elsewhere. This could, theoretically, allow us to have our cake and eat it too. The only insertion bottleneck would become the buffer flushing procedure--as is the case in our hypothetical ``reconstructionless'' approach. Unfortunately, it is not as simple as pulling the reconstructions off of the insertion path and running them in the background, as this alone cannot provide us with a meaningful bound on the number of blocks in the dynamized structure. But, it is possible to still provide this bound, if we're willing to throttle the insertion rate to be slow enough to keep up with the background reconstructions. In the next section, we'll discuss a technique based on this idea. \section{Relaxed Reconstruction} There does exist theoretical work on throttling the insertion rate of a Bentley-Saxe dynamization to control the worst-case insertion cost~\cite{overmars81}, which we discussed in Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach is to break the largest reconstructions up into small sequences of operations, that can then be attached to each insert, spreading the total workload out and ensuring each insert takes a consistent amount of time. Theoretically, the total throughput should remain about the same when doing this, but rather than having a bursty latency distribution with many fast inserts, and a small number of incredibly slow ones, distribution should be more normal. Unfortunately, this technique has a number of limitations that we discussed in Section~\ref{ssec:bsm-tail-latency-problem}. Notably for this discussion, they are \begin{enumerate} \item In the Bentley-Saxe method, the worst-case reconstruction involves every record in the structure. As such, it cannot be performed ``in advance'' without significant extra work. This problem requires the worst-case optimized dynamization systems to include complicated structures of partially built structures. \item The approach assumes that the workload of building a block can be evenly divided in advance, and somehow attached to inserts. Even for simple structures, this requires a large amount of manual adjustment to the data structure reconstruction routines, and doesn't admit simple, generalized interfaces. \end{enumerate} In this section, we consider how these restrictions can be overcome given our dynamization framework, and propose a strategy that achieves the same worst-case insertion time as the worst-case optimized techniques, given a few assumptions about available resources. First, a comment on nomenclature. We define the term \emph{last level}, $i = \ell$, to mean the level in the dynamized structure with the largest index value (and thereby the most records) and \emph{first level} to mean the level with index $i=0$. Any level with $0 < i < \ell$ is called an \emph{internal level}. A reconstruction on level $i$ involves the combination of all blocks on that level into one, larger, block, that is then appended level $i+1$. Relative to some level at index $i$, the \emph{next level} is the level at index $i + 1$, and the \emph{previous level} is at index $i-1$. Our proposed approach is as follows. We will fully detach reconstructions from buffer flushes. When the buffer fills, it will immediately flush and a new block will be placed in the first level. Reconstructions will be performed in the background to maintain the internal structure according to the tiering policy. When a level contains $s$ blocks, a reconstruction will immediately be triggered to merge these blocks and push the result down to the next level. To ensure that the number of blocks in the structure remains bounded by $\Theta(\log n)$, we will throttle the insertion rate so that it is balanced with amount of time needed to complete reconstructions. \begin{figure} \caption{Several "states" of tiering, leading up to the worst-case reconstruction.} \label{fig:tl-tiering} \end{figure} First, we'll consider how to ``spread out'' the cost of the worst-case reconstruction. Figure~\ref{fig:tl-tiering} shows various stages in the development of the internal structure of a dynamized index using tiering. Importantly, note that the last level reconstruction, which dominates the cost of the worst-case reconstruction, \emph{is able to be performed well in advance}. All of the records necessary to perform this reconstruction are present in the last level $\Theta(n)$ inserts before the reconstruction must be done to make room. This is a significant advantage to our technique over the normal Bentley-Saxe method, which will allow us to spread the cost of this reconstruction over a number of inserts without much of the complexity of~\cite{overmars81}. This leads us to the following result, \begin{theorem} \label{theo:worst-case-optimal} Given a buffered, dynamized structure utilizing the tiering layout policy, and at least $2$ parallel threads of execution, it is possible to maintain a worst-case insertion cost of \begin{equation} I(n) \in O\left(\frac{B(n)}{n} \log n\right) \end{equation} \end{theorem} \begin{proof} Consider the cost of the worst-case reconstruction, which in tiering will be of cost $\Theta(B(n))$. This reconstruction requires all of the blocks on the last level of the structure. At the point at which the last level is full, there will be $\Theta(n)$ inserts before the last level must be merged and a new level added. To ensure that the reconstruction has been completed by the time the $\Theta(n)$ inserts have been completed, it is sufficient to guarantee that the rate of inserts is sufficiently slow. Ignoring the cost of buffer flushing, this means that inserts must cost, \begin{equation*} I(n) \in \Theta(1 + \delta) \end{equation*} where the $\Theta(1)$ is the cost of appending to the mutable buffer, and $\delta$ is a stall inserted into the insertion process to ensure that the necessary reconstructions are completed in time. To identify the value of $\delta$, we note that each insert must take at least $\frac{B(n)}{n}$ time to fully cover the cost of the last level reconstruction. However, this is not sufficient to guarantee the bound, as other reconstructions will also occur within the structure. At the point at which the last level reconstruction can be scheduled, there will be exactly $1$ block on each level. Thus, each level will potentially also have an ongoing reconstruction that must be covered by inserting more stall time, to ensure that no level in the structure exceeds $s$ blocks. There are $\log n$ levels in total, and so in the worst case we will need to introduce a extra stall time to account for a reconstruction on each level, \begin{equation*} I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1}) \end{equation*} All of these internal reconstructions will be strictly less than the size of the last-level reconstruction, and so we can bound them all above by $O(\frac{B(n)}{n})$ time. Given this, and assuming that the smallest (i.e., most pressing) reconstruction is prioritized on the background thread, we find that \begin{equation*} I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right) \end{equation*} \end{proof} This approach results in an equivalent worst-case insertion latency bound to~\cite{overmars81}, but manages to resolve both of the issues cited above. By leveraging two parallel threads, instead of trying to manually multiplex a single thread, this approach requires \emph{no} modification to the user's block code to function. And, by leveraging the fact that reconstructions under tiering are strictly local to a single level, we can avoid needing to add any complicated additional structures to manage partially building blocks as new records are added. \subsection{Reducing Stall with Additional Parallelism} The result in Theorem~\ref{theo:worst-case-optimal} assumes that there are two available threads of parallel execution, which allows for the reconstructions to run in parallel with inserts. The amount of necessary insertion stall can be significantly reduced, however, if more than two threads are available. The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case bound is that it is insufficient to cover only the cost of the last level reconstruction to maintain the bound on the block count. From the moment that the last level has filled, and this reconstruction can begin, every level within the structure must sustain another $s - 1$ reconstructions before it is necessary to have completed the last level reconstruction, in order to maintain the $\Theta(\log n)$ bound on the number of blocks. Consider a parallel implementation that, contrary to Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover the last level reconstruction, and blocks all other reconstructions until it has been completed. Such an approach would result in $\delta = \frac{B(n)}{n}$ stall and complete the last level reconstruction after $\Theta(n)$ inserts. During this time, $\Theta(\frac{n}{N_B})$ blocks would accumulate in L0, ultimately resulting in a bound of $\Theta(n)$ blocks in the structure, rather than the $\Theta(\log n)$ bound we are trying to maintain. This is the reason why Theorem~\ref{theo:worst-case-optimal} must account for stalls on every level, and assumes that the smallest (and therefore most pressing) reconstruction is always active on the parallel reconstruction thread. This introduces the extra $\log n$ factor into the worst-case insertion cost function, because there will at worst be a reconstruction running on every level, and each reconstruction will be no larger than $\Theta(n)$ records. In effect, the stall amount must be selected to cover the \emph{sum} of the costs of all reconstructions that occur. Another way of deriving this bound would be to consider this sum, \begin{equation*} B(n) + (s - 1) \cdot \sum_{i=0}^{\log n - 1} B(n) \end{equation*} where the first term is the last level reconstruction cost, and the sum term considers the cost of the $s-1$ reconstructions on each internal level. Dropping constants and expanding the sum results in, \begin{equation*} B(n) \cdot \log n \end{equation*} reconstruction cost to amortize over the $\Theta(n)$ inserts. However, additional parallelism will allow us to reduce this. At the upper limit, assume that there are $\log n$ threads available for parallel reconstructions. This condition allows us to derive a smaller bound in certain cases, \begin{theorem} \label{theo:par-worst-case-optimal} Given a buffered, dynamized structure utilizing the tiering layout policy, and at least $\log n$ parallel threads of execution, it is possible to maintain a worst-case insertion cost of \begin{equation} I(n) \in O\left(\frac{B(n)}{n}\right) \end{equation} for a data structure with $B(n) \in \Omega(n)$. \end{theorem} \begin{proof} Just as in Theorem~\ref{theo:worst-case-optimal}, we begin by noting that the last level reconstruction will be of cost $\Theta(B(n))$ and must be amortized over $\Theta(n)$ inserts. However, unlike in that case, we now have $\log n$ threads of parallelism to work with. Thus, each time a reconstruction must be performed on an internal level, it can be executed on one of these threads in parallel with all other ongoing reconstructions. As there can be at most one reconstruction per level, $\log n$ threads are sufficient to run all possible reconstructions at any point in time in parallel. Each reconstruction will require $\Theta(B(N_B \cdot s^{i+1}))$ time to complete. Thus, the necessary stall to fully cover a reconstruction on level $i$ is this cost, divided by the number of inserts that can occur before the reconstruction must be done (i.e., the capacity of the index above this point). This gives, \begin{equation*} \delta_i \in O\left( \frac{B(N_B \cdot s^{i+1})}{\sum_{j=0}^{i-1} N_B\cdot s^{j+1}} \right) \end{equation*} necessary stall for each level. Noting that $s > 1$, $s \in \Theta(1)$, and that the denominator is the sum of a geometric progression, we have \begin{align*} \delta_i \in &O\left( \frac{B(N_B\cdot s^{i+1})}{s\cdot N_B \sum_{j=0}^{i-1} s^{j}} \right) \\ &O\left( \frac{(1-s) B(N_B\cdot s^{i+1})}{N_B\cdot (s - s^{i+1})} \right) \\ &O\left( \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right) \end{align*} For all reconstructions running in parallel, the necessary stall is the maximum stall of all the parallel reconstructions, \begin{equation*} \delta = \max_{i \in [0, \ell]} \left\{ \delta_i \right\} = \max_{i \in [0, \ell]} \left\{ \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right\} \end{equation*} For $B(n) \in \Omega(n)$, the numerator of the fraction will grow at least as rapidly as the denominator, meaning that $\delta_\ell$ will always be the largest. $N_B \cdot s^{\ell + 1}$ is $\Theta(n)$, so we find that, \begin{equation*} I(n) \in O \left(\frac{B(n)}{n}\right) \end{equation*} is the worst-case insertion cost, while ensuring that all reconstructions are done in time to maintain the block bound given $\log n$ parallel threads. \end{proof} \section{Implementation} The previous section demonstrated that, theoretically, it is possible to meaningfully control the tail latency of our dynamization system by relaxing the reconstruction processes and throttling the insertion rate, rather than blocking, as a means of controlling the shard count within the structure. However, there are a number of practical problems to be solved before this idea can be used in a real system. In this section, we discuss these problems, and our approaches to solving them to produce a dynamization framework based upon the technique. \subsection{Parallel Reconstruction Architecture} The existing concurrency implementation described in Section~\ref{ssec:dyn-concurrency} is insufficient for the purposes of constructing a framework supporting the parallel reconstruction scheme described in the previous section. In particular, it is limited to only two active versions of the structure at at time, with one ongoing reconstruction. Additionally, it does not consider buffer flushes as distinct events from reconstructions. In this section, we will discuss the modifications made to the concurrency support within our framework to support parallel reconstructions. Much like the simpler scheme in Section~\ref{ssec:dyn-concurrency}, our concurrency framework will be based on multi-versioning. Each \emph{version} consists of three pieces of information: a buffer head pointer, buffer tail pointer, and a collection of levels and shards. However, the process of managing, creating, and installing versions is much more complex, to allow more than two versions to exist at the same time under certain circumstances. \subsubsection{Structure Versioning} The internal structure of the dynamization consists of a sequence of levels containing immutable shards, as well as a snapshot of the state of the mutable buffer. This section pertains specifically to the internal structure; the mutable buffer handles its own versioning separate from this and will be discussed in the next section. \begin{figure} \centering \subfloat[Buffer Flush]{\includegraphics[width=.5\textwidth]{diag/tail-latency/flush.pdf}\label{fig:tl-flush}} \subfloat[Maintenance Reconstruction]{\includegraphics[width=.5\textwidth]{diag/tail-latency/maint.pdf}\label{fig:tl-maint}} \caption{\textbf{Structure Version Transitions.} The dynamized structure can transition to a new version via two operations, flushing the buffer into the first level or performing a maintenance reconstruction to merge shards on some level and append the result onto the next one. In each case, \texttt{V2} contains a shallow copy of \texttt{V1}'s light grey shards, with the dark grey shards being newly created and the white shards being deleted. The buffer flush operation in Figure~\ref{fig:tl-flush} simply creates a new shard from the buffer and places it in \texttt{L0} to create \texttt{V2}. The maintenance reconstruction in Figure~\ref{fig:tl-maint} is slightly more complex, creating a new shard in \texttt{L2} using the two shards in \texttt{V1}'s \texttt{L1}, and then removing the shards in \texttt{V2}'s \texttt{L1}. } \label{fig:tl-flush-maint}. \end{figure} The internal structure of the dynamized data structure (ignoring the buffer) can be thought of as a list of immutable levels, $\mathcal{V} = \{\mathscr{L}_0, \ldots \mathscr{L}_h\}$, where each level contains immutable shards, $\mathcal{L}_i = \{\mathscr{I}_0, \ldots \mathscr{I}_m\}$. Buffer flushes and reconstructions can be thought of as functions, which accept a version as input and produce a new version as output. Namely, \begin{align*} \mathcal{V}_{i+1} &= \mathbftt{flush}(\mathcal{V}_i, \mathcal{B}) \\ \mathcal{V}_{i+1} &= \mathbftt{maint}(\mathcal{V}_i, \mathscr{L}_x, j) \end{align*} where the subscript represents the \texttt{version\_id} and is a strictly increasing number assigned to each version. The $\mathbftt{flush}$ operation builds a new shard using records from the buffer, $\mathcal{B}$, and creates a new version identical to $\mathcal{V}_i$, except with the new shard appended to $\mathscr{L}_0$. $\mathbftt{maint}$ performs a maintenance reconstruction by building a new shard using all of the shards in level $\mathscr{L}_x$ and creating a new version identical to $\mathcal{V}_i$ except that the new shard is appended to level $\mathscr{L}_j$ and the shards in $\mathscr{L}_x$ are removed from $\mathscr{L}_x$ in the new version. These two operations are shown in Figure~\ref{fig:tl-flush-maint}. At any point in time, the framework will have \emph{one} active version, $\mathcal{V}_a$, as well as a maximum unassigned version number, $v_m > a$. New version ids are obtained by performing an atomic fetch-and-add on $v_m$, and versions will become active in the exact order of their assigned version numbers. We use the term \emph{installing} a version, $\mathcal{V}_x$ to refer to setting $\mathcal{V}_a \gets \mathcal{V}_x$. \Paragraph{Version Number Assignment.} It is the intention of this framework to prioritize buffer flushes, meaning that the versions resulting from a buffer flush should become active as rapidly as possible. It is undesirable to have some version, $\mathcal{V}_f$, resulting from a buffer flush, attempting to install while there is a version $\mathcal{V}_r$ associated with an in-process maintenance reconstruction such that $a < r < f$. In this case, the flush must wait for the maintenance reconstruction to finalize before it can itself be installed. To avoid this problem, we assign version numbers differently based upon whether the new version is created by a flush or a maintenance reconstruction. \begin{itemize} \item \textbf{Flush.} When a buffer flush is scheduled, it is immediately assigned the next available version number at the time of scheduling. \item \textbf{Maintenance Reconstruction.} Maintenance reconstructions are \emph{not} assigned a version number immediately. Instead, they are assigned a version number \emph{after} all of the reconstruction work is performed, during their installation process. \end{itemize} \Paragraph{Version Installation.} Once a given flush or maintenance reconstruction has completed and has been assigned a version number, $i$, the version will attempt to install itself. The thread running the operation will wait until $a = i - 1$, and then it will update $\mathcal{V}_a \gets \mathcal{V}_i$ using an atomic pointer assignment. All versions are reference counted using \texttt{std::shared\_pointer}, and so will be automatically deleted once all threads containing a reference to the version have terminated, so no special memory management is necessary during version installation. \Paragraph{Maintenance Version Reconciliation.} Waiting until the moment of installation to assign a version number to maintenance reconstructions avoids stalling buffer flushes, however it introduces additional complexity in the installation process. This is because active version at the time the reconstruction was scheduled, $\mathcal{V}_a$, may not still be the active version at the time the reconstruction is installed, $\mathcal{V}_{a^\prime}$. This means that the version of the structure produced by the reconstruction, $\mathcal{V}_r$, will not reflect any updates to the structure that were performed in version ids on the interval $(a, a^\prime]$. Figure~\ref{fig:tl-version-reconcilliation} shows an example of the sort of problem that can arise. One possible approach is to simply merge the versions together, adding all of the shards that are in $\mathcal{V}_{a^\prime}$ but not in $\mathcal{V}_r$ prior to installation. Sadly, this approach is insufficient because it can lead to three possible problems, \begin{enumerate} \item If shards used in the maintenance reconstruction to produce $\mathcal{V}_r$ were \emph{also} used as part of a different maintenance reconstruction resulting in a version $\mathcal{V}_o$ with $o < r$, then \textbf{records will be duplicated} by the merge. \item If another reconstruction produced a version $\mathcal{V}_o$ with $o < r$, and $\mathcal{V}_o$ added a new shard to the same level that $\mathcal{V}_r$ did, it is possible that the temporal ordering properties of the shards on the level may be violated. Recall that supporting tombstone-based deletes requires that shards be strictly ordered within each level by their age to ensure that tombstone cancellation (Section~\ref{sssec:dyn-deletes}). \item The shards that were deleted from $\mathcal{V}_r$ after the reconstruction will still be present in $\mathcal{V}_{a^\prime}$ and so may be reintroduced into the new version, again leading to duplication of records. It is non-trivial to identify these shards during the merge to skip over them, because the shards don't have a unique identifier other than their pointers, and using the pointers for this check can lead to the ABA problem using the reference counting based memory management scheme the framework is built on. \end{enumerate} The first two of these problems result from a simple synchronization problem and can be solved using locking. A maintenance reconstruction operates on some level $\mathscr{L}_i$, merging and then deleting shards from that level and placing the result in $\mathscr{L}_{i+1}$. In order for either of these problems to occur, multiple concurrent reconstructions must be operating on $\mathscr{L}_i$. Thus, a lock manager can be introduced into the framework to allow reconstructions to lock entire levels. A reconstruction can only be scheduled if it is able to acquire the lock on the level that it is using as the \emph{source} for its shards. Note that there is no synchronization problem with a concurrent reconstruction on level $\mathscr{L}_i-1$ appending a shard to $\mathscr{L}_i$. This will not violate any ordering properties or result in any duplication of records. Thus, each reconstruction only needs to lock a single level. The final problem is a bit trickier to address, but is fundamentally an implementation detail. Our approach for resolving it is to change the way that maintenance reconstructions produce a version in the first place. Rather than taking a copy of $\mathcal{V}_a$, manipulating it to perform the reconstruction, and then reconciling it with $\mathcal{V}_{a^\prime}$ when it is installed, we delay \emph{all} structural updates to the version to installation. When a reconstruction is scheduled, a reference to $\mathcal{V}_a$ is taken, instead of a copy. Then, any new shards are built based on the contents of $\mathcal{V}_a$, but no updates to the structure are made. Once all of the shard reconstructions are complete, the version installation process begins. The thread running the reconstruction waits for its turn to install, and \emph{then} makes a copy of $\mathcal{V}_{a^\prime}$. To this copy, the newly created shards are added, and any necessary deletes are performed. Because the shards to be deleted are currently referenced in, at minimum, the reference to $\mathcal{V}(a)$ maintained by the reconstruction thread, pointer equality can be used to identify the shards and the ABA problem avoided. Then, once all the updates are complete, the new version can be installed. This process does push a fair amount of work to the moment of install, between when a version id is claimed by the reconstruction thread, and that version id becomes active. During this time, any buffer flushes will be blocked. However, relative to the work associated with actually performing the reconstructions, the overhead of these metadata operations is fairly minor, and so it doesn't have a significant effect on buffer flush performance. \subsubsection{Mutable Buffer} \begin{figure} \centering \subfloat[Buffer Initial State]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-1.pdf}\label{fig:tl-buffer1}} \subfloat[Buffer Following an Insert]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-2.pdf}\label{fig:tl-buffer2}} \subfloat[Buffer Version Transition]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-3.pdf}\label{fig:tl-buffer3}} \caption{\textbf{Versioning process for the mutable buffer.} A schematic view of the mutable buffer demonstrating the three pointers representing its state, and how they are adjusted as inserts occur. Dark grey slots represent the currently active version, light grey slots the old version, and white slots are available space.} \label{fig:tl-buffer} \end{figure} Next, we'll address concurrent access and versioning of the mutable buffer. In our system, the mutable buffer consists of a large ring buffer with a head and tail pointer, as shown in Figure~\ref{fig:tl-buffer}. In order to support versioning, the buffer actually uses two head pointers, one called \texttt{head} and one called \texttt{old head}, along with a single \texttt{tail} pointer. Records are inserted into the buffer by atomically incrementing \texttt{tail} and then placing the record into the slot. For records that cannot be atomically assigned, a visibility bit can be used to ensure that concurrent readers don't access a partially written value. \texttt{tail} can be incremented until it matches \texttt{old head}, or until the current version of the buffer (between \texttt{head} and \texttt{tail}) contains $N_B$ records. At this point, any further writes would either clobber records in the old version, or exceed the user-specified buffer capacity, and so any inserts must block until a flush has been completed. Flushes are triggered based on a user-configurable set point, $N_F \leq N_B$. When $\mathtt{tail} - \mathtt{head} = N_F$, a flush operation is scheduled (more on the details of this process in Section~\ref{ssec:tl-flush}). The location of \texttt{tail} is recorded as part of the flush, but records can continue to be inserted until one of the blocking conditions in the previous paragraph is reached. When the flush has completed, a new shard is created containing the records between \texttt{head} and the value of \texttt{tail} at the time the flush began. The buffer version can then be advanced by setting \texttt{old head} to \texttt{head} and setting \texttt{head} to \texttt{tail}. All of the records associated with the old version are freed, and the records that were just flushed now begin part of the old version. The reason for this scheme is to allow threads accessing an older version of the dynamized structure to still see a current view of all of the records. These threads will have a reference to a dynamized structure containing none of the records in the buffer, as well as the old head. Because the older version of the buffer always directly precedes the newer, all of the buffered records are visible to this older version. However, threads accessing the more current version of the buffer will \emph{not} see the records contained between \texttt{old head} and \texttt{head}, as these records will have been flushed into the structure and are visible to the thread there. If this thread could still see records in the older version of the buffer, then it would see these records twice, which is incorrect. One consequence of this technique is that a buffer flush cannot complete until all threads referencing \texttt{old head} have completed. To ensure that this is the case, the two head pointers are reference counted, and a flush will stall until all references to \texttt{old head} have been removed. In principle, this problem could be reduced by allowing for more than two heads, but it becomes difficult to atomically transition between versions in that case, and it would also increase the storage requirements for the buffer, which requires $N_B$ space per available version. \subsection{Concurrent Queries} \subsubsection{Query Pre-emption} Because our implementation only supports a finite number of versions of the mutable buffer at any point in time, and insertions will stall after this finite number is reached, it is possible for queries to introduce additional insertion latency. Queries hold a reference to the version of the structure the are using, which includes holding on to a buffer head pointer. If a query is particularly long running, or otherwise stalled, it is possible that the query will block insertions by holding onto this head pointer. \subsection{Insertion Stall Mechanism} \section{Evaluation} \subsection{} \section{Conclusion}