diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-06-08 17:53:37 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-06-08 17:53:37 -0400 |
| commit | 01f6950612be18468376aeffb8eef0d7205d86d5 (patch) | |
| tree | 8f75d21427168af515cae7e66c12844d2a32dbf4 /chapters/tail-latency.tex | |
| parent | 33bc7e620276f4269ee5f1820e5477135e020b3f (diff) | |
| download | dissertation-01f6950612be18468376aeffb8eef0d7205d86d5.tar.gz | |
updates
Diffstat (limited to 'chapters/tail-latency.tex')
| -rw-r--r-- | chapters/tail-latency.tex | 596 |
1 files changed, 339 insertions, 257 deletions
diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex index 5c0e0ba..1a468df 100644 --- a/chapters/tail-latency.tex +++ b/chapters/tail-latency.tex @@ -21,13 +21,11 @@ that of its most direct dynamic analog: a B+Tree. While, as shown in Figure~\ref{fig:tl-btree-isam-tput}, the dynamized structure has superior average performance to the native dynamic structure, the latency distributions, shown in Figure~\ref{fig:tl-btree-isam-lat} are -quite different. The dynamized structure has much better "best-case" -performance, but the worst-case performance is exceedingly poor. That -the structure exhibits reasonable performance on average is the result -of these two ends of the distribution balancing each other out. +quite different. The dynamized structure has much better best-case +performance, but the worst-case performance is exceedingly poor. This poor worst-case performance is a direct consequence of the different -approaches to update support used by the dynamized structure and B+Tree. +approaches used by the dynamized structure and B+Tree to support updates. B+Trees use a form of amortized local reconstruction, whereas the dynamized ISAM tree uses amortized global reconstruction. Because the B+Tree only reconstructs the portions of the structure ``local'' to the @@ -43,15 +41,18 @@ B+Tree. At the extreme end of the latency distribution, though, the local reconstruction strategy used by the B+Tree results in significantly better worst-case performance. -Unfortunately, the design space that we have been considering thus far -is limited in its ability to meaningfully alter the worst-case insertion -performance. While we have seen that the choice of layout policy can have -some effect, the actual benefit in terms of tail latency is quite small, -and the situation is made worse by the fact that leveling, which can -have better worst-case insertion performance, lags behind tiering in -terms of average insertion performance. The use of leveling can allow -for a small reduction in the worst case, but at the cost of making the -majority of inserts worse because of increased write amplification. +Unfortunately, the design space that we have been considering is +limited in its ability to meaningfully alter the worst-case insertion +performance. Leveling requires only a fraction of, rather than all of, the +records in the structure to participate in its worst-case reconstruction, +and as a result shows slightly reduced worst-case insertion cost compared +to the other layout policies. However, this effect only results in a +single-digit factor reduction in measured worst-case latency, and has +no effect on the insertion latency distribution itself outside of the +absolute maximum value. Additionally, we've shown that leveling performed +significantly worse in average insertion performance compared to tiering, +and so its usefulness as a tool to reduce insertion tail latencies is +questionable at best. \begin{figure} \subfloat[Varied Scale Factors and Fixed Buffer Size]{\includegraphics[width=.5\textwidth]{img/tail-latency/scale_factor_sweep_standard.pdf} \label{fig:tl-parm-sf}} @@ -60,32 +61,35 @@ majority of inserts worse because of increased write amplification. \label{fig:tl-parm-sweep} \end{figure} -The other tuning knobs that are available to us are of limited usefulness -in tuning the worst case behavior. Figure~\ref{fig:tl-parm-sweep} -shows the latency distributions of our framework as we vary +Our existing framework support two other tuning parameters: the scale +factor and the buffer size. However, neither of these parameters +are of much use in adjusting the worst-case insertion behavior. We +demonstrate this experimentally in Figure~\ref{fig:tl-parm-sweep}, +which shows the latency distributions of our framework as we vary the scale factor (Figure~\ref{fig:tl-parm-sf}) and buffer size (Figure~\ref{fig:tl-parm-bs}) respectively. There is no clear trend in worst-case performance to be seen here. Adjusting the scale factor does have an effect on the distribution, but not in a way that is particularly useful from a configuration standpoint, and adjusting the mutable buffer -has almost no effect on the worst-case latency at all, or even on the -distribution; particularly when tiering is used. This is to be expected; -ultimately the worst-case reconstruction size is largely the same -regardless of scale factor or buffer size: $\Theta(n)$ records. - -The selection of configuration parameters can influence \emph{when} -these reconstructions occur, as well as slightly influence their size, but -ultimately the question of ``which configuration has the best tail-latency -performance'' is more a question of how many insertions the latency is -measured over, than any fundamental trade-offs with the design space. This -is exemplified rather well in Figure~\ref{fig:tl-parm-sf}. Each of -the ``shelves'' in the distribution correspond to reconstructions on -particular levels. As can be seen, the lines cross each other repeatedly -at these shelves. These cross-overs are points at which one configuration -begins to, temporarily, exhibit better tail latency behavior than the -other. However, after enough records have been inserted to cause the next -largest reconstructions to begin to occur, the "better" configuration -begins to appear worse again in terms of tail latency.\footnote{ +has almost no effect on either the distribution itself, or the worst-case +insertion latency; particularly when tiering is used. This is to be +expected; ultimately the worst-case reconstruction size is largely the +same irrespective of scale factor or buffer size: $\Theta(n)$ records. + +The selection of configuration parameters does influence \emph{when} +a worst-case reconstruction occurs, and can slightly affect its size. +Ultimately, however, the answer to the question of which configuration +has the best insertion tail latency performance is more a matter of how +many records the insertion latencies are measured over than it is one of +any fundamental design trade-offs within the space. This is exemplified +rather well in Figure~\ref{fig:tl-parm-sf}. Each of the ``shelves'' in +the distribution correspond to reconstructions on particular levels. As +can be seen, the lines cross each other repeatedly at these shelves. These +cross-overs are points at which one configuration begins to exhibit +better tail latency behavior than another. However, after enough records +have been inserted, the next largest reconstructions will begin to +occur. This will make the "better" configuration appear worse in +terms of tail latency.\footnote{ This plot also shows a notable difference between leveling and tiering. In the tiering configurations, the transitions between the shelves are steep and abrupt, whereas in leveling, @@ -112,32 +116,29 @@ to demonstrate that the theoretical trade-offs are achievable in practice. \section{The Insertion-Query Trade-off} \label{sec:tl-insert-query-tradeoff} -As reconstructions are at the heart of the insertion tail latency problem, -it seems worth taking a moment to consider \emph{why} they must be done -at all. Fundamentally, decomposition-based dynamization techniques trade -between insertion and query performance by controlling the number of -blocks in the decomposition. Placing a bound on this number is necessary -to bound the worst-case query cost, and is done using reconstructions -to either merge (in the case of the Bentley-Saxe method) or re-partition -(in the case of the equal block method) the blocks. Performing less -frequent (or smaller) reconstructions reduces the amount of work -associated with inserts, at the cost of allowing more blocks to accumulate -and thereby hurting query performance. +Reconstructions lie at the heart of the insertion tail latency problem, +and so it seems worth taking a moment to consider \emph{why} they occur +at all. Fundamentally, decomposition-based dynamization techniques +trade between insertion and query performance by controlling the number +of blocks in the decomposition. Placing a bound on this number is +necessary to bound the worst-case query cost, and this bound is enforced +using reconstructions to either merge (in the case of the Bentley-Saxe +method) or re-partition (in the case of the equal block method) the +blocks. Performing less frequent (or smaller) reconstructions reduces +the amount of work associated with inserts, at the cost of allowing more +blocks to accumulate and thereby hurting query performance. This trade-off between insertion and query performance by way of block -count is most directly visible in the equal block method described -in Section~\ref{ssec:ebm}. This technique provides the -following worst-case insertion and query bounds, +count is most directly visible in the equal block method described in +Section~\ref{ssec:ebm}. We will consider a variant of the equal block +method here for which $f(n) \in \Theta(1)$, resulting in a dynamization +that does not perform any re-partitioning. In this case, the technique +provides the following worst-case insertion and query bounds, \begin{align*} I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\ \mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \end{align*} -where $f(n)$ is the number of blocks. This worst-case result ignores -re-partitioning costs, which may be necessary for certain selections -of $f(n)$. We omit it here because we are about to examine a case -of the equal block method were no re-partitioning is necessary. When -re-partitioning is used, the worst case cost rises to the now familiar $I(n) -\in \Theta(B(n))$ result. +where $f(n)$ is the number of blocks. \begin{figure} \centering @@ -157,63 +158,72 @@ method, with \begin{equation*} f(n) = C \end{equation*} -for varying constant values of $C$. Note that in this test the final -record count was known in advance, allowing all re-partitioning to be -avoided. This represents a sort of ``best case scenario'' for the -technique, and isn't reflective of real-world performance, but does -serve to demonstrate the relevant properties in the clearest possible -manner. +for varying constant values of $C$. As noted above, this special case of +the equal block method allows for re-partitioning costs to be avoided +entirely, resulting in a very clean trade-off space. This result isn't +necessarily demonstrative of the real-world performance of the equal +block method, but it does serve to demonstrate the relevant properties +of the method in the clearest possible manner. Figure~\ref{fig:tl-ebm-tail-latency} shows that the equal block method provides a very direct relationship between the tail latency and the number of blocks. The worst-case insertion performance is -dictated by the size of the largest reconstruction, and so increasing -the block count results in smaller blocks, and better insertion -performance. These worst-case results also translate directly into -improved average throughput, at the cost of query latency, as shown in -Figure~\ref{fig:tl-ebm-tradeoff}. These results show that, contrary to our -Bentley-Saxe inspired dynamization system, the equal block method provides -clear and direct relationships between insertion and query performance, -as well as direct control over tail latency, through its design space. - -Unfortunately, the equal block method is not well suited for -our purposes. Despite having a much cleaner trade-off space, its -performance is strictly worse than our dynamization system. Comparing -Figure~\ref{fig:tl-ebm-tradeoff} with Figure~\ref{fig:design-tradeoff} -shows that, for a specified query latency, our technique provides -significantly better insertion throughput.\footnote{ - In actuality, the insertion performance of the equal block method is - even \emph{worse} than the numbers presented here. For this particular - benchmark, we implemented the technique knowing the number of records in - advance, and so fixed the size of each block from the start. This avoided - the need to do any re-partitioning as the structure grew, and reduced write - amplification. +dictated by the size of the largest reconstruction. Increasing the +block count reduces the size of each block, and so improves the +insertion performance. Figure~\ref{fig:tl-ebm-tradeoff} shows that +these improvements to tail latency performance translate directly into +an improvement of the overall insertion throughput as well, as the +cost of worse query latencies. Contrary to our Bentley-Saxe inspired +dynamization system, this formulation of the equal block method provides +direct control over insertion tail latency, as well as a much cleaner +relationship between average insertion and query performance. + +While these results are promising, the equal block method is not well +suited for our purposes. Despite having a clean trade-off space and +control over insertion tail latencies, this technique is strictly +worse than our existing dynamization system in every way but tail +latency control. Comparing Figure~\ref{fig:tl-ebm-tradeoff} with +Figure~\ref{fig:design-tradeoff} shows that, for a specified query +latency, a logarithmic decomposition provides significantly better +insertion throughput. Our technique uses geometric growth of block sizes, +which ensures that most reconstructions are smaller than those in the +equal block method for an equivalent number of blocks. This comes at +the cost of needing to occasionally perform large reconstructions to +compact these smaller blocks, resulting in the poor tail latencies we +are attempting to resolve. Thus, it seems as though poor tail latency +is concomitant with good average performance. + +Despite this, let's consider an approach to reconstruction within our +framework that optimizes for insertion tail latency exclusively using +the equal block method, neglecting any considerations of maintaining +a shard count bound. We can consider a variant of the equal block +method having $f(n) = \frac{n}{N_B}$. This case, like +the $f(n) = C$ approach considered above, avoids all re-partitioning, +because records are flushed into the dynamization in sets of $N_B$ size, +and so each new block is always exactly full on creation. In effect, this +technique has no reconstructions at all. Each buffer flush simply creates +a new block that is added to an ever-growing list. This produces a system +with worst-case insertion and query costs of, +\begin{align*} + I(n) &\in \Theta(B(N_B)) \\ + \mathscr{Q}(n) &\in O (n\cdot \mathscr{Q}_S(N_B)) +\end{align*} +where the worst-case insertion is simply the cost of a buffer flush, +and the worst-case query cost follows from the fact that there will +be $\Theta\left(\frac{n}{N_B}\right)$ shards in the dynamization, each +of which will have exactly $N_B$ records.\footnote{ + We are neglecting the cost of querying the buffer in this cost function + for simplicity. } -This is because, in our technique, the variable size of the blocks allows -for the majority of the reconstructions to occur with smaller structures, -while allowing the majority of the records to exist in a small number -of large blocks at the bottom of the structure. This setup enables -high insertion throughput while keeping the block count small. But, -as we've seen, the cost of this is large tail latencies, as the large -blocks must occasionally be involved in reconstructions. However, we -can use the extreme ends of the equal block method's design space to -consider upper limits on the insertion and query performance that we -might expect to get out of a dynamized structure, and then take steps -within our own framework to approach these limits, while retaining the -desirable characteristics of the logarithmic decomposition. - -At the extreme end, consider what would happen if we were to modify -our dynamization framework to avoid all reconstructions. We retain a -buffer of size $N_B$, which we flush to create a shard when full; however -we never touch the shards once they are created. This is effectively -the equal block method, where every block is fixed at $N_B$ capacity. -Such a technique would result in a worst-case insertion cost of $I(n) \in -\Theta(B(N_B))$ and produce $\Theta\left(\frac{n}{N_B}\right)$ shards in -total, resulting in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$ -worst-case query cost for a decomposable search problem. Applying -this technique to an ISAM Tree, and compared against a B+Tree, -yields the insertion and query latency distributions shown in -Figure~\ref{fig:tl-floodl0}. + +Applying this technique to an ISAM Tree, and compared against a +B+Tree, yields the insertion and query latency distributions shown +in Figure~\ref{fig:tl-floodl0}. Figure~\ref{fig:tl-floodl0-insert} +shows that it is possible to obtain insertion latency distributions +using amortized global reconstruction that are directly comparable to +dynamic structures based on amortized local reconstruction. However, +this performance comes at the cost of queries, which are incredibly slow +compared to B+Trees, as shown in Figure~\ref{fig:tl-floodl0-query}. \begin{figure} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} @@ -223,93 +233,123 @@ Figure~\ref{fig:tl-floodl0}. \label{fig:tl-floodl0} \end{figure} -Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain -insertion latency distributions using amortized global reconstruction -that are directly comparable to dynamic structures based on amortized -local reconstruction, at least in some cases. In particular, the -worst-case insertion tail latency in this model is a direct function -of the buffer size, as the worst-case insert occurs when the buffer -must be flushed to a shard. However, this performance comes at the -cost of queries, which are incredibly slow compared to B+Trees, as -shown in Figure~\ref{fig:tl-floodl0-query}. - -Unfortunately, the query latency of this technique is too large for it -to be useful; it is necessary to perform reconstructions to merge these -small shards together to ensure good query performance. However, this -does raise an interesting point. Fundamentally, the reconstructions -that contribute to tail latency are \emph{not} required from an -insertion perspective; they are a query optimization. Thus, we could -remove the reconstructions from the insertion process and perform -them elsewhere. This could, theoretically, allow us to have our -cake and eat it too. The only insertion bottleneck would become -the buffer flushing procedure--as is the case in our hypothetical -``reconstructionless'' approach. Unfortunately, it is not as simple as -pulling the reconstructions off of the insertion path and running them in -the background, as this alone cannot provide us with a meaningful bound -on the number of blocks in the dynamized structure. But, it is possible -to still provide this bound, if we're willing to throttle the insertion -rate to be slow enough to keep up with the background reconstructions. In -the next section, we'll discuss a technique based on this idea. + +On its own, this technique exhibits too large of a degradation +of query latency for it to be useful in any scenario involving a +need for queries. However, it does demonstrate that, in scenarios +where insertion doesn't need to block on reconstructions, it is +possible to obtain significantly improved insertion tail latency +distributions. Unfortunately, it also shows that using reconstructions to +enforce structural invariants to control the number of block is critical +for query performance. In the next section, we will consider approaches +to reconstruction that allow us to maintain the structural invariants of +our dynamization, while avoiding direct blocking of inserts, in an attempt +to reduce the worst-case insertion cost in line with the approach we have +just discussed, while maintaining similar worst-case query bounds to our +existing dynamization system. \section{Relaxed Reconstruction} -There does exist theoretical work on throttling the insertion -rate of a Bentley-Saxe dynamization to control the worst-case -insertion cost~\cite{overmars81}, which we discussed in -Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach is to -break the largest reconstructions up into small sequences of operations, -that can then be attached to each insert, spreading the total workload out -and ensuring each insert takes a consistent amount of time. Theoretically, +Reconstructions are necessary to maintain the structural invariants of +our dynamization, which are themselves required to maintain bounds on +worst-case query performance. Inserts are what causes the structure to +violate these invariants, and so it makes sense to attach reconstructions +to the insertion process to allow a strict maintenance of these +invariants. However, it is possible to take a more relaxed approach +to maintaining these invariants using concurrency, allowing for the same +shard bound to be enforced at a much lower worst-case insertion cost. + +There does exist theoretical work in this area, which we've already +discussed in Section~\ref{ssec:bsm-worst-optimal}. The gist of this +technique is to relax the strict binary decomposition of the Bentley-Saxe +method to allow multiple reconstructions to occur at once, and to add +a buffer to each level to contain a partially built structure. Then, +reconstructions are split up into small batches of operations, +which are attached to each insert that is issued up to the moment +when the reconstruction must be complete. By doing this, the work +of the reconstructions is spread out across many inserts, in effect +removing the need to block for large reconstructions. Theoretically, the total throughput should remain about the same when doing this, but rather than having a bursty latency distribution with many fast inserts, -and a small number of incredibly slow ones, distribution should be -more normal. +and a small number of incredibly slow ones, distribution should be more +normal.~\cite{overmars81} Unfortunately, this technique has a number of limitations that we -discussed in Section~\ref{ssec:bsm-tail-latency-problem}. Notably for -this discussion, they are -\begin{enumerate} - - \item In the Bentley-Saxe method, the worst-case reconstruction - involves every record in the structure. As such, it cannot be - performed ``in advance'' without significant extra work. This problem - requires the worst-case optimized dynamization systems to include - complicated structures of partially built structures. - - \item The approach assumes that the workload of building a - block can be evenly divided in advance, and somehow attached - to inserts. Even for simple structures, this requires a large - amount of manual adjustment to the data structure reconstruction - routines, and doesn't admit simple, generalized interfaces. +discussed in Section~\ref{ssec:bsm-tail-latency-problem}. It effectively +reduces to manually multiplexing a single thread to perform a highly +controlled, concurrent, reconstruction process. This requires the ability +to evenly divide up the work of building a data structure and somehow +attach these operations to individual inserts. This makes it ill-suited +for our general framework, because, even when the construction can be +split apart into small independent chunks, implementing it requires a +significant amount of manual adjustment to the data structure construction +processes. + +In this section, we will propose an alternative approach for implementing +a similar idea using multi-threading and prove that we can achieve, +in principle, the same worst-case insertion and query costs in a far +more general and easily implementable manner. We'll then show how we +can further leverage parallelism on top of our approach to obtain +\emph{better} worst-case bounds, assuming sufficient resources are +available. + + +\Paragraph{Layout Policies.} One important aspect of the selection of +layout policy that has not been considered up to now, but will soon +become very relevant, is the degree of reconstruction concurrency +afforded by each policy. Because different layout policies perform +reconstructions differently, there are significant differences in the +number of reconstructions that can be performed concurrently in each one. +Note that in previous chapters, we used the term \emph{reconstruction} +broadly to refer to all operations performed on the dynamization to +maintain its structural invariants as the result of a single insert. Here, +we instead use the term to refer to a single call to \texttt{build}. +\begin{itemize} + \item \textbf{Leveling.} \\ + Our leveling layout policy performs a single \texttt{build} + operation involving shards from at most two levels, as well as + flushing the buffer. Thus, at best, there can be two concurrent + operations: the \texttt{build} and the flush. If we + were to proactively perform reconstructions, each \texttt{build} + would require shards from two levels, and so the maximum number + of concurrent reconstructions is half the number of levels, + plus the flush. -\end{enumerate} -In this section, we consider how these restrictions can be overcome given -our dynamization framework, and propose a strategy that achieves the -same worst-case insertion time as the worst-case optimized theoretical -techniques, given a few assumptions about available resources, by taking -advantage of parallelism and proactive scheduling of reconstructions. + \item \textbf{Tiering.} \\ + In our tiering policy, it may be necessary to perform one \texttt{build} + operation per level. Each of these reconstructions involves only shards + from that level. As a result, at most one reconstruction per level + (as well as the flush) can proceed concurrently. + + \item \textbf{BSM.} \\ + The Bentley-Saxe method is highly eager, and merges all relevant + shards, plus the buffer, in a single call to \texttt{build}. As a result, + no concurrency is possible. +\end{itemize} We will be restricting ourselves in this chapter to the tiering layout -policy, because it has some specific properties that are useful to our -goals. In tiering, the input shards to a reconstruction are restricted -to a single level, unlike in leveling where the shards come from two -levels, and BSM where the shards come from potentially \emph{all} -the levels. This allows us to maximize parallelism, which we will -be using to improve the tail latency performance, greatly simplifies -synchronization, and provides us with the largest window over which to -amortize the costs of reconstruction. The techniques we describe in -this chapter will work with leveling as well, albeit less effectively, -and will not work \emph{at all} using BSM. - -First, a comment on nomenclature. We define the term \emph{last level}, -$i = \ell$, to mean the level in the dynamized structure with the -largest index value (and thereby the most records) and \emph{first -level} to mean the level with index $i=0$. Any level with $0 < i < -\ell$ is called an \emph{internal level}. A reconstruction on level $i$ -involves the combination of all blocks on that level into one, larger, -block, that is then appended level $i+1$. Relative to some level at -index $i$, the \emph{next level} is the level at index $i + 1$, and the -\emph{previous level} is at index $i-1$. +policy. Tiering provides the most opportunities for concurrency +and (assuming sufficient resources) parallelism. Because a given +reconstruction only requires shards from a single level, using tiering +also makes synchronization significantly easier, and it provides us +with largest window to pre-emptively schedule reconstructions. Most +of our discussion in this chapter could also be applied to leveling, +albeit with worse results. However, BSM \emph{cannot} be used at all. + +\Paragraph{Nomenclature.} For the discussion that follows, it will +be convenient to define a few terms for discussing levels relative to +each other. While these are all fairly straightforward, to alleviate any +potential confusion, we'll define them all explicitly here. We define the +term \emph{last level}, $i = \ell$, to mean the level in the dynamized +structure with the largest index value (and thereby the most records) +and \emph{first level} to mean the level with index $i=0$. Any level +with $0 < i < \ell$ is called an \emph{internal level}. A reconstruction +on level $i$ involves the combination of all blocks on that level into +one, larger, block, that is then appended level $i+1$. Relative to some +level at index $i$, the \emph{next level} is the level at index $i + +1$, and the \emph{previous level} is at index $i-1$. + +\subsection{Concurrent Reconstructions} Our proposed approach is as follows. We will fully detach reconstructions from buffer flushes. When the buffer fills, it will immediately flush @@ -318,10 +358,45 @@ will be performed in the background to maintain the internal structure according to the tiering policy. When a level contains $s$ blocks, a reconstruction will immediately be triggered to merge these blocks and push the result down to the next level. To ensure that the number -of blocks in the structure remains bounded by $\Theta(\log n)$, we will -throttle the insertion rate so that it is balanced with amount of time -needed to complete reconstructions. +of blocks in the structure remains bounded by $\Theta(\log_s n)$, we +will throttle the insertion rate by adding a stall time, $\delta$, to +each insert. $\delta$ will be determined such that it is sufficiently +large to ensure that any scheduled reconstructions have enough time to +complete before the shard count on any level exceeds $s$. This process +is summarized in Algorithm~\ref{alg:tl-relaxed-recon}. + +\begin{algorithm} +\caption{Relaxed Reconstruction Algorithm with Insertion Stalling} +\label{alg:tl-relaxed-recon} +\KwIn{$r$: a record to be inserted, $\mathscr{I} = (\mathcal{B}, \mathscr{L}_0 \ldots \mathscr{L}_\ell)$: a dynamized structure, $\delta$: insertion stall amount} + +\Comment{Stall insertion process by specified amount} +sleep($\delta$) \; +\BlankLine +\Comment{Append to the buffer if possible} +\If {$|\mathcal{B}| < N_B$} { + $\mathcal{B} \gets \mathcal{B} \cup \{r\}$ \; + \Return \; +} +\BlankLine +\Comment{Schedule any necessary reconstructions background threads} +\For {$\mathscr{L} \in \mathscr{I}$} { + \If {$|\mathscr{L}| = s$} { + $\text{schedule\_reconstruction}(\mathscr{L})$ \; + } +} + +\BlankLine +\Comment{Perform the flush} +$\mathscr{L}_0 \gets \mathscr{L}_0 \cup \{\text{build}(\mathcal{B})\}$ \; +$\mathcal{B} \gets \emptyset$ \; + +\BlankLine +\Comment{Append to the now empty buffer} +$\mathcal{B} \gets \mathcal{B} \cup \{r\}$ \; +\Return \; +\end{algorithm} \begin{figure} \centering @@ -343,12 +418,19 @@ record counts--each level has an increasing number of records per block.}} \label{fig:tl-tiering} \end{figure} -First, we'll consider how to ``spread out'' the cost of the worst-case -reconstruction. Figure~\ref{fig:tl-tiering} shows various stages in +To ensure the correctness of this algorithm, it is necessary to show +that there exists a value for $\delta$ that ensures that the structural +invariants can be maintained. Logically, this $\delta$ can be thought +of as the amount of time needed to perform the active reconstruction +operation, amortized over the inserts between when this reconstruction +can be scheduled, and when it needs to be complete. We'll consider how +to establish this value next. + +Figure~\ref{fig:tl-tiering} shows various stages in the development of the internal structure of a dynamized index using tiering. Importantly, note that the last level reconstruction, which dominates the cost of the worst-case reconstruction, \emph{is able to be -performed well in advance}. All of the records necessary to perform this +performed in advance}. All of the records necessary to perform this reconstruction are present in the last level $\Theta(n)$ inserts before the reconstruction must be done to make room. This is a significant advantage to our technique over the normal Bentley-Saxe method, which @@ -358,71 +440,71 @@ leads us to the following result, \begin{theorem} \label{theo:worst-case-optimal} Given a buffered, dynamized structure utilizing the tiering layout policy, -and at least $2$ parallel threads of execution, it is possible to maintain -a worst-case insertion cost of +a single active thread of execution and multiple background threads, and +the ability to control which background thread is active, it is possible +to maintain a worst-case insertion cost of \begin{equation} I(n) \in O\left(\frac{B(n)}{n} \log n\right) \end{equation} \end{theorem} \begin{proof} Consider the cost of the worst-case reconstruction, which in tiering -will be of cost $\Theta(B(n))$. This reconstruction requires all of the -blocks on the last level of the structure. At the point at which the -last level is full, there will be $\Theta(n)$ inserts before the last -level must be merged and a new level added. - -To ensure that the reconstruction has been completed by the time the -$\Theta(n)$ inserts have been completed, it is sufficient to guarantee -that the rate of inserts is sufficiently slow. Ignoring the cost of -buffer flushing, this means that inserts must cost, +will be $\Theta(B(n))$. This reconstruction requires all of the blocks +on the last level of the structure. At the point at which the last level +is full, and this reconstruction can be initiated, there will be another +$\Theta(n)$ inserts before it must be completed in order to maintain +the structural invariants. + +Let each insert have cost, \begin{equation*} I(n) \in \Theta(1 + \delta) \end{equation*} -where the $\Theta(1)$ is the cost of appending to the mutable buffer, -and $\delta$ is a stall inserted into the insertion process to ensure -that the necessary reconstructions are completed in time. - -To identify the value of $\delta$, we note that each insert must take -at least $\frac{B(n)}{n}$ time to fully cover the cost of the last level -reconstruction. However, this is not sufficient to guarantee the bound, as -other reconstructions will also occur within the structure. At the point -at which the last level reconstruction can be scheduled, there will be -exactly $1$ block on each level. Thus, each level will potentially also -have an ongoing reconstruction that must be covered by inserting more -stall time, to ensure that no level in the structure exceeds $s$ blocks. -There are $\log n$ levels in total, and so in the worst case we will need -to introduce a extra stall time to account for a reconstruction on each -level, +where $1$ is the cost of appending to the buffer, and $\delta$ is a +calculated stall time, during which one of the background threads can be +executed. To ensure the last-level reconstruction is complete by the +time that $\Theta(n)$ inserts have finished, it is necessary that +$\delta \in \Theta\left(\frac{B(n)}{n}\right)$. + +However, this amount of stall is insufficient to maintain exactly $s$ +shards on each level of the dynamization. At the point at which the +last-level reconstruction can be scheduled, there will be exactly $1$ +shard on all other levels (see Figure~\ref{fig:tl-tiering}). But, between +when the last-level reconstruction can be scheduled, and it must be +completed, each other level must undergo $s - 1$ reconstructions. Because +we have only a single execution thread, it is necessary to account for +the time to complete these reconstructions as well. In the worst-case, +there will be one active reconstruction on each of the $\log_s n$ levels, +and thus we must introduce stalls such that, \begin{equation*} I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1}) \end{equation*} -All of these internal reconstructions will be strictly less than the -size of the last-level reconstruction, and so we can bound them all -above by $O(\frac{B(n)}{n})$ time. - -Given this, and assuming that the smallest (i.e., most pressing) -reconstruction is prioritized on the background thread, we find that +All of these internal reconstructions will be strictly less than the size +of the last-level reconstruction, and so we can bound them all above by +$O(\frac{B(n)}{n})$ time. Given this, and assuming that the smallest +(i.e., most pressing) reconstruction is prioritized on the active +thread, we find that \begin{equation*} I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right) \end{equation*} \end{proof} -This approach results in an equivalent worst-case insertion latency -bound to~\cite{overmars81}, but manages to resolve both of the issues -cited above. By leveraging two parallel threads, instead of trying to -manually multiplex a single thread, this approach requires \emph{no} -modification to the user's block code to function. And, by leveraging -the fact that reconstructions under tiering are strictly local to a -single level, we can avoid needing to add any complicated additional -structures to manage partially building blocks as new records are added. +This approach results in an equivalent worst-case insertion and query +latency bounds to~\cite{overmars81}, but manages to resolve the issues +cited above. By leveraging multiple threads, instead of trying to manually +multiplex a single thread, this approach requires \emph{no} modification +to the user's block code to function. And, by leveraging the fact that +reconstructions under tiering are strictly local to a single level, we +can avoid needing to add any additional structures to manage partially +building blocks as new records are added. \subsection{Reducing Stall with Additional Parallelism} The result in Theorem~\ref{theo:worst-case-optimal} assumes that there -are two available threads of parallel execution, which allows for the -reconstructions to run in parallel with inserts. The amount of necessary -insertion stall can be significantly reduced, however, if more than two -threads are available. +is only a single available thread of parallel execution. This requires +that the insertion stall amount be large enough to cover all of the +reconstructions necessary at any moment in time. If we have access to +parallel execution units, though, we can significantly reduce the amount +of stall time required. The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case bound is that it is insufficient to cover only the cost of the last level @@ -432,22 +514,21 @@ level within the structure must sustain another $s - 1$ reconstructions before it is necessary to have completed the last level reconstruction, in order to maintain the $\Theta(\log n)$ bound on the number of blocks. -Consider a parallel implementation that, contrary to -Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover -the last level reconstruction, and blocks all other reconstructions -until it has been completed. Such an approach would result in $\delta -= \frac{B(n)}{n}$ stall and complete the last level reconstruction -after $\Theta(n)$ inserts. During this time, $\Theta(\frac{n}{N_B})$ -blocks would accumulate in L0, ultimately resulting in a bound of -$\Theta(n)$ blocks in the structure, rather than the $\Theta(\log -n)$ bound we are trying to maintain. This is the reason why -Theorem~\ref{theo:worst-case-optimal} must account for stalls on every -level, and assumes that the smallest (and therefore most pressing) -reconstruction is always active on the parallel reconstruction -thread. This introduces the extra $\log n$ factor into the worst-case -insertion cost function, because there will at worst be a reconstruction -running on every level, and each reconstruction will be no larger than -$\Theta(n)$ records. +To see why this is important, consider an implementation that, contrary +to Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover +the last-level reconstruction. All other reconstructions are blocked +until the last-level one has been completed. This approach would +result in $\delta = \frac{B(n)}{n}$ stall and complete the last +level reconstruction after $\Theta(n)$ inserts. During this time, +$\Theta(\frac{n}{N_B})$ blocks would accumulate in L0, ultimately +resulting in a bound of $\Theta(n)$ blocks in the structure, rather than +the $\Theta(\log n)$ bound we are trying to maintain. This is the reason +why Theorem~\ref{theo:worst-case-optimal} must account for stalls on +every level, and assumes that the smallest (and therefore most pressing) +reconstruction is always active. This introduces the extra $\log n$ +factor into the worst-case insertion cost function, because there will at +worst be a reconstruction running on every level, and each reconstruction +will be no larger than $\Theta(n)$ records. In effect, the stall amount must be selected to cover the \emph{sum} of the costs of all reconstructions that occur. Another way of deriving this @@ -463,9 +544,14 @@ B(n) \cdot \log n \end{equation*} reconstruction cost to amortize over the $\Theta(n)$ inserts. -However, additional parallelism will allow us to reduce this. At the -upper limit, assume that there are $\log n$ threads available for parallel -reconstructions. This condition allows us to derive a smaller bound in +However, additional parallelism will allow us to reduce this. At +the upper limit, assume that there are $\log n$ threads available +for parallel reconstructions. We'll consider the fork-join model of +parallelism~\cite{fork-join}, where the initiation of a reconstruction +on a thread constitutes a fork, and joining the reconstruction thread +involves applying its updates to the dynamized structure. + +This condition allows us to derive a smaller bound in certain cases, \begin{theorem} \label{theo:par-worst-case-optimal} @@ -521,10 +607,6 @@ is the worst-case insertion cost, while ensuring that all reconstructions are done in time to maintain the block bound given $\log n$ parallel threads. \end{proof} - - - - \section{Implementation} \label{sec:tl-impl} |