updates

author: Douglas Rumbaugh <dbr4@psu.edu> 2025-06-08 17:53:37 -0400
committer: Douglas Rumbaugh <dbr4@psu.edu> 2025-06-08 17:53:37 -0400
commit: 01f6950612be18468376aeffb8eef0d7205d86d5 (patch)
tree: 8f75d21427168af515cae7e66c12844d2a32dbf4 /chapters/tail-latency.tex
parent: 33bc7e620276f4269ee5f1820e5477135e020b3f (diff)
download: dissertation-01f6950612be18468376aeffb8eef0d7205d86d5.tar.gz
1 files changed, 339 insertions, 257 deletions
diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex
index 5c0e0ba..1a468df 100644
--- a/chapters/tail-latency.tex
+++ b/chapters/tail-latency.tex
@@ -21,13 +21,11 @@ that of its most direct dynamic analog: a B+Tree. While, as shown
 in Figure~\ref{fig:tl-btree-isam-tput}, the dynamized structure has
 superior average performance to the native dynamic structure, the
 latency distributions, shown in Figure~\ref{fig:tl-btree-isam-lat} are
-quite different. The dynamized structure has much better "best-case"
-performance, but the worst-case performance is exceedingly poor. That
-the structure exhibits reasonable performance on average is the result
-of these two ends of the distribution balancing each other out.
+quite different. The dynamized structure has much better best-case
+performance, but the worst-case performance is exceedingly poor.
 
 This poor worst-case performance is a direct consequence of the different
-approaches to update support used by the dynamized structure and B+Tree.
+approaches used by the dynamized structure and B+Tree to support updates.
 B+Trees use a form of amortized local reconstruction, whereas the
 dynamized ISAM tree uses amortized global reconstruction. Because the
 B+Tree only reconstructs the portions of the structure ``local'' to the
@@ -43,15 +41,18 @@ B+Tree. At the extreme end of the latency distribution, though, the
 local reconstruction strategy used by the B+Tree results in significantly
 better worst-case performance.
 
-Unfortunately, the design space that we have been considering thus far
-is limited in its ability to meaningfully alter the worst-case insertion
-performance. While we have seen that the choice of layout policy can have
-some effect, the actual benefit in terms of tail latency is quite small,
-and the situation is made worse by the fact that leveling, which can
-have better worst-case insertion performance, lags behind tiering in
-terms of average insertion performance. The use of leveling can allow
-for a small reduction in the worst case, but at the cost of making the
-majority of inserts worse because of increased write amplification.
+Unfortunately, the design space that we have been considering is
+limited in its ability to meaningfully alter the worst-case insertion
+performance. Leveling requires only a fraction of, rather than all of, the
+records in the structure to participate in its worst-case reconstruction,
+and as a result shows slightly reduced worst-case insertion cost compared
+to the other layout policies. However, this effect only results in a
+single-digit factor reduction in measured worst-case latency, and has
+no effect on the insertion latency distribution itself outside of the
+absolute maximum value. Additionally, we've shown that leveling performed
+significantly worse in average insertion performance compared to tiering,
+and so its usefulness as a tool to reduce insertion tail latencies is
+questionable at best.
 
 \begin{figure}
 \subfloat[Varied Scale Factors and Fixed Buffer Size]{\includegraphics[width=.5\textwidth]{img/tail-latency/scale_factor_sweep_standard.pdf} \label{fig:tl-parm-sf}} 
@@ -60,32 +61,35 @@ majority of inserts worse because of increased write amplification.
 \label{fig:tl-parm-sweep}
 \end{figure}
 
-The other tuning knobs that are available to us are of limited usefulness
-in tuning the worst case behavior.  Figure~\ref{fig:tl-parm-sweep}
-shows the latency distributions of our framework as we vary
+Our existing framework support two other tuning parameters: the scale
+factor and the buffer size. However, neither of these parameters
+are of much use in adjusting the worst-case insertion behavior. We
+demonstrate this experimentally in Figure~\ref{fig:tl-parm-sweep},
+which shows the latency distributions of our framework as we vary
 the scale factor (Figure~\ref{fig:tl-parm-sf}) and buffer size
 (Figure~\ref{fig:tl-parm-bs}) respectively. There is no clear trend in
 worst-case performance to be seen here. Adjusting the scale factor does
 have an effect on the distribution, but not in a way that is particularly
 useful from a configuration standpoint, and adjusting the mutable buffer
-has almost no effect on the worst-case latency at all, or even on the
-distribution; particularly when tiering is used. This is to be expected;
-ultimately the worst-case reconstruction size is largely the same
-regardless of scale factor or buffer size:  $\Theta(n)$ records.
-
-The selection of configuration parameters can influence \emph{when}
-these reconstructions occur, as well as slightly influence their size, but
-ultimately the question of ``which configuration has the best tail-latency
-performance'' is more a question of how many insertions the latency is
-measured over, than any fundamental trade-offs with the design space. This
-is exemplified rather well in Figure~\ref{fig:tl-parm-sf}. Each of
-the ``shelves'' in the distribution correspond to reconstructions on
-particular levels. As can be seen, the lines cross each other repeatedly
-at these shelves. These cross-overs are points at which one configuration
-begins to, temporarily, exhibit better tail latency behavior than the
-other. However, after enough records have been inserted to cause the next
-largest reconstructions to begin to occur, the "better" configuration
-begins to appear worse again in terms of tail latency.\footnote{
+has almost no effect on either the distribution itself, or the worst-case
+insertion latency; particularly when tiering is used.  This is to be
+expected; ultimately the worst-case reconstruction size is largely the
+same irrespective of scale factor or buffer size:  $\Theta(n)$ records.
+
+The selection of configuration parameters does influence \emph{when}
+a worst-case reconstruction occurs, and can slightly affect its size.
+Ultimately, however, the answer to the question of which configuration
+has the best insertion tail latency performance is more a matter of how
+many records the insertion latencies are measured over than it is one of
+any fundamental design trade-offs within the space.  This is exemplified
+rather well in Figure~\ref{fig:tl-parm-sf}. Each of the ``shelves'' in
+the distribution correspond to reconstructions on particular levels. As
+can be seen, the lines cross each other repeatedly at these shelves. These
+cross-overs are points at which one configuration begins to exhibit
+better tail latency behavior than another. However, after enough records
+have been inserted, the next largest reconstructions will begin to
+occur. This will make the "better" configuration appear worse in
+terms of tail latency.\footnote{
 	This plot also shows a notable difference between leveling
 	and tiering.  In the tiering configurations, the transitions
 	between the shelves are steep and abrupt, whereas in leveling,
@@ -112,32 +116,29 @@ to demonstrate that the theoretical trade-offs are achievable in practice.
 
 \section{The Insertion-Query Trade-off}
 \label{sec:tl-insert-query-tradeoff}
-As reconstructions are at the heart of the insertion tail latency problem,
-it seems worth taking a moment to consider \emph{why} they must be done
-at all.  Fundamentally, decomposition-based dynamization techniques trade
-between insertion and query performance by controlling the number of
-blocks in the decomposition. Placing a bound on this number is necessary
-to bound the worst-case query cost, and is done using reconstructions
-to either merge (in the case of the Bentley-Saxe method) or re-partition
-(in the case of the equal block method) the blocks. Performing less
-frequent (or smaller) reconstructions reduces the amount of work
-associated with inserts, at the cost of allowing more blocks to accumulate
-and thereby hurting query performance.
+Reconstructions lie at the heart of the insertion tail latency problem,
+and so it seems worth taking a moment to consider \emph{why} they occur
+at all.  Fundamentally, decomposition-based dynamization techniques
+trade between insertion and query performance by controlling the number
+of blocks in the decomposition. Placing a bound on this number is
+necessary to bound the worst-case query cost, and this bound is enforced
+using reconstructions to either merge (in the case of the Bentley-Saxe
+method) or re-partition (in the case of the equal block method) the
+blocks. Performing less frequent (or smaller) reconstructions reduces
+the amount of work associated with inserts, at the cost of allowing more
+blocks to accumulate and thereby hurting query performance.
 
 This trade-off between insertion and query performance by way of block
-count is most directly visible in the equal block method described
-in Section~\ref{ssec:ebm}. This technique provides the
-following worst-case insertion and query bounds,
+count is most directly visible in the equal block method described in
+Section~\ref{ssec:ebm}. We will consider a variant of the equal block
+method here for which $f(n) \in \Theta(1)$, resulting in a dynamization
+that does not perform any re-partitioning. In this case, the technique
+provides the following worst-case insertion and query bounds,
 \begin{align*}
 I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\
 \mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right)
 \end{align*}
-where $f(n)$ is the number of blocks. This worst-case result ignores
-re-partitioning costs, which may be necessary for certain selections
-of $f(n)$.  We omit it here because we are about to examine a case
-of the equal block method were no re-partitioning is necessary. When
-re-partitioning is used, the worst case cost rises to the now familiar $I(n)
-\in \Theta(B(n))$ result.
+where $f(n)$ is the number of blocks.
 
 \begin{figure}
 \centering
@@ -157,63 +158,72 @@ method, with
 \begin{equation*}
 f(n) = C
 \end{equation*}
-for varying constant values of $C$. Note that in this test the final
-record count was known in advance, allowing all re-partitioning to be
-avoided. This represents a sort of ``best case scenario'' for the
-technique, and isn't reflective of real-world performance, but does
-serve to demonstrate the relevant properties in the clearest possible
-manner.
+for varying constant values of $C$. As noted above, this special case of
+the equal block method allows for re-partitioning costs to be avoided
+entirely, resulting in a very clean trade-off space. This result isn't
+necessarily demonstrative of the real-world performance of the equal
+block method, but it does serve to demonstrate the relevant properties
+of the method in the clearest possible manner.
 
 Figure~\ref{fig:tl-ebm-tail-latency} shows that the equal block
 method provides a very direct relationship between the tail latency
 and the number of blocks. The worst-case insertion performance is
-dictated by the size of the largest reconstruction, and so increasing
-the block count results in smaller blocks, and better insertion
-performance. These worst-case results also translate directly into
-improved average throughput, at the cost of query latency, as shown in
-Figure~\ref{fig:tl-ebm-tradeoff}. These results show that, contrary to our
-Bentley-Saxe inspired dynamization system, the equal block method provides
-clear and direct relationships between insertion and query performance,
-as well as direct control over tail latency, through its design space.
-
-Unfortunately, the equal block method is not well suited for
-our purposes. Despite having a much cleaner trade-off space, its
-performance is strictly worse than our dynamization system. Comparing
-Figure~\ref{fig:tl-ebm-tradeoff} with Figure~\ref{fig:design-tradeoff}
-shows that, for a specified query latency, our technique provides
-significantly better insertion throughput.\footnote{
-	In actuality, the insertion performance of the equal block method is
-	even \emph{worse} than the numbers presented here. For this particular
-	benchmark, we implemented the technique knowing the number of records in
-	advance, and so fixed the size of each block from the start. This avoided
-	the need to do any re-partitioning as the structure grew, and reduced write
-	amplification.
+dictated by the size of the largest reconstruction. Increasing the
+block count reduces the size of each block, and so improves the
+insertion performance. Figure~\ref{fig:tl-ebm-tradeoff} shows that
+these improvements to tail latency performance translate directly into
+an improvement of the overall insertion throughput as well, as the
+cost of worse query latencies.  Contrary to our Bentley-Saxe inspired
+dynamization system, this formulation of the equal block method provides
+direct control over insertion tail latency, as well as a much cleaner
+relationship between average insertion and query performance.
+
+While these results are promising, the equal block method is not well
+suited for our purposes. Despite having a clean trade-off space and
+control over insertion tail latencies, this technique is strictly
+worse than our existing dynamization system in every way but tail
+latency control.  Comparing Figure~\ref{fig:tl-ebm-tradeoff} with
+Figure~\ref{fig:design-tradeoff} shows that, for a specified query
+latency, a logarithmic decomposition provides significantly better
+insertion throughput.  Our technique uses geometric growth of block sizes,
+which ensures that most reconstructions are smaller than those in the
+equal block method for an equivalent number of blocks. This comes at
+the cost of needing to occasionally perform large reconstructions to
+compact these smaller blocks, resulting in the poor tail latencies we
+are attempting to resolve.  Thus, it seems as though poor tail latency
+is concomitant with good average performance.
+
+Despite this, let's consider an approach to reconstruction within our
+framework that optimizes for insertion tail latency exclusively using
+the equal block method, neglecting any considerations of maintaining
+a shard count bound. We can consider a variant of the equal block
+method having $f(n) = \frac{n}{N_B}$. This case, like
+the $f(n) = C$ approach considered above, avoids all re-partitioning,
+because records are flushed into the dynamization in sets of $N_B$ size,
+and so each new block is always exactly full on creation. In effect, this
+technique has no reconstructions at all. Each buffer flush simply creates
+a new block that is added to an ever-growing list. This produces a system
+with worst-case insertion and query costs of,
+\begin{align*}
+	I(n) &\in \Theta(B(N_B)) \\
+	\mathscr{Q}(n) &\in O (n\cdot \mathscr{Q}_S(N_B))
+\end{align*}
+where the worst-case insertion is simply the cost of a buffer flush,
+and the worst-case query cost follows from the fact that there will
+be $\Theta\left(\frac{n}{N_B}\right)$ shards in the dynamization, each
+of which will have exactly $N_B$ records.\footnote{
+	We are neglecting the cost of querying the buffer in this cost function
+	for simplicity.
 }
-This is because, in our technique, the variable size of the blocks allows
-for the majority of the reconstructions to occur with smaller structures,
-while allowing the majority of the records to exist in a small number
-of large blocks at the bottom of the structure.  This setup enables
-high insertion throughput while keeping the block count small. But,
-as we've seen, the cost of this is large tail latencies, as the large
-blocks must occasionally be involved in reconstructions. However, we
-can use the extreme ends of the equal block method's design space to
-consider upper limits on the insertion and query performance that we
-might expect to get out of a dynamized structure, and then take steps
-within our own framework to approach these limits, while retaining the
-desirable characteristics of the logarithmic decomposition.
-
-At the extreme end, consider what would happen if we were to modify
-our dynamization framework to avoid all reconstructions.  We retain a
-buffer of size $N_B$, which we flush to create a shard when full; however
-we never touch the shards once they are created. This is effectively
-the equal block method, where every block is fixed at $N_B$ capacity.
-Such a technique would result in a worst-case insertion cost of $I(n) \in
-\Theta(B(N_B))$ and produce $\Theta\left(\frac{n}{N_B}\right)$ shards in
-total, resulting in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$
-worst-case query cost for a decomposable search problem. Applying
-this technique to an ISAM Tree, and compared against a B+Tree,
-yields the insertion and query latency distributions shown in
-Figure~\ref{fig:tl-floodl0}.
+
+Applying this technique to an ISAM Tree, and compared against a
+B+Tree, yields the insertion and query latency distributions shown
+in Figure~\ref{fig:tl-floodl0}.  Figure~\ref{fig:tl-floodl0-insert}
+shows that it is possible to obtain insertion latency distributions
+using amortized global reconstruction that are directly comparable to
+dynamic structures based on amortized local reconstruction.  However,
+this performance comes at the cost of queries, which are incredibly slow
+compared to B+Trees, as shown in Figure~\ref{fig:tl-floodl0-query}.
 
 \begin{figure}
 \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} 
@@ -223,93 +233,123 @@ Figure~\ref{fig:tl-floodl0}.
 \label{fig:tl-floodl0}
 \end{figure}
 
-Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain
-insertion latency distributions using amortized global reconstruction
-that are directly comparable to dynamic structures based on amortized
-local reconstruction, at least in some cases. In particular, the
-worst-case insertion tail latency in this model is a direct function
-of the buffer size, as the worst-case insert occurs when the buffer
-must be flushed to a shard. However, this performance comes at the
-cost of queries, which are incredibly slow compared to B+Trees, as
-shown in Figure~\ref{fig:tl-floodl0-query}.
-
-Unfortunately, the query latency of this technique is too large for it
-to be useful; it is necessary to perform reconstructions to merge these
-small shards together to ensure good query performance. However, this
-does raise an interesting point. Fundamentally, the reconstructions
-that contribute to tail latency are \emph{not} required from an
-insertion perspective; they are a query optimization. Thus, we could
-remove the reconstructions from the insertion process and perform
-them elsewhere.  This could, theoretically, allow us to have our
-cake and eat it too. The only insertion bottleneck would become
-the buffer flushing procedure--as is the case in our hypothetical
-``reconstructionless'' approach. Unfortunately, it is not as simple as
-pulling the reconstructions off of the insertion path and running them in
-the background, as this alone cannot provide us with a meaningful bound
-on the number of blocks in the dynamized structure. But, it is possible
-to still provide this bound, if we're willing to throttle the insertion
-rate to be slow enough to keep up with the background reconstructions. In
-the next section, we'll discuss a technique based on this idea.
+
+On its own, this technique exhibits too large of a degradation
+of query latency for it to be useful in any scenario involving a
+need for queries. However, it does demonstrate that, in scenarios
+where insertion doesn't need to block on reconstructions, it is
+possible to obtain significantly improved insertion tail latency
+distributions. Unfortunately, it also shows that using reconstructions to
+enforce structural invariants to control the number of block is critical
+for query performance. In the next section, we will consider approaches
+to reconstruction that allow us to maintain the structural invariants of
+our dynamization, while avoiding direct blocking of inserts, in an attempt
+to reduce the worst-case insertion cost in line with the approach we have
+just discussed, while maintaining similar worst-case query bounds to our
+existing dynamization system.
 
 \section{Relaxed Reconstruction}
 
-There does exist theoretical work on throttling the insertion
-rate of a Bentley-Saxe dynamization to control the worst-case
-insertion cost~\cite{overmars81}, which we discussed in
-Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach is to
-break the largest reconstructions up into small sequences of operations,
-that can then be attached to each insert, spreading the total workload out
-and ensuring each insert takes a consistent amount of time. Theoretically,
+Reconstructions are necessary to maintain the structural invariants of
+our dynamization, which are themselves required to maintain bounds on
+worst-case query performance. Inserts are what causes the structure to
+violate these invariants, and so it makes sense to attach reconstructions
+to the insertion process to allow a strict maintenance of these
+invariants. However, it is possible to take a more relaxed approach
+to maintaining these invariants using concurrency, allowing for the same
+shard bound to be enforced at a much lower worst-case insertion cost.
+
+There does exist theoretical work in this area, which we've already
+discussed in Section~\ref{ssec:bsm-worst-optimal}. The gist of this
+technique is to relax the strict binary decomposition of the Bentley-Saxe
+method to allow multiple reconstructions to occur at once, and to add
+a buffer to each level to contain a partially built structure. Then,
+reconstructions are split up into small batches of operations,
+which are attached to each insert that is issued up to the moment
+when the reconstruction must be complete. By doing this, the work
+of the reconstructions is spread out across many inserts, in effect
+removing the need to block for large reconstructions.  Theoretically,
 the total throughput should remain about the same when doing this, but
 rather than having a bursty latency distribution with many fast inserts,
-and a small number of incredibly slow ones, distribution should be
-more normal.
+and a small number of incredibly slow ones, distribution should be more
+normal.~\cite{overmars81}
 
 Unfortunately, this technique has a number of limitations that we
-discussed in Section~\ref{ssec:bsm-tail-latency-problem}. Notably for
-this discussion, they are
-\begin{enumerate}
-
-	\item In the Bentley-Saxe method, the worst-case reconstruction
-	involves every record in the structure. As such, it cannot be
-	performed ``in advance'' without significant extra work. This problem
-	requires the worst-case optimized dynamization systems to include 
-	complicated structures of partially built structures.
-
-	\item The approach assumes that the workload of building a
-	block can be evenly divided in advance, and somehow attached
-	to inserts. Even for simple structures, this requires a large
-	amount of manual adjustment to the data structure reconstruction
-	routines, and doesn't admit simple, generalized interfaces.
+discussed in Section~\ref{ssec:bsm-tail-latency-problem}. It effectively
+reduces to manually multiplexing a single thread to perform a highly
+controlled, concurrent, reconstruction process. This requires the ability
+to evenly divide up the work of building a data structure and somehow
+attach these operations to individual inserts. This makes it ill-suited
+for our general framework, because, even when the construction can be
+split apart into small independent chunks, implementing it requires a
+significant amount of manual adjustment to the data structure construction
+processes.
+
+In this section, we will propose an alternative approach for implementing
+a similar idea using multi-threading and prove that we can achieve,
+in principle, the same worst-case insertion and query costs in a far
+more general and easily implementable manner. We'll then show how we
+can further leverage parallelism on top of our approach to obtain
+\emph{better} worst-case bounds, assuming sufficient resources are
+available.
+
+
+\Paragraph{Layout Policies.} One important aspect of the selection of
+layout policy that has not been considered up to now, but will soon
+become very relevant, is the degree of reconstruction concurrency
+afforded by each policy. Because different layout policies perform
+reconstructions differently, there are significant differences in the
+number of reconstructions that can be performed concurrently in each one.
+Note that in previous chapters, we used the term \emph{reconstruction}
+broadly to refer to all operations performed on the dynamization to
+maintain its structural invariants as the result of a single insert. Here,
+we instead use the term to refer to a single call to \texttt{build}.
+\begin{itemize}
+	\item \textbf{Leveling.} \\
+	Our leveling layout policy performs a single \texttt{build}
+	operation involving shards from at most two levels, as well as
+	flushing the buffer. Thus, at best, there can be two concurrent
+	operations: the \texttt{build} and the flush. If we
+	were to proactively perform reconstructions, each \texttt{build}
+	would require shards from two levels, and so the maximum number
+	of concurrent reconstructions is half the number of levels,
+	plus the flush.
 	
-\end{enumerate}
-In this section, we consider how these restrictions can be overcome given
-our dynamization framework, and propose a strategy that achieves the
-same worst-case insertion time as the worst-case optimized theoretical
-techniques, given a few assumptions about available resources, by taking
-advantage of parallelism and proactive scheduling of reconstructions.
+	\item \textbf{Tiering.} \\
+	In our tiering policy, it may be necessary to perform one \texttt{build}
+	operation per level. Each of these reconstructions involves only shards
+	from that level. As a result, at most one reconstruction per level
+	(as well as the flush) can proceed concurrently.
+
+	\item \textbf{BSM.} \\
+	The Bentley-Saxe method is highly eager, and merges all relevant
+	shards, plus the buffer, in a single call to \texttt{build}. As a result,
+	no concurrency is possible.
+\end{itemize}
 
 We will be restricting ourselves in this chapter to the tiering layout
-policy, because it has some specific properties that are useful to our
-goals. In tiering, the input shards to a reconstruction are restricted
-to a single level, unlike in leveling where the shards come from two
-levels, and BSM where the shards come from potentially \emph{all}
-the levels. This allows us to maximize parallelism, which we will
-be using to improve the tail latency performance, greatly simplifies
-synchronization, and provides us with the largest window over which to
-amortize the costs of reconstruction.  The techniques we describe in
-this chapter will work with leveling as well, albeit less effectively,
-and will not work \emph{at all} using BSM.
-
-First, a comment on nomenclature. We define the term \emph{last level},
-$i = \ell$, to mean the level in the dynamized structure with the
-largest index value (and thereby the most records) and \emph{first
-level} to mean the level with index $i=0$. Any level with $0 < i <
-\ell$ is called an \emph{internal level}. A reconstruction on level $i$
-involves the combination of all blocks on that level into one, larger,
-block, that is then appended level $i+1$. Relative to some level at
-index $i$, the \emph{next level} is the level at index $i + 1$, and the
-\emph{previous level} is at index $i-1$.
+policy. Tiering provides the most opportunities for concurrency
+and (assuming sufficient resources) parallelism. Because a given
+reconstruction only requires shards from a single level, using tiering
+also makes synchronization significantly easier, and it provides us
+with largest window to pre-emptively schedule reconstructions. Most
+of our discussion in this chapter could also be applied to leveling,
+albeit with worse results. However, BSM \emph{cannot} be used at all.
+
+\Paragraph{Nomenclature.} For the discussion that follows, it will
+be convenient to define a few terms for discussing levels relative to
+each other. While these are all fairly straightforward, to alleviate any
+potential confusion, we'll define them all explicitly here.  We define the
+term \emph{last level}, $i = \ell$, to mean the level in the dynamized
+structure with the largest index value (and thereby the most records)
+and \emph{first level} to mean the level with index $i=0$. Any level
+with $0 < i < \ell$ is called an \emph{internal level}. A reconstruction
+on level $i$ involves the combination of all blocks on that level into
+one, larger, block, that is then appended level $i+1$. Relative to some
+level at index $i$, the \emph{next level} is the level at index $i +
+1$, and the \emph{previous level} is at index $i-1$.
+
+\subsection{Concurrent Reconstructions}
 
 Our proposed approach is as follows. We will fully detach reconstructions
 from buffer flushes. When the buffer fills, it will immediately flush
@@ -318,10 +358,45 @@ will be performed in the background to maintain the internal structure
 according to the tiering policy. When a level contains $s$ blocks,
 a reconstruction will immediately be triggered to merge these blocks
 and push the result down to the next level. To ensure that the number
-of blocks in the structure remains bounded by $\Theta(\log n)$, we will
-throttle the insertion rate so that it is balanced with amount of time
-needed to complete reconstructions.
+of blocks in the structure remains bounded by $\Theta(\log_s n)$, we
+will throttle the insertion rate by adding a stall time, $\delta$, to
+each insert. $\delta$ will be determined such that it is sufficiently
+large to ensure that any scheduled reconstructions have enough time to
+complete before the shard count on any level exceeds $s$. This process
+is summarized in Algorithm~\ref{alg:tl-relaxed-recon}.
+
+\begin{algorithm}
+\caption{Relaxed Reconstruction Algorithm with Insertion Stalling}
+\label{alg:tl-relaxed-recon}
+\KwIn{$r$: a record to be inserted, $\mathscr{I} = (\mathcal{B}, \mathscr{L}_0 \ldots \mathscr{L}_\ell)$: a dynamized structure, $\delta$: insertion stall amount}
+
+\Comment{Stall insertion process by specified amount}
+sleep($\delta$) \;
+\BlankLine
+\Comment{Append to the buffer if possible}
+\If {$|\mathcal{B}| < N_B$} {
+	$\mathcal{B} \gets \mathcal{B} \cup \{r\}$ \;
+	\Return \;
+}
 
+\BlankLine
+\Comment{Schedule any necessary reconstructions background threads}
+\For {$\mathscr{L} \in \mathscr{I}$} {
+	\If {$|\mathscr{L}| = s$} {
+		$\text{schedule\_reconstruction}(\mathscr{L})$ \;
+	}
+}
+
+\BlankLine
+\Comment{Perform the flush}
+$\mathscr{L}_0 \gets \mathscr{L}_0 \cup \{\text{build}(\mathcal{B})\}$ \;
+$\mathcal{B} \gets \emptyset$ \;
+
+\BlankLine
+\Comment{Append to the now empty buffer}
+$\mathcal{B} \gets \mathcal{B} \cup \{r\}$ \;
+\Return \;
+\end{algorithm}
 
 \begin{figure}
 \centering
@@ -343,12 +418,19 @@ record counts--each level has an increasing number of records per block.}}
 \label{fig:tl-tiering}
 \end{figure}
 
-First, we'll consider how to ``spread out'' the cost of the worst-case
-reconstruction.  Figure~\ref{fig:tl-tiering} shows various stages in
+To ensure the correctness of this algorithm, it is necessary to show
+that there exists a value for $\delta$ that ensures that the structural
+invariants can be maintained. Logically, this $\delta$ can be thought
+of as the amount of time needed to perform the active reconstruction
+operation, amortized over the inserts between when this reconstruction
+can be scheduled, and when it needs to be complete. We'll consider how
+to establish this value next.
+
+Figure~\ref{fig:tl-tiering} shows various stages in
 the development of the internal structure of a dynamized index using
 tiering. Importantly, note that the last level reconstruction, which
 dominates the cost of the worst-case reconstruction, \emph{is able to be
-performed well in advance}. All of the records necessary to perform this
+performed in advance}. All of the records necessary to perform this
 reconstruction are present in the last level $\Theta(n)$ inserts before
 the reconstruction must be done to make room. This is a significant
 advantage to our technique over the normal Bentley-Saxe method, which
@@ -358,71 +440,71 @@ leads us to the following result,
 \begin{theorem}
 \label{theo:worst-case-optimal}
 Given a buffered, dynamized structure utilizing the tiering layout policy,
-and at least $2$ parallel threads of execution, it is possible to maintain
-a worst-case insertion cost of
+a single active thread of execution and multiple background threads, and
+the ability to control which background thread is active, it is possible
+to maintain a worst-case insertion cost of
 \begin{equation}
 I(n) \in O\left(\frac{B(n)}{n} \log n\right)
 \end{equation}
 \end{theorem}
 \begin{proof}
 Consider the cost of the worst-case reconstruction, which in tiering
-will be of cost $\Theta(B(n))$. This reconstruction requires all of the
-blocks on the last level of the structure. At the point at which the
-last level is full, there will be $\Theta(n)$ inserts before the last
-level must be merged and a new level added.
-
-To ensure that the reconstruction has been completed by the time the
-$\Theta(n)$ inserts have been completed, it is sufficient to guarantee
-that the rate of inserts is sufficiently slow. Ignoring the cost of
-buffer flushing, this means that inserts must cost,
+will be $\Theta(B(n))$. This reconstruction requires all of the blocks
+on the last level of the structure. At the point at which the last level
+is full, and this reconstruction can be initiated, there will be another
+$\Theta(n)$ inserts before it must be completed in order to maintain
+the structural invariants.
+
+Let each insert have cost,
 \begin{equation*}
 I(n) \in \Theta(1 + \delta)
 \end{equation*}
-where the $\Theta(1)$ is the cost of appending to the mutable buffer,
-and $\delta$ is a stall inserted into the insertion process to ensure
-that the necessary reconstructions are completed in time.
-
-To identify the value of $\delta$, we note that each insert must take
-at least $\frac{B(n)}{n}$ time to fully cover the cost of the last level
-reconstruction. However, this is not sufficient to guarantee the bound, as
-other reconstructions will also occur within the structure. At the point
-at which the last level reconstruction can be scheduled, there will be
-exactly $1$ block on each level. Thus, each level will potentially also
-have an ongoing reconstruction that must be covered by inserting more
-stall time, to ensure that no level in the structure exceeds $s$ blocks.
-There are $\log n$ levels in total, and so in the worst case we will need
-to introduce a extra stall time to account for a reconstruction on each
-level,
+where $1$ is the cost of appending to the buffer, and $\delta$ is a
+calculated stall time, during which one of the background threads can be
+executed. To ensure the last-level reconstruction is complete by the
+time that $\Theta(n)$ inserts have finished, it is necessary that
+$\delta \in \Theta\left(\frac{B(n)}{n}\right)$.
+
+However, this amount of stall is insufficient to maintain exactly $s$
+shards on each level of the dynamization. At the point at which the
+last-level reconstruction can be scheduled, there will be exactly $1$
+shard on all other levels (see Figure~\ref{fig:tl-tiering}). But, between
+when the last-level reconstruction can be scheduled, and it must be
+completed, each other level must undergo $s - 1$ reconstructions. Because
+we have only a single execution thread, it is necessary to account for
+the time to complete these reconstructions as well. In the worst-case,
+there will be one active reconstruction on each of the $\log_s n$ levels,
+and thus we must introduce stalls such that,
 \begin{equation*}
 I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1})
 \end{equation*}
-All of these internal reconstructions will be strictly less than the
-size of the last-level reconstruction, and so we can bound them all
-above by $O(\frac{B(n)}{n})$ time. 
-
-Given this, and assuming that the smallest (i.e., most pressing)
-reconstruction is prioritized on the background thread, we find that
+All of these internal reconstructions will be strictly less than the size
+of the last-level reconstruction, and so we can bound them all above by
+$O(\frac{B(n)}{n})$ time.  Given this, and assuming that the smallest
+(i.e., most pressing) reconstruction is prioritized on the active
+thread, we find that
 \begin{equation*}
 I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right)
 \end{equation*}
 \end{proof}
 
-This approach results in an equivalent worst-case insertion latency
-bound to~\cite{overmars81}, but manages to resolve both of the issues
-cited above. By leveraging two parallel threads, instead of trying to
-manually multiplex a single thread, this approach requires \emph{no}
-modification to the user's block code to function. And, by leveraging
-the fact that reconstructions under tiering are strictly local to a
-single level, we can avoid needing to add any complicated additional
-structures to manage partially building blocks as new records are added.
+This approach results in an equivalent worst-case insertion and query
+latency bounds to~\cite{overmars81}, but manages to resolve the issues
+cited above. By leveraging multiple threads, instead of trying to manually
+multiplex a single thread, this approach requires \emph{no} modification
+to the user's block code to function. And, by leveraging the fact that
+reconstructions under tiering are strictly local to a single level, we
+can avoid needing to add any additional structures to manage partially
+building blocks as new records are added.
 
 \subsection{Reducing Stall with Additional Parallelism}
 
 The result in Theorem~\ref{theo:worst-case-optimal} assumes that there
-are two available threads of parallel execution, which allows for the
-reconstructions to run in parallel with inserts. The amount of necessary
-insertion stall can be significantly reduced, however, if more than two
-threads are available.
+is only a single available thread of parallel execution. This requires
+that the insertion stall amount be large enough to cover all of the
+reconstructions necessary at any moment in time. If we have access to
+parallel execution units, though, we can significantly reduce the amount
+of stall time required.
 
 The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case
 bound is that it is insufficient to cover only the cost of the last level
@@ -432,22 +514,21 @@ level within the structure must sustain another $s - 1$ reconstructions
 before it is necessary to have completed the last level reconstruction,
 in order to maintain the $\Theta(\log n)$ bound on the number of blocks.
 
-Consider a parallel implementation that, contrary to
-Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover
-the last level reconstruction, and blocks all other reconstructions
-until it has been completed. Such an approach would result in $\delta
-= \frac{B(n)}{n}$ stall and complete the last level reconstruction
-after $\Theta(n)$ inserts. During this time, $\Theta(\frac{n}{N_B})$
-blocks would accumulate in L0, ultimately resulting in a bound of
-$\Theta(n)$ blocks in the structure, rather than the $\Theta(\log
-n)$ bound we are trying to maintain. This is the reason why
-Theorem~\ref{theo:worst-case-optimal} must account for stalls on every
-level, and assumes that the smallest (and therefore most pressing)
-reconstruction is always active on the parallel reconstruction
-thread. This introduces the extra $\log n$ factor into the worst-case
-insertion cost function, because there will at worst be a reconstruction
-running on every level, and each reconstruction will be no larger than
-$\Theta(n)$ records.
+To see why this is important, consider an implementation that, contrary
+to Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover
+the last-level reconstruction. All other reconstructions are blocked
+until the last-level one has been completed.  This approach would
+result in $\delta = \frac{B(n)}{n}$ stall and complete the last
+level reconstruction after $\Theta(n)$ inserts. During this time,
+$\Theta(\frac{n}{N_B})$ blocks would accumulate in L0, ultimately
+resulting in a bound of $\Theta(n)$ blocks in the structure, rather than
+the $\Theta(\log n)$ bound we are trying to maintain. This is the reason
+why Theorem~\ref{theo:worst-case-optimal} must account for stalls on
+every level, and assumes that the smallest (and therefore most pressing)
+reconstruction is always active.  This introduces the extra $\log n$
+factor into the worst-case insertion cost function, because there will at
+worst be a reconstruction running on every level, and each reconstruction
+will be no larger than $\Theta(n)$ records.
 
 In effect, the stall amount must be selected to cover the \emph{sum} of
 the costs of all reconstructions that occur. Another way of deriving this
@@ -463,9 +544,14 @@ B(n) \cdot \log n
 \end{equation*}
 reconstruction cost to amortize over the $\Theta(n)$ inserts.
 
-However, additional parallelism will allow us to reduce this. At the
-upper limit, assume that there are $\log n$ threads available for parallel
-reconstructions. This condition allows us to derive a smaller bound in
+However, additional parallelism will allow us to reduce this. At
+the upper limit, assume that there are $\log n$ threads available
+for parallel reconstructions. We'll consider the fork-join model of
+parallelism~\cite{fork-join}, where the initiation of a reconstruction
+on a thread constitutes a fork, and joining the reconstruction thread
+involves applying its updates to the dynamized structure. 
+
+This condition allows us to derive a smaller bound in
 certain cases,
 \begin{theorem}
 \label{theo:par-worst-case-optimal}
@@ -521,10 +607,6 @@ is the worst-case insertion cost, while ensuring that all reconstructions
 are done in time to maintain the block bound given $\log n$ parallel threads.
 \end{proof}
 
-
-
-
-
 \section{Implementation}
 \label{sec:tl-impl}
author	Douglas Rumbaugh <dbr4@psu.edu>	2025-06-08 17:53:37 -0400
committer	Douglas Rumbaugh <dbr4@psu.edu>	2025-06-08 17:53:37 -0400
commit	01f6950612be18468376aeffb8eef0d7205d86d5 (patch)
tree	8f75d21427168af515cae7e66c12844d2a32dbf4 /chapters/tail-latency.tex
parent	33bc7e620276f4269ee5f1820e5477135e020b3f (diff)
download	dissertation-01f6950612be18468376aeffb8eef0d7205d86d5.tar.gz