From f34a7e871b22bfd7973158e1fa2376a12bb6f5f5 Mon Sep 17 00:00:00 2001
From: Douglas Rumbaugh <dbr4@psu.edu>
Date: Mon, 23 Jun 2025 11:52:18 -0400
Subject: updates

---
 chapters/tail-latency.tex | 93 +++++++++++++++++++++++++----------------------
 1 file changed, 50 insertions(+), 43 deletions(-)

(limited to 'chapters')

diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex
index ed6c7b8..5b3dfa5 100644
--- a/chapters/tail-latency.tex
+++ b/chapters/tail-latency.tex
@@ -332,7 +332,7 @@ policy. Tiering provides the most opportunities for concurrency
 and (assuming sufficient resources) parallelism. Because a given
 reconstruction only requires shards from a single level, using tiering
 also makes synchronization significantly easier, and it provides us
-with largest window to pre-emptively schedule reconstructions. Most
+with largest window to preemptively schedule reconstructions. Most
 of our discussion in this chapter could also be applied to leveling,
 albeit with worse results. However, BSM \emph{cannot} be used at all.
 
@@ -439,40 +439,45 @@ of inserts without much of the complexity of~\cite{overmars81}. This
 leads us to the following result,
 \begin{theorem}
 \label{theo:worst-case-optimal}
-Given a buffered, dynamized structure utilizing the tiering layout policy,
-a single active thread of execution and multiple background threads, and
-the ability to control which background thread is active, it is possible
-to maintain a worst-case insertion cost of
+Given a dynamized structure utilizing the reconstruction policy described
+in Algorithm~\ref{alg:tl-relaxed-recon}, a single execution unit, and multiple
+threads of execution that can be scheduled on that unit at will with
+preemption, it is possible to maintain a worst-case insertion cost of
 \begin{equation}
 I(n) \in O\left(\frac{B(n)}{n} \log n\right)
 \end{equation}
+while maintaining a bound of $s$ shards per level, and $\log_s n$ levels.
 \end{theorem}
 \begin{proof}
-Consider the cost of the worst-case reconstruction, which in tiering
-will be $\Theta(B(n))$. This reconstruction requires all of the blocks
-on the last level of the structure. At the point at which the last level
-is full, and this reconstruction can be initiated, there will be another
-$\Theta(n)$ inserts before it must be completed in order to maintain
-the structural invariants.
-
-Let each insert have cost,
+Under Algorithm~\ref{alg:tl-relaxed-recon}, the worst-case reconstruction
+operation consists of the creation of a new block from all of the existing
+blocks in the last level. This reconstruction will be initiated when the
+last level is full, at which point there will be another $\Theta(n)$
+inserts before the level above it also fills, and a new shard must be
+added to the last level. The reconstruction must be completed by this point
+to ensure that no more than $s$ shards exist on the last level.
+
+Assume that all inserts run on a single thread that can be scheduled
+alongside the reconstructions, and let each insert have a cost of
 \begin{equation*}
 I(n) \in \Theta(1 + \delta)
 \end{equation*}
-where $1$ is the cost of appending to the buffer, and $\delta$ is a
-calculated stall time, during which one of the background threads can be
-executed. To ensure the last-level reconstruction is complete by the
-time that $\Theta(n)$ inserts have finished, it is necessary that
-$\delta \in \Theta\left(\frac{B(n)}{n}\right)$.
+where $1$ is the cost of appending to the buffer, and $\delta$
+is a calculated stall time. During the stalling, the insert
+thread will be idle and reconstructions can be run on the execution unit.
+To ensure the last-level reconstruction is complete by the time that
+$\Theta(n)$ inserts have finished, it is necessary that $\delta \in
+\Theta\left(\frac{B(n)}{n}\right)$.
 
 However, this amount of stall is insufficient to maintain exactly $s$
 shards on each level of the dynamization. At the point at which the
-last-level reconstruction can be scheduled, there will be exactly $1$
-shard on all other levels (see Figure~\ref{fig:tl-tiering}). But, between
-when the last-level reconstruction can be scheduled, and it must be
-completed, each other level must undergo $s - 1$ reconstructions. Because
-we have only a single execution thread, it is necessary to account for
-the time to complete these reconstructions as well. In the worst-case,
+last-level reconstruction is initiated, there will be exactly $1$ shard
+on all other levels (see Figure~\ref{fig:tl-tiering}). However, between
+this initiation and the time at which the last level reconstruction must
+be complete to maintain the shard count bound, each other level must
+also undergo $s - 1$ reconstructions to maintain their own bounds.
+Because we have only a single execution unit, it is necessary to account
+for the time to complete these reconstructions as well. In the worst-case,
 there will be one active reconstruction on each of the $\log_s n$ levels,
 and thus we must introduce stalls such that,
 \begin{equation*}
@@ -481,8 +486,8 @@ I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1})
 All of these internal reconstructions will be strictly less than the size
 of the last-level reconstruction, and so we can bound them all above by
 $O(\frac{B(n)}{n})$ time.  Given this, and assuming that the smallest
-(i.e., most pressing) reconstruction is prioritized on the active
-thread, we find that
+(i.e., most pressing) reconstruction is prioritized on the execution
+unit, we find that
 \begin{equation*}
 I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right)
 \end{equation*}
@@ -492,19 +497,20 @@ This approach results in an equivalent worst-case insertion and query
 latency bounds to~\cite{overmars81}, but manages to resolve the issues
 cited above. By leveraging multiple threads, instead of trying to manually
 multiplex a single thread, this approach requires \emph{no} modification
-to the user's block code to function. And, by leveraging the fact that
-reconstructions under tiering are strictly local to a single level, we
-can avoid needing to add any additional structures to manage partially
-building blocks as new records are added.
+to the user's block code to function. The level of fine-grained control
+over the active thread necessary to achieve this bound can be achieved
+by using userspace interrupts~\cite{userspace-preempt}, allowing
+it to be more easily implemented, without making significant modifications
+to reconstruction procedures when compared to the existing worst-case
+optimal technique.
 
 \subsection{Reducing Stall with Additional Parallelism}
 
-The result in Theorem~\ref{theo:worst-case-optimal} assumes that there
-is only a single available thread of parallel execution. This requires
-that the insertion stall amount be large enough to cover all of the
-reconstructions necessary at any moment in time. If we have access to
-parallel execution units, though, we can significantly reduce the amount
-of stall time required.
+The result in Theorem~\ref{theo:worst-case-optimal} assumes that there is
+only a single available execution unit. This requires that the insertion
+stall amount be large enough to cover all of the reconstructions necessary
+at any moment in time. If we have access to parallel execution units,
+though, we can significantly reduce the amount of stall time required.
 
 The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case
 bound is that it is insufficient to cover only the cost of the last level
@@ -545,7 +551,7 @@ B(n) \cdot \log n
 reconstruction cost to amortize over the $\Theta(n)$ inserts.
 
 However, additional parallelism will allow us to reduce this. At the
-upper limit, assume that there are $\log n$ threads available for
+upper limit, assume that there are $\log n$ execution units available for
 parallel reconstructions. We'll adopt the bulk-synchronous  parallel
 (BSP) model~\cite{bsp} for our analysis of the parallel algorithm. In
 this model, computation is broken up into multiple parallel threads
@@ -584,9 +590,10 @@ Given this model, is is possible to derive the following new worst-case
 bound,
 \begin{theorem}
 \label{theo:par-worst-case-optimal}
-Given a buffered, dynamized structure utilizing the tiering layout policy,
-and at least $\log n$ parallel threads of execution in the BSP model, it
-is possible to maintain a worst-case insertion cost of
+Given a dynamized structure utilizing the reconstruction policy described
+in Algorithm~\ref{alg:tl-relaxed-recon}, and at least $\log n$ execution
+units in the BSP model, it is possible to maintain a worst-case insertion
+cost of
 \begin{equation}
 I(n) \in O\left(\frac{B(n)}{n}\right)
 \end{equation}
@@ -596,9 +603,9 @@ for a data structure with $B(n) \in \Omega(n)$.
 Just as in Theorem~\ref{theo:worst-case-optimal}, we begin by noting that
 the last level reconstruction will be of cost $\Theta(B(n))$ and must
 be amortized over $\Theta(n)$ inserts. However, unlike in that case,
-we now have $\log n$ threads of parallelism to work with. Thus, each
-time a reconstruction must be performed on an internal level, it can
-be executed on one of these threads in parallel with all other ongoing
+we now have $\log n$ execution units to work with. Thus, each time
+a reconstruction must be performed on an internal level, it can be
+executed on one of these units in parallel with all other ongoing
 reconstructions. As there can be at most one reconstruction per level,
 $\log n$ threads are sufficient to run all possible reconstructions at
 any point in time in parallel.
-- 
cgit v1.2.3