2 files changed, 69 insertions, 43 deletions
diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex
index ed6c7b8..5b3dfa5 100644
--- a/chapters/tail-latency.tex
+++ b/chapters/tail-latency.tex
@@ -332,7 +332,7 @@ policy. Tiering provides the most opportunities for concurrency
 and (assuming sufficient resources) parallelism. Because a given
 reconstruction only requires shards from a single level, using tiering
 also makes synchronization significantly easier, and it provides us
-with largest window to pre-emptively schedule reconstructions. Most
+with largest window to preemptively schedule reconstructions. Most
 of our discussion in this chapter could also be applied to leveling,
 albeit with worse results. However, BSM \emph{cannot} be used at all.
 
@@ -439,40 +439,45 @@ of inserts without much of the complexity of~\cite{overmars81}. This
 leads us to the following result,
 \begin{theorem}
 \label{theo:worst-case-optimal}
-Given a buffered, dynamized structure utilizing the tiering layout policy,
-a single active thread of execution and multiple background threads, and
-the ability to control which background thread is active, it is possible
-to maintain a worst-case insertion cost of
+Given a dynamized structure utilizing the reconstruction policy described
+in Algorithm~\ref{alg:tl-relaxed-recon}, a single execution unit, and multiple
+threads of execution that can be scheduled on that unit at will with
+preemption, it is possible to maintain a worst-case insertion cost of
 \begin{equation}
 I(n) \in O\left(\frac{B(n)}{n} \log n\right)
 \end{equation}
+while maintaining a bound of $s$ shards per level, and $\log_s n$ levels.
 \end{theorem}
 \begin{proof}
-Consider the cost of the worst-case reconstruction, which in tiering
-will be $\Theta(B(n))$. This reconstruction requires all of the blocks
-on the last level of the structure. At the point at which the last level
-is full, and this reconstruction can be initiated, there will be another
-$\Theta(n)$ inserts before it must be completed in order to maintain
-the structural invariants.
-
-Let each insert have cost,
+Under Algorithm~\ref{alg:tl-relaxed-recon}, the worst-case reconstruction
+operation consists of the creation of a new block from all of the existing
+blocks in the last level. This reconstruction will be initiated when the
+last level is full, at which point there will be another $\Theta(n)$
+inserts before the level above it also fills, and a new shard must be
+added to the last level. The reconstruction must be completed by this point
+to ensure that no more than $s$ shards exist on the last level.
+
+Assume that all inserts run on a single thread that can be scheduled
+alongside the reconstructions, and let each insert have a cost of
 \begin{equation*}
 I(n) \in \Theta(1 + \delta)
 \end{equation*}
-where $1$ is the cost of appending to the buffer, and $\delta$ is a
-calculated stall time, during which one of the background threads can be
-executed. To ensure the last-level reconstruction is complete by the
-time that $\Theta(n)$ inserts have finished, it is necessary that
-$\delta \in \Theta\left(\frac{B(n)}{n}\right)$.
+where $1$ is the cost of appending to the buffer, and $\delta$
+is a calculated stall time. During the stalling, the insert
+thread will be idle and reconstructions can be run on the execution unit.
+To ensure the last-level reconstruction is complete by the time that
+$\Theta(n)$ inserts have finished, it is necessary that $\delta \in
+\Theta\left(\frac{B(n)}{n}\right)$.
 
 However, this amount of stall is insufficient to maintain exactly $s$
 shards on each level of the dynamization. At the point at which the
-last-level reconstruction can be scheduled, there will be exactly $1$
-shard on all other levels (see Figure~\ref{fig:tl-tiering}). But, between
-when the last-level reconstruction can be scheduled, and it must be
-completed, each other level must undergo $s - 1$ reconstructions. Because
-we have only a single execution thread, it is necessary to account for
-the time to complete these reconstructions as well. In the worst-case,
+last-level reconstruction is initiated, there will be exactly $1$ shard
+on all other levels (see Figure~\ref{fig:tl-tiering}). However, between
+this initiation and the time at which the last level reconstruction must
+be complete to maintain the shard count bound, each other level must
+also undergo $s - 1$ reconstructions to maintain their own bounds.
+Because we have only a single execution unit, it is necessary to account
+for the time to complete these reconstructions as well. In the worst-case,
 there will be one active reconstruction on each of the $\log_s n$ levels,
 and thus we must introduce stalls such that,
 \begin{equation*}
@@ -481,8 +486,8 @@ I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1})
 All of these internal reconstructions will be strictly less than the size
 of the last-level reconstruction, and so we can bound them all above by
 $O(\frac{B(n)}{n})$ time.  Given this, and assuming that the smallest
-(i.e., most pressing) reconstruction is prioritized on the active
-thread, we find that
+(i.e., most pressing) reconstruction is prioritized on the execution
+unit, we find that
 \begin{equation*}
 I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right)
 \end{equation*}
@@ -492,19 +497,20 @@ This approach results in an equivalent worst-case insertion and query
 latency bounds to~\cite{overmars81}, but manages to resolve the issues
 cited above. By leveraging multiple threads, instead of trying to manually
 multiplex a single thread, this approach requires \emph{no} modification
-to the user's block code to function. And, by leveraging the fact that
-reconstructions under tiering are strictly local to a single level, we
-can avoid needing to add any additional structures to manage partially
-building blocks as new records are added.
+to the user's block code to function. The level of fine-grained control
+over the active thread necessary to achieve this bound can be achieved
+by using userspace interrupts~\cite{userspace-preempt}, allowing
+it to be more easily implemented, without making significant modifications
+to reconstruction procedures when compared to the existing worst-case
+optimal technique.
 
 \subsection{Reducing Stall with Additional Parallelism}
 
-The result in Theorem~\ref{theo:worst-case-optimal} assumes that there
-is only a single available thread of parallel execution. This requires
-that the insertion stall amount be large enough to cover all of the
-reconstructions necessary at any moment in time. If we have access to
-parallel execution units, though, we can significantly reduce the amount
-of stall time required.
+The result in Theorem~\ref{theo:worst-case-optimal} assumes that there is
+only a single available execution unit. This requires that the insertion
+stall amount be large enough to cover all of the reconstructions necessary
+at any moment in time. If we have access to parallel execution units,
+though, we can significantly reduce the amount of stall time required.
 
 The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case
 bound is that it is insufficient to cover only the cost of the last level
@@ -545,7 +551,7 @@ B(n) \cdot \log n
 reconstruction cost to amortize over the $\Theta(n)$ inserts.
 
 However, additional parallelism will allow us to reduce this. At the
-upper limit, assume that there are $\log n$ threads available for
+upper limit, assume that there are $\log n$ execution units available for
 parallel reconstructions. We'll adopt the bulk-synchronous  parallel
 (BSP) model~\cite{bsp} for our analysis of the parallel algorithm. In
 this model, computation is broken up into multiple parallel threads
@@ -584,9 +590,10 @@ Given this model, is is possible to derive the following new worst-case
 bound,
 \begin{theorem}
 \label{theo:par-worst-case-optimal}
-Given a buffered, dynamized structure utilizing the tiering layout policy,
-and at least $\log n$ parallel threads of execution in the BSP model, it
-is possible to maintain a worst-case insertion cost of
+Given a dynamized structure utilizing the reconstruction policy described
+in Algorithm~\ref{alg:tl-relaxed-recon}, and at least $\log n$ execution
+units in the BSP model, it is possible to maintain a worst-case insertion
+cost of
 \begin{equation}
 I(n) \in O\left(\frac{B(n)}{n}\right)
 \end{equation}
@@ -596,9 +603,9 @@ for a data structure with $B(n) \in \Omega(n)$.
 Just as in Theorem~\ref{theo:worst-case-optimal}, we begin by noting that
 the last level reconstruction will be of cost $\Theta(B(n))$ and must
 be amortized over $\Theta(n)$ inserts. However, unlike in that case,
-we now have $\log n$ threads of parallelism to work with. Thus, each
-time a reconstruction must be performed on an internal level, it can
-be executed on one of these threads in parallel with all other ongoing
+we now have $\log n$ execution units to work with. Thus, each time
+a reconstruction must be performed on an internal level, it can be
+executed on one of these units in parallel with all other ongoing
 reconstructions. As there can be at most one reconstruction per level,
 $\log n$ threads are sufficient to run all possible reconstructions at
 any point in time in parallel.
diff --git a/references/references.bib b/references/references.bib
index 051bb32..ec15f6b 100644
--- a/references/references.bib
+++ b/references/references.bib
@@ -2063,3 +2063,22 @@ month = aug,
 pages = {103–111},
 numpages = {9}
 }
+
+@article{userspace-preempt,
+author = {Huang, Kaisong and Zhou, Jiatang and Zhao, Zhuoyue and Xie, Dong and Wang, Tianzheng},
+title = {Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt?},
+year = {2025},
+issue_date = {June 2025},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+volume = {3},
+number = {3},
+url = {https://doi.org/10.1145/3725319},
+doi = {10.1145/3725319},
+abstract = {Traditional non-preemptive scheduling can lead to long latency under workloads that mix long-running and short transactions with varying priorities. This occurs because worker threads tend to monopolize CPU cores until they finish processing long-running transactions. Thus, short transactions must wait for the CPU, leading to long latency. As an alternative, cooperative scheduling allows for transaction yielding, but it is difficult to tune for diverse workloads. Although preemption could potentially alleviate this issue, it has seen limited adoption in DBMSs due to the high delivery latency of software interrupts and concerns on wasting useful work induced by read-write lock conflicts in traditional lock-based DBMSs.In this paper, we propose PreemptDB, a new database engine that leverages recent userspace interrupts available in modern CPUs to enable efficient preemptive scheduling. We present an efficient transaction context switching mechanism purely in userspace and scheduling policies that prioritize short, high-priority transactions without significantly affecting long-running queries. Our evaluation demonstrates that PreemptDB significantly reduces end-to-end latency for high-priority transactions compared to non-preemptive FIFO and cooperative scheduling methods.},
+journal = {Proc. ACM Manag. Data},
+month = jun,
+articleno = {182},
+numpages = {25},
+keywords = {database systems, low-latency transactions, preemptive scheduling, user interrupts}
+}