From f34a7e871b22bfd7973158e1fa2376a12bb6f5f5 Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Mon, 23 Jun 2025 11:52:18 -0400 Subject: updates --- chapters/tail-latency.tex | 93 +++++++++++++++++++++++++---------------------- 1 file changed, 50 insertions(+), 43 deletions(-) (limited to 'chapters') diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex index ed6c7b8..5b3dfa5 100644 --- a/chapters/tail-latency.tex +++ b/chapters/tail-latency.tex @@ -332,7 +332,7 @@ policy. Tiering provides the most opportunities for concurrency and (assuming sufficient resources) parallelism. Because a given reconstruction only requires shards from a single level, using tiering also makes synchronization significantly easier, and it provides us -with largest window to pre-emptively schedule reconstructions. Most +with largest window to preemptively schedule reconstructions. Most of our discussion in this chapter could also be applied to leveling, albeit with worse results. However, BSM \emph{cannot} be used at all. @@ -439,40 +439,45 @@ of inserts without much of the complexity of~\cite{overmars81}. This leads us to the following result, \begin{theorem} \label{theo:worst-case-optimal} -Given a buffered, dynamized structure utilizing the tiering layout policy, -a single active thread of execution and multiple background threads, and -the ability to control which background thread is active, it is possible -to maintain a worst-case insertion cost of +Given a dynamized structure utilizing the reconstruction policy described +in Algorithm~\ref{alg:tl-relaxed-recon}, a single execution unit, and multiple +threads of execution that can be scheduled on that unit at will with +preemption, it is possible to maintain a worst-case insertion cost of \begin{equation} I(n) \in O\left(\frac{B(n)}{n} \log n\right) \end{equation} +while maintaining a bound of $s$ shards per level, and $\log_s n$ levels. \end{theorem} \begin{proof} -Consider the cost of the worst-case reconstruction, which in tiering -will be $\Theta(B(n))$. This reconstruction requires all of the blocks -on the last level of the structure. At the point at which the last level -is full, and this reconstruction can be initiated, there will be another -$\Theta(n)$ inserts before it must be completed in order to maintain -the structural invariants. - -Let each insert have cost, +Under Algorithm~\ref{alg:tl-relaxed-recon}, the worst-case reconstruction +operation consists of the creation of a new block from all of the existing +blocks in the last level. This reconstruction will be initiated when the +last level is full, at which point there will be another $\Theta(n)$ +inserts before the level above it also fills, and a new shard must be +added to the last level. The reconstruction must be completed by this point +to ensure that no more than $s$ shards exist on the last level. + +Assume that all inserts run on a single thread that can be scheduled +alongside the reconstructions, and let each insert have a cost of \begin{equation*} I(n) \in \Theta(1 + \delta) \end{equation*} -where $1$ is the cost of appending to the buffer, and $\delta$ is a -calculated stall time, during which one of the background threads can be -executed. To ensure the last-level reconstruction is complete by the -time that $\Theta(n)$ inserts have finished, it is necessary that -$\delta \in \Theta\left(\frac{B(n)}{n}\right)$. +where $1$ is the cost of appending to the buffer, and $\delta$ +is a calculated stall time. During the stalling, the insert +thread will be idle and reconstructions can be run on the execution unit. +To ensure the last-level reconstruction is complete by the time that +$\Theta(n)$ inserts have finished, it is necessary that $\delta \in +\Theta\left(\frac{B(n)}{n}\right)$. However, this amount of stall is insufficient to maintain exactly $s$ shards on each level of the dynamization. At the point at which the -last-level reconstruction can be scheduled, there will be exactly $1$ -shard on all other levels (see Figure~\ref{fig:tl-tiering}). But, between -when the last-level reconstruction can be scheduled, and it must be -completed, each other level must undergo $s - 1$ reconstructions. Because -we have only a single execution thread, it is necessary to account for -the time to complete these reconstructions as well. In the worst-case, +last-level reconstruction is initiated, there will be exactly $1$ shard +on all other levels (see Figure~\ref{fig:tl-tiering}). However, between +this initiation and the time at which the last level reconstruction must +be complete to maintain the shard count bound, each other level must +also undergo $s - 1$ reconstructions to maintain their own bounds. +Because we have only a single execution unit, it is necessary to account +for the time to complete these reconstructions as well. In the worst-case, there will be one active reconstruction on each of the $\log_s n$ levels, and thus we must introduce stalls such that, \begin{equation*} @@ -481,8 +486,8 @@ I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1}) All of these internal reconstructions will be strictly less than the size of the last-level reconstruction, and so we can bound them all above by $O(\frac{B(n)}{n})$ time. Given this, and assuming that the smallest -(i.e., most pressing) reconstruction is prioritized on the active -thread, we find that +(i.e., most pressing) reconstruction is prioritized on the execution +unit, we find that \begin{equation*} I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right) \end{equation*} @@ -492,19 +497,20 @@ This approach results in an equivalent worst-case insertion and query latency bounds to~\cite{overmars81}, but manages to resolve the issues cited above. By leveraging multiple threads, instead of trying to manually multiplex a single thread, this approach requires \emph{no} modification -to the user's block code to function. And, by leveraging the fact that -reconstructions under tiering are strictly local to a single level, we -can avoid needing to add any additional structures to manage partially -building blocks as new records are added. +to the user's block code to function. The level of fine-grained control +over the active thread necessary to achieve this bound can be achieved +by using userspace interrupts~\cite{userspace-preempt}, allowing +it to be more easily implemented, without making significant modifications +to reconstruction procedures when compared to the existing worst-case +optimal technique. \subsection{Reducing Stall with Additional Parallelism} -The result in Theorem~\ref{theo:worst-case-optimal} assumes that there -is only a single available thread of parallel execution. This requires -that the insertion stall amount be large enough to cover all of the -reconstructions necessary at any moment in time. If we have access to -parallel execution units, though, we can significantly reduce the amount -of stall time required. +The result in Theorem~\ref{theo:worst-case-optimal} assumes that there is +only a single available execution unit. This requires that the insertion +stall amount be large enough to cover all of the reconstructions necessary +at any moment in time. If we have access to parallel execution units, +though, we can significantly reduce the amount of stall time required. The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case bound is that it is insufficient to cover only the cost of the last level @@ -545,7 +551,7 @@ B(n) \cdot \log n reconstruction cost to amortize over the $\Theta(n)$ inserts. However, additional parallelism will allow us to reduce this. At the -upper limit, assume that there are $\log n$ threads available for +upper limit, assume that there are $\log n$ execution units available for parallel reconstructions. We'll adopt the bulk-synchronous parallel (BSP) model~\cite{bsp} for our analysis of the parallel algorithm. In this model, computation is broken up into multiple parallel threads @@ -584,9 +590,10 @@ Given this model, is is possible to derive the following new worst-case bound, \begin{theorem} \label{theo:par-worst-case-optimal} -Given a buffered, dynamized structure utilizing the tiering layout policy, -and at least $\log n$ parallel threads of execution in the BSP model, it -is possible to maintain a worst-case insertion cost of +Given a dynamized structure utilizing the reconstruction policy described +in Algorithm~\ref{alg:tl-relaxed-recon}, and at least $\log n$ execution +units in the BSP model, it is possible to maintain a worst-case insertion +cost of \begin{equation} I(n) \in O\left(\frac{B(n)}{n}\right) \end{equation} @@ -596,9 +603,9 @@ for a data structure with $B(n) \in \Omega(n)$. Just as in Theorem~\ref{theo:worst-case-optimal}, we begin by noting that the last level reconstruction will be of cost $\Theta(B(n))$ and must be amortized over $\Theta(n)$ inserts. However, unlike in that case, -we now have $\log n$ threads of parallelism to work with. Thus, each -time a reconstruction must be performed on an internal level, it can -be executed on one of these threads in parallel with all other ongoing +we now have $\log n$ execution units to work with. Thus, each time +a reconstruction must be performed on an internal level, it can be +executed on one of these units in parallel with all other ongoing reconstructions. As there can be at most one reconstruction per level, $\log n$ threads are sufficient to run all possible reconstructions at any point in time in parallel. -- cgit v1.2.3