From 784cbc12a8e09f679f95c6b0c67ede58656bd70e Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Mon, 9 Jun 2025 14:31:20 -0400 Subject: updates --- chapters/tail-latency.tex | 266 +++++++++++++++++++++++++++++++--------------- 1 file changed, 182 insertions(+), 84 deletions(-) (limited to 'chapters') diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex index 1a468df..ed6c7b8 100644 --- a/chapters/tail-latency.tex +++ b/chapters/tail-latency.tex @@ -544,20 +544,49 @@ B(n) \cdot \log n \end{equation*} reconstruction cost to amortize over the $\Theta(n)$ inserts. -However, additional parallelism will allow us to reduce this. At -the upper limit, assume that there are $\log n$ threads available -for parallel reconstructions. We'll consider the fork-join model of -parallelism~\cite{fork-join}, where the initiation of a reconstruction -on a thread constitutes a fork, and joining the reconstruction thread -involves applying its updates to the dynamized structure. - -This condition allows us to derive a smaller bound in -certain cases, +However, additional parallelism will allow us to reduce this. At the +upper limit, assume that there are $\log n$ threads available for +parallel reconstructions. We'll adopt the bulk-synchronous parallel +(BSP) model~\cite{bsp} for our analysis of the parallel algorithm. In +this model, computation is broken up into multiple parallel threads +of computation, which are executed independently for a period of +time. Intermittently, a synchronization barrier is introduced, at which +point each of the parallel threads is blocked and synchronization of +global state occurs. This period of independent execution between +barriers is called a \emph{super step}. + +In this model, the parallel execution cost of a super-step is given +by the cost of the longest-running computation, the cost of communication +between the parallel threads, and the cost of the synchronization, +\begin{equation*} +\max\left\{w_i\right\} + hg + l +\end{equation*} +where $hg$ describes the communication cost, $w_i$ is the cost of the +$i$th computation, and $l$ is the cost of barrier synchronization. The +cost for the entire BSP computation is the sum of all the individual +super-steps, +\begin{equation*} +W + Hg + Sl +\end{equation*} +where $S$ is the number of super-steps in the computation. + +We'll model the worst-case reconstruction within the BSP in the following +way. Each individual reconstruction will be considered a parallel +operation, hence $w_i = B(N_B \cdot s^{i+1})$, where $i$ is the level +number. Because we are operating on a single machine, we can assume +that the communication cost is constant, $Hg \in \Theta(1)$. During the +synchronization barrier, any pending structural updates can be applied to +the dynamization (i.e., blocks added or removed from levels). Assuming +that tiering is used, this can be done in $l \in \Theta(1)$ time (for +details on how this is done, see Section~\ref{sssec:tl-versioning}). + +Given this model, is is possible to derive the following new worst-case +bound, \begin{theorem} \label{theo:par-worst-case-optimal} Given a buffered, dynamized structure utilizing the tiering layout policy, -and at least $\log n$ parallel threads of execution, it is possible to -maintain a worst-case insertion cost of +and at least $\log n$ parallel threads of execution in the BSP model, it +is possible to maintain a worst-case insertion cost of \begin{equation} I(n) \in O\left(\frac{B(n)}{n}\right) \end{equation} @@ -574,54 +603,105 @@ reconstructions. As there can be at most one reconstruction per level, $\log n$ threads are sufficient to run all possible reconstructions at any point in time in parallel. -Each reconstruction will require $\Theta(B(N_B \cdot s^{i+1}))$ time to -complete. Thus, the necessary stall to fully cover a reconstruction on -level $i$ is this cost, divided by the number of inserts that can occur -before the reconstruction must be done (i.e., the capacity of the index -above this point). This gives, +To fit this into the BSP model, we need to establish the necessary +frequency of synchronization. At best, we will need to perform one +synchronized update of the internal structure of the dynamization per +buffer flush, which will occur $\Theta\left(\frac{n}{N_B}\right)$ times +over the $\Theta(n)$ inserts. Thus, we will take $S = \frac{n}{N_B}$ as +the length of a super-step. At the end of each super-step, any pending +structural updates from any active reconstruction can be applied. + +Within a BSP super-step, the total operational cost is equal to the sum of +the communication cost (which is $\Theta(1)$ here), the synchronization +cost (also $\Theta(1)$), and the cost of the longest running computation. +The computations themselves are reconstructions, each with a cost of +$B(N_B \cdot s^{i+1})$. They will, generally, exceed the duration of a +single super-step, and so their cost will be spread evenly over each +super-step that elapses from their initiation to their conclusion. As a +result, in any given super-step, there will be two possible states that +a given active reconstruction on level $i$ can be in, +\begin{enumerate} + \item The reconstruction can complete during this super-step, + in which case the cost of it will be $w_i < \frac{n}{N_B}$ + is the amount of work done in this super step. + \item The reconstruction can be incomplete at the end of this + super-step, in which case $w_i = \frac{n}{N_B}$. +\end{enumerate} + +It is our intention to determine the time required to complete the full +reconstruction, the largest computation of which will have cost $B(n)$. +If we can show that the insertion rate necessary to ensure that this +computation is completed over $\Theta(n)$ inserts is also enough to +ensure that all the smaller reconstructions are completed in time as +well, then we can use the last-level reconstruction cost as the total +cost of the computational component of the BSP cost calculation. + +To accomplish this, consider the necessary stall to fully cover a +reconstruction on level $i$. This is the cost of the reconstruction over +level $i$, divided by the number of inserts that can occur before the +reconstruction must be done (i.e., the capacity of the index above this +point). This gives, \begin{equation*} \delta_i \in O\left( \frac{B(N_B \cdot s^{i+1})}{\sum_{j=0}^{i-1} N_B\cdot s^{j+1}} \right) \end{equation*} -necessary stall for each level. Noting that $s > 1$, $s \in \Theta(1)$, -and that the denominator is the sum of a geometric progression, we have +stall for each level. Noting that $s > 1$, $s \in \Theta(1)$, and that +the denominator is the sum of a geometric progression, we have \begin{align*} \delta_i \in &O\left( \frac{B(N_B\cdot s^{i+1})}{s\cdot N_B \sum_{j=0}^{i-1} s^{j}} \right) \\ &O\left( \frac{(1-s) B(N_B\cdot s^{i+1})}{N_B\cdot (s - s^{i+1})} \right) \\ &O\left( \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right) \end{align*} -For all reconstructions running in parallel, the necessary stall is the -maximum stall of all the parallel reconstructions, -\begin{equation*} -\delta = \max_{i \in [0, \ell]} \left\{ \delta_i \right\} = \max_{i \in [0, \ell]} \left\{ \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right\} -\end{equation*} For $B(n) \in \Omega(n)$, the numerator of the fraction will grow at least as rapidly as the denominator, meaning that $\delta_\ell$ will -always be the largest. $N_B \cdot s^{\ell + 1}$ is $\Theta(n)$, so -we find that, - +always be the largest. Thus, the stall necessary to cover the last-level +reconstruction will be at least as much as is necessary for the internal +reconstructions. + +Given this, we will consider only the cost $B(n)$ of the last level +reconstruction in our BSP calculation. This cost will be spread evenly +over $S=\frac{n}{N_B}$ super-steps , so the cost of each interval is +$\frac{B(n)}{S}$. This results in the following overall operation cost over +all super-steps, +\begin{align*} +&\frac{S\cdot B(n)}{S} + S\cdot + S \\ +&B(n) + \frac{n}{N_B} +\end{align*} +Given that $B(n) \in \Omega(n)$, we can absorb the synchronization cost +term to get a total operational cost of $O(B(n))$, which we must evenly +distribute over the $\Theta(n)$ inserts. Thus, the worst-case insertion +cost is, \begin{equation*} -I(n) \in O \left(\frac{B(n)}{n}\right) + I(n) \in O\left(\frac{B(n)}{n}\right) \end{equation*} -is the worst-case insertion cost, while ensuring that all reconstructions -are done in time to maintain the block bound given $\log n$ parallel threads. + +While the BSP model assumes infinite parallel execution, we'll further +restrict this by noting that, at any moment in time, there can be at +most one reconstruction occurring on each level of the structure, and +therefore $\log_s n$ threads are sufficient to obtain the above +result. \end{proof} +It's worth noting that the ability to parallelize the reconstructions +is afforded to us by our technique, and is not possible in the classical +Overmars's formulation, which is inherently single-threaded in nature. + \section{Implementation} \label{sec:tl-impl} -The previous section demonstrated that, theoretically, it is possible -to meaningfully control the tail latency of our dynamization system by -relaxing the reconstruction processes and throttling the insertion rate, -rather than blocking, as a means of controlling the shard count within -the structure. However, there are a number of practical problems to be -solved before this idea can be used in a real system. In this section, -we discuss these problems, and our approaches to solving them to produce -a dynamization framework based upon the technique. Note that this -system is based on the same high level architecture as we described in -Section~\ref{ssec:dyn-concurrency}. To avoid redundancy, we will focus -on how this system differs, without fully recapitulating the content of -that earlier section. +The previous section demonstrated that it is possible to meaningfully +control the worst-case insertion cost (and, therefore, the insertion tail +latency) of our dynamization system, at least in theory. This can be done +by relaxing the reconstruction processes and throttling the insertion +rate as a means of controlling the shard count within the structure, +rather than blocking the insertion thread during reconstructions. However, +there are a number of practical problems to be solved before this idea can +be used in a real system. In this section, we discuss these problems, and +our approaches to solving them to produce a dynamization framework based +upon the technique. Note that this system is based on the same high level +architecture as we described in Section~\ref{ssec:dyn-concurrency}. To +avoid redundancy, we will focus on how this system differs, without +fully recapitulating the content of that earlier section. \subsection{Parallel Reconstruction Architecture} @@ -631,20 +711,25 @@ constructing a framework supporting the parallel reconstruction scheme described in the previous section. In particular, it is limited to only two active versions of the structure at at time, with one ongoing reconstruction. Additionally, it does not consider buffer flushes as -distinct events from reconstructions. In this section, we will discuss -the modifications made to the concurrency support within our framework -to support parallel reconstructions. - -Much like the simpler scheme in Section~\ref{ssec:dyn-concurrency}, -our concurrency framework will be based on multi-versioning. Each -\emph{version} consists of three pieces of information: a buffer +distinct events from reconstructions. In order to support the result +of Theorem~\ref{theo:par-worst-case-optimal} it will be necessary to +have a concurrency control system that considers reconstructions on +each level independently and allows for one reconstruction per level +without any synchronization, and that allows for each reconstruction to +apply its results to the active structure in $\Theta(1)$ time, and in +any order, without violating any structural invariants. + +To accomplish this, we will use a multi-versioning control scheme that +is similar to the simple scheme in Section~\ref{ssec:dyn-concurrency}. +Each \emph{version} will consist of three pieces of information: a buffer head pointer, buffer tail pointer, and a collection of levels and shards. However, the process of managing, creating, and installing -versions is much more complex, to allow more than two versions to exist -at the same time under certain circumstances. +versions will be more complex, to allow more than two versions to exist +at the same time under certain circumstances and support the necessary +features mentioned above. \subsubsection{Structure Versioning} - +\label{sssec:tl-versioning} The internal structure of the dynamization consists of a sequence of levels containing immutable shards, as well as a snapshot of the state of the mutable buffer. This section pertains specifically to the internal @@ -954,17 +1039,21 @@ count, after which a query will ignore a request for preemption. \subsection{Insertion Stall Mechanism} The results of Theorem~\ref{theo:worst-case-optimal} and -\ref{theo:par-worst-case-optimal} are based upon enforcing a rate -limit upon incoming inserts, to ensure that there is sufficient time -for reconstructions to complete. In practice, calculating and precisely -stalling for the correct amount of time is quite difficult because of -the vagaries of working with a real system. While ultimately it would be -ideal to have a reasonable cost model that can estimate this stall time -on the fly based on the cost of building a data structure, the number of -records involved in reconstructions, the number of available threads, -available memory bandwidth, etc., for the purposes of this prototype -we have settled for a simple system that demonstrates the robustness of -the technique. +\ref{theo:par-worst-case-optimal} are based upon enforcing a rate limit +upon incoming inserts by manually increasing their cost, to ensure that +there is sufficient time for reconstructions to complete. The calculation +of and application of this stall factor can be seen as equivalent to +explicitly limiting the maximum allowed insertion throughput. In this +section, we consider a mechanism for doing this. + +In practice, calculating and precisely stalling for the correct amount +of time is quite difficult because of the vagaries of working with a +real system. While ultimately it would be ideal to have a reasonable +cost model that can estimate this stall time on the fly based on the +cost of building a data structure, the number of records involved in +reconstructions, the number of available threads, available memory +bandwidth, etc., for the purposes of this prototype we have settled for +a simple system that demonstrates the robustness of the technique. Recall the basic insert process within our system. Inserts bypass the scheduling system and communicate directly with the buffer, on the same @@ -975,28 +1064,37 @@ the insert has succeeded or not. The insert can fail if the buffer is full, in which case the user is expected to delay for a moment and retry. Once space has been cleared in the buffer, the insert will succeed. -We can leverage this same rejection mechanism as a form of rate limiting, -by probabilistically rejecting inserts. To this end, we introduce a -configurable parameter indicating the probability that an insert will -succeed. For each insert, we we use Bernoulli sampling based upon this -probably to determine whether the insert should be attempted or not. If -the insert is rejected, then the attempt is aborted and the function -returns a failure. The user can then delay and attempt again. - -This approach was selected because it has a few specific -advantages. First, it is based on a single parameter that can be -readily updated on demand using atomics. Our current prototype -uses a single, fixed value for the probability, but ultimately it -should be dynamically tuned to approximate the $\delta$ value from -Theorem~\ref{theo:worst-case-optimal} as closely as possible. Second, -random sampling to apply stalls ensures fairness. Ultimately, there's a -limit on how small of a stall can be meaningfully applied, particularly -if we want to avoid busy waiting. This approach allows us to fairly -approximate these small stalls by using larger, more practical, amounts -attached to randomly sampled inserts. Finally, the approach is simple -and requires no significant changes to the user interface, while (as -we will see in Section~\ref{sec:tl-eval}) directly exposing the design -space associated with the parameter to the user. +We can leverage this same mechanism as a form of rate limiting, by +rejecting new inserts when the throughput rises above a specified level. +Unfortunately, the most straightforward approach to doing this--monitoring +the throughput and simply blocking inserts when it raises above a +specified threshold--is undesirable because the probability of a given +insert being rejected is not independent. The rejections will tend to +clump, which introduces back the tail latency problem we are attempting to +resolve. Instead, it would be best to spread the rejections more evenly. + +We approximate this rate limiting behavior by using random sampling +to determine which inserts to reject. Rather than specifying a maximum +throughput, the system is configured with a probability of acceptance +for an insert. This avoids the distributional problems mentioned above +that arise from direct throughput monitoring, and has a few additional +benefits. It is based on a single parameter that can be readily updated +on demand using atomics. Our current prototype uses a single, fixed value +for the probability, but ultimately it should be dynamically tuned to +approximate the $\delta$ value from Theorem~\ref{theo:worst-case-optimal} +as closely as possible. It also doesn't require significant modification +of the existing client interfaces. + +We have elected to use Bernoulli sampling for the task of selecting which +inserts to reject. An alternative approach would have been to apply +systematic sampling. Bernoulli sampling results in the probability of an +insert being rejected being independent of its order in the workload, +but is non-deterministic. This means that there is a small probability +that many more rejections than are expected may occur over a short +span of time. Systematic sampling is not vulnerable to this problem, +but introduces a dependence between the position of an insert in a +sequence of operations and its probability of being rejected. We decided +to prioritize independence in our implementation. \section{Evaluation} \label{sec:tl-eval} -- cgit v1.2.3