From 881ebf51c3de54110aa9d7f3aefbc08e74c73189 Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Wed, 28 May 2025 17:12:35 -0400 Subject: updates --- chapters/dynamization.tex | 2 + chapters/tail-latency.tex | 292 ++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 261 insertions(+), 33 deletions(-) diff --git a/chapters/dynamization.tex b/chapters/dynamization.tex index b5bf404..2301537 100644 --- a/chapters/dynamization.tex +++ b/chapters/dynamization.tex @@ -788,6 +788,7 @@ into the $s$ blocks.~\cite{overmars-art-of-dyn} \subsection{Worst-Case Optimal Techniques} +\label{ssec:bsm-worst-optimal} \section{Limitations of Classical Dynamization Techniques} @@ -1018,6 +1019,7 @@ be used. \subsection{Configurability} \subsection{Insertion Tail Latency} +\label{ssec:bsm-tail-latency-problem} \section{Conclusion} diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex index 361dde0..14637c6 100644 --- a/chapters/tail-latency.tex +++ b/chapters/tail-latency.tex @@ -28,23 +28,33 @@ poor. That the structure exhibits reasonable performance on average is the result of these two ends of the distribution balancing each other out. -The reason for this poor tail latency is reconstructions. In order -to provide tight bounds on the number of shards within the structure, -our techniques must block inserts once the buffer has filled, until -sufficient room is cleared in the structure to accomodate these new -records. This results in the worst-case insertion behavior that we -described mathematically in the previous chapter. - -Unfortunately, the design space that we have been considering thus -far is very limited in its ability to meaningfully alter the -worst-case insertion performance. While we have seen that the choice -between leveling and tiering can have some effect, the actual benefit -in terms of tail latency is quite small, and the situation is made -worse by the fact that leveling, which can have better worst-case -insertion performance, lags behind tiering in terms of average -insertion performance. The use of leveling can allow for a small -reduction in the worst case, but at the cost of making the majority -of inserts worse because of increased write amplification. + + +This poor worst-case performance is a direct consequence of the strategies +used by the two structures to support updates. B+Trees use a form of +amortized local reconstruction, whereas the dynamized ISAM tree uses +amortized global reconstruction. Because the B+Tree only reconstructs the +portions of the structure ``local'' to the update, even in the worst case +only a portion of the data structure will need to be adjusted. However, +when using global reconstruction based techniques, the worst-case insert +requires rebuilding either the entirety of the structure (for tiering +or BSM), or at least a very large proportion of it (for leveling). The +fact that our dynamization technique uses buffering, and most of the +shards involved in reconstruction are kept small by the logarithmic +decomposition technique used to partition it, ensures that the majority +of inserts are low cost compared to the B+Tree, but at the extreme end +of the latency distribution, the local reconstruction strategy used by +the B+Tree results in better worst-case performance. + +Unfortunately, the design space that we have been considering thus far +is limited in its ability to meaningfully alter the worst-case insertion +performance. While we have seen that the choice of layout policy can have +some effect, the actual benefit in terms of tail latency is quite small, +and the situation is made worse by the fact that leveling, which can +have better worst-case insertion performance, lags behind tiering in +terms of average insertion performance. The use of leveling can allow +for a small reduction in the worst case, but at the cost of making the +majority of inserts worse because of increased write amplification. \begin{figure} \subfloat[Scale Factor Sweep]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-parm-sf}} @@ -68,25 +78,241 @@ ultimately the question of ``which configuration has the best tail-latency performance'' is more a question of how many insertions the latency is measured over, than any fundamental trade-offs with the design space. -\begin{example} -Consider two dynamized structures, $\mathscr{I}_A$ and $\mathscr{I}_B$, -with slightly different configurations. Regardless of the layout -policy used (of those discussed in Chapter~\ref{chap:design-space}), -the worst-case insertion will occur when the structure is completely -full, i.e., after +Thus, in this chapter, we will look beyond the design space we have +thus far considered to design a dynamization system that allows for +tail latency tuning in a meaningful capacity. To accomplish this, +we will consider a different way of looking at reconstructions within +dynamized structures. + +\section{The Insertion-Query Trade-off} + +As reconstructions are at the heart of the insertion tail latency problem, +it seems worth taking a moment to consider \emph{why} they must be done +at all. Fundamentally, decomposition-based dynamization techniques trade +between insertion and query performance by controlling the number of blocks +in the decomposition. Reconstructions serve to place a bound on the +number of blocks, to allow for query performance bounds to be enforced. +This trade-off between insertion and query performance by way of block +count is most directly visible in the equal block method described +in Section~\ref{ssec:ebm}. As a reminder, this technique provides the +following worst-case insertion and query bounds, +\begin{align*} +I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\ +\mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) +\end{align*} +where $f(n)$ is the number of blocks. + +Figure~\ref{fig:tl-ebm-trade-off} shows the trade-off between insertion +and query performance for a dynamized ISAM tree using the equal block +method, for various numbers of blocks. The trade-off is evident in the +figure, with a linear relationship between insertion throughput and query +latency, mediated by the number of blocks in the dynamized structure (the +block counts are annotated on each point in the plot). As the number of +blocks is increased, their size is reduced, leading to less expensive +inserts in terms of both amortized and worst-case cost. However, the +additional blocks make queries more expensive. + + +\begin{figure} +\centering +\includegraphics[width=.75\textwidth]{img/tail-latency/ebm-count-sweep.pdf} +\caption{The Insert-Query Tradeoff for the Equal Block Method with varying +number of blocks} +\label{fig:tl-ebm-trade-off} +\end{figure} + +While using the equal block method does allow for direct tuning of +the worst-case insert cost, as well as exposing a very clean trade-off +space for average query and insert performance, the technique is not +well suited to our purposes because the amortized insertion performance +is not particularly good: the insertion throughput is many times worse +than is possible with our dynamization framework for an equivalent +query latency.\footnote{ + In actuality, the insertion performance of the equal block method is + even \emph{worse} than the numbers presented here. For this particular + benchmark, we implemented the technique knowing the number of records in + advance, and so fixed the size of each block from the start. This avoided + the need to do any repartitioning as the structure grew, and reduced write + amplification. +} +This is because, in our Bentley-Saxe-based technique, the +variable size of the blocks allows for the majority of the reconstructions +to occur with smaller structures, while allowing the majority of the +records to exist in a single large block at the bottom of the structure. +This setup enables high insertion throughput while keeping the block +count small. But, as we've seen, the cost of this is large tail latencies. +However, we can use the extreme ends of the equal block method's design +space to consider upper limits on the insertion and query performance +that we might expect to get out of a dynamized structure. + + +Consider what would happen if we were to modify our dynamization framework +to avoid all reconstructions. We retain a buffer of size $N_B$, which +we flush to create a shard when full, however we never touch the shards +once they are created. This is effectively the equal blocks method, +where every block is fixed at $N_B$ capacity. Such a technique would +result in a worst-case insertion cost of $I(n) \in \Theta(B(N_B))$ and +produce $\Theta\left(\frac{n}{N_B}\right)$ shards in total, resulting +in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$ worst-case query +cost for a decomposable search problem. Applying this technique to an +ISAM Tree, and compared against a B+Tree, yields the insertion and query +latency distributions shown in Figure~\ref{fig:tl-floodl0}. + +\begin{figure} +\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} +\subfloat[Query Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-query.pdf} \label{fig:tl-floodl0-query}} \\ + +\caption{Latency Distributions for a "Reconstructionless" Dynamization} +\label{fig:tl-floodl0} +\end{figure} + +Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain +insertion latency distributions using amortized global reconstruction +that are directly comperable to dynamic structures based on amortized +local reconstruction, at least in some cases. In particular, the +worst-case insertion tail latency in this model is direct function +of the buffer size, as the worst-case insert occurs when the buffer +must be flushed to a shard. However, this performance comes at the +cost of queries, which are incredibly slow compared to B+Trees, as +shown in Figure~\ref{fig:tl-floodl0-query}. + +While this approach is not useful on its own, it does + +\section{Relaxed Reconstruction} + +There is theoretical work in this area, which we discussed in +Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach for +controlling the worst-case insertion cost is to break the largest +reconstructions up into small sequences of operations, that can then be +attached to each insert, spreading the total workload out and ensuring +each insert takes a consistent amount of time. Theoretically, the total +throughput should remain about the same when doing this, but rather +than having a bursty latency distribution with many fast inserts, and +a small number of incredibly slow ones, distribution should be far more +uniform. + +Unfortunately, this technique has a number of limitations that we +discussed in Section~\ref{ssec:bsm-tail-latency-problem}. Notably for +this discussion, they are +\begin{enumerate} + + \item In the Bentley-Saxe method, the worst-case reconstruction + involves every record in the structure. As such, it cannot be + performed ``in advance'' without significant extra work. This problem + requires the worst-case optimized dynamization systems to include + complicated structures of partially built structures. + + \item The approach assumes that the workload of building a + block can be evenly divided in advance, and somehow attached + to inserts. Even for simple structures, this requires a large + amount of manual adjustment to the data structure reconstruction + routines, and doesn't admit simple, generalized interfaces. + +\end{enumerate} +In this section, we consider how these restrictions can be overcome given +our dynamization framework, and propose a strategy that achieves the +same worst-case insertion time as the worst-case optimized techniques, +given a few assumptions about available resources. + +At a very high level, our proposed approach as follows. We will fully +detach reconstructions from buffer flushes. When the buffer fills, it will +immediately flush and a new shard will be placed in L0. Reconstructions +will be performed in the background to maintain the internal structure +according, roughly, to tiering. When a level contains $s$ shards, a +reconstruction will immediately be trigger to merge these shards and +push the result down to the next level. To ensure that the number of +shards in the structure remains bounded by $\Theta(\log n)$, we will +throttle the insertion rate so that it is balanced with amount of time +needed to complete reconstructions. + +\begin{figure} +\caption{Several "states" of tiering, leading up to the worst-case +reconstruction.} +\label{fig:tl-tiering} +\end{figure} + +First, we'll consider how to ``spread out'' the cost of the worst-case +reconstruction. Figure~\ref{fig:tl-tiering} shows various stages in +the development of the internal structure of a dynamized index using +tiering. Importantly, note that the last level reconstruction, which +dominates the cost of the worst-case reconstruction, \emph{is able to be +performed well in advance}. All of the records necessary to perform this +reconstruction are present in the last level $\Theta(n)$ inserts before +the reconstruction must be done to make room. This is a significant +advantage to our technique over the normal Bentley-Saxe method, which +will allow us to spread the cost of this reconstruction over a number +of inserts without much of the complexity of~\cite{overmars81}. This +leads us to the following result, +\begin{theorem} +Given a buffered, dynamized structure utilizing the tiering layout policy, +and at least $2$ parallel threads of execution, it is possible to maintain +a worst-case insertion cost of +\begin{equation} +I(n) \in \Theta\left(\frac{B(n)}{n} \log n\right) +\end{equation} +\end{theorem} +\begin{proof} +Consider the cost of the worst-case reconstruction, which in tiering +will be of cost $\Theta(B(n))$. This reconstruction requires all of the +blocks on the last level of the structure. At the point at which the +last level is full, there will be $\Theta(n)$ inserts before the last +level must be merged and a new level added. + +To ensure that the reconstruction has been completed by the time the +$\Theta(n)$ inserts have been completed, it is sufficient to guarantee +that the rate of inserts is sufficiently slow. Ignoring the cost of +buffer flushing, this means that inserts must cost, \begin{equation*} -n_\text{worst} = N_B + \sum_{i=0}^{\log_s n} N_B \cdot s^{i+1} +I(n) \in \Theta(1 + \delta) \end{equation*} -Let this be $n_a$ for $\mathscr{I}_B$ and $n_b$ for $\mathscr{I}_B$, -and let $\mathscr{I}_A$ be configured with scale factor $s_a$ and -$\mathscr{I}_B$ with scale factor $s_b$, such that $s_a < s_b$. -\end{example} +where the $\Theta(1)$ is the cost of appending to the mutable buffer, +and $\delta$ is a stall inserted into the insertion process to ensure +that the necessary reconstructions are completed in time. -The upshot of this discussion is that tail latencies are due to the -worst-case reconstructions associated with this method, and that the -proposed design space does not provide the necessary tools to avoid or -reduce these costs. +To identify the value of $\delta$, we note that each insert must take +at least $\frac{B(n)}{n}$ time to fully cover the cost of the last level +reconstruction. However, this is not sufficient to guarantee the bound, as +other reconstructions will also occur within the structure. At the point +at which the last level reconstruction can be scheduled, there will be +exactly $1$ shard on each level. Thus, each level will potentially also +have an ongoing reconstruction that must be covered by inserting more +stall time, to ensure that no level in the structure exceeds $s$ shards. +There are $\log n$ levels in total, and so in the worst case we will need +to introduce a extra stall time to account for a reconstruction on each +level, +\begin{equation*} +I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1}) +\end{equation*} +All of these internal reconstructions will be strictly less than the +size of the last-level reconstruction, and so we can bound them all +above by $\frac{B(n)}{n}$ time. -\section{The Insertion-Query Trade-off} +Given this, and assuming that the smallest (i.e., most pressing) +reconstruction is prioritized on the background thread, we find that +\begin{equation*} +I(n) \in \Theta\left(\frac{B(n)}{n} \cdot \log n\right) +\end{equation*} +\end{proof} + +This approach results in an equivalent worst-case insertion latency +bound to~\cite{overmars81}, but manages to resolve both of the issues +cited above. By leveraging two parallel threads, instead of trying to +manually multiplex a single thread, this approach requires \emph{no} +modification to the user's shard code to function. And, by leveraging +the fact that reconstructions under tiering are strictly local to a +single level, we can avoid needing to add any complicated additional +structures to manage partially building shards as new records are added. + +\section{Implementation} + +\subsection{Parallel Reconstruction Architecture} + +\subsection{Concurrent Queries} + +\subsection{Query Pre-emption} + +\subsection{Insertion Stall Mechanism} +\section{Evaluation} +\section{Conclusion} -- cgit v1.2.3