summaryrefslogtreecommitdiffstats
path: root/chapters/tail-latency.tex
diff options
context:
space:
mode:
authorDouglas Rumbaugh <dbr4@psu.edu>2025-05-28 17:12:35 -0400
committerDouglas Rumbaugh <dbr4@psu.edu>2025-05-28 17:12:35 -0400
commit881ebf51c3de54110aa9d7f3aefbc08e74c73189 (patch)
treec7022dd6973310eded2ffe1b9f403ea984176a21 /chapters/tail-latency.tex
parent6c9d7385e5979112947b89060716a81961f17538 (diff)
downloaddissertation-881ebf51c3de54110aa9d7f3aefbc08e74c73189.tar.gz
updates
Diffstat (limited to 'chapters/tail-latency.tex')
-rw-r--r--chapters/tail-latency.tex292
1 files changed, 259 insertions, 33 deletions
diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex
index 361dde0..14637c6 100644
--- a/chapters/tail-latency.tex
+++ b/chapters/tail-latency.tex
@@ -28,23 +28,33 @@ poor. That the structure exhibits reasonable performance on average
is the result of these two ends of the distribution balancing each
other out.
-The reason for this poor tail latency is reconstructions. In order
-to provide tight bounds on the number of shards within the structure,
-our techniques must block inserts once the buffer has filled, until
-sufficient room is cleared in the structure to accomodate these new
-records. This results in the worst-case insertion behavior that we
-described mathematically in the previous chapter.
-
-Unfortunately, the design space that we have been considering thus
-far is very limited in its ability to meaningfully alter the
-worst-case insertion performance. While we have seen that the choice
-between leveling and tiering can have some effect, the actual benefit
-in terms of tail latency is quite small, and the situation is made
-worse by the fact that leveling, which can have better worst-case
-insertion performance, lags behind tiering in terms of average
-insertion performance. The use of leveling can allow for a small
-reduction in the worst case, but at the cost of making the majority
-of inserts worse because of increased write amplification.
+
+
+This poor worst-case performance is a direct consequence of the strategies
+used by the two structures to support updates. B+Trees use a form of
+amortized local reconstruction, whereas the dynamized ISAM tree uses
+amortized global reconstruction. Because the B+Tree only reconstructs the
+portions of the structure ``local'' to the update, even in the worst case
+only a portion of the data structure will need to be adjusted. However,
+when using global reconstruction based techniques, the worst-case insert
+requires rebuilding either the entirety of the structure (for tiering
+or BSM), or at least a very large proportion of it (for leveling). The
+fact that our dynamization technique uses buffering, and most of the
+shards involved in reconstruction are kept small by the logarithmic
+decomposition technique used to partition it, ensures that the majority
+of inserts are low cost compared to the B+Tree, but at the extreme end
+of the latency distribution, the local reconstruction strategy used by
+the B+Tree results in better worst-case performance.
+
+Unfortunately, the design space that we have been considering thus far
+is limited in its ability to meaningfully alter the worst-case insertion
+performance. While we have seen that the choice of layout policy can have
+some effect, the actual benefit in terms of tail latency is quite small,
+and the situation is made worse by the fact that leveling, which can
+have better worst-case insertion performance, lags behind tiering in
+terms of average insertion performance. The use of leveling can allow
+for a small reduction in the worst case, but at the cost of making the
+majority of inserts worse because of increased write amplification.
\begin{figure}
\subfloat[Scale Factor Sweep]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-parm-sf}}
@@ -68,25 +78,241 @@ ultimately the question of ``which configuration has the best tail-latency
performance'' is more a question of how many insertions the latency is
measured over, than any fundamental trade-offs with the design space.
-\begin{example}
-Consider two dynamized structures, $\mathscr{I}_A$ and $\mathscr{I}_B$,
-with slightly different configurations. Regardless of the layout
-policy used (of those discussed in Chapter~\ref{chap:design-space}),
-the worst-case insertion will occur when the structure is completely
-full, i.e., after
+Thus, in this chapter, we will look beyond the design space we have
+thus far considered to design a dynamization system that allows for
+tail latency tuning in a meaningful capacity. To accomplish this,
+we will consider a different way of looking at reconstructions within
+dynamized structures.
+
+\section{The Insertion-Query Trade-off}
+
+As reconstructions are at the heart of the insertion tail latency problem,
+it seems worth taking a moment to consider \emph{why} they must be done
+at all. Fundamentally, decomposition-based dynamization techniques trade
+between insertion and query performance by controlling the number of blocks
+in the decomposition. Reconstructions serve to place a bound on the
+number of blocks, to allow for query performance bounds to be enforced.
+This trade-off between insertion and query performance by way of block
+count is most directly visible in the equal block method described
+in Section~\ref{ssec:ebm}. As a reminder, this technique provides the
+following worst-case insertion and query bounds,
+\begin{align*}
+I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\
+\mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right)
+\end{align*}
+where $f(n)$ is the number of blocks.
+
+Figure~\ref{fig:tl-ebm-trade-off} shows the trade-off between insertion
+and query performance for a dynamized ISAM tree using the equal block
+method, for various numbers of blocks. The trade-off is evident in the
+figure, with a linear relationship between insertion throughput and query
+latency, mediated by the number of blocks in the dynamized structure (the
+block counts are annotated on each point in the plot). As the number of
+blocks is increased, their size is reduced, leading to less expensive
+inserts in terms of both amortized and worst-case cost. However, the
+additional blocks make queries more expensive.
+
+
+\begin{figure}
+\centering
+\includegraphics[width=.75\textwidth]{img/tail-latency/ebm-count-sweep.pdf}
+\caption{The Insert-Query Tradeoff for the Equal Block Method with varying
+number of blocks}
+\label{fig:tl-ebm-trade-off}
+\end{figure}
+
+While using the equal block method does allow for direct tuning of
+the worst-case insert cost, as well as exposing a very clean trade-off
+space for average query and insert performance, the technique is not
+well suited to our purposes because the amortized insertion performance
+is not particularly good: the insertion throughput is many times worse
+than is possible with our dynamization framework for an equivalent
+query latency.\footnote{
+ In actuality, the insertion performance of the equal block method is
+ even \emph{worse} than the numbers presented here. For this particular
+ benchmark, we implemented the technique knowing the number of records in
+ advance, and so fixed the size of each block from the start. This avoided
+ the need to do any repartitioning as the structure grew, and reduced write
+ amplification.
+}
+This is because, in our Bentley-Saxe-based technique, the
+variable size of the blocks allows for the majority of the reconstructions
+to occur with smaller structures, while allowing the majority of the
+records to exist in a single large block at the bottom of the structure.
+This setup enables high insertion throughput while keeping the block
+count small. But, as we've seen, the cost of this is large tail latencies.
+However, we can use the extreme ends of the equal block method's design
+space to consider upper limits on the insertion and query performance
+that we might expect to get out of a dynamized structure.
+
+
+Consider what would happen if we were to modify our dynamization framework
+to avoid all reconstructions. We retain a buffer of size $N_B$, which
+we flush to create a shard when full, however we never touch the shards
+once they are created. This is effectively the equal blocks method,
+where every block is fixed at $N_B$ capacity. Such a technique would
+result in a worst-case insertion cost of $I(n) \in \Theta(B(N_B))$ and
+produce $\Theta\left(\frac{n}{N_B}\right)$ shards in total, resulting
+in $\mathscr{Q}(n) \in O(n \cdot \mathscr{Q}_s(N_B))$ worst-case query
+cost for a decomposable search problem. Applying this technique to an
+ISAM Tree, and compared against a B+Tree, yields the insertion and query
+latency distributions shown in Figure~\ref{fig:tl-floodl0}.
+
+\begin{figure}
+\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}}
+\subfloat[Query Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-query.pdf} \label{fig:tl-floodl0-query}} \\
+
+\caption{Latency Distributions for a "Reconstructionless" Dynamization}
+\label{fig:tl-floodl0}
+\end{figure}
+
+Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain
+insertion latency distributions using amortized global reconstruction
+that are directly comperable to dynamic structures based on amortized
+local reconstruction, at least in some cases. In particular, the
+worst-case insertion tail latency in this model is direct function
+of the buffer size, as the worst-case insert occurs when the buffer
+must be flushed to a shard. However, this performance comes at the
+cost of queries, which are incredibly slow compared to B+Trees, as
+shown in Figure~\ref{fig:tl-floodl0-query}.
+
+While this approach is not useful on its own, it does
+
+\section{Relaxed Reconstruction}
+
+There is theoretical work in this area, which we discussed in
+Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach for
+controlling the worst-case insertion cost is to break the largest
+reconstructions up into small sequences of operations, that can then be
+attached to each insert, spreading the total workload out and ensuring
+each insert takes a consistent amount of time. Theoretically, the total
+throughput should remain about the same when doing this, but rather
+than having a bursty latency distribution with many fast inserts, and
+a small number of incredibly slow ones, distribution should be far more
+uniform.
+
+Unfortunately, this technique has a number of limitations that we
+discussed in Section~\ref{ssec:bsm-tail-latency-problem}. Notably for
+this discussion, they are
+\begin{enumerate}
+
+ \item In the Bentley-Saxe method, the worst-case reconstruction
+ involves every record in the structure. As such, it cannot be
+ performed ``in advance'' without significant extra work. This problem
+ requires the worst-case optimized dynamization systems to include
+ complicated structures of partially built structures.
+
+ \item The approach assumes that the workload of building a
+ block can be evenly divided in advance, and somehow attached
+ to inserts. Even for simple structures, this requires a large
+ amount of manual adjustment to the data structure reconstruction
+ routines, and doesn't admit simple, generalized interfaces.
+
+\end{enumerate}
+In this section, we consider how these restrictions can be overcome given
+our dynamization framework, and propose a strategy that achieves the
+same worst-case insertion time as the worst-case optimized techniques,
+given a few assumptions about available resources.
+
+At a very high level, our proposed approach as follows. We will fully
+detach reconstructions from buffer flushes. When the buffer fills, it will
+immediately flush and a new shard will be placed in L0. Reconstructions
+will be performed in the background to maintain the internal structure
+according, roughly, to tiering. When a level contains $s$ shards, a
+reconstruction will immediately be trigger to merge these shards and
+push the result down to the next level. To ensure that the number of
+shards in the structure remains bounded by $\Theta(\log n)$, we will
+throttle the insertion rate so that it is balanced with amount of time
+needed to complete reconstructions.
+
+\begin{figure}
+\caption{Several "states" of tiering, leading up to the worst-case
+reconstruction.}
+\label{fig:tl-tiering}
+\end{figure}
+
+First, we'll consider how to ``spread out'' the cost of the worst-case
+reconstruction. Figure~\ref{fig:tl-tiering} shows various stages in
+the development of the internal structure of a dynamized index using
+tiering. Importantly, note that the last level reconstruction, which
+dominates the cost of the worst-case reconstruction, \emph{is able to be
+performed well in advance}. All of the records necessary to perform this
+reconstruction are present in the last level $\Theta(n)$ inserts before
+the reconstruction must be done to make room. This is a significant
+advantage to our technique over the normal Bentley-Saxe method, which
+will allow us to spread the cost of this reconstruction over a number
+of inserts without much of the complexity of~\cite{overmars81}. This
+leads us to the following result,
+\begin{theorem}
+Given a buffered, dynamized structure utilizing the tiering layout policy,
+and at least $2$ parallel threads of execution, it is possible to maintain
+a worst-case insertion cost of
+\begin{equation}
+I(n) \in \Theta\left(\frac{B(n)}{n} \log n\right)
+\end{equation}
+\end{theorem}
+\begin{proof}
+Consider the cost of the worst-case reconstruction, which in tiering
+will be of cost $\Theta(B(n))$. This reconstruction requires all of the
+blocks on the last level of the structure. At the point at which the
+last level is full, there will be $\Theta(n)$ inserts before the last
+level must be merged and a new level added.
+
+To ensure that the reconstruction has been completed by the time the
+$\Theta(n)$ inserts have been completed, it is sufficient to guarantee
+that the rate of inserts is sufficiently slow. Ignoring the cost of
+buffer flushing, this means that inserts must cost,
\begin{equation*}
-n_\text{worst} = N_B + \sum_{i=0}^{\log_s n} N_B \cdot s^{i+1}
+I(n) \in \Theta(1 + \delta)
\end{equation*}
-Let this be $n_a$ for $\mathscr{I}_B$ and $n_b$ for $\mathscr{I}_B$,
-and let $\mathscr{I}_A$ be configured with scale factor $s_a$ and
-$\mathscr{I}_B$ with scale factor $s_b$, such that $s_a < s_b$.
-\end{example}
+where the $\Theta(1)$ is the cost of appending to the mutable buffer,
+and $\delta$ is a stall inserted into the insertion process to ensure
+that the necessary reconstructions are completed in time.
-The upshot of this discussion is that tail latencies are due to the
-worst-case reconstructions associated with this method, and that the
-proposed design space does not provide the necessary tools to avoid or
-reduce these costs.
+To identify the value of $\delta$, we note that each insert must take
+at least $\frac{B(n)}{n}$ time to fully cover the cost of the last level
+reconstruction. However, this is not sufficient to guarantee the bound, as
+other reconstructions will also occur within the structure. At the point
+at which the last level reconstruction can be scheduled, there will be
+exactly $1$ shard on each level. Thus, each level will potentially also
+have an ongoing reconstruction that must be covered by inserting more
+stall time, to ensure that no level in the structure exceeds $s$ shards.
+There are $\log n$ levels in total, and so in the worst case we will need
+to introduce a extra stall time to account for a reconstruction on each
+level,
+\begin{equation*}
+I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1})
+\end{equation*}
+All of these internal reconstructions will be strictly less than the
+size of the last-level reconstruction, and so we can bound them all
+above by $\frac{B(n)}{n}$ time.
-\section{The Insertion-Query Trade-off}
+Given this, and assuming that the smallest (i.e., most pressing)
+reconstruction is prioritized on the background thread, we find that
+\begin{equation*}
+I(n) \in \Theta\left(\frac{B(n)}{n} \cdot \log n\right)
+\end{equation*}
+\end{proof}
+
+This approach results in an equivalent worst-case insertion latency
+bound to~\cite{overmars81}, but manages to resolve both of the issues
+cited above. By leveraging two parallel threads, instead of trying to
+manually multiplex a single thread, this approach requires \emph{no}
+modification to the user's shard code to function. And, by leveraging
+the fact that reconstructions under tiering are strictly local to a
+single level, we can avoid needing to add any complicated additional
+structures to manage partially building shards as new records are added.
+
+\section{Implementation}
+
+\subsection{Parallel Reconstruction Architecture}
+
+\subsection{Concurrent Queries}
+
+\subsection{Query Pre-emption}
+
+\subsection{Insertion Stall Mechanism}
+\section{Evaluation}
+\section{Conclusion}