summaryrefslogtreecommitdiffstats
path: root/chapters/tail-latency.tex
diff options
context:
space:
mode:
authorDouglas Rumbaugh <dbr4@psu.edu>2025-06-03 12:00:40 -0400
committerDouglas Rumbaugh <dbr4@psu.edu>2025-06-03 12:00:40 -0400
commit432a7fd7da164841a2ad755b839de1e65244944d (patch)
tree2f72c026b5d495302f75dcda33044ea86b43cedb /chapters/tail-latency.tex
parent067bf27c8527352d6c88f7c3e7bb38a0e5b26ab3 (diff)
downloaddissertation-432a7fd7da164841a2ad755b839de1e65244944d.tar.gz
updates
Diffstat (limited to 'chapters/tail-latency.tex')
-rw-r--r--chapters/tail-latency.tex161
1 files changed, 152 insertions, 9 deletions
diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex
index e63f3c9..a88fe0c 100644
--- a/chapters/tail-latency.tex
+++ b/chapters/tail-latency.tex
@@ -926,31 +926,167 @@ the system proposed in Section~\ref{sec:tl-impl} to control tail latencies.
\begin{figure}
\centering
-\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-insert-dist.pdf} \label{fig:tl-btree-isam-tput}}
-\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-shard-dist.pdf} \label{fig:tl-btree-isam-lat}} \\
+\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-insert-dist.pdf} \label{fig:tl-stall-200m-dist}}
+\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-shard-dist.pdf} \label{fig:tl-stall-200m-shard}} \\
\caption{Insertion and Shard Count Distributions for ISAM with 200M Records}
\label{fig:tl-stall-200m}
\end{figure}
-First, we will consider the insertion and query performance of our system
-at a variety of stall proportions.
-
+First, we will consider the insertion and query performance of our
+system at a variety of stall proportions. The purpose of this testing
+is to demonstrate that inserting stalls into the insertion process is
+able to reduce the insertion tail latency, while being able to match the
+general insertion and query performance of a strict tiering policy. Recall
+that, in the insertion stall case, no explicit shard capacity limits are
+enforced by the framework. Reconstructions are triggered with each buffer
+flush on all levels exceeding a specified shard count ($s = 4$ in these
+tests) and the buffer flushes immediately when full with no regard to the
+state of the structure. Thus, limiting the insertion latency is the only
+means the system uses to maintain its shard count at a manageable level.
+These tests were run on a system with sufficient available resources to
+fully parallelize all reconstructions.
+
+First, Figure~\ref{fig:tl-stall-200m} shows the results of testing
+insertion of the 200 million record SOSD \texttt{OSM} dataset in
+a dynamized ISAM tree. Using our insertion stalling technique, as
+well as strict tiering. We inserted $30\%$ of the records, and then
+measured the individual latency of each insert after that point to produce
+Figure~\ref{fig:tl-stall-200m-dist}. Figure~\ref{fig:tl-stall-200m-shard}
+was produced by recording the number of shards in the dynamized structure
+each time the buffer flushed. Note that a stall value of one indicates
+no stalling at all, and values less than one indicate $1 - \delta$
+probability of an insert being rejected. Thus, a lower stall value means
+more stalls are introduced. The tiering policy is strict tiering with a
+scale factor of $s=4$. It uses the concurrency control scheme described
+in Section~\ref{ssec:dyn-concurrency}.
+
+
+Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all of insertion
+rejection probabilities succeed in greatly reducing tail latency relative
+to tiering. Additionally, it shows a small amount of available tuning of
+the worst-case insertion latencies, with higher stall amounts reducing
+the tail latencies slightly at various points in the distribution. This
+latter effect results from the buffer flush latency hiding mechanism,
+which was retained from Chapter~\ref{chap:framework}. The buffer actually
+has space to two two versions, and the second version can be filled while
+the first is flushing. This means that, for more aggressive stalling,
+some of the time spent blocking on the buffer flush is redistributed
+over the inserts into the second version of the buffer, rather than
+resulting in a stall.
+
+Of course, if the query latency is severely affected by the
+use of this mechanism, it may not be worth using. Thus, in
+Figure~\ref{fig:tl-stall-200m-shard} we show the probability density of
+various shard counts within the dynamized structure for each stalling
+amount, as well as strict tiering. We have elected to examine the shard
+count, rather than the query latencies, for this purpose because our
+intention with this technique is to directly control the number of
+shards, and our intention is to show that this is possible. Of course,
+the shard count control is necessary for the sake of query latencies,
+and we will consider query latency directly later.
+
+This figure shows that, even for no insertion throttle at all, the shard
+count within the structure remains well behaved and normally distributed,
+albeit with a slightly longer tail and a higher average value. Once
+stalls are introduced, though, it is possible to both reduce the tail,
+and shift the peak of the distribution through a variety of points. In
+particular, we see that a stall of $.99$ is sufficient to move the peak
+to very close to tiering, and lower stalls are able to further shift the
+peak of the distribution to even lower counts.
\begin{figure}
\centering
-\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-insert-dist.pdf} \label{fig:tl-btree-isam-tput}}
-\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-shard-dist.pdf} \label{fig:tl-btree-isam-lat}} \\
+\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-insert-dist.pdf} \label{fig:tl-stall-4b-dist}}
+\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-shard-dist.pdf} \label{fig:tl-stall-4b-shard}} \\
\caption{Insertion and Shard Count Distributions for ISAM with 4B Records}
\label{fig:tl-stall-4b}
\end{figure}
+To validate that these results were not simply a result of the relatively
+small size of the data set used, we repeated the exact same testing
+using a set of four billion uniform integers, and these results are
+shown in Figure~\ref{fig:tl-stall-4b}. These results are aligned with
+the smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing
+the same improvements in insertion tail latency for all stall amounts,
+and Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the
+shard count. If anything, the gap between strict tiering and un-throttled
+insertion is narrower with the larger data set than the smaller one.
+
\begin{figure}
-\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-insert-dist.pdf} \label{fig:tl-btree-isam-tput}}
-\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-shard-dist.pdf} \label{fig:tl-btree-isam-lat}} \\
+\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-insert-dist.pdf} \label{fig:tl-stall-knn-dist}}
+\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-shard-dist.pdf} \label{fig:tl-stall-knn-shard}} \\
\caption{Insertion and Shard Count Distributions for VPTree }
\label{fig:tl-stall-knn}
\end{figure}
+Finally, we considered our dynamized VPTree in
+Figure~\ref{fig:tl-stall-knn}, using the \texttt{SBW} dataset of
+about one million 300-dimensional vectors. This test shows some of
+the possible limitations of our fixed rejection rate. The ISAM Tree
+tested above is constructable in roughly linear time, being an MDSP
+with $B_M(n, k) \in \Theta(n \log k)$. Thus, the ratio $\frac{B_M(n,
+k)}{n}$ used to determine the optimal insertion stall rate is
+asymptotically a constant. For VPTree, however, the construction
+cost is super-linear, with $B(n) \in \Theta(n \log n)$, and also
+generally much larger in absolute time requirements. We can see in
+Figure~\ref{fig:tl-stall-knn-shard} that the shard count distribution
+is very poorly behaved for smaller stall amounts, with the shard count
+following a roughly uniform distribution for a stall rate of $1$. This
+means that the background reconstructions are not capable of keeping up
+with buffer flushing, and so the number of shards grows significantly
+over time. Introducing stalls does shift the distribution closer to
+normal, but it requires a much larger stall rate in order to obtain
+a shard count distribution that is close to the strict tiering than
+was the case with the ISAM tree test. It is still possible, though,
+even with our simple fixed-stall rate implementation. Additionally,
+this approach is shown in Figure~\ref{fig:tl-stall-knn-dist} to reduce
+the tail latency substantially compared to strict tiering, with the same
+latency distribution effects for larger stall rates as was seen in the
+ISAM examples.
+
+Thus, we've shown that introducing even a fixed stall while allowing
+the internal structure of the dynamization to develop naturally is able
+to match the shard count distribution of strict tiering, while having
+significantly lower insertion tail latencies.
+
+\subsection{Insertion Stall Trade-off Space}
+
+While we have shown that introducing insertion stalls accomplishes the
+goal of reducing tail latencies while being able to match the shard count
+of a strict tiering reconstruction strategy, we've not yet addressed
+what the actual performance of this structure is. By throttling inserts,
+we potentially reduce the insertion throughput. And, further, it isn't
+immediately obvious just how much query performance suffers as the shard
+count distribution shifts. In this test, we examine the average values
+of insertion throughput and query latency over a variety of stall rates.
+
+The results of this test for ISAM with the SOSD \texttt{OSM} dataset are
+shown in Figure~\ref{fig:tl-latency-curve}, which shows the insertion
+throughput plotted against the average query latency for our system at
+various stall rates, and with tiering configured with an equivalent
+scale factor marked as red point for reference. This plot shows two
+interesting features of the insertion stall mechanism. First, it is
+possible to introduce stalls that do not significantly affect the write
+throughput, but do improve query latency. This is seen by the difference
+between the two points at the far right of the curve, where introducing
+a slight stall improves query performance at virtually no cost. This
+represents the region of the curve where the stalling introduces delay
+that doesn't exceed the cost of a buffer flush, and so the amount of
+time spent stalling by the system doesn't change much.
+
+The second, and perhaps more notable, point that this plot shows is
+that introducing the stall rate provides a beautiful design trade-off
+between query and insert performance. In fact, this space is far more
+useful than the trade-off space represented by layout policy and scale
+factor selection using strict reconstruction schemes that we examined
+in Chapter~\ref{chap:design-space}. At the upper end of the insertion
+optimized region, we see more than double the insertion throughput of
+tiering (with significantly lower tail latencies at well) at the cost
+of a slightly larger than 2x increase in query latency. Moving down the
+curve, we see that we are able to roughly match the performance of tiering
+within this space, and even shift to more query optimized configurations.
+
+
\begin{figure}
\centering
\includegraphics[width=.5\textwidth]{img/tail-latency/stall-latency-curve.pdf}
@@ -958,6 +1094,13 @@ at a variety of stall proportions.
\label{fig:tl-latency-curve}
\end{figure}
+This shows a very interesting result. Not only is our approach able
+to match a strict reconstruction policy in terms of average query and
+insertion performance with better tail latencies, but it is even able
+to provide a superior set of design trade-offs than the strict policies,
+at least in environments where sufficient parallel processing and memory
+are available to leverage parallel reconstructions.
+
\section{Conclusion}