summaryrefslogtreecommitdiffstats
path: root/chapters/tail-latency.tex
diff options
context:
space:
mode:
authorDouglas B. Rumbaugh <doug@douglasrumbaugh.com>2025-07-07 11:36:15 -0400
committerDouglas B. Rumbaugh <doug@douglasrumbaugh.com>2025-07-07 11:36:15 -0400
commit05aab7bd45e691a0b0f527d0ab4dd7cae0b3ec55 (patch)
treebe19b76016630bc7c7cdfb482e71b158c93fbd38 /chapters/tail-latency.tex
parent0dc1a8ea20820168149cedaa14e223d4d31dc4b6 (diff)
downloaddissertation-05aab7bd45e691a0b0f527d0ab4dd7cae0b3ec55.tar.gz
update
Diffstat (limited to 'chapters/tail-latency.tex')
-rw-r--r--chapters/tail-latency.tex155
1 files changed, 82 insertions, 73 deletions
diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex
index a0db592..dbe867c 100644
--- a/chapters/tail-latency.tex
+++ b/chapters/tail-latency.tex
@@ -350,10 +350,15 @@ will throttle the insertion rate by adding a stall time, $\delta$, to
each insert. $\delta$ will be determined such that it is sufficiently
large to ensure that any scheduled reconstructions have enough time to
complete before the shard count on any level exceeds $s$. This process
-is summarized in Algorithm~\ref{alg:tl-relaxed-recon}.
+is summarized in Algorithm~\ref{alg:tl-relaxed-recon}. Note that
+this algorithm will only work with a single insertion client thread,
+as multiple insertion threads will stall concurrently with each other,
+removing some of the necessary delay. If multiple insertion threads are
+necessary, the inserts can be serialized first in a queue to ensure that
+the appropriate amount of stalling occurs.
\begin{algorithm}
-\caption{Relaxed Reconstruction Algorithm with Insertion Stalling}
+\caption{Insertion Algorithm with Stalling}
\label{alg:tl-relaxed-recon}
\KwIn{$r$: a record to be inserted, $\mathscr{I} = (\mathcal{B}, \mathscr{L}_0 \ldots \mathscr{L}_\ell)$: a dynamized structure, $\delta$: insertion stall amount}
@@ -1077,18 +1082,21 @@ on demand using atomics. Our current prototype uses a single, fixed value
for the probability, but ultimately it should be dynamically tuned to
approximate the $\delta$ value from Theorem~\ref{theo:worst-case-optimal}
as closely as possible. It also doesn't require significant modification
-of the existing client interfaces.
-
-We have elected to use Bernoulli sampling for the task of selecting which
-inserts to reject. An alternative approach would have been to apply
-systematic sampling. Bernoulli sampling results in the probability of an
-insert being rejected being independent of its order in the workload,
-but is non-deterministic. This means that there is a small probability
-that many more rejections than are expected may occur over a short
-span of time. Systematic sampling is not vulnerable to this problem,
-but introduces a dependence between the position of an insert in a
-sequence of operations and its probability of being rejected. We decided
-to prioritize independence in our implementation.
+of the existing client interfaces, and can easily support multiple threads
+of insertion without needing an explicit serialization process to ensure
+the appropriate amount of stalling occurs.
+
+We have elected to use Bernoulli sampling for the task of selecting
+which inserts to reject. An alternative approach would have been to
+apply systematic sampling, and reject inserts based on a pattern (e.g.,
+every 1000th insert gets rejected). Bernoulli sampling results in the
+probability of an insert being rejected being independent of its order
+in the workload, but is non-deterministic. This means that there is a
+small probability that many more rejections than are expected may occur
+over a short span of time. Systematic sampling is not vulnerable to this
+problem, but introduces a dependence between the position of an insert
+in a sequence of operations and its probability of being rejected. We
+decided to prioritize independence in our implementation.
\section{Evaluation}
\label{sec:tl-eval}
@@ -1096,7 +1104,7 @@ to prioritize independence in our implementation.
In this section, we perform several experiments to evaluate the ability of
the system proposed in Section~\ref{sec:tl-impl} to control tail latencies.
-\subsection{Stall Rate Sweep}
+\subsection{Rejection Rate Sweep}
As a first test, we will evaluate the ability of our insertion stall
@@ -1113,7 +1121,7 @@ block inserts to maintain a shard bound. The buffer is always flushed
immediately, regardless of the number of shards in the structure. Thus,
the rate of insertion is controlled by the cost of flushing the
buffer (we still block when the buffer is full) and the insertion
-stall rate. The structure is maintained fully in the background, with
+rejection rate. The structure is maintained fully in the background, with
maintenance reconstructions being scheduled for all levels exceeding a
specified shard count. Thus, the number of shards within the structure
is controlled indirectly by limiting the insertion rate. We ran these
@@ -1127,10 +1135,10 @@ dataset~\cite{sosd-datasets}, as well as VPTree with the one million,
we inserted $30\%$ of the records to warm up the structure, and then
measured the individual latency of each insert after that. We measured
the count of shards in the structure each time the buffer flushed
-(including during the warmup period). Note that a stall rate of $\delta
-= 1$ indicates no stalling at all, and values less than one indicate
+(including during the warmup period). Note that a rejection rate of
+$1$ indicates no stalling at all, and values less than one indicate
$1 - \delta$ probability of an insert being rejected, after which the
-insert thread sleeps for about a microsecond. A lower stall rate means
+insert thread sleeps for about a microsecond. A lower rejection rate means
more stalls are introduced. The tiering policy is strict tiering with
a scale factor of $s=6$ using the concurrency control scheme described
in Section~\ref{ssec:dyn-concurrency}.
@@ -1145,7 +1153,7 @@ in Section~\ref{ssec:dyn-concurrency}.
We'll begin by considering the ISAM
tree. Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all
-stall rates succeed in greatly reducing tail latency relative to
+rejection rates succeed in greatly reducing tail latency relative to
tiering. Additionally, it shows a small amount of available tuning of
the worst-case insertion latencies, with higher stall amounts reducing
the tail latencies slightly at various points in the distribution. This
@@ -1160,16 +1168,16 @@ resulting in a stall.
Of course, if the query latency is severely affected by the
use of this mechanism, it may not be worth using. Thus, in
Figure~\ref{fig:tl-stall-200m-shard} we show the probability density
-of various shard counts within the decomposed structure for each stall
+of various shard counts within the decomposed structure for each rejection
rate, as well as strict tiering. This figure shows that, even for no
insertion stalling, the shard count within the structure remains well
behaved, albeit with a slightly longer tail and a higher average value
compared to tiering. Once stalls are introduced, it is possible to
both reduce the tail, and shift the peak of the distribution through
-a variety of points. In particular, we see that a stall of $.99$ is
-sufficient to move the peak to very close to tiering, and lower stall
+a variety of points. In particular, we see that a rejection rate of $.99$ is
+sufficient to move the peak to very close to tiering, and lower rejection
rates are able to further shift the peak of the distribution to even
-lower counts. This result implies that this stall mechanism may be able
+lower counts. This result implies that this rejection mechanism may be able
to produce a trade-off space for insertion and query performance, which
is a question we will examine in Section~\ref{ssec:tl-design-space}.
@@ -1186,7 +1194,7 @@ small size of the data set used, we repeated the exact same
testing using a set of four billion uniform integers, shown in
Figure~\ref{fig:tl-stall-4b}. These results are aligned with the
smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing the
-same improvements in insertion tail latency for all stall amounts, and
+same improvements in insertion tail latency for all rejection rates, and
Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the shard
count.
@@ -1199,7 +1207,7 @@ count.
Finally, we considered our dynamized VPTree in
Figure~\ref{fig:tl-stall-knn}, This test shows some of the possible
-limitations of our fixed stall rate mechanism. The ISAM tree tested
+limitations of our fixed rejection rate mechanism. The ISAM tree tested
above is constructable in roughly linear time, being an MDSP with
$B_M(n, k) \in \Theta(n \log k)$, where $k$ is the number of shards, and
thus roughly constant.\footnote{
@@ -1209,35 +1217,35 @@ thus roughly constant.\footnote{
about, however, it is clear that $k$ is typically quite close to
$s$ in practice for ISAM tree.
} Thus, the ratio $\frac{B_M(n,
-k)}{n}$ used to determine the optimal insertion stall rate is
+k)}{n}$ used to determine the optimal insertion rejection rate is
asymptotically a constant. For VPTree, however, the construction
cost is super-linear, with $B(n) \in \Theta(n \log n)$, and also
generally much larger in absolute time requirements. We can see in
Figure~\ref{fig:tl-stall-knn-shard} that the shard count distribution
-is very poorly behaved for smaller stall amounts, with the shard count
-following a roughly uniform distribution for a stall rate of $1$. This
+is very poorly behaved for smaller rejection rates, with the shard count
+following a roughly uniform distribution for a rejection rate of $1$. This
means that the background reconstructions are not capable of keeping up
with buffer flushing, and so the number of shards grows significantly
over time. Introducing stalls does shift the distribution closer to
-normal, but it requires a much larger stall rate in order to obtain
+normal, but it requires a much larger rejection rate in order to obtain
a shard count distribution that is close to the strict tiering than
was the case with the ISAM tree test. It is still possible, though,
-even with our simple fixed-stall rate implementation. Additionally,
+even with our simple fixed-rejection rate implementation. Additionally,
this approach is shown in Figure~\ref{fig:tl-stall-knn-dist} to reduce
the tail latency substantially compared to strict tiering, with the same
-latency distribution effects for larger stall rates as was seen in the
+latency distribution effects for larger rejection rates as was seen in the
ISAM examples.
These tests show that, for ISAM tree at least, introducing a constant
-stall rate while allowing the decomposition to develop naturally
+rejection rate while allowing the decomposition to develop naturally
with background reconstructions only is able to match the shard count
distribution of tiering, which strictly enforces the shard count bound
using blocking, while achieving significantly better insertion tail
latencies. VPTree is able to achieve the same results too, albeit
-requiring significantly higher stall rates to match the shard bound.
+requiring significantly higher rejection rates to match the shard bound.
-\subsection{Insertion Stall Trade-off Space}
+\subsection{Insertion Rejection Trade-off Space}
\label{ssec:tl-design-space}
\begin{figure}
@@ -1256,15 +1264,15 @@ or query latency. By throttling insertion, we potentially reduce the
throughput. Further, it isn't clear in practice how much query latency
suffers as the shard count distribution changes. In this experiment, we
address these concerns by directly measuring the insertion throughput
-and query latency over a variety of stall rates and compare the results
+and query latency over a variety of rejection rates and compare the results
to strict tiering.
The results of this test for ISAM with the SOSD \texttt{OSM} dataset are
shown in Figure~\ref{fig:tl-latency-curve-isam}, which shows the insertion
throughput plotted against the average query latency for our system at
-various stall rates, and with tiering configured with an equivalent scale
+various rejection rates, and with tiering configured with an equivalent scale
factor marked as red point for reference. The most interesting point
-demonstrated by this plot is that introducing the stall rate provides a
+demonstrated by this plot is that introducing the rejection rate provides a
beautiful design trade-off between query and insert performance. In fact,
this space is far more useful than the trade-off space represented by
layout policy and scale factor selection using strict reconstruction
@@ -1304,7 +1312,7 @@ to provide a superior set of design trade-offs than the strict policies,
at least in environments where sufficient parallel processing and memory
are available to leverage parallel reconstructions.
-\subsection{Legacy Design Space}
+\subsection{Design Space}
Our new system retains the concept of buffer size and scale factor from
the previous version, although these have very different performance
@@ -1318,7 +1326,7 @@ using the SOSD \texttt{OSM} dataset and point lookup queries.
\centering
\subfloat[Insertion Throughput vs. Query Latency for Varying Scale Factors]{\includegraphics[width=.5\textwidth]{img/tail-latency/stall-sf-sweep.pdf} \label{fig:tl-sf-curve}}
\subfloat[Insertion Tail Latency for Varying Buffer Sizes]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer-tail-latency.pdf} \label{fig:tl-buffer-tail}} \\
-\caption{Legacy Design Space Examination}
+\caption{Design Space Examination}
\label{fig:tl-design-space}
\end{figure}
@@ -1347,7 +1355,7 @@ Figure~\ref{fig:tl-buffer-tail}. As the buffer size is increased,
the worst-case insertion time also increases, although the effect is
relatively small.
-\subsection{Thread Scaling}
+\subsection{Internal Thread Scaling}
\begin{figure}
\centering
@@ -1360,21 +1368,21 @@ relatively small.
\end{figure}
In the previous tests, we ran our system configured with 32 available
-threads, which was more than enough to run all reconstructions and
-queries fully in parallel. However, it's important to determine how well
-the system works in more resource constrained environments. The system
-shares internal threads between reconstructions and queries, and
-flushing occurs on a dedicated thread separate from these. During the
-benchmark, one client thread issued queries continuously and another
-issued inserts. The index accumulated a total of five levels, so
-the maximum amount of parallelism available during the testing was 4
+internal threads, which was more than enough to run all reconstructions
+and queries fully in parallel. However, it's important to determine
+how well the system works in more resource constrained environments.
+The system shares internal threads between reconstructions and queries,
+and flushing occurs on a dedicated thread separate from these. During
+the benchmark, one client thread issued queries continuously and
+another issued inserts. The index accumulated a total of five levels,
+so the maximum amount of parallelism available during the testing was 4
parallel reconstructions, along with the dedicated flushing thread and
any concurrent queries. In these tests, we used the SOSD \texttt{OSM}
dataset (200M records) and point-lookup queries without early abort
against a dynamized ISAM tree.
We considered the insertion throughput vs. query latency trade-off for
-various stall amounts with several internal thread counts. We inserted
+various rejection rates with several internal thread counts. We inserted
30\% of the dataset first, and then measured the insertion throughput over
the insertion of the rest of the data on a client thread, while another
client thread continuously issued queries against the structure. The
@@ -1386,33 +1394,34 @@ throughput is limited only by the stall amount, and by buffer flushing.
As flushing occurs on a dedicated thread, it is unaffected by changes
in the internal thread configuration of the system.
-In terms of query performance, there are two general effects that can be
-observed. The first effect is that the previously noted reduction in
-query performance as the insertion throughput is increased is observed
-in all cases, irrespective of thread count. However, interestingly,
-the thread count itself has little effect on the curve outside of the
-case of only having a single thread. This can also be seen in
-Figure~\ref{fig:tl-query-scaling}, which shows an alternative view of
-the same data revealing the best measured insertion throughput associated
-with a given query latency bound. In both cases, two or more threads are
-capable of significantly higher insertion throughput at a given query
-latency. But, at very low insertion throughputs, this effect vanishes
-and all thread counts are roughly equivalent in performance.
+In terms of query performance, there are two general effects that can
+be observed. The first effect is that the previously noted reduction in
+query performance as the insertion throughput is increased is observed in
+all cases, irrespective of internal thread count. However, interestingly,
+the internal thread count itself has little effect on the curve outside
+of the case of only having a single internal thread. This can also be
+seen in Figure~\ref{fig:tl-query-scaling}, which shows an alternative
+view of the same data revealing the best measured insertion throughput
+associated with a given query latency bound. In both cases, two or more
+internal threads are capable of significantly higher insertion throughput
+at a given query latency. But, at very low insertion throughputs, this
+effect vanishes and all internal thread counts are roughly equivalent
+in performance.
A large part of the reason for this significant deviation in behavior
-between one thread and multiple is likely that queries and reconstructions
-share the same pool of background threads. Our testing involved issuing
-queries continuously on a single thread, while performing inserts, and so
-two threads background threads ensures that a reconstruction and query can
-be run in parallel, whereas a single thread will force queries to wait
-behind long running reconstructions. Once this bottleneck is overcome,
-a reduction in the amount of parallel reconstruction seems to have only a
-minor influence on overall performance. This is because, although in the
-worst case the system requires $\log_s n$ threads to fully parallelize
+between one internal thread and multiple is likely that queries and
+reconstructions share the same pool of internal threads. Our testing
+involved issuing queries continuously on a single internal thread,
+while performing inserts, and so two internal threads ensures
+that a reconstruction and query can be run in parallel, whereas a
+single internal thread will force queries to wait behind long running
+reconstructions. Once this bottleneck is overcome, a reduction in the
+amount of parallel reconstruction seems to have only a minor influence
+on overall performance. This is because, although in the worst case
+the system requires $\log_s n$ internal threads to fully parallelize
reconstructions, this worst case is fairly rare. The vast majority of
reconstructions only require a fraction of this total parallel capacity.
-
\section{Conclusion}
In this section, we addressed the final of the three major problems of
@@ -1425,7 +1434,7 @@ within our framework, including a significantly improved architecture
for scheduling and executing parallel and background reconstructions,
and a system for rate limiting by rejecting inserts via Bernoulli sampling.
-We evaluated this system for fixed stall rates, and found significant
+We evaluated this system for fixed rejection rates, and found significant
improvements in tail latencies, approaching the practical lower bound we
established using the equal block method, without requiring significant
degradation of query performance. In fact, we found that this rate