update

author: Douglas B. Rumbaugh <doug@douglasrumbaugh.com> 2025-07-07 11:36:15 -0400
committer: Douglas B. Rumbaugh <doug@douglasrumbaugh.com> 2025-07-07 11:36:15 -0400
commit: 05aab7bd45e691a0b0f527d0ab4dd7cae0b3ec55 (patch)
tree: be19b76016630bc7c7cdfb482e71b158c93fbd38 /chapters/tail-latency.tex
parent: 0dc1a8ea20820168149cedaa14e223d4d31dc4b6 (diff)
download: dissertation-05aab7bd45e691a0b0f527d0ab4dd7cae0b3ec55.tar.gz
1 files changed, 82 insertions, 73 deletions
diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex
index a0db592..dbe867c 100644
--- a/chapters/tail-latency.tex
+++ b/chapters/tail-latency.tex
@@ -350,10 +350,15 @@ will throttle the insertion rate by adding a stall time, $\delta$, to
 each insert. $\delta$ will be determined such that it is sufficiently
 large to ensure that any scheduled reconstructions have enough time to
 complete before the shard count on any level exceeds $s$. This process
-is summarized in Algorithm~\ref{alg:tl-relaxed-recon}.
+is summarized in Algorithm~\ref{alg:tl-relaxed-recon}. Note that
+this algorithm will only work with a single insertion client thread,
+as multiple insertion threads will stall concurrently with each other,
+removing some of the necessary delay. If multiple insertion threads are
+necessary, the inserts can be serialized first in a queue to ensure that
+the appropriate amount of stalling occurs.
 
 \begin{algorithm}
-\caption{Relaxed Reconstruction Algorithm with Insertion Stalling}
+\caption{Insertion Algorithm with Stalling}
 \label{alg:tl-relaxed-recon}
 \KwIn{$r$: a record to be inserted, $\mathscr{I} = (\mathcal{B}, \mathscr{L}_0 \ldots \mathscr{L}_\ell)$: a dynamized structure, $\delta$: insertion stall amount}
 
@@ -1077,18 +1082,21 @@ on demand using atomics. Our current prototype uses a single, fixed value
 for the probability, but ultimately it should be dynamically tuned to
 approximate the $\delta$ value from Theorem~\ref{theo:worst-case-optimal}
 as closely as possible. It also doesn't require significant modification
-of the existing client interfaces.
-
-We have elected to use Bernoulli sampling for the task of selecting which
-inserts to reject. An alternative approach would have been to apply
-systematic sampling. Bernoulli sampling results in the probability of an
-insert being rejected being independent of its order in the workload,
-but is non-deterministic. This means that there is a small probability
-that many more rejections than are expected may occur over a short
-span of time. Systematic sampling is not vulnerable to this problem,
-but introduces a dependence between the position of an insert in a
-sequence of operations and its probability of being rejected. We decided
-to prioritize independence in our implementation.
+of the existing client interfaces, and can easily support multiple threads
+of insertion without needing an explicit serialization process to ensure
+the appropriate amount of stalling occurs.
+
+We have elected to use Bernoulli sampling for the task of selecting
+which inserts to reject. An alternative approach would have been to
+apply systematic sampling, and reject inserts based on a pattern (e.g.,
+every 1000th insert gets rejected). Bernoulli sampling results in the
+probability of an insert being rejected being independent of its order
+in the workload, but is non-deterministic. This means that there is a
+small probability that many more rejections than are expected may occur
+over a short span of time. Systematic sampling is not vulnerable to this
+problem, but introduces a dependence between the position of an insert
+in a sequence of operations and its probability of being rejected. We
+decided to prioritize independence in our implementation.
 
 \section{Evaluation}
 \label{sec:tl-eval}
@@ -1096,7 +1104,7 @@ to prioritize independence in our implementation.
 In this section, we perform several experiments to evaluate the ability of
 the system proposed in Section~\ref{sec:tl-impl} to control tail latencies.
 
-\subsection{Stall Rate Sweep}
+\subsection{Rejection Rate Sweep}
 
 
 As a first test, we will evaluate the ability of our insertion stall
@@ -1113,7 +1121,7 @@ block inserts to maintain a shard bound. The buffer is always flushed
 immediately, regardless of the number of shards in the structure. Thus,
 the rate of insertion is controlled by the cost of flushing the
 buffer (we still block when the buffer is full) and the insertion
-stall rate. The structure is maintained fully in the background, with
+rejection rate. The structure is maintained fully in the background, with
 maintenance reconstructions being scheduled for all levels exceeding a
 specified shard count. Thus, the number of shards within the structure
 is controlled indirectly by limiting the insertion rate. We ran these
@@ -1127,10 +1135,10 @@ dataset~\cite{sosd-datasets}, as well as VPTree with the one million,
 we inserted $30\%$ of the records to warm up the structure, and then
 measured the individual latency of each insert after that. We measured
 the count of shards in the structure each time the buffer flushed
-(including during the warmup period).  Note that a stall rate of $\delta
-= 1$ indicates no stalling at all, and values less than one indicate
+(including during the warmup period).  Note that a rejection rate of
+$1$ indicates no stalling at all, and values less than one indicate
 $1 - \delta$ probability of an insert being rejected, after which the
-insert thread sleeps for about a microsecond. A lower stall rate means
+insert thread sleeps for about a microsecond. A lower rejection rate means
 more stalls are introduced. The tiering policy is strict tiering with
 a scale factor of $s=6$ using the concurrency control scheme described
 in Section~\ref{ssec:dyn-concurrency}.
@@ -1145,7 +1153,7 @@ in Section~\ref{ssec:dyn-concurrency}.
 
 We'll begin by considering the ISAM
 tree. Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all
-stall rates succeed in greatly reducing tail latency relative to
+rejection rates succeed in greatly reducing tail latency relative to
 tiering. Additionally, it shows a small amount of available tuning of
 the worst-case insertion latencies, with higher stall amounts reducing
 the tail latencies slightly at various points in the distribution. This
@@ -1160,16 +1168,16 @@ resulting in a stall.
 Of course, if the query latency is severely affected by the
 use of this mechanism, it may not be worth using. Thus, in
 Figure~\ref{fig:tl-stall-200m-shard} we show the probability density
-of various shard counts within the decomposed structure for each stall
+of various shard counts within the decomposed structure for each rejection
 rate, as well as strict tiering.  This figure shows that, even for no
 insertion stalling, the shard count within the structure remains well
 behaved, albeit with a slightly longer tail and a higher average value
 compared to tiering. Once stalls are introduced, it is possible to
 both reduce the tail, and shift the peak of the distribution through
-a variety of points. In particular, we see that a stall of $.99$ is
-sufficient to move the peak to very close to tiering, and lower stall
+a variety of points. In particular, we see that a rejection rate of $.99$ is
+sufficient to move the peak to very close to tiering, and lower rejection
 rates are able to further shift the peak of the distribution to even
-lower counts. This result implies that this stall mechanism may be able
+lower counts. This result implies that this rejection mechanism may be able
 to produce a trade-off space for insertion and query performance, which
 is a question we will examine in Section~\ref{ssec:tl-design-space}.
 
@@ -1186,7 +1194,7 @@ small size of the data set used, we repeated the exact same
 testing using a set of four billion uniform integers, shown in
 Figure~\ref{fig:tl-stall-4b}. These results are aligned with the
 smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing the
-same improvements in insertion tail latency for all stall amounts, and
+same improvements in insertion tail latency for all rejection rates, and
 Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the shard
 count. 
 
@@ -1199,7 +1207,7 @@ count.
 
 Finally, we considered our dynamized VPTree in
 Figure~\ref{fig:tl-stall-knn}, This test shows some of the possible
-limitations of our fixed stall rate mechanism. The ISAM tree tested
+limitations of our fixed rejection rate mechanism. The ISAM tree tested
 above is constructable in roughly linear time, being an MDSP with
 $B_M(n, k) \in \Theta(n \log k)$, where $k$ is the number of shards, and
 thus roughly constant.\footnote{
@@ -1209,35 +1217,35 @@ thus roughly constant.\footnote{
 	about, however, it is clear that $k$ is typically quite close to
 	$s$ in practice for ISAM tree.
 } Thus, the ratio $\frac{B_M(n,
-k)}{n}$ used to determine the optimal insertion stall rate is
+k)}{n}$ used to determine the optimal insertion rejection rate is
 asymptotically a constant.  For VPTree, however, the construction
 cost is super-linear, with $B(n) \in \Theta(n \log n)$, and also
 generally much larger in absolute time requirements. We can see in
 Figure~\ref{fig:tl-stall-knn-shard} that the shard count distribution
-is very poorly behaved for smaller stall amounts, with the shard count
-following a roughly uniform distribution for a stall rate of $1$. This
+is very poorly behaved for smaller rejection rates,  with the shard count
+following a roughly uniform distribution for a rejection rate of $1$. This
 means that the background reconstructions are not capable of keeping up
 with buffer flushing, and so the number of shards grows significantly
 over time. Introducing stalls does shift the distribution closer to
-normal, but it requires a much larger stall rate in order to obtain
+normal, but it requires a much larger rejection rate in order to obtain
 a shard count distribution that is close to the strict tiering than
 was the case with the ISAM tree test. It is still possible, though,
-even with our simple fixed-stall rate implementation. Additionally,
+even with our simple fixed-rejection rate implementation. Additionally,
 this approach is shown in Figure~\ref{fig:tl-stall-knn-dist} to reduce
 the tail latency substantially compared to strict tiering, with the same
-latency distribution effects for larger stall rates as was seen in the
+latency distribution effects for larger rejection rates as was seen in the
 ISAM examples.
 
 These tests show that, for ISAM tree at least, introducing a constant
-stall rate while allowing the decomposition to develop naturally
+rejection rate while allowing the decomposition to develop naturally
 with background reconstructions only is able to match the shard count
 distribution of tiering, which strictly enforces the shard count bound
 using blocking, while achieving significantly better insertion tail
 latencies. VPTree is able to achieve the same results too, albeit
-requiring significantly higher stall rates to match the shard bound. 
+requiring significantly higher rejection rates to match the shard bound. 
 
 
-\subsection{Insertion Stall Trade-off Space}
+\subsection{Insertion Rejection Trade-off Space}
 \label{ssec:tl-design-space}
 
 \begin{figure}
@@ -1256,15 +1264,15 @@ or query latency. By throttling insertion, we potentially reduce the
 throughput. Further, it isn't clear in practice how much query latency
 suffers as the shard count distribution changes. In this experiment, we
 address these concerns by directly measuring the insertion throughput
-and query latency over a variety of stall rates and compare the results
+and query latency over a variety of rejection rates and compare the results
 to strict tiering.
 
 The results of this test for ISAM with the SOSD \texttt{OSM} dataset are
 shown in Figure~\ref{fig:tl-latency-curve-isam}, which shows the insertion
 throughput plotted against the average query latency for our system at
-various stall rates, and with tiering configured with an equivalent scale
+various rejection rates, and with tiering configured with an equivalent scale
 factor marked as red point for reference. The most interesting point
-demonstrated by this plot is that introducing the stall rate provides a
+demonstrated by this plot is that introducing the rejection rate provides a
 beautiful design trade-off between query and insert performance. In fact,
 this space is far more useful than the trade-off space represented by
 layout policy and scale factor selection using strict reconstruction
@@ -1304,7 +1312,7 @@ to provide a superior set of design trade-offs than the strict policies,
 at least in environments where sufficient parallel processing and memory
 are available to leverage parallel reconstructions.
 
-\subsection{Legacy Design Space}
+\subsection{Design Space}
 
 Our new system retains the concept of buffer size and scale factor from
 the previous version, although these have very different performance
@@ -1318,7 +1326,7 @@ using the SOSD \texttt{OSM} dataset and point lookup queries.
 \centering
 \subfloat[Insertion Throughput vs. Query Latency for Varying Scale Factors]{\includegraphics[width=.5\textwidth]{img/tail-latency/stall-sf-sweep.pdf} \label{fig:tl-sf-curve}} 
 \subfloat[Insertion Tail Latency for Varying Buffer Sizes]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer-tail-latency.pdf} \label{fig:tl-buffer-tail}} \\
-\caption{Legacy Design Space Examination}
+\caption{Design Space Examination}
 \label{fig:tl-design-space}
 \end{figure}
 
@@ -1347,7 +1355,7 @@ Figure~\ref{fig:tl-buffer-tail}. As the buffer size is increased,
 the worst-case insertion time also increases, although the effect is
 relatively small.
 
-\subsection{Thread Scaling}
+\subsection{Internal Thread Scaling}
 
 \begin{figure}
 \centering
@@ -1360,21 +1368,21 @@ relatively small.
 \end{figure}
 
 In the previous tests, we ran our system configured with 32 available
-threads, which was more than enough to run all reconstructions and
-queries fully in parallel. However, it's important to determine how well
-the system works in more resource constrained environments.  The system
-shares internal threads between reconstructions and queries, and 
-flushing occurs on a dedicated thread separate from these. During the
-benchmark, one client thread issued queries continuously and another
-issued inserts. The index accumulated a total of five levels, so
-the maximum amount of parallelism available during the testing was 4
+internal threads, which was more than enough to run all reconstructions
+and queries fully in parallel. However, it's important to determine
+how well the system works in more resource constrained environments.
+The system shares internal threads between reconstructions and queries,
+and flushing occurs on a dedicated thread separate from these. During
+the benchmark, one client thread issued queries continuously and
+another issued inserts. The index accumulated a total of five levels,
+so the maximum amount of parallelism available during the testing was 4
 parallel reconstructions, along with the dedicated flushing thread and
 any concurrent queries. In these tests, we used the SOSD \texttt{OSM}
 dataset (200M records) and point-lookup queries without early abort
 against a dynamized ISAM tree.
 
 We considered the insertion throughput vs. query latency trade-off for
-various stall amounts with several internal thread counts. We inserted
+various rejection rates with several internal thread counts. We inserted
 30\% of the dataset first, and then measured the insertion throughput over
 the insertion of the rest of the data on a client thread, while another
 client thread continuously issued queries against the structure.  The
@@ -1386,33 +1394,34 @@ throughput is limited only by the stall amount, and by buffer flushing.
 As flushing occurs on a dedicated thread, it is unaffected by changes
 in the internal thread configuration of the system.
 
-In terms of query performance, there are two general effects that can be
-observed. The first effect is that the previously noted reduction in
-query performance as the insertion throughput is increased is observed
-in all cases, irrespective of thread count. However, interestingly,
-the thread count itself has little effect on the curve outside of the
-case of only having a single thread. This can also be seen in
-Figure~\ref{fig:tl-query-scaling}, which shows an alternative view of
-the same data revealing the best measured insertion throughput associated
-with a given query latency bound. In both cases, two or more threads are
-capable of significantly higher insertion throughput at a given query
-latency. But, at very low insertion throughputs, this effect vanishes
-and all thread counts are roughly equivalent in performance.
+In terms of query performance, there are two general effects that can
+be observed. The first effect is that the previously noted reduction in
+query performance as the insertion throughput is increased is observed in
+all cases, irrespective of internal thread count. However, interestingly,
+the internal thread count itself has little effect on the curve outside
+of the case of only having a single internal thread. This can also be
+seen in Figure~\ref{fig:tl-query-scaling}, which shows an alternative
+view of the same data revealing the best measured insertion throughput
+associated with a given query latency bound. In both cases, two or more
+internal threads are capable of significantly higher insertion throughput
+at a given query latency. But, at very low insertion throughputs, this
+effect vanishes and all internal thread counts are roughly equivalent
+in performance.
 
 A large part of the reason for this significant deviation in behavior
-between one thread and multiple is likely that queries and reconstructions
-share the same pool of background threads.  Our testing involved issuing
-queries continuously on a single thread, while performing inserts, and so
-two threads background threads ensures that a reconstruction and query can
-be run in parallel, whereas a single thread will force queries to wait
-behind long running reconstructions. Once this bottleneck is overcome,
-a reduction in the amount of parallel reconstruction seems to have only a
-minor influence on overall performance. This is because, although in the
-worst case the system requires $\log_s n$ threads to fully parallelize
+between one internal thread and multiple is likely that queries and
+reconstructions share the same pool of internal threads.  Our testing
+involved issuing queries continuously on a single internal thread,
+while performing inserts, and so two internal threads ensures
+that a reconstruction and query can be run in parallel, whereas a
+single internal thread will force queries to wait behind long running
+reconstructions. Once this bottleneck is overcome, a reduction in the
+amount of parallel reconstruction seems to have only a minor influence
+on overall performance. This is because, although in the worst case
+the system requires $\log_s n$ internal threads to fully parallelize
 reconstructions, this worst case is fairly rare. The vast majority of
 reconstructions only require a fraction of this total parallel capacity.
 
-
 \section{Conclusion}
 
 In this section, we addressed the final of the three major problems of
@@ -1425,7 +1434,7 @@ within our framework, including a significantly improved architecture
 for scheduling and executing parallel and background reconstructions,
 and a system for rate limiting by rejecting inserts via Bernoulli sampling.
 
-We evaluated this system for fixed stall rates, and found significant
+We evaluated this system for fixed rejection rates, and found significant
 improvements in tail latencies, approaching the practical lower bound we
 established using the equal block method, without requiring significant
 degradation of query performance. In fact, we found that this rate
author	Douglas B. Rumbaugh <doug@douglasrumbaugh.com>	2025-07-07 11:36:15 -0400
committer	Douglas B. Rumbaugh <doug@douglasrumbaugh.com>	2025-07-07 11:36:15 -0400
commit	05aab7bd45e691a0b0f527d0ab4dd7cae0b3ec55 (patch)
tree	be19b76016630bc7c7cdfb482e71b158c93fbd38 /chapters/tail-latency.tex
parent	0dc1a8ea20820168149cedaa14e223d4d31dc4b6 (diff)
download	dissertation-05aab7bd45e691a0b0f527d0ab4dd7cae0b3ec55.tar.gz