diff options
| author | Douglas B. Rumbaugh <doug@douglasrumbaugh.com> | 2025-07-07 11:36:15 -0400 |
|---|---|---|
| committer | Douglas B. Rumbaugh <doug@douglasrumbaugh.com> | 2025-07-07 11:36:15 -0400 |
| commit | 05aab7bd45e691a0b0f527d0ab4dd7cae0b3ec55 (patch) | |
| tree | be19b76016630bc7c7cdfb482e71b158c93fbd38 /chapters/tail-latency.tex | |
| parent | 0dc1a8ea20820168149cedaa14e223d4d31dc4b6 (diff) | |
| download | dissertation-05aab7bd45e691a0b0f527d0ab4dd7cae0b3ec55.tar.gz | |
update
Diffstat (limited to 'chapters/tail-latency.tex')
| -rw-r--r-- | chapters/tail-latency.tex | 155 |
1 files changed, 82 insertions, 73 deletions
diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex index a0db592..dbe867c 100644 --- a/chapters/tail-latency.tex +++ b/chapters/tail-latency.tex @@ -350,10 +350,15 @@ will throttle the insertion rate by adding a stall time, $\delta$, to each insert. $\delta$ will be determined such that it is sufficiently large to ensure that any scheduled reconstructions have enough time to complete before the shard count on any level exceeds $s$. This process -is summarized in Algorithm~\ref{alg:tl-relaxed-recon}. +is summarized in Algorithm~\ref{alg:tl-relaxed-recon}. Note that +this algorithm will only work with a single insertion client thread, +as multiple insertion threads will stall concurrently with each other, +removing some of the necessary delay. If multiple insertion threads are +necessary, the inserts can be serialized first in a queue to ensure that +the appropriate amount of stalling occurs. \begin{algorithm} -\caption{Relaxed Reconstruction Algorithm with Insertion Stalling} +\caption{Insertion Algorithm with Stalling} \label{alg:tl-relaxed-recon} \KwIn{$r$: a record to be inserted, $\mathscr{I} = (\mathcal{B}, \mathscr{L}_0 \ldots \mathscr{L}_\ell)$: a dynamized structure, $\delta$: insertion stall amount} @@ -1077,18 +1082,21 @@ on demand using atomics. Our current prototype uses a single, fixed value for the probability, but ultimately it should be dynamically tuned to approximate the $\delta$ value from Theorem~\ref{theo:worst-case-optimal} as closely as possible. It also doesn't require significant modification -of the existing client interfaces. - -We have elected to use Bernoulli sampling for the task of selecting which -inserts to reject. An alternative approach would have been to apply -systematic sampling. Bernoulli sampling results in the probability of an -insert being rejected being independent of its order in the workload, -but is non-deterministic. This means that there is a small probability -that many more rejections than are expected may occur over a short -span of time. Systematic sampling is not vulnerable to this problem, -but introduces a dependence between the position of an insert in a -sequence of operations and its probability of being rejected. We decided -to prioritize independence in our implementation. +of the existing client interfaces, and can easily support multiple threads +of insertion without needing an explicit serialization process to ensure +the appropriate amount of stalling occurs. + +We have elected to use Bernoulli sampling for the task of selecting +which inserts to reject. An alternative approach would have been to +apply systematic sampling, and reject inserts based on a pattern (e.g., +every 1000th insert gets rejected). Bernoulli sampling results in the +probability of an insert being rejected being independent of its order +in the workload, but is non-deterministic. This means that there is a +small probability that many more rejections than are expected may occur +over a short span of time. Systematic sampling is not vulnerable to this +problem, but introduces a dependence between the position of an insert +in a sequence of operations and its probability of being rejected. We +decided to prioritize independence in our implementation. \section{Evaluation} \label{sec:tl-eval} @@ -1096,7 +1104,7 @@ to prioritize independence in our implementation. In this section, we perform several experiments to evaluate the ability of the system proposed in Section~\ref{sec:tl-impl} to control tail latencies. -\subsection{Stall Rate Sweep} +\subsection{Rejection Rate Sweep} As a first test, we will evaluate the ability of our insertion stall @@ -1113,7 +1121,7 @@ block inserts to maintain a shard bound. The buffer is always flushed immediately, regardless of the number of shards in the structure. Thus, the rate of insertion is controlled by the cost of flushing the buffer (we still block when the buffer is full) and the insertion -stall rate. The structure is maintained fully in the background, with +rejection rate. The structure is maintained fully in the background, with maintenance reconstructions being scheduled for all levels exceeding a specified shard count. Thus, the number of shards within the structure is controlled indirectly by limiting the insertion rate. We ran these @@ -1127,10 +1135,10 @@ dataset~\cite{sosd-datasets}, as well as VPTree with the one million, we inserted $30\%$ of the records to warm up the structure, and then measured the individual latency of each insert after that. We measured the count of shards in the structure each time the buffer flushed -(including during the warmup period). Note that a stall rate of $\delta -= 1$ indicates no stalling at all, and values less than one indicate +(including during the warmup period). Note that a rejection rate of +$1$ indicates no stalling at all, and values less than one indicate $1 - \delta$ probability of an insert being rejected, after which the -insert thread sleeps for about a microsecond. A lower stall rate means +insert thread sleeps for about a microsecond. A lower rejection rate means more stalls are introduced. The tiering policy is strict tiering with a scale factor of $s=6$ using the concurrency control scheme described in Section~\ref{ssec:dyn-concurrency}. @@ -1145,7 +1153,7 @@ in Section~\ref{ssec:dyn-concurrency}. We'll begin by considering the ISAM tree. Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all -stall rates succeed in greatly reducing tail latency relative to +rejection rates succeed in greatly reducing tail latency relative to tiering. Additionally, it shows a small amount of available tuning of the worst-case insertion latencies, with higher stall amounts reducing the tail latencies slightly at various points in the distribution. This @@ -1160,16 +1168,16 @@ resulting in a stall. Of course, if the query latency is severely affected by the use of this mechanism, it may not be worth using. Thus, in Figure~\ref{fig:tl-stall-200m-shard} we show the probability density -of various shard counts within the decomposed structure for each stall +of various shard counts within the decomposed structure for each rejection rate, as well as strict tiering. This figure shows that, even for no insertion stalling, the shard count within the structure remains well behaved, albeit with a slightly longer tail and a higher average value compared to tiering. Once stalls are introduced, it is possible to both reduce the tail, and shift the peak of the distribution through -a variety of points. In particular, we see that a stall of $.99$ is -sufficient to move the peak to very close to tiering, and lower stall +a variety of points. In particular, we see that a rejection rate of $.99$ is +sufficient to move the peak to very close to tiering, and lower rejection rates are able to further shift the peak of the distribution to even -lower counts. This result implies that this stall mechanism may be able +lower counts. This result implies that this rejection mechanism may be able to produce a trade-off space for insertion and query performance, which is a question we will examine in Section~\ref{ssec:tl-design-space}. @@ -1186,7 +1194,7 @@ small size of the data set used, we repeated the exact same testing using a set of four billion uniform integers, shown in Figure~\ref{fig:tl-stall-4b}. These results are aligned with the smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing the -same improvements in insertion tail latency for all stall amounts, and +same improvements in insertion tail latency for all rejection rates, and Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the shard count. @@ -1199,7 +1207,7 @@ count. Finally, we considered our dynamized VPTree in Figure~\ref{fig:tl-stall-knn}, This test shows some of the possible -limitations of our fixed stall rate mechanism. The ISAM tree tested +limitations of our fixed rejection rate mechanism. The ISAM tree tested above is constructable in roughly linear time, being an MDSP with $B_M(n, k) \in \Theta(n \log k)$, where $k$ is the number of shards, and thus roughly constant.\footnote{ @@ -1209,35 +1217,35 @@ thus roughly constant.\footnote{ about, however, it is clear that $k$ is typically quite close to $s$ in practice for ISAM tree. } Thus, the ratio $\frac{B_M(n, -k)}{n}$ used to determine the optimal insertion stall rate is +k)}{n}$ used to determine the optimal insertion rejection rate is asymptotically a constant. For VPTree, however, the construction cost is super-linear, with $B(n) \in \Theta(n \log n)$, and also generally much larger in absolute time requirements. We can see in Figure~\ref{fig:tl-stall-knn-shard} that the shard count distribution -is very poorly behaved for smaller stall amounts, with the shard count -following a roughly uniform distribution for a stall rate of $1$. This +is very poorly behaved for smaller rejection rates, with the shard count +following a roughly uniform distribution for a rejection rate of $1$. This means that the background reconstructions are not capable of keeping up with buffer flushing, and so the number of shards grows significantly over time. Introducing stalls does shift the distribution closer to -normal, but it requires a much larger stall rate in order to obtain +normal, but it requires a much larger rejection rate in order to obtain a shard count distribution that is close to the strict tiering than was the case with the ISAM tree test. It is still possible, though, -even with our simple fixed-stall rate implementation. Additionally, +even with our simple fixed-rejection rate implementation. Additionally, this approach is shown in Figure~\ref{fig:tl-stall-knn-dist} to reduce the tail latency substantially compared to strict tiering, with the same -latency distribution effects for larger stall rates as was seen in the +latency distribution effects for larger rejection rates as was seen in the ISAM examples. These tests show that, for ISAM tree at least, introducing a constant -stall rate while allowing the decomposition to develop naturally +rejection rate while allowing the decomposition to develop naturally with background reconstructions only is able to match the shard count distribution of tiering, which strictly enforces the shard count bound using blocking, while achieving significantly better insertion tail latencies. VPTree is able to achieve the same results too, albeit -requiring significantly higher stall rates to match the shard bound. +requiring significantly higher rejection rates to match the shard bound. -\subsection{Insertion Stall Trade-off Space} +\subsection{Insertion Rejection Trade-off Space} \label{ssec:tl-design-space} \begin{figure} @@ -1256,15 +1264,15 @@ or query latency. By throttling insertion, we potentially reduce the throughput. Further, it isn't clear in practice how much query latency suffers as the shard count distribution changes. In this experiment, we address these concerns by directly measuring the insertion throughput -and query latency over a variety of stall rates and compare the results +and query latency over a variety of rejection rates and compare the results to strict tiering. The results of this test for ISAM with the SOSD \texttt{OSM} dataset are shown in Figure~\ref{fig:tl-latency-curve-isam}, which shows the insertion throughput plotted against the average query latency for our system at -various stall rates, and with tiering configured with an equivalent scale +various rejection rates, and with tiering configured with an equivalent scale factor marked as red point for reference. The most interesting point -demonstrated by this plot is that introducing the stall rate provides a +demonstrated by this plot is that introducing the rejection rate provides a beautiful design trade-off between query and insert performance. In fact, this space is far more useful than the trade-off space represented by layout policy and scale factor selection using strict reconstruction @@ -1304,7 +1312,7 @@ to provide a superior set of design trade-offs than the strict policies, at least in environments where sufficient parallel processing and memory are available to leverage parallel reconstructions. -\subsection{Legacy Design Space} +\subsection{Design Space} Our new system retains the concept of buffer size and scale factor from the previous version, although these have very different performance @@ -1318,7 +1326,7 @@ using the SOSD \texttt{OSM} dataset and point lookup queries. \centering \subfloat[Insertion Throughput vs. Query Latency for Varying Scale Factors]{\includegraphics[width=.5\textwidth]{img/tail-latency/stall-sf-sweep.pdf} \label{fig:tl-sf-curve}} \subfloat[Insertion Tail Latency for Varying Buffer Sizes]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer-tail-latency.pdf} \label{fig:tl-buffer-tail}} \\ -\caption{Legacy Design Space Examination} +\caption{Design Space Examination} \label{fig:tl-design-space} \end{figure} @@ -1347,7 +1355,7 @@ Figure~\ref{fig:tl-buffer-tail}. As the buffer size is increased, the worst-case insertion time also increases, although the effect is relatively small. -\subsection{Thread Scaling} +\subsection{Internal Thread Scaling} \begin{figure} \centering @@ -1360,21 +1368,21 @@ relatively small. \end{figure} In the previous tests, we ran our system configured with 32 available -threads, which was more than enough to run all reconstructions and -queries fully in parallel. However, it's important to determine how well -the system works in more resource constrained environments. The system -shares internal threads between reconstructions and queries, and -flushing occurs on a dedicated thread separate from these. During the -benchmark, one client thread issued queries continuously and another -issued inserts. The index accumulated a total of five levels, so -the maximum amount of parallelism available during the testing was 4 +internal threads, which was more than enough to run all reconstructions +and queries fully in parallel. However, it's important to determine +how well the system works in more resource constrained environments. +The system shares internal threads between reconstructions and queries, +and flushing occurs on a dedicated thread separate from these. During +the benchmark, one client thread issued queries continuously and +another issued inserts. The index accumulated a total of five levels, +so the maximum amount of parallelism available during the testing was 4 parallel reconstructions, along with the dedicated flushing thread and any concurrent queries. In these tests, we used the SOSD \texttt{OSM} dataset (200M records) and point-lookup queries without early abort against a dynamized ISAM tree. We considered the insertion throughput vs. query latency trade-off for -various stall amounts with several internal thread counts. We inserted +various rejection rates with several internal thread counts. We inserted 30\% of the dataset first, and then measured the insertion throughput over the insertion of the rest of the data on a client thread, while another client thread continuously issued queries against the structure. The @@ -1386,33 +1394,34 @@ throughput is limited only by the stall amount, and by buffer flushing. As flushing occurs on a dedicated thread, it is unaffected by changes in the internal thread configuration of the system. -In terms of query performance, there are two general effects that can be -observed. The first effect is that the previously noted reduction in -query performance as the insertion throughput is increased is observed -in all cases, irrespective of thread count. However, interestingly, -the thread count itself has little effect on the curve outside of the -case of only having a single thread. This can also be seen in -Figure~\ref{fig:tl-query-scaling}, which shows an alternative view of -the same data revealing the best measured insertion throughput associated -with a given query latency bound. In both cases, two or more threads are -capable of significantly higher insertion throughput at a given query -latency. But, at very low insertion throughputs, this effect vanishes -and all thread counts are roughly equivalent in performance. +In terms of query performance, there are two general effects that can +be observed. The first effect is that the previously noted reduction in +query performance as the insertion throughput is increased is observed in +all cases, irrespective of internal thread count. However, interestingly, +the internal thread count itself has little effect on the curve outside +of the case of only having a single internal thread. This can also be +seen in Figure~\ref{fig:tl-query-scaling}, which shows an alternative +view of the same data revealing the best measured insertion throughput +associated with a given query latency bound. In both cases, two or more +internal threads are capable of significantly higher insertion throughput +at a given query latency. But, at very low insertion throughputs, this +effect vanishes and all internal thread counts are roughly equivalent +in performance. A large part of the reason for this significant deviation in behavior -between one thread and multiple is likely that queries and reconstructions -share the same pool of background threads. Our testing involved issuing -queries continuously on a single thread, while performing inserts, and so -two threads background threads ensures that a reconstruction and query can -be run in parallel, whereas a single thread will force queries to wait -behind long running reconstructions. Once this bottleneck is overcome, -a reduction in the amount of parallel reconstruction seems to have only a -minor influence on overall performance. This is because, although in the -worst case the system requires $\log_s n$ threads to fully parallelize +between one internal thread and multiple is likely that queries and +reconstructions share the same pool of internal threads. Our testing +involved issuing queries continuously on a single internal thread, +while performing inserts, and so two internal threads ensures +that a reconstruction and query can be run in parallel, whereas a +single internal thread will force queries to wait behind long running +reconstructions. Once this bottleneck is overcome, a reduction in the +amount of parallel reconstruction seems to have only a minor influence +on overall performance. This is because, although in the worst case +the system requires $\log_s n$ internal threads to fully parallelize reconstructions, this worst case is fairly rare. The vast majority of reconstructions only require a fraction of this total parallel capacity. - \section{Conclusion} In this section, we addressed the final of the three major problems of @@ -1425,7 +1434,7 @@ within our framework, including a significantly improved architecture for scheduling and executing parallel and background reconstructions, and a system for rate limiting by rejecting inserts via Bernoulli sampling. -We evaluated this system for fixed stall rates, and found significant +We evaluated this system for fixed rejection rates, and found significant improvements in tail latencies, approaching the practical lower bound we established using the equal block method, without requiring significant degradation of query performance. In fact, we found that this rate |