diff options
| -rw-r--r-- | chapters/design-space.tex | 26 | ||||
| -rw-r--r-- | chapters/tail-latency.tex | 161 |
2 files changed, 171 insertions, 16 deletions
diff --git a/chapters/design-space.tex b/chapters/design-space.tex index 32d9b9c..952be42 100644 --- a/chapters/design-space.tex +++ b/chapters/design-space.tex @@ -730,7 +730,7 @@ and VPTree. \begin{figure} \centering \subfloat[ISAM Tree Range Count]{\includegraphics[width=.5\textwidth]{img/design-space/isam-parm-sweep.pdf} \label{fig:design-isam-tradeoff}} -\subfloat[VPTree $k$-NN]{\includegraphics[width=.5\textwidth]{img/design-space/selectivity-sweep.pdf} \label{fig:design-knn-tradeoff}} \\ +\subfloat[VPTree $k$-NN]{\includegraphics[width=.5\textwidth]{img/design-space/knn-parm-sweep.pdf} \label{fig:design-knn-tradeoff}} \\ \caption{Insertion Throughput vs. Query Latency} \label{fig:design-tradeoff} \end{figure} @@ -757,12 +757,24 @@ in scale factor have very little effect. However, level's insertion performance degrades linearly with scale factor, and this is well demonstrated in the plot. -The Bentley-Saxe method appears to follow a very similar trend to that -of leveling, albeit with even more dramatic performance degradation as -the scale factor is increased. Generally it seems to be a strictly worse -alternative to leveling in all but its best-case query cost, and we will -omit it from our tests moving forward as a result. - +The store is a bit clearer in Figure~\ref{fig:design-knn-tradeoff}. The +VPTree has a much greater construction time, both asymptotically and +in absolute terms, and the average query latency is also significantly +greater. These result in the configuration changes showing much more +significant changes in performance, and present us with a far clearer +trade-off space. The same general trends hold as in ISAM, just amplified. +Leveling has better query performance than tiering and sees increased +query performance and decreased insert performance as the scale factor +increases. Tiering has better insertion performance and worse query +performance than leveling, and sees improved insert and worstening +query performance as the scale factor is increased. The Bentley-Saxe +method shows similar trends to leveling. + +In general, the Bentley-Saxe method appears to follow a very similar +trend to that of leveling, albeit with even more dramatic performance +degradation as the scale factor is increased. Generally it seems to be +a strictly worse alternative to leveling in all but its best-case query +cost, and we will omit it from our tests moving forward as a result. \subsection{Query Size Effects} diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex index e63f3c9..a88fe0c 100644 --- a/chapters/tail-latency.tex +++ b/chapters/tail-latency.tex @@ -926,31 +926,167 @@ the system proposed in Section~\ref{sec:tl-impl} to control tail latencies. \begin{figure} \centering -\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-insert-dist.pdf} \label{fig:tl-btree-isam-tput}} -\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-shard-dist.pdf} \label{fig:tl-btree-isam-lat}} \\ +\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-insert-dist.pdf} \label{fig:tl-stall-200m-dist}} +\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-shard-dist.pdf} \label{fig:tl-stall-200m-shard}} \\ \caption{Insertion and Shard Count Distributions for ISAM with 200M Records} \label{fig:tl-stall-200m} \end{figure} -First, we will consider the insertion and query performance of our system -at a variety of stall proportions. - +First, we will consider the insertion and query performance of our +system at a variety of stall proportions. The purpose of this testing +is to demonstrate that inserting stalls into the insertion process is +able to reduce the insertion tail latency, while being able to match the +general insertion and query performance of a strict tiering policy. Recall +that, in the insertion stall case, no explicit shard capacity limits are +enforced by the framework. Reconstructions are triggered with each buffer +flush on all levels exceeding a specified shard count ($s = 4$ in these +tests) and the buffer flushes immediately when full with no regard to the +state of the structure. Thus, limiting the insertion latency is the only +means the system uses to maintain its shard count at a manageable level. +These tests were run on a system with sufficient available resources to +fully parallelize all reconstructions. + +First, Figure~\ref{fig:tl-stall-200m} shows the results of testing +insertion of the 200 million record SOSD \texttt{OSM} dataset in +a dynamized ISAM tree. Using our insertion stalling technique, as +well as strict tiering. We inserted $30\%$ of the records, and then +measured the individual latency of each insert after that point to produce +Figure~\ref{fig:tl-stall-200m-dist}. Figure~\ref{fig:tl-stall-200m-shard} +was produced by recording the number of shards in the dynamized structure +each time the buffer flushed. Note that a stall value of one indicates +no stalling at all, and values less than one indicate $1 - \delta$ +probability of an insert being rejected. Thus, a lower stall value means +more stalls are introduced. The tiering policy is strict tiering with a +scale factor of $s=4$. It uses the concurrency control scheme described +in Section~\ref{ssec:dyn-concurrency}. + + +Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all of insertion +rejection probabilities succeed in greatly reducing tail latency relative +to tiering. Additionally, it shows a small amount of available tuning of +the worst-case insertion latencies, with higher stall amounts reducing +the tail latencies slightly at various points in the distribution. This +latter effect results from the buffer flush latency hiding mechanism, +which was retained from Chapter~\ref{chap:framework}. The buffer actually +has space to two two versions, and the second version can be filled while +the first is flushing. This means that, for more aggressive stalling, +some of the time spent blocking on the buffer flush is redistributed +over the inserts into the second version of the buffer, rather than +resulting in a stall. + +Of course, if the query latency is severely affected by the +use of this mechanism, it may not be worth using. Thus, in +Figure~\ref{fig:tl-stall-200m-shard} we show the probability density of +various shard counts within the dynamized structure for each stalling +amount, as well as strict tiering. We have elected to examine the shard +count, rather than the query latencies, for this purpose because our +intention with this technique is to directly control the number of +shards, and our intention is to show that this is possible. Of course, +the shard count control is necessary for the sake of query latencies, +and we will consider query latency directly later. + +This figure shows that, even for no insertion throttle at all, the shard +count within the structure remains well behaved and normally distributed, +albeit with a slightly longer tail and a higher average value. Once +stalls are introduced, though, it is possible to both reduce the tail, +and shift the peak of the distribution through a variety of points. In +particular, we see that a stall of $.99$ is sufficient to move the peak +to very close to tiering, and lower stalls are able to further shift the +peak of the distribution to even lower counts. \begin{figure} \centering -\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-insert-dist.pdf} \label{fig:tl-btree-isam-tput}} -\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-shard-dist.pdf} \label{fig:tl-btree-isam-lat}} \\ +\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-insert-dist.pdf} \label{fig:tl-stall-4b-dist}} +\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-shard-dist.pdf} \label{fig:tl-stall-4b-shard}} \\ \caption{Insertion and Shard Count Distributions for ISAM with 4B Records} \label{fig:tl-stall-4b} \end{figure} +To validate that these results were not simply a result of the relatively +small size of the data set used, we repeated the exact same testing +using a set of four billion uniform integers, and these results are +shown in Figure~\ref{fig:tl-stall-4b}. These results are aligned with +the smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing +the same improvements in insertion tail latency for all stall amounts, +and Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the +shard count. If anything, the gap between strict tiering and un-throttled +insertion is narrower with the larger data set than the smaller one. + \begin{figure} -\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-insert-dist.pdf} \label{fig:tl-btree-isam-tput}} -\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-shard-dist.pdf} \label{fig:tl-btree-isam-lat}} \\ +\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-insert-dist.pdf} \label{fig:tl-stall-knn-dist}} +\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-shard-dist.pdf} \label{fig:tl-stall-knn-shard}} \\ \caption{Insertion and Shard Count Distributions for VPTree } \label{fig:tl-stall-knn} \end{figure} +Finally, we considered our dynamized VPTree in +Figure~\ref{fig:tl-stall-knn}, using the \texttt{SBW} dataset of +about one million 300-dimensional vectors. This test shows some of +the possible limitations of our fixed rejection rate. The ISAM Tree +tested above is constructable in roughly linear time, being an MDSP +with $B_M(n, k) \in \Theta(n \log k)$. Thus, the ratio $\frac{B_M(n, +k)}{n}$ used to determine the optimal insertion stall rate is +asymptotically a constant. For VPTree, however, the construction +cost is super-linear, with $B(n) \in \Theta(n \log n)$, and also +generally much larger in absolute time requirements. We can see in +Figure~\ref{fig:tl-stall-knn-shard} that the shard count distribution +is very poorly behaved for smaller stall amounts, with the shard count +following a roughly uniform distribution for a stall rate of $1$. This +means that the background reconstructions are not capable of keeping up +with buffer flushing, and so the number of shards grows significantly +over time. Introducing stalls does shift the distribution closer to +normal, but it requires a much larger stall rate in order to obtain +a shard count distribution that is close to the strict tiering than +was the case with the ISAM tree test. It is still possible, though, +even with our simple fixed-stall rate implementation. Additionally, +this approach is shown in Figure~\ref{fig:tl-stall-knn-dist} to reduce +the tail latency substantially compared to strict tiering, with the same +latency distribution effects for larger stall rates as was seen in the +ISAM examples. + +Thus, we've shown that introducing even a fixed stall while allowing +the internal structure of the dynamization to develop naturally is able +to match the shard count distribution of strict tiering, while having +significantly lower insertion tail latencies. + +\subsection{Insertion Stall Trade-off Space} + +While we have shown that introducing insertion stalls accomplishes the +goal of reducing tail latencies while being able to match the shard count +of a strict tiering reconstruction strategy, we've not yet addressed +what the actual performance of this structure is. By throttling inserts, +we potentially reduce the insertion throughput. And, further, it isn't +immediately obvious just how much query performance suffers as the shard +count distribution shifts. In this test, we examine the average values +of insertion throughput and query latency over a variety of stall rates. + +The results of this test for ISAM with the SOSD \texttt{OSM} dataset are +shown in Figure~\ref{fig:tl-latency-curve}, which shows the insertion +throughput plotted against the average query latency for our system at +various stall rates, and with tiering configured with an equivalent +scale factor marked as red point for reference. This plot shows two +interesting features of the insertion stall mechanism. First, it is +possible to introduce stalls that do not significantly affect the write +throughput, but do improve query latency. This is seen by the difference +between the two points at the far right of the curve, where introducing +a slight stall improves query performance at virtually no cost. This +represents the region of the curve where the stalling introduces delay +that doesn't exceed the cost of a buffer flush, and so the amount of +time spent stalling by the system doesn't change much. + +The second, and perhaps more notable, point that this plot shows is +that introducing the stall rate provides a beautiful design trade-off +between query and insert performance. In fact, this space is far more +useful than the trade-off space represented by layout policy and scale +factor selection using strict reconstruction schemes that we examined +in Chapter~\ref{chap:design-space}. At the upper end of the insertion +optimized region, we see more than double the insertion throughput of +tiering (with significantly lower tail latencies at well) at the cost +of a slightly larger than 2x increase in query latency. Moving down the +curve, we see that we are able to roughly match the performance of tiering +within this space, and even shift to more query optimized configurations. + + \begin{figure} \centering \includegraphics[width=.5\textwidth]{img/tail-latency/stall-latency-curve.pdf} @@ -958,6 +1094,13 @@ at a variety of stall proportions. \label{fig:tl-latency-curve} \end{figure} +This shows a very interesting result. Not only is our approach able +to match a strict reconstruction policy in terms of average query and +insertion performance with better tail latencies, but it is even able +to provide a superior set of design trade-offs than the strict policies, +at least in environments where sufficient parallel processing and memory +are available to leverage parallel reconstructions. + \section{Conclusion} |