2 files changed, 171 insertions, 16 deletions
diff --git a/chapters/design-space.tex b/chapters/design-space.tex
index 32d9b9c..952be42 100644
--- a/chapters/design-space.tex
+++ b/chapters/design-space.tex
@@ -730,7 +730,7 @@ and VPTree.
 \begin{figure}
 \centering
 \subfloat[ISAM Tree Range Count]{\includegraphics[width=.5\textwidth]{img/design-space/isam-parm-sweep.pdf} \label{fig:design-isam-tradeoff}} 
-\subfloat[VPTree $k$-NN]{\includegraphics[width=.5\textwidth]{img/design-space/selectivity-sweep.pdf} \label{fig:design-knn-tradeoff}} \\
+\subfloat[VPTree $k$-NN]{\includegraphics[width=.5\textwidth]{img/design-space/knn-parm-sweep.pdf} \label{fig:design-knn-tradeoff}} \\
 \caption{Insertion Throughput vs. Query Latency}
 \label{fig:design-tradeoff}
 \end{figure}
@@ -757,12 +757,24 @@ in scale factor have very little effect. However, level's insertion
 performance degrades linearly with scale factor, and this is well
 demonstrated in the plot.
 
-The Bentley-Saxe method appears to follow a very similar trend to that
-of leveling, albeit with even more dramatic performance degradation as
-the scale factor is increased. Generally it seems to be a strictly worse
-alternative to leveling in all but its best-case query cost, and we will
-omit it from our tests moving forward as a result.
-
+The store is a bit clearer in Figure~\ref{fig:design-knn-tradeoff}. The
+VPTree has a much greater construction time, both asymptotically and
+in absolute terms, and the average query latency is also significantly
+greater. These result in the configuration changes showing much more
+significant changes in performance, and present us with a far clearer
+trade-off space. The same general trends hold as in ISAM, just amplified.
+Leveling has better query performance than tiering and sees increased
+query performance and decreased insert performance as the scale factor
+increases. Tiering has better insertion performance and worse query
+performance than leveling, and sees improved insert and worstening
+query performance as the scale factor is increased. The Bentley-Saxe
+method shows similar trends to leveling.
+
+In general, the Bentley-Saxe method appears to follow a very similar
+trend to that of leveling, albeit with even more dramatic performance
+degradation as the scale factor is increased. Generally it seems to be
+a strictly worse alternative to leveling in all but its best-case query
+cost, and we will omit it from our tests moving forward as a result.
 
 \subsection{Query Size Effects}
 
diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex
index e63f3c9..a88fe0c 100644
--- a/chapters/tail-latency.tex
+++ b/chapters/tail-latency.tex
@@ -926,31 +926,167 @@ the system proposed in Section~\ref{sec:tl-impl} to control tail latencies.
 
 \begin{figure}
 \centering
-\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-insert-dist.pdf} \label{fig:tl-btree-isam-tput}} 
-\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-shard-dist.pdf} \label{fig:tl-btree-isam-lat}} \\
+\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-insert-dist.pdf} \label{fig:tl-stall-200m-dist}} 
+\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-shard-dist.pdf} \label{fig:tl-stall-200m-shard}} \\
 \caption{Insertion and Shard Count Distributions for ISAM with 200M Records}
 \label{fig:tl-stall-200m}
 \end{figure}
 
-First, we will consider the insertion and query performance of our system
-at a variety of stall proportions. 
-
+First, we will consider the insertion and query performance of our
+system at a variety of stall proportions. The purpose of this testing
+is to demonstrate that inserting stalls into the insertion process is
+able to reduce the insertion tail latency, while being able to match the
+general insertion and query performance of a strict tiering policy. Recall
+that, in the insertion stall case, no explicit shard capacity limits are
+enforced by the framework. Reconstructions are triggered with each buffer
+flush on all levels exceeding a specified shard count ($s = 4$ in these
+tests) and the buffer flushes immediately when full with no regard to the
+state of the structure. Thus, limiting the insertion latency is the only
+means the system uses to maintain its shard count at a manageable level.
+These tests were run on a system with sufficient available resources to
+fully parallelize all reconstructions.
+
+First, Figure~\ref{fig:tl-stall-200m} shows the results of testing
+insertion of the 200 million record SOSD \texttt{OSM} dataset in
+a dynamized ISAM tree. Using our insertion stalling technique, as
+well as strict tiering. We inserted $30\%$ of the records, and then
+measured the individual latency of each insert after that point to produce
+Figure~\ref{fig:tl-stall-200m-dist}. Figure~\ref{fig:tl-stall-200m-shard}
+was produced by recording the number of shards in the dynamized structure
+each time the buffer flushed.  Note that a stall value of one indicates
+no stalling at all, and values less than one indicate $1 - \delta$
+probability of an insert being rejected. Thus, a lower stall value means
+more stalls are introduced. The tiering policy is strict tiering with a
+scale factor of $s=4$. It uses the concurrency control scheme described
+in Section~\ref{ssec:dyn-concurrency}.
+
+
+Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all of insertion
+rejection probabilities succeed in greatly reducing tail latency relative
+to tiering. Additionally, it shows a small amount of available tuning of
+the worst-case insertion latencies, with higher stall amounts reducing
+the tail latencies slightly at various points in the distribution. This
+latter effect results from the buffer flush latency hiding mechanism,
+which was retained from Chapter~\ref{chap:framework}. The buffer actually
+has space to two two versions, and the second version can be filled while
+the first is flushing. This means that, for more aggressive stalling,
+some of the time spent blocking on the buffer flush is redistributed
+over the inserts into the second version of the buffer, rather than
+resulting in a stall.
+
+Of course, if the query latency is severely affected by the
+use of this mechanism, it may not be worth using. Thus, in
+Figure~\ref{fig:tl-stall-200m-shard} we show the probability density of
+various shard counts within the dynamized structure for each stalling
+amount, as well as strict tiering. We have elected to examine the shard
+count, rather than the query latencies, for this purpose because our
+intention with this technique is to directly control the number of
+shards, and our intention is to show that this is possible. Of course,
+the shard count control is necessary for the sake of query latencies,
+and we will consider query latency directly later.
+
+This figure shows that, even for no insertion throttle at all, the shard
+count within the structure remains well behaved and normally distributed,
+albeit with a slightly longer tail and a higher average value. Once
+stalls are introduced, though, it is possible to both reduce the tail,
+and shift the peak of the distribution through a variety of points. In
+particular, we see that a stall of $.99$ is sufficient to move the peak
+to very close to tiering, and lower stalls are able to further shift the
+peak of the distribution to even lower counts.
 
 \begin{figure}
 \centering
-\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-insert-dist.pdf} \label{fig:tl-btree-isam-tput}} 
-\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-shard-dist.pdf} \label{fig:tl-btree-isam-lat}} \\
+\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-insert-dist.pdf} \label{fig:tl-stall-4b-dist}} 
+\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-shard-dist.pdf} \label{fig:tl-stall-4b-shard}} \\
 \caption{Insertion and Shard Count Distributions for ISAM with 4B Records}
 \label{fig:tl-stall-4b}
 \end{figure}
 
+To validate that these results were not simply a result of the relatively
+small size of the data set used, we repeated the exact same testing
+using a set of four billion uniform integers, and these results are
+shown in Figure~\ref{fig:tl-stall-4b}. These results are aligned with
+the smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing
+the same improvements in insertion tail latency for all stall amounts,
+and Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the
+shard count. If anything, the gap between strict tiering and un-throttled
+insertion is narrower with the larger data set than the smaller one.
+
 \begin{figure}
-\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-insert-dist.pdf} \label{fig:tl-btree-isam-tput}} 
-\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-shard-dist.pdf} \label{fig:tl-btree-isam-lat}} \\
+\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-insert-dist.pdf} \label{fig:tl-stall-knn-dist}} 
+\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-shard-dist.pdf} \label{fig:tl-stall-knn-shard}} \\
 \caption{Insertion and Shard Count Distributions for VPTree }
 \label{fig:tl-stall-knn}
 \end{figure}
 
+Finally, we considered our dynamized VPTree in
+Figure~\ref{fig:tl-stall-knn}, using the \texttt{SBW} dataset of
+about one million 300-dimensional vectors. This test shows some of
+the possible limitations of our fixed rejection rate. The ISAM Tree
+tested above is constructable in roughly linear time, being an MDSP
+with $B_M(n, k) \in \Theta(n \log k)$. Thus, the ratio $\frac{B_M(n,
+k)}{n}$ used to determine the optimal insertion stall rate is
+asymptotically a constant.  For VPTree, however, the construction
+cost is super-linear, with $B(n) \in \Theta(n \log n)$, and also
+generally much larger in absolute time requirements. We can see in
+Figure~\ref{fig:tl-stall-knn-shard} that the shard count distribution
+is very poorly behaved for smaller stall amounts, with the shard count
+following a roughly uniform distribution for a stall rate of $1$. This
+means that the background reconstructions are not capable of keeping up
+with buffer flushing, and so the number of shards grows significantly
+over time. Introducing stalls does shift the distribution closer to
+normal, but it requires a much larger stall rate in order to obtain
+a shard count distribution that is close to the strict tiering than
+was the case with the ISAM tree test. It is still possible, though,
+even with our simple fixed-stall rate implementation. Additionally,
+this approach is shown in Figure~\ref{fig:tl-stall-knn-dist} to reduce
+the tail latency substantially compared to strict tiering, with the same
+latency distribution effects for larger stall rates as was seen in the
+ISAM examples.
+
+Thus, we've shown that introducing even a fixed stall while allowing
+the internal structure of the dynamization to develop naturally is able
+to match the shard count distribution of strict tiering, while having
+significantly lower insertion tail latencies. 
+
+\subsection{Insertion Stall Trade-off Space}
+
+While we have shown that introducing insertion stalls accomplishes the
+goal of reducing tail latencies while being able to match the shard count
+of a strict tiering reconstruction strategy, we've not yet addressed
+what the actual performance of this structure is. By throttling inserts,
+we potentially reduce the insertion throughput. And, further, it isn't
+immediately obvious just how much query performance suffers as the shard
+count distribution shifts. In this test, we examine the average values
+of insertion throughput and query latency over a variety of stall rates.
+
+The results of this test for ISAM with the SOSD \texttt{OSM} dataset are
+shown in Figure~\ref{fig:tl-latency-curve}, which shows the insertion
+throughput plotted against the average query latency for our system at
+various stall rates, and with tiering configured with an equivalent
+scale factor marked as red point for reference. This plot shows two
+interesting features of the insertion stall mechanism. First, it is
+possible to introduce stalls that do not significantly affect the write
+throughput, but do improve query latency. This is seen by the difference
+between the two points at the far right of the curve, where introducing
+a slight stall improves query performance at virtually no cost. This
+represents the region of the curve where the stalling introduces delay
+that doesn't exceed the cost of a buffer flush, and so the amount of
+time spent stalling by the system doesn't change much.
+
+The second, and perhaps more notable, point that this plot shows is
+that introducing the stall rate provides a beautiful design trade-off
+between query and insert performance. In fact, this space is far more
+useful than the trade-off space represented by layout policy and scale
+factor selection using strict reconstruction schemes that we examined
+in Chapter~\ref{chap:design-space}. At the upper end of the insertion
+optimized region, we see more than double the insertion throughput of
+tiering (with significantly lower tail latencies at well) at the cost
+of a slightly larger than 2x increase in query latency. Moving down the
+curve, we see that we are able to roughly match the performance of tiering
+within this space, and even shift to more query optimized configurations.
+
+
 \begin{figure}
 \centering
 \includegraphics[width=.5\textwidth]{img/tail-latency/stall-latency-curve.pdf}
@@ -958,6 +1094,13 @@ at a variety of stall proportions.
 \label{fig:tl-latency-curve}
 \end{figure}
 
+This shows a very interesting result. Not only is our approach able
+to match a strict reconstruction policy in terms of average query and
+insertion performance with better tail latencies, but it is even able
+to provide a superior set of design trade-offs than the strict policies,
+at least in environments where sufficient parallel processing and memory
+are available to leverage parallel reconstructions.
+
 
 \section{Conclusion}