From e0039390b09e802a66c2be4842dd95fee695b433 Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Mon, 2 Jun 2025 20:31:18 -0400 Subject: Updates --- chapters/beyond-dsp.tex | 136 ++++++++++++ chapters/design-space.tex | 518 ++++++++++++++++++++++++++++++++-------------- chapters/tail-latency.tex | 44 +++- 3 files changed, 535 insertions(+), 163 deletions(-) (limited to 'chapters') diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex index 222dd14..7a0df37 100644 --- a/chapters/beyond-dsp.tex +++ b/chapters/beyond-dsp.tex @@ -1289,6 +1289,142 @@ get better asymptotic performance. \subsection{Concurrency Control} \label{ssec:dyn-concurrency} +The decomposition-based dynamization scheme we are considering in this +work lends itself to a very straightfoward concurrency control scheme, +because it is founded upon static data structures. As a result, concurrent +writes only need to be managed within the mutable buffer. Beyond this, +reconstructions within the levels of the structure only reorganize +the existing data. So, a simple multi-versioning scheme to ensure that +queries only ever see one version of the structure and buffer is entirely +sufficient to allow concurrent reads and writes. In this section, we'll +discuss our basic concurrency implementation. + +\subsubsection{High Level Architecture} + +\begin{figure} +\includegraphics[width=\textwidth]{diag/concurrency.pdf} +\caption{\textbf{Framework Concurrency Architecture.} A high level +view of the concurrency architecture of the framework, detailing what +operations are run on what threads, and how they interact with the +major components. } +\label{fig:dyn-concurrency} +\end{figure} + +The general architecture of our framework is shown in +Figure~\ref{fig:dyn-concurrency}. The framework contains a scheduler and a +thread pool, which it uses to schedule and execute concurrent operations. +The scheduler itself is actually a user configurable component, but we +provide two default options: a serial scheduler that runs everything on +the client thread to emulate single-threaded behavior, and a standard +FIFO queue-based scheduler. The thread-pool contains a user-configurable +number of threads, which are shared between reconstructions and queries. +Additionally, the scheduler itself has a single thread that is used to +assign jobs to the thread pool. + +\subsubsection{The Mutable Buffer} + +Our mutable buffer is an unsorted array to which new records are appended. +This makes concurrent writes very straightfoward to support using a simple +fetch-and-add instruction on the tail pointer of the buffer. When a write +is issued, a fetch-and-add is executed against the tail pointer. This +effectively reserves a slot at the end of the array for the new record +to be written into, as each thread will recieve a unique index from this +operation. Then, the record can be directly assigned to that index. If +the buffer is full, then a reconstruction is scheduled (if one isn't +already running) and a failure is immediately returned to the user. + +In order to take advantage of concurrent reconstructions, the buffer +allocation is controlled by two numbers: a high and low watermark. The +buffer is physically allocated based upon the high watermark, but +reconstructions are triggered based on the low watermark. This system +allows the user to trigger reconstructions \emph{before} the buffer +fills. These reconstructions run in the background, allowing the system +to hide some of the reconstruction latency.\footnote{ + This system helps a little, but not very much. The amount of time it + takes to fill the buffer is vastly smaller than the time that some of + the more expensive reconstructions take, and so the overall effect + on performance is minute. In Chapter~\ref{chap:tail-latency}, we + discuss a more sophisticated background reconstruction scheme that is + much more effective at latency hiding. +} Once the buffer has filled +to the high watermark, insertions will block. + +Internal threads do not access the buffer directly, but rather retrieve +a read-only \emph{buffer view}, consisting of a head and tail pointer +indicating the set of records within the buffer that the thread can +``see''. This view defines the records removed from the buffer by a +flush, or that are seen by a query. These views are reference counted, +to ensure that records are not removed from the buffer until all threads +that can see them have finished. Because of this, queries holding a +buffer view can stall flushes. To reduce the effect of this problem, +the buffer is actually allocated with twice the physical space required +to support the specified high water mark, allowing multiple versions of +the buffer to coexist at once. + +\subsubsection{Structure Version Management} + +\begin{figure} +\includegraphics[width=\textwidth]{diag/version-transition.pdf} +\caption{\textbf{Structure Version Transition.} The transition of +active structure version proceeds in three phases. In the first +phase (A), the system waits for all references to the old version (V1) +to be released. Once a version is marked as old, it will no longer +accumulate references, so this is guaranteed to happen eventually. Once +all references have been released, the system moves to the second phase (B), +in which the active version (V2) is copied over the old version. During this +phase, both the old version and active version reference the same version +object (V2). Finally, in the third phase, the in-flight version (V3) is +made active. At this point, the reconstruction is complete, and another +reconstruction can immediately be scheduled, potentially result in a new +in-flight version (V4).} +\label{fig:dyn-conc-version} +\end{figure} + +The dynamized index itself is represented as a structure containing +levels which contain pointers to shards. Because the shards are static, +the entire structure can be cheaply shallow-copied to create a version. +The framework supports up to three versions at once: the ``old'' version, +the active version, and an ``in-flight'' version. Version transitions +occur at the conclusion of a reconstruction operation. + +Only one reconstruction can be active in the system at a time. When a +reconstruction begins, the active version of the structure is copied +to create the in-flight version, and a buffer view is claimed by the +reconstruction thread. The reconstruction is then performed on this copy, +building and removing shards according to the layout policy, and a new +shard is built from the buffer view. + +Once the reconstruction is complete, the framework attempts to convert +the in-flight version into the active one. This process first waits for +all threads currently accessing the old version to complete. Then, the +old version is replaced by the active version. During this replacement +process, the active version remains active, and so any threads entering +the system during the transition will be assigned to it. Once the active +version has been made old, the in-flight version is made active. The +versions are reference counted, so any threads using the active version +will not be affected by the deactivation of it. This process is summarized +in Figure~\ref{fig:dyn-conc-version} + + +\subsubsection{Concurrent Queries} + +Queries are answered asynchronously. When the user calls the query +routine, the query is given to the scheduler and an \texttt{std::future} +is returned to the user to access the result. Once the scheduler assigns +the query to a thread, it will retrieve a buffer view, and a reference +to the currently active version of the structure. The query will then +be processed against these structures. + +Because the query keeps a reference to the version and buffer view it is +using, it will be unaffected by the completion of a reconstruction. When +the active version becomes old, the reference remains valid, and the +query can continue to run. However, this reference will prevent the +old version from being retired, and so queries can block the completion +of reconstructions. Typically the expected runtime of a query is much +less than that of a reconstruction, and so this is not considered to +be a serious problem, but it is addressed in our more sophisticated +concurrency scheme discussed in Chapter~\ref{chap:tail-latency}. + \section{Evaluation} Having described the framework in detail, we'll now turn to demonstrating diff --git a/chapters/design-space.tex b/chapters/design-space.tex index 98c5bb2..32fe546 100644 --- a/chapters/design-space.tex +++ b/chapters/design-space.tex @@ -259,9 +259,56 @@ the structure. Thus, the best case query cost in BSM is, \subsection{Leveling} + +\begin{algorithm} +\caption{The Leveling Policy} +\label{alg:design-leveling} + +\KwIn{$r$: set of records to be inserted, $\mathscr{I}$: a dynamized structure, $n$: number of records in $\mathscr{I}$} + +\BlankLine +\Comment{Find the first non-full level} +$target \gets -1$ \; +\For{$i=0\ldots \log_s n$} { + \If {$|\mathscr{I}_i| < N_B \cdot s^{i+1}$} { + $target \gets i$ \; + break \; + } +} + +\BlankLine +\Comment{If the target is $0$, then just merge the buffer into it} +\If{$target = 0$} { + $\mathscr{I}_0 \gets \text{build}(\text{unbuild}(\mathscr{I}_0) \cup r)$ \; + \Return +} + +\BlankLine +\Comment{If the structure is full, we need to grow it} +\If {$target = -1$} { + $target \gets 1 + (\log_s n)$ \; +} + +\BlankLine +\Comment{Perform the reconstruction} +$\mathscr{I}_{target} \gets \text{build}(\text{unbuild}(\mathscr{I}_{target}) \cup \text{unbuild}(\mathscr{I}_{target - 1}))$ \; + +\BlankLine +\Comment{Shift the remaining levels down to free up $\mathscr{I}_0$} +\For{$i=target-1 \ldots 1$} { + $\mathscr{I}_i \gets \mathscr{I}_{i-1}$ \; +} + +\BlankLine +\Comment{Flush the buffer in $\mathscr{I}_0$} +$\mathscr{I}_0 \gets \text{build}(r)$ \; + +\Return \; +\end{algorithm} + Our leveling layout policy is described in -Algorithm~\ref{alg:design-level}. Each level contains a single structure -with a capacity of $N_B\cdot s^i$ records. When a reconstruction occurs, +Algorithm~\ref{alg:design-leveling}. Each level contains a single structure +with a capacity of $N_B\cdot s^{i+1}$ records. When a reconstruction occurs, the first level $i$ that has enough space to have the records in the level $i-1$ stored inside of it is selected as the target, and then a new structure is built at level $i$ containing the records in it and level @@ -323,7 +370,7 @@ I_A(n) \in \Theta\left(\frac{B(n)}{n}\cdot \frac{1}{2} (s+1) \log_s n\right) \begin{theorem} The worst-case insertion cost for leveling with a scale factor of $s$ is \begin{equation*} -\Theta\left(\frac{s-1}{s} \cdot B(n)\right) +\Theta\left(B\left(\frac{s-1}{s} \cdot n\right)\right) \end{equation*} \end{theorem} \begin{proof} @@ -331,12 +378,24 @@ Unlike in BSM, where the worst case reconstruction involves all of the records within the structure, in leveling it only includes the records in the last two levels. In particular, the worst case behavior occurs when the last level is one reconstruction away from its capacity, and the -level above it is full. In this case, the reconstruction will involve, +level above it is full. In this case, the reconstruction will involve the +full capacity of the last level, or $N_B \cdot s^{\log_s n +1}$ records. + +We can relate this to $n$ by finding the ratio of elements contained in +the last level of the structure to the entire structure. This is given +by, \begin{equation*} -\left(s^{\log_s n} - s^{\log_s n - 1}\right) + s^{\log_s n - 1} +\frac{N_B \cdot s^{\log_s n + 1}}{\sum_{i=0}^{\log_s n} N_B \cdot s^{i + 1}} = \frac{(s - 1)n}{sn - 1} +\end{equation*} +This fraction can be simplified by noting that the $1$ subtracted in +the denominator is negligible and dropping it, allowing the $n$ to be +canceled and giving a ratio of $\frac{s-1}{s}$. Thus the worst case reconstruction +will involve $\frac{s - 1}{s} \cdot n$ records, with all the other levels +simply shifting down at no cost, resulting in a worst-case insertion cost +of, +\begin{equation*} +I(n) \in \Theta\left(B\left(\frac{s-1}{s} \cdot n\right)\right) \end{equation*} -records, where the first parenthesized term represents the records in -the last level, and the second the records in the level above it. \end{proof} \begin{theorem} @@ -376,6 +435,50 @@ best-case cost of, \subsection{Tiering} + +\begin{algorithm} +\caption{The Tiering Policy} +\label{alg:design-tiering} + +\KwIn{$r$: set of records to be inserted, $\mathscr{L}_0 \ldots \mathscr{L}_{\log_s n}$: the levels of $\mathscr{I}$, $n$: the number of records in $\mathscr{I}$} +\BlankLine +\Comment{Find the first non-full level} +$target \gets -1$ \; +\For{$i=0\ldots \log_s n$} { + \If {$|\mathscr{L}_i| < s$} { + $target \gets i$ \; + break \; + } +} + +\BlankLine +\Comment{If the structure is full, we need to grow it} +\If {$target = -1$} { + $target \gets 1 + (\log_s n)$ \; +} + +\BlankLine +\Comment{Walk the structure backwards, applying reconstructions} +\For {$i \gets target \ldots 1$} { + $\mathscr{L}_i \gets \mathscr{L_i} \cup \text{build}(\text{unbuild}(\mathscr{L}_{i-1, 0}) \ldots \text{unbuild}(\mathscr{L}_{i-1, s-1}))$ \; +} +\BlankLine +\Comment{Add the buffered records to $\mathscr{L}_0$} +$\mathscr{L}_0 \gets \mathscr{L}_0 \cup \text{build}(r)$ \; + +\Return \; +\end{algorithm} + +Our tiering layout policy is described in Algorithm~\ref{alg:design-tiering}. In +this policy, each level contains $s$ shards, each with a capacity +$N_B\cdot s^i$ records. When a reconstruction occurs, the first level +with fewer than $s$ shards is selected as the target, $t$. Then, for +every level with $i < t$, all of the shards in $i$ are merged into a +single shard using a reconstruction and placed in level $i+1$. These +reconstructions are performed backwards, starting at $t-1$ and moving +back up towards $0$. Then, the the shard created by the buffer flush is +placed in level $0$. + \begin{theorem} The amortized insertion cost of tiering with a scale factor of $s$ is, \begin{equation*} @@ -454,6 +557,40 @@ in the analysis, we can see that possible trade-offs begin to manifest within the space. We've seen some of these in action directly in the experimental sections of previous chapters. +Most notably, we can directly see in these cost functions the reason why +tiering and leveling experience opposite effects as the scale factor +changes. In both policies, increasing the scale factor increases the +base of the logarithm governing the height, and so in the absence of +the additional constants in the analysis, it would superficially appear +as though both policies should see the same effects. But, with other +constants retained, we can see that this is in fact not the case. For +tiering, increasing the scale factor does reduce the number of levels, +however it also increases the number of shards. Because the level +reduction is in the base of the logarithm, but the shard count increase +is directly linear, the shard count effect dominates and we see the query +performance reduce as the scale factor increases. Leveling, however, +does not include this linear term and sees only a reduction in height. + +When considering insertion, we see a similar situation in reverse. For +leveling and tiering, increasing the scale factor reduces the size of +the log term, and there are no other terms at play in tiering, so we +see an improvement in insertion performance. However, leveling also +has a linear dependency on the scale factor, as increasing the scale +factor also increases the write amplification. This is why leveling sees +its insertion performance reduce with scale factor. The generalized +Bentley-Saxe method follows the same general trends as leveling for +worst-case query cost and for amortized insertion cost. + +Of note as well is the fact that leveling has slightly better worst-case +insertion performance. This is because leveling only ever reconstructs +one level at a time, with the other levels simply shifting around in +constant time. Bentley-Saxe and tiering have strictly worse worst-case +insertion cost as their worst-case reconstructions involve all of the +levels. In the Bentley-Saxe method, this worst-case cost is manifest +in a single, large reconstruction. In tiering, it involves $\log_s n$ +reconstructions, one per level. + + \begin{table*} \centering \small @@ -463,7 +600,7 @@ the experimental sections of previous chapters. & \textbf{Gen. BSM} & \textbf{Leveling} & \textbf{Tiering} \\ \hline $\mathscr{Q}(n)$ &$O\left(\log_s n \cdot \mathscr{Q}_S(n)\right)$ & $O\left(\log_s n \cdot \mathscr{Q}_S(n)\right)$ & $O\left(s \log_s n \cdot \mathscr{Q}_S(n)\right)$\\ \hline $\mathscr{Q}_B(n)$ & $\Theta(\mathscr{Q}_S(n))$ & $O(\log_s n \cdot \mathscr{Q}_S(n))$ & $O(\log_s n \cdot \mathscr{Q}_S(n))$ \\ \hline -$I(n)$ & $\Theta(B(n))$ & $\Theta(\frac{s - 1}{s}\cdot B(n))$ & $\Theta(B(n))$\\ \hline +$I(n)$ & $\Theta(B(n))$ & $\Theta\left(B\left(\frac{s-1}{s} \cdot n\right)\right)$ & $\Theta(B(n))$\\ \hline $I_A(n)$ & $\Theta\left(\frac{B(n)}{n} \frac{1}{2}(s-1)\cdot((s-1)\log_s n +s)\right)$ & $\Theta\left(\frac{B(n)}{n} \frac{1}{2}(s-1)\log_s n\right)$& $\Theta\left(\frac{B(n)}{n} \log_s n\right)$ \\ \hline \end{tabular} @@ -471,32 +608,6 @@ $I_A(n)$ & $\Theta\left(\frac{B(n)}{n} \frac{1}{2}(s-1)\cdot((s-1)\log_s n +s)\r \label{tab:policy-comp} \end{table*} -% \begin{table*}[!t] -% \centering -% \begin{tabular}{|l l l l l|} -% \hline -% \textbf{Policy} & \textbf{Worst-case Query Cost} & \textbf{Worst-case Insert Cost} & \textbf{Best-cast Insert Cost} & \textbf{Amortized Insert Cost} \\ \hline -% Gen. Bentley-Saxe &$\Theta\left(\log_s(n) \cdot Q(n)\right)$ &$\Theta\left(B(n)\right)$ &$\Theta\left(1\right)$ &$\Theta\left(\frac{B(n)}{n} \cdot \frac{1}{2}(s-1) \cdot ( (s-1)\log_s n + s)\right)$ \\ -% Leveling &$\Theta\left(\log_s(n) \cdot Q(n)\right)$ &$\Theta\left(B(\frac{n}{s})\right)$ &$\Theta\left(1\right)$ &$\Theta\left(\frac{B(n)}{n} \cdot \frac{1}{2} \log_s(n)(s + 1)\right)$ \\ -% Tiering &$\Theta\left(s\log_s(n) \cdot Q(n)\right)$ &$\Theta\left(B(n)\right)$ &$\Theta\left(1\right)$ &$\Theta\left(\frac{B(n)}{n} \cdot \log_s(n)\right)$ \\\hline -% \end{tabular} -% \caption{Comparison of cost functions for various reconstruction policies for DSPs} -% \label{tab:policy-comp-old} -% \end{table*} - -% \begin{table*}[!t] -% \centering -% \begin{tabular}{|l l l l|} -% %stuff &\textbf{Gen. BSM} & \textbf{Leveling} & \textbf{Tiering} \\ -% % \textbf{Worst-case Query} &$O\left(\log_s(n)\cdot \mathscr{Q}_S(n)\right)$ & $O\left(\log_s(n) \mathscr{Q}_S(n)\right)$ & $O\left(s \log_s(n) \mathscr{Q}_S(n)\right)$\\ \hline -% % \textbf{Best-case Query} & & & \\ \hline -% % \textbf{Worst-case Insert} & & & \\ \hline -% % \textbf{Amortized Insert} & & & \\ \hline - -% \caption{Comparison of cost functions for various reconstruction policies for DSPs} -% \label{tab:policy-comp} -% \end{table*} - \section{Experimental Evaluation} In the previous sections, we mathematically proved various claims about @@ -519,18 +630,19 @@ tree structure is merge-decomposable using a sorted-array merge, with a build cost of $B_M(n) \in \Theta(n \log k)$, where $k$ is the number of structures being merged. The VPTree, by contrast, is \emph{not} merge decomposable, and is built in $B(n) \in \Theta(n \log n)$ time. We -use the $200,000,000$ record SOSD \texttt{OSM} dataset~\cite{sosd} for +use the $200,000,000$ record SOSD \texttt{OSM} dataset~\cite{sosd-datasets} for ISAM testing, and the $1,000,000$ record, $300$-dimensional Spanish Billion Words (\texttt{SBW}) dataset~\cite{sbw} for VPTree testing. -For our first experiment, we will examine the latency distribution for -inserts into our structures. We tested the three layout policies, using a -common scale factor of $s=2$. This scale factor was selected to minimize -its influence on the results (we've seen before in Sections~\ref{} -and \ref{} that scale factor affects leveling and tiering in opposite -ways) and isolate the influence of the layout policy alone to as great -a degree as possible. We used a buffer size of $N_b=12000$ for the ISAM -tree structure, and $N_B=1000$ for the VPTree. +For our first experiment, we will examine the latency distribution +for inserts into our structures. We tested the three layout policies, +using a common scale factor of $s=2$. This scale factor was selected +to minimize its influence on the results (we've seen before in +Sections~\ref{ssec:ds-exp} and \ref{ssec:dyn-ds-exp} that scale factor +affects leveling and tiering in opposite ways) and isolate the influence +of the layout policy alone to as great a degree as possible. We used a +buffer size of $N_b=12000$ for the ISAM tree structure, and $N_B=1000$ +for the VPTree. We generated this distribution by inserting $30\%$ of the records from the set to ``warm up'' the dynamized structure, and then measuring the @@ -542,6 +654,7 @@ to examine the latency distribution, not the values themselves, and so this is not a significant limitation for our analysis. \begin{figure} +\centering \subfloat[ISAM Tree Insertion Latencies]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:design-isam-ins-dist}} \subfloat[VPTree Insertion Latencies]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:design-vptree-ins-dist}} \\ \caption{Insertion Latency Distributions for Layout Policies} @@ -549,23 +662,22 @@ this is not a significant limitation for our analysis. \end{figure} The resulting distributions are shown in -Figure~\ref{design-policy-ins-latency}. These distributions are +Figure~\ref{fig:design-policy-ins-latency}. These distributions are representing using a "reversed" CDF with log scaling on both axes. This representation has proven very useful for interpreting the latency distributions that we see in evaluating dynamization, but are slightly unusual, and so we've included a guide to interpreting these charts -in Appendix\ref{append:rcdf}. - -The first notable point is that, for both the ISAM tree -in Figure~\ref{fig:design-isam-ins-dist} and VPTree in -Figure~\ref{fig:design-vptree-ins-dist}, the Leveling -policy results in a measurable lower worst-case insertion -latency. This result is in line with our theoretical analysis in -Section~\ref{ssec:design-leveling-proofs}. However, there is a major -deviation from theoretical in the worst-case performance of Tiering -and BSM. Both of these should have similar worst-case latencies, as -the worst-case reconstruction in both cases involves every record in -the structure. Yet, we see tiering consistently performing better, +in Appendix~\ref{append:rcdf}. + +The first notable point is that, for both the ISAM +tree in Figure~\ref{fig:design-isam-ins-dist} and VPTree in +Figure~\ref{fig:design-vptree-ins-dist}, the Leveling policy results in a +measurable lower worst-case insertion latency. This result is in line with +our theoretical analysis in Section~\ref{sec:design-asymp}. However, there +is a major deviation from theoretical in the worst-case performance of +Tiering and BSM. Both of these should have similar worst-case latencies, +as the worst-case reconstruction in both cases involves every record +in the structure. Yet, we see tiering consistently performing better, particularly for the ISAM tree. The reason for this has to do with the way that the records are @@ -585,17 +697,27 @@ of a role. Having the records more partitioned still hurts performance, due to cache effects most likely, but less so than in the MDSP case. \begin{figure} - +\centering +\subfloat[ISAM Tree]{\includegraphics[width=.5\textwidth]{img/design-space/isam-tput.pdf} \label{fig:design-isam-tput}} +\subfloat[VPTree]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:design-vptree-tput}} \\ \caption{Insertion Throughput for Layout Policies} \label{fig:design-ins-tput} \end{figure} -Next, in Figure~\ref{fig:design-ins-tput}, we show the overall -insertion throughput for the three policies. This result should -correlate with the amortized insertion costs for each policy derived in -Section~\ref{sec:design-asym}. As expected, tiering has the highest -throughput. - +Next, in Figure~\ref{fig:design-ins-tput}, we show the overall insertion +throughput for the three policies for both ISAM Tree and VPTree. This +result should correlate with the amortized insertion costs for each +policy derived in Section~\ref{sec:design-asymp}. At a scale factor of +$s=2$, all three policies have similar insertion performance. This makes +sense, as both leveling and Bentley-Saxe experience write-amplificiation +proprotional to the scale factor, and at $s=2$ this isn't significantly +larger than tiering's write amplificiation, particularly compared +to the other factors influencing insertion performance, such as +reconstruction time. However, for larger scale factors, tiering shows +\emph{significantly} higher insertion throughput, and Leveling and +Bentley-Saxe show greatly degraded performance due to the large amount +of additional write amplification. These reuslts are perfectly in line +with the mathematical analysis of the previous section. \subsection{General Insert vs. Query Trends} @@ -606,11 +728,42 @@ respective insertion throughputs and query latencies for both ISAM Tree and VPTree. \begin{figure} - +\centering +\subfloat[ISAM Tree Range Count]{\includegraphics[width=.5\textwidth]{img/design-space/isam-parm-sweep.pdf} \label{fig:design-isam-tradeoff}} +\subfloat[VPTree $k$-NN]{\includegraphics[width=.5\textwidth]{img/design-space/selectivity-sweep.pdf} \label{fig:design-knn-tradeoff}} \\ \caption{Insertion Throughput vs. Query Latency} \label{fig:design-tradeoff} \end{figure} +Figure~\ref{fig:design-isam-tradeoff} shows the trade-off curve between +insertion throughput and query latency for range count queries executed +against a dynamized ISAM tree. This test was run with a dataset +of 500 million uniform integer keys, and a selectivity of $\sigma = +0.0000001$, the scale factor associated with each point is annotated on +the plot. These results show that there is a very direct relationship +between scale factor, layout policy, and insertion throughput. Leveling +almost universally has lower insertion throughput but also lower +query latency than tiering does, though at scale factor $s=2$ they are +fairly similar. Tiering gains insertion throughput at the cost of query +performance as the scale factor increases, although the rate at which +the insertion performance improves decreases for larger scale factors, +and the rate at which query performance declines increases dramatically. + +One interesting note is that leveling sees very little improvement in +query latency as the scale factor is increased. This is due to the fact +that, asymptotically, the scale factor only affects leveling's query +performance by increasing the base of a logarithm. Thus, small increases +in scale factor have very little effect. However, level's insertion +performance degrades linearly with scale factor, and this is well +demonstrated in the plot. + +The Bentley-Saxe method appears to follow a very similar trend to that +of leveling, albiet with even more dramatic performance degredation as +the scale factor is increased. Generally it seems to be a strictly worse +alternative to leveling in all but its best-case query cost, and we will +omit it from our tests moving forward as a result. + + \subsection{Query Size Effects} One potentially interesting aspect of decomposition-based dynamization @@ -632,107 +785,152 @@ framework to see at what points the query latencies begin to converge. We also tested $k$-NN queries with varying values of $k$. \begin{figure} -\caption{Query "Size" Effect Analysis} +\centering +\subfloat[ISAM Tree Range Count]{\includegraphics[width=.5\textwidth]{img/design-space/selectivity-sweep.pdf} \label{fig:design-isam-sel}} +\subfloat[VPTree $k$-NN]{\includegraphics[width=.5\textwidth]{img/design-space/selectivity-sweep.pdf} \label{fig:design-knn-sel}} \\ +\caption{Query Result Size Effect Analysis} \label{fig:design-query-sze} \end{figure} -\section{Asymptotically Relevant Trade-offs} - -Thus far, we have considered a configuration system that trades in -constant factors only. In general asymptotic analysis, all possible -configurations of our framework in this scheme collapse to the same basic -cost functions when the constants are removed. While we have demonstrated -that, in practice, the effects of this configuration are measurable, there -do exist techniques in the classical literature that provide asymptotically -relevant trade-offs, such as the equal block method~\cite{maurer80} and -the mixed method~\cite[pp. 117-118]{overmars83}. These techniques have -cost functions that are derived from arbitrary, positive, monotonically -increasing functions of $n$ that govern various ways in which the data -structure is partitioned, and changing the selection of function allows -for "tuning" the performance. However, to the best of our knowledge, -these techniques have never been implemented, and no useful guidance in -the literature exists for selecting these functions. - -However, it is useful to consider the general approach of these -techniques. They accomplish asymptotically relevant trade-offs by tying -the decomposition of the data structure directly to a function of $n$, -the number of records, in a user-configurable way. We can import a similar -concept into our already existing configuration framework for dynamization -to enable similar trade-offs, by replacing the constant scale factor, -$s$, with some function $s(n)$. However, we must take extreme care when -doing this to select a function that doesn't catastrophically impair -query performance. - -Recall that, generally speaking, our dynamization technique requires -multiplying the cost function for the data structure being dynamized by -the number of shards that the data structure has been decomposed into. For -search problems that are solvable in sub-polynomial time, this results in -a worst-case query cost of, -\begin{equation} -\mathscr{Q}(n) \in O(S(n) \cdot \mathscr{Q}_S(n)) -\end{equation} -where $S(n)$ is the number of shards and, for our framework, is $S(n) \in -O(s \log_s n)$. The user can adjust $s$, but this tuning does not have -asymptotically relevant consequences. Unfortunately, there is not much -room, practically, for adjustment. If, for example, we were to allow the -user to specify $S(n) \in \Theta(n)$, rather than $\Theta(\log n)$, then -query performance would be greatly impaired. We need a function that is -sub-linear to ensure useful performance. - -To accomplish this, we proposed adding a second scaling factor, $k$, such -that the number of records on level $i$ is given by, -\begin{equation} -\label{eqn:design-k-expr} -N_B \cdot \left(s \log_2^k(n)\right)^{i} -\end{equation} -with $k=0$ being equivalent to the configuration space we have discussed -thus far. The addition of $k$ allows for the dependency of the number of -shards on $n$ to be slightly biased upwards or downwards, in a way that -\emph{does} show up in the asymptotic analysis for inserts and queries, -but also ensures sub-polynomial additional query cost. - -In particular, we prove the following asymptotic properties of this -configuration. -\begin{theorem} -The worst-case query latency of a dynamization scheme where the -capacity of each level is provided by Equation~\ref{eqn:design-k-expr} is -\begin{equation} -\mathscr{Q}(n) \in O\left(\left(\frac{\log n}{\log (k \log n))}\right) \cdot \mathscr{Q}_S(n)\right) -\end{equation} -\end{theorem} -\begin{proof} -The number of levels within the structure is given by $\log_s (n)$, -where $s$ is the scale factor. The addition of $k$ to the parametrization -replaces this scale factor with $s \log^k n$, and so we have -\begin{equation*} -\log_{s \log^k n}n = \frac{\log n}{\log\left(s \log^k n\right)} = \frac{\log n}{\log s + \log\left(k \log n\right)} \in O\left(\frac{\log n}{\log (k \log n)}\right) -\end{equation*} -by the application of various logarithm rules and change-of-base formula. - -The cost of a query against a decomposed structure is $O(S(n) \cdot \mathscr{Q}_S(n))$, and -there are $\Theta(1)$ shards per level. Thus, the worst case query cost is -\begin{equation*} -\mathscr{Q}(n) \in O\left(\left(\frac{\log n}{\log (k \log n))}\right) \cdot \mathscr{Q}_S(n)\right) -\end{equation*} -\end{proof} - -\begin{theorem} -The amortized insertion cost of a dynamization scheme where the capacity of -each level is provided by Equation~\ref{eqn:design-k-expr} is, -\begin{equation*} -I_A(n) \in \Theta\left(\frac{B(n)}{n} \cdot \frac{\log n}{\log ( k \log n)}\right) -\end{equation*} -\end{theorem} -\begin{proof} -\end{proof} - -\subsection{Evaluation} - -In this section, we'll access the effect that modifying $k$ in our -new parameter space has on the insertion and query performance of our -dynamization framework. +Interestingly, for the range of selectivities tested for range counts, the +overall query latency failed to converge, and there remains a consistant, +albiet slight, stratification amongst the tested policies, as shown in +Figure~\ref{fig:design-isam-sel}. As the selectivity continues to rise +above those shown in the chart, the relative ordering of the policies +remains the same, but the relative differences between them begin to +shrink. This result makes sense given the asymptotics--there is still +\emph{some} overhead associated with the decomposition, but as the cost +of the query approaches linear, it makes up an increasingly irrelevant +portion of the runtime. + +The $k$-NN results in Figure~\ref{fig:design-knn-sel} show a slightly +different story. This is also not surprising, because $k$-NN is a +$C(n)$-decomposable problem, and the cost of result combination grows +with $k$. Thus, larger $k$ values will \emph{increase} the effect that +the decomposition has on the query runtime, unlike was the case in the +range count queries, where the total cost of the combination is constant. + +% \section{Asymptotically Relevant Trade-offs} + +% Thus far, we have considered a configuration system that trades in +% constant factors only. In general asymptotic analysis, all possible +% configurations of our framework in this scheme collapse to the same basic +% cost functions when the constants are removed. While we have demonstrated +% that, in practice, the effects of this configuration are measurable, there +% do exist techniques in the classical literature that provide asymptotically +% relevant trade-offs, such as the equal block method~\cite{maurer80} and +% the mixed method~\cite[pp. 117-118]{overmars83}. These techniques have +% cost functions that are derived from arbitrary, positive, monotonically +% increasing functions of $n$ that govern various ways in which the data +% structure is partitioned, and changing the selection of function allows +% for "tuning" the performance. However, to the best of our knowledge, +% these techniques have never been implemented, and no useful guidance in +% the literature exists for selecting these functions. + +% However, it is useful to consider the general approach of these +% techniques. They accomplish asymptotically relevant trade-offs by tying +% the decomposition of the data structure directly to a function of $n$, +% the number of records, in a user-configurable way. We can import a similar +% concept into our already existing configuration framework for dynamization +% to enable similar trade-offs, by replacing the constant scale factor, +% $s$, with some function $s(n)$. However, we must take extreme care when +% doing this to select a function that doesn't catastrophically impair +% query performance. + +% Recall that, generally speaking, our dynamization technique requires +% multiplying the cost function for the data structure being dynamized by +% the number of shards that the data structure has been decomposed into. For +% search problems that are solvable in sub-polynomial time, this results in +% a worst-case query cost of, +% \begin{equation} +% \mathscr{Q}(n) \in O(S(n) \cdot \mathscr{Q}_S(n)) +% \end{equation} +% where $S(n)$ is the number of shards and, for our framework, is $S(n) \in +% O(s \log_s n)$. The user can adjust $s$, but this tuning does not have +% asymptotically relevant consequences. Unfortunately, there is not much +% room, practically, for adjustment. If, for example, we were to allow the +% user to specify $S(n) \in \Theta(n)$, rather than $\Theta(\log n)$, then +% query performance would be greatly impaired. We need a function that is +% sub-linear to ensure useful performance. + +% To accomplish this, we proposed adding a second scaling factor, $k$, such +% that the number of records on level $i$ is given by, +% \begin{equation} +% \label{eqn:design-k-expr} +% N_B \cdot \left(s \log_2^k(n)\right)^{i} +% \end{equation} +% with $k=0$ being equivalent to the configuration space we have discussed +% thus far. The addition of $k$ allows for the dependency of the number of +% shards on $n$ to be slightly biased upwards or downwards, in a way that +% \emph{does} show up in the asymptotic analysis for inserts and queries, +% but also ensures sub-polynomial additional query cost. + +% In particular, we prove the following asymptotic properties of this +% configuration. +% \begin{theorem} +% The worst-case query latency of a dynamization scheme where the +% capacity of each level is provided by Equation~\ref{eqn:design-k-expr} is +% \begin{equation} +% \mathscr{Q}(n) \in O\left(\left(\frac{\log n}{\log (k \log n))}\right) \cdot \mathscr{Q}_S(n)\right) +% \end{equation} +% \end{theorem} +% \begin{proof} +% The number of levels within the structure is given by $\log_s (n)$, +% where $s$ is the scale factor. The addition of $k$ to the parametrization +% replaces this scale factor with $s \log^k n$, and so we have +% \begin{equation*} +% \log_{s \log^k n}n = \frac{\log n}{\log\left(s \log^k n\right)} = \frac{\log n}{\log s + \log\left(k \log n\right)} \in O\left(\frac{\log n}{\log (k \log n)}\right) +% \end{equation*} +% by the application of various logarithm rules and change-of-base formula. + +% The cost of a query against a decomposed structure is $O(S(n) \cdot \mathscr{Q}_S(n))$, and +% there are $\Theta(1)$ shards per level. Thus, the worst case query cost is +% \begin{equation*} +% \mathscr{Q}(n) \in O\left(\left(\frac{\log n}{\log (k \log n))}\right) \cdot \mathscr{Q}_S(n)\right) +% \end{equation*} +% \end{proof} + +% \begin{theorem} +% The amortized insertion cost of a dynamization scheme where the capacity of +% each level is provided by Equation~\ref{eqn:design-k-expr} is, +% \begin{equation*} +% I_A(n) \in \Theta\left(\frac{B(n)}{n} \cdot \frac{\log n}{\log ( k \log n)}\right) +% \end{equation*} +% \end{theorem} +% \begin{proof} +% \end{proof} + +% \subsection{Evaluation} + +% In this section, we'll access the effect that modifying $k$ in our +% new parameter space has on the insertion and query performance of our +% dynamization framework. \section{Conclusion} +In this chapter, we considered the proposed design space for our +dynamization framework both mathematically and experimentally, and derived +some general principles for configuration within the space. We generalized +the Bentley-Saxe method to support scale factors and buffering, but +found that the result was strictly worse than leveling in all but its +best case query performance. We also showed that there does exist a +trade-off, mediated by scale factor, between insertion performance and +query performance for the tiering layout policy. Unfortunately, the +leveling layout policy does not have a particularly useful trade-off +in this area because the cost in insertion performance grows far faster +than any query performance benefit, due to the way to two effects scale +in the cost functions for the method. + +Broadly speaking, we can draw a few general conclusions. First, the +leveling layout policy is better than tiering for query latency in +all configurations, but worse in insertion performance. Leveling also +has the best insertion tail latency performance by a small margin, +owing to the way it performs reconstructions. Tiering, however, +has significantly better insertion performance and can be configured +with query performance that is similar to leveling. These results are +aligned with the smaller-scale parameter testing done in the previous +chapters, which landed on tiering as a good general solution for most +cases. Tiering also has the advantage of meaningful tuning through scale +factor adjustment. diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex index c3cf5b7..0cdeeab 100644 --- a/chapters/tail-latency.tex +++ b/chapters/tail-latency.tex @@ -142,9 +142,9 @@ re-partitioning is used, the worst case cost rises to the now familiar $I(n) \begin{figure} \centering \subfloat[Insertion vs. Query Trade-off]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-count-sweep.pdf}\label{fig:tl-ebm-tradeoff}} -\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-count-sweep.pdf} \label{fig:tl-ebm-tail-latency}} \\ +\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-latency-dist.pdf} \label{fig:tl-ebm-tail-latency}} \\ -\caption{The equal block method with $f(n) = C$ for varying values of C. \textbf{Plots not yet populated}} +\caption{The equal block method with $f(n) = C$ for varying values of C.} \label{fig:tl-ebm} \end{figure} @@ -526,6 +526,7 @@ are done in time to maintain the block bound given $\log n$ parallel threads. \section{Implementation} +\label{sec:tl-impl} The previous section demonstrated that, theoretically, it is possible to meaningfully control the tail latency of our dynamization system by @@ -918,7 +919,44 @@ space associated with the parameter to the user. \section{Evaluation} \label{sec:tl-eval} -\subsection{} +In this section, we perform several experiments to evaluate the ability of +the system proposed in Section~\ref{sec:tl-impl} to control tail latencies. + +\subsection{Stall Proportion Sweep} + +\begin{figure} +\centering +\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-insert-dist.pdf} \label{fig:tl-btree-isam-tput}} +\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-shard-dist.pdf} \label{fig:tl-btree-isam-lat}} \\ +\caption{Insertion and Shard Count Distributions for ISAM with 200M Records} +\label{fig:tl-stall-200m} +\end{figure} + +First, we will consider the insertion and query performance of our system +at a variety of stall proportions. + + +\begin{figure} +\centering +\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-insert-dist.pdf} \label{fig:tl-btree-isam-tput}} +\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-shard-dist.pdf} \label{fig:tl-btree-isam-lat}} \\ +\caption{Insertion and Shard Count Distributions for ISAM with 4B Records} +\label{fig:tl-stall-4b} +\end{figure} + +\begin{figure} +\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-insert-dist.pdf} \label{fig:tl-btree-isam-tput}} +\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-shard-dist.pdf} \label{fig:tl-btree-isam-lat}} \\ +\caption{Insertion and Shard Count Distributions for VPTree } +\label{fig:tl-stall-knn} +\end{figure} + +\begin{figure} +\centering +\includegraphics[width=.5\textwidth]{img/tail-latency/stall-latency-curve.pdf} +\caption{Insertion Throughput vs. Query Latency for ISAM with 200M Records} +\label{fig:tl-latency-curve} +\end{figure} \section{Conclusion} -- cgit v1.2.3