diff options
| -rw-r--r-- | chapters/beyond-dsp.tex | 22 | ||||
| -rw-r--r-- | chapters/dynamization.tex | 2 | ||||
| -rw-r--r-- | chapters/sigmod23/background.tex | 4 | ||||
| -rw-r--r-- | chapters/sigmod23/examples.tex | 10 | ||||
| -rw-r--r-- | chapters/sigmod23/exp-baseline.tex | 12 | ||||
| -rw-r--r-- | chapters/sigmod23/experiment.tex | 18 | ||||
| -rw-r--r-- | chapters/sigmod23/extensions.tex | 4 | ||||
| -rw-r--r-- | chapters/sigmod23/framework.tex | 16 | ||||
| -rw-r--r-- | chapters/sigmod23/introduction.tex | 2 | ||||
| -rw-r--r-- | chapters/tail-latency.tex | 329 |
10 files changed, 216 insertions, 203 deletions
diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex index 74afdd2..5655b8c 100644 --- a/chapters/beyond-dsp.tex +++ b/chapters/beyond-dsp.tex @@ -1664,10 +1664,10 @@ compaction is triggered. We configured our dynamized structure to use $s=8$, $N_B=12000$, $\delta = .05$, $f = 16$, and the tiering layout policy. We compared our method -(\textbf{DE-IRS}) to Olken's method~\cite{olken89} on a B+Tree with +(\textbf{DE-IRS}) to Olken's method~\cite{olken89} on a B+tree with aggregate weight counts (\textbf{AGG B+Tree}), as well as our bespoke sampling solution from the previous chapter (\textbf{Bespoke}) and a -single static instance of the ISAM Tree (\textbf{ISAM}). Because IRS +single static instance of the ISAM tree (\textbf{ISAM}). Because IRS is neither INV nor DDSP, the standard Bentley-Saxe Method has no way to support deletes for it, and was not tested. All of our tested sampling queries had a controlled selectivity of $\sigma = 0.01\%$ and $k=1000$. @@ -1692,7 +1692,7 @@ the dynamic baseline. Finally, Figure~\ref{fig:irs-space} shows the space usage of the data structures, less the storage required for the raw data. The two dynamized solutions require \emph{significantly} less storage than the -dynamic B+Tree, which must leave empty spaces in its nodes for inserts. +dynamic B+tree, which must leave empty spaces in its nodes for inserts. This is a significant advantage of static data structures--they can pack data much more tightly and require less storage. Dynamization, at least in this case, doesn't add a significant amount of overhead over a single @@ -1701,7 +1701,7 @@ instance of the static structure. \subsection{$k$-NN Search} \label{ssec:dyn-knn-exp} Next, we'll consider answering high dimensional exact $k$-NN queries -using a static Vantage Point Tree (VPTree)~\cite{vptree}. This is a +using a static vantage point tree (VPTree)~\cite{vptree}. This is a binary search tree with internal nodes that partition records based on their distance to a selected point, called the vantage point. All of the points within a fixed distance of the vantage point are covered @@ -1746,10 +1746,10 @@ standard DDSP, we compare with the Bentley-Saxe Method (\textbf{BSM})\footnote{ be deleted in $\Theta(1)$ time, rather than requiring an inefficient point-lookup directly on the VPTree. } and a dynamic data structure for the same search problem called an -M-Tree~\cite{mtree,mtree-impl} (\textbf{MTree}), which is an example of a so-called +M-tree~\cite{mtree,mtree-impl} (\textbf{MTree}), which is an example of a so-called "ball tree" structure that partitions high dimensional space using nodes representing spheres, which are merged and split to maintain balance in -a manner not unlike a B+Tree. We also consider a static instance of a +a manner not unlike a B+tree. We also consider a static instance of a VPTree built over the same set of records (\textbf{VPTree}). We used L2 distance as our metric, which is defined for vectors of $d$ dimensions as @@ -1784,7 +1784,7 @@ which are biased towards better insertion performance. Both dynamized structures also outperform the dynamic baseline. Finally, as is becoming a trend, Figure~\ref{fig:knn-space} shows that the storage requirements of the static data structures, dynamized or not, are significantly less -than M-Tree. M-Tree, like a B+Tree, requires leaving empty slots in its +than M-tree. M-tree, like a B+tree, requires leaving empty slots in its nodes to support insertion, and this results in a large amount of wasted space. @@ -1810,7 +1810,7 @@ We apply our framework to create dynamized versions of two static learned indices: Triespline~\cite{plex} (\textbf{DE-TS}) and PGM~\cite{pgm} (\textbf{DE-PGM}), and compare with a standard Bentley-Saxe dynamized of Triespline (\textbf{BSM-TS}). Our dynamic baselines are ALEX~\cite{alex}, -which is dynamic learned index based on a B+Tree like structure, and +which is dynamic learned index based on a B+tree like structure, and PGM (\textbf{PGM}), which provides support for a dynamic version based on Bentley-Saxe dynamization (which is why we have not included a BSM version of PGM in our testing). @@ -1885,7 +1885,7 @@ support does in its own update-optimized configuration.\footnote{ these data structures. All of the dynamic options require significantly more space than the static Triespline, but ALEX requires the most by a very large margin. This is in keeping with the previous experiments, which -all included similarly B+Tree-like structures that required significant +all included similarly B+tree-like structures that required significant additional storage space compared to static structures as part of their update support. @@ -1966,7 +1966,7 @@ this test. In this benchmark, we used a single thread to insert records into the structure at a constant rate, while we deployed a variable number of additional threads that continuously issued sampling queries -against the structure. We used an AGG B+Tree as our baseline. Note +against the structure. We used an AGG B+tree as our baseline. Note that, to accurately maintain the aggregate weight counts as records are inserted, it is necessary that each operation obtain a lock on the root node of the tree~\cite{zhao22}. This makes this situation @@ -1974,7 +1974,7 @@ a good use-case for the automatic concurrency support provided by our framework. Figure~\ref{fig:irs-concurrency} shows the results of this benchmark for various numbers of concurrency query threads. As can be seen, our framework supports a stable update throughput up to 32 query threads, -whereas the AGG B+Tree suffers from contention for the mutex and sees +whereas the AGG B+tree suffers from contention for the mutex and sees its performance degrade as the number of threads increases. \begin{figure} diff --git a/chapters/dynamization.tex b/chapters/dynamization.tex index 1e0d3e2..085ce65 100644 --- a/chapters/dynamization.tex +++ b/chapters/dynamization.tex @@ -1499,7 +1499,7 @@ but it has a number of serious problems, is not always possible to do this efficiently, particularly for structures whose construction involve multiple stages (e.g., a sorting phase followed by a recursive node construction phase, - like in a B+Tree) with non-trivially predictable operation counts. + like in a B+tree) with non-trivially predictable operation counts. \item Even if the reconstruction process can be efficiently sub-divided, implementing the technique requires \emph{significant} diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex index 42a52de..984e36c 100644 --- a/chapters/sigmod23/background.tex +++ b/chapters/sigmod23/background.tex @@ -226,7 +226,7 @@ structure, a technique called \emph{alias augmentation}~\cite{tao22}. For example, alias augmentation can be used to construct an SSI capable of answering WIRS queries in $\Theta(\log n + k)$~\cite{afshani17,tao22}. This structure breaks the data into multiple disjoint partitions of size -$\nicefrac{n}{\log n}$, each with an associated alias structure. A B+Tree +$\nicefrac{n}{\log n}$, each with an associated alias structure. A B+tree is then built, using the augmented partitions as its leaf nodes. Each internal node is also augmented with an alias structure over the aggregate weights associated with the children of each pointer. Constructing this @@ -266,7 +266,7 @@ following desiderata, \begin{enumerate} \item Support data updates (including deletes) with similar average - performance to a standard B+Tree. + performance to a standard B+tree. \item Support IQS queries that do not pay a per-sample cost proportional to some function of the data size. In other words, $k$ should \emph{not} be be multiplied by any function of $n$ diff --git a/chapters/sigmod23/examples.tex b/chapters/sigmod23/examples.tex index 4e7f9ac..32807e1 100644 --- a/chapters/sigmod23/examples.tex +++ b/chapters/sigmod23/examples.tex @@ -74,7 +74,7 @@ makes progress towards removing it. \subsection{Independent Range Sampling (ISAM Tree)} \label{ssec:irs-struct} We will next considered independent range sampling. For this decomposable -sampling problem, we use the ISAM Tree for the SSI. Because our shards are +sampling problem, we use the ISAM tree for the SSI. Because our shards are static, we can build highly compact and efficient ISAM trees by storing the records directly in a sorted array. So long as the leaf node size is a multiple of the record size, this array can be treated as a sequence of @@ -106,7 +106,7 @@ operations are, \text{Worst-case Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) \end{align*} where $R(n) \in \Theta(1)$ for tagging and $R(n) \in \Theta(\log_s n \log_f n)$ -for tombstones and $f$ is the fanout of the ISAM Tree. +for tombstones and $f$ is the fanout of the ISAM tree. \subsection{Weighted Independent Range Sampling (Alias-augmented B+Tree)} @@ -114,13 +114,13 @@ for tombstones and $f$ is the fanout of the ISAM Tree. \label{ssec:wirs-struct} As a final example of applying this framework, we consider WIRS. This is a decomposable sampling problem that can be answered using the -alias-augmented B+Tree structure~\cite{tao22, afshani17,hu14}. This +alias-augmented B+tree structure~\cite{tao22, afshani17,hu14}. This data structure is built over sorted data, but can be bulk-loaded from this data in linear time, resulting in costs of $B(n) \in \Theta(n \log n)$ and $B_M(n) \in \Theta(n)$, though the constant factors associated with these functions are quite high, as each bulk-loading requires multiple -linear-time operations for building both the B+Tree and the alias -structures, among other things. As it is built on a B+Tree, the structure +linear-time operations for building both the B+tree and the alias +structures, among other things. As it is built on a B+tree, the structure supports $L(n) \in \Theta(\log n)$ point lookups. Answering sampling queries requires $P(n) \in \Theta(\log n)$ pre-processing time to establish the query interval, during which the weight of the interval diff --git a/chapters/sigmod23/exp-baseline.tex b/chapters/sigmod23/exp-baseline.tex index d0e1ce0..4ae744b 100644 --- a/chapters/sigmod23/exp-baseline.tex +++ b/chapters/sigmod23/exp-baseline.tex @@ -1,7 +1,7 @@ \subsection{Comparison to Baselines} Next, we compared the performance of our dynamized sampling indices with -Olken's method on an aggregate B+Tree. We also examine the query performance +Olken's method on an aggregate B+tree. We also examine the query performance of a single instance of the SSI in question to establish how much query performance is lost in the dynamization. Unless otherwise specified, IRS and WIRS queries are run with a selectivity of $0.1\%$. Additionally, @@ -51,15 +51,15 @@ resulting in better performance. In Figures~\ref{fig:wirs-insert} and \ref{fig:wirs-sample} we examine the performed of \texttt{DE-WIRS} compared to \text{AGG B+tree} and an -alias-augmented B+Tree. We see the same basic set of patterns in this +alias-augmented B+tree. We see the same basic set of patterns in this case as we did with WSS. \texttt{AGG B+Tree} defeats our dynamized index on the \texttt{twitter} dataset, but loses on the others, in terms of insertion performance. We can see that the alias-augmented -B+Tree is much more expensive to build than an alias structure, and +B+tree is much more expensive to build than an alias structure, and so its insertion performance advantage is eroded somewhat compared to the dynamic structure. For queries we see that the \texttt{AGG B+Tree} performs similarly for WIRS sampling as it did for WSS sampling, but the -alias-augmented B+Tree structure is quite a bit slower at WIRS than the +alias-augmented B+tree structure is quite a bit slower at WIRS than the alias structure was at WSS. This results in \texttt{DE-WIRS} defeating the dynamic baseline by less of a margin in this test, but it still is superior in terms of sampling performance, and is still quite close in @@ -81,7 +81,7 @@ being introduced by the dynamization. We next considered IRS queries. Figures~\ref{fig:irs-insert1} and \ref{fig:irs-sample1} show the results of our testing of single-threaded -\texttt{DE-IRS} running in-memory against the in-memory ISAM Tree and +\texttt{DE-IRS} running in-memory against the in-memory ISAM tree and \texttt{AGG B+tree}. The ISAM tree structure can be efficiently bulk-loaded, which results in a much faster construction time than the alias structure or alias-augmented B+tree. This gives it a significant update performance @@ -112,7 +112,7 @@ to answer queries. However, as the sample set size increases, this cost increasingly begins to pay off, with \texttt{DE-IRS} quickly defeating the dynamic structure in average per-sample latency. One other interesting note is the performance of the static ISAM tree, which begins on-par with -the B+Tree, but also sees an improvement as the sample set size increases. +the B+tree, but also sees an improvement as the sample set size increases. This is because of cache effects. During the initial tree traversal, both the B+tree and ISAM tree have a similar number of cache misses. However, the ISAM tree needs to perform its traversal only once, and then samples diff --git a/chapters/sigmod23/experiment.tex b/chapters/sigmod23/experiment.tex index 1eb704c..14f59a7 100644 --- a/chapters/sigmod23/experiment.tex +++ b/chapters/sigmod23/experiment.tex @@ -60,7 +60,7 @@ following dynamized structures, \item \textbf{DE-WSS.} An implementation of the dynamized alias structure~\cite{walker74} for weighted set sampling discussed in Section~\ref{ssec:wss-struct}. We compare this against a WSS -implementation of Olken's method on a B+Tree with aggregate weight tags +implementation of Olken's method on a B+tree with aggregate weight tags (\textbf{AGG-BTree})~\cite{olken95}, based on the B+tree implementation in the TLX library~\cite{tlx}. @@ -71,24 +71,24 @@ Section~\ref{ssec:ext-concurrency} and an external version from Section~\ref{ssec:ext-external}. We compare the external and concurrent versions against the AB-tree~\cite{zhao22}, and the single-threaded, in memory version was compare with an IRS implementation of Olken's -method on an AGG-BTree. +method on an \texttt{AGG-BTree}. \item \textbf{DE-WIRS.} An implementation of the dynamized alias-augmented -B+Tree~\cite{afshani17} as discussed in Section~\ref{ssec:wirs-struct} for +B+tree~\cite{afshani17} as discussed in Section~\ref{ssec:wirs-struct} for weighted independent range sampling. We compare this against a WIRS -implementation of Olken's method on an AGG-BTree. +implementation of Olken's method on \texttt{AGG-BTree}. \end{itemize} All of the tested structures, with the exception of the external memory -DE-IRS implementation and AB-Tree, were wholly contained within system -memory. AB-Tree is a native external structure, so for the in-memory +DE-IRS implementation and AB-tree, were wholly contained within system +memory. AB-tree is a native external structure, so for the in-memory concurrency evaluation we configured it with enough cache to maintain the entire structure in memory to simulate an in-memory implementation.\footnote{ Because of the nature of sampling queries, traditional - efficient locking techniques for B+Trees are not able to be - used~\cite{zhao22}. The alternatives were to run AB-Tree in this - manner, or to globally lock the B+Tree for every operation. We + efficient locking techniques for B+trees are not able to be + used~\cite{zhao22}. The alternatives were to run AB-tree in this + manner, or to globally lock the B+tree for every operation. We elected to use the former approach for this chapter. We used the latter approach in the next chapter. } diff --git a/chapters/sigmod23/extensions.tex b/chapters/sigmod23/extensions.tex index 053c8e2..f77574d 100644 --- a/chapters/sigmod23/extensions.tex +++ b/chapters/sigmod23/extensions.tex @@ -19,7 +19,7 @@ to reside in memory, and the rest on disk. This allows for the smallest few shards, which sustain the most reconstructions, to reside in memory for performance, while storing most of the data on disk, in an attempt to get the best of both worlds, so to speak.\footnote{ - In traditional LSM Trees, which are an external data structure, + In traditional LSM trees, which are an external data structure, only the memtable resides in memory. We have decided to break with this model because, for query performance reasons, the mutable buffer must remain small. By placing a few levels in memory, the @@ -58,7 +58,7 @@ structure using in XDB~\cite{li19}. Because our dynamization technique is built on top of static data structures, a limited form of concurrency support is straightforward to implement. To that end, we created a proof-of-concept dynamization of an -ISAM Tree for IRS based on a simplified version of a general concurrency +ISAM tree for IRS based on a simplified version of a general concurrency controlled scheme for log-structured data stores~\cite{golan-gueta15}. First, we restrict ourselves to tombstone deletes. This ensures that diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex index b3a8215..1eb2589 100644 --- a/chapters/sigmod23/framework.tex +++ b/chapters/sigmod23/framework.tex @@ -512,19 +512,19 @@ Though it has thus far gone unmentioned, some readers may have noted the astonishing similarity between decomposition-based dynamization techniques, and a data structure called the Log-structured Merge-tree. First proposed by O'Neil in the mid '90s\cite{oneil96}, -the LSM Tree was designed to optimize write throughput for external data +the LSM tree was designed to optimize write throughput for external data structures. It accomplished this task by buffer inserted records in a -small in-memory AVL Tree, and then flushing this buffer to disk when +small in-memory AVL tree, and then flushing this buffer to disk when it filled up. The flush process itself would fully rebuild the on-disk -structure (a B+Tree), including all of the currently existing records +structure (a B+tree), including all of the currently existing records on external storage. O'Neil also proposed version which used several, layered, external structures, to reduce the cost of reconstruction. -In more recent times, the LSM Tree has seen significant development and +In more recent times, the LSM tree has seen significant development and been used as the basis for key-value stores like RocksDB~\cite{dong21} and LevelDB~\cite{leveldb}. This work has produced an incredibly large and well explored parametrization of the reconstruction -procedures of LSM Trees, a good summary of which can be bounded in +procedures of LSM trees, a good summary of which can be bounded in this recent tutorial paper~\cite{sarkar23}. Examples of this design space exploration include: different ways to organize each "level" of the tree~\cite{dayan19, dostoevsky, autumn}, different growth @@ -534,7 +534,7 @@ auxiliary structures attached to the main ones for accelerating certain types of query~\cite{dayan18-1, zhu21, monkey}. This work is discussed in greater depth in Chapter~\ref{chap:related-work}. -Many of the elements within the LSM Tree design space are based upon the +Many of the elements within the LSM tree design space are based upon the specifics of the data structure itself, and are not applicable to our use case. However, some of the higher-level concepts can be imported and applied in the context of dynamization. Specifically, we have decided to @@ -590,7 +590,7 @@ that can be used to help improve the performance of these searches, without requiring as much storage as adding auxiliary hash tables to every block, is to include bloom filters~\cite{bloom70}. A bloom filter is an approximate data structure that answers tests of set membership -with bounded, single-sided error. These are commonly used in LSM Trees +with bounded, single-sided error. These are commonly used in LSM trees to accelerate point lookups by allowing levels that don't contain the record being searched for to be skipped. In our case, we only care about tombstone records, so rather than building these filters over all records, @@ -599,7 +599,7 @@ the sampling performance of the structure when tombstone deletes are used. \Paragraph{Layout Policy.} The Bentley-Saxe method considers blocks individually, without any other organization beyond increasing -size. In contrast, LSM Trees have multiple layers of structural +size. In contrast, LSM trees have multiple layers of structural organization. Record capacity restrictions are enforced on structures called \emph{levels}, which are partitioned into individual data structures, and then further organized into non-overlapping key ranges. diff --git a/chapters/sigmod23/introduction.tex b/chapters/sigmod23/introduction.tex index 8f0635d..7ff82cd 100644 --- a/chapters/sigmod23/introduction.tex +++ b/chapters/sigmod23/introduction.tex @@ -22,7 +22,7 @@ them. Existing implementations tend to sacrifice either performance, by requiring the entire result set of be materialized prior to applying Bernoulli sampling, or statistical independence. There exists techniques for obtaining both sampling performance and independence by leveraging -existing B+Tree indices with slight modification~\cite{olken-thesis}, +existing B+tree indices with slight modification~\cite{olken-thesis}, but even this technique has worse sampling performance than could be achieved using specialized static sampling indices. diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex index 1d707b4..ee578a1 100644 --- a/chapters/tail-latency.tex +++ b/chapters/tail-latency.tex @@ -6,7 +6,7 @@ \begin{figure} \subfloat[Insertion Throughput]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-tput.pdf} \label{fig:tl-btree-isam-tput}} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-dist.pdf} \label{fig:tl-btree-isam-lat}} \\ -\caption{Insertion Performance of Dynamized ISAM vs. B+Tree} +\caption{Insertion Performance of Dynamized ISAM vs. B+tree} \label{fig:tl-btree-isam} \end{figure} @@ -17,7 +17,7 @@ structures with good overall insertion throughput, the latency of individual inserts is highly variable. To illustrate this problem, consider the insertion performance in Figure~\ref{fig:tl-btree-isam}, which compares the insertion latencies of a dynamized ISAM tree with -that of its most direct dynamic analog: a B+Tree. While, as shown +that of its most direct dynamic analog: a B+tree. While, as shown in Figure~\ref{fig:tl-btree-isam-tput}, the dynamized structure has superior average performance to the native dynamic structure, the latency distributions, shown in Figure~\ref{fig:tl-btree-isam-lat} are @@ -25,10 +25,10 @@ quite different. The dynamized structure has much better best-case performance, but the worst-case performance is exceedingly poor. This poor worst-case performance is a direct consequence of the different -approaches used by the dynamized structure and B+Tree to support updates. -B+Trees use a form of amortized local reconstruction, whereas the +approaches used by the dynamized structure and B+tree to support updates. +B+trees use a form of amortized local reconstruction, whereas the dynamized ISAM tree uses amortized global reconstruction. Because the -B+Tree only reconstructs the portions of the structure ``local'' to the +B+tree only reconstructs the portions of the structure ``local'' to the update, even in the worst case only a small part of the data structure will need to be adjusted. However, when using global reconstruction based techniques, the worst-case insert requires rebuilding either the @@ -37,8 +37,8 @@ proportion of it (for leveling). The fact that our dynamization technique uses buffering, and most of the shards involved in reconstruction are kept small by the logarithmic decomposition technique used to partition it, ensures that the majority of inserts are low cost compared to the -B+Tree. At the extreme end of the latency distribution, though, the -local reconstruction strategy used by the B+Tree results in significantly +B+tree. At the extreme end of the latency distribution, though, the +local reconstruction strategy used by the B+tree results in significantly better worst-case performance. Unfortunately, the design space that we have been considering is @@ -216,14 +216,14 @@ of which will have exactly $N_B$ records.\footnote{ for simplicity. } -Applying this technique to an ISAM Tree, and compared against a -B+Tree, yields the insertion and query latency distributions shown +Applying this technique to an ISAM tree, and compared against a +B+tree, yields the insertion and query latency distributions shown in Figure~\ref{fig:tl-floodl0}. Figure~\ref{fig:tl-floodl0-insert} shows that it is possible to obtain insertion latency distributions using amortized global reconstruction that are directly comparable to dynamic structures based on amortized local reconstruction. However, this performance comes at the cost of queries, which are incredibly slow -compared to B+Trees, as shown in Figure~\ref{fig:tl-floodl0-query}. +compared to B+trees, as shown in Figure~\ref{fig:tl-floodl0-query}. \begin{figure} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} @@ -1109,7 +1109,44 @@ to prioritize independence in our implementation. In this section, we perform several experiments to evaluate the ability of the system proposed in Section~\ref{sec:tl-impl} to control tail latencies. -\subsection{Stall Proportion Sweep} +\subsection{Stall Rate Sweep} + + +As a first test, we will evaluate the ability of our insertion stall +mechanism to control insertion tail latencies, as well as maintain +similar decomposed structures to strict tiering. We consider the shard +count directly in this test, rather than query latencies, because our +intention is to show that this technique is capable of controlling the +number of shards in the decomposition. The shard count also serves as +an indirect measure of query latency, but we will consider this metric +directly in a later section. + +Recall that, when using insertion stalling, our framework does \emph{not} +block inserts to maintain a shard bound. The buffer is always flushed +immediately, regardless of the number of shards in the structure. Thus, +the rate of insertion is controlled by the cost of flushing the +buffer (we still block when the buffer is full) and the insertion +stall rate. The structure is maintained fully in the background, with +maintenance reconstructions being scheduled for all levels exceeding a +specified shard count. Thus, the number of shards within the structure +is controlled indirectly by limiting the insertion rate. We ran these +tests with 32 background threads on a system with 40 physical cores to +ensure sufficient resources to fully parallelize all reconstructions +(we'll consider resource constrained situations later). + +We tested -ISAM tree with the 200 million record SOSD \texttt{OSM} +dataset~\cite{sosd}, as well as VPTree with the one million, +300-dimensional, \texttt{SBW} dataset~\cite{sbw}. For each test, +we inserted $30\%$ of the records to warm up the structure, and then +measured the individual latency of each insert after that. We measured +the count of shards in the structure each time the buffer flushed +(including during the warmup period). Note that a stall rate of $\delta += 1$ indicates no stalling at all, and values less than one indicate +$1 - \delta$ probability of an insert being rejected, after which the +insert thread sleeps for about a microsecond. A lower stall rate means +more stalls are introduced. The tiering policy is strict tiering with +a scale factor of $s=6$ using the concurrency control scheme described +in Section~\ref{ssec:dyn-concurrency}. \begin{figure} \centering @@ -1119,38 +1156,10 @@ the system proposed in Section~\ref{sec:tl-impl} to control tail latencies. \label{fig:tl-stall-200m} \end{figure} -First, we will consider the insertion and query performance of our -system at a variety of stall proportions. The purpose of this testing -is to demonstrate that inserting stalls into the insertion process is -able to reduce the insertion tail latency, while being able to match the -general insertion and query performance of a strict tiering policy. Recall -that, in the insertion stall case, no explicit shard capacity limits are -enforced by the framework. Reconstructions are triggered with each buffer -flush on all levels exceeding a specified shard count ($s = 6$ in these -tests) and the buffer flushes immediately when full with no regard to the -state of the structure. Thus, limiting the insertion latency is the only -means the system uses to maintain its shard count at a manageable level. -These tests were run on a system with sufficient available resources to -fully parallelize all reconstructions. - -First, Figure~\ref{fig:tl-stall-200m} shows the results of testing -insertion of the 200 million record SOSD \texttt{OSM} dataset in a -dynamized ISAM tree, using both our insertion stalling technique and -strict tiering. We inserted $30\%$ of the records, and then measured -the individual latency of each insert after that point to produce -Figure~\ref{fig:tl-stall-200m-dist}. Figure~\ref{fig:tl-stall-200m-shard} -was produced by recording the number of shards in the dynamized structure -each time the buffer flushed. Note that a stall value of one indicates -no stalling at all, and values less than one indicate $1 - \delta$ -probability of an insert being rejected. Thus, a lower stall value means -more stalls are introduced. The tiering policy is strict tiering with a -scale factor of $s=6$. It uses the concurrency control scheme described -in Section~\ref{ssec:dyn-concurrency}. - - -Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all insertion -rejection probabilities succeed in greatly reducing tail latency relative -to tiering. Additionally, it shows a small amount of available tuning of +We'll begin by considering the ISAM +tree. Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all +stall rates succeed in greatly reducing tail latency relative to +tiering. Additionally, it shows a small amount of available tuning of the worst-case insertion latencies, with higher stall amounts reducing the tail latencies slightly at various points in the distribution. This latter effect results from the buffer flush latency hiding mechanism, @@ -1163,23 +1172,19 @@ resulting in a stall. Of course, if the query latency is severely affected by the use of this mechanism, it may not be worth using. Thus, in -Figure~\ref{fig:tl-stall-200m-shard} we show the probability density of -various shard counts within the dynamized structure for each stalling -amount, as well as strict tiering. We have elected to examine the shard -count, rather than the query latencies, for this purpose because our -intention with this technique is to directly control the number of -shards, and our intention is to show that this is possible. Of course, -the shard count control is necessary for the sake of query latencies, -and we will consider query latency directly later. - -This figure shows that, even for no insertion throttle at all, the shard -count within the structure remains well behaved and normally distributed, -albeit with a slightly longer tail and a higher average value. Once -stalls are introduced, though, it is possible to both reduce the tail, -and shift the peak of the distribution through a variety of points. In -particular, we see that a stall of $.99$ is sufficient to move the peak -to very close to tiering, and lower stalls are able to further shift the -peak of the distribution to even lower counts. +Figure~\ref{fig:tl-stall-200m-shard} we show the probability density +of various shard counts within the decomposed structure for each stall +rate, as well as strict tiering. This figure shows that, even for no +insertion stalling, the shard count within the structure remains well +behaved, albeit with a slightly longer tail and a higher average value +compared to tiering. Once stalls are introduced, it is possible to +both reduce the tail, and shift the peak of the distribution through +a variety of points. In particular, we see that a stall of $.99$ is +sufficient to move the peak to very close to tiering, and lower stall +rates are able to further shift the peak of the distribution to even +lower counts. This result implies that this stall mechanism may be able +to produce a trade-off space for insertion and query performance, which +is a question we will examine in Section~\ref{ssec:tl-design-space}. \begin{figure} \centering @@ -1189,15 +1194,14 @@ peak of the distribution to even lower counts. \label{fig:tl-stall-4b} \end{figure} -To validate that these results were not simply a result of the relatively -small size of the data set used, we repeated the exact same testing -using a set of four billion uniform integers, and these results are -shown in Figure~\ref{fig:tl-stall-4b}. These results are aligned with -the smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing -the same improvements in insertion tail latency for all stall amounts, -and Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the -shard count. If anything, the gap between strict tiering and un-throttled -insertion is narrower with the larger data set than the smaller one. +To validate that these results were not due to the relatively +small size of the data set used, we repeated the exact same +testing using a set of four billion uniform integers, shown in +Figure~\ref{fig:tl-stall-4b}. These results are aligned with the +smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing the +same improvements in insertion tail latency for all stall amounts, and +Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the shard +count. \begin{figure} \subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-insert-dist.pdf} \label{fig:tl-stall-knn-dist}} @@ -1207,11 +1211,17 @@ insertion is narrower with the larger data set than the smaller one. \end{figure} Finally, we considered our dynamized VPTree in -Figure~\ref{fig:tl-stall-knn}, using the \texttt{SBW} dataset of -about one million 300-dimensional vectors. This test shows some of -the possible limitations of our fixed rejection rate. The ISAM Tree -tested above is constructable in roughly linear time, being an MDSP -with $B_M(n, k) \in \Theta(n \log k)$. Thus, the ratio $\frac{B_M(n, +Figure~\ref{fig:tl-stall-knn}, This test shows some of the possible +limitations of our fixed stall rate mechanism. The ISAM tree tested +above is constructable in roughly linear time, being an MDSP with +$B_M(n, k) \in \Theta(n \log k)$, where $k$ is the number of shards, and +thus roughly constant.\footnote{ + For strict tiering, $k=s$ in all cases. Because we don't enforce + the level shard capacity directly, however, in the insertion + stalling case $k \in \Omega(s)$. Based on the experimental results + about, however, it is clear that $k$ is typically quite close to + $s$ in practice for ISAM tree. +} Thus, the ratio $\frac{B_M(n, k)}{n}$ used to determine the optimal insertion stall rate is asymptotically a constant. For VPTree, however, the construction cost is super-linear, with $B(n) \in \Theta(n \log n)$, and also @@ -1231,70 +1241,74 @@ the tail latency substantially compared to strict tiering, with the same latency distribution effects for larger stall rates as was seen in the ISAM examples. -Thus, we've shown that introducing even a fixed stall while allowing -the internal structure of the dynamization to develop naturally is able -to match the shard count distribution of strict tiering, while having -significantly lower insertion tail latencies. +These tests show that, for ISAM tree at least, introducing a constant +stall rate while allowing the decomposition to develop naturally +with background reconstructions only is able to match the shard count +distribution of tiering, which strictly enforces the shard count bound +using blocking, while achieving significantly better insertion tail +latencies. VPTree is able to achieve the same results too, albeit +requiring significantly higher stall rates to match the shard bound. + \subsection{Insertion Stall Trade-off Space} +\label{ssec:tl-design-space} + +\begin{figure} +\centering +\subfloat[ISAM w/ Point Lookup]{\includegraphics[width=.5\textwidth]{img/tail-latency/stall-latency-curve.pdf} \label{fig:tl-latency-curve-isam}} +\subfloat[VPTree w/ $k$-NN]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-latency-curve.pdf} \label{fig:tl-latency-curve-knn}} \\ +\caption{Insertion Throughput vs. Query Latency} +\label{fig:tl-latency-curve} +\end{figure} -While we have shown that introducing insertion stalls accomplishes the -goal of reducing tail latencies while being able to match the shard count -of a strict tiering reconstruction strategy, we've not yet addressed -what the actual performance of this structure is. By throttling inserts, -we potentially reduce the insertion throughput. And, further, it isn't -immediately obvious just how much query performance suffers as the shard -count distribution shifts. In this test, we examine the average values -of insertion throughput and query latency over a variety of stall rates. +We have shown that introducing insertion stalls accomplishes our stated +goal of reducing insertion tail latencies while simultaneously maintaining +a shard count in line with strict tiering. However, we have not address +the actual performance of the structure in terms of average throughput +or query latency. By throttling insertion, we potentially reduce the +throughput. Further, it isn't clear in practice how much query latency +suffers as the shard count distribution changes. In this experiment, we +address these concerns by directly measuring the insertion throughput +and query latency over a variety of stall rates and compare the results +to strict tiering. The results of this test for ISAM with the SOSD \texttt{OSM} dataset are shown in Figure~\ref{fig:tl-latency-curve-isam}, which shows the insertion throughput plotted against the average query latency for our system at -various stall rates, and with tiering configured with an equivalent -scale factor marked as red point for reference. This plot shows two -interesting features of the insertion stall mechanism. First, it is -possible to introduce stalls that do not significantly affect the write -throughput, but do improve query latency. This is seen by the difference -between the two points at the far right of the curve, where introducing -a slight stall improves query performance at virtually no cost. This -represents the region of the curve where the stalling introduces delay -that doesn't exceed the cost of a buffer flush, and so the amount of -time spent stalling by the system doesn't change much. - -The second, and perhaps more notable, point that this plot shows is -that introducing the stall rate provides a beautiful design trade-off -between query and insert performance. In fact, this space is far more -useful than the trade-off space represented by layout policy and scale -factor selection using strict reconstruction schemes that we examined -in Chapter~\ref{chap:design-space}. At the upper end of the insertion -optimized region, we see more than double the insertion throughput of -tiering (with significantly lower tail latencies at well) at the cost -of a slightly larger than 2x increase in query latency. Moving down the -curve, we see that we are able to roughly match the performance of tiering -within this space, and even shift to more query optimized configurations. +various stall rates, and with tiering configured with an equivalent scale +factor marked as red point for reference. The most interesting point +demonstrated by this plot is that introducing the stall rate provides a +beautiful design trade-off between query and insert performance. In fact, +this space is far more useful than the trade-off space represented by +layout policy and scale factor selection using strict reconstruction +schemes that we examined in Chapter~\ref{chap:design-space}. At the +upper end of the insertion optimized region, we see more than double +the insertion throughput of tiering (with significantly lower tail +latencies at well) at the cost of a slightly larger than 2x increase +in query latency. Moving down the curve, we see that we are able to +roughly match the performance of tiering within this space, and even +shift to more query optimized configurations. Also, this trade-off curve +falls \emph{below} the equivalently configured tiering on the chart, +indicating that it's performance is strictly superior. We also performed the same testing for $k$-NN queries using VPTree and the \texttt{SBW} dataset. The results are shown in Figure~\ref{fig:tl-latency-curve-knn}. Because the run time of $k$-NN -queries is significantly longer than the point lookups in the ISAM test, -we additionally applied a rate limit to the query thread, issuing new -queries every 100 milliseconds, and configured query preemption with a -trigger point of approximately 40 milliseconds. We applied the same -parameters for the tiering test, and counted any additional latency -associated with query preemption towards the average query latency figures -reported. This test shows that, like with ISAM, we have access to a -similarly clear trade-off space by adjusting the insertion throughput, -however in this case the standard tiering policy did perform better in -terms of average insertion throughput and query latency. - - -\begin{figure} -\centering -\subfloat[ISAM w/ Point Lookup]{\includegraphics[width=.5\textwidth]{img/tail-latency/stall-latency-curve.pdf} \label{fig:tl-latency-curve-isam}} -\subfloat[VPTree w/ $k$-NN]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-latency-curve.pdf} \label{fig:tl-latency-curve-knn}} \\ -\caption{Insertion Throughput vs. Query Latency} -\label{fig:tl-latency-curve} -\end{figure} +queries is significantly longer than the point lookups in the ISAM +test, we additionally applied a rate limit to the query thread, issuing +new queries every 100 milliseconds, and configured query preemption +with a trigger point of approximately 40 milliseconds. We applied +the same parameters for the tiering test, and counted any additional +latency associated with query preemption towards the average query +latency figures reported. This test shows that, like with ISAM, we have +access to a similarly clear trade-off space by adjusting the insertion +throughput, however in this case the standard tiering policy did perform +better in terms of average insertion throughput and query latency. The +fact that stalling was outperformed by strict tiering for VPTree +isn't a surprising result, given the observations made in the previous +test. VPTree requires significantly higher insertion throttling to keep +up with the longer reconstruction times, and the amount of throttling +per record is asymptotically not constant as the structure grows. This shows a very interesting result. Not only is our approach able to match a strict reconstruction policy in terms of average query and @@ -1310,7 +1324,7 @@ the previous version, although these have very different performance implications given our different compaction strategy. In this test, we examine the effects of these parameters on the insertion-query trade-off curves noted above, as well as on insertion tail latency. The results -are shown in Figure~\ref{fig:tl-design-space}, for a dynamized ISAM Tree +are shown in Figure~\ref{fig:tl-design-space}, for a dynamized ISAM tree using the SOSD \texttt{OSM} dataset and point lookup queries. \begin{figure} @@ -1362,7 +1376,7 @@ In the previous tests, we ran our system configured with 32 available threads, which was more than enough to run all reconstructions and queries fully in parallel. However, it's important to determine how well the system works in more resource constrained environments. The system -shares internal threads between reconstructions and queries, and that +shares internal threads between reconstructions and queries, and flushing occurs on a dedicated thread separate from these. During the benchmark, one client thread issued queries continuously and another issued inserts. The index accumulated a total of five levels, so @@ -1381,12 +1395,12 @@ results of this test are shown in Figure~\ref{fig:tl-latency-threads}. The first note is that the change in the number of available internal threads has little effect on the insertion throughput, as shown by the clustering of the points on the curve. This is to be expected, as inserts -throughput is limited only by the stall amount, and by the buffer flushing -operation. As flushing occurs on a dedicated thread, it is unaffected -by changes in the internal thread configuration of the system. +throughput is limited only by the stall amount, and by buffer flushing. +As flushing occurs on a dedicated thread, it is unaffected by changes +in the internal thread configuration of the system. In terms of query performance, there are two general effects that can be -observed. The first effect is that the previously noted effect of reduced +observed. The first effect is that the previously noted reduction in query performance as the insertion throughput is increased is observed in all cases, irrespective of thread count. However, interestingly, the thread count itself has little effect on the curve outside of the @@ -1398,17 +1412,16 @@ capable of significantly higher insertion throughput at a given query latency. But, at very low insertion throughputs, this effect vanishes and all thread counts are roughly equivalent in performance. -A large part of the reason for this significant deviation in -behavior between one thread and multiple is likely that queries and -reconstructions share the same pool of background threads in this -framework. Our testing involved issuing queries continuously on a -single thread, while performing inserts, and so two threads background -threads ensures that a reconstruction and query can be run in parallel, -whereas a single thread will force queries to wait behind long running -reconstructions. Once this bottleneck is overcome, a reduction in the -amount of parallel reconstruction seems to have only a minor influence -on overall performance. This is likely because, although in the worst -case the system requires $\log_s n$ threads to fully parallelize +A large part of the reason for this significant deviation in behavior +between one thread and multiple is likely that queries and reconstructions +share the same pool of background threads. Our testing involved issuing +queries continuously on a single thread, while performing inserts, and so +two threads background threads ensures that a reconstruction and query can +be run in parallel, whereas a single thread will force queries to wait +behind long running reconstructions. Once this bottleneck is overcome, +a reduction in the amount of parallel reconstruction seems to have only a +minor influence on overall performance. This is because, although in the +worst case the system requires $\log_s n$ threads to fully parallelize reconstructions, this worst case is fairly rare. The vast majority of reconstructions only require a fraction of this total parallel capacity. @@ -1425,20 +1438,20 @@ within our framework, including a significantly improved architecture for scheduling and executing parallel and background reconstructions, and a system for rate limiting by rejecting inserts via Bernoulli sampling. -We evaluated this system for fixed insertion rejection rates, and found -significant improvements in tail latencies, approaching the practical lower -bound we established using the equal block method, without requiring -significant degradation of query performance. In fact, we found that -this rate limiting mechanism provides a design space with more effective -trade-offs than the one we examined in Chapter~\ref{chap:design-space}, -with the system being able to exceed the query performance of an -equivalently configured tiering system for certain rate limiting -configurations. The method has limitations, assigning a fixed rejection -rate of inserts works well for linear time constructable structures like -the ISAM Tree, but was significantly less effective for the VPTree, which -requires $\Theta(n \log n)$ time to construct. For structures like this, -it will be necessary to dynamically scale the amount of throttling based -on the record count and size of reconstruction. Additionally, our current +We evaluated this system for fixed stall rates, and found significant +improvements in tail latencies, approaching the practical lower bound we +established using the equal block method, without requiring significant +degradation of query performance. In fact, we found that this rate +limiting mechanism provides a design space with more effective trade-offs +than the one we examined in Chapter~\ref{chap:design-space}, with the +system being able to exceed the query performance of an equivalently +configured tiering system for certain rate limiting configurations. The +method has limitations, assigning a fixed rejection rate of inserts +works well for linear time constructable structures like the ISAM tree, +but was significantly less effective for the VPTree, which requires +$\Theta(n \log n)$ time to construct. For structures like this, it will +be necessary to dynamically scale the amount of throttling based on +the record count and size of reconstruction. Additionally, our current system isn't easily capable of reaching the ``ideal'' goal of being able to reliably trade query performance and insertion latency at a fixed throughput. Nonetheless, the mechanisms for supporting such features |