diff options
| author | Douglas B. Rumbaugh <doug@douglasrumbaugh.com> | 2025-07-07 11:36:15 -0400 |
|---|---|---|
| committer | Douglas B. Rumbaugh <doug@douglasrumbaugh.com> | 2025-07-07 11:36:15 -0400 |
| commit | 05aab7bd45e691a0b0f527d0ab4dd7cae0b3ec55 (patch) | |
| tree | be19b76016630bc7c7cdfb482e71b158c93fbd38 /chapters | |
| parent | 0dc1a8ea20820168149cedaa14e223d4d31dc4b6 (diff) | |
| download | dissertation-05aab7bd45e691a0b0f527d0ab4dd7cae0b3ec55.tar.gz | |
update
Diffstat (limited to 'chapters')
| -rw-r--r-- | chapters/beyond-dsp.tex | 54 | ||||
| -rw-r--r-- | chapters/conclusion.tex | 17 | ||||
| -rw-r--r-- | chapters/design-space.tex | 34 | ||||
| -rw-r--r-- | chapters/dynamization.tex | 4 | ||||
| -rw-r--r-- | chapters/sigmod23/exp-extensions.tex | 13 | ||||
| -rw-r--r-- | chapters/tail-latency.tex | 155 |
6 files changed, 158 insertions, 119 deletions
diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex index 7632261..fcd29b5 100644 --- a/chapters/beyond-dsp.tex +++ b/chapters/beyond-dsp.tex @@ -1955,28 +1955,6 @@ compress redundant data. \subsection{Concurrency} -We also tested the preliminary concurrency support described in -Section~\ref{ssec:dyn-concurrency}, using IRS as our test case, with our -dynamization configured with $N_B = 1200$, $s=8$, and the tiering layout -policy. Note that IRS only supports tagging, as it isn't invertible even -under the IDSP model, and our current concurrency implementation only -supports deletes with tombstones, so we eschewed deletes entirely for -this test. - -In this benchmark, we used a single thread to insert records -into the structure at a constant rate, while we deployed a variable -number of additional threads that continuously issued sampling queries -against the structure. We used an AGG B+tree as our baseline. Note -that, to accurately maintain the aggregate weight counts as records -are inserted, it is necessary that each operation obtain a lock on -the root node of the tree~\cite{zhao22}. This makes this situation -a good use-case for the automatic concurrency support provided by our -framework. Figure~\ref{fig:irs-concurrency} shows the results of this -benchmark for various numbers of concurrency query threads. As can be seen, -our framework supports a stable update throughput up to 32 query threads, -whereas the AGG B+tree suffers from contention for the mutex and sees -its performance degrade as the number of threads increases. - \begin{figure} \centering %\vspace{-2mm} @@ -1987,6 +1965,38 @@ its performance degrade as the number of threads increases. %\vspace{-2mm} \end{figure} +We also tested the preliminary concurrency support described in +Section~\ref{ssec:dyn-concurrency}, using IRS as our test case, with our +dynamization configured with $N_B = 1200$, $s=8$, and the tiering layout +policy. Note that IRS only supports tagging, as it isn't invertible even +under the IDSP model, and our current concurrency implementation only +supports deletes with tombstones, so we eschewed deletes entirely for +this test. + +In this benchmark, we used a single thread to insert records into the +structure at a constant rate, while we deployed a variable number of +additional threads that continuously issued sampling queries against +the structure. We used an AGG B+tree as our baseline. Note that, +to accurately maintain the aggregate weight counts as records are +inserted, it is necessary that each operation obtain a lock on the +root node of the tree~\cite{zhao22}. This makes this situation a +good use-case for the automatic concurrency support provided by our +framework. Figure~\ref{fig:irs-concurrency} shows the results of this +benchmark for various numbers of concurrency query threads. As can be +seen, our framework supports a stable update throughput up to 32 query +threads, whereas the AGG B+tree suffers from contention for the mutex +and sees its performance degrade as the number of threads increases. The +framework is able to achieve this because queries are processed mostly +independently from reconstructions due to the multi-versioning of the +structure. Thus, a query can simply maintain a static view on a set +of data within the dynamized structure for as long as it likes, while +inserts can freely proceed. Because our implementation only maintains a +limited number of structure versions, it is possible for long-running +queries to slow down inserts, which will eventually be blocked until +the query releases the version it is using, but this is a function of +the query type itself, not the number of queries running or the number +of client threads issuing queries. + \section{Conclusion} In this chapter, we sought to develop a set of tools for generalizing diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex index 13457b5..eb4cf72 100644 --- a/chapters/conclusion.tex +++ b/chapters/conclusion.tex @@ -61,6 +61,21 @@ update support. In particular, our framework must also support the following additional features, \begin{enumerate} + \item \textbf{Automatic Tuning of Insertion Rejection Rate.} \\ + The tail latency control system discussed + in Chapter~\ref{chap:tail-latency} is based upon setting a + rejection rate parameter for inserts, which must be tuned for + the data structure being dynamized. The current version treats + this as a user-specified constant parameter, but it would be + ideal for this parameter to be automatically determined based + on the performance of the framework. In particular, we noted in + Chapter~\ref{chap:tail-latency} that having it fixed to a single + value is sub-optimal for some data structures, and there also + exist opportunities to dynamically adjust it based on the actual + rate of inserts into the system to achieve better throughput. The + design of a system for doing this automatic rejection rate tuning is + an important next step for the framework. + \item \textbf{Support for external storage.} \\ While we did have an implementation of sampling framework discussed in Chapter~\ref{chap:sampling} that used an external @@ -69,6 +84,7 @@ following additional features, to extend it with support for external structures, as well as evaluate whether our proposed techniques still function effectively in this context. + \item \textbf{Crash recovery.} \\ It is critical for a database index to support crash recovery, so that it can be recovered to a state consistent with the rest of @@ -77,6 +93,7 @@ following additional features, inefficient crash recovery is straightforward: All operations can be logged and replayed in the event of a crash. But this is highly inefficient, and so a better scheme must be devised. + \item \textbf{Distributed systems support.} \\ The append-only and decomposed nature of dynamized indices make them seem a natural fit in a distributed systems context. This was diff --git a/chapters/design-space.tex b/chapters/design-space.tex index 22773e5..2aecede 100644 --- a/chapters/design-space.tex +++ b/chapters/design-space.tex @@ -54,7 +54,7 @@ Bentley-Saxe in Section~\ref{ssec:dyn-knn-exp}. Our experiments in Chapter~\ref{chap:framework} show that, for other types of problem, the technique does not fare quite so well in its unmodified form. -\section{Asymptotic Analysis} +\section{Theoretical Performance Analysis} \label{sec:design-asymp} Before beginning with derivations for the cost functions of dynamized @@ -75,8 +75,7 @@ As a first step, we will derive a modified version of the Bentley-Saxe method that has been adjusted to support arbitrary scale factors. There's nothing fundamental to the technique that prevents such modifications, and its likely that they have not been analyzed like this before simply -out of a lack of interest in constant factors in theoretical asymptotic -analysis. +out of a lack of interest in constant factors in asymptotic analysis. When generalizing the Bentley-Saxe method for arbitrary scale factors, we decided to maintain the core concept of binary decomposition. One @@ -555,7 +554,7 @@ best-case query cost. \end{proof} \section{General Observations} -The asymptotic results from the previous section are summarized in +The theoretical results from the previous section are summarized in Table~\ref{tab:policy-comp}. When the scale factor is accounted for in the analysis, we can see that possible trade-offs begin to manifest within the space. We've seen some of these in action directly in @@ -625,7 +624,7 @@ the real-world performance implications of the configuration parameter space of our framework. -\subsection{Asymptotic Insertion Performance} +\subsection{Insertion Performance} We'll begin by validating our results for the insertion performance characteristics of the three layout policies. For this test, we @@ -638,6 +637,7 @@ the $200$ million record SOSD \texttt{OSM} dataset~\cite{sosd-datasets} for ISAM testing, and the one million record, $300$-dimensional Spanish Billion Words (\texttt{SBW}) dataset~\cite{sbw} for VPTree testing. +\Paragraph{Worst-case Insertion Performance.} For our first experiment, we will examine the latency distribution for inserts into our structures. We tested the three layout policies, using a common scale factor of $s=2$. This scale factor was selected @@ -705,6 +705,7 @@ due to cache effects most likely, but less so than in the MDSP case. \label{fig:design-ins-tput} \end{figure} +\Paragraph{Insertion Throughput.} Next, in Figure~\ref{fig:design-ins-tput}, we show the overall insertion throughput for the three policies for both ISAM tree and VPTree. This result should correlate with the amortized insertion costs for each @@ -778,8 +779,6 @@ performance across the board. Generally it seems to be a strictly worse alternative to leveling in all but its best-case query cost, and we will omit it from our tests moving forward. -\subsection{Buffer Size} - \begin{figure} \centering \subfloat[ISAM Tree Range Count]{\includegraphics[width=.5\textwidth]{img/design-space/isam-bs-sweep.pdf} \label{fig:buffer-isam-tradeoff}} @@ -788,13 +787,12 @@ omit it from our tests moving forward. \label{fig:buffer-size} \end{figure} -In the previous section, we considered the effect of various scale -factors on the trade-off between insertion and query performance. Our -framework also supports varying buffer sizes, and so we will examine this -next. Figure~\ref{fig:buffer-size} shows the same insertion throughput -vs. query latency curves for fixed layout policy and scale factor -configurations at varying buffer sizes, under the same experimental -conditions as the previous test. +We will next turn our attention to the effect that buffer size +has on the trade-off between insertion and query performance. +Figure~\ref{fig:buffer-size} shows the same insertion throughput vs. query +latency curves for fixed layout policy and scale factor configurations +at varying buffer sizes, under the same experimental conditions as the +previous test. Unlike with the scale factor, there is a significant difference in the behavior of the two tested structures under buffer size variation. For @@ -858,10 +856,10 @@ albeit slight, stratification amongst the tested policies, as shown in Figure~\ref{fig:design-isam-sel}. As the selectivity continues to rise above those shown in the chart, the relative ordering of the policies remains the same, but the relative differences between them begin to -shrink. This result makes sense given the asymptotics--there is still -\emph{some} overhead associated with the decomposition, but as the cost -of the query approaches linear, it makes up an increasingly irrelevant -portion of the run time. +shrink. This result makes sense given the theoretical analysis--there +is still \emph{some} overhead associated with the decomposition, but +as the cost of the query approaches linear, it makes up an increasingly +irrelevant portion of the run time. The $k$-NN results in Figure~\ref{fig:design-knn-sel} show a slightly different story. This is also not surprising, because $k$-NN is a diff --git a/chapters/dynamization.tex b/chapters/dynamization.tex index 5e4cdec..4f9e50b 100644 --- a/chapters/dynamization.tex +++ b/chapters/dynamization.tex @@ -1270,6 +1270,10 @@ their applicability, \item The result merge operation must be commutative and associative, and is called repeatedly to merge pairs of results. + + \item There are serious restrictions on problems which can support + deletes, requiring additional assumptions about the search + problem or data structure. \end{itemize} These requirements restrict the types of queries that can be supported by diff --git a/chapters/sigmod23/exp-extensions.tex b/chapters/sigmod23/exp-extensions.tex index 3d3f5b7..df1f4b6 100644 --- a/chapters/sigmod23/exp-extensions.tex +++ b/chapters/sigmod23/exp-extensions.tex @@ -38,12 +38,13 @@ dynamic baseline in both sampling and update performance. Finally, we tested the multi-threaded insertion performance of our in-memory, concurrent implementation of \texttt{DE-IRS} compared to -\texttt{AB-tree} configured to run entirely in memory. We used the -synthetic uniform dataset (1B records) for this testing, and introduced a -slight delay between inserts to avoid bottlenecking on the fetch-and-add -within the mutable buffer. Figure~\ref{fig:con-latency} shows the latency -vs. throughput curves for the two structures. Note that \texttt{AB-tree}'s -results are cut off by the y-axis, as it performs significantly worse than +\texttt{AB-tree} configured with a large enough cache to store the +data set entirely in memory. We used the synthetic uniform dataset +(1B records) for this testing, and introduced a slight delay between +inserts to avoid bottlenecking on the fetch-and-add within the mutable +buffer. Figure~\ref{fig:con-latency} shows the latency vs. throughput +curves for the two structures. Note that \texttt{AB-tree}'s results +are cut off by the y-axis, as it performs significantly worse than \texttt{DE-IRS}. Figure~\ref{fig:con-tput} shows the insertion throughput as additional insertion threads are added. Both plots show linear scaling up to 3 or 4 threads, before the throughput levels off. Further, even diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex index a0db592..dbe867c 100644 --- a/chapters/tail-latency.tex +++ b/chapters/tail-latency.tex @@ -350,10 +350,15 @@ will throttle the insertion rate by adding a stall time, $\delta$, to each insert. $\delta$ will be determined such that it is sufficiently large to ensure that any scheduled reconstructions have enough time to complete before the shard count on any level exceeds $s$. This process -is summarized in Algorithm~\ref{alg:tl-relaxed-recon}. +is summarized in Algorithm~\ref{alg:tl-relaxed-recon}. Note that +this algorithm will only work with a single insertion client thread, +as multiple insertion threads will stall concurrently with each other, +removing some of the necessary delay. If multiple insertion threads are +necessary, the inserts can be serialized first in a queue to ensure that +the appropriate amount of stalling occurs. \begin{algorithm} -\caption{Relaxed Reconstruction Algorithm with Insertion Stalling} +\caption{Insertion Algorithm with Stalling} \label{alg:tl-relaxed-recon} \KwIn{$r$: a record to be inserted, $\mathscr{I} = (\mathcal{B}, \mathscr{L}_0 \ldots \mathscr{L}_\ell)$: a dynamized structure, $\delta$: insertion stall amount} @@ -1077,18 +1082,21 @@ on demand using atomics. Our current prototype uses a single, fixed value for the probability, but ultimately it should be dynamically tuned to approximate the $\delta$ value from Theorem~\ref{theo:worst-case-optimal} as closely as possible. It also doesn't require significant modification -of the existing client interfaces. - -We have elected to use Bernoulli sampling for the task of selecting which -inserts to reject. An alternative approach would have been to apply -systematic sampling. Bernoulli sampling results in the probability of an -insert being rejected being independent of its order in the workload, -but is non-deterministic. This means that there is a small probability -that many more rejections than are expected may occur over a short -span of time. Systematic sampling is not vulnerable to this problem, -but introduces a dependence between the position of an insert in a -sequence of operations and its probability of being rejected. We decided -to prioritize independence in our implementation. +of the existing client interfaces, and can easily support multiple threads +of insertion without needing an explicit serialization process to ensure +the appropriate amount of stalling occurs. + +We have elected to use Bernoulli sampling for the task of selecting +which inserts to reject. An alternative approach would have been to +apply systematic sampling, and reject inserts based on a pattern (e.g., +every 1000th insert gets rejected). Bernoulli sampling results in the +probability of an insert being rejected being independent of its order +in the workload, but is non-deterministic. This means that there is a +small probability that many more rejections than are expected may occur +over a short span of time. Systematic sampling is not vulnerable to this +problem, but introduces a dependence between the position of an insert +in a sequence of operations and its probability of being rejected. We +decided to prioritize independence in our implementation. \section{Evaluation} \label{sec:tl-eval} @@ -1096,7 +1104,7 @@ to prioritize independence in our implementation. In this section, we perform several experiments to evaluate the ability of the system proposed in Section~\ref{sec:tl-impl} to control tail latencies. -\subsection{Stall Rate Sweep} +\subsection{Rejection Rate Sweep} As a first test, we will evaluate the ability of our insertion stall @@ -1113,7 +1121,7 @@ block inserts to maintain a shard bound. The buffer is always flushed immediately, regardless of the number of shards in the structure. Thus, the rate of insertion is controlled by the cost of flushing the buffer (we still block when the buffer is full) and the insertion -stall rate. The structure is maintained fully in the background, with +rejection rate. The structure is maintained fully in the background, with maintenance reconstructions being scheduled for all levels exceeding a specified shard count. Thus, the number of shards within the structure is controlled indirectly by limiting the insertion rate. We ran these @@ -1127,10 +1135,10 @@ dataset~\cite{sosd-datasets}, as well as VPTree with the one million, we inserted $30\%$ of the records to warm up the structure, and then measured the individual latency of each insert after that. We measured the count of shards in the structure each time the buffer flushed -(including during the warmup period). Note that a stall rate of $\delta -= 1$ indicates no stalling at all, and values less than one indicate +(including during the warmup period). Note that a rejection rate of +$1$ indicates no stalling at all, and values less than one indicate $1 - \delta$ probability of an insert being rejected, after which the -insert thread sleeps for about a microsecond. A lower stall rate means +insert thread sleeps for about a microsecond. A lower rejection rate means more stalls are introduced. The tiering policy is strict tiering with a scale factor of $s=6$ using the concurrency control scheme described in Section~\ref{ssec:dyn-concurrency}. @@ -1145,7 +1153,7 @@ in Section~\ref{ssec:dyn-concurrency}. We'll begin by considering the ISAM tree. Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all -stall rates succeed in greatly reducing tail latency relative to +rejection rates succeed in greatly reducing tail latency relative to tiering. Additionally, it shows a small amount of available tuning of the worst-case insertion latencies, with higher stall amounts reducing the tail latencies slightly at various points in the distribution. This @@ -1160,16 +1168,16 @@ resulting in a stall. Of course, if the query latency is severely affected by the use of this mechanism, it may not be worth using. Thus, in Figure~\ref{fig:tl-stall-200m-shard} we show the probability density -of various shard counts within the decomposed structure for each stall +of various shard counts within the decomposed structure for each rejection rate, as well as strict tiering. This figure shows that, even for no insertion stalling, the shard count within the structure remains well behaved, albeit with a slightly longer tail and a higher average value compared to tiering. Once stalls are introduced, it is possible to both reduce the tail, and shift the peak of the distribution through -a variety of points. In particular, we see that a stall of $.99$ is -sufficient to move the peak to very close to tiering, and lower stall +a variety of points. In particular, we see that a rejection rate of $.99$ is +sufficient to move the peak to very close to tiering, and lower rejection rates are able to further shift the peak of the distribution to even -lower counts. This result implies that this stall mechanism may be able +lower counts. This result implies that this rejection mechanism may be able to produce a trade-off space for insertion and query performance, which is a question we will examine in Section~\ref{ssec:tl-design-space}. @@ -1186,7 +1194,7 @@ small size of the data set used, we repeated the exact same testing using a set of four billion uniform integers, shown in Figure~\ref{fig:tl-stall-4b}. These results are aligned with the smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing the -same improvements in insertion tail latency for all stall amounts, and +same improvements in insertion tail latency for all rejection rates, and Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the shard count. @@ -1199,7 +1207,7 @@ count. Finally, we considered our dynamized VPTree in Figure~\ref{fig:tl-stall-knn}, This test shows some of the possible -limitations of our fixed stall rate mechanism. The ISAM tree tested +limitations of our fixed rejection rate mechanism. The ISAM tree tested above is constructable in roughly linear time, being an MDSP with $B_M(n, k) \in \Theta(n \log k)$, where $k$ is the number of shards, and thus roughly constant.\footnote{ @@ -1209,35 +1217,35 @@ thus roughly constant.\footnote{ about, however, it is clear that $k$ is typically quite close to $s$ in practice for ISAM tree. } Thus, the ratio $\frac{B_M(n, -k)}{n}$ used to determine the optimal insertion stall rate is +k)}{n}$ used to determine the optimal insertion rejection rate is asymptotically a constant. For VPTree, however, the construction cost is super-linear, with $B(n) \in \Theta(n \log n)$, and also generally much larger in absolute time requirements. We can see in Figure~\ref{fig:tl-stall-knn-shard} that the shard count distribution -is very poorly behaved for smaller stall amounts, with the shard count -following a roughly uniform distribution for a stall rate of $1$. This +is very poorly behaved for smaller rejection rates, with the shard count +following a roughly uniform distribution for a rejection rate of $1$. This means that the background reconstructions are not capable of keeping up with buffer flushing, and so the number of shards grows significantly over time. Introducing stalls does shift the distribution closer to -normal, but it requires a much larger stall rate in order to obtain +normal, but it requires a much larger rejection rate in order to obtain a shard count distribution that is close to the strict tiering than was the case with the ISAM tree test. It is still possible, though, -even with our simple fixed-stall rate implementation. Additionally, +even with our simple fixed-rejection rate implementation. Additionally, this approach is shown in Figure~\ref{fig:tl-stall-knn-dist} to reduce the tail latency substantially compared to strict tiering, with the same -latency distribution effects for larger stall rates as was seen in the +latency distribution effects for larger rejection rates as was seen in the ISAM examples. These tests show that, for ISAM tree at least, introducing a constant -stall rate while allowing the decomposition to develop naturally +rejection rate while allowing the decomposition to develop naturally with background reconstructions only is able to match the shard count distribution of tiering, which strictly enforces the shard count bound using blocking, while achieving significantly better insertion tail latencies. VPTree is able to achieve the same results too, albeit -requiring significantly higher stall rates to match the shard bound. +requiring significantly higher rejection rates to match the shard bound. -\subsection{Insertion Stall Trade-off Space} +\subsection{Insertion Rejection Trade-off Space} \label{ssec:tl-design-space} \begin{figure} @@ -1256,15 +1264,15 @@ or query latency. By throttling insertion, we potentially reduce the throughput. Further, it isn't clear in practice how much query latency suffers as the shard count distribution changes. In this experiment, we address these concerns by directly measuring the insertion throughput -and query latency over a variety of stall rates and compare the results +and query latency over a variety of rejection rates and compare the results to strict tiering. The results of this test for ISAM with the SOSD \texttt{OSM} dataset are shown in Figure~\ref{fig:tl-latency-curve-isam}, which shows the insertion throughput plotted against the average query latency for our system at -various stall rates, and with tiering configured with an equivalent scale +various rejection rates, and with tiering configured with an equivalent scale factor marked as red point for reference. The most interesting point -demonstrated by this plot is that introducing the stall rate provides a +demonstrated by this plot is that introducing the rejection rate provides a beautiful design trade-off between query and insert performance. In fact, this space is far more useful than the trade-off space represented by layout policy and scale factor selection using strict reconstruction @@ -1304,7 +1312,7 @@ to provide a superior set of design trade-offs than the strict policies, at least in environments where sufficient parallel processing and memory are available to leverage parallel reconstructions. -\subsection{Legacy Design Space} +\subsection{Design Space} Our new system retains the concept of buffer size and scale factor from the previous version, although these have very different performance @@ -1318,7 +1326,7 @@ using the SOSD \texttt{OSM} dataset and point lookup queries. \centering \subfloat[Insertion Throughput vs. Query Latency for Varying Scale Factors]{\includegraphics[width=.5\textwidth]{img/tail-latency/stall-sf-sweep.pdf} \label{fig:tl-sf-curve}} \subfloat[Insertion Tail Latency for Varying Buffer Sizes]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer-tail-latency.pdf} \label{fig:tl-buffer-tail}} \\ -\caption{Legacy Design Space Examination} +\caption{Design Space Examination} \label{fig:tl-design-space} \end{figure} @@ -1347,7 +1355,7 @@ Figure~\ref{fig:tl-buffer-tail}. As the buffer size is increased, the worst-case insertion time also increases, although the effect is relatively small. -\subsection{Thread Scaling} +\subsection{Internal Thread Scaling} \begin{figure} \centering @@ -1360,21 +1368,21 @@ relatively small. \end{figure} In the previous tests, we ran our system configured with 32 available -threads, which was more than enough to run all reconstructions and -queries fully in parallel. However, it's important to determine how well -the system works in more resource constrained environments. The system -shares internal threads between reconstructions and queries, and -flushing occurs on a dedicated thread separate from these. During the -benchmark, one client thread issued queries continuously and another -issued inserts. The index accumulated a total of five levels, so -the maximum amount of parallelism available during the testing was 4 +internal threads, which was more than enough to run all reconstructions +and queries fully in parallel. However, it's important to determine +how well the system works in more resource constrained environments. +The system shares internal threads between reconstructions and queries, +and flushing occurs on a dedicated thread separate from these. During +the benchmark, one client thread issued queries continuously and +another issued inserts. The index accumulated a total of five levels, +so the maximum amount of parallelism available during the testing was 4 parallel reconstructions, along with the dedicated flushing thread and any concurrent queries. In these tests, we used the SOSD \texttt{OSM} dataset (200M records) and point-lookup queries without early abort against a dynamized ISAM tree. We considered the insertion throughput vs. query latency trade-off for -various stall amounts with several internal thread counts. We inserted +various rejection rates with several internal thread counts. We inserted 30\% of the dataset first, and then measured the insertion throughput over the insertion of the rest of the data on a client thread, while another client thread continuously issued queries against the structure. The @@ -1386,33 +1394,34 @@ throughput is limited only by the stall amount, and by buffer flushing. As flushing occurs on a dedicated thread, it is unaffected by changes in the internal thread configuration of the system. -In terms of query performance, there are two general effects that can be -observed. The first effect is that the previously noted reduction in -query performance as the insertion throughput is increased is observed -in all cases, irrespective of thread count. However, interestingly, -the thread count itself has little effect on the curve outside of the -case of only having a single thread. This can also be seen in -Figure~\ref{fig:tl-query-scaling}, which shows an alternative view of -the same data revealing the best measured insertion throughput associated -with a given query latency bound. In both cases, two or more threads are -capable of significantly higher insertion throughput at a given query -latency. But, at very low insertion throughputs, this effect vanishes -and all thread counts are roughly equivalent in performance. +In terms of query performance, there are two general effects that can +be observed. The first effect is that the previously noted reduction in +query performance as the insertion throughput is increased is observed in +all cases, irrespective of internal thread count. However, interestingly, +the internal thread count itself has little effect on the curve outside +of the case of only having a single internal thread. This can also be +seen in Figure~\ref{fig:tl-query-scaling}, which shows an alternative +view of the same data revealing the best measured insertion throughput +associated with a given query latency bound. In both cases, two or more +internal threads are capable of significantly higher insertion throughput +at a given query latency. But, at very low insertion throughputs, this +effect vanishes and all internal thread counts are roughly equivalent +in performance. A large part of the reason for this significant deviation in behavior -between one thread and multiple is likely that queries and reconstructions -share the same pool of background threads. Our testing involved issuing -queries continuously on a single thread, while performing inserts, and so -two threads background threads ensures that a reconstruction and query can -be run in parallel, whereas a single thread will force queries to wait -behind long running reconstructions. Once this bottleneck is overcome, -a reduction in the amount of parallel reconstruction seems to have only a -minor influence on overall performance. This is because, although in the -worst case the system requires $\log_s n$ threads to fully parallelize +between one internal thread and multiple is likely that queries and +reconstructions share the same pool of internal threads. Our testing +involved issuing queries continuously on a single internal thread, +while performing inserts, and so two internal threads ensures +that a reconstruction and query can be run in parallel, whereas a +single internal thread will force queries to wait behind long running +reconstructions. Once this bottleneck is overcome, a reduction in the +amount of parallel reconstruction seems to have only a minor influence +on overall performance. This is because, although in the worst case +the system requires $\log_s n$ internal threads to fully parallelize reconstructions, this worst case is fairly rare. The vast majority of reconstructions only require a fraction of this total parallel capacity. - \section{Conclusion} In this section, we addressed the final of the three major problems of @@ -1425,7 +1434,7 @@ within our framework, including a significantly improved architecture for scheduling and executing parallel and background reconstructions, and a system for rate limiting by rejecting inserts via Bernoulli sampling. -We evaluated this system for fixed stall rates, and found significant +We evaluated this system for fixed rejection rates, and found significant improvements in tail latencies, approaching the practical lower bound we established using the equal block method, without requiring significant degradation of query performance. In fact, we found that this rate |