From 228be229a831ad082e8310a6d247f1153fb475b8 Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Thu, 29 May 2025 19:36:41 -0400 Subject: updates --- chapters/design-space.tex | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) (limited to 'chapters/design-space.tex') diff --git a/chapters/design-space.tex b/chapters/design-space.tex index f639999..98c5bb2 100644 --- a/chapters/design-space.tex +++ b/chapters/design-space.tex @@ -4,11 +4,11 @@ \section{Introduction} In the previous two chapters, we introduced an LSM tree inspired design -space into the Bentley-Saxe method to allow for more flexilibity in +space into the Bentley-Saxe method to allow for more flexibility in tuning the performance. However, aside from some general comments about how these parameters operator in relation to insertion and query performance, and some limited experimental evaluation, we haven't -performed a systematic analsyis of this space, its capabilities, and its +performed a systematic analysis of this space, its capabilities, and its limitations. We will rectify this situation in this chapter, performing both a detailed mathematical analysis of the design parameter space, as well as experiments to demonstrate these trade-offs exist in practice. @@ -16,7 +16,7 @@ as well as experiments to demonstrate these trade-offs exist in practice. \subsection{Why bother?} Before diving into the design space we have introduced in detail, it's -worth taking some time to motivate this entire endevour. There is a large +worth taking some time to motivate this entire endeavor. There is a large body of theoretical work in the area of data structure dynamization, and, to the best of our knowledge, none of these papers have introduced a design space of the sort that we have introduced here. Despite this, @@ -55,7 +55,7 @@ Bentley-Saxe in Section~\ref{ssec:dyn-knn-exp}. The other experiments in Chapter~\ref{chap:framework} show that, for other types of problem, the technique does not fair quite so well. -\section{Asymptotic Analsyis} +\section{Asymptotic Analysis} \label{sec:design-asymp} Before beginning with derivations for @@ -172,7 +172,7 @@ writes from the previous level into level $i$, as well as rewriting all of the records currently on level $i$. The net result of this is that the number of writes on level $i$ is given -by the following recurrance relation (combined with the $W(0)$ base case), +by the following recurrence relation (combined with the $W(0)$ base case), \begin{equation*} W(i) = sW(i-1) + \frac{1}{2}\left(s-1\right)^2 \cdot s^i @@ -297,13 +297,13 @@ one fewer times, and so on. Thus, the total number of writes is, \begin{equation*} B\sum_{i=0}^{s-1} (s - i) = B\left(s^2 + \sum_{i=0}^{i-1} i\right) = B\left(s^2 + \frac{(s-1)s}{2}\right) \end{equation*} -which can be simplied to get, +which can be simplified to get, \begin{equation*} \frac{1}{2}s(s+1)\cdot B \end{equation*} -writes occuring on each level.\footnote{ - This write count is not cummulative over the entire structure. It only - accounts for the number of writes occuring on this specific level. +writes occurring on each level.\footnote{ + This write count is not cumulative over the entire structure. It only + accounts for the number of writes occurring on this specific level. } To obtain the total number of times records are rewritten, we need to @@ -517,7 +517,7 @@ characteristics of the three layout policies. For this test, we consider two data structures: the ISAM tree and the VP tree. The ISAM tree structure is merge-decomposable using a sorted-array merge, with a build cost of $B_M(n) \in \Theta(n \log k)$, where $k$ is the number -of structures being merged. The VPTree, by constrast, is \emph{not} +of structures being merged. The VPTree, by contrast, is \emph{not} merge decomposable, and is built in $B(n) \in \Theta(n \log n)$ time. We use the $200,000,000$ record SOSD \texttt{OSM} dataset~\cite{sosd} for ISAM testing, and the $1,000,000$ record, $300$-dimensional Spanish @@ -551,9 +551,9 @@ this is not a significant limitation for our analysis. The resulting distributions are shown in Figure~\ref{design-policy-ins-latency}. These distributions are representing using a "reversed" CDF with log scaling on both axes. This -representation has proven very useful for interpretting the latency +representation has proven very useful for interpreting the latency distributions that we see in evaluating dynamization, but are slightly -unusual, and so we've included a guide to interpretting these charts +unusual, and so we've included a guide to interpreting these charts in Appendix\ref{append:rcdf}. The first notable point is that, for both the ISAM tree @@ -592,7 +592,7 @@ due to cache effects most likely, but less so than in the MDSP case. Next, in Figure~\ref{fig:design-ins-tput}, we show the overall insertion throughput for the three policies. This result should -correlate with the amorized insertion costs for each policy derived in +correlate with the amortized insertion costs for each policy derived in Section~\ref{sec:design-asym}. As expected, tiering has the highest throughput. @@ -618,7 +618,7 @@ techniques is that, asymptotically, the additional cost added by decomposing the data structure vanished for sufficiently expensive queries. Bentley and Saxe proved that for query costs of the form $\mathscr{Q}_B(n) \in \Omega(n^\epsilon)$ for $\epsilon > 0$, the -overal query cost is unaffected (asymptotically) by the decomposition. +overall query cost is unaffected (asymptotically) by the decomposition. This would seem to suggest that, as the cost of the query over a single shard increases, the effectiveness of our design space for tuning query performance should reduce. This is because our tuning space consists @@ -643,7 +643,7 @@ constant factors only. In general asymptotic analysis, all possible configurations of our framework in this scheme collapse to the same basic cost functions when the constants are removed. While we have demonstrated that, in practice, the effects of this configuration are measurable, there -do exist techniques in the classical literature that provide asympotically +do exist techniques in the classical literature that provide asymptotically relevant trade-offs, such as the equal block method~\cite{maurer80} and the mixed method~\cite[pp. 117-118]{overmars83}. These techniques have cost functions that are derived from arbitrary, positive, monotonically @@ -702,7 +702,7 @@ capacity of each level is provided by Equation~\ref{eqn:design-k-expr} is \end{theorem} \begin{proof} The number of levels within the structure is given by $\log_s (n)$, -where $s$ is the scale factor. The addition of $k$ to the parameterization +where $s$ is the scale factor. The addition of $k$ to the parametrization replaces this scale factor with $s \log^k n$, and so we have \begin{equation*} \log_{s \log^k n}n = \frac{\log n}{\log\left(s \log^k n\right)} = \frac{\log n}{\log s + \log\left(k \log n\right)} \in O\left(\frac{\log n}{\log (k \log n)}\right) -- cgit v1.2.3