From 7d8ba7ed3d24b03fed3ac789614c27b3808ded74 Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Thu, 15 May 2025 18:24:35 -0400 Subject: updates --- chapters/beyond-dsp.tex | 1532 +++++++++++++++++++++-------------------------- 1 file changed, 699 insertions(+), 833 deletions(-) (limited to 'chapters/beyond-dsp.tex') diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex index 50d6369..66b9d97 100644 --- a/chapters/beyond-dsp.tex +++ b/chapters/beyond-dsp.tex @@ -11,863 +11,729 @@ \label{chap:framework} -The previous chapter demonstrated -the possible utility of -designing indexes based upon the dynamic extension of static data -structures. However, the presented strategy falls short of a general -framework, as it is specific to sampling problems. In this chapter, -the techniques of that work will be discussed in more general terms, -to arrive at a more broadly applicable solution. A general -framework is proposed, which places only two requirements on supported data -structures, - -\begin{itemize} - \item Extended Decomposability - \item Record Identity -\end{itemize} +\section{Introduction} + +In the previous chapter, we discussed how several of the limitations of +dynamization could be overcome by proposing a systematic dynamization +approach for sampling data structures. In doing so, we introduced +a multi-stage query mechanism to overcome the non-decomposability of +these queries, provided two mechanisms for supporting deletes along with +specialized processing to integrate these with the query mechanism, and +introduced some performance tuning capability inspired by the design space +of modern LSM Trees. While promising, these results are highly specialized +and remain useful only within the context of sampling queries. In this +chapter, we develop new generalized query abstractions based on these +specific results, and discuss a fully implemented framework based upon +these abstractions. + +More specifically, in this chapter we propose \emph{extended +decomposability} and \emph{iterative deletion decomposability} as two +new, broader classes of search problem which are strict supersets of +decomposability and deletion decomposability respectively, providing a +more powerful interface to allow the efficient implementation of a larger +set of search problems over a dynamized structure. We then implement +a C++ library based upon these abstractions which is capable of adding +support for inserts, deletes, and concurrency to static data structures +automatically, and use it to provide dynamizations for independent range +sampling, range queries with learned indices, string search with succinct +tries, and high dimensional vector search with metric indices. In each +case we compare our dynamized implementation with existing dynamic +structures, and standard Bentley-Saxe dynamizations, where possible. + +\section{Beyond Decomposability} + +We begin our discussion of this generalized framework by proposing +new classes of search problems based upon our results from examining +sampling problems in the previous chapter. Our new classes will enable +the support of new types of search problem, enable more efficient support +for certain already supported problems, and allow for broader support +of deletes. Based on this, we will develop a taxonomy of search problems +that can be supported by our dynamization technique. + + +\subsection{Extended Decomposability} + +As discussed in Chapter~\cite{chap:background}, the standard query model +used by dynamization techniques requires that a given query be broadcast, +unaltered, to each block within the dynamized structure, and then that +the results from these identical local queries be efficiently mergable +to obtain the final answer to the query. This model limits dynamization +to decomposable search problems (Definition~\ref{def:dsp}). + +In the previous chapter, we considered various sampling problems as +examples of non-decomposable search problems, and devised a technique for +correctly answering queries of that type over a dynamized structure. In +this section, we'll retread our steps with an eye towards a general +solution, that could be applicable in other contexts. For convenience, +we'll focus exlusively on independent range sampling. As a reminder, this +search problem is defined as, + +\begin{definitionIRS}[Independent Range Sampling~\cite{tao22}] + Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query + interval $q = [x, y]$ and an integer $k$, an independent range sampling + query returns $k$ independent samples from $D \cap q$ with each + point having equal probability of being sampled. +\end{definitionIRS} + +We formalize this as a search problem $F_\text{IRS}:(\mathcal{D}, +\mathcal{Q}) \to \mathcal{R}$ where the record domain is $\mathcal{D} += \mathbb{R}$, the query parameters domain consists of order triples +containing the lower and upper boudns of the query interval, and the +number of samples to draw, $\mathcal{Q} = \mathbb{R} \times \mathbb{R} +\times \mathbb{Z}^+$, and the result domain containts subsets of the +real numbers, $\mathcal{R} = \mathcal{PS}(\mathbb{R})$. + +$F_\text{IRS}$ can be solved using a variety of data structures, such as +the static ISAM solution discussed in Section~\ref{ssec:irs-struct}. For +our example here, we will use a simple sorted array. Let $\mathcal{I}$ +be the sorted array data structure, with a specific instance $\mathscr{I} +\in \mathcal{I}$ built over a set $D \subset \mathbb{R}$ having $|D| = +n$ records. The problem $F_\text{IRS}(\mathscr{I}, (l, u, k))$ can be +solved by binary searching $\mathscr{I}$ twice to obtain the index of +the first element greater than or equal to $l$ ($i_l$) and the last +element less than or equal to $u$ ($i_u$). With these two indices, +$k$ random numbers can generated on the interval $[i_l, i_u]$ and the +records at these indices returned. This sampling procedure is described +in Algorithm~\ref{alg:array-irs} and runs in $\mathscr{Q}_\text{irs} +\in \Theta(\log n + k)$ time. + +\SetKwFunction{IRS}{IRS} +\begin{algorithm} +\caption{Solution to IRS on a sorted array} +\label{alg:array-irs} +\KwIn{$k$: sample size, $[l,u]$: lower and upper bound of records to sample} +\KwOut{$S$: a sample set of size $k$} +\Def{\IRS{$(\mathscr{I}, (l, u, k))$}}{ + \Comment{Find the lower and upper bounds of the interval} + $i_l \gets \text{binary\_search\_lb}(\mathscr{I}, l)$ \; + $i_u \gets \text{binary\_search\_ub}(\mathscr{I}, u)$ \; + \BlankLine + \Comment{Initialize empty sample set} + $S \gets \{\}$ \; + \BlankLine + \For {$i=1\ldots k$} { + \Comment{Select a random record within the inteval} + $i_r \gets \text{randint}(i_l, i_u)$ \; + + \Comment{Add it to the sample set} + $S \gets S \cup \{\text{get}(\mathscr{I}, i_r)\}$ \; + } + \BlankLine + \Comment{Return the sample set} + \Return $S$ \; +} +\end{algorithm} -In this chapter, first these two properties are defined. Then, -a general dynamic extension framework is described which can -be applied to any data structure supporting these properties. Finally, -an experimental evaluation is presented that demonstrates the viability -of this framework. - -\section{Extended Decomposability} - -Chapter~\ref{chap:sampling} demonstrated how non-DSPs can be efficiently -addressed using Bentley-Saxe, so long as the query interface is -modified to accommodate their needs. For Independent sampling -problems, this involved a two-pass approach, where some pre-processing -work was performed against each shard and used to construct a shard -alias structure. This structure was then used to determine how many -samples to draw from each shard. - -To generalize this approach, a new class of decomposability is proposed, -called \emph{extended decomposability}. At present, its -definition is tied tightly to the query interface, rather -than a formal mathematical definition. In extended decomposability, -rather than treating a search problem as a monolith, the algorithm -is decomposed into multiple components. -This allows -for communication between shards as part of the query process. -Additionally, rather than using a binary merge operator, extended -decomposability uses a variadic function that merges all of the -result sets in one pass, reducing the cost due to merging by a -logarithmic factor without introducing any new restrictions. - -The basic interface that must be supported by a extended-decomposable -search problem (eDSP) is, -\begin{itemize} +It becomes more difficult to answer $F_\text{IRS}$ over a data structure +that has been decomposed into blocks, because the number of samples +taken from each block must be appropriately weighted to correspond to the +number of records within each block falling into the query range. In the +classical model, there isn't a way to do this, and so the only solution +is to answer $F_\text{IRS}$ against each block, asking for the full $k$ +samples each time, and then downsampling the results corresponding to +the relative weight of each block, to obtain a final sample set. + +Using this idea, we can formulate $F_\text{IRS}$ as a $C(n)$-decomposable +problem by changing the result set type to $\mathcal{R} = +\mathcal{PS}(\mathbb{R}) \times \mathbb{R}$ where the first element +in the tuple is the sample set and the second argument is the number +of elements falling between $l$ and $u$ in the block being sampled +from. With this information, it is possible to implement $\mergeop$ +using Bernoulli sampling over the two sample sets to be merged. This +requires $\Theta(k)$ time, and thus $F_\text{IRS}$ can be said to be +a $k$-decomposable search problem, which runs in $\Theta(\log^2 n + k +\log n)$ time. This procedure is shown in Algorithm~\ref{alg:decomp-irs}. + +\SetKwFunction{IRSDecomp}{IRSDecomp} +\SetKwFunction{IRSCombine}{IRSCombine} +\begin{algorithm}[!h] + \caption{$k$-Decomposable Independent Range Sampling} + \label{alg:decomp-irs} + \KwIn{$k$: sample size, $[l,u]$: lower and upper bound of records to sample} + \KwOut{$(S, c)$: a sample set of size $k$ and a count of the number + of records on on the interval $[l,u]$} + \Def{\IRSDecomp{$\mathscr{I}_i, (l, u, k)$}}{ + \Comment{Find the lower and upper bounds of the interval} + $i_l \gets \text{binary\_search\_lb}(\mathscr{I}_i, l)$ \; + $i_u \gets \text{binary\_search\_ub}(\mathscr{I}_i, u)$ \; + \BlankLine + \Comment{Initialize empty sample set} + $S \gets \{\}$ \; + \BlankLine + \For {$i=1\ldots k$} { + \Comment{Select a random record within the inteval} + $i_r \gets \text{randint}(i_l, i_u)$ \; + + \Comment{Add it to the sample set} + $S \gets S \cup \{\text{get}(\mathscr{I}_i, i_r)\}$ \; + } + \BlankLine + \Comment{Return the sample set and record count} + \Return ($S$, $i_u - i_l$) \; + } + \BlankLine + + \Def{\IRSCombine{$(S_1, c_1)$, $(S_2, c_2)$}}{ + \Comment{The output set should be the same size as the input ones} + $k \gets |S_1|$ \; + \BlankLine + \Comment{Calculate the weighting that should be applied to each set when sampling} + $w_1 \gets \frac{c_1}{c_1 + c_2}$ \; + $w_2 \gets \frac{c_2}{c_1 + c_2}$ \; + \BlankLine + \Comment{Initialize output set and count} + $S \gets \{\}$\; + $c \gets c_1 + c_2$ \; + \BlankLine + \Comment{Down-sample the input result sets} + $S \gets S \cup \text{bernoulli}(S_1, w_1, k\times w_1)$ \; + $S \gets S \cup \text{bernoulli}(S_2, w_2, k\times w_2)$ \; + \BlankLine + \Return $(S, w)$ + } +\end{algorithm} - \item $\mathbftt{local\_preproc}(\mathcal{I}_i, \mathcal{Q}) \to - \mathscr{S}_i$ \\ - Pre-processes each partition $\mathcal{D}_i$ using index - $\mathcal{I}_i$ to produce preliminary information about the - query result on this partition, encoded as an object - $\mathscr{S}_i$. - - \item $\mathbftt{distribute\_query}(\mathscr{S}_1, \ldots, - \mathscr{S}_m, \mathcal{Q}) \to \mathcal{Q}_1, \ldots, - \mathcal{Q}_m$\\ - Processes the list of preliminary information objects - $\mathscr{S}_i$ and emits a list of local queries - $\mathcal{Q}_i$ to run independently on each partition. - - \item $\mathbftt{local\_query}(\mathcal{I}_i, \mathcal{Q}_i) - \to \mathcal{R}_i$ \\ - Executes the local query $\mathcal{Q}_i$ over partition - $\mathcal{D}_i$ using index $\mathcal{I}_i$ and returns a - partial result $\mathcal{R}_i$. - - \item $\mathbftt{merge}(\mathcal{R}_1, \ldots \mathcal{R}_m) \to - \mathcal{R}$ \\ - Merges the partial results to produce the final answer. +While this approach does allow sampling over a dynamized structure, it is +asymptotically inferior to Olken's method, which allows for sampling in +only $\Theta(k \log n)$ time~\cite{olken89}. However, we've already seen +in the previous chapter how it is possible to modify the query procedure +into a multi-stage process to enable more efficient solutions to the IRS +problem. The core idea underlying our solution in that chapter was to +introduce individualized local queries for each block, which were created +after a pre-processing step to allow information about each block to be +determined first. In that particular example, we established the weight +each block should have during sampling, and then creating custom sampling +queries with variable $k$ values, following the weight distributions. We +have determined a general interface that allows for this procedure to be +expressed, and we define the term \emph{extended decomposability} to refer +to search problems that can be answered in this way. + +More formally, consider search problem $F(D, q)$ capable of being +answered using a data structure instance $\mathscr{I} \in \mathcal{I}$ +built over a set of records $D \in \mathcal{D}$ that has been decomposed +into $m$ blocks, $\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_m$ +each corresponding to a partition of $D$, $D_1, D_2, \ldots, D_m$. $F$ +is an extended-decomposable search problem (eDSP) if it can be expressed +using the following interface, +\begin{itemize} +\item $\mathbftt{local\_preproc}(\mathscr{I}_i, q) \to \mathscr{M}_i$ \\ + Pre-process each partition, $D_i$, using its associated data + structure, $\mathscr{I}$ and generate a meta-information object + $\mathscr{M}_i$ for use in local query generation. + +\item $\mathbftt{distribute\_query}(\mathscr{M}_1, \ldots, \mathscr{M}_m, + q) \to q_1, \ldots, q_m$\\ + Process the set of meta-information about each block and produce + individual local queries, $q_1, \ldots, q_m$, for each block. + +\item $\mathbftt{local\_query}(\mathscr{I}_i, q_i) \to r_i$ \\ + Evaluate the local query with parameters $q_i$ over the data + in $D_i$ using the data structure $\mathscr{I}_i$ and produce + a partial query result, $r_i$. + +\item $\mathbftt{combine}(r_1, \ldots, r_m) \to R$ \\ + Combine the list of local query results, $r_1, \ldots, r_m$ into + a final query result, $R$. \end{itemize} -The pseudocode for the query algorithm using this interface is, -\begin{algorithm} - \DontPrintSemicolon - \SetKwProg{Proc}{procedure}{ BEGIN}{END} - \SetKwProg{For}{for}{ DO}{DONE} +Let $P(n)$ be the cost of $\mathbftt{local\_preproc}$, $D(n)$ be +the cost of $\mathbftt{distribute\_query}$, $\mathscr{Q}_\ell(n)$ +be the cost of $\mathbftt{local\_query}$, and $C_e(n)$ be the cost +$\mathbftt{combine}$. To solve a search problem with this interface +requires calling $\mathbftt{local\_preproc}$ and $\mathbftt{local\_query}$ +once per block, and $\mathbftt{distribute\_query}$ and +$\mathbftt{combine}$ once. For a Bentley-Saxe dynamization then, with +$O(\log_2 n)$ blocks, the worst-case cost of answering an eDSP is, +\begin{equation} +\label{eqn:edsp-cost} +O \left( \log_2 n \cdot P(n) + D(n) + \log_2 n \cdot \mathscr{Q}_\ell(n) + C_e(n) \right) +\end{equation} - \Proc{\mathbftt{QUERY}($D[]$, $\mathscr{Q}$)} { - \For{$i \in [0, |D|)$} { - $S[i] := \mathbftt{local\_preproc}(D[i], \mathscr{Q})$ - } \; +As an example, we'll express IRS using the above interface and +analyze its complexity to show that the resulting solution as the +same $\Theta(log^2 n + k)$ cost as the specialized solution from +Chapter~\ref{chap:sampling}. We use $\mathbftt{local\_preproc}$ +to determine the number of records on each block falling on the +interval $[l, u]$ and return this, as well as $i_l$ and $i_u$ as the +meta-information. Then, $\mathbftt{distribute\_query}$ will perform +weighted set sampling using a temporary alias structure over the +weights of all of the blocks to calculate the appropriate value +of $k$ for each local query, which will consist of $(k_i, i_{l,i}, +i_{u,i})$. With the appropriate value of $k$, as well as the indices of +the upper and lower bounds, pre-calculated, $\mathbftt{local\_query}$ +can simply generate $k_i$ random integers and return the corresponding +records. $\mathbftt{combine}$ simply combines all of the local results +and returns the final result set. Algorithm~\ref{alg:edsp-irs} shows +each of these operations in psuedo-code. + + +\SetKwFunction{preproc}{local\_preproc} +\SetKwFunction{distribute}{distribute\_query} +\SetKwFunction{query}{local\_query} +\SetKwFunction{combine}{combine} +\begin{algorithm}[t] + \caption{IRS with Extended Decomposability} + \label{alg:edsp-irs} + \KwIn{$k$: sample size, $[l,u]$: lower and upper bound of records to sample} + \KwOut{$R$: a sample set of size $k$} + + \Def{\preproc{$\mathscr{I}_i$, $q=(l,u,k)$}}{ + \Comment{Find the indices for the upper and lower bounds of the query range} + $i_l \gets \text{binary\_search\_lb}(\mathscr{I}_i, l)$ \; + $i_u \gets \text{binary\_search\_ub}(\mathscr{I}_i, u)$ \; + \BlankLine + \Return $(i_l, i_u)$ \; + } + + \BlankLine + \Def{\distribute{$\mathscr{M}_1$, $\ldots$, $\mathscr{M}_m$, $q=(l,u,k)$}}{ + \Comment{Determine number of records to sample from each block} + $k_1, \ldots k_m \gets \mathtt{wss}(k, \mathscr{M}_1, \ldots \mathscr{M}_m)$ \; + \BlankLine + \Comment{Build local query objects} + \For {$i=1..m$} { + $q_i \gets (\mathscr{M}.i_l, \mathscr{M}.i_u, k_i)$ \; + } + + \BlankLine + \Return $q_1 \ldots q_m$ \; + } - $ Q := \mathbftt{distribute\_query}(S, \mathscr{Q}) $ \; \; + \BlankLine + \Def{\query{$\mathscr{I}_i$, $q_i = (i_{l,i},i_{u,i},k_i)$}}{ + + \For {$i=1\ldots k_i$} { + \Comment{Select a random record within the inteval} + $i_r \gets \text{randint}(i_{l,i}, i_{u,i})$ \; - \For{$i \in [0, |D|)$} { - $R[i] := \mathbftt{local\_query}(D[i], Q[i])$ - } \; + \Comment{Add it to the sample set} + $S \gets S \cup \{\text{get}(\mathscr{I}_i, i_r)\}$ \; + } - $OUT := \mathbftt{merge}(R)$ \; + \Return $S$ \; + } - \Return {$OUT$} \; + \BlankLine + \Def{\combine{$r_1, \ldots, r_m$, $q=(l, u, k)$}}{ + \Comment{Union results together} + \Return $\bigcup_{i=1}^{m} r_i$ } \end{algorithm} -In this system, each query can report a partial result with -\mathbftt{local\_preproc}, which can be used by -\mathbftt{distribute\_query} to adjust the per-partition query -parameters, allowing for direct communication of state between -partitions. Queries which do not need this functionality can simply -return empty $\mathscr{S}_i$ objects from \mathbftt{local\_preproc}. +These operations result in $P(n) \in \Theta(\log n)$, $D(n) \in +\Theta(\log n)$, $\mathscr{Q}(n,k) \in \Theta(k)$, and $C_e(n) \in +\Theta(1)$. At first glance, it would appear that we arrived at a +solution with a query cost of $O\left(\log_2^2 n + k\log_2 n\right)$, +and thus fallen short of our goal. However, Equation~\ref{eqn:edsp-cost} +is only an upper bound on the cost. In the case of IRS, we can leverage an +important problem-specific detail to obtain a better result: the total +cost of the local queries is actually \emph{independent} of the number +of shards. + +For IRS, the cost of $\mathbftt{local\_query}$ is linear to the number +of samples requested. Our initial asymptotic cost assumes that, in the +worst case, each of the $\log_2 n$ blocks is sampled $k$ times. But +this is not true of our algorithm. Rather, only $k$ samples are taken +\emph{in total}, distributed across all of the blocks. Thus, regardless +of how many blocks there are, there will only be $k$ samples drawn, +requiring $k$ random number generations, etc. As a result, the total +cost of the local query term in the cost function is actually $\Theta(k)$. +Applying this result gives us a tighter bound of, +\begin{equation*} +\mathscr{Q}_\text{IRS} \in \Theta\left(\log_2^2 n + k\right) +\end{equation*} +which matches the result of Chapter~\ref{chap:sampling} for IRS sampling +in the absence of deletes. The other sampling problems considered in +Chapter~\ref{chap:sampling} can be similarly implemented using this +interface, with the same performance as their specialized implementations. + + +\subsection{Iterative Deletion Decomposability} + +We next turn out attention to support for deletes. Efficient delete +support in Bentley-Saxe dynamization is provably impossible~\cite{saxe79}, +but, as discussed in Section~\ref{ssec:dyn-deletes} it is possible +to support them in restricted situations, where either the search +problem is invertible (Definition~\ref{}) or the data structure and +search problem combined are deletion decomposable (Definition~\ref{}). +In Chapter~\ref{chap:sampling}, we considered a set of search problems +which did \emph{not} satisfy any of these properties, and instead built a +customized solution for deletes that required tight integration with the +query process in order to function. While such a solution was acceptable +for the goals of that chapter, it is not sufficient for our goal in this +chapter of producing a generalized system. + +Additionally, of the two types of problem that can support deletes, the +invertible case is preferable. This is because the amount of work necessary +to support deletes for invertible search problems is very small. The data +structure requires no modification (such as to implement weak deletes), +and the query requires no modification (to ignore the weak deletes) aside +from the addition of the $\Delta$ operator. This is appealing from a +framework design standpoint. Thus, it would also be worth it to consider +approaches for expanding the range of search problems that can be answered +using the ghost structure mechanism supported by invertible problems. + +A significant limitation of invertible problems is that the result set +size is not able to be controlled. We do not know how many records in our +local results have been deleted until we reach the combine operation and +they begin to cancel out, at which point we lack a mechanism to go back +and retrieve more. This presents difficulties for addressing important +search problems such as top-$k$, $k$-NN, and sampling. In principle, these +queries could be supported by repeating the query with larger-and-larger +$k$ values until the desired number of records is returned, but in the +eDSP model this requires throwing away a lot of useful work, as the state +of the query must be rebuilt each time. + +We can resolve this problem by moving the decision to repeat the query +into the query interface itself, allowing retries \emph{before} the +result set is returned to the user and the local meta-information objects +discarded. This allows us to preserve this pre-processing work, and repeat +the local query process as many times as is necessary to achieve our +desired number of records. From this obervation, we propose another new +class of search problem: \emph{iterative deletion decomposable} (IDSP). The +IDSP definition expands eDSP with a fifth operation, -\subsection{Query Complexity} +\begin{itemize} + \item $\mathbftt{repeat}(\mathcal{Q}, \mathcal{R}, \mathcal{Q}_1, \ldots, + \mathcal{Q}_m) \to (\mathbb{B}, \mathcal{Q}_1, \ldots, + \mathcal{Q}_m)$ \\ + Evaluate the combined query result in light of the query. If + a repetition is necessary to satisfy constraints in the query + (e.g., result set size), optionally update the local queries as + needed and return true. Otherwise, return false. +\end{itemize} -Before describing how to use this new interface and definition to -support more efficient queries than standard decomposability, more -more general expression for the cost of querying such a structure should -be derived. -Recall that Bentley-Saxe, when applied to a $C(n)$-decomposable -problem, has the following query cost, +If this routine returns true, then the query process is repeated +from $\mathbftt{distribute\_query}$, and if it returns false then the +result is returned to the user. If the number of repetitions of the +query is bounded by $R(n)$, then the following provides an upper bound +on the worst-case query complexity of an IDSP, + +\begin{equation*} + O\left(\log_2 n \cdot P(n) + R(n) \left(D(n) + \log_2 n \cdot Q_s(n) + + C_e(n)\right)\right) +\end{equation*} + +It is important that a bound on the number of repetitions exists, +as without this the worst-case query complexity is unbounded. The +details of providing and enforcing this bound are very search problem +specific. For problems like $k$-NN or top-$k$, the number of repetitions +is a function of the number of deleted records within the structure, +and so $R(n)$ can be bounded by placing a limit on the number of deleted +records. This can be done, for example, using the full-reconstruction +techniques in the literature~\cite{saxe79, merge-dsp, overmars83} +or through proactively performing reconstructions, such as with the +mechanism discussed in Section~\ref{sssec:sampling-rejection-bound}, +depending on the particulars of how deletes are implemented. + +As an example of how IDSP can facilitate delete support for search +problems, let's consider $k$-NN. This problem can be $C(n)$-deletion +decomposable, depending upon the data structure used to answer it, but +it is not invertible because it suffers from the problem of potentially +returning fewer than $k$ records in the final result set after the results +of the query against the primary and ghost structures have been combined. +Worse, even if the query does return $k$ records as requested, it is +possible that the result set could be incorrect, depending upon which +records were deleted, what block those records are in, and the order in +which the merge and inverse merge are applied. + +\begin{example} +Consider the $k$-NN search problem, $F$, over some metric index +$\mathcal{I}$. $\mathcal{I}$ has been dynamized, with a ghost +structure for deletes, and consists of two blocks, $\mathscr{I}_1$ and +$\mathscr{I}_2$ in the primary structure, and one block, $\mathscr{I}_G$ +in the ghost structure. The structures contain the following records, +\begin{align*} +\mathscr{I}_1 &= \{ x_1, x_2, x_3, x_4, x_5\} \\ +\mathscr{I}_2 &= \{ x_6, x_7, x_8 \} \\ +\mathscr{I}_G &= \{x_1, x_2, x_3 \} +\end{align*} +where the subscript indicates the proximity to some point, $p$. Thus, +the correct answer to the query $F(\mathscr{I}, (3, p))$ would be the +set of points $\{x_4, x_5, x_6\}$. + +Querying each of the three blocks independently, however, will produce +an incorrect answer. The partial results will be, +\begin{align*} +r_1 = \{x_1, x_2, x_3\} \\ +r_2 = \{x_6, x_7, x_8\} \\ +r_g = \{x_1, x_2, x_3\} +\end{align*} +and, assuming that $\mergeop$ returns the $k$ elements closest to $p$ +from the inputs, and $\Delta$ removes matching elements, performing +$r_1~\mergeop~r_2~\Delta~r_g$ will give an answer of $\{\}$, which +has insufficient records, and performing $r_1~\Delta~r_g~\mergeop~r_2$ +will provide a result of $\{x_6, x_7, x_8\}$, which is wrong. +\end{example} + +From this example, we can draw two conclusions about performing $k$-NN +using a ghost structure for deletes. First, we must ensure that all of +the local queries against the primary structure are merged, prior to +removing any deleted records, to ensure correctness. Second, once the +ghost structure records have been removed, we may need to go back to +the dynamized structure for more records to ensure that we have enough. +Both of these requirements can be accomodated by the IDSP model, and the +resulting query algorithm is shown in Algorithm~\ref{alg:idsp-knn}. This +algorithm assumes that the data structure in question can save the +current traversal state in the meta-information object, and resume a +$k$-NN query on the structure from that state at no cost. + +\SetKwFunction{repeat}{repeat} + +\begin{algorithm}[th] + \caption{$k$-NN with Iterative Decomposability} + \label{alg:idsp-knn} + \KwIn{$k$: result size, $p$: query point} + \Def{\preproc{$q=(k, p)$, $\mathscr{I}_i$}}{ + \Return $\mathscr{I}_i.\text{initialize\_state}(k, p)$ \; + } -\begin{equation} - \label{eq3:Bentley-Saxe} - O\left(\log n \cdot \left( Q_s(n) + C(n)\right)\right) -\end{equation} -where $Q_s(n)$ is the cost of the query against one partition, and -$C(n)$ is the cost of the merge operator. - -Let $Q_s(n)$ represent the cost of \mathbftt{local\_query} and -$C(n)$ the cost of \mathbftt{merge} in the extended decomposability -case. Additionally, let $P(n)$ be the cost of $\mathbftt{local\_preproc}$ -and $\mathcal{D}(n)$ be the cost of \mathbftt{distribute\_query}. -Additionally, recall that $|D| = \log n$ for the Bentley-Saxe method. -In this case, the cost of a query is -\begin{equation} - O \left( \log n \cdot P(n) + \mathcal{D}(n) + - \log n \cdot Q_s(n) + C(n) \right) -\end{equation} + \BlankLine + \Def{\distribute{$\mathscr{M}_1$, ..., $\mathscr{M}_m$, $q=(k,p)$}}{ + \For {$i\gets1 \ldots m$} { + $q_i \gets (k, p, \mathscr{M}_i)$ \; + } -Superficially, this looks to be strictly worse than the Bentley-Saxe -case in Equation~\ref{eq3:Bentley-Saxe}. However, the important -thing to understand is that for $C(n)$-decomposable queries, $P(n) -\in O(1)$ and $\mathcal{D}(n) \in O(1)$, as these steps are unneeded. -Thus, for normal decomposable queries, the cost actually reduces -to, -\begin{equation} - O \left( \log n \cdot Q_s(n) + C(n) \right) -\end{equation} -which is actually \emph{better} than Bentley-Saxe. Meanwhile, the -ability perform state-sharing between queries can facilitate better -solutions than would otherwise be possible. - -In light of this new approach, consider the two examples of -non-decomposable search problems from Section~\ref{ssec:decomp-limits}. - -\subsection{k-Nearest Neighbor} -\label{ssec:knn} -The KNN problem is $C(n)$-decomposable, and Section~\ref{sssec-decomp-limits-knn} -arrived at a Bentley-Saxe based solution to this problem based on -VPTree, with a query cost of -\begin{equation} - O \left( k \log^2 n + k \log n \log k \right) -\end{equation} -by running KNN on each partition, and then merging the result sets -with a heap. - -Applying the interface of extended-decomposability to this problem -allows for some optimizations. Pre-processing is not necessary here, -but the variadic merge function can be leveraged to get an asymptotically -better solution. Simply dropping the existing algorithm into this -interface will result in a merge algorithm with cost, -\begin{equation} - C(n) \in O \left( k \log n \left( \log k + \log\log n\right)\right) -\end{equation} -which results in a total query cost that is slightly \emph{worse} -than the original, + \Return $q_1 \ldots q_m$ \; + } -\begin{equation} - O \left( k \log^2 n + k \log n \left(\log k + \log\log n\right) \right) -\end{equation} + \BlankLine + \Def{\query{$\mathscr{I}_i$, $q_i=(k,p,\mathscr{M}_i)$}}{ + $(r_i, \mathscr{M}_i) \gets \mathscr{I}_i.\text{knn\_from}(k, p, \mathscr{M}_i)$ \; + \BlankLine + \Comment{The local result includes the records stored in a priority queue and query state} + \Return $(r_i, \mathscr{M}_i)$ \; + } -The problem is that the number of records considered in a given -merge has grown from $O(k)$ in the binary merge case to $O(\log n -\cdot k)$ in the variadic merge. However, because the merge function -now has access to all of the data at once, the algorithm can be modified -slightly for better efficiency by only pushing $\log n$ elements -into the heap at a time. This trick only works if -the $R_i$s are in sorted order relative to $f(x, q)$, -however this condition is satisfied by the result sets returned by -KNN against a VPTree. Thus, for each $R_i$, the first element in sorted -order can be inserted into the heap, -element in sorted order into the heap, tagged with a reference to -which $R_i$ it was taken from. Then, when the heap is popped, the -next element from the associated $R_i$ can be inserted. -This allows the heap's size to be maintained at no larger -than $O(\log n)$, and limits the algorithm to no more than -$k$ pop operations and $\log n + k - 1$ pushes. - -This algorithm reduces the cost of KNN on this structure to, -\begin{equation} - O(k \log^2 n + \log n) -\end{equation} -which is strictly better than the original. + \BlankLine + \Def{\combine{$r_1, \ldots, r_m, \ldots, r_n$, $q=(k,p)$}}{ + $R \gets \{\}$ \; + $pq \gets \text{PriorityQueue}()$ ; + $gpq \gets \text{PriorityQueue}()$ \; + \BlankLine + \Comment{Results $1$ through $m$ are from the primary structure, + and $m+1$ through $n$ are from the ghost structure.} + \For {$i\gets 1 \ldots m$} { + $pq.\text{enqueue}(i, r_i.\text{front}())$ \; + } + + \For {$i \gets m+1 \ldots n$} { + $gpq.\text{enqueue}(i, r_i.\text{front}())$ + } + + \BlankLine + \Comment{Process the primary local results} + \While{$|R| < k \land \neg pq.\text{empty}()$} { + $(i, d) \gets pq.\text{dequeue}()$ \; + + \BlankLine + $R \gets R \cup r_i.\text{dequeue}()$ \; + \If {$\neg r_i.\text{empty}()$} { + $pq.\text{enqueue}(i, r_i.\text{front}())$ \; + } + } + + \BlankLine + \Comment{Process the ghost local results} + \While{$\neg gpq.\text{empty}()$} { + $(i, d) \gets gpq.\text{dequeue}()$ \; + + \BlankLine + \If {$r_i.\text{front}() \in R$} { + $R \gets R / \{r_i.\text{front}()\}$ \; + + \If {$\neg r_i.\text{empty}()$} { + $gpq.\text{enqueue}(i, r_i.\text{front}())$ \; + } + } + } -\subsection{Independent Range Sampling} + \BlankLine + \Return $R$ \; + } + \BlankLine + \Def{\repeat{$q=(k,p), R, q_1,\ldots q_m$}} { + $missing \gets k - R.\text{size}()$ \; + \If {$missing > 0$} { + \For {$i \gets 1\ldots m$} { + $q_i \gets (missing, p, q_i.\mathscr{M}_i)$ \; + } + + \Return $(True, q_1 \ldots q_m)$ \; + } + + \Return $(False, q_1 \ldots q_m)$ \; + } +\end{algorithm} -The eDSP abstraction also provides sufficient features to implement -IRS, using the same basic approach as was used in the previous -chapter. Unlike KNN, IRS will take advantage of the extended query -interface. Recall from the Chapter~\ref{chap:sampling} that the approach used -for answering sampling queries (ignoring the buffer, for now) was, - -\begin{enumerate} - \item Query each shard to establish the weight that should be assigned to the - shard in sample size assignments. - \item Build an alias structure over those weights. - \item For each sample, reference the alias structure to determine which shard - to sample from, and then draw the sample. -\end{enumerate} - -This approach can be mapped easily onto the eDSP interface as follows, -\begin{itemize} - \item[\texttt{local\_preproc}] Determine and return the total weight of candidate records for - sampling in the shard. - \item[\texttt{distribute\_query}] Using the shard weights, construct an alias structure associating - each shard with its total weight. Then, query this alias structure $k$ times. For shard $i$, the - local query $\mathscr{Q}_i$ will have its sample size assigned based on how many times $i$ is returned - during the alias querying. - \item[\texttt{local\_query}] Process the local query using the underlying data structure's normal sampling - procedure. - \item[\texttt{merge}] Union all of the partial results together. -\end{itemize} -This division of the query maps closely onto the cost function, -\begin{equation} - O\left(P(n) + kS(n)\right) -\end{equation} -used in Chapter~\ref{chap:sampling}, where the $W(n) + P(n)$ pre-processing -cost is associated with the cost of \texttt{local\_preproc} and the -$kS(n)$ sampling cost is associated with $\texttt{local\_query}$. -The \texttt{distribute\_query} operation will require $O(\log n)$ -time to construct the shard alias structure, and $O(k)$ time to -query it. Accounting then for the fact that \texttt{local\_preproc} -will be called once per shard ($\log n$ times), and a total of $k$ -records will be sampled as the cost of $S(n)$ each, this results -in a total query cost of, -\begin{equation} - O\left(\left[W(n) + P(n)\right]\log n + k S(n)\right) -\end{equation} -which matches the cost in Equation~\ref{eq:sample-cost}. - -\section{Record Identity} - -Another important consideration for the framework is support for -deletes, which are important in the contexts of database systems. -The sampling extension framework supported two techniques -for the deletion of records: tombstone-based deletes and tagging-based -deletes. In both cases, the solution required that the shard support -point lookups, either for checking tombstones or for finding the -record to mark it as deleted. Implicit in this is an important -property of the underlying data structure which was taken for granted -in that work, but which will be made explicit here: record identity. - -Delete support requires that each record within the index be uniquely -identifiable, and linkable directly to a location in storage. This -property is called \emph{record identity}. - In the context of database -indexes, it isn't a particularly contentious requirement. Indexes -already are designed to provide a mapping directly to a record in -storage, which (at least in the context of RDBMS) must have a unique -identifier attached. However, in more general contexts, this -requirement will place some restrictions on the applicability of -the framework. - -For example, approximate data structures or summaries, such as Bloom -filters~\cite{bloom70} or count-min sketches~\cite{countmin-sketch} -are data structures which don't necessarily store the underlying -record. In principle, some summaries \emph{could} be supported by -normal Bentley-Saxe as there exist mergeable -summaries~\cite{mergeable-summaries}. But because these data structures -violate the record identity property, they would not support deletes -(either in the framework, or Bentley-Saxe). The framework considers -deletes to be a first-class citizen, and this is formalized by -requiring record identity as a property that supported data structures -must have. - -\section{The General Framework} - -Based on these properties, and the work described in -Chapter~\ref{chap:sampling}, dynamic extension framework has been devised with -broad support for data structures. It is implemented in C++20, using templates -and concepts to define the necessary interfaces. A user of this framework needs -to provide a definition for their data structure with a prescribed interface -(called a \texttt{shard}), and a definition for their query following an -interface based on the above definition of an eDSP. These two classes can then -be used as template parameters to automatically create a dynamic index, which -exposes methods for inserting and deleting records, as well as executing -queries. - -\subsection{Framework Design} - -\Paragraph{Structure.} The overall design of the general framework -itself is not substantially different from the sampling framework -discussed in the Chapter~\ref{chap:sampling}. It consists of a mutable buffer -and a set of levels containing data structures with geometrically -increasing capacities. The \emph{mutable buffer} is a small unsorted -record array of fixed capacity that buffers incoming inserts. As -the mutable buffer is kept sufficiently small (e.g. fits in L2 CPU -cache), the cost of querying it without any auxiliary structures -can be minimized, while still allowing better insertion performance -than Bentley-Saxe, which requires rebuilding an index structure for -each insertion. The use of an unsorted buffer is necessary to -ensure that the framework doesn't require an existing dynamic version -of the index structure being extended, which would defeat the purpose -of the entire exercise. - -The majority of the data within the structure is stored in a sequence -of \emph{levels} with geometrically increasing record capacity, -such that the capacity of level $i$ is $s^{i+1}$, where $s$ is a -configurable parameter called the \emph{scale factor}. Unlike -Bentley-Saxe, these levels are permitted to be partially full, which -allows significantly more flexibility in terms of how reconstruction -is performed. This also opens up the possibility of allowing each -level to allocate its record capacity across multiple data structures -(named \emph{shards}) rather than just one. This decision is called -the \emph{layout policy}, with the use of a single structure being -called \emph{leveling}, and multiple structures being called -\emph{tiering}. - -\begin{figure} -\centering -\subfloat[Leveling]{\includegraphics[width=.5\textwidth]{img/leveling} \label{fig:leveling}} -\subfloat[Tiering]{\includegraphics[width=.5\textwidth]{img/tiering} \label{fig:tiering}} - \caption{\textbf{An overview of the general structure of the - dynamic extension framework} using leveling (Figure~\ref{fig:leveling}) and -tiering (Figure~\ref{fig:tiering}) layout policies. The pictured extension has -a scale factor of 3, with $L_0$ being at capacity, and $L_1$ being at -one third capacity. Each shard is shown as a dotted box, wrapping its associated -dataset ($D_i$), data structure ($I_i$), and auxiliary structures $(A_i)$. } -\label{fig:framework} -\end{figure} - -\Paragraph{Shards.} The basic building block of the dynamic extension -is called a shard, defined as $\mathcal{S}_i = (\mathcal{D}_i, -\mathcal{I}_i, A_i)$, which consists of a partition of the data -$\mathcal{D}_i$, an instance of the static index structure being -extended $\mathcal{I}_i$, and an optional auxiliary structure $A_i$. -To ensure the viability of level reconstruction, the extended data -structure should at least support a construction method -$\mathtt{build}(\mathcal{D})$ that can build a new static index -from a set of records $\mathcal{D}$ from scratch. This set of records -may come from the mutable buffer, or from a union of underlying -data of multiple other shards. It is also beneficial for $\mathcal{I}_i$ -to support efficient point-lookups, which can search for a record's -storage location by its identifier (given by the record identify -requirements of the framework). The shard can also be customized -to provide any necessary features for supporting the index being -extended. For example, auxiliary data structures like Bloom filters -or hash tables can be added to improve point-lookup performance, -or additional, specialized query functions can be provided for use -by the query functions. - -From an implementation standpoint, the shard object provides a shim -between the data structure and the framework itself. At minimum, -it must support the following interface, -\begin{itemize} - \item $\mathbftt{construct}(B) \to S$ \\ - Construct a new shard from the contents of the mutable buffer, $B$. - \item $\mathbftt{construct}(S_0, \ldots, S_n) \to S$ - Construct a new shard from the records contained within a list of already - existing shards. - \item $\mathbftt{point\_lookup}(r) \to *r$ \\ - Search for a record, $r$, by identity and return a reference to its - location in storage. -\end{itemize} +\subsection{Search Problem Taxonomy} -\Paragraph{Insertion \& deletion.} The framework supports inserting -new records and deleting records already in the index. These two -operations also allow for updates to existing records, by first -deleting the old version and then inserting a new one. These -operations are added by the framework automatically, and require -only a small shim or minor adjustments to the code of the data -structure being extended within the implementation of the shard -object. - -Insertions are performed by first wrapping the record to be inserted -with a framework header, and then appending it to the end of the -mutable buffer. If the mutable buffer is full, it is flushed to -create a new shard, which is combined into the first level of the -structure. The level reconstruction process is layout policy -dependent. In the case of leveling, the underlying data of the -source shard and the target shard are combined, resulting a new -shard replacing the target shard in the target level. When using -tiering, the newly created shard is simply placed into the target -level. If the target level is full, the framework first triggers a merge on the -target level, which will create another shard at one higher level, -and then inserts the former shard at the now empty target level. -Note that each time a new shard is created, the framework must invoke -$\mathtt{build}$ to construct a new index from scratch for this -shard. - -The framework supports deletes using two approaches: either by -inserting a special tombstone record or by performing a lookup for -the record to be deleted and setting a bit in the header. This -decision is called the \emph{delete policy}, with the former being -called \emph{tombstone delete} and the latter \emph{tagged delete}. -The framework will automatically filter deleted records from query -results before returning them to the user, either by checking for -the delete tag, or by performing a lookup of each record for an -associated tombstone. The number of deleted records within the -framework can be bounded by canceling tombstones and associated -records when they meet during reconstruction, or by dropping all -tagged records when a shard is reconstructed. The framework also -supports aggressive reconstruction (called \emph{compaction}) to -precisely bound the number of deleted records within the index, -which can be helpful to improve the performance of certain types -of query. This is useful for certain search problems, as was seen with -sampling queries in Chapter~\ref{chap:sampling}, but is not -generally necessary to bound query cost in most cases. - -\Paragraph{Design space.} The framework described in this section -has a large design space. In fact, much of the design space has -similar knobs to the well-known LSM Tree~\cite{dayan17}, albeit in -a different environment: the framework targets in-memory static -index structures for general extended decomposable queries without -efficient index merging support, whereas the LSM-tree targets -external range indexes that can be efficiently merged. - -The framework's design trades off among auxiliary memory usage, read performance, -and write performance. The two most significant decisions are the -choice of layout and delete policy. A tiering layout policy reduces -write amplification compared to leveling, requiring each record to -only be written once per level, but increases the number of shards -within the structure, which can hurt query performance. As for -delete policy, the use of tombstones turns deletes into insertions, -which are typically faster. However, depending upon the nature of -the query being executed, the delocalization of the presence -information for a record may result in one extra point lookup for -each record in the result set of a query, vastly reducing read -performance. In these cases, tagging may make more sense. This -results in each delete turning into a slower point-lookup, but -always allows for constant-time visibility checks of records. The -other two major parameters, scale factor and buffer size, can be -used to tune the performance once the policies have been selected. -Generally speaking, larger scale factors result in fewer shards, -but can increase write amplification under leveling. Large buffer -sizes can adversely affect query performance when an unsorted buffer -is used, while allowing higher update throughput. Because the overall -design of the framework remains largely unchanged, the design space -exploration of Section~\ref{ssec:ds-exp} remains relevant here. - -\subsection{The Shard Interface} - -The shard object serves as a ``shim'' between a data structure and -the extension framework, providing a set of mandatory functions -which are used by the framework code to facilitate reconstruction -and deleting records. The data structure being extended can be -provided by a different library and included as an attribute via -composition/aggregation, or can be directly implemented within the -shard class. Additionally, shards can contain any necessary auxiliary -structures, such as bloom filters or hash tables, as necessary to -support the required interface. - -The require interface for a shard object is as follows, -\begin{verbatim} - new(MutableBuffer) -> Shard - new(Shard[]) -> Shard - point_lookup(Record, Boolean) -> Record - get_data() -> Record - get_record_count() -> Int - get_tombstone_count() -> Int - get_memory_usage() -> Int - get_aux_memory_usage() -> Int -\end{verbatim} - -The first two functions are constructors, necessary to build a new Shard -from either an array of other shards (for a reconstruction), or from -a mutable buffer (for a buffer flush).\footnote{ - This is the interface as it currently stands in the existing implementation, but - is subject to change. In particular, we are considering changing the shard reconstruction - procedure to allow for only one necessary constructor, with a more general interface. As - we look to concurrency, being able to construct shards from arbitrary combinations of shards - and buffers will become convenient, for example. - } -The \texttt{point\_lookup} operation is necessary for delete support, and is -used either to locate a record for delete when tagging is used, or to search -for a tombstone associated with a record when tombstones are used. The boolean -is intended to be used to communicate to the shard whether the lookup is -intended to locate a tombstone or a record, and is meant to be used to allow -the shard to control whether a point lookup checks a filter before searching, -but could also be used for other purposes. The \texttt{get\_data} -function exposes a pointer to the beginning of the array of records contained -within the shard--it imposes no restriction on the order of these records, but -does require that all records can be accessed sequentially from this pointer, -and that the order of records does not change. The rest of the functions are -accessors for various shard metadata. The record and tombstone count numbers -are used by the framework for reconstruction purposes.\footnote{The record -count includes tombstones as well, so the true record count on a level is -$\text{reccnt} - \text{tscnt}$.} The memory usage statistics are, at present, -only exposed directly to the user and have no effect on the framework's -behavior. In the future, these may be used for concurrency control and task -scheduling purposes. - -Beyond these, a shard can expose any additional functions that are necessary -for its associated query classes. For example, a shard intended to be used for -range queries might expose upper and lower bound functions, or a shard used for -nearest neighbor search might expose a nearest-neighbor function. - -\subsection{The Query Interface} -\label{ssec:fw-query-int} - -The required interface for a query in the framework is a bit more -complicated than the interface defined for an eDSP, because the -framework needs to query the mutable buffer as well as the shards. -As a result, there is some slight duplication of functions, with -specialized query and pre-processing routines for both shards and -buffers. Specifically, a query must define the following functions, -\begin{verbatim} - get_query_state(QueryParameters, Shard) -> ShardState; - get_buffer_query_state(QueryParameters, Buffer) -> BufferState; - - process_query_states(QueryParameters, ShardStateList, BufferStateList) -> LocalQueryList; - - query(LocalQuery, Shard) -> ResultList - buffer_query(LocalQuery, Buffer) -> ResultList - - merge(ResultList) -> FinalResult - - delete_query_state(ShardState) - delete_buffer_query_state(BufferState) - - bool EARLY_ABORT; - bool SKIP_DELETE_FILTER; -\end{verbatim} - -The \texttt{get\_query\_state} and \texttt{get\_buffer\_query\_state} functions -map to the \texttt{local\_preproc} operation of the eDSP definition for shards -and buffers respectively. \texttt{process\_query\_states} serves the function -of \texttt{distribute\_query}. Note that this function takes a list of buffer -states; although the proposed framework above contains only a single buffer, -future support for concurrency will require multiple buffers, and so the -interface is set up with support for this. The \texttt{query} and -\texttt{buffer\_query} functions execute the local query against the shard or -buffer and return the intermediate results, which are merged using -\texttt{merge} into a final result set. The \texttt{EARLY\_ABORT} parameter can -be set to \texttt{true} to force the framework to immediately return as soon as -the first result is found, rather than querying the entire structure, and the -\texttt{SKIP\_DELETE\_FILTER} disables the framework's automatic delete -filtering, allowing deletes to be manually handled within the \texttt{merge} -function by the developer. These flags exist to allow for optimizations for -certain types of query. For example, point-lookups can take advantage of -\texttt{EARLY\_ABORT} to stop as soon as a match is found, and -\texttt{SKIP\_DELETE\_FILTER} can be used for more efficient tombstone delete -handling in range queries, where tombstones for results will always be in the -\texttt{ResultList}s going into \texttt{merge}. - -The framework itself answers queries by simply calling these routines in -a prescribed order, -\begin{verbatim} -query(QueryArguments qa) BEGIN - FOR i < BufferCount DO - BufferStates[i] = get_buffer_query_state(qa, Buffers[i]) - DONE - - FOR i < ShardCount DO - ShardStates[i] = get_query_state(qa, Shards[i]) - DONE - - process_query_states(qa, ShardStates, BufferStates) - - FOR i < BufferCount DO - temp = buffer_query(BufferStates[i], Buffers[i]) - IF NOT SKIP_DELETE_FILTER THEN - temp = filter_deletes(temp) - END - Results[i] = temp; - - IF EARLY_ABORT AND Results[i].size() > 0 THEN - delete_states(ShardStates, BufferStates) - return merge(Results) - END - DONE - - FOR i < ShardCount DO - temp = query(ShardStates[i], Shards[i]) - IF NOT SKIP_DELETE_FILTER THEN - temp = filter_deletes(temp) - END - Results[i + BufferCount] = temp - IF EARLY_ABORT AD Results[i + BufferCount].size() > 0 THEN - delete_states(ShardStates, BufferStates) - return merge(Results) - END - DONE - - delete_states(ShardStates, BufferStates) - return merge(Results) -END -\end{verbatim} - -\subsubsection{Standardized Queries} - -Provided with the framework are several "standardized" query classes, including -point lookup, range query, and IRS. These queries can be freely applied to any -shard class that implements the necessary optional interfaces. For example, the -provided IRS and range query both require the shard to implement a -\texttt{lower\_bound} and \texttt{upper\_bound} function that returns an index. -They then use this index to access the record array exposed via -\texttt{get\_data}. This is convenient, because it helps to separate the search -problem from the data structure, and moves towards presenting these two objects -as orthogonal. - -In the next section the framework is evaluated by producing a number of indexes -for three different search problems. Specifically, the framework is applied to -a pair of learned indexes, as well as an ISAM-tree. All three of these shards -provide the bound interface described above, meaning that the same range query -class can be used for all of them. It also means that the learned indexes -automatically have support for IRS. And, of course, they also all can be used -with the provided point-lookup query, which simply uses the required -\texttt{point\_lookup} function of the shard. - -At present, the framework only supports associating a single query class with -an index. However, this is simply a limitation of implementation. In the future, -approaches will be considered for associating arbitrary query classes to allow -truly multi-purpose indexes to be constructed. This is not to say that every -data structure will necessarily be efficient at answering every type of query -that could be answered using their interface--but in a database system, being -able to repurpose an existing index to accelerate a wide range of query types -would certainly seem worth considering. - -\section{Framework Evaluation} - -The framework was evaluated using three different types of search problem: -range-count, high-dimensional k-nearest neighbor, and independent range -sampling. In all three cases, an extended static data structure was compared -with dynamic alternatives for the same search problem to demonstrate the -framework's competitiveness. - -\subsection{Methodology} - -All tests were performed using Ubuntu 22.04 -LTS on a dual-socket Intel Xeon Gold 6242R server with 384 GiB of -installed memory and 40 physical cores. Benchmark code was compiled -using \texttt{gcc} version 11.3.0 at the \texttt{-O3} optimization level. - - -\subsection{Range Queries} - -A first test evaluates the performance of the framework in the context of -range queries against learned indexes. In Chapter~\ref{chap:intro}, the -lengthy development cycle of this sort of data structure was discussed, -and so learned indexes were selected as an evaluation candidate to demonstrate -how this framework could allow such lengthy development lifecycles to be largely -bypassed. - -Specifically, the framework is used to produce dynamic learned indexes based on -TrieSpline~\cite{plex} (DE-TS) and the static version of PGM~\cite{pgm} (DE-PGM). These -are both single-pass construction static learned indexes, and thus well suited for use -within this framework compared to more complex structures like RMI~\cite{RMI}, which have -more expensive construction algorithms. The two framework-extended data structures are -compared with dynamic learned indexes, namely ALEX~\cite{ALEX} and the dynamic version of -PGM~\cite{pgm}. PGM provides an interesting comparison, as its native -dynamic version was implemented using a slightly modified version Bentley-Saxe method. - -When performing range queries over large data sets, the -copying of query results can introduce significant overhead. Because the four -tested structures have different data copy behaviors, a range count query was -used for testing, rather than a pure range query. This search problem exposes -the searching performance of the data structures, while controlling for different -data copy behaviors, and so should provide more directly comparable results. - -Range count -queries were executed with a selectivity of $0.01\%$ against three datasets -from the SOSD benchmark~\cite{sosd-datasets}: \texttt{book}, \texttt{fb}, and -\texttt{osm}, which all have 200 million 64-bit keys following a variety of -distributions, which were paired with uniquely generated 64-bit values. There -is a fourth dataset in SOSD, \texttt{wiki}, which was excluded from testing -because it contained duplicate keys, which are not supported by dynamic -PGM.\footnote{The dynamic version of PGM supports deletes using tombstones, -but doesn't wrap records with a header to accomplish this. Instead it reserves -one possible value to represent a tombstone. Records are deleted by inserting a -record having the same key, but this different value. This means that duplicate -keys, even if they have different values, are unsupported as two records with -the same key will be treated as a delete by the index.~\cite{pgm} } - -The shard implementations for DE-PGM and DE-TS required about 300 lines of -C++ code each, and no modification to the data structures themselves. For both -data structures, the framework was configured with a buffer of 12,000 records, a scale -factor of 8, the tombstone delete policy, and tiering. Each shard stored $D_i$ -as a sorted array of records, used an instance of the learned index for -$\mathcal{I}_i$, and has no auxiliary structures. The local query routine used -the learned index to locate the first key in the query range and then iterated -over the sorted array until the end of the range is reached, counting the -number of records and tombstones required. The mutable buffer query performed -the counting over a full scan. No local preprocessing was needed, and the merge -operation simply summed the record and tombstone counts, and returned their -difference. - -\begin{figure*}[t] - \centering - \subfloat[Update Throughput]{\includegraphics[width=.5\textwidth]{img/fig-bs-rq-insert} \label{fig:rq-insert}} - \subfloat[Query Latency]{\includegraphics[width=.5\textwidth]{img/fig-bs-rq-query} \label{fig:rq-query}} \\ - \subfloat[Index Sizes]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0 ]{img/fig-bs-rq-space} \label{fig:idx-space}} - \caption{Range Count Evaluation} - \label{fig:results1} -\end{figure*} - -Figure~\ref{fig:rq-insert} shows the update throughput of all competitors. ALEX -performs the worst in all cases, and PGM performs the best, with the extended -indexes falling in the middle. It is not unexpected that PGM performs better -than the framework, because the Bentley-Saxe extension in PGM is custom-built, -and thus has a tighter integration than a general framework would allow. -However, even with this advantage, DE-PGM still reaches up to 85\% of PGM's -insertion throughput. Additionally, Figure~\ref{fig:rq-query} shows that PGM -pays a large cost in query latency for its advantage in insertion, with the -framework extended indexes significantly outperforming it. Further, DE-TS even -outperforms ALEX for query latency in some cases. Finally, -Figure~\ref{fig:idx-space} shows the storage cost of the indexes, without -counting the space necessary to store the records themselves. The storage cost -of a learned index is fairly variable, as it is largely a function of the -distribution of the data, but in all cases, the extended learned -indexes, which build compact data arrays without gaps, occupy three orders of -magnitude smaller storage space compared to ALEX, which requires leaving gaps -in the data arrays. - -\subsection{High-Dimensional k-Nearest Neighbor} -The next test evaluates the framework for the extension of high-dimensional -metric indexes for the k-nearest neighbor search problem. An M-tree~\cite{mtree} -was used as the dynamic baseline,\footnote{ - Specifically, the M-tree implementation tested can be found at \url{https://github.com/dbrumbaugh/M-Tree} - and is a fork of a structure written originally by Eduardo D'Avila, modified to compile under C++20. The - tree uses a random selection algorithm for ball splitting. -} and a VPTree~\cite{vptree} as the static structure. The framework was used to -extend VPTree to produce the dynamic version, DE-VPTree. -An M-Tree is a tree that partitions records based on -high-dimensional spheres and supports updates by splitting and merging these -partitions. -A VPTree is a binary tree that is produced by recursively selecting -a point, called the vantage point, and partitioning records based on their -distance from that point. This results in a difficult to modify structure that -can be constructed in $O(n \log n)$ time and can answer KNN queries in $O(k -\log n)$ time. - -DE-VPTree, used a buffer of 12,000 records, a scale factor of 6, tiering, and -delete tagging. The query was implemented without a pre-processing step, using -the standard VPTree algorithm for KNN queries against each shard. All $k$ -records were determined for each shard, and then the merge operation used a -heap to merge the results sets together and return the $k$ nearest neighbors -from the $k\log(n)$ intermediate results. This is a type of query that pays a -non-constant merge cost, even with the framework's expanded query interface, of -$O(k \log k)$. In effect, the kNN query must be answered twice: once for each -shard to get the intermediate result sets, and then a second time within the -merge operation to select the kNN from the result sets. - -\begin{figure} - \centering - \includegraphics[width=.75\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn} - \caption{KNN Index Evaluation} - \label{fig:knn} -\end{figure} -Euclidean distance was used as the metric for both structures, and $k=1000$ was -used for all queries. The reference point for each query was selected randomly -from points within the dataset. Tests were run using the Spanish Billion Words -dataset~\cite{sbw}, of 300-dimensional vectors. The results are shown in -Figure~\ref{fig:knn}. In this case, the static nature of the VPTree allows it -to dominate the M-Tree in query latency, and the simpler reconstruction -procedure shows a significant insertion performance improvement as well. - -\subsection{Independent Range Sampling} -Finally, the -framework was tested using one-dimensional IRS queries. As before, -a static ISAM-tree was used as the data structure to be extended, -however the sampling query was implemented using the query interface from -Section~\ref{ssec:fw-query-int}. The pre-processing step identifies the first -and last query falling into the range to be sampled from, and determines the -total weight based on this range, for each shard. Then, in the local query -generation step, these weights are used to construct and alias structure, which -is used to assign sample sizes to each shard based on weight to avoid -introducing skew into the results. After this, the query routine generates -random numbers between the established bounds to sample records, and the merge -operation appends the individual result sets together. This static procedure -only requires a pair of tree traversals per shard, regardless of how many -samples are taken. - -\begin{figure} - \centering - \subfloat[Query Latency]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-query} \label{fig:irs-query}} - \subfloat[Update Throughput]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-insert} \label{fig:irs-insert}} - \caption{IRS Index Evaluation} - \label{fig:results2} -\end{figure} - -The extended ISAM structure (DE-IRS) was compared to a B$^+$-Tree -with aggregate weight tags on internal nodes (AGG B+Tree) for sampling -and insertion performance, and to a single instance of the static ISAM-tree (ISAM), -which does not support updates. DE-IRS was configured with a buffer size -of 12,000 records, a scale factor of 6, tiering, and delete tagging. The IRS -queries had a selectivity of $0.1\%$ with sample size of $k=1000$. Testing -was performed using the same datasets as were used for range queries. - -Figure~\ref{fig:irs-query} -shows the significant latency advantage that the dynamically extended ISAM tree -enjoys compared to a B+Tree. DE-IRS is up to 23 times faster than the B$^+$-Tree at -answering sampling queries, and only about 3 times slower than the fully static -solution. In this case, the extra query cost caused by needing to query -multiple structures is more than balanced by the query efficiency of each of -those structures, relative to tree sampling. Interestingly, the framework also -results in better update performance compared to the B$^+$-Tree, as shown in -Figure~\ref{fig:irs-insert}. This is likely because the ISAM shards can be -efficiently constructed using a combination of sorted-merge operations and -bulk-loading, and avoid expensive structural modification operations that are -necessary for maintaining a B$^+$-Tree. - -\subsection{Discussion} - - -The results demonstrate not only that the framework's update support is -competitive with custom-built dynamic data structures, but that the framework -is even able to, in many cases, retain some of the query performance advantage -of its extended static data structure. This is particularly evident in the k-nearest -neighbor and independent range sampling tests, where the static version of the -structure was directly tested as well. These tests demonstrate one of the advantages -of static data structures: they are able to maintain much tighter inter-record relationships -than dynamic ones, because update support typically requires relaxing these relationships -to make it easier to update them. While the framework introduces the overhead of querying -multiple structures and merging them together, it is clear from the results that this overhead -is generally less than the overhead incurred by the update support techniques used -in the dynamic structures. The only case where the framework was defeated in query performance -was in competition with ALEX, where the resulting query latencies were comparable. - -It is also evident that the update support provided by the framework is on par with, if not -superior, to that provided by the dynamic baselines, at least in terms of throughput. The -framework will certainly suffer from larger tail latency spikes, which weren't measured in -this round of testing, due to the larger scale of the reconstructions, but the amortization -of these costs over a large number of inserts allows for the maintenance of a respectable -level of throughput. In fact, the only case where the framework loses in insertion throughput -is against the dynamic PGM. However, an examination of the query latency reveals that this -is likely due to the fact that the standard configuration of the Bently-Saxe variant used -by PGM is highly tuned for insertion performance, as the query latencies against this data -structure are far worse than any other learned index tested, so even this result shouldn't -be taken as a ``clear'' defeat of the framework's implementation. - -Overall, it is clear from this evaluation that the dynamic extension framework is a -promising alternative to manual index redesign for accommodating updates. In almost -all cases, the framework-extended static data structures provided superior insertion -throughput in all cases, and query latencies that either matched or exceeded that of -the dynamic baselines. Additionally, though it is hard to quantity, the code complexity -of the framework-extended data structures was much less, with the shard implementations -requiring only a small amount of relatively straightforward code to interface with pre-existing -static data structures, or with the necessary data structure implementations themselves being -simpler. +Having defined two new classes of search problem, it seems sensible +at this point to collect our definitions together with pre-existing +ones from the classical literature, and present a cohesive taxonomy +of the search problems for which our techniques can be used to +support dynamization. This taxonomy is shown in the Venn diagrams of +Figure~\ref{fig:taxonomy}. Note that, for convenience, the search problem +classications relevant for supporting deletes have been seperated out +into a seperate diagram. In principle, this deletion taxonomy can be +thought of as being nested inside of each of the general search problem +classifications, as the two sets of classification are orthogonal. That +a search problem falls into a particular classification in the general +taxonomy doesn't imply any particular information about where in the +deletion taxonomy that same problem might also fall. -\section{Conclusion} +\begin{figure}[t] + \subfloat[General Taxonomy]{\includegraphics[width=.49\linewidth]{diag/taxonomy} + \label{fig:taxonomy-main}} + \subfloat[Deletion Taxonomy]{\includegraphics[width=.49\linewidth]{diag/deletes} \label{fig:taxonomy-deletes}} + \caption{An overview of the Taxonomy of Search Problems, as relevant to + our discussion of data structure dynamization. Our proposed extensions + are marked with an asterisk (*) and colored yellow. + } + \label{fig:taxonomy} +\end{figure} + +Figure~\ref{fig:taxonomy-main} illustrates the classifications of search +problem that are not deletion-related, including standard decomposability +(DSP), extended decomposability (eDSP), $C(n)$-decomposability +($C(n)$-DSP), and merge decomposability (MDSP). We consider ISAM, +TrieSpline~\cite{plex}, and succinct trie~\cite{zhang18} to be examples +of MDSPs because the data structures can be constructed more efficiently +from sorted data, and so when building from existing blocks, the data +is already sorted in each block and can be merged while maintaining +a sorted order more efficiently. VP-trees~\cite{vptree} and alias +structures~\cite{walker74}, in contrast, don't have a convenient +way of merging, and so must be reconstructed in full each time. We +have classified sampling queries in this taxonomy as eDSPs because +this implementation is more efficient than the $C(n)$-decomposable +variant we have also discussed. $k$-NN, for reasons discussed in +Chapter~\ref{chap:background}, are classified as $C(n)$-decomposable. + +The classification of range scans is a bit trickier. It is not uncommon +in the theoretical literature for range scans to be considered DSPs, with +$\mergeop$ taken to be the set union operator. From an implementation +standpoint, it is sometimes possible to perform a union in $\Theta(1)$ +time. For example, in Chapter~\ref{chap:sampling} we accomplished this by +placing sampled records directly into a shared buffer, and not having an +explicit combine step at all. However, in the general case where we do +need an explicit combine step, the union operation does require linear +time in the size of the result sets to copy the records from the local +result into the final result. The sizes of these results are functions +of the selectivity of the range scan, but theoretically could be large +relative to the data size, and so we've decided to err on the side of +caution and classify range scans as $C(n)$-decomposable here. If the +results of the range scan are expected to be returned in sorted order, +then the problem is \emph{certainly} $C(n)$-decomposable. +Range +counts, on the other hand, are truly DSPs.\footnote{ + Because of the explicit combine interface we use for eDSPs, the + optimization of writing samples directly into the buffer that we used + in the previous chapter to get a $\Theta(1)$ set union cannot be used + for the eDSP implementation of IRS in this chapter. However, our eDSP + sampling in Algorithm~\ref{alg:edsp-irs} samples \emph{exactly} $k$ + records, and so the combination step still only requires $\Theta(k)$ + work, and the complexity remains the same. +} Point lookups are an example of a DSP as well, assuming that the lookup +key is unique, or at least minimally duplicated. In the case where +the number of results for the lookup become a substantial proportion +of the total data size, then this search problem could be considered +$C(n)$-decomposable for the same reason as range scans. + +Figure~\ref{fig:taxonomy-deletes} shows the various classes of search +problem relevant to delete support. We have made the decision to +classify invertible problems (INV) as a subset of deletion decomposable +problems (DDSP), because one could always embed the ghost structure +directly into the block implementation, use the DDSP delete operation +to insert into that block, and handle the $\Delta$ operator as part of +$\mathbftt{local\_query}$. We consider range count to be invertible, +with $\Delta$ taken to be subtraction. Range scans are also invertible, +technically, but the cost of filtering out the deleted records during +result set merging is relatively expensive, as it requires either +performing a sorted merge of all of the records (rather than a simple +union) to cancel out records with their ghosts, or doing a linear +search for each ghost record to remove its corresponding data from the +result set. As a result, we have classified them as DDSPs instead, +as weak deletes are easily supported during range scans with no extra +cost. Any records marked as deleted can simply be skipped over when +copying into the local or final result sets. Similarly, $k$-NN queries +admit a DDSP solution for certain data structures, but we've elected to +classify them as IDSPs using Algorithm~\ref{alg:idsp-knn} as this is +possible without making any modifications to the data structure to support +weak deletes, and not all metric indexing structures support efficient +point lookups that would be necessary to support weak deletes. We've also +classified IRS as an IDSP, which is the only place in the taxonomy that +it can fit. Note that IRS (and other sampling problems) are unique in this +model in that they require the IDSP classification, but must actually +support deletes using weak deletes. There's no way to support ghost structure +based deletes in our general framework for sampling queries.\footnote{ + This is in contrast to the specialized framework for sampling in + Chapter~\ref{chap:sampling}, where we heavily modified the query + process to make tombstone (which is analogous to ghost structure) + based deletes possible. +} + +\section{Dynamization Framework} + +With the previously discussed new classes of search problems devised, we +can now present our generalized framework based upon those models. This +framework takes the form of a header-only C++20 library which can +automatically extend data structures with support for concurrent inserts +and deletes, depending upon the classification of the problem in the +taxonomy of Figure~\ref{fig:taxonomy}. The user provides the data +structure and query implementations as template parameters, and the +framework then provides an interface that allows for queries, inserts, +and deletes against the new dynamic structure. + +\subsection{Interfaces} + +In order to enforce interface requirements, our implementation takes +advantage of C++20 concepts. There are three major sets of interfaces +that the user of the framework must implement: records, shards, and +queries. We'll discuss each of these in this section. + +\subsubsection{Record Interface} + + +The record interface is the simplest of the three. Records are C++ +structs, and they must implement an equality comparision operator. Beyond +this, the framework places no additional constraints and makes +no assumptions about record contents, their ordering properties, +etc. Though the records must be fixed length (as they are structs), +variable length data can be supported using off-record storage and +pointers if necessary. Each record is automatically wrapped by the +framework with a header that is used to facilitate deletion support. +The record concept is shown in Listing~\ref{lst:record}. + +\begin{lstfloat} +\begin{lstlisting}[language=C++] +template +concept RecordInterface = requires(R r, R s) { + { r == s } -> std::convertible_to; +}; +\end{lstlisting} +\caption{The required interface for record types in our dynamization framework.} +\label{lst:record} +\end{lstfloat} + + +\subsubsection{Shard Interface} +\subsubsection{Query Interface} + +\subsection{Configurability} + +\subsection{Concurrency} + +\section{Evaluation} +\subsection{Experimental Setup} +\subsection{Design Space Evaluation} +\subsection{Independent Range Sampling} +\subsection{k-NN Search} +\subsection{Range Scan} +\subsection{String Search} +\subsection{Concurrency} -In this chapter, a generalize version of the framework originally proposed in -Chapter~\ref{chap:sampling} was proposed. This framework is based on two -key properties: extended decomposability and record identity. It is capable -of extending any data structure and search problem supporting these two properties -with support for inserts and deletes. An evaluation of this framework was performed -by extending several static data structures, and comparing the resulting structures' -performance against dynamic baselines capable of answering the same type of search -problem. The extended structures generally performed as well as, if not better, than -their dynamic baselines in query performance, insert performance, or both. This demonstrates -the capability of this framework to produce viable indexes in a variety of contexts. However, -the framework is not yet complete. In the next chapter, the work required to bring this -framework to completion will be described. +\section{Conclusion} -- cgit v1.2.3