\chapter{Generalizing the Framework} \begin{center} \emph{The following chapter is an adaptation of work completed in collaboration with Dr. Dong Xie and Dr. Zhuoyue Zhao and published in PVLDB Volume 17, Issue 11 (July 2024) under the title "Towards Systematic Index Dynamization". } \hrule \end{center} \label{chap:framework} \section{Introduction} In the previous chapter, we discussed how several of the limitations of dynamization could be overcome by proposing a systematic dynamization approach for sampling data structures. In doing so, we introduced a multi-stage query mechanism to overcome the non-decomposability of these queries, provided two mechanisms for supporting deletes along with specialized processing to integrate these with the query mechanism, and introduced some performance tuning capability inspired by the design space of modern LSM Trees. While promising, these results are highly specialized and remain useful only within the context of sampling queries. In this chapter, we develop new generalized query abstractions based on these specific results, and discuss a fully implemented framework based upon these abstractions. More specifically, in this chapter we propose \emph{extended decomposability} and \emph{iterative deletion decomposability} as two new, broader classes of search problem which are strict supersets of decomposability and deletion decomposability respectively, providing a more powerful interface to allow the efficient implementation of a larger set of search problems over a dynamized structure. We then implement a C++ library based upon these abstractions which is capable of adding support for inserts, deletes, and concurrency to static data structures automatically, and use it to provide dynamizations for independent range sampling, range queries with learned indices, string search with succinct tries, and high dimensional vector search with metric indices. In each case we compare our dynamized implementation with existing dynamic structures, and standard Bentley-Saxe dynamizations, where possible. \section{Beyond Decomposability} We begin our discussion of this generalized framework by proposing new classes of search problems based upon our results from examining sampling problems in the previous chapter. Our new classes will enable the support of new types of search problem, enable more efficient support for certain already supported problems, and allow for broader support of deletes. Based on this, we will develop a taxonomy of search problems that can be supported by our dynamization technique. \subsection{Extended Decomposability} \label{ssec:edsp} As discussed in Chapter~\cite{chap:background}, the standard query model used by dynamization techniques requires that a given query be broadcast, unaltered, to each block within the dynamized structure, and then that the results from these identical local queries be efficiently mergeable to obtain the final answer to the query. This model limits dynamization to decomposable search problems (Definition~\ref{def:dsp}). In the previous chapter, we considered various sampling problems as examples of non-decomposable search problems, and devised a technique for correctly answering queries of that type over a dynamized structure. In this section, we'll retread our steps with an eye towards a general solution, that could be applicable in other contexts. For convenience, we'll focus exclusively on independent range sampling. As a reminder, this search problem is defined as, \begin{definitionIRS}[Independent Range Sampling~\cite{tao22}] Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query interval $q = [x, y]$ and an integer $k$, an independent range sampling query returns $k$ independent samples from $D \cap q$ with each point having equal probability of being sampled. \end{definitionIRS} We formalize this as a search problem $F_\text{IRS}:(\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ where the record domain is $\mathcal{D} = \mathbb{R}$, the query parameters domain consists of order triples containing the lower and upper bounds of the query interval, and the number of samples to draw, $\mathcal{Q} = \mathbb{R} \times \mathbb{R} \times \mathbb{Z}^+$, and the result domain contains subsets of the real numbers, $\mathcal{R} = \mathcal{PS}(\mathbb{R})$. $F_\text{IRS}$ can be solved using a variety of data structures, such as the static ISAM solution discussed in Section~\ref{ssec:irs-struct}. For our example here, we will use a simple sorted array. Let $\mathcal{I}$ be the sorted array data structure, with a specific instance $\mathscr{I} \in \mathcal{I}$ built over a set $D \subset \mathbb{R}$ having $|D| = n$ records. The problem $F_\text{IRS}(\mathscr{I}, (l, u, k))$ can be solved by binary searching $\mathscr{I}$ twice to obtain the index of the first element greater than or equal to $l$ ($i_l$) and the last element less than or equal to $u$ ($i_u$). With these two indices, $k$ random numbers can generated on the interval $[i_l, i_u]$ and the records at these indices returned. This sampling procedure is described in Algorithm~\ref{alg:array-irs} and runs in $\mathscr{Q}_\text{irs} \in \Theta(\log n + k)$ time. \SetKwFunction{IRS}{IRS} \begin{algorithm} \caption{Solution to IRS on a sorted array} \label{alg:array-irs} \KwIn{$k$: sample size, $[l,u]$: lower and upper bound of records to sample} \KwOut{$S$: a sample set of size $k$} \Def{\IRS{$(\mathscr{I}, (l, u, k))$}}{ \Comment{Find the lower and upper bounds of the interval} $i_l \gets \text{binary\_search\_lb}(\mathscr{I}, l)$ \; $i_u \gets \text{binary\_search\_ub}(\mathscr{I}, u)$ \; \BlankLine \Comment{Initialize empty sample set} $S \gets \{\}$ \; \BlankLine \For {$i=1\ldots k$} { \Comment{Select a random record within the interval} $i_r \gets \text{randint}(i_l, i_u)$ \; \Comment{Add it to the sample set} $S \gets S \cup \{\text{get}(\mathscr{I}, i_r)\}$ \; } \BlankLine \Comment{Return the sample set} \Return $S$ \; } \end{algorithm} It becomes more difficult to answer $F_\text{IRS}$ over a data structure that has been decomposed into blocks, because the number of samples taken from each block must be appropriately weighted to correspond to the number of records within each block falling into the query range. In the classical model, there isn't a way to do this, and so the only solution is to answer $F_\text{IRS}$ against each block, asking for the full $k$ samples each time, and then down-sampling the results corresponding to the relative weight of each block, to obtain a final sample set. Using this idea, we can formulate $F_\text{IRS}$ as a $C(n)$-decomposable problem by changing the result set type to $\mathcal{R} = \mathcal{PS}(\mathbb{R}) \times \mathbb{R}$ where the first element in the tuple is the sample set and the second argument is the number of elements falling between $l$ and $u$ in the block being sampled from. With this information, it is possible to implement $\mergeop$ using Bernoulli sampling over the two sample sets to be merged. This requires $\Theta(k)$ time, and thus $F_\text{IRS}$ can be said to be a $k$-decomposable search problem, which runs in $\Theta(\log^2 n + k \log n)$ time. This procedure is shown in Algorithm~\ref{alg:decomp-irs}. \SetKwFunction{IRSDecomp}{IRSDecomp} \SetKwFunction{IRSCombine}{IRSCombine} \begin{algorithm}[!h] \caption{$k$-Decomposable Independent Range Sampling} \label{alg:decomp-irs} \KwIn{$k$: sample size, $[l,u]$: lower and upper bound of records to sample} \KwOut{$(S, c)$: a sample set of size $k$ and a count of the number of records on on the interval $[l,u]$} \Def{\IRSDecomp{$\mathscr{I}_i, (l, u, k)$}}{ \Comment{Find the lower and upper bounds of the interval} $i_l \gets \text{binary\_search\_lb}(\mathscr{I}_i, l)$ \; $i_u \gets \text{binary\_search\_ub}(\mathscr{I}_i, u)$ \; \BlankLine \Comment{Initialize empty sample set} $S \gets \{\}$ \; \BlankLine \For {$i=1\ldots k$} { \Comment{Select a random record within the interval} $i_r \gets \text{randint}(i_l, i_u)$ \; \Comment{Add it to the sample set} $S \gets S \cup \{\text{get}(\mathscr{I}_i, i_r)\}$ \; } \BlankLine \Comment{Return the sample set and record count} \Return ($S$, $i_u - i_l$) \; } \BlankLine \Def{\IRSCombine{$(S_1, c_1)$, $(S_2, c_2)$}}{ \Comment{The output set should be the same size as the input ones} $k \gets |S_1|$ \; \BlankLine \Comment{Calculate the weighting that should be applied to each set when sampling} $w_1 \gets \frac{c_1}{c_1 + c_2}$ \; $w_2 \gets \frac{c_2}{c_1 + c_2}$ \; \BlankLine \Comment{Initialize output set and count} $S \gets \{\}$\; $c \gets c_1 + c_2$ \; \BlankLine \Comment{Down-sample the input result sets} $S \gets S \cup \text{bernoulli}(S_1, w_1, k\times w_1)$ \; $S \gets S \cup \text{bernoulli}(S_2, w_2, k\times w_2)$ \; \BlankLine \Return $(S, w)$ } \end{algorithm} While this approach does allow sampling over a dynamized structure, it is asymptotically inferior to Olken's method, which allows for sampling in only $\Theta(k \log n)$ time~\cite{olken89}. However, we've already seen in the previous chapter how it is possible to modify the query procedure into a multi-stage process to enable more efficient solutions to the IRS problem. The core idea underlying our solution in that chapter was to introduce individualized local queries for each block, which were created after a pre-processing step to allow information about each block to be determined first. In that particular example, we established the weight each block should have during sampling, and then created custom sampling queries with variable $k$ values, following the weight distributions. We have determined a general interface that allows for this procedure to be expressed, and we define the term \emph{extended decomposability} to refer to search problems that can be answered in this way. More formally, consider search problem $F(D, q)$ capable of being answered using a data structure instance $\mathscr{I} \in \mathcal{I}$ built over a set of records $D \in \mathcal{D}$ that has been decomposed into $m$ blocks, $\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_m$ each corresponding to a partition of $D$, $D_1, D_2, \ldots, D_m$. $F$ is an extended-decomposable search problem (eDSP) if it can be expressed using the following interface, \begin{itemize} \item $\mathbftt{local\_preproc}(\mathscr{I}_i, q) \to \mathscr{M}_i$ \\ Pre-process each partition, $D_i$, using its associated data structure, $\mathscr{I}$ and generate a meta-information object $\mathscr{M}_i$ for use in local query generation. \item $\mathbftt{distribute\_query}(\mathscr{M}_1, \ldots, \mathscr{M}_m, q) \to q_1, \ldots, q_m$\\ Process the set of meta-information about each block and produce individual local queries, $q_1, \ldots, q_m$, for each block. \item $\mathbftt{local\_query}(\mathscr{I}_i, q_i) \to r_i$ \\ Evaluate the local query with parameters $q_i$ over the data in $D_i$ using the data structure $\mathscr{I}_i$ and produce a partial query result, $r_i$. \item $\mathbftt{combine}(r_1, \ldots, r_m) \to R$ \\ Combine the list of local query results, $r_1, \ldots, r_m$ into a final query result, $R$. \end{itemize} Let $P(n)$ be the cost of $\mathbftt{local\_preproc}$, $D(n)$ be the cost of $\mathbftt{distribute\_query}$, $\mathscr{Q}_\ell(n)$ be the cost of $\mathbftt{local\_query}$, and $C_e(n)$ be the cost $\mathbftt{combine}$. To solve a search problem with this interface requires calling $\mathbftt{local\_preproc}$ and $\mathbftt{local\_query}$ once per block, and $\mathbftt{distribute\_query}$ and $\mathbftt{combine}$ once. For a Bentley-Saxe dynamization then, with $O(\log_2 n)$ blocks, the worst-case cost of answering an eDSP is, \begin{equation} \label{eqn:edsp-cost} O \left( \log_2 n \cdot P(n) + D(n) + \log_2 n \cdot \mathscr{Q}_\ell(n) + C_e(n) \right) \end{equation} As an example, we'll express IRS using the above interface and analyze its complexity to show that the resulting solution as the same $\Theta(log^2 n + k)$ cost as the specialized solution from Chapter~\ref{chap:sampling}. We use $\mathbftt{local\_preproc}$ to determine the number of records on each block falling on the interval $[l, u]$ and return this, as well as $i_l$ and $i_u$ as the meta-information. Then, $\mathbftt{distribute\_query}$ will perform weighted set sampling using a temporary alias structure over the weights of all of the blocks to calculate the appropriate value of $k$ for each local query, which will consist of $(k_i, i_{l,i}, i_{u,i})$. With the appropriate value of $k$, as well as the indices of the upper and lower bounds, pre-calculated, $\mathbftt{local\_query}$ can simply generate $k_i$ random integers and return the corresponding records. $\mathbftt{combine}$ simply combines all of the local results and returns the final result set. Algorithm~\ref{alg:edsp-irs} shows each of these operations in pseudo-code. \SetKwFunction{preproc}{local\_preproc} \SetKwFunction{distribute}{distribute\_query} \SetKwFunction{query}{local\_query} \SetKwFunction{combine}{combine} \begin{algorithm}[t] \caption{IRS with Extended Decomposability} \label{alg:edsp-irs} \KwIn{$k$: sample size, $[l,u]$: lower and upper bound of records to sample} \KwOut{$R$: a sample set of size $k$} \Def{\preproc{$\mathscr{I}_i$, $q=(l,u,k)$}}{ \Comment{Find the indices for the upper and lower bounds of the query range} $i_l \gets \text{binary\_search\_lb}(\mathscr{I}_i, l)$ \; $i_u \gets \text{binary\_search\_ub}(\mathscr{I}_i, u)$ \; \BlankLine \Return $(i_l, i_u)$ \; } \BlankLine \Def{\distribute{$\mathscr{M}_1$, $\ldots$, $\mathscr{M}_m$, $q=(l,u,k)$}}{ \Comment{Determine number of records to sample from each block} $k_1, \ldots k_m \gets \mathtt{wss}(k, \mathscr{M}_1, \ldots \mathscr{M}_m)$ \; \BlankLine \Comment{Build local query objects} \For {$i=1..m$} { $q_i \gets (\mathscr{M}.i_l, \mathscr{M}.i_u, k_i)$ \; } \BlankLine \Return $q_1 \ldots q_m$ \; } \BlankLine \Def{\query{$\mathscr{I}_i$, $q_i = (i_{l,i},i_{u,i},k_i)$}}{ \For {$i=1\ldots k_i$} { \Comment{Select a random record within the interval} $i_r \gets \text{randint}(i_{l,i}, i_{u,i})$ \; \Comment{Add it to the sample set} $S \gets S \cup \{\text{get}(\mathscr{I}_i, i_r)\}$ \; } \Return $S$ \; } \BlankLine \Def{\combine{$r_1, \ldots, r_m$, $q=(l, u, k)$}}{ \Comment{Union results together} \Return $\bigcup_{i=1}^{m} r_i$ } \end{algorithm} These operations result in $P(n) \in \Theta(\log n)$, $D(n) \in \Theta(\log n)$, $\mathscr{Q}(n,k) \in \Theta(k)$, and $C_e(n) \in \Theta(1)$. At first glance, it would appear that we arrived at a solution with a query cost of $O\left(\log_2^2 n + k\log_2 n\right)$, and thus fallen short of our goal. However, Equation~\ref{eqn:edsp-cost} is only an upper bound on the cost. In the case of IRS, we can leverage an important problem-specific detail to obtain a better result: the total cost of the local queries is actually \emph{independent} of the number of shards. For IRS, the cost of $\mathbftt{local\_query}$ is linear to the number of samples requested. Our initial asymptotic cost assumes that, in the worst case, each of the $\log_2 n$ blocks is sampled $k$ times. But this is not true of our algorithm. Rather, only $k$ samples are taken \emph{in total}, distributed across all of the blocks. Thus, regardless of how many blocks there are, there will only be $k$ samples drawn, requiring $k$ random number generations, etc. As a result, the total cost of the local query term in the cost function is actually $\Theta(k)$. Applying this result gives us a tighter bound of, \begin{equation*} \mathscr{Q}_\text{IRS} \in \Theta\left(\log_2^2 n + k\right) \end{equation*} which matches the result of Chapter~\ref{chap:sampling} for IRS sampling in the absence of deletes. The other sampling problems considered in Chapter~\ref{chap:sampling} can be similarly implemented using this interface, with the same performance as their specialized implementations. \subsection{Iterative Deletion Decomposability} \label{ssec:dyn-idsp} We next turn out attention to support for deletes. Efficient delete support in Bentley-Saxe dynamization is provably impossible~\cite{saxe79}, but, as discussed in Section~\ref{ssec:dyn-deletes} it is possible to support them in restricted situations, where either the search problem is invertible (Definition~\ref{def:invert}) or the data structure and search problem combined are deletion decomposable (Definition~\ref{def:background-ddsp}). In Chapter~\ref{chap:sampling}, we considered a set of search problems which did \emph{not} satisfy any of these properties, and instead built a customized solution for deletes that required tight integration with the query process in order to function. While such a solution was acceptable for the goals of that chapter, it is not sufficient for our goal in this chapter of producing a generalized system. Additionally, of the two types of problem that can support deletes, the invertible case is preferable. This is because the amount of work necessary to support deletes for invertible search problems is very small. The data structure requires no modification (such as to implement weak deletes), and the query requires no modification (to ignore the weak deletes) aside from the addition of the $\Delta$ operator. This is appealing from a framework design standpoint. Thus, it would also be worth it to consider approaches for expanding the range of search problems that can be answered using the ghost structure mechanism supported by invertible problems. A significant limitation of invertible problems is that the result set size is not able to be controlled. We do not know how many records in our local results have been deleted until we reach the combine operation and they begin to cancel out, at which point we lack a mechanism to go back and retrieve more records. This presents difficulties for addressing important search problems such as top-$k$, $k$-NN, and sampling. In principle, these queries could be supported by repeating the query with larger-and-larger $k$ values until the desired number of records is returned, but in the eDSP model this requires throwing away a lot of useful work, as the state of the query must be rebuilt each time. We can resolve this problem by moving the decision to repeat the query into the query interface itself, allowing retries \emph{before} the result set is returned to the user and the local meta-information objects discarded. This allows us to preserve this pre-processing work, and repeat the local query process as many times as is necessary to achieve our desired number of records. From this observation, we propose another new class of search problem: \emph{iterative deletion decomposable} (IDSP). The IDSP definition expands eDSP with a fifth operation, \begin{itemize} \item $\mathbftt{repeat}(\mathcal{Q}, \mathcal{R}, \mathcal{Q}_1, \ldots, \mathcal{Q}_m) \to (\mathbb{B}, \mathcal{Q}_1, \ldots, \mathcal{Q}_m)$ \\ Evaluate the combined query result in light of the query. If a repetition is necessary to satisfy constraints in the query (e.g., result set size), optionally update the local queries as needed and return true. Otherwise, return false. \end{itemize} If this routine returns true, it must also modify the local queries as necessary to account for the work that remains to be completed (e.g., update the number of records to retrieve). Then, the query process resumes from the execution of the local queries. If it returns false, then the result is simply returned to the user. If the number of repetitions of the query is bounded by $R(n)$, then the following provides an upper bound on the worst-case query complexity of an IDSP, \begin{equation*} O\left(\log_2 n \cdot P(n) + D(n) + R(n) \left(\log_2 n \cdot Q_s(n) + C_e(n)\right)\right) \end{equation*} It is important that a bound on the number of repetitions exists, as without this the worst-case query complexity is unbounded. The details of providing and enforcing this bound are very search problem specific. For problems like $k$-NN or top-$k$, the number of repetitions is a function of the number of deleted records within the structure, and so $R(n)$ can be bounded by placing a limit on the number of deleted records. This can be done, for example, using the full-reconstruction techniques in the literature~\cite{saxe79, merge-dsp, overmars83} or through proactively performing reconstructions, such as with the mechanism discussed in Section~\ref{sssec:sampling-rejection-bound}, depending on the particulars of how deletes are implemented. As an example of how IDSP can facilitate delete support for search problems, let's consider $k$-NN. This problem can be $C(n)$-deletion decomposable, depending upon the data structure used to answer it, but it is not invertible because it suffers from the problem of potentially returning fewer than $k$ records in the final result set after the results of the query against the primary and ghost structures have been combined. Worse, even if the query does return $k$ records as requested, it is possible that the result set could be incorrect, depending upon which records were deleted, what block those records are in, and the order in which the merge and inverse merge are applied. \begin{example} Consider the $k$-NN search problem, $F$, over some metric index $\mathcal{I}$. $\mathcal{I}$ has been dynamized, with a ghost structure for deletes, and consists of two blocks, $\mathscr{I}_1$ and $\mathscr{I}_2$ in the primary structure, and one block, $\mathscr{I}_G$ in the ghost structure. The structures contain the following records, \begin{align*} \mathscr{I}_1 &= \{ x_1, x_2, x_3, x_4, x_5\} \\ \mathscr{I}_2 &= \{ x_6, x_7, x_8 \} \\ \mathscr{I}_G &= \{x_1, x_2, x_3 \} \end{align*} where the subscript indicates the proximity to some point, $p$. Thus, the correct answer to the query $F(\mathscr{I}, (3, p))$ would be the set of points $\{x_4, x_5, x_6\}$. Querying each of the three blocks independently, however, will produce an incorrect answer. The partial results will be, \begin{align*} r_1 = \{x_1, x_2, x_3\} \\ r_2 = \{x_6, x_7, x_8\} \\ r_g = \{x_1, x_2, x_3\} \end{align*} and, assuming that $\mergeop$ returns the $k$ elements closest to $p$ from the inputs, and $\Delta$ removes matching elements, performing $r_1~\mergeop~r_2~\Delta~r_g$ will give an answer of $\{\}$, which has insufficient records, and performing $r_1~\Delta~r_g~\mergeop~r_2$ will provide a result of $\{x_6, x_7, x_8\}$, which is wrong. \end{example} From this example, we can draw two conclusions about performing $k$-NN using a ghost structure for deletes. First, we must ensure that all of the local queries against the primary structure are merged, prior to removing any deleted records, to ensure correctness. Second, once the ghost structure records have been removed, we may need to go back to the dynamized structure for more records to ensure that we have enough. Both of these requirements can be accommodated by the IDSP model, and the resulting query algorithm is shown in Algorithm~\ref{alg:idsp-knn}. This algorithm assumes that the data structure in question can save the current traversal state in the meta-information object, and resume a $k$-NN query on the structure from that state at no cost. \SetKwFunction{repeat}{repeat} \afterpage{\clearpage} \begin{algorithm}[p] \caption{$k$-NN with Iterative Decomposability} \label{alg:idsp-knn} \KwIn{$k$: result size, $p$: query point} \Def{\preproc{$q=(k, p)$, $\mathscr{I}_i$}}{ \Return $\mathscr{I}_i.\text{initialize\_state}(k, p)$ \; } \BlankLine \Def{\distribute{$\mathscr{M}_1$, ..., $\mathscr{M}_m$, $q=(k,p)$}}{ \For {$i\gets1 \ldots m$} { $q_i \gets (k, p, \mathscr{M}_i)$ \; } \Return $q_1 \ldots q_m$ \; } \BlankLine \Def{\query{$\mathscr{I}_i$, $q_i=(k,p,\mathscr{M}_i)$}}{ $(r_i, \mathscr{M}_i) \gets \mathscr{I}_i.\text{knn\_from}(k, p, \mathscr{M}_i)$ \; \Comment{The local result stores records in a priority queue} \Return $(r_i, \mathscr{M}_i)$ \; } \BlankLine \Def{\combine{$r_1, \ldots, r_m, \ldots, r_n$, $q=(k,p)$}}{ $R \gets \{\}$ ; $pq \gets \text{PriorityQueue}()$ ; $gpq \gets \text{PriorityQueue}()$ \; \Comment{Results $1$ through $m$ are from the primary structure, and $m+1$ through $n$ are from the ghost structure.} \For {$i\gets 1 \ldots m$} { $pq.\text{enqueue}(i, r_i.\text{front}())$ \; } \For {$i \gets m+1 \ldots n$} { $gpq.\text{enqueue}(i, r_i.\text{front}())$ } \BlankLine \Comment{Process the primary local results} \While{$|R| < k \land \neg pq.\text{empty}()$} { $(i, d) \gets pq.\text{dequeue}()$ \; $R \gets R \cup r_i.\text{dequeue}()$ \; \If {$\neg r_i.\text{empty}()$} { $pq.\text{enqueue}(i, r_i.\text{front}())$ \; } } \BlankLine \Comment{Process the ghost local results} \While{$\neg gpq.\text{empty}()$} { $(i, d) \gets gpq.\text{dequeue}()$ \; \If {$r_i.\text{front}() \in R$} { $R \gets R / \{r_i.\text{front}()\}$ \; \If {$\neg r_i.\text{empty}()$} { $gpq.\text{enqueue}(i, r_i.\text{front}())$ \; } } } \Return $R$ \; } \BlankLine \Def{\repeat{$q=(k,p), R, q_1,\ldots q_m$}} { $missing \gets k - R.\text{size}()$ \; \If {$missing > 0$} { \For {$i \gets 1\ldots m$} { $q_i \gets (missing, p, q_i.\mathscr{M}_i)$ \; } \Return $(True, q_1 \ldots q_m)$ \; } \Return $(False, q_1 \ldots q_m)$ \; } \end{algorithm} \subsection{Search Problem Taxonomy} Having defined two new classes of search problem, it seems sensible at this point to collect our definitions together with pre-existing ones from the classical literature, and present a cohesive taxonomy of the search problems for which our techniques can be used to support dynamization. This taxonomy is shown in the Venn diagrams of Figure~\ref{fig:taxonomy}. Note that, for convenience, the search problem classifications relevant for supporting deletes have been separated out into a separate diagram. In principle, this deletion taxonomy can be thought of as being nested inside of each of the general search problem classifications, as the two sets of classification are orthogonal. That a search problem falls into a particular classification in the general taxonomy doesn't imply any particular information about where in the deletion taxonomy that same problem might also fall. \begin{figure}[t] \subfloat[General Taxonomy]{\includegraphics[width=.49\linewidth]{diag/taxonomy} \label{fig:taxonomy-main}} \subfloat[Deletion Taxonomy]{\includegraphics[width=.49\linewidth]{diag/deletes} \label{fig:taxonomy-deletes}} \caption{An overview of the Taxonomy of Search Problems, as relevant to our discussion of data structure dynamization. Our proposed extensions are marked with an asterisk (*) and colored yellow. } \label{fig:taxonomy} \end{figure} Figure~\ref{fig:taxonomy-main} illustrates the classifications of search problem that are not deletion-related, including standard decomposability (DSP), extended decomposability (eDSP), $C(n)$-decomposability ($C(n)$-DSP), and merge decomposability (MDSP). We consider ISAM, TrieSpline~\cite{plex}, and succinct trie~\cite{zhang18} to be examples of MDSPs because the data structures can be constructed more efficiently from sorted data, and so when building from existing blocks, the data is already sorted in each block and can be merged while maintaining a sorted order more efficiently. VP-trees~\cite{vptree} and alias structures~\cite{walker74}, in contrast, don't have a convenient way of merging, and so must be reconstructed in full each time. We have classified sampling queries in this taxonomy as eDSPs because this implementation is more efficient than the $C(n)$-decomposable variant we have also discussed. $k$-NN, for reasons discussed in Chapter~\ref{chap:background}, are classified as $C(n)$-decomposable. The classification of range scans is a bit trickier. It is not uncommon in the theoretical literature for range scans to be considered DSPs, with $\mergeop$ taken to be the set union operator. From an implementation standpoint, it is sometimes possible to perform a union in $\Theta(1)$ time. For example, in Chapter~\ref{chap:sampling} we accomplished this by placing sampled records directly into a shared buffer, and not having an explicit combine step at all. However, in the general case where we do need an explicit combine step, the union operation does require linear time in the size of the result sets to copy the records from the local result into the final result. The sizes of these results are functions of the selectivity of the range scan, but theoretically could be large relative to the data size, and so we've decided to err on the side of caution and classify range scans as $C(n)$-decomposable here. If the results of the range scan are expected to be returned in sorted order, then the problem is \emph{certainly} $C(n)$-decomposable. Range counts, on the other hand, are truly DSPs.\footnote{ Because of the explicit combine interface we use for eDSPs, the optimization of writing samples directly into the buffer that we used in the previous chapter to get a $\Theta(1)$ set union cannot be used for the eDSP implementation of IRS in this chapter. However, our eDSP sampling in Algorithm~\ref{alg:edsp-irs} samples \emph{exactly} $k$ records, and so the combination step still only requires $\Theta(k)$ work, and the complexity remains the same. } Point lookups are an example of a DSP as well, assuming that the lookup key is unique, or at least minimally duplicated. In the case where the number of results for the lookup become a substantial proportion of the total data size, then this search problem could be considered $C(n)$-decomposable for the same reason as range scans. Figure~\ref{fig:taxonomy-deletes} shows the various classes of search problem relevant to delete support. We have made the decision to classify invertible problems (INV) as a subset of deletion decomposable problems (DDSP), because one could always embed the ghost structure directly into the block implementation, use the DDSP delete operation to insert into that block, and handle the $\Delta$ operator as part of $\mathbftt{local\_query}$. We consider range count to be invertible, with $\Delta$ taken to be subtraction. Range scans are also invertible, technically, but the cost of filtering out the deleted records during result set merging is relatively expensive, as it requires either performing a sorted merge of all of the records (rather than a simple union) to cancel out records with their ghosts, or doing a linear search for each ghost record to remove its corresponding data from the result set. As a result, we have classified them as DDSPs instead, as weak deletes are easily supported during range scans with no extra cost. Any records marked as deleted can simply be skipped over when copying into the local or final result sets. Similarly, $k$-NN queries admit a DDSP solution for certain data structures, but we've elected to classify them as IDSPs using Algorithm~\ref{alg:idsp-knn} as this is possible without making any modifications to the data structure to support weak deletes, and not all metric indexing structures support efficient point lookups that would be necessary to support weak deletes. We've also classified IRS as an IDSP, which is the only place in the taxonomy that it can fit. Note that IRS (and other sampling problems) are unique in this model in that they require the IDSP classification, but must actually support deletes using weak deletes. There's no way to support ghost structure based deletes in our general framework for sampling queries.\footnote{ This is in contrast to the specialized framework for sampling in Chapter~\ref{chap:sampling}, where we heavily modified the query process to make tombstone (which is analogous to ghost structure) based deletes possible. } \section{Dynamization Framework} \label{sec:dyn-framework} With the previously discussed new classes of search problems devised, we can now present our generalized framework based upon those models. This framework takes the form of a header-only C++20 library which can automatically extend data structures with support for concurrent inserts and deletes, depending upon the classification of the problem in the taxonomy of Figure~\ref{fig:taxonomy}. The user provides the data structure and query implementations as template parameters, and the framework then provides an interface that allows for queries, inserts, and deletes against the new dynamic structure. Specifically, in addition to accessors for various structural information, the framework provides the following main operations, \begin{itemize} \item \texttt{int insert(RecordType); } \\ This function will insert a record into the dynamized structure, and will return $1$ if the record was successfully inserted, and $0$ if it was not. Insertion failure is part of the concurrency control mechanism, and failed inserts should be retried after a short delay. More details of this are in Section~\ref{ssec:dyn-concurrency}. \item \texttt{int erase(RecordType);} \\ This function will delete a record from the dynamized structure, returning $1$ on success and $0$ on failure. The meaning of a failure to delete is dependent upon the delete mechanism in use, and will be discussed in Section~\ref{sssec:dyn-deletes}. \item \texttt{std::future query(QueryParameters); } \\ This function will execute a query with the specified parameters against the structure and return the result. This interface is asynchronous, and returns a future immediately, which can be used to access the query result once the query has finished executing. \end{itemize} It can be configured with a template argument to run in single-threaded mode, or multi-threaded mode. In multi-threaded mode, the above routines can be called concurrently without any necessary synchronization in user code, and without requiring any special modification to the data structure and queries, beyond those changes necessary to use them in single-threaded mode. \subsection{Basic Principles} Before discussing the interfaces that the user must implement to use their code with our framework, it seems wise to discuss the high level functioning and structure of the framework, the details of which inform certain decisions about the necessary features that the user must implement to interface with it. The high level structure and organization of the framework is similar to that of Section~\ref{ssec:sampling-framework}. The framework requires the user to specify types to represent the record, query, and data structure (which we call a shard). The details of the interface requirements for these types are discussed in Section~\ref{ssec:dyn-interface}, and are enforced using C++20's concepts mechanism. \begin{figure} \centering %\vspace{-3mm} \subfloat[\small Leveling]{\includegraphics[width=.5\textwidth]{diag/leveling} \label{fig:dyn-leveling}} %\vspace{-3mm} \subfloat[\small Tiering]{\includegraphics[width=.5\textwidth]{diag/tiering} \label{fig:dyn-tiering}} %\vspace{-3mm} \caption{\small An overview of the general structure of the dynamization framework using (a) leveling and (b) tiering layout policies, with a scale factor 3. Each shard is shown as a dotted box, wrapping its associated dataset ($D_i$) and index ($I_i$). } \label{fig:dyn-framework} %\vspace{-3mm} \end{figure} Internally, the framework consists of a sequence of \emph{levels} with increasing record capacity, each containing one or more \emph{shards}. The layout of these levels is defined by a template argument, the \emph{layout policy}, and an integer called the \emph{scale factor}. The latter governs how quickly the record capacities of each level grow, and the former controls how those records are broken into shards on the level and the way in which records move from level to level during reconstructions. The details of layout policies, reconstruction, etc., will be discussed in a later section. Logically ``above'' these levels is a small unsorted array, called the \emph{mutable buffer}. The mutable buffer is of user-configurable size, and all inserts into the structure are first placed into it. When the buffer fills, it will be flushed into the structure, requiring reconstructions to occur in a manner consistent with the layout policy in order to make room. A simple graphical representation of the framework and two of its layout policies is shown in Figure~\ref{fig:dyn-framework}. The framework provides two mechanisms for supporting deletes: tagging and tombstones. These are identical to the mechanisms discussed in Section~\ref{ssec:sampling-deletes}, with tombstone deletes operating by inserting a record identical to the one to be deleted into the structure, with an indicator bit set in the header, and tagged deletes performing a lookup of the record to be deleted in the structure and setting a bit in its header directly. Tombstone deletes are used to support invertible search problems, and tagged deletes are used for deletion decomposable search problems. While the delete procedure itself is handled automatically by the framework based upon the specified mechanism, it is the user's responsible to appropriately handle deleted records in their query and shard implementations. \subsection{Interfaces} \label{ssec:dyn-interface} In order to enforce interface requirements, our implementation takes advantage of C++20 concepts. There are three major sets of interfaces that the user of the framework must implement: records, shards, and queries. We'll discuss each of these in this section. \subsubsection{Record Interface} The record interface is the simplest of the three. The type used as a record only requires an implementation of an equality comparison operator, and is assumed to be of fixed length. Beyond this, the framework places no additional constraints and makes no assumptions about record contents, their ordering properties, etc. Though the records must be fixed length, variable length data can be supported using off-record storage and pointers if necessary. Each record is automatically wrapped by the framework with a header that is used to facilitate deletion support. The record concept is shown in Listing~\ref{lst:record}, along with the wrapped header type that is used to interact with records within the framework. \begin{lstfloat} \begin{lstlisting}[language=C++] template concept RecordInterface = requires(R r, R s) { { r == s } -> std::convertible_to; }; template struct Wrapped { uint32_t header; R rec; inline void set_delete(); inline bool is_deleted() const; inline void set_tombstone(bool val); inline bool is_tombstone() const; inline bool operator==(const Wrapped &other) const; }; \end{lstlisting} \caption{The required interface for record types in our dynamization framework.} \label{lst:record} \end{lstfloat} \subsubsection{Shard Interface} Our framework's underlying representation of the data structure is called a \emph{shard}. The provided shard structure should provide either a full implementation of the data structure to be dynamized, or a shim around an existing implementation that provides the necessary functions for our framework to interact with it. Shards must provide two constructors: one from an unsorted set of records, and another from a set of other shards of the same type. The second of these constructors is to allow for efficient merging to be leveraged for merge decomposable search problems. Shards can also expose a point lookup operation for use in supporting deletes for DDSPs. This function is only used for DDSP deletes, and so can be left off when this functionality isn't necessary. If a data structure doesn't natively support an efficient point-lookup, then it can be added by including a hash table or other data structure in the shard if desired. This function accepts a record type as input, and should return a pointer to the record that exactly matches the input in storage, if one exists, or \texttt{nullptr} if it doesn't. It should also accept an optional boolean argument that the framework will pass \texttt{true} into if the lookup operation is being used to search for a tombstone records. This flag is to allow the shard to use various tombstone-related optimization, such as using a Bloom filter for them, or storing them separately from the main records, etc. Shards should also expose some accessors for basic meta-data about its contents. In particular, the framework is reliant upon a function that returns the number of records within the shard for planning reconstructions, and the number of deleted records or tombstones within the shard for use in proactive compaction to bound the number of deleted records. The interface also requires functions for accessing memory usage information, both the memory use for the main data structure being dynamized, and also any auxiliary memory (e.g., memory used for an auxiliary hash table). These memory functions are used only for informational purposes. The concept for shard types is shown in Listing~\ref{lst:shard}. Note that all records within shards are wrapped by the framework header. It is up to the shard to handle the removal of deleted records based on this information during reconstruction. \begin{lstfloat} \begin{lstlisting}[language=C++] template concept ShardInterface = RecordInterface && requires(SHARD shard, const std::vector &shard_vector, bool b, BufferView bv, typename SHARD::RECORD rec) { {SHARD(shard_vector)}; {SHARD(std::move(bv))}; { shard.point_lookup(rec, b) } -> std::same_as *>; { shard.get_record_count() } -> std::convertible_to; { shard.get_tombstone_count() } -> std::convertible_to; { shard.get_memory_usage() } -> std::convertible_to; { shard.get_aux_memory_usage() } -> std::convertible_to; }; \end{lstlisting} \caption{The required interface for shard types in our dynamization framework.} \label{lst:shard} \end{lstfloat} \subsubsection{Query Interface} The most complex interface required by the framework is for queries. The concept for query types is given in Listing~\ref{lst:query}. In effect, it requires implementing the full IDSP interface from the previous section, as well as versions of $\mathbftt{local\_preproc}$ and $\mathbftt{local\_query}$ for pre-processing and querying an unsorted set of records, which is necessary to allow the mutable buffer to be used as part of the query process.\footnote{ In the worst case, these routines could construct temporary shard over the mutable buffer, and use this to answer queries. } The $\mathbftt{repeat}$ function is necessary even for normal eDSP problems, and should just return \texttt{false} with no other action in those cases. The interface also allows the user to specify whether the query process should abort after the first result is obtained, which is a useful optimization for point lookups. This interface allows for the local and overall query results to be independently specified of different types. This can be used for a variety of purposes. For example, an invertible range count can have a local result that includes both the number of records and the number of tombstones, while the query result itself remains a single number. Additionally, the framework makes no decision about what, if any, collection type should be used for these results. A range scan, for example, could specify the result types as a vector of records, map of records, etc., depending on the use case. There is one significant difference between the IDSP interface and the query concept implementation. For efficiency purposes, \texttt{combine} does not return the query result object. Instead, the framework itself initializes the object, and then passes it by reference into \texttt{combine}. This is necessary because \texttt{combine} can be called multiple times, depending on whether the query must be repeated. Adding it as an argument to \texttt{combine}, rather than returning it, allows for the local query results to be discarded completely, and new results generated and added to the existing result set, in the case of a repetition. Without this modification, the user would either need to define an additional combination operation for final result types, or duplicate effort in the combine step on each repetition. \begin{lstfloat} \begin{lstlisting}[language=C++] template concept QueryInterface = requires(PARAMETERS *parameters, LOCAL *local, LOCAL_BUFFER *buffer_query, SHARD *shard, std::vector &local_queries, std::vector &local_results, RESULT &result, BufferView *bv) { { QUERY::local_preproc(shard, parameters) } -> std::convertible_to; { QUERY::local_preproc_buffer(bv, parameters) } -> std::convertible_to; { QUERY::distribute_query(parameters, local_queries, buffer_query) }; { QUERY::local_query(shard, local) } -> std::convertible_to; { QUERY::local_query_buffer(buffer_query) } -> std::convertible_to; { QUERY::combine(local_results, parameters, result) }; { QUERY::repeat(parameters, result, local_queries, buffer_query) } -> std::same_as; { QUERY::EARLY_ABORT } -> std::convertible_to; }; \end{lstlisting} \caption{The required interface for query types in our dynamization framework.} \label{lst:query} \end{lstfloat} \subsection{Internal Mechanisms} Given a user provided query, shard, and record type, the framework will automatically provide support for inserts, as well as deletes for supported search problems, and concurrency if desired. This section will discuss the internal mechanisms that the framework uses to support these operations in a single-threaded context. Concurrency will be discussed in Section~\ref{ssec:dyn-concurrency}. \subsubsection{Inserts and Layout Policy} New records are inserted into the structure by appending them to the end of the mutable buffer. When the mutable buffer is filled, it must be flushed to make room for further inserts. This flush involves building a shard from the records in the buffer using the unsorted constructor, and then performing a series of reconstructions to integrate this new shard into the structure. Once these reconstructions are complete, the buffer can be marked as empty and the insertion performed. There are three layout policies supported by our framework, \begin{itemize} \item \textbf{Bentley-Saxe Method (BSM).} \\ Our framework supports the Bentley-Saxe method directly, which we used as a baseline for comparison in some benchmarking tests. This configuration requires that $N_b = 1$ and $s = 2$ to match the standard BSM exactly (a version of this approach that relaxes these restrictions is considered in the next chapter). Reconstructions are performed by finding the first empty level, $i$, (or adding one to the bottom if needed) and then constructing a new shard at that level including all of the records from all of the shards at levels $j <= i$, as well as the newly created buffer shard. Then all levels $j < i$ are set to empty. Our implementation of BSM does not include any of the re-partitioning routines for bounding deviations in record counts from the exact binary decomposition in the face of deleted records. \item \textbf{Leveling.}\\ Our leveling policy is identical to the one discussed in Chapter~\ref{chap:sampling}. The capacity of level $i$ is $N_b \cdot s^i+1$ records. The first level ($i$) with available capacity to hold all the records from the level above it ($i-1$ or the buffer, if $i = 0$) is found. Then, for all levels $j < i$, the records in $j$ are merged with the records in $j+1$ and the resulting shard placed in level $j+1$. This procedure guarantees that level $0$ will have capacity for the shard from the buffer, which is then merged into it (if it is not empty) or replaces it (if the level is empty). \item \textbf{Tiering.}\\ Our tiering policy, again, is identical to the one discussed in Chapter~\ref{chap:sampling}. The capacity of each level is $s$ shards, each having $N_b \cdot s^i$ records at most. The first level ($i$) having fewer than $s$ shards is identified. Then, for each level $0 j$. But, if $i < j$, then a cancellation should occur. The case where the record and tombstone coexist covers the situation where a record is deleted, and then inserted again after the delete. In this case, there does exist a record $r_k$ with $k < j$ that the tombstone should cancel with, but that record may exist in a different shard. So the tombstone will \emph{eventually} cancel, but it would be incorrect to cancel it with the matching record $r_i$ that it coexists with in the shard being considered. This means that correct tombstone cancellation requires that the order that records have been inserted be known and accounted for during shard construction. To enable this, our framework implements two important features, \begin{enumerate} \item All records in the buffer contain a timestamp in their header, indicating insertion order. This can be cleared or discarded once the buffer shard has been constructed. \item All shards passed into the shard constructor are provided in reverse chronological order. The first shard in the vector will be the oldest, and so on, with the final shard being the newest. \end{enumerate} The user can make use of these properties however they like during shard construction. The specific approach that we use in our shard implementations is to ensure that records are sorted by value, such that equal records are adjacent, and then by age, such that the newest record appears first, and the oldest last. By enforcing this order, a tombstone at index $i$ will cancel with a record if and only if that record is in index $i+1$. For structures that are constructed by a sorted-merge of data, this allows tombstone cancellation at no extra cost during the merge operation. Otherwise, it requires an extra linear pass after sorting to remove canceled records.\footnote{ For this reason, we use tagging based deletes for structures which don't require sorting by value during construction. } \Paragraph{Erase Return Codes.} As noted in Section~\ref{sec:dyn-framework}, the external \texttt{erase} function can return a $0$ on failure. The specific meaning of this failure, however, is a function of the delete policy being used. For tombstone deletes, a failure to delete means a failure to insert, and the request should be retried after a brief delay. Note that, for performance reasons, the framework makes no effort to ensure that the record being erased using tombstones is \emph{actually} there, so it is possible to insert a tombstone that can never be canceled. This won't affect correctness in any way, so long as queries are correctly implemented, but it will increase the size of the structure slightly. For tagging deletes, a failure to delete means that the record to be removed could not be located to tag it. Such failures should \emph{not} be retried immediately, as the situation will not automatically resolve itself before new records are inserted. \Paragraph{Tombstone Asymptotic Complexity.} Tombstone deletes reduce to inserts, and so they have the same asymptotic properties as inserts. Namely, \begin{align*} \mathscr{D}(n) &\in \Theta(B(n)) \\ \mathscr{D}_a(n) &\in \Theta\left( \frac{B(n)}{n} \cdot \log_s n\right) \end{align*} \Paragraph{Tagging Asymptotic Complexity.} Tagging deletes must perform a linear scan of the buffer, and a point-lookup of every shard. If $L(n)$ is the worst-case cost of the shard's implementation of \texttt{point\_lookup}, then the worst-case cost of a delete under tagging is, \begin{equation*} \mathscr{D}(n) \in \Theta \left( N_b + L(n) \cdot \log_s n\right) \end{equation*} The \texttt{point\_lookup} interface requires an optional boolean argument that is set to true when the function is called as part of a delete process by the framework. This is to enable the use of Bloom filters, or other similar structures, to accelerate these operations if desired. \subsubsection{Queries} The framework processes queries using a direct implementation of the approach discussed in Section~\ref{ssec:dyn-idsp}, with modifications to account for the buffer. The buffer itself is treated in the procedure like any other shard, except with its own specialized query and preprocessing function. The algorithm itself is shown in Algorithm~\ref{alg:dyn-query} In order to appropriately account for deletes during result set combination, the query interfaces make similar ordering guarantees to the shard construction interface. Records from the buffer will have their insertion timestamp available, and shards, local queries, and local results, are always passed in descending order of age. This is to allow tombstones to be accounted for during the query process using the same mechanisms described in Section~\ref{sssec:dyn-deletes}. \begin{algorithm}[t] \caption{Query with Dynamization Framework} \label{alg:dyn-query} \KwIn{$q$: query parameters, $b$: mutable buffer, $S$: static shards at all levels} \KwOut{$R$: query results} $\mathscr{S}_b \gets \texttt{local\_preproc}_{buffer}(b, q);\ \ \mathscr{S} \gets \{\}$ \; \For{$s \in S$}{$\mathscr{S} \gets \mathscr{S}\ \cup (s, \texttt{local\_preproc}(s, q))$\;} $(q_b, q_1, \ldots q_m) \gets \texttt{distribute\_query}(\mathscr{S}_b, \mathscr{S}, q)$ \; $\mathcal{R} \gets \{\}; \ \ \texttt{rpt} \gets \bot$ \; \Do{\texttt{rpt}}{ $locR \gets \{\}$ \; $locR \gets locR \cup \texttt{local\_query}_{buffer}(b, q_b)$ \; % the subscript in this one is wonky. Maybe do an array of Qs? \For{$s \in S$}{$locR \gets locR \cup \texttt{local\_query}(s, q_s)$} %\Comment{For \red{name}, use \texttt{tombstone\_lookup} to remove all deleted records. } %\If{\textbf{not} \texttt{SKIP\_DELETE\_FILTER}}{$locR \gets \texttt{filter\_deletes}(locR, S)$} $\mathcal{R} \gets \mathcal{R} \cup \texttt{combine}(locR, q_b, q_1, \ldots, q_m)$\; $(\texttt{rpt}, q_b, q_1, \ldots, q_m) \gets \texttt{repeat}(q, \mathcal{R}, q_b, q_1,\ldots, q_m)$\; } \Return{$\mathcal{R}$} \end{algorithm} \Paragraph{Asymptotic Complexity.} The worst-case query cost of the framework follows the same basic cost function as discussed for IDSPs in Section~\ref{ssec:dyn-idsp}, with slight modifications to account for the different cost function of buffer querying and preprocessing. The cost is, \begin{equation*} \mathscr{Q}(n) \in O \left(P_B(N_B) + \log_s n \cdot P(n) + D(n) + R(n)\left( Q_B(n) + \log_s n \cdot Q_s(n) + C_e(n)\right)\right) \end{equation*} where $P_B(n)$ is the cost of pre-processing the buffer, and $Q_B(n)$ is the cost of querying it. As $N_B$ is a small constant relative to $n$, in some cases these terms can be omitted, but they are left here for generality. Also note that this is an upper bound, but isn't necessarily tight. As we saw with IRS in Section~\ref{ssec:edsp}, it is sometimes possible to leverage problem-specific details within this interface to get better asymptotic performance. \subsection{Concurrency Control} \label{ssec:dyn-concurrency} \section{Evaluation} Having described the framework in detail, we'll now turn to demonstrating its performance for a variety of search problems and associated data structures. We've predominately selected problems for which an existing dynamic data structure also exists, to demonstrate that the performance of our dynamization techniques can match or exceed hand-built dynamic solutions to these problems. Specifically, we will consider IRS using ISAM tree, range scans using learned indices, high-dimensional $k$-NN using VPTree, and exact string matching using succinct tries. \subsection{Experimental Setup} All of our testing was performed using Ubuntu 20.04 LTS on a dual socket Intel Xeon Gold 6242 server with 384 GiB of physical memory and 40 physical cores. We ran our benchmarks pinned to a specific core, or specific NUMA node for multi-threaded testing. Our code was compiled using GCC version 11.3.0 with the \texttt{-O3} flag, and targeted to C++20.\footnote{ Aside from the ALEX benchmark. ALEX does not build in this configuration, and we used C++13 instead for that particular test. } Our testing methodology involved warming up the data structure by inserting 10\% of the dataset, and then measuring the throughput over the insertion of the rest of the records. During this second phase, a workload mixture of 95\% inserts and 5\% deletes was used for structures that supported deletes. Once the insertion phase was complete, we measured the query latency by repeatedly querying the structure with a selection of pre-constructed queries and measuring the average latency. Reported query performance numbers are latencies, and insertion/update numbers are throughputs. For data structure size charts, we report the total size of the data structure and all auxiliary structures, minus the size of the raw data. All tests were run on a single-thread without any background operations, unless otherwise specified. We used several datasets for testing the different structures. Specifically, \begin{itemize} \item For range and sampling problems, we used the \texttt{book}, \texttt{fb}, and \texttt{osm} datasets from SOSD~\cite{sosd-datasets}. Each has 200 million 64-bit keys (to which we added 64-bit values) following a variety of distributions. We omitted the \texttt{wiki} dataset because it contains duplicate keys, which were not supported by one of our dynamic baselines. \item For vector problems, we used the Spanish Billion Words (SBW) dataset~\cite{sbw}, containing about 1 million 300-dimensional vectors of doubles, and a sample of 10 million 128-dimensional vectors of unsigned longs from the BigANN dataset~\cite{bigann}. \item For string search, we used the genome of the brown bear (ursarc) broken into 30 million unique 70-80 character chunks~\cite{ursa}, and a list of about 400,000 English words (english)~\cite{english-words}. \end{itemize} \subsection{Design Space Evaluation} \label{ssec:dyn-ds-exp} \begin{figure} %\vspace{0pt} \centering \subfloat[Insertion Throughput \\ vs. Buffer Size]{\includegraphics[width=.4\textwidth]{img/fig-ps-mt-insert} \label{fig:ins-buffer-size}} \subfloat[Insertion Throughput \\ vs. Scale Factor]{\includegraphics[width=.4\textwidth]{img/fig-ps-sf-insert} \label{fig:ins-scale-factor}} \\ %\vspace{-2mm} \subfloat[Query Latency vs. Buffer Size]{\includegraphics[width=.4\textwidth]{img/fig-ps-mt-query} \label{fig:q-buffer-size}} \subfloat[Query Latency vs. Scale Factor]{\includegraphics[width=.4\textwidth]{img/fig-ps-sf-query} \label{fig:q-scale-factor}} %\vspace{-2mm} \caption{Design Space Evaluation (Triespline)} %\vspace{-2mm} \end{figure} For our first set of experiments, we evaluated a dynamized version of the Triespline learned index~\cite{plex} for answering range count queries.\footnote{ We tested range scans throughout this chapter by measure the performance of a range count. We decided to go this route to ensure that the results across our baselines were comparable. Different range structures provided different interfaces for accessing the result sets, some of which required making an extra copy and others which didn't. Using a range count instead allowed us to measure only index traversal time, without needing to worry about controlling for this difference in interface. } We examined different configurations of our framework to examine the effects that our configuration parameters had on query and insertion performance. We ran these tests using the SOSD \texttt{OSM} dataset. First, we'll consider the effect of buffer size on performance in Figures~\ref{fig:ins-buffer-size} and \ref{fig:q-buffer-size}. For all of these tests, we used a fixed scale factor of $8$ and the tombstone delete policy. Each plot shows the performance of our three supported layout policies (note that BSM using a fixed $N_B=1$ and $s=2$ for all tests, to accurately reflect the performance of the classical Bentley-Saxe method). We first note that the insertion throughput appears to increase roughly linearly with the buffer size, regardless of layout policy (Figure~\ref{fig:ins-buffer-size}), whereas the query latency remains relatively flat up to $N_B=12000$, at which point it begins to increase for both policies. It's worth noting that this is the point at which the buffer takes up roughly half of the L1 cache on our test machine. It's interesting to compare these results with those in Figures~\ref{fig:insert_mt} and \ref{fig:sample_mt} in the previous chapter. Both of them show roughly similar insertion performance (though this is masked slightly by the log scaling of the y-axis and larger range of x-values in Figure~\ref{fig:insert_mt}), but there's a clear difference in query performance. For the sampling structure in Figure~\ref{fig:sample_mt}, the query latency was largely independent of buffer size. In our sampling framework, we use rejection sampling on the buffer, and so it introduced constant overhead. For range scans, though, we need to do a full linear scan of the buffer. Increasing the buffer reduces the number of shards to be queried slightly, and this effect appears to be enough to counterbalance the increasing scan cost to a point, but there's clearly a cut-off at which larger buffers cease to make sense. We'll examine this situation in more detail in the next chapter. Next, we consider the effect that scale factor has on performance. Figure~\ref{fig:ins-scale-factor} shows the change in insertion performance as the scale factor is increased. The pattern here is the same as we saw in the previous chapter, in Figure~\ref{fig:insert_sf}. When leveling is used, enlarging the scale factor hurts insertion performance. When tiering is used, it improves performance. This is because a larger scale factor in tiering results in more, smaller structures, and thus reduced reconstruction time. But for leveling it increases the write amplification, hurting performance. Figure~\ref{fig:q-scale-factor} shows that, like with Figure~\ref{fig:sample_sf} in the previous chapter, query latency is not strongly affected by the scale factor, but larger scale factors due tend to have a negative effect under tiering (due to having more structures). As a final note, these results demonstrate that, compared the the normal Bentley-Saxe method, our proposed design space is a strict improvement. There are points within the space that are equivalent to, or even strictly superior to, BSM in terms of both query and insertion performance. Beyond this, there are also clearly available trade-offs between insertion and query performance, particular when it comes to selecting layout policy. \begin{figure*} %\vspace{0pt} \centering \subfloat[Update Throughput]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-insert} \label{fig:irs-insert}} \subfloat[Query Latency]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-query} \label{fig:irs-query}} \subfloat[Index Overhead]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-space} \label{fig:irs-space}} %\vspace{-3mm} \caption{IRS Index Evaluation} \label{fig:irs} %\vspace{-6mm} \end{figure*} \subsection{Independent Range Sampling} Next, we'll consider the independent range sampling problem using ISAM tree. The functioning of this structure for answering IRS queries is discussed in more detail in Section~\ref{ssec:irs-struct}, and we use the query algorithm described in Algorithm~\ref{alg:decomp-irs}. We use the tagging mechanism to support deletes, and enable proactive compaction to ensure that rejection rates are bounded. For our query class, we obtain the upper and lower bounds of the query range, and the weight of that range, using tree traversals in \texttt{local\_preproc}. We use rejection sampling on the buffer, and so the buffer preprocessing simply uses the number of records in the buffer for its weight. In \texttt{distribute\_query}, we build an alias structure over all of the weights and query it $k$ times to obtain the individual $k$ values for the local queries. To avoid extra work on repeat, we stash this alias structure in the buffer's local query object so it is available for re-use. \texttt{local\_query} simply generates the appropriate number of random numbers on the query interval. For each of these, the record is checked to see if it has been tagged as deleted or not, and added to the result set if it hasn't. No retries occur in the case of deleted records. \texttt{combine} simply merges all the result sets together, and \texttt{repeat} checks if the total result set size is the same as requested. If it is not, then \texttt{repeat} updates $k$ to be the number of missing records, and calls \texttt{distribute\_query} again, before returning false. This query algorithm and data structure results in a dynamized index with the following performance characteristics, \begin{align*} \text{Insert:} \quad &\Theta\left(\log_s n\right) \\ \text{Query:} \quad &\Theta\left(\log_s n \log_f n + \frac{k}{1 - \delta}\right) \\ \text{Delete:} \quad &\Theta\left(\log_s n \log_f n\right) \end{align*} where $f$ is the fanout of the ISAM tree and $\delta$ is the maximum proportion of deleted records that can exist on a level before a proactive compaction is triggered. We configured our dynamized structure to use $s=8$, $N_B=12000$, $\delta = .05$, $f = 16$, and the tiering layout policy. We compared our method (\textbf{DE-IRS}) to Olken's method~\cite{olken89} on a B+Tree with aggregate weight counts (\textbf{AGG B+Tree}), as well as our bespoke sampling solution from the previous chapter (\textbf{Bespoke}) and a single static instance of the ISAM Tree (\textbf{ISAM}). Because IRS is neither INV nor DDSP, the standard Bentley-Saxe Method has no way to support deletes for it, and was not tested. All of our tested sampling queries had a controlled selectivity of $\sigma = 0.01\%$ and $k=1000$. The results of our performance benchmarking are in Figure~\ref{fig:irs}. Figure~\ref{fig:irs-insert} shows that our general framework has comparable insertion performance to the specialized one, though loses slightly. This is to be expected, as \textbf{Bespoke} was hand-written for specifically this type of query and data structure, and has hard-coded data types, among other things. Despite losing to \textbf{Bespoke} slightly, \textbf{DE-IRS} does still manage to defeat the dynamic baseline in all cases. Figure~\ref{fig:irs-query} shows the average query latencies of the three dynamic solutions, as well as a lower bound provided by querying a single instance of ISAM statically built over all of the records. This shows that our generalized solution actually manages to defeat the \textbf{Bespoke} in query latency, coming in a bit closer to the static structure. Both \textbf{DE-IRS} and \textbf{Bespoke} manage to default the dynamic baseline. Finally, Figure~\ref{fig:irs-space} shows the space usage of the data structures, less the storage required for the raw data. The two dynamized solutions require \emph{significantly} less storage than the dynamic B+Tree, which must leave empty spaces in its nodes for inserts. This is a significant advantage of static data structures--they can pack data much more tightly and require less storage. Dynamization, at least in this case, doesn't add a significant amount of overhead over a single instance of the static structure. \subsection{$k$-NN Search} \label{ssec:dyn-knn-exp} Next, we'll consider answering high dimensional exact $k$-NN queries using a static Vantage Point Tree (VPTree)~\cite{vptree}. This is a binary search tree with internal nodes that partition records based on their distance to a selected point, called the vantage point. All of the points within a fixed distance of the vantage point are covered by one sub-tree, and the points outside of this distance are covered by the other. This results in a hard-to-update data structure that can be constructed in $\Theta(n \log n)$ time using repeated application of the \texttt{quickselect} algorithm~\cite{quickselect} to partition the points for each node. This structure can answer $k$-NN queries in $\Theta(k \log n)$ time. Our dynamized query procedure is implemented based on Algorithm~\cite{alg:idsp-knn}, though using delete tagging instead of tombstones. VPTree doesn't support efficient point lookups, and so to work around this we add a hash map to each shard, mapping each record to its location in storage, to ensure that deletes can be done efficiently in this way. This allows us to avoid canceling deleted records in the \texttt{combine} operation, as they can be skipped over during \texttt{local\_query} directly. Because $k$-NN doesn't have any of the distributional requirements of IRS, these local queries can return $k$ records even in the case of deletes, by simply returning the next-closest record instead, so long as there are at least $k$ undeleted records in the shard. Thus, \texttt{repeat} isn't necessary. This algorithm and data structure result in a dynamization with the following performance characteristics, \begin{align*} \text{Insert:} \quad &\Theta\left(\log_s n\right) \\ \text{Query:} \quad &\Theta\left(N_B + \log n \log_s n\right ) \\ \text{Delete:} \quad &\Theta\left(\log_s n \right) \end{align*} For testing, we considered a dynamized VPTree using $N_B = 1400$, $s = 8$, the tiering layout policy, and tagged deletes. Because $k$-NN is a standard DDSP, we compare with the Bentley-Saxe Method (\textbf{BSM})\footnote{ There is one deviation from pure BSM in our implementation. We use the same delete tagging scheme as the rest of our framework, meaning that the hash tables for record lookup are embedded alongside each block, rather than having a single global table. This means that the lookup of the shard containing the record to be deleted runs in $\Theta(\log_2 n)$ time, rather than $\Theta(1)$ time. However, once the block has been identified, our approach allows the record to be deleted in $\Theta(1)$ time, rather than requiring an inefficient point-lookup directly on the VPTree. } and a dynamic data structure for the same search problem called an M-Tree~\cite{mtree,mtree-impl} (\textbf{MTree}), which is an example of a so-called "ball tree" structure that partitions high dimensional space using nodes representing spheres, which are merged and split to maintain balance in a manner not unlike a B+Tree. We also consider a static instance of a VPTree built over the same set of records (\textbf{VPTree}). We used L2 distance as our metric, which is defined for vectors of $d$ dimensions as \begin{equation*} dist(r, s) = \sqrt{\sum_{i=0}^{d-1} \left(r_i - s_i\right)^2} \end{equation*} and ran the queries with $k=1000$ relative to a randomly selected point in the dataset. \begin{figure*} \subfloat[Update Throughput]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn-insert} \label{fig:knn-insert}} \subfloat[Query Latency]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn-query} \label{fig:knn-query}} \subfloat[Index Overhead]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn-space} \label{fig:knn-space}} %\vspace{-3mm} \caption{k-NN Index Evaluation} %\vspace{-3mm} \label{fig:knn-eval} \end{figure*} The results of this benchmarking are reported in Figure~\ref{fig:knn-eval}. The VPTree is shown here to \emph{vastly} out-perform the dynamic data structure in query performance in Figure~\ref{fig:knn-query}. Note that the y-axis of this figure is log-scaled. Interestingly, the query performance is not severely degraded relative to the static baseline regardless of the dynamization scheme used, with \textbf{BSM-VPTree} performing slightly \emph{better} than our framework for query performance. The reason for this is shown in Figure~\ref{fig:knn-insert}, where our framework outperforms the Bentley-Saxe method in insertion performance. These results are attributable to our selection of framework configuration parameters, which are biased towards better insertion performance. Both dynamized structures also outperform the dynamic baseline. Finally, as is becoming a trend, Figure~\ref{fig:knn-space} shows that the storage requirements of the static data structures, dynamized or not, are significantly less than M-Tree. M-Tree, like a B+Tree, requires leaving empty slots in its nodes to support insertion, and this results in a large amount of wasted space. As a final note, metric indexing is an area where dynamized static structures have been shown to work well already, and our results here are in line with the results of Naidan and Hetland, who applied BSM directly to metric data structures, including VPTree, in their own work and showed similar performance advantages~\cite{naidan14}. \subsection{Range Scan} Next, we will consider applying our dynamization framework to learned indices for single-dimensional range scans. A learned index is a sorted data structure which attempts to index data by directly modeling a function mapping a key to its offset within a storage array. The result of a lookup against the index is a estimated location, along with a strict error bound, within which the record is guaranteed to be located. We apply our framework to create dynamized versions of two static learned indices: Triespline~\cite{plex} (\textbf{DE-TS}) and PGM~\cite{pgm} (\textbf{DE-PGM}), and compare with a standard Bentley-Saxe dynamized of Triespline (\textbf{BSM-TS}). Our dynamic baselines are ALEX~\cite{alex}, which is dynamic learned index based on a B+Tree like structure, and PGM (\textbf{PGM}), which provides support for a dynamic version based on Bentley-Saxe dynamization (which is why we have not included a BSM version of PGM in our testing). For our dynamized versions of Triespline and PGM, we configure the framework with $N_B = 12000$, $s=8$ and the tiering layout policy. We consider range count queries, which traverse the range and return the number of records on it, rather than returning the set of records, to overcome differences in the query interfaces in our baselines, some of which make extra copies of the records. We consider traversing the range and counting to be a more fair comparison. Range counts are true invertible search problems, and so we use tombstone-deletes. The query process itself performs no preprocessing. Local queries use the index to identify the first record in the query range and then traverses the range, counting the number of records and tombstones encountered. These counts are then combined by adding up the total record count from all shards, subtracting the total tombstone count, and returning the final count. No repeats are necessary. The buffer query simply scans the unsorted array and performs the same counting. We examine range count queries with a fixed selectivity of $\sigma = 0.1\%$. \begin{figure*} \centering \subfloat[Update Throughput]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-rq-insert} \label{fig:rq-insert}} \subfloat[Query Latency]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-rq-query} \label{fig:rq-query}} \subfloat[Index Overhead]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-rq-space} \label{fig:rq-space}} %\vspace{-3mm} \caption{Learned Index Evaluation} %\vspace{-3mm} \label{fig:eval-learned-index} \end{figure*} The results of our evaluation are shown in Figure~\ref{fig:eval-learned-index}. Figure~\ref{fig:rq-insert} shows the insertion performance. DE-TS is the best in all cases, and the pure BSM version of Triespline is the worst by a substantial margin. Of particular interest in this chart is the inconsistent performance of ALEX, which does quite well on the \texttt{books} dataset, and poorly on the others. It is worth noting that getting ALEX to run \emph{at all} in some cases required a lot of trial and error and tuning, as its performance is highly distribution dependent. Our dynamized version of PGM consistently out-performed the built-in dynamic support of the same structure. One shouldn't read \emph{too} much into this result, as PGM itself supports some performance tuning and can be adjusted to balance between insertion and query performance. We ran it with the author's suggested default values, but in principle it could be possible to tune it to match our framework's performance here. The important take-away from this test is that our generalized framework can easily trade-blows with a custom, integrated solution. The query performance results in Figure~\ref{fig:rq-query} are a bit less interesting. All solutions perform similarly, with ALEX again showing itself be to fairly distribution dependent in its performance, performing the best out of all of the structures on the \texttt{books} dataset by a reasonable margin, but falling in line with the others on the remaining datasets. The standout result here is the dynamic PGM, which performs horrendously compared to all of the other structures. The same caveat from the previous paragraph applies here--PGM can be configured for better performance. But it's notable that our framework-dynamized PGM is able to beat PGM slightly in insertion performance without seeing the same massive degradation in query performance that PGM's native update support does in its own update-optimized configuration.\footnote{ It's also worth noting that PGM implements tombstone deletes by inserting a record with a matching key to the record to be deleted, and a particular "tombstone" value, rather than using a header. This means that it can not support duplicate keys when deletes are used, unlike our approach. It also means that the records are smaller, which should improve query performance, but we're able to beat it even including the header. PGM is the reason we excluded the \texttt{wiki} dataset from SOSD, as it has duplicate key values. } Finally, Figure~\ref{fig:rq-space} shows the storage requirements for these data structures. All of the dynamic options require significantly more space than the static Triespline, but ALEX requires the most by a very large margin. This is in keeping with the previous experiments, which all included similarly B+Tree-like structures that required significant additional storage space compared to static structures as part of their update support. \subsection{String Search} As a final example of a search problem, we consider exact string matching using the fast succinct trie~\cite{zhang18}. While dynamic tries aren't terribly unusual~\cite{m-bonsai,dynamic-trie}, succinct data structures, which attempt to approach an information-theoretic lower-bound on their binary representation of the data, are usually static because implementing updates while maintaining these compact representations is difficult~\cite{dynamic-trie}. There are specialized approaches for dynamizing such structures~\cite{dynamize-succinct}, but in this section we consider the effectiveness of our generalized framework for them. \begin{figure*} \centering \subfloat[Update Throughput]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-insert} \label{fig:fst-insert}} \subfloat[Query Latency]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-query} \label{fig:fst-query}} \subfloat[Index Overhead]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-space} \label{fig:fst-space}} %\vspace{-3mm} \caption{FST Evaluation} \label{fig:fst-eval} %\vspace{-5mm} \end{figure*} Our shard type is a direct wrapper around an implementation of fast succinct trie~\cite{fst-impl}. We store the strings in off-record storage, and the record type itself contains a pointer to the string in storage. Queries use no pre-processing and the local queries directly search for a matching string. We use the framework's early abort feature to stop as soon as the first result is found, and combine simply checks whether this record is a tombstone or not. If it's a tombstone, then the lookup is considered to have not found the search string. Otherwise, the record is returned. This results in a dynamized structure with the following asymptotic costs, \begin{align*} \text{Insert:} \quad &\Theta\left(\log_s n\right) \\ \text{Query:} \quad &\Theta\left(N_B + \log n \log_s n\right ) \\ \text{Delete:} \quad &\Theta\left(\log_s n \right) \end{align*} We compare our dynamized succinct trie (\textbf{DE-FST}), configured with $N_B = 1200$, $s = 8$, the tiering layout policy, and tombstone deletes, with a standard Bentley-Saxe dynamization (\textbf{BSM-FST}), as well as a single static instance of the structure (\textbf{FST}). The results are show in Figure~\ref{fig:fst-eval}. As with range scans, the Bentley-Saxe method shows horrible insertion performance relative to our framework in Figure~\ref{fig:fst-insert}. Note that the significant observed difference in update throughput for the two data sets is largely attributable to the relative sizes. The \texttt{US} set is far larger than \texttt{english}. Figure~\ref{fig:fst-query} shows that our write-optimized framework configuration is slightly out-performed in query latency by the standard Bentley-Saxe dynamization, and that both dynamized structures are quite a bit slower than the static structure for queries. Finally, the storage costs for the data structures are shown in Figure~\ref{fig:fst-space}. For the \texttt{english} data set, the extra storage cost from decomposing the structure is quite significant, but the for \texttt{ursarc} set the sizes are quite comparable. It is not unexpected that dynamization would add storage cost for succinct (or any compressed) data structures, because the splitting of the records across multiple data structures reduces the ability of the structure to compress redundant data. \subsection{Concurrency} We also tested the preliminary concurrency support described in Section~\ref{ssec:dyn-concurrency}, using IRS as our test case, with our dynamization configured with $N_B = 1200$, $s=8$, and the tiering layout policy. Note that IRS only supports tagging, as it isn't invertible even under the IDSP model, and our current concurrency implementation only supports deletes with tombstones, so we eschewed deletes entirely for this test. In this benchmark, we used a single thread to insert records into the structure at a constant rate, while we deployed a variable number of additional threads that continuously issued sampling queries against the structure. We used an AGG B+Tree as our baseline. Note that, to accurately maintain the aggregate weight counts as records are inserted, it is necessary that each operation obtain a lock on the root node of the tree~\cite{zhao22}. This makes this situation a good use-case for the automatic concurrency support provided by our framework. Figure~\ref{fig:irs-concurrency} shows the results of this benchmark for various numbers of concurrency query threads. As can be seen, our framework supports a stable update throughput up to 32 query threads, whereas the AGG B+Tree suffers from contention for the mutex and sees its performance degrade as the number of threads increases. \begin{figure} \centering %\vspace{-2mm} \includegraphics[width=.5\textwidth]{img/fig-bs-irs-concurrency} %\vspace{-2mm} \caption{IRS Thread Scaling} \label{fig:irs-concurrency} %\vspace{-2mm} \end{figure} \section{Conclusion} In this chapter, we sought to develop a set of tools for generalizing some of the results from our study of sampling data structures in Chapter~\ref{chap:sampling} to apply to a broader set of data structures. This results in our development of two new classes of search problem: extended decomposable search problems, and iterative deletion decomposable search problems. The former class allows for a pre-processing step to be used to generate individualize local queries for each block in a decomposed structure, and the latter allows for the query process to be repeated as necessary, with possible modifications to the local queries each time, to build up the result set iteratively. We then implemented a C++ framework for automatically dynamizing static data structures for search problems falling into either of these classes, which included an LSM tree inspired design space and support for concurrency. We used this framework to produce dynamized structures for a wide variety of search problems, and compared the results to existing dynamic baselines, as well as the original Bentley-Saxe method, where applicable. The results show that our framework is capable of creating dynamic structures that are competitive with, or superior to, custom-built dynamic structures, and also has clear performance advantages over the classical Bentley-Saxe method.