From bf99837f39a61f6cce88e24431e08347db66270e Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Sun, 18 May 2025 18:57:36 -0400 Subject: updates --- chapters/beyond-dsp.tex | 534 +++++++++++++++++++++++++++++++++++++---- img/fig-bs-fst-insert.pdf | Bin 17043 -> 17043 bytes img/fig-bs-fst-query.pdf | Bin 22003 -> 22003 bytes img/fig-bs-fst-space.pdf | Bin 18929 -> 18929 bytes img/fig-bs-irs-concurrency.pdf | Bin 16484 -> 16484 bytes img/fig-bs-irs-insert.pdf | Bin 19021 -> 22490 bytes img/fig-bs-irs-query.pdf | Bin 24732 -> 28092 bytes img/fig-bs-irs-space.pdf | Bin 21953 -> 24526 bytes img/fig-bs-knn-insert.pdf | Bin 19914 -> 19914 bytes img/fig-bs-knn-query.pdf | Bin 20360 -> 20360 bytes img/fig-bs-knn-space.pdf | Bin 21435 -> 21435 bytes img/fig-bs-knn.pdf | Bin 25275 -> 25275 bytes img/fig-bs-rq-insert.pdf | Bin 25557 -> 25557 bytes img/fig-bs-rq-query.pdf | Bin 34240 -> 34240 bytes img/fig-bs-rq-space.pdf | Bin 53432 -> 53432 bytes img/fig-ps-mt-insert.pdf | Bin 16578 -> 16578 bytes img/fig-ps-mt-query.pdf | Bin 16691 -> 16691 bytes img/fig-ps-mt-space.pdf | Bin 16614 -> 16249 bytes img/fig-ps-sf-insert.pdf | Bin 13306 -> 13306 bytes img/fig-ps-sf-query.pdf | Bin 16064 -> 16064 bytes img/fig-ps-sf-space.pdf | Bin 12953 -> 12817 bytes references/references.bib | 32 +++ 22 files changed, 521 insertions(+), 45 deletions(-) diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex index e76a638..fd4537c 100644 --- a/chapters/beyond-dsp.tex +++ b/chapters/beyond-dsp.tex @@ -52,11 +52,11 @@ that can be supported by our dynamization technique. \subsection{Extended Decomposability} - +\label{ssec:edsp} As discussed in Chapter~\cite{chap:background}, the standard query model used by dynamization techniques requires that a given query be broadcast, unaltered, to each block within the dynamized structure, and then that -the results from these identical local queries be efficiently mergable +the results from these identical local queries be efficiently mergeable to obtain the final answer to the query. This model limits dynamization to decomposable search problems (Definition~\ref{def:dsp}). @@ -65,7 +65,7 @@ examples of non-decomposable search problems, and devised a technique for correctly answering queries of that type over a dynamized structure. In this section, we'll retread our steps with an eye towards a general solution, that could be applicable in other contexts. For convenience, -we'll focus exlusively on independent range sampling. As a reminder, this +we'll focus exclusively on independent range sampling. As a reminder, this search problem is defined as, \begin{definitionIRS}[Independent Range Sampling~\cite{tao22}] @@ -78,9 +78,9 @@ search problem is defined as, We formalize this as a search problem $F_\text{IRS}:(\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ where the record domain is $\mathcal{D} = \mathbb{R}$, the query parameters domain consists of order triples -containing the lower and upper boudns of the query interval, and the +containing the lower and upper bounds of the query interval, and the number of samples to draw, $\mathcal{Q} = \mathbb{R} \times \mathbb{R} -\times \mathbb{Z}^+$, and the result domain containts subsets of the +\times \mathbb{Z}^+$, and the result domain contains subsets of the real numbers, $\mathcal{R} = \mathcal{PS}(\mathbb{R})$. $F_\text{IRS}$ can be solved using a variety of data structures, such as @@ -112,7 +112,7 @@ in Algorithm~\ref{alg:array-irs} and runs in $\mathscr{Q}_\text{irs} $S \gets \{\}$ \; \BlankLine \For {$i=1\ldots k$} { - \Comment{Select a random record within the inteval} + \Comment{Select a random record within the interval} $i_r \gets \text{randint}(i_l, i_u)$ \; \Comment{Add it to the sample set} @@ -130,7 +130,7 @@ taken from each block must be appropriately weighted to correspond to the number of records within each block falling into the query range. In the classical model, there isn't a way to do this, and so the only solution is to answer $F_\text{IRS}$ against each block, asking for the full $k$ -samples each time, and then downsampling the results corresponding to +samples each time, and then down-sampling the results corresponding to the relative weight of each block, to obtain a final sample set. Using this idea, we can formulate $F_\text{IRS}$ as a $C(n)$-decomposable @@ -161,7 +161,7 @@ a $k$-decomposable search problem, which runs in $\Theta(\log^2 n + k $S \gets \{\}$ \; \BlankLine \For {$i=1\ldots k$} { - \Comment{Select a random record within the inteval} + \Comment{Select a random record within the interval} $i_r \gets \text{randint}(i_l, i_u)$ \; \Comment{Add it to the sample set} @@ -265,7 +265,7 @@ the upper and lower bounds, pre-calculated, $\mathbftt{local\_query}$ can simply generate $k_i$ random integers and return the corresponding records. $\mathbftt{combine}$ simply combines all of the local results and returns the final result set. Algorithm~\ref{alg:edsp-irs} shows -each of these operations in psuedo-code. +each of these operations in pseudo-code. \SetKwFunction{preproc}{local\_preproc} @@ -304,7 +304,7 @@ each of these operations in psuedo-code. \Def{\query{$\mathscr{I}_i$, $q_i = (i_{l,i},i_{u,i},k_i)$}}{ \For {$i=1\ldots k_i$} { - \Comment{Select a random record within the inteval} + \Comment{Select a random record within the interval} $i_r \gets \text{randint}(i_{l,i}, i_{u,i})$ \; \Comment{Add it to the sample set} @@ -350,7 +350,7 @@ interface, with the same performance as their specialized implementations. \subsection{Iterative Deletion Decomposability} - +\label{ssec:dyn-idsp} We next turn out attention to support for deletes. Efficient delete support in Bentley-Saxe dynamization is provably impossible~\cite{saxe79}, but, as discussed in Section~\ref{ssec:dyn-deletes} it is possible @@ -391,7 +391,7 @@ into the query interface itself, allowing retries \emph{before} the result set is returned to the user and the local meta-information objects discarded. This allows us to preserve this pre-processing work, and repeat the local query process as many times as is necessary to achieve our -desired number of records. From this obervation, we propose another new +desired number of records. From this observation, we propose another new class of search problem: \emph{iterative deletion decomposable} (IDSP). The IDSP definition expands eDSP with a fifth operation, @@ -476,7 +476,7 @@ the local queries against the primary structure are merged, prior to removing any deleted records, to ensure correctness. Second, once the ghost structure records have been removed, we may need to go back to the dynamized structure for more records to ensure that we have enough. -Both of these requirements can be accomodated by the IDSP model, and the +Both of these requirements can be accommodated by the IDSP model, and the resulting query algorithm is shown in Algorithm~\ref{alg:idsp-knn}. This algorithm assumes that the data structure in question can save the current traversal state in the meta-information object, and resume a @@ -576,8 +576,8 @@ ones from the classical literature, and present a cohesive taxonomy of the search problems for which our techniques can be used to support dynamization. This taxonomy is shown in the Venn diagrams of Figure~\ref{fig:taxonomy}. Note that, for convenience, the search problem -classications relevant for supporting deletes have been seperated out -into a seperate diagram. In principle, this deletion taxonomy can be +classifications relevant for supporting deletes have been separated out +into a separate diagram. In principle, this deletion taxonomy can be thought of as being nested inside of each of the general search problem classifications, as the two sets of classification are orthogonal. That a search problem falls into a particular classification in the general @@ -675,7 +675,7 @@ based deletes in our general framework for sampling queries.\footnote{ } \section{Dynamization Framework} - +\label{sec:dyn-framework} With the previously discussed new classes of search problems devised, we can now present our generalized framework based upon those models. This framework takes the form of a header-only C++20 library which can @@ -769,7 +769,7 @@ and two of its layout policies is shown in Figure~\ref{fig:dyn-framework}. The framework provides two mechanisms for supporting deletes: tagging and tombstones. These are identical to the mechanisms discussed in Section~\ref{ssec:sampling-deletes}, with tombstone deletes operating by -inserting a record identicle to the one to be deleted into the structure, +inserting a record identical to the one to be deleted into the structure, with an indicator bit set in the header, and tagged deletes performing a lookup of the record to be deleted in the structure and setting a bit in its header directly. Tombstone deletes are used to support @@ -848,7 +848,7 @@ or \texttt{nullptr} if it doesn't. It should also accept an optional boolean argument that the framework will pass \texttt{true} into if it is don't a lookup for a tombstone. This flag is to allow the shard to use various tombstone-related optimization, such as using a Bloom filter -for them, or storing them seperately from the main records, etc. +for them, or storing them separately from the main records, etc. Shards should also expose some accessors for basic meta-data about its contents. In particular, the framework is reliant upon a function @@ -857,7 +857,7 @@ reconstructions, and the number of deleted records or tombstones within the shard for use in proactive compaction to bound the number of deleted records. The interface also requires functions for accessing memory usage information, both the memory use for the main data structure -being dynamized, and also any auxilliary memory (e.g., memory used +being dynamized, and also any auxiliary memory (e.g., memory used for an auxiliary hash table). These memory functions are used only for informational purposes. @@ -983,55 +983,267 @@ framework.} \subsection{Internal Mechanisms} +Given a user provided query, shard, and record type, the framework +will automatically provide support for inserts, as well as deletes for +supported search problems, and concurrency if desired. This section will +discuss the internal mechanisms that the framework uses to support these +operations in a single-threaded context. Concurrency will be discussed in +Section~\ref{ssec:dyn-concurrency}. + \subsubsection{Inserts and Layout Policy} +New records are inserted into the structure by appending them to the +end of the mutable buffer. When the mutable buffer is filled, it must +be flushed to make room for further inserts. This flush involves building +a shard from the records in the buffer using the unsorted constructor, +and then performing a series of reconstructions to integrate this new +shard into the structure. Once these reconstructions are complete, the +buffer can be marked as empty and the insertion performed. + +There are three layout policies supported by our framework, +\begin{itemize} +\item \textbf{Bentley-Saxe Method (BSM).} \\ +Our framework supports the Bentley-Saxe method directly, which we used as +a baseline for comparison in some benchmarking tests. This configuration +requires that $N_b = 1$ and $s = 2$ to match the standard BSM exactly (a +version of this approach that relaxes these restrictions is considered +in the next chapter). Reconstructions are performed by finding the +first empty level, $i$, (or adding one to the bottom if needed) and then +constructing a new shard at that level including all of the records from +all of the shards at levels $j <= i$, as well as the newly created buffer +shard. Then all levels $j < i$ are set to empty. Our implementation of +BSM does not include any of the re-partitioning routines for bounding +deviations in record counts from the exact binary decomposition in the +face of deleted records. + +\item \textbf{Leveling.}\\ +Our leveling policy is identical to the one discussed in +Chapter~\ref{chap:sampling}. The capacity of level $i$ is $N_b \cdot +s^i+1$ records. The first level ($i$) with available capacity to hold +all the records from the level above it ($i-1$ or the buffer, if $i += 0$) is found. Then, for all levels $j < i$, the records in $j$ are +merged with the records in $j+1$ and the resulting shard placed in level +$j+1$. This procedure guarantees that level $0$ will have capacity for +the shard from the buffer, which is then merged into it (if it is not +empty) or because it (if the level is empty). + + +\item \textbf{Tiering.}\\ +Our tiering policy, again, is identical to the one discussed in +Chapter~\ref{chap:sampling}. The capacity of each level is $s$ shards, +each having $N_b \cdot s^i$ records at most. The first level ($i$) having +fewer than $s$ shards is identified. Then, for each level $0 j$. But, if $i < j$, then a cancellation should occur. + +The case where the record and tombstone coexist covers the situation where +a record is deleted, and then inserted again after the delete. In this +case, there does exist a record $r_k$ with $k < j$ that the tombstone +should cancel with, but that record may exist in a different shard. So +the tombstone will \emph{eventually} cancel, but it would be technically +incorrect to cancel it with the matching record $r_i$ that it coexists +with in the shard being considered. + +This means that correct tombstone cancellation requires that the order +that records have been inserted be known and accounted for during +shard construction. To enable this, our framework implements two important +features, + +\begin{enumerate} + \item All records in the buffer contain a timestamp in their header, + indicating insertion order. This can be cleared or discarded once + the buffer shard has been constructed. + \item All shards passed into the shard constructor are provided in + reverse chronological order. The first shard in the vector will be + the oldest, and so on, with the final shard being the newest. +\end{enumerate} + +The user can make use of these properties however they like during +shard construction. The specific approach that we use in our shard +implementations is to ensure that records are sorted by value, such that +equal records are adjacent, and then by age, such that the newest record +appears first, and the oldest last. By enforcing this order, a tombstone +at index $i$ will cancel with a record if and only if that record is +in index $i+1$. For structures that are constructed by a sorted-merge +of data, this allows tombstone cancellation at no extra cost during +the merge operation. Otherwise, it requires an extra linear pass after +sorting to remove cancelled records.\footnote{ + For this reason, we use tagging based deletes for structures which + don't require sorting by value during construction. +} -\Paragraph{Asymptotic Complexity.} +\Paragraph{Erase Return Codes.} As noted in +Section~\ref{sec:dyn-framework}, the external \texttt{erase} function can +return a $0$ on failure. The specific meaning of this failure, however, +is a function of the delete policy being used. + +For tombstone deletes, a failure to delete means a failure to insert, +and the request should be retried after a brief delay. Note that, for +performance reasons, the framework makes no effort to ensure that the +record being erased using tombstones is \emph{actually} there, so it +is possible to insert a tombstone that can never be cancelled. This +won't affect correctness in any way, so long as queries are correctly +implemented, but it will increase the size of the structure slightly. + +For tagging deletes, a failure to delete means that the record to be +removed could not be located to tag it. Such failures should \emph{not} +be retried immediately, as the situation will not automatically resolve +itself before new records are inserted. + +\Paragraph{Tombstone Asymptotic Complexity.} Tombstone deletes reduce to +inserts, and so they have the same asymptotic properties as inserts. Namely, +\begin{align*} +\mathscr{D}(n) &\in \Theta(B(n)) \\ +\mathscr{D}_a(n) &\in \Theta\left( \frac{B(n)}{n} \cdot \log_s n\right) +\end{align*} -\Paragraph{Asymptotic Complexity.} +\Paragraph{Tagging Asymptotic Complexity.} Tagging deletes must perform +a linear scan of the buffer, and a point-lookup of every shard. If $L(n)$ +is the worst-case cost of the shard's implementation of \texttt{point\_lookup}, +then the worst-case cost of a delete under tagging is, +\begin{equation*} +\mathscr{D}(n) \in \Theta \left( N_b + L(n) \cdot \log_s n\right) +\end{equation*} +The \texttt{point\_lookup} interface requires an optional boolean argument +that is set to true when the function is called as part of a delete +process by the framework. This is to enable the use of Bloom filters, +or other similar structures, to accelerate these operations if desired. \subsubsection{Queries} +The framework processes queries using a direct implementation of the +approach discussed in Section~\ref{ssec:dyn-idsp}, with modifications to +account for the buffer. The buffer itself is treated in the procedure like +any other shard, except with its own specialized query and preprocessing +function. The algorithm itself is shown in Algorithm~\ref{alg:dyn-query} + +In order to appropriately account for deletes during result set +combination, the query interfaces make similar ordering guarantees to +the shard construction interface. Records from the buffer will have +their insertion timestamp available, and shards, local queries, and +local results, are always passed in descending order of age. This is to +allow tombstones to be accounted for during the query process using the +same mechanisms described in Section~\ref{sssec:dyn-deletes}. \begin{algorithm}[t] \caption{Query with Dynamization Framework} - \label{algo:query-framework} - \KwIn{$q$: query parameters, $b$: mutable buffer, $S$: static index shards at all levels} + \label{alg:dyn-query} + \KwIn{$q$: query parameters, $b$: mutable buffer, $S$: static shards at all levels} \KwOut{$R$: query results} $\mathscr{S}_b \gets \texttt{local\_preproc}_{buffer}(b, q);\ \ \mathscr{S} \gets \{\}$ \; @@ -1053,12 +1265,88 @@ framework.} \end{algorithm} -\Paragraph{Asymptotic Complexity.} +\Paragraph{Asymptotic Complexity.} The worst-case query cost of the +framework follows the same basic cost function as discussed for IDSPs +in Section~\ref{asec:dyn-idsp}, with slight modifications to account for +the different cost function of buffer querying and preprocessing. The +cost is, +\begin{equation*} +\mathscr{Q}(n) \in O \left(P_B(N_B) + \log_s n \cdot P(n) + D(n) + R(n)\left( + Q_B(n) + \log_s n \cdot Q_s(n) + C_e(n)\right)\right) +\end{equation*} +where $P_B(n)$ is the cost of pre-processing the buffer, and $Q_B(n)$ is +the cost of querying it. As $N_B$ is a small constant relative to $n$, +in some cases these terms can be ommitted, but they are left here for +generality. Also note that this is an upper bound, but isn't necessarily +tight. As we saw with IRS in Section~\ref{ssec:edsp}, it is sometimes +possible to leverage problem-specific details within this interface to +get better asymptotic performance. \subsection{Concurrency Control} \section{Evaluation} + +Having described the framework in detail, we'll now turn to demonstrating +its performance for a variety of search problems and associated data +structures. We've predominately selected problems for which an existing +dynamic data structure also exists, to demonstrate that the performance +of our dynamization techniques can match or exceed hand-built dynamic +solutions to these problems. Specifically, we will consider IRS using +ISAM tree, range scans using learned indices, high-dimensional $k$-NN +using VPTree, and exact string matching using succinct tries. + + \subsection{Experimental Setup} + +All of our testing was performed using Ubuntu 20.04 LTS on a dual +socket Intel Xeon Gold 6242 server with 384 GiB of physical memory and +40 physical cores. We ran our benchmarks pinned to a specific core, +or specific NUMA node for multi-threaded testing. Our code was compiled +using GCC version 11.3.0 with the \texttt{-O3} flag, and targetted to +C++20.\footnote{ + Aside from the ALEX benchmark. ALEX does not build in this + configuration, and we used C++13 instead for that particular test. +} + +Our testing methodology involved warming up the data structure by +inserting 10\% of the dataset, and then measuring the throughput over +the insertion of the rest of the records. During this second phase, a +workload mixture of 95\% inserts and 5\% deletes was used for structures +that supported deletes. Once the insertion phase was complete, we measured +the query latency by repeatedly querying the structure with a selection +of pre-constructed queries and measuring the average latency. Reported +query performance numbers are latencies, and insertion/update numbers are +throughputs. For data structure size charts, we report the total size of +the data structure and all auxiliary structures, minus the size of the +raw data. All tests were run on a single-thread without any background +operations, unless otherwise specified. + +We used several datasets for testing the different +structures. Specifically, + +\begin{itemize} + + \item For range and sampling problems, we used the \texttt{book}, + \texttt{fb}, and \texttt{osm} datasets from + SOSD~\cite{sosd-datasets}. Each has 200 million 64-bit keys + (to which we added 64-bit values) following a variety of + distributions. We ommitted the \texttt{wiki} dataset because it + contains duplicate keys, which were not supported by one of our + dynamic baselines. + + \item For vector problems, we used the Spanish Billion Words (SBW) + dataset~\cite{sbw}, containing about 1 million 300-dimensional + vectors of doubles, and a sample of 10 million 128-dimensional + vectors of unsigned longs from the BigANN dataset~\cite{bigann}. + + \item For string search, we used the genome of the brown bear + (ursarc) broken into 30 million unique 70-80 character + chunks~\cite{ursa}, and a list of about 400,000 English words + (english)~\cite{english-words}. + +\end{itemize} + + \subsection{Design Space Evaluation} \begin{figure} @@ -1074,8 +1362,70 @@ framework.} %\vspace{-2mm} \end{figure} +For our first set of experiments, we evaluated a dynamized version of the +Triespline learned index~\cite{plex} for answering range count queries.\footnote{ + We tested range scans throughout this chapter by measure the + performance of a range count. We decided to go this route to ensure + that the results across our baselines were comprable. Different range + structures provided different interfaces for accessing the result + sets, some of which required making an extra copy and others which + didn't. Using a range count instead allowed us to measure only index + traversal time, without needing to worry about controlling for this + difference in interface. +} We examined different configurations of our framework to examine the +effects that our configuration parameters had on query and insertion +performance. We ran these tests using the SOSD \texttt{OSM} dataset. + +First, we'll consider the effect of buffer size on performance in +Figures~\ref{fig:ins-buffer-size} and \ref{fig:q-buffer-size}. For all +of these tests, we used a fixe scale factor of $8$ and the tombstone +delete policy. Each plot shows the performance of our three supported +layout policies (note that BSM using a fixed $N_B=1$ and $s=2$ for all +tests, to accurately reflect the performance of the classical Bentley-Saxe +method). We first note that the insertion throughput appears to increase +roughly linearly with the buffer size, regardless of layout policy +(Figure~\ref{fig:ins-buffer-size}), whereas the query latency remains +relatively flat up to $N_B=12000$, at which point it begins to increase +for both policies. It's worth noting that this is the point at which +the buffer takes up roughly half of the L1 cache on our test machine. + +It's interesting to compare these results with those in +Figures~\ref{fig:insert_mt} and \ref{fig:sample_mt} in the previous +chapter. Both of them show roughly similar insertion performance +(though this is masked slightly by the log scaling of the y-axis and +larger range of x-values in Figure~\ref{fig:insert_mt}), but there's a +clear difference in query performance. For the sampling structure in +Figure~\ref{fig:sample_mt}, the query latency was largely independent of +buffer size. In our sampling framework, we use rejection sampling on the +buffer, and so it introduced constant overhead. For range scans, though, +we need to do a full linear scan of the buffer. Increasing the buffer +reduces the number of shards to be queried slightly, and this effect +appears to be enough to counterbalance the increasing scan cost to a +point, but there's clearly a cut-off at which larger buffers cease to make +sense. We'll examine this situation in more detail in the next chapter. + +Next, we consider the effect that scale factor has on +performance. Figure~\ref{fig:ins-scale-factor} shows the change +in insertion performance as the scale factor is increased. The +pattern here is the same as we saw in the previous chapter, in +Figure~\ref{fig:insert_sf}. When leveling is used, enlarging the +scale factor hurts insertion performance. When tiering is used, it +improves performance. This is because a larger scale factor in tiering +results in more, smaller structures, and thus reduced reconstruction +time. But for leveling it increases the write amplification, hurting +performance. Figure~\ref{fig:q-scale-factor} shows that, like with +Figure~\ref{fig:query_sf} in the previous chapter, query latency is not +strong affected by the scale factor, but larger scale factors due tend +to have a negative effect under tiering (due to having more structures). + +As a final note, these results demonstrate that, compared the the +normal Bentley-Saxe method, our proposed design space is a strict +improvement. There are points within the space that are equivilant to, +or even strictly superior to, BSM in terms of both query and insertion +performance, as well as clearly available trade-offs between insertion and +query performance, particular when it comes to selecting layout policy. + -\subsection{Independent Range Sampling} \begin{figure*} %\vspace{0pt} @@ -1089,15 +1439,115 @@ framework.} %\vspace{-6mm} \end{figure*} +\subsection{Independent Range Sampling} + +Next, we'll consider the indepedent range sampling problem using ISAM +tree. The functioning of this structure for answering IRS queries is +discussed in more detail in Section~\ref{ssec:irs-struct}, and we use the +query algorithm described in Algorithm~\ref{alg:decomp-irs}. We use the +tagging mechanism to support deletes, and enable proactive compaction +to ensure that rejection rates are bounded. For our query class, we +obtain the upper and lower bounds of the query range, and the weight +of that range, using tree traversals in \texttt{local\_preproc}. We +use rejection sampling on the buffer, and so the buffer preprocessing +simply uses the number of records in the buffer for its weight. In +\texttt{distribute\_query}, we build and alias structure over all of +the weights and query it $k$ times to obtain the individual $k$ values +for the local queries. To avoid extra work on repeat, we stash this +alias structure in the buffer's local query object so it is available +for re-use. \texttt{local\_query} simply generates the appropriate +number of random numbers on the query interval. For each of these, +the record is checked to see if it has been tagged as deleted or not, +and added to the result set if it hasn't. No retries occur in the case +of deleted records. \texttt{combine} simply merges all the result sets +together, and \texttt{repeat} checks if the total result set size is +the same as requested. If it is not, then \texttt{repeat} updates $k$ +to be the number of missing records, and calls \texttt{distribute\_query} +again, before returning false. + +This query algorithm and data structure results in a dynamized index with +the following performance characteristics, \begin{align*} \text{Insert:} \quad &\Theta\left(\log_s n\right) \\ \text{Query:} \quad &\Theta\left(\log_s n \log_f n + \frac{k}{1 - \delta}\right) \\ \text{Delete:} \quad &\Theta\left(\log_s n \log_f n\right) \end{align*} +where $f$ is the fanout of the ISAM tree and $\delta$ is the maximum +proportion of deleted records that can exist on a level before a proactive +compaction is triggered. + +We configured our dynamized structure to use $s=8$, $N_B=12000$, $\delta += .05$, $f = 16$, and the tiering layout policy. We compared our method +(\textbf{DE-IRS}) to Olken's method~\cite{olken89} on a B+Tree with +aggregate weight counts (\textbf{AGG B+Tree}), as well as our besoke +sampling solution from the previous chapter (\textbf{Besoke}) and a +single static instance of the ISAM Tree (\textbf{ISAM}). Because IRS +is neither INV nor DDSP, the standard Bentley-Saxe Method has no way to +support deletes for it, and was not tested. All of our tested sampling +queries had a controlled selectivity of $\sigma = 0.01\%$ and $k=1000$. + +The results of our performance benchmarking are in Figure~\ref{fig:irs}. +Figure~\ref{fig:irs-insert} shows that our general framework has +comperable insertion performance to the specialized one, though loses +slightly. This is to be expected, as \textbf{Bespoke} was hand-written for +specifically this type of query and data structure, and has hard-coded +data types, among other things. Despite losing to \textbf{Bespoke} +slightly, \textbf{DE-IRS} does still manage to defeat the dynamic baseline +in all cases. + +Figure~\ref{fig:irs-query} shows the average query latencies of the +three dynamic solutions, as well as a lower bound provided by querying a +single instance of ISAM statically built over all of the records. This +shows that our generalized solution actually manages to defeat the +\textbf{Bespoke} in query latency, coming in a bit closer to the static +structure. Both \textbf{DE-IRS} and \textbf{Bespoke} manage to default +the dynamic baseline. + +Finally, Figure~\ref{fig:irs-space} shows the space usage of the +data structures, less the storage required for the raw data. The two +dynamized solutions require \emph{significantly} less storage than the +dynamic B+Tree, which must leave empty spaces in its nodes for inserts. +This is a significant advantage of static data structures--they can pack +data much more tightly and require less storage. Dynamization, at least +in this case, doesn't add a significant amount of overhead over a single +instance of the static structure. + +\subsection{$k$-NN Search} + +Next, we'll consider answering high dimensional exact $k$-NN queries +using a static Vantage Point Tree (VPTree)~\cite{vptree}. This is a +binary search tree with internal nodes that partition records based +on their distance to a selected point, called the vantage point. All +of the points within a fixed distance of the vantage point are covered +by one subtree, and the points outside of this distance are covered by +the other. This results in a hard-to-update data structure that can +be constructed in $\Theta(n \log n)$ time using repeated application of +the \texttt{quickselect} algorithm~\cite{quickselect} to partition the +points for each node. This structure can answer $k$-NN queries in +$\Theta(k \log n)$ time. + +Our dynamized query procedure is implemented based on +Algorithm~\cite{alg:idsp-knn}, though using delete tagging instead of +tombstones. VPTree doesn't support efficient point lookups, and so to +work around this we add a hash map to each shard, mapping each record to +its location in storage, to ensure that deletes can be done efficiently +in this way. This allows us to avoid cancelling deleted records in +the \texttt{combine} operation, as they can be skipped over during +\texttt{local\_query} directly. Because $k$-NN doesn't have any of the +distributional requirements of IRS, these local queries can return $k$ +records even in the case of deletes, by simply returning the next-closest +record instead, so long as there are at least $k$ undeleted records in +the shard. Thus, \texttt{repeat} isn't necessary. This algorithm and +data structure result in a dynamization with the following performance +characteristics, +\begin{align*} + \text{Insert:} \quad &\Theta\left(\log_s n\right) \\ + \text{Query:} \quad &\Theta\left(N_B + \log n \log_s n\right ) \\ + \text{Delete:} \quad &\Theta\left(\log_s n \right) +\end{align*} -\subsection{k-NN Search} \begin{figure*} \subfloat[Update Throughput]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn-insert} \label{fig:knn-insert}} @@ -1110,12 +1560,6 @@ framework.} \end{figure*} -\begin{align*} - \text{Insert:} \quad &\Theta\left(\log_s n\right) \\ - \text{Query:} \quad &\Theta\left(N_B + \log n \log_s n\right ) \\ - \text{Delete:} \quad &\Theta\left(\log_s n \right) -\end{align*} - \subsection{Range Scan} \begin{figure*} diff --git a/img/fig-bs-fst-insert.pdf b/img/fig-bs-fst-insert.pdf index 409284f..8eb773e 100644 Binary files a/img/fig-bs-fst-insert.pdf and b/img/fig-bs-fst-insert.pdf differ diff --git a/img/fig-bs-fst-query.pdf b/img/fig-bs-fst-query.pdf index b1efbd4..4a4adf2 100644 Binary files a/img/fig-bs-fst-query.pdf and b/img/fig-bs-fst-query.pdf differ diff --git a/img/fig-bs-fst-space.pdf b/img/fig-bs-fst-space.pdf index e0fdef5..d482dfa 100644 Binary files a/img/fig-bs-fst-space.pdf and b/img/fig-bs-fst-space.pdf differ diff --git a/img/fig-bs-irs-concurrency.pdf b/img/fig-bs-irs-concurrency.pdf index c66a641..36773f8 100644 Binary files a/img/fig-bs-irs-concurrency.pdf and b/img/fig-bs-irs-concurrency.pdf differ diff --git a/img/fig-bs-irs-insert.pdf b/img/fig-bs-irs-insert.pdf index 3e879f8..c44f89b 100644 Binary files a/img/fig-bs-irs-insert.pdf and b/img/fig-bs-irs-insert.pdf differ diff --git a/img/fig-bs-irs-query.pdf b/img/fig-bs-irs-query.pdf index 09c9604..10f515c 100644 Binary files a/img/fig-bs-irs-query.pdf and b/img/fig-bs-irs-query.pdf differ diff --git a/img/fig-bs-irs-space.pdf b/img/fig-bs-irs-space.pdf index db0bbaa..238b5eb 100644 Binary files a/img/fig-bs-irs-space.pdf and b/img/fig-bs-irs-space.pdf differ diff --git a/img/fig-bs-knn-insert.pdf b/img/fig-bs-knn-insert.pdf index 6a74560..3f0bb6d 100644 Binary files a/img/fig-bs-knn-insert.pdf and b/img/fig-bs-knn-insert.pdf differ diff --git a/img/fig-bs-knn-query.pdf b/img/fig-bs-knn-query.pdf index ea7abc6..19badcb 100644 Binary files a/img/fig-bs-knn-query.pdf and b/img/fig-bs-knn-query.pdf differ diff --git a/img/fig-bs-knn-space.pdf b/img/fig-bs-knn-space.pdf index 09a8700..7bd1c6b 100644 Binary files a/img/fig-bs-knn-space.pdf and b/img/fig-bs-knn-space.pdf differ diff --git a/img/fig-bs-knn.pdf b/img/fig-bs-knn.pdf index f58af87..e0508ec 100644 Binary files a/img/fig-bs-knn.pdf and b/img/fig-bs-knn.pdf differ diff --git a/img/fig-bs-rq-insert.pdf b/img/fig-bs-rq-insert.pdf index 87105e4..7dec789 100644 Binary files a/img/fig-bs-rq-insert.pdf and b/img/fig-bs-rq-insert.pdf differ diff --git a/img/fig-bs-rq-query.pdf b/img/fig-bs-rq-query.pdf index d8de0f3..6841276 100644 Binary files a/img/fig-bs-rq-query.pdf and b/img/fig-bs-rq-query.pdf differ diff --git a/img/fig-bs-rq-space.pdf b/img/fig-bs-rq-space.pdf index 744c0fb..e4e6324 100644 Binary files a/img/fig-bs-rq-space.pdf and b/img/fig-bs-rq-space.pdf differ diff --git a/img/fig-ps-mt-insert.pdf b/img/fig-ps-mt-insert.pdf index 4ad0e7c..d332dd6 100644 Binary files a/img/fig-ps-mt-insert.pdf and b/img/fig-ps-mt-insert.pdf differ diff --git a/img/fig-ps-mt-query.pdf b/img/fig-ps-mt-query.pdf index 433ae4c..db0d9e7 100644 Binary files a/img/fig-ps-mt-query.pdf and b/img/fig-ps-mt-query.pdf differ diff --git a/img/fig-ps-mt-space.pdf b/img/fig-ps-mt-space.pdf index af74602..e4978ac 100644 Binary files a/img/fig-ps-mt-space.pdf and b/img/fig-ps-mt-space.pdf differ diff --git a/img/fig-ps-sf-insert.pdf b/img/fig-ps-sf-insert.pdf index 3020cd6..a857734 100644 Binary files a/img/fig-ps-sf-insert.pdf and b/img/fig-ps-sf-insert.pdf differ diff --git a/img/fig-ps-sf-query.pdf b/img/fig-ps-sf-query.pdf index 11f57eb..56375e3 100644 Binary files a/img/fig-ps-sf-query.pdf and b/img/fig-ps-sf-query.pdf differ diff --git a/img/fig-ps-sf-space.pdf b/img/fig-ps-sf-space.pdf index bc85eb5..8655e69 100644 Binary files a/img/fig-ps-sf-space.pdf and b/img/fig-ps-sf-space.pdf differ diff --git a/references/references.bib b/references/references.bib index 38244e6..7d2b8a0 100644 --- a/references/references.bib +++ b/references/references.bib @@ -1735,3 +1735,35 @@ keywords = {analytic model, analysis of algorithms, overflow chaining, performan publisher={Elsevier} } +@online{ursa, + title = {Brown Bear Genome, v1}, + url = {https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_023065955.1/}, + year = {2024} +} + + @online{bigann, + title = {BigANN Dataset}, + url = {https://big-ann-benchmarks.com/neurips21.html}, + year = {2024} +} + +@online{english-words, + title = {English Words Dataset}, + url = {https://github.com/dwyl/english-words?tab=readme-ov-file}, + year = {2024} +} + +@article{quickselect, + author = {C. A. R. Hoare}, + title = {Algorithm 65: find}, + journal = {Commun. {ACM}}, + volume = {4}, + number = {7}, + pages = {321--322}, + year = {1961}, + url = {https://doi.org/10.1145/366622.366647}, + doi = {10.1145/366622.366647}, + timestamp = {Fri, 24 Mar 2023 16:31:07 +0100}, + biburl = {https://dblp.org/rec/journals/cacm/Hoare61a.bib}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} -- cgit v1.2.3