From cd3447f1cad16972e8a659ec6e84764c5b8b2745 Mon Sep 17 00:00:00 2001 From: "Douglas B. Rumbaugh" Date: Sun, 1 Jun 2025 13:15:52 -0400 Subject: Julia updates --- chapters/beyond-dsp.tex | 126 ++++++++++++++++++++++++------------------------ 1 file changed, 63 insertions(+), 63 deletions(-) (limited to 'chapters/beyond-dsp.tex') diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex index 87f44ba..73f8174 100644 --- a/chapters/beyond-dsp.tex +++ b/chapters/beyond-dsp.tex @@ -202,7 +202,7 @@ problem. The core idea underlying our solution in that chapter was to introduce individualized local queries for each block, which were created after a pre-processing step to allow information about each block to be determined first. In that particular example, we established the weight -each block should have during sampling, and then creating custom sampling +each block should have during sampling, and then created custom sampling queries with variable $k$ values, following the weight distributions. We have determined a general interface that allows for this procedure to be expressed, and we define the term \emph{extended decomposability} to refer @@ -379,12 +379,12 @@ A significant limitation of invertible problems is that the result set size is not able to be controlled. We do not know how many records in our local results have been deleted until we reach the combine operation and they begin to cancel out, at which point we lack a mechanism to go back -and retrieve more. This presents difficulties for addressing important -search problems such as top-$k$, $k$-NN, and sampling. In principle, these -queries could be supported by repeating the query with larger-and-larger -$k$ values until the desired number of records is returned, but in the -eDSP model this requires throwing away a lot of useful work, as the state -of the query must be rebuilt each time. +and retrieve more records. This presents difficulties for addressing +important search problems such as top-$k$, $k$-NN, and sampling. In +principle, these queries could be supported by repeating the query with +larger-and-larger $k$ values until the desired number of records is +returned, but in the eDSP model this requires throwing away a lot of +useful work, as the state of the query must be rebuilt each time. We can resolve this problem by moving the decision to repeat the query into the query interface itself, allowing retries \emph{before} the @@ -700,7 +700,7 @@ the following main operations, This function will delete a record from the dynamized structure, returning $1$ on success and $0$ on failure. The meaning of a failure to delete is dependent upon the delete mechanism in use, - and will be discussed in Section~\ref{ssec:dyn-deletes}. + and will be discussed in Section~\ref{sssec:dyn-deletes}. \item \texttt{std::future query(QueryParameters); } \\ This function will execute a query with the specified parameters @@ -838,17 +838,18 @@ shards of the same type. The second of these constructors is to allow for efficient merging to be leveraged for merge decomposable search problems. Shards can also expose a point lookup operation for use in supporting -deletes for DDSPs. This function is only used for DDSP deletes, and so can -be left off when this functionality isn't necessary. If a data structure -doesn't natively support an efficient point-lookup, then it can be added -by including a hash table or other data structure in the shard if desired. -This function accepts a record type as input, and should return a pointer -to the record that exactly matches the input in storage, if one exists, -or \texttt{nullptr} if it doesn't. It should also accept an optional -boolean argument that the framework will pass \texttt{true} into if it -is don't a lookup for a tombstone. This flag is to allow the shard to -use various tombstone-related optimization, such as using a Bloom filter -for them, or storing them separately from the main records, etc. +deletes for DDSPs. This function is only used for DDSP deletes, and +so can be left off when this functionality isn't necessary. If a data +structure doesn't natively support an efficient point-lookup, then it +can be added by including a hash table or other data structure in the +shard if desired. This function accepts a record type as input, and +should return a pointer to the record that exactly matches the input in +storage, if one exists, or \texttt{nullptr} if it doesn't. It should +also accept an optional boolean argument that the framework will pass +\texttt{true} into if the lookup operation is being used to search for +a tombstone records. This flag is to allow the shard to use various +tombstone-related optimization, such as using a Bloom filter for them, +or storing them separately from the main records, etc. Shards should also expose some accessors for basic meta-data about its contents. In particular, the framework is reliant upon a function @@ -888,19 +889,19 @@ concept ShardInterface = RecordInterface }; \end{lstlisting} -\label{listing:shard} \caption{The required interface for shard types in our dynamization framework.} +\label{lst:shard} \end{lstfloat} \subsubsection{Query Interface} The most complex interface required by the framework is for queries. The -concept for query types is given in Listing~\ref{listing:query}. In +concept for query types is given in Listing~\ref{lst:query}. In effect, it requires implementing the full IDSP interface from the previous section, as well as versions of $\mathbftt{local\_preproc}$ -and $\mathbftt{local\query}$ for pre-processing and querying an unsorted +and $\mathbftt{local\_query}$ for pre-processing and querying an unsorted set of records, which is necessary to allow the mutable buffer to be used as part of the query process.\footnote{ In the worst case, these routines could construct temporary shard @@ -918,7 +919,7 @@ a local result that includes both the number of records and the number of tombstones, while the query result itself remains a single number. Additionally, the framework makes no decision about what, if any, collection type should be used for these results. A range scan, for -example, could specified the result types as a vector of records, map +example, could specify the result types as a vector of records, map of records, etc., depending on the use case. There is one significant difference between the IDSP interface and the @@ -935,7 +936,6 @@ to define an additional combination operation for final result types, or duplicate effort in the combine step on each repetition. \begin{lstfloat} - \begin{lstlisting}[language=C++] template j$. But, if $i < j$, then a cancellation should occur. The case where the record and tombstone coexist covers the situation where a record is deleted, and then inserted again after the delete. In this case, there does exist a record $r_k$ with $k < j$ that the tombstone should cancel with, but that record may exist in a different shard. So -the tombstone will \emph{eventually} cancel, but it would be technically -incorrect to cancel it with the matching record $r_i$ that it coexists -with in the shard being considered. +the tombstone will \emph{eventually} cancel, but it would be incorrect +to cancel it with the matching record $r_i$ that it coexists with in +the shard being considered. This means that correct tombstone cancellation requires that the order that records have been inserted be known and accounted for during @@ -1186,7 +1186,7 @@ at index $i$ will cancel with a record if and only if that record is in index $i+1$. For structures that are constructed by a sorted-merge of data, this allows tombstone cancellation at no extra cost during the merge operation. Otherwise, it requires an extra linear pass after -sorting to remove cancelled records.\footnote{ +sorting to remove canceled records.\footnote{ For this reason, we use tagging based deletes for structures which don't require sorting by value during construction. } @@ -1200,7 +1200,7 @@ For tombstone deletes, a failure to delete means a failure to insert, and the request should be retried after a brief delay. Note that, for performance reasons, the framework makes no effort to ensure that the record being erased using tombstones is \emph{actually} there, so it -is possible to insert a tombstone that can never be cancelled. This +is possible to insert a tombstone that can never be canceled. This won't affect correctness in any way, so long as queries are correctly implemented, but it will increase the size of the structure slightly. @@ -1271,7 +1271,7 @@ same mechanisms described in Section~\ref{sssec:dyn-deletes}. \Paragraph{Asymptotic Complexity.} The worst-case query cost of the framework follows the same basic cost function as discussed for IDSPs -in Section~\ref{asec:dyn-idsp}, with slight modifications to account for +in Section~\ref{ssec:dyn-idsp}, with slight modifications to account for the different cost function of buffer querying and preprocessing. The cost is, \begin{equation*} @@ -1280,7 +1280,7 @@ cost is, \end{equation*} where $P_B(n)$ is the cost of pre-processing the buffer, and $Q_B(n)$ is the cost of querying it. As $N_B$ is a small constant relative to $n$, -in some cases these terms can be ommitted, but they are left here for +in some cases these terms can be omitted, but they are left here for generality. Also note that this is an upper bound, but isn't necessarily tight. As we saw with IRS in Section~\ref{ssec:edsp}, it is sometimes possible to leverage problem-specific details within this interface to @@ -1307,7 +1307,7 @@ All of our testing was performed using Ubuntu 20.04 LTS on a dual socket Intel Xeon Gold 6242 server with 384 GiB of physical memory and 40 physical cores. We ran our benchmarks pinned to a specific core, or specific NUMA node for multi-threaded testing. Our code was compiled -using GCC version 11.3.0 with the \texttt{-O3} flag, and targetted to +using GCC version 11.3.0 with the \texttt{-O3} flag, and targeted to C++20.\footnote{ Aside from the ALEX benchmark. ALEX does not build in this configuration, and we used C++13 instead for that particular test. @@ -1335,7 +1335,7 @@ structures. Specifically, \texttt{fb}, and \texttt{osm} datasets from SOSD~\cite{sosd-datasets}. Each has 200 million 64-bit keys (to which we added 64-bit values) following a variety of - distributions. We ommitted the \texttt{wiki} dataset because it + distributions. We omitted the \texttt{wiki} dataset because it contains duplicate keys, which were not supported by one of our dynamic baselines. @@ -1371,7 +1371,7 @@ For our first set of experiments, we evaluated a dynamized version of the Triespline learned index~\cite{plex} for answering range count queries.\footnote{ We tested range scans throughout this chapter by measure the performance of a range count. We decided to go this route to ensure - that the results across our baselines were comprable. Different range + that the results across our baselines were comparable. Different range structures provided different interfaces for accessing the result sets, some of which required making an extra copy and others which didn't. Using a range count instead allowed us to measure only index @@ -1383,7 +1383,7 @@ performance. We ran these tests using the SOSD \texttt{OSM} dataset. First, we'll consider the effect of buffer size on performance in Figures~\ref{fig:ins-buffer-size} and \ref{fig:q-buffer-size}. For all -of these tests, we used a fixe scale factor of $8$ and the tombstone +of these tests, we used a fixed scale factor of $8$ and the tombstone delete policy. Each plot shows the performance of our three supported layout policies (note that BSM using a fixed $N_B=1$ and $s=2$ for all tests, to accurately reflect the performance of the classical Bentley-Saxe @@ -1419,17 +1419,17 @@ improves performance. This is because a larger scale factor in tiering results in more, smaller structures, and thus reduced reconstruction time. But for leveling it increases the write amplification, hurting performance. Figure~\ref{fig:q-scale-factor} shows that, like with -Figure~\ref{fig:query_sf} in the previous chapter, query latency is not -strong affected by the scale factor, but larger scale factors due tend +Figure~\ref{fig:sample_sf} in the previous chapter, query latency is not +strongly affected by the scale factor, but larger scale factors due tend to have a negative effect under tiering (due to having more structures). As a final note, these results demonstrate that, compared the the normal Bentley-Saxe method, our proposed design space is a strict -improvement. There are points within the space that are equivilant to, +improvement. There are points within the space that are equivalent to, or even strictly superior to, BSM in terms of both query and insertion -performance, as well as clearly available trade-offs between insertion and -query performance, particular when it comes to selecting layout policy. - +performance. Beyond this, there are also clearly available trade-offs +between insertion and query performance, particular when it comes to +selecting layout policy. \begin{figure*} @@ -1446,7 +1446,7 @@ query performance, particular when it comes to selecting layout policy. \subsection{Independent Range Sampling} -Next, we'll consider the indepedent range sampling problem using ISAM +Next, we'll consider the independent range sampling problem using ISAM tree. The functioning of this structure for answering IRS queries is discussed in more detail in Section~\ref{ssec:irs-struct}, and we use the query algorithm described in Algorithm~\ref{alg:decomp-irs}. We use the @@ -1456,7 +1456,7 @@ obtain the upper and lower bounds of the query range, and the weight of that range, using tree traversals in \texttt{local\_preproc}. We use rejection sampling on the buffer, and so the buffer preprocessing simply uses the number of records in the buffer for its weight. In -\texttt{distribute\_query}, we build and alias structure over all of +\texttt{distribute\_query}, we build an alias structure over all of the weights and query it $k$ times to obtain the individual $k$ values for the local queries. To avoid extra work on repeat, we stash this alias structure in the buffer's local query object so it is available @@ -1485,8 +1485,8 @@ compaction is triggered. We configured our dynamized structure to use $s=8$, $N_B=12000$, $\delta = .05$, $f = 16$, and the tiering layout policy. We compared our method (\textbf{DE-IRS}) to Olken's method~\cite{olken89} on a B+Tree with -aggregate weight counts (\textbf{AGG B+Tree}), as well as our besoke -sampling solution from the previous chapter (\textbf{Besoke}) and a +aggregate weight counts (\textbf{AGG B+Tree}), as well as our bespoke +sampling solution from the previous chapter (\textbf{Bespoke}) and a single static instance of the ISAM Tree (\textbf{ISAM}). Because IRS is neither INV nor DDSP, the standard Bentley-Saxe Method has no way to support deletes for it, and was not tested. All of our tested sampling @@ -1494,7 +1494,7 @@ queries had a controlled selectivity of $\sigma = 0.01\%$ and $k=1000$. The results of our performance benchmarking are in Figure~\ref{fig:irs}. Figure~\ref{fig:irs-insert} shows that our general framework has -comperable insertion performance to the specialized one, though loses +comparable insertion performance to the specialized one, though loses slightly. This is to be expected, as \textbf{Bespoke} was hand-written for specifically this type of query and data structure, and has hard-coded data types, among other things. Despite losing to \textbf{Bespoke} @@ -1525,7 +1525,7 @@ using a static Vantage Point Tree (VPTree)~\cite{vptree}. This is a binary search tree with internal nodes that partition records based on their distance to a selected point, called the vantage point. All of the points within a fixed distance of the vantage point are covered -by one subtree, and the points outside of this distance are covered by +by one sub-tree, and the points outside of this distance are covered by the other. This results in a hard-to-update data structure that can be constructed in $\Theta(n \log n)$ time using repeated application of the \texttt{quickselect} algorithm~\cite{quickselect} to partition the @@ -1537,7 +1537,7 @@ Algorithm~\cite{alg:idsp-knn}, though using delete tagging instead of tombstones. VPTree doesn't support efficient point lookups, and so to work around this we add a hash map to each shard, mapping each record to its location in storage, to ensure that deletes can be done efficiently -in this way. This allows us to avoid cancelling deleted records in +in this way. This allows us to avoid canceling deleted records in the \texttt{combine} operation, as they can be skipped over during \texttt{local\_query} directly. Because $k$-NN doesn't have any of the distributional requirements of IRS, these local queries can return $k$ @@ -1599,7 +1599,7 @@ scheme used, with \textbf{BSM-VPTree} performing slightly \emph{better} than our framework for query performance. The reason for this is shown in Figure~\ref{fig:knn-insert}, where our framework outperforms the Bentley-Saxe method in insertion performance. These results are -atributible to our selection of framework configuration parameters, +attributable to our selection of framework configuration parameters, which are biased towards better insertion performance. Both dynamized structures also outperform the dynamic baseline. Finally, as is becoming a trend, Figure~\ref{fig:knn-space} shows that the storage requirements @@ -1667,7 +1667,7 @@ The results of our evaluation are shown in Figure~\ref{fig:eval-learned-index}. Figure~\ref{fig:rq-insert} shows the insertion performance. DE-TS is the best in all cases, and the pure BSM version of Triespline is the worst by a substantial margin. Of -particular interest in this chart is the inconsisent performance of +particular interest in this chart is the inconsistent performance of ALEX, which does quite well on the \texttt{books} dataset, and poorly on the others. It is worth noting that getting ALEX to run \emph{at all} in some cases required a lot of trial and error and tuning, as its @@ -1691,8 +1691,8 @@ performs horrendously compared to all of the other structures. The same caveat from the previous paragraph applies here--PGM can be configured for better performance. But it's notable that our framework-dynamized PGM is able to beat PGM slightly in insertion performance without seeing the -same massive degredation in query performance that PGM's native update -suport does in its own update-optmized configuration.\footnote{ +same massive degradation in query performance that PGM's native update +support does in its own update-optimized configuration.\footnote{ It's also worth noting that PGM implements tombstone deletes by inserting a record with a matching key to the record to be deleted, and a particular "tombstone" value, rather than using a header. This @@ -1712,7 +1712,7 @@ update support. \subsection{String Search} As a final example of a search problem, we consider exact string matching -using the fast succinct trie~\cite{zhang18}. While updatable +using the fast succinct trie~\cite{zhang18}. While dynamic tries aren't terribly unusual~\cite{m-bonsai,dynamic-trie}, succinct data structures, which attempt to approach an information-theoretic lower-bound on their binary representation of the data, are usually static because @@ -1725,7 +1725,7 @@ we consider the effectiveness of our generalized framework for them. \centering \subfloat[Update Throughput]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-insert} \label{fig:fst-insert}} \subfloat[Query Latency]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-query} \label{fig:fst-query}} - \subfloat[Index Overhead]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-space} \label{fig:fst-size}} + \subfloat[Index Overhead]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-space} \label{fig:fst-space}} %\vspace{-3mm} \caption{FST Evaluation} \label{fig:fst-eval} @@ -1739,9 +1739,9 @@ storage. Queries use no pre-processing and the local queries directly search for a matching string. We use the framework's early abort feature to stop as soon as the first result is found, and combine simply checks whether this record is a tombstone or not. If it's a tombstone, then -the lookup is considered to have no found the search string. Otherwise, +the lookup is considered to have not found the search string. Otherwise, the record is returned. This results in a dynamized structure with the -following asympotic costs, +following asymptotic costs, \begin{align*} @@ -1759,7 +1759,7 @@ The results are show in Figure~\ref{fig:fst-eval}. As with range scans, the Bentley-Saxe method shows horrible insertion performance relative to our framework in Figure~\ref{fig:fst-insert}. Note that the significant observed difference in update throughput for the two data sets is -largely attributable to the relative sizes. The \texttt{usra} set is +largely attributable to the relative sizes. The \texttt{US} set is far larger than \texttt{english}. Figure~\ref{fig:fst-query} shows that our write-optimized framework configuration is slightly out-performed in query latency by the standard Bentley-Saxe dynamization, and that both @@ -1767,7 +1767,7 @@ dynamized structures are quite a bit slower than the static structure for queries. Finally, the storage costs for the data structures are shown in Figure~\ref{fig:fst-space}. For the \texttt{english} data set, the extra storage cost from decomposing the structure is quite significant, -but the for \texttt{ursarc} set the sizes are quite comperable. It is +but the for \texttt{ursarc} set the sizes are quite comparable. It is not unexpected that dynamization would add storage cost for succinct (or any compressed) data structures, because the splitting of the records across multiple data structures reduces the ability of the structure to @@ -1792,10 +1792,10 @@ are inserted, it is necessary that each operation obtain a lock on the root node of the tree~\cite{zhao22}. This makes this situation a good use-case for the automatic concurrency support provided by our framework. Figure~\ref{fig:irs-concurrency} shows the results of this -benchmark for various numbers of concurreny query threads. As can be seen, +benchmark for various numbers of concurrency query threads. As can be seen, our framework supports a stable update throughput up to 32 query threads, whereas the AGG B+Tree suffers from contention for the mutex and sees -is performance degrade as the number of threads increases. +its performance degrade as the number of threads increases. \begin{figure} \centering -- cgit v1.2.3