From 40bff24fc2e2da57f382e4f49a5ffb7c826bbcfb Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Tue, 13 May 2025 17:29:40 -0400 Subject: Updates --- chapters/sigmod23/framework.tex | 48 ++++++++++++++++++++--------------------- 1 file changed, 24 insertions(+), 24 deletions(-) (limited to 'chapters/sigmod23/framework.tex') diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex index 0f3fac8..2f2515b 100644 --- a/chapters/sigmod23/framework.tex +++ b/chapters/sigmod23/framework.tex @@ -1,7 +1,7 @@ \section{Dynamization of SSIs} \label{sec:framework} -Our goal, then, is to design a solution to indepedent sampling that is +Our goal, then, is to design a solution to independent sampling that is able to achieve \emph{both} efficient updates and efficient sampling, while also maintaining statistical independence both within and between IQS queries, and to do so in a generalized fashion without needing to @@ -98,7 +98,7 @@ define the decomposability conditions for a query sampling problem, \end{enumerate} \end{definition} -These two conditions warrant further explaination. The first condition +These two conditions warrant further explanation. The first condition is simply a redefinition of the standard decomposability criteria to consider matching the distribution, rather than the exact records in $R$, as the correctness condition for the merge process. The second condition @@ -114,7 +114,7 @@ problems. First, we note that many SSIs have a sampling procedure that naturally involves two phases. First, some preliminary work is done to determine metadata concerning the set of records to sample from, and then $k$ samples are drawn from the structure, taking advantage of -this metadata. If we represent the time cost of the prelimary work +this metadata. If we represent the time cost of the preliminary work with $P(n)$ and the cost of drawing a sample with $S(n)$, then these structures query cost functions are of the form, @@ -213,7 +213,7 @@ $k$ records have been sampled. \end{example} Assuming a Bentley-Saxe decomposition with $\log n$ blocks and assuming -a constant number of repetitions, the cost of answering a decomposible +a constant number of repetitions, the cost of answering a decomposable sampling query having a pre-processing cost of $P(n)$, a weight-determination cost of $W(n)$ and a per-sample cost of $S(n)$ will be, \begin{equation} @@ -241,7 +241,7 @@ satisfied by either the search problem or data structure. Unfortunately, neither approach can work as a ``drop-in'' solution in the context of sampling problems, because of the way that deleted records interact with the sampling process itself. Sampling problems, as formalized here, -are neither invertable, nor deletion decomposable. In this section, +are neither invertible, nor deletion decomposable. In this section, we'll discuss our mechanisms for supporting deletes, as well as how these can be handled during sampling while maintaining correctness. @@ -397,7 +397,7 @@ Section~\ref{sec:sampling-implementation}. \subsubsection{Bounding Rejection Probability} -When a sampled record has been rejected, it must be resampled. This +When a sampled record has been rejected, it must be re-sampled. This introduces performance overhead resulting from extra memory access and random number generations, and hurts our ability to provide performance bounds on our sampling operations. In the worst case, a structure @@ -413,7 +413,7 @@ we have the opportunity to remove deleted records. This will cause the record counts associated with each block of the structure to gradually drift out of alignment with the "perfect" powers of two associated with the Bentley-Saxe method, however. In the theoretical literature on this -topic, the solution to this problem is to periodically repartition all of +topic, the solution to this problem is to periodically re-partition all of the records to re-align the block sizes~\cite{merge-dsp, saxe79}. This approach could also be easily applied here, if desired, though we do not in our implementations, for reasons that will be discussed in @@ -449,7 +449,7 @@ is not sufficient. Fortunately, this passive system can be used as the basis for a system that does provide a bound. This is because it guarantees, whether tagging or tombstones are used, that any given deleted -record will \emph{eventually} be cancelled out after a finite number +record will \emph{eventually} be canceled out after a finite number of reconstructions. If the number of deleted records gets too high, some or all of these deleted records can be cleared out by proactively performing reconstructions. We call these proactive reconstructions @@ -524,7 +524,7 @@ that is both tunable, and generally more performant, at the cost of some additional theoretical complexity. There has been some theoretical work in this area, based upon nesting instances of the equal block method within the Bentley-Saxe method~\cite{overmars81}, but these methods are -unwieldy and are targetted at tuning the worst-case at the expense of the +unwieldy and are targeted at tuning the worst-case at the expense of the common case. We will take a different approach to adding configurability to our dynamization system. @@ -532,7 +532,7 @@ Though it has thus far gone unmentioned, some readers may have noted the astonishing similarity between decomposition-based dynamization techniques, and a data structure called the Log-structured Merge-tree. First proposed by O'Neil in the mid '90s\cite{oneil96}, -the LSM Tree was designed to optmize write throughout for external data +the LSM Tree was designed to optimize write throughout for external data structures. It accomplished this task by buffer inserted records in a small in-memory AVL Tree, and then flushing this buffer to disk when it filled up. The flush process itself would fully rebuild the on-disk @@ -543,13 +543,13 @@ layered, external structures, to reduce the cost of reconstruction. In more recent times, the LSM Tree has seen significant development and been used as the basis for key-value stores like RocksDB~\cite{dong21} and LevelDB~\cite{leveldb}. This work has produced an incredibly large -and well explored parameterization of the reconstruction procedures of +and well explored parametrization of the reconstruction procedures of LSM Trees, a good summary of which can be bound in this recent tutorial paper~\cite{sarkar23}. Examples of this design space exploration include: different ways to organize each "level" of the tree~\cite{dayan19, -dostoevsky, autumn}, different growth rates, buffering, sub-partioning +dostoevsky, autumn}, different growth rates, buffering, sub-partitioning of structures to allow finer-grained reconstruction~\cite{dayan22}, and -approaches for allocating resources to auxilliary structures attached to +approaches for allocating resources to auxiliary structures attached to the main ones for accelerating certain types of query~\cite{dayan18-1, zhu21, monkey}. @@ -561,7 +561,7 @@ following four elements for use in our dynamization technique, \begin{itemize} \item A small dynamic buffer into which new records are inserted \item A variable growth rate, called as \emph{scale factor} - \item The ability to attach auxilliary structures to each block + \item The ability to attach auxiliary structures to each block \item Two different strategies for reconstructing data structures \end{itemize} This design space and its associated trade-offs will be discussed in @@ -585,28 +585,28 @@ $N_B \cdot 2^i$ records in the $i$th block. We call this unsorted array the \emph{mutable buffer}. \Paragraph{Scale Factor.} In the Bentley-Saxe method, each block is -twice as large as the block the preceeds it There is, however, no reason +twice as large as the block the precedes it There is, however, no reason why this growth rate couldn't be adjusted. In our system, we make the growth rate a user-specified constant called the \emph{scale factor}, $s$, such that the $i$th level contains $N_B \cdot s^i$ records. -\Paragraph{Auxilliary Structures.} In Section~\ref{ssec:sampling-deletes}, +\Paragraph{Auxiliary Structures.} In Section~\ref{ssec:sampling-deletes}, we encountered two problems relating to supporting deletes that can be -resolved through the use of auxilliary structures. First, regardless +resolved through the use of auxiliary structures. First, regardless of whether tagging or tombstones are used, the data structure requires support for an efficient point-lookup operation. Many SSIs are tree-based and thus support this, but not all data structures do. In such cases, -the point-lookup operation could be provided by attaching an auxilliary +the point-lookup operation could be provided by attaching an auxiliary hash table to the data structure that maps records to their location in the SSI. We use term \emph{shard} to refer to the combination of a -block with these optional auxilliary structures. +block with these optional auxiliary structures. In addition, the tombstone deletion mechanism requires performing a point lookup for every record sampled, to validate that it has not been deleted. This introduces a large amount of overhead into the sampling process, as this requires searching each block in the structure. One approach that can be used to help improve the performance of these searches, -without requiring as much storage as adding auxilliary hash tables to +without requiring as much storage as adding auxiliary hash tables to every block, is to include bloom filters~\cite{bloom70}. A bloom filter is an approximate data structure that answers tests of set membership with bounded, single-sided error. These are commonly used in LSM Trees @@ -687,7 +687,7 @@ s^{i+1}$ records. If tiering is used, each level will contain up to $s$ SSIs, each with up to $N_B \cdot s^i$ records. The scale factor, $s$, controls the rate at which the capacity of each level grows. The framework supports deletes using either the tombstone or tagging policy, -which can be selected by the user acccording to her preference. To support +which can be selected by the user according to her preference. To support these delete mechanisms, each record contains an attached header with bits to indicate its tombstone or delete status. @@ -735,7 +735,7 @@ compaction is complete, the delete proportions are checked again, and this process is repeated until all levels satisfy the bound. Following this procedure, inserts have a worst case cost of $I \in -\Theta(B_M(n))$, equivalent to Bently-Saxe. The amortized cost can be +\Theta(B_M(n))$, equivalent to Bentley-Saxe. The amortized cost can be determined by finding the total cost of reconstructions involving each record and amortizing it over each insert. The cost of the insert is composed of three parts, @@ -773,7 +773,7 @@ cost of a delete is the same as cost of doing a point lookup, as the "delete" itself is simply setting a bit in the header of the record, once it has been located. There will be $\Theta(\log_s n)$ total shards in the structure, each with a look-up cost of $L(n)$ using either the -SSI's native point-lookup, or an auxilliary hash table, and the lookup +SSI's native point-lookup, or an auxiliary hash table, and the lookup must also scan the buffer in $\Theta(N_B)$ time. Thus, the worst-case cost of a tagged delete is, \begin{equation*} @@ -792,7 +792,7 @@ to sample from the unsorted buffer as well. There are two approaches for sampling from the buffer. The most general approach would be to temporarily build an SSI over the records within the buffer, and then treat this is a normal shard for the remainder of the sampling procedure. -In this case, the sampling algorithm remains indentical to the algorithm +In this case, the sampling algorithm remains identical to the algorithm discussed in Section~\ref{ssec:decomposed-structure-sampling}, following the construction of the temporary shard. This results in a worst-case sampling cost of, -- cgit v1.2.3