Updates

author: Douglas Rumbaugh <dbr4@psu.edu> 2025-05-13 17:29:40 -0400
committer: Douglas Rumbaugh <dbr4@psu.edu> 2025-05-13 17:29:40 -0400
commit: 40bff24fc2e2da57f382e4f49a5ffb7c826bbcfb (patch)
tree: c00441b058255de08a32d227ce7af46bf11d8eb8 /chapters/sigmod23
parent: 5ffc53e69e956054fdefd1fe193e00eee705dcab (diff)
download: dissertation-40bff24fc2e2da57f382e4f49a5ffb7c826bbcfb.tar.gz
8 files changed, 47 insertions, 47 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex
index b4ccbf1..af3b80a 100644
--- a/chapters/sigmod23/background.tex
+++ b/chapters/sigmod23/background.tex
@@ -37,12 +37,12 @@ have \emph{statistical independence} and for the distribution of records
 in the sample set to match the distribution of source data set. This
 requires that the sampling of a record does not affect the probability of
 any other record being sampled in the future. Such sample sets are said
-to be drawn i.i.d (idendepently and identically distributed). Throughout
+to be drawn i.i.d (independently and identically distributed). Throughout
 this chapter, the term "independent" will be used to describe both
 statistical independence, and identical distribution.
 
 Independence of sample sets is important because many useful statistical
-results are derived from assumping that the condition holds. For example,
+results are derived from assuming that the condition holds. For example,
 it is a requirement for the application of statistical tools such as
 the Central Limit Theorem~\cite{bulmer79}, which is the basis for many
 concentration bounds.  A failure to maintain independence in sampling
@@ -54,7 +54,7 @@ sampling} (IQS)~\cite{hu14}. In IQS, a sample set is constructed from a
 specified number of records in the result set of a database query. In
 this context, it isn't enough to ensure that individual records are
 sampled independently; the sample sets from repeated queries must also be
-indepedent. This precludes, for example, caching and returning the same
+independent. This precludes, for example, caching and returning the same
 sample set to multiple repetitions of the same query. This inter-query
 independence provides a variety of useful properties, such as fairness
 and representativeness of query results~\cite{tao22}.
@@ -194,7 +194,7 @@ call static sampling indices (SSIs) in this chapter,\footnote{
   is based, which was published prior to our realization that a strong
   distinction between an index and a data structure would be useful. I
   am retaining the term SSI in this chapter for consistency with the
-  original paper, but understand that in the termonology established in
+  original paper, but understand that in the terminology established in
   Chapter~\ref{chap:background}, SSIs are data structures, not indices.
 },
 that are capable of answering sampling queries more efficiently than
@@ -216,7 +216,7 @@ per sample.  Thus, a WSS query can be answered in $\Theta(k)$ time,
 assuming the structure has already been built. Unfortunately, the alias
 structure cannot be efficiently updated, as inserting new records would
 change the relative weights of \emph{all} the records, and require fully
-repartitioning the structure.
+re-partitioning the structure.
 
 While the alias method only applies to WSS, other sampling problems can
 be solved by using the alias method within the context of a larger data
@@ -245,7 +245,7 @@ the alias structure with support for weight updates over a fixed set of
 elements~\cite{hagerup93,matias03,allendorf23}. These approaches do not
 allow the insertion or removal of new records, however, only in-place
 weight updates. While in principle they could be constructed over the
-entire domain of possible records, with the weights of non-existant
+entire domain of possible records, with the weights of non-existent
 records set to $0$, this is hardly practical. Thus, these structures are
 not suited for the database sampling applications that are of interest to
 us in this chapter.
diff --git a/chapters/sigmod23/exp-baseline.tex b/chapters/sigmod23/exp-baseline.tex
index da62766..5585c36 100644
--- a/chapters/sigmod23/exp-baseline.tex
+++ b/chapters/sigmod23/exp-baseline.tex
@@ -5,7 +5,7 @@ Olken's method on an aggregate B+Tree. We also examine the query performance
 of a single instance of the SSI in question to establish how much query
 performance is lost in the dynamization. Unless otherwise specified,
 IRS and WIRS queries are run with a selectivity of $0.1\%$. Additionally,
-the \texttt{OSM} dataset was downsampled to 500 million records, except
+the \texttt{OSM} dataset was down-sampled to 500 million records, except
 for scalability tests. The synthetic uniform and zipfian datasets were
 generated with 1 billion records. As with the previous section, all
 benchmarks began by warming up the structure with $10\%$ of the total
@@ -50,13 +50,13 @@ resulting in better performance.
 \end{figure*}
 
 In Figures~\ref{fig:wirs-insert} and \ref{fig:wirs-sample} we examine
-the performed of \texttt{DE-WIRS} compared to \text{AGG B+TreE} and an
+the performed of \texttt{DE-WIRS} compared to \text{AGG B+tree} and an
 alias-augmented B+Tree. We see the same basic set of patterns in this
 case as we did with WSS. \texttt{AGG B+Tree} defeats our dynamized
 index on the \texttt{twitter} dataset, but loses on the others, in
 terms of insertion performance. We can see that the alias-augmented
 B+Tree is much more expensive to build than an alias structure, and
-so its insertion performance advantage is erroded somewhat compared to
+so its insertion performance advantage is eroded somewhat compared to
 the dynamic structure.  For queries we see that the \texttt{AGG B+Tree}
 performs similarly for WIRS sampling as it did for WSS sampling, but the
 alias-augmented B+Tree structure is quite a bit slower at WIRS than the
@@ -82,7 +82,7 @@ being introduced by the dynamization.
 We next considered IRS queries. Figures~\ref{fig:irs-insert1} and
 \ref{fig:irs-sample1} show the results of our testing of single-threaded
 \texttt{DE-IRS} running in-memory against the in-memory ISAM Tree and
-\texttt{AGG B+treE}. The ISAM tree structure can be efficiently bulk-loaded,
+\texttt{AGG B+tree}. The ISAM tree structure can be efficiently bulk-loaded,
 which results in a much faster construction time than the alias structure
 or alias-augmented B+tree. This gives it a significant update performance 
 advantage, and we see in Figure~\ref{fig:irs-insert1} that \texttt{DE-IRS}
@@ -96,7 +96,7 @@ the performance differences.
 We also consider the scalability of inserts, queries, and deletes, of
 \texttt{DE-IRS} compared to \texttt{AGG B+tree} across a wide range of
 data sizes. Figure~\ref{fig:irs-insert-s} shows that \texttt{DE-IRS}'s
-insertion performance scales similarly with datasize as the baseline, and
+insertion performance scales similarly with data size as the baseline, and
 Figure~\ref{fig:irs-sample-s} tells a similar story for query performance.
 Figure~\ref{fig:irs-delete-s} compares the delete performance of the
 two structures, where \texttt{DE-IRS} is configured to use tagging. As
@@ -110,7 +110,7 @@ the B+tree is superior to \texttt{DE-IRS} because of the cost of the
 preliminary processing that our dynamized structure must do to begin
 to answer queries. However, as the sample set size increases, this cost
 increasingly begins to pay off, with \texttt{DE-IRS} quickly defeating
-the dynamic structure in averge per-sample latency. One other interesting
+the dynamic structure in average per-sample latency. One other interesting
 note is the performance of the static ISAM tree, which begins on-par with
 the B+Tree, but also sees an improvement as the sample set size increases.
 This is because of cache effects. During the initial tree traversal, both
diff --git a/chapters/sigmod23/exp-extensions.tex b/chapters/sigmod23/exp-extensions.tex
index 62f15f4..3d3f5b7 100644
--- a/chapters/sigmod23/exp-extensions.tex
+++ b/chapters/sigmod23/exp-extensions.tex
@@ -49,9 +49,9 @@ as additional insertion threads are added. Both plots show linear scaling
 up to 3 or 4 threads, before the throughput levels off. Further, even
 with as many as 32 threads, the system is able to maintain a stable
 insertion throughput. Note that this implementation of concurrency
-is incredibly rudamentary, and doesn't take advantage of concurrent
+is incredibly rudimentary, and doesn't take advantage of concurrent
 merging opportunities, among other things. An implementation with
 support for this will be discussed in Chapter~\ref{chap:tail-latency},
-and shown to perform significantly better. Even with this rudamentary
+and shown to perform significantly better. Even with this rudimentary
 implementation of concurrency, however, \texttt{DE-IRS} is able to
 outperform \texttt{AB-tree} under all conditions tested.
diff --git a/chapters/sigmod23/exp-parameter-space.tex b/chapters/sigmod23/exp-parameter-space.tex
index d53c592..9583312 100644
--- a/chapters/sigmod23/exp-parameter-space.tex
+++ b/chapters/sigmod23/exp-parameter-space.tex
@@ -62,7 +62,7 @@ operations) reducing their effect on the overall throughput.
 
 The influence of scale factor on update performance is shown in
 Figure~\ref{fig:insert_sf}. The effect is different depending on the
-layout policy, with larger scale factors benefitting update performance
+layout policy, with larger scale factors benefiting update performance
 under tiering, and hurting it under leveling. The effect of the mutable
 buffer size on insertion, shown in Figure~\ref{fig:insert_mt}, is a little
 less clear, but does show a slight upward trend, with larger buffers
@@ -86,7 +86,7 @@ effect on query performance. Thus, in this context, is would appear
 that the scale factor is primarily useful as an insertion performance
 tuning tool. The mutable buffer size, in Figure~\ref{fig:sample_mt},
 also generally has no clear effect. This is expected, because the buffer
-contains onyl a small number of records relative to the entire dataset,
+contains only a small number of records relative to the entire dataset,
 and so has a fairly low probability of being selected for drawing
 a sample from. Even when it is selected, rejection sampling is very
 inexpensive. The one exception to this trend is when using tombstones,
diff --git a/chapters/sigmod23/experiment.tex b/chapters/sigmod23/experiment.tex
index 4dbb4c2..727284a 100644
--- a/chapters/sigmod23/experiment.tex
+++ b/chapters/sigmod23/experiment.tex
@@ -28,7 +28,7 @@ added to records when testing dynamic baselines. Additionally, weighted
 testing attached a 64-bit integer weight to each record. This weight was
 not included in the record for non-weighted testing. The weights and
 keys were both used directly from the datasets, and values were added
-seperately and unique to each record.
+separately and unique to each record.
 
 We used the following datasets for testing,
 \begin{itemize}
@@ -75,13 +75,13 @@ method on an AGG-BTree.
 
 \item \textbf{DE-WIRS.} An implementation of the dynamized alias-augmented
 B+Tree~\cite{afshani17} as discussed in Section~\ref{ssec:wirs-struct} for
-weighted indepedent range sampling. We compare this against a WIRS
+weighted independent range sampling. We compare this against a WIRS
 implementation of Olken's method on an AGG-BTree.
 
 \end{itemize}
 
 All of the tested structures, with the exception of the external memory
-DE-IRS implementation and AB-Tree, were wholely contained within system
+DE-IRS implementation and AB-Tree, were wholly contained within system
 memory. AB-Tree is a native external structure, so for the in-memory
 concurrency evaluation we configured it with enough cache to maintain
 the entire structure in memory to simulate an in-memory implementation.\footnote{
diff --git a/chapters/sigmod23/extensions.tex b/chapters/sigmod23/extensions.tex
index 06d55a5..3a3cba3 100644
--- a/chapters/sigmod23/extensions.tex
+++ b/chapters/sigmod23/extensions.tex
@@ -9,7 +9,7 @@ concurrency and external data structures.
 \subsection{External Data Structures}
 \label{ssec:ext-external}
 
-Our dynamization techniques can easily accomodate external data structures
+Our dynamization techniques can easily accommodate external data structures
 as well as in-memory ones. To demonstrate this, we have implemented
 a dynamized version of an external ISAM tree for use in answering IRS
 queries. The mutable buffer remains an unsorted array in memory, however
@@ -46,7 +46,7 @@ file or a Spark RDD, and a centralized control node can manage the
 mutable buffer. Flushing this buffer would create a new file/RDD, and
 reconstructions could likewise be performed by creating new immutable
 structures through the merging of existing ones, using the same basic
-scheme as has already been discussed in this chapter.  Using thes tools,
+scheme as has already been discussed in this chapter.  Using these tools,
 SSIs over datasets that exceed the capacity of a single node could be
 supported. Such distributed SSIs do exist, such as the RDD-based sampling
 structure using in XDB~\cite{li19}.
diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex
index 0f3fac8..2f2515b 100644
--- a/chapters/sigmod23/framework.tex
+++ b/chapters/sigmod23/framework.tex
@@ -1,7 +1,7 @@
 \section{Dynamization of SSIs} 
 \label{sec:framework}
 
-Our goal, then, is to design a solution to indepedent sampling that is
+Our goal, then, is to design a solution to independent sampling that is
 able to achieve \emph{both} efficient updates and efficient sampling,
 while also maintaining statistical independence both within and between
 IQS queries, and to do so in a generalized fashion without needing to
@@ -98,7 +98,7 @@ define the decomposability conditions for a query sampling problem,
 	\end{enumerate}
 \end{definition}
 
-These two conditions warrant further explaination. The first condition
+These two conditions warrant further explanation. The first condition
 is simply a redefinition of the standard decomposability criteria to
 consider matching the distribution, rather than the exact records in $R$,
 as the correctness condition for the merge process. The second condition
@@ -114,7 +114,7 @@ problems. First, we note that many SSIs have a sampling procedure that
 naturally involves two phases. First, some preliminary work is done
 to determine metadata concerning the set of records to sample from,
 and then $k$ samples are drawn from the structure, taking advantage of
-this metadata.  If we represent the time cost of the prelimary work
+this metadata.  If we represent the time cost of the preliminary work
 with $P(n)$ and the cost of drawing a sample with $S(n)$, then these
 structures query cost functions are of the form,
 
@@ -213,7 +213,7 @@ $k$ records have been sampled.
 \end{example}
 
 Assuming a Bentley-Saxe decomposition with $\log n$ blocks and assuming
-a constant number of repetitions, the cost of answering a decomposible
+a constant number of repetitions, the cost of answering a decomposable
 sampling query having a pre-processing cost of $P(n)$, a weight-determination
 cost of $W(n)$ and a per-sample cost of $S(n)$ will be,
 \begin{equation}
@@ -241,7 +241,7 @@ satisfied by either the search problem or data structure. Unfortunately,
 neither approach can work as a ``drop-in'' solution in the context of
 sampling problems, because of the way that deleted records interact with
 the sampling process itself. Sampling problems, as formalized here,
-are neither invertable, nor deletion decomposable. In this section,
+are neither invertible, nor deletion decomposable. In this section,
 we'll discuss our mechanisms for supporting deletes, as well as how
 these can be handled during sampling while maintaining correctness.
 
@@ -397,7 +397,7 @@ Section~\ref{sec:sampling-implementation}.
 
 \subsubsection{Bounding Rejection Probability}
 
-When a sampled record has been rejected, it must be resampled. This
+When a sampled record has been rejected, it must be re-sampled. This
 introduces performance overhead resulting from extra memory access and
 random number generations, and hurts our ability to provide performance
 bounds on our sampling operations. In the worst case, a structure
@@ -413,7 +413,7 @@ we have the opportunity to remove deleted records. This will cause the
 record counts associated with each block of the structure to gradually
 drift out of alignment with the "perfect" powers of two associated with
 the Bentley-Saxe method, however. In the theoretical literature on this
-topic, the solution to this problem is to periodically repartition all of
+topic, the solution to this problem is to periodically re-partition all of
 the records to re-align the block sizes~\cite{merge-dsp, saxe79}. This
 approach could also be easily applied here, if desired, though we
 do not in our implementations, for reasons that will be discussed in
@@ -449,7 +449,7 @@ is not sufficient.
 Fortunately, this passive system can be used as the basis for a
 system that does provide a bound. This is because it guarantees,
 whether tagging or tombstones are used, that any given deleted
-record will \emph{eventually} be cancelled out after a finite number
+record will \emph{eventually} be canceled out after a finite number
 of reconstructions. If the number of deleted records gets too high,
 some or all of these deleted records can be cleared out by proactively
 performing reconstructions. We call these proactive reconstructions
@@ -524,7 +524,7 @@ that is both tunable, and generally more performant, at the cost of some
 additional theoretical complexity. There has been some theoretical work
 in this area, based upon nesting instances of the equal block method
 within the Bentley-Saxe method~\cite{overmars81}, but these methods are
-unwieldy and are targetted at tuning the worst-case at the expense of the
+unwieldy and are targeted at tuning the worst-case at the expense of the
 common case. We will take a different approach to adding configurability
 to our dynamization system.
 
@@ -532,7 +532,7 @@ Though it has thus far gone unmentioned, some readers may have
 noted the astonishing similarity between decomposition-based
 dynamization techniques, and a data structure called the Log-structured
 Merge-tree. First proposed by O'Neil in the mid '90s\cite{oneil96},
-the LSM Tree was designed to optmize write throughout for external data
+the LSM Tree was designed to optimize write throughout for external data
 structures. It accomplished this task by buffer inserted records in a
 small in-memory AVL Tree, and then flushing this buffer to disk when
 it filled up. The flush process itself would fully rebuild the on-disk
@@ -543,13 +543,13 @@ layered, external structures, to reduce the cost of reconstruction.
 In more recent times, the LSM Tree has seen significant development and
 been used as the basis for key-value stores like RocksDB~\cite{dong21}
 and LevelDB~\cite{leveldb}. This work has produced an incredibly large
-and well explored parameterization of the reconstruction procedures of
+and well explored parametrization of the reconstruction procedures of
 LSM Trees, a good summary of which can be bound in this recent tutorial
 paper~\cite{sarkar23}. Examples of this design space exploration include:
 different ways to organize each "level" of the tree~\cite{dayan19,
-dostoevsky, autumn}, different growth rates, buffering, sub-partioning
+dostoevsky, autumn}, different growth rates, buffering, sub-partitioning
 of structures to allow finer-grained reconstruction~\cite{dayan22}, and
-approaches for allocating resources to auxilliary structures attached to
+approaches for allocating resources to auxiliary structures attached to
 the main ones for accelerating certain types of query~\cite{dayan18-1,
 zhu21, monkey}.
 
@@ -561,7 +561,7 @@ following four elements for use in our dynamization technique,
 \begin{itemize}
 	\item A small dynamic buffer into which new records are inserted
 	\item A variable growth rate, called as \emph{scale factor}
-	\item The ability to attach auxilliary structures to each block
+	\item The ability to attach auxiliary structures to each block
 	\item Two different strategies for reconstructing data structures
 \end{itemize}
 This design space and its associated trade-offs will be discussed in
@@ -585,28 +585,28 @@ $N_B \cdot 2^i$ records in the $i$th block. We call this unsorted array
 the \emph{mutable buffer}.
 
 \Paragraph{Scale Factor.} In the Bentley-Saxe method, each block is
-twice as large as the block the preceeds it There is, however, no reason
+twice as large as the block the precedes it There is, however, no reason
 why this growth rate couldn't be adjusted. In our system, we make the
 growth rate a user-specified constant called the \emph{scale factor},
 $s$, such that the $i$th level contains $N_B \cdot s^i$ records.
 
-\Paragraph{Auxilliary Structures.} In Section~\ref{ssec:sampling-deletes},
+\Paragraph{Auxiliary Structures.} In Section~\ref{ssec:sampling-deletes},
 we encountered two problems relating to supporting deletes that can be
-resolved through the use of auxilliary structures. First, regardless
+resolved through the use of auxiliary structures. First, regardless
 of whether tagging or tombstones are used, the data structure requires
 support for an efficient point-lookup operation. Many SSIs are tree-based
 and thus support this, but not all data structures do. In such cases,
-the point-lookup operation could be provided by attaching an auxilliary
+the point-lookup operation could be provided by attaching an auxiliary
 hash table to the data structure that maps records to their location in
 the SSI.  We use term \emph{shard} to refer to the combination of a
-block with these optional auxilliary structures.
+block with these optional auxiliary structures.
 
 In addition, the tombstone deletion mechanism requires performing a point
 lookup for every record sampled, to validate that it has not been deleted.
 This introduces a large amount of overhead into the sampling process,
 as this requires searching each block in the structure. One approach
 that can be used to help improve the performance of these searches,
-without requiring as much storage as adding auxilliary hash tables to
+without requiring as much storage as adding auxiliary hash tables to
 every block, is to include bloom filters~\cite{bloom70}. A bloom filter
 is an approximate data structure that answers tests of set membership
 with bounded, single-sided error. These are commonly used in LSM Trees
@@ -687,7 +687,7 @@ s^{i+1}$ records. If tiering is used, each level will contain up to
 $s$ SSIs, each with up to $N_B \cdot s^i$ records. The scale factor,
 $s$, controls the rate at which the capacity of each level grows. The
 framework supports deletes using either the tombstone or tagging policy,
-which can be selected by the user acccording to her preference. To support
+which can be selected by the user according to her preference. To support
 these delete mechanisms, each record contains an attached header with
 bits to indicate its tombstone or delete status.
 
@@ -735,7 +735,7 @@ compaction is complete, the delete proportions are checked again, and
 this process is repeated until all levels satisfy the bound.
 
 Following this procedure, inserts have a worst case cost of $I \in
-\Theta(B_M(n))$, equivalent to Bently-Saxe. The amortized cost can be
+\Theta(B_M(n))$, equivalent to Bentley-Saxe. The amortized cost can be
 determined by finding the total cost of reconstructions involving each
 record and amortizing it over each insert. The cost of the insert is
 composed of three parts,
@@ -773,7 +773,7 @@ cost of a delete is the same as cost of doing a point lookup, as the
 "delete" itself is simply setting a bit in the header of the record,
 once it has been located. There will be $\Theta(\log_s n)$ total shards
 in the structure, each with a look-up cost of $L(n)$ using either the
-SSI's native point-lookup, or an auxilliary hash table, and the lookup
+SSI's native point-lookup, or an auxiliary hash table, and the lookup
 must also scan the buffer in $\Theta(N_B)$ time. Thus, the worst-case
 cost of a tagged delete is,
 \begin{equation*}
@@ -792,7 +792,7 @@ to sample from the unsorted buffer as well. There are two approaches
 for sampling from the buffer. The most general approach would be to
 temporarily build an SSI over the records within the buffer, and then
 treat this is a normal shard for the remainder of the sampling procedure.
-In this case, the sampling algorithm remains indentical to the algorithm
+In this case, the sampling algorithm remains identical to the algorithm
 discussed in Section~\ref{ssec:decomposed-structure-sampling}, following
 the construction of the temporary shard. This results in a worst-case
 sampling cost of,
diff --git a/chapters/sigmod23/introduction.tex b/chapters/sigmod23/introduction.tex
index befdbba..1a33c2e 100644
--- a/chapters/sigmod23/introduction.tex
+++ b/chapters/sigmod23/introduction.tex
@@ -3,7 +3,7 @@
 Having discussed the relevant background materials, we will now turn to a
 discussion of our first attempt to address the limitations of dynamization
 in the context of one particular class of non-decomposable search problem:
-indepedent random sampling. We've already discussed one representative
+independent random sampling. We've already discussed one representative
 problem of this class, independent range sampling, and shown how it is
 not traditionally decomposable. This specific problem is one of several
 very similar types of problem, however, and in this chapter we will also
@@ -21,7 +21,7 @@ problems is limited by the techniques used within databases to implement
 them. Existing implementations tend to sacrifice either performance,
 by requiring the entire result set of be materialized prior to applying
 Bernoulli sampling, or statistical independence. There exists techniques
-for obtaining both sampling performance and indepedence by leveraging
+for obtaining both sampling performance and independence by leveraging
 existing B+Tree indices with slight modification~\cite{olken-thesis},
 but even this technique has worse sampling performance than could be
 achieved using specialized static sampling indices.
author	Douglas Rumbaugh <dbr4@psu.edu>	2025-05-13 17:29:40 -0400
committer	Douglas Rumbaugh <dbr4@psu.edu>	2025-05-13 17:29:40 -0400
commit	40bff24fc2e2da57f382e4f49a5ffb7c826bbcfb (patch)
tree	c00441b058255de08a32d227ce7af46bf11d8eb8 /chapters/sigmod23
parent	5ffc53e69e956054fdefd1fe193e00eee705dcab (diff)
download	dissertation-40bff24fc2e2da57f382e4f49a5ffb7c826bbcfb.tar.gz