summaryrefslogtreecommitdiffstats
path: root/chapters/sigmod23
diff options
context:
space:
mode:
authorDouglas B. Rumbaugh <doug@douglasrumbaugh.com>2025-07-06 18:21:32 -0400
committerDouglas B. Rumbaugh <doug@douglasrumbaugh.com>2025-07-06 18:21:32 -0400
commit0dc1a8ea20820168149cedaa14e223d4d31dc4b6 (patch)
tree2bc726803cf6de6d669958b1f5a79cde59722e00 /chapters/sigmod23
parent0fff4753fac809a6ba17f428df3a041cebe692e0 (diff)
downloaddissertation-0dc1a8ea20820168149cedaa14e223d4d31dc4b6.tar.gz
updates
Diffstat (limited to 'chapters/sigmod23')
-rw-r--r--chapters/sigmod23/background.tex8
-rw-r--r--chapters/sigmod23/framework.tex18
2 files changed, 13 insertions, 13 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex
index 984e36c..8d3a88f 100644
--- a/chapters/sigmod23/background.tex
+++ b/chapters/sigmod23/background.tex
@@ -21,8 +21,8 @@ set; the specific usage should be clear from context.
In each of the problems considered, sampling can be performed either
with-replacement or without-replacement. Sampling with-replacement
means that a record that has been included in the sample set for a given
-sampling query is "replaced" into the dataset and allowed to be sampled
-again. Sampling without-replacement does not "replace" the record,
+sampling query is ``replaced'' into the dataset and allowed to be sampled
+again. Sampling without-replacement does not ``replace'' the record,
and so each individual record can only be included within the a sample
set once for a given query. The data structures that will be discussed
support sampling with-replacement, and sampling without-replacement can
@@ -38,7 +38,7 @@ in the sample set to match the distribution of source data set. This
requires that the sampling of a record does not affect the probability of
any other record being sampled in the future. Such sample sets are said
to be drawn i.i.d (independently and identically distributed). Throughout
-this chapter, the term "independent" will be used to describe both
+this chapter, the term ``independent'' will be used to describe both
statistical independence, and identical distribution.
Independence of sample sets is important because many useful statistical
@@ -192,7 +192,7 @@ requiring greater than $k$ traversals to obtain a sample set of size $k$.
\Paragraph{Static Solutions.}
There are also a large number of static data structures, which we'll
call static sampling indices (SSIs) in this chapter,\footnote{
- We used the term "SSI" in the original paper on which this chapter
+ We used the term ``SSI'' in the original paper on which this chapter
is based, which was published prior to our realization that a strong
distinction between an index and a data structure would be useful. I
am retaining the term SSI in this chapter for consistency with the
diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex
index 218c290..b413802 100644
--- a/chapters/sigmod23/framework.tex
+++ b/chapters/sigmod23/framework.tex
@@ -252,7 +252,7 @@ of the major limitations of the ghost structure approach for handling
deletes is that there is not a principled method for removing deleted
records from the decomposed structure. The standard approach is to set an
arbitrary number of delete records, and rebuild the entire structure when
-this threshold is crossed~\cite{saxe79}. Mixing the "ghost" records into
+this threshold is crossed~\cite{saxe79}. Mixing the ``ghost'' records into
the same structures as the original records allows for deleted records
to naturally be cleaned up over time as they meet their tombstones during
reconstructions using a technique called tombstone cancellation. This
@@ -280,7 +280,7 @@ mechanism.
The cost of using a tombstone delete in a Bentley-Saxe dynamization is
the same as a simple insert,
\begin{equation*}
-\mathscr{D}(n)_A \in \Theta\left(\frac{B(n)}{n} \log_2 (n)\right)
+D_A(n) \in \Theta\left(\frac{B(n)}{n} \log_2 (n)\right)
\end{equation*}
with the worst-case cost being $\Theta(B(n))$. Note that there is also
a minor performance effect resulting from deleted records appearing
@@ -309,7 +309,7 @@ on a Bentley-Saxe decomposition of that SSI will require, at worst,
executing a point-lookup on each block, with a total cost of
\begin{equation*}
-\mathscr{D}(n) \in \Theta\left( L(n) \log_2 (n)\right)
+D(n) \in \Theta\left( L(n) \log_2 (n)\right)
\end{equation*}
If the SSI being considered does \emph{not} support an efficient
@@ -391,7 +391,7 @@ a natural way of controlling the number of deleted records within the
structure, and thereby bounding the rejection rate. During reconstruction,
we have the opportunity to remove deleted records. This will cause the
record counts associated with each block of the structure to gradually
-drift out of alignment with the "perfect" powers of two associated with
+drift out of alignment with the ``perfect'' powers of two associated with
the Bentley-Saxe method, however. In the theoretical literature on this
topic, the solution to this problem is to periodically re-partition all of
the records to re-align the block sizes~\cite{merge-dsp, saxe79}. This
@@ -450,7 +450,7 @@ deleted records involved in the reconstruction will be dropped. Tombstones
may require multiple cascading rounds of compaction to occur, because a
tombstone record will only cancel when it encounters the record that it
deletes. However, because tombstones always follow the record they
-delete in insertion order, and will therefore always be "above" that
+delete in insertion order, and will therefore always be ``above'' that
record in the structure, each reconstruction will move every tombstone
involved closer to the record it deletes, ensuring that eventually the
bound will be satisfied.
@@ -526,7 +526,7 @@ and LevelDB~\cite{leveldb}. This work has produced an incredibly
large and well explored parametrization of the reconstruction
procedures of LSM trees, a good summary of which can be bounded in
this recent tutorial paper~\cite{sarkar23}. Examples of this design
-space exploration include: different ways to organize each "level"
+space exploration include: different ways to organize each ``level''
of the tree~\cite{dayan19, dostoevsky, autumn}, different growth
rates, buffering, sub-partitioning of structures to allow finer-grained
reconstruction~\cite{dayan22}, and approaches for allocating resources to
@@ -739,19 +739,19 @@ Assuming that $N_B \ll n$, the first two terms of this expression are
constant. Dropping them and amortizing the result over $n$ records give
us the amortized insertion cost,
\begin{equation*}
-I_a(n) \in \Theta\left(\frac{B_M(n)}{n}\log_s(n)\right)
+I_A(n) \in \Theta\left(\frac{B_M(n)}{n}\log_s(n)\right)
\end{equation*}
If the SSI being considered does not support a more efficient
construction procedure from other instances of the same SSI, and
the general Bentley-Saxe \texttt{unbuild} and \texttt{build}
-operations must be used, the the cost becomes $I_a(n) \in
+operations must be used, the the cost becomes $I_A(n) \in
\Theta\left(\frac{B(n)}{n}\log_s(n)\right)$ instead.
\Paragraph{Delete.} The framework supports both tombstone and tagged
deletes, each with different performance. Using tombstones, the cost
of a delete is identical to that of an insert. When using tagging, the
cost of a delete is the same as the cost of a point lookup, because the
-"delete" itself only sets a bit in the header of the record,
+``delete'' itself only sets a bit in the header of the record,
once it has been located. There will be $\Theta(\log_s n)$ total shards
in the structure, each with a look-up cost of $L(n)$ using either the
SSI's native point-lookup, or an auxiliary hash table, and the lookup