Julia updates

author: Douglas B. Rumbaugh <doug@douglasrumbaugh.com> 2025-06-01 13:15:52 -0400
committer: Douglas B. Rumbaugh <doug@douglasrumbaugh.com> 2025-06-01 13:15:52 -0400
commit: cd3447f1cad16972e8a659ec6e84764c5b8b2745 (patch)
tree: 5a50b6e8a99646e326b2c41714f50e4f7dee64d0 /chapters/sigmod23
parent: 6354e60f106a89f5bf807082561ed5efd9be0f4f (diff)
download: dissertation-cd3447f1cad16972e8a659ec6e84764c5b8b2745.tar.gz
7 files changed, 90 insertions, 84 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex
index af3b80a..d600c27 100644
--- a/chapters/sigmod23/background.tex
+++ b/chapters/sigmod23/background.tex
@@ -19,16 +19,16 @@ is used to indicate the selection of either a single sample or a sample
 set; the specific usage should be clear from context.
 
 In each of the problems considered, sampling can be performed either
-with replacement or without replacement. Sampling with replacement
+with-replacement or without-replacement. Sampling with-replacement
 means that a record that has been included in the sample set for a given
 sampling query is "replaced" into the dataset and allowed to be sampled
-again. Sampling without replacement does not "replace" the record,
+again. Sampling without-replacement does not "replace" the record,
 and so each individual record can only be included within the a sample
 set once for a given query. The data structures that will be discussed
-support sampling with replacement, and sampling without replacement can
-be implemented using a constant number of with replacement sampling
+support sampling with-replacement, and sampling without-replacement can
+be implemented using a constant number of with-replacement sampling
 operations, followed by a deduplication step~\cite{hu15}, so this chapter
-will focus exclusive on the with replacement case.
+will focus exclusive on the with-replacement case.
 
 \subsection{Independent Sampling Problem}
 
@@ -115,8 +115,10 @@ of problems that will be directly addressed within this chapter.
 
 Relational database systems often have native support for IQS using
 SQL's \texttt{TABLESAMPLE} operator~\cite{postgress-doc}. However, the
-algorithms used to implement this operator have significant limitations:
-users much choose between statistical independence or performance.
+algorithms used to implement this operator have significant limitations
+and do not allow users to maintain statistical independence of the results
+without also running the query to be sampled from in full. Thus, users must
+choose between independece and performance.
 
 To maintain statistical independence, Bernoulli sampling is used. This
 technique requires iterating over every record in the result set of the
@@ -240,7 +242,7 @@ Tao~\cite{tao22}.
 There also exist specialized data structures with support for both
 efficient sampling and updates~\cite{hu14}, but these structures have
 poor constant factors and are very complex, rendering them of little
-practical utility. Additionally, efforts have been made to extended
+practical utility. Additionally, efforts have been made to extend
 the alias structure with support for weight updates over a fixed set of
 elements~\cite{hagerup93,matias03,allendorf23}. These approaches do not
 allow the insertion or removal of new records, however, only in-place
diff --git a/chapters/sigmod23/examples.tex b/chapters/sigmod23/examples.tex
index 38df04d..4e7f9ac 100644
--- a/chapters/sigmod23/examples.tex
+++ b/chapters/sigmod23/examples.tex
@@ -25,7 +25,7 @@ number of shards involved in a reconstruction using either layout policy
 is $\Theta(1)$ using our framework, this means that we can perform
 reconstructions in $B_M(n) \in \Theta(n)$ time, including tombstone
 cancellation. The total weight of the structure can also be calculated
-at no time when it is constructed, allows $W(n) \in \Theta(1)$ time
+at no time cost when it is constructed, allows $W(n) \in \Theta(1)$ time
 as well. Point lookups over the sorted data can be done using a binary
 search in $L(n) \in \Theta(\log_2 n)$ time, and sampling queries require
 no pre-processing, so $P(n) \in \Theta(1)$. The mutable buffer can be
diff --git a/chapters/sigmod23/exp-baseline.tex b/chapters/sigmod23/exp-baseline.tex
index 5585c36..d0e1ce0 100644
--- a/chapters/sigmod23/exp-baseline.tex
+++ b/chapters/sigmod23/exp-baseline.tex
@@ -73,7 +73,7 @@ being introduced by the dynamization.
     \subfloat[Insertion Throughput vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-insert} \label{fig:irs-insert1}}
     \subfloat[Sampling Latency vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-sample} \label{fig:irs-sample1}} \\
 
-    \subfloat[Delete Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-delete} \label{fig:irs-delete}} 
+    \subfloat[Delete Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-delete} \label{fig:irs-delete-s}} 
     \subfloat[Sampling Latency vs. Sample Size]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-samplesize} \label{fig:irs-samplesize}}
     \caption{Framework Comparison to Baselines for IRS}
  
diff --git a/chapters/sigmod23/exp-parameter-space.tex b/chapters/sigmod23/exp-parameter-space.tex
index 9583312..1e51d8c 100644
--- a/chapters/sigmod23/exp-parameter-space.tex
+++ b/chapters/sigmod23/exp-parameter-space.tex
@@ -2,11 +2,11 @@
 \label{ssec:ds-exp}
 
 Our proposed framework has a large design space, which we briefly
-described in Section~\ref{ssec:design-space}. The contents of this
-space will be described in much more detail in Chapter~\ref{chap:design-space},
-but as part of this work we did perform an experimental examination of our
-framework to compare insertion throughput and query latency over various
-points within the space.
+described in Section~\ref{ssec:sampling-design-space}. The
+contents of this space will be described in much more detail in
+Chapter~\ref{chap:design-space}, but as part of this work we did perform
+an experimental examination of our framework to compare insertion
+throughput and query latency over various points within the space.
 
 We examined this design space by considering \texttt{DE-WSS} specifically,
 using a random sample of $500,000,000$ records from the \texttt{OSM}
@@ -48,7 +48,7 @@ performance, with tiering outperforming leveling for both delete
 policies. The next largest effect was the delete policy selection,
 with tombstone deletes outperforming tagged deletes in insertion
 performance. This result aligns with the asymptotic analysis of the two
-approaches in Section~\ref{sampling-deletes}. It is interesting to note
+approaches in Section~\ref{ssec:sampling-deletes}. It is interesting to note
 however that the effect of layout policy was more significant in these
 particular tests,\footnote{
     Although the largest performance gap in absolute terms was between
diff --git a/chapters/sigmod23/experiment.tex b/chapters/sigmod23/experiment.tex
index 727284a..1eb704c 100644
--- a/chapters/sigmod23/experiment.tex
+++ b/chapters/sigmod23/experiment.tex
@@ -53,7 +53,7 @@ uninteresting key distributions.
 
 \Paragraph{Structures Compared.} As a basis of comparison, we tested
 both our dynamized SSI implementations, and existing dynamic baselines,
-for each sampling problem considered. Specifically, we consider a the
+for each sampling problem considered. Specifically, we consider the
 following dynamized structures,
 \begin{itemize}
 
diff --git a/chapters/sigmod23/extensions.tex b/chapters/sigmod23/extensions.tex
index 3a3cba3..3304b76 100644
--- a/chapters/sigmod23/extensions.tex
+++ b/chapters/sigmod23/extensions.tex
@@ -56,7 +56,7 @@ structure using in XDB~\cite{li19}.
 
 Because our dynamization technique is built on top of static data
 structures, a limited form of concurrency support is straightforward to
-implement. To that end, created a proof-of-concept dynamization of an
+implement. To that end, we created a proof-of-concept dynamization of an
 ISAM Tree for IRS based on a simplified version of a general concurrency
 controlled scheme for log-structured data stores~\cite{golan-gueta15}.
 
@@ -79,7 +79,7 @@ accessing them have finished.
 The buffer itself is an unsorted array, so a query can capture a
 consistent and static version by storing the tail pointer at the time
 the query begins. New inserts can be performed concurrently by doing
-a fetch-and-and on the tail. By using multiple buffers, inserts and
+a fetch-and-add on the tail. By using multiple buffers, inserts and
 reconstructions can proceed, to some extent, in parallel, which helps to
 hide some of the insertion tail latency due to blocking on reconstructions
 during a buffer flush.
diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex
index 256d127..804194b 100644
--- a/chapters/sigmod23/framework.tex
+++ b/chapters/sigmod23/framework.tex
@@ -50,6 +50,7 @@ on the query being sampled from.  Based on these observations, we can
 define the decomposability conditions for a query sampling problem,
 
 \begin{definition}[Decomposable Sampling Problem]
+	\label{def:decomp-sampling}
 	A query sampling problem, $X: (F, \mathcal{D}, \mathcal{Q},
 	\mathbb{Z}^+ \to \mathcal{R}$) is decomposable if and only if
 	the following conditions are met for all $q \in \mathcal{Q},
@@ -78,12 +79,14 @@ These two conditions warrant further explanation. The first condition
 is simply a redefinition of the standard decomposability criteria to
 consider matching the distribution, rather than the exact records in $R$,
 as the correctness condition for the merge process. The second condition
-handles a necessary property of the underlying search problem being
-sampled from.  Note that this condition is \emph{stricter} than normal
-decomposability for $F$, and essentially requires that the query being
-sampled from return a set of records, rather than an aggregate value or
-some other result that cannot be meaningfully sampled from. This condition
-is satisfied by predicate-filtering style database queries, among others.
+addresses the search problem from which results are to be sampled. Not all
+search problems admit sampling of this sort--for example, an aggregation
+query that returns a single result. This condition essentially requires
+that the search problem being sampled from return a set of records, rather
+than an aggregate value or some other result that cannot be meaningfully
+sampled from. This condition is satisfied by predicate-filtering style
+database queries, among others. However, it should be noted that this
+condition is \emph{stricter} than normal decomposability.
 
 With these definitions in mind, let's turn to solving these query sampling
 problems. First, we note that many SSIs have a sampling procedure that
@@ -120,7 +123,7 @@ down-sampling combination operator. Secondly, this formulation
 fails to avoid a per-sample dependence on $n$, even in the case
 where $S(n) \in \Theta(1)$.  This gets even worse when considering
 rejections that may occur as a result of deleted records. Recall from
-Section~\ref{ssec:background-deletes} that deletion can be supported
+Section~\ref{ssec:dyn-deletes} that deletion can be supported
 using weak deletes or a shadow structure in a Bentley-Saxe dynamization.
 Using either approach, it isn't possible to avoid deleted records in
 advance when sampling, and so these will need to be rejected and retried.
@@ -208,9 +211,8 @@ or are naturally determined as part of the pre-processing, and thus the
 $W(n)$ term can be merged into $P(n)$.
 
 \subsection{Supporting Deletes}
-\ref{ssec:sampling-deletes}
-
-As discussed in Section~\ref{ssec:background-deletes}, the Bentley-Saxe
+\label{ssec:sampling-deletes}
+As discussed in Section~\ref{ssec:dyn-deletes}, the Bentley-Saxe
 method can support deleting records through the use of either weak
 deletes, or a secondary ghost structure, assuming certain properties are
 satisfied by either the search problem or data structure. Unfortunately,
@@ -222,13 +224,14 @@ we'll discuss our mechanisms for supporting deletes, as well as how
 these can be handled during sampling while maintaining correctness.
 
 Because both deletion policies have their advantages under certain
-contexts, we decided to support both. Specifically, we propose two
-mechanisms for deletes, which are
+contexts, we decided to support both. We require that each record contain
+a small header, which is used to store visibility metadata. Given this,
+we propose two mechanisms for deletes,
 
 \begin{enumerate}
 \item \textbf{Tagged Deletes.} Each record in the structure includes a
-header with a visibility bit set. On delete, the structure is searched
-for the record, and the bit is set in indicate that it has been deleted.
+visibility bit in its header. On delete, the structure is searched
+for the record, and the bit is set to indicate that it has been deleted.
 This mechanism is used to support \emph{weak deletes}.
 \item \textbf{Tombstone Deletes.} On delete, a new record is inserted into
 the structure with a tombstone bit set in the header. This mechanism is
@@ -252,8 +255,9 @@ arbitrary number of delete records, and rebuild the entire structure when
 this threshold is crossed~\cite{saxe79}. Mixing the "ghost" records into
 the same structures as the original records allows for deleted records
 to naturally be cleaned up over time as they meet their tombstones during
-reconstructions. This is an important consequence that will be discussed
-in more detail in Section~\ref{ssec-sampling-delete-bounding}.
+reconstructions using a technique called tombstone cancellation. This
+technique, and its important consequences related to sampling, will be
+discussed in Section~\ref{sssec:sampling-rejection-bound}. 
 
 There are two relevant aspects of performance that the two mechanisms
 trade-off between: the cost of performing the delete, and the cost of
@@ -368,7 +372,7 @@ This performance cost seems catastrophically bad, considering
 it must be paid per sample, but there are ways to mitigate
 it. We will discuss these mitigations in more detail later,
 during our discussion of the implementation of these results in
-Section~\ref{sec:sampling-implementation}.
+Section~\ref{ssec:sampling-framework}.
 
 
 \subsubsection{Bounding Rejection Probability}
@@ -392,8 +396,7 @@ the Bentley-Saxe method, however. In the theoretical literature on this
 topic, the solution to this problem is to periodically re-partition all of
 the records to re-align the block sizes~\cite{merge-dsp, saxe79}. This
 approach could also be easily applied here, if desired, though we
-do not in our implementations, for reasons that will be discussed in
-Section~\ref{sec:sampling-implementation}.
+do not in our implementations. 
 
 The process of removing these deleted records during reconstructions is
 different for the two mechanisms. Tagged deletes are straightforward,
@@ -411,16 +414,16 @@ care with ordering semantics, tombstones and their associated records can
 be sorted into adjacent spots, allowing them to be efficiently dropped
 during reconstruction without any extra overhead.
 
-While the dropping of deleted records during reconstruction helps, it is
-not sufficient on its own to ensure a particular bound on the number of
-deleted records within the structure. Pathological scenarios resulting in
-unbounded rejection rates, even in the presence of this mitigation, are
-possible. For example, tagging alone will never trigger reconstructions,
-and so it would be possible to delete every single record within the
-structure without triggering a reconstruction, or records could be deleted
-in the reverse order that they were inserted using tombstones. In either
-case, a passive system of dropping records naturally during reconstruction
-is not sufficient.
+While the dropping of deleted records during reconstruction helps,
+it is not sufficient on its own to ensure a particular bound on the
+number of deleted records within the structure. Pathological scenarios
+resulting in unbounded rejection rates, even in the presence of this
+mitigation, are possible. For example, tagging alone will never trigger
+reconstructions, and so it would be possible to delete every single
+record within the structure without triggering a reconstruction. Or,
+when using tombstones, records could be deleted in the reverse order
+that they were inserted. In either case, a passive system of dropping
+records naturally during reconstruction is not sufficient.
 
 Fortunately, this passive system can be used as the basis for a
 system that does provide a bound. This is because it guarantees,
@@ -490,6 +493,7 @@ be taken to obtain a sample set of size $k$.
 
 
 \subsection{Performance Tuning and Configuration}
+\label{ssec:sampling-design-space}
 
 The final of the desiderata referenced earlier in this chapter for our
 dynamized sampling indices is having tunable performance. The base
@@ -508,7 +512,7 @@ Though it has thus far gone unmentioned, some readers may have
 noted the astonishing similarity between decomposition-based
 dynamization techniques, and a data structure called the Log-structured
 Merge-tree. First proposed by O'Neil in the mid '90s\cite{oneil96},
-the LSM Tree was designed to optimize write throughout for external data
+the LSM Tree was designed to optimize write throughput for external data
 structures. It accomplished this task by buffer inserted records in a
 small in-memory AVL Tree, and then flushing this buffer to disk when
 it filled up. The flush process itself would fully rebuild the on-disk
@@ -518,22 +522,23 @@ layered, external structures, to reduce the cost of reconstruction.
 
 In more recent times, the LSM Tree has seen significant development and
 been used as the basis for key-value stores like RocksDB~\cite{dong21}
-and LevelDB~\cite{leveldb}. This work has produced an incredibly large
-and well explored parametrization of the reconstruction procedures of
-LSM Trees, a good summary of which can be bound in this recent tutorial
-paper~\cite{sarkar23}. Examples of this design space exploration include:
-different ways to organize each "level" of the tree~\cite{dayan19,
-dostoevsky, autumn}, different growth rates, buffering, sub-partitioning
-of structures to allow finer-grained reconstruction~\cite{dayan22}, and
-approaches for allocating resources to auxiliary structures attached to
-the main ones for accelerating certain types of query~\cite{dayan18-1,
-zhu21, monkey}.
+and LevelDB~\cite{leveldb}. This work has produced an incredibly
+large and well explored parametrization of the reconstruction
+procedures of LSM Trees, a good summary of which can be bounded in
+this recent tutorial paper~\cite{sarkar23}. Examples of this design
+space exploration include: different ways to organize each "level"
+of the tree~\cite{dayan19, dostoevsky, autumn}, different growth
+rates, buffering, sub-partitioning of structures to allow finer-grained
+reconstruction~\cite{dayan22}, and approaches for allocating resources to
+auxiliary structures attached to the main ones for accelerating certain
+types of query~\cite{dayan18-1, zhu21, monkey}. This work is discussed
+in greater depth in Chapter~\ref{chap:related-work}
 
 Many of the elements within the LSM Tree design space are based upon the
-specifics of the data structure itself, and are not generally applicable.
-However, some of the higher-level concepts can be imported and applied in
-the context of dynamization. Specifically, we have decided to import the
-following four elements for use in our dynamization technique,
+specifics of the data structure itself, and are not applicable to our
+use case.  However, some of the higher-level concepts can be imported and
+applied in the context of dynamization. Specifically, we have decided to
+import the following four elements for use in our dynamization technique,
 \begin{itemize}
 	\item A small dynamic buffer into which new records are inserted
 	\item A variable growth rate, called as \emph{scale factor}
@@ -554,11 +559,11 @@ we are dynamizing may not exist. This introduces some query cost, as
 queries must be answered from these unsorted records as well, but in
 the case of sampling this isn't a serious problem. The implications of
 this will be discussed in Section~\ref{ssec:sampling-cost-funcs}. The
-size of this buffer, $N_B$ is a user-specified constant, and all block
-capacities are multiplied by it. In the Bentley-Saxe method, the $i$th
-block contains $2^i$ records. In our scheme, with buffering, this becomes
-$N_B \cdot 2^i$ records in the $i$th block. We call this unsorted array
-the \emph{mutable buffer}.
+size of this buffer, $N_B$ is a user-specified constant. Block capacities
+are defined in terms of multiples of $N_B$, such that each buffer flush
+corresponds to an insert in the traditioanl Bentley-Saxe method. Thus,
+rather than the $i$th block containing $2^i$ records, it contains $N_B
+\cdot 2^i$ records. We call this unsorted array the \emph{mutable buffer}.
 
 \Paragraph{Scale Factor.} In the Bentley-Saxe method, each block is
 twice as large as the block the precedes it There is, however, no reason
@@ -593,19 +598,19 @@ we can build them over tombstones. This approach can greatly improve
 the sampling performance of the structure when tombstone deletes are used.
 
 \Paragraph{Layout Policy.} The Bentley-Saxe method considers blocks
-individually, without any other organization beyond increasing size. In
-contrast, LSM Trees have multiple layers of structural organization. The
-top level structure is a level, upon which record capacity restrictions
-are applied. These levels are then partitioned into individual structures,
-which can be further organized by key range. Because our intention is to
-support general data structures, which may or may not be easily partition
-by a key, we will not consider the finest grain of partitioning. However,
-we can borrow the concept of levels, and lay out shards in these levels
-according to different strategies.
+individually, without any other organization beyond increasing
+size. In contrast, LSM Trees have multiple layers of structural
+organization. Record capacity restrictions are enforced on structures
+called \emph{levels}, which are partitioned into individual data
+structures, and then further organized into non-overlapping key ranges.
+Because our intention is to support general data structures, which may
+or may not be easily partitioned by a key, we will not consider the finest
+grain of partitioning. However, we can borrow the concept of levels,
+and lay out shards in these levels according to different strategies.
 
 Specifically, we consider two layout policies. First, we can allow a
 single shard per level, a policy called \emph{Leveling}. This approach
-is traditionally read optimized, as it generally results in fewer shards
+is traditionally read-optimized, as it generally results in fewer shards
 within the overall structure for a given scale factor. Under leveling,
 the $i$th level has a capacity of $N_B \cdot s^{i+1}$ records.  We can
 also allow multiple shards per level, resulting in a write-optimized
@@ -628,12 +633,10 @@ The requirements that the framework places upon SSIs are rather
 modest. The sampling problem being considered must be a decomposable
 sampling problem (Definition \ref{def:decomp-sampling}) and the SSI must
 support the \texttt{build} and \texttt{unbuild} operations. Optionally,
-if the SSI supports point lookups or if the SSI can be constructed
-from multiple instances of the SSI more efficiently than its normal
-static construction, these two operations can be leveraged by the
-framework. However, these are not requirements, as the framework provides
-facilities to work around their absence.
-
+if the SSI supports point lookups or if the SSI is merge decomposable,
+then these two operations can be leveraged by the framework. However,
+these are not requirements, as the framework provides facilities to work
+around their absence.
 
 \captionsetup[subfloat]{justification=centering}
 \begin{figure*}
@@ -669,6 +672,7 @@ these delete mechanisms, each record contains an attached header with
 bits to indicate its tombstone or delete status.
 
 \subsection{Supported Operations and Cost Functions}
+\label{ssec:sampling-cost-funcs}
 \Paragraph{Insert.} Inserting a record into the dynamization involves
 appending it to the mutable buffer, which requires $\Theta(1)$ time. When
 the buffer reaches its capacity, it must be flushed into the structure
author	Douglas B. Rumbaugh <doug@douglasrumbaugh.com>	2025-06-01 13:15:52 -0400
committer	Douglas B. Rumbaugh <doug@douglasrumbaugh.com>	2025-06-01 13:15:52 -0400
commit	cd3447f1cad16972e8a659ec6e84764c5b8b2745 (patch)
tree	5a50b6e8a99646e326b2c41714f50e4f7dee64d0 /chapters/sigmod23
parent	6354e60f106a89f5bf807082561ed5efd9be0f4f (diff)
download	dissertation-cd3447f1cad16972e8a659ec6e84764c5b8b2745.tar.gz