summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorDouglas B. Rumbaugh <doug@douglasrumbaugh.com>2025-06-01 13:15:52 -0400
committerDouglas B. Rumbaugh <doug@douglasrumbaugh.com>2025-06-01 13:15:52 -0400
commitcd3447f1cad16972e8a659ec6e84764c5b8b2745 (patch)
tree5a50b6e8a99646e326b2c41714f50e4f7dee64d0
parent6354e60f106a89f5bf807082561ed5efd9be0f4f (diff)
downloaddissertation-cd3447f1cad16972e8a659ec6e84764c5b8b2745.tar.gz
Julia updates
-rw-r--r--chapters/beyond-dsp.tex126
-rw-r--r--chapters/dynamization.tex124
-rw-r--r--chapters/related-works.tex1
-rw-r--r--chapters/sigmod23/background.tex18
-rw-r--r--chapters/sigmod23/examples.tex2
-rw-r--r--chapters/sigmod23/exp-baseline.tex2
-rw-r--r--chapters/sigmod23/exp-parameter-space.tex12
-rw-r--r--chapters/sigmod23/experiment.tex2
-rw-r--r--chapters/sigmod23/extensions.tex4
-rw-r--r--chapters/sigmod23/framework.tex134
10 files changed, 220 insertions, 205 deletions
diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex
index 87f44ba..73f8174 100644
--- a/chapters/beyond-dsp.tex
+++ b/chapters/beyond-dsp.tex
@@ -202,7 +202,7 @@ problem. The core idea underlying our solution in that chapter was to
introduce individualized local queries for each block, which were created
after a pre-processing step to allow information about each block to be
determined first. In that particular example, we established the weight
-each block should have during sampling, and then creating custom sampling
+each block should have during sampling, and then created custom sampling
queries with variable $k$ values, following the weight distributions. We
have determined a general interface that allows for this procedure to be
expressed, and we define the term \emph{extended decomposability} to refer
@@ -379,12 +379,12 @@ A significant limitation of invertible problems is that the result set
size is not able to be controlled. We do not know how many records in our
local results have been deleted until we reach the combine operation and
they begin to cancel out, at which point we lack a mechanism to go back
-and retrieve more. This presents difficulties for addressing important
-search problems such as top-$k$, $k$-NN, and sampling. In principle, these
-queries could be supported by repeating the query with larger-and-larger
-$k$ values until the desired number of records is returned, but in the
-eDSP model this requires throwing away a lot of useful work, as the state
-of the query must be rebuilt each time.
+and retrieve more records. This presents difficulties for addressing
+important search problems such as top-$k$, $k$-NN, and sampling. In
+principle, these queries could be supported by repeating the query with
+larger-and-larger $k$ values until the desired number of records is
+returned, but in the eDSP model this requires throwing away a lot of
+useful work, as the state of the query must be rebuilt each time.
We can resolve this problem by moving the decision to repeat the query
into the query interface itself, allowing retries \emph{before} the
@@ -700,7 +700,7 @@ the following main operations,
This function will delete a record from the dynamized structure,
returning $1$ on success and $0$ on failure. The meaning of a
failure to delete is dependent upon the delete mechanism in use,
- and will be discussed in Section~\ref{ssec:dyn-deletes}.
+ and will be discussed in Section~\ref{sssec:dyn-deletes}.
\item \texttt{std::future<QueryResult> query(QueryParameters); } \\
This function will execute a query with the specified parameters
@@ -838,17 +838,18 @@ shards of the same type. The second of these constructors is to allow for
efficient merging to be leveraged for merge decomposable search problems.
Shards can also expose a point lookup operation for use in supporting
-deletes for DDSPs. This function is only used for DDSP deletes, and so can
-be left off when this functionality isn't necessary. If a data structure
-doesn't natively support an efficient point-lookup, then it can be added
-by including a hash table or other data structure in the shard if desired.
-This function accepts a record type as input, and should return a pointer
-to the record that exactly matches the input in storage, if one exists,
-or \texttt{nullptr} if it doesn't. It should also accept an optional
-boolean argument that the framework will pass \texttt{true} into if it
-is don't a lookup for a tombstone. This flag is to allow the shard to
-use various tombstone-related optimization, such as using a Bloom filter
-for them, or storing them separately from the main records, etc.
+deletes for DDSPs. This function is only used for DDSP deletes, and
+so can be left off when this functionality isn't necessary. If a data
+structure doesn't natively support an efficient point-lookup, then it
+can be added by including a hash table or other data structure in the
+shard if desired. This function accepts a record type as input, and
+should return a pointer to the record that exactly matches the input in
+storage, if one exists, or \texttt{nullptr} if it doesn't. It should
+also accept an optional boolean argument that the framework will pass
+\texttt{true} into if the lookup operation is being used to search for
+a tombstone records. This flag is to allow the shard to use various
+tombstone-related optimization, such as using a Bloom filter for them,
+or storing them separately from the main records, etc.
Shards should also expose some accessors for basic meta-data about
its contents. In particular, the framework is reliant upon a function
@@ -888,19 +889,19 @@ concept ShardInterface = RecordInterface<typename SHARD::RECORD>
};
\end{lstlisting}
-\label{listing:shard}
\caption{The required interface for shard types in our dynamization
framework.}
+\label{lst:shard}
\end{lstfloat}
\subsubsection{Query Interface}
The most complex interface required by the framework is for queries. The
-concept for query types is given in Listing~\ref{listing:query}. In
+concept for query types is given in Listing~\ref{lst:query}. In
effect, it requires implementing the full IDSP interface from the
previous section, as well as versions of $\mathbftt{local\_preproc}$
-and $\mathbftt{local\query}$ for pre-processing and querying an unsorted
+and $\mathbftt{local\_query}$ for pre-processing and querying an unsorted
set of records, which is necessary to allow the mutable buffer to be
used as part of the query process.\footnote{
In the worst case, these routines could construct temporary shard
@@ -918,7 +919,7 @@ a local result that includes both the number of records and the number
of tombstones, while the query result itself remains a single number.
Additionally, the framework makes no decision about what, if any,
collection type should be used for these results. A range scan, for
-example, could specified the result types as a vector of records, map
+example, could specify the result types as a vector of records, map
of records, etc., depending on the use case.
There is one significant difference between the IDSP interface and the
@@ -935,7 +936,6 @@ to define an additional combination operation for final result types,
or duplicate effort in the combine step on each repetition.
\begin{lstfloat}
-
\begin{lstlisting}[language=C++]
template <typename QUERY, typename SHARD,
@@ -979,9 +979,9 @@ requires(PARAMETERS *parameters, LOCAL *local,
};
\end{lstlisting}
-\label{listing:query}
\caption{The required interface for query types in our dynamization
framework.}
+\label{lst:query}
\end{lstfloat}
@@ -1029,7 +1029,7 @@ all the records from the level above it ($i-1$ or the buffer, if $i
merged with the records in $j+1$ and the resulting shard placed in level
$j+1$. This procedure guarantees that level $0$ will have capacity for
the shard from the buffer, which is then merged into it (if it is not
-empty) or because it (if the level is empty).
+empty) or replaces it (if the level is empty).
\item \textbf{Tiering.}\\
@@ -1152,16 +1152,16 @@ reconstruction. Consider a record $r_i$ and its corresponding tombstone
$t_j$, where the subscript is the insertion time, with $i < j$ meaning
that $r_i$ was inserted \emph{before} $t_j$. Then, if we are to apply
tombstone cancellations, we must obey the following invariant within
-each shard: A record $r_i$ and tombstone $r_j$ can exist in the same
+each shard: A record $r_i$ and tombstone $t_j$ can exist in the same
shard if $i > j$. But, if $i < j$, then a cancellation should occur.
The case where the record and tombstone coexist covers the situation where
a record is deleted, and then inserted again after the delete. In this
case, there does exist a record $r_k$ with $k < j$ that the tombstone
should cancel with, but that record may exist in a different shard. So
-the tombstone will \emph{eventually} cancel, but it would be technically
-incorrect to cancel it with the matching record $r_i$ that it coexists
-with in the shard being considered.
+the tombstone will \emph{eventually} cancel, but it would be incorrect
+to cancel it with the matching record $r_i$ that it coexists with in
+the shard being considered.
This means that correct tombstone cancellation requires that the order
that records have been inserted be known and accounted for during
@@ -1186,7 +1186,7 @@ at index $i$ will cancel with a record if and only if that record is
in index $i+1$. For structures that are constructed by a sorted-merge
of data, this allows tombstone cancellation at no extra cost during
the merge operation. Otherwise, it requires an extra linear pass after
-sorting to remove cancelled records.\footnote{
+sorting to remove canceled records.\footnote{
For this reason, we use tagging based deletes for structures which
don't require sorting by value during construction.
}
@@ -1200,7 +1200,7 @@ For tombstone deletes, a failure to delete means a failure to insert,
and the request should be retried after a brief delay. Note that, for
performance reasons, the framework makes no effort to ensure that the
record being erased using tombstones is \emph{actually} there, so it
-is possible to insert a tombstone that can never be cancelled. This
+is possible to insert a tombstone that can never be canceled. This
won't affect correctness in any way, so long as queries are correctly
implemented, but it will increase the size of the structure slightly.
@@ -1271,7 +1271,7 @@ same mechanisms described in Section~\ref{sssec:dyn-deletes}.
\Paragraph{Asymptotic Complexity.} The worst-case query cost of the
framework follows the same basic cost function as discussed for IDSPs
-in Section~\ref{asec:dyn-idsp}, with slight modifications to account for
+in Section~\ref{ssec:dyn-idsp}, with slight modifications to account for
the different cost function of buffer querying and preprocessing. The
cost is,
\begin{equation*}
@@ -1280,7 +1280,7 @@ cost is,
\end{equation*}
where $P_B(n)$ is the cost of pre-processing the buffer, and $Q_B(n)$ is
the cost of querying it. As $N_B$ is a small constant relative to $n$,
-in some cases these terms can be ommitted, but they are left here for
+in some cases these terms can be omitted, but they are left here for
generality. Also note that this is an upper bound, but isn't necessarily
tight. As we saw with IRS in Section~\ref{ssec:edsp}, it is sometimes
possible to leverage problem-specific details within this interface to
@@ -1307,7 +1307,7 @@ All of our testing was performed using Ubuntu 20.04 LTS on a dual
socket Intel Xeon Gold 6242 server with 384 GiB of physical memory and
40 physical cores. We ran our benchmarks pinned to a specific core,
or specific NUMA node for multi-threaded testing. Our code was compiled
-using GCC version 11.3.0 with the \texttt{-O3} flag, and targetted to
+using GCC version 11.3.0 with the \texttt{-O3} flag, and targeted to
C++20.\footnote{
Aside from the ALEX benchmark. ALEX does not build in this
configuration, and we used C++13 instead for that particular test.
@@ -1335,7 +1335,7 @@ structures. Specifically,
\texttt{fb}, and \texttt{osm} datasets from
SOSD~\cite{sosd-datasets}. Each has 200 million 64-bit keys
(to which we added 64-bit values) following a variety of
- distributions. We ommitted the \texttt{wiki} dataset because it
+ distributions. We omitted the \texttt{wiki} dataset because it
contains duplicate keys, which were not supported by one of our
dynamic baselines.
@@ -1371,7 +1371,7 @@ For our first set of experiments, we evaluated a dynamized version of the
Triespline learned index~\cite{plex} for answering range count queries.\footnote{
We tested range scans throughout this chapter by measure the
performance of a range count. We decided to go this route to ensure
- that the results across our baselines were comprable. Different range
+ that the results across our baselines were comparable. Different range
structures provided different interfaces for accessing the result
sets, some of which required making an extra copy and others which
didn't. Using a range count instead allowed us to measure only index
@@ -1383,7 +1383,7 @@ performance. We ran these tests using the SOSD \texttt{OSM} dataset.
First, we'll consider the effect of buffer size on performance in
Figures~\ref{fig:ins-buffer-size} and \ref{fig:q-buffer-size}. For all
-of these tests, we used a fixe scale factor of $8$ and the tombstone
+of these tests, we used a fixed scale factor of $8$ and the tombstone
delete policy. Each plot shows the performance of our three supported
layout policies (note that BSM using a fixed $N_B=1$ and $s=2$ for all
tests, to accurately reflect the performance of the classical Bentley-Saxe
@@ -1419,17 +1419,17 @@ improves performance. This is because a larger scale factor in tiering
results in more, smaller structures, and thus reduced reconstruction
time. But for leveling it increases the write amplification, hurting
performance. Figure~\ref{fig:q-scale-factor} shows that, like with
-Figure~\ref{fig:query_sf} in the previous chapter, query latency is not
-strong affected by the scale factor, but larger scale factors due tend
+Figure~\ref{fig:sample_sf} in the previous chapter, query latency is not
+strongly affected by the scale factor, but larger scale factors due tend
to have a negative effect under tiering (due to having more structures).
As a final note, these results demonstrate that, compared the the
normal Bentley-Saxe method, our proposed design space is a strict
-improvement. There are points within the space that are equivilant to,
+improvement. There are points within the space that are equivalent to,
or even strictly superior to, BSM in terms of both query and insertion
-performance, as well as clearly available trade-offs between insertion and
-query performance, particular when it comes to selecting layout policy.
-
+performance. Beyond this, there are also clearly available trade-offs
+between insertion and query performance, particular when it comes to
+selecting layout policy.
\begin{figure*}
@@ -1446,7 +1446,7 @@ query performance, particular when it comes to selecting layout policy.
\subsection{Independent Range Sampling}
-Next, we'll consider the indepedent range sampling problem using ISAM
+Next, we'll consider the independent range sampling problem using ISAM
tree. The functioning of this structure for answering IRS queries is
discussed in more detail in Section~\ref{ssec:irs-struct}, and we use the
query algorithm described in Algorithm~\ref{alg:decomp-irs}. We use the
@@ -1456,7 +1456,7 @@ obtain the upper and lower bounds of the query range, and the weight
of that range, using tree traversals in \texttt{local\_preproc}. We
use rejection sampling on the buffer, and so the buffer preprocessing
simply uses the number of records in the buffer for its weight. In
-\texttt{distribute\_query}, we build and alias structure over all of
+\texttt{distribute\_query}, we build an alias structure over all of
the weights and query it $k$ times to obtain the individual $k$ values
for the local queries. To avoid extra work on repeat, we stash this
alias structure in the buffer's local query object so it is available
@@ -1485,8 +1485,8 @@ compaction is triggered.
We configured our dynamized structure to use $s=8$, $N_B=12000$, $\delta
= .05$, $f = 16$, and the tiering layout policy. We compared our method
(\textbf{DE-IRS}) to Olken's method~\cite{olken89} on a B+Tree with
-aggregate weight counts (\textbf{AGG B+Tree}), as well as our besoke
-sampling solution from the previous chapter (\textbf{Besoke}) and a
+aggregate weight counts (\textbf{AGG B+Tree}), as well as our bespoke
+sampling solution from the previous chapter (\textbf{Bespoke}) and a
single static instance of the ISAM Tree (\textbf{ISAM}). Because IRS
is neither INV nor DDSP, the standard Bentley-Saxe Method has no way to
support deletes for it, and was not tested. All of our tested sampling
@@ -1494,7 +1494,7 @@ queries had a controlled selectivity of $\sigma = 0.01\%$ and $k=1000$.
The results of our performance benchmarking are in Figure~\ref{fig:irs}.
Figure~\ref{fig:irs-insert} shows that our general framework has
-comperable insertion performance to the specialized one, though loses
+comparable insertion performance to the specialized one, though loses
slightly. This is to be expected, as \textbf{Bespoke} was hand-written for
specifically this type of query and data structure, and has hard-coded
data types, among other things. Despite losing to \textbf{Bespoke}
@@ -1525,7 +1525,7 @@ using a static Vantage Point Tree (VPTree)~\cite{vptree}. This is a
binary search tree with internal nodes that partition records based
on their distance to a selected point, called the vantage point. All
of the points within a fixed distance of the vantage point are covered
-by one subtree, and the points outside of this distance are covered by
+by one sub-tree, and the points outside of this distance are covered by
the other. This results in a hard-to-update data structure that can
be constructed in $\Theta(n \log n)$ time using repeated application of
the \texttt{quickselect} algorithm~\cite{quickselect} to partition the
@@ -1537,7 +1537,7 @@ Algorithm~\cite{alg:idsp-knn}, though using delete tagging instead of
tombstones. VPTree doesn't support efficient point lookups, and so to
work around this we add a hash map to each shard, mapping each record to
its location in storage, to ensure that deletes can be done efficiently
-in this way. This allows us to avoid cancelling deleted records in
+in this way. This allows us to avoid canceling deleted records in
the \texttt{combine} operation, as they can be skipped over during
\texttt{local\_query} directly. Because $k$-NN doesn't have any of the
distributional requirements of IRS, these local queries can return $k$
@@ -1599,7 +1599,7 @@ scheme used, with \textbf{BSM-VPTree} performing slightly \emph{better}
than our framework for query performance. The reason for this is
shown in Figure~\ref{fig:knn-insert}, where our framework outperforms
the Bentley-Saxe method in insertion performance. These results are
-atributible to our selection of framework configuration parameters,
+attributable to our selection of framework configuration parameters,
which are biased towards better insertion performance. Both dynamized
structures also outperform the dynamic baseline. Finally, as is becoming
a trend, Figure~\ref{fig:knn-space} shows that the storage requirements
@@ -1667,7 +1667,7 @@ The results of our evaluation are shown in
Figure~\ref{fig:eval-learned-index}. Figure~\ref{fig:rq-insert} shows
the insertion performance. DE-TS is the best in all cases, and the pure
BSM version of Triespline is the worst by a substantial margin. Of
-particular interest in this chart is the inconsisent performance of
+particular interest in this chart is the inconsistent performance of
ALEX, which does quite well on the \texttt{books} dataset, and poorly
on the others. It is worth noting that getting ALEX to run \emph{at
all} in some cases required a lot of trial and error and tuning, as its
@@ -1691,8 +1691,8 @@ performs horrendously compared to all of the other structures. The same
caveat from the previous paragraph applies here--PGM can be configured
for better performance. But it's notable that our framework-dynamized PGM
is able to beat PGM slightly in insertion performance without seeing the
-same massive degredation in query performance that PGM's native update
-suport does in its own update-optmized configuration.\footnote{
+same massive degradation in query performance that PGM's native update
+support does in its own update-optimized configuration.\footnote{
It's also worth noting that PGM implements tombstone deletes by
inserting a record with a matching key to the record to be deleted,
and a particular "tombstone" value, rather than using a header. This
@@ -1712,7 +1712,7 @@ update support.
\subsection{String Search}
As a final example of a search problem, we consider exact string matching
-using the fast succinct trie~\cite{zhang18}. While updatable
+using the fast succinct trie~\cite{zhang18}. While dynamic
tries aren't terribly unusual~\cite{m-bonsai,dynamic-trie}, succinct data
structures, which attempt to approach an information-theoretic lower-bound
on their binary representation of the data, are usually static because
@@ -1725,7 +1725,7 @@ we consider the effectiveness of our generalized framework for them.
\centering
\subfloat[Update Throughput]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-insert} \label{fig:fst-insert}}
\subfloat[Query Latency]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-query} \label{fig:fst-query}}
- \subfloat[Index Overhead]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-space} \label{fig:fst-size}}
+ \subfloat[Index Overhead]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-space} \label{fig:fst-space}}
%\vspace{-3mm}
\caption{FST Evaluation}
\label{fig:fst-eval}
@@ -1739,9 +1739,9 @@ storage. Queries use no pre-processing and the local queries directly
search for a matching string. We use the framework's early abort feature
to stop as soon as the first result is found, and combine simply checks
whether this record is a tombstone or not. If it's a tombstone, then
-the lookup is considered to have no found the search string. Otherwise,
+the lookup is considered to have not found the search string. Otherwise,
the record is returned. This results in a dynamized structure with the
-following asympotic costs,
+following asymptotic costs,
\begin{align*}
@@ -1759,7 +1759,7 @@ The results are show in Figure~\ref{fig:fst-eval}. As with range scans,
the Bentley-Saxe method shows horrible insertion performance relative to
our framework in Figure~\ref{fig:fst-insert}. Note that the significant
observed difference in update throughput for the two data sets is
-largely attributable to the relative sizes. The \texttt{usra} set is
+largely attributable to the relative sizes. The \texttt{US} set is
far larger than \texttt{english}. Figure~\ref{fig:fst-query} shows that
our write-optimized framework configuration is slightly out-performed in
query latency by the standard Bentley-Saxe dynamization, and that both
@@ -1767,7 +1767,7 @@ dynamized structures are quite a bit slower than the static structure for
queries. Finally, the storage costs for the data structures are shown
in Figure~\ref{fig:fst-space}. For the \texttt{english} data set, the
extra storage cost from decomposing the structure is quite significant,
-but the for \texttt{ursarc} set the sizes are quite comperable. It is
+but the for \texttt{ursarc} set the sizes are quite comparable. It is
not unexpected that dynamization would add storage cost for succinct
(or any compressed) data structures, because the splitting of the records
across multiple data structures reduces the ability of the structure to
@@ -1792,10 +1792,10 @@ are inserted, it is necessary that each operation obtain a lock on
the root node of the tree~\cite{zhao22}. This makes this situation
a good use-case for the automatic concurrency support provided by our
framework. Figure~\ref{fig:irs-concurrency} shows the results of this
-benchmark for various numbers of concurreny query threads. As can be seen,
+benchmark for various numbers of concurrency query threads. As can be seen,
our framework supports a stable update throughput up to 32 query threads,
whereas the AGG B+Tree suffers from contention for the mutex and sees
-is performance degrade as the number of threads increases.
+its performance degrade as the number of threads increases.
\begin{figure}
\centering
diff --git a/chapters/dynamization.tex b/chapters/dynamization.tex
index 053fb46..a2277c3 100644
--- a/chapters/dynamization.tex
+++ b/chapters/dynamization.tex
@@ -67,15 +67,17 @@ terms for these two concepts.
\subsection{Decomposable Search Problems}
-Dynamization techniques require the partitioning of one data structure
-into several, smaller ones. As a result, these techniques can only
-be applied in situations where the search problem to be answered can
-be answered from this set of smaller data structures, with the same
-answer as would have been obtained had all of the data been used to
-construct a single, large structure. This requirement is formalized in
-the definition of a class of problems called \emph{decomposable search
-problems (DSP)}. This class was first defined by Bentley and Saxe in
-their work on dynamization, and we will adopt their definition,
+The dynamization techniques we will be considering require decomposing
+one data structure into several, smaller ones, called blocks, each built
+over a disjoint partition of the data. As a result, these techniques
+can only be applied in situations where the search problem can be
+answered from this set of decomposed blocks. The answer to the search
+problem from the decomposition should be the same as would have been
+obtained had all of the data been stored in a single data structure. This
+requirement is formalized in the definition of a class of problems called
+\emph{decomposable search problems (DSP)}. This class was first defined
+by Bentley and Saxe in their work on dynamization, and we will adopt
+their definition,
\begin{definition}[Decomposable Search Problem~\cite{saxe79}]
\label{def:dsp}
@@ -180,15 +182,17 @@ database indices. We refer to a data structure with update support as
contain header information (like visibility) that is updated in place.
}
-This section discusses \emph{dynamization}, the construction of a
-dynamic data structure based on an existing static one. When certain
-conditions are satisfied by the data structure and its associated
-search problem, this process can be done automatically, and with
-provable asymptotic bounds on amortized insertion performance, as well
-as worst case query performance. This is in contrast to the manual
-design of dynamic data structures, which involve techniques based on
-partially rebuilding small portions of a single data structure (called
-\emph{local reconstruction})~\cite{overmars83}. This is a very high cost
+This section discusses \emph{dynamization}, the construction of a dynamic
+data structure based on an existing static one. When certain conditions
+are satisfied by the data structure and its associated search problem,
+this process can be done automatically, and with provable asymptotic
+bounds on amortized insertion performance, as well as worst case
+query performance. This automatic approach is in constrast with the
+manual design of a dynamic data structure, which involves altering
+the data structure itself to natively support updates. This process
+usually involves implementing techniques that partially rebuild small
+portions of the structure to accomodate new records, which is called
+\emph{local reconstruction}~\cite{overmars83}. This is a very high cost
intervention that requires significant effort on the part of the data
structure designer, whereas conventional dynamization can be performed
with little-to-no modification of the underlying data structure at all.
@@ -345,7 +349,7 @@ then an insert is done by,
Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{
Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be
violated by deletes. We're omitting deletes from the discussion at
- this point, but will circle back to them in Section~\ref{sec:deletes}.
+ this point, but will circle back to them in Section~\ref{ssec:dyn-deletes}.
} In this case, the constraints are enforced by "re-configuring" the
structure. $s$ is updated to be exactly $f(n)$, all of the existing
blocks are unbuilt, and then the records are redistributed evenly into
@@ -584,14 +588,13 @@ F(A / B, q) = F(A, q)~\Delta~F(B, q)
for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
\end{definition}
-Given a search problem with this property, it is possible to perform
-deletes by creating a secondary ``ghost'' structure. When a record
-is to be deleted, it is inserted into this structure. Then, when the
-dynamization is queried, this ghost structure is queried as well as the
-main one. The results from the ghost structure can be removed from the
-result set using the inverse merge operator. This simulates the result
-that would have been obtained had the records been physically removed
-from the main structure.
+Given a search problem with this property, it is possible to emulate
+removing a record from the structure by instead inserting into a
+secondary ``ghost'' structure. When the dynamization is queried,
+this ghost structure is queried as well as the main one. The results
+from the ghost structure can be removed from the result set using the
+inverse merge operator. This simulates the result that would have been
+obtained had the records been physically removed from the main structure.
Two examples of invertible search problems are set membership
and range count. Range count was formally defined in
@@ -670,11 +673,13 @@ to some serious problems, for example if every record in a structure
of $n$ records is deleted, the net result will be an "empty" dynamized
data structure containing $2n$ physical records within it. To circumvent
this problem, Bentley and Saxe proposed a mechanism of setting a maximum
-threshold for the size of the ghost structure relative to the main one,
-and performing a complete re-partitioning of the data once this threshold
-is reached, removing all deleted records from the main structure,
-emptying the ghost structure, and rebuilding blocks with the records
-that remain according to the invariants of the technique.
+threshold for the size of the ghost structure relative to the main one.
+Once this threshold was reached, a complete re-partitioning of the data
+can be performed. During this re-paritioning, all deleted records can
+be removed from the main structure, and the ghost structure emptied
+completely. Then all of the blocks can be rebuilt from the remaining
+records, partitioning them according to the strict binary decomposition
+of the Bentley-Saxe method.
\subsubsection{Weak Deletes for Deletion Decomposable Search Problems}
@@ -694,16 +699,16 @@ underlying data structure supports a delete operation. More formally,
for $\mathscr{I}$.
\end{definition}
-Superficially, this doesn't appear very useful. If the underlying data
-structure already supports deletes, there isn't much reason to use a
-dynamization technique to add deletes to it. However, one point worth
-mentioning is that it is possible, in many cases, to easily \emph{add}
-delete support to a static structure. If it is possible to locate a
-record and somehow mark it as deleted, without removing it from the
-structure, and then efficiently ignore these records while querying,
-then the given structure and its search problem can be said to be
-deletion decomposable. This technique for deleting records is called
-\emph{weak deletes}.
+Superficially, this doesn't appear very useful, because if the underlying
+data structure already supports deletes, there isn't much reason to
+use a dynamization technique to add deletes to it. However, even in
+structures that don't natively support deleting, it is possible in many
+cases to \emph{add} delete support without significant alterations.
+If it is possible to locate a record and somehow mark it as deleted,
+without removing it from the structure, and then efficiently ignore these
+records while querying, then the given structure and its search problem
+can be said to be deletion decomposable. This technique for deleting
+records is called \emph{weak deletes}.
\begin{definition}[Weak Deletes~\cite{overmars81}]
\label{def:weak-delete}
@@ -815,10 +820,10 @@ and thereby the query performance. The particular invariant maintenance
rules depend upon the decomposition scheme used.
\Paragraph{Bentley-Saxe Method.} When creating a BSM dynamization for
-a deletion decomposable search problem, the $i$th block where $i \geq 2$\footnote{
+a deletion decomposable search problem, the $i$th block where $i \geq 2$,\footnote{
Block $i=0$ will only ever have one record, so no special maintenance must be
done for it. A delete will simply empty it completely.
-},
+}
in the absence of deletes, will contain $2^{i-1} + 1$ records. When a
delete occurs in block $i$, no special action is taken until the number
of records in that block falls below $2^{i-2}$. Once this threshold is
@@ -1076,12 +1081,13 @@ matching of records in result sets. To work around this, a slight abuse
of definition is in order: assume that the equality conditions within
the DSP definition can be interpreted to mean ``the contents in the two
sets are drawn from the same distribution''. This enables the category
-of DSP to apply to this type of problem.
+of DSP to apply to this type of problem, while maintaining the spirit of
+the definition.
Even with this abuse, however, IRS cannot generally be considered
decomposable; it is at best $C(n)$-decomposable. The reason for this is
that matching the distribution requires drawing the appropriate number
-of samples from each each partition of the data. Even in the special
+of samples from each partition of the data. Even in the special
case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples
from each partition that must appear in the result set cannot be known
in advance due to differences in the selectivity of the predicate across
@@ -1102,7 +1108,7 @@ the partitions.
probability of a $4$. The second and third result sets can only
be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these
together, we'd find that the probability distribution of the sample
- would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform
+ would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were we to perform
the same sampling operation over the full dataset (not partitioned),
the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$.
@@ -1111,21 +1117,23 @@ the partitions.
The problem is that the number of samples drawn from each partition
needs to be weighted based on the number of elements satisfying the
query predicate in that partition. In the above example, by drawing $4$
-samples from $D_1$, more weight is given to $3$ than exists within
-the base dataset. This can be worked around by sampling a full $k$
-records from each partition, returning both the sample and the number
-of records satisfying the predicate as that partition's query result,
-and then performing another pass of IRS as the merge operator, but this
-is the same approach as was used for k-NN above. This leaves IRS firmly
+samples from $D_1$, more weight is given to $3$ than exists within the
+base dataset. This can be worked around by sampling a full $k$ records
+from each partition, returning both the sample and the number of records
+satisfying the predicate as that partition's query result. This allows for
+the relative weights of each block to be controlled for during the merge,
+by doing weighted sampling of each partial result. This approach requires
+$\Theta(k)$ time for the merge operation, however, leaving IRS firmly
in the $C(n)$-decomposable camp. If it were possible to pre-calculate
the number of samples to draw from each partition, then a constant-time
merge operation could be used.
-We examine this problem in detail in Chapters~\ref{chap:sampling} and
-\ref{chap:framework} and propose techniques for efficiently expanding
-support of dynamization systems to non-decomposable search problems, as
-well as addressing some additional difficulties introduced by supporting
-deletes, which can complicate query processing.
+We examine expanding support for non-decomposable search problems
+in Chapters~\ref{chap:sampling} and \ref{chap:framework} and propose
+techniques for efficiently expanding support of dynamization systems to
+non-decomposable search problems, as well as addressing some additional
+difficulties introduced by supporting deletes, which can complicate
+query processing.
\subsection{Configurability}
diff --git a/chapters/related-works.tex b/chapters/related-works.tex
index 2ed466a..7a42003 100644
--- a/chapters/related-works.tex
+++ b/chapters/related-works.tex
@@ -1,4 +1,5 @@
\chapter{Related Work}
+\label{chap:related-work}
\section{Implementations of Bentley-Saxe}
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex
index af3b80a..d600c27 100644
--- a/chapters/sigmod23/background.tex
+++ b/chapters/sigmod23/background.tex
@@ -19,16 +19,16 @@ is used to indicate the selection of either a single sample or a sample
set; the specific usage should be clear from context.
In each of the problems considered, sampling can be performed either
-with replacement or without replacement. Sampling with replacement
+with-replacement or without-replacement. Sampling with-replacement
means that a record that has been included in the sample set for a given
sampling query is "replaced" into the dataset and allowed to be sampled
-again. Sampling without replacement does not "replace" the record,
+again. Sampling without-replacement does not "replace" the record,
and so each individual record can only be included within the a sample
set once for a given query. The data structures that will be discussed
-support sampling with replacement, and sampling without replacement can
-be implemented using a constant number of with replacement sampling
+support sampling with-replacement, and sampling without-replacement can
+be implemented using a constant number of with-replacement sampling
operations, followed by a deduplication step~\cite{hu15}, so this chapter
-will focus exclusive on the with replacement case.
+will focus exclusive on the with-replacement case.
\subsection{Independent Sampling Problem}
@@ -115,8 +115,10 @@ of problems that will be directly addressed within this chapter.
Relational database systems often have native support for IQS using
SQL's \texttt{TABLESAMPLE} operator~\cite{postgress-doc}. However, the
-algorithms used to implement this operator have significant limitations:
-users much choose between statistical independence or performance.
+algorithms used to implement this operator have significant limitations
+and do not allow users to maintain statistical independence of the results
+without also running the query to be sampled from in full. Thus, users must
+choose between independece and performance.
To maintain statistical independence, Bernoulli sampling is used. This
technique requires iterating over every record in the result set of the
@@ -240,7 +242,7 @@ Tao~\cite{tao22}.
There also exist specialized data structures with support for both
efficient sampling and updates~\cite{hu14}, but these structures have
poor constant factors and are very complex, rendering them of little
-practical utility. Additionally, efforts have been made to extended
+practical utility. Additionally, efforts have been made to extend
the alias structure with support for weight updates over a fixed set of
elements~\cite{hagerup93,matias03,allendorf23}. These approaches do not
allow the insertion or removal of new records, however, only in-place
diff --git a/chapters/sigmod23/examples.tex b/chapters/sigmod23/examples.tex
index 38df04d..4e7f9ac 100644
--- a/chapters/sigmod23/examples.tex
+++ b/chapters/sigmod23/examples.tex
@@ -25,7 +25,7 @@ number of shards involved in a reconstruction using either layout policy
is $\Theta(1)$ using our framework, this means that we can perform
reconstructions in $B_M(n) \in \Theta(n)$ time, including tombstone
cancellation. The total weight of the structure can also be calculated
-at no time when it is constructed, allows $W(n) \in \Theta(1)$ time
+at no time cost when it is constructed, allows $W(n) \in \Theta(1)$ time
as well. Point lookups over the sorted data can be done using a binary
search in $L(n) \in \Theta(\log_2 n)$ time, and sampling queries require
no pre-processing, so $P(n) \in \Theta(1)$. The mutable buffer can be
diff --git a/chapters/sigmod23/exp-baseline.tex b/chapters/sigmod23/exp-baseline.tex
index 5585c36..d0e1ce0 100644
--- a/chapters/sigmod23/exp-baseline.tex
+++ b/chapters/sigmod23/exp-baseline.tex
@@ -73,7 +73,7 @@ being introduced by the dynamization.
\subfloat[Insertion Throughput vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-insert} \label{fig:irs-insert1}}
\subfloat[Sampling Latency vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-sample} \label{fig:irs-sample1}} \\
- \subfloat[Delete Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-delete} \label{fig:irs-delete}}
+ \subfloat[Delete Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-delete} \label{fig:irs-delete-s}}
\subfloat[Sampling Latency vs. Sample Size]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-samplesize} \label{fig:irs-samplesize}}
\caption{Framework Comparison to Baselines for IRS}
diff --git a/chapters/sigmod23/exp-parameter-space.tex b/chapters/sigmod23/exp-parameter-space.tex
index 9583312..1e51d8c 100644
--- a/chapters/sigmod23/exp-parameter-space.tex
+++ b/chapters/sigmod23/exp-parameter-space.tex
@@ -2,11 +2,11 @@
\label{ssec:ds-exp}
Our proposed framework has a large design space, which we briefly
-described in Section~\ref{ssec:design-space}. The contents of this
-space will be described in much more detail in Chapter~\ref{chap:design-space},
-but as part of this work we did perform an experimental examination of our
-framework to compare insertion throughput and query latency over various
-points within the space.
+described in Section~\ref{ssec:sampling-design-space}. The
+contents of this space will be described in much more detail in
+Chapter~\ref{chap:design-space}, but as part of this work we did perform
+an experimental examination of our framework to compare insertion
+throughput and query latency over various points within the space.
We examined this design space by considering \texttt{DE-WSS} specifically,
using a random sample of $500,000,000$ records from the \texttt{OSM}
@@ -48,7 +48,7 @@ performance, with tiering outperforming leveling for both delete
policies. The next largest effect was the delete policy selection,
with tombstone deletes outperforming tagged deletes in insertion
performance. This result aligns with the asymptotic analysis of the two
-approaches in Section~\ref{sampling-deletes}. It is interesting to note
+approaches in Section~\ref{ssec:sampling-deletes}. It is interesting to note
however that the effect of layout policy was more significant in these
particular tests,\footnote{
Although the largest performance gap in absolute terms was between
diff --git a/chapters/sigmod23/experiment.tex b/chapters/sigmod23/experiment.tex
index 727284a..1eb704c 100644
--- a/chapters/sigmod23/experiment.tex
+++ b/chapters/sigmod23/experiment.tex
@@ -53,7 +53,7 @@ uninteresting key distributions.
\Paragraph{Structures Compared.} As a basis of comparison, we tested
both our dynamized SSI implementations, and existing dynamic baselines,
-for each sampling problem considered. Specifically, we consider a the
+for each sampling problem considered. Specifically, we consider the
following dynamized structures,
\begin{itemize}
diff --git a/chapters/sigmod23/extensions.tex b/chapters/sigmod23/extensions.tex
index 3a3cba3..3304b76 100644
--- a/chapters/sigmod23/extensions.tex
+++ b/chapters/sigmod23/extensions.tex
@@ -56,7 +56,7 @@ structure using in XDB~\cite{li19}.
Because our dynamization technique is built on top of static data
structures, a limited form of concurrency support is straightforward to
-implement. To that end, created a proof-of-concept dynamization of an
+implement. To that end, we created a proof-of-concept dynamization of an
ISAM Tree for IRS based on a simplified version of a general concurrency
controlled scheme for log-structured data stores~\cite{golan-gueta15}.
@@ -79,7 +79,7 @@ accessing them have finished.
The buffer itself is an unsorted array, so a query can capture a
consistent and static version by storing the tail pointer at the time
the query begins. New inserts can be performed concurrently by doing
-a fetch-and-and on the tail. By using multiple buffers, inserts and
+a fetch-and-add on the tail. By using multiple buffers, inserts and
reconstructions can proceed, to some extent, in parallel, which helps to
hide some of the insertion tail latency due to blocking on reconstructions
during a buffer flush.
diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex
index 256d127..804194b 100644
--- a/chapters/sigmod23/framework.tex
+++ b/chapters/sigmod23/framework.tex
@@ -50,6 +50,7 @@ on the query being sampled from. Based on these observations, we can
define the decomposability conditions for a query sampling problem,
\begin{definition}[Decomposable Sampling Problem]
+ \label{def:decomp-sampling}
A query sampling problem, $X: (F, \mathcal{D}, \mathcal{Q},
\mathbb{Z}^+ \to \mathcal{R}$) is decomposable if and only if
the following conditions are met for all $q \in \mathcal{Q},
@@ -78,12 +79,14 @@ These two conditions warrant further explanation. The first condition
is simply a redefinition of the standard decomposability criteria to
consider matching the distribution, rather than the exact records in $R$,
as the correctness condition for the merge process. The second condition
-handles a necessary property of the underlying search problem being
-sampled from. Note that this condition is \emph{stricter} than normal
-decomposability for $F$, and essentially requires that the query being
-sampled from return a set of records, rather than an aggregate value or
-some other result that cannot be meaningfully sampled from. This condition
-is satisfied by predicate-filtering style database queries, among others.
+addresses the search problem from which results are to be sampled. Not all
+search problems admit sampling of this sort--for example, an aggregation
+query that returns a single result. This condition essentially requires
+that the search problem being sampled from return a set of records, rather
+than an aggregate value or some other result that cannot be meaningfully
+sampled from. This condition is satisfied by predicate-filtering style
+database queries, among others. However, it should be noted that this
+condition is \emph{stricter} than normal decomposability.
With these definitions in mind, let's turn to solving these query sampling
problems. First, we note that many SSIs have a sampling procedure that
@@ -120,7 +123,7 @@ down-sampling combination operator. Secondly, this formulation
fails to avoid a per-sample dependence on $n$, even in the case
where $S(n) \in \Theta(1)$. This gets even worse when considering
rejections that may occur as a result of deleted records. Recall from
-Section~\ref{ssec:background-deletes} that deletion can be supported
+Section~\ref{ssec:dyn-deletes} that deletion can be supported
using weak deletes or a shadow structure in a Bentley-Saxe dynamization.
Using either approach, it isn't possible to avoid deleted records in
advance when sampling, and so these will need to be rejected and retried.
@@ -208,9 +211,8 @@ or are naturally determined as part of the pre-processing, and thus the
$W(n)$ term can be merged into $P(n)$.
\subsection{Supporting Deletes}
-\ref{ssec:sampling-deletes}
-
-As discussed in Section~\ref{ssec:background-deletes}, the Bentley-Saxe
+\label{ssec:sampling-deletes}
+As discussed in Section~\ref{ssec:dyn-deletes}, the Bentley-Saxe
method can support deleting records through the use of either weak
deletes, or a secondary ghost structure, assuming certain properties are
satisfied by either the search problem or data structure. Unfortunately,
@@ -222,13 +224,14 @@ we'll discuss our mechanisms for supporting deletes, as well as how
these can be handled during sampling while maintaining correctness.
Because both deletion policies have their advantages under certain
-contexts, we decided to support both. Specifically, we propose two
-mechanisms for deletes, which are
+contexts, we decided to support both. We require that each record contain
+a small header, which is used to store visibility metadata. Given this,
+we propose two mechanisms for deletes,
\begin{enumerate}
\item \textbf{Tagged Deletes.} Each record in the structure includes a
-header with a visibility bit set. On delete, the structure is searched
-for the record, and the bit is set in indicate that it has been deleted.
+visibility bit in its header. On delete, the structure is searched
+for the record, and the bit is set to indicate that it has been deleted.
This mechanism is used to support \emph{weak deletes}.
\item \textbf{Tombstone Deletes.} On delete, a new record is inserted into
the structure with a tombstone bit set in the header. This mechanism is
@@ -252,8 +255,9 @@ arbitrary number of delete records, and rebuild the entire structure when
this threshold is crossed~\cite{saxe79}. Mixing the "ghost" records into
the same structures as the original records allows for deleted records
to naturally be cleaned up over time as they meet their tombstones during
-reconstructions. This is an important consequence that will be discussed
-in more detail in Section~\ref{ssec-sampling-delete-bounding}.
+reconstructions using a technique called tombstone cancellation. This
+technique, and its important consequences related to sampling, will be
+discussed in Section~\ref{sssec:sampling-rejection-bound}.
There are two relevant aspects of performance that the two mechanisms
trade-off between: the cost of performing the delete, and the cost of
@@ -368,7 +372,7 @@ This performance cost seems catastrophically bad, considering
it must be paid per sample, but there are ways to mitigate
it. We will discuss these mitigations in more detail later,
during our discussion of the implementation of these results in
-Section~\ref{sec:sampling-implementation}.
+Section~\ref{ssec:sampling-framework}.
\subsubsection{Bounding Rejection Probability}
@@ -392,8 +396,7 @@ the Bentley-Saxe method, however. In the theoretical literature on this
topic, the solution to this problem is to periodically re-partition all of
the records to re-align the block sizes~\cite{merge-dsp, saxe79}. This
approach could also be easily applied here, if desired, though we
-do not in our implementations, for reasons that will be discussed in
-Section~\ref{sec:sampling-implementation}.
+do not in our implementations.
The process of removing these deleted records during reconstructions is
different for the two mechanisms. Tagged deletes are straightforward,
@@ -411,16 +414,16 @@ care with ordering semantics, tombstones and their associated records can
be sorted into adjacent spots, allowing them to be efficiently dropped
during reconstruction without any extra overhead.
-While the dropping of deleted records during reconstruction helps, it is
-not sufficient on its own to ensure a particular bound on the number of
-deleted records within the structure. Pathological scenarios resulting in
-unbounded rejection rates, even in the presence of this mitigation, are
-possible. For example, tagging alone will never trigger reconstructions,
-and so it would be possible to delete every single record within the
-structure without triggering a reconstruction, or records could be deleted
-in the reverse order that they were inserted using tombstones. In either
-case, a passive system of dropping records naturally during reconstruction
-is not sufficient.
+While the dropping of deleted records during reconstruction helps,
+it is not sufficient on its own to ensure a particular bound on the
+number of deleted records within the structure. Pathological scenarios
+resulting in unbounded rejection rates, even in the presence of this
+mitigation, are possible. For example, tagging alone will never trigger
+reconstructions, and so it would be possible to delete every single
+record within the structure without triggering a reconstruction. Or,
+when using tombstones, records could be deleted in the reverse order
+that they were inserted. In either case, a passive system of dropping
+records naturally during reconstruction is not sufficient.
Fortunately, this passive system can be used as the basis for a
system that does provide a bound. This is because it guarantees,
@@ -490,6 +493,7 @@ be taken to obtain a sample set of size $k$.
\subsection{Performance Tuning and Configuration}
+\label{ssec:sampling-design-space}
The final of the desiderata referenced earlier in this chapter for our
dynamized sampling indices is having tunable performance. The base
@@ -508,7 +512,7 @@ Though it has thus far gone unmentioned, some readers may have
noted the astonishing similarity between decomposition-based
dynamization techniques, and a data structure called the Log-structured
Merge-tree. First proposed by O'Neil in the mid '90s\cite{oneil96},
-the LSM Tree was designed to optimize write throughout for external data
+the LSM Tree was designed to optimize write throughput for external data
structures. It accomplished this task by buffer inserted records in a
small in-memory AVL Tree, and then flushing this buffer to disk when
it filled up. The flush process itself would fully rebuild the on-disk
@@ -518,22 +522,23 @@ layered, external structures, to reduce the cost of reconstruction.
In more recent times, the LSM Tree has seen significant development and
been used as the basis for key-value stores like RocksDB~\cite{dong21}
-and LevelDB~\cite{leveldb}. This work has produced an incredibly large
-and well explored parametrization of the reconstruction procedures of
-LSM Trees, a good summary of which can be bound in this recent tutorial
-paper~\cite{sarkar23}. Examples of this design space exploration include:
-different ways to organize each "level" of the tree~\cite{dayan19,
-dostoevsky, autumn}, different growth rates, buffering, sub-partitioning
-of structures to allow finer-grained reconstruction~\cite{dayan22}, and
-approaches for allocating resources to auxiliary structures attached to
-the main ones for accelerating certain types of query~\cite{dayan18-1,
-zhu21, monkey}.
+and LevelDB~\cite{leveldb}. This work has produced an incredibly
+large and well explored parametrization of the reconstruction
+procedures of LSM Trees, a good summary of which can be bounded in
+this recent tutorial paper~\cite{sarkar23}. Examples of this design
+space exploration include: different ways to organize each "level"
+of the tree~\cite{dayan19, dostoevsky, autumn}, different growth
+rates, buffering, sub-partitioning of structures to allow finer-grained
+reconstruction~\cite{dayan22}, and approaches for allocating resources to
+auxiliary structures attached to the main ones for accelerating certain
+types of query~\cite{dayan18-1, zhu21, monkey}. This work is discussed
+in greater depth in Chapter~\ref{chap:related-work}
Many of the elements within the LSM Tree design space are based upon the
-specifics of the data structure itself, and are not generally applicable.
-However, some of the higher-level concepts can be imported and applied in
-the context of dynamization. Specifically, we have decided to import the
-following four elements for use in our dynamization technique,
+specifics of the data structure itself, and are not applicable to our
+use case. However, some of the higher-level concepts can be imported and
+applied in the context of dynamization. Specifically, we have decided to
+import the following four elements for use in our dynamization technique,
\begin{itemize}
\item A small dynamic buffer into which new records are inserted
\item A variable growth rate, called as \emph{scale factor}
@@ -554,11 +559,11 @@ we are dynamizing may not exist. This introduces some query cost, as
queries must be answered from these unsorted records as well, but in
the case of sampling this isn't a serious problem. The implications of
this will be discussed in Section~\ref{ssec:sampling-cost-funcs}. The
-size of this buffer, $N_B$ is a user-specified constant, and all block
-capacities are multiplied by it. In the Bentley-Saxe method, the $i$th
-block contains $2^i$ records. In our scheme, with buffering, this becomes
-$N_B \cdot 2^i$ records in the $i$th block. We call this unsorted array
-the \emph{mutable buffer}.
+size of this buffer, $N_B$ is a user-specified constant. Block capacities
+are defined in terms of multiples of $N_B$, such that each buffer flush
+corresponds to an insert in the traditioanl Bentley-Saxe method. Thus,
+rather than the $i$th block containing $2^i$ records, it contains $N_B
+\cdot 2^i$ records. We call this unsorted array the \emph{mutable buffer}.
\Paragraph{Scale Factor.} In the Bentley-Saxe method, each block is
twice as large as the block the precedes it There is, however, no reason
@@ -593,19 +598,19 @@ we can build them over tombstones. This approach can greatly improve
the sampling performance of the structure when tombstone deletes are used.
\Paragraph{Layout Policy.} The Bentley-Saxe method considers blocks
-individually, without any other organization beyond increasing size. In
-contrast, LSM Trees have multiple layers of structural organization. The
-top level structure is a level, upon which record capacity restrictions
-are applied. These levels are then partitioned into individual structures,
-which can be further organized by key range. Because our intention is to
-support general data structures, which may or may not be easily partition
-by a key, we will not consider the finest grain of partitioning. However,
-we can borrow the concept of levels, and lay out shards in these levels
-according to different strategies.
+individually, without any other organization beyond increasing
+size. In contrast, LSM Trees have multiple layers of structural
+organization. Record capacity restrictions are enforced on structures
+called \emph{levels}, which are partitioned into individual data
+structures, and then further organized into non-overlapping key ranges.
+Because our intention is to support general data structures, which may
+or may not be easily partitioned by a key, we will not consider the finest
+grain of partitioning. However, we can borrow the concept of levels,
+and lay out shards in these levels according to different strategies.
Specifically, we consider two layout policies. First, we can allow a
single shard per level, a policy called \emph{Leveling}. This approach
-is traditionally read optimized, as it generally results in fewer shards
+is traditionally read-optimized, as it generally results in fewer shards
within the overall structure for a given scale factor. Under leveling,
the $i$th level has a capacity of $N_B \cdot s^{i+1}$ records. We can
also allow multiple shards per level, resulting in a write-optimized
@@ -628,12 +633,10 @@ The requirements that the framework places upon SSIs are rather
modest. The sampling problem being considered must be a decomposable
sampling problem (Definition \ref{def:decomp-sampling}) and the SSI must
support the \texttt{build} and \texttt{unbuild} operations. Optionally,
-if the SSI supports point lookups or if the SSI can be constructed
-from multiple instances of the SSI more efficiently than its normal
-static construction, these two operations can be leveraged by the
-framework. However, these are not requirements, as the framework provides
-facilities to work around their absence.
-
+if the SSI supports point lookups or if the SSI is merge decomposable,
+then these two operations can be leveraged by the framework. However,
+these are not requirements, as the framework provides facilities to work
+around their absence.
\captionsetup[subfloat]{justification=centering}
\begin{figure*}
@@ -669,6 +672,7 @@ these delete mechanisms, each record contains an attached header with
bits to indicate its tombstone or delete status.
\subsection{Supported Operations and Cost Functions}
+\label{ssec:sampling-cost-funcs}
\Paragraph{Insert.} Inserting a record into the dynamization involves
appending it to the mutable buffer, which requires $\Theta(1)$ time. When
the buffer reaches its capacity, it must be flushed into the structure