summaryrefslogtreecommitdiffstats
path: root/chapters/sigmod23/examples.tex
diff options
context:
space:
mode:
Diffstat (limited to 'chapters/sigmod23/examples.tex')
-rw-r--r--chapters/sigmod23/examples.tex236
1 files changed, 117 insertions, 119 deletions
diff --git a/chapters/sigmod23/examples.tex b/chapters/sigmod23/examples.tex
index cdbc398..38df04d 100644
--- a/chapters/sigmod23/examples.tex
+++ b/chapters/sigmod23/examples.tex
@@ -1,143 +1,141 @@
-\section{Framework Instantiations}
+\section{Applications of the Framework}
\label{sec:instance}
-In this section, the framework is applied to three sampling problems and their
-associated SSIs. All three sampling problems draw random samples from records
-satisfying a simple predicate, and so result sets for all three can be
-constructed by directly merging the result sets of the queries executed against
-individual shards, the primary requirement for the application of the
-framework. The SSIs used for each problem are discussed, including their
-support of the remaining two optional requirements for framework application.
+Using the framework from the previous section, we can create dynamizations
+of SSIs for various sampling problems. In this section, we consider
+three different decomposable sampling problems and their associated SSIs,
+discussing the necessary details of implementation to ensure they work
+efficiently.
-\subsection{Dynamically Extended WSS Structure}
+\subsection{Weighted Set Sampling (Alias Structure)}
\label{ssec:wss-struct}
-As a first example of applying this framework for dynamic extension,
-the alias structure for answering WSS queries is considered. This is a
-static structure that can be constructed in $O(n)$ time and supports WSS
-queries in $O(1)$ time. The alias structure will be used as the SSI, with
-the shards containing an alias structure paired with a sorted array of
-records. { The use of sorted arrays for storing the records
-allows for more efficient point-lookups, without requiring any additional
-space. The total weight associated with a query for
-a given alias structure is the total weight of all of its records,
-and can be tracked at the shard level and retrieved in constant time. }
+As a first example, we will consider the alias structure~\cite{walker74}
+for weighted set sampling. This is a static data structure that is
+constructable in $B(n) \in \Theta(n)$ time and is capable of answering
+sampling queries in $\Theta(1)$ time per sample. This structure does
+\emph{not} directly support point-lookups, nor is it naturally sorted
+to allow for convenient tombstone cancellation. However, the structure
+itself doesn't place any requirements on the ordering of the underlying
+data, and so both of these limitations can be addressed by building it
+over a sorted array.
-Using the formulae from Section~\ref{sec:framework}, the worst-case
-costs of insertion, sampling, and deletion are easily derived. The
-initial construction cost from the buffer is $C_c(N_b) \in O(N_b
-\log N_b)$, requiring the sorting of the buffer followed by alias
-construction. After this point, the shards can be reconstructed in
-linear time while maintaining sorted order. Thus, the reconstruction
-cost is $C_r(n) \in O(n)$. As each shard contains a sorted array,
-the point-lookup cost is $L(n) \in O(\log n)$. The total weight can
-be tracked with the shard, requiring $W(n) \in O(1)$ time to access,
-and there is no necessary preprocessing, so $P(n) \in O(1)$. Samples
-can be drawn in $S(n) \in O(1)$ time. Plugging these results into the
-formulae for insertion, sampling, and deletion costs gives,
+This pre-sorting will require $B(n) \in \Theta(n \log n)$ time to
+build from the buffer, however after this a sorted-merge can be used
+to perform reconstructions from the shards themselves. As the maximum
+number of shards involved in a reconstruction using either layout policy
+is $\Theta(1)$ using our framework, this means that we can perform
+reconstructions in $B_M(n) \in \Theta(n)$ time, including tombstone
+cancellation. The total weight of the structure can also be calculated
+at no time when it is constructed, allows $W(n) \in \Theta(1)$ time
+as well. Point lookups over the sorted data can be done using a binary
+search in $L(n) \in \Theta(\log_2 n)$ time, and sampling queries require
+no pre-processing, so $P(n) \in \Theta(1)$. The mutable buffer can be
+sampled using rejection sampling.
+
+This results in the following cost functions for the various operations
+supported by the dynamization,
\begin{align*}
- \text{Insertion:} \quad &O\left(\log_s n\right) \\
- \text{Sampling:} \quad &O\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
- \text{Tagged Delete:} \quad &O\left(\log_s n \log n\right)
+ \text{Amortized Insertion/Tombstone Delete:} \quad &\Theta\left(\log_s n\right) \\
+ \text{Worst-case Sampling:} \quad &\Theta\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
+ \text{Worst-case Tagged Delete:} \quad &\Theta\left(\log_s n \log n\right)
\end{align*}
-where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log n)$ for
+where $R(n) \in \Theta(1)$ for tagging and $R(n) \in \Theta(\log_s n \log n)$ for
tombstones.
-\Paragraph{Bounding Rejection Rate.} In the weighted sampling case,
-the framework's generic record-based compaction trigger mechanism
-is insufficient to bound the rejection rate. This is because the
-probability of a given record being sampling is dependent upon its
-weight, as well as the number of records in the index. If a highly
-weighted record is deleted, it will be preferentially sampled, resulting
-in a larger number of rejections than would be expected based on record
-counts alone. This problem can be rectified using the framework's user-specified
-compaction trigger mechanism.
-In addition to
-tracking record counts, each level also tracks its rejection rate,
+\Paragraph{Sampling Rejection Rate Bound.} Bounding the number of deleted
+records is not sufficient to bound the rejection rate of weighted sampling
+queries on its own, because it doesn't account for the weights of the
+records being deleted. Recall in our discussion of this bound that we
+assumed that all records had equal weights. Without this assumption, it
+is possible to construct adversarial cases where a very highly weighted
+record is deleted, resulting in it being preferentially sampled and
+rejected repeatedly.
+
+To ensure that our solution is robust even in the face of such adversarial
+workloads, for the weighted sampling case we introduce another compaction
+trigger based on the measured rejection rate of each level. We
+define the rejection rate of level $i$ as,
\begin{equation*}
-\rho_i = \frac{\text{rejections}}{\text{sampling attempts}}
+ \rho_i = \frac{\text{rejections}}{\text{sampling attempts}}
\end{equation*}
-A configurable rejection rate cap, $\rho$, is then defined. If $\rho_i
-> \rho$ on a level, a compaction is triggered. In the case
-the tombstone delete policy, it is not the level containing the sampled
-record, but rather the level containing its tombstone, that is considered
-the source of the rejection. This is necessary to ensure that the tombstone
-is moved closer to canceling its associated record by the compaction.
+and allow the user to specify a maximum rejection rate, $\rho$. If $\rho_i
+> \rho$ on a given level, then a proactive compaction is triggered. In
+the case of tagged deletes, the rejection rate of a level is based on
+the rejections resulting from sampling attempts on that level. This
+will \emph{not} work when using tombstones, however, as compacting the
+level containing the record that was rejected will not make progress
+towards eliminating that record from the structure in this case. Instead,
+when using tombstones, the rejection rate is tracked based on the level
+containing the tombstone that caused the rejection. This ensures that the
+tombstone is moved towards its associated record, and that the compaction
+makes progress towards removing it.
-\subsection{Dynamically Extended IRS Structure}
-\label{ssec:irs-struct}
-Another sampling problem to which the framework can be applied is
-independent range sampling (IRS). The SSI in this example is the in-memory
-ISAM tree. The ISAM tree supports efficient point-lookups
- directly, and the total weight of an IRS query can be
-easily obtained by counting the number of records within the query range,
-which is determined as part of the preprocessing of the query.
-The static nature of shards in the framework allows for an ISAM tree
-to be constructed with adjacent nodes positioned contiguously in memory.
-By selecting a leaf node size that is a multiple of the record size, and
-avoiding placing any headers within leaf nodes, the set of leaf nodes can
-be treated as a sorted array of records with direct indexing, and the
-internal nodes allow for faster searching of this array.
-Because of this layout, per-sample tree-traversals are avoided. The
-start and end of the range from which to sample can be determined using
-a pair of traversals, and then records can be sampled from this range
-using random number generation and array indexing.
-
-Assuming a sorted set of input records, the ISAM tree can be bulk-loaded
-in linear time. The insertion analysis proceeds like the WSS example
-previously discussed. The initial construction cost is $C_c(N_b) \in
-O(N_b \log N_b)$ and reconstruction cost is $C_r(n) \in O(n)$. The ISAM
-tree supports point-lookups in $L(n) \in O(\log_f n)$ time, where $f$
-is the fanout of the tree.
+\subsection{Independent Range Sampling (ISAM Tree)}
+\label{ssec:irs-struct}
+We will next considered independent range sampling. For this decomposable
+sampling problem, we use the ISAM Tree for the SSI. Because our shards are
+static, we can build highly compact and efficient ISAM trees by storing
+the records directly in a sorted array. So long as the leaf node size is
+a multiple of the record size, this array can be treated as a sequence of
+leaf nodes in the tree, and internal nodes can be built above this using
+array indices as pointers. These internal nodes can also be constructed
+contiguously in an array, maximizing cache efficiency.
-The process for performing range sampling against the ISAM tree involves
-two stages. First, the tree is traversed twice: once to establish the index of
-the first record greater than or equal to the lower bound of the query,
-and again to find the index of the last record less than or equal to the
-upper bound of the query. This process has the effect of providing the
-number of records within the query range, and can be used to determine
-the weight of the shard in the shard alias structure. Its cost is $P(n)
-\in O(\log_f n)$. Once the bounds are established, samples can be drawn
-by randomly generating uniform integers between the upper and lower bound,
-in $S(n) \in O(1)$ time each.
+To build this structure from the buffer requires sorting the records
+first, and then performing a linear time bulk-load, and hence $B(n)
+\in \Theta(n \log n)$. However, sorted-array merges can be used for
+further reconstructions, meaning that $B_M(n) \in \Theta(n)$. The data
+structure itself supports point lookups in $L(n)\in \Theta(\log n)$ time.
+IRS queries can be answered by first using two tree traversals to identify
+the minimum and maximum array indices associated with the query range
+in $\Theta(\log n)$ time, and then generating array indices within this
+range uniformly at random for each sample. The initial traversals can be
+considered preprocessing time, so $P(n) \in \Theta(\log n)$. The weight
+of the shard is simply the difference between the upper and lower indices
+of the range (i.e., the number of records in the range), and so $W(n)
+\in \Theta(1)$ time, and the per-sample cost is a single random number
+generation, so $S(n) \in \Theta(1)$. The mutable buffer can be sampled
+using rejection sampling.
-This results in the extended version of the ISAM tree having the following
-insert, sampling, and delete costs,
+Accounting for all these costs, the time complexity of the various
+operations are,
\begin{align*}
- \text{Insertion:} \quad &O\left(\log_s n\right) \\
- \text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
- \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
+ \text{Amortized Insertion/Tombstone Delete:} \quad &O\left(\log_s n\right) \\
+ \text{Worst-case Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
+ \text{Worst-case Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
\end{align*}
-where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for
-tombstones.
+where $R(n) \in \Theta(1)$ for tagging and $R(n) \in \Theta(\log_s n \log_f n)$
+for tombstones and $f$ is the fanout of the ISAM Tree.
-\subsection{Dynamically Extended WIRS Structure}
+\subsection{Weighted Independent Range Sampling (Alias-augmented B+Tree)}
+
\label{ssec:wirs-struct}
-As a final example of applying this framework, the WIRS problem will be
-considered. Specifically, the alias-augmented B+tree approach, described
-by Tao \cite{tao22}, generalizing work by Afshani and Wei \cite{afshani17},
-and Hu et al. \cite{hu14}, will be extended.
-This structure allows for efficient point-lookups, as
-it is based on the B+tree, and the total weight of a given WIRS query can
-be calculated given the query range using aggregate weight tags within
-the tree.
+As a final example of applying this framework, we consider WIRS. This
+is a decomposable sampling problem that can be answered using the
+alias-augmented B+Tree structure~\cite{tao22, afshani17,hu14}. This
+data structure is built over sorted data, but can be bulk-loaded from
+this data in linear time, resulting in costs of $B(n) \in \Theta(n \log n)$
+and $B_M(n) \in \Theta(n)$, though the constant factors associated with
+these functions are quite high, as each bulk-loading requires multiple
+linear-time operations for building both the B+Tree and the alias
+structures, among other things. As it is built on a B+Tree, the structure
+supports $L(n) \in \Theta(\log n)$ point lookups. Answering sampling
+queries requires $P(n) \in \Theta(\log n)$ pre-processing time to
+establish the query interval, during which the weight of the interval
+can be calculated in $W(n) \in \Theta(1)$ time using the aggregate weight
+tags in the tree's internal nodes. After this, samples can be drawn in
+$S(n) \in \Theta(1)$ time.
-The alias-augmented B+tree is a static structure of linear space, capable
-of being built initially in $C_c(N_b) \in O(N_b \log N_b)$ time, being
-bulk-loaded from sorted lists of records in $C_r(n) \in O(n)$ time,
-and answering WIRS queries in $O(\log_f n + k)$ time, where the query
-cost consists of preliminary work to identify the sampling range
-and calculate the total weight, with $P(n) \in O(\log_f n)$ cost, and
-constant-time drawing of samples from that range with $S(n) \in O(1)$.
-This results in the following costs,
+This all results in the following costs,
\begin{align*}
- \text{Insertion:} \quad &O\left(\log_s n\right) \\
- \text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\
- \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
+ \text{Amortized Insertion/Tombstone Delete:} \quad &O\left(\log_s n\right) \\
+ \text{Worst-case Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\
+ \text{Worst-case Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
\end{align*}
where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for
-tombstones. Because this is a weighted sampling structure, the custom
-compaction trigger discussed in in Section~\ref{ssec:wss-struct} is applied
-to maintain bounded rejection rates during sampling.
-
+tombstones and $f$ is the fanout of the tree. This is another weighted
+sampling problem, and so we also apply the same rejection rate based
+compaction trigger as discussed in Section~\ref{ssec:wss-struct} for the
+dynamized alias structure.