Initial commit

author: Douglas Rumbaugh <dbr4@psu.edu> 2025-04-27 17:36:57 -0400
committer: Douglas Rumbaugh <dbr4@psu.edu> 2025-04-27 17:36:57 -0400
commit: 5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch)
tree: 276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/sigmod23/examples.tex
download: dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz
1 files changed, 143 insertions, 0 deletions
diff --git a/chapters/sigmod23/examples.tex b/chapters/sigmod23/examples.tex
new file mode 100644
index 0000000..cdbc398
--- /dev/null
+++ b/chapters/sigmod23/examples.tex
@@ -0,0 +1,143 @@
+\section{Framework Instantiations}
+\label{sec:instance}
+In this section, the framework is applied to three sampling problems and their
+associated SSIs. All three sampling problems draw random samples from records
+satisfying a simple predicate, and so result sets for all three can be
+constructed by directly merging the result sets of the queries executed against
+individual shards, the primary requirement for the application of the
+framework. The SSIs used for each problem are discussed, including their
+support of the remaining two optional requirements for framework application.
+
+\subsection{Dynamically Extended WSS Structure} 
+\label{ssec:wss-struct}
+As a first example of applying this framework for dynamic extension,
+the alias structure for answering WSS queries is considered. This is a
+static structure that can be constructed in $O(n)$ time and supports WSS
+queries in $O(1)$ time.  The alias structure will be used as the SSI, with
+the shards containing an alias structure paired with a sorted array of
+records. {  The use of sorted arrays for storing the records
+allows for more efficient point-lookups, without requiring any additional
+space. The  total weight associated with a query for
+a given alias structure is the total weight of all of its records,
+and can be tracked at the shard level and retrieved in constant time. }
+
+Using the formulae from Section~\ref{sec:framework}, the worst-case
+costs of insertion, sampling, and deletion are easily derived. The
+initial construction cost from the buffer is $C_c(N_b) \in O(N_b
+\log N_b)$, requiring the sorting of the buffer followed by alias
+construction. After this point, the shards can be reconstructed in
+linear time while maintaining sorted order. Thus, the reconstruction
+cost is $C_r(n) \in O(n)$. As each shard contains a sorted array,
+the point-lookup cost is $L(n) \in O(\log n)$. The total weight can
+be tracked with the shard, requiring $W(n) \in O(1)$ time to access,
+and there is no necessary preprocessing, so $P(n) \in O(1)$. Samples
+can be drawn in $S(n) \in O(1)$ time. Plugging these results into the
+formulae for insertion, sampling, and deletion costs gives,
+
+\begin{align*}
+    \text{Insertion:} \quad &O\left(\log_s n\right) \\
+    \text{Sampling:}  \quad &O\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
+    \text{Tagged Delete:} \quad &O\left(\log_s n \log n\right)
+\end{align*}
+where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log n)$ for
+tombstones.
+
+\Paragraph{Bounding Rejection Rate.} In the weighted sampling case,
+the framework's generic record-based compaction trigger mechanism
+is insufficient to bound the rejection rate. This is because the
+probability of a given record being sampling is dependent upon its
+weight, as well as the number of records in the index. If a highly
+weighted record is deleted, it will be preferentially sampled, resulting
+in a larger number of rejections than would be expected based on record
+counts alone. This problem can be rectified using the framework's user-specified
+compaction trigger mechanism.
+In addition to
+tracking record counts, each level also tracks its rejection rate,
+\begin{equation*}
+\rho_i = \frac{\text{rejections}}{\text{sampling attempts}}
+\end{equation*}
+A configurable rejection rate cap, $\rho$, is then defined. If $\rho_i
+> \rho$ on a level, a compaction is triggered. In the case
+the tombstone delete policy, it is not the level containing the sampled
+record, but rather the level containing its tombstone, that is considered
+the source of the rejection. This is necessary to ensure that the tombstone
+is moved closer to canceling its associated record by the compaction.
+
+\subsection{Dynamically Extended IRS Structure} 
+\label{ssec:irs-struct}
+Another sampling problem to which the framework can be applied is
+independent range sampling (IRS). The SSI in this example is the in-memory
+ISAM tree. The ISAM tree supports efficient point-lookups
+ directly, and the total weight of an IRS query can be
+easily obtained by counting the number of records within the query range,
+which is determined as part of the preprocessing of the query.
+
+The static nature of shards in the framework allows for an ISAM tree
+to be constructed with adjacent nodes positioned contiguously in memory.
+By selecting a leaf node size that is a multiple of the record size, and
+avoiding placing any headers within leaf nodes, the set of leaf nodes can
+be treated as a sorted array of records with direct indexing, and the
+internal nodes allow for faster searching of this array.
+Because of this layout, per-sample tree-traversals are avoided. The
+start and end of the range from which to sample can be determined using
+a pair of traversals, and then records can be sampled from this range
+using random number generation and array indexing.
+
+Assuming a sorted set of input records, the ISAM tree can be bulk-loaded
+in linear time. The insertion analysis proceeds like the WSS example
+previously discussed. The initial construction cost is $C_c(N_b) \in
+O(N_b \log N_b)$ and reconstruction cost is $C_r(n) \in O(n)$. The ISAM
+tree supports point-lookups in $L(n) \in O(\log_f n)$ time, where $f$
+is the fanout of the tree.
+
+The process for performing range sampling against the ISAM tree involves
+two stages. First, the tree is traversed twice: once to establish the index of
+the first record greater than or equal to the lower bound of the query,
+and again to find the index of the last record less than or equal to the
+upper bound of the query. This process has the effect of providing the
+number of records within the query range, and can be used to determine
+the weight of the shard in the shard alias structure. Its cost is $P(n)
+\in O(\log_f n)$. Once the bounds are established, samples can be drawn
+by randomly generating uniform integers between the upper and lower bound,
+in $S(n) \in O(1)$ time each.
+
+This results in the extended version of the ISAM tree having the following
+insert, sampling, and delete costs,
+\begin{align*}
+    \text{Insertion:} \quad &O\left(\log_s n\right) \\
+    \text{Sampling:}  \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
+    \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
+\end{align*}
+where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for
+tombstones.
+
+
+\subsection{Dynamically Extended WIRS Structure} 
+\label{ssec:wirs-struct}
+As a final example of applying this framework, the WIRS problem will be
+considered. Specifically, the alias-augmented B+tree approach, described
+by Tao \cite{tao22}, generalizing work by Afshani and Wei \cite{afshani17},
+and Hu et al. \cite{hu14}, will be extended.  
+This structure allows for efficient point-lookups, as
+it is based on the B+tree, and the total weight of a given WIRS query can
+be calculated given the query range using aggregate weight tags within
+the tree.
+
+The alias-augmented B+tree is a static structure of linear space, capable
+of being built initially in $C_c(N_b) \in O(N_b \log N_b)$ time, being
+bulk-loaded from sorted lists of records in $C_r(n) \in O(n)$ time,
+and answering WIRS queries in $O(\log_f n + k)$ time, where the query
+cost consists of preliminary work to identify the sampling range
+and calculate the total weight, with $P(n) \in O(\log_f n)$ cost, and
+constant-time drawing of samples from that range with $S(n) \in O(1)$. 
+This results in the following costs,
+\begin{align*}
+    \text{Insertion:} \quad &O\left(\log_s n\right) \\
+    \text{Sampling:}  \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\
+    \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
+\end{align*}
+where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for
+tombstones. Because this is a weighted sampling structure, the custom
+compaction trigger discussed in in Section~\ref{ssec:wss-struct} is applied
+to maintain bounded rejection rates during sampling. 
+
author	Douglas Rumbaugh <dbr4@psu.edu>	2025-04-27 17:36:57 -0400
committer	Douglas Rumbaugh <dbr4@psu.edu>	2025-04-27 17:36:57 -0400
commit	5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch)
tree	276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/sigmod23/examples.tex
download	dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz