diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-04-27 17:36:57 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-04-27 17:36:57 -0400 |
| commit | 5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch) | |
| tree | 276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/sigmod23/examples.tex | |
| download | dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz | |
Initial commit
Diffstat (limited to 'chapters/sigmod23/examples.tex')
| -rw-r--r-- | chapters/sigmod23/examples.tex | 143 |
1 files changed, 143 insertions, 0 deletions
diff --git a/chapters/sigmod23/examples.tex b/chapters/sigmod23/examples.tex new file mode 100644 index 0000000..cdbc398 --- /dev/null +++ b/chapters/sigmod23/examples.tex @@ -0,0 +1,143 @@ +\section{Framework Instantiations} +\label{sec:instance} +In this section, the framework is applied to three sampling problems and their +associated SSIs. All three sampling problems draw random samples from records +satisfying a simple predicate, and so result sets for all three can be +constructed by directly merging the result sets of the queries executed against +individual shards, the primary requirement for the application of the +framework. The SSIs used for each problem are discussed, including their +support of the remaining two optional requirements for framework application. + +\subsection{Dynamically Extended WSS Structure} +\label{ssec:wss-struct} +As a first example of applying this framework for dynamic extension, +the alias structure for answering WSS queries is considered. This is a +static structure that can be constructed in $O(n)$ time and supports WSS +queries in $O(1)$ time. The alias structure will be used as the SSI, with +the shards containing an alias structure paired with a sorted array of +records. { The use of sorted arrays for storing the records +allows for more efficient point-lookups, without requiring any additional +space. The total weight associated with a query for +a given alias structure is the total weight of all of its records, +and can be tracked at the shard level and retrieved in constant time. } + +Using the formulae from Section~\ref{sec:framework}, the worst-case +costs of insertion, sampling, and deletion are easily derived. The +initial construction cost from the buffer is $C_c(N_b) \in O(N_b +\log N_b)$, requiring the sorting of the buffer followed by alias +construction. After this point, the shards can be reconstructed in +linear time while maintaining sorted order. Thus, the reconstruction +cost is $C_r(n) \in O(n)$. As each shard contains a sorted array, +the point-lookup cost is $L(n) \in O(\log n)$. The total weight can +be tracked with the shard, requiring $W(n) \in O(1)$ time to access, +and there is no necessary preprocessing, so $P(n) \in O(1)$. Samples +can be drawn in $S(n) \in O(1)$ time. Plugging these results into the +formulae for insertion, sampling, and deletion costs gives, + +\begin{align*} + \text{Insertion:} \quad &O\left(\log_s n\right) \\ + \text{Sampling:} \quad &O\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\ + \text{Tagged Delete:} \quad &O\left(\log_s n \log n\right) +\end{align*} +where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log n)$ for +tombstones. + +\Paragraph{Bounding Rejection Rate.} In the weighted sampling case, +the framework's generic record-based compaction trigger mechanism +is insufficient to bound the rejection rate. This is because the +probability of a given record being sampling is dependent upon its +weight, as well as the number of records in the index. If a highly +weighted record is deleted, it will be preferentially sampled, resulting +in a larger number of rejections than would be expected based on record +counts alone. This problem can be rectified using the framework's user-specified +compaction trigger mechanism. +In addition to +tracking record counts, each level also tracks its rejection rate, +\begin{equation*} +\rho_i = \frac{\text{rejections}}{\text{sampling attempts}} +\end{equation*} +A configurable rejection rate cap, $\rho$, is then defined. If $\rho_i +> \rho$ on a level, a compaction is triggered. In the case +the tombstone delete policy, it is not the level containing the sampled +record, but rather the level containing its tombstone, that is considered +the source of the rejection. This is necessary to ensure that the tombstone +is moved closer to canceling its associated record by the compaction. + +\subsection{Dynamically Extended IRS Structure} +\label{ssec:irs-struct} +Another sampling problem to which the framework can be applied is +independent range sampling (IRS). The SSI in this example is the in-memory +ISAM tree. The ISAM tree supports efficient point-lookups + directly, and the total weight of an IRS query can be +easily obtained by counting the number of records within the query range, +which is determined as part of the preprocessing of the query. + +The static nature of shards in the framework allows for an ISAM tree +to be constructed with adjacent nodes positioned contiguously in memory. +By selecting a leaf node size that is a multiple of the record size, and +avoiding placing any headers within leaf nodes, the set of leaf nodes can +be treated as a sorted array of records with direct indexing, and the +internal nodes allow for faster searching of this array. +Because of this layout, per-sample tree-traversals are avoided. The +start and end of the range from which to sample can be determined using +a pair of traversals, and then records can be sampled from this range +using random number generation and array indexing. + +Assuming a sorted set of input records, the ISAM tree can be bulk-loaded +in linear time. The insertion analysis proceeds like the WSS example +previously discussed. The initial construction cost is $C_c(N_b) \in +O(N_b \log N_b)$ and reconstruction cost is $C_r(n) \in O(n)$. The ISAM +tree supports point-lookups in $L(n) \in O(\log_f n)$ time, where $f$ +is the fanout of the tree. + +The process for performing range sampling against the ISAM tree involves +two stages. First, the tree is traversed twice: once to establish the index of +the first record greater than or equal to the lower bound of the query, +and again to find the index of the last record less than or equal to the +upper bound of the query. This process has the effect of providing the +number of records within the query range, and can be used to determine +the weight of the shard in the shard alias structure. Its cost is $P(n) +\in O(\log_f n)$. Once the bounds are established, samples can be drawn +by randomly generating uniform integers between the upper and lower bound, +in $S(n) \in O(1)$ time each. + +This results in the extended version of the ISAM tree having the following +insert, sampling, and delete costs, +\begin{align*} + \text{Insertion:} \quad &O\left(\log_s n\right) \\ + \text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\ + \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) +\end{align*} +where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for +tombstones. + + +\subsection{Dynamically Extended WIRS Structure} +\label{ssec:wirs-struct} +As a final example of applying this framework, the WIRS problem will be +considered. Specifically, the alias-augmented B+tree approach, described +by Tao \cite{tao22}, generalizing work by Afshani and Wei \cite{afshani17}, +and Hu et al. \cite{hu14}, will be extended. +This structure allows for efficient point-lookups, as +it is based on the B+tree, and the total weight of a given WIRS query can +be calculated given the query range using aggregate weight tags within +the tree. + +The alias-augmented B+tree is a static structure of linear space, capable +of being built initially in $C_c(N_b) \in O(N_b \log N_b)$ time, being +bulk-loaded from sorted lists of records in $C_r(n) \in O(n)$ time, +and answering WIRS queries in $O(\log_f n + k)$ time, where the query +cost consists of preliminary work to identify the sampling range +and calculate the total weight, with $P(n) \in O(\log_f n)$ cost, and +constant-time drawing of samples from that range with $S(n) \in O(1)$. +This results in the following costs, +\begin{align*} + \text{Insertion:} \quad &O\left(\log_s n\right) \\ + \text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\ + \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) +\end{align*} +where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for +tombstones. Because this is a weighted sampling structure, the custom +compaction trigger discussed in in Section~\ref{ssec:wss-struct} is applied +to maintain bounded rejection rates during sampling. + |