diff options
Diffstat (limited to 'chapters/sigmod23/examples.tex')
| -rw-r--r-- | chapters/sigmod23/examples.tex | 236 |
1 files changed, 117 insertions, 119 deletions
diff --git a/chapters/sigmod23/examples.tex b/chapters/sigmod23/examples.tex index cdbc398..38df04d 100644 --- a/chapters/sigmod23/examples.tex +++ b/chapters/sigmod23/examples.tex @@ -1,143 +1,141 @@ -\section{Framework Instantiations} +\section{Applications of the Framework} \label{sec:instance} -In this section, the framework is applied to three sampling problems and their -associated SSIs. All three sampling problems draw random samples from records -satisfying a simple predicate, and so result sets for all three can be -constructed by directly merging the result sets of the queries executed against -individual shards, the primary requirement for the application of the -framework. The SSIs used for each problem are discussed, including their -support of the remaining two optional requirements for framework application. +Using the framework from the previous section, we can create dynamizations +of SSIs for various sampling problems. In this section, we consider +three different decomposable sampling problems and their associated SSIs, +discussing the necessary details of implementation to ensure they work +efficiently. -\subsection{Dynamically Extended WSS Structure} +\subsection{Weighted Set Sampling (Alias Structure)} \label{ssec:wss-struct} -As a first example of applying this framework for dynamic extension, -the alias structure for answering WSS queries is considered. This is a -static structure that can be constructed in $O(n)$ time and supports WSS -queries in $O(1)$ time. The alias structure will be used as the SSI, with -the shards containing an alias structure paired with a sorted array of -records. { The use of sorted arrays for storing the records -allows for more efficient point-lookups, without requiring any additional -space. The total weight associated with a query for -a given alias structure is the total weight of all of its records, -and can be tracked at the shard level and retrieved in constant time. } +As a first example, we will consider the alias structure~\cite{walker74} +for weighted set sampling. This is a static data structure that is +constructable in $B(n) \in \Theta(n)$ time and is capable of answering +sampling queries in $\Theta(1)$ time per sample. This structure does +\emph{not} directly support point-lookups, nor is it naturally sorted +to allow for convenient tombstone cancellation. However, the structure +itself doesn't place any requirements on the ordering of the underlying +data, and so both of these limitations can be addressed by building it +over a sorted array. -Using the formulae from Section~\ref{sec:framework}, the worst-case -costs of insertion, sampling, and deletion are easily derived. The -initial construction cost from the buffer is $C_c(N_b) \in O(N_b -\log N_b)$, requiring the sorting of the buffer followed by alias -construction. After this point, the shards can be reconstructed in -linear time while maintaining sorted order. Thus, the reconstruction -cost is $C_r(n) \in O(n)$. As each shard contains a sorted array, -the point-lookup cost is $L(n) \in O(\log n)$. The total weight can -be tracked with the shard, requiring $W(n) \in O(1)$ time to access, -and there is no necessary preprocessing, so $P(n) \in O(1)$. Samples -can be drawn in $S(n) \in O(1)$ time. Plugging these results into the -formulae for insertion, sampling, and deletion costs gives, +This pre-sorting will require $B(n) \in \Theta(n \log n)$ time to +build from the buffer, however after this a sorted-merge can be used +to perform reconstructions from the shards themselves. As the maximum +number of shards involved in a reconstruction using either layout policy +is $\Theta(1)$ using our framework, this means that we can perform +reconstructions in $B_M(n) \in \Theta(n)$ time, including tombstone +cancellation. The total weight of the structure can also be calculated +at no time when it is constructed, allows $W(n) \in \Theta(1)$ time +as well. Point lookups over the sorted data can be done using a binary +search in $L(n) \in \Theta(\log_2 n)$ time, and sampling queries require +no pre-processing, so $P(n) \in \Theta(1)$. The mutable buffer can be +sampled using rejection sampling. + +This results in the following cost functions for the various operations +supported by the dynamization, \begin{align*} - \text{Insertion:} \quad &O\left(\log_s n\right) \\ - \text{Sampling:} \quad &O\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\ - \text{Tagged Delete:} \quad &O\left(\log_s n \log n\right) + \text{Amortized Insertion/Tombstone Delete:} \quad &\Theta\left(\log_s n\right) \\ + \text{Worst-case Sampling:} \quad &\Theta\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\ + \text{Worst-case Tagged Delete:} \quad &\Theta\left(\log_s n \log n\right) \end{align*} -where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log n)$ for +where $R(n) \in \Theta(1)$ for tagging and $R(n) \in \Theta(\log_s n \log n)$ for tombstones. -\Paragraph{Bounding Rejection Rate.} In the weighted sampling case, -the framework's generic record-based compaction trigger mechanism -is insufficient to bound the rejection rate. This is because the -probability of a given record being sampling is dependent upon its -weight, as well as the number of records in the index. If a highly -weighted record is deleted, it will be preferentially sampled, resulting -in a larger number of rejections than would be expected based on record -counts alone. This problem can be rectified using the framework's user-specified -compaction trigger mechanism. -In addition to -tracking record counts, each level also tracks its rejection rate, +\Paragraph{Sampling Rejection Rate Bound.} Bounding the number of deleted +records is not sufficient to bound the rejection rate of weighted sampling +queries on its own, because it doesn't account for the weights of the +records being deleted. Recall in our discussion of this bound that we +assumed that all records had equal weights. Without this assumption, it +is possible to construct adversarial cases where a very highly weighted +record is deleted, resulting in it being preferentially sampled and +rejected repeatedly. + +To ensure that our solution is robust even in the face of such adversarial +workloads, for the weighted sampling case we introduce another compaction +trigger based on the measured rejection rate of each level. We +define the rejection rate of level $i$ as, \begin{equation*} -\rho_i = \frac{\text{rejections}}{\text{sampling attempts}} + \rho_i = \frac{\text{rejections}}{\text{sampling attempts}} \end{equation*} -A configurable rejection rate cap, $\rho$, is then defined. If $\rho_i -> \rho$ on a level, a compaction is triggered. In the case -the tombstone delete policy, it is not the level containing the sampled -record, but rather the level containing its tombstone, that is considered -the source of the rejection. This is necessary to ensure that the tombstone -is moved closer to canceling its associated record by the compaction. +and allow the user to specify a maximum rejection rate, $\rho$. If $\rho_i +> \rho$ on a given level, then a proactive compaction is triggered. In +the case of tagged deletes, the rejection rate of a level is based on +the rejections resulting from sampling attempts on that level. This +will \emph{not} work when using tombstones, however, as compacting the +level containing the record that was rejected will not make progress +towards eliminating that record from the structure in this case. Instead, +when using tombstones, the rejection rate is tracked based on the level +containing the tombstone that caused the rejection. This ensures that the +tombstone is moved towards its associated record, and that the compaction +makes progress towards removing it. -\subsection{Dynamically Extended IRS Structure} -\label{ssec:irs-struct} -Another sampling problem to which the framework can be applied is -independent range sampling (IRS). The SSI in this example is the in-memory -ISAM tree. The ISAM tree supports efficient point-lookups - directly, and the total weight of an IRS query can be -easily obtained by counting the number of records within the query range, -which is determined as part of the preprocessing of the query. -The static nature of shards in the framework allows for an ISAM tree -to be constructed with adjacent nodes positioned contiguously in memory. -By selecting a leaf node size that is a multiple of the record size, and -avoiding placing any headers within leaf nodes, the set of leaf nodes can -be treated as a sorted array of records with direct indexing, and the -internal nodes allow for faster searching of this array. -Because of this layout, per-sample tree-traversals are avoided. The -start and end of the range from which to sample can be determined using -a pair of traversals, and then records can be sampled from this range -using random number generation and array indexing. - -Assuming a sorted set of input records, the ISAM tree can be bulk-loaded -in linear time. The insertion analysis proceeds like the WSS example -previously discussed. The initial construction cost is $C_c(N_b) \in -O(N_b \log N_b)$ and reconstruction cost is $C_r(n) \in O(n)$. The ISAM -tree supports point-lookups in $L(n) \in O(\log_f n)$ time, where $f$ -is the fanout of the tree. +\subsection{Independent Range Sampling (ISAM Tree)} +\label{ssec:irs-struct} +We will next considered independent range sampling. For this decomposable +sampling problem, we use the ISAM Tree for the SSI. Because our shards are +static, we can build highly compact and efficient ISAM trees by storing +the records directly in a sorted array. So long as the leaf node size is +a multiple of the record size, this array can be treated as a sequence of +leaf nodes in the tree, and internal nodes can be built above this using +array indices as pointers. These internal nodes can also be constructed +contiguously in an array, maximizing cache efficiency. -The process for performing range sampling against the ISAM tree involves -two stages. First, the tree is traversed twice: once to establish the index of -the first record greater than or equal to the lower bound of the query, -and again to find the index of the last record less than or equal to the -upper bound of the query. This process has the effect of providing the -number of records within the query range, and can be used to determine -the weight of the shard in the shard alias structure. Its cost is $P(n) -\in O(\log_f n)$. Once the bounds are established, samples can be drawn -by randomly generating uniform integers between the upper and lower bound, -in $S(n) \in O(1)$ time each. +To build this structure from the buffer requires sorting the records +first, and then performing a linear time bulk-load, and hence $B(n) +\in \Theta(n \log n)$. However, sorted-array merges can be used for +further reconstructions, meaning that $B_M(n) \in \Theta(n)$. The data +structure itself supports point lookups in $L(n)\in \Theta(\log n)$ time. +IRS queries can be answered by first using two tree traversals to identify +the minimum and maximum array indices associated with the query range +in $\Theta(\log n)$ time, and then generating array indices within this +range uniformly at random for each sample. The initial traversals can be +considered preprocessing time, so $P(n) \in \Theta(\log n)$. The weight +of the shard is simply the difference between the upper and lower indices +of the range (i.e., the number of records in the range), and so $W(n) +\in \Theta(1)$ time, and the per-sample cost is a single random number +generation, so $S(n) \in \Theta(1)$. The mutable buffer can be sampled +using rejection sampling. -This results in the extended version of the ISAM tree having the following -insert, sampling, and delete costs, +Accounting for all these costs, the time complexity of the various +operations are, \begin{align*} - \text{Insertion:} \quad &O\left(\log_s n\right) \\ - \text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\ - \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) + \text{Amortized Insertion/Tombstone Delete:} \quad &O\left(\log_s n\right) \\ + \text{Worst-case Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\ + \text{Worst-case Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) \end{align*} -where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for -tombstones. +where $R(n) \in \Theta(1)$ for tagging and $R(n) \in \Theta(\log_s n \log_f n)$ +for tombstones and $f$ is the fanout of the ISAM Tree. -\subsection{Dynamically Extended WIRS Structure} +\subsection{Weighted Independent Range Sampling (Alias-augmented B+Tree)} + \label{ssec:wirs-struct} -As a final example of applying this framework, the WIRS problem will be -considered. Specifically, the alias-augmented B+tree approach, described -by Tao \cite{tao22}, generalizing work by Afshani and Wei \cite{afshani17}, -and Hu et al. \cite{hu14}, will be extended. -This structure allows for efficient point-lookups, as -it is based on the B+tree, and the total weight of a given WIRS query can -be calculated given the query range using aggregate weight tags within -the tree. +As a final example of applying this framework, we consider WIRS. This +is a decomposable sampling problem that can be answered using the +alias-augmented B+Tree structure~\cite{tao22, afshani17,hu14}. This +data structure is built over sorted data, but can be bulk-loaded from +this data in linear time, resulting in costs of $B(n) \in \Theta(n \log n)$ +and $B_M(n) \in \Theta(n)$, though the constant factors associated with +these functions are quite high, as each bulk-loading requires multiple +linear-time operations for building both the B+Tree and the alias +structures, among other things. As it is built on a B+Tree, the structure +supports $L(n) \in \Theta(\log n)$ point lookups. Answering sampling +queries requires $P(n) \in \Theta(\log n)$ pre-processing time to +establish the query interval, during which the weight of the interval +can be calculated in $W(n) \in \Theta(1)$ time using the aggregate weight +tags in the tree's internal nodes. After this, samples can be drawn in +$S(n) \in \Theta(1)$ time. -The alias-augmented B+tree is a static structure of linear space, capable -of being built initially in $C_c(N_b) \in O(N_b \log N_b)$ time, being -bulk-loaded from sorted lists of records in $C_r(n) \in O(n)$ time, -and answering WIRS queries in $O(\log_f n + k)$ time, where the query -cost consists of preliminary work to identify the sampling range -and calculate the total weight, with $P(n) \in O(\log_f n)$ cost, and -constant-time drawing of samples from that range with $S(n) \in O(1)$. -This results in the following costs, +This all results in the following costs, \begin{align*} - \text{Insertion:} \quad &O\left(\log_s n\right) \\ - \text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\ - \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) + \text{Amortized Insertion/Tombstone Delete:} \quad &O\left(\log_s n\right) \\ + \text{Worst-case Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\ + \text{Worst-case Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) \end{align*} where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for -tombstones. Because this is a weighted sampling structure, the custom -compaction trigger discussed in in Section~\ref{ssec:wss-struct} is applied -to maintain bounded rejection rates during sampling. - +tombstones and $f$ is the fanout of the tree. This is another weighted +sampling problem, and so we also apply the same rejection rate based +compaction trigger as discussed in Section~\ref{ssec:wss-struct} for the +dynamized alias structure. |