1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
|
\section{Framework Instantiations}
\label{sec:instance}
In this section, the framework is applied to three sampling problems and their
associated SSIs. All three sampling problems draw random samples from records
satisfying a simple predicate, and so result sets for all three can be
constructed by directly merging the result sets of the queries executed against
individual shards, the primary requirement for the application of the
framework. The SSIs used for each problem are discussed, including their
support of the remaining two optional requirements for framework application.
\subsection{Dynamically Extended WSS Structure}
\label{ssec:wss-struct}
As a first example of applying this framework for dynamic extension,
the alias structure for answering WSS queries is considered. This is a
static structure that can be constructed in $O(n)$ time and supports WSS
queries in $O(1)$ time. The alias structure will be used as the SSI, with
the shards containing an alias structure paired with a sorted array of
records. { The use of sorted arrays for storing the records
allows for more efficient point-lookups, without requiring any additional
space. The total weight associated with a query for
a given alias structure is the total weight of all of its records,
and can be tracked at the shard level and retrieved in constant time. }
Using the formulae from Section~\ref{sec:framework}, the worst-case
costs of insertion, sampling, and deletion are easily derived. The
initial construction cost from the buffer is $C_c(N_b) \in O(N_b
\log N_b)$, requiring the sorting of the buffer followed by alias
construction. After this point, the shards can be reconstructed in
linear time while maintaining sorted order. Thus, the reconstruction
cost is $C_r(n) \in O(n)$. As each shard contains a sorted array,
the point-lookup cost is $L(n) \in O(\log n)$. The total weight can
be tracked with the shard, requiring $W(n) \in O(1)$ time to access,
and there is no necessary preprocessing, so $P(n) \in O(1)$. Samples
can be drawn in $S(n) \in O(1)$ time. Plugging these results into the
formulae for insertion, sampling, and deletion costs gives,
\begin{align*}
\text{Insertion:} \quad &O\left(\log_s n\right) \\
\text{Sampling:} \quad &O\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
\text{Tagged Delete:} \quad &O\left(\log_s n \log n\right)
\end{align*}
where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log n)$ for
tombstones.
\Paragraph{Bounding Rejection Rate.} In the weighted sampling case,
the framework's generic record-based compaction trigger mechanism
is insufficient to bound the rejection rate. This is because the
probability of a given record being sampling is dependent upon its
weight, as well as the number of records in the index. If a highly
weighted record is deleted, it will be preferentially sampled, resulting
in a larger number of rejections than would be expected based on record
counts alone. This problem can be rectified using the framework's user-specified
compaction trigger mechanism.
In addition to
tracking record counts, each level also tracks its rejection rate,
\begin{equation*}
\rho_i = \frac{\text{rejections}}{\text{sampling attempts}}
\end{equation*}
A configurable rejection rate cap, $\rho$, is then defined. If $\rho_i
> \rho$ on a level, a compaction is triggered. In the case
the tombstone delete policy, it is not the level containing the sampled
record, but rather the level containing its tombstone, that is considered
the source of the rejection. This is necessary to ensure that the tombstone
is moved closer to canceling its associated record by the compaction.
\subsection{Dynamically Extended IRS Structure}
\label{ssec:irs-struct}
Another sampling problem to which the framework can be applied is
independent range sampling (IRS). The SSI in this example is the in-memory
ISAM tree. The ISAM tree supports efficient point-lookups
directly, and the total weight of an IRS query can be
easily obtained by counting the number of records within the query range,
which is determined as part of the preprocessing of the query.
The static nature of shards in the framework allows for an ISAM tree
to be constructed with adjacent nodes positioned contiguously in memory.
By selecting a leaf node size that is a multiple of the record size, and
avoiding placing any headers within leaf nodes, the set of leaf nodes can
be treated as a sorted array of records with direct indexing, and the
internal nodes allow for faster searching of this array.
Because of this layout, per-sample tree-traversals are avoided. The
start and end of the range from which to sample can be determined using
a pair of traversals, and then records can be sampled from this range
using random number generation and array indexing.
Assuming a sorted set of input records, the ISAM tree can be bulk-loaded
in linear time. The insertion analysis proceeds like the WSS example
previously discussed. The initial construction cost is $C_c(N_b) \in
O(N_b \log N_b)$ and reconstruction cost is $C_r(n) \in O(n)$. The ISAM
tree supports point-lookups in $L(n) \in O(\log_f n)$ time, where $f$
is the fanout of the tree.
The process for performing range sampling against the ISAM tree involves
two stages. First, the tree is traversed twice: once to establish the index of
the first record greater than or equal to the lower bound of the query,
and again to find the index of the last record less than or equal to the
upper bound of the query. This process has the effect of providing the
number of records within the query range, and can be used to determine
the weight of the shard in the shard alias structure. Its cost is $P(n)
\in O(\log_f n)$. Once the bounds are established, samples can be drawn
by randomly generating uniform integers between the upper and lower bound,
in $S(n) \in O(1)$ time each.
This results in the extended version of the ISAM tree having the following
insert, sampling, and delete costs,
\begin{align*}
\text{Insertion:} \quad &O\left(\log_s n\right) \\
\text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
\text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
\end{align*}
where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for
tombstones.
\subsection{Dynamically Extended WIRS Structure}
\label{ssec:wirs-struct}
As a final example of applying this framework, the WIRS problem will be
considered. Specifically, the alias-augmented B+tree approach, described
by Tao \cite{tao22}, generalizing work by Afshani and Wei \cite{afshani17},
and Hu et al. \cite{hu14}, will be extended.
This structure allows for efficient point-lookups, as
it is based on the B+tree, and the total weight of a given WIRS query can
be calculated given the query range using aggregate weight tags within
the tree.
The alias-augmented B+tree is a static structure of linear space, capable
of being built initially in $C_c(N_b) \in O(N_b \log N_b)$ time, being
bulk-loaded from sorted lists of records in $C_r(n) \in O(n)$ time,
and answering WIRS queries in $O(\log_f n + k)$ time, where the query
cost consists of preliminary work to identify the sampling range
and calculate the total weight, with $P(n) \in O(\log_f n)$ cost, and
constant-time drawing of samples from that range with $S(n) \in O(1)$.
This results in the following costs,
\begin{align*}
\text{Insertion:} \quad &O\left(\log_s n\right) \\
\text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\
\text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
\end{align*}
where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for
tombstones. Because this is a weighted sampling structure, the custom
compaction trigger discussed in in Section~\ref{ssec:wss-struct} is applied
to maintain bounded rejection rates during sampling.
|