chapters/sigmod23/examples.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141

\section{Applications of the Framework}
\label{sec:instance}
Using the framework from the previous section, we can create dynamizations
of SSIs for various sampling problems. In this section, we consider
three different decomposable sampling problems and their associated SSIs,
discussing the necessary details of implementation to ensure they work
efficiently.

\subsection{Weighted Set Sampling (Alias Structure)}
\label{ssec:wss-struct}
As a first example, we will consider the alias structure~\cite{walker74}
for weighted set sampling. This is a static data structure that is
constructable in $B(n) \in \Theta(n)$ time and is capable of answering
sampling queries in $\Theta(1)$ time per sample. This structure does
\emph{not} directly support point-lookups, nor is it naturally sorted
to allow for convenient tombstone cancellation. However, the structure
itself doesn't place any requirements on the ordering of the underlying
data, and so both of these limitations can be addressed by building it
over a sorted array.

This pre-sorting will require $B(n) \in \Theta(n \log n)$ time to
build from the buffer, however after this a sorted-merge can be used
to perform reconstructions from the shards themselves. As the maximum
number of shards involved in a reconstruction using either layout policy
is $\Theta(1)$ using our framework, this means that we can perform
reconstructions in $B_M(n) \in \Theta(n)$ time, including tombstone
cancellation. The total weight of the structure can also be calculated
at no time cost when it is constructed, allows $W(n) \in \Theta(1)$ time
as well. Point lookups over the sorted data can be done using a binary
search in $L(n) \in \Theta(\log_2 n)$ time, and sampling queries require
no pre-processing, so $P(n) \in \Theta(1)$. The mutable buffer can be
sampled using rejection sampling.

This results in the following cost functions for the various operations
supported by the dynamization,

\begin{align*}
    \text{Amortized Insertion/Tombstone Delete:} \quad &\Theta\left(\log_s n\right) \\
    \text{Worst-case Sampling:}  \quad &\Theta\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
    \text{Worst-case Tagged Delete:} \quad &\Theta\left(\log_s n \log n\right)
\end{align*}
where $R(n) \in \Theta(1)$ for tagging and $R(n) \in \Theta(\log_s n \log n)$ for
tombstones.

\Paragraph{Sampling Rejection Rate Bound.} Bounding the number of deleted
records is not sufficient to bound the rejection rate of weighted sampling
queries on its own, because it doesn't account for the weights of the
records being deleted. Recall in our discussion of this bound that we
assumed that all records had equal weights. Without this assumption, it
is possible to construct adversarial cases where a very highly weighted
record is deleted, resulting in it being preferentially sampled and
rejected repeatedly.

To ensure that our solution is robust even in the face of such adversarial
workloads, for the weighted sampling case we introduce another compaction
trigger based on the measured rejection rate of each level. We
define the rejection rate of level $i$ as,
\begin{equation*}
    \rho_i = \frac{\text{rejections}}{\text{sampling attempts}}
\end{equation*}
and allow the user to specify a maximum rejection rate, $\rho$. If $\rho_i
> \rho$ on a given level, then a proactive compaction is triggered. In
the case of tagged deletes, the rejection rate of a level is based on
the rejections resulting from sampling attempts on that level. This
will \emph{not} work when using tombstones, however, as compacting the
level containing the record that was rejected will not make progress
towards eliminating that record from the structure in this case. Instead,
when using tombstones, the rejection rate is tracked based on the level
containing the tombstone that caused the rejection. This ensures that the
tombstone is moved towards its associated record, and that the compaction
makes progress towards removing it.


\subsection{Independent Range Sampling (ISAM Tree)} 
\label{ssec:irs-struct}
We will next considered independent range sampling. For this decomposable
sampling problem, we use the ISAM Tree for the SSI. Because our shards are
static, we can build highly compact and efficient ISAM trees by storing
the records directly in a sorted array. So long as the leaf node size is
a multiple of the record size, this array can be treated as a sequence of
leaf nodes in the tree, and internal nodes can be built above this using
array indices as pointers. These internal nodes can also be constructed
contiguously in an array, maximizing cache efficiency.

To build this structure from the buffer requires sorting the records
first, and then performing a linear time bulk-load, and hence $B(n)
\in \Theta(n \log n)$. However, sorted-array merges can be used for
further reconstructions, meaning that $B_M(n) \in \Theta(n)$. The data
structure itself supports point lookups in $L(n)\in \Theta(\log n)$ time.
IRS queries can be answered by first using two tree traversals to identify
the minimum and maximum array indices associated with the query range
in $\Theta(\log n)$ time, and then generating array indices within this
range uniformly at random for each sample. The initial traversals can be
considered preprocessing time, so $P(n) \in \Theta(\log n)$. The weight
of the shard is simply the difference between the upper and lower indices
of the range (i.e., the number of records in the range), and so $W(n)
\in \Theta(1)$ time, and the per-sample cost is a single random number
generation, so $S(n) \in \Theta(1)$. The mutable buffer can be sampled
using rejection sampling.

Accounting for all these costs, the time complexity of the various
operations are,
\begin{align*}
    \text{Amortized Insertion/Tombstone Delete:} \quad &O\left(\log_s n\right) \\
    \text{Worst-case Sampling:}  \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\
    \text{Worst-case Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
\end{align*}
where $R(n) \in \Theta(1)$ for tagging and $R(n) \in \Theta(\log_s n \log_f n)$
for tombstones and $f$ is the fanout of the ISAM Tree.


\subsection{Weighted Independent Range Sampling (Alias-augmented B+Tree)} 

\label{ssec:wirs-struct}
As a final example of applying this framework, we consider WIRS. This
is a decomposable sampling problem that can be answered using the
alias-augmented B+Tree structure~\cite{tao22, afshani17,hu14}. This
data structure is built over sorted data, but can be bulk-loaded from
this data in linear time, resulting in costs of $B(n) \in \Theta(n \log n)$
and $B_M(n) \in \Theta(n)$, though the constant factors associated with
these functions are quite high, as each bulk-loading requires multiple
linear-time operations for building both the B+Tree and the alias
structures, among other things. As it is built on a B+Tree, the structure
supports $L(n) \in \Theta(\log n)$ point lookups. Answering sampling
queries requires $P(n) \in \Theta(\log n)$ pre-processing time to
establish the query interval, during which the weight of the interval
can be calculated in $W(n) \in \Theta(1)$ time using the aggregate weight
tags in the tree's internal nodes. After this, samples can be drawn in
$S(n) \in \Theta(1)$ time.

This all results in the following costs,
\begin{align*}
    \text{Amortized Insertion/Tombstone Delete:} \quad &O\left(\log_s n\right) \\
    \text{Worst-case Sampling:}  \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\
    \text{Worst-case Tagged Delete:} \quad &O\left(\log_s n \log_f n\right)
\end{align*}
where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for
tombstones and $f$ is the fanout of the tree. This is another weighted
sampling problem, and so we also apply the same rejection rate based
compaction trigger as discussed in Section~\ref{ssec:wss-struct} for the
dynamized alias structure.