1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
|
\section{Dynamization of SSIs}
\label{sec:framework}
Our goal, then, is to design a solution to indepedent sampling that is
able to achieve \emph{both} efficient updates and efficient sampling,
while also maintaining statistical independence both within and between
IQS queries, and to do so in a generalized fashion without needing to
design new dynamic data structures for each problem. Given the range
of SSIs already available, it seems reasonable to attempt to apply
dynamization techniques to accomplish this goal. Using the Bentley-Saxe
method would allow us to to support inserts and deletes without
requiring any modification of the SSIs. Unfortunately, as discussed
in Section~\ref{ssec:background-irs}, there are problems with directly
applying BSM to sampling problems. All of the considerations discussed
there in the context of IRS apply equally to the other sampling problems
considered in this chapter. In this section, we will discuss approaches
for resolving these problems.
\begin{table}[t]
\centering
\begin{tabular}{|l l|}
\hline
\textbf{Variable} & \textbf{Description} \\ \hline
$N_b$ & Capacity of the mutable buffer \\ \hline
$s$ & Scale factor \\ \hline
$B_c(n)$ & SSI construction cost from unsorted records \\ \hline
$B_r(n)$ & SSI reconstruction cost from existing SSI instances\\ \hline
$L(n)$ & SSI point-lookup cost \\ \hline
$P(n)$ & SSI sampling pre-processing cost \\ \hline
$S(n)$ & SSI per-sample sampling cost \\ \hline
$W(n)$ & SSI weight determination cost \\ \hline
$R(n)$ & Rejection check cost \\ \hline
$\delta$ & Maximum delete proportion \\ \hline
\end{tabular}
\label{tab:nomen}
\caption{\textbf{Nomenclature.} A reference of variables and functions
used in this chapter.}
\end{table}
\subsection{Sampling over Decomposed Structures}
\label{ssec:decomposed-structure-sampling}
The core problem facing any attempt to dynamize SSIs is that independently
sampling from a decomposed structure is difficult. As discussed in
Section~\ref{ssec:background-irs}, accomplishing this task within the
DSP model used by the Bentley-Saxe method requires drawing a full $k$
samples from each of the blocks, and then repeatedly down-sampling each
of the intermediate sample sets. However, it is possible to devise a
more efficient query process if we abandon the DSP model and consider
a slightly more complicated procedure.
First, we'll define the IQS problem in terms of the notation and concepts
used in Chapter~\cite{chap:background} for search problems,
\begin{definition}[Independent Query Sampling Problem]
Given a search problem, $F$, a query sampling problem is function
of the form $X: (F, \mathcal{D}, \mathcal{Q}, \mathbb{Z}^+)
\to \mathcal{R}$ where $\mathcal{D}$ is the domain of records
and $\mathcal{Q}$ is the domain of query parameters of $F$. The
solution to a sampling problem, $R \in \mathcal{R}$ will be a subset
of records from the solution to $F$ drawn independently such that,
$|R| = k$ for some $k \in \mathbb{Z}^+$.
\end{definition}
To consider the decomposability of such problems, we need to resolve a
minor definitional issue. As noted before, the DSP model is based on
deterministic queries. The definition doesn't apply for sampling queries,
because it assumes that the result sets of identical queries should
also be identical. For general IQS, we also need to enforce conditions
on the query being sampled from. Based on these observations, we can
define the decomposability conditions for a query sampling problem,
\begin{definition}[Decomposable Sampling Problem]
A query sampling problem, $X: (F, \mathcal{D}, \mathcal{Q},
\mathbb{Z}^+ \to \mathcal{R}$) is decomposable if and only if
the following conditions are met for all $q \in \mathcal{Q},
k \in \mathbb{Z}^+$,
\begin{enumerate}
\item There exists a $\Theta(C(n,k))$ time computable, associative, and
commutative binary operator $\mergeop$ such that,
\begin{equation*}
X(F, A \cup B, q, k) \sim X(F, A, q, k)~ \mergeop ~X(F,
B, q, k)
\end{equation*}
for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B
= \emptyset$.
\item For any dataset $D \subseteq \mathcal{D}$ that has been
decomposed into $m$ partitions such that $D =
\bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset \quad
\forall i,j < m, i \neq j$,
\begin{equation*}
F(D, q) = \bigcup_{i=1}^m F(D_i, q)
\end{equation*}
\end{enumerate}
\end{definition}
These two conditions warrant further explaination. The first condition
is simply a redefinition of the standard decomposability criteria to
consider matching the distribution, rather than the exact records in $R$,
as the correctness condition for the merge process. The second condition
handles a necessary property of the underlying search problem being
sampled from. Note that this condition is \emph{stricter} than normal
decomposability for $F$, and essentially requires that the query being
sampled from return a set of records, rather than an aggregate value or
some other result that cannot be meaningfully sampled from. This condition
is satisfied by predicate-filtering style database queries, among others.
With these definitions in mind, let's turn to solving these query sampling
problems. First, we note that many SSIs have a sampling procedure that
naturally involves two phases. First, some preliminary work is done
to determine metadata concerning the set of records to sample from,
and then $k$ samples are drawn from the structure, taking advantage of
this metadata. If we represent the time cost of the prelimary work
with $P(n)$ and the cost of drawing a sample with $S(n)$, then these
structures query cost functions are of the form,
\begin{equation*}
\mathscr{Q}(n, k) = P(n) + k S(n)
\end{equation*}
Consider an arbitrary decomposable sampling problem with a cost function
of the above form, $X(\mathscr{I}, F, q, k)$, which draws a sample
of $k$ records from $d \subseteq \mathcal{D}$ using an instance of
an SSI $\mathscr{I} \in \mathcal{I}$. Applying dynamization results
in $d$ being split across $m$ disjoint instances of $\mathcal{I}$
such that $d = \bigcup_{i=0}^m \text{unbuild}(\mathscr{I}_i)$ and
$\text{unbuild}(\mathscr{I}_i) \cap \text{unbuild}(\mathscr{I}_j)
= \emptyset \quad \forall i, j < m, i \neq j$. If we consider a
Bentley-Saxe dynamization of such a structure, the $\mergeop$ operation
would be a $\Theta(k)$ down-sampling. Thus, the total query cost of such
a structure would be,
\begin{equation*}
\Theta\left(\log_2 n \left( P(n) + k S(n) + k\right)\right)
\end{equation*}
This cost function is sub-optimal for two reasons. First, we
pay extra cost to merge the result sets together because of the
down-sampling combination operator. Secondly, this formulation
fails to avoid a per-sample dependence on $n$, even in the case
where $S(n) \in \Theta(1)$. This gets even worse when considering
rejections that may occur as a result of deleted records. Recall from
Section~\ref{ssec:background-deletes} that deletion can be supported
using weak deletes or a shadow structure in a Bentley-Saxe dynamization.
Using either approach, it isn't possible to avoid deleted records in
advance when sampling, and so these will need to be rejected and retried.
In the DSP model, this retry will need to reprocess every block a second
time. You cannot retry in place without introducing bias into the result
set. We will discuss this more in Section~\ref{ssec:sampling-deletes}.
\begin{figure}
\centering
\includegraphics[width=\textwidth]{img/sigmod23/sampling}
\caption{\textbf{Overview of the multiple-block query sampling process} for
Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of
the shards is determined, then (2) these weights are used to construct an
alias structure. Next, (3) the alias structure is queried $k$ times to
determine per shard sample sizes, and then (4) sampling is performed.
Finally, (5) any rejected samples are retried starting from the alias
structure, and the process is repeated until the desired number of samples
has been retrieved.}
\label{fig:sample}
\end{figure}
The key insight that allowed us to solve this particular problem was that
there is a mismatch between the structure of the sampling query process,
and the structure assumed by DSPs. Using an SSI to answer a sampling
query results in a naturally two-phase process, but DSPs are assumed to
be single phase. We can construct a more effective process for answering
such queries based on a multi-stage process, summarized in Figure~\ref{fig:sample}.
\begin{enumerate}
\item Perform the query pre-processing work, and determine each
block's respective weight under a given query to be sampled
from (e.g., the number of records falling into the query range
for IRS).
\item Build a temporary alias structure over these weights.
\item Query the alias structure $k$ times to determine how many
samples to draw from each block.
\item Draw the appropriate number of samples from each block and
merge them together to form the final query result, using any
necessary pre-processing results in the process.
\end{enumerate}
It is possible that some of the records sampled in Step 4 must be
rejected, either because of deletes or some other property of the sampling
procedure being used. If $r$ records are rejected, the above procedure
can be repeated from Step 3, taking $k - r$ as the number of times to
query the alias structure, without needing to redo any of the preprocessing
steps. This can be repeated as many times as necessary until the required
$k$ records have been sampled.
\begin{example}
\label{ex:sample}
Consider executing a WSS query, with $k=1000$, across three blocks
containing integer keys with unit weight. $\mathscr{I}_1$ contains only the
key $-2$, $\mathscr{I}_2$ contains all integers on $[1,100]$, and $\mathscr{I}_3$
contains all integers on $[101, 200]$. These structures are shown
in Figure~\ref{fig:sample}. Sampling is performed by first
determining the normalized weights for each block: $w_1 = 0.005$,
$w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a
block alias structure. The block alias structure is then queried
$k$ times, resulting in a distribution of $k_i$s that is
commensurate with the relative weights of each block. Finally,
each block is queried in turn to draw the appropriate number
of samples.
\end{example}
Assuming a Bentley-Saxe decomposition with $\log n$ blocks and assuming
a constant number of repetitions, the cost of answering a decomposible
sampling query having a pre-processing cost of $P(n)$, a weight-determination
cost of $W(n)$ and a per-sample cost of $S(n)$ will be,
\begin{equation}
\label{eq:dsp-sample-cost}
\boxed{
\mathscr{Q}(n, k) \in \Theta \left( (P(n) + W(n)) \log_2 n + k S(n) \right)
}
\end{equation}
where the cost of building the alias structure is $\Theta(\log_2 n)$
and thus absorbed into the pre-processing cost. For the SSIs discussed
in this chapter, which have $S(n) \in \Theta(1)$, this model provides us
with the desired decoupling of the data size ($n$) from the per-sample
cost. Additionally, for all of the SSIs considered in this paper,
the weights can be determined in either $W(n) \in \Theta(1)$ time,
or are naturally determined as part of the pre-processing, and thus the
$W(n)$ term can be merged into $P(n)$.
\subsection{Supporting Deletes}
As discussed in Section~\ref{ssec:background-deletes}, the Bentley-Saxe
method can support deleting records through the use of either weak
deletes, or a secondary ghost structure, assume certain properties are
satisfied by either the search problem or data structure. Unfortunately,
neither approach can work as a "drop-in" solution in the context of
sampling problems, because of the way that deleted records interact with
the sampling process itself. Sampling problems, as formalized here,
are neither invertable, nor deletion decomposable. In this section,
we'll discuss our mechanisms for supporting deletes, as well as how
these can be handled during sampling while maintaining correctness.
Because both deletion policies have their advantages under certain
contexts, we decided to support both. Specifically, we propose two
mechanisms for deletes, which are
\begin{enumerate}
\item \textbf{Tagged Deletes.} Each record in the structure includes a
header with a visibility bit set. On delete, the structure is searched
for the record, and the bit is set in indicate that it has been deleted.
This mechanism is used to support \emph{weak deletes}.
\item \textbf{Tombstone Deletes.} On delete, a new record is inserted into
the structure with a tombstone bit set in the header. This mechanism is
used to support \emph{ghost structure} based deletes.
\end{enumerate}
Broadly speaking, for sampling problems, tombstone deletes cause a number
of problems because \emph{sampling problems are not invertible}. However,
this limitation can be worked around during the query process if desired.
Tagging is much more natural for these search problems. However, the
flexibility of selecting either option is desirable because of their
different performance characteristics.
While tagging is a fairly direct method of implementing weak deletes,
tombstones are sufficiently different from the traditional ghost structure
system that it is worth motivating the decision to use them here. One
of the major limitations of the ghost structure approach for handling
deletes is that there is not a principled method for removing deleted
records from the decomposed structure. The standard approach is to set an
arbitrary number of delete records, and rebuild the entire structure when
this threshold is crossed~\cite{saxe79}. Mixing the "ghost" records into
the same structures as the original records allows for deleted records
to naturally be cleaned up over time as they meet their tombstones during
reconstructions. This is an important consequence that will be discussed
in more detail in Section~\ref{ssec-sampling-delete-bounding}.
There are two relevant aspects of performance that the two mechanisms
trade-off between: the cost of performing the delete, and the cost of
checking if a sampled record has been deleted. In addition to these,
the use of tombstones also makes supporting concurrency and external
data structures far easier. This is because tombstone deletes are simple
inserts, and thus they leave the individual structures immutable. Tagging
requires doing in-place updates of the record header in the structures,
resulting in possible race conditions and random IO operations on
disk. This makes tombstone deletes particularly attractive in these
contexts.
\subsubsection{Deletion Cost}
\label{ssec:sampling-deletes}
We will first consider the cost of performing a delete using either
mechanism.
\Paragraph{Tombstone Deletes.}
The cost of using a tombstone delete in a Bentley-Saxe dynamization is
the same as a simple insert,
\begin{equation*}
\mathscr{D}(n)_A \in \Theta\left(\frac{B(n)}{n} \log_2 (n)\right)
\end{equation*}
with the worst-case cost being $\Theta(B(n))$. Note that there is also
a minor performance effect resulting from deleted records appearing
twice within the structure, once for the original record and once for
the tombstone, inflating the overall size of the structure.
\Paragraph{Tagged Deletes.} In contrast to tombstone deletes, tagged
deletes are not simple inserts, and so have their own cost function. The
process of deleting a record under tagging consists of first searching
the entire structure for the record to be deleted, and then setting a
bit in its header. As a result, the performance of this operation is
a function of how expensive it is to locate an individual record within
the decomposed data structure.
In the theoretical literature, this lookup operation is provided
by a global hash table built over every record in the structure,
mapping each record to the block that contains it. Then, the data
structure's weak delete operation can be applied to the relevant
block~\cite{merge-dsp}. While this is certainly an option for us, we
note that the SSIs we are currently considering all support a reasonably
efficient $\Theta(\log n)$ lookup operation as it is, and have elected
to design tagged deletes to allow this operation to be leveraged when
available, rather than needing to deal with maintaining global hash table.
If a given SSI has a point-lookup cost of $L(n)$, then a tagged delete
on a Bentley-Saxe decomposition of that SSI will require, at worst,
executing a point-lookup on each block, with a total cost of
\begin{equation*}
\mathscr{D}(n) \in \Theta\left( L(n) \log_2 (n)\right)
\end{equation*}
If the SSI being considered does \emph{not} support an efficient
point-lookup operation, then a hash table can be used instead. We consider
individual hash tables associated with each block, rather than a single
global one, for simplicity of implementation and analysis. So, in these
cases, the same procedure as above can be used, with $L(n) \in \Theta(1)$.
\begin{figure}
\centering
\subfloat[Tombstone Rejection Check]{\includegraphics[width=.5\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}
\subfloat[Tagging Rejection Check]{\includegraphics[width=.5\textwidth]{img/sigmod23/delete-tagging} \label{fig:delete-tag}}
\caption{\textbf{Overview of the rejection check procedure for deleted records.} First,
a record is sampled (1).
When using the tombstone delete policy
(Figure~\ref{fig:delete-tombstone}), the rejection check starts by (2) querying
the bloom filter of the mutable buffer. The filter indicates the record is
not present, so (3) the filter on $L_0$ is queried next. This filter
returns a false positive, so (4) a point-lookup is executed against $L_0$.
The lookup fails to find a tombstone, so the search continues and (5) the
filter on $L_1$ is checked, which reports that the tombstone is present.
This time, it is not a false positive, and so (6) a lookup against $L_1$
(7) locates the tombstone. The record is thus rejected. When using the
tagging policy (Figure~\ref{fig:delete-tag}), (1) the record is sampled and
(2) checked directly for the delete tag. It is set, so the record is
immediately rejected.}
\label{fig:delete}
\end{figure}
\subsubsection{Rejection Check Costs}
Because sampling queries are neither invertible nor deletion decomposable,
the query process must be modified to support deletes using either of the
above mechanisms. This modification entails requiring that each sampled
record be manually checked to confirm that it hasn't been deleted, prior
to adding it to the sample set. We call the cost of this operation the
\emph{rejection check cost}, $R(n)$. The process differs between the
two deletion mechanisms, and the two procedures are summarized in
Figure~\ref{fig:delete}.
For tagged deletes, this is a simple process. The information about the
deletion status of a given record is stored directly alongside the record,
within its header. So, once a record has been sampled, this check can be
immediately performed with $R(n) \in \Theta(1)$ time.
Tombstone deletes, however, introduce a significant difficulty in
performing the rejection check. The information about whether a record
has been deleted is not local to the record itself, and therefore a
point-lookup is required to search for the tombstone associated with
each sample. Thus, the rejection check cost when using tombstones to
implement deletes over a Bentley-Saxe decomposition of an SSI is,
\begin{equation}
R(n) \in \Theta( L(n) \log_2 n)
\end{equation}
This performance cost seems catastrophically bad, considering
it must be paid per sample, but there are ways to mitigate
it. We will discuss these mitigations in more detail later,
during our discussion of the implementation of these results in
Section~\ref{sec:sampling-implementation}.
\subsubsection{Bounding Rejection Probability}
When a sampled record has been rejected, it must be resampled. This
introduces performance overhead resulting from extra memory access and
random number generations, and hurts our ability to provide performance
bounds on our sampling operations. In the worst case, a structure
may consist mostly or entirely of deleted records, resulting in
a potentially unbounded number of rejections during sampling. Thus,
in order to maintain sampling performance bounds, the probability of a
rejection during sampling must be bounded.
The reconstructions associated with Bentley-Saxe dynamization give us
a natural way of controlling the number of deleted records within the
structure, and thereby bounding the rejection rate. During reconstruction,
we have the opportunity to remove deleted records. This will cause the
record counts associated with each block of the structure to gradually
drift out of alignment with the "perfect" powers of two associated with
the Bentley-Saxe method, however. In the theoretical literature on this
topic, the solution to this problem is to periodically repartition all of
the records to re-align the block sizes~\cite{merge-dsp, saxe79}. This
approach could also be easily applied here, if desired, though we
do not in our implementations, for reasons that will be discussed in
Section~\ref{sec:sampling-implementation}.
The process of removing these deleted records during reconstructions is
different for the two mechanisms. Tagged deletes are straightforward,
because all tagged records can simply be dropped when they are involved
in a reconstruction. Tombstones, however, require a slightly more complex
approach. Rather than being able to drop deleted records immediately,
during reconstructions the records can only be dropped when the
tombstone and its associate record are involved in the \emph{same}
reconstruction, at which point both can be dropped. We call this
process \emph{tombstone cancellation}. In the general case, it can be
implemented using a preliminary linear pass over the records involved
in a reconstruction to identify the records to be dropped, but in many
cases reconstruction involves sorting the records anyway, and by taking
care with ordering semantics, tombstones and their associated records can
be sorted into adjacent spots, allowing them to be efficiently dropped
during reconstruction without any extra overhead.
While the dropping of deleted records during reconstruction helps, it is
not sufficient on its own to ensure a particular bound on the number of
deleted records within the structure. Pathological scenarios resulting in
unbounded rejection rates, even in the presence of this mitigation, are
possible. For example, tagging alone will never trigger reconstructions,
and so it would be possible to delete every single record within the
structure without triggering a reconstruction, or records could be deleted
in the reverse order that they were inserted using tombstones. In either
case, a passive system of dropping records naturally during reconstruction
is not sufficient.
Fortunately, this passive system can be used as the basis for a
system that does provide a bound. This is because it guarantees,
whether tagging or tombstones are used, that any given deleted
record will \emph{eventually} be cancelled out after a finite number
of reconstructions. If the number of deleted records gets too high,
some or all of these deleted records can be cleared out by proactively
performing reconstructions. We call these proactive reconstructions
\emph{compactions}.
The basic strategy, then, is to define a maximum allowable proportion
of deleted records, $\delta \in [0, 1]$. Each block in the decomposition
tracks the number of tombstones or tagged records within it. This count
can be easily maintained by incrementing a counter when a record in the
block is tagged, and by counting tombstones during reconstructions. These
counts on each block are then monitored, and if the proportion of deletes
in a block ever exceeds $\delta$, a proactive reconstruction including
this block and one or more blocks below it in the structure can be
triggered. The proportion of the newly compacted block can then be checked
again, and this process repeated until all blocks respect the bound.
For tagging, a single round of compaction will always suffice, because all
deleted records involved in the reconstruction will be dropped. Tombstones
may require multiple cascading rounds of compaction to occur, because a
tombstone record will only cancel when it encounters the record that it
deletes. However, because tombstones always follow the record they
delete in insertion order, and will therefore always be "above" that
record in the structure, each reconstruction will move every tombstone
involved closer to the record it deletes, ensuring that eventually the
bound will be satisfied.
Asymptotically, this compaction process will not affect the amortized
insertion cost of the structure. This is because the cost is based on
the number of reconstructions that a given record is involved in over
the lifetime of the structure. Preemptive compaction does not increase
the number of reconstructions, only \emph{when} they occur.
\subsubsection{Sampling Procedure with Deletes}
\label{ssec:sampling-with-deletes}
Because sampling is neither deletion decomposable nor invertible,
the presence of deletes will have an effect on the query costs. As
already mentioned, the basic cost associated with deletes is a rejection
check associated with each sampled record. When a record is sampled,
it must be checked to determine whether it has been deleted or not. If
it has, then it must be rejected. Note that when this rejection occurs,
it cannot be retried immediately on the same block, but rather a new
block must be selected to sample from. This is because deleted records
aren't accounted for in the weight calculations, and so could introduce
bias. As a straightforward example of this problem, consider a block
that contains only deleted records. Any sample drawn from this block will
be rejected, and so retrying samples against this block will result in
an infinite loop.
Assuming the compaction strategy mentioned in the previous section is
applied, ensuring a bound of at most $\delta$ proportion of deleted
records in the structure, and assuming all records have an equal
probability of being sampled, the cost of answering sampling queries
accounting for rejections is,
\begin{equation*}
%\label{eq:sampling-cost-del}
\mathscr{Q}(n, k) = \Theta\left([W(n) + P(n)]\log_2 n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
\end{equation*}
Where $\frac{k}{1 - \delta}$ is the expected number of samples that must
be taken to obtain a sample set of size $k$.
\subsection{Performance Tuning and Configuration}
The final of the desiderata referenced earlier in this chapter for our
dynamized sampling indices is having tunable performance. The base
Bentley-Saxe method has a highly rigid reconstruction policy that,
while theoretically convenient, does not lend itself to performance
tuning. However, it can be readily modified to form a more relaxed policy
that is both tunable, and generally more performant, at the cost of some
additional theoretical complexity. There has been some theoretical work
in this area, based upon nesting instances of the equal block method
within the Bentley-Saxe method~\cite{overmars81}, but these methods are
unwieldy and are targetted at tuning the worst-case at the expense of the
common case. We will take a different approach to adding configurability
to our dynamization system.
Though it has thus far gone unmentioned, readers familiar with LSM Trees
may have noted the astonishing similarity between decomposition-based
dynamization techniques, and a data structure called the Log-structured
Merge-tree. First proposed by O'Neil in the mid '90s\cite{oneil96},
the LSM Tree was designed to optmize write throughout for external data
structures. It accomplished this task by buffer inserted records in a
small in-memory AVL Tree, and then flushing this buffer to disk when
it filled up. The flush process itself would fully rebuild the on-disk
structure (a B+Tree), including all of the currently existing records
on external storage. O'Neil also proposed version which used several,
layered, external structures, to reduce the cost of reconstruction.
In more recent times, the LSM Tree has seen significant development and
been used as the basis for key-value stores like RocksDB~\cite{dong21}
and LevelDB~\cite{leveldb}. This work as produced an incredibly large
and well explored parameterization of the reconstruction procedures of
LSM Trees, a good summary of which can be bound in this recent tutorial
paper~\cite{sarkar23}. Examples of this design space exploration include:
different ways to organize each "level" of the tree~\cite{dayan19,
dostoevsky, autumn}, different growth rates, buffering, sub-partioning
of structures to allow finer-grained reconstruction~\cite{dayan22}, and
approaches for allocating resources to auxilliary structures attached to
the main ones for accelerating certain types of query~\cite{dayan18-1,
zhu21, monkey}.
Many of the elements within the LSM Tree design space are based upon the
specifics of the data structure itself, and are not generally applicable.
However, some of the higher-level concepts can be imported and applied in
the context of dynamization. Specifically, we have decided to import the
following four elements for use in our dynamization technique,
\begin{itemize}
\item A small dynamic buffer into which new records are inserted
\item A variable growth rate, called as \emph{scale factor}
\item The ability to attach auxilliary structures to each block
\item Two different strategies for reconstructing data structures
\end{itemize}
This design space and its associated trade-offs will be discussed in
more detail in Chapter~\ref{chap:design-space}, but we'll describe it
briefly here.
\Paragraph{Buffering.} In the standard Bentley-Saxe method, each
insert triggers a reconstruction. Many of these are quite small, but
it still makes most insertions somewhat expensive. By adding a small
buffer, a large number of inserts can be performed without requiring
any reconstructions at all. For generality, we elected to use an
unsorted array as our buffer, as dynamic versions of the structures
we are dynamizing may not exist. This introduces some query cost, as
queries must be answered from these unsorted records as well, but in
the case of sampling this isn't a serious problem. The implications of
this will be discussed in Section~\ref{ssec:sampling-cost-funcs}. The
size of this buffer, $N_B$ is a user-specified constant, and all block
capacities are multiplied by it. In the Bentley-Saxe method, the $i$th
block contains $2^i$ records. In our scheme, with buffering, this becomes
$N_B \cdot 2^i$ records in the $i$th block. We call this unsorted array
the \emph{mutable buffer}.
\Paragraph{Scale Factor.} In the Bentley-Saxe method, each block is
twice as large as the block the preceeds it There is, however, no reason
why this growth rate couldn't be adjusted. In our system, we make the
growth rate a user-specified constant called the \emph{scale factor},
$s$, such that the $i$th level contains $N_B \cdot s^i$ records.
\Paragraph{Auxilliary Structures.} In Section~\ref{ssec:sampling-deletes},
we encountered two problems relating to supporting deletes that can be
resolved through the use of auxilliary structures. First, regardless
of whether tagging or tombstones are used, the data structure requires
support for an efficient point-lookup operation. Many SSIs are tree-based
and thus support this, but not all data structures do. In such cases,
the point-lookup operation could be provided by attaching an auxilliary
hash table to the data structure that maps records to their location in
the SSI. We use term \emph{shard} to refer to the combination of a
block with these optional auxilliary structures.
In addition, the tombstone deletion mechanism requires performing a point
lookup for every record sampled, to validate that it has not been deleted.
This introduces a large amount of overhead into the sampling process,
as this requires searching each block in the structure. One approach
that can be used to help improve the performance of these searches,
without requiring as much storage as adding auxilliary hash tables to
every block, is to include bloom filters~\cite{bloom70}. A bloom filter
is an approximate data structure that answers tests of set membership
with bounded, single-sided error. These are commonly used in LSM Trees
to accelerate point lookups by allowing levels that don't contain the
record being searched for to be skipped. In our case, we only care about
tombstone records, so rather than building these filters over all records,
we can build them over tombstones. This approach can greatly improve
the sampling performance of the structure when tombstone deletes are used.
\Paragraph{Layout Policy.} The Bentley-Saxe method considers blocks
individually, without any other organization beyond increasing size. In
contrast, LSM Trees have multiple layers of structural organization. The
top level structure is a level, upon which record capacity restrictions
are applied. These levels are then partitioned into individual structures,
which can be further organized by key range. Because our intention is to
support general data structures, which may or may not be easily partition
by a key, we will not consider the finest grain of partitioning. However,
we can borrow the concept of levels, and lay out shards in these levels
according to different strategies.
Specifically, we consider two layout policies. First, we can allow a
single shard per level, a policy called \emph{Leveling}. This approach
is traditionally read optimized, as it generally results in fewer shards
within the overall structure for a given scale factor. Under leveling,
the $i$th level has a capacity of $N_B \cdot s^{i+1}$ records. We can
also allow multiple shards per level, resulting in a write-optimized
policy called \emph{Tiering}. In tiering, each level can hold up to $s$
shards, each with up to $N_B \cdot s^i$ records. Note that this doesn't
alter the overall record capacity of each level relative to leveling,
only the way the records are divided up into shards.
\section{Practical Dynamization Framework}
Based upon the results discussed in the previous section, we are now ready
to discuss the dynamization framework that we have produced for adding
update support to SSIs. This framework allows us to achieve all three
of our desiderata, at least for certain configurations, and provides a
wide range of performance tuning options to the user.
\subsection{Requirements}
The requirements that the framework places upon SSIs are rather
modest. The sampling problem being considered must be a decomposable
sampling problem (Definition \ref{def:decomp-sampling}) and the SSI must
support the \texttt{build} and \texttt{unbuild} operations. Optionally,
if the SSI supports point lookups or if the SSI can be constructed
from multiple instances of the SSI more efficiently than its normal
static construction, these two operations can be leveraged by the
framework. However, these are not requirements, as the framework provides
facilities to work around their absence.
\captionsetup[subfloat]{justification=centering}
\begin{figure*}
\centering
\subfloat[Leveling]{\includegraphics[width=.5\textwidth]{img/sigmod23/merge-leveling} \label{fig:leveling}}
\subfloat[Tiering]{\includegraphics[width=.5\textwidth]{img/sigmod23/merge-tiering} \label{fig:tiering}}
\caption{\textbf{A graphical overview of our dynamization framework.} A
mutable buffer (MB) sits atop two levels (L0, L1) containing shards (pairs
of SSIs and auxiliary structures [A]) using the leveling
(Figure~\ref{fig:leveling}) and tiering (Figure~\ref{fig:tiering}) layout
policies. Records are represented as black/colored squares, and grey
squares represent unused capacity. An insertion requiring a multi-level
reconstruction is illustrated.} \label{fig:sampling-framework}
\end{figure*}
\subsection{Framework Construction}
The framework itself is shown in Figure~\ref{fig:sampling-framework},
along with some of its configuration parameters and its insert procedure
(which will be discussed in the next section). It consists of an unsorted
array of size $N_B$ records, sitting atop a sequence of \emph{levels},
each containing SSIs according to the layout policy. If leveling
is used, each level will contain a single SSI with up to $N_B \cdot
s^{i+1}$ records. If tiering is used, each level will contain up to
$s$ SSIs, each with up to $N_B \cdot s^i$ records. The scale factor,
$s$, controls the rate at which the capacity of each level grows. The
framework supports deletes using either the tombstone or tagging policy,
which can be selected by the user acccording to her preference. To support
these delete mechanisms, each record contains an attached header with
bits to indicate its tombstone or delete status.
\subsection{Supported Operations and Cost Functions}
\Paragraph{Insert.} Inserting a record into the dynamization involves
appending it to the mutable buffer, which requires $\Theta(1)$ time. When
the buffer reaches its capacity, it must be flushed into the structure
itself before any further records can be inserted. First, a shard will be
constructed from the records in the buffer using the SSI's \texttt{build}
operation, with $B(N_B)$ cost. This shard will then be merged into the
levels below it, which may require further reconstructions to occur to
make room. The manner in which these reconstructions proceed follows the
selection of layout policy,
\begin{itemize}
\item[\textbf{Leveling}] When a buffer flush occurs in the leveling
policy, the system scans the existing levels to find the first level
which has sufficient empty space to store the contents of the level above
it. More formally, if the number of records in level $i$ is $N_i$, then
$i$ is determined such that $N_i + N_B\cdot s^{i} <= N_B \cdot s^{i+1}$.
If no level exists that satisfies the record count constraint, then an
empty level is added and $i$ is set to the index of this new level. Then,
a reconstruction is executed containing all of the records in levels $i$
and $i - 1$ (where $i=-1$ indicates the temporary shard built from the
buffer). Following this reconstruction, all levels $j < i$ are shifted
by one level.
\item[\textbf{Tiering}] When using tiering, the system will locate
the first level, $i$, containing fewer than $s$ shards. If no such
level exists, then a new empty level is added and $i$ is set to the
index of that level. Then, for each level $j < i$, a reconstruction
is performed involving all $s$ shards on that level. The resulting new
shard will then be placed into the level at $j + 1$ and $j$ will be
emptied. Following this, the newly created shard from the buffer will
be appended to level $0$.
\end{itemize}
In either case, the reconstructions all use instances of the shard as
input, and so if the SSI supports more efficient construction in this case
(with $B_M(n)$ cost), then this routine can be used here. Once all of
the necessary reconstructions have been performed, each level is checked
to verify that the proportion of tombstones or deleted records is less
than $\delta$. If this condition fails, then a proactive compaction is
triggered. This compaction involves doing the reconstructions necessary
to move the shard violating the delete bound down one level. Once the
compaction is complete, the delete proportions are checked again, and
this process is repeated until all levels satisfy the bound.
Following this procedure, inserts have a worst case cost of $I \in
\Theta(B_M(n))$, equivalent to Bently-Saxe. The amortized cost can be
determined by finding the total cost of reconstructions involving each
record and amortizing it over each insert. The cost of the insert is
composed of three parts,
\begin{enumerate}
\item The cost of appending to the buffer
\item The cost of flushing the buffer to a shard
\item The total cost of the reconstructions the record is involved
in over the lifetime of the structure
\end{enumerate}
The first cost is constant and the second is $B(N_B)$. Regardless of
layout policy, there will be $\Theta(\log_s(n))$ total levels, and
the record will, at worst, be written a constant number of times to
each level, resulting in a maximum of $\Theta(\log_s(n)B_M(n))$ cost
associated with these reconstructions. Thus, the total cost associated
with each record in the structure is,
\begin{equation*}
\Theta(1) + \Theta(B(N_B)) + \Theta(\log_s(n)B_M(n))
\end{equation*}
Assuming that $N_B \ll n$, the first two terms of this expression are
constant. Dropping them and amortizing the result over $n$ records give
us the amortized insertion cost,
\begin{equation*}
I_a(n) \in \Theta\left(\frac{B_M(n)}{n}\log_s(n)\right)
\end{equation*}
If the SSI being considered does not support a more efficient
construction procedure from other instances of the same SSI, and
the general Bentley-Saxe \texttt{unbuild} and \texttt{build}
operations must be used, the the cost becomes $I_a(n) \in
\Theta\left(\frac{B(n)}{n}\log_s(n)\right)$ instead.
\Paragraph{Delete.} The framework supports both tombstone and tagged
deletes, each with different performance. Using tombstones, the cost
of a delete is identical to that of an insert. When using tagging, the
cost of a delete is the same as cost of doing a point lookup, as the
"delete" itself is simply setting a bit in the header of the record,
once it has been located. There will be $\Theta(\log_s n)$ total shards
in the structure, each with a look-up cost of $L(n)$ using either the
SSI's native point-lookup, or an auxilliary hash table, and the lookup
must also scan the buffer in $\Theta(N_B)$ time. Thus, the worst-case
cost of a tagged delete is,
\begin{equation*}
D(n) = \Theta(N_B + L(n)\log_s(n))
\end{equation*}
\Paragraph{Update.} Given the above definitions of insert and delete,
in-place updates of records can be supported by first deleting the record
to be updated, and then inserting the updated value as a new record. Thus,
the update cost is $\Theta(I(n) + D(n))$.
\Paragraph{Sampling.} Answering sampling queries from this structure is
largely the same as was discussed for a standard Bentley-Saxe dynamization
in Section~\ref{ssec:sampling-with-deletes} with the addition of a need
to sample from the unsorted buffer as well. There are two approaches
for sampling from the buffer. The most general approach would be to
temporarily build an SSI over the records within the buffer, and then
treat this is a normal shard for the remainder of the sampling procedure.
In this case, the sampling algorithm remains indentical to the algorithm
discussed in Section~\ref{ssec:decomposed-structure-sampling}, following
the construction of the temporary shard. This results in a worst-case
sampling cost of,
\begin{equation*}
\mathscr{Q}(n, k) = \Theta\left(B(N_B) + [W(n) + P(n)]\log_2 n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
\end{equation*}
In practice, however, it is often possible to perform rejection sampling
against the buffer, without needing to do any additional work to prepare
it. In this case, the full weight of the buffer can be used to determine
how many samples to draw from it, and then these samples can be obtained
using standard rejection sampling to both control the weight, and enforce
any necessary predicates. Because $N_B \ll n$, this procedure will not
introduce anything more than constant overhead in the sampling process as
the probability of sampling from the buffer is quite low, and the cost of
doing so is constant, and so the overall query cost when rejection sampling
is possible is,
\begin{equation*}
\mathscr{Q}(n, k) = \Theta\left([W(n) + P(n)]\log_2 n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
\end{equation*}
In both cases, $R(n) \in \Theta(1)$ for tagging deletes, and $R(n) \in
N_B + L(N) \log_s n$ for tombstones (including the cost of searching
the buffer for the tombstone).
|