summaryrefslogtreecommitdiffstats
path: root/chapters/sigmod23/framework.tex
blob: c878d93ff93a90a78bb196594ba97724e3b51347 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
\section{Dynamization of SSIs} 
\label{sec:framework}

Our goal, then, is to design a solution to indepedent sampling that is
able to achieve \emph{both} efficient updates and efficient sampling,
while also maintaining statistical independence both within and between
IQS queries, and to do so in a generalized fashion without needing to
design new dynamic data structures for each problem. Given the range
of SSIs already available, it seems reasonable to attempt to apply
dynamization techniques to accomplish this goal. Using the Bentley-Saxe
method would allow us to to support inserts and deletes without
requiring any modification of the SSIs.  Unfortunately, as discussed
in Section~\ref{ssec:background-irs}, there are problems with directly
applying BSM to sampling problems. All of the considerations discussed
there in the context of IRS apply equally to the other sampling problems
considered in this chapter. In this section, we will discuss approaches
for resolving these problems.

\subsection{Sampling over Decomposed Structures}

The core problem facing any attempt to dynamize SSIs is that independently
sampling from a decomposed structure is difficult. As discussed in
Section~\ref{ssec:background-irs}, accomplishing this task within the
DSP model used by the Bentley-Saxe method requires drawing a full $k$
samples from each of the blocks, and then repeatedly down-sampling each
of the intermediate sample sets. However, it is possible to devise a
more efficient query process if we abandon the DSP model and consider
a slightly more complicated procedure.

First, we'll define the IQS problem in terms of the notation and concepts
used in Chapter~\cite{chap:background} for search problems,

\begin{definition}[Independent Query Sampling Problem]
	Given a search problem, $F$, a query sampling problem is function
	of the form $X: (F, \mathcal{D}, \mathcal{Q}, \mathbb{Z}^+)
	\to \mathcal{R}$ where $\mathcal{D}$ is the domain of records
	and $\mathcal{Q}$ is the domain of query parameters of $F$. The
	solution to a sampling problem, $R \in \mathcal{R}$ will be a subset
	of records from the solution to $F$ drawn independently such that,
	$|R| = k$ for some $k \in \mathbb{Z}^+$.
\end{definition}

To consider the decomposability of such problems, we need to resolve a
minor definitional issue. As noted before, the DSP model is based on
deterministic queries. The definition doesn't apply for sampling queries,
because it assumes that the result sets of identical queries should
also be identical. For general IQS, we also need to enforce conditions
on the query being sampled from.  Based on these observations, we can
define the decomposability conditions for a query sampling problem,

\begin{definition}[Decomposable Sampling Problem]
	A query sampling problem, $X: (F, \mathcal{D}, \mathcal{Q},
	\mathbb{Z}^+ \to \mathcal{R}$) is decomposable if and only if
	the following conditions are met for all $q \in \mathcal{Q},
	k \in \mathbb{Z}^+$,
	\begin{enumerate}
	\item There exists a $\Theta(C(n,k))$ time computable, associative, and
	      commutative binary operator $\mergeop$ such that,
	      \begin{equation*}
		X(F, A \cup B, q, k) \sim X(F, A, q, k)~ \mergeop ~X(F,
		B, q, k)
	      \end{equation*}
	  for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B
	  = \emptyset$.

	\item For any dataset $D \subseteq \mathcal{D}$ that has been
	      decomposed into $m$ partitions such that $D =
	      \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset \quad
	      \forall i,j < m, i \neq j$,
		  \begin{equation*}
			F(D, q) = \bigcup_{i=1}^m F(D_i, q)
		  \end{equation*}
	\end{enumerate}
\end{definition}

These two conditions warrant further explaination. The first condition
is simply a redefinition of the standard decomposability criteria to
consider matching the distribution, rather than the exact records in $R$,
as the correctness condition for the merge process. The second condition
handles a necessary property of the underlying search problem being
sampled from.  Note that this condition is \emph{stricter} than normal
decomposability for $F$, and essentially requires that the query being
sampled from return a set of records, rather than an aggregate value or
some other result that cannot be meaningfully sampled from. This condition
is satisfied by predicate-filtering style database queries, among others.

With these definitions in mind, let's turn to solving these query sampling
problems. First, we note that many SSIs have a sampling procedure that
naturally involves two phases. First, some preliminary work is done
to determine metadata concerning the set of records to sample from,
and then $k$ samples are drawn from the structure, taking advantage of
this metadata.  If we represent the time cost of the prelimary work
with $P(n)$ and the cost of drawing a sample with $S(n)$, then these
structures query cost functions are of the form,

\begin{equation*}
\mathscr{Q}(n, k) = P(n) + k S(n)
\end{equation*}


Consider an arbitrary decomposable sampling problem with a cost function
of the above form, $X(\mathscr{I}, F, q, k)$, which draws a sample
of $k$ records from $d \subseteq \mathcal{D}$ using an instance of
an SSI $\mathscr{I} \in \mathcal{I}$. Applying dynamization results
in $d$ being split across $m$ disjoint instances of $\mathcal{I}$
such that $d = \bigcup_{i=0}^m \text{unbuild}(\mathscr{I}_i)$ and
$\text{unbuild}(\mathscr{I}_i) \cap \text{unbuild}(\mathscr{I}_j)
= \emptyset \quad \forall i, j < m, i \neq j$.  If we consider a
Bentley-Saxe dynamization of such a structure, the $\mergeop$ operation
would be a $\Theta(k)$ down-sampling. Thus, the total query cost of such
a structure would be,
\begin{equation*}
\Theta\left(\log_2 n \left( P(n) + k S(n) + k\right)\right)
\end{equation*}

This cost function is sub-optimal for two reasons. First, we
pay extra cost to merge the result sets together because of the
down-sampling combination operator. Secondly, this formulation
fails to avoid a per-sample dependence on $n$, even in the case
where $S(n) \in \Theta(1)$.  This gets even worse when considering
rejections that may occur as a result of deleted records. Recall from
Section~\ref{ssec:background-deletes} that deletion can be supported
using weak deletes or a shadow structure in a Bentley-Saxe dynamization.
Using either approach, it isn't possible to avoid deleted records in
advance when sampling, and so these will need to be rejected and retried.
In the DSP model, this retry will need to reprocess every block a second
time. You cannot retry in place without introducing bias into the result
set. We will discuss this more in Section~\ref{ssec:sampling-deletes}.

\begin{figure}
    \centering
	\includegraphics[width=\textwidth]{img/sigmod23/sampling}
    \caption{\textbf{Overview of the multiple-block query sampling process} for
    Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of
    the shards is determined, then (2) these weights are used to construct an
    alias structure. Next, (3) the alias structure is queried $k$ times to
    determine per shard sample sizes, and then (4) sampling is performed.
    Finally, (5) any rejected samples are retried starting from the alias
    structure, and the process is repeated until the desired number of samples
    has been retrieved.}
	\label{fig:sample}
	
\end{figure}

The key insight that allowed us to solve this particular problem was that
there is a mismatch between the structure of the sampling query process,
and the structure assumed by DSPs. Using an SSI to answer a sampling
query results in a naturally two-phase process, but DSPs are assumed to
be single phase. We can construct a more effective process for answering
such queries based on a multi-stage process, summarized in Figure~\ref{fig:sample}.
\begin{enumerate}
	\item Perform the query pre-processing work, and determine each
	      block's respective weight under a given query to be sampled
		  from (e.g., the number of records falling into the query range
		  for IRS).

	\item Build a temporary alias structure over these weights.

	\item Query the alias structure $k$ times to determine how many
	      samples to draw from each block.
		
	\item Draw the appropriate number of samples from each block and
	      merge them together to form the final query result, using any
		  necessary pre-processing results in the process.
\end{enumerate}
It is possible that some of the records sampled in Step 4 must be
rejected, either because of deletes or some other property of the sampling
procedure being used. If $r$ records are rejected, the above procedure
can be repeated from Step 3, taking $k - r$ as the number of times to
query the alias structure, without needing to redo any of the preprocessing
steps. This can be repeated as many times as necessary until the required
$k$ records have been sampled.

\begin{example}
    \label{ex:sample}
    Consider executing a WSS query, with $k=1000$, across three blocks
    containing integer keys with unit weight. $\mathscr{I}_1$ contains only the
    key $-2$, $\mathscr{I}_2$ contains all integers on $[1,100]$, and $\mathscr{I}_3$
    contains all integers on $[101, 200]$. These structures are shown
    in Figure~\ref{fig:sample}. Sampling is performed by first
    determining the normalized weights for each block: $w_1 = 0.005$,
    $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a
    block alias structure. The block alias structure is then queried
    $k$ times, resulting in a distribution of $k_i$s that is
    commensurate with the relative weights of each block. Finally,
    each block is queried in turn to draw the appropriate number
    of samples.
\end{example}

Assuming a Bentley-Saxe decomposition with $\log n$ blocks and assuming
a constant number of repetitions, the cost of answering a decomposible
sampling query having a pre-processing cost of $P(n)$, a weight-determination
cost of $W(n)$ and a per-sample cost of $S(n)$ will be,
\begin{equation}
\label{eq:dsp-sample-cost}
\boxed{
\mathscr{Q}(n, k) \in \Theta \left( (P(n) + W(n)) \log_2 n + k S(n) \right)
}
\end{equation}
where the cost of building the alias structure is $\Theta(\log_2 n)$
and thus absorbed into the pre-processing cost. For the SSIs discussed
in this chapter, which have $S(n) \in \Theta(1)$, this model provides us
with the desired decoupling of the data size ($n$) from the per-sample
cost. Additionally, for all of the SSIs considered in this paper,
the weights can be determined in  either $W(n) \in \Theta(1)$ time,
or are naturally determined as part of the pre-processing, and thus the
$W(n)$ term can be merged into $P(n)$.

\subsection{Supporting Deletes}

As discussed in Section~\ref{ssec:background-deletes}, the Bentley-Saxe
method can support deleting records through the use of either weak
deletes, or a secondary ghost structure, assume certain properties are
satisfied by either the search problem or data structure. Unfortunately,
neither approach can work as a "drop-in" solution in the context of
sampling problems, because of the way that deleted records interact with
the sampling process itself. Sampling problems, as formalized here,
are neither invertable, nor deletion decomposable. In this section,
we'll discuss our mechanisms for supporting deletes, as well as how
these can be handled during sampling while maintaining correctness.

Because both deletion policies have their advantages under certain
contexts, we decided to support both. Specifically, we propose two
mechanisms for deletes, which are

\begin{enumerate}
\item \textbf{Tagged Deletes.} Each record in the structure includes a
header with a visibility bit set. On delete, the structure is searched
for the record, and the bit is set in indicate that it has been deleted.
This mechanism is used to support \emph{weak deletes}.
\item \textbf{Tombstone Deletes.} On delete, a new record is inserted into
the structure with a tombstone bit set in the header. This mechanism is
used to support \emph{ghost structure} based deletes.
\end{enumerate}

Broadly speaking, for sampling problems, tombstone deletes cause a number
of problems because \emph{sampling problems are not invertible}. However,
this limitation can be worked around during the query process if desired.
Tagging is much more natural for these search problems. However, the
flexibility of selecting either option is desirable because of their
different performance characteristics.

While tagging is a fairly direct method of implementing weak deletes,
tombstones are sufficiently different from the traditional ghost structure
system that it is worth motivating the decision to use them here. One
of the major limitations of the ghost structure approach for handling
deletes is that there is not a principled method for removing deleted
records from the decomposed structure. The standard approach is to set an
arbitrary number of delete records, and rebuild the entire structure when
this threshold is crossed~\cite{saxe79}. Mixing the "ghost" records into
the same structures as the original records allows for deleted records
to naturally be cleaned up over time as they meet their tombstones during
reconstructions. This is an important consequence that will be discussed
in more detail in Section~\ref{ssec-sampling-delete-bounding}.

There are two relevant aspects of performance that the two mechanisms
trade-off between: the cost of performing the delete, and the cost of
checking if a sampled record has been deleted. In addition to these,
the use of tombstones also makes supporting concurrency and external
data structures far easier. This is because tombstone deletes are simple
inserts, and thus they leave the individual structures immutable. Tagging
requires doing in-place updates of the record header in the structures,
resulting in possible race conditions and random IO operations on
disk. This makes tombstone deletes particularly attractive in these
contexts.


\subsubsection{Deletion Cost}
We will first consider the cost of performing a delete using either
mechanism.

\Paragraph{Tombstone Deletes.}
The cost of using a tombstone delete in a Bentley-Saxe dynamization is
the same as a simple insert,
\begin{equation*}
\mathscr{D}(n)_A \in \Theta\left(\frac{B(n)}{n} \log_2 (n)\right)
\end{equation*}
with the worst-case cost being $\Theta(B(n))$. Note that there is also
a minor performance effect resulting from deleted records appearing
twice within the structure, once for the original record and once for
the tombstone, inflating the overall size of the structure.

\Paragraph{Tagged Deletes.} In contrast to tombstone deletes, tagged
deletes are not simple inserts, and so have their own cost function. The
process of deleting a record under tagging consists of first searching
the entire structure for the record to be deleted, and then setting a
bit in its header. As a result, the performance of this operation is
a function of how expensive it is to locate an individual record within
the decomposed data structure.

In the theoretical literature, this lookup operation is provided
by a global hash table built over every record in the structure,
mapping each record to the block that contains it. Then, the data
structure's weak delete operation can be applied to the relevant
block~\cite{merge-dsp}. While this is certainly an option for us, we
note that the SSIs we are currently considering all support a reasonably
efficient $\Theta(\log n)$ lookup operation as it is, and have elected
to design tagged deletes to allow this operation to be leveraged when
available, rather than needing to deal with maintaining global hash table.
If a given SSI has a point-lookup cost of $L(n)$, then a tagged delete
on a Bentley-Saxe decomposition of that SSI will require, at worst,
executing a point-lookup on each block, with a total cost of

\begin{equation*}
\mathscr{D}(n) \in \Theta\left( L(n) \log_2 (n)\right)
\end{equation*}

If the SSI being considered does \emph{not} support an efficient
point-lookup operation, then a hash table can be used instead. We consider
individual hash tables associated with each block, rather than a single
global one, for simplicity of implementation and analysis. So, in these
cases, the same procedure as above can be used, with $L(n) \in \Theta(1)$.


\begin{figure}
	\centering
	\subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\
	\subfloat[Tagging Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tagging} \label{fig:delete-tag}}
	
    \caption{\textbf{Overview of the rejection check procedure for deleted records.} First,
    a record is sampled (1).
    When using the tombstone delete policy
    (Figure~\ref{fig:delete-tombstone}), the rejection check starts by (2) querying
    the bloom filter of the mutable buffer. The filter indicates the record is
    not present, so (3) the filter on $L_0$ is queried next. This filter
    returns a false positive, so (4) a point-lookup is executed against $L_0$.
    The lookup fails to find a tombstone, so the search continues and (5) the
    filter on $L_1$ is checked, which reports that the tombstone is present.
    This time, it is not a false positive, and so (6) a lookup against $L_1$
    (7) locates the tombstone. The record is thus rejected. When using the
    tagging policy (Figure~\ref{fig:delete-tag}), (1) the record is sampled and
    (2) checked directly for the delete tag. It is set, so the record is
    immediately rejected.} 

    \label{fig:delete}
	
\end{figure}

\subsubsection{Rejection Check Costs}

Because sampling queries are neither invertible nor deletion decomposable,
the query process must be modified to support deletes using either of the
above mechanisms. This modification entails requiring that each sampled
record be manually checked to confirm that it hasn't been deleted, prior
to adding it to the sample set. We call the cost of this operation the
\emph{rejection check cost}, $R(n)$. The process differs between the
two deletion mechanisms, and the two procedures are summarized in
Figure~\ref{fig:delete}.

For tagged deletes, this is a simple process. The information about the
deletion status of a given record is stored directly alongside the record,
within its header. So, once a record has been sampled, this check can be
immediately performed with $R(n) \in \Theta(1)$ time.

Tombstone deletes, however, introduce a significant difficulty in
performing the rejection check. The information about whether a record
has been deleted is not local to the record itself, and therefore a
point-lookup is required to search for the tombstone associated with
each sample. Thus, the rejection check cost when using tombstones to
implement deletes over a Bentley-Saxe decomposition of an SSI is,
\begin{equation}
R(n) \in \Theta( L(n) \log_2 n)
\end{equation}
This performance cost seems catastrophically bad, considering
it must be paid per sample, but there are ways to mitigate
it. We will discuss these mitigations in more detail later,
during our discussion of the implementation of these results in
Section~\ref{sec:sampling-implementation}.


\subsubsection{Bounding Rejection Probability}

When a sampled record has been rejected, it must be resampled. This
introduces performance overhead resulting from extra memory access and
random number generations, and hurts our ability to provide performance
bounds on our sampling operations. In the worst case, a structure
may consist mostly or entirely of deleted records, resulting in
a potentially unbounded number of rejections during sampling. Thus,
in order to maintain sampling performance bounds, the probability of a
rejection during sampling must be bounded.

The reconstructions associated with Bentley-Saxe dynamization give us
a natural way of controlling the number of deleted records within the
structure, and thereby bounding the rejection rate. During reconstruction,
we have the opportunity to remove deleted records. This will cause the
record counts associated with each block of the structure to gradually
drift out of alignment with the "perfect" powers of two associated with
the Bentley-Saxe method, however. In the theoretical literature on this
topic, the solution to this problem is to periodically repartition all of
the records to re-align the block sizes~\cite{merge-dsp, saxe79}. This
approach could also be easily applied here, if desired, though we
do not in our implementations, for reasons that will be discussed in
Section~\ref{sec:sampling-implementation}.

The process of removing these deleted records during reconstructions is
different for the two mechanisms. Tagged deletes are straightforward,
because all tagged records can simply be dropped when they are involved
in a reconstruction. Tombstones, however, require a slightly more complex
approach. Rather than being able to drop deleted records immediately,
during reconstructions the records can only be dropped when the
tombstone and its associate record are involved in the \emph{same}
reconstruction, at which point both can be dropped. We call this
process \emph{tombstone cancellation}. In the general case, it can be
implemented using a preliminary linear pass over the records involved
in a reconstruction to identify the records to be dropped, but in many
cases reconstruction involves sorting the records anyway, and by taking
care with ordering semantics, tombstones and their associated records can
be sorted into adjacent spots, allowing them to be efficiently dropped
during reconstruction without any extra overhead.

While the dropping of deleted records during reconstruction helps, it is
not sufficient on its own to ensure a particular bound on the number of
deleted records within the structure. Pathological scenarios resulting in
unbounded rejection rates, even in the presence of this mitigation, are
possible. For example, tagging alone will never trigger reconstructions,
and so it would be possible to delete every single record within the
structure without triggering a reconstruction, or records could be deleted
in the reverse order that they were inserted using tombstones. In either
case, a passive system of dropping records naturally during reconstruction
is not sufficient.

Fortunately, this passive system can be used as the basis for a
system that does provide a bound. This is because it guarantees,
whether tagging or tombstones are used, that any given deleted
record will \emph{eventually} be cancelled out after a finite number
of reconstructions. If the number of deleted records gets too high,
some or all of these deleted records can be cleared out by proactively
performing reconstructions. We call these proactive reconstructions
\emph{compactions}.

The basic strategy, then, is to define a maximum allowable proportion
of deleted records, $\delta \in [0, 1]$. Each block in the decomposition
tracks the number of tombstones or tagged records within it. This count
can be easily maintained by incrementing a counter when a record in the
block is tagged, and by counting tombstones during reconstructions. These
counts on each block are then monitored, and if the proportion of deletes
in a block ever exceeds $\delta$, a proactive reconstruction including
this block and one or more blocks below it in the structure can be
triggered. The proportion of the newly compacted block can then be checked
again, and this process repeated until all blocks respect the bound.

For tagging, a single round of compaction will always suffice, because all
deleted records involved in the reconstruction will be dropped. Tombstones
may require multiple cascading rounds of compaction to occur, because a
tombstone record will only cancel when it encounters the record that it
deletes. However, because tombstones always follow the record they
delete in insertion order, and will therefore always be "above" that
record in the structure, each reconstruction will move every tombstone
involved closer to the record it deletes, ensuring that eventually the
bound will be satisfied.

Asymptotically, this compaction process will not affect the amortized
insertion cost of the structure. This is because the cost is based on
the number of reconstructions that a given record is involved in over
the lifetime of the structure. Preemptive compaction does not increase
the number of reconstructions, only \emph{when} they occur. 

\subsubsection{Sampling Procedure with Deletes}

Because sampling is neither deletion decomposable nor invertible,
the presence of deletes will have an effect on the query costs. As
already mentioned, the basic cost associated with deletes is a rejection
check associated with each sampled record. When a record is sampled,
it must be checked to determine whether it has been deleted or not. If
it has, then it must be rejected. Note that when this rejection occurs,
it cannot be retried immediately on the same block, but rather a new
block must be selected to sample from. This is because deleted records
aren't accounted for in the weight calculations, and so could introduce
bias. As a straightforward example of this problem, consider a block
that contains only deleted records. Any sample drawn from this block will
be rejected, and so retrying samples against this block will result in
an infinite loop. 

Assuming the compaction strategy mentioned in the previous section is
applied, ensuring a bound of at most $\delta$ proportion of deleted
records in the structure, and assuming all records have an equal
probability of being sampled, the cost of answering sampling queries
accounting for rejections is,

\begin{equation*}
%\label{eq:sampling-cost-del}
    \mathscr{Q}(n, k) = \Theta\left([W(n) + P(n)]\log_2 n  + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
\end{equation*}
Where $\frac{k}{1 - \delta}$ is the expected number of samples that must
be taken to obtain a sample set of size $k$.


\subsection{Performance Tuning and Configuration}
\subsubsection{LSM Tree Imports}
\subsection{Insertion}
\label{ssec:insert}
The framework supports inserting new records by first appending them to the end
of the mutable buffer. When it is full, the buffer is flushed into a sequence
of levels containing shards of increasing capacity, using a procedure
determined by the layout policy as discussed in Section~\ref{sec:framework}.
This method allows for the cost of repeated shard reconstruction to be
effectively amortized.

Let the cost of constructing the SSI from an arbitrary set of $n$ records be
$C_c(n)$ and the cost of reconstructing the SSI given two or more shards
containing $n$ records in total be $C_r(n)$. The cost of an insert is composed
of three parts: appending to the mutable buffer, constructing a new
shard from the buffered records during a flush, and the total cost of
reconstructing shards containing the record over the lifetime of the index. The
cost of appending to the mutable buffer is constant, and the cost of constructing a
shard from the buffer can be amortized across the records participating in the
buffer flush, giving $\nicefrac{C_c(N_b)}{N_b}$. These costs are paid exactly once for
each record. To derive an expression for the cost of repeated reconstruction,
first note that each record will participate in at most $s$ reconstructions on
a given level, resulting in a worst-case amortized cost of $O\left(s\cdot
\nicefrac{C_r(n)}{n}\right)$ paid per level. The index itself will contain at most
$\log_s n$ levels. Thus, over the lifetime of the index a given record
will pay $O\left(s\cdot \nicefrac{C_r(n)}{n}\log_s n\right)$ cost in repeated
reconstruction.

Combining these results, the total amortized insertion cost is
\begin{equation}
O\left(\frac{C_c(N_b)}{N_b} + s \cdot \frac{C_r(n)}{n} \log_s n\right)
\end{equation}
This can be simplified by noting that $s$ is constant, and that $N_b \ll n$ and also
a constant. By neglecting these terms, the amortized insertion cost of the
framework is,
\begin{equation}
O\left(\frac{C_r(n)}{n}\log_s n\right)
\end{equation}

\captionsetup[subfloat]{justification=centering}

\begin{figure*}
	\centering
	\subfloat[Leveling]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-leveling} \label{fig:leveling}}\\
	\subfloat[Tiering]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-tiering} \label{fig:tiering}}
	
    \caption{\textbf{A graphical overview of the sampling framework and its insert procedure.} A
    mutable buffer (MB) sits atop two levels (L0, L1) containing shards (pairs
    of SSIs and auxiliary structures [A]) using the leveling
    (Figure~\ref{fig:leveling}) and tiering (Figure~\ref{fig:tiering}) layout
    policies. Records are represented as black/colored squares, and grey
    squares represent unused capacity. An insertion requiring a multi-level
    reconstruction is illustrated.} \label{fig:framework}
	
\end{figure*}

\section{Framework Implementation}

Our framework has been designed to work efficiently with any SSI, so long
as it has the following properties.

\begin{enumerate}
	\item The underlying full query $Q$ supported by the SSI from whose results
	samples are drawn satisfies the following property:
	for any dataset $D = \cup_{i = 1}^{n}D_i$
	where $D_i \cap D_j = \emptyset$, $Q(D) = \cup_{i = 1}^{n}Q(D_i)$.
	\item \emph{(Optional)} The SSI supports efficient point-lookups.
	\item \emph{(Optional)} The SSI is capable of efficiently reporting the total weight of all records
	returned by the underlying full query.
\end{enumerate}

The first property applies to the query being sampled from, and is essential
for the correctness of sample sets reported by extended sampling
indexes.\footnote{ This condition is stricter than the definition of a
decomposable search problem in the Bentley-Saxe method, which allows for
\emph{any} constant-time merge operation, not just union.
However, this condition is satisfied by many common types of database
query, such as predicate-based filtering queries.} The latter two properties
are optional, but reduce deletion and sampling costs respectively. Should the
SSI fail to support point-lookups, an auxiliary hash table can be attached to 
the data structures.
Should it fail to support query result weight reporting, rejection
sampling can be used in place of the more efficient scheme discussed in
Section~\ref{ssec:sample}. The analysis of this framework will generally
assume that all three conditions are satisfied.

Given an SSI with these properties, a dynamic extension can be produced as
shown in Figure~\ref{fig:framework}. The extended index consists of disjoint
shards containing an instance of the SSI being extended, and optional auxiliary
data structures. The auxiliary structures allow acceleration of certain
operations that are required by the framework, but which the SSI being extended
does not itself support efficiently. Examples of possible auxiliary structures
include hash tables, Bloom filters~\cite{bloom70}, and range
filters~\cite{zhang18,siqiang20}. The shards are arranged into levels of
increasing record capacity, with either one shard, or up to a fixed maximum
number of shards, per level. The decision to place one or many shards per level
is called the \emph{layout policy}. The policy names are borrowed from the
literature on the LSM tree, with the former called \emph{leveling} and the
latter called \emph{tiering}.

To avoid a reconstruction on every insert, an unsorted array of fixed capacity
($N_b$), called the \emph{mutable buffer}, is used to buffer updates. Because it is
unsorted, it is kept small to maintain reasonably efficient sampling
and point-lookup performance. All updates are performed by appending new
records to the tail of this buffer. 
If a record currently within the index is
to be updated to a new value, it must first be deleted, and then a record with
the new value inserted. This ensures that old versions of records are properly
filtered from query results.

When the buffer is full, it is flushed to make room for new records. The
flushing procedure is based on the layout policy in use. When using leveling
(Figure~\ref{fig:leveling}) a new SSI is constructed using both the records in
$L_0$ and those in the buffer. This is used to create a new shard, which
replaces the one previously in $L_0$. When using tiering
(Figure~\ref{fig:tiering}) a new shard is built using only the records from the
buffer, and placed into $L_0$ without altering the existing shards. Each level
has a record capacity of $N_b \cdot s^{i+1}$, controlled by a configurable
parameter, $s$, called the scale factor. Records are organized in one large
shard under leveling, or in $s$ shards of $N_b \cdot s^i$ capacity each under
tiering. When a level reaches its capacity, it must be emptied to make room for
the records flushed into it. This is accomplished by moving its records down to
the next level of the index. Under leveling, this requires constructing a new
shard containing all records from both the source and target levels, and
placing this shard into the target, leaving the source empty. Under tiering,
the shards in the source level are combined into a single new shard that is
placed into the target level. Should the target be full, it is first emptied by
applying the same procedure. New empty levels
are dynamically added as necessary to accommodate these reconstructions.
Note that shard reconstructions are not necessarily performed using
merging, though merging can be used as an optimization of the reconstruction
procedure where such an algorithm exists. In general, reconstruction requires
only pooling the records of the shards being combined and then applying the SSI's
standard construction algorithm to this set of records.

\begin{table}[t]
\caption{Frequently Used Notation}
\centering

\begin{tabular}{|p{2.5cm} p{5cm}|}
	\hline
	\textbf{Variable} & \textbf{Description} \\ \hline
	$N_b$ & Capacity of the mutable buffer \\ \hline
	$s$   & Scale factor \\ \hline
	$C_c(n)$ & SSI initial construction cost \\ \hline
	$C_r(n)$ & SSI reconstruction cost \\ \hline
	$L(n)$ & SSI point-lookup cost \\ \hline
	$P(n)$ & SSI sampling pre-processing cost \\ \hline
	$S(n)$ & SSI per-sample sampling cost \\ \hline
	$W(n)$ & Shard weight determination cost \\ \hline
	$R(n)$   & Shard rejection check cost  \\ \hline
	$\delta$ & Maximum delete proportion \\ \hline
	%$\rho$ & Maximum rejection rate \\ \hline
\end{tabular}
\label{tab:nomen}

\end{table}

Table~\ref{tab:nomen} lists frequently used notation for the various parameters
of the framework, which will be used in the coming analysis of the costs and
trade-offs associated with operations within the framework's design space. The
remainder of this section will discuss the performance characteristics of
insertion into this structure (Section~\ref{ssec:insert}), how it can be used
to correctly answer sampling queries (Section~\ref{ssec:insert}), and efficient
approaches for supporting deletes (Section~\ref{ssec:delete}). Finally, it will
close with a detailed discussion of the trade-offs within the framework's
design space (Section~\ref{ssec:design-space}).




\subsection{Trade-offs on Framework Design Space}
\label{ssec:design-space}
The framework has several tunable parameters, allowing it to be tailored for
specific applications. This design space contains trade-offs among three major
performance characteristics: update cost, sampling cost, and auxiliary memory
usage. The two most significant decisions when implementing this framework are
the selection of the layout and delete policies. The asymptotic analysis of the
previous sections obscures some of the differences between these policies, but
they do have significant practical performance implications.

\Paragraph{Layout Policy.} The choice of layout policy represents a clear
trade-off between update and sampling performance. Leveling
results in fewer shards of larger size, whereas tiering results in a larger
number of smaller shards. As a result, leveling reduces the costs associated
with point-lookups and sampling query preprocessing by a constant factor,
compared to tiering. However, it results in more write amplification: a given
record may be involved in up to $s$ reconstructions on a single level, as
opposed to the single reconstruction per level under tiering. 

\Paragraph{Delete Policy.} There is a trade-off between delete performance and
sampling performance that exists in the choice of delete policy. Tagging
requires a point-lookup when performing a delete, which is more expensive than
the insert required by tombstones. However, it also allows constant-time
rejection checks, unlike tombstones which require a point-lookup of each
sampled record. In situations where deletes are common and write-throughput is
critical, tombstones may be more useful. Tombstones are also ideal in
situations where immutability is required, or random writes must be avoided.
Generally speaking, however, tagging is superior when using SSIs that support
it, because sampling rejection checks will usually be more common than deletes.

\Paragraph{Mutable Buffer Capacity and Scale Factor.} The mutable buffer
capacity and scale factor both influence the number of levels within the index,
and by extension the number of distinct shards. Sampling and point-lookups have
better performance with fewer shards. Smaller shards are also faster to
reconstruct, although the same adjustments that reduce shard size also result
in a larger number of reconstructions, so the trade-off here is less clear.

The scale factor has an interesting interaction with the layout policy: when
using leveling, the scale factor directly controls the amount of write
amplification per level. Larger scale factors mean more time is spent
reconstructing shards on a level, reducing update performance. Tiering does not
have this problem and should see its update performance benefit directly from a
larger scale factor, as this reduces the number of reconstructions.

The buffer capacity also influences the number of levels, but is more
significant in its effects on point-lookup performance: a lookup must perform a
linear scan of the buffer. Likewise, the unstructured nature of the buffer also
will contribute negatively towards sampling performance, irrespective of which
buffer sampling technique is used. As a result, although a large buffer will
reduce the number of shards, it will also hurt sampling and delete (under
tagging) performance. It is important to minimize the cost of these buffer
scans, and so it is preferable to keep the buffer small, ideally small enough
to fit within the CPU's L2 cache. The number of shards within the index is,
then, better controlled by changing the scale factor, rather than the buffer
capacity. Using a smaller buffer will result in more compactions and shard
reconstructions; however, the empirical evaluation in Section~\ref{ssec:ds-exp}
demonstrates that this is not a serious performance problem when a scale factor
is chosen appropriately. When the shards are in memory, frequent small
reconstructions do not have a significant performance penalty compared to less
frequent, larger ones.

\Paragraph{Auxiliary Structures.} The framework's support for arbitrary
auxiliary data structures allows for memory to be traded in exchange for
insertion or sampling performance. The use of Bloom filters for accelerating
tombstone rejection checks has already been discussed, but many other options
exist. Bloom filters could also be used to accelerate point-lookups for delete
tagging, though such filters would require much more memory than tombstone-only
ones to be effective. An auxiliary hash table could be used for accelerating
point-lookups, or range filters like SuRF \cite{zhang18} or Rosetta
\cite{siqiang20} added to accelerate pre-processing for range queries like in
IRS or WIRS.