summaryrefslogtreecommitdiffstats
path: root/chapters/sigmod23/framework.tex
blob: 32a32e1f6ac659383dcbf629b169ddd602d539ab (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
\section{Dynamic Sampling Index Framework} 
\label{sec:framework}

This work is an attempt to design a solution to independent sampling 
that achieves \emph{both} efficient updates and near-constant cost per 
sample. As the goal is to tackle the problem in a generalized fashion,
rather than design problem-specific data structures for used as the basis
of an index, a framework is created that allows for already 
existing static data structures to be used as the basis for a sampling
index, by automatically adding support for data updates using a modified
version of the Bentley-Saxe method.

Unfortunately, Bentley-Saxe as described in Section~\ref{ssec:bsm} cannot be
directly applied to sampling problems. The concept of decomposability is not
cleanly applicable to sampling, because the distribution of records in the
result set, rather than the records themselves, must be matched following the
result merge. Efficiently controlling the distribution requires each sub-query
to access information external to the structure against which it is being
processed, a contingency unaccounted for by Bentley-Saxe. Further, the process
of reconstruction used in Bentley-Saxe provides poor worst-case complexity
bounds~\cite{saxe79}, and attempts to modify the procedure to provide better
worst-case performance are complex and have worse performance in the common
case~\cite{overmars81}. Despite these limitations, this chapter will argue that
the core principles of the Bentley-Saxe method can be profitably applied to
sampling indexes, once a system for controlling result set distributions and a
more effective reconstruction scheme have been devised. The solution to
the former will be discussed in Section~\ref{ssec:sample}. For the latter,
inspiration is drawn from the literature on the LSM tree.

The LSM tree~\cite{oneil96} is a data structure proposed to optimize
write throughput in disk-based storage engines. It consists of a memory
table of bounded size, used to buffer recent changes, and a hierarchy
of external levels containing indexes of exponentially increasing
size. When the memory table has reached capacity, it is emptied into the
external levels. Random writes are avoided by treating the data within
the external levels as immutable; all writes go through the memory
table. This introduces write amplification but maximizes sequential
writes, which is important for maintaining high throughput in disk-based
systems. The LSM tree is associated with a broad and well studied design
space~\cite{dayan17,dayan18,dayan22,balmau19,dayan18-1} containing
trade-offs between three key performance metrics: read performance, write
performance, and auxiliary memory usage. The challenges
faced in reconstructing predominately in-memory indexes are quite
 different from those which the LSM tree is intended
to address, having little to do with disk-based systems and sequential IO
operations. But, the LSM tree possesses a rich design space for managing
the periodic reconstruction of data structures in a manner that is both
more practical and more flexible than that of Bentley-Saxe. By borrowing
from this design space, this preexisting body of work can be leveraged,
and many of Bentley-Saxe's limitations addressed.

\captionsetup[subfloat]{justification=centering}

\begin{figure*}
	\centering
	\subfloat[Leveling]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-leveling} \label{fig:leveling}}\\
	\subfloat[Tiering]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-tiering} \label{fig:tiering}}
	
    \caption{\textbf{A graphical overview of the sampling framework and its insert procedure.} A
    mutable buffer (MB) sits atop two levels (L0, L1) containing shards (pairs
    of SSIs and auxiliary structures [A]) using the leveling
    (Figure~\ref{fig:leveling}) and tiering (Figure~\ref{fig:tiering}) layout
    policies. Records are represented as black/colored squares, and grey
    squares represent unused capacity. An insertion requiring a multi-level
    reconstruction is illustrated.} \label{fig:framework}
	
\end{figure*}


\subsection{Framework Overview}
The goal of this chapter is to build a general framework that extends most SSIs
with efficient support for updates by splitting the index into small data structures
to reduce reconstruction costs, and then distributing the sampling process over these
smaller structures.
The framework is designed to work efficiently with any SSI, so
long as it has the following properties,
\begin{enumerate}
	\item The underlying full query $Q$ supported by the SSI from whose results
	samples are drawn satisfies the following property:
	for any dataset $D = \cup_{i = 1}^{n}D_i$
	where $D_i \cap D_j = \emptyset$, $Q(D) = \cup_{i = 1}^{n}Q(D_i)$.
	\item \emph{(Optional)} The SSI supports efficient point-lookups.
	\item \emph{(Optional)} The SSI is capable of efficiently reporting the total weight of all records
	returned by the underlying full query.
\end{enumerate}

The first property applies to the query being sampled from, and is essential
for the correctness of sample sets reported by extended sampling
indexes.\footnote{ This condition is stricter than the definition of a
decomposable search problem in the Bentley-Saxe method, which allows for
\emph{any} constant-time merge operation, not just union.
However, this condition is satisfied by many common types of database
query, such as predicate-based filtering queries.} The latter two properties
are optional, but reduce deletion and sampling costs respectively. Should the
SSI fail to support point-lookups, an auxiliary hash table can be attached to 
the data structures.
Should it fail to support query result weight reporting, rejection
sampling can be used in place of the more efficient scheme discussed in
Section~\ref{ssec:sample}. The analysis of this framework will generally
assume that all three conditions are satisfied.

Given an SSI with these properties, a dynamic extension can be produced as
shown in Figure~\ref{fig:framework}. The extended index consists of disjoint
shards containing an instance of the SSI being extended, and optional auxiliary
data structures. The auxiliary structures allow acceleration of certain
operations that are required by the framework, but which the SSI being extended
does not itself support efficiently. Examples of possible auxiliary structures
include hash tables, Bloom filters~\cite{bloom70}, and range
filters~\cite{zhang18,siqiang20}. The shards are arranged into levels of
increasing record capacity, with either one shard, or up to a fixed maximum
number of shards, per level. The decision to place one or many shards per level
is called the \emph{layout policy}. The policy names are borrowed from the
literature on the LSM tree, with the former called \emph{leveling} and the
latter called \emph{tiering}.

To avoid a reconstruction on every insert, an unsorted array of fixed capacity
($N_b$), called the \emph{mutable buffer}, is used to buffer updates. Because it is
unsorted, it is kept small to maintain reasonably efficient sampling
and point-lookup performance. All updates are performed by appending new
records to the tail of this buffer. 
If a record currently within the index is
to be updated to a new value, it must first be deleted, and then a record with
the new value inserted. This ensures that old versions of records are properly
filtered from query results.

When the buffer is full, it is flushed to make room for new records. The
flushing procedure is based on the layout policy in use. When using leveling
(Figure~\ref{fig:leveling}) a new SSI is constructed using both the records in
$L_0$ and those in the buffer. This is used to create a new shard, which
replaces the one previously in $L_0$. When using tiering
(Figure~\ref{fig:tiering}) a new shard is built using only the records from the
buffer, and placed into $L_0$ without altering the existing shards. Each level
has a record capacity of $N_b \cdot s^{i+1}$, controlled by a configurable
parameter, $s$, called the scale factor. Records are organized in one large
shard under leveling, or in $s$ shards of $N_b \cdot s^i$ capacity each under
tiering. When a level reaches its capacity, it must be emptied to make room for
the records flushed into it. This is accomplished by moving its records down to
the next level of the index. Under leveling, this requires constructing a new
shard containing all records from both the source and target levels, and
placing this shard into the target, leaving the source empty. Under tiering,
the shards in the source level are combined into a single new shard that is
placed into the target level. Should the target be full, it is first emptied by
applying the same procedure. New empty levels
are dynamically added as necessary to accommodate these reconstructions.
Note that shard reconstructions are not necessarily performed using
merging, though merging can be used as an optimization of the reconstruction
procedure where such an algorithm exists. In general, reconstruction requires
only pooling the records of the shards being combined and then applying the SSI's
standard construction algorithm to this set of records.

\begin{table}[t]
\caption{Frequently Used Notation}
\centering

\begin{tabular}{|p{2.5cm} p{5cm}|}
	\hline
	\textbf{Variable} & \textbf{Description} \\ \hline
	$N_b$ & Capacity of the mutable buffer \\ \hline
	$s$   & Scale factor \\ \hline
	$C_c(n)$ & SSI initial construction cost \\ \hline
	$C_r(n)$ & SSI reconstruction cost \\ \hline
	$L(n)$ & SSI point-lookup cost \\ \hline
	$P(n)$ & SSI sampling pre-processing cost \\ \hline
	$S(n)$ & SSI per-sample sampling cost \\ \hline
	$W(n)$ & Shard weight determination cost \\ \hline
	$R(n)$   & Shard rejection check cost  \\ \hline
	$\delta$ & Maximum delete proportion \\ \hline
	%$\rho$ & Maximum rejection rate \\ \hline
\end{tabular}
\label{tab:nomen}

\end{table}

Table~\ref{tab:nomen} lists frequently used notation for the various parameters
of the framework, which will be used in the coming analysis of the costs and
trade-offs associated with operations within the framework's design space. The
remainder of this section will discuss the performance characteristics of
insertion into this structure (Section~\ref{ssec:insert}), how it can be used
to correctly answer sampling queries (Section~\ref{ssec:insert}), and efficient
approaches for supporting deletes (Section~\ref{ssec:delete}). Finally, it will
close with a detailed discussion of the trade-offs within the framework's
design space (Section~\ref{ssec:design-space}).


\subsection{Insertion}
\label{ssec:insert}
The framework supports inserting new records by first appending them to the end
of the mutable buffer. When it is full, the buffer is flushed into a sequence
of levels containing shards of increasing capacity, using a procedure
determined by the layout policy as discussed in Section~\ref{sec:framework}.
This method allows for the cost of repeated shard reconstruction to be
effectively amortized.

Let the cost of constructing the SSI from an arbitrary set of $n$ records be
$C_c(n)$ and the cost of reconstructing the SSI given two or more shards
containing $n$ records in total be $C_r(n)$. The cost of an insert is composed
of three parts: appending to the mutable buffer, constructing a new
shard from the buffered records during a flush, and the total cost of
reconstructing shards containing the record over the lifetime of the index. The
cost of appending to the mutable buffer is constant, and the cost of constructing a
shard from the buffer can be amortized across the records participating in the
buffer flush, giving $\nicefrac{C_c(N_b)}{N_b}$. These costs are paid exactly once for
each record. To derive an expression for the cost of repeated reconstruction,
first note that each record will participate in at most $s$ reconstructions on
a given level, resulting in a worst-case amortized cost of $O\left(s\cdot
\nicefrac{C_r(n)}{n}\right)$ paid per level. The index itself will contain at most
$\log_s n$ levels. Thus, over the lifetime of the index a given record
will pay $O\left(s\cdot \nicefrac{C_r(n)}{n}\log_s n\right)$ cost in repeated
reconstruction.

Combining these results, the total amortized insertion cost is
\begin{equation}
O\left(\frac{C_c(N_b)}{N_b} + s \cdot \frac{C_r(n)}{n} \log_s n\right)
\end{equation}
This can be simplified by noting that $s$ is constant, and that $N_b \ll n$ and also
a constant. By neglecting these terms, the amortized insertion cost of the
framework is,
\begin{equation}
O\left(\frac{C_r(n)}{n}\log_s n\right)
\end{equation}


\subsection{Sampling} 
\label{ssec:sample}

\begin{figure}
    \centering
	\includegraphics[width=\textwidth]{img/sigmod23/sampling}
    \caption{\textbf{Overview of the multiple-shard sampling query process} for
    Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of
    the shards is determined, then (2) these weights are used to construct an
    alias structure. Next, (3) the alias structure is queried $k$ times to
    determine per shard sample sizes, and then (4) sampling is performed.
    Finally, (5) any rejected samples are retried starting from the alias
    structure, and the process is repeated until the desired number of samples
    has been retrieved.}
	\label{fig:sample}
	
\end{figure}

For many SSIs, sampling queries are completed in two stages. Some preliminary
processing is done to identify the range of records from which to sample, and then
samples are drawn from that range. For example, IRS over a sorted list of
records can be performed by first identifying the upper and lower bounds of the
query range in the list, and then sampling records by randomly generating
indexes within those bounds. The general cost of a sampling query can be
modeled as $P(n) + k S(n)$, where $P(n)$ is the cost of preprocessing, $k$ is
the number of samples drawn, and $S(n)$ is the cost of sampling a single
record.

When sampling from multiple shards, the situation grows more complex. For each
sample, the shard to select the record from must first be decided. Consider an
arbitrary sampling query $X(D, k)$ asking for a sample set of size $k$ against
dataset $D$. The framework splits $D$ across $m$ disjoint shards, such that $D
= \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset, \forall i,j < m$. The
framework must ensure that $X(D, k)$ and $\bigcup_{i=0}^m X(D_i, k_i)$ follow
the same distribution, by selecting appropriate values for the $k_i$s. If care
is not taken to balance the number of samples drawn from a shard with the total
weight of the shard under $X$, then bias can be introduced into the sample
set's distribution. The selection of $k_i$s can be viewed as an instance of WSS,
and solved using the alias method.

When sampling using the framework, first the weight of each shard under the
sampling query is determined and a \emph{shard alias structure} built over
these weights. Then, for each sample, the shard alias is used to
determine the shard from which to draw the sample. Let $W(n)$ be the cost of
determining this total weight for a single shard under the query. The initial setup
cost, prior to drawing any samples, will be $O\left([W(n) + P(n)]\log_s
n\right)$, as the preliminary work for sampling from each shard must be
performed, as well as weights determined and alias structure constructed. In
many cases, however, the preliminary work will also determine the total weight,
and so the relevant operation need only be applied once to accomplish both
tasks.

To ensure that all records appear in the sample set with the appropriate
probability, the mutable buffer itself must also be a valid target for
sampling. There are two generally applicable techniques that can be applied for
this, both of which can be supported by the framework. The query being sampled
from can be directly executed against the buffer and the result set used to
build a temporary SSI, which can be sampled from. Alternatively, rejection
sampling can be used to sample directly from the buffer, without executing the
query. In this case, the total weight of the buffer is used for its entry in
the shard alias structure. This can result in the buffer being
over-represented in the shard selection process, and so any rejections during
buffer sampling must be retried starting from shard selection. These same
considerations apply to rejection sampling used against shards, as well.


\begin{example}
    \label{ex:sample}
    Consider executing a WSS query, with $k=1000$, across three shards
    containing integer keys with unit weight. $S_1$ contains only the
    key $-2$, $S_2$ contains all integers on $[1,100]$, and $S_3$
    contains all integers on $[101, 200]$. These structures are shown
    in Figure~\ref{fig:sample}. Sampling is performed by first
    determining the normalized weights for each shard: $w_1 = 0.005$,
    $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a
    shard alias structure. The shard alias structure is then queried
    $k$ times, resulting in a distribution of $k_i$s that is
    commensurate with the relative weights of each shard. Finally,
    each shard is queried in turn to draw the appropriate number
    of samples.
\end{example}


Assuming that rejection sampling is used on the mutable buffer, the worst-case
time complexity for drawing $k$ samples from an index containing $n$ elements
with a sampling cost of $S(n)$ is,
\begin{equation}
    \label{eq:sample-cost}
    O\left(\left[W(n) + P(n)\right]\log_s n + kS(n)\right)
\end{equation}

%If instead a temporary SSI is constructed, the cost of sampling
%becomes: $O\left(N_b + C_c(N_b) + (W(n) + P(n))\log_s n + kS(n)\right)$. 

\begin{figure}
	\centering
	\subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\
	\subfloat[Tagging Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tagging} \label{fig:delete-tag}}
	
    \caption{\textbf{Overview of the rejection check procedure for deleted records.} First,
    a record is sampled (1).
    When using the tombstone delete policy
    (Figure~\ref{fig:delete-tombstone}), the rejection check starts by (2) querying
    the bloom filter of the mutable buffer. The filter indicates the record is
    not present, so (3) the filter on $L_0$ is queried next. This filter
    returns a false positive, so (4) a point-lookup is executed against $L_0$.
    The lookup fails to find a tombstone, so the search continues and (5) the
    filter on $L_1$ is checked, which reports that the tombstone is present.
    This time, it is not a false positive, and so (6) a lookup against $L_1$
    (7) locates the tombstone. The record is thus rejected. When using the
    tagging policy (Figure~\ref{fig:delete-tag}), (1) the record is sampled and
    (2) checked directly for the delete tag. It is set, so the record is
    immediately rejected.} 

    \label{fig:delete}
	
\end{figure}


\subsection{Deletion} 
\label{ssec:delete}

Because the shards are static, records cannot be arbitrarily removed from them.
This requires that deletes be supported in some other way, with the ultimate
goal being the prevention of deleted records' appearance in sampling query
result sets. This can be realized in two ways: locating the record and marking
it, or inserting a new record which indicates that an existing record should be
treated as deleted. The framework supports both of these techniques, the
selection of which is called the \emph{delete policy}. The former policy is
called \emph{tagging} and the latter \emph{tombstone}.

Tagging a record is straightforward. Point-lookups are performed against each
shard in the index, as well as the buffer, for the record to be deleted. When
it is found, a bit in a header attached to the record is set. When sampling,
any records selected with this bit set are automatically rejected. Tombstones
represent a lazy strategy for deleting records. When a record is deleted using
tombstones, a new record with identical key and value, but with a ``tombstone''
bit set, is inserted into the index. A record's presence can be checked by
performing a point-lookup. If a tombstone with the same key and value exists
above the record in the index, then it should be rejected when sampled.

Two important aspects of performance are pertinent when discussing deletes: the
cost of the delete operation, and the cost of verifying the presence of a
sampled record. The choice of delete policy represents a trade-off between
these two costs. Beyond this simple trade-off, the delete policy also has other
implications that can affect its applicability to certain types of SSI. Most
notably, tombstones do not require any in-place updating of records, whereas
tagging does. This means that using tombstones is the only way to ensure total
immutability of the data within shards, which avoids random writes and eases
concurrency control. The tombstone delete policy, then, is particularly
appealing in external and concurrent contexts.

\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is
the same as an ordinary insert. Tagging, by contrast, requires a point-lookup
of the record to be deleted, and so is more expensive. Assuming a point-lookup
operation with cost $L(n)$, a tagged delete must search each level in the
index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$
time. 

\Paragraph{Rejection Check Costs.} In addition to the cost of the delete
itself, the delete policy affects the cost of determining if a given record has
been deleted. This is called the \emph{rejection check cost}, $R(n)$. When
using tagging, the information necessary to make the rejection decision is
local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones
it is not; a point-lookup must be performed to search for a given record's
corresponding tombstone. This look-up must examine the buffer, and each shard
within the index. This results in a rejection check cost of $R(n) \in O\left(N_b +
L(n) \log_s n\right)$. The rejection check process for the two delete policies is
summarized in Figure~\ref{fig:delete}.

Two factors contribute to the tombstone rejection check cost: the size of the
buffer, and the cost of performing a point-lookup against the shards. The
latter cost can be controlled using the framework's ability to associate
auxiliary structures with shards. For SSIs which do not support efficient
point-lookups, a hash table can be added to map key-value pairs to their
location within the SSI. This allows for constant-time rejection checks, even
in situations where the index would not otherwise support them. However, the
storage cost of this intervention is high, and in situations where the SSI does
support efficient point-lookups, it is not necessary. Further performance
improvements can be achieved by noting that the probability of a given record
having an associated tombstone in any particular shard is relatively small.
This means that many point-lookups will be executed against shards that do not
contain the tombstone being searched for. In this case, these unnecessary
lookups can be partially avoided using Bloom filters~\cite{bloom70} for
tombstones. By inserting tombstones into these filters during reconstruction,
point-lookups against some shards which do not contain the tombstone being
searched for can be bypassed. Filters can be attached to the buffer as well,
which may be even more significant due to the linear cost of scanning it. As
the goal is a reduction of rejection check costs, these filters need only be
populated with tombstones. In a later section, techniques for bounding the
number of tombstones on a given level are discussed, which will allow for the
memory usage of these filters to be tightly controlled while still ensuring
precise bounds on filter error.

\Paragraph{Sampling with Deletes.} The addition of deletes to the framework
alters the analysis of sampling costs. A record that has been deleted cannot
be present in the sample set, and therefore the presence of each sampled record
must be verified. If a record has been deleted, it must be rejected. When
retrying samples rejected due to delete, the process must restart from shard
selection, as deleted records may be counted in the weight totals used to
construct that structure. This increases the cost of sampling to,
\begin{equation}
\label{eq:sampling-cost}
    O\left([W(n) + P(n)]\log_s n  + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right)
\end{equation}
where $R(n)$ is the cost of checking if a sampled record has been deleted, and
$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling
attempts required to obtain $k$ samples, given a fixed rejection probability.
The rejection probability itself is a function of the workload, and is
unbounded. 

\Paragraph{Bounding the Rejection Probability.} Rejections during sampling
constitute wasted memory accesses and random number generations, and so steps
should be taken to minimize their frequency. The probability of a rejection is
directly related to the number of deleted records, which is itself a function
of workload and dataset. This means that, without building counter-measures
into the framework, tight bounds on sampling performance cannot be provided in
the presence of deleted records. It is therefore critical that the framework
support some method for bounding the number of deleted records within the
index.

While the static nature of shards prevents the direct removal of records at the
moment they are deleted, it doesn't prevent the removal of records during
reconstruction. When using tagging, all tagged records encountered during
reconstruction can be removed. When using tombstones, however, the removal
process is non-trivial. In principle, a rejection check could be performed for
each record encountered during reconstruction, but this would increase
reconstruction costs and introduce a new problem of tracking tombstones
associated with records that have been removed. Instead, a lazier approach can
be used: delaying removal until a tombstone and its associated record
participate in the same shard reconstruction. This delay allows both the record
and its tombstone to be removed at the same time, an approach called
\emph{tombstone cancellation}. In general, this can be implemented using an
extra linear scan of the input shards before reconstruction to identify
tombstones and associated records for cancellation, but potential optimizations
exist for many SSIs, allowing it to be performed during the reconstruction
itself at no extra cost.

The removal of deleted records passively during reconstruction is not enough to
bound the number of deleted records within the index. It is not difficult to
envision pathological scenarios where deletes result in unbounded rejection
rates, even with this mitigation in place. However, the dropping of deleted
records does provide a useful property: any specific deleted record will
eventually be removed from the index after a finite number of reconstructions.
Using this fact, a bound on the number of deleted records can be enforced. A
new parameter, $\delta$, is defined, representing the maximum proportion of
deleted records within the index. Each level, and the buffer, tracks the number
of deleted records it contains by counting its tagged records or tombstones.
Following each buffer flush, the proportion of deleted records is checked
against $\delta$. If any level is found to exceed it, then a proactive
reconstruction is triggered, pushing its shards down into the next level. The
process is repeated until all levels respect the bound, allowing the number of
deleted records to be precisely controlled, which, by extension, bounds the
rejection rate. This process is called \emph{compaction}.

Assuming every record is equally likely to be sampled, this new bound can be
applied to the analysis of sampling costs. The probability of a record being
rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to
Equation~\ref{eq:sampling-cost} yields,
\begin{equation}
%\label{eq:sampling-cost-del}
    O\left([W(n) + P(n)]\log_s n  + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
\end{equation}

Asymptotically, this proactive compaction does not alter the analysis of
insertion costs. Each record is still written at most $s$ times on each level,
there are at most $\log_s n$ levels, and the buffer insertion and SSI
construction costs are all unchanged, and so on. This results in the amortized 
insertion cost remaining the same.

This compaction strategy is based upon tombstone and record counts, and the
bounds assume that every record is equally likely to be sampled. For certain
sampling problems (such as WSS), there are other conditions that must be
considered to provide a bound on the rejection rate. To account for these
situations in a general fashion, the framework supports problem-specific
compaction triggers that can be tailored to the SSI being used. These allow
compactions to be triggered based on other properties, such as rejection rate
of a level, weight of deleted records, and the like.


\subsection{Trade-offs on Framework Design Space}
\label{ssec:design-space}
The framework has several tunable parameters, allowing it to be tailored for
specific applications. This design space contains trade-offs among three major
performance characteristics: update cost, sampling cost, and auxiliary memory
usage. The two most significant decisions when implementing this framework are
the selection of the layout and delete policies. The asymptotic analysis of the
previous sections obscures some of the differences between these policies, but
they do have significant practical performance implications.

\Paragraph{Layout Policy.} The choice of layout policy represents a clear
trade-off between update and sampling performance. Leveling
results in fewer shards of larger size, whereas tiering results in a larger
number of smaller shards. As a result, leveling reduces the costs associated
with point-lookups and sampling query preprocessing by a constant factor,
compared to tiering. However, it results in more write amplification: a given
record may be involved in up to $s$ reconstructions on a single level, as
opposed to the single reconstruction per level under tiering. 

\Paragraph{Delete Policy.} There is a trade-off between delete performance and
sampling performance that exists in the choice of delete policy. Tagging
requires a point-lookup when performing a delete, which is more expensive than
the insert required by tombstones. However, it also allows constant-time
rejection checks, unlike tombstones which require a point-lookup of each
sampled record. In situations where deletes are common and write-throughput is
critical, tombstones may be more useful. Tombstones are also ideal in
situations where immutability is required, or random writes must be avoided.
Generally speaking, however, tagging is superior when using SSIs that support
it, because sampling rejection checks will usually be more common than deletes.

\Paragraph{Mutable Buffer Capacity and Scale Factor.} The mutable buffer
capacity and scale factor both influence the number of levels within the index,
and by extension the number of distinct shards. Sampling and point-lookups have
better performance with fewer shards. Smaller shards are also faster to
reconstruct, although the same adjustments that reduce shard size also result
in a larger number of reconstructions, so the trade-off here is less clear.

The scale factor has an interesting interaction with the layout policy: when
using leveling, the scale factor directly controls the amount of write
amplification per level. Larger scale factors mean more time is spent
reconstructing shards on a level, reducing update performance. Tiering does not
have this problem and should see its update performance benefit directly from a
larger scale factor, as this reduces the number of reconstructions.

The buffer capacity also influences the number of levels, but is more
significant in its effects on point-lookup performance: a lookup must perform a
linear scan of the buffer. Likewise, the unstructured nature of the buffer also
will contribute negatively towards sampling performance, irrespective of which
buffer sampling technique is used. As a result, although a large buffer will
reduce the number of shards, it will also hurt sampling and delete (under
tagging) performance. It is important to minimize the cost of these buffer
scans, and so it is preferable to keep the buffer small, ideally small enough
to fit within the CPU's L2 cache. The number of shards within the index is,
then, better controlled by changing the scale factor, rather than the buffer
capacity. Using a smaller buffer will result in more compactions and shard
reconstructions; however, the empirical evaluation in Section~\ref{ssec:ds-exp}
demonstrates that this is not a serious performance problem when a scale factor
is chosen appropriately. When the shards are in memory, frequent small
reconstructions do not have a significant performance penalty compared to less
frequent, larger ones.

\Paragraph{Auxiliary Structures.} The framework's support for arbitrary
auxiliary data structures allows for memory to be traded in exchange for
insertion or sampling performance. The use of Bloom filters for accelerating
tombstone rejection checks has already been discussed, but many other options
exist. Bloom filters could also be used to accelerate point-lookups for delete
tagging, though such filters would require much more memory than tombstone-only
ones to be effective. An auxiliary hash table could be used for accelerating
point-lookups, or range filters like SuRF \cite{zhang18} or Rosetta
\cite{siqiang20} added to accelerate pre-processing for range queries like in
IRS or WIRS.