1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
|
\section{Dynamization of SSIs}
\label{sec:framework}
Our goal, then, is to design a solution to indepedent sampling that is
able to achieve \emph{both} efficient updates and efficient sampling,
while also maintaining statistical independence both within and between
IQS queries, and to do so in a generalized fashion without needing to
design new dynamic data structures for each problem. Given the range
of SSIs already available, it seems reasonable to attempt to apply
dynamization techniques to accomplish this goal. Using the Bentley-Saxe
method would allow us to to support inserts and deletes without
requiring any modification of the SSIs. Unfortunately, as discussed
in Section~\ref{ssec:background-irs}, there are problems with directly
applying BSM to sampling problems. All of the considerations discussed
there in the context of IRS apply equally to the other sampling problems
considered in this chapter. In this section, we will discuss approaches
for resolving these problems.
\subsection{Sampling over Partitioned Datasets}
The core problem facing any attempt to dynamize SSIs is that independently
sampling from a partitioned dataset is difficult. As discussed in
Section~\ref{ssec:background-irs}, accomplishing this task within the
DSP model used by the Bentley-Saxe method requires drawing a full $k$
samples from each of the blocks, and then repeatedly down-sampling each
of the intermediate sample sets. However, it is possible to devise a
more efficient query process if we abandon the DSP model and consider
a slightly more complicated procedure.
First, we need to resolve a minor definitional problem. As noted before,
the DSP model is based on deterministic queries. The definition doesn't
apply for sampling queries, because it assumes that the result sets of
identical queries should also be identical. For general IQS, we also need
to enforce conditions on the query being sampled from.
\begin{definition}[Query Sampling Problem]
Given a search problem, $F$, a sampling problem is function
of the form $X: (F, \mathcal{D}, \mathcal{Q}, \mathbb{Z}^+)
\to \mathcal{R}$ where $\mathcal{D}$ is the domain of records
and $\mathcal{Q}$ is the domain of query parameters of $F$. The
solution to a sampling problem, $R \in \mathcal{R}$ will be a subset
of records from the solution to $F$ drawn independently such that,
$|R| = k$ for some $k \in \mathbb{Z}^+$.
\end{definition}
With this in mind, we can now define the decomposability conditions for
a query sampling problem,
\begin{definition}[Decomposable Sampling Problem]
A query sampling problem, $X: (F, \mathcal{D}, \mathcal{Q},
\mathbb{Z}^+ \to \mathcal{R}$) is decomposable if and only if
the following conditions are met for all $q \in \mathcal{Q},
k \in \mathbb{Z}^+$,
\begin{enumerate}
\item There exists a $\Theta(C(n,k))$ time computable, associative, and
commutative binary operator $\mergeop$ such that,
\begin{equation*}
X(F, A \cup B, q, k) \sim X(F, A, q, k)~ \mergeop ~X(F,
B, q, k)
\end{equation*}
for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B
= \emptyset$.
\item For any dataset $D \subseteq \mathcal{D}$ that has been
decomposed into $m$ partitions such that $D =
\bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset \quad
\forall i,j < m, i \neq j$,
\begin{equation*}
F(D, q) = \bigcup_{i=1}^m F(D_i, q)
\end{equation*}
\end{enumerate}
\end{definition}
These two conditions warrant further explaination. The first condition
is simply a redefinition of the standard decomposability criteria to
consider matching the distribution, rather than the exact records in $R$,
as the correctness condition for the merge process. The second condition
handles a necessary property of the underlying search problem being
sampled from. Note that this condition is \emph{stricter} than normal
decomposability for $F$, and essentially requires that the query being
sampled from return a set of records, rather than an aggregate value or
some other result that cannot be meaningfully sampled from. This condition
is satisfied by predicate-filtering style database queries, among others.
With these definitions in mind, let's turn to solving these query sampling
problems. First, we note that many SSIs have a sampling procedure that
naturally involves two phases. First, some preliminary work is done
to determine metadata concerning the set of records to sample from,
and then $k$ samples are drawn from the structure, taking advantage of
this metadata. If we represent the time cost of the prelimary work
with $P(n)$ and the cost of drawing a sample with $S(n)$, then these
structures query cost functions are of the form,
\begin{equation*}
\mathscr{Q}(n, k) = P(n) + k S(n)
\end{equation*}
Consider an arbitrary decomposable sampling query with a cost function
of the above form, $X(\mathscr{I}, F, q, k)$, which draws a sample
of $k$ records from $d \subseteq \mathcal{D}$ using an instance of
an SSI $\mathscr{I} \in \mathcal{I}$. Applying dynamization results
in $d$ being split across $m$ disjoint instances of $\mathcal{I}$
such that $d = \bigcup_{i=0}^m \text{unbuild}(\mathscr{I}_i)$ and
$\text{unbuild}(\mathscr{I}_i) \cap \text{unbuild}(\mathscr{I}_j)
= \emptyset \quad \forall i, j < m, i \neq j$. If we consider a
Bentley-Saxe dynamization of such a structure, the $\mergeop$ operation
would be a $\Theta(k)$ down-sampling. Thus, the total query cost of such
a structure would be,
\begin{equation*}
\Theta\left(\log_2 n \left( P(n) + k S(n) + k\right)\right)
\end{equation*}
This cost function is sub-optimal for two reasons. First, we
pay extra cost to merge the result sets together because of the
down-sampling combination operator. Secondly, this formulation
fails to avoid a per-sample dependence on $n$, even in the case
where $S(n) \in \Theta(1)$. This gets even worse when considering
rejections that may occur as a result of deleted records. Recall from
Section~\ref{ssec:background-deletes} that deletion can be supported
using weak deletes or a shadow structure in a Bentley-Saxe dynamization.
Using either approach, it isn't possible to avoid deleted records in
advance when sampling, and so these will need to be rejected and retried.
In the DSP model, this retry will need to reprocess every block a second
time. You cannot retry in place without introducing bias into the result
set. We will discuss this more in Section~\ref{ssec:sampling-deletes}.
\begin{figure}
\centering
\includegraphics[width=\textwidth]{img/sigmod23/sampling}
\caption{\textbf{Overview of the multiple-block query sampling process} for
Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of
the shards is determined, then (2) these weights are used to construct an
alias structure. Next, (3) the alias structure is queried $k$ times to
determine per shard sample sizes, and then (4) sampling is performed.
Finally, (5) any rejected samples are retried starting from the alias
structure, and the process is repeated until the desired number of samples
has been retrieved.}
\label{fig:sample}
\end{figure}
The key insight that allowed us to solve this particular problem was that
there is a mismatch between the structure of the sampling query process,
and the structure assumed by DSPs. Using an SSI to answer a sampling
query results in a naturally two-phase process, but DSPs are assumed to
be single phase. We can construct a more effective process for answering
such queries based on a multi-stage process, summarized in Figure~\ref{fig:sample}.
\begin{enumerate}
\item Determine each block's respective weight under a given
query to be sampled from (e.g., the number of records falling
into the query range for IRS).
\item Build a temporary alias structure over these weights.
\item Query the alias structure $k$ times to determine how many
samples to draw from each block.
\item Draw the appropriate number of samples from each block and
merge them together to form the final query result.
\end{enumerate}
It is possible that some of the records sampled in Step 4 must be
rejected, either because of deletes or some other property of the sampling
procedure being used. If $r$ records are rejected, the above procedure
can be repeated from Step 3, taking $k - r$ as the number of times to
query the alias structure, without needing to redo any of the preprocessing
steps. This can be repeated as many times as necessary until the required
$k$ records have been sampled.
\begin{example}
\label{ex:sample}
Consider executing a WSS query, with $k=1000$, across three blocks
containing integer keys with unit weight. $\mathscr{I}_1$ contains only the
key $-2$, $\mathscr{I}_2$ contains all integers on $[1,100]$, and $\mathscr{I}_3$
contains all integers on $[101, 200]$. These structures are shown
in Figure~\ref{fig:sample}. Sampling is performed by first
determining the normalized weights for each block: $w_1 = 0.005$,
$w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a
block alias structure. The block alias structure is then queried
$k$ times, resulting in a distribution of $k_i$s that is
commensurate with the relative weights of each block. Finally,
each block is queried in turn to draw the appropriate number
of samples.
\end{example}
Assuming a Bentley-Saxe decomposition with $\log n$ blocks and assuming
a constant number of repetitions, the cost of answering a decomposible
sampling query having a pre-processing cost of $P(n)$ and a per-sample
cost of $S(n)$ will be,
\begin{equation}
\label{eq:dsp-sample-cost}
\boxed{
\mathscr{Q}(n, k) \in \Theta \left( P(n) \log_2 n + k S(n) \right)
}
\end{equation}
where the cost of building the alias structure is $\Theta(\log_2 n)$
and thus absorbed into the pre-processing cost. For the SSIs discussed
in this chapter, which have $S(n) \in \Theta(1)$, this model provides us
with the desired decoupling of the data size ($n$) from the per-sample
cost.
\subsection{Supporting Deletes}
Because the shards are static, records cannot be arbitrarily removed from them.
This requires that deletes be supported in some other way, with the ultimate
goal being the prevention of deleted records' appearance in sampling query
result sets. This can be realized in two ways: locating the record and marking
it, or inserting a new record which indicates that an existing record should be
treated as deleted. The framework supports both of these techniques, the
selection of which is called the \emph{delete policy}. The former policy is
called \emph{tagging} and the latter \emph{tombstone}.
Tagging a record is straightforward. Point-lookups are performed against each
shard in the index, as well as the buffer, for the record to be deleted. When
it is found, a bit in a header attached to the record is set. When sampling,
any records selected with this bit set are automatically rejected. Tombstones
represent a lazy strategy for deleting records. When a record is deleted using
tombstones, a new record with identical key and value, but with a ``tombstone''
bit set, is inserted into the index. A record's presence can be checked by
performing a point-lookup. If a tombstone with the same key and value exists
above the record in the index, then it should be rejected when sampled.
Two important aspects of performance are pertinent when discussing deletes: the
cost of the delete operation, and the cost of verifying the presence of a
sampled record. The choice of delete policy represents a trade-off between
these two costs. Beyond this simple trade-off, the delete policy also has other
implications that can affect its applicability to certain types of SSI. Most
notably, tombstones do not require any in-place updating of records, whereas
tagging does. This means that using tombstones is the only way to ensure total
immutability of the data within shards, which avoids random writes and eases
concurrency control. The tombstone delete policy, then, is particularly
appealing in external and concurrent contexts.
\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is
the same as an ordinary insert. Tagging, by contrast, requires a point-lookup
of the record to be deleted, and so is more expensive. Assuming a point-lookup
operation with cost $L(n)$, a tagged delete must search each level in the
index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$
time.
\Paragraph{Rejection Check Costs.} In addition to the cost of the delete
itself, the delete policy affects the cost of determining if a given record has
been deleted. This is called the \emph{rejection check cost}, $R(n)$. When
using tagging, the information necessary to make the rejection decision is
local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones
it is not; a point-lookup must be performed to search for a given record's
corresponding tombstone. This look-up must examine the buffer, and each shard
within the index. This results in a rejection check cost of $R(n) \in O\left(N_b +
L(n) \log_s n\right)$. The rejection check process for the two delete policies is
summarized in Figure~\ref{fig:delete}.
Two factors contribute to the tombstone rejection check cost: the size of the
buffer, and the cost of performing a point-lookup against the shards. The
latter cost can be controlled using the framework's ability to associate
auxiliary structures with shards. For SSIs which do not support efficient
point-lookups, a hash table can be added to map key-value pairs to their
location within the SSI. This allows for constant-time rejection checks, even
in situations where the index would not otherwise support them. However, the
storage cost of this intervention is high, and in situations where the SSI does
support efficient point-lookups, it is not necessary. Further performance
improvements can be achieved by noting that the probability of a given record
having an associated tombstone in any particular shard is relatively small.
This means that many point-lookups will be executed against shards that do not
contain the tombstone being searched for. In this case, these unnecessary
lookups can be partially avoided using Bloom filters~\cite{bloom70} for
tombstones. By inserting tombstones into these filters during reconstruction,
point-lookups against some shards which do not contain the tombstone being
searched for can be bypassed. Filters can be attached to the buffer as well,
which may be even more significant due to the linear cost of scanning it. As
the goal is a reduction of rejection check costs, these filters need only be
populated with tombstones. In a later section, techniques for bounding the
number of tombstones on a given level are discussed, which will allow for the
memory usage of these filters to be tightly controlled while still ensuring
precise bounds on filter error.
\Paragraph{Sampling with Deletes.} The addition of deletes to the framework
alters the analysis of sampling costs. A record that has been deleted cannot
be present in the sample set, and therefore the presence of each sampled record
must be verified. If a record has been deleted, it must be rejected. When
retrying samples rejected due to delete, the process must restart from shard
selection, as deleted records may be counted in the weight totals used to
construct that structure. This increases the cost of sampling to,
\begin{equation}
\label{eq:sampling-cost}
O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right)
\end{equation}
where $R(n)$ is the cost of checking if a sampled record has been deleted, and
$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling
attempts required to obtain $k$ samples, given a fixed rejection probability.
The rejection probability itself is a function of the workload, and is
unbounded.
\Paragraph{Bounding the Rejection Probability.} Rejections during sampling
constitute wasted memory accesses and random number generations, and so steps
should be taken to minimize their frequency. The probability of a rejection is
directly related to the number of deleted records, which is itself a function
of workload and dataset. This means that, without building counter-measures
into the framework, tight bounds on sampling performance cannot be provided in
the presence of deleted records. It is therefore critical that the framework
support some method for bounding the number of deleted records within the
index.
While the static nature of shards prevents the direct removal of records at the
moment they are deleted, it doesn't prevent the removal of records during
reconstruction. When using tagging, all tagged records encountered during
reconstruction can be removed. When using tombstones, however, the removal
process is non-trivial. In principle, a rejection check could be performed for
each record encountered during reconstruction, but this would increase
reconstruction costs and introduce a new problem of tracking tombstones
associated with records that have been removed. Instead, a lazier approach can
be used: delaying removal until a tombstone and its associated record
participate in the same shard reconstruction. This delay allows both the record
and its tombstone to be removed at the same time, an approach called
\emph{tombstone cancellation}. In general, this can be implemented using an
extra linear scan of the input shards before reconstruction to identify
tombstones and associated records for cancellation, but potential optimizations
exist for many SSIs, allowing it to be performed during the reconstruction
itself at no extra cost.
The removal of deleted records passively during reconstruction is not enough to
bound the number of deleted records within the index. It is not difficult to
envision pathological scenarios where deletes result in unbounded rejection
rates, even with this mitigation in place. However, the dropping of deleted
records does provide a useful property: any specific deleted record will
eventually be removed from the index after a finite number of reconstructions.
Using this fact, a bound on the number of deleted records can be enforced. A
new parameter, $\delta$, is defined, representing the maximum proportion of
deleted records within the index. Each level, and the buffer, tracks the number
of deleted records it contains by counting its tagged records or tombstones.
Following each buffer flush, the proportion of deleted records is checked
against $\delta$. If any level is found to exceed it, then a proactive
reconstruction is triggered, pushing its shards down into the next level. The
process is repeated until all levels respect the bound, allowing the number of
deleted records to be precisely controlled, which, by extension, bounds the
rejection rate. This process is called \emph{compaction}.
Assuming every record is equally likely to be sampled, this new bound can be
applied to the analysis of sampling costs. The probability of a record being
rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to
Equation~\ref{eq:sampling-cost} yields,
\begin{equation}
%\label{eq:sampling-cost-del}
O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right)
\end{equation}
Asymptotically, this proactive compaction does not alter the analysis of
insertion costs. Each record is still written at most $s$ times on each level,
there are at most $\log_s n$ levels, and the buffer insertion and SSI
construction costs are all unchanged, and so on. This results in the amortized
insertion cost remaining the same.
This compaction strategy is based upon tombstone and record counts, and the
bounds assume that every record is equally likely to be sampled. For certain
sampling problems (such as WSS), there are other conditions that must be
considered to provide a bound on the rejection rate. To account for these
situations in a general fashion, the framework supports problem-specific
compaction triggers that can be tailored to the SSI being used. These allow
compactions to be triggered based on other properties, such as rejection rate
of a level, weight of deleted records, and the like.
\subsection{Performance Tuning and Configuration}
\captionsetup[subfloat]{justification=centering}
\begin{figure*}
\centering
\subfloat[Leveling]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-leveling} \label{fig:leveling}}\\
\subfloat[Tiering]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-tiering} \label{fig:tiering}}
\caption{\textbf{A graphical overview of the sampling framework and its insert procedure.} A
mutable buffer (MB) sits atop two levels (L0, L1) containing shards (pairs
of SSIs and auxiliary structures [A]) using the leveling
(Figure~\ref{fig:leveling}) and tiering (Figure~\ref{fig:tiering}) layout
policies. Records are represented as black/colored squares, and grey
squares represent unused capacity. An insertion requiring a multi-level
reconstruction is illustrated.} \label{fig:framework}
\end{figure*}
\subsection{Framework Overview}
Our framework has been designed to work efficiently with any SSI, so long
as it has the following properties.
\begin{enumerate}
\item The underlying full query $Q$ supported by the SSI from whose results
samples are drawn satisfies the following property:
for any dataset $D = \cup_{i = 1}^{n}D_i$
where $D_i \cap D_j = \emptyset$, $Q(D) = \cup_{i = 1}^{n}Q(D_i)$.
\item \emph{(Optional)} The SSI supports efficient point-lookups.
\item \emph{(Optional)} The SSI is capable of efficiently reporting the total weight of all records
returned by the underlying full query.
\end{enumerate}
The first property applies to the query being sampled from, and is essential
for the correctness of sample sets reported by extended sampling
indexes.\footnote{ This condition is stricter than the definition of a
decomposable search problem in the Bentley-Saxe method, which allows for
\emph{any} constant-time merge operation, not just union.
However, this condition is satisfied by many common types of database
query, such as predicate-based filtering queries.} The latter two properties
are optional, but reduce deletion and sampling costs respectively. Should the
SSI fail to support point-lookups, an auxiliary hash table can be attached to
the data structures.
Should it fail to support query result weight reporting, rejection
sampling can be used in place of the more efficient scheme discussed in
Section~\ref{ssec:sample}. The analysis of this framework will generally
assume that all three conditions are satisfied.
Given an SSI with these properties, a dynamic extension can be produced as
shown in Figure~\ref{fig:framework}. The extended index consists of disjoint
shards containing an instance of the SSI being extended, and optional auxiliary
data structures. The auxiliary structures allow acceleration of certain
operations that are required by the framework, but which the SSI being extended
does not itself support efficiently. Examples of possible auxiliary structures
include hash tables, Bloom filters~\cite{bloom70}, and range
filters~\cite{zhang18,siqiang20}. The shards are arranged into levels of
increasing record capacity, with either one shard, or up to a fixed maximum
number of shards, per level. The decision to place one or many shards per level
is called the \emph{layout policy}. The policy names are borrowed from the
literature on the LSM tree, with the former called \emph{leveling} and the
latter called \emph{tiering}.
To avoid a reconstruction on every insert, an unsorted array of fixed capacity
($N_b$), called the \emph{mutable buffer}, is used to buffer updates. Because it is
unsorted, it is kept small to maintain reasonably efficient sampling
and point-lookup performance. All updates are performed by appending new
records to the tail of this buffer.
If a record currently within the index is
to be updated to a new value, it must first be deleted, and then a record with
the new value inserted. This ensures that old versions of records are properly
filtered from query results.
When the buffer is full, it is flushed to make room for new records. The
flushing procedure is based on the layout policy in use. When using leveling
(Figure~\ref{fig:leveling}) a new SSI is constructed using both the records in
$L_0$ and those in the buffer. This is used to create a new shard, which
replaces the one previously in $L_0$. When using tiering
(Figure~\ref{fig:tiering}) a new shard is built using only the records from the
buffer, and placed into $L_0$ without altering the existing shards. Each level
has a record capacity of $N_b \cdot s^{i+1}$, controlled by a configurable
parameter, $s$, called the scale factor. Records are organized in one large
shard under leveling, or in $s$ shards of $N_b \cdot s^i$ capacity each under
tiering. When a level reaches its capacity, it must be emptied to make room for
the records flushed into it. This is accomplished by moving its records down to
the next level of the index. Under leveling, this requires constructing a new
shard containing all records from both the source and target levels, and
placing this shard into the target, leaving the source empty. Under tiering,
the shards in the source level are combined into a single new shard that is
placed into the target level. Should the target be full, it is first emptied by
applying the same procedure. New empty levels
are dynamically added as necessary to accommodate these reconstructions.
Note that shard reconstructions are not necessarily performed using
merging, though merging can be used as an optimization of the reconstruction
procedure where such an algorithm exists. In general, reconstruction requires
only pooling the records of the shards being combined and then applying the SSI's
standard construction algorithm to this set of records.
\begin{table}[t]
\caption{Frequently Used Notation}
\centering
\begin{tabular}{|p{2.5cm} p{5cm}|}
\hline
\textbf{Variable} & \textbf{Description} \\ \hline
$N_b$ & Capacity of the mutable buffer \\ \hline
$s$ & Scale factor \\ \hline
$C_c(n)$ & SSI initial construction cost \\ \hline
$C_r(n)$ & SSI reconstruction cost \\ \hline
$L(n)$ & SSI point-lookup cost \\ \hline
$P(n)$ & SSI sampling pre-processing cost \\ \hline
$S(n)$ & SSI per-sample sampling cost \\ \hline
$W(n)$ & Shard weight determination cost \\ \hline
$R(n)$ & Shard rejection check cost \\ \hline
$\delta$ & Maximum delete proportion \\ \hline
%$\rho$ & Maximum rejection rate \\ \hline
\end{tabular}
\label{tab:nomen}
\end{table}
Table~\ref{tab:nomen} lists frequently used notation for the various parameters
of the framework, which will be used in the coming analysis of the costs and
trade-offs associated with operations within the framework's design space. The
remainder of this section will discuss the performance characteristics of
insertion into this structure (Section~\ref{ssec:insert}), how it can be used
to correctly answer sampling queries (Section~\ref{ssec:insert}), and efficient
approaches for supporting deletes (Section~\ref{ssec:delete}). Finally, it will
close with a detailed discussion of the trade-offs within the framework's
design space (Section~\ref{ssec:design-space}).
\subsection{Insertion}
\label{ssec:insert}
The framework supports inserting new records by first appending them to the end
of the mutable buffer. When it is full, the buffer is flushed into a sequence
of levels containing shards of increasing capacity, using a procedure
determined by the layout policy as discussed in Section~\ref{sec:framework}.
This method allows for the cost of repeated shard reconstruction to be
effectively amortized.
Let the cost of constructing the SSI from an arbitrary set of $n$ records be
$C_c(n)$ and the cost of reconstructing the SSI given two or more shards
containing $n$ records in total be $C_r(n)$. The cost of an insert is composed
of three parts: appending to the mutable buffer, constructing a new
shard from the buffered records during a flush, and the total cost of
reconstructing shards containing the record over the lifetime of the index. The
cost of appending to the mutable buffer is constant, and the cost of constructing a
shard from the buffer can be amortized across the records participating in the
buffer flush, giving $\nicefrac{C_c(N_b)}{N_b}$. These costs are paid exactly once for
each record. To derive an expression for the cost of repeated reconstruction,
first note that each record will participate in at most $s$ reconstructions on
a given level, resulting in a worst-case amortized cost of $O\left(s\cdot
\nicefrac{C_r(n)}{n}\right)$ paid per level. The index itself will contain at most
$\log_s n$ levels. Thus, over the lifetime of the index a given record
will pay $O\left(s\cdot \nicefrac{C_r(n)}{n}\log_s n\right)$ cost in repeated
reconstruction.
Combining these results, the total amortized insertion cost is
\begin{equation}
O\left(\frac{C_c(N_b)}{N_b} + s \cdot \frac{C_r(n)}{n} \log_s n\right)
\end{equation}
This can be simplified by noting that $s$ is constant, and that $N_b \ll n$ and also
a constant. By neglecting these terms, the amortized insertion cost of the
framework is,
\begin{equation}
O\left(\frac{C_r(n)}{n}\log_s n\right)
\end{equation}
\begin{figure}
\centering
\subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\
\subfloat[Tagging Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tagging} \label{fig:delete-tag}}
\caption{\textbf{Overview of the rejection check procedure for deleted records.} First,
a record is sampled (1).
When using the tombstone delete policy
(Figure~\ref{fig:delete-tombstone}), the rejection check starts by (2) querying
the bloom filter of the mutable buffer. The filter indicates the record is
not present, so (3) the filter on $L_0$ is queried next. This filter
returns a false positive, so (4) a point-lookup is executed against $L_0$.
The lookup fails to find a tombstone, so the search continues and (5) the
filter on $L_1$ is checked, which reports that the tombstone is present.
This time, it is not a false positive, and so (6) a lookup against $L_1$
(7) locates the tombstone. The record is thus rejected. When using the
tagging policy (Figure~\ref{fig:delete-tag}), (1) the record is sampled and
(2) checked directly for the delete tag. It is set, so the record is
immediately rejected.}
\label{fig:delete}
\end{figure}
\subsection{Deletion}
\label{ssec:delete}
\subsection{Trade-offs on Framework Design Space}
\label{ssec:design-space}
The framework has several tunable parameters, allowing it to be tailored for
specific applications. This design space contains trade-offs among three major
performance characteristics: update cost, sampling cost, and auxiliary memory
usage. The two most significant decisions when implementing this framework are
the selection of the layout and delete policies. The asymptotic analysis of the
previous sections obscures some of the differences between these policies, but
they do have significant practical performance implications.
\Paragraph{Layout Policy.} The choice of layout policy represents a clear
trade-off between update and sampling performance. Leveling
results in fewer shards of larger size, whereas tiering results in a larger
number of smaller shards. As a result, leveling reduces the costs associated
with point-lookups and sampling query preprocessing by a constant factor,
compared to tiering. However, it results in more write amplification: a given
record may be involved in up to $s$ reconstructions on a single level, as
opposed to the single reconstruction per level under tiering.
\Paragraph{Delete Policy.} There is a trade-off between delete performance and
sampling performance that exists in the choice of delete policy. Tagging
requires a point-lookup when performing a delete, which is more expensive than
the insert required by tombstones. However, it also allows constant-time
rejection checks, unlike tombstones which require a point-lookup of each
sampled record. In situations where deletes are common and write-throughput is
critical, tombstones may be more useful. Tombstones are also ideal in
situations where immutability is required, or random writes must be avoided.
Generally speaking, however, tagging is superior when using SSIs that support
it, because sampling rejection checks will usually be more common than deletes.
\Paragraph{Mutable Buffer Capacity and Scale Factor.} The mutable buffer
capacity and scale factor both influence the number of levels within the index,
and by extension the number of distinct shards. Sampling and point-lookups have
better performance with fewer shards. Smaller shards are also faster to
reconstruct, although the same adjustments that reduce shard size also result
in a larger number of reconstructions, so the trade-off here is less clear.
The scale factor has an interesting interaction with the layout policy: when
using leveling, the scale factor directly controls the amount of write
amplification per level. Larger scale factors mean more time is spent
reconstructing shards on a level, reducing update performance. Tiering does not
have this problem and should see its update performance benefit directly from a
larger scale factor, as this reduces the number of reconstructions.
The buffer capacity also influences the number of levels, but is more
significant in its effects on point-lookup performance: a lookup must perform a
linear scan of the buffer. Likewise, the unstructured nature of the buffer also
will contribute negatively towards sampling performance, irrespective of which
buffer sampling technique is used. As a result, although a large buffer will
reduce the number of shards, it will also hurt sampling and delete (under
tagging) performance. It is important to minimize the cost of these buffer
scans, and so it is preferable to keep the buffer small, ideally small enough
to fit within the CPU's L2 cache. The number of shards within the index is,
then, better controlled by changing the scale factor, rather than the buffer
capacity. Using a smaller buffer will result in more compactions and shard
reconstructions; however, the empirical evaluation in Section~\ref{ssec:ds-exp}
demonstrates that this is not a serious performance problem when a scale factor
is chosen appropriately. When the shards are in memory, frequent small
reconstructions do not have a significant performance penalty compared to less
frequent, larger ones.
\Paragraph{Auxiliary Structures.} The framework's support for arbitrary
auxiliary data structures allows for memory to be traded in exchange for
insertion or sampling performance. The use of Bloom filters for accelerating
tombstone rejection checks has already been discussed, but many other options
exist. Bloom filters could also be used to accelerate point-lookups for delete
tagging, though such filters would require much more memory than tombstone-only
ones to be effective. An auxiliary hash table could be used for accelerating
point-lookups, or range filters like SuRF \cite{zhang18} or Rosetta
\cite{siqiang20} added to accelerate pre-processing for range queries like in
IRS or WIRS.
|