1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
|
\chapter{Classical Dynamization Techniques}
\label{chap:background}
This chapter will introduce important background information and
existing work in the area of data structure dynamization. We will
first discuss the concept of a search problem, which is central to
dynamization techniques. While one might imagine that restrictions on
dynamization would be functions of the data structure to be dynamized,
in practice the requirements placed on the data structure are quite mild,
and it is the necessary properties of the search problem that the data
structure is used to address that provide the central difficulty to
applying dynamization techniques in a given area. After this, database
indices will be discussed briefly. Indices are the primary use of data
structures within the database context that is of interest to our work.
Following this, existing theoretical results in the area of data structure
dynamization will be discussed, which will serve as the building blocks
for our techniques in subsequent chapters. The chapter will conclude with
a discussion of some of the limitations of these existing techniques.
\section{Queries and Search Problems}
\label{sec:dsp}
Data access lies at the core of most database systems. We want to ask
questions of the data, and ideally get the answer efficiently. We
will refer to the different types of question that can be asked as
\emph{search problems}. We will be using this term in a similar way as
the word \emph{query} \footnote{
The term query is often abused and used to
refer to several related, but slightly different things. In the
vernacular, a query can refer to either a) a general type of search
problem (as in "range query"), b) a specific instance of a search
problem, or c) a program written in a query language.
}
is often used within the database systems literature: to refer to a
general class of questions. For example, we could consider range scans,
point-lookups, nearest neighbor searches, predicate filtering, random
sampling, etc., to each be a general search problem. Formally, for the
purposes of this work, a search problem is defined as follows,
\begin{definition}[Search Problem]
Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function
$F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched,
$\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the
answer domain.\footnote{
It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an
example, a \texttt{COUNT} aggregation might map a set of strings onto
an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need
not be a universal constraint.
}
\end{definition}
We will use the term \emph{query} to mean a specific instance of a search
problem,
\begin{definition}[Query]
Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and
a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific
instance of the search problem, $F(\mathcal{D}, q)$.
\end{definition}
As an example of using these definitions, a \emph{membership test}
or \emph{range scan} would be considered search problems, and a range
scan over the interval $[10, 99]$ would be a query. We've drawn this
distinction because, as we'll see as we enter into the discussion of
our work in later chapters, it is useful to have separate, unambiguous
terms for these two concepts.
\subsection{Decomposable Search Problems}
Dynamization techniques require the partitioning of one data structure
into several, smaller ones. As a result, these techniques can only
be applied in situations where the search problem to be answered can
be answered from this set of smaller data structures, with the same
answer as would have been obtained had all of the data been used to
construct a single, large structure. This requirement is formalized in
the definition of a class of problems called \emph{decomposable search
problems (DSP)}. This class was first defined by Bentley and Saxe in
their work on dynamization, and we will adopt their definition,
\begin{definition}[Decomposable Search Problem~\cite{saxe79}]
\label{def:dsp}
A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and
only if there exists a constant-time computable, associative, and
commutative binary operator $\mergeop$ such that,
\begin{equation*}
F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
\end{equation*}
for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
\end{definition}
The requirement for $\mergeop$ to be constant-time was used by Bentley and
Saxe to prove specific performance bounds for answering queries from a
decomposed data structure. However, it is not strictly \emph{necessary},
and later work by Overmars lifted this constraint and considered a more
general class of search problems called \emph{$C(n)$-decomposable search
problems},
\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars-cn-decomp}]
A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable
if and only if there exists an $O(C(n))$-time computable, associative,
and commutative binary operator $\mergeop$ such that,
\begin{equation*}
F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
\end{equation*}
for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
\end{definition}
To demonstrate that a search problem is decomposable, it is necessary to
show the existence of the merge operator, $\mergeop$, with the necessary
properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B,
q)$. With these two results, induction demonstrates that the problem is
decomposable even in cases with more than two partial results.
As an example, consider range scans,
\begin{definition}[Range Count]
\label{def:range-count}
Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval,
$ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns
the cardinality, $|d \cap q|$.
\end{definition}
\begin{theorem}
\label{ther:decomp-range-count}
Range Count is a decomposable search problem.
\end{theorem}
\begin{proof}
Let $\mergeop$ be addition ($+$). Applying this to
Definition~\ref{def:dsp}, gives
\begin{align*}
|(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)|
\end{align*}
which is true by the distributive property of union and
intersection. Addition is an associative and commutative
operator that can be calculated in $\Theta(1)$ time. Therefore, range counts
are DSPs.
\end{proof}
Because the codomain of a DSP is not restricted, more complex output
structures can be used to allow for problems that are not directly
decomposable to be converted to DSPs, possibly with some minor
post-processing. For example, calculating the arithmetic mean of a set
of numbers can be formulated as a DSP,
\begin{theorem}
The calculation of the arithmetic mean of a set of numbers is a DSP.
\end{theorem}
\begin{proof}
Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$,
where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple
contains the sum of the values within the input set, and the
cardinality of the input set. For two disjoint partitions of the data,
$D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let
$A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$.
Applying Definition~\ref{def:dsp}, gives
\begin{align*}
A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2) \\
(s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c)
\end{align*}
From this result, the average can be determined in constant time by
taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set
of numbers is a DSP.
\end{proof}
\section{Dynamization for Decomposable Search Problems}
Because data in a database is regularly updated, data structures
intended to be used as an index must support updates (inserts, in-place
modification, and deletes). Not all potentially useful data structures
support updates, and so a general strategy for adding update support
would increase the number of data structures that could be used as
database indices. We refer to a data structure with update support as
\emph{dynamic}, and one without update support as \emph{static}.\footnote{
The term static is distinct from immutable. Static refers to the
layout of records within the data structure, whereas immutable
refers to the data stored within those records. This distinction
will become relevant when we discuss different techniques for adding
delete support to data structures. The data structures used are
always static, but not necessarily immutable, because the records may
contain header information (like visibility) that is updated in place.
}
This section discusses \emph{dynamization}, the construction of a
dynamic data structure based on an existing static one. When certain
conditions are satisfied by the data structure and its associated
search problem, this process can be done automatically, and with
provable asymptotic bounds on amortized insertion performance, as well
as worst case query performance. This is in contrast to the manual
design of dynamic data structures, which involve techniques based on
partially rebuilding small portions of a single data structure (called
\emph{local reconstruction})~\cite{overmars83}. This is a very high cost
intervention that requires significant effort on the part of the data
structure designer, whereas conventional dynamization can be performed
with little-to-no modification of the underlying data structure at all.
It is worth noting that there are a variety of techniques
discussed in the literature for dynamizing structures with specific
properties, or under very specific sets of circumstances. Examples
include frameworks for adding update support succinct data
structures~\cite{dynamize-succinct} or taking advantage of batching
of insert and query operations~\cite{batched-decomposable}. This
section discusses techniques that are more general, and don't require
workload-specific assumptions.
We will first discuss the necessary data structure requirements, and
then examine several classical dynamization techniques. The section
will conclude with a discussion of delete support within the context
of these techniques. For more detail than is included in this chapter,
Overmars wrote a book providing a comprehensive survey of techniques for
creating dynamic data structures, including not only the dynamization
techniques discussed here, but also local reconstruction based
techniques and more~\cite{overmars83}.\footnote{
Sadly, this book isn't readily available in
digital format as of the time of writing.
}
\subsection{Global Reconstruction}
The most fundamental dynamization technique is that of \emph{global
reconstruction}. While not particularly useful on its own, global
reconstruction serves as the basis for the techniques to follow, and so
we will begin our discussion of dynamization with it.
Consider a class of data structure, $\mathcal{I}$, capable of answering a
search problem, $\mathcal{Q}$. Insertion via global reconstruction is
possible if $\mathcal{I}$ supports the following two operations,
\begin{align*}
\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\
\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D})
\end{align*}
where $\mathtt{build}$ constructs an instance $\mathscr{I}\in\mathcal{I}$
over the data structure over a set of records $d \subseteq \mathcal{D}$
in $B(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d
\subseteq \mathcal{D}$ used to construct $\mathscr{I} \in \mathcal{I}$ in
$\Theta(1)$ time,\footnote{
There isn't any practical reason why $\mathtt{unbuild}$ must run
in constant time, but this is the assumption made in \cite{saxe79}
and in subsequent work based on it, and so we will follow the same
definition here.
} such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$.
Given this structure, an insert of record $r \in \mathcal{D}$ into a
data structure $\mathscr{I} \in \mathcal{I}$ can be defined by,
\begin{align*}
\mathscr{I}_{i}^\prime = \text{build}(\text{unbuild}(\mathscr{I}_i) \cup \{r\})
\end{align*}
It goes without saying that this operation is sub-optimal, as the
insertion cost is $\Theta(B(n))$, and $B(n) \in \Omega(n)$ at best for
most data structures. However, this global reconstruction strategy can
be used as a primitive for more sophisticated techniques that can provide
reasonable performance.
\subsection{Amortized Global Reconstruction}
\label{ssec:agr}
The problem with global reconstruction is that each insert must rebuild
the entire data structure, involving all of its records. This results
in a worst-case insert cost of $\Theta(B(n))$. However, opportunities
for improving this scheme can present themselves when considering the
\emph{amortized} insertion cost.
Consider the cost accrued by the dynamized structure under global
reconstruction over the lifetime of the structure. Each insert will result
in all of the existing records being rewritten, so at worst each record
will be involved in $\Theta(n)$ reconstructions, each reconstruction
having $\Theta(B(n))$ cost. We can amortize this cost over the $n$ records
inserted to get an amortized insertion cost for global reconstruction of,
\begin{equation*}
I_a(n) = \frac{B(n) \cdot n}{n} = B(n)
\end{equation*}
This doesn't improve things as is, however it does present two
opportunities for improvement. If we could either reduce the size of
the reconstructions, or the number of times a record is reconstructed,
then we could reduce the amortized insertion cost.
The key insight, first discussed by Bentley and Saxe, is that
both of these goals can be accomplished by \emph{decomposing} the
data structure into multiple, smaller structures, each built from
a disjoint partition of the data. As long as the search problem
being considered is decomposable, queries can be answered from
this structure with bounded worst-case overhead, and the amortized
insertion cost can be improved~\cite{saxe79}. Significant theoretical
work exists in evaluating different strategies for decomposing the
data structure~\cite{saxe79, overmars81, overmars83} and for leveraging
specific efficiencies of the data structures being considered to improve
these reconstructions~\cite{merge-dsp}.
There are two general decomposition techniques that emerged from
this work. The earliest of these is the logarithmic method, often
called the Bentley-Saxe method in modern literature, and is the most
commonly discussed technique today. The Bentley-Saxe method has been
directly applied in a few instances in the literature, such as to
metric indexing structures~\cite{naidan14} and spatial structures~\cite{bkdtree},
and has also been used in a modified form for genetic sequence search
structures~\cite{almodaresi23} and graphs~\cite{lsmgraph}, to cite a few
examples.
A later technique, the equal block method, was also developed. It is
generally not as effective as the Bentley-Saxe method, and as a result we
have not identified any specific applications of this technique outside
of the theoretical literature, however we will discuss it as well in
the interest of completeness, and because it does lend itself well to
demonstrating certain properties of decomposition-based dynamization
techniques.
\subsection{Equal Block Method}
\label{ssec:ebm}
Though chronologically later, the equal block method is theoretically a
bit simpler, and so we will begin our discussion of decomposition-based
technique for dynamization of decomposable search problems with it. There
have been several proposed variations of this concept~\cite{maurer79,
maurer80}, but we will focus on the most developed form as described by
Overmars and von Leeuwan~\cite{overmars-art-of-dyn, overmars83}. The core
concept of the equal block method is to decompose the data structure
into several smaller data structures, called blocks, over partitions
of the data. This decomposition is performed such that each block is of
roughly equal size.
Consider a data structure $\mathscr{I} \in \mathcal{I}$ that solves
some decomposable search problem, $F$ and is built over a set of records
$d \in \mathcal{D}$. This structure can be decomposed into $s$ blocks,
$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over
partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a specific value
makes little sense when the number of records changes, and so it is taken
to be governed by a smooth, monotonically increasing function $f(n)$ such
that, at any point, the following two constraints are obeyed.
\begin{align}
f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\
\forall_{1 \leq j \leq s} \quad | \mathscr{I}_j | \leq \frac{2n}{s} \label{ebm-c2}
\end{align}
where $|\mathscr{I}_j|$ is the number of records in the block,
$|\text{unbuild}(\mathscr{I}_j)|$.
A new record is inserted by finding the smallest block and rebuilding it
using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$,
then an insert is done by,
\begin{equation*}
\mathscr{I}_k^\prime = \text{build}(\text{unbuild}(\mathscr{I}_k) \cup \{r\})
\end{equation*}
Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{
Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be
violated by deletes. We're omitting deletes from the discussion at
this point, but will circle back to them in Section~\ref{sec:deletes}.
} In this case, the constraints are enforced by "re-configuring" the
structure. $s$ is updated to be exactly $f(n)$, all of the existing
blocks are unbuilt, and then the records are redistributed evenly into
$s$ blocks.
A query with parameters $q$ is answered by this structure by individually
querying the blocks, and merging the local results together with $\mergeop$,
\begin{equation*}
F(\mathscr{I}, q) = \bigmergeop_{j=1}^{s}F(\mathscr{I}_j, q)
\end{equation*}
where $F(\mathscr{I}, q)$ is a slight abuse of notation, referring to
answering the query over $d$ using the data structure $\mathscr{I}$.
This technique provides better amortized performance bounds than global
reconstruction, at the possible cost of worse query performance for
sub-linear queries. We'll omit the details of the proof of performance
for brevity and streamline some of the original notation (full details
can be found in~\cite{overmars83}), but this technique ultimately
results in a data structure with the following performance characteristics,
\begin{align*}
\text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{B(n)}{n} + B\left(\frac{n}{f(n)}\right)\right) \\
\text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\
\end{align*}
where $B(n)$ is the cost of statically building $\mathcal{I}$, and
$\mathscr{Q}(n)$ is the cost of answering $F$ using $\mathcal{I}$.
%TODO: example?
\subsection{The Bentley-Saxe Method}
\label{ssec:bsm}
%FIXME: switch this section (and maybe the previous?) over to being
% indexed at 0 instead of 1
The original, and most frequently used, dynamization technique is the
Bentley-Saxe Method (BSM), also called the logarithmic method in older
literature. Rather than breaking the data structure into equally sized
blocks, BSM decomposes the structure into logarithmically many blocks
of exponentially increasing size. More specifically, the data structure
is decomposed into $h = \lceil \log_2 n \rceil$ blocks, $\mathscr{I}_1,
\mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$
will be either empty, or contain exactly $2^i$ records within it.
The procedure for inserting a record, $r \in \mathcal{D}$, into
a BSM dynamization is as follows. If the block $\mathscr{I}_0$
is empty, then $\mathscr{I}_0 = \text{build}{\{r\}}$. If it is not
empty, then there will exist a maximal sequence of non-empty blocks
$\mathscr{I}_0, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq
0$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case,
$\mathscr{I}_{i+1}$ is set to $\text{build}(\{r\} \cup \bigcup_{l=0}^i
\text{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_0$ through
$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the
end of the structure as needed.
%FIXME: switch the x's to r's for consistency
\begin{figure}
\centering
\includegraphics[width=.8\textwidth]{diag/bsm.pdf}
\caption{An illustration of inserts into the Bentley-Saxe Method}
\label{fig:bsm-example}
\end{figure}
Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The
dynamization is built over a set of records $x_1, x_2, \ldots,
x_{10}$ initially, with eight records in $\mathscr{I}_3$ and two in
$\mathscr{I}_1$. The first new record, $x_{11}$, is inserted directly
into $\mathscr{I}_0$. For the next insert following this, $x_{12}$, the
first empty block is $\mathscr{I}_2$, and so the insert is performed by
doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup
\text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$
and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$.
This technique is called a \emph{binary decomposition} of the data
structure. Considering a BSM dynamization of a structure containing $n$
records, labeling each block with a $0$ if it is empty and a $1$ if it
is full will result in the binary representation of $n$. For example,
the final state of the structure in Figure~\ref{fig:bsm-example} contains
$12$ records, and the labeling procedure will result in $0\text{b}1100$,
which is $12$ in binary. Inserts affect this representation of the
structure in the same way that incrementing the binary number by $1$ does.
By applying BSM to a data structure, a dynamized structure can be created
with the following performance characteristics,
\begin{align*}
\text{Amortized Insertion Cost:}&\quad \Theta\left(\left(\frac{B(n)}{n}\cdot \log_2 n\right)\right) \\
\text{Worst Case Insertion Cost:}&\quad \Theta\left(B(n)\right) \\
\text{Worst-case Query Cost:}& \quad \Theta\left(\log_2 n\cdot \mathscr{Q}\left(n\right)\right) \\
\end{align*}
This is a particularly attractive result because, for example, a data
structure having $B(n) \in \Theta(n)$ will have an amortized insertion
cost of $\log_2 (n)$, which is quite reasonable. The trade-off for this
is an extra logarithmic multiple attached to the query complexity. It is
also worth noting that the worst-case insertion cost remains the same
as global reconstruction, but this case arises only very rarely. If
you consider the binary decomposition representation, the worst-case
behavior is triggered each time the existing number overflows, and a
new digit must be added.
As a final note about the query performance of this structure, because
the overhead due to querying the blocks is logarithmic, under certain
circumstances this cost can be absorbed, resulting in no effect on the
asymptotic worst-case query performance. As an example, consider a linear
scan of the data running in $\Theta(n)$ time. In this case, every record
must be considered, and so there isn't any performance penalty\footnote{
From an asymptotic perspective. There will still be measurable performance
effects from caching, etc., even in this case.
} to breaking the records out into multiple chunks and scanning them
individually. For formally, for any query running in $\mathscr{Q}(n) \in
\Omega\left(n^\epsilon\right)$ time where $\epsilon > 0$, the worst-case
cost of answering a decomposable search problem from a BSM dynamization
is $\Theta\left(\mathscr{Q}(n)\right)$.~\cite{saxe79}
\subsection{The Mixed Method}
\subsection{Merge Decomposable Search Problems}
\subsection{Delete Support}
\label{ssec:dyn-deletes}
Classical dynamization techniques have also been developed with
support for deleting records. In general, the same technique of global
reconstruction that was used for inserting records can also be used to
delete them. Given a record $r \in \mathcal{D}$ and a data structure
$\mathscr{I} \in \mathcal{I}$ such that $r \in \mathscr{I}$, $r$ can be
deleted from the structure in $C(n)$ time as follows,
\begin{equation*}
\mathscr{I}^\prime = \text{build}(\text{unbuild}(\mathscr{I}) - \{r\})
\end{equation*}
However, supporting deletes within the dynamization schemes discussed
above is more complicated. The core problem is that inserts affect the
dynamized structure in a deterministic way, and as a result certain
partitioning schemes can be leveraged to reason about the
performance. But, deletes do not work like this.
\begin{figure}
\caption{A Bentley-Saxe dynamization for the integers on the
interval $[1, 100]$.}
\label{fig:bsm-delete-example}
\end{figure}
For example, consider a Bentley-Saxe dynamization that contains all
integers on the interval $[1, 100]$, inserted in that order, shown in
Figure~\ref{fig:bsm-delete-example}. We would like to delete all of the
records from this structure, one at a time, using global reconstruction.
This presents several problems,
\begin{itemize}
\item For each record, we need to identify which block it is in before
we can delete it.
\item The cost of performing a delete is a function of which block the
record is in, which is a question of distribution and not easily
controlled.
\item As records are deleted, the structure will potentially violate
the invariants of the decomposition scheme used, which will
require additional work to fix.
\end{itemize}
To resolve these difficulties, two very different approaches have been
proposed for supporting deletes, each of which rely on certain properties
of the search problem and data structure. These are the use of a ghost
structure and weak deletes.
\subsubsection{Ghost Structure for Invertible Search Problems}
The first proposed mechanism for supporting deletes was discussed
alongside the Bentley-Saxe method in Bentley and Saxe's original
paper. This technique applies to a class of search problems called
\emph{invertible} (also called \emph{decomposable counting problems}
in later literature~\cite{overmars83}). Invertible search problems
are decomposable, and also support an ``inverse'' merge operator, $\Delta$,
that is able to remove records from the result set. More formally,
\begin{definition}[Invertible Search Problem~\cite{saxe79}]
\label{def:invert}
A decomposable search problem, $F$ is invertible if and only if there
exists a constant time computable operator, $\Delta$, such that
\begin{equation*}
F(A / B, q) = F(A, q)~\Delta~F(B, q)
\end{equation*}
for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
\end{definition}
Given a search problem with this property, it is possible to perform
deletes by creating a secondary ``ghost'' structure. When a record
is to be deleted, it is inserted into this structure. Then, when the
dynamization is queried, this ghost structure is queried as well as the
main one. The results from the ghost structure can be removed from the
result set using the inverse merge operator. This simulates the result
that would have been obtained had the records been physically removed
from the main structure.
Two examples of invertible search problems are set membership
and range count. Range count was formally defined in
Definition~\ref{def:range-count}.
\begin{theorem}
Range count is an invertible search problem.
\end{theorem}
\begin{proof}
To prove that range count is an invertible search problem, it must be
decomposable and have a $\Delta$ operator. That it is a DSP has already
been proven in Theorem~\ref{ther:decomp-range-count}.
Let $\Delta$ be subtraction $(-)$. Applying this to Definition~\ref{def:invert}
gives,
\begin{equation*}
|(A / B) \cap q | = |(A \cap q) / (B \cap q)| = |(A \cap q)| - |(B \cap q)|
\end{equation*}
which is true by the distributive property of set difference and
intersection. Subtraction is computable in constant time, therefore
range count is an invertible search problem using subtraction as $\Delta$.
\end{proof}
The set membership search problem is defined as follows,
\begin{definition}[Set Membership]
\label{def:set-membership}
Consider a set of elements $d \subseteq \mathcal{D}$ from some domain,
and a single element $r \in \mathcal{D}$. A test of set membership is a
search problem of the form $F: (\mathcal{PS}(\mathcal{D}), \mathcal{D})
\to \mathbb{B}$ such that $F(d, r) = r \in d$, which maps to $0$ if $r
\not\in d$ and $1$ if $r \in d$.
\end{definition}
\begin{theorem}
Set membership is an invertible search problem.
\end{theorem}
\begin{proof}
To prove that set membership is invertible, it is necessary to establish
that it is a decomposable search problem, and that a $\Delta$ operator
exists. We'll begin with the former.
\begin{lemma}
\label{lem:set-memb-dsp}
Set membership is a decomposable search problem.
\end{lemma}
\begin{proof}
Let $\mergeop$ be the logical disjunction ($\lor$). This yields,
\begin{align*}
F(A \cup B, r) &= F(A, r) \lor F(B, r) \\
r \in (A \cup B) &= (r \in A) \lor (r \in B)
\end{align*}
which is true, following directly from the definition of union. The
logical disjunction is an associative, commutative operator that can
be calculated in $\Theta(1)$ time. Therefore, set membership is a
decomposable search problem.
\end{proof}
For the inverse merge operator, $\Delta$, it is necessary that $F(A,
r) ~\Delta~F(B, r)$ be true \emph{only} if $r \in A$ and $r \not\in
B$. Thus, it could be directly implemented as $F(A, r)~\Delta~F(B, r) =
F(A, r) \land \neg F(B, r)$, which is constant time if
the operands are already known.
Thus, we have shown that set membership is a decomposable search problem,
and that a constant time $\Delta$ operator exists. Therefore, it is an
invertible search problem.
\end{proof}
For search problems such as these, this technique allows for deletes to be
supported with the same cost as an insert. Unfortunately, it suffers from
write amplification because each deleted record is recorded twice--one in
the main structure, and once in the ghost structure. This means that $n$
is, in effect, the total number of records and deletes. This can lead
to some serious problems, for example if every record in a structure
of $n$ records is deleted, the net result will be an "empty" dynamized
data structure containing $2n$ physical records within it. To circumvent
this problem, Bentley and Saxe proposed a mechanism of setting a maximum
threshold for the size of the ghost structure relative to the main one,
and performing a complete re-partitioning of the data once this threshold
is reached, removing all deleted records from the main structure,
emptying the ghost structure, and rebuilding blocks with the records
that remain according to the invariants of the technique.
\subsubsection{Weak Deletes for Deletion Decomposable Search Problems}
Another approach for supporting deletes was proposed later, by Overmars
and van Leeuwen, for a class of search problem called \emph{deletion
decomposable}. These are decomposable search problems for which the
underlying data structure supports a delete operation. More formally,
\begin{definition}[Deletion Decomposable Search Problem~\cite{merge-dsp}]
\label{def:background-ddsp}
A decomposable search problem, $F$, and its data structure,
$\mathcal{I}$, is deletion decomposable if and only if, for some
instance $\mathscr{I} \in \mathcal{I}$, containing $n$ records,
there exists a deletion routine $\mathtt{delete}(\mathscr{I},
r)$ that removes some $r \in \mathcal{D}$ in time $D(n)$ without
increasing the query time, deletion time, or storage requirement,
for $\mathscr{I}$.
\end{definition}
Superficially, this doesn't appear very useful. If the underlying data
structure already supports deletes, there isn't much reason to use a
dynamization technique to add deletes to it. However, one point worth
mentioning is that it is possible, in many cases, to easily \emph{add}
delete support to a static structure. If it is possible to locate a
record and somehow mark it as deleted, without removing it from the
structure, and then efficiently ignore these records while querying,
then the given structure and its search problem can be said to be
deletion decomposable. This technique for deleting records is called
\emph{weak deletes}.
\begin{definition}[Weak Deletes~\cite{overmars81}]
\label{def:weak-delete}
A data structure is said to support weak deletes if it provides a
routine, \texttt{delete}, that guarantees that after $\alpha \cdot n$
deletions, where $\alpha < 1$, the query cost is bounded by $k_\alpha
\mathscr{Q}(n)$ for some constant $k_\alpha$ dependent only upon $\alpha$,
where $\mathscr{Q}(n)$ is the cost of answering the query against a
structure upon which no weak deletes were performed.\footnote{
This paper also provides a similar definition for weak updates,
but these aren't of interest to us in this work, and so the above
definition was adapted from the original with the weak update
constraints removed.
} The results of the query of a block containing weakly deleted records
should be the same as the results would be against a block with those
records removed.
\end{definition}
As an example of a deletion decomposable search problem, consider the set
membership problem considered above (Definition~\ref{def:set-membership})
where $\mathcal{I}$, the data structure used to answer queries of the
search problem, is a hash map.\footnote{
While most hash maps are already dynamic, and so wouldn't need
dynamization to be applied, there do exist static ones too. For example,
the hash map being considered could be implemented using perfect
hashing~\cite{perfect-hashing}, which has many static implementations.
}
\begin{theorem}
The set membership problem, answered using a static hash map, is
deletion decomposable.
\end{theorem}
\begin{proof}
We've already shown in Lemma~\ref{lem:set-memb-dsp} that set membership
is a decomposable search problem. For it to be deletion decomposable,
we must demonstrate that the hash map, $\mathcal{I}$, supports deleting
records without hurting its query performance, delete performance, or
storage requirements. Assume that an instance $\mathscr{I} \in
\mathcal{I}$ having $|\mathscr{I}| = n$ can answer queries in
$\mathscr{Q}(n) \in \Theta(1)$ time and requires $\Omega(n)$ storage.
Such a structure can support weak deletes. Each record within the
structure has a single bit attached to it, indicating whether it has
been deleted or not. These bits will require $\Theta(n)$ storage and
be initialized to 0 when the structure is constructed. A delete can
be performed by querying the structure for the record to be deleted in
$\Theta(1)$ time, and setting the bit to 1 if the record is found. This
operation has $D(n) \in \Theta(1)$ cost.
\begin{lemma}
\label{lem:weak-deletes}
The delete procedure as described above satisfies the requirements of
Definition~\ref{def:weak-delete} for weak deletes.
\end{lemma}
\begin{proof}
Per Definition~\ref{def:weak-delete}, there must exist some constant
dependent only on $\alpha$, $k_\alpha$, such that after $\alpha \cdot
n$ deletes against $\mathscr{I}$ with $\alpha < 1$, the query cost is
bounded by $\Theta(\alpha \mathscr{Q}(n))$.
In this case, $\mathscr{Q}(n) \in \Theta(1)$, and therefore our final
query cost must be bounded by $\Theta(k_\alpha)$. When a query is
executed against $\mathscr{I}$, there are three possible cases,
\begin{enumerate}
\item The record being searched for does not exist in $\mathscr{I}$. In
this case, the query result is 0.
\item The record being searched for does exist in $\mathscr{I}$ and has
a delete bit value of 0. In this case, the query result is 1.
\item The record being searched for does exist in $\mathscr{I}$ and has
a delete bit value of 1 (i.e., it has been deleted). In this case, the
query result is 0.
\end{enumerate}
In all three cases, the addition of deletes requires only $\Theta(1)$
extra work at most. Therefore, set membership over a static hash map
using our proposed deletion mechanism satisfies the requirements for
weak deletes, with $k_\alpha = 1$.
\end{proof}
Finally, we note that the cost of one of these weak deletes is $D(n)
= \mathscr{Q}(n)$. By Lemma~\ref{lem:weak-deletes}, the delete cost is
not asymptotically harmed by deleting records.
Thus, we've shown that set membership using a static hash map is a
decomposable search problem, the storage cost remains $\Omega(n)$ and the
query and delete costs are unaffected by the presence of deletes using the
proposed mechanism. All of the requirements of deletion decomposability
are satisfied, therefore set membership using a static hash map is a
deletion decomposable search problem.
\end{proof}
For such problems, deletes can be supported by first identifying the
block in the dynamization containing the record to be deleted, and
then calling $\mathtt{delete}$ on it. In order to allow this block to
be easily located, it is possible to maintain a hash table over all
of the records, alongside the dynamization, which maps each record
onto the block containing it. This table must be kept up to date as
reconstructions occur, but this can be done at no extra asymptotic costs
for any data structures having $B(n) \in \Omega(n)$, as it requires only
linear time. This allows for deletes to be performed in $\mathscr{D}(n)
\in \Theta(D(n))$ time.
The presence of deleted records within the structure does introduce a
new problem, however. Over time, the number of records in each block will
drift away from the requirements imposed by the dynamization technique. It
will eventually become necessary to re-partition the records to restore
these invariants, which are necessary for bounding the number of blocks,
and thereby the query performance. The particular invariant maintenance
rules depend upon the decomposition scheme used.
\Paragraph{Bentley-Saxe Method.} When creating a BSM dynamization for
a deletion decomposable search problem, the $i$th block where $i \geq 2$\footnote{
Block $i=0$ will only ever have one record, so no special maintenance must be
done for it. A delete will simply empty it completely.
},
in the absence of deletes, will contain $2^{i-1} + 1$ records. When a
delete occurs in block $i$, no special action is taken until the number
of records in that block falls below $2^{i-2}$. Once this threshold is
reached, a reconstruction can be performed to restore the appropriate
record counts in each block.~\cite{merge-dsp}
\Paragraph{Equal Block Method.} For the equal block method, there are
two cases in which a delete may cause a block to fail to obey the method's
size invariants,
\begin{enumerate}
\item If enough records are deleted, it is possible for the number
of blocks to exceed $f(2n)$, violating Invariant~\ref{ebm-c1}.
\item The deletion of records may cause the maximum size of each
block to shrink, causing some blocks to exceed the maximum capacity
of $\nicefrac{2n}{s}$. This is a violation of Invariant~\ref{ebm-c2}.
\end{enumerate}
In both cases, it should be noted that $n$ is decreased as records are
deleted. Should either of these cases emerge as a result of a delete,
the entire structure must be reconfigured to ensure that its invariants
are maintained. This reconfiguration follows the same procedure as when
an insert results in a violation: $s$ is updated to be exactly $f(n)$, all
existing blocks are unbuilt, and then the records are evenly redistributed
into the $s$ blocks.~\cite{overmars-art-of-dyn}
\subsection{Worst-Case Optimal Techniques}
\label{ssec:bsm-worst-optimal}
\section{Limitations of Classical Dynamization Techniques}
\label{sec:bsm-limits}
While fairly general, these dynamization techniques have a number of
limitations that prevent them from being directly usable as a general
solution to the problem of creating database indices. Because of the
requirement that the query being answered be decomposable, many search
problems cannot be addressed--or at least efficiently addressed, by
decomposition-based dynamization. The techniques also do nothing to reduce
the worst-case insertion cost, resulting in extremely poor tail latency
performance relative to hand-built dynamic structures. Finally, these
approaches do not do a good job of exposing the underlying configuration
space to the user, meaning that the user can exert limited control on the
performance of the dynamized data structure. This section will discuss
these limitations, and the rest of the document will be dedicated to
proposing solutions to them.
\subsection{Limits of Decomposability}
\label{ssec:decomp-limits}
Unfortunately, the DSP abstraction used as the basis of classical
dynamization techniques has a few significant limitations that restrict
their applicability,
\begin{itemize}
\item The query must be broadcast identically to each block and cannot
be adjusted based on the state of the other blocks.
\item The query process is done in one pass--it cannot be repeated.
\item The result merge operation must be $O(1)$ to maintain good query
performance.
\item The result merge operation must be commutative and associative,
and is called repeatedly to merge pairs of results.
\end{itemize}
These requirements restrict the types of queries that can be supported by
the method efficiently. For example, k-nearest neighbor and independent
range sampling are not decomposable.
\subsubsection{k-Nearest Neighbor}
\label{sssec-decomp-limits-knn}
The k-nearest neighbor (k-NN) problem is a generalization of the nearest
neighbor problem, which seeks to return the closest point within the
dataset to a given query point. More formally, this can be defined as,
\begin{definition}[Nearest Neighbor]
Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$
be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
between two points within $D$. The nearest neighbor problem, $NN(D,
q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$
for some query point, $q \in \mathbb{R}^d$.
\end{definition}
In practice, it is common to require $f(x, y)$ be a metric,\footnote
{
Contrary to its vernacular usage as a synonym for ``distance'', a
metric is more formally defined as a valid distance function over
a metric space. Metric spaces require their distance functions to
have the following properties,
\begin{itemize}
\item The distance between a point and itself is always 0.
\item All distances between non-equal points must be positive.
\item For all points, $x, y \in D$, it is true that
$f(x, y) = f(y, x)$.
\item For any three points $x, y, z \in D$ it is true that
$f(x, z) \leq f(x, y) + f(y, z)$.
\end{itemize}
These distances also must have the interpretation that $f(x, y) <
f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This
is the opposite of the definition of similarity, and so some minor
manipulations are usually required to make similarity measures work
in metric-based indexes. \cite{intro-analysis}
}
and this will be done in the examples of indices for addressing
this problem in this work, but it is not a fundamental aspect of the problem
formulation. The nearest neighbor problem itself is decomposable,
with a simple merge function that accepts the result with the smallest
value of $f(x, q)$ for any two inputs\cite{saxe79}.
The k-nearest neighbor problem generalizes nearest-neighbor to return
the $k$ nearest elements,
\begin{definition}[k-Nearest Neighbor]
Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$
be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
between two points within $D$. The k-nearest neighbor problem,
$KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$
such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$.
\end{definition}
This can be thought of as solving the nearest-neighbor problem $k$ times,
each time removing the returned result from $D$ prior to solving the
problem again. Unlike the single nearest-neighbor case (which can be
thought of as k-NN with $k=1$), this problem is \emph{not} decomposable.
\begin{theorem}
k-NN is not a decomposable search problem.
\end{theorem}
\begin{proof}
To prove this, consider the query $KNN(D, q, k)$ against some partitioned
dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If k-NN is decomposable,
then there must exist some constant-time, commutative, and associative
binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l}
R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q,
k)$. Consider the evaluation of the merge operator against two arbitrary
result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| =
|R_j| = k$, and that the contents of $R$ must be the $k$ records from
$R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the
problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$
time. Therefore, k-NN is not a decomposable search problem.
\end{proof}
With that said, it is clear that there isn't any fundamental restriction
preventing the merging of the result sets; it is only the case that an
arbitrary performance requirement wouldn't be satisfied. It is possible
to merge the result sets in non-constant time, and so it is the case
that k-NN is $C(n)$-decomposable. Unfortunately, this classification
brings with it a reduction in query performance as a result of the way
result merges are performed.
As a concrete example of these costs, consider using the Bentley-Saxe
method to extend the VPTree~\cite{vptree}. The VPTree is a static,
metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k
\log n)$. One possible merge algorithm for k-NN would be to push all
of the elements in the two arguments onto a min-heap, and then pop off
the first $k$. In this case, the cost of the merge operation would be
$C(k) = k \log k$. Were $k$ assumed to be constant, then the operation
could be considered to be constant-time. But given that $k$ is only
bounded in size above by $n$, this isn't a safe assumption to make in
general. Evaluating the total query cost for the extended structure,
this would yield,
\begin{equation}
k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right)
\end{equation}
The reason for this large increase in cost is the repeated application
of the merge operator. The Bentley-Saxe method requires applying the
merge operator in a binary fashion to each partial result, multiplying
its cost by a factor of $\log n$. Thus, the constant-time requirement
of standard decomposability is necessary to keep the cost of the merge
operator from appearing within the complexity bound of the entire
operation in the general case.\footnote {
There is a special case, noted by Overmars, where the total cost is
$O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n))
\in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the
case where the cost of the query and merge operation are sufficiently
large to consume the logarithmic factor, and so it doesn't represent
a special case with better performance.
}
If we could revise the result merging operation to remove this duplicated
cost, we could greatly reduce the cost of supporting $C(n)$-decomposable
queries.
\subsubsection{Independent Range Sampling}
\label{ssec:background-irs}
Another problem that is not decomposable is independent sampling. There
are a variety of problems falling under this umbrella, including weighted
set sampling, simple random sampling, and weighted independent range
sampling, but we will focus on independent range sampling here.
\begin{definition}[Independent Range Sampling~\cite{tao22}]
Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
interval $q = [x, y]$ and an integer $k$, an independent range
sampling query returns $k$ independent samples from $D \cap q$
with each point having equal probability of being sampled.
\end{definition}
This problem immediately encounters a category error when considering
whether it is decomposable: the result set is randomized, whereas
the conditions for decomposability are defined in terms of an exact
matching of records in result sets. To work around this, a slight abuse
of definition is in order: assume that the equality conditions within
the DSP definition can be interpreted to mean ``the contents in the two
sets are drawn from the same distribution''. This enables the category
of DSP to apply to this type of problem.
Even with this abuse, however, IRS cannot generally be considered
decomposable; it is at best $C(n)$-decomposable. The reason for this is
that matching the distribution requires drawing the appropriate number
of samples from each each partition of the data. Even in the special
case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples
from each partition that must appear in the result set cannot be known
in advance due to differences in the selectivity of the predicate across
the partitions.
\begin{example}[IRS Sampling Difficulties]
Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 =
\{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and
an IRS query over the interval $[3, 4]$ with $k=12$. Because all three
partitions have the same size, it seems sensible to evenly distribute
the samples across them ($4$ samples from each partition). Applying
the query predicate to the partitions results in the following,
$d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$.
In expectation, then, the first result set will contain $R_0 = \{3,
3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same
probability of a $4$. The second and third result sets can only
be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these
together, we'd find that the probability distribution of the sample
would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform
the same sampling operation over the full dataset (not partitioned),
the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$.
\end{example}
The problem is that the number of samples drawn from each partition needs to be
weighted based on the number of elements satisfying the query predicate in that
partition. In the above example, by drawing $4$ samples from $D_1$, more weight
is given to $3$ than exists within the base dataset. This can be worked around
by sampling a full $k$ records from each partition, returning both the sample
and the number of records satisfying the predicate as that partition's query
result, and then performing another pass of IRS as the merge operator, but this
is the same approach as was used for k-NN above. This leaves IRS firmly in the
$C(n)$-decomposable camp. If it were possible to pre-calculate the number of
samples to draw from each partition, then a constant-time merge operation could
be used.
\subsection{Configurability}
\subsection{Insertion Tail Latency}
\label{ssec:bsm-tail-latency-problem}
\section{Conclusion}
This chapter discussed the necessary background information pertaining to
queries and search problems, indexes, and techniques for dynamic extension. It
described the potential for using custom indexes for accelerating particular
kinds of queries, as well as the challenges associated with constructing these
indexes. The remainder of this document will seek to address these challenges
through modification and extension of the Bentley-Saxe method, describing work
that has already been completed, as well as the additional work that must be
done to realize this vision.
|