1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
|
\chapter{Classical Dynamization Techniques}
\label{chap:background}
This chapter will introduce important background information and existing
work in the area of data structure dynamization. We will first discuss the
core concepts of search problems and data structures, which are central
to dynamization techniques. While one might imagine that restrictions
on dynamization would be functions of the data structure to be dynamized,
in practice the requirements placed on the data structure are quite mild.
Instead, the central difficulties to applying dynamization lie in the
necessary properties of the search problem that the data structure is
intended to solve. Following this, existing theoretical results in the
area of data structure dynamization will be discussed, which will serve
as the building blocks for our techniques in subsequent chapters. The
chapter will conclude with a discussion of some of the limitations of
these existing techniques.
\section{Background}
Before discussing dynamization itself, there are a few important
definitions to dispose of. In this section, we'll discuss some relevant
background information on search problems and data structures, which will
form the foundation of our discussion of dynamization itself.
\subsection{Queries and Search Problems}
\label{sec:dsp}
Data access lies at the core of most database systems. We want to ask
questions of the data, and ideally get the answer efficiently. We
will refer to the different types of question that can be asked as
\emph{search problems}. We will be using this term in a similar way as
the word \emph{query} \footnote{
The term query is often abused and used to
refer to several related, but slightly different things. In the
vernacular, a query can refer to either a) a general type of search
problem (as in "range query"), b) a specific instance of a search
problem, or c) a program written in a query language.
}
is often used within the database systems literature: to refer to a
general class of questions. For example, we could consider range scans,
point-lookups, nearest neighbor searches, predicate filtering, random
sampling, etc., to each be a general search problem. Formally, for the
purposes of this work, a search problem is defined as follows,
\begin{definition}[Search Problem]
Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function
$F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched,
$\mathcal{Q}$ represents the domain of search parameters, and $\mathcal{R}$ represents the
domain of possible answers.\footnote{
It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an
example, a \texttt{COUNT} aggregation might map a set of strings onto
an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need
not be a universal constraint.
}
\end{definition}
We will use the term \emph{query} to mean a specific instance of a search
problem, with a fixed set of search parameters,
\begin{definition}[Query]
Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and
a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific
instance of the search problem, $F(\mathcal{D}, q)$.
\end{definition}
As an example of these definitions, a \emph{membership test} or a
\emph{range scan} would be considered a search problem, and a range
scan over the interval $[10, 99]$ would be a query. We've drawn this
distinction because, as we'll see as we enter into the discussion of
our work in later chapters, it is useful to have separate, unambiguous
terms for these two concepts.
\subsection{Data Structures}
Answering a search problem over an unordered set of data is not terribly
efficient, and so usually the data is organized into a particular
layout to facilitate more efficient answers. Such layouts are called
\emph{data structures}. Examples of data structures include B-trees,
hash tables, sorted arrays, etc. A data structure can be thought of as
a \emph{solution} to a search problem~\cite{saxe79}.
The symbol $\mathcal{I}$ indicates a data structure, and the symbol
$\mathscr{I} \in \mathcal{I}$ represents an instance of a data structure
built over a particular set of data, $d \subseteq \mathcal{D}$. We will use
two abuses of notation pertaining to data structures throughout this
work. First, $F(\mathscr{I}, q)$ will be used to indicate a query with
search parameters $q$ over the data set $d$, answered using the data
structure $\mathscr{I}$. Second, we will use $|\mathscr{I}|$ to indicate
the number of records within $\mathscr{I}$ (which is equivalent to $|d|$,
where $d$ is the set of records that $\mathscr{I}$ has been built over).
We broadly classify data structures into three types, based upon
the operations supported by the structure: static, half-dynamic, and
full-dynamic. Static data structures do not support updates, half-dynamic
structures support inserts, and full-dynamic support inserts and deletes.
Note that we will use the unqualified term \emph{dynamic} to refer to
both half-dynamic and full-dynamic structures when the distinction isn't
relevant. Additionally, the term \emph{native dynamic} will be used to
indicate a data structure that has been custom-built with support for
inserts and/or deletes without the need for dynamization. These categories
are not all-inclusive, as there are a number of data structures which
do not fit the classification, but such structures are outside of the
scope of this work.
\begin{definition}[Static Data Structure~\cite{dsp}]
\label{def:static-ds}
A static data structure does not support updates of any kind, but can
be constructed from a data set and answer queries. Additionally, we
require that the static data structure provide the ability to reproduce
the set of records that was used to construct it. Specifically, static
data structures must support the following three operations,
\begin{itemize}
\item $\mathbftt{query}: \left(\mathcal{I}, \mathcal{Q}\right) \to \mathcal{R}$ \\
$\mathbftt{query}(\mathscr{I}, q)$ answers the query
$F(\mathscr{I}, q)$ and returns the result. This operation runs
in $\mathscr{Q}_S(n)$ time in the worst-case and \emph{cannot alter
the state of $\mathscr{I}$}.
\item $\mathbftt{build}:\left(\mathcal{PS}(\mathcal{D})\right) \to \mathcal{I}$ \\
$\mathbftt{build}(d)$ constructs a new instance of $\mathcal{I}$
using the records in set $d$. This operation runs in $B(n)$ time in
the worst case.
\item $\mathbftt{unbuild}\left(\mathcal{I}\right) \to \mathcal{PS}(\mathcal{D})$ \\
$\mathbftt{unbuild}(\mathscr{I})$ recovers the set of records, $d$
used to construct $\mathscr{I}$. The literature on dynamization
generally assumes that this operation runs in $\Theta(1)$
time~\cite{saxe79}, and we will adopt the same assumption in our
analysis.
\end{itemize}
\end{definition}
Note that the term static is distinct from immutable. Static refers
to the layout of records within the data structure, whereas immutable
refers to the data stored within those records. This distinction will
become relevant when we discuss different techniques for adding delete
support to data structures. The data structures used are always static,
but not necessarily immutable, because the records may contain header
information (like visibility) that is updated in place.
\begin{definition}[Half-dynamic Data Structure~\cite{overmars-art-of-dyn}]
\label{def:half-dynamic-ds}
A half-dynamic data structure requires the three operations of a static
data structure, as well as the ability to efficiently insert new data into
a structure built over an existing data set, $d$.
\begin{itemize}
\item $\mathbftt{insert}: \left(\mathcal{I}, \mathcal{D}\right) \to \mathcal{I}$ \\
$\mathbftt{insert}(\mathscr{I}, r)$ returns a data structure,
$\mathscr{I}^\prime$, such that $\mathbftt{query}(\mathscr{I}^\prime,
q) = F(d \cup r, q)$, for some $r \in \mathcal{D}$. This operation
runs in $I(n)$ time in the worst-case.
\end{itemize}
\end{definition}
The important aspect of insertion in this model is that the effect of
the new record on the query result is observed, not necessarily that
the result is a structure exactly identical to the one that would be
obtained by building a new structure over $d \cup r$. Also, though the
formalism used implies a functional operation where the original data
structure is unmodified, this is not actually a requirement. $\mathscr{I}$
could be sightly modified in place, and returned as $\mathscr{I}^\prime$,
as is conventionally done with native dynamic data structures.
\begin{definition}[Full-dynamic Data Structure~\cite{overmars-art-of-dyn}]
\label{def:full-dynamic-ds}
A full-dynamic data structure is a half-dynamic structure that also
has support for deleting records from the dataset.
\begin{itemize}
\item $\mathbftt{delete}: \left(\mathcal{I}, \mathcal{D}\right) \to \mathcal{I}$ \\
$\mathbftt{delete}(\mathscr{I}, r)$ returns a data structure, $\mathscr{I}^\prime$,
such that $\mathbftt{query}(\mathscr{I}^\prime,
q) = F(d - r, q)$, for some $r \in \mathcal{D}$. This operation
runs in $D(n)$ time in the worst-case.
\end{itemize}
\end{definition}
As with insertion, the important aspect of deletion is that the effect
of $r$ on the results of queries answered using $\mathscr{I}^\prime$
has been removed, not necessarily that the record is physically
removed from the structure. A full-dynamic data structure also
supports in-place modification of an existing record. In order
to update a record $r$ to $r^\prime$, it is sufficient to perform
$\mathbftt{insert}\left(\mathbftt{delete}\left(\mathscr{I}, r\right),
r^\prime\right)$.
There are data structures that do not fit into this classification
scheme. For example, approximate structures like Bloom
filters~\cite{bloom70} or various sketches and summaries, do not retain
full information about the records that have been used to construct them.
Such structures cannot support \texttt{unbuild} as a result. Some other
data structures cannot be statically queried--the act of querying them
mutates their state. This is the case for structures like heaps, stacks,
and queues, for example.
\section{Decomposition-based Dynamization}
\emph{Dynamization} is the process of transforming a static data structure
into a dynamic one. When certain conditions are satisfied by the data
structure and its associated search problem, this process can be done
automatically, and with provable asymptotic bounds on amortized insertion
performance, as well as worst-case query performance. This automatic
approach is in contrast with the design of a native dynamic data
structure, which involves altering the data structure itself to natively
support updates. This process usually involves implementing techniques
that partially rebuild small portions of the structure to accommodate new
records, which is called \emph{local reconstruction}~\cite{overmars83}.
This is a very manual intervention that requires significant effort on the
part of the data structure designer, whereas conventional dynamization
can be performed with little-to-no modification of the underlying data
structure at all.
It is worth noting that there are a variety of techniques
discussed in the literature for dynamizing structures with specific
properties, or under very specific sets of circumstances. Examples
include frameworks for adding update support succinct data
structures~\cite{dynamize-succinct} or taking advantage of batching
of insert and query operations~\cite{batched-decomposable}. This
section discusses techniques that are more general, and don't require
workload-specific assumptions. For more detail than is included in
this section, Overmars wrote a book providing a comprehensive survey of
techniques for creating dynamic data structures, including not only the
dynamization techniques discussed here, but also local reconstruction
based techniques and more~\cite{overmars83}.\footnote{
Sadly, this book isn't readily available in
digital format as of the time of writing.
}
\subsection{Global Reconstruction}
The most fundamental dynamization technique is that of \emph{global
reconstruction}. While not particularly useful on its own, global
reconstruction serves as the basis for the techniques to follow, and so
we will begin our discussion of dynamization with it.
Consider some search problem, $F$, for which we have a static solution,
$\mathcal{I}$. Given the operations supported by static structures, it
is possible to insert a new record, $r \in \mathcal{D}$, into an instance
$\mathscr{I} \in \mathcal{I}$ as follows,
\begin{equation*}
\mathbftt{insert}(\mathscr{I}, r) \triangleq \mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}) \cup \{r\})
\end{equation*}
Likewise, a record can be deleted using,
\begin{equation*}
\mathbftt{delete}(\mathscr{I}, r) \triangleq \mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}) - \{r\})
\end{equation*}
It goes without saying that this operation is sub-optimal, as the
insertion and deletion costs are both $\Theta(B(n))$, and $B(n)
\in \Omega(n)$ at best for most data structures. However, this global
reconstruction strategy can be used as a primitive for more sophisticated
techniques that can provide reasonable performance.
\begin{figure}
\centering
\includegraphics[width=.8\textwidth]{diag/decomp.pdf}
\caption{\textbf{Data Structure Decomposition.} A single large data
structure containing $n$ records can be decomposed into multiple instances
of the same data structure, each built over a disjoint partition of
the data. In this case, the decomposition is performed such that each
block contains $\sqrt{n}$ records. As a result, rather than an insert
requiring $B(n)$ time, it will require $B(\sqrt{n})$ time.}
\label{fig:bg-decomp}
\end{figure}
The problem with global reconstruction is that each insert or delete
must rebuild the entire data structure, involving all of its records. The
key insight, first discussed by Bentley and Saxe~\cite{saxe79}, is that
the cost associated with global reconstruction can be reduced by be
accomplished by \emph{decomposing} the data structure into multiple,
smaller structures, each built from a disjoint partition of the data.
These smaller structures are called \emph{blocks}. It is possible to
devise decomposition schemes that result in asymptotic improvements
of insertion performance when compared to global reconstruction alone.
\begin{example}[Data Structure Decomposition]
Consider a data structure that can be constructed in $B(n) \in \Theta
(n \log n)$ time with $|\mathscr{I}| = n$. Inserting a new record into
this structure using global reconstruction will require $I(n) \in \Theta
(n \log n)$ time. However, if the data structure is decomposed into
blocks, such that each block contains $\Theta(\sqrt{n)})$ records, as shown
in Figure~\ref{fig:bg-decomp}, then only a single block must be reconstructed
to accommodate the insert, requiring $I(n) \in \Theta(\sqrt{n} \log \sqrt{n})$ time.
\end{example}
Much of the existing work on dynamization has considered different
approaches to decomposing data structures, and the effects that these
approaches have on insertion and query performance. However, before we can
discuss these approaches, we must first address the problem of answering
search problems over these decomposed structures.
\subsection{Decomposable Search Problems}
Not all search problems can be correctly answered over a decomposed data
structure, and this problem introduces one of the major limitations
of traditional dynamization techniques: The answer to the search
problem from the decomposition should be the same as would have been
obtained had all of the data been stored in a single data structure. This
requirement is formalized in the definition of a class of problems called
\emph{decomposable search problems} (DSP). This class was first defined
by Jon Bentley,
\begin{definition}[Decomposable Search Problem~\cite{dsp}]
\label{def:dsp}
A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and
only if there exists a constant-time computable, associative, and
commutative binary operator $\mergeop$ such that,
\begin{equation*}
F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
\end{equation*}
for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
\end{definition}
The requirement for $\mergeop$ to be constant-time was used by Bentley
to prove specific performance bounds for answering queries from a
decomposed data structure. However, it is not strictly \emph{necessary},
and later work by Overmars lifted this constraint and considered a
more general class of search problems called \emph{$C(n)$-decomposable
search problems},
\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars-cn-decomp}]
A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable
if and only if there exists an $O(C(n))$-time computable, associative,
and commutative binary operator $\mergeop$ such that,
\begin{equation*}
F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
\end{equation*}
for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
\end{definition}
\Paragraph{Examples.} To demonstrate that a search problem is
decomposable, it is necessary to show the existence of the merge operator,
$\mergeop$, with the necessary properties, and to show that $F(A \cup
B, q) = F(A, q)~ \mergeop ~F(B, q)$. With these two results, induction
demonstrates that the problem is decomposable even in cases with more
than two partial results.
As an example, consider the range counting problem, which seeks to
identify the number of elements in a set of 1-dimensional points that
fall onto a specified interval,
\begin{definition}[Range Count]
\label{def:range-count}
Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval,
$ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns
the cardinality, $|d \cap q|$.
\end{definition}
\begin{theorem}
\label{ther:decomp-range-count}
Range Count is a decomposable search problem.
\end{theorem}
\begin{proof}
Let $\mergeop$ be addition ($+$). Applying this to
Definition~\ref{def:dsp}, gives
\begin{align*}
|(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)|
\end{align*}
which is true by the distributive property of union and
intersection. Addition is an associative and commutative
operator that can be calculated in $\Theta(1)$ time. Therefore, range count
is a decomposable search problem.
\end{proof}
Because the codomain of a DSP is not restricted, more complex output
structures can be used to allow for problems that are not directly
decomposable to be converted to DSPs, possibly with some minor
post-processing. For example, calculating the arithmetic mean of a set
of numbers can be formulated as a DSP,
\begin{theorem}
The calculation of the arithmetic mean of a set of numbers is a DSP.
\end{theorem}
\begin{proof}
Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$,
where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple
contains the sum of the values within the input set, and the
cardinality of the input set. For two disjoint partitions of the data,
$D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let
$A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$.
Applying Definition~\ref{def:dsp}, gives
\begin{align*}
A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2) \\
(s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c)
\end{align*}
From this result, the average can be determined in constant time by
taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set
of numbers is a DSP.
\end{proof}
\Paragraph{Answering Queries for DSPs.} Queries for a decomposable
search problem can be answered over a decomposed structure by
individually querying each block, and then merging the results together
using $\mergeop$. In many cases, this process will introduce some
overhead in the query cost. Given a decomposed data structure $\mathscr{I}
= \{\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_m\}$,
a query for a $C(n)$-decomposable search problem can be answered using,
\begin{equation*}
\mathbftt{query}\left(\mathscr{I}, q\right) \triangleq \bigmergeop_{i=1}^{m} F(\mathscr{I}_i, q)
\end{equation*}
which requires,
\begin{equation*}
\mathscr{Q}(n) \in O \left( m \cdot \left(\mathscr{Q}(n_\text{max}) + C(n)\right)\right)
\end{equation*}
time, where $m$ is the number of blocks and $n_\text{max}$ is the size of the
largest block. Note the fact that $C(n)$ is multiplied by $m$ in this
expression--this is a large part of the reason why $C(n)$-decomposability
is not particularly desirable compared to standard decomposability,
where $C(n) \in \Theta(1)$ and thus falls out of the cost function.
This is an upper bound only, it is occasionally possible to do
better. Under certain circumstances, the costs of querying multiple
blocks can be absorbed, resulting in no worst-case overhead, at least
asymptotically. As an example, consider a linear scan of the data running
in $\Theta(n)$ time. In this case, every record must be considered,
and so there isn't any performance penalty\footnote{
From an asymptotic perspective. There will still be measurable
performance effects from caching, etc., even in this case.
} to breaking the records out into multiple chunks and scanning them
individually. More formally, for any query running in $\mathscr{Q}_S(n) \in
\Omega\left(n^\epsilon\right)$ time where $\epsilon > 0$, the worst-case
cost of answering a decomposable search problem from a decomposed
structure is $\Theta\left(\mathscr{Q}_S(n)\right)$.~\cite{saxe79}
\section{Decomposition-based Dynamization for Half-dynamic Structures}
The previous discussion reveals the basic tension that exists
within decomposition based techniques: larger block sizes result
in worse insertion performance and better query performance. Query
performance is improved by reducing the number of blocks, but this is
concomitant with making the blocks larger, harming insertion performance.
The literature on decomposition-based dynamization techniques discusses
different approaches for performing the decomposition to balance these two
competing interests, as well as various additional properties of search
problems and structures that can be leveraged for better performance. In
this section, we will discuss these topics in the context of creating
half-dynamic data structures, and the next section will discuss similar
considerations for full-dynamic structures.
Of the decomposition techniques, we will focus on the three most important
from a practical standpoint.\footnote{
There are, in effect, two main methods for decomposition:
decomposing based on some counting scheme (logarithmic and
$k$-binomial)~\cite{saxe79} or decomposing into equally sized blocks
(equal block method)~\cite{overmars-art-of-dyn}. Other, more complex,
methods do exist, but they are largely compositions of these two
simpler ones. These composed decompositions (heh) are of largely
theoretical interest, as they are sufficiently complex to be of
questionable practical utility.~\cite{overmars83}
} The earliest of these is the logarithmic method, often called the
Bentley-Saxe method in modern literature, and is the most commonly
discussed technique today. The logarithmic method has been directly
applied in a few instances in the literature, such as to metric indexing
structures~\cite{naidan14} and spatial structures~\cite{bkdtree},
and has also been used in a modified form for genetic sequence search
structures~\cite{almodaresi23} and graphs~\cite{lsmgraph}, to cite
a few examples. Bentley and Saxe also proposed a second approach, the
$k$-binomial method, that slightly alters the exact decomposition approach
used by the logarithmic method to allow for flexibility in whether the
performance of inserts or queries should be favored. A later technique,
the equal block method, was also developed, which also seeks to introduce
a mechanism for performance tuning. Of the three, the logarithmic method
is the most generally effective, and we have not identified any specific
applications of either $k$-binomial decomposition or the equal block method
outside of the theoretical literature.
\subsection{The Logarithmic Method}
\label{ssec:bsm}
The original, and most frequently used, decomposition technique is the
logarithmic method, also called Bentley-Saxe method (BSM) in more recent
literature. This technique decomposes the structure into logarithmically
many blocks of exponentially increasing size. More specifically, the
data structure is decomposed into $h = \lceil \log_2 n \rceil$ blocks,
$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block
$\mathscr{I}_i$ will be either empty, or contain exactly $2^i$ records
within it.
The procedure for inserting a record, $r \in \mathcal{D}$, into
a logarithmic decomposition is as follows. If the block $\mathscr{I}_1$
is empty, then $\mathscr{I}_1 = \mathbftt{build}{\{r\}}$. If it is not
empty, then there will exist a maximal sequence of non-empty blocks
$\mathscr{I}_1, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq
1$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case,
$\mathscr{I}_{i+1}$ is set to $\mathbftt{build}(\{r\} \cup \bigcup_{l=1}^i
\mathbftt{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_1$ through
$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the
end of the structure as needed.
%FIXME: switch the x's to r's for consistency
\begin{figure}
\centering
\includegraphics[width=.8\textwidth]{diag/bsm.pdf}
\caption{\textbf{Insertion in the Logarithmic Method.} A logarithmic
decomposition of some data structure initially containing records $r_1,
r_2, \ldots, r_{10}$ is shown, along with the insertion procedure. First,
the new record $r_{11}$ is inserted. The first empty block is at $i=1$, and
so the new record is simply placed there. Next, $r_{12}$ is inserted. For
this insert, the first empty block is at $i=3$, requiring the blocks $1$
and $2$ to be merged, along with the $r_{12}$, to create the new block.
}
\label{fig:bsm-example}
\end{figure}
Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The
dynamization is built over a set of records $x_1, x_2, \ldots,
x_{10}$ initially, with eight records in $\mathscr{I}_4$ and two in
$\mathscr{I}_2$. The first new record, $x_{11}$, is inserted directly
into $\mathscr{I}_1$. For the next insert following this, $x_{12}$, the
first empty block is $\mathscr{I}_3$, and so the insert is performed by
doing $\mathscr{I}_3 = \text{build}\left(\{x_{12}\} \cup
\text{unbuild}(\mathscr{I}_2) \cup \text{unbuild}(\mathscr{I}_3)\right)$
and then emptying $\mathscr{I}_2$ and $\mathscr{I}_3$.
This technique is called a \emph{binary decomposition} of the data
structure. Considering a logarithmic decomposition of a structure
containing $n$ records, labeling each block with a $0$ if it is empty and
a $1$ if it is full will result in the binary representation of $n$. For
example, the final state of the structure in Figure~\ref{fig:bsm-example}
contains $12$ records, and the labeling procedure will result
in $0\text{b}1100$, which is $12$ in binary. Inserts affect this
representation of the structure in the same way that incrementing the
binary number by $1$ does.
By applying this method to a data structure, a dynamized structure can
be created with the following performance characteristics,
\begin{align*}
\text{Amortized Insertion Cost:}&\quad I_A(n) \in \Theta\left(\frac{B(n)}{n}\cdot \log_2 n\right) \\
\text{Worst Case Insertion Cost:}&\quad I(n) \in \Theta\left(B(n)\right) \\
\text{Worst-case Query Cost:}& \quad \mathscr{Q}(n) \in O\left(\log_2 n\cdot \mathscr{Q}_S\left(n\right)\right) \\
\end{align*}
This is a particularly attractive result because, for example, a data
structure having $B(n) \in \Theta(n)$ will have an amortized insertion
cost of $\log_2 (n)$, which is quite reasonable. The trade-off for this
is an extra logarithmic multiple attached to the query complexity. It is
also worth noting that the worst-case insertion cost remains the same
as global reconstruction, but this case arises only very rarely. If
you consider the binary decomposition representation, the worst-case
behavior is triggered each time the existing number overflows, and a
new digit must be added.
\subsection{The $k$-Binomial Method}
\begin{figure}
\centering
\includegraphics[width=.8\textwidth]{diag/kbin.pdf}
\caption{\textbf{Insertion in the $k$-Binomial Method.} A $k$-binomial
decomposition of some data structure initially containing the records
$r_1 \ldots r_{18}$ with $k=3$, along with the values of the $D$ array.
When the record $r_{19}$ is inserted, $D[1]$ is incremented and, as it
is not equal to $D[2]$, no carrying is necessary. Thus, the new record is
simply placed in the structure at $i=1$. However, inserting the next record
will result in $D[1] = D[2]$ after the increment. Once this increment is
shifted, it will also be the case that $D[2] = D[3]$. As a result, the
entire structure is compacted into a single block.
}
\label{fig:dyn-kbin}
\end{figure}
One of the significant limitations of the logarithmic method is that it
is incredibly rigid. In our earlier discussion of decomposition we noted
that there exists a clear trade-off between insert and query performance
for half-dynamic structures mediate by the number of blocks into which
the structure is decomposed. However, the logarithmic method does not
allow any navigation of this trade-off. In their original paper on the
topic, Bentley and Saxe proposed a different decomposition scheme that
does expose this trade-off, however, which they called the $k$-binomial
transform.~\cite{saxe79}
In this transform, rather than decomposing the data structure based on
powers of two, the structure is decomposed based on a sum of $k$ binomial
coefficients. This decomposition results in exactly $k$ blocks in the
structure. For example, with $k=3$, the number 17 can be represented as,
\begin{align*}
17 &= {5 \choose 3} + {4 \choose 2} + {1 \choose 1} \\
&= 10 + 6 + 1
\end{align*}
and thus the decomposed structure will contain three blocks, one with
$10$ records, one with $6$, and another with $2$.
More generally, a structure of $n$ elements is decomposed based on the
following sum of binary coefficients,
\begin{equation*}
n = \sum_{i=1}^{k} {D[i] \choose k}
\end{equation*}
where $D$ is an array of $k+1$ integers, such that $D[i] > D[i-1]$
for all $i \leq k$, and $D[k+1] = \infty$. When a record is inserted,
$D[1]$ is incremented by one, and then this increment is ``shifted'' up
the array until no two adjacent integers are equal by considering each
element $i$ in the array and checking if it is equal to $i+1$. If it is,
the value at $D[i]$ is decremented and $D[i+1]$ incremented, and then $i$
itself is incremented. This is guaranteed to terminate because the last
element of the array $D[k+1]$ is taken to be infinite.
\begin{algorithm}
\caption{Insertion into a $k$-Binomial Decomposition}
\label{alg:dyn-binomial-insert}
\KwIn{$r$: the record to be inserted, $\mathscr{I} = \{ \mathscr{I}_1 \ldots \mathscr{I}_k\}$: a decomposed structure, $D$: the array of binomial coefficients}
$D[1] \gets D[1] + 1$ \;
$S \gets \{r\} \cup \mathbftt{unbuild}(\mathscr{I}_1)$ \;
$\mathscr{I}_1 \gets \emptyset$ \;
\BlankLine
$i \gets 1$ \;
\While{$D[i] = D[i+1]$} {
$D[i+1] \gets D[i+1] + 1$ \;
$S \gets S \cup \mathbftt{unbuild}(\mathscr{I}_i)$ \;
\BlankLine
$D[i] \gets D[i] - 1$\;
$\mathscr{I}_i \gets \emptyset$\;
\BlankLine
$i \gets i + 1$\;
}
\BlankLine
$\mathscr{I}_i \gets \mathbftt{build}(S)$ \;
\Return $\mathscr{I}$
\end{algorithm}
In order to maintain the structural decomposition based on those
results, the method maintains a list of $k$ structures, $\mathscr{I} =
\{\mathscr{I}_1 \ldots \mathscr{I}_k\}$ (which all start empty). During
an insert, a set of records to use to build the new block is initialized
with the record to be inserted. Then, each time $D[i]$ is considered
in the above increment algorithm, the structure $\mathscr{I}_i$
is unbuilt and its records added to the set. When the increment
algorithm above terminates, a new block is built and placed at
$\mathscr{I}_i$. This process is a bit complicated, so we've summarized
it in Algorithm~\ref{alg:dyn-binomial-insert}. Figure~\ref{fig:dyn-kbin}
shows an example of inserting records into a $k$-binomial decomposition.
Applying this technique results in the following costs for operations,
\begin{align*}
\text{Amortized Insertion Cost:}&\quad I_A(n) \in \Theta\left(\frac{B(n)}{n} \cdot \left(k! \cdot n\right)^{\frac{1}{k}}\right) \\
\text{Worst-case Insertion Cost:}&\quad I(n) \in \Theta\left(B(n)\right) \\
\text{Worst-case Query Cost:}& \quad \mathscr{Q}(n) \in O\left(k\cdot \mathscr{Q}_S(n)\right) \\
\end{align*}
Because the number of blocks is restricted to a constant, $k$, this method
is highly biased towards query performance, at the cost of insertion.
Bentley and Saxe also proposed a decomposition based on the dual of this
$k$-binomial approach. We won't go into the details here, but this dual
$k$-binomial method effectively reverses the insert and query trade-offs
to produce an insert optimized structure, with costs,
\begin{align*}
\text{Amortized Insertion Cost:}&\quad I_A(n) \in \left( \frac{B(n)}{n} \cdot k\right) \\
\text{Worst-case Insertion Cost:}&\quad I(n) \in \Theta\left(B(n)\right) \\
\text{Worst-case Query Cost:}& \quad \mathscr{Q}(n) \in O\left(\left(k! \cdot n\right)^{\frac{1}{k}}\cdot \mathscr{Q}_S(n)\right) \\
\end{align*}
\subsection{Equal Block Method}
\label{ssec:ebm}
\begin{figure}
\centering
\includegraphics[width=.8\textwidth]{diag/ebm.pdf}
\caption{\textbf{Insertion in the Equal Block Method.} An equal block
decomposition of some data structure initially containing the records
$r_1\ldots r_{14}$, with $f(n) = 3$. When the record $r_{15}$ is inserted,
the smallest block ($i=1$) is located, and the record is placed there
by rebuilding it. When the next record, $r_{16}$ is inserted, the value
of $f(n)$ increases to $4$. As a result, the entire structure must be
re-partitioned to evenly distribute the records over $4$ blocks during
this insert.
}
\label{fig:dyn-ebm}
\end{figure}
The $k$-binomial method aims to provide the ability to adjust the
performance of a dynamized structure, selecting for either insert or
query performance. However, it is a bit of a blunt instrument, allowing
for a broadly insert-optimized system with horrible query performance, or
a broadly query-optimized system with horrible insert performance. Once
the system has been selected, the degree of trade-off can be slightly
adjusted by tweaking $k$. The desire to introduce a decomposition that
allowed for more fine-grained control over this trade-off resulted in a
third decomposition technique called the \emph{equal block method}. There
have been several proposed variations of this concept~\cite{maurer79,
maurer80}, but we will focus on the most developed form as described
by Overmars and von Leeuwen~\cite{overmars-art-of-dyn, overmars83}. The
core concept of the equal block method is to decompose the data structure
into a specified number of blocks, such that each block is of roughly
equal size.
Consider an instance of a data structure $\mathscr{I} \in \mathcal{I}$
that solves some decomposable search problem, $F$ and is built over
a set of records $d \in \mathcal{D}$. Rather than decomposing the data
structure based on some count-based scheme, as the two previous techniques
did, we can simply break the structure up into $s$ evenly sized blocks,
$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over
partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a constant
value results in large degradation of insert performance as the number
of records grows,\footnote{
The $k$-binomial decomposition got away with fixed a constant number
of blocks because it scaled the sizes of the blocks to ensure that
there was a size distribution, and most of the reconstruction effort
occurred in the smaller blocks. When all the blocks are of equal size,
however, the cost of these reconstructions is much larger, and so it
becomes necessary to gradually grow the block count to ensure that
insertion cost doesn't grow too large.
} and so we instead take it to be governed by a smooth, monotonically
increasing function $f(n)$ such that, at any point, the following two
constraints are obeyed,
\begin{align}
f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\
\forall_{1 \leq j \leq s} \quad | \mathscr{I}_j | \leq \frac{2n}{s} \label{ebm-c2}
\end{align}
A new record is inserted by finding the smallest block and rebuilding it
using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$,
then an insert is done by,
\begin{equation*}
\mathscr{I}_k^\prime = \mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}_k) \cup \{r\})
\end{equation*}
Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{
Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be
violated by deletes. We're omitting deletes from the discussion at
this point, but will circle back to them in Section~\ref{ssec:dyn-deletes}.
} In this case, the constraints are enforced by re-configuring the
structure. $s$ is updated to be exactly $f(n)$, all of the existing
blocks are unbuilt, and then the records are redistributed evenly into
$s$ blocks. An example of insertion in the equal block method is shown
in Figure~\ref{fig:dyn-ebm}.
This technique provides better amortized performance bounds than global
reconstruction, at the possible cost of worse query performance for
sub-linear queries. We'll omit the details of the proof of performance
for brevity and streamline some of the original notation (full details
can be found in~\cite{overmars83}), but this technique ultimately
results in a data structure with the following performance characteristics,
\begin{align*}
\text{Amortized Insertion Cost:}&\quad I_A(n) \in \Theta\left(\frac{B(n)}{n} + B\left(\frac{n}{f(n)}\right)\right) \\
\text{Worst-case Insertion Cost:}&\quad I(n) \in \Theta\left(f(n)\cdot B\left(\frac{n}{f(n)}\right)\right) \\
\text{Worst-case Query Cost:}& \quad \mathscr{Q}(n) \in O\left(f(n) \cdot \mathscr{Q}_S\left(\frac{n}{f(n)}\right)\right) \\
\end{align*}
The equal block method is generally \emph{worse} in terms of insertion
performance than the logarithmic and $k$-binomial decompositions, because
the sizes of reconstructions are typically much larger for an equivalent
block count, due to all the blocks having approximately the same size.
\subsection{Optimizations}
In addition to exploring various different approaches to decomposing the
data structure to be dynamized, the literature also explores a number of
techniques for optimizing performance under certain circumstances. In
this section, we will discuss the two most important of these for our
purposes: the exploitation of more efficient data structure merging, and
an approach for reducing the worst-case insertion cost of a decomposition
based loosely on the logarithmic method.
\subsubsection{Merge Decomposable Search Problems}
When considering a decomposed structure, reconstructions are performed
not using a random assortment of records, but mostly using records
extracted from already existing data structures. In the case of static
data structures, as defined in Definition~\ref{def:static-ds}, the
best we can do is to unbuild the data structures and then rebuild from
scratch, there are many data structures which can be efficiently merged.
Consider a data structure that supports construction via merging,
$\mathbftt{merge}(\mathscr{I}_1, \ldots \mathscr{I}_k)$ in $B_M(n, k)$
time, where $n = \sum_{i=1}^k |\mathscr{I}_i|$. A search problem for
which such a data structure exists is called a \emph{merge decomposable
search problem} (MDSP)~\cite{merge-dsp}.
Note that in~\cite{merge-dsp}, Overmars considers a \emph{very} specific
definition where the data structure is built in two stages. An initial
sorting phase, requiring $O(n \log n)$ time, and then a construction
phase requiring $O(n)$ time. Overmars's proposed mechanism for leveraging
this property is to include with each block a linked list storing the
records in sorted order (presumably to account for structures where the
records must be sorted, but aren't necessarily kept that way). During
reconstructions, these sorted lists can first be merged, and then the
data structure built from the resulting merged list. Using this approach,
even accounting for the merging of the list, he is able to prove that
the amortized insertion cost is less than would have been the case paying
the $O( n \log n)$ cost for each reconstruction.~\cite{merge-dsp}
While Overmars's definition for MDSP does capture a large number of
mergeable data structures (including all of the mergeable structures
considered in this work), we modify his definition to consider a broader
class of problems. We will be using the term to refer to any search
problem with a data structure that can be merged more efficiently than
built from an unsorted set of records. More formally,
\begin{definition}[Merge Decomposable Search Problem~\cite{merge-dsp}]
\label{def:mdsp}
A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$
is decomposable if and only if there exists a solution to the
search problem (i.e., a data structure) that is static, and also
supports the operation,
\begin{itemize}
\item $\mathbftt{merge}: \mathcal{I}^k \to \mathcal{I}$ \\
$\mathbftt{merge}(\mathscr{I}_1, \ldots \mathscr{I}_k)$ returns a
static data structure, $\mathcal{I}^\prime$, constructed
from the input data structures, with cost $B_M(n, k) \leq B(n)$,
such that for any set of search parameters $q$,
\begin{equation*}
\mathbftt{query}(\mathscr{I}^\prime, q) = \mathbftt{query}(\mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}_1)\cup\ldots\cup\mathbftt{unbuild}(\mathscr{I}_k)))
\end{equation*}
\end{itemize}
\end{definition}
The value of $k$ can be upper-bounded by the decomposition technique
used. For example, in the logarithmic method there will be $\log n$
structures to merge in the worst case, and so to gain benefit from the
merge routine, the merging of $\log n$ structures must be less expensive
than building the new structure using the standard $\mathtt{unbuild}$
and $\mathtt{build}$ mechanism. Note that the availability of an efficient merge
operation isn't helpful in the equal block method, which doesn't
perform data structure merges.\footnote{
In the equal block method, all reconstructions are due to either
inserting a record or re partitioning of the records. In the former
case, the reconstruction pulls records from only a single structure
and merging is not possible. In the latter, records may come from
multiple structures, but the structures are not merged and only some
of the records from each are used. In either case, merging is not
useful as an optimization.
}
\subsubsection{Improved Worst-Case Insertion Performance}
\label{ssec:bsm-worst-optimal}
Dynamization based upon decomposition and global reconstruction has a
significant gap between its \emph{amortized} insertion performance, and
its \emph{worst-case} insertion performance. When using the Bentley-Saxe
method, the logarithmic decomposition ensures that the majority of inserts
involve rebuilding only small data structures, and thus are relatively
fast. However, the worst-case insertion cost is still $\Theta(B(n))$,
no better than global reconstruction, because the worst-case insert
requires a reconstruction using all of the records in the structure.
Overmars and van Leeuwen~\cite{overmars81, overmars83} proposed an
alteration to the logarithmic method that is capable of bringing the
worst-case insertion cost in line with amortized, $I(n) \in \Theta
\left(\frac{B(n)}{n} \log n\right)$. To accomplish this, they introduce
a structure that is capable of spreading the work of reconstructions
out across multiple inserts. Their structure consists of $\log_2 n$
levels, like the logarithmic method, but each level contains four data
structures, rather than one, called $Oldest_i$, $Older_i$, $Old_i$, $New_i$
respectively.\footnote{
We are here adopting nomenclature used by Erickson in his lecture
notes on the topic~\cite{erickson-bsm-notes}, which is a bit clearer
than the more mathematical notation in the original source material.
} The $Old$, $Older$, $Oldest$ structures represent completely built
versions of the data structure on each level, and will be either full
($2^i$ records) or empty. If $Oldest$ is empty, then so is $Older$,
and if $Older$ is empty, then so is $Old$. The fourth structure,
$New$, represents a partially built structure on the level. A record
in the structure will be present in exactly one old structure, and may
additionally appear in a new structure as well.
When inserting into this structure, the algorithm first examines every
level, $i$. If both $Older_{i-1}$ and $Oldest_{i-1}$ are full, then the
algorithm will execute $\frac{B(2^i)}{2^i}$ steps of the algorithm
to construct $New_i$ from $\text{unbuild}(Older_{i-1}) \cup
\text{unbuild}(Oldest_{i-1})$. Once enough inserts have been performed
to completely build some block, $New_i$, the source blocks for the
reconstruction, $Oldest_{i-1}$ and $Older_{i-1}$ are deleted, $Old_{i-1}$
becomes $Oldest_{i-1}$, and $New_i$ is assigned to the oldest empty block
on level $i$.
This approach means that, in the worst case, partial reconstructions will
be executed on every level in the structure, resulting in
\begin{equation*}
I(n) \in \Theta\left(\sum_{i=1}^{\log_2 n} \frac{B(2^i)}{2^i}\right) \in \Theta\left(\log_2 n \frac{B(n)}{n}\right)
\end{equation*}
time. Additionally, if $B(n) \in \Omega(n^{1 + \epsilon})$ for $\epsilon
> 0$, then the bottom level dominates the reconstruction cost, and the
worst-case bound drops to $I(n) \in \Theta\left(\frac{B(n)}{n}\right)$.
\section{Decomposition-based Dynamization for Full-dynamic Structures}
\label{ssec:dyn-deletes}
Full-dynamic structures are those with support for deleting records,
as well as inserting. As it turns out, supporting deletes efficiently
is significantly more challenging than inserts, but there are some
results in the theoretical literature for efficient delete support in
restricted cases.
While, as discussed earlier, it is in principle possible to support
deletes using global reconstruction, with the operation defined as
\begin{equation*}
\mathbftt{delete}(\mathscr{I}, r) \triangleq \mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}) - \{r\})
\end{equation*}
the extension of this procedure to a decomposed data structure is less
than trivial. Unlike inserts, where the record can (in principle) be
placed into whatever block we like, deletes must be applied specifically
to the block containing the record. As a result, there must be a means to
locate the block containing a specified record before it can be deleted.
In addition to this, all three of the decomposition schemes discussed so
far take advantage of the fact that inserts can be applied to blocks in
a systematic manner to provide performance guarantees. Deletes, however,
lack this control, making bounding their performance far more difficult.
For example, consider a logarithmic decomposition that contains all
integers on the interval $[1, 100]$, inserted in that order. We would
like to delete all of the records from this structure, one at a time,
using global reconstruction, in the opposite order they were inserted in.
Even if we assume that we can easily locate the block containing each
record to delete, we are still faced with two major problems,
\begin{itemize}
\item The cost of performing a delete is a function of which block the
record is in, which is a question of distribution and not
easily controlled. In this example, we will always trigger
the worst-case behavior, repeatedly rebuilding the largest
blocks in the structure one at a time, as the number of
records diminishes.
\item As records are deleted, the structure will potentially violate
the invariants of the decomposition scheme used, which will
require additional work to fix.
\end{itemize}
To resolve these difficulties, two very different approaches have been
proposed for creating full-dynamic structures. One approach requires the
search problem itself to have certain properties, and the other requires
certain operations to be supported by the data structure. We'll discuss
these next.
\subsection{Ghost Structure for Invertible Search Problems}
The first proposed mechanism for supporting deletes was discussed
alongside the logarithmic method in Bentley and Saxe's original
paper. This technique applies to a class of search problems called
\emph{invertible} (also called \emph{decomposable counting problems}
in later literature~\cite{overmars83}). Invertible search problems
are decomposable, and also support an ``inverse'' merge operator, $\Delta$,
that is able to remove records from the result set. More formally,
\begin{definition}[Invertible Search Problem~\cite{saxe79}]
\label{def:invert}
A decomposable search problem, $F$ is invertible if and only if there
exists a constant time computable operator, $\Delta$, such that
\begin{equation*}
F(A - B, q) = F(A, q)~\Delta~F(B, q)
\end{equation*}
for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
\end{definition}
Given a search problem with this property, it is possible to emulate
removing a record from the structure by instead inserting into a
secondary ``ghost'' structure. When the decomposed structure is queried, this
ghost structure is queried as well as the main one. The results from
the ghost structure can be removed from the result set using the inverse
merge operator. This simulates the result that would have been obtained
had the records been physically removed from the main structure.
Two examples of invertible search problems are range count and set
membership. Range count was formally defined in
Definition~\ref{def:range-count}.
\begin{theorem}
Range count is an invertible search problem.
\end{theorem}
\begin{proof}
To prove that range count is an invertible search problem, it must be
decomposable and have a $\Delta$ operator. That it is a DSP has already
been proven in Theorem~\ref{ther:decomp-range-count}.
Let $\Delta$ be subtraction $(-)$. Applying this to Definition~\ref{def:invert}
gives,
\begin{equation*}
|(A / B) \cap q | = |(A \cap q) / (B \cap q)| = |(A \cap q)| - |(B \cap q)|
\end{equation*}
which is true by the distributive property of set difference and
intersection. Subtraction is computable in constant time, therefore
range count is an invertible search problem using subtraction as $\Delta$.
\end{proof}
The set membership search problem is defined as follows,
\begin{definition}[Set Membership]
\label{def:set-membership}
Consider a set of elements $d \subseteq \mathcal{D}$ from some domain,
and a single element $r \in \mathcal{D}$. A test of set membership is a
search problem of the form $F: (\mathcal{PS}(\mathcal{D}), \mathcal{D})
\to \mathbb{B}$ such that $F(d, r) = r \in d$, which maps to $0$ if $r
\not\in d$ and $1$ if $r \in d$.
\end{definition}
\begin{theorem}
Set membership is an invertible search problem.
\end{theorem}
\begin{proof}
To prove that set membership is invertible, it is necessary to establish
that it is a decomposable search problem, and that a $\Delta$ operator
exists. We'll begin with the former.
\begin{lemma}
\label{lem:set-memb-dsp}
Set membership is a decomposable search problem.
\end{lemma}
\begin{proof}
Let $\mergeop$ be the logical disjunction ($\lor$). This yields,
\begin{align*}
F(A \cup B, r) &= F(A, r) \lor F(B, r) \\
r \in (A \cup B) &= (r \in A) \lor (r \in B)
\end{align*}
which is true, following directly from the definition of union. The
logical disjunction is an associative, commutative operator that can
be calculated in $\Theta(1)$ time. Therefore, set membership is a
decomposable search problem.
\end{proof}
For the inverse merge operator, $\Delta$, it is necessary that $F(A,
r) ~\Delta~F(B, r)$ be true \emph{only} if $r \in A$ and $r \not\in
B$. Thus, it could be directly implemented as $F(A, r)~\Delta~F(B, r) =
F(A, r) \land \neg F(B, r)$, which is constant time if
the operands are already known.
Thus, we have shown that set membership is a decomposable search problem,
and that a constant time $\Delta$ operator exists. Therefore, it is an
invertible search problem.
\end{proof}
For search problems such as these, this technique allows for deletes to be
supported with the same cost as an insert. Unfortunately, it suffers from
write amplification because each deleted record is recorded twice--one in
the main structure, and once in the ghost structure. This means that $n$
is, in effect, the total number of records and deletes. This can lead
to some serious problems, for example if every record in a structure of
$n$ records is deleted, the net result will be an ``empty'' dynamized
data structure containing $2n$ physical records within it. To circumvent
this problem, Bentley and Saxe proposed a mechanism of setting a maximum
threshold for the size of the ghost structure relative to the main one.
Once this threshold was reached, a complete re-partitioning of the data
can be performed. During this re-partitioning, all deleted records can
be removed from the main structure, and the ghost structure emptied
completely. Then all of the blocks can be rebuilt from the remaining
records, partitioning them according to the strict binary decomposition
of the logarithmic method.
\subsection{Weak Deletes for Deletion Decomposable Search Problems}
Another approach for supporting deletes was proposed later, by Overmars
and van Leeuwen, for a class of search problem called \emph{deletion
decomposable}. These are decomposable search problems for which the
underlying data structure supports a delete operation. More formally,
\begin{definition}[Deletion Decomposable Search Problem~\cite{merge-dsp}]
\label{def:background-ddsp}
A decomposable search problem, $F$, and its data structure,
$\mathcal{I}$, is deletion decomposable if and only if, for some
instance $\mathscr{I} \in \mathcal{I}$, containing $n$ records,
there exists a deletion routine $\mathtt{delete}(\mathscr{I},
r)$ that removes some $r \in \mathcal{D}$ in time $D(n)$ without
increasing the query time, deletion time, or storage requirement,
for $\mathscr{I}$.
\end{definition}
Superficially, this doesn't appear very useful, because if the underlying
data structure already supports deletes, there isn't much reason to
use a dynamization technique to add deletes to it. However, even in
structures that don't natively support deleting, it is possible in many
cases to \emph{add} delete support without significant alterations.
If it is possible to locate a record and somehow mark it as deleted,
without removing it from the structure, and then efficiently ignore these
records while querying, then the given structure and its search problem
can be said to be deletion decomposable. This technique for deleting
records is called \emph{weak deletes}.
\begin{definition}[Weak Deletes~\cite{overmars81}]
\label{def:weak-delete}
A data structure is said to support weak deletes if it provides a
routine, \texttt{delete}, that guarantees that after $\alpha \cdot n$
deletions, where $\alpha < 1$, the query cost is bounded by $k_\alpha
\mathscr{Q}(n)$ for some constant $k_\alpha$ dependent only upon $\alpha$,
where $\mathscr{Q}(n)$ is the cost of answering the query against a
structure upon which no weak deletes were performed.\footnote{
This paper also provides a similar definition for weak updates,
but these aren't of interest to us in this work, and so the above
definition was adapted from the original with the weak update
constraints removed.
} The results of the query of a block containing weakly deleted records
should be the same as the results would be against a block with those
records removed.
\end{definition}
As an example of a deletion decomposable search problem, consider the set
membership problem considered above (Definition~\ref{def:set-membership})
where $\mathcal{I}$, the data structure used to answer queries of the
search problem, is a hash map.\footnote{
While most hash maps are already dynamic, and so wouldn't need
dynamization to be applied, there do exist static ones too. For example,
the hash map being considered could be implemented using perfect
hashing~\cite{perfect-hashing}, which has many static implementations.
}
\begin{theorem}
The set membership problem, answered using a static hash map, is
deletion decomposable.
\end{theorem}
\begin{proof}
We've already shown in Lemma~\ref{lem:set-memb-dsp} that set membership
is a decomposable search problem. For it to be deletion decomposable,
we must demonstrate that the hash map, $\mathcal{I}$, supports deleting
records without hurting its query performance, delete performance, or
storage requirements. Assume that an instance $\mathscr{I} \in
\mathcal{I}$ having $|\mathscr{I}| = n$ can answer queries in
$\mathscr{Q}(n) \in \Theta(1)$ time and requires $\Omega(n)$ storage.
Such a structure can support weak deletes. Each record within the
structure has a single bit attached to it, indicating whether it has
been deleted or not. These bits will require $\Theta(n)$ storage and
be initialized to $0$ when the structure is constructed. A delete can
be performed by querying the structure for the record to be deleted in
$\Theta(1)$ time, and setting the bit to 1 if the record is found. This
operation has $D(n) \in \Theta(1)$ cost.
\begin{lemma}
\label{lem:weak-deletes}
The delete procedure as described above satisfies the requirements of
Definition~\ref{def:weak-delete} for weak deletes.
\end{lemma}
\begin{proof}
Per Definition~\ref{def:weak-delete}, there must exist some constant
dependent only on $\alpha$, $k_\alpha$, such that after $\alpha \cdot
n$ deletes against $\mathscr{I}$ with $\alpha < 1$, the query cost is
bounded by $\Theta(\alpha \cdot \mathscr{Q}(n))$.
In this case, $\mathscr{Q}(n) \in \Theta(1)$, and therefore our final
query cost must be bounded by $\Theta(k_\alpha)$. When a query is
executed against $\mathscr{I}$, there are three possible cases,
\begin{enumerate}
\item The record being searched for does not exist in $\mathscr{I}$. In
this case, the query result is $0$.
\item The record being searched for does exist in $\mathscr{I}$ and has
a delete bit value of $0$. In this case, the query result is $1$.
\item The record being searched for does exist in $\mathscr{I}$ and has
a delete bit value of $1$ (i.e., it has been deleted). In this case, the
query result is $0$.
\end{enumerate}
In all three cases, the addition of deletes requires only $\Theta(1)$
extra work at most. Therefore, set membership over a static hash map
using our proposed deletion mechanism satisfies the requirements for
weak deletes, with $k_\alpha = 1$.
\end{proof}
Finally, we note that the cost of one of these weak deletes is $D(n)
= \mathscr{Q}(n)$. By Lemma~\ref{lem:weak-deletes}, the delete cost is
not asymptotically harmed by deleting records.
Thus, we've shown that set membership using a static hash map is a
decomposable search problem, the storage cost remains $\Omega(n)$ and the
query and delete costs are unaffected by the presence of deletes using the
proposed mechanism. All of the requirements of deletion decomposability
are satisfied, therefore set membership using a static hash map is a
deletion decomposable search problem.
\end{proof}
For such problems, deletes can be supported by first identifying the
block in the decomposition containing the record to be deleted, and
then calling $\mathtt{delete}$ on it. In order to allow this block to
be easily located, it is possible to maintain a hash table over all
of the records, alongside the decomposition, which maps each record
onto the block containing it. This table must be kept up to date as
reconstructions occur, but this can be done at no extra asymptotic costs
for any data structures having $B(n) \in \Omega(n)$, as it requires only
linear time. This allows for deletes to be performed in $\mathscr{D}(n)
\in \Theta(D(n))$ time.
The presence of deleted records within the structure does introduce a
new problem, however. Over time, the number of records in each block will
drift away from the requirements imposed by the decomposition technique. It
will eventually become necessary to re-partition the records to restore
these invariants, which are necessary for bounding the number of blocks,
and thereby the query performance. The particular invariant maintenance
rules depend upon the decomposition scheme used. To our knowledge,
there is no discussion of applying the $k$-binomial method to deletion
decomposable search problems, and so method is not listed here.
\Paragraph{Logarithmic Method.} When creating a logarithmic decomposition for
a deletion decomposable search problem, the $i$th block where $i \geq 2$,\footnote{
Block $i=1$ will only ever have one record, so no special maintenance must be
done for it. A delete will simply empty it completely.
}
in the absence of deletes, will contain $2^{i-1} + 1$ records. When a
delete occurs in block $i$, no special action is taken until the number
of records in that block falls below $2^{i-2}$. Once this threshold is
reached, a reconstruction can be performed to restore the appropriate
record counts in each block.~\cite{merge-dsp}
\Paragraph{Equal Block Method.} For the equal block method, there are
two cases in which a delete may cause a block to fail to obey the method's
size invariants,
\begin{enumerate}
\item If enough records are deleted, it is possible for the number
of blocks to exceed $f(2n)$, violating Invariant~\ref{ebm-c1}.
\item The deletion of records may cause the maximum size of each
block to shrink, causing some blocks to exceed the maximum capacity
of $\nicefrac{2n}{s}$. This is a violation of Invariant~\ref{ebm-c2}.
\end{enumerate}
In both cases, it should be noted that $n$ is decreased as records are
deleted. Should either of these cases emerge as a result of a delete,
the entire structure must be reconfigured to ensure that its invariants
are maintained. This reconfiguration follows the same procedure as when
an insert results in a violation: $s$ is updated to be exactly $f(n)$, all
existing blocks are unbuilt, and then the records are evenly redistributed
into the $s$ blocks.~\cite{overmars-art-of-dyn}
\section{Limitations of Classical Dynamization Techniques}
\label{sec:bsm-limits}
While fairly general, these dynamization techniques have a number of
limitations that prevent them from being directly usable as a general
solution to the problem of creating database indices. Because of the
requirement that the query being answered be decomposable, many search
problems cannot be addressed--or at least efficiently addressed, by
decomposition-based dynamization. Additionally, though we have discussed
two decomposition approaches that expose some form of performance tuning
to the user, these techniques are targeted as asymptotic results, which
results in poor results in practice. Finally, most decomposition schemes
have poor worst-case insertion performance, resulting in extremely poor
tail latency relative to native dynamic structures. While there do exist
decomposition schemes that have exhibit better worst-case performance,
they are impractical. This section will discuss these limitations in
more detail, and the rest of the document will be dedicated to proposing
solutions to them.
\subsection{Limits of Decomposability}
\label{ssec:decomp-limits}
Unfortunately, the DSP abstraction used as the basis of classical
dynamization techniques has a few significant limitations that restrict
their applicability,
\begin{itemize}
\item The query must be broadcast identically to each block and cannot
be adjusted based on the state of the other blocks.
\item The query process is done in one pass--it cannot be repeated.
\item The result merge operation must be $O(1)$ to maintain good query
performance.
\item The result merge operation must be commutative and associative,
and is called repeatedly to merge pairs of results.
\end{itemize}
These requirements restrict the types of queries that can be supported by
the method efficiently. For example, k-nearest neighbor and independent
range sampling are not decomposable.
\subsubsection{k-Nearest Neighbor}
\label{sssec-decomp-limits-knn}
The k-nearest neighbor ($k$-NN) problem is a generalization of the nearest
neighbor problem, which seeks to return the closest point within the
dataset to a given query point. More formally, this can be defined as,
\begin{definition}[Nearest Neighbor]
Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$
be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
between two points within $D$. The nearest neighbor problem, $NN(D,
q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$
for some query point, $q \in \mathbb{R}^d$.
\end{definition}
In practice, it is common to require $f(x, y)$ be a metric,\footnote
{
Contrary to its vernacular usage as a synonym for ``distance'', a
metric is more formally defined as a valid distance function over
a metric space. Metric spaces require their distance functions to
have the following properties,
\begin{itemize}
\item The distance between a point and itself is always $0$.
\item All distances between non-equal points must be positive.
\item For all points, $x, y \in D$, it is true that
$f(x, y) = f(y, x)$.
\item For any three points $x, y, z \in D$ it is true that
$f(x, z) \leq f(x, y) + f(y, z)$.
\end{itemize}
These distances also must have the interpretation that $f(x, y) <
f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This
is the opposite of the definition of similarity, and so some minor
manipulations are usually required to make similarity measures work
in metric-based indexes. \cite{intro-analysis}
}
and this will be done in the examples of indices for addressing
this problem in this work, but it is not a fundamental aspect of the problem
formulation. The nearest neighbor problem itself is decomposable,
with a simple merge function that accepts the result with the smallest
value of $f(x, q)$ for any two inputs\cite{saxe79}.
The k-nearest neighbor problem generalizes nearest-neighbor to return
the $k$ nearest elements,
\begin{definition}[k-Nearest Neighbor]
Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$
be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
between two points within $D$. The k-nearest neighbor problem,
$KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$
such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$.
\end{definition}
This can be thought of as solving the nearest-neighbor problem $k$ times,
each time removing the returned result from $D$ prior to solving the
problem again. Unlike the single nearest-neighbor case (which can be
thought of as k-NN with $k=1$), this problem is \emph{not} decomposable.
\begin{theorem}
k-NN is not a decomposable search problem.
\end{theorem}
\begin{proof}
To prove this, consider the query $KNN(D, q, k)$ against some partitioned
dataset $D = D_1 \cup D_2 \ldots \cup D_\ell$. If k-NN is decomposable,
then there must exist some constant-time, commutative, and associative
binary operator $\mergeop$, such that $R = \mergeop_{1 \leq i \leq l}
R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q,
k)$. Consider the evaluation of the merge operator against two arbitrary
result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| =
|R_j| = k$, and that the contents of $R$ must be the $k$ records from
$R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the
problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$
time. Therefore, k-NN is not a decomposable search problem.
\end{proof}
With that said, it is clear that there isn't any fundamental restriction
preventing the merging of the result sets; it is only the case that an
arbitrary performance requirement wouldn't be satisfied. It is possible
to merge the result sets in non-constant time, and so it is the case
that k-NN is $C(n)$-decomposable. Unfortunately, this classification
brings with it a reduction in query performance as a result of the way
result merges are performed.
As a concrete example of these costs, consider using the logarithmic
method to extend the VPTree~\cite{vptree}. The VPTree is a static,
metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k
\log n)$. One possible merge algorithm for k-NN would be to push all
of the elements in the two arguments onto a min-heap, and then pop off
the first $k$. In this case, the cost of the merge operation would be
$C(k) = k \log k$. Were $k$ assumed to be constant, then the operation
could be considered to be constant-time. But given that $k$ is only
bounded in size above by $n$, this isn't a safe assumption to make in
general. Evaluating the total query cost for the extended structure,
this would yield,
\begin{equation}
k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right)
\end{equation}
The reason for this large increase in cost is the repeated application
of the merge operator. The logarithmic method requires applying the
merge operator in a binary fashion to each partial result, multiplying
its cost by a factor of $\log n$. Thus, the constant-time requirement
of standard decomposability is necessary to keep the cost of the merge
operator from appearing within the complexity bound of the entire
operation in the general case.\footnote {
There is a special case, noted by Overmars, where the total cost is
$O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n))
\in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the
case where the cost of the query and merge operation are sufficiently
large to consume the logarithmic factor, and so it doesn't represent
a special case with better performance.
}
If we could revise the result merging operation to remove this duplicated
cost, we could greatly reduce the cost of supporting $C(n)$-decomposable
queries.
\subsubsection{Independent Range Sampling}
\label{ssec:background-irs}
Another problem that is not decomposable is independent sampling. There
are a variety of problems falling under this umbrella, including weighted
set sampling, simple random sampling, and weighted independent range
sampling, but we will focus on independent range sampling here.
\begin{definition}[Independent Range Sampling~\cite{tao22}]
Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
interval $q = [x, y]$ and an integer $k$, an independent range
sampling query returns $k$ independent samples from $D \cap q$
with each point having equal probability of being sampled.
\end{definition}
This problem immediately encounters a category error when considering
whether it is decomposable: the result set is randomized, whereas
the conditions for decomposability are defined in terms of an exact
matching of records in result sets. To work around this, a slight abuse
of definition is in order: assume that the equality conditions within
the DSP definition can be interpreted to mean ``the contents in the two
sets are drawn from the same distribution''. This enables the category
of DSP to apply to this type of problem, while maintaining the spirit of
the definition.
Even with this abuse, however, IRS cannot generally be considered
decomposable; it is at best $C(n)$-decomposable. The reason for this is
that matching the distribution requires drawing the appropriate number
of samples from each partition of the data. Even in the special
case that $|D_1| = |D_2| = \ldots = |D_\ell|$, the number of samples
from each partition that must appear in the result set cannot be known
in advance due to differences in the selectivity of the predicate across
the partitions.
\begin{example}[IRS Sampling Difficulties]
Consider three partitions of data, $D_1 = \{1, 2, 3, 4, 5\}, D_2 =
\{1, 1, 1, 1, 3\}, D_3 = \{4, 4, 4, 4, 4\}$ using bag semantics and
an IRS query over the interval $[3, 4]$ with $k=12$. Because all three
partitions have the same size, it seems sensible to evenly distribute
the samples across them ($4$ samples from each partition). Applying
the query predicate to the partitions results in the following,
$d_1 = \{3, 4\}, d_2 = \{3 \}, d_3 = \{4, 4, 4, 4\}$.
In expectation, then, the first result set will contain $R_1 = \{3,
3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same
probability of a $4$. The second and third result sets can only
be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these
together, we'd find that the probability distribution of the sample
would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were we to perform
the same sampling operation over the full dataset (not partitioned),
the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$.
\end{example}
The problem is that the number of samples drawn from each partition
needs to be weighted based on the number of elements satisfying the
query predicate in that partition. In the above example, by drawing $4$
samples from $D_1$, more weight is given to $3$ than exists within the
base dataset. This can be worked around by sampling a full $k$ records
from each partition, returning both the sample and the number of records
satisfying the predicate as that partition's query result. This allows for
the relative weights of each block to be controlled for during the merge,
by doing weighted sampling of each partial result. This approach requires
$\Theta(k)$ time for the merge operation, however, leaving IRS firmly
in the $C(n)$-decomposable camp. If it were possible to pre-calculate
the number of samples to draw from each partition, then a constant-time
merge operation could be used.
We examine expanding support for non-decomposable search problems
in Chapters~\ref{chap:sampling} and \ref{chap:framework} and propose
techniques for efficiently expanding support of dynamization systems to
non-decomposable search problems, as well as addressing some additional
difficulties introduced by supporting deletes, which can complicate
query processing.
\subsection{Configurability}
Decomposition-based dynamization is built upon a fundamental trade-off
between insertion and query performance, that is governed by the
number of blocks a structure is decomposed into. Both the equal
block and $k$-binomial method expose parameters to tune the number
of blocks in the structure, but these techniques suffer from poor
insertion performance in general. The equal block method in particular
suffers greatly from the larger block sizes~\cite{overmars83}, and the
$k$-binomial approach suffers the same problem. In fact, we'll show in
Chapter~\ref{chap:tail-latency} that the equal block method is strictly
worse than the logarithmic method in experimental conditions for a given query
latency in the trade-off space. There is a theoretical technique that
attempts to address this limitation by nesting the logarithmic method
inside of the equal block method, called the \emph{mixed method}, that
has appeared in the theoretical literature~\cite{overmars83}. But this
technique is clunky, and doesn't provide the user with a meaningful design
space for configuring the system beyond specifying arbitrary functions.
The reason for this lack of simple configurability in existing
dynamization literature seems to stem from the theoretical nature of
the work. Many ``obvious'' options for tweaking the method, such as
changing the rate at which levels grow, adding buffering, etc., result in
constant-factor trade-offs, and thus are not relevant to the asymptotic
bounds that these works are concerned with. It's worth noting that some
works based on \emph{applying} the logarithmic method introduce some
form of configurability~\cite{pgm,almodaresi23}, usually inspired by
the design space of LSM trees~\cite{oneil96}, but the full consequences
of this parametrization in the context of dynamization have, to the
best of our knowledge, not been explored. We will discuss this topic
in Chapter~\ref{chap:design-space}.
\subsection{Insertion Tail Latency}
\label{ssec:bsm-tail-latency-problem}
One of the largest problems associated with classical dynamization
techniques is the poor worst-case insertion performance. This
results in massive insertion tail latencies. Unfortunately, solving
this problem within the logarithmic method itself is not a trivial
undertaking. Maintaining the strict binary decomposition of the structure,
ensures that any given reconstruction cannot be performed in advance,
as it requires access to all the records in the structure in the worst
case. This limits the ability to use parallelism to hide the latencies.
The worst-case optimized approach proposed by Overmars and von Leeuwen
abandons the binary decomposition of the logarithmic method, and is thus
able to provide an approach for limiting this worst-case insertion bound,
but it has a number of serious problems,
\begin{enumerate}
\item It assumes that the reconstruction process for a data structure
can be divided \textit{a priori} into a small number of independent
operations that can be executed in batches during each insert. It
is not always possible to do this efficiently, particularly for
structures whose construction involve multiple stages (e.g.,
a sorting phase followed by a recursive node construction phase,
like in a B+Tree) with non-trivially predictable operation counts.
\item Even if the reconstruction process can be efficiently
sub-divided, implementing the technique requires \emph{significant}
and highly specialized modification of the construction procedures
for a data structure, and tight integration of these procedures into
the insertion process as a whole. This makes it poorly suited for use
in a generalized framework of the sort we are attempting to create.
\end{enumerate}
We tackle the problem of insertion tail latency in
Chapter~\ref{chap:tail-latency} and propose a new system which
resolves these difficulties and allows for significant improvements in
insertion tail latency without seriously degrading the other performance
characteristics of the dynamized structure.
\section{Conclusion}
In this chapter, we introduced the concept of a search problem, and
showed how amortized global reconstruction can be used to dynamize data
structures associated with search problems having certain properties. We
examined several theoretical approaches for dynamization, including the
equal block method, the logarithmic method, and a worst-case insertion
optimized approach. Additionally, we considered several more classes of
search problem, and saw how additional properties could be used to enable
more efficient reconstruction, and support for efficiently deleting
records from the structure. Ultimately, however, these techniques
have several deficiencies that must be overcome before a practical,
general, system can be built upon them. Namely, they lack support for
several important types of search problem, particularly if deletes are
required, they are not easily configurable by the user, and they suffer
from poor insertion tail latency. The rest of this work will be dedicated
to approaches to resolve these deficiencies.
|