chapters/beyond-dsp.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873

\chapter{Generalizing the Framework}

\begin{center}
    \emph{The following chapter is an adaptation of work completed in collaboration with Dr. Dong Xie and Dr. Zhuoyue Zhao
    and published
    in PVLDB Volume 17, Issue 11 (July 2024) under the title "Towards Systematic Index Dynamization".
    }
    \hrule
\end{center}


\label{chap:framework}

The previous chapter demonstrated
the possible utility of
designing indexes based upon the dynamic extension of static data
structures. However, the presented strategy falls short of a general
framework, as it is specific to sampling problems. In this chapter,
the techniques of that work will be discussed in more general terms,
to arrive at a more broadly applicable solution. A general
framework is proposed, which places only two requirements on supported data
structures, 

\begin{itemize}
    \item Extended Decomposability
    \item Record Identity
\end{itemize}

In this chapter, first these two properties are defined. Then, 
a general dynamic extension framework is described which can
be applied to any data structure supporting these properties. Finally,
an experimental evaluation is presented that demonstrates the viability
of this framework.

\section{Extended Decomposability}

Chapter~\ref{chap:sampling} demonstrated how non-DSPs can be efficiently
addressed using Bentley-Saxe, so long as the query interface is
modified to accommodate their needs. For Independent sampling
problems, this involved a two-pass approach, where some pre-processing
work was performed against each shard and used to construct a shard
alias structure. This structure was then used to determine how many
samples to draw from each shard.

To generalize this approach, a new class of decomposability is proposed,
called \emph{extended decomposability}. At present, its 
definition is tied tightly to the query interface, rather
than a formal mathematical definition.  In extended decomposability,
rather than treating a search problem as a monolith, the algorithm 
is decomposed into multiple components.
This allows
for communication between shards as part of the query process.
Additionally, rather than using a binary merge operator, extended
decomposability uses a variadic function that merges all of the
result sets in one pass, reducing the cost due to merging by a
logarithmic factor without introducing any new restrictions.

The basic interface that must be supported by a extended-decomposable
search problem (eDSP) is,
\begin{itemize}

    \item $\mathbftt{local\_preproc}(\mathcal{I}_i, \mathcal{Q}) \to
    \mathscr{S}_i$ \\
        Pre-processes each partition $\mathcal{D}_i$ using index
        $\mathcal{I}_i$ to produce preliminary information about the 
        query result on this partition, encoded as an object 
        $\mathscr{S}_i$.

    \item $\mathbftt{distribute\_query}(\mathscr{S}_1, \ldots,
    \mathscr{S}_m, \mathcal{Q}) \to \mathcal{Q}_1, \ldots, 
    \mathcal{Q}_m$\\
            Processes the list of preliminary information objects
            $\mathscr{S}_i$ and emits a list of local queries
            $\mathcal{Q}_i$ to run independently on each partition.

    \item $\mathbftt{local\_query}(\mathcal{I}_i, \mathcal{Q}_i)
    \to \mathcal{R}_i$ \\
            Executes the local query $\mathcal{Q}_i$ over partition
            $\mathcal{D}_i$ using index $\mathcal{I}_i$ and returns a
            partial result $\mathcal{R}_i$.

    \item $\mathbftt{merge}(\mathcal{R}_1, \ldots \mathcal{R}_m) \to
    \mathcal{R}$ \\ 
           Merges the partial results to produce the final answer.

\end{itemize}

The pseudocode for the query algorithm using this interface is,
\begin{algorithm}
    \DontPrintSemicolon
    \SetKwProg{Proc}{procedure}{ BEGIN}{END}
    \SetKwProg{For}{for}{ DO}{DONE}

    \Proc{\mathbftt{QUERY}($D[]$, $\mathscr{Q}$)} {
        \For{$i \in [0, |D|)$} {
            $S[i] := \mathbftt{local\_preproc}(D[i], \mathscr{Q})$
        } \;

        $ Q := \mathbftt{distribute\_query}(S, \mathscr{Q}) $ \; \;

        \For{$i \in [0, |D|)$} {
            $R[i] := \mathbftt{local\_query}(D[i], Q[i])$
        } \;

        $OUT := \mathbftt{merge}(R)$ \;

        \Return {$OUT$} \;
    }
\end{algorithm}

In this system, each query can report a partial result with
\mathbftt{local\_preproc}, which can be used by
\mathbftt{distribute\_query} to adjust the per-partition query
parameters, allowing for direct communication of state between
partitions. Queries which do not need this functionality can simply
return empty $\mathscr{S}_i$ objects from \mathbftt{local\_preproc}.

\subsection{Query Complexity}

Before describing how to use this new interface and definition to
support more efficient queries than standard decomposability, more
more general expression for the cost of querying such a structure should
be derived.
Recall that Bentley-Saxe, when applied to a $C(n)$-decomposable
problem, has the following query cost,

\begin{equation}
    \label{eq3:Bentley-Saxe}
    O\left(\log n \cdot \left( Q_s(n) + C(n)\right)\right)
\end{equation}
where $Q_s(n)$ is the cost of the query against one partition, and
$C(n)$ is the cost of the merge operator.

Let $Q_s(n)$ represent the cost of \mathbftt{local\_query} and
$C(n)$ the cost of \mathbftt{merge} in the extended decomposability
case. Additionally, let $P(n)$ be the cost of $\mathbftt{local\_preproc}$
and $\mathcal{D}(n)$ be the cost of \mathbftt{distribute\_query}.
Additionally, recall that $|D| = \log n$ for the Bentley-Saxe method.
In this case, the cost of a query is
\begin{equation}
    O \left( \log n \cdot P(n) + \mathcal{D}(n) + 
             \log n \cdot Q_s(n) + C(n) \right)
\end{equation}

Superficially, this looks to be strictly worse than the Bentley-Saxe
case in Equation~\ref{eq3:Bentley-Saxe}. However, the important
thing to understand is that for $C(n)$-decomposable queries, $P(n)
\in O(1)$ and $\mathcal{D}(n) \in O(1)$, as these steps are unneeded.
Thus, for normal decomposable queries, the cost actually reduces
to,
\begin{equation}
    O \left( \log n \cdot Q_s(n) + C(n) \right)
\end{equation}
which is actually \emph{better} than Bentley-Saxe. Meanwhile, the
ability perform state-sharing between queries can facilitate better
solutions than would otherwise be possible.

In light of this new approach, consider the two examples of
non-decomposable search problems from Section~\ref{ssec:decomp-limits}.

\subsection{k-Nearest Neighbor}
\label{ssec:knn}
The KNN problem is $C(n)$-decomposable, and Section~\ref{sssec-decomp-limits-knn}
arrived at a Bentley-Saxe based solution to this problem based on
VPTree, with a query cost of
\begin{equation}
    O \left( k \log^2 n + k \log n \log k \right)
\end{equation}
by running KNN on each partition, and then merging the result sets
with a heap.

Applying the interface of extended-decomposability to this problem
allows for some optimizations. Pre-processing is not necessary here,
but the variadic merge function can be leveraged to get an asymptotically
better solution. Simply dropping the existing algorithm into this
interface will result in a merge algorithm with cost,
\begin{equation}
    C(n) \in O \left( k \log n \left( \log k + \log\log n\right)\right)
\end{equation}
which results in a total query cost that is slightly \emph{worse}
than the original,

\begin{equation}
    O \left( k \log^2 n + k \log n \left(\log k + \log\log n\right) \right)
\end{equation}

The problem is that the number of records considered in a given
merge has grown from $O(k)$ in the binary merge case to $O(\log n
\cdot k)$ in the variadic merge. However, because the merge function
now has access to all of the data at once, the algorithm can be modified
slightly for better efficiency by only pushing $\log n$ elements
into the heap at a time. This trick only works if 
the $R_i$s are in sorted order relative to $f(x, q)$,
however this condition is satisfied by the result sets returned by
KNN against a VPTree. Thus, for each $R_i$, the first element in sorted
order can be inserted into the heap,
element in sorted order into the heap, tagged with a reference to
which $R_i$ it was taken from. Then, when the heap is popped, the
next element from the associated $R_i$ can be inserted.
This allows the heap's size to be maintained at no larger 
than $O(\log n)$, and limits the algorithm to no more than
$k$ pop operations and $\log n + k - 1$ pushes.

This algorithm reduces the cost of KNN on this structure to,
\begin{equation}
    O(k \log^2 n + \log n)
\end{equation}
which is strictly better than the original.

\subsection{Independent Range Sampling}

The eDSP abstraction also provides sufficient features to implement
IRS, using the same basic approach as was used in the previous
chapter. Unlike KNN, IRS will take advantage of the extended query
interface. Recall from the Chapter~\ref{chap:sampling} that the approach used
for answering sampling queries (ignoring the buffer, for now) was,

\begin{enumerate}
    \item Query each shard to establish the weight that should be assigned to the
        shard in sample size assignments.
    \item Build an alias structure over those weights.
    \item For each sample, reference the alias structure to determine which shard
        to sample from, and then draw the sample.
\end{enumerate}

This approach can be mapped easily onto the eDSP interface as follows,
\begin{itemize}
    \item[\texttt{local\_preproc}] Determine and return the total weight of candidate records for
        sampling in the shard.
    \item[\texttt{distribute\_query}] Using the shard weights, construct an alias structure associating
        each shard with its total weight. Then, query this alias structure $k$ times. For shard $i$, the
        local query $\mathscr{Q}_i$ will have its sample size assigned based on how many times $i$ is returned
        during the alias querying.
    \item[\texttt{local\_query}] Process the local query using the underlying data structure's normal sampling
        procedure.
    \item[\texttt{merge}] Union all of the partial results together.
\end{itemize}

This division of the query maps closely onto the cost function,
\begin{equation}
    O\left(P(n) + kS(n)\right)
\end{equation}
used in Chapter~\ref{chap:sampling}, where the $W(n) + P(n)$ pre-processing
cost is associated with the cost of \texttt{local\_preproc} and the
$kS(n)$ sampling cost is associated with $\texttt{local\_query}$.
The \texttt{distribute\_query} operation will require $O(\log n)$
time to construct the shard alias structure, and $O(k)$ time to
query it. Accounting then for the fact that \texttt{local\_preproc}
will be called once per shard ($\log n$ times), and a total of $k$
records will be sampled as the cost of $S(n)$ each, this results
in a total query cost of,
\begin{equation}
    O\left(\left[W(n) + P(n)\right]\log n + k S(n)\right)
\end{equation}
which matches the cost in Equation~\ref{eq:sample-cost}.

\section{Record Identity}

Another important consideration for the framework is support for
deletes, which are important in the contexts of database systems.
The sampling extension framework supported two techniques
for the deletion of records: tombstone-based deletes and tagging-based
deletes. In both cases, the solution required that the shard support
point lookups, either for checking tombstones or for finding the
record to mark it as deleted. Implicit in this is an important
property of the underlying data structure which was taken for granted
in that work, but which will be made explicit here: record identity.

Delete support requires that each record within the index be uniquely
identifiable, and linkable directly to a location in storage. This 
property is called \emph{record identity}.
 In the context of database
indexes, it isn't a particularly contentious requirement. Indexes
already are designed to provide a mapping directly to a record in
storage, which (at least in the context of RDBMS) must have a unique
identifier attached. However, in more general contexts, this
requirement will place some restrictions on the applicability of
the framework.

For example, approximate data structures or summaries, such as Bloom
filters~\cite{bloom70} or count-min sketches~\cite{countmin-sketch}
are data structures which don't necessarily store the underlying
record. In principle, some summaries \emph{could} be supported by
normal Bentley-Saxe as there exist mergeable
summaries~\cite{mergeable-summaries}. But because these data structures
violate the record identity property, they would not support deletes
(either in the framework, or Bentley-Saxe). The framework considers
deletes to be a first-class citizen, and this is formalized by
requiring record identity as a property that supported data structures
must have.

\section{The General Framework}

Based on these properties, and the work described in
Chapter~\ref{chap:sampling}, dynamic extension framework has been devised with
broad support for data structures. It is implemented in C++20, using templates
and concepts to define the necessary interfaces. A user of this framework needs
to provide a definition for their data structure with a prescribed interface
(called a \texttt{shard}), and a definition for their query following an
interface based on the above definition of an eDSP. These two classes can then
be used as template parameters to automatically create a dynamic index, which
exposes methods for inserting and deleting records, as well as executing
queries.

\subsection{Framework Design}

\Paragraph{Structure.} The overall design of the general framework
itself is not substantially different from the sampling framework
discussed in the Chapter~\ref{chap:sampling}. It consists of a mutable buffer
and a set of levels containing data structures with geometrically
increasing capacities.  The \emph{mutable buffer} is a small unsorted
record array of fixed capacity that buffers incoming inserts. As
the mutable buffer is kept sufficiently small (e.g. fits in L2 CPU
cache), the cost of querying it without any auxiliary structures
can be minimized, while still allowing better insertion performance
than Bentley-Saxe, which requires rebuilding an index structure for
each insertion.  The use of an unsorted buffer is necessary to
ensure that the framework doesn't require an existing dynamic version
of the index structure being extended, which would defeat the purpose
of the entire exercise.

The majority of the data within the structure is stored in a sequence
of \emph{levels} with geometrically increasing record capacity,
such that the capacity of level $i$ is $s^{i+1}$, where $s$ is a
configurable parameter called the \emph{scale factor}.  Unlike
Bentley-Saxe, these levels are permitted to be partially full, which
allows significantly more flexibility in terms of how reconstruction
is performed. This also opens up the possibility of allowing each
level to allocate its record capacity across multiple data structures
(named \emph{shards}) rather than just one. This decision is called
the  \emph{layout policy}, with the use of a single structure being
called \emph{leveling}, and multiple structures being called
\emph{tiering}.

\begin{figure}
\centering
\subfloat[Leveling]{\includegraphics[width=.5\textwidth]{img/leveling} \label{fig:leveling}}
\subfloat[Tiering]{\includegraphics[width=.5\textwidth]{img/tiering} \label{fig:tiering}}
    \caption{\textbf{An overview of the general structure of the
    dynamic extension framework} using leveling (Figure~\ref{fig:leveling}) and
tiering (Figure~\ref{fig:tiering}) layout policies. The pictured extension has
a scale factor of 3, with $L_0$ being at capacity, and $L_1$ being at
one third capacity. Each shard is shown as a dotted box, wrapping its associated
dataset ($D_i$), data structure ($I_i$), and auxiliary structures $(A_i)$. }
\label{fig:framework}
\end{figure}

\Paragraph{Shards.} The basic building block of the dynamic extension
is called a shard, defined as $\mathcal{S}_i = (\mathcal{D}_i,
\mathcal{I}_i, A_i)$, which consists of a partition of the data
$\mathcal{D}_i$, an instance of the static index structure being
extended $\mathcal{I}_i$, and an optional auxiliary structure $A_i$.
To ensure the viability of level reconstruction, the extended data
structure should at least support a construction method
$\mathtt{build}(\mathcal{D})$ that can build a new static index
from a set of records $\mathcal{D}$ from scratch. This set of records
may come from the mutable buffer, or from a union of underlying
data of multiple other shards. It is also beneficial for $\mathcal{I}_i$
to support efficient point-lookups, which can search for a record's
storage location by its identifier (given by the record identify
requirements of the framework). The shard can also be customized
to provide any necessary features for supporting the index being
extended.  For example, auxiliary data structures like Bloom filters
or hash tables can be added to improve point-lookup performance,
or additional, specialized query functions can be provided for use
by the query functions.

From an implementation standpoint, the shard object provides a shim
between the data structure and the framework itself. At minimum,
it must support the following interface,
\begin{itemize}
    \item $\mathbftt{construct}(B) \to S$ \\
    Construct a new shard from the contents of the mutable buffer, $B$.

    \item $\mathbftt{construct}(S_0, \ldots, S_n) \to S$ 
    Construct a new shard from the records contained within a list of already
    existing shards.

    \item $\mathbftt{point\_lookup}(r) \to *r$ \\
    Search for a record, $r$, by identity and return a reference to its
    location in storage.
\end{itemize}

\Paragraph{Insertion \& deletion.} The framework supports inserting
new records and deleting records already in the index. These two
operations also allow for updates to existing records, by first
deleting the old version and then inserting a new one. These
operations are added by the framework automatically, and require
only a small shim or minor adjustments to the code of the data
structure being extended within the implementation of the shard
object.

Insertions are performed by first wrapping the record to be inserted
with a framework header, and then appending it to the end of the
mutable buffer. If the mutable buffer is full, it is flushed to
create a new shard, which is combined into the first level of the
structure. The level reconstruction process is layout policy
dependent. In the case of leveling, the underlying data of the
source shard and the target shard are combined, resulting a new
shard replacing the target shard in the target level. When using
tiering, the newly created shard is simply placed into the target
level. If the target level is full, the framework first triggers a merge on the
target level, which will create another shard at one higher level,
and then inserts the former shard at the now empty target level.
Note that each time a new shard is created, the framework must invoke
$\mathtt{build}$ to construct a new index from scratch for this
shard.

The framework supports deletes using two approaches: either by
inserting a special tombstone record or by performing a lookup for
the record to be deleted and setting a bit in the header. This
decision is called the \emph{delete policy}, with the former being
called \emph{tombstone delete} and the latter \emph{tagged delete}.
The framework will automatically filter deleted records from query
results before returning them to the user, either by checking for
the delete tag, or by performing a lookup of each record for an
associated tombstone. The number of deleted records within the
framework can be bounded by canceling tombstones and associated
records when they meet during reconstruction, or by dropping all
tagged records when a shard is reconstructed. The framework also
supports aggressive reconstruction (called \emph{compaction}) to
precisely bound the number of deleted records within the index,
which can be helpful to improve the performance of certain types
of query. This is useful for certain search problems, as was seen with
sampling queries in Chapter~\ref{chap:sampling}, but is not
generally necessary to bound query cost in most cases.

\Paragraph{Design space.} The framework described in this section
has a large design space. In fact, much of the design space has
similar knobs to the well-known LSM Tree~\cite{dayan17}, albeit in
a different environment: the framework targets in-memory static
index structures for general extended decomposable queries without
efficient index merging support, whereas the LSM-tree targets
external range indexes that can be efficiently merged.  

The framework's design trades off among auxiliary memory usage, read performance,
and write performance. The two most significant decisions are the
choice of layout and delete policy. A tiering layout policy reduces
write amplification compared to leveling, requiring each record to
only be written once per level, but increases the number of shards
within the structure, which can hurt query performance. As for
delete policy, the use of tombstones turns deletes into insertions,
which are typically faster. However, depending upon the nature of
the query being executed, the delocalization of the presence
information for a record may result in one extra point lookup for
each record in the result set of a query, vastly reducing read
performance. In these cases, tagging may make more sense. This
results in each delete turning into a slower point-lookup, but
always allows for constant-time visibility checks of records. The
other two major parameters, scale factor and buffer size, can be
used to tune the performance once the policies have been selected.
Generally speaking, larger scale factors result in fewer shards,
but can increase write amplification under leveling.  Large buffer
sizes can adversely affect query performance when an unsorted buffer
is used, while allowing higher update throughput. Because the overall
design of the framework remains largely unchanged, the design space
exploration of Section~\ref{ssec:ds-exp} remains relevant here.

\subsection{The Shard Interface}

The shard object serves as a ``shim'' between a data structure and
the extension framework, providing a set of mandatory functions
which are used by the framework code to facilitate reconstruction
and deleting records. The data structure being extended can be
provided by a different library and included as an attribute via 
composition/aggregation, or can be directly implemented within the 
shard class. Additionally, shards can contain any necessary auxiliary
structures, such as bloom filters or hash tables, as necessary to
support the required interface.

The require interface for a shard object is as follows,
\begin{verbatim}
    new(MutableBuffer) -> Shard
    new(Shard[]) -> Shard
    point_lookup(Record, Boolean) -> Record
    get_data() -> Record
    get_record_count() -> Int
    get_tombstone_count() -> Int
    get_memory_usage() -> Int
    get_aux_memory_usage() -> Int
\end{verbatim}

The first two functions are constructors, necessary to build a new Shard
from either an array of other shards (for a reconstruction), or from
a mutable buffer (for a buffer flush).\footnote{
    This is the interface as it currently stands in the existing implementation, but
    is subject to change. In particular, we are considering changing the shard reconstruction
    procedure to allow for only one necessary constructor, with a more general interface. As
    we look to concurrency, being able to construct shards from arbitrary combinations of shards
    and buffers will become convenient, for example.
 } 
The \texttt{point\_lookup} operation is necessary for delete support, and is
used either to locate a record for delete when tagging is used, or to search
for a tombstone associated with a record when tombstones are used. The boolean
is intended to be used to communicate to the shard whether the lookup is
intended to locate a tombstone or a record, and is meant to be used to allow
the shard to control whether a point lookup checks a filter before searching,
but could also be used for other purposes. The \texttt{get\_data}
function exposes a pointer to the beginning of the array of records contained
within the shard--it imposes no restriction on the order of these records, but
does require that all records can be accessed sequentially from this pointer,
and that the order of records does not change. The rest of the functions are
accessors for various shard metadata. The record and tombstone count numbers
are used by the framework for reconstruction purposes.\footnote{The record
count includes tombstones as well, so the true record count on a level is
$\text{reccnt} - \text{tscnt}$.} The memory usage statistics are, at present,
only exposed directly to the user and have no effect on the framework's
behavior. In the future, these may be used for concurrency control and task
scheduling purposes.

Beyond these, a shard can expose any additional functions that are necessary
for its associated query classes. For example, a shard intended to be used for
range queries might expose upper and lower bound functions, or a shard used for
nearest neighbor search might expose a nearest-neighbor function.

\subsection{The Query Interface}
\label{ssec:fw-query-int}

The required interface for a query in the framework is a bit more
complicated than the interface defined for an eDSP, because the
framework needs to query the mutable buffer as well as the shards.
As a result, there is some slight duplication of functions, with
specialized query and pre-processing routines for both shards and
buffers. Specifically, a query must define the following functions,
\begin{verbatim}
    get_query_state(QueryParameters, Shard) -> ShardState;
    get_buffer_query_state(QueryParameters, Buffer) -> BufferState;

    process_query_states(QueryParameters, ShardStateList, BufferStateList) -> LocalQueryList;

    query(LocalQuery, Shard) -> ResultList
    buffer_query(LocalQuery, Buffer) -> ResultList

    merge(ResultList) -> FinalResult

    delete_query_state(ShardState)
    delete_buffer_query_state(BufferState)

    bool EARLY_ABORT;
    bool SKIP_DELETE_FILTER;
\end{verbatim}

The \texttt{get\_query\_state} and \texttt{get\_buffer\_query\_state} functions
map to the \texttt{local\_preproc} operation of the eDSP definition for shards
and buffers respectively. \texttt{process\_query\_states} serves the function
of \texttt{distribute\_query}. Note that this function takes a list of buffer
states; although the proposed framework above contains only a single buffer,
future support for concurrency will require multiple buffers, and so the
interface is set up with support for this. The \texttt{query} and
\texttt{buffer\_query} functions execute the local query against the shard or
buffer and return the intermediate results, which are merged using
\texttt{merge} into a final result set. The \texttt{EARLY\_ABORT} parameter can
be set to \texttt{true} to force the framework to immediately return as soon as
the first result is found, rather than querying the entire structure, and the
\texttt{SKIP\_DELETE\_FILTER} disables the framework's automatic delete
filtering, allowing deletes to be manually handled within the \texttt{merge}
function by the developer. These flags exist to allow for optimizations for
certain types of query. For example, point-lookups can take advantage of
\texttt{EARLY\_ABORT} to stop as soon as a match is found, and
\texttt{SKIP\_DELETE\_FILTER} can be used for more efficient tombstone delete
handling in range queries, where tombstones for results will always be in the
\texttt{ResultList}s going into \texttt{merge}.

The framework itself answers queries by simply calling these routines in 
a prescribed order,
\begin{verbatim}
query(QueryArguments qa) BEGIN
    FOR i < BufferCount DO
        BufferStates[i] = get_buffer_query_state(qa, Buffers[i])
    DONE

    FOR i < ShardCount DO
        ShardStates[i] = get_query_state(qa, Shards[i])
    DONE

    process_query_states(qa, ShardStates, BufferStates)

    FOR i < BufferCount DO 
        temp = buffer_query(BufferStates[i], Buffers[i])
        IF NOT SKIP_DELETE_FILTER THEN
            temp = filter_deletes(temp)
        END
        Results[i] = temp;

        IF EARLY_ABORT AND Results[i].size() > 0 THEN
            delete_states(ShardStates, BufferStates)
            return merge(Results)
        END
    DONE

    FOR i < ShardCount DO
        temp = query(ShardStates[i], Shards[i])
        IF NOT SKIP_DELETE_FILTER THEN
            temp = filter_deletes(temp)
        END
        Results[i + BufferCount] = temp
        IF EARLY_ABORT AD Results[i + BufferCount].size() > 0 THEN
            delete_states(ShardStates, BufferStates)
            return merge(Results)
        END
    DONE

    delete_states(ShardStates, BufferStates)
    return merge(Results)
END
\end{verbatim}

\subsubsection{Standardized Queries}

Provided with the framework are several "standardized" query classes, including
point lookup, range query, and IRS. These queries can be freely applied to any
shard class that implements the necessary optional interfaces. For example, the
provided IRS and range query both require the shard to implement a
\texttt{lower\_bound} and \texttt{upper\_bound} function that returns an index.
They then use this index to access the record array exposed via
\texttt{get\_data}. This is convenient, because it helps to separate the search
problem from the data structure, and moves towards presenting these two objects
as orthogonal.

In the next section the framework is evaluated by producing a number of indexes
for three different search problems. Specifically, the framework is applied to
a pair of learned indexes, as well as an ISAM-tree. All three of these shards
provide the bound interface described above, meaning that the same range query
class can be used for all of them. It also means that the learned indexes
automatically have support for IRS. And, of course, they also all can be used
with the provided point-lookup query, which simply uses the required
\texttt{point\_lookup} function of the shard.

At present, the framework only supports associating a single query class with
an index. However, this is simply a limitation of implementation. In the future,
approaches will be considered for associating arbitrary query classes to allow
truly multi-purpose indexes to be constructed. This is not to say that every
data structure will necessarily be efficient at answering every type of query 
that could be answered using their interface--but in a database system, being
able to repurpose an existing index to accelerate a wide range of query types
would certainly seem worth considering.

\section{Framework Evaluation}

The framework was evaluated using three different types of search problem:
range-count, high-dimensional k-nearest neighbor, and independent range
sampling. In all three cases, an extended static data structure was compared
with dynamic alternatives for the same search problem to demonstrate the
framework's competitiveness.

\subsection{Methodology} 

All tests were performed using Ubuntu 22.04
LTS on a dual-socket Intel Xeon Gold 6242R server with 384 GiB of
installed memory and 40 physical cores. Benchmark code was compiled
using \texttt{gcc} version 11.3.0 at the \texttt{-O3} optimization level.


\subsection{Range Queries}

A first test evaluates the performance of the framework in the context of
range queries against learned indexes. In Chapter~\ref{chap:intro}, the
lengthy development cycle of this sort of data structure was discussed,
and so learned indexes were selected as an evaluation candidate to demonstrate
how this framework could allow such lengthy development lifecycles to be largely
bypassed.

Specifically, the framework is used to produce dynamic learned indexes based on
TrieSpline~\cite{plex} (DE-TS) and the static version of PGM~\cite{pgm} (DE-PGM). These
are both single-pass construction static learned indexes, and thus well suited for use
within this framework compared to more complex structures like RMI~\cite{RMI}, which have
more expensive construction algorithms. The two framework-extended data structures are
compared with dynamic learned indexes, namely ALEX~\cite{ALEX} and the dynamic version of
PGM~\cite{pgm}. PGM provides an interesting comparison, as its native
dynamic version was implemented using a slightly modified version Bentley-Saxe method.

When performing range queries over large data sets, the
copying of query results can introduce significant overhead. Because the four
tested structures have different data copy behaviors, a range count query was
used for testing, rather than a pure range query. This search problem exposes
the searching performance of the data structures, while controlling for different
data copy behaviors, and so should provide more directly comparable results.

Range count
queries were executed with a selectivity of $0.01\%$ against three datasets
from the SOSD benchmark~\cite{sosd-datasets}: \texttt{book}, \texttt{fb}, and
\texttt{osm}, which all have 200 million 64-bit keys following a variety of
distributions, which were paired with uniquely generated 64-bit values. There
is a fourth dataset in SOSD, \texttt{wiki}, which was excluded from testing
because it contained duplicate keys, which are not supported by dynamic
PGM.\footnote{The dynamic version of PGM supports deletes using tombstones,
but doesn't wrap records with a header to accomplish this. Instead it reserves
one possible value to represent a tombstone. Records are deleted by inserting a
record having the same key, but this different value. This means that duplicate
keys, even if they have different values, are unsupported as two records with
the same key will be treated as a delete by the index.~\cite{pgm} }

The shard implementations for DE-PGM and DE-TS required about 300 lines of
C++ code each, and no modification to the data structures themselves. For both
data structures, the framework was configured with a buffer of 12,000 records, a scale
factor of 8, the tombstone delete policy, and tiering. Each shard stored $D_i$
as a sorted array of records, used an instance of the learned index for
$\mathcal{I}_i$, and has no auxiliary structures. The local query routine used
the learned index to locate the first key in the query range and then iterated
over the sorted array until the end of the range is reached, counting the
number of records and tombstones required. The mutable buffer query performed
the counting over a full scan.  No local preprocessing was needed, and the merge
operation simply summed the record and tombstone counts, and returned their
difference.

\begin{figure*}[t]
    \centering
    \subfloat[Update Throughput]{\includegraphics[width=.5\textwidth]{img/fig-bs-rq-insert} \label{fig:rq-insert}}
    \subfloat[Query Latency]{\includegraphics[width=.5\textwidth]{img/fig-bs-rq-query} \label{fig:rq-query}} \\
    \subfloat[Index Sizes]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0 ]{img/fig-bs-rq-space} \label{fig:idx-space}}
    \caption{Range Count Evaluation}
    \label{fig:results1}
\end{figure*}

Figure~\ref{fig:rq-insert} shows the update throughput of all competitors. ALEX
performs the worst in all cases, and PGM performs the best, with the extended
indexes falling in the middle. It is not unexpected that PGM performs better
than the framework, because the Bentley-Saxe extension in PGM is custom-built,
and thus has a tighter integration than a general framework would allow.
However, even with this advantage, DE-PGM still reaches up to 85\% of PGM's
insertion throughput. Additionally, Figure~\ref{fig:rq-query} shows that PGM
pays a large cost in query latency for its advantage in insertion, with the
framework extended indexes significantly outperforming it. Further, DE-TS even
outperforms ALEX for query latency in some cases. Finally,
Figure~\ref{fig:idx-space} shows the storage cost of the indexes, without
counting the space necessary to store the records themselves. The storage cost
of a learned index is fairly variable, as it is largely a function of the
distribution of the data, but in all cases, the extended learned
indexes, which build compact data arrays without gaps, occupy three orders of
magnitude smaller storage space compared to ALEX, which requires leaving gaps
in the data arrays.

\subsection{High-Dimensional k-Nearest Neighbor} 
The next test evaluates the framework for the extension of high-dimensional 
metric indexes for the k-nearest neighbor search problem. An M-tree~\cite{mtree}
was used as the dynamic baseline,\footnote{
    Specifically, the M-tree implementation tested can be found at \url{https://github.com/dbrumbaugh/M-Tree}
    and is a fork of a structure written originally by Eduardo D'Avila, modified to compile under C++20. The
    tree uses a random selection algorithm for ball splitting.
} and a VPTree~\cite{vptree} as the static structure. The framework was used to
extend VPTree to produce the dynamic version, DE-VPTree.
An M-Tree is a tree that partitions records based on
high-dimensional spheres and supports updates by splitting and merging these
partitions. 
A VPTree is a binary tree that is produced by recursively selecting
a point, called the vantage point, and partitioning records based on their
distance from that point. This results in a difficult to modify structure that
can be constructed in $O(n \log n)$ time and can answer KNN queries in $O(k
\log n)$ time.

DE-VPTree, used a buffer of 12,000 records, a scale factor of 6, tiering, and
delete tagging. The query was implemented without a pre-processing step, using
the standard VPTree algorithm for  KNN queries against each shard.  All $k$
records were determined for each shard, and then the merge operation used a
heap to merge the results sets together and return the $k$ nearest neighbors
from the $k\log(n)$ intermediate results. This is a type of query that pays a
non-constant merge cost, even with the framework's expanded query interface, of
$O(k \log k)$. In effect, the kNN query must be answered twice: once for each
shard to get the intermediate result sets, and then a second time within the
merge operation to select the kNN from the result sets.

\begin{figure}
    \centering
    \includegraphics[width=.75\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn}
    \caption{KNN Index Evaluation}
    \label{fig:knn}
\end{figure}
Euclidean distance was used as the metric for both structures, and $k=1000$ was
used for all queries. The reference point for each query was selected randomly
from points within the dataset. Tests were run using the Spanish Billion Words
dataset~\cite{sbw}, of 300-dimensional vectors. The results are shown in
Figure~\ref{fig:knn}. In this case, the static nature of the VPTree allows it
to dominate the M-Tree in query latency, and the simpler reconstruction
procedure shows a significant insertion performance improvement as well.

\subsection{Independent Range Sampling} 
Finally, the
framework was tested using one-dimensional IRS queries. As before,
a static ISAM-tree was used as the data structure to be extended,
however the sampling query was implemented using the query interface from
Section~\ref{ssec:fw-query-int}. The pre-processing step identifies the first
and last query falling into the range to be sampled from, and determines the
total weight based on this range, for each shard. Then, in the local query
generation step, these weights are used to construct and alias structure, which
is used to assign sample sizes to each shard based on weight to avoid
introducing skew into the results. After this, the query routine generates
random numbers between the established bounds to sample records, and the merge
operation appends the individual result sets together. This static procedure
only requires a pair of tree traversals per shard, regardless of how many
samples are taken.

\begin{figure}
    \centering
    \subfloat[Query Latency]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-query} \label{fig:irs-query}}
    \subfloat[Update Throughput]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-insert} \label{fig:irs-insert}}
    \caption{IRS Index Evaluation}
    \label{fig:results2}
\end{figure}

The extended ISAM structure (DE-IRS) was compared to a B$^+$-Tree
with aggregate weight tags on internal nodes (AGG B+Tree) for sampling
and insertion performance, and to a single instance of the static ISAM-tree (ISAM), 
which does not support updates. DE-IRS was configured with a buffer size
of 12,000 records, a scale factor of 6, tiering, and delete tagging. The IRS
queries had a selectivity of $0.1\%$ with sample size of $k=1000$. Testing
was performed using the same datasets as were used for range queries.

Figure~\ref{fig:irs-query}
shows the significant latency advantage that the dynamically extended ISAM tree
enjoys compared to a B+Tree. DE-IRS is up to 23 times faster than the B$^+$-Tree at
answering sampling queries, and only about 3 times slower than the fully static
solution.  In this case, the extra query cost caused by needing to query
multiple structures is more than balanced by the query efficiency of each of
those structures, relative to tree sampling.  Interestingly, the framework also
results in better update performance compared to the B$^+$-Tree, as shown in
Figure~\ref{fig:irs-insert}. This is likely because the ISAM shards can be
efficiently constructed using a combination of sorted-merge operations and
bulk-loading, and avoid expensive structural modification operations that are
necessary for maintaining a B$^+$-Tree.

\subsection{Discussion} 


The results demonstrate not only that the framework's update support is
competitive with custom-built dynamic data structures, but that the framework
is even able to, in many cases, retain some of the query performance advantage 
of its extended static data structure. This is particularly evident in the k-nearest
neighbor and independent range sampling tests, where the static version of the
structure was directly tested as well. These tests demonstrate one of the advantages
of static data structures: they are able to maintain much tighter inter-record relationships
than dynamic ones, because update support typically requires relaxing these relationships
to make it easier to update them. While the framework introduces the overhead of querying
multiple structures and merging them together, it is clear from the results that this overhead
is generally less than the overhead incurred by the update support techniques used
in the dynamic structures. The only case where the framework was defeated in query performance
was in competition with ALEX, where the resulting query latencies were comparable.

It is also evident that the update support provided by the framework is on par with, if not
superior, to that provided by the dynamic baselines, at least in terms of throughput. The 
framework will certainly suffer from larger tail latency spikes, which weren't measured in
this round of testing, due to the larger scale of the reconstructions, but the amortization
of these costs over a large number of inserts allows for the maintenance of a respectable
level of throughput. In fact, the only case where the framework loses in insertion throughput
is against the dynamic PGM. However, an examination of the query latency reveals that this 
is likely due to the fact that the standard configuration of the Bently-Saxe variant used
by PGM is highly tuned for insertion performance, as the query latencies against this data
structure are far worse than any other learned index tested, so even this result shouldn't
be taken as a ``clear'' defeat of the framework's implementation.

Overall, it is clear from this evaluation that the dynamic extension framework is a
promising alternative to manual index redesign for accommodating updates. In almost 
all cases, the framework-extended static data structures provided superior insertion
throughput in all cases, and query latencies that either matched or exceeded that of
the dynamic baselines. Additionally, though it is hard to quantity, the code complexity
of the framework-extended data structures was much less, with the shard implementations
requiring only a small amount of relatively straightforward code to interface with pre-existing
static data structures, or with the necessary data structure implementations themselves being
simpler.

\section{Conclusion}

In this chapter, a generalize version of the framework originally proposed in
Chapter~\ref{chap:sampling} was proposed. This framework is based on two
key properties: extended decomposability and record identity. It is capable
of extending any data structure and search problem supporting these two properties
with support for inserts and deletes. An evaluation of this framework was performed
by extending several static data structures, and comparing the resulting structures'
performance against dynamic baselines capable of answering the same type of search
problem. The extended structures generally performed as well as, if not better, than
their dynamic baselines in query performance, insert performance, or both. This demonstrates
the capability of this framework to produce viable indexes in a variety of contexts. However,
the framework is not yet complete. In the next chapter, the work required to bring this
framework to completion will be described.