chapters/tail-latency.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446

\chapter{Controlling Insertion Tail Latency}
\label{chap:tail-latency}

\section{Introduction}

\begin{figure}
\subfloat[Insertion Throughput]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-tput.pdf} \label{fig:tl-btree-isam-tput}} 
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/btree-dist.pdf} \label{fig:tl-btree-isam-lat}} \\
\caption{Insertion Performance of Dynamized ISAM vs. B+Tree}
\label{fig:tl-btree-isam}
\end{figure}

Up to this point in our investigation, we have not directly addressed
one of the largest problems associated with dynamization: insertion tail
latency. While our dynamization techniques are capable of producing
structures with good overall insertion throughput, the latency of
individual inserts is highly variable.  To illustrate this problem,
consider the insertion performance in Figure~\ref{fig:tl-btree-isam},
which compares the insertion latencies of a dynamized ISAM tree with
that of its most direct dynamic analog: a B+Tree. While, as shown
in Figure~\ref{fig:tl-btree-isam-tput}, the dynamized structure has
superior average performance to the native dynamic structure, the
latency distributions, shown in Figure~\ref{fig:tl-btree-isam-lat} are
quite different. The dynamized structure has much better best-case
performance, but the worst-case performance is exceedingly poor.

This poor worst-case performance is a direct consequence of the different
approaches used by the dynamized structure and B+Tree to support updates.
B+Trees use a form of amortized local reconstruction, whereas the
dynamized ISAM tree uses amortized global reconstruction. Because the
B+Tree only reconstructs the portions of the structure ``local'' to the
update, even in the worst case only a small part of the data structure
will need to be adjusted. However, when using global reconstruction
based techniques, the worst-case insert requires rebuilding either the
entirety of the structure (for tiering or BSM), or at least a very large
proportion of it (for leveling). The fact that our dynamization technique
uses buffering, and most of the shards involved in reconstruction are
kept small by the logarithmic decomposition technique used to partition
it, ensures that the majority of inserts are low cost compared to the
B+Tree. At the extreme end of the latency distribution, though, the
local reconstruction strategy used by the B+Tree results in significantly
better worst-case performance.

Unfortunately, the design space that we have been considering is
limited in its ability to meaningfully alter the worst-case insertion
performance. Leveling requires only a fraction of, rather than all of, the
records in the structure to participate in its worst-case reconstruction,
and as a result shows slightly reduced worst-case insertion cost compared
to the other layout policies. However, this effect only results in a
single-digit factor reduction in measured worst-case latency, and has
no effect on the insertion latency distribution itself outside of the
absolute maximum value. Additionally, we've shown that leveling performed
significantly worse in average insertion performance compared to tiering,
and so its usefulness as a tool to reduce insertion tail latencies is
questionable at best.

\begin{figure}
\subfloat[Varied Scale Factors and Fixed Buffer Size]{\includegraphics[width=.5\textwidth]{img/tail-latency/scale_factor_sweep_standard.pdf} \label{fig:tl-parm-sf}} 
\subfloat[Varied Buffer Sizes and Fixed Scale Factor]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer_sweep_standard.pdf} \label{fig:tl-parm-bs}} \\
\caption{Design Space Effects on Latency Distribution}
\label{fig:tl-parm-sweep}
\end{figure}

Our existing framework support two other tuning parameters: the scale
factor and the buffer size. However, neither of these parameters
are of much use in adjusting the worst-case insertion behavior. We
demonstrate this experimentally in Figure~\ref{fig:tl-parm-sweep},
which shows the latency distributions of our framework as we vary
the scale factor (Figure~\ref{fig:tl-parm-sf}) and buffer size
(Figure~\ref{fig:tl-parm-bs}) respectively. There is no clear trend in
worst-case performance to be seen here. Adjusting the scale factor does
have an effect on the distribution, but not in a way that is particularly
useful from a configuration standpoint, and adjusting the mutable buffer
has almost no effect on either the distribution itself, or the worst-case
insertion latency; particularly when tiering is used.  This is to be
expected; ultimately the worst-case reconstruction size is largely the
same irrespective of scale factor or buffer size:  $\Theta(n)$ records.

The selection of configuration parameters does influence \emph{when}
a worst-case reconstruction occurs, and can slightly affect its size.
Ultimately, however, the answer to the question of which configuration
has the best insertion tail latency performance is more a matter of how
many records the insertion latencies are measured over than it is one of
any fundamental design trade-offs within the space.  This is exemplified
rather well in Figure~\ref{fig:tl-parm-sf}. Each of the ``shelves'' in
the distribution correspond to reconstructions on particular levels. As
can be seen, the lines cross each other repeatedly at these shelves. These
cross-overs are points at which one configuration begins to exhibit
better tail latency behavior than another. However, after enough records
have been inserted, the next largest reconstructions will begin to
occur. This will make the "better" configuration appear worse in
terms of tail latency.\footnote{
	This plot also shows a notable difference between leveling
	and tiering.  In the tiering configurations, the transitions
	between the shelves are steep and abrupt, whereas in leveling,
	the transitions are smoother, particular as the scale factor
	increases. These smoother curves show the write amplification
	of leveling, where the largest shards are not created ``fully
	formed'' as they are in tiering, but rather are built over a
	series of merges.  This slower growth results in the smoother
	transitions. Note also that these curves are convex--which is
	\emph{bad} on this plot, as this means a higher probability of
	a higher latency reconstruction.
}

It seems apparent that, to resolve the problem of insertion tail latency,
we will need to look beyond the design space we have thus far considered.
In this chapter, we do just this, and propose a new mechanism for
controlling reconstructions that leverages parallelism to provide
similar amortized insertion and query performance characteristics, but
also allows for significantly better insertion tail latencies. We will
demonstrate mathematically that our new technique is capable of matching
the query performance of the tiering layout policy, describe a practical
implementation of these ideas, and then evaluate that prototype system
to demonstrate that the theoretical trade-offs are achievable in practice.

\section{The Insertion-Query Trade-off}
\label{sec:tl-insert-query-tradeoff}
Reconstructions lie at the heart of the insertion tail latency problem,
and so it seems worth taking a moment to consider \emph{why} they occur
at all.  Fundamentally, decomposition-based dynamization techniques
trade between insertion and query performance by controlling the number
of blocks in the decomposition. Placing a bound on this number is
necessary to bound the worst-case query cost, and this bound is enforced
using reconstructions to either merge (in the case of the Bentley-Saxe
method) or re-partition (in the case of the equal block method) the
blocks. Performing less frequent (or smaller) reconstructions reduces
the amount of work associated with inserts, at the cost of allowing more
blocks to accumulate and thereby hurting query performance.

This trade-off between insertion and query performance by way of block
count is most directly visible in the equal block method described in
Section~\ref{ssec:ebm}. We will consider a variant of the equal block
method here for which $f(n) \in \Theta(1)$, resulting in a dynamization
that does not perform any re-partitioning. In this case, the technique
provides the following worst-case insertion and query bounds,
\begin{align*}
I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\
\mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right)
\end{align*}
where $f(n)$ is the number of blocks.

\begin{figure}
\centering
\subfloat[Insertion vs. Query Trade-off]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-count-sweep.pdf}\label{fig:tl-ebm-tradeoff}} 
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-latency-dist.pdf} \label{fig:tl-ebm-tail-latency}} \\

\caption{The equal block method with $f(n) = C$ for varying values of C.}
\label{fig:tl-ebm}
\end{figure}

Unlike the design space we have proposed in
Chapter~\ref{chap:design-space}, the equal block method allows for
\emph{both} trading off between insert and query performance, \emph{and}
controlling the tail latency. Figure~\ref{fig:tl-ebm} shows the results
of testing an implementation of a dynamized ISAM tree using the equal block
method, with
\begin{equation*}
f(n) = C
\end{equation*}
for varying constant values of $C$. As noted above, this special case of
the equal block method allows for re-partitioning costs to be avoided
entirely, resulting in a very clean trade-off space. This result isn't
necessarily demonstrative of the real-world performance of the equal
block method, but it does serve to demonstrate the relevant properties
of the method in the clearest possible manner.

Figure~\ref{fig:tl-ebm-tail-latency} shows that the equal block
method provides a very direct relationship between the tail latency
and the number of blocks. The worst-case insertion performance is
dictated by the size of the largest reconstruction. Increasing the
block count reduces the size of each block, and so improves the
insertion performance. Figure~\ref{fig:tl-ebm-tradeoff} shows that
these improvements to tail latency performance translate directly into
an improvement of the overall insertion throughput as well, as the
cost of worse query latencies.  Contrary to our Bentley-Saxe inspired
dynamization system, this formulation of the equal block method provides
direct control over insertion tail latency, as well as a much cleaner
relationship between average insertion and query performance.

While these results are promising, the equal block method is not well
suited for our purposes. Despite having a clean trade-off space and
control over insertion tail latencies, this technique is strictly
worse than our existing dynamization system in every way but tail
latency control.  Comparing Figure~\ref{fig:tl-ebm-tradeoff} with
Figure~\ref{fig:design-tradeoff} shows that, for a specified query
latency, a logarithmic decomposition provides significantly better
insertion throughput.  Our technique uses geometric growth of block sizes,
which ensures that most reconstructions are smaller than those in the
equal block method for an equivalent number of blocks. This comes at
the cost of needing to occasionally perform large reconstructions to
compact these smaller blocks, resulting in the poor tail latencies we
are attempting to resolve.  Thus, it seems as though poor tail latency
is concomitant with good average performance.

Despite this, let's consider an approach to reconstruction within our
framework that optimizes for insertion tail latency exclusively using
the equal block method, neglecting any considerations of maintaining
a shard count bound. We can consider a variant of the equal block
method having $f(n) = \frac{n}{N_B}$. This case, like
the $f(n) = C$ approach considered above, avoids all re-partitioning,
because records are flushed into the dynamization in sets of $N_B$ size,
and so each new block is always exactly full on creation. In effect, this
technique has no reconstructions at all. Each buffer flush simply creates
a new block that is added to an ever-growing list. This produces a system
with worst-case insertion and query costs of,
\begin{align*}
	I(n) &\in \Theta(B(N_B)) \\
	\mathscr{Q}(n) &\in O (n\cdot \mathscr{Q}_S(N_B))
\end{align*}
where the worst-case insertion is simply the cost of a buffer flush,
and the worst-case query cost follows from the fact that there will
be $\Theta\left(\frac{n}{N_B}\right)$ shards in the dynamization, each
of which will have exactly $N_B$ records.\footnote{
	We are neglecting the cost of querying the buffer in this cost function
	for simplicity.
}

Applying this technique to an ISAM Tree, and compared against a
B+Tree, yields the insertion and query latency distributions shown
in Figure~\ref{fig:tl-floodl0}.  Figure~\ref{fig:tl-floodl0-insert}
shows that it is possible to obtain insertion latency distributions
using amortized global reconstruction that are directly comparable to
dynamic structures based on amortized local reconstruction.  However,
this performance comes at the cost of queries, which are incredibly slow
compared to B+Trees, as shown in Figure~\ref{fig:tl-floodl0-query}.

\begin{figure}
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-insert.pdf} \label{fig:tl-floodl0-insert}} 
\subfloat[Query Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/floodl0-query.pdf} \label{fig:tl-floodl0-query}} \\

\caption{Latency Distributions for a Reconstructionless Dynamization}
\label{fig:tl-floodl0}
\end{figure}


On its own, this technique exhibits too large of a degradation
of query latency for it to be useful in any scenario involving a
need for queries. However, it does demonstrate that, in scenarios
where insertion doesn't need to block on reconstructions, it is
possible to obtain significantly improved insertion tail latency
distributions. Unfortunately, it also shows that using reconstructions to
enforce structural invariants to control the number of block is critical
for query performance. In the next section, we will consider approaches
to reconstruction that allow us to maintain the structural invariants of
our dynamization, while avoiding direct blocking of inserts, in an attempt
to reduce the worst-case insertion cost in line with the approach we have
just discussed, while maintaining similar worst-case query bounds to our
existing dynamization system.

\section{Relaxed Reconstruction}

Reconstructions are necessary to maintain the structural invariants of
our dynamization, which are themselves required to maintain bounds on
worst-case query performance. Inserts are what causes the structure to
violate these invariants, and so it makes sense to attach reconstructions
to the insertion process to allow a strict maintenance of these
invariants. However, it is possible to take a more relaxed approach
to maintaining these invariants using concurrency, allowing for the same
shard bound to be enforced at a much lower worst-case insertion cost.

There does exist theoretical work in this area, which we've already
discussed in Section~\ref{ssec:bsm-worst-optimal}. The gist of this
technique is to relax the strict binary decomposition of the Bentley-Saxe
method to allow multiple reconstructions to occur at once, and to add
a buffer to each level to contain a partially built structure. Then,
reconstructions are split up into small batches of operations,
which are attached to each insert that is issued up to the moment
when the reconstruction must be complete. By doing this, the work
of the reconstructions is spread out across many inserts, in effect
removing the need to block for large reconstructions.  Theoretically,
the total throughput should remain about the same when doing this, but
rather than having a bursty latency distribution with many fast inserts,
and a small number of incredibly slow ones, distribution should be more
normal.~\cite{overmars81}

Unfortunately, this technique has a number of limitations that we
discussed in Section~\ref{ssec:bsm-tail-latency-problem}. It effectively
reduces to manually multiplexing a single thread to perform a highly
controlled, concurrent, reconstruction process. This requires the ability
to evenly divide up the work of building a data structure and somehow
attach these operations to individual inserts. This makes it ill-suited
for our general framework, because, even when the construction can be
split apart into small independent chunks, implementing it requires a
significant amount of manual adjustment to the data structure construction
processes.

In this section, we will propose an alternative approach for implementing
a similar idea using multi-threading and prove that we can achieve,
in principle, the same worst-case insertion and query costs in a far
more general and easily implementable manner. We'll then show how we
can further leverage parallelism on top of our approach to obtain
\emph{better} worst-case bounds, assuming sufficient resources are
available.


\Paragraph{Layout Policies.} One important aspect of the selection of
layout policy that has not been considered up to now, but will soon
become very relevant, is the degree of reconstruction concurrency
afforded by each policy. Because different layout policies perform
reconstructions differently, there are significant differences in the
number of reconstructions that can be performed concurrently in each one.
Note that in previous chapters, we used the term \emph{reconstruction}
broadly to refer to all operations performed on the dynamization to
maintain its structural invariants as the result of a single insert. Here,
we instead use the term to refer to a single call to \texttt{build}.
\begin{itemize}
	\item \textbf{Leveling.} \\
	Our leveling layout policy performs a single \texttt{build}
	operation involving shards from at most two levels, as well as
	flushing the buffer. Thus, at best, there can be two concurrent
	operations: the \texttt{build} and the flush. If we
	were to proactively perform reconstructions, each \texttt{build}
	would require shards from two levels, and so the maximum number
	of concurrent reconstructions is half the number of levels,
	plus the flush.
	
	\item \textbf{Tiering.} \\
	In our tiering policy, it may be necessary to perform one \texttt{build}
	operation per level. Each of these reconstructions involves only shards
	from that level. As a result, at most one reconstruction per level
	(as well as the flush) can proceed concurrently.

	\item \textbf{BSM.} \\
	The Bentley-Saxe method is highly eager, and merges all relevant
	shards, plus the buffer, in a single call to \texttt{build}. As a result,
	no concurrency is possible.
\end{itemize}

We will be restricting ourselves in this chapter to the tiering layout
policy. Tiering provides the most opportunities for concurrency
and (assuming sufficient resources) parallelism. Because a given
reconstruction only requires shards from a single level, using tiering
also makes synchronization significantly easier, and it provides us
with largest window to preemptively schedule reconstructions. Most
of our discussion in this chapter could also be applied to leveling,
albeit with worse results. However, BSM \emph{cannot} be used at all.

\Paragraph{Nomenclature.} For the discussion that follows, it will
be convenient to define a few terms for discussing levels relative to
each other. While these are all fairly straightforward, to alleviate any
potential confusion, we'll define them all explicitly here.  We define the
term \emph{last level}, $i = \ell$, to mean the level in the dynamized
structure with the largest index value (and thereby the most records)
and \emph{first level} to mean the level with index $i=0$. Any level
with $0 < i < \ell$ is called an \emph{internal level}. A reconstruction
on level $i$ involves the combination of all blocks on that level into
one, larger, block, that is then appended level $i+1$. Relative to some
level at index $i$, the \emph{next level} is the level at index $i +
1$, and the \emph{previous level} is at index $i-1$.

\subsection{Concurrent Reconstructions}

Our proposed approach is as follows. We will fully detach reconstructions
from buffer flushes. When the buffer fills, it will immediately flush
and a new block will be placed in the first level. Reconstructions
will be performed in the background to maintain the internal structure
according to the tiering policy. When a level contains $s$ blocks,
a reconstruction will immediately be triggered to merge these blocks
and push the result down to the next level. To ensure that the number
of blocks in the structure remains bounded by $\Theta(\log_s n)$, we
will throttle the insertion rate by adding a stall time, $\delta$, to
each insert. $\delta$ will be determined such that it is sufficiently
large to ensure that any scheduled reconstructions have enough time to
complete before the shard count on any level exceeds $s$. This process
is summarized in Algorithm~\ref{alg:tl-relaxed-recon}.

\begin{algorithm}
\caption{Relaxed Reconstruction Algorithm with Insertion Stalling}
\label{alg:tl-relaxed-recon}
\KwIn{$r$: a record to be inserted, $\mathscr{I} = (\mathcal{B}, \mathscr{L}_0 \ldots \mathscr{L}_\ell)$: a dynamized structure, $\delta$: insertion stall amount}

\Comment{Stall insertion process by specified amount}
sleep($\delta$) \;
\BlankLine
\Comment{Append to the buffer if possible}
\If {$|\mathcal{B}| < N_B$} {
	$\mathcal{B} \gets \mathcal{B} \cup \{r\}$ \;
	\Return \;
}

\BlankLine
\Comment{Schedule any necessary reconstructions background threads}
\For {$\mathscr{L} \in \mathscr{I}$} {
	\If {$|\mathscr{L}| = s$} {
		$\text{schedule\_reconstruction}(\mathscr{L})$ \;
	}
}

\BlankLine
\Comment{Perform the flush}
$\mathscr{L}_0 \gets \mathscr{L}_0 \cup \{\text{build}(\mathcal{B})\}$ \;
$\mathcal{B} \gets \emptyset$ \;

\BlankLine
\Comment{Append to the now empty buffer}
$\mathcal{B} \gets \mathcal{B} \cup \{r\}$ \;
\Return \;
\end{algorithm}

\begin{figure}
\centering
\includegraphics[width=\textwidth]{diag/tail-latency/last-level-recon.pdf}
\caption{\textbf{Worst-case Reconstruction.} Using the tiering layout
policy, the worst-case reconstruction occurs when every level in
the structure has been filled (middle portion of the figure) and a
reconstruction must be performed on each level to merge it into a single
shard, and place it on the level below, leaving the structure with one
shard per level after the records from the buffer have been added to
L0 (right portion of the figure). The cost of this reconstruction is
dominated by the cost of performing a reconstruction on the last level. The
last level reconstruction, however, is able to be performed well in advance,
as it only requires the blocks on the last level, which fills $\Theta(n)$
inserts before the worst-case reconstruction is triggered (left portion of the
figure). This provides us with the opportunity to initiate this reconstruction
early. \emph{Note: the block sizes in this diagram are not scaled to their
record counts--each level has an increasing number of records per block.}}
\label{fig:tl-tiering}
\end{figure}

To ensure the correctness of this algorithm, it is necessary to show
that there exists a value for $\delta$ that ensures that the structural
invariants can be maintained. Logically, this $\delta$ can be thought
of as the amount of time needed to perform the active reconstruction
operation, amortized over the inserts between when this reconstruction
can be scheduled, and when it needs to be complete. We'll consider how
to establish this value next.

Figure~\ref{fig:tl-tiering} shows various stages in
the development of the internal structure of a dynamized index using
tiering. Importantly, note that the last level reconstruction, which
dominates the cost of the worst-case reconstruction, \emph{is able to be
performed in advance}. All of the records necessary to perform this
reconstruction are present in the last level $\Theta(n)$ inserts before
the reconstruction must be done to make room. This is a significant
advantage to our technique over the normal Bentley-Saxe method, which
will allow us to spread the cost of this reconstruction over a number
of inserts without much of the complexity of~\cite{overmars81}. This
leads us to the following result,
\begin{theorem}
\label{theo:worst-case-optimal}
Given a dynamized structure utilizing the reconstruction policy described
in Algorithm~\ref{alg:tl-relaxed-recon}, a single execution unit, and multiple
threads of execution that can be scheduled on that unit at will with
preemption, it is possible to maintain a worst-case insertion cost of
\begin{equation}
I(n) \in O\left(\frac{B(n)}{n} \log n\right)
\end{equation}
while maintaining a bound of $s$ shards per level, and $\log_s n$ levels.
\end{theorem}
\begin{proof}
Under Algorithm~\ref{alg:tl-relaxed-recon}, the worst-case reconstruction
operation consists of the creation of a new block from all of the existing
blocks in the last level. This reconstruction will be initiated when the
last level is full, at which point there will be another $\Theta(n)$
inserts before the level above it also fills, and a new shard must be
added to the last level. The reconstruction must be completed by this point
to ensure that no more than $s$ shards exist on the last level.

Assume that all inserts run on a single thread that can be scheduled
alongside the reconstructions, and let each insert have a cost of
\begin{equation*}
I(n) \in \Theta(1 + \delta)
\end{equation*}
where $1$ is the cost of appending to the buffer, and $\delta$
is a calculated stall time. During the stalling, the insert
thread will be idle and reconstructions can be run on the execution unit.
To ensure the last-level reconstruction is complete by the time that
$\Theta(n)$ inserts have finished, it is necessary that $\delta \in
\Theta\left(\frac{B(n)}{n}\right)$.

However, this amount of stall is insufficient to maintain exactly $s$
shards on each level of the dynamization. At the point at which the
last-level reconstruction is initiated, there will be exactly $1$ shard
on all other levels (see Figure~\ref{fig:tl-tiering}). However, between
this initiation and the time at which the last level reconstruction must
be complete to maintain the shard count bound, each other level must
also undergo $s - 1$ reconstructions to maintain their own bounds.
Because we have only a single execution unit, it is necessary to account
for the time to complete these reconstructions as well. In the worst-case,
there will be one active reconstruction on each of the $\log_s n$ levels,
and thus we must introduce stalls such that,
\begin{equation*}
I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1})
\end{equation*}
All of these internal reconstructions will be strictly less than the size
of the last-level reconstruction, and so we can bound them all above by
$O(\frac{B(n)}{n})$ time.  Given this, and assuming that the smallest
(i.e., most pressing) reconstruction is prioritized on the execution
unit, we find that
\begin{equation*}
I(n) \in O\left(\frac{B(n)}{n} \cdot \log n\right)
\end{equation*}
\end{proof}

This approach results in an equivalent worst-case insertion and query
latency bounds to~\cite{overmars81}, but manages to resolve the issues
cited above. By leveraging multiple threads, instead of trying to manually
multiplex a single thread, this approach requires \emph{no} modification
to the user's block code to function. The level of fine-grained control
over the active thread necessary to achieve this bound can be achieved
by using userspace interrupts~\cite{userspace-preempt}, allowing
it to be more easily implemented, without making significant modifications
to reconstruction procedures when compared to the existing worst-case
optimal technique.

\subsection{Reducing Stall with Additional Parallelism}

The result in Theorem~\ref{theo:worst-case-optimal} assumes that there is
only a single available execution unit. This requires that the insertion
stall amount be large enough to cover all of the reconstructions necessary
at any moment in time. If we have access to parallel execution units,
though, we can significantly reduce the amount of stall time required.

The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case
bound is that it is insufficient to cover only the cost of the last level
reconstruction to maintain the bound on the block count. From the moment
that the last level has filled, and this reconstruction can begin, every
level within the structure must sustain another $s - 1$ reconstructions
before it is necessary to have completed the last level reconstruction,
in order to maintain the $\Theta(\log n)$ bound on the number of blocks.

To see why this is important, consider an implementation that, contrary
to Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover
the last-level reconstruction. All other reconstructions are blocked
until the last-level one has been completed.  This approach would
result in $\delta = \frac{B(n)}{n}$ stall and complete the last
level reconstruction after $\Theta(n)$ inserts. During this time,
$\Theta(\frac{n}{N_B})$ blocks would accumulate in L0, ultimately
resulting in a bound of $\Theta(n)$ blocks in the structure, rather than
the $\Theta(\log n)$ bound we are trying to maintain. This is the reason
why Theorem~\ref{theo:worst-case-optimal} must account for stalls on
every level, and assumes that the smallest (and therefore most pressing)
reconstruction is always active.  This introduces the extra $\log n$
factor into the worst-case insertion cost function, because there will at
worst be a reconstruction running on every level, and each reconstruction
will be no larger than $\Theta(n)$ records.

In effect, the stall amount must be selected to cover the \emph{sum} of
the costs of all reconstructions that occur. Another way of deriving this
bound would be to consider this sum,
\begin{equation*}
B(n) + (s - 1) \cdot \sum_{i=0}^{\log n - 1} B(n)
\end{equation*}
where the first term is the last level reconstruction cost, and the sum
term considers the cost of the $s-1$ reconstructions on each internal
level. Dropping constants and expanding the sum results in,
\begin{equation*}
B(n) \cdot \log n 
\end{equation*}
reconstruction cost to amortize over the $\Theta(n)$ inserts.

However, additional parallelism will allow us to reduce this. At the
upper limit, assume that there are $\log n$ execution units available for
parallel reconstructions. We'll adopt the bulk-synchronous  parallel
(BSP) model~\cite{bsp} for our analysis of the parallel algorithm. In
this model, computation is broken up into multiple parallel threads
of computation, which are executed independently for a period of
time. Intermittently, a synchronization barrier is introduced, at which
point each of the parallel threads is blocked and synchronization of
global state occurs. This period of independent execution between
barriers is called a \emph{super step}.

In this model, the parallel execution cost of a super-step is given
by the cost of the longest-running computation, the cost of communication
between the parallel threads, and the cost of the synchronization,
\begin{equation*}
\max\left\{w_i\right\} + hg + l
\end{equation*}
where $hg$ describes the communication cost, $w_i$ is the cost of the
$i$th computation, and $l$ is the cost of barrier synchronization. The
cost for the entire BSP computation is the sum of all the individual
super-steps,
\begin{equation*}
W + Hg + Sl
\end{equation*}
where $S$ is the number of super-steps in the computation.

We'll model the worst-case reconstruction within the BSP in the following
way. Each individual reconstruction will be considered a parallel
operation, hence $w_i = B(N_B \cdot s^{i+1})$, where $i$ is the level
number. Because we are operating on a single machine, we can assume
that the communication cost is constant, $Hg \in \Theta(1)$. During the
synchronization barrier, any pending structural updates can be applied to
the dynamization (i.e., blocks added or removed from levels). Assuming
that tiering is used, this can be done in $l \in \Theta(1)$ time (for
details on how this is done, see Section~\ref{sssec:tl-versioning}).

Given this model, is is possible to derive the following new worst-case
bound,
\begin{theorem}
\label{theo:par-worst-case-optimal}
Given a dynamized structure utilizing the reconstruction policy described
in Algorithm~\ref{alg:tl-relaxed-recon}, and at least $\log n$ execution
units in the BSP model, it is possible to maintain a worst-case insertion
cost of
\begin{equation}
I(n) \in O\left(\frac{B(n)}{n}\right)
\end{equation}
for a data structure with $B(n) \in \Omega(n)$.
\end{theorem}
\begin{proof}
Just as in Theorem~\ref{theo:worst-case-optimal}, we begin by noting that
the last level reconstruction will be of cost $\Theta(B(n))$ and must
be amortized over $\Theta(n)$ inserts. However, unlike in that case,
we now have $\log n$ execution units to work with. Thus, each time
a reconstruction must be performed on an internal level, it can be
executed on one of these units in parallel with all other ongoing
reconstructions. As there can be at most one reconstruction per level,
$\log n$ threads are sufficient to run all possible reconstructions at
any point in time in parallel.

To fit this into the BSP model, we need to establish the necessary
frequency of synchronization. At best, we will need to perform one
synchronized update of the internal structure of the dynamization per
buffer flush, which will occur $\Theta\left(\frac{n}{N_B}\right)$ times
over the $\Theta(n)$ inserts. Thus, we will take $S = \frac{n}{N_B}$ as
the length of a super-step. At the end of each super-step, any pending
structural updates from any active reconstruction can be applied.

Within a BSP super-step, the total operational cost is equal to the sum of
the communication cost (which is $\Theta(1)$ here), the synchronization
cost (also $\Theta(1)$), and the cost of the longest running computation.
The computations themselves are reconstructions, each with a cost of
$B(N_B \cdot s^{i+1})$. They will, generally, exceed the duration of a
single super-step, and so their cost will be spread evenly over each
super-step that elapses from their initiation to their conclusion. As a
result, in any given super-step, there will be two possible states that
a given active reconstruction on level $i$ can be in,
\begin{enumerate}
	\item The reconstruction can complete during this super-step,
	in which case the cost of it will be $w_i < \frac{n}{N_B}$
	is the amount of work done in this super step.
	\item The reconstruction can be incomplete at the end of this
	super-step, in which case $w_i = \frac{n}{N_B}$.
\end{enumerate}

It is our intention to determine the time required to complete the full
reconstruction, the largest computation of which will have cost $B(n)$.
If we can show that the insertion rate necessary to ensure that this
computation is completed over $\Theta(n)$ inserts is also enough to
ensure that all the smaller reconstructions are completed in time as
well, then we can use the last-level reconstruction cost as the total
cost of the computational component of the BSP cost calculation.

To accomplish this, consider the necessary stall to fully cover a
reconstruction on level $i$. This is the cost of the reconstruction over
level $i$, divided by the number of inserts that can occur before the
reconstruction must be done (i.e., the capacity of the index above this
point). This gives,
\begin{equation*}
\delta_i \in O\left( \frac{B(N_B \cdot s^{i+1})}{\sum_{j=0}^{i-1} N_B\cdot s^{j+1}} \right)
\end{equation*}
stall for each level. Noting that $s > 1$, $s \in \Theta(1)$, and that
the denominator is the sum of a geometric progression, we have
\begin{align*}
\delta_i \in &O\left( \frac{B(N_B\cdot s^{i+1})}{s\cdot N_B \sum_{j=0}^{i-1} s^{j}} \right) \\
             &O\left( \frac{(1-s) B(N_B\cdot s^{i+1})}{N_B\cdot (s - s^{i+1})} \right) \\
			 &O\left( \frac{B(N_B\cdot s^{i+1})}{N_B \cdot s^{i+1}}\right)
\end{align*}

For $B(n) \in \Omega(n)$, the numerator of the fraction will grow at
least as rapidly as the denominator, meaning that $\delta_\ell$ will
always be the largest. Thus, the stall necessary to cover the last-level
reconstruction will be at least as much as is necessary for the internal
reconstructions.

Given this, we will consider only the cost $B(n)$ of the last level
reconstruction in our BSP calculation. This cost will be spread evenly
over $S=\frac{n}{N_B}$ super-steps , so the cost of each interval is
$\frac{B(n)}{S}$. This results in the following overall operation cost over
all super-steps,
\begin{align*}
&\frac{S\cdot B(n)}{S} + S\cdot + S \\
&B(n) + \frac{n}{N_B}
\end{align*}
Given that $B(n) \in \Omega(n)$, we can absorb the synchronization cost
term to get a total operational cost of $O(B(n))$, which we must evenly
distribute over the $\Theta(n)$ inserts. Thus, the worst-case insertion
cost is,
\begin{equation*}
	I(n) \in O\left(\frac{B(n)}{n}\right)
\end{equation*}

While the BSP model assumes infinite parallel execution, we'll further
restrict this by noting that, at any moment in time, there can be at
most one reconstruction occurring on each level of the structure, and
therefore $\log_s n$ threads are sufficient to obtain the above
result.
\end{proof}

It's worth noting that the ability to parallelize the reconstructions
is afforded to us by our technique, and is not possible in the classical
Overmars's formulation, which is inherently single-threaded in nature.

\section{Implementation}
\label{sec:tl-impl}

The previous section demonstrated that it is possible to meaningfully
control the worst-case insertion cost (and, therefore, the insertion tail
latency) of our dynamization system, at least in theory. This can be done
by relaxing the reconstruction processes and throttling the insertion
rate as a means of controlling the shard count within the structure,
rather than blocking the insertion thread during reconstructions. However,
there are a number of practical problems to be solved before this idea can
be used in a real system. In this section, we discuss these problems, and
our approaches to solving them to produce a dynamization framework based
upon the technique. Note that this system is based on the same high level
architecture as we described in Section~\ref{ssec:dyn-concurrency}. To
avoid redundancy, we will focus on how this system differs, without
fully recapitulating the content of that earlier section.

\subsection{Parallel Reconstruction Architecture}

The existing concurrency implementation described in
Section~\ref{ssec:dyn-concurrency} is insufficient for the purposes of
constructing a framework supporting the parallel reconstruction scheme
described in the previous section. In particular, it is limited to
only two active versions of the structure at at time, with one ongoing
reconstruction. Additionally, it does not consider buffer flushes as
distinct events from reconstructions. In order to support the result
of Theorem~\ref{theo:par-worst-case-optimal} it will be necessary to
have a concurrency control system that considers reconstructions on
each level independently and allows for one reconstruction per level
without any synchronization, and that allows for each reconstruction to
apply its results to the active structure in $\Theta(1)$ time, and in
any order, without violating any structural invariants. 

To accomplish this, we will use a multi-versioning control scheme that
is similar to the simple scheme in Section~\ref{ssec:dyn-concurrency}.
Each \emph{version} will consist of three pieces of information: a buffer
head pointer, buffer tail pointer, and a collection of levels and
shards. However, the process of managing, creating, and installing
versions will be more complex, to allow more than two versions to exist
at the same time under certain circumstances and support the necessary
features mentioned above.

\subsubsection{Structure Versioning}
\label{sssec:tl-versioning}
The internal structure of the dynamization consists of a sequence of
levels containing immutable shards, as well as a snapshot of the state
of the mutable buffer. This section pertains specifically to the internal
structure; the mutable buffer handles its own versioning separate from
this and will be discussed in the next section.

\begin{figure}
\centering
\subfloat[Buffer Flush]{\includegraphics[width=.5\textwidth]{diag/tail-latency/flush.pdf}\label{fig:tl-flush}} 
\subfloat[Maintenance Reconstruction]{\includegraphics[width=.5\textwidth]{diag/tail-latency/maint.pdf}\label{fig:tl-maint}} 
\caption{\textbf{Structure Version Transitions.} The dynamized structure
can transition to a new version via two operations, flushing the buffer
into the first level or performing a maintenance reconstruction to
merge shards on some level and append the result onto the next one. In
each case, \texttt{V2} contains a shallow copy of \texttt{V1}'s
light grey shards, with the dark grey shards being newly created
and the white shards being deleted. The buffer flush operation in
Figure~\ref{fig:tl-flush} simply creates a new shard from the buffer
and places it in \texttt{L0} to create \texttt{V2}. The maintenance
reconstruction in Figure~\ref{fig:tl-maint} is slightly more complex,
creating a new shard in \texttt{L2} using the two shards in \texttt{V1}'s
\texttt{L1}, and then removing the shards in \texttt{V2}'s \texttt{L1}.
}
\label{fig:tl-flush-maint}. 
\end{figure}

The internal structure of the dynamized data structure (ignoring the
buffer) can be thought of as a list of immutable levels, $\mathcal{V}
= \{\mathscr{L}_0, \ldots \mathscr{L}_h\}$, where each level
contains immutable shards, $\mathcal{L}_i = \{\mathscr{I}_0, \ldots
\mathscr{I}_m\}$. Buffer flushes and reconstructions can be thought of
as functions, which accept a version as input and produce a new version
as output. Namely,
\begin{align*}
	\mathcal{V}_{i+1} &= \mathbftt{flush}(\mathcal{V}_i, \mathcal{B}) \\ 
	\mathcal{V}_{i+1} &= \mathbftt{maint}(\mathcal{V}_i, \mathscr{L}_x, j)
\end{align*}
where the subscript represents the \texttt{version\_id} and is a strictly
increasing number assigned to each version. The $\mathbftt{flush}$
operation builds a new shard using records from the buffer, $\mathcal{B}$,
and creates a new version identical to $\mathcal{V}_i$, except with
the new shard appended to $\mathscr{L}_0$. $\mathbftt{maint}$ performs
a maintenance reconstruction by building a new shard using all of the
shards in level $\mathscr{L}_x$ and creating a new version identical
to $\mathcal{V}_i$ except that the new shard is appended to level
$\mathscr{L}_j$ and the shards in $\mathscr{L}_x$ are removed from
$\mathscr{L}_x$ in the new version. These two operations are shown in
Figure~\ref{fig:tl-flush-maint}.

At any point in time, the framework will have \emph{one} active version,
$\mathcal{V}_a$, as well as a maximum unassigned version number, $v_m
> a$. New version ids are obtained by performing an atomic fetch-and-add
on $v_m$, and versions will become active in the exact order of their
assigned version numbers. We use the term \emph{installing} a version,
$\mathcal{V}_x$ to refer to setting $\mathcal{V}_a \gets \mathcal{V}_x$.

\Paragraph{Version Number Assignment.} It is the intention of this
framework to prioritize buffer flushes, meaning that the versions
resulting from a buffer flush should become active as rapidly as
possible. It is undesirable to have some version, $\mathcal{V}_f$,
resulting from a buffer flush, attempting to install while there is
a version $\mathcal{V}_r$ associated with an in-process maintenance
reconstruction such that $a < r < f$. In this case, the flush must wait
for the maintenance reconstruction to finalize before it can itself be
installed. To avoid this problem, we assign version numbers differently
based upon whether the new version is created by a flush or a maintenance
reconstruction.

\begin{itemize}
	\item \textbf{Flush.} When a buffer flush is scheduled, it is
	immediately assigned the next available version number at the
	time of scheduling.

	\item \textbf{Maintenance Reconstruction.} Maintenance reconstructions
	are \emph{not} assigned a version number immediately. Instead, they
	are assigned a version number \emph{after} all of the reconstruction
	work is performed, during their installation process.
\end{itemize}

\Paragraph{Version Installation.} Once a given flush or maintenance
reconstruction has completed and has been assigned a version
number, $i$, the version will attempt to install itself. The
thread running the operation will wait until $a = i - 1$, and
then it will update $\mathcal{V}_a \gets \mathcal{V}_i$ using an
atomic pointer assignment. All versions are reference counted using
\texttt{std::shared\_pointer}, and so will be automatically deleted
once all threads containing a reference to the version have terminated,
so no special memory management is necessary during version installation.

\begin{figure}
\centering
\includegraphics[width=\textwidth]{diag/tail-latency/dropped-shard.pdf}
\caption{\textbf{Shard Reconciliation Problem.} Because maintenance
reconstructions don't obtain their version number until after they have
completed, it is possible for the internal structure of the dynamization
to change between when the reconstruction is scheduled, and when it completes. In
this example, a maintenance reconstruction is scheduled based on V1 of the
structure. Before it can finish, V2 is created as a result of a buffer flush. As a
result, the maintenance reconstruction's resulting structure is assigned V3. But, when
it is installed, the shard produced by the flush in V2 is lost. It will be necessary
to devise a means to prevent this from happening.}
\label{fig:tl-dropped-shard}
\end{figure}

\Paragraph{Maintenance Version Reconciliation.} Waiting until the
moment of installation to assign a version number to maintenance
reconstructions avoids stalling buffer flushes, however it introduces
additional complexity in the installation process. This is because active
version at the time the reconstruction was scheduled, $\mathcal{V}_a$,
may not still be the active version at the time the reconstruction is
installed, $\mathcal{V}_{a^\prime}$. This means that the version of
the structure produced by the reconstruction, $\mathcal{V}_r$, will not
reflect any updates to the structure that were performed in version ids on
the interval $(a, a^\prime]$. Figure~\ref{fig:tl-dropped-shard}
shows an example of the sort of problem that can arise.

One possible approach is to simply merge the versions together,
adding all of the shards that are in $\mathcal{V}_{a^\prime}$ but
not in $\mathcal{V}_r$ prior to installation. Sadly, this approach is
insufficient because it can lead to three possible problems,

\begin{enumerate}
	\item If shards used in the maintenance reconstruction to
	produce $\mathcal{V}_r$ were \emph{also} used as part of a
	different maintenance reconstruction resulting in a version
	$\mathcal{V}_o$ with $o < r$, then \textbf{records will be
	duplicated} by the merge.

	\item If another reconstruction produced a version $\mathcal{V}_o$
	with $o < r$, and $\mathcal{V}_o$ added a new shard to the same
	level that $\mathcal{V}_r$ did, it is possible that the
	temporal ordering properties of the shards on the level
	may be violated. Recall that supporting tombstone-based
	deletes requires that shards be strictly ordered within each
	level by their age to ensure that tombstone cancellation
	(Section~\ref{sssec:dyn-deletes}).

	\item The shards that were deleted from $\mathcal{V}_r$ after the
	reconstruction will still be present in $\mathcal{V}_{a^\prime}$
	and so may be reintroduced into the new version, again leading
	to duplication of records. It is non-trivial to identify these
	shards during the merge to skip over them, because the shards
	don't have a unique identifier other than their pointers, and
	using the pointers for this check can lead to the ABA problem
	using the reference counting based memory management scheme the
	framework is built on.

\end{enumerate}

The first two of these problems result from a simple synchronization
problem and can be solved using locking. A maintenance reconstruction
operates on some level $\mathscr{L}_i$, merging and then deleting shards
from that level and placing the result in $\mathscr{L}_{i+1}$. In
order for either of these problems to occur, multiple concurrent
reconstructions must be operating on $\mathscr{L}_i$. Thus, a lock manager
can be introduced into the framework to allow reconstructions to lock
entire levels. A reconstruction can only be scheduled if it is able to
acquire the lock on the level that it is using as the \emph{source}
for its shards. Note that there is no synchronization problem with a
concurrent reconstruction on level $\mathscr{L}_{i-1}$ appending a shard
to $\mathscr{L}_i$. This will not violate any ordering properties or
result in any duplication of records. Thus, each reconstruction only
needs to lock a single level.

The final problem is a bit trickier to address, but is fundamentally
an implementation detail. Our approach for resolving it is to
change the way that maintenance reconstructions produce a version
in the first place. Rather than taking a copy of $\mathcal{V}_a$,
manipulating it to perform the reconstruction, and then reconciling
it with $\mathcal{V}_{a^\prime}$ when it is installed, we delay
\emph{all} structural updates to the version to installation. When a
reconstruction is scheduled, a reference to $\mathcal{V}_a$ is taken,
instead of a copy.  Then, any new shards are built based on the contents
of $\mathcal{V}_a$, but no updates to the structure are made. Once all
of the shard reconstructions are complete, the version installation
process begins. The thread running the reconstruction waits for its turn
to install, and \emph{then} makes a copy of $\mathcal{V}_{a^\prime}$. To
this copy, the newly created shards are added, and any necessary deletes
are performed. Because the shards to be deleted are currently referenced
in, at minimum, the reference to $\mathcal{V}_a$ maintained by the
reconstruction thread, pointer equality can be used to identify the
shards and the ABA problem avoided. Then, once all the updates are complete,
the new version can be installed.

This process does push a fair amount of work to the moment of install,
between when a version id is claimed by the reconstruction thread, and
that version id becomes active. During this time, any buffer flushes
will be blocked. However, relative to the work associated with actually
performing the reconstructions, the overhead of these metadata operations
is fairly minor, and so it doesn't have a significant effect on buffer
flush performance.


\subsubsection{Mutable Buffer}

\begin{figure}
\centering
\subfloat[Buffer Initial State]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-1.pdf}\label{fig:tl-buffer1}} 
\subfloat[Buffer Following an Insert]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-2.pdf}\label{fig:tl-buffer2}} 
\subfloat[Buffer Version Transition]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-3.pdf}\label{fig:tl-buffer3}} 
\caption{\textbf{Versioning process for the mutable buffer.} A schematic
view of the mutable buffer demonstrating the three pointers representing
its state, and how they are adjusted as inserts occur. Dark grey slots
represent the currently active version, light grey slots the old version,
and white slots are available space.}
\label{fig:tl-buffer}
\end{figure}

Next, we'll address concurrent access and versioning of the mutable
buffer. In our system, the mutable buffer consists of a large ring buffer
with a head and tail pointer, as shown in Figure~\ref{fig:tl-buffer}. In
order to support versioning, the buffer actually uses two head pointers,
one called \texttt{head} and one called \texttt{old head}, along
with a single \texttt{tail} pointer. Records are inserted into the
buffer by atomically incrementing \texttt{tail} and then placing the
record into the slot. For records that cannot be atomically assigned,
a visibility bit can be used to ensure that concurrent readers don't
access a partially written value.  \texttt{tail} can be incremented
until it matches \texttt{old head}, or until the current version of
the buffer (between \texttt{head} and \texttt{tail}) contains $N_B$
records. At this point, any further writes would either clobber records
in the old version, or exceed the user-specified buffer capacity, and so
any inserts must block until a flush has been completed.

Flushes are triggered based on a user-configurable set point, $N_F \leq
N_B$. When $\mathtt{tail} - \mathtt{head} = N_F$, a flush operation
is scheduled. The location of \texttt{tail} is recorded as part of
the flush, but records can continue to be inserted until one of the
blocking conditions in the previous paragraph is reached. When the flush
has completed, a new shard is created containing the records between
\texttt{head} and the value of \texttt{tail} at the time the flush began.
The buffer version can then be advanced by setting \texttt{old head}
to \texttt{head} and setting \texttt{head} to \texttt{tail}. All of the
records associated with the old version are freed, and the records that
were just flushed now begin part of the old version.

The reason for this scheme is to allow threads accessing an older
version of the dynamized structure to still see a current view of all
of the records. These threads will have a reference to a dynamized
structure containing none of the records in the buffer, as well as
the old head. Because the older version of the buffer always directly
precedes the newer, all of the buffered records are visible to this
older version.  However, threads accessing the more current version of
the buffer will \emph{not} see the records contained between \texttt{old
head} and \texttt{head}, as these records will have been flushed into
the structure and are visible to the thread there. If this thread could
still see records in the older version of the buffer, then it would see
these records twice, which is incorrect.

One consequence of this technique is that a buffer flush cannot complete
until all threads referencing \texttt{old head} have completed. To ensure
that this is the case, the two head pointers are reference counted, and
a flush will stall until all references to \texttt{old head} have been
removed. In principle, this problem could be reduced by allowing for more
than two heads, but it becomes difficult to atomically transition between
versions in that case, and it would also increase the storage requirements
for the buffer, which requires $N_B$ space per available version.

\subsection{Concurrent Queries}

Queries are answered based upon the active version of the structure
at the moment the query begins to execute. When the query routine of
the dynamization is called, a query is scheduled. Once a thread becomes
available, the query will begin to execute. At the start of execution, the
query thread takes a reference to $\mathcal{V}_a$ as well as the current
\texttt{head} and \texttt{tail} of the buffer. Both $\mathcal{V}_a$
and \texttt{head} are reference counted, and will be retained for the
duration of the query. Once the query has finished processing, it will
return the result to the user via an \texttt{std::promise} and release
its references to the active version and buffer.

\subsubsection{Query Preemption}

Because our implementation only supports two active head pointers in the
mutable buffer, queries can lead to insertion stalls. If a long running
query is holding a reference to \texttt{old head}, then an active buffer
flush of the old version will be blocked by this query. If this blocking
goes on for sufficiently long, then the buffer may fill up and the system
begin to reject inserts.

One possible solution to this problem is to process the
\texttt{buffer\_query} first, and then discard the reference to
\texttt{old head}, allowing the buffer flush to proceed. However, this
would not work for iterative deletion decomposable search problems,
which may require re-processing the buffer query arbitrarily many times.
As a result, we instead implement a simple preemption mechanism to
defeat long running queries. The framework keeps track of how long
a buffer flush has been stalled by queries maintaining references to
\texttt{old head}.  Once this stalling passes a user-defined threshold,
a preemption flag will be set ordering the queries in question to restart
themselves. This is implemented fully within the framework code, requiring
no user adjustment to their queries to support it, as the framework query
mechanism simply checks this flag in between calls to user code. If
a query sees this flag is set, it will release its references to the
\texttt{old head} and structure version, and automatically put itself
back in the scheduling queue to be retried against newer versions of
the structure and buffer.

Note that, if misconfigured, it is possible that this mechanism will
entirely prevent certain long-running queries from being answered.
If the threshold for preemption is set lower than the expected
run-time of a valid query, it's possible that the query will loop forever
if the system is experiencing sufficient insertion pressure. To help
avoid this, another parameter is available to specify a maximum preemption
count, after which a query will ignore a request for preemption.

\subsection{Insertion Stall Mechanism}

The results of Theorem~\ref{theo:worst-case-optimal} and
\ref{theo:par-worst-case-optimal} are based upon enforcing a rate limit
upon incoming inserts by manually increasing their cost, to ensure that
there is sufficient time for reconstructions to complete. The calculation
of and application of this stall factor can be seen as equivalent to
explicitly limiting the maximum allowed insertion throughput. In this
section, we consider a mechanism for doing this.

In practice, calculating and precisely stalling for the correct amount
of time is quite difficult because of the vagaries of working with a
real system. While ultimately it would be ideal to have a reasonable
cost model that can estimate this stall time on the fly based on the
cost of building a data structure, the number of records involved in
reconstructions, the number of available threads, available memory
bandwidth, etc., for the purposes of this prototype we have settled for
a simple system that demonstrates the robustness of the technique.

Recall the basic insert process within our system. Inserts bypass the
scheduling system and communicate directly with the buffer, on the same
client thread that called the function, to maximize insertion performance
and eliminate as much concurrency control overhead as possible. The
insert routine is synchronous, and returns a boolean indicating whether
the insert has succeeded or not. The insert can fail if the buffer is full,
in which case the user is expected to delay for a moment and retry. Once
space has been cleared in the buffer, the insert will succeed.

We can leverage this same mechanism as a form of rate limiting, by
rejecting new inserts when the throughput rises above a specified level.
Unfortunately, the most straightforward approach to doing this--monitoring
the throughput and simply blocking inserts when it raises above a
specified threshold--is undesirable because the probability of a given
insert being rejected is not independent. The rejections will tend to
clump, which introduces back the tail latency problem we are attempting to
resolve. Instead, it would be best to spread the rejections more evenly.

We approximate this rate limiting behavior by using random sampling
to determine which inserts to reject. Rather than specifying a maximum
throughput, the system is configured with a probability of acceptance
for an insert. This avoids the distributional problems mentioned above
that arise from direct throughput monitoring, and has a few additional
benefits.  It is based on a single parameter that can be readily updated
on demand using atomics. Our current prototype uses a single, fixed value
for the probability, but ultimately it should be dynamically tuned to
approximate the $\delta$ value from Theorem~\ref{theo:worst-case-optimal}
as closely as possible. It also doesn't require significant modification
of the existing client interfaces.

We have elected to use Bernoulli sampling for the task of selecting which
inserts to reject. An alternative approach would have been to apply
systematic sampling. Bernoulli sampling results in the probability of an
insert being rejected being independent of its order in the workload,
but is non-deterministic. This means that there is a small probability
that many more rejections than are expected may occur over a short
span of time. Systematic sampling is not vulnerable to this problem,
but introduces a dependence between the position of an insert in a
sequence of operations and its probability of being rejected. We decided
to prioritize independence in our implementation.

\section{Evaluation}
\label{sec:tl-eval}

In this section, we perform several experiments to evaluate the ability of
the system proposed in Section~\ref{sec:tl-impl} to control tail latencies.

\subsection{Stall Proportion Sweep}

\begin{figure}
\centering
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-insert-dist.pdf} \label{fig:tl-stall-200m-dist}} 
\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/200m-stall-shard-dist.pdf} \label{fig:tl-stall-200m-shard}} \\
\caption{Insertion and Shard Count Distributions for ISAM with 200M Records}
\label{fig:tl-stall-200m}
\end{figure}

First, we will consider the insertion and query performance of our
system at a variety of stall proportions. The purpose of this testing
is to demonstrate that inserting stalls into the insertion process is
able to reduce the insertion tail latency, while being able to match the
general insertion and query performance of a strict tiering policy. Recall
that, in the insertion stall case, no explicit shard capacity limits are
enforced by the framework. Reconstructions are triggered with each buffer
flush on all levels exceeding a specified shard count ($s = 6$ in these
tests) and the buffer flushes immediately when full with no regard to the
state of the structure. Thus, limiting the insertion latency is the only
means the system uses to maintain its shard count at a manageable level.
These tests were run on a system with sufficient available resources to
fully parallelize all reconstructions.

First, Figure~\ref{fig:tl-stall-200m} shows the results of testing
insertion of the 200 million record SOSD \texttt{OSM} dataset in a
dynamized ISAM tree, using both our insertion stalling technique and
strict tiering. We inserted $30\%$ of the records, and then measured
the individual latency of each insert after that point to produce
Figure~\ref{fig:tl-stall-200m-dist}. Figure~\ref{fig:tl-stall-200m-shard}
was produced by recording the number of shards in the dynamized structure
each time the buffer flushed.  Note that a stall value of one indicates
no stalling at all, and values less than one indicate $1 - \delta$
probability of an insert being rejected. Thus, a lower stall value means
more stalls are introduced. The tiering policy is strict tiering with a
scale factor of $s=6$. It uses the concurrency control scheme described
in Section~\ref{ssec:dyn-concurrency}.


Figure~\ref{fig:tl-stall-200m-dist} clearly shows that all insertion
rejection probabilities succeed in greatly reducing tail latency relative
to tiering. Additionally, it shows a small amount of available tuning of
the worst-case insertion latencies, with higher stall amounts reducing
the tail latencies slightly at various points in the distribution. This
latter effect results from the buffer flush latency hiding mechanism,
which was retained from Chapter~\ref{chap:framework}. The buffer actually
has space for two versions, and the second version can be filled while
the first is flushing. This means that, for more aggressive stalling,
some of the time spent blocking on the buffer flush is redistributed
over the inserts into the second version of the buffer, rather than
resulting in a stall.

Of course, if the query latency is severely affected by the
use of this mechanism, it may not be worth using. Thus, in
Figure~\ref{fig:tl-stall-200m-shard} we show the probability density of
various shard counts within the dynamized structure for each stalling
amount, as well as strict tiering. We have elected to examine the shard
count, rather than the query latencies, for this purpose because our
intention with this technique is to directly control the number of
shards, and our intention is to show that this is possible. Of course,
the shard count control is necessary for the sake of query latencies,
and we will consider query latency directly later.

This figure shows that, even for no insertion throttle at all, the shard
count within the structure remains well behaved and normally distributed,
albeit with a slightly longer tail and a higher average value. Once
stalls are introduced, though, it is possible to both reduce the tail,
and shift the peak of the distribution through a variety of points. In
particular, we see that a stall of $.99$ is sufficient to move the peak
to very close to tiering, and lower stalls are able to further shift the
peak of the distribution to even lower counts.

\begin{figure}
\centering
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-insert-dist.pdf} \label{fig:tl-stall-4b-dist}} 
\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/4b-stall-shard-dist.pdf} \label{fig:tl-stall-4b-shard}} \\
\caption{Insertion and Shard Count Distributions for ISAM with 4B Records}
\label{fig:tl-stall-4b}
\end{figure}

To validate that these results were not simply a result of the relatively
small size of the data set used, we repeated the exact same testing
using a set of four billion uniform integers, and these results are
shown in Figure~\ref{fig:tl-stall-4b}. These results are aligned with
the smaller data set, with Figure~\ref{fig:tl-stall-4b-dist} showing
the same improvements in insertion tail latency for all stall amounts,
and Figure~\ref{fig:tl-stall-4b-shard} showing similar trends in the
shard count. If anything, the gap between strict tiering and un-throttled
insertion is narrower with the larger data set than the smaller one.

\begin{figure}
\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-insert-dist.pdf} \label{fig:tl-stall-knn-dist}} 
\subfloat[Shard Count Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-shard-dist.pdf} \label{fig:tl-stall-knn-shard}} \\
\caption{Insertion and Shard Count Distributions for VPTree }
\label{fig:tl-stall-knn}
\end{figure}

Finally, we considered our dynamized VPTree in
Figure~\ref{fig:tl-stall-knn}, using the \texttt{SBW} dataset of
about one million 300-dimensional vectors. This test shows some of
the possible limitations of our fixed rejection rate. The ISAM Tree
tested above is constructable in roughly linear time, being an MDSP
with $B_M(n, k) \in \Theta(n \log k)$. Thus, the ratio $\frac{B_M(n,
k)}{n}$ used to determine the optimal insertion stall rate is
asymptotically a constant.  For VPTree, however, the construction
cost is super-linear, with $B(n) \in \Theta(n \log n)$, and also
generally much larger in absolute time requirements. We can see in
Figure~\ref{fig:tl-stall-knn-shard} that the shard count distribution
is very poorly behaved for smaller stall amounts, with the shard count
following a roughly uniform distribution for a stall rate of $1$. This
means that the background reconstructions are not capable of keeping up
with buffer flushing, and so the number of shards grows significantly
over time. Introducing stalls does shift the distribution closer to
normal, but it requires a much larger stall rate in order to obtain
a shard count distribution that is close to the strict tiering than
was the case with the ISAM tree test. It is still possible, though,
even with our simple fixed-stall rate implementation. Additionally,
this approach is shown in Figure~\ref{fig:tl-stall-knn-dist} to reduce
the tail latency substantially compared to strict tiering, with the same
latency distribution effects for larger stall rates as was seen in the
ISAM examples.

Thus, we've shown that introducing even a fixed stall while allowing
the internal structure of the dynamization to develop naturally is able
to match the shard count distribution of strict tiering, while having
significantly lower insertion tail latencies. 

\subsection{Insertion Stall Trade-off Space}

While we have shown that introducing insertion stalls accomplishes the
goal of reducing tail latencies while being able to match the shard count
of a strict tiering reconstruction strategy, we've not yet addressed
what the actual performance of this structure is. By throttling inserts,
we potentially reduce the insertion throughput. And, further, it isn't
immediately obvious just how much query performance suffers as the shard
count distribution shifts. In this test, we examine the average values
of insertion throughput and query latency over a variety of stall rates.

The results of this test for ISAM with the SOSD \texttt{OSM} dataset are
shown in Figure~\ref{fig:tl-latency-curve-isam}, which shows the insertion
throughput plotted against the average query latency for our system at
various stall rates, and with tiering configured with an equivalent
scale factor marked as red point for reference. This plot shows two
interesting features of the insertion stall mechanism. First, it is
possible to introduce stalls that do not significantly affect the write
throughput, but do improve query latency. This is seen by the difference
between the two points at the far right of the curve, where introducing
a slight stall improves query performance at virtually no cost. This
represents the region of the curve where the stalling introduces delay
that doesn't exceed the cost of a buffer flush, and so the amount of
time spent stalling by the system doesn't change much.

The second, and perhaps more notable, point that this plot shows is
that introducing the stall rate provides a beautiful design trade-off
between query and insert performance. In fact, this space is far more
useful than the trade-off space represented by layout policy and scale
factor selection using strict reconstruction schemes that we examined
in Chapter~\ref{chap:design-space}. At the upper end of the insertion
optimized region, we see more than double the insertion throughput of
tiering (with significantly lower tail latencies at well) at the cost
of a slightly larger than 2x increase in query latency. Moving down the
curve, we see that we are able to roughly match the performance of tiering
within this space, and even shift to more query optimized configurations.

We also performed the same testing for $k$-NN queries using
VPTree and the \texttt{SBW} dataset.  The results are shown in
Figure~\ref{fig:tl-latency-curve-knn}. Because the run time of $k$-NN
queries is significantly longer than the point lookups in the ISAM test,
we additionally applied a rate limit to the query thread, issuing new
queries every 100 milliseconds, and configured query preemption with a
trigger point of approximately 40 milliseconds.  We applied the same
parameters for the tiering test, and counted any additional latency
associated with query preemption towards the average query latency figures
reported. This test shows that, like with ISAM, we have access to a
similarly clear trade-off space by adjusting the insertion throughput,
however in this case the standard tiering policy did perform better in
terms of average insertion throughput and query latency.


\begin{figure}
\centering
\subfloat[ISAM w/ Point Lookup]{\includegraphics[width=.5\textwidth]{img/tail-latency/stall-latency-curve.pdf} \label{fig:tl-latency-curve-isam}} 
\subfloat[VPTree w/ $k$-NN]{\includegraphics[width=.5\textwidth]{img/tail-latency/knn-stall-latency-curve.pdf} \label{fig:tl-latency-curve-knn}} \\
\caption{Insertion Throughput vs. Query Latency}
\label{fig:tl-latency-curve}
\end{figure}

This shows a very interesting result. Not only is our approach able
to match a strict reconstruction policy in terms of average query and
insertion performance with better tail latencies, but it is even able
to provide a superior set of design trade-offs than the strict policies,
at least in environments where sufficient parallel processing and memory
are available to leverage parallel reconstructions.

\subsection{Legacy Design Space}

Our new system retains the concept of buffer size and scale factor from
the previous version, although these have very different performance
implications given our different compaction strategy. In this test, we
examine the effects of these parameters on the insertion-query trade-off
curves noted above, as well as on insertion tail latency. The results
are shown in Figure~\ref{fig:tl-design-space}, for a dynamized ISAM Tree
using the SOSD \texttt{OSM} dataset and point lookup queries.

\begin{figure}
\centering
\subfloat[Insertion Throughput vs. Query Latency for Varying Scale Factors]{\includegraphics[width=.5\textwidth]{img/tail-latency/stall-sf-sweep.pdf} \label{fig:tl-sf-curve}} 
\subfloat[Insertion Tail Latency for Varying Buffer Sizes]{\includegraphics[width=.5\textwidth]{img/tail-latency/buffer-tail-latency.pdf} \label{fig:tl-buffer-tail}} \\
\caption{Legacy Design Space Examination}
\label{fig:tl-design-space}
\end{figure}

First, we consider the insertion throughput vs. average query latency
curves for our system using different values of scale factor in
Figure~\ref{fig:tl-sf-curve}. Recall that our system of reconstruction in
this chapter does not explicitly enforce any structural invariants, and so
the scale factor's only role is in determining at what point a given level
will have a reconstruction scheduled for it. Lower scale factors will
more aggressively compact shards, while higher scale factors will allow
more shards to accumulate before attempting to perform a reconstruction.
Interestingly, there are clear differences in the curves, particularly at
higher insertion throughputs. For lower throughputs, a scale factor of
$s=2$ appears strictly inferior, while the other tested scale factors result
in roughly equivalent curves. However, as the insertion throughput is
increased, the curves begin to separate more, with $s = 6$ emerging as
the superior option for the majority of the space.

Next, we consider the effect that buffer size has on insertion
tail latency. Based on our discussion of the equal block method
in Section~\ref{sec:tl-insert-query-tradeoff}, and the fact that
our technique only blocks inserts on buffer flushes, it stands
the reason that the buffer size should directly influence the
worst-case insertion time.  That bears out in practice, as shown in
Figure~\ref{fig:tl-buffer-tail}. As the buffer size is increased,
the worst-case insertion time also increases, although the effect is
relatively small.

\subsection{Thread Scaling}

\begin{figure}
\centering
\subfloat[Insertion Throughput vs. Query Latency]{\includegraphics[width=.5\textwidth]{img/tail-latency/recon-thread-scale.pdf} \label{fig:tl-latency-threads}} 
\subfloat[Maximum Insertion Throughput for a Given Query Latency]{\includegraphics[width=.5\textwidth]{img/tail-latency/constant-query.pdf} \label{fig:tl-query-scaling}} \\

\caption{Framework Thread Scaling}
\label{fig:tl-threads}

\end{figure}

In the previous tests, we ran our system configured with 32 available
threads, which was more than enough to run all reconstructions and
queries fully in parallel. However, it's important to determine how well
the system works in more resource constrained environments.  The system
shares internal threads between reconstructions and queries, and that
flushing occurs on a dedicated thread separate from these. During the
benchmark, one client thread issued queries continuously and another
issued inserts. The index accumulated a total of five levels, so
the maximum amount of parallelism available during the testing was 4
parallel reconstructions, along with the dedicated flushing thread and
any concurrent queries. In these tests, we used the SOSD \texttt{OSM}
dataset (200M records) and point-lookup queries without early abort
against a dynamized ISAM tree.

We considered the insertion throughput vs. query latency trade-off for
various stall amounts with several internal thread counts. We inserted
30\% of the dataset first, and then measured the insertion throughput over
the insertion of the rest of the data on a client thread, while another
client thread continuously issued queries against the structure.  The
results of this test are shown in Figure~\ref{fig:tl-latency-threads}. The
first note is that the change in the number of available internal
threads has little effect on the insertion throughput, as shown by the
clustering of the points on the curve. This is to be expected, as inserts
throughput is limited only by the stall amount, and by the buffer flushing
operation. As flushing occurs on a dedicated thread, it is unaffected
by changes in the internal thread configuration of the system.

In terms of query performance, there are two general effects that can be
observed. The first effect is that the previously noted effect of reduced
query performance as the insertion throughput is increased is observed
in all cases, irrespective of thread count. However, interestingly,
the thread count itself has little effect on the curve outside of the
case of only having a single thread. This can also be seen in
Figure~\ref{fig:tl-query-scaling}, which shows an alternative view of
the same data revealing the best measured insertion throughput associated
with a given query latency bound. In both cases, two or more threads are
capable of significantly higher insertion throughput at a given query
latency. But, at very low insertion throughputs, this effect vanishes
and all thread counts are roughly equivalent in performance.

A large part of the reason for this significant deviation in
behavior between one thread and multiple is likely that queries and
reconstructions share the same pool of background threads in this
framework. Our testing involved issuing queries continuously on a
single thread, while performing inserts, and so two threads background
threads ensures that a reconstruction and query can be run in parallel,
whereas a single thread will force queries to wait behind long running
reconstructions. Once this bottleneck is overcome, a reduction in the
amount of parallel reconstruction seems to have only a minor influence
on overall performance. This is likely because, although in the worst
case the system requires $\log_s n$ threads to fully parallelize
reconstructions, this worst case is fairly rare. The vast majority of
reconstructions only require a fraction of this total parallel capacity.


\section{Conclusion}

In this section, we addressed the final of the three major problems of
dynamization: tail latency. We proposed a technique for limiting the
rate of insertions to match the rate of reconstruction that is able to
match the worst-case optimized approach of Overmars~\cite{overmars81} on
a single thread, and able to exceed it given multiple parallel threads.
We then implemented the necessary mechanisms to support this technique
within our framework, including a significantly improved architecture
for scheduling and executing parallel and background reconstructions,
and a system for rate limiting by rejecting inserts via Bernoulli sampling.

We evaluated this system for fixed insertion rejection rates, and found
significant improvements in tail latencies, approaching the practical lower
bound we established using the equal block method, without requiring
significant degradation of query performance. In fact, we found that
this rate limiting mechanism provides a design space with more effective
trade-offs than the one we examined in Chapter~\ref{chap:design-space},
with the system being able to exceed the query performance of an
equivalently configured tiering system for certain rate limiting
configurations. The method has limitations, assigning a fixed rejection
rate of inserts works well for linear time constructable structures like
the ISAM Tree, but was significantly less effective for the VPTree, which
requires $\Theta(n \log n)$ time to construct. For structures like this,
it will be necessary to dynamically scale the amount of throttling based
on the record count and size of reconstruction. Additionally, our current
system isn't easily capable of reaching the ``ideal'' goal of being able
to reliably trade query performance and insertion latency at a fixed
throughput. Nonetheless, the mechanisms for supporting such features
are present, and even this simple implementation represents a marked
improvement in terms of both insertion tail latency and configurability.