1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
|
\captionsetup[subfloat]{justification=centering}
\section{Extensions}
\label{sec:discussion}
While this chapter has thus far discussed single-threaded, in-memory
data structures, our technique can be easily extended to support other
use-cases. In this section, we will discuss extensions to support
concurrency and external data structures.
\subsection{External Data Structures}
\label{ssec:ext-external}
Our dynamization techniques can easily accomodate external data structures
as well as in-memory ones. To demonstrate this, we have implemented
a dynamized version of an external ISAM tree for use in answering IRS
queries. The mutable buffer remains an unsorted array in memory, however
the shards themselves can be \emph{either} an in-memory ISAM tree, or an
external one. Our system allows for a user-configurable number of shards
to reside in memory, and the rest on disk. This allows for the smallest
few shards, which sustain the most reconstructions, to reside in memory
for performance, while storing most of the data on disk, in an attempt
to get the best of both worlds, so to speak.\footnote{
In traditional LSM Trees, which are an external data structure,
only the memtable resides in memory. We have decided to break with
this model because, for query performance reasons, the mutable
buffer must remain small. By placing a few levels in memory, the
performance effects of frequent buffer flushes can be mitigated. This
isn't strictly necessary, however.
}
The on-disk shards are built from standard ISAM trees using $8$ KiB
page-aligned internal and leaf nodes. To avoid random writes, we only
support tombstone-based deletes. Theoretically, it should be possible to
implement a hybrid approach, where deletes first search the in-memory
shards for the record and tag it if found, inserting a tombstone only
when it is not located. However, because of the geometric growth rate
of the shards, at any given time the majority of the data will be on
disk anyway, so this would only provide a marginal improvement.
\subsection{Distributed Data Structures}
Many distributed data processing systems are built on immutable
abstractions, such Apache Spark's resilient distributed dataset
(RDD)~\cite{rdd} or the Hadoop file system's (HDFS) append-only
files~\cite{hadoop}. Each shard can be encapsulated within an HDFS
file or a Spark RDD, and a centralized control node can manage the
mutable buffer. Flushing this buffer would create a new file/RDD, and
reconstructions could likewise be performed by creating new immutable
structures through the merging of existing ones, using the same basic
scheme as has already been discussed in this chapter. Using thes tools,
SSIs over datasets that exceed the capacity of a single node could be
supported. Such distributed SSIs do exist, such as the RDD-based sampling
structure using in XDB~\cite{li19}.
\subsection{Concurrency}
\label{ssec:ext-concurrency}
Because our dynamization technique is built on top of static data
structures, a limited form of concurrency support is straightforward to
implement. To that end, created a proof-of-concept dynamization of an
ISAM Tree for IRS based on a simplified version of a general concurrency
controlled scheme for log-structured data stores~\cite{golan-gueta15}.
First, we restrict ourselves to tombstone deletes. This ensures that
all the static data structures within our dynamization are also immutable.
When using tagging, the deleted flags on records in these structures could
be dynamically updated, leading to possible synchronization issues. While
this isn't a fundamentally unsolvable problem, and could be addressed
simply through the use of a timestamp in the header of the records, we
decided to keep things simple and implement our concurrency scheme on the
assumption of full shard immutability.
Given this immutability, we can construct a simple versioning system over
the entire structure. Reconstructions can be performed in the background
and then ``activated'' atomically by using a simple compare-and-swap of
a pointer to the entire structure. Reference counting can then be used
to automatically free old versions of the structure when all queries
accessing them have finished.
The buffer itself is an unsorted array, so a query can capture a
consistent and static version by storing the tail pointer at the time
the query begins. New inserts can be performed concurrently by doing
a fetch-and-and on the tail. By using multiple buffers, inserts and
reconstructions can proceed, to some extent, in parallel, which helps to
hide some of the insertion tail latency due to blocking on reconstructions
during a buffer flush.
|