\captionsetup[subfloat]{justification=centering} \section{Extensions} \label{sec:discussion} While this chapter has thus far discussed single-threaded, in-memory data structures, our technique can be easily extended to support other use-cases. In this section, we will discuss extensions to support concurrency and external data structures. \subsection{External Data Structures} \label{ssec:ext-external} Our dynamization techniques can easily accomodate external data structures as well as in-memory ones. To demonstrate this, we have implemented a dynamized version of an external ISAM tree for use in answering IRS queries. The mutable buffer remains an unsorted array in memory, however the shards themselves can be \emph{either} an in-memory ISAM tree, or an external one. Our system allows for a user-configurable number of shards to reside in memory, and the rest on disk. This allows for the smallest few shards, which sustain the most reconstructions, to reside in memory for performance, while storing most of the data on disk, in an attempt to get the best of both worlds, so to speak.\footnote{ In traditional LSM Trees, which are an external data structure, only the memtable resides in memory. We have decided to break with this model because, for query performance reasons, the mutable buffer must remain small. By placing a few levels in memory, the performance effects of frequent buffer flushes can be mitigated. This isn't strictly necessary, however. } The on-disk shards are built from standard ISAM trees using $8$ KiB page-aligned internal and leaf nodes. To avoid random writes, we only support tombstone-based deletes. Theoretically, it should be possible to implement a hybrid approach, where deletes first search the in-memory shards for the record and tag it if found, inserting a tombstone only when it is not located. However, because of the geometric growth rate of the shards, at any given time the majority of the data will be on disk anyway, so this would only provide a marginal improvement. \subsection{Distributed Data Structures} Many distributed data processing systems are built on immutable abstractions, such Apache Spark's resilient distributed dataset (RDD)~\cite{rdd} or the Hadoop file system's (HDFS) append-only files~\cite{hadoop}. Each shard can be encapsulated within an HDFS file or a Spark RDD, and a centralized control node can manage the mutable buffer. Flushing this buffer would create a new file/RDD, and reconstructions could likewise be performed by creating new immutable structures through the merging of existing ones, using the same basic scheme as has already been discussed in this chapter. Using thes tools, SSIs over datasets that exceed the capacity of a single node could be supported. Such distributed SSIs do exist, such as the RDD-based sampling structure using in XDB~\cite{li19}. \subsection{Concurrency} \label{ssec:ext-concurrency} Because our dynamization technique is built on top of static data structures, a limited form of concurrency support is straightforward to implement. To that end, created a proof-of-concept dynamization of an ISAM Tree for IRS based on a simplified version of a general concurrency controlled scheme for log-structured data stores~\cite{golan-gueta15}. First, we restrict ourselves to tombstone deletes. This ensures that all the static data structures within our dynamization are also immutable. When using tagging, the deleted flags on records in these structures could be dynamically updated, leading to possible synchronization issues. While this isn't a fundamentally unsolvable problem, and could be addressed simply through the use of a timestamp in the header of the records, we decided to keep things simple and implement our concurrency scheme on the assumption of full shard immutability. Given this immutability, we can construct a simple versioning system over the entire structure. Reconstructions can be performed in the background and then ``activated'' atomically by using a simple compare-and-swap of a pointer to the entire structure. Reference counting can then be used to automatically free old versions of the structure when all queries accessing them have finished. The buffer itself is an unsorted array, so a query can capture a consistent and static version by storing the tail pointer at the time the query begins. New inserts can be performed concurrently by doing a fetch-and-and on the tail. By using multiple buffers, inserts and reconstructions can proceed, to some extent, in parallel, which helps to hide some of the insertion tail latency due to blocking on reconstructions during a buffer flush.