From 5e4ad2777acc4c2420514e39fb98b7cf2e200996 Mon Sep 17 00:00:00 2001
From: Douglas Rumbaugh <dbr4@psu.edu>
Date: Sun, 27 Apr 2025 17:36:57 -0400
Subject: Initial commit

---
 chapters/sigmod23/extensions.tex | 57 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)
 create mode 100644 chapters/sigmod23/extensions.tex

(limited to 'chapters/sigmod23/extensions.tex')

diff --git a/chapters/sigmod23/extensions.tex b/chapters/sigmod23/extensions.tex
new file mode 100644
index 0000000..6c242e9
--- /dev/null
+++ b/chapters/sigmod23/extensions.tex
@@ -0,0 +1,57 @@
+\captionsetup[subfloat]{justification=centering}
+\section{Extensions}
+\label{sec:discussion}
+In this section, various extensions of the framework are considered.
+Specifically, the applicability of the framework to external or distributed
+data structures is discussed, as well as the use of the framework to add
+automatic support for concurrent updates and sampling to extended SSIs.
+
+\Paragraph{Larger-than-Memory Data.} This framework can be applied to external
+static sampling structures with minimal modification. As a proof-of-concept,
+the IRS structure was extended with support for shards containing external ISAM
+trees. This structure supports storing a configurable number of shards in
+memory, and the rest on disk, making it well suited for operating in
+memory-constrained environments. The on-disk shards contain standard ISAM
+trees, with $8\text{KiB}$ page-aligned nodes. The external version of the
+index only supports tombstone-based deletes, as tagging would require random
+writes. In principle a hybrid approach to deletes is possible, where a delete
+first searches the in-memory data for the record to be deleted, tagging it if
+found. If the record is not found, then a tombstone could be inserted. As the
+data size grows, though, and the preponderance of data is found on disk, this
+approach would largely revert to the standard tombstone approach in practice.
+External settings make the framework even more attractive, in terms of
+performance characteristics, due to the different cost model. In external data
+structures, performance is typically measured in terms of the number of IO
+operations, meaning that much of the overhead introduced by the framework for
+tasks like querying the mutable buffer, building auxiliary structures, extra
+random number generations due to the shard alias structure, and the like,
+become far less significant.
+
+Because the framework maintains immutability of shards, it is also well suited for
+use on top of distributed file-systems  or with other distributed data
+abstractions like RDDs in Apache Spark~\cite{rdd}. Each shard can be
+encapsulated within an immutable file in HDFS or an RDD in Spark. A centralized
+control node or driver program can manage the mutable buffer, flushing it into
+a new file or RDD when it is full, merging with existing files or RDDs using
+the same reconstruction scheme already discussed for the framework. This setup
+allows for datasets exceeding the capacity of a single node to be supported. As
+an example, XDB~\cite{li19} features an RDD-based distributed sampling
+structure that could be supported by this framework.
+
+\Paragraph{Concurrency.} The immutability of the majority of the structures
+within the index makes for a straightforward concurrency implementation.
+Concurrency control on the buffer is made trivial by the fact it is a simple,
+unsorted array. The rest of the structure is never updated (aside from possible
+delete tagging), and so concurrency becomes a simple matter of delaying the
+freeing of memory used by internal structures until all the threads accessing
+them have exited, rather than immediately on merge completion. A very basic
+concurrency implementation can be achieved by using the tombstone delete
+policy, and a reference counting scheme to control the deletion of the shards
+following reconstructions. Multiple insert buffers can be used to improve
+insertion throughput, as this will allow inserts to proceed in parallel with
+merges, ultimately allowing concurrency to scale up to the point of being
+bottlenecked by memory bandwidth and available storage. This proof-of-concept
+implementation is based on a simplified version of an approach proposed by
+Golan-Gueta et al. for concurrent log-structured data stores
+\cite{golan-gueta15}.
+
-- 
cgit v1.2.3