From 901a04fd8ec9a07b7bd195517a6d9e89da3ecab6 Mon Sep 17 00:00:00 2001
From: Douglas Rumbaugh <dbr4@psu.edu>
Date: Fri, 9 May 2025 14:08:31 -0400
Subject: updates

---
 chapters/sigmod23/extensions.tex | 60 +++++++++++++++++++++++-----------------
 1 file changed, 35 insertions(+), 25 deletions(-)

(limited to 'chapters')

diff --git a/chapters/sigmod23/extensions.tex b/chapters/sigmod23/extensions.tex
index d8a4247..2752b0f 100644
--- a/chapters/sigmod23/extensions.tex
+++ b/chapters/sigmod23/extensions.tex
@@ -1,31 +1,41 @@
 \captionsetup[subfloat]{justification=centering}
 \section{Extensions to the Framework}
 \label{sec:discussion}
-In this section, various extensions of the framework are considered.
-Specifically, the applicability of the framework to external or distributed
-data structures is discussed, as well as the use of the framework to add
-automatic support for concurrent updates and sampling to extended SSIs.
-
-\Paragraph{Larger-than-Memory Data.} This framework can be applied to external
-static sampling structures with minimal modification. As a proof-of-concept,
-the IRS structure was extended with support for shards containing external ISAM
-trees. This structure supports storing a configurable number of shards in
-memory, and the rest on disk, making it well suited for operating in
-memory-constrained environments. The on-disk shards contain standard ISAM
-trees, with $8\text{KiB}$ page-aligned nodes. The external version of the
-index only supports tombstone-based deletes, as tagging would require random
-writes. In principle a hybrid approach to deletes is possible, where a delete
-first searches the in-memory data for the record to be deleted, tagging it if
-found. If the record is not found, then a tombstone could be inserted. As the
-data size grows, though, and the preponderance of data is found on disk, this
-approach would largely revert to the standard tombstone approach in practice.
-External settings make the framework even more attractive, in terms of
-performance characteristics, due to the different cost model. In external data
-structures, performance is typically measured in terms of the number of IO
-operations, meaning that much of the overhead introduced by the framework for
-tasks like querying the mutable buffer, building auxiliary structures, extra
-random number generations due to the shard alias structure, and the like,
-become far less significant.
+While this chapter has thus far discussed single-threaded, in-memory data
+structures, the framework as proposed can be easily extended to support
+other use-cases. In this section, we discuss extending this framework
+to support concurrency and external data structures.
+
+
+\Paragraph{Larger-than-Memory Data.} Our dynamization techniques,
+as discussed thus far, can easily accomodate external data structures
+as well as in-memory ones. To demonstrate this, we have implemented
+a dynamized version of an external ISAM tree for use in answering IRS
+queries. The mutable buffer remains an unsorted array in memory, however
+the shards themselves can either \emph{either} an in-memory ISAM tree
+or an external one. Our system allows for a user-configurable number of
+shards and the rest on disk, for performance tuning purposes.
+
+The on-disk shards are built from standard ISAM trees using $8$ KiB
+page-aligned internal and leaf nodes. To avoid random writes, we only
+support tombstone-based deletes. Theoretically, it should be possible to
+implement a hybrid approach, where deletes first search the in-memory
+shards for the record and tag it if found, inserting a tombstone only
+when it is not located. However, because of the geometric growth rate
+of the shards, at any given time the majority of the data will be on
+disk anyway, so this would only provide a marginal improvement.
+
+Our implementation does not include a buffer manager, for simplicty. The
+external interface requires passing in page-aligned buffers. 
+
+
+
+
+\Paragraph{Applications to distributed data structures.}
+Many distributed file-systems are built on immutable abstracted, such
+Apache Spark's resilient distributed dataset (RDD)~\cite{rdd} or Hadoop's
+immutable 
+
 
 Because the framework maintains immutability of shards, it is also well suited for
 use on top of distributed file-systems  or with other distributed data
-- 
cgit v1.2.3