From 901a04fd8ec9a07b7bd195517a6d9e89da3ecab6 Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Fri, 9 May 2025 14:08:31 -0400 Subject: updates --- chapters/sigmod23/extensions.tex | 60 +++++++++++++++++++++++----------------- 1 file changed, 35 insertions(+), 25 deletions(-) (limited to 'chapters') diff --git a/chapters/sigmod23/extensions.tex b/chapters/sigmod23/extensions.tex index d8a4247..2752b0f 100644 --- a/chapters/sigmod23/extensions.tex +++ b/chapters/sigmod23/extensions.tex @@ -1,31 +1,41 @@ \captionsetup[subfloat]{justification=centering} \section{Extensions to the Framework} \label{sec:discussion} -In this section, various extensions of the framework are considered. -Specifically, the applicability of the framework to external or distributed -data structures is discussed, as well as the use of the framework to add -automatic support for concurrent updates and sampling to extended SSIs. - -\Paragraph{Larger-than-Memory Data.} This framework can be applied to external -static sampling structures with minimal modification. As a proof-of-concept, -the IRS structure was extended with support for shards containing external ISAM -trees. This structure supports storing a configurable number of shards in -memory, and the rest on disk, making it well suited for operating in -memory-constrained environments. The on-disk shards contain standard ISAM -trees, with $8\text{KiB}$ page-aligned nodes. The external version of the -index only supports tombstone-based deletes, as tagging would require random -writes. In principle a hybrid approach to deletes is possible, where a delete -first searches the in-memory data for the record to be deleted, tagging it if -found. If the record is not found, then a tombstone could be inserted. As the -data size grows, though, and the preponderance of data is found on disk, this -approach would largely revert to the standard tombstone approach in practice. -External settings make the framework even more attractive, in terms of -performance characteristics, due to the different cost model. In external data -structures, performance is typically measured in terms of the number of IO -operations, meaning that much of the overhead introduced by the framework for -tasks like querying the mutable buffer, building auxiliary structures, extra -random number generations due to the shard alias structure, and the like, -become far less significant. +While this chapter has thus far discussed single-threaded, in-memory data +structures, the framework as proposed can be easily extended to support +other use-cases. In this section, we discuss extending this framework +to support concurrency and external data structures. + + +\Paragraph{Larger-than-Memory Data.} Our dynamization techniques, +as discussed thus far, can easily accomodate external data structures +as well as in-memory ones. To demonstrate this, we have implemented +a dynamized version of an external ISAM tree for use in answering IRS +queries. The mutable buffer remains an unsorted array in memory, however +the shards themselves can either \emph{either} an in-memory ISAM tree +or an external one. Our system allows for a user-configurable number of +shards and the rest on disk, for performance tuning purposes. + +The on-disk shards are built from standard ISAM trees using $8$ KiB +page-aligned internal and leaf nodes. To avoid random writes, we only +support tombstone-based deletes. Theoretically, it should be possible to +implement a hybrid approach, where deletes first search the in-memory +shards for the record and tag it if found, inserting a tombstone only +when it is not located. However, because of the geometric growth rate +of the shards, at any given time the majority of the data will be on +disk anyway, so this would only provide a marginal improvement. + +Our implementation does not include a buffer manager, for simplicty. The +external interface requires passing in page-aligned buffers. + + + + +\Paragraph{Applications to distributed data structures.} +Many distributed file-systems are built on immutable abstracted, such +Apache Spark's resilient distributed dataset (RDD)~\cite{rdd} or Hadoop's +immutable + Because the framework maintains immutability of shards, it is also well suited for use on top of distributed file-systems or with other distributed data -- cgit v1.2.3