diff options
| author | Douglas B. Rumbaugh <doug@douglasrumbaugh.com> | 2025-06-01 13:15:52 -0400 |
|---|---|---|
| committer | Douglas B. Rumbaugh <doug@douglasrumbaugh.com> | 2025-06-01 13:15:52 -0400 |
| commit | cd3447f1cad16972e8a659ec6e84764c5b8b2745 (patch) | |
| tree | 5a50b6e8a99646e326b2c41714f50e4f7dee64d0 /chapters/dynamization.tex | |
| parent | 6354e60f106a89f5bf807082561ed5efd9be0f4f (diff) | |
| download | dissertation-cd3447f1cad16972e8a659ec6e84764c5b8b2745.tar.gz | |
Julia updates
Diffstat (limited to 'chapters/dynamization.tex')
| -rw-r--r-- | chapters/dynamization.tex | 124 |
1 files changed, 66 insertions, 58 deletions
diff --git a/chapters/dynamization.tex b/chapters/dynamization.tex index 053fb46..a2277c3 100644 --- a/chapters/dynamization.tex +++ b/chapters/dynamization.tex @@ -67,15 +67,17 @@ terms for these two concepts. \subsection{Decomposable Search Problems} -Dynamization techniques require the partitioning of one data structure -into several, smaller ones. As a result, these techniques can only -be applied in situations where the search problem to be answered can -be answered from this set of smaller data structures, with the same -answer as would have been obtained had all of the data been used to -construct a single, large structure. This requirement is formalized in -the definition of a class of problems called \emph{decomposable search -problems (DSP)}. This class was first defined by Bentley and Saxe in -their work on dynamization, and we will adopt their definition, +The dynamization techniques we will be considering require decomposing +one data structure into several, smaller ones, called blocks, each built +over a disjoint partition of the data. As a result, these techniques +can only be applied in situations where the search problem can be +answered from this set of decomposed blocks. The answer to the search +problem from the decomposition should be the same as would have been +obtained had all of the data been stored in a single data structure. This +requirement is formalized in the definition of a class of problems called +\emph{decomposable search problems (DSP)}. This class was first defined +by Bentley and Saxe in their work on dynamization, and we will adopt +their definition, \begin{definition}[Decomposable Search Problem~\cite{saxe79}] \label{def:dsp} @@ -180,15 +182,17 @@ database indices. We refer to a data structure with update support as contain header information (like visibility) that is updated in place. } -This section discusses \emph{dynamization}, the construction of a -dynamic data structure based on an existing static one. When certain -conditions are satisfied by the data structure and its associated -search problem, this process can be done automatically, and with -provable asymptotic bounds on amortized insertion performance, as well -as worst case query performance. This is in contrast to the manual -design of dynamic data structures, which involve techniques based on -partially rebuilding small portions of a single data structure (called -\emph{local reconstruction})~\cite{overmars83}. This is a very high cost +This section discusses \emph{dynamization}, the construction of a dynamic +data structure based on an existing static one. When certain conditions +are satisfied by the data structure and its associated search problem, +this process can be done automatically, and with provable asymptotic +bounds on amortized insertion performance, as well as worst case +query performance. This automatic approach is in constrast with the +manual design of a dynamic data structure, which involves altering +the data structure itself to natively support updates. This process +usually involves implementing techniques that partially rebuild small +portions of the structure to accomodate new records, which is called +\emph{local reconstruction}~\cite{overmars83}. This is a very high cost intervention that requires significant effort on the part of the data structure designer, whereas conventional dynamization can be performed with little-to-no modification of the underlying data structure at all. @@ -345,7 +349,7 @@ then an insert is done by, Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{ Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be violated by deletes. We're omitting deletes from the discussion at - this point, but will circle back to them in Section~\ref{sec:deletes}. + this point, but will circle back to them in Section~\ref{ssec:dyn-deletes}. } In this case, the constraints are enforced by "re-configuring" the structure. $s$ is updated to be exactly $f(n)$, all of the existing blocks are unbuilt, and then the records are redistributed evenly into @@ -584,14 +588,13 @@ F(A / B, q) = F(A, q)~\Delta~F(B, q) for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$. \end{definition} -Given a search problem with this property, it is possible to perform -deletes by creating a secondary ``ghost'' structure. When a record -is to be deleted, it is inserted into this structure. Then, when the -dynamization is queried, this ghost structure is queried as well as the -main one. The results from the ghost structure can be removed from the -result set using the inverse merge operator. This simulates the result -that would have been obtained had the records been physically removed -from the main structure. +Given a search problem with this property, it is possible to emulate +removing a record from the structure by instead inserting into a +secondary ``ghost'' structure. When the dynamization is queried, +this ghost structure is queried as well as the main one. The results +from the ghost structure can be removed from the result set using the +inverse merge operator. This simulates the result that would have been +obtained had the records been physically removed from the main structure. Two examples of invertible search problems are set membership and range count. Range count was formally defined in @@ -670,11 +673,13 @@ to some serious problems, for example if every record in a structure of $n$ records is deleted, the net result will be an "empty" dynamized data structure containing $2n$ physical records within it. To circumvent this problem, Bentley and Saxe proposed a mechanism of setting a maximum -threshold for the size of the ghost structure relative to the main one, -and performing a complete re-partitioning of the data once this threshold -is reached, removing all deleted records from the main structure, -emptying the ghost structure, and rebuilding blocks with the records -that remain according to the invariants of the technique. +threshold for the size of the ghost structure relative to the main one. +Once this threshold was reached, a complete re-partitioning of the data +can be performed. During this re-paritioning, all deleted records can +be removed from the main structure, and the ghost structure emptied +completely. Then all of the blocks can be rebuilt from the remaining +records, partitioning them according to the strict binary decomposition +of the Bentley-Saxe method. \subsubsection{Weak Deletes for Deletion Decomposable Search Problems} @@ -694,16 +699,16 @@ underlying data structure supports a delete operation. More formally, for $\mathscr{I}$. \end{definition} -Superficially, this doesn't appear very useful. If the underlying data -structure already supports deletes, there isn't much reason to use a -dynamization technique to add deletes to it. However, one point worth -mentioning is that it is possible, in many cases, to easily \emph{add} -delete support to a static structure. If it is possible to locate a -record and somehow mark it as deleted, without removing it from the -structure, and then efficiently ignore these records while querying, -then the given structure and its search problem can be said to be -deletion decomposable. This technique for deleting records is called -\emph{weak deletes}. +Superficially, this doesn't appear very useful, because if the underlying +data structure already supports deletes, there isn't much reason to +use a dynamization technique to add deletes to it. However, even in +structures that don't natively support deleting, it is possible in many +cases to \emph{add} delete support without significant alterations. +If it is possible to locate a record and somehow mark it as deleted, +without removing it from the structure, and then efficiently ignore these +records while querying, then the given structure and its search problem +can be said to be deletion decomposable. This technique for deleting +records is called \emph{weak deletes}. \begin{definition}[Weak Deletes~\cite{overmars81}] \label{def:weak-delete} @@ -815,10 +820,10 @@ and thereby the query performance. The particular invariant maintenance rules depend upon the decomposition scheme used. \Paragraph{Bentley-Saxe Method.} When creating a BSM dynamization for -a deletion decomposable search problem, the $i$th block where $i \geq 2$\footnote{ +a deletion decomposable search problem, the $i$th block where $i \geq 2$,\footnote{ Block $i=0$ will only ever have one record, so no special maintenance must be done for it. A delete will simply empty it completely. -}, +} in the absence of deletes, will contain $2^{i-1} + 1$ records. When a delete occurs in block $i$, no special action is taken until the number of records in that block falls below $2^{i-2}$. Once this threshold is @@ -1076,12 +1081,13 @@ matching of records in result sets. To work around this, a slight abuse of definition is in order: assume that the equality conditions within the DSP definition can be interpreted to mean ``the contents in the two sets are drawn from the same distribution''. This enables the category -of DSP to apply to this type of problem. +of DSP to apply to this type of problem, while maintaining the spirit of +the definition. Even with this abuse, however, IRS cannot generally be considered decomposable; it is at best $C(n)$-decomposable. The reason for this is that matching the distribution requires drawing the appropriate number -of samples from each each partition of the data. Even in the special +of samples from each partition of the data. Even in the special case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples from each partition that must appear in the result set cannot be known in advance due to differences in the selectivity of the predicate across @@ -1102,7 +1108,7 @@ the partitions. probability of a $4$. The second and third result sets can only be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these together, we'd find that the probability distribution of the sample - would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform + would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were we to perform the same sampling operation over the full dataset (not partitioned), the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$. @@ -1111,21 +1117,23 @@ the partitions. The problem is that the number of samples drawn from each partition needs to be weighted based on the number of elements satisfying the query predicate in that partition. In the above example, by drawing $4$ -samples from $D_1$, more weight is given to $3$ than exists within -the base dataset. This can be worked around by sampling a full $k$ -records from each partition, returning both the sample and the number -of records satisfying the predicate as that partition's query result, -and then performing another pass of IRS as the merge operator, but this -is the same approach as was used for k-NN above. This leaves IRS firmly +samples from $D_1$, more weight is given to $3$ than exists within the +base dataset. This can be worked around by sampling a full $k$ records +from each partition, returning both the sample and the number of records +satisfying the predicate as that partition's query result. This allows for +the relative weights of each block to be controlled for during the merge, +by doing weighted sampling of each partial result. This approach requires +$\Theta(k)$ time for the merge operation, however, leaving IRS firmly in the $C(n)$-decomposable camp. If it were possible to pre-calculate the number of samples to draw from each partition, then a constant-time merge operation could be used. -We examine this problem in detail in Chapters~\ref{chap:sampling} and -\ref{chap:framework} and propose techniques for efficiently expanding -support of dynamization systems to non-decomposable search problems, as -well as addressing some additional difficulties introduced by supporting -deletes, which can complicate query processing. +We examine expanding support for non-decomposable search problems +in Chapters~\ref{chap:sampling} and \ref{chap:framework} and propose +techniques for efficiently expanding support of dynamization systems to +non-decomposable search problems, as well as addressing some additional +difficulties introduced by supporting deletes, which can complicate +query processing. \subsection{Configurability} |