updates

author: Douglas B. Rumbaugh <doug@douglasrumbaugh.com> 2025-07-06 17:11:15 -0400
committer: Douglas B. Rumbaugh <doug@douglasrumbaugh.com> 2025-07-06 17:11:15 -0400
commit: da08d6025eb81b95200284632d8c0a8daf0612f2 (patch)
tree: 6c93312423ac144457f92da0703d763275e2edc5
parent: 293d4af6d349d07ecd72c96121033e2ab155d359 (diff)
download: dissertation-da08d6025eb81b95200284632d8c0a8daf0612f2.tar.gz
1 files changed, 97 insertions, 0 deletions
diff --git a/chapters/related-works.tex b/chapters/related-works.tex
index 8e24750..c9d9357 100644
--- a/chapters/related-works.tex
+++ b/chapters/related-works.tex
@@ -347,3 +347,100 @@ works in this section that discusses generalizing its techniques to
 non-single dimensional data.
 
 \section{Generalized Index Templates}
+
+The other line of work in automatic index design we discussed
+was generalized index templates. These are systems which expose a
+generalization of a particular type of data structure, presenting hooks
+for user-defined functions into various data structure operations,
+while providing all of the necessary index features (concurrency,
+crash recovery, etc.) automatically. Assuming that the underlying data
+structure can be used to construct an index for a given use case, this
+approach makes it straightforward for a database user to produce a custom
+index for a particular search problem or data type. Unfortunately,
+these templates are restricted by their underlying data structure,
+and can only produce indices that fit this model.  There are two major
+examples of generalized index templates, the generalized search tree
+(GiST)~\cite{gist, concurrent-gist, pg-gist} and the generalized inverted
+index (GIN)~\cite{pg-gin}.
+
+The GiST~\cite{gist} is a general data structure built on a search
+tree, which allows the user to specify certain specific behaviors to
+adapt it to their needs, while automatically providing concurrency
+and crash recovery~\cite{concurrent-gist}. It was been implemented in
+Postgres~\cite{pg-gist}. GiST requires the user to implement six
+functions,
+\begin{itemize}
+\item $\mathbftt{consistent}(E, q)$ \\
+	Given an internal node entry, $E$, and a query predicate $q$,
+	it determines whether the entry could satisfy the predicate and
+	returns true if so, or false if the predicate can certainly not
+	be satisfied.
+
+\item $\mathbftt{union}(P)$ \\
+	Given a set of contiguous internal node entries, $P$, this
+	function returns a predicate that holds for all tuples stored
+	in children of the entries.
+
+\item $\mathbftt{compress}(E)$ \\
+	Returns a compressed version of an entry, $E$. This is used to
+	produce internal node separator keys.
+
+\item $\mathbftt{decompress}(E)$ \\
+	Returns a decompressed version of a compressed entry, $E$. This
+	is used to recover record intervals consistent with a separator
+	key. 
+
+\item $\mathbftt{penalty}(E_1, E_2)$ \\
+	Returns the ``penalty'' for inserting $E_2$ into the sub-tree
+	rooted at $E_1$. This is used for tree balancing in the insertion
+	routines.
+
+\item $\mathbftt{pick\_split}(P)$ \\
+	Given a set of contiguous internal node entries, $P$, pick a split
+	point to break $P$ into two disjoint partitions. This is used during
+	split operations to maintain tree balance.
+	
+\end{itemize}
+
+Given these user-defined functions, the GiST will automatically provide
+two search routines: a general search returning all records matching a
+predicate, and a specific search for linearly ordered data. This latter
+search also requires a $\mathbftt{compare}(E_1, E_2)$ function be specified
+which serves as a comparator between two records. The structure also
+supports inserting and deleting records.
+
+The generalized inverted index, GIN, is a similar concept for a different
+data structure~\cite{pg-gin}. Rather than generalizing a search tree,
+GIN generalizes an inverted index. This structure represents a set of
+key-value pairs, where the value is a set of composite entries containing
+multiple keys. The index allows all values containing a specified key
+to be easily identified. The classic example of a use for an inverted
+index is document search, where the values are entire documents, and
+the keys are specific words.
+
+GIN requires the user specify the following functions,\footnote{
+	I've streamlined the representations of these functions somewhat
+	to conform to the conventions of this work. In the original
+	source documentation, these function definitions are given in C,
+	with many more arguments for outputs, configuration, etc.
+	}
+\begin{itemize}
+	\item $\mathbftt{extract\_value}(V)$ \\
+		Extract all of the keys from a given value, $V$, and return them
+		as an array. 
+	\item $\mathbftt{extract\_query}(Q)$ \\
+		Extract the keys to search for in a given query, $Q$.
+	\item $\mathbftt{consistent}(V, Q)$ \\
+		Checks an indexed value, $V$, for the keys contained in a query,
+		$Q$. Returns an array of booleans, where the $i$th
+		element in the array is true of the $i$th key from the
+		query appears in the value.	
+	\item $\mathbftt{compare}(E_1, E_2)$ \\
+		A comparator used for sorting keys.
+\end{itemize}
+
+Note that, unlike GiST, GIN doesn't generalize the construction of
+the index itself. It uses a B-tree index over the keys with limited
+user control over how it is built. The generalization of GIN is in the
+extraction of keys from values, and the query process itself. Like GiST,
+a GIN index automatically provides concurrency and crash recovery.
author	Douglas B. Rumbaugh <doug@douglasrumbaugh.com>	2025-07-06 17:11:15 -0400
committer	Douglas B. Rumbaugh <doug@douglasrumbaugh.com>	2025-07-06 17:11:15 -0400
commit	da08d6025eb81b95200284632d8c0a8daf0612f2 (patch)
tree	6c93312423ac144457f92da0703d763275e2edc5
parent	293d4af6d349d07ecd72c96121033e2ab155d359 (diff)
download	dissertation-da08d6025eb81b95200284632d8c0a8daf0612f2.tar.gz