summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorDouglas B. Rumbaugh <doug@douglasrumbaugh.com>2025-07-06 17:11:15 -0400
committerDouglas B. Rumbaugh <doug@douglasrumbaugh.com>2025-07-06 17:11:15 -0400
commitda08d6025eb81b95200284632d8c0a8daf0612f2 (patch)
tree6c93312423ac144457f92da0703d763275e2edc5
parent293d4af6d349d07ecd72c96121033e2ab155d359 (diff)
downloaddissertation-da08d6025eb81b95200284632d8c0a8daf0612f2.tar.gz
updates
-rw-r--r--chapters/related-works.tex97
1 files changed, 97 insertions, 0 deletions
diff --git a/chapters/related-works.tex b/chapters/related-works.tex
index 8e24750..c9d9357 100644
--- a/chapters/related-works.tex
+++ b/chapters/related-works.tex
@@ -347,3 +347,100 @@ works in this section that discusses generalizing its techniques to
non-single dimensional data.
\section{Generalized Index Templates}
+
+The other line of work in automatic index design we discussed
+was generalized index templates. These are systems which expose a
+generalization of a particular type of data structure, presenting hooks
+for user-defined functions into various data structure operations,
+while providing all of the necessary index features (concurrency,
+crash recovery, etc.) automatically. Assuming that the underlying data
+structure can be used to construct an index for a given use case, this
+approach makes it straightforward for a database user to produce a custom
+index for a particular search problem or data type. Unfortunately,
+these templates are restricted by their underlying data structure,
+and can only produce indices that fit this model. There are two major
+examples of generalized index templates, the generalized search tree
+(GiST)~\cite{gist, concurrent-gist, pg-gist} and the generalized inverted
+index (GIN)~\cite{pg-gin}.
+
+The GiST~\cite{gist} is a general data structure built on a search
+tree, which allows the user to specify certain specific behaviors to
+adapt it to their needs, while automatically providing concurrency
+and crash recovery~\cite{concurrent-gist}. It was been implemented in
+Postgres~\cite{pg-gist}. GiST requires the user to implement six
+functions,
+\begin{itemize}
+\item $\mathbftt{consistent}(E, q)$ \\
+ Given an internal node entry, $E$, and a query predicate $q$,
+ it determines whether the entry could satisfy the predicate and
+ returns true if so, or false if the predicate can certainly not
+ be satisfied.
+
+\item $\mathbftt{union}(P)$ \\
+ Given a set of contiguous internal node entries, $P$, this
+ function returns a predicate that holds for all tuples stored
+ in children of the entries.
+
+\item $\mathbftt{compress}(E)$ \\
+ Returns a compressed version of an entry, $E$. This is used to
+ produce internal node separator keys.
+
+\item $\mathbftt{decompress}(E)$ \\
+ Returns a decompressed version of a compressed entry, $E$. This
+ is used to recover record intervals consistent with a separator
+ key.
+
+\item $\mathbftt{penalty}(E_1, E_2)$ \\
+ Returns the ``penalty'' for inserting $E_2$ into the sub-tree
+ rooted at $E_1$. This is used for tree balancing in the insertion
+ routines.
+
+\item $\mathbftt{pick\_split}(P)$ \\
+ Given a set of contiguous internal node entries, $P$, pick a split
+ point to break $P$ into two disjoint partitions. This is used during
+ split operations to maintain tree balance.
+
+\end{itemize}
+
+Given these user-defined functions, the GiST will automatically provide
+two search routines: a general search returning all records matching a
+predicate, and a specific search for linearly ordered data. This latter
+search also requires a $\mathbftt{compare}(E_1, E_2)$ function be specified
+which serves as a comparator between two records. The structure also
+supports inserting and deleting records.
+
+The generalized inverted index, GIN, is a similar concept for a different
+data structure~\cite{pg-gin}. Rather than generalizing a search tree,
+GIN generalizes an inverted index. This structure represents a set of
+key-value pairs, where the value is a set of composite entries containing
+multiple keys. The index allows all values containing a specified key
+to be easily identified. The classic example of a use for an inverted
+index is document search, where the values are entire documents, and
+the keys are specific words.
+
+GIN requires the user specify the following functions,\footnote{
+ I've streamlined the representations of these functions somewhat
+ to conform to the conventions of this work. In the original
+ source documentation, these function definitions are given in C,
+ with many more arguments for outputs, configuration, etc.
+ }
+\begin{itemize}
+ \item $\mathbftt{extract\_value}(V)$ \\
+ Extract all of the keys from a given value, $V$, and return them
+ as an array.
+ \item $\mathbftt{extract\_query}(Q)$ \\
+ Extract the keys to search for in a given query, $Q$.
+ \item $\mathbftt{consistent}(V, Q)$ \\
+ Checks an indexed value, $V$, for the keys contained in a query,
+ $Q$. Returns an array of booleans, where the $i$th
+ element in the array is true of the $i$th key from the
+ query appears in the value.
+ \item $\mathbftt{compare}(E_1, E_2)$ \\
+ A comparator used for sorting keys.
+\end{itemize}
+
+Note that, unlike GiST, GIN doesn't generalize the construction of
+the index itself. It uses a B-tree index over the keys with limited
+user control over how it is built. The generalization of GIN is in the
+extraction of keys from values, and the query process itself. Like GiST,
+a GIN index automatically provides concurrency and crash recovery.