diff options
Diffstat (limited to 'chapters')
| -rw-r--r-- | chapters/related-works.tex | 97 |
1 files changed, 97 insertions, 0 deletions
diff --git a/chapters/related-works.tex b/chapters/related-works.tex index 8e24750..c9d9357 100644 --- a/chapters/related-works.tex +++ b/chapters/related-works.tex @@ -347,3 +347,100 @@ works in this section that discusses generalizing its techniques to non-single dimensional data. \section{Generalized Index Templates} + +The other line of work in automatic index design we discussed +was generalized index templates. These are systems which expose a +generalization of a particular type of data structure, presenting hooks +for user-defined functions into various data structure operations, +while providing all of the necessary index features (concurrency, +crash recovery, etc.) automatically. Assuming that the underlying data +structure can be used to construct an index for a given use case, this +approach makes it straightforward for a database user to produce a custom +index for a particular search problem or data type. Unfortunately, +these templates are restricted by their underlying data structure, +and can only produce indices that fit this model. There are two major +examples of generalized index templates, the generalized search tree +(GiST)~\cite{gist, concurrent-gist, pg-gist} and the generalized inverted +index (GIN)~\cite{pg-gin}. + +The GiST~\cite{gist} is a general data structure built on a search +tree, which allows the user to specify certain specific behaviors to +adapt it to their needs, while automatically providing concurrency +and crash recovery~\cite{concurrent-gist}. It was been implemented in +Postgres~\cite{pg-gist}. GiST requires the user to implement six +functions, +\begin{itemize} +\item $\mathbftt{consistent}(E, q)$ \\ + Given an internal node entry, $E$, and a query predicate $q$, + it determines whether the entry could satisfy the predicate and + returns true if so, or false if the predicate can certainly not + be satisfied. + +\item $\mathbftt{union}(P)$ \\ + Given a set of contiguous internal node entries, $P$, this + function returns a predicate that holds for all tuples stored + in children of the entries. + +\item $\mathbftt{compress}(E)$ \\ + Returns a compressed version of an entry, $E$. This is used to + produce internal node separator keys. + +\item $\mathbftt{decompress}(E)$ \\ + Returns a decompressed version of a compressed entry, $E$. This + is used to recover record intervals consistent with a separator + key. + +\item $\mathbftt{penalty}(E_1, E_2)$ \\ + Returns the ``penalty'' for inserting $E_2$ into the sub-tree + rooted at $E_1$. This is used for tree balancing in the insertion + routines. + +\item $\mathbftt{pick\_split}(P)$ \\ + Given a set of contiguous internal node entries, $P$, pick a split + point to break $P$ into two disjoint partitions. This is used during + split operations to maintain tree balance. + +\end{itemize} + +Given these user-defined functions, the GiST will automatically provide +two search routines: a general search returning all records matching a +predicate, and a specific search for linearly ordered data. This latter +search also requires a $\mathbftt{compare}(E_1, E_2)$ function be specified +which serves as a comparator between two records. The structure also +supports inserting and deleting records. + +The generalized inverted index, GIN, is a similar concept for a different +data structure~\cite{pg-gin}. Rather than generalizing a search tree, +GIN generalizes an inverted index. This structure represents a set of +key-value pairs, where the value is a set of composite entries containing +multiple keys. The index allows all values containing a specified key +to be easily identified. The classic example of a use for an inverted +index is document search, where the values are entire documents, and +the keys are specific words. + +GIN requires the user specify the following functions,\footnote{ + I've streamlined the representations of these functions somewhat + to conform to the conventions of this work. In the original + source documentation, these function definitions are given in C, + with many more arguments for outputs, configuration, etc. + } +\begin{itemize} + \item $\mathbftt{extract\_value}(V)$ \\ + Extract all of the keys from a given value, $V$, and return them + as an array. + \item $\mathbftt{extract\_query}(Q)$ \\ + Extract the keys to search for in a given query, $Q$. + \item $\mathbftt{consistent}(V, Q)$ \\ + Checks an indexed value, $V$, for the keys contained in a query, + $Q$. Returns an array of booleans, where the $i$th + element in the array is true of the $i$th key from the + query appears in the value. + \item $\mathbftt{compare}(E_1, E_2)$ \\ + A comparator used for sorting keys. +\end{itemize} + +Note that, unlike GiST, GIN doesn't generalize the construction of +the index itself. It uses a B-tree index over the keys with limited +user control over how it is built. The generalization of GIN is in the +extraction of keys from values, and the query process itself. Like GiST, +a GIN index automatically provides concurrency and crash recovery. |