\chapter{Background} \section{Database Indexes} \label{sec:indexes} Within a database system, search problems are expressed using some high level language (or mapped directly to commands, for simpler systems like key-value stores), which is processed by the database system to produce a result. Within many database systems, the most basic access primitive is a table scan, which sequentially examines each record within the data set. There are many situations in which the same query could be answered in less time using a more sophisticated data access scheme, however, and databases support a limited number of such schemes through the use of specialized data structures called \emph{indices} (or indexes). Indices can be built over a set of attributes in a table and provide faster access for particular search problems. The term \emph{index} is often abused within the database community to refer to a range of closely related, but distinct, conceptual categories.\footnote{ The word index can be used to refer to a structure mapping record information to the set of records matching that information, as a general synonym for ``data structure'', to data structures used specifically in query processing, etc. } This ambiguity is rarely problematic, as the subtle differences between these categories are not often significant, and context clarifies the intended meaning in situations where they are. However, this work explicitly operates at the interface of two of these categories, and so it is important to disambiguate between them. \subsection{The Classical Index} A database index is a specialized data structure that provides a means to efficiently locate records that satisfy specific criteria. This enables more efficient query processing for supported search problems. A classical index can be modeled as a function, mapping a set of attribute values, called a key, $\mathcal{K}$, to a set of record identifiers, $\mathcal{R}$. The codomain of an index can be either the set of record identifiers, a set containing sets of record identifiers, or the set of physical records, depending upon the configuration of the index.~\cite{cowbook} For our purposes here, we'll focus on the first of these, but the use of other codmains wouldn't have any material effect on our discussion. We will use the following definition of a ``classical'' database index, \begin{definition}[Classical Index~\cite{cowbook}] Consider a set of database records, $\mathcal{D}$. An index over these records, $\mathcal{I}_\mathcal{D}$ is a map of the form $\mathcal{I}_\mathcal{D}:(\mathcal{K}, \mathcal{D}) \to \mathcal{R}$, where $\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$, called a \emph{key}. \end{definition} In order to facilitate this mapping, indexes are built using data structures. The selection of data structure has implications on the performance of the index, and the types of search problem it can be used to accelerate. Broadly speaking, classical indices can be divided into two categories: ordered and unordered. Ordered indices allow for the iteration over a set of record identifiers in a particular sorted order of keys, and the efficient location of a specific key value in that order. These indices can be used to accelerate range scans and point-lookups. Unordered indices are specialized for point-lookups on a particular key value, and do not support iterating over records in some order.~\cite{cowbook, mysql-btree-hash} There is a very small set of data structures that are usually used for creating classical indexes. For ordered indices, the most commonly used data structure is the B-tree~\cite{ubiq-btree},\footnote{ By \emph{B-tree} here, we are referring not to the B-tree data structure, but to a wide range of related structures derived from the B-tree. Examples include the B$^+$-tree, B$^\epsilon$-tree, etc. } and the log-structured merge (LSM) tree~\cite{oneil96} is also often used within the context of key-value stores~\cite{rocksdb}. Some databases implement unordered indices using hash tables~\cite{mysql-btree-hash}. \subsection{The Generalized Index} The previous section discussed the classical definition of index as might be found in a database systems textbook. However, this definition is limited by its association specifically with mapping key fields to records. For the purposes of this work, a broader definition of index will be considered, \begin{definition}[Generalized Index] Consider a set of database records, $\mathcal{D}$, and search problem, $\mathcal{Q}$. A generalized index, $\mathcal{I}_\mathcal{D}$ is a map of the form $\mathcal{I}_\mathcal{D}:(\mathcal{Q}, \mathcal{D}) \to \mathcal{R})$. \end{definition} A classical index is a special case of a generalized index, with $\mathcal{Q}$ being a point-lookup or range scan based on a set of record attributes. There are a number of generalized indexes that appear in some database systems. For example, some specialized databases or database extensions have support for indexes based the R-tree\footnote{ Like the B-tree, R-tree here is used as a signifier for a general class of related data structures.} for spatial databases~\cite{postgis-doc, ubiq-rtree} or hierarchical navigable small world graphs for similarity search~\cite{pinecone-db}, among others. These systems are typically either an add-on module, or a specialized standalone database that has been designed specifically for answering particular types of queries (such as spatial queries, similarity search, string matching, etc.). %\subsection{Indexes in Query Processing} %A database management system utilizes indexes to accelerate certain %types of query. Queries are expressed to the system in some high %level language, such as SQL or Datalog. These are generalized %languages capable of expressing a wide range of possible queries. %The DBMS is then responsible for converting these queries into a %set of primitive data access procedures that are supported by the %underlying storage engine. There are a variety of techniques for %this, including mapping directly to a tree of relational algebra %operators and interpreting that tree, query compilation, etc. But, %ultimately, this internal query representation is limited by the routines %supported by the storage engine.~\cite{cowbook} %As an example, consider the following SQL query (representing a %2-dimensional k-nearest neighbor problem)\footnote{There are more efficient %ways of answering this query, but I'm aiming for simplicity here %to demonstrate my point}, % %\begin{verbatim} %SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A % WHERE A.property = filtering_criterion % ORDER BY d % LIMIT 5; %\end{verbatim} % %This query will be translated into a logical query plan (a sequence %of relational algebra operators) by the query planner, which could %result in a plan like this, % %\begin{verbatim} %query plan here %\end{verbatim} % %With this logical query plan, the DBMS will next need to determine %which supported operations it can use to most efficiently answer %this query. For example, the selection operation (A) could be %physically manifested as a table scan, or could be answered using %an index scan if there is an ordered index over \texttt{A.property}. %The query optimizer will make this decision based on its estimate %of the selectivity of the predicate. This may result in one of the %following physical query plans % %\begin{verbatim} %physical query plan %\end{verbatim} % %In either case, however, the space of possible physical plans is %limited by the available access methods: either a sorted scan on %an attribute (index) or an unsorted scan (table scan). The database %must filter for all elements matching the filtering criterion, %calculate the distances between all of these points and the query, %and then sort the results to get the final answer. Additionally, %note that the sort operation in the plan is a pipeline-breaker. If %this plan were to appear as a sub-tree in a larger query plan, the %overall plan would need to wait for the full evaluation of this %sub-query before it could proceed, as sorting requires the full %result set. % %Imagine a world where a new index was available to the DBMS: a %nearest neighbor index. This index would allow the iteration over %records in sorted order, relative to some predefined metric and a %query point. If such an index existed over \texttt{(A.x, A.y)} using %\texttt{dist}, then a third physical plan would be available to the DBMS, % %\begin{verbatim} %\end{verbatim} % %This plan pulls records in order of their distance to \texttt{Q} %directly, using an index, and then filters them, avoiding the %pipeline breaking sort operation. While it's not obvious in this %case that this new plan is superior (this would depend upon the %selectivity of the predicate), it is a third option. It becomes %increasingly superior as the selectivity of the predicate grows, %and is clearly superior in the case where the predicate has unit %selectivity (requiring only the consideration of $5$ records total). % %This use of query-specific indexing schemes presents a query %optimization challenge: how does the database know when a particular %specialized index can be used for a given query, and how can %specialized indexes broadcast their capabilities to the query optimizer %in a general fashion? This work is focused on the problem of enabling %the existence of such indexes, rather than facilitating their use; %however these are important questions that must be considered in %future work for this solution to be viable. There has been work %done surrounding the use of arbitrary indexes in queries in the past, %such as~\cite{byods-datalog}. This problem is considered out-of-scope %for the proposed work, but will be considered in the future.