\chapter{Background}


\section{Database Indexes}
\label{sec:indexes}

Within a database system, search problems are expressed using
some high level language (or mapped directly to commands, for
simpler systems like key-value stores), which is processed by
the database system to produce a result. Within many database
systems, the most basic access primitive is a table scan, which
sequentially examines each record within the data set. There are many
situations in which the same query could be answered in less time using
a more sophisticated data access scheme, however, and databases support
a limited number of such schemes through the use of specialized data
structures called \emph{indices} (or indexes). Indices can be built over
a set of attributes in a table and provide faster access for particular
search problems.

The term \emph{index} is often abused within the database community
to refer to a range of closely related, but distinct, conceptual
categories.\footnote{
The word index can be used to refer to a structure mapping record
information to the set of records matching that information, as a
general synonym for ``data structure'', to data structures used
specifically in query processing, etc.
} 
This ambiguity is rarely problematic, as the subtle differences between
these categories are not often significant, and context clarifies the
intended meaning in situations where they are.  However, this work
explicitly operates at the interface of two of these categories, and so
it is important to disambiguate between them.

\subsection{The Classical Index}

A database index is a specialized data structure that provides a means
to efficiently locate records that satisfy specific criteria. This
enables more efficient query processing for supported search problems. A
classical index can be modeled as a function, mapping a set of attribute
values, called a key, $\mathcal{K}$, to a set of record identifiers,
$\mathcal{R}$. The codomain of an index can be either the set of
record identifiers, a set containing sets of record identifiers, or
the set of physical records, depending upon the configuration of the
index.~\cite{cowbook} For our purposes here, we'll focus on the first of
these, but the use of other codmains wouldn't have any material effect
on our discussion.

We will use the following definition of a ``classical'' database index,

\begin{definition}[Classical Index~\cite{cowbook}]
Consider a set of database records, $\mathcal{D}$. An index over
these records, $\mathcal{I}_\mathcal{D}$ is a map of the form
    $\mathcal{I}_\mathcal{D}:(\mathcal{K}, \mathcal{D}) \to \mathcal{R}$, where
$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$,
called a \emph{key}.
\end{definition}

In order to facilitate this mapping, indexes are built using data
structures.  The selection of data structure has implications on the
performance of the index, and the types of search problem it can be
used to accelerate. Broadly speaking, classical indices can be divided
into two categories: ordered and unordered. Ordered indices allow for
the iteration over a set of record identifiers in a particular sorted
order of keys, and the efficient location of a specific key value in
that order. These indices can be used to accelerate range scans and
point-lookups. Unordered indices are specialized for point-lookups on a
particular key value, and do not support iterating over records in some
order.~\cite{cowbook, mysql-btree-hash}

There is a very small set of data structures that are usually used for
creating classical indexes. For ordered indices, the most commonly used
data structure is the B-tree~\cite{ubiq-btree},\footnote{
    By \emph{B-tree} here, we are referring not to the B-tree data
    structure, but to a wide range of related structures derived from
    the B-tree.  Examples include the B$^+$-tree, B$^\epsilon$-tree, etc.
}
and the log-structured merge (LSM) tree~\cite{oneil96} is also often
used within the context of key-value stores~\cite{rocksdb}. Some databases
implement unordered indices using hash tables~\cite{mysql-btree-hash}.

\subsection{The Generalized Index}

The previous section discussed the classical definition of index
as might be found in a database systems textbook. However, this
definition is limited by its association specifically with mapping
key fields to records. For the purposes of this work, a broader
definition of index will be considered,

\begin{definition}[Generalized Index]
Consider a set of database records, $\mathcal{D}$, and search
problem, $\mathcal{Q}$.
A generalized index, $\mathcal{I}_\mathcal{D}$
is a map of the form $\mathcal{I}_\mathcal{D}:(\mathcal{Q}, \mathcal{D}) \to
\mathcal{R})$.
\end{definition}

A classical index is a special case of a generalized index, with $\mathcal{Q}$
being a point-lookup or range scan based on a set of record attributes.

There are a number of generalized indexes that appear in some database systems.
For example, some specialized databases or database extensions have support for
indexes based the R-tree\footnote{ Like the B-tree, R-tree here is used as a
signifier for a general class of related data structures.} for spatial
databases~\cite{postgis-doc, ubiq-rtree} or hierarchical navigable small world
graphs for similarity search~\cite{pinecone-db}, among others. These systems
are typically either an add-on module, or a specialized standalone database
that has been designed specifically for answering particular types of queries
(such as spatial queries, similarity search, string matching, etc.).

%\subsection{Indexes in Query Processing} 

%A database management system utilizes indexes to accelerate certain
%types of query. Queries are expressed to the system in some high
%level language, such as SQL or Datalog. These are generalized
%languages capable of expressing a wide range of possible queries.
%The DBMS is then responsible for converting these queries into a
%set of primitive data access procedures that are supported by the
%underlying storage engine. There are a variety of techniques for
%this, including mapping directly to a tree of relational algebra
%operators and interpreting that tree, query compilation, etc. But,
%ultimately, this internal query representation is limited by the routines 
%supported by the storage engine.~\cite{cowbook}

%As an example, consider the following SQL query (representing a
%2-dimensional k-nearest neighbor problem)\footnote{There are more efficient
%ways of answering this query, but I'm aiming for simplicity here
%to demonstrate my point},
%
%\begin{verbatim}
%SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A
%       WHERE A.property = filtering_criterion
%       ORDER BY d
%       LIMIT 5;
%\end{verbatim}
%
%This query will be translated into a logical query plan (a sequence
%of relational algebra operators) by the query planner, which could
%result in a plan like this,
%
%\begin{verbatim}
%query plan here
%\end{verbatim}
%
%With this logical query plan, the DBMS will next need to determine
%which supported operations it can use to most efficiently answer
%this query. For example, the selection operation (A) could be
%physically manifested as a table scan, or could be answered using
%an index scan if there is an ordered index over \texttt{A.property}.
%The query optimizer will make this decision based on its estimate
%of the selectivity of the predicate. This may result in one of the
%following physical query plans
%
%\begin{verbatim}
%physical query plan
%\end{verbatim}
%
%In either case, however, the space of possible physical plans is
%limited by the available access methods: either a sorted scan on
%an attribute (index) or an unsorted scan (table scan). The database
%must filter for all elements matching the filtering criterion,
%calculate the distances between all of these points and the query,
%and then sort the results to get the final answer. Additionally,
%note that the sort operation in the plan is a pipeline-breaker. If
%this plan were to appear as a sub-tree in a larger query plan, the
%overall plan would need to wait for the full evaluation of this
%sub-query before it could proceed, as sorting requires the full
%result set.
%
%Imagine a world where a new index was available to the DBMS: a
%nearest neighbor index. This index would allow the iteration over
%records in sorted order, relative to some predefined metric and a
%query point. If such an index existed over \texttt{(A.x, A.y)} using
%\texttt{dist}, then a third physical plan would be available to the DBMS,
%
%\begin{verbatim}
%\end{verbatim}
%
%This plan pulls records in order of their distance to \texttt{Q}
%directly, using an index, and then filters them, avoiding the
%pipeline breaking sort operation. While it's not obvious in this
%case that this new plan is superior (this would depend upon the
%selectivity of the predicate), it is a third option. It becomes
%increasingly superior as the selectivity of the predicate grows,
%and is clearly superior in the case where the predicate has unit
%selectivity (requiring only the consideration of $5$ records total).
%
%This use of query-specific indexing schemes presents a query
%optimization challenge: how does the database know when a particular
%specialized index can be used for a given query, and how can
%specialized indexes broadcast their capabilities to the query optimizer
%in a general fashion? This work is focused on the problem of enabling
%the existence of such indexes, rather than facilitating their use;
%however these are important questions that must be considered in
%future work for this solution to be viable. There has been work
%done surrounding the use of arbitrary indexes in queries in the past,
%such as~\cite{byods-datalog}. This problem is considered out-of-scope
%for the proposed work, but will be considered in the future.