1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
\begin{abstract}
The execution of analytical queries on massive datasets presents challenges
due to long response times and high computational costs. As a result, the
analysis of representative samples of data has emerged as an attractive
alternative; this avoids the cost of processing queries against the entire
dataset, while still producing statistically valid results. Unfortunately,
the sampling techniques in common use sacrifice either sample quality or
performance, and so are poorly suited for this task. However, it is
possible to build high quality sample sets efficiently with the assistance
of indexes. This introduces a new challenge: real-world data is subject to
continuous update, and so the indexes must be kept up to date. This is
difficult, because existing sampling indexes present a dichotomy; efficient
sampling indexes are difficult to update, while easily updatable indexes
have poor sampling performance. This paper seeks to address this gap by
proposing a general and practical framework for extending most sampling
indexes with efficient update support, based on splitting indexes into
smaller shards, combined with a systematic approach to the periodic
reconstruction. The framework's design space is examined, with an eye
towards exploring trade-offs between update performance, sampling
performance, and memory usage. Three existing static sampling indexes are
extended using this framework to support updates, and the generalization of
the framework to concurrent operations and larger-than-memory data is
discussed. Through a comprehensive suite of benchmarks, the extended
indexes are shown to match or exceed the update throughput of
state-of-the-art dynamic baselines, while presenting significant
improvements in sampling latency.
\end{abstract}
|