A probabilistic framework for estimating the accuracy of aggregate range queries evaluated over histograms
Articolo
Data di Pubblicazione:
2012
Citazione:
A probabilistic framework for estimating the accuracy of aggregate range queries evaluated over histograms / Buccafurri, F., Furfaro, F., Sacca', D.. - In: INFORMATION SCIENCES. - ISSN 0020-0255. - 188:(2012), pp. 121-150.
Abstract:
A histogram over a multi-dimensional data set is a synopsis
consisting of aggregate data summarizing the values of the points
inside non-overlapping ranges of the domain.
Owing to their effectiveness in supporting a fast (though approximate) estimation
of the answers of aggregate range queries, histograms are widely used in several contexts
dealing with multi-dimensional data, especially those where the precision
of the answers (within reasonable limits) is not the major requirement.
However, the practical impact of histograms has been limited by the fact that, so far, no
mechanism has been defined which provides a reliable (non-trivial) quantification of the
degree of approximation of the query estimates.
In this paper, this problem is addressed by introducing a probabilistic framework which
allows for estimating the accuracy of the approximate answers resulting from
evaluating aggregate queries over a histogram.
Specifically, given a histogram over a data set,
the answer of an aggregate range query is modeled as a random variable, whose probability distribution depends on the type and the values of the aggregate data stored
in the histogram.
Therein, the mean value and the variance of this random variable represent an estimate of
the actual answer of the corresponding query and of the error rate, respectively.
The proposed framework can exploit different kinds of aggregates (namely, sum and
count) stored in the histogram, as well as integrity constraints defined over
the original data.
consisting of aggregate data summarizing the values of the points
inside non-overlapping ranges of the domain.
Owing to their effectiveness in supporting a fast (though approximate) estimation
of the answers of aggregate range queries, histograms are widely used in several contexts
dealing with multi-dimensional data, especially those where the precision
of the answers (within reasonable limits) is not the major requirement.
However, the practical impact of histograms has been limited by the fact that, so far, no
mechanism has been defined which provides a reliable (non-trivial) quantification of the
degree of approximation of the query estimates.
In this paper, this problem is addressed by introducing a probabilistic framework which
allows for estimating the accuracy of the approximate answers resulting from
evaluating aggregate queries over a histogram.
Specifically, given a histogram over a data set,
the answer of an aggregate range query is modeled as a random variable, whose probability distribution depends on the type and the values of the aggregate data stored
in the histogram.
Therein, the mean value and the variance of this random variable represent an estimate of
the actual answer of the corresponding query and of the error rate, respectively.
The proposed framework can exploit different kinds of aggregates (namely, sum and
count) stored in the histogram, as well as integrity constraints defined over
the original data.
Tipologia CRIS:
1.1 Articolo in rivista
Keywords:
Multidimensional data; Compression; Histograms
Elenco autori:
Buccafurri, Francesco; Furfaro, F; Sacca', D
Link alla scheda completa:
Pubblicato in: