A Simple and Efficient Estimation Method for Stream Expression Cardinalities
01 January 2007
Estimating the cardinalities of set expressions defined over multiple distributed streams is one of the most fundamental queries of interest. Earlier methods based on probabilistic sketches have focused mostly on the sketching algorithms. On the contrary, the estimators do not fully utilize the information in the sketches and thus are not statistically efficient. In this paper, we develop a statistically efficient yet simple estimator for the cardinalities based on a continuous variant of the well known Flajolet-Martin sketches. Specifically, we are able to show that, for two streams, our estimator has almost the same statistical efficiency as the Maximum Likelihood Estimator (MLE), which is well-known to be the most efficient because its variance can achieve the Cramer-Rao lower bound asymptotically. Moreover, for higher number of streams, our estimator is still computationally simple, but the MLE becomes intractable due to the complexity of the likelihood. Let $N$ be the cardinality of the union of all streams, and $|{cal S}|$ be the cardinality of a set expression $mathcal{S}$ to be estimated. For a given relative standard error $delta$, %root mean square relative error (relative standard error) $rse$, the memory requirement of our estimator is $O(delta^{-2}|{cal S}|^{-1}Nloglog N)$, which is superior to state-of-the-art algorithms, especially for large $N$ and small $frac{|{cal S}|}{N}$ where the estimation is more challenging.