Calculating the mean and standard deviation in one pass
October 27th, 2009
It is possible to calculate the mean and the standard deviation of a serie of double values while iterating only once on the whole set of data. This approach is thus also suitable for streaming data; it could update the mean and standard deviation on-the-fly.
Here I present an implementation of the algorithm shown by Mark Hoemmen in his paper Computing the standard deviation efficiently as a static Java method. The values are provided by an Iterator
. Note that in Java 5+ the iterator could directly force the objects to implement the interface. Here I have chosen to delegate the extraction of the double value to an extractor object (I’m using it on Java 1.4).
/** * A method to calculate the mean and standard deviation of a series * of double. The objects are provided by an {@link Iterator} and the * double value is extracted using the passed extractor. * It skips <code>NaN</code> values.<br> * * Returns a <code>double[3]</code> where <br> * [0] is the number of values used in the mean<br> * [1] contains the mean value of the series<br> * [2] contains the standard deviation (sigma) of the complete * population<br> * Returns <code>NaN</code> for the mean and std dev if no valid value * could be found in the series.<br> * * Algorithm taken from "Computing the standard deviation efficiently" * by Mark Hoemmen * (http://www.cs.berkeley.edu/~mhoemmen/cs194/Tutorials/variance.pdf) * * @return the number of values, mean and std-dev of the serie * */ public static double[] getMeanAndStdDev(Iterator it, DoubleExtractor e){ while(it.hasNext()){// while initial value is NaN, try next double xk = e.doubleValue(it.next()); if(Double.isNaN(xk)){ continue; } int k = 1; double Mk = xk; double Qk = 0; while(it.hasNext()){ xk = e.doubleValue(it.next()); if(Double.isNaN(xk)){ continue; } k++; double d = xk-Mk; // is actually xk - Mk-1, // as Mk was not yet updated Qk += (k-1)*d*d/k; Mk += d/k; } return new double[]{k,Mk,Math.sqrt(Qk/k)}; } return new double[]{0,Double.NaN,Double.NaN}; }
using the following interface
/** * An object which can return a double representation of passed objects. * * @author Philipp Buluschek * */ public interface DoubleExtractor { /** * @return the double representation of the passed object. */ public double doubleValue(Object o); }
Thanks! here a similar C implementation:
typedef struct statistics_s {
size_t k; // sample count
double Mk; // mean
double Qk; // standard variance
} statistics_t;
static void
statistics_record (statistics_t *stats, double x) {
stats->k++;
if (1 == stats->k) {
stats->Mk = x;
stats->Qk = 0;
} else {
double d = x – stats->Mk; // is actually xk – M_{k-1},
// as Mk was not yet updated
stats->Qk += (stats->k-1)*d*d/stats->k;
stats->Mk += d/stats->k;
}
}
static double
statistics_stdev (statistics_t *stats)
{
return sqrt (stats->Qk/stats->k);
}
static size_t
statistics_nsamples (statistics_t *stats)
{
return stats->k;
}
static double
statistics_variance (statistics_t *stats)
{
return stats->Qk / (stats->k – 1);
}
static double
statistics_stdvar (statistics_t *stats)
{
return stats->Qk / stats->k;
}
static void
statistics_init (statistics_t *stats)
{
stats->k = 0;
}