{"id":140,"date":"2009-10-27T11:30:55","date_gmt":"2009-10-27T16:30:55","guid":{"rendered":"http:\/\/blog.buluschek.com\/?p=140"},"modified":"2023-07-02T10:49:23","modified_gmt":"2023-07-02T08:49:23","slug":"calculating-the-mean-and-standard-deviation-in-one-pass","status":"publish","type":"post","link":"https:\/\/www.buluschek.com\/?p=140","title":{"rendered":"Calculating the mean and standard deviation in one pass"},"content":{"rendered":"<p>It is possible to calculate the mean and the standard deviation of a serie of double values while iterating only once on the whole set of data. This approach is thus also suitable for streaming data; it could update the mean and standard deviation on-the-fly. <\/p>\n<p>Here I present an implementation of the algorithm shown by Mark Hoemmen in his paper <a href=\"http:\/\/www.cs.berkeley.edu\/~mhoemmen\/cs194\/Tutorials\/variance.pdf\">Computing the standard deviation efficiently<\/a> as a static Java method. The values are provided by an <code>Iterator<\/code>. Note that in Java 5+ the iterator could directly force the objects to implement the interface. Here I have chosen to delegate the extraction of the double value to an extractor object (I&#8217;m using it on Java 1.4).<\/p>\n<p>[sourcecode lang=&#8221;java&#8221;]<br \/>\n\t\/**<br \/>\n\t * A method to calculate the mean and standard deviation of a series<br \/>\n\t * of double. The objects are provided by an {@link Iterator} and the<br \/>\n\t * double value is extracted using the passed extractor.<br \/>\n\t * It skips &lt;code&gt;NaN&lt;\/code&gt; values.&lt;br&gt;<br \/>\n\t *<br \/>\n\t * Returns a &lt;code&gt;double[3]&lt;\/code&gt; where &lt;br&gt;<br \/>\n\t * [0] is the number of values used in the mean&lt;br&gt;<br \/>\n\t * [1] contains the mean value of the series&lt;br&gt;<br \/>\n\t * [2] contains the standard deviation (sigma) of the complete<br \/>\n\t * population&lt;br&gt;<br \/>\n\t * Returns &lt;code&gt;NaN&lt;\/code&gt; for the mean and std dev if no valid value<br \/>\n\t * could be found in the series.&lt;br&gt;<br \/>\n\t *<br \/>\n\t * Algorithm taken from &quot;Computing the standard deviation efficiently&quot;<br \/>\n\t * by Mark Hoemmen<br \/>\n\t * (http:\/\/www.cs.berkeley.edu\/~mhoemmen\/cs194\/Tutorials\/variance.pdf)<br \/>\n\t *<br \/>\n\t * @return the number of values, mean and std-dev of the serie<br \/>\n\t *<br \/>\n\t *\/<br \/>\n\tpublic static double[] getMeanAndStdDev(Iterator it, DoubleExtractor e){<br \/>\n\t\twhile(it.hasNext()){\/\/ while initial value is NaN, try next<br \/>\n\t\t\tdouble xk = e.doubleValue(it.next());<br \/>\n\t\t\tif(Double.isNaN(xk)){<br \/>\n\t\t\t\tcontinue;<br \/>\n\t\t\t}<br \/>\n\t\t\tint k = 1;<br \/>\n\t\t\tdouble Mk = xk;<br \/>\n\t\t\tdouble Qk = 0;<br \/>\n\t\t\twhile(it.hasNext()){<br \/>\n\t\t\t\txk = e.doubleValue(it.next());<br \/>\n\t\t\t\tif(Double.isNaN(xk)){<br \/>\n\t\t\t\t\tcontinue;<br \/>\n\t\t\t\t}<br \/>\n\t\t\t\tk++;<br \/>\n\t\t\t\tdouble d = xk-Mk; \/\/ is actually xk &#8211; Mk-1,<br \/>\n\t\t\t\t                  \/\/ as Mk was not yet updated<br \/>\n\t\t\t\tQk += (k-1)*d*d\/k;<br \/>\n\t\t\t\tMk += d\/k;<br \/>\n\t\t\t}<br \/>\n\t\t\treturn new double[]{k,Mk,Math.sqrt(Qk\/k)};<br \/>\n\t\t}<br \/>\n\t\treturn new double[]{0,Double.NaN,Double.NaN};<br \/>\n\t}<br \/>\n[\/sourcecode]<\/p>\n<p>using the following interface<br \/>\n[sourcecode lang=&#8221;java&#8221;]<br \/>\n\/**<br \/>\n * An object which can return a double representation of passed objects.<br \/>\n *<br \/>\n * @author Philipp Buluschek<br \/>\n *<br \/>\n *\/<br \/>\npublic interface DoubleExtractor {<\/p>\n<p>\t\/**<br \/>\n\t * @return the double representation of the passed object.<br \/>\n\t *\/<br \/>\n\tpublic double doubleValue(Object o);<br \/>\n}<br \/>\n[\/sourcecode]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It is possible to calculate the mean and the standard deviation of a serie of double values while iterating only once on the whole set of data. This approach is thus also suitable for streaming data; it could update the mean and standard deviation on-the-fly. Here I present an implementation of the algorithm shown by [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[4,27,18,19],"class_list":["post-140","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-code","tag-java","tag-mean","tag-standard-deviation"],"_links":{"self":[{"href":"https:\/\/www.buluschek.com\/index.php?rest_route=\/wp\/v2\/posts\/140","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.buluschek.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.buluschek.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.buluschek.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.buluschek.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=140"}],"version-history":[{"count":13,"href":"https:\/\/www.buluschek.com\/index.php?rest_route=\/wp\/v2\/posts\/140\/revisions"}],"predecessor-version":[{"id":339,"href":"https:\/\/www.buluschek.com\/index.php?rest_route=\/wp\/v2\/posts\/140\/revisions\/339"}],"wp:attachment":[{"href":"https:\/\/www.buluschek.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=140"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.buluschek.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=140"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.buluschek.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=140"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}