streamly-statistics
Copyright(c) 2024 Composewell Technologies
LicenseApache-2.0
Maintainerstreamly@composewell.com
Stabilityexperimental
PortabilityGHC
Safe HaskellNone
LanguageHaskell2010

Streamly.Statistics.Scanl

Description

See Streamly.Statistics for general information. This module provides scans instead of folds.

Synopsis

Incremental Scans

Scans of type Scanl m (a, Maybe a) b are incremental sliding window scans. An input of type (a, Nothing) indicates that the input element a is being inserted in the window without ejecting an old value increasing the window size by 1. An input of type (a, Just a) indicates that the first element is being inserted in the window and the second element is being removed from the window, the window size remains the same. The window size can only increase and never decrease.

You can compute the statistics over the entire stream using sliding window folds by keeping the second element of the input tuple as Nothing.

Also see Streamly.Data.Scanl for some basic window scans.

Summary Statistics

Location

incrMinimum :: forall (m :: Type -> Type) a. (Monad m, Ord a) => Scanl m (Incr a) a Source #

The minimum element in a rolling window.

For smaller window sizes (< 30) Streamly.Internal.Data.Fold.windowMinimum performs better. If you want to compute the minimum of the entire stream Fold.min from streamly package would be much faster.

Time: \(\mathcal{O}(n*w)\) where \(w\) is the window size.

incrMaximum :: forall (m :: Type -> Type) a. (Monad m, Ord a) => Scanl m (Incr a) a Source #

The maximum element in a rolling window.

For smaller window sizes (< 30) Streamly.Internal.Data.Fold.windowMaximum performs better. If you want to compute the maximum of the entire stream Streamly.Data.Fold.maximum from streamly package would be much faster.

Time: \(\mathcal{O}(n*w)\) where \(w\) is the window size.

incrRawMoment :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Int -> Scanl m (Incr a) a Source #

Raw moment is the moment about 0. The \(k\)th raw moment is defined as:

\(\mu'_k = \frac{\sum_{i=1}^n x_{i}^k}{n}\)

>>> rawMoment k = Fold.teeWith (/) (Fold.windowPowerSum p) Fold.windowLength

See https://en.wikipedia.org/wiki/Moment_(mathematics) .

Space: \(\mathcal{O}(1)\)

Time: \(\mathcal{O}(n)\)

incrRawMomentFrac :: forall (m :: Type -> Type) a. (Monad m, Floating a) => a -> Scanl m (Incr a) a Source #

Like rawMoment but powers can be negative or fractional. This is slower than rawMoment for positive intergal powers.

>>> rawMomentFrac p = Fold.teeWith (/) (Fold.windowPowerSumFrac p) Fold.windowLength

incrWelfordMean :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Scanl m (Incr a) a Source #

Same as mean but uses Welford's algorithm to compute the mean incrementally.

It maintains a running mean instead of a running sum and adjusts the mean based on a new value. This is slower than mean because of using the division operation on each step and it is numerically unstable (as of now). The advantage over mean could be no overflow if the numbers are large, because we do not maintain a sum, but that is a highly unlikely corner case.

Internal

incrGeometricMean :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #

Geometric mean, defined as:

\(GM = \sqrt[n]{x_1 x_2 \cdots x_n}\)

\(GM = \left(\prod_{i=1}^n x_i\right)^\frac{1}{n}\)

or, equivalently, as the arithmetic mean in log space:

\(GM = e ^{{\frac{\sum_{i=1}^{n}\ln a_i}{n}}}\)

>>> geometricMean = exp <$> lmap log mean

See https://en.wikipedia.org/wiki/Geometric_mean .

incrHarmonicMean :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Scanl m (Incr a) a Source #

The harmonic mean of the positive numbers \(x_1, x_2, \ldots, x_n\) is defined as:

\(HM = \frac{n}{\frac1{x_1} + \frac1{x_2} + \cdots + \frac1{x_n}}\)

\(HM = \left(\frac{\sum\limits_{i=1}^n x_i^{-1}}{n}\right)^{-1}\)

>>> harmonicMean = Fold.teeWith (/) length (lmap recip sum)
>>> harmonicMean = powerMeanFrac (-1)

See https://en.wikipedia.org/wiki/Harmonic_mean .

incrQuadraticMean :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #

The quadratic mean or root mean square (rms) of the numbers \(x_1, x_2, \ldots, x_n\) is defined as:

\(RMS = \sqrt{ \frac{1}{n} \left( x_1^2 + x_2^2 + \cdots + x_n^2 \right) }.\)

>>> quadraticMean = powerMean 2

See https://en.wikipedia.org/wiki/Root_mean_square .

incrPowerMean :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Int -> Scanl m (Incr a) a Source #

The \(k\)th power mean of numbers \(x_1, x_2, \ldots, x_n\) is:

\(M_k = \left( \frac{1}{n} \sum_{i=1}^n x_i^k \right)^{\frac{1}{k}}\)

\(powerMean(k) = (rawMoment(k))^\frac{1}{k}\)

>>> powerMean k = (** (1 / fromIntegral k)) <$> rawMoment k

All other means can be expressed in terms of power mean. It is also known as the generalized mean.

See https://en.wikipedia.org/wiki/Generalized_mean

incrPowerMeanFrac :: forall (m :: Type -> Type) a. (Monad m, Floating a) => a -> Scanl m (Incr a) a Source #

Like powerMean but powers can be negative or fractional. This is slower than powerMean for positive intergal powers.

>>> powerMeanFrac k = (** (1 / k)) <$> rawMomentFrac k

Weighted Means

Exponential Smoothing.

ewma :: forall (m :: Type -> Type). Monad m => Double -> Scanl m Double Double Source #

Exponentially weighted moving average is given by ewma w where w is the smoothing factor varying between 0 and 1. To compute the moving average the newest input is given a weight of w and the running ewma is given a weight of 1 - w.

For an empty stream this API returns 0. For a non-empty stream the first value in the stream is used as the initial ewma.

The higher the smoothing factor the more weight is given to the new value. Consider some interesting special cases, when w is 1 the ewma is always the newest value, when w is 0 then the ewma is always the oldest value. If w is 0.5 then the new inputs get exponentially weighted by weights 1/2, 1/4, 1/8, ..., see below for details.

Mathematically, we define ewma \(s_n\), of \(n\) values, \(x_1,\ldots,x_n\), recursively as follows:

\(\begin{align} s_0& = x_0, \quad \text{initial value}\\ s_n & = \alpha x_{n} + (1-\alpha)s_{n-1},\quad n>0 \end{align}\)

If we expand the recursive term it reveals an exponential series:

\(s_n = \alpha \left[x_n + (1-\alpha)x_{n-1} + (1-\alpha)^2 x_{n-2} + \cdots + (1-\alpha)^{n-1} x_1 \right] + (1-\alpha)^n x_0\)

where \(\alpha\), the smoothing factor, is in the range \(0 <\alpha < 1\). More the value of \(\alpha\), the more weight is given to newer values. As a special case if it is 0 then the weighted sum would always be the same as the oldest value, if it is 1 then the sum would always be the same as the newest value.

See https://en.wikipedia.org/wiki/Moving_average

See https://en.wikipedia.org/wiki/Exponential_smoothing

ewmaRampUpSmoothing :: forall (m :: Type -> Type). Monad m => Double -> Double -> Scanl m Double Double Source #

ewma n k is like ewma but the smoothing factor used is itself exponentially smoothened starting from 1 to the final value k using n as its smoothing factor. In other words, the smoothing factor is derived by smoothing the series 1,k,k,k,... using n as the smoothing factor. As time goes on, the smoothing factor gets closer to k.

incrEwma :: forall (m :: Type -> Type). MonadIO m => Double -> Scanl m (Incr Double, RingArray Double) Double Source #

incrEwma w computes the ewma of the elements incrementally in a window using w as the smoothing factor.

ewma can be calculated incrementally as follows. If \(x_i\) is the value entering the window, n is the size of window, \(x_1\) is the second last value in the old window and \(x_0\) is the value exiting the window:

\(s_{new} = \alpha x_i + (1-\alpha)s_{old} + (1-\alpha)^{n} (x_1 - x_0\))

Unimplemented

Spread

Second order central moment is a statistical measure of dispersion. The \(k\)th moment about the mean (or \(k\)th central moment) is defined as:

\(\mu_k = \frac{1}{n}\sum_{i=1}^n {(x_{i}-\mu)}^k\)

See https://mathworld.wolfram.com/CentralMoment.html .

See https://en.wikipedia.org/wiki/Statistical_dispersion .

incrRange :: forall (m :: Type -> Type) a. (Monad m, Num a, Ord a) => Scanl m (Incr a) a Source #

The difference between the maximum and minimum elements of a rolling window.

>>> range = Fold.teeWith (-) maximum minimum

If you want to compute the range of the entire stream Fold.teeWith (-) Fold.maximum Fold.minimum from the streamly package would be much faster.

Space: \(\mathcal{O}(n)\) where n is the window size.

Time: \(\mathcal{O}(n*w)\) where \(w\) is the window size.

incrMd :: forall (m :: Type -> Type). MonadIO m => Scanl m (Incr Double, RingArray Double) Double Source #

md n computes the mean absolute deviation (or mean deviation) in a sliding window of last n elements in the stream.

The input of the scan is (incr, ring), where incr is the incremental window operation and ring is the contents of the entire window in a ring array.

The mean absolute deviation of the numbers \(x_1, x_2, \ldots, x_n\) is:

\(MD = \frac{1}{n}\sum_{i=1}^n |x_i-\mu|\)

Note: It is expensive to compute MD in a sliding window. We need to maintain a ring buffer of last n elements and maintain a running mean, when the result is extracted we need to compute the difference of all elements from the mean and get the average. Using standard deviation may be computationally cheaper.

See https://en.wikipedia.org/wiki/Average_absolute_deviation .

Pre-release

incrVariance :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Scanl m (Incr a) a Source #

The variance \(\sigma^2\) of a population of \(n\) equally likely values is defined as the average of the squares of deviations from the mean \(\mu\). In other words, second moment about the mean:

\(\sigma^2 = \frac{1}{n}\sum_{i=1}^n {(x_{i}-\mu)}^2\)

\(\sigma^2 = rawMoment(2) - \mu^2\)

\(\mu_2 = -(\mu'_1)^2 + \mu'_2\)

Note that the variance would be biased if applied to estimate the population variance from a sample of the population. See sampleVariance.

See https://en.wikipedia.org/wiki/Variance.

Space: \(\mathcal{O}(1)\)

Time: \(\mathcal{O}(n)\)

incrStdDev :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #

Standard deviation \(\sigma\) is the square root of variance.

This is the population standard deviation or uncorrected sample standard deviation.

>>> stdDev = sqrt <$> variance

See https://en.wikipedia.org/wiki/Standard_deviation .

Space: \(\mathcal{O}(1)\)

Time: \(\mathcal{O}(n)\)

Shape

Third and fourth order central moments are a measure of shape.

See https://en.wikipedia.org/wiki/Shape_parameter .

See https://en.wikipedia.org/wiki/Standardized_moment .

incrSkewness :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #

Skewness \(\gamma\) is the standardized third central moment defined as:

\(\tilde{\mu}_3 = \frac{\mu_3}{\sigma^3}\)

The third central moment can be computed in terms of raw moments:

\(\mu_3 = 2(\mu'_1)^3 - 3\mu'_1\mu'_2 + \mu'_3\)

Substituting \(\mu'_1 = \mu\), and \(\mu'_2 = \mu^2 + \sigma^2\):

\(\mu_3 = -\mu^3 - 3\mu\sigma^2 + \mu'_3\)

Skewness is a measure of symmetry of the probability distribution. It is 0 for a symmetric distribution, negative for a distribution that is skewed towards left, positive for a distribution skewed towards right.

For a normal like distribution the median can be found around \(\mu - \frac{\gamma\sigma}{6}\) and the mode can be found around \(\mu - \frac{\gamma \sigma}{2}\).

See https://en.wikipedia.org/wiki/Skewness .

incrKurtosis :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #

Kurtosis \(\kappa\) is the standardized fourth central moment, defined as:

\(\tilde{\mu}_4 = \frac{\mu_4}{\sigma^4}\)

The fourth central moment can be computed in terms of raw moments:

\(\mu_4 = -3(\mu'_1)^4 + 6(\mu'_1)^2\mu'_2 - 4\mu'_1\mu'_3\ + \mu'_4\)

Substituting \(\mu'_1 = \mu\), and \(\mu'_2 = \mu^2 + \sigma^2\):

\(\mu_4 = 3\mu^4 + 6\mu^2\sigma^2 - 4\mu\mu'_3 + \mu'_4\)

It is always non-negative. It is 0 for a point distribution, low for light tailed (platykurtic) distributions and high for heavy tailed (leptokurtic) distributions.

\(\kappa >= \gamma^2 + 1\)

For a normal distribution \(\kappa = 3\sigma^4\).

See https://en.wikipedia.org/wiki/Kurtosis .

Estimation

incrSampleVariance :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Scanl m (Incr a) a Source #

Unbiased sample variance i.e. the variance of a sample corrected to better estimate the variance of the population, defined as:

\(s^2 = \frac{1}{n - 1}\sum_{i=1}^n {(x_{i}-\mu)}^2\)

\(s^2 = \frac{n}{n - 1} \times \sigma^2\).

See https://en.wikipedia.org/wiki/Bessel%27s_correction.

incrSampleStdDev :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #

Sample standard deviation:

\(s = \sqrt{sampleVariance}\)

>>> sampleStdDev = sqrt <$> sampleVariance

See https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation .

incrStdErrMean :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #

Standard error of the sample mean (SEM), defined as:

\( SEM = \frac{sampleStdDev}{\sqrt{n}} \)

See https://en.wikipedia.org/wiki/Standard_error .

Space: \(\mathcal{O}(1)\)

Time: \(\mathcal{O}(n)\)

Probability Distribution

incrFrequency :: forall (m :: Type -> Type) a. (Monad m, Ord a) => Scanl m (Incr a) (Map a Int) Source #

Count the frequency of elements in a sliding window.

>>> input = Stream.fromList [1,1,3,4,4::Int]
>>> f = Ring.slidingWindow 4 Statistics.frequency
>>> Stream.fold f input
fromList [(1,1),(3,1),(4,2)]