Copyright | (c) 2024 Composewell Technologies |
---|---|
License | Apache-2.0 |
Maintainer | streamly@composewell.com |
Stability | experimental |
Portability | GHC |
Safe Haskell | None |
Language | Haskell2010 |
Streamly.Statistics.Scanl
Description
See Streamly.Statistics for general information. This module provides scans instead of folds.
Synopsis
- incrMinimum :: forall (m :: Type -> Type) a. (Monad m, Ord a) => Scanl m (Incr a) a
- incrMaximum :: forall (m :: Type -> Type) a. (Monad m, Ord a) => Scanl m (Incr a) a
- incrRawMoment :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Int -> Scanl m (Incr a) a
- incrRawMomentFrac :: forall (m :: Type -> Type) a. (Monad m, Floating a) => a -> Scanl m (Incr a) a
- incrWelfordMean :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Scanl m (Incr a) a
- incrGeometricMean :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a
- incrHarmonicMean :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Scanl m (Incr a) a
- incrQuadraticMean :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a
- incrPowerMean :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Int -> Scanl m (Incr a) a
- incrPowerMeanFrac :: forall (m :: Type -> Type) a. (Monad m, Floating a) => a -> Scanl m (Incr a) a
- ewma :: forall (m :: Type -> Type). Monad m => Double -> Scanl m Double Double
- ewmaRampUpSmoothing :: forall (m :: Type -> Type). Monad m => Double -> Double -> Scanl m Double Double
- incrEwma :: forall (m :: Type -> Type). MonadIO m => Double -> Scanl m (Incr Double, RingArray Double) Double
- incrRange :: forall (m :: Type -> Type) a. (Monad m, Num a, Ord a) => Scanl m (Incr a) a
- incrMd :: forall (m :: Type -> Type). MonadIO m => Scanl m (Incr Double, RingArray Double) Double
- incrVariance :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Scanl m (Incr a) a
- incrStdDev :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a
- incrSkewness :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a
- incrKurtosis :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a
- incrSampleVariance :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Scanl m (Incr a) a
- incrSampleStdDev :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a
- incrStdErrMean :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a
- incrFrequency :: forall (m :: Type -> Type) a. (Monad m, Ord a) => Scanl m (Incr a) (Map a Int)
Incremental Scans
Scans of type Scanl m (a, Maybe a) b
are incremental sliding window
scans. An input of type (a, Nothing)
indicates that the input element
a
is being inserted in the window without ejecting an old value
increasing the window size by 1. An input of type (a, Just a)
indicates that the first element is being inserted in the window and the
second element is being removed from the window, the window size remains
the same. The window size can only increase and never decrease.
You can compute the statistics over the entire stream using sliding
window folds by keeping the second element of the input tuple as
Nothing
.
Also see Streamly.Data.Scanl for some basic window scans.
Summary Statistics
Location
incrMinimum :: forall (m :: Type -> Type) a. (Monad m, Ord a) => Scanl m (Incr a) a Source #
The minimum element in a rolling window.
For smaller window sizes (< 30) Streamly.Internal.Data.Fold.windowMinimum performs better. If you want to compute the minimum of the entire stream Fold.min from streamly package would be much faster.
Time: \(\mathcal{O}(n*w)\) where \(w\) is the window size.
incrMaximum :: forall (m :: Type -> Type) a. (Monad m, Ord a) => Scanl m (Incr a) a Source #
The maximum element in a rolling window.
For smaller window sizes (< 30) Streamly.Internal.Data.Fold.windowMaximum performs better. If you want to compute the maximum of the entire stream Streamly.Data.Fold.maximum from streamly package would be much faster.
Time: \(\mathcal{O}(n*w)\) where \(w\) is the window size.
incrRawMoment :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Int -> Scanl m (Incr a) a Source #
Raw moment is the moment about 0. The \(k\)th raw moment is defined as:
\(\mu'_k = \frac{\sum_{i=1}^n x_{i}^k}{n}\)
>>>
rawMoment k = Fold.teeWith (/) (Fold.windowPowerSum p) Fold.windowLength
See https://en.wikipedia.org/wiki/Moment_(mathematics) .
Space: \(\mathcal{O}(1)\)
Time: \(\mathcal{O}(n)\)
incrRawMomentFrac :: forall (m :: Type -> Type) a. (Monad m, Floating a) => a -> Scanl m (Incr a) a Source #
Like rawMoment
but powers can be negative or fractional. This is
slower than rawMoment
for positive intergal powers.
>>>
rawMomentFrac p = Fold.teeWith (/) (Fold.windowPowerSumFrac p) Fold.windowLength
incrWelfordMean :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Scanl m (Incr a) a Source #
Same as mean
but uses Welford's algorithm to compute the mean
incrementally.
It maintains a running mean instead of a running sum and adjusts the mean
based on a new value. This is slower than mean
because of using the
division operation on each step and it is numerically unstable (as of now).
The advantage over mean
could be no overflow if the numbers are large,
because we do not maintain a sum, but that is a highly unlikely corner case.
Internal
incrGeometricMean :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #
Geometric mean, defined as:
\(GM = \sqrt[n]{x_1 x_2 \cdots x_n}\)
\(GM = \left(\prod_{i=1}^n x_i\right)^\frac{1}{n}\)
or, equivalently, as the arithmetic mean in log space:
\(GM = e ^{{\frac{\sum_{i=1}^{n}\ln a_i}{n}}}\)
>>>
geometricMean = exp <$> lmap log mean
incrHarmonicMean :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Scanl m (Incr a) a Source #
The harmonic mean of the positive numbers \(x_1, x_2, \ldots, x_n\) is defined as:
\(HM = \frac{n}{\frac1{x_1} + \frac1{x_2} + \cdots + \frac1{x_n}}\)
\(HM = \left(\frac{\sum\limits_{i=1}^n x_i^{-1}}{n}\right)^{-1}\)
>>>
harmonicMean = Fold.teeWith (/) length (lmap recip sum)
>>>
harmonicMean = powerMeanFrac (-1)
incrQuadraticMean :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #
The quadratic mean or root mean square (rms) of the numbers \(x_1, x_2, \ldots, x_n\) is defined as:
\(RMS = \sqrt{ \frac{1}{n} \left( x_1^2 + x_2^2 + \cdots + x_n^2 \right) }.\)
>>>
quadraticMean = powerMean 2
incrPowerMean :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Int -> Scanl m (Incr a) a Source #
The \(k\)th power mean of numbers \(x_1, x_2, \ldots, x_n\) is:
\(M_k = \left( \frac{1}{n} \sum_{i=1}^n x_i^k \right)^{\frac{1}{k}}\)
\(powerMean(k) = (rawMoment(k))^\frac{1}{k}\)
>>>
powerMean k = (** (1 / fromIntegral k)) <$> rawMoment k
All other means can be expressed in terms of power mean. It is also known as the generalized mean.
incrPowerMeanFrac :: forall (m :: Type -> Type) a. (Monad m, Floating a) => a -> Scanl m (Incr a) a Source #
Like powerMean
but powers can be negative or fractional. This is
slower than powerMean
for positive intergal powers.
>>>
powerMeanFrac k = (** (1 / k)) <$> rawMomentFrac k
Weighted Means
Exponential Smoothing.
ewma :: forall (m :: Type -> Type). Monad m => Double -> Scanl m Double Double Source #
Exponentially weighted moving average is given by ewma w
where w
is
the smoothing factor varying between 0 and 1. To compute the moving average
the newest input is given a weight of w
and the running ewma is given a
weight of 1 - w
.
For an empty stream this API returns 0. For a non-empty stream the first value in the stream is used as the initial ewma.
The higher the smoothing factor the more weight is given to the new value.
Consider some interesting special cases, when w
is 1 the ewma is always
the newest value, when w
is 0 then the ewma is always the oldest value. If
w
is 0.5
then the new inputs get exponentially weighted by weights
1/2, 1/4, 1/8, ...
, see below for details.
Mathematically, we define ewma \(s_n\), of \(n\) values, \(x_1,\ldots,x_n\), recursively as follows:
\(\begin{align} s_0& = x_0, \quad \text{initial value}\\ s_n & = \alpha x_{n} + (1-\alpha)s_{n-1},\quad n>0 \end{align}\)
If we expand the recursive term it reveals an exponential series:
\(s_n = \alpha \left[x_n + (1-\alpha)x_{n-1} + (1-\alpha)^2 x_{n-2} + \cdots + (1-\alpha)^{n-1} x_1 \right] + (1-\alpha)^n x_0\)
where \(\alpha\), the smoothing factor, is in the range \(0 <\alpha < 1\). More the value of \(\alpha\), the more weight is given to newer values. As a special case if it is 0 then the weighted sum would always be the same as the oldest value, if it is 1 then the sum would always be the same as the newest value.
ewmaRampUpSmoothing :: forall (m :: Type -> Type). Monad m => Double -> Double -> Scanl m Double Double Source #
ewma n k
is like ewma
but the smoothing factor used is itself
exponentially smoothened starting from 1
to the final value k
using n
as its smoothing factor. In other words, the smoothing factor is derived by
smoothing the series 1,k,k,k,...
using n
as the smoothing factor. As
time goes on, the smoothing factor gets closer to k.
incrEwma :: forall (m :: Type -> Type). MonadIO m => Double -> Scanl m (Incr Double, RingArray Double) Double Source #
incrEwma w
computes the ewma of the elements incrementally in a window
using w
as the smoothing factor.
ewma can be calculated incrementally as follows. If \(x_i\) is the value entering the window, n is the size of window, \(x_1\) is the second last value in the old window and \(x_0\) is the value exiting the window:
\(s_{new} = \alpha x_i + (1-\alpha)s_{old} + (1-\alpha)^{n} (x_1 - x_0\))
Unimplemented
Spread
Second order central moment is a statistical measure of dispersion. The \(k\)th moment about the mean (or \(k\)th central moment) is defined as:
\(\mu_k = \frac{1}{n}\sum_{i=1}^n {(x_{i}-\mu)}^k\)
incrRange :: forall (m :: Type -> Type) a. (Monad m, Num a, Ord a) => Scanl m (Incr a) a Source #
The difference between the maximum and minimum elements of a rolling window.
>>>
range = Fold.teeWith (-) maximum minimum
If you want to compute the range of the entire stream Fold.teeWith (-)
Fold.maximum Fold.minimum
from the streamly package would be much faster.
Space: \(\mathcal{O}(n)\) where n
is the window size.
Time: \(\mathcal{O}(n*w)\) where \(w\) is the window size.
incrMd :: forall (m :: Type -> Type). MonadIO m => Scanl m (Incr Double, RingArray Double) Double Source #
md n
computes the mean absolute deviation (or mean deviation) in a
sliding window of last n
elements in the stream.
The input of the scan is (incr, ring), where incr is the incremental window operation and ring is the contents of the entire window in a ring array.
The mean absolute deviation of the numbers \(x_1, x_2, \ldots, x_n\) is:
\(MD = \frac{1}{n}\sum_{i=1}^n |x_i-\mu|\)
Note: It is expensive to compute MD in a sliding window. We need to maintain a ring buffer of last n elements and maintain a running mean, when the result is extracted we need to compute the difference of all elements from the mean and get the average. Using standard deviation may be computationally cheaper.
See https://en.wikipedia.org/wiki/Average_absolute_deviation .
Pre-release
incrVariance :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Scanl m (Incr a) a Source #
The variance \(\sigma^2\) of a population of \(n\) equally likely values is defined as the average of the squares of deviations from the mean \(\mu\). In other words, second moment about the mean:
\(\sigma^2 = \frac{1}{n}\sum_{i=1}^n {(x_{i}-\mu)}^2\)
\(\sigma^2 = rawMoment(2) - \mu^2\)
\(\mu_2 = -(\mu'_1)^2 + \mu'_2\)
Note that the variance would be biased if applied to estimate the population
variance from a sample of the population. See sampleVariance
.
See https://en.wikipedia.org/wiki/Variance.
Space: \(\mathcal{O}(1)\)
Time: \(\mathcal{O}(n)\)
incrStdDev :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #
Standard deviation \(\sigma\) is the square root of variance
.
This is the population standard deviation or uncorrected sample standard deviation.
>>>
stdDev = sqrt <$> variance
See https://en.wikipedia.org/wiki/Standard_deviation .
Space: \(\mathcal{O}(1)\)
Time: \(\mathcal{O}(n)\)
Shape
Third and fourth order central moments are a measure of shape.
incrSkewness :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #
Skewness \(\gamma\) is the standardized third central moment defined as:
\(\tilde{\mu}_3 = \frac{\mu_3}{\sigma^3}\)
The third central moment can be computed in terms of raw moments:
\(\mu_3 = 2(\mu'_1)^3 - 3\mu'_1\mu'_2 + \mu'_3\)
Substituting \(\mu'_1 = \mu\), and \(\mu'_2 = \mu^2 + \sigma^2\):
\(\mu_3 = -\mu^3 - 3\mu\sigma^2 + \mu'_3\)
Skewness is a measure of symmetry of the probability distribution. It is 0 for a symmetric distribution, negative for a distribution that is skewed towards left, positive for a distribution skewed towards right.
For a normal like distribution the median can be found around \(\mu - \frac{\gamma\sigma}{6}\) and the mode can be found around \(\mu - \frac{\gamma \sigma}{2}\).
incrKurtosis :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #
Kurtosis \(\kappa\) is the standardized fourth central moment, defined as:
\(\tilde{\mu}_4 = \frac{\mu_4}{\sigma^4}\)
The fourth central moment can be computed in terms of raw moments:
\(\mu_4 = -3(\mu'_1)^4 + 6(\mu'_1)^2\mu'_2 - 4\mu'_1\mu'_3\ + \mu'_4\)
Substituting \(\mu'_1 = \mu\), and \(\mu'_2 = \mu^2 + \sigma^2\):
\(\mu_4 = 3\mu^4 + 6\mu^2\sigma^2 - 4\mu\mu'_3 + \mu'_4\)
It is always non-negative. It is 0 for a point distribution, low for light tailed (platykurtic) distributions and high for heavy tailed (leptokurtic) distributions.
\(\kappa >= \gamma^2 + 1\)
For a normal distribution \(\kappa = 3\sigma^4\).
Estimation
incrSampleVariance :: forall (m :: Type -> Type) a. (Monad m, Fractional a) => Scanl m (Incr a) a Source #
Unbiased sample variance i.e. the variance of a sample corrected to better estimate the variance of the population, defined as:
\(s^2 = \frac{1}{n - 1}\sum_{i=1}^n {(x_{i}-\mu)}^2\)
\(s^2 = \frac{n}{n - 1} \times \sigma^2\).
incrSampleStdDev :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #
Sample standard deviation:
\(s = \sqrt{sampleVariance}\)
>>>
sampleStdDev = sqrt <$> sampleVariance
See https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation .
incrStdErrMean :: forall (m :: Type -> Type) a. (Monad m, Floating a) => Scanl m (Incr a) a Source #
Standard error of the sample mean (SEM), defined as:
\( SEM = \frac{sampleStdDev}{\sqrt{n}} \)
See https://en.wikipedia.org/wiki/Standard_error .
Space: \(\mathcal{O}(1)\)
Time: \(\mathcal{O}(n)\)
Probability Distribution
incrFrequency :: forall (m :: Type -> Type) a. (Monad m, Ord a) => Scanl m (Incr a) (Map a Int) Source #
Count the frequency of elements in a sliding window.
>>>
input = Stream.fromList [1,1,3,4,4::Int]
>>>
f = Ring.slidingWindow 4 Statistics.frequency
>>>
Stream.fold f input
fromList [(1,1),(3,1),(4,2)]