Minimal Example¶

The main goal of this example is to demonstrate how to use STACIE with a minimal, self-contained example. First, the properties of a basic Markov process are discussed, and then data is generated using this process. (A detailed derivation of the analytical results is provided in the last section.) The Markov chains are analyzed using two models, followed by some comments on their applicability.

A secondary goal is to thoroughly discuss the plots generated by STACIE, which can help detect problems with the analysis or input data.

Library Imports and Matplotlib Configuration¶

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from stacie import (
    compute_spectrum,
    estimate_acint,
    ExpPolyModel,
    LorentzModel,
    UnitConfig,
    plot_extras,
    plot_fitted_spectrum,
)

mpl.rc_file("matplotlibrc")
%config InlineBackend.figure_formats = ["svg"]

The Markov Process¶

We use the following simple discrete-time Markov process:

\[ \hat{x}_{n+1} = \alpha \hat{x}_n + \beta \hat{z}_n \]

where \(\hat{z}_n\) are uncorrelated standard normal random variables, and \(\alpha\) and \(\beta\) are real constants. The parameter \(\alpha\) controls the autocorrelation of the process, with \(0 < \alpha < 1\).

One can show that the autocorrelation function of this process is given by

\[ c_\Delta = \frac{\beta^2}{1 - \alpha^2}\, \alpha^{|\Delta|} \]

The variance, autocorrelation integral (with \(F=1\) and \(h=1\)) and integrated correlation time are respectively:

\[\begin{split} \begin{aligned} \sigma^2 &= \frac{\beta^2}{1 - \alpha^2} \\ \mathcal{I} &= \frac{1}{2}\left(\frac{\beta}{1 - \alpha}\right)^2 \\ \tau_{\text{int}} &= \frac{\mathcal{I}}{F\,c_0} = \frac{1}{2}\frac{1 + \alpha}{1 - \alpha} \end{aligned} \end{split}\]

Derivations of these equations can be found in the final section of this notebook.

The example below uses the parameters \(\alpha = 31/33\) and \(\beta = \sqrt{8/1089}\), for which we expect \(\mathcal{I} = 1\) and \(\tau_{\text{int}} = 16\).

Data generation¶

The following code cell implements 64 independent realizations of the Markov process of 32768 steps each. This implementation vectorizes over independent sequences, which is much faster than generating them one by one.

nseq = 64
nstep = 1024 * 32
alpha = 31 / 33
beta = np.sqrt(8 / 1089)
std = beta / np.sqrt(1 - alpha**2)
rng = np.random.default_rng(0)
sequences = np.zeros((nseq, nstep))
sequences[:, 0] = rng.normal(0, std, nseq)
for i in range(1, nstep):
    sequences[:, i] = alpha * sequences[:, i - 1] + rng.normal(0, beta, nseq)

Analysis With STACIE, Using the ExpPoly Model¶

The following code cell estimates the autocorrelation integral using the ExpPolyModel. Because the autocorrelation decays exponentially, the spectrum features a Lorentzian peak at zero frequency. Hence, we use degrees \(S=\{0, 2\}\) for the polynomial, which ensures a zero-derivative at the origin.

spectrum = compute_spectrum(sequences)
result_exppoly = estimate_acint(spectrum, ExpPolyModel([0, 2]), verbose=True)

CUTOFF FREQUENCY SCAN cv2l(125%)
     neff   criterion        fcut
---------  ----------  ----------
0        14.4    2.83e-04
6        14.4    3.01e-04
3        14.4    3.20e-04
0        14.4    3.41e-04
7        14.3    3.63e-04
5        14.2    3.86e-04
3        14.0    4.11e-04
2        13.7    4.38e-04
2        13.5    4.66e-04
2        13.3    4.96e-04
2        13.1    5.28e-04
4        13.1    5.62e-04
6        13.1    5.98e-04
9        13.3    6.37e-04
3        13.4    6.78e-04
8        13.5    7.21e-04
3        13.6    7.68e-04
0        13.5    8.18e-04
8        13.1    8.70e-04
6        12.5    9.26e-04
6        11.7    9.86e-04
8        11.0    1.05e-03
1        10.5    1.12e-03
5        10.1    1.19e-03
1        10.0    1.27e-03
8        10.0    1.35e-03
7        10.0    1.43e-03
8        10.0    1.53e-03
2         9.7    1.63e-03
7         9.3    1.73e-03
4         8.8    1.84e-03
4         8.3    1.96e-03
7         7.7    2.09e-03
2         7.3    2.22e-03
0         7.0    2.37e-03
1         6.9    2.52e-03
6         7.1    2.68e-03
4         7.3    2.85e-03
6         7.6    3.04e-03
2         7.7    3.23e-03
2         7.7    3.44e-03
7         7.5    3.66e-03
6         7.2    3.90e-03
1         7.0    4.15e-03
1         6.9    4.42e-03
6         6.9    4.70e-03
8         7.1    5.01e-03
7         7.4    5.33e-03
2         8.2    5.67e-03
6         9.7    6.04e-03
6        12.9    6.43e-03
6        19.2    6.84e-03
4        30.7    7.29e-03
2        50.9    7.76e-03
0        84.7    8.26e-03
9       140.3    8.79e-03
Cutoff criterion exceeds incumbent + margin: 6.9 + 100.0.

INPUT TIME SERIES
    Time step:                     1.00e+00 
    Simulation time:               3.28e+04 
    Maximum degrees of freedom:    128.0

MAIN RESULTS
    Autocorrelation integral:      9.99e-01 ± 1.76e-02 
    Integrated correlation time:   1.59e+01 ± 2.80e-01 

SANITY CHECKS (weighted averages over cutoff grid)
    Effective number of points:    117.3 (ideally > 40)
    Regression cost Z-score:       -0.1 (ideally < 2)
    Cutoff criterion Z-score:      0.3 (ideally < 2)

MODEL exppoly(0, 2) | CUTOFF CRITERION cv2l(125%)
    Number of parameters:          2
    Average cutoff frequency:      3.47e-03

With the verbose=True option, STACIE prints the results of the analysis. The first section of the screen output shows the progress of the cutoff frequency scan and includes the following columns:

neff: the number of effective spectrum data points used in the fit.
criterion: the value of the cutoff criterion used for the weighted average over solutions at different cutoff frequencies.
fcut: the cutoff frequency used for the fit.

The second section summarizes the analysis, and can be easily related to the concepts in the theory section. These results already reveal that STACIE reproduces the expected results.

The next code cell plots the model fitted to the spectrum.

uc = UnitConfig()
plt.close("fitted_exppoly")
fig, ax = plt.subplots(num="fitted_exppoly")
plot_fitted_spectrum(ax, uc, result_exppoly)

../_images/8584d9786590d8ad6db1d0fca019fb083e6785af725d751330beb21d3849c8f1.svg

This plot displays a lot of information:

The spectrum is shown as blue dots.
The fitted model is plotted as a solid green line. Its parameters are the weighted average over multiple cutoff frequencies.
The red dotted line shows the weighted average of the switching functions, used to identify the low-frequency region of the spectrum.
The green band shows the expected uncertainty of the sampling spectrum, as a 95% confidence interval. Most of the blue data points should fall within this band, at least in the region where the model is fitted to the data (red dotted line close to 1.0).
The green dashed lines are the 95% confidence intervals of the fitted model. This confidence interval should be narrow in the low-frequency region.
The black vertical line corresponds to the weighted average of the cutoff frequencies used in the fit.
The plot title summarizes key information about the analysis, including:
- The model used to fit the data.
- The autocorrelation integral and its uncertainty.
- The integrated correlation time and its uncertainty.
The legend shows to the confidence level used for plotting.

The next plot shows some intermediate results, which can help you understand the fitting process or detect problems.

plt.close("extras_exppoly")
fig, axs = plt.subplots(2, 2, num="extras_exppoly")
plot_extras(axs, uc, result_exppoly)

../_images/9d49530489a3747a7d3c24d0406b788aaebe226dbbfb03d445f11b1a7bd0b601.svg

This plot contains four panels with extra results:

The top left panel shows the weight assigned to each cutoff frequency, based on the CV2L criterion. Things to look for:
- If the cutoff weight is large for the lowest cutoffs, then the input sequences are likely too short. This typically also results in a low number of effective data points. Increasing the number of steps in the inputs will increase the frequency resolution of the spectrum, allowing for better fits with lower cutoff frequencies.
- If the cutoff weight is large for the highest cutoffs, there is most likely also a problem with the analysis. There can be multiple causes for this:
  - The input sequences are much longer than necessary. In this case, you can increase the neff_max option of estimate_acint() to fit the model with higher cutoffs. However, this can be expensive, so it is recommended to use more and shorter sequences instead. This can be done by preprocessing the data with the split() function before computing the spectrum. Even better is to plan ahead and avoid this situation.
  - The data is block-averaged with a block size that is too large, which limits the available frequency range.
The top right panel shows the autocorrelation integral for each cutoff frequency. The dots indicate the extent to which each point contributes to the final result. (Black is high weight, white is low weight.) Things to look for:
- If the autocorrelation integral shows sharp jumps at low cutoff frequencies, the model is most likely overfitting the data at low frequencies. These points are practically always given low cutoff weights, so you can ignore them. However, if you want to exclude them from the analysis, you can increase the neff_min option of estimate_acint().
The bottom left panel the Z-scores of the regression cost and the cutoff criterion, as a function of the cutoff frequency. The Z-score is the number of standard deviations a value deviates from its mean. For ill-behaved fits, the Z-scores easily exceed 2. When providing sufficient inputs, high Z-scores should only occur where the cutoff weight is low. If the Z-scores are high for cutoff frequencies with high cutoff weights, the input data is insufficient for reliable error estimation or the model is not appropriate. In this case, it is recommended to use a different model or to increase the length of the input sequences.
The bottom right panel shows the eigenvalues of the Hessian matrix of the fit, in a preconditioned parameter space, at each cutoff frequency. A large spread of the eigenvalues indicates that the fit is not well constrained. Such a large spread typically results in overfitting artifacts.

Analysis With STACIE, Using the Lorentz Model¶

In this example, we know a priori that the autocorrelation function decays exponentially. Therefore, the LorentzModel should be able to perfectly explain the spectrum, up to the statistical noise in the data.

# Analysis
result_lorentz = estimate_acint(spectrum, LorentzModel(), verbose=True)

# Plotting
plt.close("fitted_lorentz")
fig, ax = plt.subplots(num="fitted_lorentz")
plot_fitted_spectrum(ax, uc, result_lorentz)
plt.close("extras_lorentz")
fig, axs = plt.subplots(2, 2, num="extras_lorentz")
plot_extras(axs, uc, result_lorentz)

CUTOFF FREQUENCY SCAN cv2l(125%)
     neff   criterion        fcut
---------  ----------  ----------
0         inf    4.31e-04  (No correlation time estimate available.)
9         inf    4.59e-04  (No correlation time estimate available.)
9         inf    4.89e-04  (No correlation time estimate available.)
0         inf    5.20e-04  (No correlation time estimate available.)
1         inf    5.54e-04  (No correlation time estimate available.)
3         inf    5.89e-04  (Variance of the correlation time estimate is too large.)
6         inf    6.27e-04  (Variance of the correlation time estimate is too large.)
0         inf    6.68e-04  (Variance of the correlation time estimate is too large.)
4         inf    7.11e-04  (Variance of the correlation time estimate is too large.)
9         inf    7.57e-04  (Variance of the correlation time estimate is too large.)
6         inf    8.06e-04  (Variance of the correlation time estimate is too large.)
3         inf    8.58e-04  (Variance of the correlation time estimate is too large.)
2         inf    9.13e-04  (Variance of the correlation time estimate is too large.)
2         inf    9.72e-04  (Variance of the correlation time estimate is too large.)
3         inf    1.03e-03  (Variance of the correlation time estimate is too large.)
5         inf    1.10e-03  (Variance of the correlation time estimate is too large.)
9         inf    1.17e-03  (Variance of the correlation time estimate is too large.)
4         inf    1.25e-03  (Variance of the correlation time estimate is too large.)
1         inf    1.33e-03  (Variance of the correlation time estimate is too large.)
0         inf    1.41e-03  (Variance of the correlation time estimate is too large.)
1         inf    1.51e-03  (Variance of the correlation time estimate is too large.)
4         inf    1.60e-03  (opt: Hessian matrix has non-positive eigenvalues: evals=array([-9.28856795e-05,  4.37431407e-01,  2.56266148e+00]))
8         inf    1.71e-03  (Variance of the correlation time estimate is too large.)
5         inf    1.82e-03  (Variance of the correlation time estimate is too large.)
5         inf    1.93e-03  (Variance of the correlation time estimate is too large.)
7         inf    2.06e-03  (Variance of the correlation time estimate is too large.)
1         inf    2.19e-03  (Variance of the correlation time estimate is too large.)
9         inf    2.33e-03  (Variance of the correlation time estimate is too large.)
9         inf    2.48e-03  (Variance of the correlation time estimate is too large.)
3         inf    2.64e-03  (Variance of the correlation time estimate is too large.)
0         inf    2.81e-03  (Variance of the correlation time estimate is too large.)
1         inf    2.99e-03  (Variance of the correlation time estimate is too large.)
6         inf    3.19e-03  (Variance of the correlation time estimate is too large.)
5         inf    3.39e-03  (Variance of the correlation time estimate is too large.)
9         inf    3.61e-03  (Variance of the correlation time estimate is too large.)
7         inf    3.84e-03  (Variance of the correlation time estimate is too large.)
0         inf    4.09e-03  (Variance of the correlation time estimate is too large.)
9         inf    4.36e-03  (Variance of the correlation time estimate is too large.)
3         inf    4.64e-03  (Variance of the correlation time estimate is too large.)
4         inf    4.94e-03  (Variance of the correlation time estimate is too large.)
1         inf    5.25e-03  (Variance of the correlation time estimate is too large.)
5         inf    5.59e-03  (Variance of the correlation time estimate is too large.)
6         inf    5.95e-03  (Variance of the correlation time estimate is too large.)
5         inf    6.34e-03  (Variance of the correlation time estimate is too large.)
2         inf    6.75e-03  (Variance of the correlation time estimate is too large.)
9         inf    7.18e-03  (Variance of the correlation time estimate is too large.)
4        11.7    7.64e-03
0        11.5    8.14e-03
6        11.3    8.66e-03
4        11.2    9.22e-03
4        11.0    9.81e-03
7        10.8    1.04e-02
3        10.6    1.11e-02
4        10.4    1.18e-02
1        10.2    1.26e-02
4        10.1    1.34e-02
5         9.9    1.43e-02
5         9.6    1.52e-02
4         9.4    1.62e-02
5         9.1    1.72e-02
9         8.9    1.83e-02
6         8.6    1.95e-02
9         8.5    2.08e-02
0         8.3    2.21e-02
9         8.1    2.35e-02
0         8.0    2.51e-02
3         7.8    2.67e-02
1         7.5    2.84e-02
7         7.0    3.02e-02
Reached the maximum number of effective points (1000).

INPUT TIME SERIES
    Time step:                     1.00e+00 
    Simulation time:               3.28e+04 
    Maximum degrees of freedom:    128.0

MAIN RESULTS
    Autocorrelation integral:      1.01e+00 ± 1.00e-02 
    Integrated correlation time:   1.60e+01 ± 1.59e-01 

SANITY CHECKS (weighted averages over cutoff grid)
    Effective number of points:    820.9 (ideally > 60)
    Regression cost Z-score:       0.6 (ideally < 2)
    Cutoff criterion Z-score:      -0.1 (ideally < 2)

MODEL lorentz(0.1) | CUTOFF CRITERION cv2l(125%)
    Number of parameters:          3
    Average cutoff frequency:      2.44e-02 
    Exponential correlation time:  1.59e+01 ± 2.80e-01 

RECOMMENDED SIMULATION SETTINGS (EXPONENTIAL CORR. TIME)
    Block time:                    5.01e+00 ± 1.57e-01 
    Simulation time:               1.00e+03 ± 2.22e+00

../_images/951efc5bbc8f73cae826b4eb699ab4a64ec35a363095ec023c21ae5c596e78c9.svg

../_images/89d3e90b442a0386cca35ff1ac4c98ae5f1198fbceb1a5a000e49b8c14a7015b.svg

The extra plots reveal some noteworthy features:

At the lowest cutoff frequencies, the model clearly overfits the data. There is a large spread on the eigenvalues, and the autocorrelation integral fluctuates significantly for low cutoff frequencies. Although the results at these cutoff frequencies are unreliable, they are given low weights, so you don’t need to intervene manually to exclude these results.
The cutoff weight is maximal for the highest cutoff frequencies. This is typically a sign that the input sequences are too long, but this is not the case here. While the Lorentz model would also yield excellent results with shorter sequences, this would still result in a high frequency cutoff. This is because the Lorentz model fits the data perfectly; more data points will always result in a lower cutoff criterion. However, in more realistic cases involving data from complex simulations or measurements, this is unlikely to happen.

Regression Tests¶

If you are experimenting with this notebook, you can ignore any exceptions below. The tests are only meant to pass for the notebook in its original form.

if abs(result_exppoly.acint - 1.0) > 0.03:
    raise ValueError(f"Wrong acint: {result_exppoly.acint:.4e}")
if abs(result_exppoly.corrtime_int - 16.0) > 0.5:
    raise ValueError(f"Wrong corrtime_int: {result_exppoly.corrtime_int:.4e}")
if abs(result_lorentz.acint - 1.0) > 0.03:
    raise ValueError(f"Wrong acint: {result_lorentz.acint:.4e}")
if abs(result_lorentz.corrtime_int - 16.0) > 0.5:
    raise ValueError(f"Wrong corrtime_int: {result_lorentz.corrtime_int:.4e}")
if abs(result_lorentz.corrtime_exp - 16.0) > 0.5:
    raise ValueError(f"Wrong corrtime_exp: {result_lorentz.corrtime_exp:.4e}")

Derivation of the Autocorrelation Integral¶

The Markov process is defined by the following equation:

\[ \hat{x}_{n+1} = \alpha \hat{x}_n + \beta \hat{z}_n \]

The stationary distribution of the process is Gaussian with zero mean. The initial state is slowly reduced by repetitive application of the factor \(\alpha\). The only part that remains after a long time is the additive contributions from the normal noise \(\hat{z}_n\).

Since the two terms on the right-hand side of the equation are independent, the variance of the stationary distribution can be found by solving:

\[ c_0 = \alpha^2 c_0 + \beta^2 \]

This gives:

\[ c_0 = \frac{\beta^2}{1 - \alpha^2} \]

The covariance of two neighboring points is given by:

\[ \cov[\hat{x}_n, \hat{x}_{n+1}] = \alpha \cov[\hat{x}_n, \hat{x}_n] = \alpha\, c_0 \]

This can easily be generalized to points separated by \(\Delta>0\) steps through induction:

\[ \cov[\hat{x}_n, \hat{x}_{n+\Delta}] = \alpha\, \cov[\hat{x}_n, \hat{x}_{n+\Delta-1}] = \alpha^{\Delta}\, c_0 \]

with a similar result for \(\Delta<0\). Combining these results gives:

\[ c_\Delta = \frac{\beta^2}{1 - \alpha^2}\, \alpha^{|\Delta|} \]

The autocorrelation integral is defined by a simple quadrature rule (with \(F=1\) and \(h=1\)):

\[ \mathcal{I} = \frac{1}{2} \sum_{\Delta=-\infty}^{\infty} \frac{\beta^2}{1 - \alpha^2}\, \alpha^{|\Delta|} \]

This can be rewritten easily using properties of geometric series. One must be careful not to double-count the \(\Delta=0\) term. This can be accomplished by isolating this term and rewriting the remaining terms in the sum with a shifted index, \(n=|\Delta|-1\). After replacing the index, we use \(\sum_{\Delta=1}^\infty \alpha^{\Delta}=\alpha\sum_{n=0}^{\infty}\alpha^n\).

\[\begin{split} \begin{aligned} \mathcal{I} &= \frac{\beta^2}{(1 - \alpha^2)}\left( \frac{1}{2} + \alpha \sum_{n=0}^{\infty} \alpha^n \right) \\ &= \frac{\beta^2}{(1 - \alpha^2)}\left( \frac{1}{2} + \frac{\alpha}{1 - \alpha} \right) \\ &= \frac{\beta^2}{(1 - \alpha^2)}\left( \frac{1 + \alpha}{2(1 - \alpha)} \right) \\ &= \frac{1}{2}\left(\frac{\beta}{1 - \alpha}\right)^2 \end{aligned} \end{split}\]