Uncertainty of the Mean of Time-Correlated Data

This notebook shows how to use STACIE to compute the error of the mean of a time-correlated input sequence, meaning not all of its values are statistically independent.

This is a completely self-contained example that generates input sequences (with MCMC) and then analyzes them with STACIE. Atomic units are used unless otherwise noted.

We suggest experimenting with this notebook by making the following changes:

  • Change the number of sequences and their length.

  • Change the correlation time through PROPOSAL_STEP.

Library Imports and Matplotlib Configuration

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from scipy.integrate import quad
import scipy.constants as sc
from stacie import (
    UnitConfig,
    compute_spectrum,
    estimate_acint,
    LorentzModel,
    plot_extras,
    plot_fitted_spectrum,
    plot_spectrum,
)
mpl.rc_file("matplotlibrc")
%config InlineBackend.figure_formats = ["svg"]

Data Generation

The data for the analysis are generated by sampling a Kratzer–Feus potential of a diatomic molecule [Fue26, Kra20] at constant temperature. This potential is harmonic in \(1/r\):

\[ U(r) = \frac{K}{2} \left(\frac{r_0^2}{r} - r_0\right)^2 \]

where \(K\) is the force constant and \(r_0\) the equilibrium bond length. The sampled probability density is the Boltzmann distribution:

\[ p_r(r) = \frac{1}{Z} \exp\left(-\frac{U(r)}{k_\text{B}T}\right) \]

where the normalization \(Z\) is the classical partition function.

In this example, the force constant and bond length of the lithium dimer are used, with parameters from Zhao et al. [ZZF22] converted to atomic units. A high temperature is used to skew the distribution to larger distances.

K = 0.015
R0 = 5.150
TEMPERATURE = 1000
BOLTZMANN = sc.value("Boltzmann constant") / sc.value("Hartree energy")
BETA = 1 / (BOLTZMANN * TEMPERATURE)
PROPOSAL_STEP = 0.1


def logprob(r):
    """Calculate the logarithm of the probability."""
    energy = 0.5 * K * (R0**2 / r - R0) ** 2
    return -BETA * energy


def plot_potential_dist():
    plt.close("boltzmann")
    _, (ax1, ax2) = plt.subplots(2, 1, num="boltzmann", sharex=True)
    rgrid = np.linspace(0.5 * R0, 2 * R0, 100)
    ax2.sharex(ax1)
    ax1.plot(rgrid, -logprob(rgrid) / BETA)
    ax1.set_ylabel(r"Potential energy [E$_\text{h}$]")
    ax2.plot(rgrid, np.exp(logprob(rgrid)))
    ax2.set_ylabel("Boltzmann factor\n(unnormalized probability density)")
    ax2.set_xlabel("Internuclear distance [a$_0$]")


plot_potential_dist()
../_images/5b669b88f35b662696a2e457f4dc331a589e1a8b7a095519cfec5a1d70097eec.svg

The MCMC implementation below is non-standard in the sense that it is vectorized to generate multiple sequences in parallel.

def sample_mcmc_chain(niter, stride, ndim, burnin, seed=42):
    """Sample independent Markov Chains with the Metropolis algorithm.

    Parameters
    ----------
    niter
        The number of MCMC iterations to run.
    stride
        The number of iterations between samples returned,
        i.e. the thinning interval.
    ndim
        The number of independent Markov chains to run.
    burnin
        The number of iterations to discard at the beginning.
    seed
        The random number generator seed.

    Returns
    -------
    result
        A 2D array of shape (ndim, niter // stride) containing the sampled sequences.
    """
    rng = np.random.default_rng(seed)
    result = np.zeros((ndim, niter // stride))
    r_old = np.full(ndim, R0)
    lp_old = logprob(r_old)
    irow = 0
    istep = 0
    while irow < result.shape[1]:
        r_new = r_old + rng.normal(0, PROPOSAL_STEP, ndim)
        lp_new = logprob(r_new)
        accept = lp_new > lp_old
        mask = ~accept
        nrnd = mask.sum()
        if nrnd > 0:
            accept[mask] = rng.uniform(0, 1, nrnd) < np.exp(lp_new[mask] - lp_old[mask])
        r_old[accept] = r_new[accept]
        lp_old[accept] = lp_new[accept]
        if burnin > 0:
            burnin -= 1
            continue
        if istep % stride == 0:
            result[:, irow] = r_new
            irow += 1
        istep += 1
    return result


sequences = sample_mcmc_chain(10240, 5, 50, 200)
print(f"(nseq, nstep) = {sequences.shape}")
mean_mc = sequences.mean()
print(f"Monte Carlo E[r] ≈ {mean_mc:.5f} > R0 = {R0:.5f}")
(nseq, nstep) = (50, 2048)
Monte Carlo E[r] ≈ 5.28666 > R0 = 5.15000

Because of the finite temperature and the anharmonicity of the potential, the average distance is greater than the equilibrium bond length.

# Plot the beginning of a few sequences.
# The atomic unit of length is the Bohr radius, $\mathrm{a}_0$.
def plot_chains():
    plt.close("chains")
    _, ax = plt.subplots(num="chains")
    ax.plot(sequences[0][:500], label="Chain 1")
    ax.plot(sequences[1][:500], label="Chain 2")
    ax.plot(sequences[2][:500], label="Chain 3")
    ax.set_xlabel("Step")
    ax.set_ylabel(r"Bond length [a$_0$]")
    ax.set_title("Markov Chain samples")
    ax.legend()


plot_chains()
../_images/7afb143bb8ddd492bcac8d5e101617dc9df467893159078135282820ffa74ff2.svg

The sequences in the plot are clearly time-correlated. The following cells show how STACIE can be used to compute the uncertainty of this average, taking into account that not all samples are independent due to time correlations.

Uncertainty Quantification

The spectrum is calculated using settings that are appropriate for error estimation. See the Error Estimates section for the justification of the prefactors and include_zero_freq keyword arguments. Since we are analyzing MCMC data, the timestep argument is not specified, corresponding to a dimensionless time step of 1.

# Compute and plot the power spectrum.
spectrum = compute_spectrum(
    sequences,
    prefactors=2.0 / sequences.size,
    include_zero_freq=False,
)

The UnitConfig object contains settings that are reused by most plotting functions. The integral has units of length squared, \(\mathrm{a}_0^2\). (It is the variance of the mean.)

uc = UnitConfig(
    time_fmt=".1f",
    acint_fmt=".1e",
    acint_unit_str=r"a$^2_0$",
)
plt.close("spectrum")
_, ax = plt.subplots(num="spectrum")
plot_spectrum(ax, uc, spectrum, 180)
../_images/9ec8fe379b562ec4fad170fce1a58a536eb80761190d633bc3d3cabbc0872942.svg

From the spectrum, one can already visually estimate the variance of the mean: the limit to zero frequency is about \(6 \times 10^{-5}\,\mathrm{a}_0^2\). By normalizing the spectrum with the total simulation time, the spectrum has the correct unit of length squared. In the following cell, a model is fitted to the spectrum to get a more precise estimate.

result = estimate_acint(spectrum, LorentzModel(), verbose=True, uc=uc)
CUTOFF FREQUENCY SCAN cv2l(125%)
     neff   criterion        fcut
---------  ----------  ----------
     15.0         inf    7.38e-03  (Variance of the correlation time estimate is too large.)
     16.0         inf    7.85e-03  (Variance of the correlation time estimate is too large.)
     17.1         inf    8.36e-03  (Variance of the correlation time estimate is too large.)
     18.2         inf    8.90e-03  (Variance of the correlation time estimate is too large.)
     19.4         inf    9.47e-03  (Variance of the correlation time estimate is too large.)
     20.7         inf    1.01e-02  (Variance of the correlation time estimate is too large.)
     22.0         inf    1.07e-02  (Variance of the correlation time estimate is too large.)
     23.5         inf    1.14e-02  (Variance of the correlation time estimate is too large.)
     25.0         inf    1.22e-02  (Variance of the correlation time estimate is too large.)
     26.7         inf    1.29e-02  (Variance of the correlation time estimate is too large.)
     28.4         inf    1.38e-02  (Variance of the correlation time estimate is too large.)
     30.3         inf    1.47e-02  (Variance of the correlation time estimate is too large.)
     32.3         inf    1.56e-02  (Variance of the correlation time estimate is too large.)
     34.4         inf    1.66e-02  (Variance of the correlation time estimate is too large.)
     36.7         inf    1.77e-02  (Variance of the correlation time estimate is too large.)
     39.1        -4.8    1.88e-02
     41.6        -5.1    2.00e-02
     44.3        -5.5    2.13e-02
     47.2        -6.0    2.27e-02
     50.3        -6.6    2.42e-02
     53.6        -7.1    2.57e-02
     57.1        -7.4    2.74e-02
     60.8        -7.7    2.92e-02
     64.7        -7.9    3.11e-02
     68.9        -8.0    3.31e-02
     73.4        -8.1    3.52e-02
     78.2        -8.2    3.75e-02
     83.3        -8.3    3.99e-02
     88.7        -8.3    4.24e-02
     94.4        -8.3    4.52e-02
    100.5        -8.3    4.81e-02
    107.1        -8.4    5.12e-02
    114.0        -8.6    5.45e-02
    121.4        -8.7    5.80e-02
    129.2        -8.8    6.18e-02
    137.6        -8.9    6.57e-02
    146.5        -8.8    7.00e-02
    156.0        -8.7    7.45e-02
    166.1        -8.5    7.93e-02
    176.8        -8.3    8.44e-02
    188.3        -8.0    8.98e-02
    200.4        -7.9    9.56e-02
    213.4        -7.9    1.02e-01
    227.2        -8.0    1.08e-01
    241.9        -8.2    1.15e-01
    257.5        -8.4    1.23e-01
    274.2        -8.5    1.31e-01
    291.9        -8.5    1.39e-01
    310.7        -8.5    1.48e-01
    330.8        -8.3    1.58e-01
    352.2        -8.2    1.68e-01
    374.9         inf    1.79e-01  (cv2l: Insufficient data after cutoff.)
Scan stopped by cutoff criterion.

INPUT TIME SERIES
    Time step:                     1.0 
    Simulation time:               2048.0 
    Maximum degrees of freedom:    100.0

MAIN RESULTS
    Autocorrelation integral:      6.3e-05 ± 2.5e-06 a$^2_0$
    Integrated correlation time:   12.3 ± 0.5 

SANITY CHECKS (weighted averages over cutoff grid)
    Effective number of points:    164.8 (ideally > 60)
    Regression cost Z-score:       -0.1 (ideally < 2)
    Cutoff criterion Z-score:      1.0 (ideally < 2)

MODEL lorentz(0.1) | CUTOFF CRITERION cv2l(125%)
    Number of parameters:          3
    Average cutoff frequency:      7.87e-02 
    Exponential correlation time:  12.6 ± 0.6 

RECOMMENDED SIMULATION SETTINGS (EXPONENTIAL CORR. TIME)
    Block time:                    4.0 ± 0.3 
    Simulation time:               790.4 ± 4.5

The spectrum is normalized such that the integral of the autocorrelation function is equal to the variance of the mean. Because STACIE estimates errors of the autocorrelation integral, it can thus also estimate errors of errors of means.

The error of the mean and its uncertainty are printed in the following cell.

error_mc = np.sqrt(result.acint)
print(f"Error of the mean = {error_mc:.5f}")
error_of_error_mc = 0.5 * result.acint_std / error_mc
print(f"Uncertainty of the error of the mean = {error_of_error_mc:.5f}")
Error of the mean = 0.00793
Uncertainty of the error of the mean = 0.00016

It is also interesting to visualize the fitted spectrum and some intermediate results.

# Plot of the sampling and fitted model spectrum.
plt.close("fitted")
_, ax = plt.subplots(num="fitted")
plot_fitted_spectrum(ax, uc, result)
../_images/55957d04c2c2312c1fe0422a26485837884425aef72b80a2e773888aa6371bd4.svg

The Lorentz model can clearly explain the spectrum, even well beyond the width of the peak at zero frequency.

# Plot additional intermediate results as a function of the frequency cutoff.
plt.close("extras")
_, axs = plt.subplots(2, 2, num="extras")
plot_extras(axs, uc, result)
../_images/925339bcc06de90aa3feafe0bb75362a4020b70d60b86195858fe1a7a668c109.svg

The extra plots reveal several interesting challenges of the analysis:

  • The cutoff weight (top left panel) remains high up to the highest cutoff frequency considered. If one is only interested in the zero-frequency limit of the spectrum, there is little to be gained by including many data points in the fit at high frequencies, well past the width of the peak at zero frequency. These will not make the autocorrelation integral more precise, but bear the risk of introducing some bias due to underfitting. One may manually impose a maximum frequency cutoff with the fcut_max argument of the estimate_acint() function.

  • The risk for some bias at high cutoff frequencies is also visible in the Z-score associated with the cutoff criterion (green curve in the lower left panel). For higher cutoff frequencies, the Z-score slowly increases to values above 2, where the cutoff weight is still significant.

    The reason for the higher Z-score is that the input time series is not normally distributed, due to the asymmetry of the Kratzer–Fues potential. As a result, the MC chain cannot be described by a Gaussian process, and the uncertainty of the spectrum amplitudes is not exactly Gamma-distributed. You can verify this hypothesis by rerunning this example with TEMPERATURE = 100 and PROPOSAL_STEP = 0.03. This will result in a more symmetric distribution of bond lengths. By lowering the proposal step, the correlation time remains about the same. With these settings, a lower criterion Z-score is obtained at high cutoff frequencies.

Precise Mean With Numerical Quadrature

Because the probability density sampled by the MC chain is one-dimensional, it is feasible to compute the mean using numerical quadrature, which is much more accurate than the Monte Carlo estimate. (For production simulations, Monte Carlo is only advantageous for high-dimensional problems.)

As shown in the code below, the difference between the quadrature and Monte Carlo estimates is on the order of the estimated uncertainty of the MC result.

numer_quad = quad(lambda r: r * np.exp(logprob(r)), 0, 50)[0]
denom_quad = quad(lambda r: np.exp(logprob(r)), 0, 50)[0]
mean_quad = numer_quad / denom_quad
print(f"Quadrature  E[r]   ≈ {mean_quad:8.5f}")
print(f"Monte Carlo E[r]   ≈ {mean_mc:8.5f}")
print(f"|Difference|       = {abs(mean_quad - mean_mc):8.5f}")
print(f"Estimated MC error = {error_mc:8.5f}")
Quadrature  E[r]   ≈  5.28043
Monte Carlo E[r]   ≈  5.28666
|Difference|       =  0.00623
Estimated MC error =  0.00793

Autocorrelation time

The Lorentz model estimates the exponential correlation time [Sok97] from the width of the peak at zero frequency in the spectrum. It may differ from the integrated autocorrelation time. Only if the autocorrelation function is nothing but an exponentially decaying function, both should match.

print("Autocorrelation times:")
print(f"exponential: {result.corrtime_exp:5.2f} ± {result.corrtime_exp_std:5.2f}")
print(f"integrated:  {result.corrtime_int:5.2f} ± {result.corrtime_int_std:5.2f}")
Autocorrelation times:
exponential: 12.58 ±  0.57
integrated:  12.28 ±  0.49

Here, the deviation between the two autocorrelation times falls within the uncertainty of the estimates. This is the expected result, since the Lorentzian model is able to explain the whole spectrum.

Regression Tests

If you are experimenting with this notebook, you can ignore any exceptions below. The tests are only meant to pass for the notebook in its original form.

if abs(mean_mc - 5.28666) > 1e-3:
    raise ValueError(f"Wrong mean_mc: {mean_mc:.5f}")
if abs(error_mc - 0.00794) > 1e-3:
    raise ValueError(f"Wrong error_mc: {error_mc:.5f}")