Uncertainty of the Mean of Time-Correlated Data¶

When data exhibits time correlations, the error of the average cannot be computed by assuming that all data are statistically independent. Because of time correlations, there are fewer independent values than the number of elements in the data.

Quantifying the uncertainty of averages over time-correlated data is discussed in several textbooks, e.g., Appendix D of “Understanding Molecular Simulation” by Frenkel and Smit [FS02], or Section 8.4 in the book “Computer Simulation of Liquids” (second edition) by Allen and Tildesley [AT17].

Derivation¶

The sample mean of the time-dependent sequence \(\hat{\mathbf{x}}\) is:

\[ \hat{x}_\text{av} = \frac{1}{N} \sum_{n=0}^{N-1} \hat{x}_n \]

The variance of this sample mean is:

\[ \var[\hat{x}_\text{av}] = \frac{1}{N^2} \sum_{n=0}^{N-1} \sum_{m=0}^{N-1} \cov[\hat{x}_n \,,\, \hat{x}_m] \]

We assume that the sequence is drawn from a stationary process, such that the covariance depends only on \(\Delta = n-m\):

\[ c_\Delta = \cov[\hat{x}_n \,,\, \hat{x}_m] \]

leading to:

\[ \var[\hat{x}_\text{av}] = \frac{1}{N^2} \sum_{n=0}^{N-1} \sum_{\Delta=-n}^{N-1-n} \cov[\hat{x}_n \,,\, \hat{x}_{n+\Delta}] \]

To simplify this expression, we must further assume that the second summation can be extended from \(\Delta=-\infty\) to \(\Delta=+\infty\). This approximation is acceptable if the correlation time of the sequence is small compared to \(N\). (In other words, we assume that \(c_\Delta\) decays to zero in a small number of steps compared to \(N\).) With this assumption, we find:

\[ \var[\hat{x}_\text{av}] \approx \frac{1}{N} \sum_{\Delta=-\infty}^{+\infty} c_\Delta \]

Analogously, when the sample mean is defined over a continuous function:

\[ \hat{x}_\text{av} = \frac{1}{T} \int_0^T \hat{x}(t)\,\mathrm{d} t \]

the variance of this sample mean is:

\[ \var[\hat{x}_\text{av}] = \frac{1}{T} \int_{-\infty}^\infty c(\Delta_t)\,\mathrm{d} \Delta_t \]

How to Compute with STACIE?¶

Because no factor \(1/2\) is present in the expression for the variance of the mean, the factor \(F\) must compensate for the factor \(1/2\) in the autocorrelation integral. Hence, we must use \(F=2\).

It is assumed that you can load the time-dependent sequences into a 2D NumPy array, where each row is a sequence and each column a time step. If you have a physical time step (in some unit of time), it is recommended that you use it as shown below, as it will result in more meaningful plots and time scales. If not available, you can set timestep=1 or remove it from the script altogether.

from stacie import compute_spectrum, estimate_acint, plot_results, PadeModel

# Load your sequences and the time step.
# The details depend on your use case.
sequences, timestep = ...

# The sequences must be an array with shape (nseq, nstep).
# Each row represents one time-dependent sequence with length nstep.
# Get the total simulation time (sum over all sequences)
total_time = timestep * sequences.size

# The factor 2 is just compensating for the factor 1/2 in the autocorrelation integral.
spectrum = compute_spectrum(
        sequences,
        prefactors=2.0 / total_time,
        timestep=timestep,
        include_zero_freq=False,
)
result = estimate_acint(spectrum, PadeModel([0, 2], [2]))
print("The mean", sequences.mean())
print("Error of the mean", np.sqrt(result.acint))
plot_results("error.pdf", result)

The spectrum at zero frequency must be excluded because it contains contributions from the mean, i.e., not only from the autocorrelation integral.

The Pade model is used here because it is nearly always a good choice for error estimates. However, if the data does not feature an exponential decay of the ACF, this model may not be appropriate. In such cases, you can use the ExpPolyModel instead. For more details, see the section on spectrum models.

A worked example can be found in the notebook the error of the mean of a sequence generated by a Metropolis Monte Carlo algorithm.