10. Trajectory Analysis

The analysis of the results is (in the first place) based on the HDF5 trajectory output files. On-line analysis (during the iterative algorithm, without writing data to disk) is also possible. As a final option, trajectory data from other programs may also be converted to the HDF5 format used by Yaff, which can then be post-processed as if it was generated by Yaff.

In all the examples below, the following import statement is implicitly assumed:

import h5py as h5

10.1. Slicing the data

All the analysis routines below have at least the following four optional arguments:

  • start: the first sample to consider for the analysis
  • step: consider only a sample each step iterations.
  • max_sample: consider at most max_sample number of samples.

Some routines also support the following optional argument:

  • end: the last sample to consider for the analysis.

The max_sample option is only possible when step is not specified and the total number of samples (or end) is known. When max_sample is given, the corresponding optimal value for step will be derived from max_sample. Some analysis may not have the max_sample argument, e.g. the spectrum analysis, because the choice of the step size for such analysis is a critical parameter that needs to be set carefully.

See yaff.analysis.utils.get_slice() for the documentation of the generic underlying routine that handles the slices.

10.2. Basic analysis

A few basic analysis routines are provided to quickly check the sanity of an MD simulation:

  • plot_energies makes a plot of the kinetic and the total energy as function of time. For example:

    f = h5.File('output.h5')

    makes a figure energies.png.

  • plot_temperate and plot_pressure are similar, but they plot the temperature and pressure as function of time instead of the energies.

  • plot_temp_dist plots the distribution (both pdf and cdf) of the instantaneous atomic and system temperatures and compares these with the expected analytical result for a constant-temperature ensemble. For example:

    f = h5.File('output.h5')

    makes a figure temp_dist.png

All these functions accept optional arguments to tune their behavior. See yaff.analysis.basic for more analysis routines and more details.

10.3. Advanced analysis

Yaff also includes analysis tools that can extract relevant macroscopic properties from a simulation. These analysis tools require some additional computations that can either be done in a post-processing step, or on-line.

10.3.1. Radial distribution functions (RDFs)

The following example computes and oxygen-oxygen RDF using a previously generated trajectory file:

f = h5.File('output.h5')
select = system.get_indexes('8') # select atoms with ATSELECT rule
rdf = RDF(f, rcut=4.8*angstrom, rspacing=0.1*angstrom, max_sample=100, select0=select)

In this example, the cutoff for the RDF is 4.8 Å and the spacing of the bins 0.1 Å. At most 100 samples are used to compute the RDF. The results are included in the HDF5 file, and optionally plotted using matplotlib. Alternatively, the same RDFAnalysis class can be used for on-line analysis, without the need to store huge amounts of data on disk:

select = system.get_indexes('O')
rdf = RDF(None, 4.8*angstrom, 0.1*angstrom, max_sample=100, select0=select)
verlet = VerletIntegrator(ff, hooks=rdf, temp0=300)

One may also construct RDFs based on two sets of particles, using two select* arguments. The following example computes the inter-particle RDF for the oxygen and hydrogen atoms, using a previously computed trajectory file:

f = h5.File('output.h5')
select0 = system.get_indexes('8') # select atoms with ATSELECT rule
select1 = system.get_indexes('1')
rdf = RDF(f, rcut=4.8*angstrom, rspacing=0.1*angstrom, max_sample=100,
          select0=select0, select1=select1)

The reference documentation of the rdf module (yaff.analysis.rdf) describes also the more exotic RDF options, e.g. to exclude intra-molecular distances from the RDF.

10.3.2. Vibrational spectra and correlation times

The (efficient) computation of vibrational spectra and autocorrelation functions are so closely related that they are carried out simultaneously. A basic vibrational spectrum (by default based on the atomic velocities) is done as follows:

spectrum = Spectrum(f, bsize=512)

The bsize argument determines the size of the blocks used for the spectral analysis. The trajectory is cut into blocks of the given size. For each block, the spectrum is computed, and then averaged over all blocks. The plot method makes a figure of the spectrum. The plot_ac method makes a figure of the corresponding autocorrelation function. All the results are also available as attributes of the spectrum object. Similar to the RDF analysis, the spectrum can be computed both on-line and off-line. One can also estimate the IR spectrum as follows:

spectrum = Spectrum(f, bsize=512, path='trajectory/dipole_vel', key='ir')

More details can be found in the reference documention: yaff.analysis.spectrum.

10.3.3. Diffusion constants

The diffusion constant (based on the displacements of all nuclei) is computed as follows:

diff = Diffusion(f, step=10, mult=5)
print diff.A

The positions are subsampled with step size of 10. The mult argument controls between which time intervals are used to compute the mean-square displacements (MSDs) used to carry out a fit of the Einstein diffusion equation. When set to 5, the MSDs are computed at intervals of 10, 20, 30, 40 and 50 steps. As soon as the analysis is ready, a straight line is fitted through the MSD versus time data. The slope (diff.A) of this line corresponds to the diffusion constant. A plot of these data and the fitted line can be made with diff.plot().

One may restrict the analysis to just a selection of atoms as follows:

diff = Diffusion(f, step=10, mult=5, select=[0,4,5])

See yaff.analysis.diffusion for more details.

TODO. Explain (and double check) how to compute diffusion constants of multiple molecules instead of just atoms.

10.3.4. Computing errors on thermodynamic averages

The block average method can be used to compute the error on time-dependent data that have some degree of auto-correlation. The following example computes the error on the average temperature from an MD simulation, using a HDF5 trajectory file.

f = h5.file('output.h5')
error, sinef = blav(f['trajectory/temperature'])

Two values are returned. error is the error on the average. sinef is the statistical sampling inefficiency. More details about this routine are given here: yaff.sampling.blav.

10.4. Post-processing external trajectory data

One may also use the analysis module of Yaff to process trajectories generated with other molecular simulation codes. The following example script (see data/examples/002_external_trajectory/rdf.py) converts the xyz file to HDF5 data, which is then used to perform the analysis

# Needed for python2 backward compatibility
from __future__ import print_function

# Import Yaff and Numpy libraries and setup a reference system. The reference
# system is used to recognize atom types etc.
import numpy as np
from yaff import *
system = System.from_file('trajectory.xyz', rvecs=np.diag([20.3, 20.3, 20.3])*angstrom)

# Create a HDF5 file and convert the XYZ file to arrays in the HDF5 file
import h5py as h5
with  h5.File('trajectory.h5', mode='w') as f:
    xyz_to_hdf5(f, 'trajectory.xyz')

    # Select two lists of atom indexes based on the ATSELECT rules '1' and '8'
    select0 = system.get_indexes('1')
    select1 = system.get_indexes('8')

    # Note. The remainder of the example may be moved to a separate script if
    # that would be more convenient, e.g. in case different RDFs must be generated.
    # This would avoid repetetive conversion of the XYZ file.

    # Create the RDF.
    rdf = RDF(10*angstrom, 0.1*angstrom, f, max_sample=100, select0=select0, select1=select1)
    # One may make plots with the rdf object ...
    # ... or access the results as Numpy arrays
    print('RDF DATA FOR THE X-AXIS [A]')
    print('RDF DATA FOR THE Y-AXIS')

See yaff.conversion for more routines to convert trajectory data.