sampling
In the Born-Oppenheimer philosophy, we explore the phase space of a molecule or a material and generate samples using molecular dynamics simulations. Those samples are then used to evaluate time averages of some property of interest in order to predict physical observables. In psiflow, such simulations are executed within i-PI, a versatile and efficient code which supports an impressive number of features. We mention the most important ones below
- molecular dynamics in various ensembles: most notably NVE, NVT, and fully anisotropic NPT. There exist a variety of thermostat and barostat options, the default being Langevin. Together with the ability to combine arbitrary hamiltonians, this includes biased molecular dynamics simulations using e.g. harmonic restraints (umbrella sampling).
- path-integral molecular dynamics (PIMD): allows for the simulation of the quantum behavior of light atomic nuclei. This is important for many systems involving hydrogen atoms at relatively low temperatures (<=room temperature). Importantly these simulations can also be performed in any of the aforementioned ensembles.
- geometry optimizations: i-PI can be used to optimize the geometry of a molecule or a material using a variety of optimization algorithms.
- replica exchange (parallel tempering): dramatically improves the sampling efficiency and ergodicity whenever nontrivial free energy barriers are present in the phase space of the system. In this approach, one considers replicas of the system at various temperatures and/or pressures, or optionally even with different hamiltonians.
- multiple walker metadynamics: simple but powerful method to overcome free energy barriers when a suitable collective variable is known for the system of interest.
the Walker
class
Psiflow is essentially a convenient wrapper around most of i-PI's features.
The key object which enables the execution of these simulations is the Walker
class.
A single walker describes a single replica of the system which evolves through phase space.
It is initialized with a Geometry
instance which describes the start of the simulation.
Optional arguments include a hamiltonian for the walker (which is used to evaluate forces
during simulations), the temperature and pressure, or a timestep.
from psiflow.sampling import Walker
from psiflow.geometry import Geometry
from psiflow.hamiltonians import MACEHamiltonian
start = Geometry.load("start.xyz")
walker = Walker(
start,
hamiltonian=MACEHamiltonian.mace_mp0(),
temperature=300.0,
pressure=None, # NVT simulation
timestep=0.5, # in femtoseconds, the default value
)
generating trajectories
Walkers can be propagated in time by using the sample
function.
It accepts a list of walkers, each of which will be propagated in phase space according to its own parameters.
Importantly, there are no restrictions on the type of walkers in this list.
Users can mix regular NVE walkers, with path-integral NPT walkers, and a list of N replica exchange walkers.
Internally, psiflow will recognize which walkers are independent and parallelize the execution as much as possible.
Consider the following example:
from psiflow.sampling import sample
outputs = sample(
walkers,
steps=1e6, # total number of timesteps to run the simulation for; this translates to 500 ps in this case
step=1e3, # sample every 1000 timesteps
start=1e5, # start sampling after 50 ps
)
print(outputs) # list of `SimulationOutput` instances
Output
instances, each of which contains the trajectory of a single walker,
as obtained by performing molecular dynamics simulations in the corresponding ensemble and with the provided total number of steps.
Note that each of these simulations is independent (there exists no coupling between them), and as such, they will be executed in parallel as much as possible.
The outputs are ordered in the same way as the input walkers (i.e. outputs[0]
corresponds to the output from walkers[0]
).
They provide access to the sampled trajectory of the simulation, the elapsed simulation time, and importantly, a number of observable properties
which have been written out by i-PI. These properties can be used to compute averages of physical observables, such as the internal energy or the virial stress tensor.
A full list of available properties is given in the i-PI documentation
(note that psiflow adheres to the same naming convention as adopted in i-PI).
The following options are kept track of by default:
energy
: the total energy of the system. The actual name of this quantity ispotential{electronvolt}
temperature
: the instantaneous temperature of the system. The actual name of this quantity istemperature{kelvin}
time
: the elapsed simulation time. The actual name of this quantity istime{picosecond}
volume
: the volume of the simulation cell (only for periodic systems). The actual name of this quantity isvolume{angstrom3}
Similarly to the evaluation of a Hamiltonian
or the querying of a snapshot in a Dataset
, simulation outputs are returned as futures.
For example, say we wanted to compute the average energy for each of the simulations:
import numpy as np
energy_futures = [output["potential{electronvolt}"] for output in outputs]
energies = [future.result() for future in energy_futures]
mean_energy = np.array([np.mean(energy) for energy in energies])
result()
), and then computes the mean energy for each simulation. In a very similar fashion, we can compute the bulk modules of bcc iron simply by constructing walkers at various pressures and extracting the corresponding volume{angstrom3}
observable -- see here.
In many cases, it is useful to save the trajectory of a simulation to disk.
Trajectories are essentially just a series of snapshots, and as such, psiflow represents them as Dataset
instances.
Each of the outputs has an attribute trajectory
which is a Dataset
instance.
Let us save the trajectory of the first simulation to disk:
As a sanity check, let us recompute the potential energies which were stored at each snapshot during the simulation using the compute
functionality of our MACE hamiltonian:
mace = walkers[0].hamiltonian # the hamiltonian used in the simulations
future = mace.compute(outputs[0].trajectory, 'energy') # future of the recomputed energies as an array
manual_energies_0 = future.result() # get the actual numpy array
assert np.allclose(
manual_energies_0,
energies[0],
)
step
argument is provided, which determines
the saving frequency for both the output properties (such as energy or temperature) as
well as for the trajectory.
In some cases, one wishes to save output properties without actually saving the
trajectory because it would end up consuming too much disk space.
For this use case, it is possible to pass keep_trajectory=False
as an additional
argument to the sample function.
A full list of possible arguments is given below:
- walkers (type
list[Walker]
): the walkers which should be propagated. Internally, they are partitioned into groups such that e.g. replica-exchange-coupled walkers are propagated at the same time (and on the same node) in order to allow for communication between them. - steps (type
int
): the total number of steps to perform in each of the simulations. - step (type
int
): saving frequency of both output properties as well as the trajectory. - start (type
int
): properties and snapshots will not be saved beforestart
steps have passed -- used for equilibration. - keep_trajectory (type
bool
): determines whether or not to save the trajectory. - max_force (type
float
, in eV/A): determines the maximum value of any of the forces in the walker before a simulation is interrupted. Large forces are typically a result of an unphysical region in phase space and/or a badly-trained ML potential. - observables (type
list[str]
): physical properties which are to be tracked and saved during the simulation. An exhaustive list of all options is given in the i-PI documentation. Four quantities are saved by default: energy, temperature, time, and unit cell volume. - fix_com (type
bool
): whether or not to fix the center of mass. Usually a good idea -- default isTrue
. - prng_seed (type
int
): default RNG seed for i-PI, in order to be able to reproduce - use_unique_seeds (type
bool
): use a unique seed for each of the walkers, starting at the givenprng_seed
for walker 0,prng_seed + 1
for walker 1, etc. - checkpoint_step (type
int
): how frequently to save checkpoints. Checkpoints are used at the end of a simulation in order to update thestate
attribute of the walker. If the simulation is (gracefully) terminated because of e.g. a walltime limit on the SLURM node, then the state of the walker will be determined by the last saved checkpoint. - verbosity (type
str
): verbosity of the i-PI output files;low
,medium
, orhigh
. - motion_defaults (type
ET.Element
): TODO: additional thermostat and barostat options
randomize and quench
If molecular simulation would rigorously satisfy the ergodic hypothesis, then the starting structure of a simulation is entirely irrelevant and will not affect the sampling distribution. In practice, however, the ergodic hypothesis is almost always violated due to a number of reasons, and the starting structure can have a significant impact on the subsequent simulation.
If the starting structure is too strongly out of equilibrium for the specific hamiltonian of the walker (i.e. the initial energy and forces are too large), the simulation will likely explode in the first few steps and it will practically never return physically relevant samples.
Because each walker can have its own hamiltonian, a given geometry might be an entirely infeasible start for one walker, but be reasonable for another.
Psiflow provides a simple and efficient method to assign reasonable starting structures to a collection of walkers based on a large set of possible candidates: the quench
method.
It accepts a list of Walker
instances, and a Dataset
of possible candidate structures.
It will automatically evaluate the hamiltonians of the walkers over the dataset in an efficient manner, and assign the structure with the lowest energy to each walker.
By applying the quench
method, psiflow reassigns walker starting geometries to the lowest energy structures as obtained from a dataset of choice:
from psiflow.sampling import quench
# assume 'walkers' is a list of `Walker` instances with badly initialized geometries
data = Dataset.load("my_large_dataset_of_candidates.xyz")
quench(walkers, data)
relaxed = Dataset([w.hamiltonian.evaluate(w.start) for w in walkers])
energies = relaxed.get('energy').result() # energies are now much lower
In an alternative setting, it might be useful to randomize the starting structure for each walkers from a given dataset (e.g. to improve sampling coverage). This can be achieved using the randomize
method:
advanced options
The real power of i-PI lies in its extensive support for advanced sampling algorithms. This section discusses the most important ones which are currently supported within psiflow.
PIMD simulations
The Born-Oppenheimer approximation breaks down for light nuclei (most notably hydrogen), and by adapting traditional molecular dynamics algorithms, it is possible to account for the quantum mechanical (wave-like) nature of protons during simulations. These techniques are referred to as path-integral methods and represent the core business of i-PI. In practice, it comes down to running N replicas of the same simulation in parallel, in which consecutive replicas are coupled by harmonic spring forces. As such, the cost of these simulations is N times higher than a classical simulation of the same system, and in many cases also the time step needs to be chosen smaller (e.g. 0.25 fs instead of 0.5 fs).
Performing PIMD simulations in psiflow is pretty much identical to 'normal' simulations;
walkers can be given an nbeads
attribute which determines the number of replicas. Its
default value is 1, meaning purely classical MD.
pimd_walker = Walker(
geometry,
hamiltonian,
temperature=200,
pressure=0.1, # PIMD is compatible with NVE, NVT, and even NPT
nbeads=32, # run 32 replicas in parallel
timestep=0.4, # in fs
)
classical_walker = Walker(
geometry,
hamiltonian,
temperature=600,
pressure=None,
nbeads=1, # the default value
timestep=1, # in fs
)
# simulate two walkers:
# the first will propagate 32 individual replicas, the second only one
outputs = sample(
[pimd_walker, classical_walker],
steps=10000,
)
start
and state
attributes which track the current and
start position of the walker only keep track of one replica.
replica exchange
Replica exchange (otherwise called parallel tempering) is a technique used to improve the ergodicity of sampling algorithms. It consists of simulating multiple replicas of the same system in parallel but at different thermodynamic conditions. This implies different temperatures, different pressures, or even different hamiltonians. During the simulation, Monte Carlo trial moves are attempted which try to swap atomic geometries between to replicas. If trial moves are rejected according to the specific acceptance criterion, it can be shown that replica exchange does not alter the theoretical sampling distribution of each of the replicas. The only thing it does is make the sampling less dependent on the initial starting configuration.
Replica exchange can be enabled between walkers of the same ensemble (all NVE, all NVT, or
all NPT) simply by calling the replica_exchange
function. The i-PI implementation
which psiflow uses requests two additional parameters; see the
documentation and the
associated paper for more details.
from psiflow.sampling import replica_exchange
walkers = []
for temperature, pressure in zip([300, 350, 400], [-10, 0, 10]):
walker = Walker(
start=geometry,
hamiltonian=hamiltonian,
temperature=temperature,
pressure=pressure,
)
walkers.append(walker)
# add replica exchange to walkers; modifies them in-place
replica_exchange(
walkers,
trial_frequency=100, # trial frequency, see i-PI docs
rescale_kinetic=True, # usually a good idea, see i-PI docs
)
output = sample(walkers, steps=10000, step=100)
coupled simulations run single-node
PIMD and replica exchange simulations involve N parallel replicas of the same system which are dependent on each other for their time propagation. As such, they are scheduled by psiflow in a way that guarantees that they run simultaneously and that they can communicate with each other without too much hassle. At the moment, this implies that all replicas need to be executed on a single node. If you have powerful nodes with four or eight GPUs and relatively lightweight MACE models, this is usually not a problem. For more large-scale applications, some manual hacking will be necessary. Open an issue on Github if you need help on this.
metadynamics
Metadynamics (and its variants) are an extremely common form of enhanced sampling and
phase space exploration, and i-PI has native support for these using PLUMED.
Similarly to how time-independent bias contributions are created simply based on their
input string, psiflow implements metadynamics using a Metadynamics
object which is
constructed based on a plumed input file.
For the proton jump example in vinyl alcohol, an equivalent metadynamics input would look like this:
from psiflow.sampling import Metadynamics
plumed_str = """UNITS LENGTH=A ENERGY=kj/mol
d_C: DISTANCE ATOMS=3,5
d_O: DISTANCE ATOMS=1,5
CV: COMBINE ARG=d_C,d_O COEFFICIENTS=1,-1 PERIODIC=NO
METAD ARG=CV PACE=10 SIGMA=0.1 HEIGHT=5
"""
metadynamics = Metadynamics(plumed_str)
walker = Walker(geometry, hamiltonian, temperature=300, metadynamics=metadynamics)
outputs = sample([walker], steps=10000, step=100)
# added hills during simulation are tracked as a parsl DataFuture
hills_file = metadynamics.external # 10,000 / 10 hills
The corresponding hills file which keeps track of the trajectory of the simulation in CV
space (as well as a list of the Gaussian hills) is stored in the metadynamics.external
attribute. Since this is, like any observable or trajectory, the outcome of a simulation,
psiflow uses futures to represent it. In this case, the hills file is therefore a future
-- a DataFuture
to be precise. It can be read and analyzed much like a regular hills
file, except that we need to make sure to call .result()
in order to wait until the
simulation has completed.
metadynamics.external.result() # call result() to wait until simulation has finished
# read it
with open(metadynamics.external.filepath, 'r') as f:
content = f.read()
print(content)
geometry optimizations
Aside from molecular dynamics, it is often very useful to use gradient-based optimization
on the potential energy surface to 'descend' into a (most likely local) energy minimum.
While this could theoretically be considered as a kind of zero-kelvin walker, geometry
optimization is implemented in a separate module, and does not require any walkers.
Instead, psiflow exposes a single optimize
function which requires the following
arguments:
- state (type
Geometry
): the initial state of the system. - hamiltonian (type
Hamiltonian
): the hamiltonian defines the PES to be optimized - steps (type
int
): the maximum number of optimization steps. - keep_trajectory (type
bool
): whether to keep the optimization trajectory - mode (type
str
): optimization algorithm, see i-PI documentation for details. - etol (type
float
): convergence tolerance on the energy (eV) - ptol (type
float
): convergence tolerance on the positions (angstrom) - ftol (type
float
): convergence tolerance on the forces (eV/angstrom)
The function returns a Future which represents the minimum-energy geometry:
from psiflow.sampling import optimize
minimum_geometry = optimize(
geometry, # starting structure
mace, # use MACE PES; will use float64 precision
ftol=1e-4, # sufficiently tight!
) # returns a future
energy = mace.compute(geometry, 'energy')
minimum_energy = mace.compute(minimum_geometry, 'energy')
A common scenario is to find the global minimum of the potential energy by performing
multiple optimizations with different starting configurations (which can be obtained using
e.g. high temperature MD).
While this is technically straightforward to implement via a custom
python_app
,
psiflow provides a convenient helper function for precisely this purpose:
optimize_dataset
.
It accepts the same arguments as optimize
except that the starting geometry should
become a Dataset
of geometries that represent the starting configuration for the
optimizations. Correspondingly, it returns Dataset
of minimized geometries: