Working with molecules

Introduction

Molecule objects

The class molmod.molecules.Molecule is a fundamental part of the MolMod package. Instances of this class hold all information to describe the geometry and the topology of a molecular structure. In practice the following attributes can be stored in a Molecule object:

  • numbers – An array of atomic numbers with N elements. This is the only mandatory attribute. All other attributes are optional.
  • coordinates – An array with atomic coordinates (in Bohr) with shape (N,3).
  • title – A short description of the molecule.
  • masses – An array with atomic masses (in \(m_e\))
  • graph – A MolecularGraph instance describing the connectivity of the atoms.
  • symbols – A tuple with atomic symbols. These may be the names of the elements or some force-field atom types.
  • unit_cell – A UnitCell instance describing the periodic boundary conditions.

One can create Molecule instances based on several conventional file formats as follows:

mol = Molecule.from_file("foo.xyz")

The extension of the file is used to detect the file format. The following formats are currently supported: *.cml, *.fchk, *.pdb, *.sdf, *.xyz. It is relatively trivial to implement new file formats, but there is no real urgent need for that because the babel program can convert most files into one of these formats.

A Molecule object can be written to file as follows:

mol.write_to_file("bar.xyz")

The currently supported formats for writing files are: *.xyz, *.cml

Derived quantities

Some derived quantities are accessible as attributes. The contents of these attributes are only computed when they are accessed for the first time. Once a derived quantity is computed it is stored internally (in the molecule object) in case it would be requested a second time. This mechanism is called caching. It imposes good scripting practices when using the MolMod package and also improves the efficiency of the scripts. These are a few convenient derived quantities:

  • chemical_formula – A string containing the chemical formula.
  • com – The center of mass of the molecule.
  • distance_matrix – A square matrix with all inter-atomic distances in a molecule (ignoring periodic boundary conditions for the moment).
  • inertia_tensor – The inertia tensor of the molecule computed with the center of mass as the origin.
  • mass – The total mass of the system.
  • size – The number of atoms.

An up-to-date list can be found in the reference documentation: molmod.molecules.Molecule. Most derived quantities depend on optional attributes. If these optional attributes are not present, an error is raised when a derived quantity is requested.

Modifying molecule objects

The caching mechanism has one important side-effect. When the underlying data of the derived quantities change, the cached results are no longer correct. This problem is called cache inconsistency. There are typically two ways to impose cache consistency: (i) all underlying quantities are kept constant, or (ii) cached results are deleted when the underlying data change. The first mechanism is used throughout the MolMod package. When you need to modify a Molecule object, just create a new one with modified attributes. This is facilitated by the copy_with method:

mol1 = Molecule.from_file("foo.xyz")
mol2 = mol1.copy_with(title="bar")

Any of the above attributes can be modified through copy_with. Cached derived quantities are never discarded with this approach.

The concept of derived quantities and read-only attributes is extensively used in the following classes:

Examples

Conversion of files

Although the conversion of files with molecular systems from one format to the other is easily done with the babel program, the example is still useful for didactic purposes.

File: molmod/examples/001_molecules/a_convert.py

#!/usr/bin/env python

from molmod import *

mol = Molecule.from_file("ibuprofen.sdf")
mol.write_to_file("ibuprofen.xyz")

Center of mass to origin

This example shows how a modified molecule object is made.

File: molmod/examples/001_molecules/b_com.py

#!/usr/bin/env python

from molmod import *

# 0) Load the molecule from file
mol1 = Molecule.from_file("ibuprofen.sdf")

# 1) The atomic masses are not included in the sdf file. We need them below to
# compute the center of mass, so we assign standard masses.
mol1.set_default_masses()
print(mol1.masses/amu)

# 2) Create modified coordinates. The attribute coordinates is an N by 3 array,
# while the com attribute is an array with three components. The following
# subtraction is performed elementwise at the level of the rows. Each row
# of the coordinates array corresponds to one atom. From each atomic coordinate
# the center of mass is subtracted to construct a new coordinates array.
new_coordinates = mol1.coordinates - mol1.com

# 3) Make a copy of mol1 with updated coordinates and write it to a file.
mol2 = mol1.copy_with(coordinates=new_coordinates)
mol2.write_to_file("ibuprofen_com.xyz")
print("Written file ibuprofen_com.xyz")

All carbons

The following program prints the X-, Y- and Z-coordinates of all Carbon atoms in the ibuprofen molecule.

File: molmod/examples/001_molecules/c_carbon.py

#!/usr/bin/env python

from molmod import *

# 0) Load the molecule
mol = Molecule.from_file("ibuprofen.sdf")

# 1) Print out all carbon positions
for i in range(mol.size):
    if mol.numbers[i] == 6:
        print(mol.coordinates[i]/angstrom)

Size of a molecule

One can define the size of a molecule in several ways. The following program prints out the largest inter-atomic distance in Ångström.

File: molmod/examples/001_molecules/d_size.py

#!/usr/bin/env python

from molmod import *

# 0) Load the molecule.
mol = Molecule.from_file("ibuprofen.sdf")

# 1) Print the largest element in the distance matrix.
print("Largest interatomic distance [A]:")
print(mol.distance_matrix.max()/angstrom)
# Some comments:
#  - One can just write code that assumes the attribute distance_matrix is
#    present, but in fact it is only computed once the first time it is used.
#  - The method max is a method of the numpy array. It returns the largest value
#    in an array.

Shape of a molecule

One can define the shape of a molecule based on the covariance of the atomic coordinates with respect to their geometric center. Some shape categories were introduced by Zingg. The following program computes the shape category for the ibuprofen molecule

File: molmod/examples/001_molecules/e_shape.py

#!/usr/bin/env python

from __future__ import print_function

from molmod import *

from numpy import dot, sqrt
from numpy.linalg import eigvalsh


# 0) Load the molecule.
mol = Molecule.from_file("ibuprofen.sdf")

# 1) Compute the arithmetic center of the atomic coordinates.
center = mol.coordinates.mean(axis=0)
# Without the axis=0 argument, the average over all X, Y and Z components would
# be computed. Now it just computes the averages for each column. axis=1 would
# refer to averaging over rows.

# 2) Move the arithmetic center of the coordinates to the origin. The same
# comments from the b_com.py apply.
centered = mol.coordinates - mol.coordinates.mean(axis=0)

# 3) Compute the covariance matrix of the centered coordinates.
covar = dot(centered.transpose(), centered)/mol.size

# 4) Compute the eigenvalues of the symmetric covariance matrix.
evals = eigvalsh(covar)

# 5) The spread along the three eigenvectors is computed as the standard deviation
# of the atomic coordinates along that direction:
c, b, a = sqrt(evals)
print("Spread along the long axis [A]:", a/angstrom)
print("Spread along the intermediate axis [A]:", b/angstrom)
print("Spread along the short axis [A]:", c/angstrom)

# 6) Test in which category this shape belongs. The factor R is set to 1.5.
R = 1.5
if b < R*c:
    if a < R*b:
        shape = 'equant' # sphere-like
    else:
        shape = 'prolate' # sigar-like
else:
    if a < R*b:
        shape = 'oblate' # pancake-like
    else:
        shape = 'bladed' # keyboard-like
print("The shape category:", shape)

Problems

  • Write a program that makes a list of all (unique) pairs of Carbon atoms that have an inter-atomic distance below 2.0 Ångström.

  • Try to find for each shape category (equant, prolate, oblate and bladed) some molecules from PubChem. Just download a bunch of SDF files and use the glob module from Python to loop over all these files:

    from glob import glob
    for fn in glob("*.sdf"):
        mol = Molecule.from_file(fn)
        # do something here
    

    If you feel lucky, extend to program so that it downloads the first 100 molecules from the PubChem database.