Running a Refinement (in detail)

This tutorial walks through running a refinement in more detail than the Argon a-to-z tutorial. The following tutorials in brackets also provide more detail on creating the simulation setup for this tutorial.

By the end of this tutorial, a reader familiar with MD simulation should be capable of applying MDMC refinements to their own datasets and simulations.

In order to refine potential parameters the following is required:

A Universe which contains one or more atoms. (Building a Universe)
A Simulation which runs using the Universe. (Running a Simulation)
One or more Parameters to refine. (Selecting fitting Parameters)
One or more experimental datasets.

If you would like more detail on these, please see the tutorials in brackets; the actual setup of the simulation will be done quite concisely here.

[ ]:

import os
# This setting tells MDMC how many physical cores on your computer it should use
# for simulation and refinement calculations.
os.environ["OMP_NUM_THREADS"] = "4"

Experimental datasets

MDMC’s support for experimental datasets requires two things: the relevant Observable (e.g. dynamic structure factor \(S(Q,w)\)) must be available, and a Reader for the specific file format of the data must be implemented (e.g. the LAMPSQw Reader, which reads the LAMP output file format for dynamic structure factors). If this is not the case for a dataset you’d like to use, please file an issue in the MDMC github repository or contact support@mdmcproject.org.

Each experimental dataset must be specified by a Python dictionary containing the following:

file_name : A string with the file name of the experimental data
type : A string specifying the type of Observable which describes the data (e.g. SQw if the data is the dynamic structure factor, PDF if the data is the pair distribution function).
reader : A string specifying the name of the ObservableReader which will be used to read the file.
weight : A float which determines the relative importance weighting of this dataset to other datasets, if multiple datasets are being refined at the same time.
rescale_factor, optional : A float which rescales the experimental data linearly in order to match the absolute scale of the simulated data when calculating the Figure of Merit (FoM). Default is 1., i.e. no change to the data.
auto_scale, optional : A boolean specifying whether to automatically set the rescale_factor at each step of the refinement to the value which minimizes the FoM. Default is False.
use_FFT, optional : A boolean specifying whether to use the Fast Fourier Transform (FFT) in the calculation of variables from MD, which are faster but can impose restrictions on the uniformity of variables. Default is True.
resolution : a one-line dict of format {'file' : str} or {key : float}. if {'file' : str}, the str should be a file in the same format as file_name containing results of a vanadium sample which is used to determine instrument energy resolution for this dataset. If {key : float}, the key should be the type of numeric resolution, with the key being the function type (e.g. 'gaussian' or 'lorentzian'), and the float the FWHM for that function.

For this tutorial we are using Johan Qvist et al (2011)’s data for supercooled water.

[ ]:

# Dataset from: Johan Qvist et al, J. Chem. Phys. 134, 144508 (2011)
QENS = {'file_name':'data/263K05Awat_LAMP',
        'type':'SQw',
        'reader':'LAMPSQw',
        'weight':1.,
        'auto_scale':True,
        'use_FFT':False,
        'resolution':{'file': 'data/262p7K0A5van_LAMP'}}

As the dataset we are using has both negative energy values and non-uniform spacing, the default behaviour of use_FFT=True would cause the data to be interpolated as part of the refinement process. Setting this to False allows us to preserve the original energy values.

The default behaviour of the scaling (rescale_factor=1., auto_scale=False) assumes that the dataset provided has been properly scaled and normalised for the refinement process. This is the preferred way of using MDMC, and arbitrary or automatic rescaling should be undertaken with care. For example, using auto_scale to determine the scaling does not take into account any physical aspects of scaling the data, such as the presence or absence of background events from peaks outside its range.

The specific manner in which the weight applies to calculating the figure of merit (FoM) depends on the particular FoM which is used for the refinement. By default this is a least-squares difference weighted by the experimental error (StandardFoMCalculator):

\(FoM_{total} = \sum_{i} FoM_{i}\)

where the sum is over the number of experimental datasets and we calculate a weighted FoM for the \(i\)-th dataset as follows:

\(FoM_{i} = w_{i} \sum_{j} \left( \frac{D_{j}^{exp} - D_{j}^{sim}}{\sigma_{j}^{exp}} \right)^2\)

\(D_{j}\) is an observable, \(\sigma_{j}\) is the error associated with the observable, and \(exp\) and \(sim\) refer to the experimental and simulated data respectively. \(D_{j}\) and \(\sigma_{j}\) are N-dimensional arrays of all the data points for a dataset.

The exp_datasets parameter which is passed to Control is a list of all dataset dictionaries. In this instance there is only a single experimental dataset:

[ ]:

exp_datasets = [QENS]

If another dataset were to be included, for example the pair distribution function, this would require its own dictionary:

[ ]:

n_diffraction = {'file_name':'data/water_PDF',
                 'type':'PDF',
                 'reader':'ASCII',
                 'weight':1.,
                 'auto_scale':True,
                 'resolution': {'gaussian': 84}}
two_exp_datasets = [QENS, n_diffraction]

Controlling the parameters

The Parameters of the Universe can have various restrictions placed on them for the purposes of refinement.

[ ]:

universe = simulation.universe
fit_parameters = universe.parameters
print(fit_parameters['epsilon'])

The value of one Parameter can be tied to that of another if there is some physical reason why it should be dependent on it. For example, in a water molecule setting the charge of the oxygen to be -2 times that of the hydrogen:

[ ]:

fit_parameters['charge'][0].set_tie(fit_parameters['charge'][1], ' * - 2')
print(fit_parameters['charge'][0])

The value of a Parameter can have constraints set to limit it’s value to a certain range:

[ ]:

fit_parameters['epsilon'].constraints = (0.6, 0.7)
print(fit_parameters['epsilon'])

Finally, a Parameter can be fixed outright to whatever value it currently has.

[ ]:

fit_parameters['equilibrium_state'][0].fixed = True
print(fit_parameters['equilibrium_state'][0])

Note that Parameters that are fixed, tied, or equal to 0 cannot be refined and so will not be passed to the Minimizer when the Control object is created. If constraints are set then refinement will occur however any proposed values that happen to lie outside the range of the constraints will be clipped.

Restricting the value of a Parameter in these ways applies to the Monte Carlo aspect of MDMC. Namely, it affects how the Parameter is allowed to change between refinement steps. It should not be confused with the use of constraint_algorithms when creating a Universe (see Building a Universe). The latter controls whether or not aspects of the Universe such as bonds are treated as rigid or not during molecular dynamics. It is entirely possible to have a rigid bond but allow the length of that bond to change between refinement steps, or conversely have a bond that is free to oscillate during MD but the equilibrium length is not altered as part of the refinement.

Controlling the figure of merit

The figure of merit (FoM) used in the refinement process to assess the goodness of fit between MD and experimental data can be configured using a dictionary with the following key/value pairs:

error : {'exp', 'none'} Whether to weight the FoM using the experimental errors of the data or not (respectively). The former results in values with larger (relative) error contributing less to the overall FoM.
norm : {'data_points', 'dof', 'none'} Whether to normalise the FoM using the total number of data points, the degrees of freedom (for the chi-squared figure of merit, this is number of data points minus the number of varying parameters), or not at all.

If not provided then the FoM defaults to using the first options, namely:

[ ]:

FoM_options = {'error':'exp', 'norm':'data_points'}

Currently, the only option for the overall form of the FoM is that of a chi-squared test. Details of the implementation can be found at this explanation page.

Resolution Function

As the dataset included a vanadium sample, this will be used to determine the instrument resolution as a function of momentum. This function is then used in calculations involving that corresponding dataset. Note for numerically defined reoslutions, the resolution function is not accessible as an array. The resolution function can be accessed from the Observable (note that it is transformed into the time domain to enable it to be applied multiplicatively to FQt rather than convolving with SQw):

[ ]:

exp_obs = control.observable_pairs[0].exp_obs
#help(exp_obs.resolution)
resolution_function = exp_obs.resolution.resolution_function
# Note that for multidimensional resolution functions, the innermost independent variable should be passed first
t = np.linspace(0, 1e5)
resolution_array = resolution_function(t, Q)

The real and imaginary components of this can then be plotted as a 3D surface (note that for a Gaussian lineshape in the energy domain, one would expect no imaginary components and another Gaussian shape in the time domain):

[ ]:

%matplotlib notebook

from matplotlib import cm
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def plot_SQw(Q, E, SQw):
    fig = plt.figure()
    ax = fig.add_subplot(projection='3d')
    X, Y = np.meshgrid(E, Q)
    surf = ax.plot_surface(X, Y, SQw, cmap=cm.viridis)
    plt.show()

[ ]:

plot_SQw(Q, t, np.real(resolution_array))
plot_SQw(Q, t, np.imag(resolution_array))

Note that for values outside the original range of the vanadium sample, nearest neighbour extrapolation is used.

When applied to calculated data the resolution window is normalised to have a value of 1 at time zero for all momentum. As the zero time component is directly proportional to the integral in the energy domain, this is equivalent to enforcing that the static structure factor of vanadium is constant for all momentum.

Once normalised in this way, this can be compared to the resolution from assuming a Gaussian lineshape with a FWHM of 84 ueV at a specific Q value:

[ ]:

from MDMC.common.constants import h
sigma_e = 2 * np.sqrt(2 * np.log(2)) * 0.084
# h is in units of eV s whereas system units are meV fs, so apply a factor of 1e3 * 1e15 to convert it
sigma_t = h * 1e18 / sigma_e
control.observable_pairs[0].exp_obs.t = t
time_resolution = exp_obs._calculate_resolution_window(resolution_function)
Q_index = 10

plt.figure()
plt.plot(t, np.real(time_resolution)[Q_index], label='Vanadium Resolution')
plt.plot(t, np.exp(-0.5 * (t / sigma_t) ** 2), label='Gaussian Resolution')
plt.legend()

FoM and Parameter Plotting

After a refinement has completed, the results can be read in and plotted with pandas. Unlike dynamic plotting, this can be done whether or not the refinement has been run within a Jupyter notebook, but it still requires matplotlib to be installed:

[ ]:

import pandas as pd
results = pd.read_csv(control.results_filename)

# If only the y variable is provided (e.g. 'FoM'), it is plotted against the number of steps)
results.plot(y='FoM')

[ ]:

# The x axis can also be specified
results.plot(x='FoM', y=charge_parameter)

Running a Refinement (in detail)

Refinement

Experimental datasets

Controlling the refinement

Controlling the parameters

Controlling the figure of merit

Running the refinement

Resolution Function

Refinement Output

Refinement plotting

FoM and Parameter Plotting