Conducting a simulated sequential learning run

In this tutorial we will show how to conduct a simulated sequential learning run over a fully explored design space.

Creating a fully explored `DesignSpace`

Following a similar procedure as in the previous tutorial, we will create a fully explored DesignSpace (ie. no unknown labels). This time the structures will be clean mono-elemental surfaces which we can generate via generate_surface_structures.

>>> # Generate the clean surfaces
>>> from autocat.surface import generate_surface_structures
>>> from autocat.utils import flatten_structures_dict
>>> surfs_dict = generate_surface_structures(
...     ["Pt", "Cu", "Li", "Ti"],
...     n_fixed_layers=2,
...     default_lat_param_lib="pbe_fd"
... )
>>> surfs = flatten_structures_dict(surfs_dict)

In this case we specified that the default lattice parameters from the library calculated with the PBE XC functional and a finite difference basis set.

As before, we will create random labels for all structures. But if you want meaningful sequential learning runs these must be actual labels relevant to your design space!

>>> # Generate the labels for each structure
>>> import numpy as np
>>> labels = np.random.uniform(-1.5,1.5,size=len(surfs))

Taking the structures and labels we can define our DesignSpace.

>>> from autocat.learning.sequential import DesignSpace
>>> design_space = DesignSpace(surfs, labels)
>>> design_space
+-------------------------+--------------------------+
|                         |       DesignSpace        |
+-------------------------+--------------------------+
|    total # of systems   |            10            |
| # of unlabelled systems |            0             |
|  unique species present | ['Pt', 'Cu', 'Li', 'Ti'] |
|      maximum label      |    1.1205404366846423    |
|      minimum label      |   -1.3259701029215702    |
+-------------------------+--------------------------+

Doing a single simulated sequential learning run

Given our fully explored DesignSpace, we can simulate a sequential learning search over it to gain insights into guided searches within this context. To do this simulated run we can make use of the simulated_sequential_learning function. This will internally drive a SequentialLearner object which will be returned at the end of the run.

As before, we will need to make choices with regard to the Predictor settings. In this case we will use a SineMatrix featurizer alongside a GaussianProcessRegressor.

>>> from sklearn.gaussian_process import GaussianProcessRegressor
>>> from sklearn.gaussian_process.kernels import RBF
>>> from dscribe.descriptors.sinematrix import SineMatrix
>>> from autocat.learning.featurizers import Featurizer
>>> from autocat.learning.predictors import Predictor
>>> kernel = RBF(1.5)
>>> regressor = GaussianProcessRegressor(kernel=kernel)
>>> featurizer = Featurizer(
...     featurizer_class=SineMatrix,
...     design_space_structures=design_space.design_space_structures
... )
>>> predictor = Predictor(regressor=regressor, featurizer=featurizer)
>>> predictor
+-----------+------------------------------------------------------------------+
|           |                            Predictor                             |
+-----------+------------------------------------------------------------------+
| regressor | <class 'sklearn.gaussian_process._gpr.GaussianProcessRegressor'> |
|  is fit?  |                              False                               |
+-----------+------------------------------------------------------------------+
+-----------------------------------+-------------------------------------------+
|                                   |                 Featurizer                |
+-----------------------------------+-------------------------------------------+
|               class               | dscribe.descriptors.sinematrix.SineMatrix |
|               kwargs              |                    None                   |
|            species list           |          ['Li', 'Ti', 'Pt', 'Cu']         |
|       maximum structure size      |                     36                    |
|               preset              |                    None                   |
| design space structures provided? |                    True                   |
+-----------------------------------+-------------------------------------------+

We also need to select parameters with regard to candidate selection. This includes the acquisition function to be used,
target window (if applicable), and number of candidates to pick at each iteration. This can be done via the CandidateSelector object. Let's use a maximum uncertainty acquisition function to pick candidates based on their associated uncertainty values.

>>> from autocat.learning.sequential import CandidateSelector
>>> candidate_selector = CandidateSelector(
...     acquisition_function="MU",
...     num_candidates_to_pick=1    
... )
>>> candidate_selector
+-------------------------------+--------------------+
|                               | Candidate Selector |
+-------------------------------+--------------------+
|      acquisition function     |         MU         |
|    # of candidates to pick    |         1          |
|         target window         |        None        |
|          include hhi?         |       False        |
| include segregation energies? |       False        |
+-------------------------------+--------------------+

Now we have everything we need to conduct a simulated sequential learning loop. We'll restrict the run to conduct 5 iterations.

>>> from autocat.learning.sequential import simulated_sequential_learning
>>> sim_seq_learn = simulated_sequential_learning(
...     full_design_space=design_space,
...     candidate_selector=candidate_selector,
...     predictor=predictor,
...     init_training_size=1,
...     number_of_sl_loops=5,
... )
>>> sim_seq_learn
+----------------------------------+--------------------+
|                                  | Sequential Learner |
+----------------------------------+--------------------+
|         iteration count          |         6          |
| next candidate system structures |      ['Cu36']      |
|  next candidate system indices   |        [5]         |
+----------------------------------+--------------------+
+-------------------------------+--------------------+
|                               | Candidate Selector |
+-------------------------------+--------------------+
|      acquisition function     |         MU         |
|    # of candidates to pick    |         1          |
|         target window         |        None        |
|          include hhi?         |       False        |
| include segregation energies? |       False        |
+-------------------------------+--------------------+
+-------------------------+--------------------------+
|                         |       DesignSpace        |
+-------------------------+--------------------------+
|    total # of systems   |            10            |
| # of unlabelled systems |            4             |
|  unique species present | ['Pt', 'Cu', 'Li', 'Ti'] |
|      maximum label      |    0.9712050050259604    |
|      minimum label      |   -1.3259701029215702    |
+-------------------------+--------------------------+
+-----------+------------------------------------------------------------------+
|           |                            Predictor                             |
+-----------+------------------------------------------------------------------+
| regressor | <class 'sklearn.gaussian_process._gpr.GaussianProcessRegressor'> |
|  is fit?  |                               True                               |
+-----------+------------------------------------------------------------------+
+-----------------------------------+-------------------------------------------+
|                                   |                 Featurizer                |
+-----------------------------------+-------------------------------------------+
|               class               | dscribe.descriptors.sinematrix.SineMatrix |
|               kwargs              |                    None                   |
|            species list           |          ['Li', 'Ti', 'Pt', 'Cu']         |
|       maximum structure size      |                     36                    |
|               preset              |                    None                   |
| design space structures provided? |                    True                   |
+-----------------------------------+-------------------------------------------+

Within the returned SequentialLearner object we now have information we can use for further analysis including prediction and uncertainty histories as well as the candidate selection history.

Doing multiple simulated sequential learning runs

It is often useful to consider the statistics of multiple independent simulated sequential learning runs. For this purpose we can make use of the multiple_simulated_sequential_learning_runs function. This acts in the same manner as for the single run verion, but will return a SequentialLearner object for each of the independent runs in a list. Moreover, the inputs remain the same except with the added option of running in parallel (since this is an embarrassingly parallel operation). Here we will conduct three independent runs in serial.

>>> from autocat.learning.sequential import multiple_simulated_sequential_learning_runs
>>> runs_history = multiple_simulated_sequential_learning_runs(
...     full_design_space=design_space,
...     candidate_selector=candidate_selector,
...     predictor=predictor,
...     init_training_size=1,
...     number_of_sl_loops=5,
...     number_of_runs=3,
...     # number_of_parallel_jobs=N if you wanted to run in parallel
... )

Taking the SequentialLearners from within runs_history, their histories may be used to calculate more robust statistics into the simulated searches.

Conducting a simulated sequential learning run

Creating a fully explored DesignSpace

Doing a single simulated sequential learning run

Doing multiple simulated sequential learning runs

Creating a fully explored `DesignSpace`